Post Incident Report - 2020-04-29
Descriptive Name: SSO outage
Incident Reference Number: 20200429
Date of Incident: 27/04/2020, Monday & 28/04/2020 Tuesday
Time & Duration of Incident: 08:30 to 09:25AEST 27/04/2020
08:07 to 09:05AEST 28/04/2020,
Severity: ☒ Service Effecting ☐ Access Effecting
☒ Performance Effecting ☐ Network Effecting
Location Effect: ☐ Isolated school ☐ Host + all VMs
☐ Sub-Net ☒ Muiltiple schools
Services Affected: - SSO login from any location inoperative.
Incident Cause: - Cloudwork SSO service experienced a DOS like surge of sign in activity on Monday and Tuesday mornings
- Regular reports running overtime consuming DB resources
- Analysis on Monday (27/4/2020) determined that authentication volume was not legitimate activity. Volume was generated by a misbehaving app continuously completing successful sign on requests hundreds of times a minute per successfully logged in user. This was being performed by School1 from incident on 31/3/2020. On Tuesday (28/4/2020) a new School2 was identified as exhibiting exactly the same mis-behaviour from exactly the same app. On Wednesday (29/04/2020) another School3 was also exhibiting exactly the same mis-behaviour from exactly the same app.
Incident Resolved: ☒ Yes ☐ No ☐ Open
Time of Resolution: 09:15, 28/04/2020, Tuesday
18:30, 28/04/2020, Tuesday
Restoration Timeframe: 1hr 53mins 00secs
Issued By: Technical Support, email@example.com & firstname.lastname@example.org
PIR Issue Date: 30/04/2020
Contact Information: Please report any continued service disruption immediately to :
Studentnet NOC Support: +61 2 9281 3905
Support Email: email@example.com firstname.lastname@example.org
- 08:15 Extraordinarily high sign in authentication activity observed
- Some expectation that this reflected legitimate extra traffic caused by schools moving en mass to remote learning work models as a result of COVID-19 pandemic.
- But also there was a possibility that a repeat of the mis-behaving app was occurring
- Schools started reporting poor sign-in performance
- 08:30 all sign-in services stopped.
- Investigation commenced to audit resources allocated to critical processes and heavily utilised schools
- Investigation commenced determine status of mis-behaving app.
- Investigation determined:
- Mis-behaving app was present again at School1 but for only some of their account holders
- Weekly reports generation was running overtime and was creating unnecessary database load
- 08:45 Reporting jobs were re-scheduled
- 09:15 School1 service terminated until the school could confirm that it would no longer generate mis-behaved traffic.
- Services to other schools re-commenced
- 10:15 School1 services brought back online in a resource constrained container so as to have no impact on other services
- 15:00 – Evidence started to appear that another school (School2) was exhibiting the same mis-behaving app traffic
- 08:10 – positive confirmation that School2 was generating same mis-behaved traffic from same app
- 08:15 – attempts made to call IT admins at School2
- Contact from School2 admins confirming that they are using the same app causing issues at School1
- Inbound load increasing uncontrollably
- Examination of load indicated that it was not being efficiently distributed to available servers
- Planning for a new round-robin based resource allocation scheme was completed. This would require a DNS change to implement.
- School2 resources constrained to a resource limited container.
08:59 – first status notification texted out to all reporting schools:
- “Studentnet Cloudwork SSO service currently experiencing capacity issues. Further update coming shortly. ”
09:00 DNS were implemented to establish a more efficient round-robin allocation of tasks to available servers
09:15 DNS changes were deployed commencing TTL propagation period clock
Services progressively came back online as DNS TTL period expired
09:31 – second status notification texted out to all reporting schools:
- “Studentnet Cloudwork SSO outage, update. A fix has been applied that required a DNS change. TTL for DNS propagation means that fix will take 30-40mins delay.”
Capacity planning completed to dramatically increase available capacity. Plan implements included:
- Configure and physically deploy newly purchased servers into the DC. Hardware servers were purchased in March 2020, with planned deployment in Q2 2020. This was plan was brought forward to be completed in 8hours.
- Complete implementation of round-robin load allocation policy to now more efficiently utilise 3 available physical servers
- Configure and commission 2 new database servers increasing available DB capacity
- 16:00 New servers delivered, and racked into the DC
- 18:30 New servers network connected and incorporated into swarm.
06:19 – third status notification texted out to all reporting schools:
- “Studentnet Cloudwork status update: All systems operational. New hardware deployed, extra DBMSs and DNSs commissioned, app behaviour being monitored. PIR to follow. Please report any problems to 02 9281 1626. Thank You”
07:30 A third school (School3) was detected with the same mis-behaving app traffic being exhibited
08:00 Attempts being made to contact School3 IT admins
08:15 Contact made with School3 IT admins advising of mis-behaving app traffic being generated
08:30 School3 service placed into resource constrained container to isolate any impact on other schools
08:30-08:55 School3 experiences slow performance issues arising from self-generated mis-behaving traffic. All other services continue unaffected
08:55 School3 disables mis-behaving app
09:10 School3 re-enables mis-behaving app in constrained fashion.
- Inappropriate authentication behaviour by an app
- Poorly timed weekly report generation jobs
Copyright © 2020, Studentnet/Coherent Cloud(CoClo) ABN 90 001 966 892