20200429 - SSO Outage

Incident Report for Studentnet®

Postmortem

Post Incident Report - 2020-04-29

Descriptive Name: SSO outage

Incident Reference Number: 20200429

Date of Incident: 27/04/2020, Monday & 28/04/2020 Tuesday

Time & Duration of Incident: 08:30 to 09:25AEST 27/04/2020

‌ 08:07 to 09:05AEST 28/04/2020,

‌ 1hr 53min

Severity: ☒ Service Effecting ☐ Access Effecting

‌ ☒ Performance Effecting ☐ Network Effecting

Location Effect: ☐ Isolated school ☐ Host + all VMs

‌ ☐ Sub-Net ☒ Muiltiple schools

Services Affected: - SSO login from any location inoperative.

Incident Cause: - Cloudwork SSO service experienced a DOS like surge of sign in activity on Monday and Tuesday mornings

Regular reports running overtime consuming DB resources
Analysis on Monday (27/4/2020) determined that authentication volume was not legitimate activity. Volume was generated by a misbehaving app continuously completing successful sign on requests hundreds of times a minute per successfully logged in user. This was being performed by School1 from incident on 31/3/2020. On Tuesday (28/4/2020) a new School2 was identified as exhibiting exactly the same mis-behaviour from exactly the same app. On Wednesday (29/04/2020) another School3 was also exhibiting exactly the same mis-behaviour from exactly the same app.

Incident Resolved: ☒ Yes ☐ No ☐ Open

Time of Resolution: 09:15, 28/04/2020, Tuesday

‌ 18:30, 28/04/2020, Tuesday

Restoration Timeframe: 1hr 53mins 00secs

Issued By: Technical Support, support@studentnet.net & support@coherentcloud.com

PIR Issue Date: 30/04/2020

Contact Information: Please report any continued service disruption immediately to :

Studentnet NOC Support: +61 2 9281 3905

Support Email: support@studentnet.net support@coherentcloud.com

Incident Description

27/04/2020

08:15 Extraordinarily high sign in authentication activity observed
Some expectation that this reflected legitimate extra traffic caused by schools moving en mass to remote learning work models as a result of COVID-19 pandemic.
But also there was a possibility that a repeat of the mis-behaving app was occurring
Schools started reporting poor sign-in performance
08:30 all sign-in services stopped.
Investigation commenced to audit resources allocated to critical processes and heavily utilised schools
Investigation commenced determine status of mis-behaving app.
Investigation determined:

Mis-behaving app was present again at School1 but for only some of their account holders
Weekly reports generation was running overtime and was creating unnecessary database load

08:45 Reporting jobs were re-scheduled
09:15 School1 service terminated until the school could confirm that it would no longer generate mis-behaved traffic.
Services to other schools re-commenced
10:15 School1 services brought back online in a resource constrained container so as to have no impact on other services
15:00 – Evidence started to appear that another school (School2) was exhibiting the same mis-behaving app traffic

28/04/2020

08:10 – positive confirmation that School2 was generating same mis-behaved traffic from same app
08:15 – attempts made to call IT admins at School2
Contact from School2 admins confirming that they are using the same app causing issues at School1
Inbound load increasing uncontrollably
Examination of load indicated that it was not being efficiently distributed to available servers
Planning for a new round-robin based resource allocation scheme was completed. This would require a DNS change to implement.
School2 resources constrained to a resource limited container.
08:59 – first status notification texted out to all reporting schools:
- “Studentnet Cloudwork SSO service currently experiencing capacity issues. Further update coming shortly. ”
09:00 DNS were implemented to establish a more efficient round-robin allocation of tasks to available servers
09:15 DNS changes were deployed commencing TTL propagation period clock
Services progressively came back online as DNS TTL period expired
09:31 – second status notification texted out to all reporting schools:
- “Studentnet Cloudwork SSO outage, update. A fix has been applied that required a DNS change. TTL for DNS propagation means that fix will take 30-40mins delay.”
Capacity planning completed to dramatically increase available capacity. Plan implements included:

Configure and physically deploy newly purchased servers into the DC. Hardware servers were purchased in March 2020, with planned deployment in Q2 2020. This was plan was brought forward to be completed in 8hours.
Complete implementation of round-robin load allocation policy to now more efficiently utilise 3 available physical servers
Configure and commission 2 new database servers increasing available DB capacity

16:00 New servers delivered, and racked into the DC
18:30 New servers network connected and incorporated into swarm.

29/04/2020

06:19 – third status notification texted out to all reporting schools:
- “Studentnet Cloudwork status update: All systems operational. New hardware deployed, extra DBMSs and DNSs commissioned, app behaviour being monitored. PIR to follow. Please report any problems to 02 9281 1626. Thank You”
07:30 A third school (School3) was detected with the same mis-behaving app traffic being exhibited
08:00 Attempts being made to contact School3 IT admins
08:15 Contact made with School3 IT admins advising of mis-behaving app traffic being generated
08:30 School3 service placed into resource constrained container to isolate any impact on other schools
08:30-08:55 School3 experiences slow performance issues arising from self-generated mis-behaving traffic. All other services continue unaffected
08:55 School3 disables mis-behaving app
09:10 School3 re-enables mis-behaving app in constrained fashion.

Root Cause

Inappropriate authentication behaviour by an app
Poorly timed weekly report generation jobs

Recommendations/Preventative Measures

Audit app behaviours and request rectification to conform to standard protocols where needed
Monitor resource usage re-allocating and optimising where needed
Implement Cloudwork status notification page

Prepare for further orders of magnitude growth in authentication volumes as remote learning is established as the new normal mode of operation

                                                                                     _oOo_

Strictly Confidential

Posted May 01, 2020 - 15:36 AEST

Resolved

Descriptive Name: SSO outage
Incident Reference Number: 20200429
Date of Incident: 27/04/2020, Monday & 28/04/2020 Tuesday
Time & Duration of Incident: 08:30 to 09:25AEST 27/04/2020
08:07 to 09:05AEST 28/04/2020,
1hr 53min

Severity: ☒ Service Effecting ☐ Access Effecting
☒ Performance Effecting ☐ Network Effecting

Location Effect: ☐ Isolated school ☐ Host + all VMs
☐ Sub-Net ☒ Muiltiple schools
Services Affected: - SSO login from any location inoperative.

Posted Apr 28, 2020 - 08:00 AEST