20200429 - SSO Outage
Incident Report for Studentnet
Postmortem

Post Incident Report - 2020-04-29

Descriptive Name: SSO outage

Incident Reference Number: 20200429

Date of Incident: 27/04/2020, Monday & 28/04/2020 Tuesday

Time & Duration of Incident: 08:30 to 09:25AEST 27/04/2020

‌ 08:07 to 09:05AEST 28/04/2020,

‌ 1hr 53min

Severity: ☒ Service Effecting ☐ Access Effecting

‌ ☒ Performance Effecting ☐ Network Effecting

Location Effect: ☐ Isolated school ☐ Host + all VMs

‌ ☐ Sub-Net ☒ Muiltiple schools

Services Affected: - SSO login from any location inoperative.

Incident Cause: - Cloudwork SSO service experienced a DOS like surge of sign in activity on Monday and Tuesday mornings

  • Regular reports running overtime consuming DB resources
  • Analysis on Monday (27/4/2020) determined that authentication volume was not legitimate activity. Volume was generated by a misbehaving app continuously completing successful sign on requests hundreds of times a minute per successfully logged in user. This was being performed by School1 from incident on 31/3/2020. On Tuesday (28/4/2020) a new School2 was identified as exhibiting exactly the same mis-behaviour from exactly the same app. On Wednesday (29/04/2020) another School3 was also exhibiting exactly the same mis-behaviour from exactly the same app.

Incident Resolved: ☒ Yes ☐ No ☐ Open

Time of Resolution: 09:15, 28/04/2020, Tuesday

‌ 18:30, 28/04/2020, Tuesday

Restoration Timeframe: 1hr 53mins 00secs

Issued By: Technical Support, support@studentnet.net & support@coherentcloud.com

PIR Issue Date: 30/04/2020

Contact Information: Please report any continued service disruption immediately to :

Studentnet NOC Support: +61 2 9281 3905

Support Email: support@studentnet.net support@coherentcloud.com

Incident Description

27/04/2020

  • 08:15 Extraordinarily high sign in authentication activity observed
  • Some expectation that this reflected legitimate extra traffic caused by schools moving en mass to remote learning work models as a result of COVID-19 pandemic.
  • But also there was a possibility that a repeat of the mis-behaving app was occurring
  • Schools started reporting poor sign-in performance
  • 08:30 all sign-in services stopped.
  • Investigation commenced to audit resources allocated to critical processes and heavily utilised schools
  • Investigation commenced determine status of mis-behaving app.
  • Investigation determined:
  1. Mis-behaving app was present again at School1 but for only some of their account holders
  2. Weekly reports generation was running overtime and was creating unnecessary database load
  • 08:45 Reporting jobs were re-scheduled
  • 09:15 School1 service terminated until the school could confirm that it would no longer generate mis-behaved traffic.
  • Services to other schools re-commenced
  • 10:15 School1 services brought back online in a resource constrained container so as to have no impact on other services
  • 15:00 – Evidence started to appear that another school (School2) was exhibiting the same mis-behaving app traffic

28/04/2020

  • 08:10 – positive confirmation that School2 was generating same mis-behaved traffic from same app
  • 08:15 – attempts made to call IT admins at School2
  • Contact from School2 admins confirming that they are using the same app causing issues at School1
  • Inbound load increasing uncontrollably
  • Examination of load indicated that it was not being efficiently distributed to available servers
  • Planning for a new round-robin based resource allocation scheme was completed. This would require a DNS change to implement.
  • School2 resources constrained to a resource limited container.
  • 08:59 – first status notification texted out to all reporting schools:

    • Studentnet Cloudwork SSO service currently experiencing capacity issues. Further update coming shortly.
  • 09:00 DNS were implemented to establish a more efficient round-robin allocation of tasks to available servers

  • 09:15 DNS changes were deployed commencing TTL propagation period clock

  • Services progressively came back online as DNS TTL period expired

  • 09:31 – second status notification texted out to all reporting schools:

    • Studentnet Cloudwork SSO outage, update. A fix has been applied that required a DNS change. TTL for DNS propagation means that fix will take 30-40mins delay.
  • Capacity planning completed to dramatically increase available capacity. Plan implements included:

  1. Configure and physically deploy newly purchased servers into the DC. Hardware servers were purchased in March 2020, with planned deployment in Q2 2020. This was plan was brought forward to be completed in 8hours.
  2. Complete implementation of round-robin load allocation policy to now more efficiently utilise 3 available physical servers
  3. Configure and commission 2 new database servers increasing available DB capacity
  • 16:00 New servers delivered, and racked into the DC
  • 18:30 New servers network connected and incorporated into swarm.

29/04/2020

  • 06:19 – third status notification texted out to all reporting schools:

    • Studentnet Cloudwork status update: All systems operational. New hardware deployed, extra DBMSs and DNSs commissioned, app behaviour being monitored. PIR to follow. Please report any problems to 02 9281 1626. Thank You
  • 07:30 A third school (School3) was detected with the same mis-behaving app traffic being exhibited

  • 08:00 Attempts being made to contact School3 IT admins

  • 08:15 Contact made with School3 IT admins advising of mis-behaving app traffic being generated

  • 08:30 School3 service placed into resource constrained container to isolate any impact on other schools

  • 08:30-08:55 School3 experiences slow performance issues arising from self-generated mis-behaving traffic. All other services continue unaffected

  • 08:55 School3 disables mis-behaving app

  • 09:10 School3 re-enables mis-behaving app in constrained fashion.

Root Cause

  • Inappropriate authentication behaviour by an app
  • Poorly timed weekly report generation jobs

Recommendations/Preventative Measures

  • Audit app behaviours and request rectification to conform to standard protocols where needed
  • Monitor resource usage re-allocating and optimising where needed
  • Implement Cloudwork status notification page
  • Prepare for further orders of magnitude growth in authentication volumes as remote learning is established as the new normal mode of operation

                                                                                         _oOo_
    

Strictly Confidential

Copyright © 2020, Studentnet/Coherent Cloud(CoClo) ABN 90 001 966 892

Posted May 01, 2020 - 15:36 AEST

Resolved
Descriptive Name: SSO outage
Incident Reference Number: 20200429
Date of Incident: 27/04/2020, Monday & 28/04/2020 Tuesday
Time & Duration of Incident: 08:30 to 09:25AEST 27/04/2020
08:07 to 09:05AEST 28/04/2020,
1hr 53min

Severity: ☒ Service Effecting ☐ Access Effecting
☒ Performance Effecting ☐ Network Effecting

Location Effect: ☐ Isolated school ☐ Host + all VMs
☐ Sub-Net ☒ Muiltiple schools
Services Affected: - SSO login from any location inoperative.
Posted Apr 28, 2020 - 08:00 AEST