During operating hours in EU, we experienced a larger than normal volume of NinjaOne Remote session traffic. This increase caused a backlog in our session validation processing that our system struggled to keep up with. The delay in session validation caused NinjaOne Remote to invalidate sessions to ensure our customers’ security, resulting in a failure to stay connected to devices during the outage. To provide immediate relief, NinjaOne began scaling up infrastructure services and identified further mitigations that are included in the 12.1.0 release.
In addition to the NinjaOne Remote incident, NinjaOne also experienced significant performance degradation on several backend databases due to capacity saturation. These databases all exhibited different patterns, exacerbating the complexity of debugging with several concurrent, unique issues. Customers mostly experienced this as page load timeouts, delays in agent data processing, and slow console page loads.
NinjaOne upscaled database capacity and rebalanced customers between different database clusters to bring performance back to normal levels and resolve the incident.
The incident stretched across multiple days since active troubleshooting could only happen at peak load during EU business hours. We are actively taking all metrics and patterns observed to continue improving monitoring while addressing specific contributing issues with 12.0.30 and 12.1.0.