On February 25 and February 26, Shepherd experienced two separate production outages during peak business hours. Both incidents resulted in full application unavailability, including intermittent access to Read-Only mode.
While the timing overlapped with release activity, both incidents were the result of resource exhaustion and thread pool starvation within our production environment under peak load conditions.
At 2:32pm ET on February 25, production application servers became unresponsive within seconds of each other. The application did not crash from an exception; instead, worker processes froze and stopped responding to health checks.
As threads became blocked and connection pools were exhausted, the system stopped accepting new requests. This created a cascading failure across all servers.
Manual reboots and the addition of capacity restored service by 3:12pm ET.
The infrastructure had limited scaling headroom under peak daytime load. When threads were exhausted simultaneously across servers, the application became unresponsive.
On February 26 at 3:32pm ET, a midday deployment was performed to correct a user-impacting feature. During the deployment process:
Clearing cache under peak load caused all servers to simultaneously reload data from Redis and the database. This created a surge in concurrent requests, which again exhausted available thread pools and caused cascading failures similar to February 25.
Primary instability lasted approximately 27 minutes, with full stabilization by 4:42pm ET.
Beginning this weekend and into next week, the following changes are being implemented:
These two incidents exposed infrastructure limits under peak load and deployment conditions. While the feature releases themselves did not introduce logic errors that caused the outages, the infrastructure and deployment safeguards were not sufficient for current usage patterns.
We take full responsibility for strengthening these systems and processes to prevent recurrence.
A separate communication from Amber on our team addresses user concerns, release timing, and broader context.