Shepherd Application System Outage

Incident Report for Shepherd Veterinary Solutions

Postmortem

Summary

On February 25 and February 26, Shepherd experienced two separate production outages during peak business hours. Both incidents resulted in full application unavailability, including intermittent access to Read-Only mode.

February 25: 2:32pm–3:12pm ET (~40 minutes)
February 26: 3:32pm–4:42pm ET (~70 minutes from first degradation to full stabilization; primary instability ~27 minutes)

While the timing overlapped with release activity, both incidents were the result of resource exhaustion and thread pool starvation within our production environment under peak load conditions.

February 25

What Happened

At 2:32pm ET on February 25, production application servers became unresponsive within seconds of each other. The application did not crash from an exception; instead, worker processes froze and stopped responding to health checks.

As threads became blocked and connection pools were exhausted, the system stopped accepting new requests. This created a cascading failure across all servers.

Manual reboots and the addition of capacity restored service by 3:12pm ET.

Technical Cause

Progressive thread pool starvation
Blocking operations consumed available worker threads
Database connection pools became exhausted during recovery
Reduced infrastructure headroom left no buffer for failure
Shared cache between primary app and EES contributed to the impact of EES not being available

The infrastructure had limited scaling headroom under peak daytime load. When threads were exhausted simultaneously across servers, the application became unresponsive.

February 26

What Happened

On February 26 at 3:32pm ET, a midday deployment was performed to correct a user-impacting feature. During the deployment process:

A maintenance script triggered a full cache clear
The deploy process also manually cleared cache

Clearing cache under peak load caused all servers to simultaneously reload data from Redis and the database. This created a surge in concurrent requests, which again exhausted available thread pools and caused cascading failures similar to February 25.

Primary instability lasted approximately 27 minutes, with full stabilization by 4:42pm ET.

Technical Cause

Full cache clear during peak traffic
Simultaneous cache reload across all servers
Surge in Redis and database requests
Thread pool exhaustion under load
Shared cache between the primary app and EES contributed to the impact of EES not being available

Contributing Factors Across Both Incidents

Limited scaling headroom during peak usage
Thread pool configuration too small for spike conditions
Connection pool limits too restrictive for recovery scenarios
Blue/green deployment process not optimized for high-traffic windows
Full cache clear behavior not safe for midday deployments
Not enough compute resources (servers) to appropriate scale

Immediate Actions

Beginning this weekend and into next week, the following changes are being implemented:

Increasing thread pool minimum thresholds
Increasing database connection pool limits
Adding additional application server capacity
Removing full cache clear behavior from deployment and maintenance scripts
Implementing more granular cache invalidation controls
Adjusting scaling thresholds to add headroom earlier
Beginning a formal architecture review led by our CTO
Establishing an architecture review board to evaluate infrastructure resilience

Longer-Term Improvements Under Review

Production auto-scaling
Rolling and canary deployment safeguards
EES isolation from primary cache and infrastructure
Multi-zone failover improvements
Expanded monitoring and alerting
Formalized release-day health validation checks

Closing

These two incidents exposed infrastructure limits under peak load and deployment conditions. While the feature releases themselves did not introduce logic errors that caused the outages, the infrastructure and deployment safeguards were not sufficient for current usage patterns.

We take full responsibility for strengthening these systems and processes to prevent recurrence.

A separate communication from Amber on our team addresses user concerns, release timing, and broader context.

Posted Feb 27, 2026 - 17:21 EST

Resolved

The incident has been resolved and Shepherd has maintained stability. A full postmortem will be shared once the final investigation has been completed.

Posted Feb 25, 2026 - 16:33 EST

Monitoring

Service has been restored and Shepherd is accessible again. A full postmortem will be shared once the final investigation has been completed.

Posted Feb 25, 2026 - 15:27 EST

Investigating

We are currently investigating an issue with accessing Shepherd.

Read-only access for your appointment schedule and medical records can be accessed here - https://ees.shepherd.vet/login

Learn more about read-only here -
https://help.shepherd.vet/en/articles/9142634-read-only-mode-and-emergency-kit

Posted Feb 25, 2026 - 14:43 EST

This incident affected: Shepherd Application, SMS (Texting Services), AI Services (TranscribeAI, SummarizeAI, DiagnoseAI), OpenAPI (Clinic Connect and Integrations), and Shepherd Support Chat (Intercom).