Shepherd Application System Outage

Incident Report for Shepherd Veterinary Solutions

Postmortem

Summary

On February 25 and February 26, Shepherd experienced two separate production outages during peak business hours. Both incidents resulted in full application unavailability, including intermittent access to Read-Only mode.

  • February 25: 2:32pm–3:12pm ET (~40 minutes)
  • February 26: 3:32pm–4:42pm ET (~70 minutes from first degradation to full stabilization; primary instability ~27 minutes)

While the timing overlapped with release activity, both incidents were the result of resource exhaustion and thread pool starvation within our production environment under peak load conditions.

February 25

What Happened

At 2:32pm ET on February 25, production application servers became unresponsive within seconds of each other. The application did not crash from an exception; instead, worker processes froze and stopped responding to health checks.

As threads became blocked and connection pools were exhausted, the system stopped accepting new requests. This created a cascading failure across all servers.

Manual reboots and the addition of capacity restored service by 3:12pm ET.

Technical Cause

  • Progressive thread pool starvation
  • Blocking operations consumed available worker threads
  • Database connection pools became exhausted during recovery
  • Reduced infrastructure headroom left no buffer for failure
  • Shared cache between primary app and EES contributed to the impact of EES not being available

The infrastructure had limited scaling headroom under peak daytime load. When threads were exhausted simultaneously across servers, the application became unresponsive.

February 26

What Happened

On February 26 at 3:32pm ET, a midday deployment was performed to correct a user-impacting feature. During the deployment process:

  • A maintenance script triggered a full cache clear
  • The deploy process also manually cleared cache

Clearing cache under peak load caused all servers to simultaneously reload data from Redis and the database. This created a surge in concurrent requests, which again exhausted available thread pools and caused cascading failures similar to February 25.

Primary instability lasted approximately 27 minutes, with full stabilization by 4:42pm ET.

Technical Cause

  • Full cache clear during peak traffic
  • Simultaneous cache reload across all servers
  • Surge in Redis and database requests
  • Thread pool exhaustion under load
  • Shared cache between the primary app and EES contributed to the impact of EES not being available

Contributing Factors Across Both Incidents

  1. Limited scaling headroom during peak usage
  2. Thread pool configuration too small for spike conditions
  3. Connection pool limits too restrictive for recovery scenarios
  4. Blue/green deployment process not optimized for high-traffic windows
  5. Full cache clear behavior not safe for midday deployments
  6. Not enough compute resources (servers) to appropriate scale

Immediate Actions

Beginning this weekend and into next week, the following changes are being implemented:

  • Increasing thread pool minimum thresholds
  • Increasing database connection pool limits
  • Adding additional application server capacity
  • Removing full cache clear behavior from deployment and maintenance scripts
  • Implementing more granular cache invalidation controls
  • Adjusting scaling thresholds to add headroom earlier
  • Beginning a formal architecture review led by our CTO
  • Establishing an architecture review board to evaluate infrastructure resilience

Longer-Term Improvements Under Review

  • Production auto-scaling
  • Rolling and canary deployment safeguards
  • EES isolation from primary cache and infrastructure
  • Multi-zone failover improvements
  • Expanded monitoring and alerting
  • Formalized release-day health validation checks

Closing

These two incidents exposed infrastructure limits under peak load and deployment conditions. While the feature releases themselves did not introduce logic errors that caused the outages, the infrastructure and deployment safeguards were not sufficient for current usage patterns.

We take full responsibility for strengthening these systems and processes to prevent recurrence.

A separate communication from Amber on our team addresses user concerns, release timing, and broader context.

Posted Feb 27, 2026 - 17:20 EST

Resolved

The incident has been resolved and Shepherd has maintained stability. A full postmortem will be shared once the final investigation has been completed.
Posted Feb 26, 2026 - 16:44 EST

Investigating

We are currently investigating an issue with accessing Shepherd.

Read-only access for your appointment schedule and medical records can be accessed here - https://ees.shepherd.vet/login

Learn more about read-only here -
https://help.shepherd.vet/en/articles/9142634-read-only-mode-and-emergency-kit
Posted Feb 26, 2026 - 15:45 EST
This incident affected: Shepherd Application.