Shepherd Application System Outage

Incident Report for Shepherd Veterinary Solutions

Postmortem

Summary

On September 17, Shepherd experienced an outage from 4:28 PM ET until 5:01 PM ET. During this time, users were unable to access the application. Service was restored at 3:01 PM and has remained stable since.

What Happened

The outage was triggered by a sudden spike in database requests. Under normal conditions, Shepherd processes around 150 requests at a given time. On September 17, that number surged to over 1,000 requests within just a few minutes, overwhelming system resources and causing the application to become unavailable.

While the exact trigger for the surge is still under investigation, it appears to be tied to existing queries unexpectedly consuming far more resources than usual, and was not due to a recent release.

What We Have Implemented

  • Improved Monitoring & Alarms: Updated our alerts for elevated request volumes to have a lower threshold and help provide engineering several extra minutes to respond before widespread impact.
  • Load Mitigation: Continued shifting heavy or long-running queries off the main database and onto read-only systems as part of ongoing infrastructure improvements.

What’s Next

  • Resilience Improvements: We’re working on changes that reduce single points of failure so an issue in one area doesn’t impact all practices.
  • Deeper Analysis: We are continuing to investigate why the spike occurred. No repeating pattern has been observed in the days since the outage.

Although the precise root cause of the request surge remains under investigation, immediate safeguards are in place to detect problems earlier, contain them faster, and minimize the risk of another system-wide outage.

Thank you for your continued trust, patience, and partnership.

— The Shepherd Team

Posted Sep 24, 2025 - 19:02 EDT

Resolved

Between 4:28pm ET and 5:01pm ET, users were unable to access the Shepherd application due to a system-wide outage. Service was fully restored, and services have been stable since.
Posted Sep 17, 2025 - 17:00 EDT

Investigating

An issue was identified preventing access to the Shepherd application for all users.
Posted Sep 17, 2025 - 16:30 EDT
This incident affected: Shepherd Application, SMS (Texting Services), AI Services (TranscribeAI, EchoAI, DiagnoseAI), and OpenAPI (Clinic Connect and Integrations).