Client Context
“MediHealth Systems” is a fast-growing SaaS provider that powers Electronic Health-Record (EHR), E-Prescription and Tele-Consultation services for 350+ clinics across the United Kingdom.
- Handles ~3 million API calls/day and spikes during morning clinic hours.
- The entire stack runs on AWS (EC2, RDS PostgreSQL, Lambda, API Gateway, AWS HealthLake, X-Ray).
- Three core teams—DevOps, Application Engineering, Clinical Support—share a strict 99.99 % SLA and HIPAA compliance duties.
The Challenge
- Metrics, logs & traces scattered in separate consoles
- Very less real-time visibility and entire system was relied on alarms or on user complaints
- Root-cause searching took ~1 hour with lots of tab-switching
- Often unnoticed errors
Solution Overview
Deploy a unified Amazon CloudWatch Dashboard that surfaces metrics, logs and traces side-by-side for every critical microservice. The dashboard:
- Auto-refreshes every 30 seconds for near real-time insight.
- Embeds CloudWatch Logs Insights queries, CloudWatch Metrics graphs and AWS X-Ray ServiceLens widgets.
- Displays key alarms and anomaly-detection bands thus potential incidents jump out instantly.
Core Components
| Component | Description |
| Amazon CloudWatch Dashboards | Central visual console. |
| CloudWatch Metrics & Alarms | Infrastructure (EC2, RDS, Lambda) + custom app KPIs (Key Performance Indicator) (e.g., “Prescription Success Rate”). |
| CloudWatch Logs Insights | Log-based error counts and patterns. |
| AWS X-Ray & ServiceLens | End-to-end traces across API Gateway → Lambda → EC2 → RDS (Fully Managed Relational Database) |
| CloudWatch Agent & RDS Enhanced Monitoring | Extra OS and database telemetry. |
Technical Implementation
- Inventory & Prioritise Signals
- CPU, memory, DB connections, EHR read/write request latency, medication-order throughput, 4xx/5xx error rates.
- Enable Deep Telemetry
- Installed CloudWatch Agent on EC2; turned on RDS Enhanced Monitoring; activated Lambda active tracing.
- Expand AWS X-Ray Coverage
- Added SDK instrumentation to the web tier; shipped traces for each transaction through EC2, Lambda and RDS.
- Design the Dashboard (sample widgets)
- Infrastructure Health: EC2 CPU/Memory, RDS connections, queue depth.
- Clinical Workflow KPIs: “New-Encounter latency vs throughput”, “Prescription success %”.
- Error Monitors: Log-derived ERROR spikes, 5xx rates, Lambda exceptions.
- X-Ray Service Map: Latency heat-spots across microservices.
- Alarm Annotations: Red/green overlay with anomaly bands.
- Access & Sharing
- Read-only links for Clinical Support; default CloudWatch landing page for DevOps; 24×7 wall-board in the ops area.
Measurable Outcomes
| KPI | Before | After |
| Mean-time-to-detect (MTTD) | 10-15 min | Seconds |
| Mean-time-to-resolve (MTTR) | ~60 min | Often <15 min |
| Major incident frequency | Several per month | “Noticeably lower” |
| Failed prescription renewals during peak | Frequent spikes | Minimal |
| Team bridge calls | Tool-switching | Single shared screen |
Key Learnings
- Single-pane observability slashes MTTR (Mean-time-to-resolve)
- Complete telemetry is prerequisite to an effective dashboard.
- Shared dashboards drive collaboration between each team
- Continually refine widgets and alarms as the architecture evolves.
Future Outlook
- Extend the dashboard to new microservices and event-streaming components.
- Leverage real-user monitoring (CloudWatch RUM) into dashboard
- Automate capacity decisions with Application Auto Scaling driven by dashboard metrics.



