Client Context

“MediHealth Systems” is a fast-growing SaaS provider that powers Electronic Health-Record (EHR), E-Prescription and Tele-Consultation services for 350+ clinics across the United Kingdom.

Handles ~3 million API calls/day and spikes during morning clinic hours.
The entire stack runs on AWS (EC2, RDS PostgreSQL, Lambda, API Gateway, AWS HealthLake, X-Ray).
Three core teams—DevOps, Application Engineering, Clinical Support—share a strict 99.99 % SLA and HIPAA compliance duties.

The Challenge

Metrics, logs & traces scattered in separate consoles
Very less real-time visibility and entire system was relied on alarms or on user complaints
Root-cause searching took ~1 hour with lots of tab-switching
Often unnoticed errors

Solution Overview

Deploy a unified Amazon CloudWatch Dashboard that surfaces metrics, logs and traces side-by-side for every critical microservice. The dashboard:

Auto-refreshes every 30 seconds for near real-time insight.
Embeds CloudWatch Logs Insights queries, CloudWatch Metrics graphs and AWS X-Ray ServiceLens widgets.
Displays key alarms and anomaly-detection bands thus potential incidents jump out instantly.

Core Components

Component	Description
Amazon CloudWatch Dashboards	Central visual console.
CloudWatch Metrics & Alarms	Infrastructure (EC2, RDS, Lambda) + custom app KPIs (Key Performance Indicator) (e.g., “Prescription Success Rate”).
CloudWatch Logs Insights	Log-based error counts and patterns.
AWS X-Ray & ServiceLens	End-to-end traces across API Gateway → Lambda → EC2 → RDS (Fully Managed Relational Database)
CloudWatch Agent & RDS Enhanced Monitoring	Extra OS and database telemetry.

Technical Implementation

Inventory & Prioritise Signals
- CPU, memory, DB connections, EHR read/write request latency, medication-order throughput, 4xx/5xx error rates.
Enable Deep Telemetry
- Installed CloudWatch Agent on EC2; turned on RDS Enhanced Monitoring; activated Lambda active tracing.
Expand AWS X-Ray Coverage
- Added SDK instrumentation to the web tier; shipped traces for each transaction through EC2, Lambda and RDS.
Design the Dashboard (sample widgets)
- Infrastructure Health: EC2 CPU/Memory, RDS connections, queue depth.
- Clinical Workflow KPIs: “New-Encounter latency vs throughput”, “Prescription success %”.
- Error Monitors: Log-derived ERROR spikes, 5xx rates, Lambda exceptions.
- X-Ray Service Map: Latency heat-spots across microservices.
- Alarm Annotations: Red/green overlay with anomaly bands.
Access & Sharing
- Read-only links for Clinical Support; default CloudWatch landing page for DevOps; 24×7 wall-board in the ops area.

Measurable Outcomes

KPI	Before	After
Mean-time-to-detect (MTTD)	10-15 min	Seconds
Mean-time-to-resolve (MTTR)	~60 min	Often <15 min
Major incident frequency	Several per month	“Noticeably lower”
Failed prescription renewals during peak	Frequent spikes	Minimal
Team bridge calls	Tool-switching	Single shared screen

Key Learnings

Single-pane observability slashes MTTR (Mean-time-to-resolve)
Complete telemetry is prerequisite to an effective dashboard.
Shared dashboards drive collaboration between each team
Continually refine widgets and alarms as the architecture evolves.

Future Outlook

Extend the dashboard to new microservices and event-streaming components.
Leverage real-user monitoring (CloudWatch RUM) into dashboard
Automate capacity decisions with Application Auto Scaling driven by dashboard metrics.

Stakeholder Feedback

“Amazon X-Ray lets us trace a single ‘Create-Prescription’ request hop-by-hop through 34 micro-services. A deep dive into that segment showed a mis-configured exponential back-off hammering DynamoDB. We rolled a one-line config change, redeployed, and latency snapped back under 200 ms in 17 minutes.”

Improving Observability with Amazon CloudWatch Dashboards

Client Context

The Challenge

Solution Overview

Core Components

Technical Implementation

Measurable Outcomes

Key Learnings

Future Outlook

Stakeholder Feedback

Let’s Discuss Your Project

Related Posts

Confidential Computing in the Cloud

How to Scale FinTech Migrations: A GitOps Golden Path with IDP

Cloud Rescue: FinOps-Led Cost Optimization Across Multi-Cloud