Improving Observability with Amazon CloudWatch Dashboards

Client Context

“MediHealth Systems” is a fast-growing SaaS provider that powers Electronic Health-Record (EHR), E-Prescription and Tele-Consultation services for 350+ clinics across the United Kingdom.

  • Handles ~3 million API calls/day and spikes during morning clinic hours.
  • The entire stack runs on AWS (EC2, RDS PostgreSQL, Lambda, API Gateway, AWS HealthLake, X-Ray).
  • Three core teams—DevOps, Application Engineering, Clinical Support—share a strict 99.99 % SLA and HIPAA compliance duties.

The Challenge

  • Metrics, logs & traces scattered in separate consoles
  • Very less real-time visibility and entire system was relied on alarms or on user complaints
  • Root-cause searching took ~1 hour with lots of tab-switching
  • Often unnoticed errors

Solution Overview

Deploy a unified Amazon CloudWatch Dashboard that surfaces metrics, logs and traces side-by-side for every critical microservice. The dashboard:

  • Auto-refreshes every 30 seconds for near real-time insight.
  • Embeds CloudWatch Logs Insights queries, CloudWatch Metrics graphs and AWS X-Ray ServiceLens widgets.
  • Displays key alarms and anomaly-detection bands thus potential incidents jump out instantly.

Core Components

ComponentDescription
Amazon CloudWatch DashboardsCentral visual console.
CloudWatch Metrics & AlarmsInfrastructure (EC2, RDS, Lambda) + custom app KPIs (Key Performance Indicator) (e.g., “Prescription Success Rate”).
CloudWatch Logs InsightsLog-based error counts and patterns.
AWS X-Ray & ServiceLensEnd-to-end traces across API Gateway → Lambda → EC2 → RDS (Fully Managed Relational Database)
CloudWatch Agent & RDS Enhanced MonitoringExtra OS and database telemetry.

Technical Implementation

  1. Inventory & Prioritise Signals
    • CPU, memory, DB connections, EHR read/write request latency, medication-order throughput, 4xx/5xx error rates.
  2. Enable Deep Telemetry
    • Installed CloudWatch Agent on EC2; turned on RDS Enhanced Monitoring; activated Lambda active tracing.
  3. Expand AWS X-Ray Coverage
    • Added SDK instrumentation to the web tier; shipped traces for each transaction through EC2, Lambda and RDS.
  4. Design the Dashboard (sample widgets)
    • Infrastructure Health: EC2 CPU/Memory, RDS connections, queue depth.
    • Clinical Workflow KPIs: “New-Encounter latency vs throughput”, “Prescription success %”.
    • Error Monitors: Log-derived ERROR spikes, 5xx rates, Lambda exceptions.
    • X-Ray Service Map: Latency heat-spots across microservices.
    • Alarm Annotations: Red/green overlay with anomaly bands.
  5. Access & Sharing
    • Read-only links for Clinical Support; default CloudWatch landing page for DevOps; 24×7 wall-board in the ops area.

Measurable Outcomes

KPIBeforeAfter
Mean-time-to-detect (MTTD)10-15 minSeconds
Mean-time-to-resolve (MTTR)~60 minOften <15 min
Major incident frequencySeveral per month“Noticeably lower”
Failed prescription renewals during peakFrequent spikesMinimal
Team bridge callsTool-switchingSingle shared screen

Key Learnings

  • Single-pane observability slashes MTTR (Mean-time-to-resolve)
  • Complete telemetry is prerequisite to an effective dashboard.
  • Shared dashboards drive collaboration between each team
  • Continually refine widgets and alarms as the architecture evolves.

Future Outlook

  • Extend the dashboard to new microservices and event-streaming components.
  • Leverage real-user monitoring (CloudWatch RUM) into dashboard
  • Automate capacity decisions with Application Auto Scaling driven by dashboard metrics.

Stakeholder Feedback

“Amazon X-Ray lets us trace a single ‘Create-Prescription’ request hop-by-hop through 34 micro-services. A deep dive into that segment showed a mis-configured exponential back-off hammering DynamoDB. We rolled a one-line config change, redeployed, and latency snapped back under 200 ms in 17 minutes.”

Let’s Discuss Your Project

Prefer a face-to-face conversation? Choose a time that works for you, and let’s explore how we can collaborate to meet your ambitious goals.

Related Posts

Confidential Computing for Healthcare Cloud Data

Confidential Computing in the Cloud

Protecting Data in Use for Healthcare Analytics and AI Overview MediSure Diagnostics is an anonymized regional healthcare network operating imaging centres, outpatient clinics, and a growing analytics practice. The organization processes protected health information across...

FinTech Migrations

How to Scale FinTech Migrations: A GitOps Golden Path with IDP

Overview Client introduction A FinTech company runs a portfolio of 30+ legacy applications supporting onboarding, payments, and internal operations. The big picture The company wasn’t struggling to “move to cloud”, but they were struggling to...

FinOps-Led Cost Optimization Across Multi-Cloud with FOCUS 1.3

Cloud Rescue: FinOps-Led Cost Optimization Across Multi-Cloud

From messy multi-cloud bills to 22% lower spend with FOCUS 1.3 Overview Client Introduction A fast-growing digital marketplace company running customer workloads on AWS + Azure, with analytics and batch workloads on GCP, and a...