Challenge 8: What’s really UP, doc?
Prerequisites
-
Reset the backend server
Note
With this command we’re priming the backend that generates metrics to behave in a specific way. This simulates your colleagues making some changes that might have broken/fixed a few things.## Check if the METRICS_APP_ADDRESS env variable is set in your environment before you do this. curl -X POST \ -H "Content-Type: application/json" \ -d '{ "ChatSuccess": 0.95, "ChatSafetyIssue": 0.1, "ChatEngaged": 0.70, "ChatAcknowledged": 0.15, "ChatRejected": 0.05, "ChatUnclassified": 0.1, "ChatSPositive": 0.6, "ChatSNegative": 0.1, "ChatSNeutral": 0.2, "ChatSUnclassified": 0.1, "LoginSuccess": 0.999, "StartupSuccess": 0.95, "PrefUpdateSuccess": 0.99, "PrefGetSuccess": 0.999, "LoginLatencyMinMS": 10, "LoginLatencyMaxMS": 200, "ChatLatencyMinMS": 906, "ChatLatencyMaxMS": 4683, "StartupLatencyMinMS": 400, "StartupLatencyMaxMS": 1000, "PrefGetLatencyMinMS": 153, "PrefGetLatencyMaxMS": 348, "PrefUpdateLatencyMinMS": 363, "PrefUpdateLatencyMaxMS": 645 }' \ $METRICS_APP_ADDRESS/phase
Introduction
The Calm Before the Storm: You settle in for another day of SRE serenity, casually monitoring the dashboards and basking in the glow of Movie Guru’s stable performance. Suddenly, your peaceful morning is shattered by a frantic colleague from customer support.
“Mayday! Mayday!” they exclaim, bursting into your cubicle. “Users are reporting that Movie Guru is acting up! They can’t seem to use the website properly!”
You and your colleagues decide to see what is wrong by navigating to the Movie Guru website. You notice that approximately 50% of your chat messages fail.
Look at the video below to understand what the user experience is like (the video is sped up):
Description
- Look at your SLO dashboards to spot issues (wait a few minutes before you check).
- Compare your observation (when visiting the website) with the data displayed on the dashboards. What discrepancies do you notice?
- Explain the reason for the difference between what users are reporting and what the dashboards are showing.
- How can you improve your monitoring to better reflect the actual user experience?
Success Criteria
- You have identified the monitoring gap.
- You have propose solutions to improve monitoring.