Challenge 4: Let the monitoring begin

Previous Challenge Next Challenge

Introduction

The platform team introduces you to the app’s monitoring dashboards in the Google Cloud Console. The platform team’s dashboards use metrics collected from the movie-guru backend service.

  • Login Dashboard: Tracks the health and efficiency of the user login process.
  • Main Page Load Dashboard/ Startup Dashboard: Monitors the performance of the post-login, Main Page Load.
  • Chat Dashboard: Provides a comprehensive view of user interactions with the chatbot, including engagement, sentiment, and response times.

Note: Metrics in the dashboards may appear blocky because we’re simulating load with only a few users. Achieving smoother graphs generally requires a larger user load.

Description

Make guesses for this exercise whenever you don’t have real information to go on.

  1. Browse existing dashboards
    • Find them at Google Cloud Monitoring > Dashboards > Custom Dashboards.
  2. Assess user experience
    • Based on the metrics and your own experience, describe how users likely perceive the app’s performance.
  3. Note down which aspects of the app need serious improvement.
  4. Choose Your SLIs:
    • SREs need to document carefully the metrics they use to construct their SLIs (referesher on what are SLIs).
    • Your challenge is to first define the SLIs (on paper) by examining the dashboards to identify relevant metrics.
    • Write them down in definition form as illustrated below.
      • Example Availability SLI: The availability of service abc is measured by the ratio of successful startups recorded as a ratio of metric x to the total attempts in metric y.
      • Example Latency SLI: The latency of service abc, measured by the metric x, is tracked as a histogram at the 10th, 50th, 90th, and 99th percentiles.
    • Tips:
      • Look at the Business Goals below to narrow down your search to just a few SLIs relevant for this and the upcoming exercises.
      • If you aren’t sure of the difference between an SLI and a metric, look here.

Business goals

  • Business goal 1: The main page should be accessible and load quickly for users.
  • Business goal 2: The chatbot should respond quickly to users and keep them engaged.

Success Criteria

  • You’ve chosen SLI based on the metrics that are being collected from the server.
  • These SLIs can measure the health of the app based on the two business goals.

Learning Resources

How do metrics differ from SLIs?

Metrics and Service Level Indicators (SLIs) both provide valuable data about a system’s performance, but they serve distinct roles. Metrics are broad measurements that capture various aspects of system activity, such as CPU usage, latency, and error rates. They form the foundational data used to observe, monitor, and troubleshoot a system. SLIs, on the other hand, are carefully selected metrics that directly reflect the quality of service experienced by users. Focusing on factors like availability, latency, or error rate, SLIs gauge how well a service is meeting specific reliability targets known as Service Level Objectives (SLOs). While metrics provide a comprehensive view of system health, SLIs narrow the focus to measure the specific qualities that most affect user satisfaction, aligning system performance with business objectives.

Latency Metrics

  • These metrics (for all dashboards) measures how long it takes for users to get a successful response from the server.
  • It provides insights into the speed and efficiency of a specific server process (eg: login, chat, etc).
  • Lower latency means faster logins, contributing to a better user experience.
  • The dashboard displays several percentiles of login latency (10th, 50th, 90th, 95th, 99th), giving you a comprehensive view of the login speed distribution.
  • This metric is also displayed as a line chart, allowing you to track changes in latency over time and identify any performance degradations.

Previous Challenge Next Challenge