Challenge 3: SLOs: Not Just Another Acronym
Previous Challenge Next Challenge
Introduction
In the previous challenge, you dove deep into Movie Guru’s reliability landscape, discovering a young app with room to grow. You learned that the company currently lacks a robust way to measure and define user experience, relying instead on the unsustainable goal of constant uptime.
Armed with the insights gained from exploring the app, collaborating with stakeholders, and understanding the system’s design, challenges, and user feedback, it’s time to take a crucial step: defining SLIs and SLOs for User Journeys. If you need a refresher on SLIs or SLOs, see the Learning Resources.
Description
Make guesses for this exercise whenever you don’t have real information to go on.
- Choose Your Journeys: Select two key user journeys for Movie Guru.
- Choose Your SLIs: What SLIs would you use to see that your application is healthy (as experienced by the user)?
- Craft Your SLOs: Define relevant SLOs for each chosen user journey using the SLIs identified above.
- Consider what aspects of reliability matter most to users in each journey and how you can measure success.
- See Learning Resources for an example.
Success Criteria
- You have selected SLIs for your app.
- You have crafted 2 SLOs for the Movie Guru app. Each SLO includes the following components, demonstrating a comprehensive understanding of how to define and measure service level objectives:
- Objective: A clear statement of the reliability target for a specific user journey or feature. The value has to have a good business reason behind it.
- Time window: The period over which the SLI is measured (e.g., 30-day rolling window).
- Service Level Indicator (SLI): A metric used to assess the service’s performance against the objective (e.g., availability, latency, quality, throughput, timeliness). Make your best guess here.
Learning Resources
What are SLIs?
Service Level Indicators (SLIs) are specific measurements that show how well a service is performing. They help teams understand if they are meeting their goals for reliability and quality. For example, one SLI might measure how often a website is available to users, while another could track how quickly the website responds to requests. An SLI can also look at the number of errors or failures compared to total requests. These indicators are important because they help teams see where they can improve their services.
What are SLOs?
Based on Google’s SRE framework, Service Level Objectives (SLOs) are target values or ranges for a service’s reliability, measured by Service Level Indicators (SLIs). SLOs help you define the desired user experience and guide decisions about reliability investments. They act as a key communication tool between technical teams and business stakeholders, ensuring everyone is aligned on what level of reliability is acceptable and achievable. Crucially, SLOs should be based on what matters most to users, not arbitrary targets like 100% uptime.
Example SLO:
For the user journey of “adding a product to an online shopping cart”, a possible SLO could be:
- 99.9% of “Add to Cart” requests should be successful within 2 seconds, measured over a 30-day rolling window. This SLO focuses on the key user action (“Add to Cart”) and sets targets for both availability (99.9% success rate) and latency (2-second response time). It’s directly tied to user experience, ensuring a smooth and efficient shopping experience. The addition of “measured over a 30-day rolling window” specifies the timeframe for evaluating the SLO. This means that the success rate and response time are calculated based on data collected over the past 30 days. This rolling window provides a continuous and up-to-date assessment of the SLO’s performance.