Challenge 2: Yes, there are others

Previous Challenge Next Challenge

Introduction

Your second day as an SRE at The Movie Advisory Company started with a bang. The CEO, clearly fueled by an excessive amount of coffee, stormed into your workspace, ranting about Movie Guru’s unreliable performance. “Users are complaining about the site not always being reachable!” he yelled, “This is unacceptable! Movie Guru needs to be up 100% of the time!” He demanded a solution immediately. With a panicked look in his eyes, he pointed you towards the platform team (a single, overworked engineer) and the application team (known for their eccentric work habits).

Your challenge: figure out how to improve the app’s stability, manage the CEO’s expectations, and prevent a complete meltdown. Welcome to the world of SRE!

Description

  1. Initial Response to CEO: Analyze the CEO’s demands in the context of SRE principles. Are there any parts of his demand that clash with those principles? Discuss your analysis with your team-mates and coach. You can also do a short role-play with one of you acting as the CEO.

    Note: The focus on the role-play should be on articulating your reasoning and how it aligns with SRE principles. The focus shouldn’t be on trying to persuade the CEO to change their mind (this isn’t a communication/negotiation workshop).

  2. Information Gathering: You’re not alone in this quest for stability! To improve Movie Guru’s stability, you’ll need to collaborate with others. Identify the key stakeholders within the company and determine what information you need from each of them to achieve your reliability goals.

Success Criteria

To successfully complete this challenge, you should be able to demonstrate the following:

Initial Response:

  • Explained why 100% uptime is an unrealistic and potentially harmful goal.
  • Clearly articulated the relationship between reliability and cost.
  • Emphasized the importance of aligning reliability targets with user needs and business priorities.
  • [BONUS] Communicated the need to balance reliability investments with other factors like innovation.

Information Gathering:

  • Identified key stakeholder teams within The Movie Advisory Company (including technical teams, product owners, and business stakeholders).
  • Explained the role of each stakeholder group in ensuring Movie Guru’s reliability.
  • Specified the information needed from each stakeholder group to assess the current state of reliability and plan for improvements.
  • Demonstrated an understanding of the importance of collaboration and communication in achieving reliability goals.

Learning Resources

  • Realistic Expectations on Reliability: It’s essential to communicate that 100% uptime is neither feasible nor necessary. Regularly reinforcing this helps align stakeholders with a balanced, achievable reliability strategy.

  • Stakeholder Engagement: Involve key technical and business stakeholders, such as development, platform, QA teams, and product managers. Each provides crucial insights into stability, user needs, and resource constraints.

  • Gathering Critical Information: Collect performance data, architecture diagrams, deployment processes, capacity planning information, and incident response details. This helps build a clear picture of the current system’s strengths and areas for improvement.

  • Balanced Reliability Goals: By aligning SLOs with both user needs and practical limits, we can drive improvements that support long-term stability and innovation, fostering trust and a sustainable reliability model.

Previous Challenge Next Challenge