SRE / Platform Engineer

Intermediate

Implement advanced observability, practice chaos engineering, and build internal platform tools.

Your Progress0 / 50 questions

2 questions free per topic

Unlock all 50 questions with Pro

Upgrade to Pro

Topics

1

Advanced Observability (Prometheus, Grafana & OpenTelemetry)

2 free / 10 questions

  • 1
    What are the four core metric types in Prometheus, and when would you use each one?
  • 2
    What are the key principles for designing effective Grafana dashboards that teams actually use during incidents?
  • What is OpenTelemetry and how does it unify the three pillars of observability? Why has it become the industry standard?Pro
  • How would you write PromQL queries to calculate and monitor a latency-based SLO, such as "99% of requests complete within 300ms"?Pro
  • How do you implement distributed tracing across a microservices architecture using OpenTelemetry? Walk through the key decisions and trade-offs.Pro
  • How would you design and implement a Grafana dashboard that monitors the four golden signals for a critical production service, including appropriate alerting?Pro
  • Explain how you would design an OpenTelemetry Collector pipeline for a production Kubernetes cluster. What receivers, processors, and exporters would you configure and why?Pro
  • Your organization processes millions of requests per second across hundreds of microservices, and your Prometheus instance is hitting performance limits. How would you architect a scalable observability platform?Pro
  • How do you build a unified observability experience that correlates metrics, traces, and logs? Describe the technical implementation including exemplars, trace-to-log correlation, and unified querying.Pro
  • Describe how you would design an observability-driven incident response workflow where alerting, dashboards, traces, and runbooks work together to minimize mean time to resolution.Pro

Unlock 8 more questions

Get full access with Pro

Upgrade
2

Reliability Engineering Practices

2 free / 10 questions

  • 1
    What is an error budget, and how does it create alignment between development and operations teams?
  • 2
    How do feature flags improve release safety, and what are the key considerations when implementing them?
  • What is the difference between canary deployments, blue-green deployments, and rolling deployments? When would you choose each one?Pro
  • How do you implement error budget policies in practice? Walk through what happens when a team is burning through their error budget too quickly and how you escalate responses.Pro
  • How would you implement a progressive rollout strategy for a critical payment service that processes thousands of transactions per minute?Pro
  • What does a mature release engineering pipeline look like for an SRE team? Describe the stages from code commit to production and the safety mechanisms at each stage.Pro
  • How do you use SLOs to drive engineering prioritization decisions? Give examples of how error budget data influences what a team works on.Pro
  • How do you manage reliability across a system of interdependent microservices where the failure of one service cascades to others? Describe the architectural patterns and operational practices you would implement.Pro
  • Your organization is struggling with a tension between development teams who want to deploy multiple times daily and SRE concerns about production stability. How do you design a system and process that safely enables high deployment frequency?Pro
  • How do you implement progressive rollouts across a multi-region deployment, and what strategies do you use to prevent a bad release from impacting all regions simultaneously?Pro

Unlock 8 more questions

Get full access with Pro

Upgrade
3

Capacity Planning & Performance

2 free / 10 questions

  • 1
    What are the different types of load testing, and when would you use each one in a production environment?
  • 2
    What is the difference between horizontal and vertical scaling, and what factors influence which approach to choose?
  • What are the key resource metrics you monitor for capacity planning, and what utilization targets do you aim for?Pro
  • How would you design and execute a comprehensive load testing strategy for a service that is about to handle a major product launch expected to triple normal traffic?Pro
  • How do you configure and tune Kubernetes Horizontal Pod Autoscaler for a latency-sensitive API service? What metrics do you use and what are the common tuning challenges?Pro
  • Walk through your systematic approach to identifying and resolving performance bottlenecks in a distributed system where response times have gradually increased over the past month.Pro
  • How do you build a resource forecasting model that predicts when your infrastructure will need additional capacity? What data inputs and methods do you use?Pro
  • Your primary PostgreSQL database is approaching its capacity limits with growing read and write traffic. Design a comprehensive scaling strategy that addresses both immediate relief and long-term growth.Pro
  • Design an auto-scaling architecture for a system with highly variable traffic patterns, including predictable daily peaks, unpredictable viral spikes, and cost optimization requirements. How do you handle the tension between responsiveness and cost?Pro
  • How do you embed performance engineering into the software development lifecycle so that performance issues are caught early rather than discovered in production? Describe the technical systems and processes you would implement.Pro

Unlock 8 more questions

Get full access with Pro

Upgrade
4

Chaos Engineering

2 free / 10 questions

  • 1
    What is chaos engineering, and how does it differ from traditional testing approaches like load testing or fault injection?
  • 2
    What is Chaos Monkey, and what other chaos engineering tools are commonly used in Kubernetes environments?
  • What is a steady state hypothesis in chaos engineering, and how do you define one for a practical experiment?Pro
  • How do you plan and execute a game day exercise for your team? Walk through the preparation, execution, and follow-up process.Pro
  • What are the most valuable failure injection patterns for testing microservice resilience, and how do you implement each one in a Kubernetes environment?Pro
  • How do you ensure safety when running chaos experiments in production? What safeguards and controls should be in place?Pro
  • How do you integrate chaos experiments into the CI/CD pipeline to continuously validate resilience as part of the deployment process?Pro
  • How would you build and scale a chaos engineering program across an organization with 50 microservices owned by 10 different teams? What are the technical and organizational challenges?Pro
  • Design a comprehensive chaos experiment to test your system's resilience to a complete availability zone failure. What is your hypothesis, what do you inject, and how do you measure success?Pro
  • How do chaos engineering experiments improve your observability posture, and how does better observability enable more sophisticated chaos experiments? Describe this feedback loop with specific examples.Pro

Unlock 8 more questions

Get full access with Pro

Upgrade
5

Platform Engineering & Internal Tools

2 free / 10 questions

  • 1
    What is platform engineering, and how does it differ from traditional infrastructure or DevOps teams?
  • 2
    What are golden paths in platform engineering, and why are they important for developer productivity?
  • What does self-service infrastructure mean, and what are the key components needed to implement it safely?Pro
  • How would you implement a developer portal using Backstage to serve as the front door to your internal developer platform? What components would you configure and what integrations are most valuable?Pro
  • How do you measure the success of an internal developer platform? What metrics and feedback mechanisms indicate that the platform is actually improving developer productivity?Pro
  • How do you design the right level of abstraction for your internal developer platform? What are the trade-offs between giving developers full control versus providing highly opinionated abstractions?Pro
  • How should a platform team operate to effectively serve its internal developer customers? Describe the team structure, rituals, and practices that make a platform team successful.Pro
  • How do you architect an internal developer platform that supports multiple teams with different requirements while maintaining security isolation, fair resource allocation, and operational consistency?Pro
  • You are tasked with migrating 30 existing services onto a new internal developer platform without disrupting ongoing development. How do you plan and execute this migration?Pro
  • How do you systematically identify and eliminate the biggest friction points in the developer experience across your organization, from local development to production debugging?Pro

Unlock 8 more questions

Get full access with Pro

Upgrade

Mock Interview

Test your knowledge with an AI-powered mock interview session.

Start Mock Interview
Text
Voice (Pro)

Quick Stats

  • Total Questions50
  • Topics5
  • DifficultyIntermediate
View Interview Checklist