Question 1

What are the four core metric types in Prometheus, and when would you use each one?

Accepted Answer

Prometheus defines four core metric types, each designed for different measurement scenarios. Understanding when to use each one is essential for building meaningful observability into your services.

A Counter is a cumulative metric that only ever increases or resets to zero on restart. It is perfect for tracking things that accumulate over time, such as the total number of HTTP requests served, errors encountered, or bytes processed. You never use a counter for a value that can decrease. When querying counters, you typically use the rate() or increase() functions in PromQL to calculate how

Question 2

What are the key principles for designing effective Grafana dashboards that teams actually use during incidents?

Accepted Answer

Effective Grafana dashboards need to serve as operational tools, not vanity displays. The difference between a dashboard that gets used during incidents and one that gets ignored comes down to a few key design principles.

Start with a clear hierarchy and purpose. Every dashboard should answer a specific question. A service overview dashboard answers "is my service healthy right now?" while a debugging dashboard answers "where exactly is the problem?" Mixing these concerns creates cluttered dashboards that serve neither purpose well. Organize dashboards into layers: a top-level overview for

Question 3

What is an error budget, and how does it create alignment between development and operations teams?

Accepted Answer

An error budget is the maximum amount of unreliability that a service is allowed over a defined time window, derived directly from the service level objective. If your SLO states that a service should be available 99.9 percent of the time over a 30-day rolling window, then your error budget is the remaining 0.1 percent, which translates to approximately 43 minutes of allowable downtime per month.

The power of the error budget lies in what it makes explicit. Without an error budget, reliability is treated as an absolute: every outage is bad, and teams argue about whether to prioritize new fe

Question 4

How do feature flags improve release safety, and what are the key considerations when implementing them?

Accepted Answer

Feature flags, also called feature toggles, are a technique where new functionality is deployed to production but wrapped in a conditional check that controls whether it is active. This separates the act of deploying code from the act of releasing functionality to users, which fundamentally changes how teams manage risk.

The primary safety benefit is that feature flags enable instant rollback without redeployment. If a new feature causes problems in production, you flip the flag off and the old behavior is immediately restored. This is dramatically faster than rolling back a deployment, whi

Question 5

What are the different types of load testing, and when would you use each one in a production environment?

Accepted Answer

Load testing verifies how a system behaves under various levels of demand. There are several distinct types, each designed to answer different questions about your system's capacity and resilience.

Baseline load testing establishes the normal performance characteristics of your system under typical production traffic levels. You replay or simulate traffic patterns that match your average workday load and measure response times, throughput, error rates, and resource utilization. The results become your performance baseline, which you compare against when making changes. Run baseline tests re

Question 6

What is the difference between horizontal and vertical scaling, and what factors influence which approach to choose?

Accepted Answer

Scaling is the process of increasing a system's capacity to handle more load. The two fundamental approaches are vertical scaling, which means making individual machines bigger, and horizontal scaling, which means adding more machines. Each has distinct advantages and constraints.

Vertical scaling, or scaling up, involves increasing the resources of an existing instance: more CPU cores, more RAM, faster storage. If your database server has 16 GB of RAM and is hitting memory limits, you upgrade to 64 GB. Vertical scaling is operationally simple because the application architecture does not c

Question 7

What is chaos engineering, and how does it differ from traditional testing approaches like load testing or fault injection?

Accepted Answer

Chaos engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions in production. It was pioneered by Netflix and formalized in the Principles of Chaos Engineering. While it shares goals with other testing approaches, its methodology and philosophy are distinct.

Traditional testing verifies that a system behaves correctly under known conditions. Unit tests check individual functions, integration tests check component interactions, and load tests check behavior under expected traffic volumes. These tests operate on the assump

Question 8

What is Chaos Monkey, and what other chaos engineering tools are commonly used in Kubernetes environments?

Accepted Answer

Chaos Monkey is the original chaos engineering tool, created by Netflix as part of their Simian Army suite. Its core function is simple but powerful: it randomly terminates virtual machine instances in production during business hours. The philosophy is that if your system cannot handle the loss of a single instance, which will eventually happen due to hardware failures, software bugs, or cloud provider issues, it is better to discover that during business hours when engineers are available to respond than at 3 AM on a weekend.

Chaos Monkey forced Netflix's engineering teams to design servi

Question 9

What is platform engineering, and how does it differ from traditional infrastructure or DevOps teams?

Accepted Answer

Platform engineering is the discipline of designing and building toolchains and workflows that enable software engineering organizations to deliver value faster. Platform engineers create internal developer platforms that abstract infrastructure complexity and provide self-service capabilities, allowing development teams to provision, deploy, and operate their services independently without needing deep infrastructure expertise.

The key distinction from traditional infrastructure teams is the product mindset. Traditional infrastructure teams operate as service providers: development teams s

Question 10

What are golden paths in platform engineering, and why are they important for developer productivity?

Accepted Answer

Golden paths, sometimes called paved roads, are opinionated, well-supported default workflows for common engineering tasks. They represent the best way to accomplish something within your organization, pre-integrated with all the necessary tooling, security controls, and operational best practices. The term captures the idea that there is a smooth, well-lit path that developers can follow with confidence.

For example, a golden path for creating a new microservice might include: a service template that generates a repository with the standard project structure, Dockerfile, CI pipeline, Kuber

SRE / Platform Engineer

Topics

Advanced Observability (Prometheus, Grafana & OpenTelemetry)

Reliability Engineering Practices

Capacity Planning & Performance

Chaos Engineering

Platform Engineering & Internal Tools

Mock Interview

Quick Stats