🎯 Crack Your Next DevOps / SRE Interview

Daily tips, mock interviews & community-driven prep resources.
Follow for free content. Book a session for focused prep.

📸 Follow @master.devops ▶️ YouTube Shorts

🚀 DevOps & SRE Interview Quick Reference

Complete revision notes for all 18 tools — built by @master.devops. Click any card for key concepts, commands, and interview Q&A.
Follow @master.devops on Instagram for daily tips & new resources.

Revision progress 0 / 18 reviewed

Free DevOps & SRE Interview Reference Guide

DevOps Interview Kit is a free, comprehensive reference built for engineers preparing for DevOps, SRE, Platform Engineering, and Cloud Engineering interviews. Every section is written from real interview experience — not copied from documentation. Covers 18 tools with key concepts, important commands, architecture patterns, and curated interview Q&As with full answers.

Whether you're targeting a Senior SRE role, a Platform Engineer position, or a DevOps Engineer role at a product-based company, this guide covers the tools and concepts that consistently appear in technical interviews across all levels. Built and maintained by @master.devops — a DevOps education community sharing real production knowledge, free for everyone.

Tools & Topics Covered

Click any tool above to access the full reference. Below is a summary of what each section covers.

☸️

Kubernetes

Pod lifecycle, Deployments vs StatefulSets vs DaemonSets, Services and Ingress, RBAC and ServiceAccounts, HPA, liveness vs readiness probes, NetworkPolicy, PersistentVolumes, and common troubleshooting scenarios including CrashLoopBackOff and OOMKilled.

🐳

Docker

Dockerfile best practices, multi-stage builds, minimal base images (alpine, distroless), layer caching, CMD vs ENTRYPOINT, Docker networking (bridge, host, overlay), named volumes vs bind mounts, security hardening with non-root users and read-only filesystems.

☁️

AWS

EC2 instance types and placement groups, VPC design with subnets and NAT, IAM roles and policies, S3 storage classes and lifecycle rules, EKS cluster architecture, Lambda functions and event sources, RDS Multi-AZ vs Read Replicas, CloudWatch and Auto Scaling.

🔷

Azure

AKS architecture, Azure DevOps pipelines vs GitHub Actions, App Service plans, Virtual Networks and NSGs, Azure Key Vault and Managed Identity, Azure Load Balancer vs Application Gateway, Azure Monitor and Log Analytics, Entra ID (formerly Azure AD) and RBAC.

🏔️

Terraform

Providers, resources, data sources, and variables. Remote state with S3 and DynamoDB locking. Modules for reusability. Workspaces for environment isolation. Import existing infrastructure. Terraform plan/apply/destroy lifecycle. State manipulation and taint commands.

🔄

ArgoCD

GitOps principles and ArgoCD architecture. Application CRD, sync policies (auto vs manual), self-healing, and pruning. App-of-apps pattern for multi-cluster management. Image Updater for automated tag promotion. Handling secrets in GitOps pipelines with External Secrets Operator.

⚙️

GitHub Actions

Workflow YAML structure, triggers (push, pull_request, schedule, workflow_dispatch), jobs and steps, secrets and OIDC for passwordless cloud auth, actions/cache for speed, matrix builds for parallel testing, reusable workflows, and concurrency controls.

🏗️

Jenkins

Declarative vs Scripted Pipelines, Jenkinsfile structure, agents and node labels, shared libraries for DRY pipelines, credential binding, upstream/downstream jobs, Blue Ocean UI, parallel stages, and integration with SonarQube, Nexus, and Kubernetes agents.

⛵

Helm

Chart structure (Chart.yaml, values.yaml, templates), Go templating syntax, values override hierarchy (--set vs -f vs defaults), helm upgrade --install for idempotent CI deploys, rollback via helm history, lifecycle hooks for database migrations, and OCI chart repositories.

📊

Prometheus & Grafana

Prometheus data model (metrics, labels, timestamps), scrape configurations, PromQL query language, recording rules for expensive queries, Alertmanager routing and inhibition, Grafana dashboards and data sources, and the four golden signals: latency, traffic, errors, saturation.

🌿

Git

Branching strategies (GitFlow vs trunk-based development), rebase vs merge and when to use each, interactive rebase for squashing commits, git reflog as a safety net, cherry-pick, bisect for bug hunting, and branch protection rules for team workflows.

🐧

Linux

File permissions and ownership (chmod, chown, umask), process management (ps, kill signals, systemctl), networking tools (ss, ip, dig, curl), log analysis with journalctl, cron jobs and scheduling, performance analysis with top/vmstat/iostat, and shell scripting fundamentals.

🔍

SonarQube

SAST analysis, quality gates and quality profiles, bugs vs vulnerabilities vs code smells vs security hotspots, test coverage integration with JaCoCo and pytest-cov, Jenkins and GitHub Actions integration, branch analysis, and technical debt measurement.

🛡️

OWASP / DevSecOps

OWASP Top 10 with mitigations, shift-left security principles, SAST vs DAST, complete DevSecOps pipeline stages: Gitleaks for secret scanning, SonarQube for SAST, OWASP Dependency-Check for SCA, Trivy for image scanning, and OWASP ZAP for DAST.

📦

JFrog Artifactory

Binary repository management, repository types (local, remote, virtual), artifact promotion from snapshot to release, retention policies, integration with Maven, Docker, npm and pip, Xray scanning for CVEs, and build information for full traceability.

🔭

Splunk

SPL (Search Processing Language) for log analysis, Universal and Heavy Forwarders for log collection, index management, dashboards and visualizations, saved searches and scheduled alerts, field extractions, and Splunk as a SIEM for security event correlation.

🎯

SRE Concepts

SLI, SLO, SLA definitions and relationships, error budget calculation and usage, burn rate alerting, MTTR and MTBF, blameless post-mortem structure, toil definition and reduction strategies, deployment strategies (blue-green, canary, rolling), and incident management workflows.

🏗️

Maven & Build Tools

Maven build lifecycle phases, dependency scopes (compile, runtime, provided, test), BOM (Bill of Materials) for version management, multi-module projects, SNAPSHOT vs RELEASE artifact promotion, plugin configuration, and Gradle comparison for DevOps interviews.

Sample Interview Questions & Answers

The following are representative questions from the reference guide, shown in full to illustrate the depth of coverage.

Kubernetes — Liveness vs Readiness Probes

Q: What is the difference between liveness and readiness probes, and what happens when each fails?

A: A liveness probe answers: "Is this container still alive?" If it fails, Kubernetes restarts the container. It protects against deadlocks and hung processes that are running but not making progress. A readiness probe answers: "Is this container ready to receive traffic?" If it fails, the pod is removed from the Service's endpoint list — no traffic is routed to it, but the container is not restarted. Use readiness probes for slow-starting apps or apps that need time to warm up caches. A startup probe disables the liveness check for a configurable period to give slow-starting applications time to initialise before the liveness probe takes over. Misconfiguring probes is one of the most common sources of production incidents in Kubernetes.

Terraform — Remote State and Locking

Q: Why do you need remote state in Terraform and how does state locking work?

A: Terraform state (terraform.tfstate) is the source of truth that maps your configuration to real infrastructure. By default it's stored locally, which breaks team workflows — two engineers running apply simultaneously would corrupt state or make conflicting changes. Remote state on S3 solves the storage problem; DynamoDB state locking solves the concurrency problem. When any Terraform operation that modifies state begins, it writes a lock record to DynamoDB. If another operation tries to start, it finds the lock and waits or errors out. This prevents race conditions. Important: state files often contain sensitive values in plaintext, so restrict S3 bucket access with strict IAM policies and enable versioning so you can recover from accidental state corruption.

SRE — SLO, Error Budgets, and Deployment Decisions

Q: What is an error budget and how does it influence deployment decisions?

A: An error budget is the allowed failure quota derived from your SLO. For a 99.9% SLO, the error budget is 0.1% — approximately 43.8 minutes of downtime per month. When your error budget is healthy (budget remaining), your team has room to take risk: ship features faster, experiment with larger deploys, try new technologies. When the budget is exhausted, you freeze non-critical feature work and direct all effort toward reliability improvements. This framework is powerful because it gives both Dev and Ops a shared quantitative language for risk — instead of arguing about whether to deploy, you check the budget. Example: "We used 60% of our error budget this month on the payment service rollout, so we're pausing the next feature release until the budget recovers."

Docker — CMD vs ENTRYPOINT

Q: What is the difference between CMD and ENTRYPOINT in a Dockerfile, and why does it matter in Kubernetes?

A: ENTRYPOINT defines the executable that always runs when the container starts. CMD provides default arguments to that executable — and can be overridden at runtime with docker run args. When both are set, CMD is appended to ENTRYPOINT. The critical production detail: always use exec form (["java", "-jar", "app.jar"]) not shell form (java -jar app.jar). Shell form wraps your process in /bin/sh -c, making your app a child process of the shell. When Kubernetes sends SIGTERM for graceful shutdown, the signal goes to the shell — not your app — so your app never gets the shutdown signal and always waits for the full terminationGracePeriodSeconds before being force-killed. This causes slow rolling updates and can disrupt in-flight requests.

GitOps — ArgoCD CI vs CD Separation

Q: How do CI and CD responsibilities split in a GitOps workflow with ArgoCD?

A: In GitOps, CI and CD are explicitly separated. CI (GitHub Actions, Jenkins) handles: running tests, building the Docker image, pushing to a registry, and updating the image tag in the GitOps repository (typically via a commit to a values.yaml or Kustomize overlay). ArgoCD then detects that Git change and handles the actual Kubernetes deployment — applying manifests, managing rollouts, and reporting sync status. This separation means your CI pipeline never needs kubectl access or cloud credentials for deployment. Rollback is a git revert, not a pipeline re-run. Audit history is Git history. This clean boundary is what interviewers are testing when they ask about GitOps architecture.

Security — DevSecOps Pipeline Stages

Q: Walk me through a complete DevSecOps pipeline and which security tool you'd use at each stage.

A: A complete shift-left security pipeline has a tool at every stage: Pre-commit: Gitleaks or truffleHog to catch secrets before they hit the repo. SAST: SonarQube or Semgrep to analyze source code for vulnerabilities without running it. SCA (Software Composition Analysis): OWASP Dependency-Check or Snyk to scan third-party libraries for known CVEs. Image scanning: Trivy or Grype to scan Docker image layers — fail the build on HIGH/CRITICAL CVEs. DAST: OWASP ZAP to test the running application via actual HTTP requests, finding runtime configuration issues SAST can't detect. Runtime: Falco for real-time anomaly detection in production containers. The key interview phrase: "shift-left" means finding vulnerabilities earlier in the pipeline where they're cheaper and faster to fix.

How to Use This Guide

▸ Click any tool card in the nav bar to open the full reference for that tool
▸ Each tool page has: key concepts, important commands, architecture notes, and interview Q&As with full answers
▸ Click any question to expand the answer — practice closing the answer first and testing yourself
▸ Use the Revision Tracker to check off tools as you complete them
▸ Check the "How to Answer" page for the universal interview answer framework
▸ Progress is saved in your browser's localStorage — no account needed

About master.devops

This site is run by master.devops — a faceless DevOps education brand helping engineers master the skills and interview techniques needed for DevOps, SRE, and cloud engineering roles.

Content is written by the master.devops community — engineers with real production experience in DevOps, SRE, and cloud infrastructure.

Follow for daily interview tips, real commands, and career advice: @master.devops on Instagram · YouTube Shorts · LinkedIn

Revision Tracker

Check off tools as you complete revision.

0 / 18

💡 How to Answer Any Interview Question

The universal framework for DevOps/SRE interviews — especially security and availability questions.

The 4-Step Answer Formula

1. State the risk/problem — "The main concern with X is..."
2. Name the mechanism — "We handle this using Y feature/tool"
3. Real example — "In production, we handle this by..."
4. Trade-off — "The downside is cost/complexity, balanced by..."

When you blank: Say "Let me think through this from first principles..." — buys 5 seconds and signals structured thinking.

Security Answer Template

"To secure X, I focus on three layers: authentication/authorization, data in transit and at rest, and audit logging. For Kubernetes, RBAC for auth, TLS/mTLS for in-cluster traffic, audit logs enabled. Secrets go through Vault — never in plain manifests."

Availability Answer Template

"Availability depends on eliminating single points of failure + fast recovery. For Kubernetes: multi-AZ node pools, PodDisruptionBudgets, readiness probes, HPA. For databases: read replicas + automated failover. RTO/RPO targets drive which mechanisms I pick."

Deployment Strategy Cheat Sheet

Strategy	How it works	Pros	Cons
Rolling	Replace pods gradually, N at a time	Zero downtime, low cost	Mixed versions temporarily
Blue/Green	Two identical envs, instant LB switch	Instant rollback	Doubles infra cost
Canary	Route 5%→25%→100% to new version	Minimal blast radius	Complex, slower
Recreate	Stop all, deploy new	Simple	Has downtime
Feature flags	Deploy disabled, enable per user	Decouple deploy from release	Code complexity

SLI / SLO / SLA Quick Reference

Term	What it is	Example
SLI	The metric you measure	% requests returning 200
SLO	Your internal target	99.9% success rate
SLA	Contract with customers	99.5% guaranteed (with penalties)
Error Budget	1 - SLO	0.1% = 43.8 min/month downtime allowed

Master DevOps & SRE
from Zero to Interview-Ready

New to DevOps? Follow This Roadmap

18-Tool Interview Quick Reference

🎯 Crack Your Next DevOps / SRE Interview

🚀 DevOps & SRE Interview Quick Reference

Revision Tracker

💡 How to Answer Any Interview Question

The 4-Step Answer Formula

Security Answer Template

Availability Answer Template

Deployment Strategy Cheat Sheet

SLI / SLO / SLA Quick Reference

Explore the DevOps Ecosystem

Study Smarter, Not Harder

What's Covered

19 Essential DevOps Tools

More Prep Material

Frequently Asked Questions

Built by DevOps Engineers, for DevOps Engineers

☕ Buy Me a Coffee

🇮🇳 India · Razorpay

🌍 International · Ko-fi

What is DevOps?

Free DevOps Guides & Articles

📚 Recommended Courses

🔗 Related DevOps Topics

Ready to Crack Your DevOps Interview?

Master DevOps & SREfrom Zero to Interview-Ready

New to DevOps? Follow This Roadmap

18-Tool Interview Quick Reference

🎯 Crack Your Next DevOps / SRE Interview

🚀 DevOps & SRE Interview Quick Reference

Free DevOps & SRE Interview Reference Guide

Tools & Topics Covered

Sample Interview Questions & Answers

How to Use This Guide

About master.devops

Revision Tracker

💡 How to Answer Any Interview Question

The 4-Step Answer Formula

Security Answer Template

Availability Answer Template

Deployment Strategy Cheat Sheet

SLI / SLO / SLA Quick Reference

Explore the DevOps Ecosystem

Study Smarter, Not Harder

What's Covered

19 Essential DevOps Tools

More Prep Material

Frequently Asked Questions

Join the Community

Built by DevOps Engineers, for DevOps Engineers

☕ Buy Me a Coffee

🇮🇳 India · Razorpay

🌍 International · Ko-fi

What is DevOps?

Free DevOps Guides & Articles

📚 Recommended Courses

🔗 Related DevOps Topics

Ready to Crack Your DevOps Interview?

Master DevOps & SRE
from Zero to Interview-Ready