🎯 Crack Your Next DevOps / SRE Interview
Daily tips, mock interviews & community-driven prep resources.
Follow for free content. Book a session for focused prep.
🚀 DevOps & SRE Interview Quick Reference
Complete revision notes for all 18 tools — built by @master.devops. Click any card for key concepts, commands, and interview Q&A.
Follow @master.devops on Instagram for daily tips & new resources.
Free DevOps & SRE Interview Reference Guide
DevOps Interview Kit is a free, comprehensive reference built for engineers preparing for DevOps, SRE, Platform Engineering, and Cloud Engineering interviews. Every section is written from real interview experience — not copied from documentation. Covers 18 tools with key concepts, important commands, architecture patterns, and curated interview Q&As with full answers.
Whether you're targeting a Senior SRE role, a Platform Engineer position, or a DevOps Engineer role at a product-based company, this guide covers the tools and concepts that consistently appear in technical interviews across all levels. Built and maintained by @master.devops — a DevOps education community sharing real production knowledge, free for everyone.
Tools & Topics Covered
Click any tool above to access the full reference. Below is a summary of what each section covers.
Pod lifecycle, Deployments vs StatefulSets vs DaemonSets, Services and Ingress, RBAC and ServiceAccounts, HPA, liveness vs readiness probes, NetworkPolicy, PersistentVolumes, and common troubleshooting scenarios including CrashLoopBackOff and OOMKilled.
Dockerfile best practices, multi-stage builds, minimal base images (alpine, distroless), layer caching, CMD vs ENTRYPOINT, Docker networking (bridge, host, overlay), named volumes vs bind mounts, security hardening with non-root users and read-only filesystems.
EC2 instance types and placement groups, VPC design with subnets and NAT, IAM roles and policies, S3 storage classes and lifecycle rules, EKS cluster architecture, Lambda functions and event sources, RDS Multi-AZ vs Read Replicas, CloudWatch and Auto Scaling.
AKS architecture, Azure DevOps pipelines vs GitHub Actions, App Service plans, Virtual Networks and NSGs, Azure Key Vault and Managed Identity, Azure Load Balancer vs Application Gateway, Azure Monitor and Log Analytics, Entra ID (formerly Azure AD) and RBAC.
Providers, resources, data sources, and variables. Remote state with S3 and DynamoDB locking. Modules for reusability. Workspaces for environment isolation. Import existing infrastructure. Terraform plan/apply/destroy lifecycle. State manipulation and taint commands.
GitOps principles and ArgoCD architecture. Application CRD, sync policies (auto vs manual), self-healing, and pruning. App-of-apps pattern for multi-cluster management. Image Updater for automated tag promotion. Handling secrets in GitOps pipelines with External Secrets Operator.
Workflow YAML structure, triggers (push, pull_request, schedule, workflow_dispatch), jobs and steps, secrets and OIDC for passwordless cloud auth, actions/cache for speed, matrix builds for parallel testing, reusable workflows, and concurrency controls.
Declarative vs Scripted Pipelines, Jenkinsfile structure, agents and node labels, shared libraries for DRY pipelines, credential binding, upstream/downstream jobs, Blue Ocean UI, parallel stages, and integration with SonarQube, Nexus, and Kubernetes agents.
Chart structure (Chart.yaml, values.yaml, templates), Go templating syntax, values override hierarchy (--set vs -f vs defaults), helm upgrade --install for idempotent CI deploys, rollback via helm history, lifecycle hooks for database migrations, and OCI chart repositories.
Prometheus data model (metrics, labels, timestamps), scrape configurations, PromQL query language, recording rules for expensive queries, Alertmanager routing and inhibition, Grafana dashboards and data sources, and the four golden signals: latency, traffic, errors, saturation.
Branching strategies (GitFlow vs trunk-based development), rebase vs merge and when to use each, interactive rebase for squashing commits, git reflog as a safety net, cherry-pick, bisect for bug hunting, and branch protection rules for team workflows.
File permissions and ownership (chmod, chown, umask), process management (ps, kill signals, systemctl), networking tools (ss, ip, dig, curl), log analysis with journalctl, cron jobs and scheduling, performance analysis with top/vmstat/iostat, and shell scripting fundamentals.
SAST analysis, quality gates and quality profiles, bugs vs vulnerabilities vs code smells vs security hotspots, test coverage integration with JaCoCo and pytest-cov, Jenkins and GitHub Actions integration, branch analysis, and technical debt measurement.
OWASP Top 10 with mitigations, shift-left security principles, SAST vs DAST, complete DevSecOps pipeline stages: Gitleaks for secret scanning, SonarQube for SAST, OWASP Dependency-Check for SCA, Trivy for image scanning, and OWASP ZAP for DAST.
Binary repository management, repository types (local, remote, virtual), artifact promotion from snapshot to release, retention policies, integration with Maven, Docker, npm and pip, Xray scanning for CVEs, and build information for full traceability.
SPL (Search Processing Language) for log analysis, Universal and Heavy Forwarders for log collection, index management, dashboards and visualizations, saved searches and scheduled alerts, field extractions, and Splunk as a SIEM for security event correlation.
SLI, SLO, SLA definitions and relationships, error budget calculation and usage, burn rate alerting, MTTR and MTBF, blameless post-mortem structure, toil definition and reduction strategies, deployment strategies (blue-green, canary, rolling), and incident management workflows.
Maven build lifecycle phases, dependency scopes (compile, runtime, provided, test), BOM (Bill of Materials) for version management, multi-module projects, SNAPSHOT vs RELEASE artifact promotion, plugin configuration, and Gradle comparison for DevOps interviews.
Sample Interview Questions & Answers
The following are representative questions from the reference guide, shown in full to illustrate the depth of coverage.
How to Use This Guide
- ▸ Click any tool card in the nav bar to open the full reference for that tool
- ▸ Each tool page has: key concepts, important commands, architecture notes, and interview Q&As with full answers
- ▸ Click any question to expand the answer — practice closing the answer first and testing yourself
- ▸ Use the Revision Tracker to check off tools as you complete them
- ▸ Check the "How to Answer" page for the universal interview answer framework
- ▸ Progress is saved in your browser's localStorage — no account needed
About master.devops
This site is run by master.devops — a faceless DevOps education brand helping engineers master the skills and interview techniques needed for DevOps, SRE, and cloud engineering roles.
Content is written by the master.devops community — engineers with real production experience in DevOps, SRE, and cloud infrastructure.
Follow for daily interview tips, real commands, and career advice: @master.devops on Instagram · YouTube Shorts · LinkedIn
Revision Tracker
Check off tools as you complete revision.
0 / 18
💡 How to Answer Any Interview Question
The universal framework for DevOps/SRE interviews — especially security and availability questions.
The 4-Step Answer Formula
- 1. State the risk/problem — "The main concern with X is..."
- 2. Name the mechanism — "We handle this using Y feature/tool"
- 3. Real example — "In production, we handle this by..."
- 4. Trade-off — "The downside is cost/complexity, balanced by..."
Security Answer Template
Availability Answer Template
Deployment Strategy Cheat Sheet
| Strategy | How it works | Pros | Cons |
|---|---|---|---|
| Rolling | Replace pods gradually, N at a time | Zero downtime, low cost | Mixed versions temporarily |
| Blue/Green | Two identical envs, instant LB switch | Instant rollback | Doubles infra cost |
| Canary | Route 5%→25%→100% to new version | Minimal blast radius | Complex, slower |
| Recreate | Stop all, deploy new | Simple | Has downtime |
| Feature flags | Deploy disabled, enable per user | Decouple deploy from release | Code complexity |
SLI / SLO / SLA Quick Reference
| Term | What it is | Example |
|---|---|---|
| SLI | The metric you measure | % requests returning 200 |
| SLO | Your internal target | 99.9% success rate |
| SLA | Contract with customers | 99.5% guaranteed (with penalties) |
| Error Budget | 1 - SLO | 0.1% = 43.8 min/month downtime allowed |