← Home 🗺️ Mind Map ☕ Ko-fi 💳 Razorpay
// Kubernetes Guide · Intermediate–Advanced

Kubernetes Complete Guide: Pods, Deployments, RBAC & Production Patterns

📅 Updated April 2026 · 📅 April 2026 ⏱ 15 min read 🏷 Kubernetes · DevOps · SRE · Containers
👨‍💻
master.devops
Practising DevOps Engineer with deep hands-on experience in Kubernetes, AWS, CI/CD, and SRE. Every guide is written from real production work.
// Table of Contents
  1. What is Kubernetes and Why Does It Exist?
  2. Kubernetes Architecture
  3. Core Workload Objects
  4. Networking — Services and Ingress
  5. Real YAML Examples
  6. Health Probes — The Most Misunderstood Feature
  7. Security and RBAC
  8. Autoscaling with HPA
  9. Debugging Kubernetes in Production
  10. Interview Q&A

Kubernetes (K8s) is the single most important skill in modern DevOps and SRE. I have used it daily in production — managing clusters on AWS EKS and Azure AKS, responding to incidents, tuning HPA policies, and designing multi-AZ high-availability architectures. This guide is written from that real-world experience, not from reading documentation.

Originally built by Google based on their internal system Borg, Kubernetes was open-sourced in 2014 and has since become the industry standard for running containerised applications at scale. If you are preparing for a Senior DevOps, Platform Engineer, or SRE interview, a deep, practical understanding of Kubernetes is non-negotiable.

What is Kubernetes and Why Does It Exist?

Before Kubernetes, running Docker containers at scale exposed a fundamental problem: Docker solved packaging, but not orchestration. What happens when a container crashes? How do you spread containers across 50 servers? How do you update 100 running containers without downtime? How do you handle a traffic spike at 2am?

Kubernetes answers all of these questions. It is an orchestration platform — a system that manages containers across a cluster of machines. You declare what you want (3 replicas of my API, always running, with 512MB RAM each) and Kubernetes makes it happen and maintains it — even when servers fail, containers crash, or traffic doubles.

Core philosophy: Kubernetes is a desired-state system. You describe the desired state in YAML. Kubernetes continuously reconciles actual state toward desired state. This reconciliation loop is the foundation of everything in the platform.

Kubernetes Architecture

A Kubernetes cluster has two types of machines: the control plane and worker nodes. Understanding this architecture is the first question in most K8s interviews.

Control Plane Components

Worker Node Components

Core Workload Objects

Pod — The Atomic Unit

A Pod is the smallest deployable unit. It contains one or more containers that share a network namespace (they talk to each other via localhost) and storage volumes. In practice, most Pods run a single container. The sidecar pattern (Istio's Envoy proxy, Vault's secret injector) is the main case for multi-container Pods.

Pods are ephemeral. They are not self-healing. If a Pod dies, it stays dead unless a controller (like a Deployment) creates a replacement. Never create bare Pods in production — always use a Deployment or StatefulSet.

Deployment — For Stateless Applications

A Deployment manages a ReplicaSet (a set of identical Pods). You declare the number of replicas and the container image. The Deployment controller ensures that many pods are always running. It also handles rolling updates — replacing Pods gradually so there is no downtime — and rollbacks when a new version fails.

StatefulSet — For Stateful Applications

StatefulSets are like Deployments but with three additional guarantees: stable Pod names (pod-0, pod-1, pod-2 — never random), stable per-Pod PersistentVolumeClaims that survive rescheduling, and ordered start/stop (pod-0 starts before pod-1, pod-1 stops before pod-0). Use StatefulSets for PostgreSQL, MySQL, Kafka, Redis Cluster, Cassandra, Elasticsearch — any workload where instance identity matters.

DaemonSet — One Pod Per Node

DaemonSets ensure exactly one Pod runs on every node (or a subset). Used for: log collectors (Fluentd, Filebeat), monitoring agents (Prometheus Node Exporter), network plugins (Calico CNI), and security agents (Falco). When a new node joins the cluster, the DaemonSet Pod is automatically scheduled on it.

Networking — Services and Ingress

Kubernetes networking is the most common area where engineers get stuck. There are four distinct communication problems to understand:

Service Types

Real YAML Examples

Production Deployment

# production-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: api-server namespace: production labels: app: api-server version: "2.1.0" spec: replicas: 3 selector: matchLabels: app: api-server strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 0 # never drop below 100% capacity maxSurge: 1 # create 1 extra pod during update template: metadata: labels: app: api-server version: "2.1.0" spec: serviceAccountName: api-sa securityContext: runAsNonRoot: true runAsUser: 1001 # NEVER run as root containers: - name: api image: registry.company.com/api:2.1.0 ports: - containerPort: 8080 resources: requests: # scheduler uses this for placement cpu: "100m" memory: "256Mi" limits: # OOMKilled if exceeded cpu: "500m" memory: "512Mi" env: - name: DB_PASSWORD valueFrom: secretKeyRef: # never hardcode secrets name: db-secret key: password readinessProbe: httpGet: path: /health/ready port: 8080 initialDelaySeconds: 10 periodSeconds: 5 livenessProbe: httpGet: path: /health/live port: 8080 initialDelaySeconds: 30 # give app time to start periodSeconds: 10 lifecycle: preStop: exec: command: ["sleep", "5"] # drain connections gracefully terminationGracePeriodSeconds: 30

Ingress with TLS

# ingress.yaml apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: api-ingress namespace: production annotations: nginx.ingress.kubernetes.io/ssl-redirect: "true" nginx.ingress.kubernetes.io/rate-limit: "100" # 100 req/s per IP spec: ingressClassName: nginx tls: - hosts: - api.company.com secretName: api-tls-cert # cert-manager manages this rules: - host: api.company.com http: paths: - path: /api/v1 pathType: Prefix backend: service: name: api-service port: number: 80 - path: /health pathType: Exact backend: service: name: api-service port: number: 80

HPA with Custom Metrics

# hpa.yaml apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: api-hpa namespace: production spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: api-server minReplicas: 3 maxReplicas: 50 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 60 # scale up when avg CPU > 60% - type: Resource resource: name: memory target: type: Utilization averageUtilization: 70 behavior: scaleDown: stabilizationWindowSeconds: 300 # wait 5 min before scaling down policies: - type: Percent value: 10 # scale down max 10% at a time periodSeconds: 60

Health Probes — The Most Misunderstood Feature

In production Kubernetes environments, health probe misconfiguration is the single most common cause of production incidents I have investigated. Get these wrong and you will have either CrashLoopBackOff (liveness probe too aggressive) or traffic to broken pods (readiness probe missing).

Liveness probe failure → Kubernetes kills and restarts the container. Use for detecting deadlocks, infinite loops, or processes that are stuck but still "running".
Readiness probe failure → Pod is removed from Service endpoints. No traffic is routed to it. Container is NOT restarted. Use for: app warming up, database not yet connected, cache not yet populated.
Critical rule: Never put an external dependency check in a liveness probe. If your liveness probe calls your database and the database has a 30-second outage, Kubernetes will restart every pod in your fleet simultaneously. Use readiness for external dependency checks.
Startup probe tip: For Java apps with long JVM warm-up, set failureThreshold: 30 and periodSeconds: 10 on the startup probe. This gives 300 seconds (5 minutes) for startup before liveness kicks in. Without this, JVM apps frequently enter CrashLoopBackOff on first deploy.

Security and RBAC

Role-Based Access Control (RBAC) is the primary security mechanism inside a Kubernetes cluster. The principle of least privilege applies everywhere: a Pod should only have access to the resources it actually needs, and nothing more.

# Create a role that can only read pods and logs kubectl create role pod-reader --verb=get,list,watch --resource=pods,pods/log -n production # Bind it to a service account kubectl create rolebinding api-pod-reader --role=pod-reader --serviceaccount=production:api-sa -n production # Verify — always test what you set up kubectl auth can-i get pods --as=system:serviceaccount:production:api-sa -n production # Output: yes kubectl auth can-i delete pods --as=system:serviceaccount:production:api-sa -n production # Output: no
Kubernetes Secrets are NOT encrypted by default. They are base64-encoded, which anyone with etcd access can trivially decode. For production, use HashiCorp Vault with the Agent Injector, Sealed Secrets (Bitnami), or External Secrets Operator with AWS Secrets Manager or Azure Key Vault. Enable etcd encryption at rest as a baseline.

Debugging Kubernetes in Production

This is the section that separates senior engineers from juniors in interviews. When a pod is broken in production, you need a systematic, fast debugging approach — not random kubectl commands.

# Step 1: Get pod status — look for Error, CrashLoopBackOff, Pending, OOMKilled kubectl get pods -n production -o wide # Step 2: Describe — the Events section is the most useful part kubectl describe pod api-7d4b-xyz -n production # Look for: ImagePullBackOff, resource limits, scheduling failures # Step 3: Logs — use --previous for crashed container's last logs kubectl logs api-7d4b-xyz -n production --previous kubectl logs api-7d4b-xyz -n production -c api --tail=100 # Step 4: Events — sorted by time, cluster-wide view kubectl get events --sort-by='.lastTimestamp' -n production # Step 5: Resource usage — check for OOMKilled (memory limit exceeded) kubectl top pods -n production --sort-by=memory # Step 6: Exec into running container for live debugging kubectl exec -it api-7d4b-xyz -n production -- sh # Step 7: Check if Service has healthy endpoints kubectl get endpoints api-service -n production # Empty endpoints = readiness probe failing or selector mismatch # Step 8: Port-forward for local testing without Ingress kubectl port-forward svc/api-service 8080:80 -n production

Interview Q&A

Q1: What is CrashLoopBackOff and how do you debug it?
CrashLoopBackOff means a container is crashing immediately after start and Kubernetes is restarting it with exponential backoff (10s, 20s, 40s up to 5 minutes). Common causes: application startup error, missing environment variable or ConfigMap key, liveness probe firing before app is ready, OOMKilled (memory limit too low), wrong ENTRYPOINT command. Debug with kubectl logs --previous to see the last crash output, and kubectl describe pod to see the exit code. Exit code 137 = OOMKilled, exit code 1 = application error.
Q2: What is the difference between Deployment and StatefulSet?
Deployments are for stateless applications — Pods are interchangeable, get random names, can share a single PVC or use no PVC. StatefulSets are for stateful applications needing stable identity: ordered names (pod-0, pod-1), per-Pod stable PVCs that survive rescheduling, and ordered startup/shutdown. Use StatefulSet for: databases (Postgres, MySQL), Kafka, Redis Cluster, Elasticsearch. Never use StatefulSet for stateless apps — it makes rolling updates unnecessarily sequential.
Q3: Liveness vs Readiness — what happens when each fails?
Liveness failure: kubelet kills the container and restarts it. Readiness failure: the Pod is removed from the Service endpoint list — no traffic, no restart. A Pod can be "live" but "not ready" (still warming up). The most common mistake: aggressive liveness probe on slow-starting app causes CrashLoopBackOff. The second most common: no readiness probe, so traffic hits Pods during startup before the app is ready to serve.
Q4: How does the Kubernetes scheduler decide where to place a Pod?
Two phases: Filtering eliminates nodes that cannot run the Pod — insufficient CPU/memory requests, taint not tolerated, node selector mismatch, affinity rule violation, node not Ready. Scoring ranks remaining nodes by least-allocated resources, pod topology spread score, image locality (node already has the image cached). The highest-scoring node wins. If no node passes filtering, the Pod stays Pending indefinitely. Check kubectl describe pod events for "FailedScheduling" to see why.
Q5: How do you ensure zero-downtime deployments?
Set maxUnavailable: 0 and maxSurge: 1 in the RollingUpdate strategy. Configure a proper readiness probe so the new Pod only joins the Service endpoints when it is actually ready to serve traffic. Set a preStop hook (sleep 5) to allow in-flight requests to complete before the container is terminated. Set terminationGracePeriodSeconds long enough for graceful shutdown. Test with a canary deployment first using a second Deployment with 1 replica pointing to the new image.
Q6: What is a PodDisruptionBudget and when do you need one?
A PodDisruptionBudget (PDB) limits how many Pods of a Deployment can be simultaneously unavailable during voluntary disruptions — node drains during maintenance, cluster upgrades, or node auto-scaling scale-down events. Without a PDB, a node drain could evict all replicas of a Deployment at once, causing downtime. Example: minAvailable: 2 ensures at least 2 Pods are always running. Set PDBs for every production Deployment with more than 1 replica.
// More Guides
📖 DevOps ☸️ Kubernetes 🐳 Docker ⚙️ CI/CD 🗂️ Terraform 🐧 Linux 🌿 Git ☁️ AWS 📊 Prometheus

☸️ Explore Kubernetes on the Interactive Mind Map

See how Kubernetes connects to Docker, Helm, ArgoCD, Prometheus, AWS, and more — with real commands and 5 interview Q&As per tool.

Open Interactive Mind Map ← DevOps Basics First
🚀 Want the complete DevOps interview kit?
Full notes, Q&A cheat sheets, real commands — all tools covered.
💳 Get Complete DevOps Kit →

Kubernetes runs on cloud infrastructure. See how to deploy and manage EKS clusters on AWS →, and automate the whole thing with Terraform →

📩 Get Free DevOps Interview Notes

Cheat sheets, real commands, interview Q&As — free.

No spam · Follow @master.devops for daily tips

// Continue Learning
🐳Docker — Understand containers before Kubernetes ⚙️CI/CD — Automate K8s deployments with ArgoCD 📊Prometheus — Monitor your Kubernetes cluster

🔗 Related DevOps Topics

🐳 Docker ☸️ Kubernetes 🗂️ Terraform 🐧 Linux ☁️ AWS ⚙️ CI/CD 📊 Prometheus 🌿 Git 📖 DevOps 🗺️ Mind Map

☕ Support Master DevOps

All content is 100% free. If this guide helped you crack an interview or learn something new, your support keeps the project going.

☕ Ko-fi — International 💳 Razorpay — UPI / India

No subscription · One-time equally loved 🙏

☸️
Written by Master DevOps
DevOps & SRE Engineer · Updated April 2026

Master DevOps is a community of practising DevOps and SRE engineers sharing real production knowledge — from Kubernetes internals to CI/CD pipeline design. All content is written from hands-on experience, not copied from documentation. Our mission: make senior-level DevOps knowledge free for everyone.

📸 Instagram ▶️ YouTube 💼 LinkedIn About Us →
🎯

Ready to Crack Your DevOps Interview?

Access 90+ interview Q&As, real commands, SRE frameworks, and 18-tool reference cards — all free, no login required. Used by 1,300+ DevOps engineers.

🎯 Open Interview Kit → 🗺️ Explore Mind Map

No account needed · Works on mobile · Updated weekly

Advertisement
🌙