Introduction: Learning from the Best in DevOps
Netflix, Google, and Amazon aren't just tech giants — they're the pioneers of modern DevOps culture. Netflix deploys thousands of times per day. Google manages billions of containers weekly. Amazon pushes code to production every 11.7 seconds. These companies have built, open-sourced, and battle-tested the tools that the rest of us now rely on.
In this comprehensive guide, we'll dissect the exact DevOps toolchains used by each company, provide real configuration files you can adapt, and show you how to implement enterprise-grade DevOps practices — even on a small team budget.
The Netflix DevOps Stack
Netflix operates one of the most sophisticated cloud-native architectures in the world, serving over 230 million subscribers across 190+ countries. Their engineering culture is built on freedom and responsibility, and their DevOps tooling reflects that philosophy.
Spinnaker: Multi-Cloud Continuous Delivery
Spinnaker is Netflix's open-source continuous delivery platform, designed for releasing software changes with high velocity and confidence. Unlike simpler CI/CD tools, Spinnaker provides multi-cloud deployment, advanced deployment strategies, and deep integration with cloud providers.
Install Spinnaker using Halyard (the official CLI tool):
## Install Halyard on Ubuntu/Debian
curl -O https://raw.githubusercontent.com/spinnaker/halyard/master/install/debian/InstallHalyard.sh
sudo bash InstallHalyard.sh
## Verify installation
hal version list
## Configure the cloud provider (AWS example)
hal config provider aws account add my-aws-account
--account-id 123456789012
--assume-role role/spinnakerManaged
--regions us-east-1,us-west-2
hal config provider aws enable
## Set the deployment type
hal config deploy edit --type distributed --account-name my-k8s-account
## Deploy Spinnaker
hal deploy apply
Here is a complete Spinnaker pipeline configuration in YAML that demonstrates a canary deployment strategy:
application: my-web-app
name: production-deploy-pipeline
triggers:
- type: docker
registry: index.docker.io
repository: myorg/my-web-app
tag: "v.*"
enabled: true
stages:
- refId: "1"
type: bake
name: "Bake AMI"
baseOs: ubuntu
package: my-web-app
regions:
- us-east-1
vmType: hvm
storeType: ebs
- refId: "2"
requisiteStageRefIds: ["1"]
type: canaryAnalysis
name: "Canary Analysis"
canaryConfig:
lifetimeDuration: PT1H
metricsAccountName: my-datadog
scoreThresholds:
marginal: 50
pass: 75
storageAccountName: my-s3
- refId: "3"
requisiteStageRefIds: ["2"]
type: deploy
name: "Deploy to Production"
clusters:
- account: my-aws-account
application: my-web-app
capacity:
desired: 6
max: 10
min: 3
cooldown: 300
instanceType: m5.xlarge
loadBalancers:
- my-web-app-prod-lb
provider: aws
strategy: rollingpush
subnetType: internal
- refId: "4"
requisiteStageRefIds: ["3"]
type: webhook
name: "Notify Slack"
url: https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK
method: POST
payload: |
{
"channel": "#deployments",
"text": "Production deployment of ${trigger.tag} completed successfully!"
}
Chaos Monkey and the Simian Army
Netflix's Chaos Monkey randomly terminates production instances to ensure that engineers build resilient services. It's part of the broader Simian Army that includes Latency Monkey, Conformity Monkey, and Chaos Gorilla.
## Clone Chaos Monkey
git clone https://github.com/Netflix/chaosmonkey.git
cd chaosmonkey
## Build from source
go build -o chaosmonkey ./cmd/chaosmonkey
## Create configuration file
cat > chaosmonkey.toml <<EOF
[chaosmonkey]
enabled = true
schedule_enabled = true
accounts = ["my-aws-account"]
[chaosmonkey.scheduler]
frequency = 5
clustering_enabled = true
[database]
host = "localhost"
name = "chaosmonkey"
user = "chaosmonkey"
password = "secure-password-here"
port = 3306
EOF
## Register Chaos Monkey with Spinnaker
chaosmonkey migrate
chaosmonkey schedule
## Manually trigger chaos (for testing)
chaosmonkey terminate my-web-app --region us-east-1
Zuul Gateway and Eureka Service Discovery
Netflix built Zuul as their API gateway for dynamic routing, monitoring, resiliency, and security. Combined with Eureka for service discovery, these tools form the backbone of Netflix's microservices communication layer.
# application.yml for Zuul Gateway
server:
port: 8080
spring:
application:
name: api-gateway
eureka:
client:
serviceUrl:
defaultZone: http://localhost:8761/eureka/
register-with-eureka: true
fetch-registry: true
zuul:
routes:
user-service:
path: /api/users/**
serviceId: user-service
stripPrefix: true
order-service:
path: /api/orders/**
serviceId: order-service
stripPrefix: true
product-service:
path: /api/products/**
serviceId: product-service
stripPrefix: true
ribbon:
eager-load:
enabled: true
host:
connect-timeout-millis: 5000
socket-timeout-millis: 10000
hystrix:
command:
default:
execution:
isolation:
thread:
timeoutInMilliseconds: 60000
# Eureka Server application.yml
server:
port: 8761
eureka:
instance:
hostname: localhost
client:
registerWithEureka: false
fetchRegistry: false
serviceUrl:
defaultZone: http://${eureka.instance.hostname}:${server.port}/eureka/
server:
enable-self-preservation: false
eviction-interval-timer-in-ms: 5000
The Google DevOps Stack
Google runs over 4 billion containers per week using their internal Borg system. Their DevOps philosophy gave birth to Site Reliability Engineering (SRE), and their open-source contributions have fundamentally shaped the industry.
Kubernetes: From Borg to K8s
Kubernetes (K8s) was born from Google's internal Borg system, which has been managing containerized workloads since 2003. When Google decided to share this orchestration knowledge with the world, they rebuilt the concepts into Kubernetes and donated it to the CNCF.
Set up a Google Kubernetes Engine (GKE) cluster:
## Install Google Cloud SDK
curl https://sdk.cloud.google.com | bash
exec -l $SHELL
gcloud init
## Create a GKE cluster
gcloud container clusters create production-cluster
--zone us-central1-a
--num-nodes 3
--machine-type e2-standard-4
--enable-autoscaling
--min-nodes 2
--max-nodes 10
--enable-autorepair
--enable-autoupgrade
--enable-network-policy
--enable-vertical-pod-autoscaling
--release-channel regular
## Get credentials
gcloud container clusters get-credentials production-cluster
--zone us-central1-a
## Verify cluster
kubectl get nodes
kubectl cluster-info
Here is a production-ready Kubernetes deployment manifest:
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
namespace: production
labels:
app: web-app
version: v2.1.0
spec:
replicas: 3
selector:
matchLabels:
app: web-app
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
metadata:
labels:
app: web-app
version: v2.1.0
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
spec:
serviceAccountName: web-app-sa
containers:
- name: web-app
image: gcr.io/my-project/web-app:v2.1.0
ports:
- containerPort: 8080
name: http
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
env:
- name: DB_HOST
valueFrom:
secretKeyRef:
name: db-credentials
key: host
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: db-credentials
key: password
---
apiVersion: v1
kind: Service
metadata:
name: web-app-service
namespace: production
spec:
selector:
app: web-app
ports:
- protocol: TCP
port: 80
targetPort: 8080
type: ClusterIP
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Cloud Build CI/CD
Google Cloud Build is Google's fully managed CI/CD platform. It executes builds on Google's infrastructure, scales automatically, and integrates seamlessly with GKE.
# cloudbuild.yaml
steps:
# Run tests
- name: 'node:18'
entrypoint: npm
args: ['install']
- name: 'node:18'
entrypoint: npm
args: ['test']
env:
- 'NODE_ENV=test'
# Build Docker image
- name: 'gcr.io/cloud-builders/docker'
args:
- 'build'
- '-t'
- 'gcr.io/$PROJECT_ID/web-app:$SHORT_SHA'
- '-t'
- 'gcr.io/$PROJECT_ID/web-app:latest'
- '.'
# Push to GCR
- name: 'gcr.io/cloud-builders/docker'
args: ['push', 'gcr.io/$PROJECT_ID/web-app:$SHORT_SHA']
# Deploy to GKE
- name: 'gcr.io/cloud-builders/gke-deploy'
args:
- 'run'
- '--filename=k8s/'
- '--image=gcr.io/$PROJECT_ID/web-app:$SHORT_SHA'
- '--location=us-central1-a'
- '--cluster=production-cluster'
timeout: 1200s
options:
machineType: E2_HIGHCPU_8
Istio Service Mesh Configuration
Istio, co-developed by Google, IBM, and Lyft, provides traffic management, security, and observability for microservices. Here is a practical configuration:
## Install Istio
curl -L https://istio.io/downloadIstio | sh -
cd istio-*
export PATH=$PWD/bin:$PATH
## Install Istio with production profile
istioctl install --set profile=production -y
## Enable sidecar injection for a namespace
kubectl label namespace production istio-injection=enabled
# Istio VirtualService for traffic splitting (Canary)
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: web-app-vs
namespace: production
spec:
hosts:
- web-app
http:
- match:
- headers:
x-canary:
exact: "true"
route:
- destination:
host: web-app
subset: canary
- route:
- destination:
host: web-app
subset: stable
weight: 90
- destination:
host: web-app
subset: canary
weight: 10
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: web-app-dr
namespace: production
spec:
host: web-app
trafficPolicy:
connectionPool:
http:
h2UpgradePolicy: DEFAULT
http1MaxPendingRequests: 1024
http2MaxRequests: 1024
tcp:
maxConnections: 100
outlierDetection:
consecutive5xxErrors: 3
interval: 30s
baseEjectionTime: 30s
subsets:
- name: stable
labels:
version: v2.0.0
- name: canary
labels:
version: v2.1.0
The Amazon DevOps Stack
Amazon was arguably the first company to fully embrace microservices and DevOps, driven by the famous "two-pizza team" mandate from Jeff Bezos. Their AWS platform now powers a massive share of the internet's infrastructure.
AWS CodePipeline + CodeBuild
AWS CodePipeline orchestrates the entire release process, while CodeBuild handles the actual build and test execution. Here is a complete buildspec.yml configuration:
# buildspec.yml for AWS CodeBuild
version: 0.2
env:
variables:
NODE_ENV: "production"
AWS_DEFAULT_REGION: "us-east-1"
parameter-store:
DB_PASSWORD: "/myapp/production/db-password"
API_KEY: "/myapp/production/api-key"
phases:
install:
runtime-versions:
nodejs: 18
commands:
- echo "Installing dependencies..."
- npm ci --production=false
pre_build:
commands:
- echo "Running linter and type checking..."
- npm run lint
- npm run type-check
- echo "Logging in to Amazon ECR..."
- aws ecr get-login-password --region $AWS_DEFAULT_REGION | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com
- REPOSITORY_URI=$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/my-web-app
- COMMIT_HASH=$(echo $CODEBUILD_RESOLVED_SOURCE_VERSION | cut -c 1-7)
- IMAGE_TAG=${COMMIT_HASH:-latest}
build:
commands:
- echo "Running tests..."
- npm test -- --coverage --ci
- echo "Building Docker image..."
- docker build -t $REPOSITORY_URI:latest .
- docker tag $REPOSITORY_URI:latest $REPOSITORY_URI:$IMAGE_TAG
post_build:
commands:
- echo "Pushing Docker image..."
- docker push $REPOSITORY_URI:latest
- docker push $REPOSITORY_URI:$IMAGE_TAG
- echo "Writing image definitions file..."
- printf '[{"name":"web-app","imageUri":"%s"}]' $REPOSITORY_URI:$IMAGE_TAG > imagedefinitions.json
artifacts:
files:
- imagedefinitions.json
- appspec.yml
- taskdef.json
reports:
coverage-report:
files:
- "coverage/clover.xml"
file-format: CLOVERXML
cache:
paths:
- 'node_modules/**/*'
AWS CloudFormation Templates
CloudFormation is Amazon's Infrastructure as Code service. Here is a template that provisions a complete VPC with auto-scaling ECS service:
AWSTemplateFormatVersion: '2010-09-09'
Description: Production VPC with ECS Fargate Service
Parameters:
EnvironmentName:
Type: String
Default: production
ContainerImage:
Type: String
Description: Docker image URI for the application
ContainerPort:
Type: Number
Default: 8080
Resources:
VPC:
Type: AWS::EC2::VPC
Properties:
CidrBlock: 10.0.0.0/16
EnableDnsHostnames: true
EnableDnsSupport: true
Tags:
- Key: Name
Value: !Sub "${EnvironmentName}-vpc"
PublicSubnet1:
Type: AWS::EC2::Subnet
Properties:
VpcId: !Ref VPC
CidrBlock: 10.0.1.0/24
AvailabilityZone: !Select [0, !GetAZs ""]
MapPublicIpOnLaunch: true
PublicSubnet2:
Type: AWS::EC2::Subnet
Properties:
VpcId: !Ref VPC
CidrBlock: 10.0.2.0/24
AvailabilityZone: !Select [1, !GetAZs ""]
MapPublicIpOnLaunch: true
ECSCluster:
Type: AWS::ECS::Cluster
Properties:
ClusterName: !Sub "${EnvironmentName}-cluster"
ClusterSettings:
- Name: containerInsights
Value: enabled
TaskDefinition:
Type: AWS::ECS::TaskDefinition
Properties:
Family: !Sub "${EnvironmentName}-web-app"
Cpu: 512
Memory: 1024
NetworkMode: awsvpc
RequiresCompatibilities:
- FARGATE
ExecutionRoleArn: !GetAtt ECSExecutionRole.Arn
TaskRoleArn: !GetAtt ECSTaskRole.Arn
ContainerDefinitions:
- Name: web-app
Image: !Ref ContainerImage
PortMappings:
- ContainerPort: !Ref ContainerPort
LogConfiguration:
LogDriver: awslogs
Options:
awslogs-group: !Ref LogGroup
awslogs-region: !Ref "AWS::Region"
awslogs-stream-prefix: ecs
HealthCheck:
Command:
- CMD-SHELL
- !Sub "curl -f http://localhost:${ContainerPort}/healthz || exit 1"
Interval: 30
Timeout: 5
Retries: 3
Service:
Type: AWS::ECS::Service
Properties:
Cluster: !Ref ECSCluster
TaskDefinition: !Ref TaskDefinition
DesiredCount: 3
LaunchType: FARGATE
NetworkConfiguration:
AwsvpcConfiguration:
Subnets:
- !Ref PublicSubnet1
- !Ref PublicSubnet2
SecurityGroups:
- !Ref ContainerSecurityGroup
AssignPublicIp: ENABLED
LoadBalancers:
- ContainerName: web-app
ContainerPort: !Ref ContainerPort
TargetGroupArn: !Ref TargetGroup
Outputs:
ClusterName:
Value: !Ref ECSCluster
ServiceURL:
Value: !GetAtt ALB.DNSName
ECS vs EKS: Which to Choose?
| Feature | ECS (Elastic Container Service) | EKS (Elastic Kubernetes Service) |
|---|---|---|
| Complexity | Lower learning curve | Steeper, but industry standard |
| Pricing | No control plane cost | $0.10/hour for control plane |
| Portability | AWS-only | Multi-cloud via K8s |
| Integration | Deep AWS native integration | Good AWS + K8s ecosystem |
| Auto Scaling | Application Auto Scaling | HPA + Karpenter/Cluster Autoscaler |
| Service Mesh | AWS App Mesh | Istio, Linkerd, App Mesh |
| Best For | AWS-only shops, simpler workloads | Multi-cloud, complex orchestration |
Common Tools Across All Three Giants
Terraform: Infrastructure as Code
All three companies and their ecosystems heavily leverage Terraform for multi-cloud infrastructure management. Here is a real-world Terraform configuration:
# main.tf - Production Infrastructure
terraform {
required_version = ">= 1.5.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
backend "s3" {
bucket = "my-terraform-state-prod"
key = "infrastructure/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-lock"
encrypt = true
}
}
provider "aws" {
region = var.aws_region
default_tags {
tags = {
Environment = var.environment
ManagedBy = "terraform"
Project = var.project_name
}
}
}
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "5.1.0"
name = "${var.project_name}-${var.environment}"
cidr = "10.0.0.0/16"
azs = ["${var.aws_region}a", "${var.aws_region}b", "${var.aws_region}c"]
private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
enable_nat_gateway = true
single_nat_gateway = var.environment != "production"
enable_dns_hostnames = true
enable_dns_support = true
public_subnet_tags = {
"kubernetes.io/role/elb" = 1
}
private_subnet_tags = {
"kubernetes.io/role/internal-elb" = 1
}
}
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "19.15.0"
cluster_name = "${var.project_name}-${var.environment}"
cluster_version = "1.28"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
eks_managed_node_groups = {
general = {
desired_size = 3
min_size = 2
max_size = 10
instance_types = ["m5.xlarge"]
capacity_type = "ON_DEMAND"
}
spot = {
desired_size = 2
min_size = 1
max_size = 8
instance_types = ["m5.large", "m5a.large", "m4.large"]
capacity_type = "SPOT"
labels = {
workload-type = "batch"
}
taints = [{
key = "spot"
value = "true"
effect = "NO_SCHEDULE"
}]
}
}
}
# variables.tf
variable "aws_region" {
description = "AWS region"
type = string
default = "us-east-1"
}
variable "environment" {
description = "Environment name"
type = string
default = "production"
}
variable "project_name" {
description = "Project name"
type = string
default = "myapp"
}
Prometheus + Grafana Monitoring Setup
Monitoring is non-negotiable at scale. All three companies use Prometheus-compatible metrics systems internally. Here is a complete monitoring stack setup:
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: production
region: us-east-1
rule_files:
- "alerts/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)
replacement: $1
- job_name: "node-exporter"
static_configs:
- targets: ["node-exporter:9100"]
- job_name: "application"
metrics_path: /metrics
static_configs:
- targets: ["web-app:8080"]
labels:
service: web-app
# alerts/application.yml
groups:
- name: application-alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected on {{ $labels.instance }}"
description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1.0
for: 10m
labels:
severity: warning
annotations:
summary: "High p95 latency on {{ $labels.instance }}"
description: "p95 latency is {{ $value }}s"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
ELK Stack (Elasticsearch, Logstash, Kibana)
Centralized logging with the ELK stack is critical for observability:
# docker-compose.yml for ELK stack
version: "3.8"
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.10.0
environment:
- discovery.type=single-node
- xpack.security.enabled=true
- ELASTIC_PASSWORD=changeme
- "ES_JAVA_OPTS=-Xms2g -Xmx2g"
volumes:
- esdata:/usr/share/elasticsearch/data
ports:
- "9200:9200"
healthcheck:
test: curl -s http://localhost:9200 > /dev/null || exit 1
interval: 30s
timeout: 10s
retries: 5
logstash:
image: docker.elastic.co/logstash/logstash:8.10.0
volumes:
- ./logstash/pipeline:/usr/share/logstash/pipeline
ports:
- "5044:5044"
depends_on:
elasticsearch:
condition: service_healthy
kibana:
image: docker.elastic.co/kibana/kibana:8.10.0
environment:
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
- ELASTICSEARCH_USERNAME=kibana_system
- ELASTICSEARCH_PASSWORD=changeme
ports:
- "5601:5601"
depends_on:
elasticsearch:
condition: service_healthy
volumes:
esdata:
driver: local
Implementing Enterprise DevOps on a Small Team Budget
You don't need a Netflix-sized budget to adopt enterprise-grade DevOps practices. Here is a practical roadmap for teams of 3-10 engineers:
Phase 1: Foundation (Week 1-2, Cost: ~$0/month)
- Version Control: GitHub Free or GitLab Community Edition
- CI/CD: GitHub Actions (2,000 free minutes/month) or GitLab CI
- Containerization: Docker Desktop (free for small teams)
- IaC: Terraform Cloud (free tier for up to 5 users)
Phase 2: Orchestration (Week 3-4, Cost: ~$75/month)
- Kubernetes: DigitalOcean Managed K8s ($12/month per node, minimum 1)
- Registry: Docker Hub free tier or GitHub Container Registry
- Secrets: HashiCorp Vault (open source) or Sealed Secrets for K8s
Phase 3: Observability (Week 5-6, Cost: ~$0-50/month)
- Monitoring: Prometheus + Grafana (both open source)
- Logging: Loki + Grafana (lighter than ELK, open source)
- Alerting: Alertmanager + PagerDuty free tier or Slack webhooks
Sample GitHub Actions Workflow (Free Tier)
# .github/workflows/deploy.yml
name: Build, Test, and Deploy
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
cache: npm
- run: npm ci
- run: npm test -- --coverage
- uses: actions/upload-artifact@v4
with:
name: coverage
path: coverage/
build-and-push:
needs: test
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v5
with:
push: true
tags: ghcr.io/${{ github.repository }}:${{ github.sha }}
deploy:
needs: build-and-push
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: azure/k8s-set-context@v3
with:
kubeconfig: ${{ secrets.KUBE_CONFIG }}
- run: |
kubectl set image deployment/web-app
web-app=ghcr.io/${{ github.repository }}:${{ github.sha }}
-n production
kubectl rollout status deployment/web-app -n production
Troubleshooting Common DevOps Issues
Problem: Kubernetes Pods Stuck in CrashLoopBackOff
Cause: Application crashes on startup due to missing environment variables, failed database connection, or out-of-memory errors.
Solution:
## Check pod logs
kubectl logs pod-name -n production --previous
## Describe the pod for events
kubectl describe pod pod-name -n production
## Check resource limits
kubectl top pod pod-name -n production
## Common fix: increase memory limits or fix env vars
kubectl edit deployment web-app -n production
Problem: Terraform State Lock Stuck
Cause: A previous terraform apply was interrupted, leaving a stale DynamoDB lock.
Solution:
## Force unlock (use the Lock ID from the error message)
terraform force-unlock LOCK_ID
## If using S3 backend, check DynamoDB for the lock
aws dynamodb scan --table-name terraform-lock
Problem: Docker Build Cache Causing Stale Images
Cause: Docker aggressively caches layers, sometimes serving outdated dependencies.
Solution:
## Build without cache
docker build --no-cache -t myapp:latest .
## Prune all unused images and cache
docker system prune -a --volumes
## Use BuildKit for smarter caching
DOCKER_BUILDKIT=1 docker build -t myapp:latest .
Quick Reference: DevOps Tool Comparison Cheat Sheet
| Category | Netflix | Amazon | Open-Source Alternative | |
|---|---|---|---|---|
| CD Platform | Spinnaker | Cloud Build | CodePipeline | ArgoCD, Flux |
| Container Orchestration | Titus (internal) | Kubernetes/GKE | ECS/EKS | K3s, Nomad |
| Service Mesh | Zuul + custom | Istio | App Mesh | Linkerd, Consul |
| Service Discovery | Eureka | K8s DNS | Cloud Map | Consul, CoreDNS |
| Chaos Engineering | Chaos Monkey | Internal tools | FIS | Litmus, Gremlin |
| IaC | Terraform | Deployment Manager | CloudFormation | Terraform, Pulumi |
| Monitoring | Atlas | Monarch (internal) | CloudWatch | Prometheus + Grafana |
| Logging | Custom ELK | Cloud Logging | CloudWatch Logs | Loki, ELK Stack |
| Tracing | Custom Zipkin | Cloud Trace | X-Ray | Jaeger, Zipkin |
Key Takeaways
- Netflix: Freedom and responsibility. Invest in resilience through chaos engineering. Their open-source tools (Spinnaker, Eureka, Zuul, Chaos Monkey) are production-proven at massive scale.
- Google: Invented modern container orchestration. Their SRE practices (error budgets, SLOs) should be adopted by every team. Kubernetes and Istio are industry standards.
- Amazon: Deep AWS integration matters. If you're all-in on AWS, their native tools offer the tightest integration. CloudFormation + CodePipeline + ECS is a powerful combination.
- Start small: You can replicate 80% of these practices using open-source tools and free tiers. Focus on CI/CD automation first, then containerization, then observability.