DevOps Tools Used by Netflix, Google, and Amazon

Introduction: Learning from the Best in DevOps

Netflix, Google, and Amazon aren't just tech giants — they're the pioneers of modern DevOps culture. Netflix deploys thousands of times per day. Google manages billions of containers weekly. Amazon pushes code to production every 11.7 seconds. These companies have built, open-sourced, and battle-tested the tools that the rest of us now rely on.

In this comprehensive guide, we'll dissect the exact DevOps toolchains used by each company, provide real configuration files you can adapt, and show you how to implement enterprise-grade DevOps practices — even on a small team budget.

The Netflix DevOps Stack

Netflix operates one of the most sophisticated cloud-native architectures in the world, serving over 230 million subscribers across 190+ countries. Their engineering culture is built on freedom and responsibility, and their DevOps tooling reflects that philosophy.

Spinnaker: Multi-Cloud Continuous Delivery

Spinnaker is Netflix's open-source continuous delivery platform, designed for releasing software changes with high velocity and confidence. Unlike simpler CI/CD tools, Spinnaker provides multi-cloud deployment, advanced deployment strategies, and deep integration with cloud providers.

Install Spinnaker using Halyard (the official CLI tool):

## Install Halyard on Ubuntu/Debian
curl -O https://raw.githubusercontent.com/spinnaker/halyard/master/install/debian/InstallHalyard.sh
sudo bash InstallHalyard.sh

## Verify installation
hal version list

## Configure the cloud provider (AWS example)
hal config provider aws account add my-aws-account 
    --account-id 123456789012 
    --assume-role role/spinnakerManaged 
    --regions us-east-1,us-west-2

hal config provider aws enable

## Set the deployment type
hal config deploy edit --type distributed --account-name my-k8s-account

## Deploy Spinnaker
hal deploy apply

Here is a complete Spinnaker pipeline configuration in YAML that demonstrates a canary deployment strategy:

application: my-web-app
name: production-deploy-pipeline
triggers:
  - type: docker
    registry: index.docker.io
    repository: myorg/my-web-app
    tag: "v.*"
    enabled: true

stages:
  - refId: "1"
    type: bake
    name: "Bake AMI"
    baseOs: ubuntu
    package: my-web-app
    regions:
      - us-east-1
    vmType: hvm
    storeType: ebs

  - refId: "2"
    requisiteStageRefIds: ["1"]
    type: canaryAnalysis
    name: "Canary Analysis"
    canaryConfig:
      lifetimeDuration: PT1H
      metricsAccountName: my-datadog
      scoreThresholds:
        marginal: 50
        pass: 75
      storageAccountName: my-s3

  - refId: "3"
    requisiteStageRefIds: ["2"]
    type: deploy
    name: "Deploy to Production"
    clusters:
      - account: my-aws-account
        application: my-web-app
        capacity:
          desired: 6
          max: 10
          min: 3
        cooldown: 300
        instanceType: m5.xlarge
        loadBalancers:
          - my-web-app-prod-lb
        provider: aws
        strategy: rollingpush
        subnetType: internal

  - refId: "4"
    requisiteStageRefIds: ["3"]
    type: webhook
    name: "Notify Slack"
    url: https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK
    method: POST
    payload: |
      {
        "channel": "#deployments",
        "text": "Production deployment of ${trigger.tag} completed successfully!"
      }

Chaos Monkey and the Simian Army

Netflix's Chaos Monkey randomly terminates production instances to ensure that engineers build resilient services. It's part of the broader Simian Army that includes Latency Monkey, Conformity Monkey, and Chaos Gorilla.

## Clone Chaos Monkey
git clone https://github.com/Netflix/chaosmonkey.git
cd chaosmonkey

## Build from source
go build -o chaosmonkey ./cmd/chaosmonkey

## Create configuration file
cat > chaosmonkey.toml <<EOF
[chaosmonkey]
enabled = true
schedule_enabled = true
accounts = ["my-aws-account"]

[chaosmonkey.scheduler]
frequency = 5
clustering_enabled = true

[database]
host = "localhost"
name = "chaosmonkey"
user = "chaosmonkey"
password = "secure-password-here"
port = 3306
EOF

## Register Chaos Monkey with Spinnaker
chaosmonkey migrate
chaosmonkey schedule

## Manually trigger chaos (for testing)
chaosmonkey terminate my-web-app --region us-east-1

Zuul Gateway and Eureka Service Discovery

Netflix built Zuul as their API gateway for dynamic routing, monitoring, resiliency, and security. Combined with Eureka for service discovery, these tools form the backbone of Netflix's microservices communication layer.

# application.yml for Zuul Gateway
server:
  port: 8080

spring:
  application:
    name: api-gateway

eureka:
  client:
    serviceUrl:
      defaultZone: http://localhost:8761/eureka/
    register-with-eureka: true
    fetch-registry: true

zuul:
  routes:
    user-service:
      path: /api/users/**
      serviceId: user-service
      stripPrefix: true
    order-service:
      path: /api/orders/**
      serviceId: order-service
      stripPrefix: true
    product-service:
      path: /api/products/**
      serviceId: product-service
      stripPrefix: true
  ribbon:
    eager-load:
      enabled: true
  host:
    connect-timeout-millis: 5000
    socket-timeout-millis: 10000

hystrix:
  command:
    default:
      execution:
        isolation:
          thread:
            timeoutInMilliseconds: 60000

# Eureka Server application.yml
server:
  port: 8761

eureka:
  instance:
    hostname: localhost
  client:
    registerWithEureka: false
    fetchRegistry: false
    serviceUrl:
      defaultZone: http://${eureka.instance.hostname}:${server.port}/eureka/
  server:
    enable-self-preservation: false
    eviction-interval-timer-in-ms: 5000

The Google DevOps Stack

Google runs over 4 billion containers per week using their internal Borg system. Their DevOps philosophy gave birth to Site Reliability Engineering (SRE), and their open-source contributions have fundamentally shaped the industry.

Kubernetes: From Borg to K8s

Kubernetes (K8s) was born from Google's internal Borg system, which has been managing containerized workloads since 2003. When Google decided to share this orchestration knowledge with the world, they rebuilt the concepts into Kubernetes and donated it to the CNCF.

Set up a Google Kubernetes Engine (GKE) cluster:

## Install Google Cloud SDK
curl https://sdk.cloud.google.com | bash
exec -l $SHELL
gcloud init

## Create a GKE cluster
gcloud container clusters create production-cluster 
    --zone us-central1-a 
    --num-nodes 3 
    --machine-type e2-standard-4 
    --enable-autoscaling 
    --min-nodes 2 
    --max-nodes 10 
    --enable-autorepair 
    --enable-autoupgrade 
    --enable-network-policy 
    --enable-vertical-pod-autoscaling 
    --release-channel regular

## Get credentials
gcloud container clusters get-credentials production-cluster 
    --zone us-central1-a

## Verify cluster
kubectl get nodes
kubectl cluster-info

Here is a production-ready Kubernetes deployment manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
  namespace: production
  labels:
    app: web-app
    version: v2.1.0
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-app
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: web-app
        version: v2.1.0
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
    spec:
      serviceAccountName: web-app-sa
      containers:
        - name: web-app
          image: gcr.io/my-project/web-app:v2.1.0
          ports:
            - containerPort: 8080
              name: http
          resources:
            requests:
              cpu: 250m
              memory: 256Mi
            limits:
              cpu: 500m
              memory: 512Mi
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5
          env:
            - name: DB_HOST
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: host
            - name: DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: password
---
apiVersion: v1
kind: Service
metadata:
  name: web-app-service
  namespace: production
spec:
  selector:
    app: web-app
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080
  type: ClusterIP
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80

Cloud Build CI/CD

Google Cloud Build is Google's fully managed CI/CD platform. It executes builds on Google's infrastructure, scales automatically, and integrates seamlessly with GKE.

# cloudbuild.yaml
steps:
  # Run tests
  - name: 'node:18'
    entrypoint: npm
    args: ['install']
    
  - name: 'node:18'
    entrypoint: npm
    args: ['test']
    env:
      - 'NODE_ENV=test'

  # Build Docker image
  - name: 'gcr.io/cloud-builders/docker'
    args:
      - 'build'
      - '-t'
      - 'gcr.io/$PROJECT_ID/web-app:$SHORT_SHA'
      - '-t'
      - 'gcr.io/$PROJECT_ID/web-app:latest'
      - '.'

  # Push to GCR
  - name: 'gcr.io/cloud-builders/docker'
    args: ['push', 'gcr.io/$PROJECT_ID/web-app:$SHORT_SHA']

  # Deploy to GKE
  - name: 'gcr.io/cloud-builders/gke-deploy'
    args:
      - 'run'
      - '--filename=k8s/'
      - '--image=gcr.io/$PROJECT_ID/web-app:$SHORT_SHA'
      - '--location=us-central1-a'
      - '--cluster=production-cluster'

timeout: 1200s
options:
  machineType: E2_HIGHCPU_8

Istio Service Mesh Configuration

Istio, co-developed by Google, IBM, and Lyft, provides traffic management, security, and observability for microservices. Here is a practical configuration:

## Install Istio
curl -L https://istio.io/downloadIstio | sh -
cd istio-*
export PATH=$PWD/bin:$PATH

## Install Istio with production profile
istioctl install --set profile=production -y

## Enable sidecar injection for a namespace
kubectl label namespace production istio-injection=enabled

# Istio VirtualService for traffic splitting (Canary)
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: web-app-vs
  namespace: production
spec:
  hosts:
    - web-app
  http:
    - match:
        - headers:
            x-canary:
              exact: "true"
      route:
        - destination:
            host: web-app
            subset: canary
    - route:
        - destination:
            host: web-app
            subset: stable
          weight: 90
        - destination:
            host: web-app
            subset: canary
          weight: 10
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: web-app-dr
  namespace: production
spec:
  host: web-app
  trafficPolicy:
    connectionPool:
      http:
        h2UpgradePolicy: DEFAULT
        http1MaxPendingRequests: 1024
        http2MaxRequests: 1024
      tcp:
        maxConnections: 100
    outlierDetection:
      consecutive5xxErrors: 3
      interval: 30s
      baseEjectionTime: 30s
  subsets:
    - name: stable
      labels:
        version: v2.0.0
    - name: canary
      labels:
        version: v2.1.0

The Amazon DevOps Stack

Amazon was arguably the first company to fully embrace microservices and DevOps, driven by the famous "two-pizza team" mandate from Jeff Bezos. Their AWS platform now powers a massive share of the internet's infrastructure.

AWS CodePipeline + CodeBuild

AWS CodePipeline orchestrates the entire release process, while CodeBuild handles the actual build and test execution. Here is a complete buildspec.yml configuration:

# buildspec.yml for AWS CodeBuild
version: 0.2

env:
  variables:
    NODE_ENV: "production"
    AWS_DEFAULT_REGION: "us-east-1"
  parameter-store:
    DB_PASSWORD: "/myapp/production/db-password"
    API_KEY: "/myapp/production/api-key"

phases:
  install:
    runtime-versions:
      nodejs: 18
    commands:
      - echo "Installing dependencies..."
      - npm ci --production=false
      
  pre_build:
    commands:
      - echo "Running linter and type checking..."
      - npm run lint
      - npm run type-check
      - echo "Logging in to Amazon ECR..."
      - aws ecr get-login-password --region $AWS_DEFAULT_REGION | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com
      - REPOSITORY_URI=$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/my-web-app
      - COMMIT_HASH=$(echo $CODEBUILD_RESOLVED_SOURCE_VERSION | cut -c 1-7)
      - IMAGE_TAG=${COMMIT_HASH:-latest}
      
  build:
    commands:
      - echo "Running tests..."
      - npm test -- --coverage --ci
      - echo "Building Docker image..."
      - docker build -t $REPOSITORY_URI:latest .
      - docker tag $REPOSITORY_URI:latest $REPOSITORY_URI:$IMAGE_TAG
      
  post_build:
    commands:
      - echo "Pushing Docker image..."
      - docker push $REPOSITORY_URI:latest
      - docker push $REPOSITORY_URI:$IMAGE_TAG
      - echo "Writing image definitions file..."
      - printf '[{"name":"web-app","imageUri":"%s"}]' $REPOSITORY_URI:$IMAGE_TAG > imagedefinitions.json

artifacts:
  files:
    - imagedefinitions.json
    - appspec.yml
    - taskdef.json

reports:
  coverage-report:
    files:
      - "coverage/clover.xml"
    file-format: CLOVERXML

cache:
  paths:
    - 'node_modules/**/*'

AWS CloudFormation Templates

CloudFormation is Amazon's Infrastructure as Code service. Here is a template that provisions a complete VPC with auto-scaling ECS service:

AWSTemplateFormatVersion: '2010-09-09'
Description: Production VPC with ECS Fargate Service

Parameters:
  EnvironmentName:
    Type: String
    Default: production
  ContainerImage:
    Type: String
    Description: Docker image URI for the application
  ContainerPort:
    Type: Number
    Default: 8080

Resources:
  VPC:
    Type: AWS::EC2::VPC
    Properties:
      CidrBlock: 10.0.0.0/16
      EnableDnsHostnames: true
      EnableDnsSupport: true
      Tags:
        - Key: Name
          Value: !Sub "${EnvironmentName}-vpc"

  PublicSubnet1:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref VPC
      CidrBlock: 10.0.1.0/24
      AvailabilityZone: !Select [0, !GetAZs ""]
      MapPublicIpOnLaunch: true

  PublicSubnet2:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref VPC
      CidrBlock: 10.0.2.0/24
      AvailabilityZone: !Select [1, !GetAZs ""]
      MapPublicIpOnLaunch: true

  ECSCluster:
    Type: AWS::ECS::Cluster
    Properties:
      ClusterName: !Sub "${EnvironmentName}-cluster"
      ClusterSettings:
        - Name: containerInsights
          Value: enabled

  TaskDefinition:
    Type: AWS::ECS::TaskDefinition
    Properties:
      Family: !Sub "${EnvironmentName}-web-app"
      Cpu: 512
      Memory: 1024
      NetworkMode: awsvpc
      RequiresCompatibilities:
        - FARGATE
      ExecutionRoleArn: !GetAtt ECSExecutionRole.Arn
      TaskRoleArn: !GetAtt ECSTaskRole.Arn
      ContainerDefinitions:
        - Name: web-app
          Image: !Ref ContainerImage
          PortMappings:
            - ContainerPort: !Ref ContainerPort
          LogConfiguration:
            LogDriver: awslogs
            Options:
              awslogs-group: !Ref LogGroup
              awslogs-region: !Ref "AWS::Region"
              awslogs-stream-prefix: ecs
          HealthCheck:
            Command:
              - CMD-SHELL
              - !Sub "curl -f http://localhost:${ContainerPort}/healthz || exit 1"
            Interval: 30
            Timeout: 5
            Retries: 3

  Service:
    Type: AWS::ECS::Service
    Properties:
      Cluster: !Ref ECSCluster
      TaskDefinition: !Ref TaskDefinition
      DesiredCount: 3
      LaunchType: FARGATE
      NetworkConfiguration:
        AwsvpcConfiguration:
          Subnets:
            - !Ref PublicSubnet1
            - !Ref PublicSubnet2
          SecurityGroups:
            - !Ref ContainerSecurityGroup
          AssignPublicIp: ENABLED
      LoadBalancers:
        - ContainerName: web-app
          ContainerPort: !Ref ContainerPort
          TargetGroupArn: !Ref TargetGroup

Outputs:
  ClusterName:
    Value: !Ref ECSCluster
  ServiceURL:
    Value: !GetAtt ALB.DNSName

ECS vs EKS: Which to Choose?

Feature	ECS (Elastic Container Service)	EKS (Elastic Kubernetes Service)
Complexity	Lower learning curve	Steeper, but industry standard
Pricing	No control plane cost	$0.10/hour for control plane
Portability	AWS-only	Multi-cloud via K8s
Integration	Deep AWS native integration	Good AWS + K8s ecosystem
Auto Scaling	Application Auto Scaling	HPA + Karpenter/Cluster Autoscaler
Service Mesh	AWS App Mesh	Istio, Linkerd, App Mesh
Best For	AWS-only shops, simpler workloads	Multi-cloud, complex orchestration

Common Tools Across All Three Giants

Terraform: Infrastructure as Code

All three companies and their ecosystems heavily leverage Terraform for multi-cloud infrastructure management. Here is a real-world Terraform configuration:

# main.tf - Production Infrastructure
terraform {
  required_version = ">= 1.5.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
  backend "s3" {
    bucket         = "my-terraform-state-prod"
    key            = "infrastructure/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-lock"
    encrypt        = true
  }
}

provider "aws" {
  region = var.aws_region
  default_tags {
    tags = {
      Environment = var.environment
      ManagedBy   = "terraform"
      Project     = var.project_name
    }
  }
}

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.1.0"

  name = "${var.project_name}-${var.environment}"
  cidr = "10.0.0.0/16"

  azs             = ["${var.aws_region}a", "${var.aws_region}b", "${var.aws_region}c"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]

  enable_nat_gateway     = true
  single_nat_gateway     = var.environment != "production"
  enable_dns_hostnames   = true
  enable_dns_support     = true

  public_subnet_tags = {
    "kubernetes.io/role/elb" = 1
  }
  private_subnet_tags = {
    "kubernetes.io/role/internal-elb" = 1
  }
}

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "19.15.0"

  cluster_name    = "${var.project_name}-${var.environment}"
  cluster_version = "1.28"

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  eks_managed_node_groups = {
    general = {
      desired_size = 3
      min_size     = 2
      max_size     = 10
      instance_types = ["m5.xlarge"]
      capacity_type  = "ON_DEMAND"
    }
    spot = {
      desired_size = 2
      min_size     = 1
      max_size     = 8
      instance_types = ["m5.large", "m5a.large", "m4.large"]
      capacity_type  = "SPOT"
      labels = {
        workload-type = "batch"
      }
      taints = [{
        key    = "spot"
        value  = "true"
        effect = "NO_SCHEDULE"
      }]
    }
  }
}

# variables.tf
variable "aws_region" {
  description = "AWS region"
  type        = string
  default     = "us-east-1"
}

variable "environment" {
  description = "Environment name"
  type        = string
  default     = "production"
}

variable "project_name" {
  description = "Project name"
  type        = string
  default     = "myapp"
}

Prometheus + Grafana Monitoring Setup

Monitoring is non-negotiable at scale. All three companies use Prometheus-compatible metrics systems internally. Here is a complete monitoring stack setup:

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: production
    region: us-east-1

rule_files:
  - "alerts/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

scrape_configs:
  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (.+)
        replacement: $1

  - job_name: "node-exporter"
    static_configs:
      - targets: ["node-exporter:9100"]

  - job_name: "application"
    metrics_path: /metrics
    static_configs:
      - targets: ["web-app:8080"]
        labels:
          service: web-app

# alerts/application.yml
groups:
  - name: application-alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected on {{ $labels.instance }}"
          description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1.0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High p95 latency on {{ $labels.instance }}"
          description: "p95 latency is {{ $value }}s"

      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Pod {{ $labels.pod }} is crash looping"

ELK Stack (Elasticsearch, Logstash, Kibana)

Centralized logging with the ELK stack is critical for observability:

# docker-compose.yml for ELK stack
version: "3.8"
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.10.0
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=true
      - ELASTIC_PASSWORD=changeme
      - "ES_JAVA_OPTS=-Xms2g -Xmx2g"
    volumes:
      - esdata:/usr/share/elasticsearch/data
    ports:
      - "9200:9200"
    healthcheck:
      test: curl -s http://localhost:9200 > /dev/null || exit 1
      interval: 30s
      timeout: 10s
      retries: 5

  logstash:
    image: docker.elastic.co/logstash/logstash:8.10.0
    volumes:
      - ./logstash/pipeline:/usr/share/logstash/pipeline
    ports:
      - "5044:5044"
    depends_on:
      elasticsearch:
        condition: service_healthy

  kibana:
    image: docker.elastic.co/kibana/kibana:8.10.0
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
      - ELASTICSEARCH_USERNAME=kibana_system
      - ELASTICSEARCH_PASSWORD=changeme
    ports:
      - "5601:5601"
    depends_on:
      elasticsearch:
        condition: service_healthy

volumes:
  esdata:
    driver: local

Implementing Enterprise DevOps on a Small Team Budget

You don't need a Netflix-sized budget to adopt enterprise-grade DevOps practices. Here is a practical roadmap for teams of 3-10 engineers:

Phase 1: Foundation (Week 1-2, Cost: ~$0/month)

Version Control: GitHub Free or GitLab Community Edition
CI/CD: GitHub Actions (2,000 free minutes/month) or GitLab CI
Containerization: Docker Desktop (free for small teams)
IaC: Terraform Cloud (free tier for up to 5 users)

Phase 2: Orchestration (Week 3-4, Cost: ~$75/month)

Kubernetes: DigitalOcean Managed K8s ($12/month per node, minimum 1)
Registry: Docker Hub free tier or GitHub Container Registry
Secrets: HashiCorp Vault (open source) or Sealed Secrets for K8s

Phase 3: Observability (Week 5-6, Cost: ~$0-50/month)

Monitoring: Prometheus + Grafana (both open source)
Logging: Loki + Grafana (lighter than ELK, open source)
Alerting: Alertmanager + PagerDuty free tier or Slack webhooks

Sample GitHub Actions Workflow (Free Tier)

# .github/workflows/deploy.yml
name: Build, Test, and Deploy
on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: npm
      - run: npm ci
      - run: npm test -- --coverage
      - uses: actions/upload-artifact@v4
        with:
          name: coverage
          path: coverage/

  build-and-push:
    needs: test
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      - uses: docker/build-push-action@v5
        with:
          push: true
          tags: ghcr.io/${{ github.repository }}:${{ github.sha }}

  deploy:
    needs: build-and-push
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: azure/k8s-set-context@v3
        with:
          kubeconfig: ${{ secrets.KUBE_CONFIG }}
      - run: |
          kubectl set image deployment/web-app 
            web-app=ghcr.io/${{ github.repository }}:${{ github.sha }} 
            -n production
          kubectl rollout status deployment/web-app -n production

Troubleshooting Common DevOps Issues

Problem: Kubernetes Pods Stuck in CrashLoopBackOff

Cause: Application crashes on startup due to missing environment variables, failed database connection, or out-of-memory errors.

Solution:

## Check pod logs
kubectl logs pod-name -n production --previous

## Describe the pod for events
kubectl describe pod pod-name -n production

## Check resource limits
kubectl top pod pod-name -n production

## Common fix: increase memory limits or fix env vars
kubectl edit deployment web-app -n production

Problem: Terraform State Lock Stuck

Cause: A previous terraform apply was interrupted, leaving a stale DynamoDB lock.

Solution:

## Force unlock (use the Lock ID from the error message)
terraform force-unlock LOCK_ID

## If using S3 backend, check DynamoDB for the lock
aws dynamodb scan --table-name terraform-lock

Problem: Docker Build Cache Causing Stale Images

Cause: Docker aggressively caches layers, sometimes serving outdated dependencies.

Solution:

## Build without cache
docker build --no-cache -t myapp:latest .

## Prune all unused images and cache
docker system prune -a --volumes

## Use BuildKit for smarter caching
DOCKER_BUILDKIT=1 docker build -t myapp:latest .

Quick Reference: DevOps Tool Comparison Cheat Sheet

Category	Netflix	Google	Amazon	Open-Source Alternative
CD Platform	Spinnaker	Cloud Build	CodePipeline	ArgoCD, Flux
Container Orchestration	Titus (internal)	Kubernetes/GKE	ECS/EKS	K3s, Nomad
Service Mesh	Zuul + custom	Istio	App Mesh	Linkerd, Consul
Service Discovery	Eureka	K8s DNS	Cloud Map	Consul, CoreDNS
Chaos Engineering	Chaos Monkey	Internal tools	FIS	Litmus, Gremlin
IaC	Terraform	Deployment Manager	CloudFormation	Terraform, Pulumi
Monitoring	Atlas	Monarch (internal)	CloudWatch	Prometheus + Grafana
Logging	Custom ELK	Cloud Logging	CloudWatch Logs	Loki, ELK Stack
Tracing	Custom Zipkin	Cloud Trace	X-Ray	Jaeger, Zipkin

Key Takeaways

Netflix: Freedom and responsibility. Invest in resilience through chaos engineering. Their open-source tools (Spinnaker, Eureka, Zuul, Chaos Monkey) are production-proven at massive scale.
Google: Invented modern container orchestration. Their SRE practices (error budgets, SLOs) should be adopted by every team. Kubernetes and Istio are industry standards.
Amazon: Deep AWS integration matters. If you're all-in on AWS, their native tools offer the tightest integration. CloudFormation + CodePipeline + ECS is a powerful combination.
Start small: You can replicate 80% of these practices using open-source tools and free tiers. Focus on CI/CD automation first, then containerization, then observability.