Upgrading EKS Clusters!

Upgrades are always fun

Featured image

Upgrading EKS the hard way!

Let me set the scene: 12 EKS clusters, spread across different clients, running on Kubernetes 1.22, and a mountain of potential complexity waiting to explode. This is the story of how I transformed a potential nightmare into a surgical precision upgrade operation.

In previous posts I talked about managed Kubernetes services that span all major cloud providers, this is definitely a big win for all engineers that work and manage Kubernetes workloads as upgrades tend to be somewhat quick and safe, whereas having your own Kubernetes vanilla distribution running on multiple virtual machines can be a real nightmare!

The Initial Landscape: A Kubernetes Pressure Cooker

Imagine inheriting a system with:

Spoiler alert: This wasn’t going to be a simple kubectl upgrade situation or terraform apply -y out of the box. 🚨

The Battle Plan: Terraform, Communication, and Nerves of Steel

Step 1: Reconnaissance and Mapping

First, I created a comprehensive inventory:

locals {
  cluster_upgrade_map = {
    "client-a-prod"     = { priority = "high", maintenance_window = "off-peak" }
    "client-b-staging"  = { priority = "medium", maintenance_window = "weekend" }
    "client-c-dev"      = { priority = "low", maintenance_window = "anytime" }
    # ... and 9 more clusters
  }
}

The Terraform Upgrade Module: Our Secret Weapon

resource "aws_eks_cluster" "upgraded_cluster" {
  # Base cluster configuration
  name     = var.cluster_name
  version  = "1.29"  # Target Kubernetes version

  # Controlled, phased upgrade strategy
  upgrade_policy {
    max_unavailable_percentage = 33%
    drain_strategy             = "carefully"
  }

  # Node group version alignment
  node_group_version = "1.29"
}

The Communication Symphony

Client Coordination: Like Conducting an Orchestra 🎼

Communication Template

Dear [Client],

Upcoming Kubernetes Upgrade Details:
- Current Version: 1.22
- Target Version: 1.29
- Estimated Downtime: 15-30 minutes
- Maintenance Window: [Specific Date/Time]
- Rollback Capability: ✅ Fully Prepared

The Upgrade Phases: A Strategic Ballet

Phase 1: Non-Critical Clusters

Phase 2: Staging and Pre-Prod

Phase 3: Production Clusters

Terraform Magic: Making Complex Look Simple

module "eks_cluster_upgrades" {
  source  = "my-org/eks-upgrade/aws"
  version = "1.0.0"

  for_each = local.cluster_upgrade_map

  cluster_name     = each.key
  target_version   = "1.29"
  priority        = each.value.priority
  maintenance_window = each.value.maintenance_window
}

War Stories and Lessons Learned 🏆

Unexpected Challenges

Pro Tips for the Brave

  1. Always have a rollback plan.
  2. Test, test, and test again on lower environments.
  3. Communication is 90% of the battle, executing the plan communicated is just 10%. As crazy as it sounds that is the real truth.
  4. Automate everything possible as this will save your life from doing human errors, and believe me, human errors due happen, and there are some situations on where a simple letter can explode all the infrastructure.

The Aftermath: A Hero’s Reflection

Final Thoughts: Kubernetes is a Journey, Not a Destination

Upgrading isn’t just about bumping version numbers. It’s about understanding systems, managing expectations, and executing with precision in order to not disrupt any high sensitive workloads, and most importantly, not affecting the end users of your applications.

To my fellow infrastructure engineers: Stay curious, stay prepared, and always have a good coffee nearby. ☕🚀

May your upgrades be smooth and your downtime minimal!

Build On!