5 min to read
Upgrading EKS Clusters!
Upgrades are always fun

Upgrading EKS the hard way!
Let me set the scene: 12 EKS clusters, spread across different clients, running on Kubernetes 1.22, and a mountain of potential complexity waiting to explode. This is the story of how I transformed a potential nightmare into a surgical precision upgrade operation.
In previous posts I talked about managed Kubernetes services that span all major cloud providers, this is definitely a big win for all engineers that work and manage Kubernetes workloads as upgrades tend to be somewhat quick and safe, whereas having your own Kubernetes vanilla distribution running on multiple virtual machines can be a real nightmare!
The Initial Landscape: A Kubernetes Pressure Cooker
Imagine inheriting a system with:
- 12 uniquely configured EKS clusters
- Clients with varying levels of tolerance for downtime
- A Kubernetes version that was rapidly approaching end-of-support (This was literally the main reason for this as AWS was changing the pricing of the old versions in order to push us, the customers, to upgrade to newest versions)
- A mix of critical and non-critical workloads
Spoiler alert: This wasn’t going to be a simple kubectl upgrade
situation or terraform apply -y
out of the box. 🚨
The Battle Plan: Terraform, Communication, and Nerves of Steel
Step 1: Reconnaissance and Mapping
First, I created a comprehensive inventory:
locals {
cluster_upgrade_map = {
"client-a-prod" = { priority = "high", maintenance_window = "off-peak" }
"client-b-staging" = { priority = "medium", maintenance_window = "weekend" }
"client-c-dev" = { priority = "low", maintenance_window = "anytime" }
# ... and 9 more clusters
}
}
The Terraform Upgrade Module: Our Secret Weapon
resource "aws_eks_cluster" "upgraded_cluster" {
# Base cluster configuration
name = var.cluster_name
version = "1.29" # Target Kubernetes version
# Controlled, phased upgrade strategy
upgrade_policy {
max_unavailable_percentage = 33%
drain_strategy = "carefully"
}
# Node group version alignment
node_group_version = "1.29"
}
The Communication Symphony
Client Coordination: Like Conducting an Orchestra 🎼
- Created detailed upgrade runbooks for each client
- Mapped out exact maintenance windows
- Prepared rollback plans (always have a Plan B!)
Communication Template
Dear [Client],
Upcoming Kubernetes Upgrade Details:
- Current Version: 1.22
- Target Version: 1.29
- Estimated Downtime: 15-30 minutes
- Maintenance Window: [Specific Date/Time]
- Rollback Capability: ✅ Fully Prepared
The Upgrade Phases: A Strategic Ballet
Phase 1: Non-Critical Clusters
- Started with development and staging environments on environments that had no critical workloads.
- Validated upgrade procedures on test clusters. This was one of the most key points on defining the best approach and testing the changes to somewhat real environments.
- Collected initial learnings and potential pitfalls.
Phase 2: Staging and Pre-Prod
- Applied learnings from Phase 1
- More rigorous testing and monitoring in place, which consisted on monitoring the current workloads in the clusters and how they acted after the new nodes were created with the new kubernetes versions.
- Increased monitoring and alert sensitivity
Phase 3: Production Clusters
- Surgical precision mode activated
- Staggered upgrades are a way to perform upgrades in phases, rather than all at once. So that’s why we did on specific clients/environments in specific dates that were confirmed by all the customers.
- Minute-by-minute monitoring
Terraform Magic: Making Complex Look Simple
module "eks_cluster_upgrades" {
source = "my-org/eks-upgrade/aws"
version = "1.0.0"
for_each = local.cluster_upgrade_map
cluster_name = each.key
target_version = "1.29"
priority = each.value.priority
maintenance_window = each.value.maintenance_window
}
War Stories and Lessons Learned 🏆
Unexpected Challenges
- Some legacy Helm charts required updates and some of the engineers that built them were no longer in the company, a classic scenario in this lovely IT world!
- Certain container images were version-specific and had to be updated, so some refactor was done with the help of the engineers.
- Some Kubernetes objects needed to be updated as the versions constraints from alpha to beta to stable.
Pro Tips for the Brave
- Always have a rollback plan.
- Test, test, and test again on lower environments.
- Communication is 90% of the battle, executing the plan communicated is just 10%. As crazy as it sounds that is the real truth.
- Automate everything possible as this will save your life from doing human errors, and believe me, human errors due happen, and there are some situations on where a simple letter can explode all the infrastructure.
The Aftermath: A Hero’s Reflection
- 12 clusters upgraded ✅
- Zero unplanned downtimes ✅
- Minimal client disruption ✅
- Personal sanity mostly intact 😅
Final Thoughts: Kubernetes is a Journey, Not a Destination
Upgrading isn’t just about bumping version numbers. It’s about understanding systems, managing expectations, and executing with precision in order to not disrupt any high sensitive workloads, and most importantly, not affecting the end users of your applications.
To my fellow infrastructure engineers: Stay curious, stay prepared, and always have a good coffee nearby. ☕🚀
May your upgrades be smooth and your downtime minimal!
Build On!