GCP FinOps #002: Chaos Engineer Your Way to a Lower Cost // Dmitri Lerko

tl;dr;

Save money and build more reliable systems by running preemptible instances. Frequent failure generated by preemptible instances forces best practices around resilience.

Preemptible Instances

Cloud costs in dynamic (fast-moving, growing, and innovative) organizations tend to be ever-increasing. It is a by-product of engineering activity. Stabilizing or driving costs down can be challenging once ‘low hanging fruit’ opportunities are exhausted.

GCP has Preemptible instances. As data centers are built for peak demands, often times (seasonality, time of the day, day of the week) there is a lot of spare, idle server capacity. Preemptible instances offer those idle servers at reduced list price to maximize utilization of GCP datacentres. Preemptible instances have the following characteristics:

They are not guaranteed to be available.
They live at most 24 hours and then automatically terminated.
They can be terminated before 24 hours at 30-second notice.
Instances are up-to 80% cheaper when compared to on-demand prices.
Local SSDs, TPUs, and GPUs attached to preemptible instances are also heavily discounted.
There is no auction and no ability to extend lifespan beyond 24 hours unlike with Spot Instances.
Price of preemptible instances is known in advance.

Preemptible instances have a great reputation for cost-saving, but a terrible one for reliability. Engineers beg not to use preemptible instances so they can focus on writing features instead of dealing with “artificial” reliability issues. “Why can’t we just pay for full-price instances and run everything reliably?”, “I have a deadline to deliver this feature and no time for this.” they say… Let me tell you. There is nothing artificial about reliability issues that you experience when running preemptible instances. Preemptible instances merely make failure a guaranteed daily occurrence, that would have happened anyway. VM instances need periodic maintenance, physical servers do crash, GCP has outages. Paying more will not prevent those events from happening, but when they will happen, you will be unprepared as you were hiding away from failures, not seeking them out.

Hold on, what has this got to do with Chaos Engineering? Chaos engineering is the discipline of experimenting on a software system in production to build confidence in the system’s capability to withstand turbulent and unexpected conditions. Early tools in Chaos Engineering domain were very focused on instance-level failures. Maturation led to chaos in networking, IAM, and even random resource deletions in Kubernetes.

Preemptible instances can act as a fundamental Chaos Engineering tool and an excellent starting point. I know many of you have longed to try Chaos Engineering out, but it is a hard pitch to make.

Hey, I am going to spend valuable engineering time and money on Chaos Engineering products to break things (yep, break it more often than we do already) so we can have a more reliable future.

Compare this with:

Hey, there is a way to save up to 80% on our costly instances, while making our infrastructure and applications more reliable at the same time.

I’d favor the second pitch without any doubts.

What preemptible got to do with Chaos Engineering?

Preemptibles are not guaranteed to have availability, have a limited lifespan, can be terminated at any time (chaotic) with 30 seconds notice.

To counteract some of the preemptible constraints, you have to:

Run compute on a different machine types (reduces the probability of preemption as demand usually increase for a specific machine type in any one point of time)
Run compute in different zones/regions (preemption is a result of an increase in local demand, more localities increases overall availability)
Run applications in a highly available way. Ensure that you have N+1 instances of your application, ensure that they all don’t end up on the same node/zone or even region (pick your battles depending on your availability SLOs)
Do not rely on ephemeral storage or the fact that the server will be around in the future. This constraint forces delegation of state onto databases, also logs have to get ingested remotely in real-time or risk losing last-minute entries.
Your applications have to get faster at starting up and shutting down. Especially shutting down. You have < 30 seconds to stop what you doing or risk losing or corrupting your data/state.
Introduce async message queues like PubSub, so workloads can be picked up at any later point. At the same time, failed processing due to preemption should result in the original message going back onto the queue to be processed by another instance.
Have sub-minute metrics collection interval. Tools with 1 or 5min metric scrapes quickly become irrelevant in the face of frequent terminations.
Your nodes have to be bootstrapped and ready to work. Running 30min user script at each startup is not sustainable, instead, you have to rely on disk images with everything pre-provisioned within.

What Google products support preemptibles?

Google Compute Engine.

UI checkbox, gcloud xxx --preemptible or preemptible = "true" in Terraform. I’d recommend using preemptibles in Managed Instance Groups so it can maintain minimum availability for you.

Google Kubernetes Engine

You can apply preemptibility to the NodePool. It has to be a new NodePool as you can not retrofit preemptibility onto an existing one. I will cover the topic of preemptibles in GKE in a separate future blog post.

Dataflow

When using FlexRS mode of dataflow it can blend preemptible with standard instances. The process is rather abstracted away and very low touch for the user. Read more on --flexRSGoal=COST_OPTIMIZED option here.

Dataproc

Dataproc calls it Secondary Workers. It is an ability to add preemptible, processing only workers to a cluster. They do not contribute to HDFS and only do computational processing). More info.

Local SSD, TPU and, GPU

All three can be used with preemptibe instances creating an interesting challenge to solve, especially around the large, fast ephemeral NVMe disk that can disappear at a short notice.

Kubeflow

Kubeflow (which is a combination of some of the above products) can benefit from preemptible instances.

Summary

Addition of two constraints, the lifespan of the instance and instance availability, is all that was needed to embark on the journey of Chaos Engineering. It may not sound like much, but challenges you’ll face solving for availability, reliability, and durability while instances are being removed in under 30-seconds will pay huge dividends (in both cost savings and reliability). Once you steady the ship while using preemptibles, both yourself and stakeholders will have a lot more confidence in trying more sophisticated Chaos Engineering tools but do start small.