Author: Marc LeBlanc


Google has done a great job making it easy for people to build and deploy GKE clusters. With a few clicks or lines of terraform code you can deploy clusters of virtually any size, in any region, across regions etc., with almost as many options for sizing the nodes the cluster runs on. At some point, if not right up front, the discussion always turns to cost optimization. Not to fret, there is a really simple way to enjoy up to 80% reduced cost by running at least part of your GKE clusters on Preemptible VMs.

When to Preempt

There are some very important characteristics of preemptible instances you need to take into consideration. Preemptible instances are best for fault-tolerant workloads - think batch jobs.

Why? 1) Preemptible instances are released after a maximum of 24 hours 2) Preempting occurs with only a 30 second notice 3) Preemptible instances can not be set to live-migrate to a normal instance

Defining a Preemptible Node Pool via Terraform

To enable pre-emtible instances via terraform, there is a simple 1 line that needs to be added to your node pool definition preemptible = true

Example:

resource "google_container_node_pool" "preemptible_nodes" {
  name     = "${var.cluster_name}-pe-nodes"
  location = "${var.cluster_location}"
  cluster  = "${google_container_cluster.primary.name}"

  node_config {
    preemptible  = true
    machine_type = "${var.machine_type}"
    oauth_scopes = "${var.oauth_scopes}"

    metadata {
      disable-legacy-endpoints = "true"
    }
  }
}

Risk Mitigation through Dual Node Pools

30 second notice and your nodes might be riped out from under your feet?! Yes, yes, I know, that sounds terribly risky. The idiom “Don’t put all your eggs in one basket” is very applicable here. You can add stability to your cluster and stateful workloads by setting up multiple node pools - one on pre-emtible instances, and one not. Doing this within the console is very easy, just create 2 node pools, and in one select the Enable preemptible nodes option under Nodes.

Again, with Terraform, this is pretty easy to achieve. Simple duplicate your node pool definition with a new name, and remove the preemtible bit from your node pool definition.

Example:

resource "google_container_node_pool" "preemptible_nodes" {
  name     = "${var.cluster_name}-pe-nodes"
  location = "${var.cluster_location}"
  cluster  = "${google_container_cluster.primary.name}"
  
  node_config {
    preemptible  = true
    machine_type = "${var.machine_type}"
    oauth_scopes = "${var.oauth_scopes}"
  }
}

resource "google_container_node_pool" "ondemand_nodes" {
  name     = "${var.cluster_name}-od-nodes"
  location = "${var.cluster_location}"
  cluster  = "${google_container_cluster.primary.name}"

  node_config {
    machine_type = "${var.machine_type}"
  }
}

This will give you 2 node pools - 1 on preemptible instances, 1 on normal instances.

But Wait! There’s more! Autoscaling

Great, you have 2 node pools. Now what? Enter autoscaling. Generally speaking, our stateful workloads should be fairly predictible in terms of sizing and you can likely size your on-demand/normal node pool based on documented sizing guidelines , but with batch processing it may be less clear. Again, within the console it is fairly easy to go in and set up your node pools to autoscale accordingly. Doing so programatically via Terraform is just as easy.

Simply add an autoscaling{} block to your node pool definition and your pool will stay within the min-max sizing accordingly. For this example, I am recommending to only turn on autoscaling for the preemptible pool per my previous note that you should be able to properly size your stateful workloads accordingly.

Example:

resource "google_container_node_pool" "preemptible_nodes" {
  name     = "${var.cluster_name}-pe-nodes"
  location = "${var.cluster_location}"
  cluster  = "${google_container_cluster.primary.name}"
  
  autoscaling {
    min_node_count = "${var.pe_initial_node_count}"
    max_node_count = "${var.pe_node_count}"
  }

  node_config {
    preemptible  = true
    machine_type = "${var.machine_type}"
  }
}

That’s it. You now have a simple cost optimization applied to your GKE cluster for stateless/fault-tolerant workloads. You can specify which nodes your pods are scheduled for using nodeselector accordingly.

After leaving your cluster running for a few days, inspecting the nodes you will likely see the pre-emtible instances have been rolling over as expected.

That’s it for now! Happy DevOpsing!

Tagged:



//comments


//blog search


//other topics