A Year With Gitops in Production // Dmitri Lerko

Do you know what is running in your production?

At this moment I know exactly what is running in my production, without even accessing it. Would you ask me the same question before the move to the GitOps - I’ll have to delete and reinstall all the VMs, redeploy application and finally answer with a similar level of confidence. This confidence would drift away with every lapsed minute as nearly all non-immutable, CIOps controlled environments are bound to deviate from the desired state. I attribute this drift between desired an actual state to the limitations of the Continuous Delivery tools of the past, where Continuous Integration tool is also a centre-piece to all the deployments. Changes are only captured within the CI tool. Your environment’s desired state of specific applications is expressed via button presses in the CI tool, but never actually verified as a whole. The entire environment is never checked against the desired state as the desired state is not defined in one single place. With these drawbacks of CIOps, I encourage you to move on to the CD tools made for the cloud-native era - GitOps.

What is GitOps?

Fundamentally, GitOps only change two aspects of the well established Continuous Delivery.

Push vs Pull

With GitOps - you no longer tell your servers and clusters what to run, they instead ask what’s there for us to run. This seems subtle - but it makes a tremendous impact. Where in the past you had to know what to deploy and where - now you only manage the what.

Continuous Syncing

Continuous Syncing is the paradigm shift where the deployment of the desired environment state is not a once-off process. Instead, GitOps periodically ensures that there is no drift between the desired and actual state of the environment. Therefore you don’t create a deployment event by clicking a button in CIOps tool. Instead, you modify the desired state in the source control. Once a modification is made, GitOps will eventually bring your environment in sync with the source control.

Continuous Delivery goals for the cloud-native

With the kick-off of the migration to the Google Cloud Platform, we’ve defined several goals for the CD platform to ensure that we do not repeat mistakes of the CIOps era. This exercise has allowed us to confirm that existing tools are no-longer fit for the purpose, while GitOps is ticking all the boxes.

Full Audit Trail

It is not sustainable to learn about past deployments and versions from the CI/CD tool itself. 100% of this information should be persisted in the change management database. Luckily, GitOps is using git, which is an excellent database. You also do not duplicate your efforts as changes made to git to perform deployments are simultaneously a full audit trail with what, when and by whom was changed.

Everything as a code

In the cloud, both CI and CD should be defined as code first. The desired state of environments should also be set in full as a declarative code.

No cap on deployment parallelism

There should be no limitations on how many deployments are happening at the same time. With CIOps and on-premise this is a significant issue as any deployment is essentially a temporary reduction in the service capacity, while in the cloud it is possible to scale your environment up to run the latest version of the service without impacting live capacity.

Minimise configurational drift

Environments should not deviate away from the desired state. Ideally, they should also prevent or rollback any changes that are not declared in the source control.

Don’t give CI/CD wildcard access to your environments

I believe that the aim of Continuous Integration is a creation of deployable artefacts, objective of Continuous Delivery is to provide a mechanism to run artefacts in production. Giving CD unrestricted access to your VMs and clusters has long been considered standard practice. There are better and more secure ways to run your code in production.

Poor man’s GitOps

Every new tool has to go through the initial resistance. Will it be complex to grasp? Will it be challenging to maintain? Is it going to be obsolete in a year? Luckily, with GitOps, we can take a lot of scepticism away due to the simplicity of its concept. If you ask me to implement GitOps with only 5 minutes to spare, it will look as follows:

git clone declarative-definitions
while true {
  git pull
  kubectl apply -f declarative-definitions
  sleep 30
}

I do not recommend running this in production, but it gives a clear understanding that GitOps is quite simple, yet, very powerful.

GitOps Build Step

At the end of all of our builds, we are running a custom build step. It creates a Pull Request against the GitOps repo with the change of image version being built.

containers:
  - name: app-name
-   image: gcr.io/project-name/app-name:v1
+   image: gcr.io/project-name/app-name:v2

To implement this build step you’ll need git, hub(GitHub’s CLI) and yq (jq alternative for yaml)

Recipe:

Check out GitOps repo with git
Make a branch
Update image version using yq
Commit changes
Push branch
Create a PR with hub

Pull Request created by this step is what our developers and QAs are using to deploy applications by merging the Pull Request.

Downsides

Generally, GitOps improve upon a lot of things, but there are some cases where it might need extra attention to ensure the smooth operation.

Restarts

Each time you change a Secret or a ConfigMap without modifying the deployment itself, you are presented with a challenge of making sure that your application picks up the latest metadata. This problem is not unique to the GitOps, but I would like to share some options.

One that we use the most is a simple string ENV_VAR in the deployment manifest that we increment each time we modify ConfigMap or a Secret.

env:
- name: RESTART_ME
  value: "Increment the counter in this string to restart Deployment [COUNTER:1]"

Another option is to use Reloader and eliminate the cognitive load from the developers.

Broken YAML

Another common issue around Kubernetes deployments is broken YAML manifests where syntax or object references are broken. In this case, a CIOps pipeline or even kubectl apply -f has an advantage. Feedback about broken YAML is very timely. With GitOps broken manifests will break a GitOps tool of choice (Flux in our case), but unless you actively monitor and alert on logs in the GitOps tool, this problem might remain unnoticed for longer.

API throttling

GitOps is constantly polling git repo. Ensure that your provider has a generous allowance on polling. Additionally, tools like Flux are scanning your container registries for changes, and you might hit the API limits there as well.

A year with GitOps outcomes

Having used GitOps to manage multiple clusters in different projects by numerous teams over the last year - I am happy to declare that it has met all of our original goals and became a great cloud-native success for loveholidays. Collectively, we’ve deployed then 10'000 times since September 2018, and it was smooth sailing. It is hard to imagine the Continuous Delivery done differently.

Best practices

In my mind, I compare GitOps to a strict teacher. GitOps can’t deny you from doing outright silly things like kubectl apply -f from your machine. Instead, it will let you do it, monitor the environment for a few minutes and correct it back to the state it was before you’ve made a change. Next thing you do is kubectl edit configmap, GitOps lets you do it, but corrects it back to the original state a few minutes later. Even the most stubborn of the developers quickly learn that they are fighting windmills and performing changes via git is just as quick, but also traceable.

Developer friendly

Developers are pushed daily to try the new, shiny tool to make all of their problems go away. I believe that tool proliferation is part of the developer’s problem. There is too much to try, learn and maintain. GitOps gives a unique opportunity to take a tool away from the developer’s toolbox. With GitOps, developers are using git not only for feature work but for deployments too. It comes very natural to developers and maintains flow for longer.

GitHub vs kubectl

Before GitOps I was raving about •kubectl* and its powers. With GitOps, I am a lot more likely to be using GitHub to assess the state of my environments then use *kubectl* to achieve the same. This highlights the change in trust in your CI/CD where there are no doubts that the desired state described in source control is an actual state of the runtime environment.

Available Implementations

Flux - this is what we choose a year ago, and it served us well.

Argo

Razee

Presentations

I’ve presented on the topic at London DevOps meetup on the 4th of July 2019. The deck is available here.