A Year With Gitops in Production
Do you know what is running in your production?
At this moment I know exactly what is running in my production, without even accessing it. Would you ask me the same question before the move to the GitOps - I’ll have to delete and reinstall all the VMs, redeploy application and finally answer with a similar level of confidence. This confidence would drift away with every lapsed minute as nearly all non-immutable, CIOps controlled environments are bound to deviate from the desired state. I attribute this drift between desired an actual state to the limitations of the Continuous Delivery tools of the past, where Continuous Integration tool is also a centre-piece to all the deployments. Changes are only captured within the CI tool. Your environment’s desired state of specific applications is expressed via button presses in the CI tool, but never actually verified as a whole. The entire environment is never checked against the desired state as the desired state is not defined in one single place. With these drawbacks of CIOps, I encourage you to move on to the CD tools made for the cloud-native era - GitOps.
What is GitOps?
Fundamentally, GitOps only change two aspects of the well established Continuous Delivery.
Push vs Pull
With GitOps - you no longer tell your servers and clusters what to run, they instead ask what’s there for us to run. This seems subtle - but it makes a tremendous impact. Where in the past you had to know what to deploy and where - now you only manage the what.
Continuous Syncing
Continuous Syncing is the paradigm shift where the deployment of the desired environment state is not a once-off process. Instead, GitOps periodically ensures that there is no drift between the desired and actual state of the environment. Therefore you don’t create a deployment event by clicking a button in CIOps tool. Instead, you modify the desired state in the source control. Once a modification is made, GitOps will eventually bring your environment in sync with the source control.
Continuous Delivery goals for the cloud-native
With the kick-off of the migration to the Google Cloud Platform, we’ve defined several goals for the CD platform to ensure that we do not repeat mistakes of the CIOps era. This exercise has allowed us to confirm that existing tools are no-longer fit for the purpose, while GitOps is ticking all the boxes.
Full Audit Trail
It is not sustainable to learn about past deployments and versions from the CI/CD tool itself. 100% of this information should be persisted in the change management database. Luckily, GitOps is using git, which is an excellent database. You also do not duplicate your efforts as changes made to git to perform deployments are simultaneously a full audit trail with what, when and by whom was changed.
Everything as a code
In the cloud, both CI and CD should be defined as code first. The desired state of environments should also be set in full as a declarative code.
No cap on deployment parallelism
There should be no limitations on how many deployments are happening at the same time. With CIOps and on-premise this is a significant issue as any deployment is essentially a temporary reduction in the service capacity, while in the cloud it is possible to scale your environment up to run the latest version of the service without impacting live capacity.
Minimise configurational drift
Environments should not deviate away from the desired state. Ideally, they should also prevent or rollback any changes that are not declared in the source control.
Don’t give CI/CD wildcard access to your environments
I believe that the aim of Continuous Integration is a creation of deployable artefacts, objective of Continuous Delivery is to provide a mechanism to run artefacts in production. Giving CD unrestricted access to your VMs and clusters has long been considered standard practice. There are better and more secure ways to run your code in production.
Poor man’s GitOps
Every new tool has to go through the initial resistance. Will it be complex to grasp? Will it be challenging to maintain? Is it going to be obsolete in a year? Luckily, with GitOps, we can take a lot of scepticism away due to the simplicity of its concept. If you ask me to implement GitOps with only 5 minutes to spare, it will look as follows:
git clone declarative-definitions
while true {
git pull
kubectl apply -f declarative-definitions
sleep 30
}
GitOps Build Step
At the end of all of our builds, we are running a custom build step. It creates a Pull Request against the GitOps repo with the change of image version being built.
containers:
- name: app-name
- image: gcr.io/project-name/app-name:v1
+ image: gcr.io/project-name/app-name:v2
To implement this build step you’ll need git, hub(GitHub’s CLI) and yq (jq alternative for yaml)
Recipe:
- Check out GitOps repo with git
- Make a branch
- Update image version using yq
- Commit changes
- Push branch
- Create a PR with hub
Pull Request created by this step is what our developers and QAs are using to deploy applications by merging the Pull Request.
Downsides
Generally, GitOps improve upon a lot of things, but there are some cases where it might need extra attention to ensure the smooth operation.
Restarts
Each time you change a Secret or a ConfigMap without modifying the deployment itself, you are presented with a challenge of making sure that your application picks up the latest metadata. This problem is not unique to the GitOps, but I would like to share some options.
One that we use the most is a simple string ENV_VAR in the deployment manifest that we increment each time we modify ConfigMap or a Secret.
env:
- name: RESTART_ME
value: "Increment the counter in this string to restart Deployment [COUNTER:1]"
Another option is to use Reloader and eliminate the cognitive load from the developers.
Broken YAML
Another common issue around Kubernetes deployments is broken YAML manifests where syntax or object references are broken. In this case, a CIOps pipeline or even kubectl apply -f
has an advantage. Feedback about broken YAML is very timely. With GitOps broken manifests will break a GitOps tool of choice (Flux in our case), but unless you actively monitor and alert on logs in the GitOps tool, this problem might remain unnoticed for longer.
API throttling
GitOps is constantly polling git repo. Ensure that your provider has a generous allowance on polling. Additionally, tools like Flux are scanning your container registries for changes, and you might hit the API limits there as well.
A year with GitOps outcomes
Having used GitOps to manage multiple clusters in different projects by numerous teams over the last year - I am happy to declare that it has met all of our original goals and became a great cloud-native success for loveholidays. Collectively, we’ve deployed then 10'000 times since September 2018, and it was smooth sailing. It is hard to imagine the Continuous Delivery done differently.
Best practices
In my mind, I compare GitOps to a strict teacher. GitOps can’t deny you from doing outright silly things like kubectl apply -f
from your machine. Instead, it will let you do it, monitor the environment for a few minutes and correct it back to the state it was before you’ve made a change. Next thing you do is kubectl edit configmap
, GitOps lets you do it, but corrects it back to the original state a few minutes later. Even the most stubborn of the developers quickly learn that they are fighting windmills and performing changes via git is just as quick, but also traceable.
Developer friendly
Developers are pushed daily to try the new, shiny tool to make all of their problems go away. I believe that tool proliferation is part of the developer’s problem. There is too much to try, learn and maintain. GitOps gives a unique opportunity to take a tool away from the developer’s toolbox. With GitOps, developers are using git not only for feature work but for deployments too. It comes very natural to developers and maintains flow for longer.
GitHub vs kubectl
Before GitOps I was raving about •kubectl* and its powers. With GitOps, I am a lot more likely to be using GitHub to assess the state of my environments then use *kubectl* to achieve the same. This highlights the change in trust in your CI/CD where there are no doubts that the desired state described in source control is an actual state of the runtime environment.
Available Implementations
Flux - this is what we choose a year ago, and it served us well.
Presentations
I’ve presented on the topic at London DevOps meetup on the 4th of July 2019. The deck is available here.