Implementing zero downtime deployments on Kubernetes — the plan

Published in

Spawn DB

4 min readApr 6, 2021

We’ve reached the stage with Spawn where the application is stable and we are actively recruiting new users through our open beta. With this extra usage, and coming from other parts of the world we now have the problem of user operations being interrupted by our deployments. With the vast majority of use previously coming from our internal teams, scheduling deployments outside of UK working hours or communicating deployments to our teams was an adequate solution. Creating in the region of 1200 database containers a day, the likelihood of user operations occurring during the current 5 minute deployment window is high and needs addressing.

Requirements

Spawn is a cloud hosted service that delivers databases on demand for dev, CI and testing workflows, with data and in a matter of seconds.

All databases are containers behind the scenes and Kubernetes is used to orchestrate this, as well as host the application itself. Additional complexity is added with a dependency external to the cluster, a Virtual Machine with a worker installed handling storage. Some third party software is used such as NATS to pass messages around between the components.

The main user tasks can be boiled down to creating database containers and database images, which range from 5 seconds to potentially many hours for images. Any of these tasks which are in progress before a deployment begins or start during the deployment must be unaffected.

We would also like to be able to deploy as frequently as is needed, multiple deployments a day must be possible.

Our requirements list to judge the possible solutions against:

Invisible to the user
Handle tasks spanning multiple hours
Deploy multiple times per day
Incrementally implement without need for additional downtime

Options

There are two main approaches that we investigated when thinking about this; Kubernetes rolling updates and blue/green deployments. During this research a third strategy also came to light; rainbow deployments.

Kubernetes rolling updates

As Spawn already uses Kubernetes deployments for its components, making use of the rolling update functionality seemed like the way to go. The main question mark was around the components which process tasks that may take several hours to complete. This excellent blog post by Daniele Polencic goes through the graceful shutdown process in detail and while handling long running tasks can be made to work, recommends against it.

✅ Invisible to the user
❌ Handle tasks spanning multiple hours
✅ Deploy multiple times per day
✅ Incrementally implement without need for additional downtime

Blue/green deployments

With blue/green deployments essentially being two separate deployments that switch being live each release, this is more in line with Daniele’s suggestion to handling long running tasks. The problem comes when releasing frequently is a requirement, let us say that we have a database image creation task that will take 6 hours and it starts at 8am when the “green” deployment is live. At 9am we release a new version, keeping the “green” deployment in place but replacing the “blue” version with our release. Both deployments are now running tasks and “green” still has 5 hours before its long running image task has completed, we now can’t deploy again without impacting user actions.

✅ Invisible to the user
✅ Handle tasks spanning multiple hours
❌ Deploy multiple times per day
✅ Incrementally implement without need for additional downtime

Rainbow deployments

To remove the limiting nature of blue/green deployments in terms of frequency, a method called rainbow deployments can be used. In this case each release will create a new, separate deployment. The existing deployment will be left in place until all tasks it is currently serving have been completed and then removed. In cases where there are no in progress long running tasks, or they finish shortly after the new release, cleaning up the deployment will free resource and save money.

✅ Invisible to the user
✅ Handle tasks spanning multiple hours
✅ Deploy multiple times per day
✅ Incrementally implement without need for additional downtime

Plan

Given that the two main approaches explored fell short on the list of requirements, rainbow deployments seem to be the best fit for our Spawn application. We believe we can achieve this incrementally, without additional downtime and with minimal application code changes.

Message bus

To keep the components of each release separate we will append a release identifier to all message subjects, maintaining a single message bus for all releases.

Activating the release

Integration tests will be run against each new release before it becomes the active deployment. Following a successful run of tests, the Ingress for the API will be updated to point to the API server of the new release, leaving the existing release in place.

Clean up

A prometheus metric recording the number of active tasks for a release will be regularly checked to determine when an existing release has finished its workload and can be destroyed. This will be accompanied by a minimum timespan (for example 1 hour) to keep old releases available in the event of issues with the new release.

Roll backs

For issues found with a release soon after deployment, the existing release will still be available either finishing in progress tasks or sitting dormant. In this case the Ingress can be quickly and easily switched back to the previous release so it can regain active status. In the event that previous releases have been cleaned up, they can be redeployed and treated like a new release.

This is something that we will start working towards in the next few weeks, do you have any experiences of implementing zero downtime deployments that you can share to help us or others with this journey?