Master breaking change deployments – 11 small steps

Whenever systems are subject to a change we have to uphold compatibility to the clients. Any change that breaks a contract (technically or semantically) will in best case lead to traceable malfunction or in worst case to a broken, but silently proceeding system. In most cases when we change API contracts we just extend the old contract. In practice breaking changes will occur. Good news is that there is a strategy called „Expand and Contract“-pattern which also plays well together with modern platforms.

In a further, not yet published blog post and code repository, I will describe everything step by step with code examples using Kubernetes.

The problem of a breaking change

We should not do that, right? We often try hard to not break any API and add a lot of technical debt to the system just because it was designed in a way that doesn’t work anymore for new features we have to add now. From that experience, that’s also when developers try to find the very best solution for any possible future and over-engineer things for new features. And that’s where software development can get very slow and nevertheless at some point in the future we would like to break our API to not implement the workaround for the workaround, which means that whatever we consider for our design:

Don’t spend too much time preparing for every eventuality as things will always change. Rather pick the easiest solution, which is good enough for now and know how to easily change it in the future.

Although we have tooling for deployments we are responsible on our own to patch our software that it doesn’t break and the patch itself has to match the deployment strategy. Luckily there is a pattern called “Expand and Contract” which doesn’t need any sophisticated tooling or technology and describes a strategy of changes in software that will in sum apply a breaking change by consecutive small non-breaking changes.

Expand and Contract

We start with an overview of the essential parts of that pattern. It basically starts with the expand-phase and ends with the contract-phase. Applying the whole change at once break the system, but going through all phases in order makes it work. The central idea is that in a client-server relation at first the server has to expand and allow the clients to use that extension. After a few steps, we can contract everything and clean up the old implementation. It works the same for server with database, where the server actually is the client in that relation.

A common use case is a breaking change in the database model, where the data structure for a new feature doesn’t work anymore and should be altered. The following steps are described with a Canary Release, which is the default rollout strategy in Kubernetes.

1 – Expand phase

At first we start with a running system, where we identified Table X to be changed because for example the cardinality in table will not work anymore with a new feature and we have to prepare that in order to be able to implement the new feature.

Breaking changes - Running system
1 – Running System

Instead of changing and migrating everything with the next version, we start with the “Expand”-phase and create a new table Table N that contains the new structure and our Server v2 should keep everything as is, but write updates also to Table N. Notice that with the introduction of Table N the information content and source of truth Table X stayed the same. Because we now keep partial duplicated information it is inconsistent, due to Server v1 will not update Table N.

The goal is to guarantee that Server v2 will not introduce new inconsistencies from a consistent state, writing to Table X and Table N . Thus writing to both tables should be executed in a transaction.

Breaking changes - Write new data structure
2 – Write new data structure

Before we bring both tables to a consistent state, we need to make sure, that all Server v1 instances are drained, because they would introduce new inconsistent data.

Breaking changes - Drain old server instances
3 – Drain old server instances

2 – Migrate

As soon as all Server v1 instances have been gone it’s guaranteed that a migration to a consistent state will uphold consistency from that point on. The migration should take care, that still the source of truth is Table X and it should only change Table N. The good thing is that the migration job can be split into arbitrary sized chunks of work to not overload the system and the migration step is not time-critical at all, so the system can run in this state for a longer time. In other words the migration makes the data converges iteratively to a consistent state while migrated data stays consistent.

Breaking changes - Synchronize old and new data structure
4 – Synchronize old and new data structure

After the migration we should be able to verify over time that the data really stays consistent. We could therefore let the system run for a while.

Breaking changes - Verify consistency (optional)
5 – Verify consistency (optional)

3 – Contract phase

With a better confidence that our system does the right thing for the change, we can now launch our Server v3 which does the same as Server v2 but symmetrically inverted. It should now use Table N as the source of truth for reading and writing and write to Table X.

Breaking changes - Write old, read+write new data structure
6 – Write old, read+write new data structure

Before we deploy our last version, which uses Table N as source and not using Table X anymore, we need to make sure that no Server v2 instances are running anymore, as the last version (again symmetric to Server v1 at the beginning) will introduce inconsistent data between both tables as Table X is not updated anymore.

Breaking changes - Drain old server instances
7 – Drain old server instances

Last but not least we deploy a version that only uses the new structure Table N thus again introduces inconsistent data as Table X is not updated anymore.

Breaking changes - Read+write new data structure
8 – Read+write new data structure

Before we’re allowed to clean up and remove Table X we need to drain Server v3 instances.

Breaking changes - Drain old server instances
9 – Drain old server instances

For the last deployment the server is allowed to remove Table X.

Breaking changes - Clean up old data structures
10 – Clean up old data structures

Finally after all Service v4 instances has been stopped we have successfully rolled out a breaking database change on a productive system.

Breaking changes - Drain old server instances
11 – Drain old server instances

Recap – Expand and Contract

As a short recap here are all steps in one diagram:

Expand and Contract steps

Each server version will have a single source of truth for reading, while possibly writing data to more than one source.

Breaking changes in practice

Companies often go a quicker, but riskier way and do a Blue-Green deployment, where the “Green” version breaks the “Blue” version and put the change together with the migration in one step. The higher risk is also often tried to be compensated by a defined service downtime window and/or deployments outside the business hours, if possible. Depending on the criticality of the service, as well as the extent of the change and other factors, the “quick way” can become a severe problem when rolling-back the breaking change was not consider or for some reason doesn’t work because it wasn’t properly tested. With Expand and Contract going back one version is always possible.

Conclusion

Utilizing Expand and Contract means more overhead, but a smooth and solid transition without any “point of no return”-barriers, as well as no time pressure at all. Every deployment step will allow to rollback one version if found faulty. There are no downtimes and a possible large migration job can be split into smaller tasks to reduce delays for concurrent requests.

Even though this pattern could be considered a secret weapon for any type of system change, it also often turns out that it’s not necessary, because we don’t have to meet all the requirements the pattern can fulfill. Another underestimated aspect is that changes can become very complicated. Starting with some infrastructure setups that may not support easy, automated deployments and finally the implementation itself may turn out to be very complex, because of the business domain. Try to find your solution according to your own taste.

Outlook

As already mentioned the Expand and Contract pattern can get quite complex even for smaller changes. To be able to focus on the change itself it’s good to have a solid platform that allows for convenient deployment and rollback. One of them is Kubernetes and in Part 2 I’ll show you how to use it to rollout a breaking change, step by step.

If you’re interested in having your own bullet-proof platform for development and production, just feel free to reach out to us to help you.

Kommentar verfassen

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert

Nach oben scrollen
WordPress Cookie Hinweis von Real Cookie Banner