Code Freeze - An Answer to the Wrong Question
Code freeze is a solution. But not many people discuss what problem it's a solution for.
This article seeks to bring light to understanding this solution, its problem, and what other problems could be related to it.
Understanding code freeze
What's a code freeze?
Code freeze is the process of stopping all changes to a certain piece of software for a certain period. Here are some examples:
- A company is going to do a product demo for a client, and they ask their engineering team to stop deploying website changes so that they're sure that the website is stable for the demo.
- An open-source project doing a code freeze before a large release, so that there are no surprises being introduced by contributors.
- An institution is preparing for the holidays and asks their team to stop all changes so that they're sure that there won't be new bugs introduced and noticed during their time off.
In all the examples above, we can notice a few common characteristics:
- They express situations where there's lack of trust (either on software, or people, or both).
- They're used as a means to protect the organization that's in charge of the software, and its people.
What does it solve?
From the characteristics mentioned above (lack of trust, protect the organization), we know that code freeze does protect the organization at inconvenient times, so it's a solution for that. And it's used whenever the organization cannot afford having instability for long periods of time. But it's not at all a solution to the lack of trust, since trust is not something that comes and goes after a few hours or days - rather, trust is a cultural aspect, and it takes time to build, be it towards people or software.
Therefore, code freeze can be seen much more as a palliative than a real solution - organizations assume they have a lack of trust and, therefore, try to reduce the damage that it can cause, instead of fixing the lack of trust at the source. The question that code freeze answers is: "How can we avoid sabotaging ourselves at the worst possible times?"
No-Deployment Fridays
No-deployment Fridays are a particular kind of code freeze: it happens every week, every Friday, thus reducing deployment capacity to 80%. The reason why it's done is so that issues introduced on a Friday don't cause havoc during the weekend (nobody wants to start debugging software during weekends, right?!).
Let me emphasize: "issues introduced on a Friday / havoc during the weekend". There's an implied long feedback here, besides the lack of trust. This is important.
Is it possible to avoid code freeze altogether?
Yes, it is. At least in many situations, perhaps the majority of them, if we take code "code freeze" as being a function of "lack of trust".
So let's see how we can fix the lack of trust itself.
Lack of trust on people
Starting off with open-source software: if the project chooses to accept contributions from a large number of people, then it's hard, sometimes impossible, to build trust with everyone involved. Something that can be done to mitigate this is to subdivide contributors into smaller teams, each with their own "representative", and have a "core team" that gets changes from these representatives. But this is not always feasible.
When it comes to smaller teams, or in organizations where only internal staff can contribute, then it's easier to solve: it's about getting good engineers (more or less experienced, it depends on the organization goals and needs) or training the current ones, so that the organization can start trusting them.
Lack of trust on software
Lack of trust on software might be on the development process, or the release process, or both. There's an easy and a hard part about getting trust on that.
The hard part is changing the engineering culture. The people involved might need to accept working in a different way than they're used to, and there might be resistance - out of skepticism, or fear, or protection of status-quo, or a combination of them. Showing how the culture can be better might help, but it's not guaranteed.
Changing the processes, though, is the easy part. It all comes down to quality assurance embedded in all stages of these processes - before and after deployments. By fixing these, the organization can have a trust relationship with the changes being introduced, even if there might be bugs along the way.
Changing the processes
As stated earlier, the development processes might need to be changed (perhaps dramatically) for the organization to avoid code freezes. If doing deployments is painful and brittle, then the solution is not to do fewer of them - rather, it's doing more, way more of them. But the processes need to be changed - the team can't just start deploying all the time without changing anything else.
In general, the engineering team has to adopt an inflexible "top-quality stance": whatever they do, they're always ensuring they're doing with the best quality they can, given what they know about what needs to be done. Incurring into technical debt (not to be confused with cruft, or crap code) should be acceptable as a normal part of the evolution of the project. These are some of the items the team needs to commit to:
- Aggressive testing: no matter if the team does TDD (preferably) or test-later or a combination of them, they need to make sure every behavior in the system they develop is covered by automated tests, and the test suites need to have high quality too - where the tests clearly specify the behaviors and expectations they're working with, not just covering lines of code.
- Great code design: great code design is simple to understand and easy to change. It bears low cognitive load and accept developers without them feeling overwhelmed and fighting the code.
- Fast deployments: deployments should take a few minutes to be done, at max. Preferably under a minute, if possible. This way, developers don't have friction to deploy, and end up doing more of them, with smaller changes.
- Small deployments: the smaller the deployments are, the easier it is to spot problems, and the narrower is the impact they cause in average. Working this way, plus with fast deployments, dramatically reduces the risk in each deployment.
- Easy and reliable rollbacks: there's no such thing as "bug-free software", and, in case of bugs, the team needs an easy way out - which is normally through rollbacks. They need to happen in seconds - not minutes -, so that the team doesn't have to fear the impact of issues introduced.
- Feature flags: since deployments don't need to coincide with public releases in many cases, teams can (and should) use feature flags to control what they're making available to end users. This is not only a way to "buffer" visible changes without stopping deployments, but also an alternative to disable something new that might be broken - and this is sometimes easier and/or preferable than rolling back the code.
- Aggressive observability: it's not enough to make changes easy and reliable, the team also needs to aggressively publish and explore application telemetry data, in order to better understand how their systems are running in production. The better they can observe their product running live, in the hands of end users, the better they can gauge its health, and the faster they can act to fix possible issues.
- Shift-left on security: instead of leaving security as an afterthought, the engineering team needs to adopt a "secure by default" stance and add all necessary tools to ensure their systems are as secure as possible. Tools like container image scanners, dependency scanners etc. come in very handy as part of the team's integration pipelines.
- Other quality tools: it's also very important to have other system quality tools, like code formatters or checkers, type checkers (in the case of dynamically-typed languages), documentation checkers etc. to help keep the codebase maintainable.
Adopting all of these will, at least in many cases, bring engineering teams to a state of comfort and confidence in doing deployment pretty much any time and any day.
Conclusion
I'm a firm believer that it's possible to drop code freezes altogether, and to say "goodbye" to the idea of "no-deployment Fridays", giving back 20% of deployment time to the teams. But there are certain critical pillars that have to be put in place for this to happen, as explained above.