On rebuilding a system from scratch

Recently I've been seeing a number of respected - and really good - higher-level managers stating that an engineering team should never rebuild a system from scratch; Instead, that they should always find ways to improve the existing system. As much as I understand the reasoning behind that, which involves risk, cost and other factors, I find that perspective dogmatic, and would like to offer a different one.

Actually, I'd like to tell my own success story about rebuilding a system from scratch, as this will feed into my opinion about the subject.

My story

Context

I used to work for a company called YouGov, in a team that built their most successful product at the time: BrandIndex.

The team was composed by a manager, myself as a Senior Engineer and Backend Lead, another Senior Engineer focused on backend and another Senior Engineer focused on frontend. All of us had at least 10 years of experience. Pretty amazing team — actually, the best team I ever worked with. The manager was there since 2011 or so, I was there since 2014, the other backend engineer since 2013 and the frontend engineer since 2017.

The challenge

BrandIndex had been built using a few old technologies and some new ones, but without much focus on architecture - it had grown from a prototypical structure, that served its purpose very well at the start of the project, into a large monolith that was super hard to test and refactor. Certain places were a big ball of mud. A system that once worked very well, that allowed the team to iterate fast and discover things to feed into the product, had morphed into a significantly messy one, that was super hard to test and to maintain.

Over the years, we spent considerable effort on increasing code coverage and refactoring the codebase, in order to make everything more readable and controllable. Many of the view functions (think "Controller" in the MVC pattern) had hundreds or even thousands of lines and were touching everything from MongoDB, to e-mail sending, to file storage. Refactoring it was a massive undertaking, but one that had to be done, otherwise we would quickly stop being able to make changes to the project in reasonable timing. However, the lack of an architectural vision, of a minimal design that could help us make the project both performant, scalable and extensible, was gradually putting a hard stop on our ability to include new features.

The decision

In 2017, after 3 years working there, I started telling my superiors that we were hitting the limit to where the project could reach, and thus it was time for a rebuild. Some asked if we could reuse the majority of the code and technology we had, but I was emphatic about the need for a technology and architecture reassessment. A number of features being requested by the business teams just weren't feasible in the old system, and we made sure that was clearly communicated.

One year later, in 2018, we got the news: they accepted our proposal, and we could start rebuilding it!

It was time to start working on the new system. We needed a serious plan, and we needed to make it work well, otherwise it would bite us back in our "backs". Should the new project fail, and our team would take a severe hit to its reputation.

The journey

We started with the most obvious thing: the architecture. Since most of us had been working there for years, we were very comfortable with the domain, which allowed us to create a very solid design for the new system. We knew our problems very well at this point, we knew the trade-offs we were ready to make, we knew what worked and what didn't, in the old system. So we went with a service-based design — maybe I can write about the architectural and technological choices in a later post, but it's out of the scope of this article.

But we didn't just abandon the old system; We still ended up implementing a few more changes to it while working on the new system at the same time. But we made sure that we could have the data transposed to the new system (after transformations to adapt to the new design). So there was a period when both systems were running in parallel. Also, the sales teams continued to sell the product subscriptions even with the old system running, and technical support continued to support it as well.

After implementing at least a majority of the features available in the old system, in the new one, we started creating new features that were in our backlog. People were very happy to see them come to life, and noticed how that was possible with the advent of the new system.

We also gradually migrated companies from the old to the new system, first allowing them to use both, but later only allowing using the new system.

The result

Shortly after we "finished" the new system (meaning it had all the previous relevant functionalities, plus the new ones that had long been in the backlog), we noticed that the subscriptions had multiplied ten-fold - users and companies were 10x more than what we had before, since with the new features and more robust system it was easier to sell it. We managed to put the product on a whole new level, much more reliable, performant, maintainable and extensible.

Not only this, but our team wrapped up being regarded as one of the most respected ones in the company — we never competed against them, but it's interesting to notice how much respect we earned in a process that could have been a dramatic failure.

Why it succeeded

Let's do a recap of why this rebuilding of the system was so successful:

Team seniority

The fact that we had seniors in all areas of the team played a major role. If one or more of the areas hadn't had that, it could have compromised the quality of the new project, and we could have gotten to pretty much the same situation as before - an unmaintainable system. This reinforces my belief that every engineering team should have at least one senior engineer in each area.

Domain experience

By knowing what the product was, and what the system should accomplish, we didn't have to experiment too much before heading in the right direction. Without this domain knowledge, we would have done a lot of trial-and-error, thus spending much more money and time. This one is critical: I've seen a number of projects fail, either completely or at least in hitting the deadline, for lack of domain experience by the engineering team.

Support by managers

Without approval by the higher management layers we wouldn't be allowed to rebuild the system. Therefore, knowing how to "sell" this project to them was also critical, and one of the ingredients of success here was to show how we wouldn't have progressed if we hadn't done this. For example, two of the new features, which were my ideas, were to have brand analyses as resources instead of differentiated by URL fragment parameters (as in the old system), and to share resources between users, including analyses. These features were so successful that they opened up to more features being built on top of them, and they were only possible to build in the new system.

Keep the engine running

Even though we spent considerable effort on the new system, we still kept the old one running, and gradually transitioned users. This is sometimes referred to as the "strangler fig" technique. This approach not only allowed us to keep the money coming, through the usage of the old system, but also allowed us to use the old system as a support whenever we made mistakes in the new one. If we had done a hard migration of users from the old to the new system, all at once, users would be desperate with the problems and confusion they would face, and we could have lost many subscriptions as a result.

Support from other teams

The other, non-engineering teams, still kept working with the old system - selling and giving technical support to it. Not only that, but they also started learning the new system and convincing users to do early adoption of it. This allowed us to fail faster but with much lower impact, which gave us a foundation to carefully engineer the new system.

Quality as non-negotiable

A few times we did have to take the pragmatism train, but never compromising on quality. The new system ended up being solid as a rock, with the server-side having 94% code coverage (last time I checked). I very proudly remember having the other backend engineer starting to maintain the new system and telling me he was feeling pleasure doing changes to it — because they were easy to make, probably because the layers were well encapsulated and the responsibilities of the code modules (classes, functions etc) were clear.

Conclusion

Should engineering teams always rebuild problematic systems? Absolutely not — actually, most of the time they should be trying to improve their current systems, not rebuilding them. It would be a gigantic waste of time and money if every time we found a big issue in a system, we would rebuild it from scratch.

But sometimes it's necessary. Sometimes it's the difference between having a system that halted the progress of a product, and a system that opens the door to the future of the product.

#leadership #management