Headwinds to Redesign

This essay was originally published in IEEE Software (Volume: 38, Issue: 2, March-April 2021).

From the Editor
A half-century ago, Winston Royce warned us that, in practice, a waterfall process “is risky and invites failure.” In a twist, Michael Keeling warns us that, in practice, teams using an iterative process will drag their feet on necessary redesign work, which sabotages iterative’s main advantage over waterfall processes.
- George Fairbanks, IEEE Software "Pragmatic Designer" column editor

On Small Systems, or in cases when system requirements are churning, it’s common for teams to do only a little upfront design, and instead get started writing code with a plan to redesign later, as needed. In practice, however, teams often don’t follow through on those redesign plans.

Teams that plan to redesign later always face severe headwinds -- business, technical, and social forces that blow them off course and prevent them from initiating a redesign effort. Teams that are successful with incremental design and minimal design upfront know how to overcome these headwinds. This is the tale of one team that chose redesign over upfront analysis, the headwinds faced, and how those headwinds were overcome to successfully redesign the software system.

The Road to Redesign

Once upon a time, I was on a team whose mission was to develop a set of cloud-based web services. Working in a start-up-like environment, we had tight deadlines and operated under extreme uncertainty. Fast time to market was essential to our business. Our product manager had identified a handful of core use cases, but it was difficult to prioritize work beyond these use cases without knowing how customers would respond to the first batch of features.

Recognizing that time to market was important and that requirements were still churning, we chose to incrementally design the architecture. Embracing a minimalist approach, we built the initial foundation by selecting thematic patterns for the architecture, laying out the basic web services we thought we’d need, and picking the primary technologies. After sketching a plan, we constructed and deployed a walking skeleton, the simplest implementation of our design that could run in the production environment.

Once the walking skeleton was ready, we shifted focus to putting “meat on the bones.” Starting with the most important features first, we designed, implemented, and deployed pieces of the system incrementally. Each deployment was a potentially shippable increment of the system.

As the software grew, we refined the architecture, recorded architecture decision records, and refactored the code in the small. We wrote well-crafted code that was easy to test, easy to maintain, and easy change. We did our best work, but sometimes our decisions didn’t play out as we’d expected.

After a few months, we were ready for the official release. Now that people actually used our software, it wasn’t long before we were blessed with many of the problems we’d hoped success might someday bring.

From the outside, our web services were well designed and functioned as required. A look under the hood told a different story. Some web services were poorly partitioned and some architectural responsibilities were ill defined. There were holes in our logging and metrics that limited operational visibility. The internals of some services were overcomplicated and difficult to maintain.

End users wouldn’t have noticed, but it was nearly impossible for us to add new functionality to one of our most important web services. In the race to the finish, it had become a tangled mess of conditional statements and unfathomable exception handling. It was challenging to determine when a fault was permanent and when it might be possible to retry. It was difficult to isolate code for testing purposes. There were surely opportunities to improve performance as well as maintainability, modifiability, and testability.

When we decided to spend less time on upfront design and ship sooner, we understood there would be issues. We didn’t take the time to fully analyze the problem space or consequences of our decisions before implementing them. The idea was to release the web services as soon as possible and to make time for redesign after release. This plan fell by the wayside once our system was in use.

In fact, it was a full year before the promised redesign work would begin. As we discussed ideas for how to redesign the system, a series of headwinds emerged one after another and blew us off course: firefighting, fear, valuing features over architecture, and finally, lingering doubts. Before any redesign could start, we would need to overcome each of these headwinds (see "How Do I Overcome Headwinds?").

The Era of Firefighting

After the release, we spent nearly all of our effort keeping things running. We joked that the steady stream of alarms indicated how popular our web services were, which was only funny because it was somewhat true. The jokes weren’t as amusing at 3:00 a.m. while frantically working to discern real problems from the numerous false alarms.

Here’s an example of what we faced. A whole class of alarms weren’t caused by faults, but by legitimate users making honest mistakes while using our services. In some web services, we didn’t do a good job distinguishing between input errors and exceptions caused by something truly unexpected. Users had direct access to our web service application programming interfaces (APIs). Input errors, such as missing a required field in a request, are expected to happen. Database constraint violations, by contrast, are truly exceptional errors that are not expected. The former is par for the course, while the latter is a genuine problem. Both cases raised an alarm in our system.

As a result, we faced a steady stream of alarms. They were a constant reminder of how little we knew when we started and how desperately we needed to redesign things. The idea of redesigning anything in such a climate is almost laughable.

As much as we would have loved to redesign the error handling and eradicate those problems from our system, we simply didn’t have the time while we were busy fighting fires. We only had time to treat symptoms. We improved logging and recalibrated alarms. We built tools to help us quickly triage alarms. We made deploying web services faster and easier. We made cheap, quick fixes that had a big impact.

Over several weeks, the number and frequency of alarms decreased to a tolerable hum. We finally had time to think about how to improve the system design and fix root causes. After surviving such a harrowing ordeal, some teammates were afraid that a redesign might upset the careful balance we had only recently achieved.

The Era of Fear

Looking at the data and reflecting on our recent experiences, it was easy to see that one web service in particular was causing most of our pain. This web service was difficult to reason about and was a linchpin in the architecture. It was an obvious target for immediate redesign.

Some on the team strongly disagreed. They argued that the service provided a key business differentiator and nearly all the traffic flowed through it, so it was too important to redesign. Yes, there were problems, but it was functionally correct, difficult to change and, most importantly, working in production right now. “Replacing a mission-critical service is just asking for trouble,” they said.

When a chunk of code is difficult to change, it isn’t an excuse to avoid redesign, it’s a reason to embrace it. That said, there was still something to be learned from our fears. It would be catastrophic if the redesigned service were worse, even temporarily. We needed a new, more detailed plan that everyone felt good about.

Good risk management put most of our fears at ease. In many cases, we already had work in progress that enabled us to safely redesign web services. Since the beginning of the project, we had built up a regression test suite as a safety net for redesign. We had also created a web proxy so that we could conditionally direct traffic as we pleased. The proxy required work, but it was exactly what we needed to run two versions of a web service in parallel.

As we separated general anxiety from real engineering risk, we settled on a more detailed plan for redesigning this mission-critical service that everyone felt confident would work. The only thing missing was the time to do the work.

The Era of Features Over Architecture

Given a choice between redesign and feature work, we nearly always chose to implement new features. Everyone, including our product manager, agreed that some day, redesign would be necessary. In retrospect, we never discussed how we’d know when that day had come.

Our product manager could always tell us how customers would benefit from a particular new feature. He knew how many people had asked for a new capability and how likely a feature was to attract new customers. Meanwhile, the costs of redesign were easy to see, while the benefits were largely invisible.

Our only hope was to find a way to make those benefits visible. We did this by estimating the value of our most important quality attributes.

First, we looked at design-focused quality attributes such as maintainability, modifiability, and testability. We created soft metrics such as waste and potential efficiency gains to quantify them. For example, how much faster might we add a new feature if the code were more malleable? Design-time quality attributes helped make a stronger case, but soft metrics alone were not enough to convince our product manager—or ourselves—that the time was right for redesign.

Next, we looked at runtime quality attributes, such as performance and reliability. These quality attributes tell a dramatically more convincing story. We could measure real-world usage and concretely forecast how much users might benefit from particular improvements. For example, what if we could automatically retry certain faults or increase request throughput?

The case for redesign became even stronger after looking at our feature road map. Based on past experiences, we were concerned about our ability to accommodate upcoming features with the existing codebase. If we didn’t act soon, it might be too late, and we would be unable to deliver important features in time for the big year-end trade show.

Redesign crept its way toward the top of the backlog. Nearly a year to the day after our first release, we were finally ready to kick off the redesign as we had always intended. Instead of a vague idea that we’d “learn by implementing first, redesigning later,” we now had a detailed plan for execution. That plan addressed the risks of redesign and showed that redesign had real value. We had the full support of our product manager and line manager.

We had conquered every headwind that blew us off course since the initial release, but one. The last step was to start the work.

The Era of Lingering Doubt

One day, over tacos in the lunchroom, my manager asked me plainly, “Why haven’t we started redesigning the service? What are you waiting for?” For weeks we had discussed kicking off redesign of this core web service, and we had yet to start. After everything we had been through, now that it was the development team’s call, we hesitated.

I explained to my manager how the timing was wrong, how we couldn’t possibly afford to redesign things right now with everyone being so busy. He patiently listened and asked encouragingly, “What do you need to be ready to start?”

The feeling washed over me, suddenly, and all at once: What were we waiting for? The only thing preventing the redesign effort from starting was us. We had worked for so long in a world where we couldn’t redesign that we had become normalized to that reality. After lunch I checked in with my team, and we decided to just do it. That day, we built the scaffolding for what would become the newly redesigned web service.

Over the next six weeks, we redesigned everything in that linchpin web service. We separated the API code from business logic and the client adapters used to communicate with downstream web services. We introduced immutable business objects that could operate only on validated types and revamped the settings system. We banned exceptions for all but truly exceptional conditions and adopted monadic error handling as a core pattern.

As planned, we used the existing regression test suite to not only check functionality of the new service but also to demonstrate our progress. We ran both old and new services in production simultaneously and switched between them freely as we worked out the kinks in the new code.

The redesigned web service was a success; it easily accepted new features, reduced faults by orders of magnitude, increased operational visibility, and improved performance. Perhaps most importantly, the redesigned web service created opportunities that let us embrace changes yet to come and paved the way for redesigning other parts of our architecture.

How Do I Overcome Headwinds?

Headwinds are the business, technical, and social forces that prevent a team from starting a planned redesign effort. Here are some ways to overcome them.

Headwind: Firefighting

The team is overwhelmed by persistent problems caused by the system to be redesigned. As a result, it’s difficult to find time for redesign work.

Try this: Create slack for redesign by making quick, cheap changes that will free up the team’s time. Examples include

improving logging so that production issues can be triaged and diagnosed quicker
fixing bugs that can be easily resolved
creating workarounds for niggling, persistent bugs that are too big to fix without redesign
refactoring in the small to make often-changed code easier to understand.

Headwind: Fear

The team is unwilling to redesign for fear of further deteriorating the current design.

Try this: Reduce anxiety by identifying concrete engineering risks and creating a plan to mitigate the top risks. Examples include

building a test harness to reduce the risk of a functional regression
creating a roll-back plan so that redesigned components can be unreleased in response to problems
practicing your data backup and recovery plan.

Headwind: Features Over Architecture

The team consistently deprioritizes redesign work in favor of feature work.

Try this: Quantify the value gained by redesign so as to create a fair comparison with feature work. Examples include

measuring disruptions caused by the current design and estimating the impact on cycle times or velocity
counting faults or user-reported errors and estimating which ones can be prevented permanently in a redesigned system
identifying features that simply cannot be implemented in the current design.

Headwind: Lingering Doubt

The team refuses to initiate redesign even when no other headwinds remain.

Try this: Just do it. Examples include

blocking out time on your calendar to focus on redesign work
building a spike solution
constructing a new walking skeleton as a starting point for the new system.

Redesign Is Inevitable

After having lived this tale, if I could travel back in time, what advice would I give my team? Do less. Release sooner. Anticipate headwinds to redesign.

Embracing incremental design requires that we accept the inevitability of redesign. While redesign is inevitable, headwinds will defer it and prolong your suffering. Do not underestimate the effort required to overcome the headwinds.

Designing incrementally means flaws will make their way into the design. Code will be brittle, error prone, messy, challenging to reason about, or difficult to change. These design flaws slow teams down and require firefighting. Teammates will be afraid to make changes for fear of breaking something or making things worse. The team will agree to redesign someday, but features are prioritized over redesign, so someday never comes.

It’s up to you to recognize headwinds and overcome them. This is true even when your own lingering doubts are the only thing stopping you from getting started. Redesign doesn’t magically happen simply because you decided to forgo upfront analysis and design incrementally.

This is not a cautionary tale to scare teams away from incremental design. In many situations, it’s the best strategy. When planning to redesign, there will be headwinds. Forewarned is forearmed. Now that you know to look for headwinds, you can prepare to overcome them before they blow you too far off course.

27 May 2024	Formatted for publishing on this website.
15 Feb 2021	Final draft published by IEEE Software.

Headwinds to Redesign

The Road to Redesign

The Era of Firefighting

The Era of Fear

The Era of Features Over Architecture

The Era of Lingering Doubt

How Do I Overcome Headwinds?

Headwind: Firefighting

Headwind: Fear

Headwind: Features Over Architecture

Headwind: Lingering Doubt

Redesign Is Inevitable

Details

Change Log

The Road to Redesign #

The Era of Firefighting #

The Era of Fear #

The Era of Features Over Architecture #

The Era of Lingering Doubt #

How Do I Overcome Headwinds? #

Headwind: Firefighting #

Headwind: Fear #

Headwind: Features Over Architecture #

Headwind: Lingering Doubt #

Redesign Is Inevitable #

Details

Change Log

The Road to Redesign

The Era of Firefighting

The Era of Fear

The Era of Features Over Architecture

The Era of Lingering Doubt

How Do I Overcome Headwinds?

Headwind: Firefighting

Headwind: Fear

Headwind: Features Over Architecture

Headwind: Lingering Doubt

Redesign Is Inevitable