Towards an operational definition of tech debt
What is tech debt?
Informally, “technical debt,” or tech debt for short, is a metaphor for expedience in software development. You take shortcuts now, borrowing from the future. By building things more quickly now you can solve more business problems faster, and deal with technical problems later when you (hopefully) have more time.
It’s quite hard to measure tech debt. When thinking about it using the finance metaphor, tech debt is incurred at the time the code is written because it represents a deliberate choice on the part of the code authors. I don’t disagree that this happens sometimes, but in my observation, systems seem to accrue tech debt over time in the form of “bit rot” – not just via compound interest on sacrifices made for expedience. For example, a design of a dependent library might suit the original simple product, but the product is much more complex now and the dependency’s design no longer makes sense, but nobody has bothered to update it yet. Or you started forcing every client to use graphql to access your api, but all the old http request infrastructure is still in use even though it’s complicated and does work that doesn’t need doing anymore. Or you invented a new way to write unittests but all the old tests are still in the old form.
I’m not certain, but I do think most of the tech debt I encounter is in this form. I realized maybe I could formalize it:
Definition: Tech debt is the difference between your codebase’s current state and the desired state, which is where it would be if you wrote it from scratch knowing what you know now to solve the problems it is currently solving.
Interesting implications of this definition:
- The very first version of the product has zero tech debt. Because, by definition, when you write code for the first time, that’s the code you wrote from scratch to solve all the problems it is currently solving. But the debt starts creeping in as your software gets applied to new people/problems and as you learn more about how it should work.
- As you learn new things about how to do software engineering, you incur tech debt. Yep that’s right. For example, if you realize you should have designed your API with graphql, or written your backend in another language – you just “discovered” tech debt in your code.
- Tech debt is in the eye of the beholder. A code base does not have tech debt on its own, it only has tech debt from your current perspective on how it would have best been written.
- Adding a feature without changing other code can decrease tech debt. Heh: this scenario is bizarre but it could happen if (for example) you used to think: “my code doesn’t need to support complicated feature X and so my original organization was overkill”; but you just added feature X so now you’re thinking: “I’m glad I organized it that way so feature X was easy – I’d definitely do it that way in the future!”
- When modifying code, always aim at the end-state. Often times when changing code, I am tempted to leave bits and pieces unchanged for historical reasons. But doing this adds to the tech debt of the codebase because it’s not how you would do it today.
Aiming at the end state
I want to expand on the last point, “aim at the end-state”: let’s say when adding a feature, I realize that a widely-used class or function I modified is now slightly misnamed for what it does today, even though the name used to be accurate. It’s a pain to rename things so I leave it. But the next person to come along and read the code is going to be a bit confused. That’s a real and painful consequence of this form of tech debt. Remember, code is read far more often than it is written.
Instead, when changing code, compare it to what you would have written if you had always known you needed that code. Usually this results in a bunch of things that would be different.
Design Example: Ledger
For example, let’s say you have a double-entry ledger, and you’re allowing people to send money from their own account to another account. Normally everything just touches two accounts – you send $100 to someone, and their account goes up by $100 and yours goes down by $100.
But now you want to add a promo where you get $5 off your next transfer. Prior to your code change, the system always assumed that there were exactly two accounts involved in each ledger entry and that the deltas add up to zero. But when adding the promo, you have to break one of these assumptions – because the sender should only be debited $95, but the recipient gets $100, the difference is $5, so you have to either stop assuming they add to zero, or take $5 from some other account in the process. And if you decide to take $5 from another account, you are again faced with two options: you could make the ledger support more than two accounts, or you could break up your transaction into two transactions (the promo award followed by the transfer). Each of these choices has upsides and downsides – not only in how easy they are to implement, but how much they reflect the proper “end state” of the code.
One might enumerate the options and analyze them as such:
- If you allow +/- deltas to not equal zero, you’ve broken a core assumption of double-entry accounting. This is easy to implement, but may be very difficult to account for in the future. If you have a bug with promo transactions, money could just disappear and be hard to track down. This seems quite far from ideal.
- If promo money comes from a third account and we split it up into multiple transactions, then the application gets a bit more complex. It might be easy to introduce bugs – maybe you want to undo a transaction but forget to undo the promo as well; or maybe you introduce a problem where the transaction order matters, because the person’s account might go to zero in between the two transactions. While better than #1, this is not ideal either.
- If promo money comes from a third account and we allow a single transaction to modify more than two accounts, we probably need to change the storage format of the ledger entries – no longer is there a single
credit_account
anddebit_account
column, we now need something else. So this option is more work, but it reduces the long-term technical pain of the above two options and thus I think it’s closest to the end state.
Now, having enumerated the options and compared them to the end state, you can actually decide what to do! Any of these options might end up being the right choice for your situation, depending on what exactly you’re trying to achieve, how serious your accountants are, how good you are at debugging, etc. In general, I like to do this “compare to end state” process both at the design level (like in this example) and the code level (e.g., what a given function’s abstraction boundary should be). A big fraction of all code-review comments I write are something along the lines of “if we wrote the code originally to support this feature, your change is not how I would have written it.”
Code-level Example: Formatting Logic
Here’s another: I’m a maintenance UI programmer and need to format a typed phone number on a certain screen. I know the code must already have a way to do something like this, because I can see the formatted phone number in another screen; but I can only find format_phone
, which depends on a user and only works if we know the phone number is valid, but the typed number doesn’t have a user, and isn’t guaranteed valid. I might be tempted to create a new phone-formatting function from scratch, and this is often a fine thing to do – but it’s important to consider the end state:
- I could do the simplest possible thing (create
format_typed_phone
function) then the next person to come along findsformat_phone
as well asformat_typed_phone
in different places, and tears their hair out trying to figure out how and why this happened and which one they should use. They don’t have my context about why there are two functions. - I could do #1 but also rename
format_phone
toformat_user_phone
, and ensure they are both documented. This should reduce confusion, but still requires people to know that there are two functions and choose between them, and there’s probably still some code duplication here. - I could refactor the code so that the
format_phone
function doesn’t depend on a user and accepts invalid phones, then call it from everywhere.
Again, which you choose depends on the situation you’re in. I hope most programmers (when given the options) would spend the extra hour or two doing #3, so as to improve the code – yet I see a lot of people not even considering #3 in their daily work and I think it’s because they don’t have the process of thinking through the proper end states.
Indeed, the process of “diffing against what you would have written” may be very hard for newcomers to a codebase. To this I say yes, it is hard; but learning how your codebase was designed at a high level is usually sufficiently important to block all your existing work anyway. You don’t need a 100% precise understanding of it to avoid making serious tech-debt errors by implementing features at the wrong level of the code, or by duplicating functionality that already exists; but you DO need a rough understanding.
The principle of avoiding unnecessary tech debt means that you can make documentation demands from senior engineers. For example, if you couldn’t understand the overall structure of the code after a few hours, that’s a documentation issue. Every experienced programmer on a codebase knows which parts of the code are core and non-core and how the 80/20 control flow and data structures are represented, and it’s worth their time to write that stuff down so that newer engineers can quickly get their bearings.