Software Entropy
Posted November 21st, 2022
Software is written iteratively. It doesn't matter if you follow agile, waterfall, or any other methodology. You can only write one function at a time.
There is always some sort of cycle, whether that's a pull request, a commit, or even a change in your mental state.
The changes between each cycle can be small or huge. As you go through more cycles, the changes you introduce accumulate. By adding in a little functionality each time, eventually you have a working application. But, by making things marginally worse each time, you end up with a mess.
Imagine if every day, you write 60 lines of code. In the first day, that's managable. The first few days are also managable. But, after a month, that's 1 file with 1800 lines of code. Finding the right line of code to change and scrolling to it is now a bottleneck to future productivity. The likelihood that a new feature introduces a bug to an old feature becomes very high.
Similarly, if every day you put your code in a new file in the same directory, after a month you have a directory with 30 files in it. They're typically sorted alphabetically, but files near each other might not be related. Like the previous example, the speedy productivity from the first day slows down to a slow crawl.
No matter what cycle size you use, you're screwed. If you write code in large cycles, then chances are you make things worse in any given cycle. E.g you write 100 lines of code or create 3 files. And if you use small cycles, then you go through more cycles, and things that escape your notice add up.
Let's extend this further. Add a few more developers, and you could be writing 400 lines of code per day, or 5 files.
Assuming that you work weekdays only, for 6 months, then we have 16,000 lines of code, or 600 files.
One file with 16,000 lines of code is going to be very slow to edit in an editor. Syntax highlighting and other intellisense will be limited. Bug fixing becomes very difficult. It's very easy to figure out what to change in a 30 line file. It's relatively easy in a 60 line file. But in a 16,000, finding what to change, and making sure it doesn't break something else is almost impossible. The proper way to do it is with fast feedback, and divide-and-conquer debugging, so you can narrow down the bug to 8000 lines, then 4000, 2000, 1000, 500, 250, 125, and eventually 1. But, because of the amount of code, recompiling and rechecking means you won't get fast feedback. And if things are not organised properly, it won't be just one line, many will be tied in together.
Similarly, with 600 files in the same directory, finding the right file becomes a bottleneck.
By only focusing on the functionality during each cycle, there comes a point where it becomes prohibitively difficult to add new functionality, or fix existing functionality.
Anyone who intends to stay in business for long should be concerned about this.
The problem is not having many lines of code, or many files. It's having lots of code that is not organised properly.
Ideally, the files will be split into folders, with a few files in each folder. The file hierarchy would be logical, and it would be easy to find where to add or edit things based on the file structure. It would communicate information about how the system works. And each file would have a managable amount of code in it.
The 600 files might be organised into 6 folders, each containing 10 folders, which each contain 10 folders. Finding the right file would involve starting at the top of the hiearchy, and asking of the 6 most likely contains what you're looking for, and then which of the 10, twice. We can deal with that many things. If you have 600 files, there are going to be ways to group them that make sense to us and mean we don't have to think that much when finding the right folder.
However, the number of ways to organise code well is vastly outnumbered by the ways to organise it badly.
If code is not modularised properly, then you will have to make changes in many different places for one requirement. And the file hierarchy would be a hindrance, since the divisions don't make sense.
Because of this, software projects often slow down as they grow.
I don't think this is inevitable. I think that by being concerned about code quality and modularity, and writing well factored code, projects can remain modifiable for years.
How can this be dealt with?
I've described what happens if this is completely ignored: it takes longer to add new functionality, and longer to fix bugs. For a business, this is great for the competition.
Not addressing this, and trying to work around it, is like never doing laundry or washing up dishes, and instead buying new ones all the time and stacking dirty things on top of each other, reducing the available space left until there is nothing left.
Sometimes a business will let it go on for a long time before they recognise that this has become a bottleneck, and allow developers to majorly refactor the system.
This is better. I have seen some success with this. Unfortunately, it's very difficult. It's hard to make significant progress with this. And typically more work is being done in parallel, on other branches. It's harder to make something better than make it worse. If people are adding more functionality whilst doing this major refactor, progress will not be noticable.
Back to my washing up analogy, this is like not doing any washing up or laundry for a month, and then trying to clean everything all at once. In addition to killing the Hydra and capturing the guardian of the underworld, Hercules was tasked with cleaning the Augean stables in one night. They had not been cleaned in 30 years, and had over a thousand cattle.
The problem with not worrying about the future, or only worrying when you get there, is that eventually you do get there. And you have to deal with the consequences of the decisions you made that ignored the future.
That being said, it's possible to be too worried about this. You could be held back by analysis paralysis, refusing to write code that makes sense now because it won't make sense later. Being overly fussy about code in each cycle, even though the details being fussed over are not relevent to the big picture.
If a project is delayed too much for what seem to be asthetic/purist programmer concerns, then it won't have a future.
Worrying about everything that could turn out to be a problem and trying to avoid it leads to overengineering. Work wasted on things that turn out to be unneccessary.
I think the proper way to think about this is relative rates of change. If the system becomes 5% worse every cycle, but 3% better, then it will decrease on net by 2% each cycle.
It's important to make sure that the process of software development works towards addressing this gap.
There are a few factors that can help with this:
- Testing. Having good test coverage means that you can be more confident that a change made to the code does not introduce unexpected behaviour. This makes it easier to improve and rewrite code later on. Without this, you have to manually test it, and as the system grows the number of possible states grows significantly, making it prohibitively slow.
- Static typing. By pointing out errors quickly, each cycle has less incorrect code merged in.
- Linting. Not everything can be caught by static typing and tests. Linting checks for problematic code style. Often designed after looking at data from common causes of bugs. Indicates duplicated code and complexity that could be addressed. It helps developers focus their efforts on areas that give the most bang for their buck.
- Code reviewing. By looking at each others code, developers can spot mistakes, and suggest simpler ways of doing things.
- Pair programming. By comparing possible solutions, it's easier to simplify and introduce less unneccessary code. It's also a more effective form of code review since both people understand the code a lot more than when they read through it in a pull request.
- Constant refactoring, the boy scout rule.
- Modularity, encapsulation, minimisation of dependencies. By reducing the number of "things" and their interactions, it becomes easier to reason about things independently, and modify them without worrying about everything else.
- Good design, continuosly rethought. If the application itself isn't designed well, then programmers will have a hard time with it. If it's designed haphazardly, with features thrown together randomly, users unable to understand how to use it, and no one knowing how it works, then how are developers supposed to find the logical place to modify things?
Any given measure in a cycle could fail, but multiple measures will catch a lot.
However, the small amount missed still accumulates. And, even if you catch everything in your cycles, you have to address the phase change.
The problem mentioned earlier, of too many files or lines of code is not about the new code being bad.
At first, when all the code is in one file, that's not necessarily a problem when the code is only 60 lines. Dividing it into files isn't stricly necessary at that point. And the example file hierarchy is not necessary when there are only 4 files. Eventually, that example hiearchy will also need modification, once there are 1000 files, for example.
In George Miller's famous paper on working memory capacity, people can only hold about 7 things in memory (5-9). Since then, that number has been changed to 3-4 "things", if they're unfamiliar. In order to deal with more complex ideas, they can be chunked.
To me, this says a great deal about how we should organise code. We should split code into functions that can be treated as one "thing". We should split those functions across files once there are too many in one file. We should organise files into directories once there are more than 5 in a directory. A directory should almost never have more than 10 things in it. When significantly below capacity, there's no need to worry about this. But close to it, developers should be wary. And above it, they'll have to apply a new level of organisation to their code.