Scientific Software Development
Posted December 31st, 2022
Knowing the difference between great features and useless features is the difference between a highly successful product and a failed product.
Great features and useless features take just as much time, effort, and money to create. And, before making the product, are hard to distinguish from each other.
Any business that wants to make a highly successful product needs to find out which of their ideas are great and which are useless as soon as possible, so they reduce the investment into their useless ideas and focus on their great ones.
A business needs to find out which features users will use and pay for. What problems users have, and whether the features address them.
Everyone can spin a nice story about why their idea is so great, but if the users don't follow through then it's pointless.
The best way to do this is using the scientific method.
In every other field, when we want to make sure we aren't believing what we want to believe, we turn to science.
I'm not talking about "Scientific Management". In the early 20th Century, prefixing things with "Scientific" made them seem legitimate, but often really meant horrible and beaurocratic. I mean "real science", not "science theatre". That is, how the techniques applied in fields such as chemistry and physics can be applied to software.
What does this look like for software?
Hypotheses, Predictions, and Falsifiability When thinking about a feature, think about what kind of results you could expect. Be specific. Make it measurable. Think about what kind of results would tell you that you are wrong. This kind of thinking is very useful for understanding the Return On Investment for a given feature. For example, if you think a feature will increase the Retention Rate by 10%, you can calculate how much that will increase revenue, and compare that to how much it might cost to implement it.
Cause and Effect It's not enough for to predict the outcome of an action. Science is about investigating how the action results in the outcome.
Scientists don't just observe that sunlight increases plant growth and go home, they understand the role sunlight plays in converting carbon dioxide and water into carbohydrates and oxygen.
Similarly, it's not enough to know if moving a button around a page or changing the colour increases the retention rate, but if making it more visible or accessible increases retention.
Knowing this, a company can compare multiple different ways of making the button more visible, for example.
Controlled Variables Science does not always deal with controlled variables, sometimes it has to make do with what it has. When scientists are trying to predict when an Earthquake will happen or when a volcano will erupt, they can't hold all the variables constant, but they can still study it scientifically.
However, where possible, controlling variables leads to more confidence. For software this means means that rather than constantly changing things, change fewer things and give time for the collection of data. If too many changes are made in a short period of time, it's hard to attribute a result to a specific feature.
If this is too prohibitive for businesses, then the next best thing is feature flagging. Rather than controlling the variables, for any given feature some people have it and some don't, and both groups will have various other features toggled on or off randomly, and the average difference can be compared.
Controlling variables doesn't just apply to features, it means all observable behaviour of the system. This means no new bugs or regressions should make it into the system. Almost everything should behave exactly the same. If every release modifies existing functionality accidentally, it's hard to tell what difference the new feature.
This means thorough testing before releasing into production, and a comprehensive automated test suite.
Large Representitive Samples If you have very few users, then any results you have could be due to random noise rather than the features themselves. If you separate users into small cohorts, you have the same issue.
"Get more users" isn't very actionable, since it's often the goal. However, if you're doing something like beta-testing, you can increase the size. Any testing that isn't done on production users runs the risk of biased sampling. Are the people doing the testing representitive of the users? If they're an in-house team, they're more likely to be familiar with the product and therefore miss things that a first time user might struggle with. Or maybe the testing team is more tech savvy than average, or don't have disabilities. If you want to minimise surprises when releasing into production, then either test on production users or make sure that the testing team is representitive.
Reproducability If an experiment is repeated by someone else and the same results aren't replicated that's a bad sign in science. A result that can be replicated is a result that can be trusted. By not replicating tests, you run the risk of being swayed by experimental error, anomalies, or bias.
I'm not sure how this would be translated into software, but I'm imagining the same company trying something again in different circumstances. I'm imagining companies publishing the results of having, for example, deep links on their Key Performance Indicators.
Interpretation A common issue with science is that the thing you want to know is not what's easiest to test. But rather than waxing poetical, scientists look for what they can test. What they want to know is often much more general than what they can test. But headlines often skip that part.
For example, suppose a company is interested in understanding the impact that offers have on retention rate. They're going to test certain kinds of offers, within their company. But people looking at their results have to make sure they're not assuming more of them than they're actually getting. It doesn't say much about the effectivess of other kinds of offers in other companies. Maybe the cost of the product impacts the effectiveness of offers, or the age of the average customer.
It's important to interpret the results in a way that is faithful to the data, which means becoming familiar with the details of the data collection and statistics.
If a new feature doesn't improve certain metrics, it could be because users don't like the feature, or it could be because the users can't find it or access it easily. With enough data, it's possible to tell. It's important to collect enough data to reduce the number of different interpretations that are consistent with the data. Avoid Epicycles In a previous post on scientific debugging I told the story of how people tried to explain certain astronomical observations whilst preserving the idea of the Earth being in the centre of the universe by making the theory more and more complex - adding cycles within cycles within cycles.
I was historically wrong. At the time there were people who believed that the Earth orbitted the sun, but scientifically it was just as appropriate to believe in the Epicyclic model because the Heliocentric model meant revising their measurements for the distances of the sun and the rest of the stars. They had measured the furthest stars to be as far as the sun actually is, and measured the sun to be far closer, so they were understandably concerned about the idea.
Nethertheless, the moral of the story that's widely told is still useful.
When someone's pet theory is not treated well by the evidence, a common response is to posit more things to explain why the data is the way it is whilst preserving the theory. This is especially problematic when the theory is a Highly Paid Person's Opinion (Hippo), and everyone else is tasked with validating it. It's important to give up on pet theories if the evidence continually disproves them, and switch to better ones.
If a feature consistently doesn't deliver on its promises, get rid of it.
Priors Scientists don't start from a blank slate. They don't use random word generators for hypotheses. They don't forget the world is round when doing a new experiment.
No scientist is holding their breath hoping that an experiment they do today will show the world is round. If it doesn't show the Earth is round, they check their tools and assume they've made a mistake.
Instead, scientists rely on the results of previous experiments (especially ones that have met the conditions I've described), and use those to test new things. They use their knowledge to generate reasonable hypotheses, rather than spending all of their time on hypotheses with no chance of being true.
Similarly, you don't have to experimentally verify everything when making a new product. You can rely on the results of others instead of having to reinvent everything.
One of the implications of this is that not all changes are equally likely.
The particular shade of blue for a particular button might change, although it will probably be the same change as the rest of the buttons.
The way that users sign into a service might change, but the fact that the service allows users to sign in won't.
For programmers, this affects the way they should write their code. The parts that aren't likely to change much can be written more concretely, which is good for performance and is more comprehensible. The parts that are more likely to change should be written to make it easier to change, in the ways that it could be changed. For example, the colour of the button should be a variable, and all of the primary buttons should use the same variable.
Hawthorne Effect One difficulty with social science is that unlike molecules, humans have memory and awareness. If people know that they're being observed, they might behave differently. This can skew the results of a study. This is something to consider when user groups are testing a product.
Putting it all together In summary, we have an iterative development process, with some upfront planning. It shares some qualities with agile, but it's not quite the same. Getting rid of features is as important as adding them, and adding too many things too quickly is discouraged. Some of the things I have mentioned are costly, and may not make sense for a business. A company might not have enough evidence to make a scientist happy that they should make a certain decision, but they would not want to risk their competition doing it and succeeding.
I don't think that companies should seek to be 100% scientific, but I think that most companies can benefit from being more scientific than they currently are. There probably are low-cost scientific approaches that can make them more effective.