On Culture: Solving problems at a systemic level

Treating problems as isolated events increases the chance of their recurrence. It's far better to assume all problems are systemic and respond accordingly

David JarvisThursday, 2 November 2023

A common failure mode for companies is to write off negative incidents as one-time events. The problem with this approach is that one-time incidents are actually quite rare. If something can happen once, it can happen again...and it probably will.

As fintechs, we operate in an environment that has a very low tolerance for failure and we need to take problems seriously in order to be deserving of the trust that our customers and regulators have placed in us.

If your stakeholders feel that you are not deserving of that trust, your business will die.

Think in terms of systems, not individual accountability

Individual responsibility and accountability are important, but ensuring systemic resilience means you have to solve problems at a systemic level rather than an individual level.

Thinking about problems in terms of individuals leads to problems being written off as one-time incidents in a well-intentioned way ("oh, they just made a mistake, it won't happen again"). The problem with this approach is that it does nothing to address the problem.

By way of example: let's say someone makes a mistake. One response is to teach that person how to avoid that mistake, which means they probably won't make that particular mistake again. But a better response is to put in place a systemic check to make sure that class of mistake will be prevented in the future across your entire company. Maybe start screening for the mistake in your hiring process. Maybe add a new check to your technical infra to ensure that it's simply impossible for the mistake to happen. Whatever the response, you should think about it in terms of systems and processes, not individual people.

The existence of one problem means more problems are likely

The fact that something went wrong doesn't necessarily mean that more things like it have gone wrong or will go wrong, but it's pretty important to check. If the systems that should have prevented one thing from going wrong were not operating successfully, it's possible that there were other similar issues also going unchecked. Treating all problems as systemic is a good way to find out if your checks and controls are working as intended.

As the company scales, so will your problems

A once-a-year problem at 20 people will become a once-a-month problem at 200 and will happen multiple times a day at 2,000. You need to think about how issues will scale as your business grows.

When facing a problem, ask yourself: how much worse will this be when we're much bigger? Problems, for the most part, are much cheaper to catch and fix early on‍—‌an ounce of prevention is worth a pound of cure.

Postmortems and remediation

Treating problems as systemic means always saying "unless we correct for this, it is highly likely to happen again". Correcting and remediating problems typically starts with a postmortem, which should aim to answer the core question of why did this happen? and then to follow that up with what can we do to ensure this doesn't happen again?

Here’s how to write a good postmortem:

Why did this happen?

Answering the core question of why usually means a bunch of other questions need to be answered. The following is a good start:

What: describe the event with as little ambiguity or uncertainty as possible. Be specific about the order of events, including dates and times where possible. This is also a good point to describe the impact of the event (did it lead to downtime, financial loss, problems for customers etc. - even internal impacts like "led to internal communication failure between teams" are important)
Who: who discovered the issue? Who helped to address it? Avoid language that blames others; you want to focus on statements of fact here while you work to understand the issue constructively and collaboratively.
How: what was the initial trigger that caused this problem? How did you become aware of it, and how long did it take for you to become aware of it?
Why: why did this happen? In particular, was there an existing system in place to prevent this from happening?
1. If so, why didn't that system prevent it?
2. If not, why not? Was it a blind spot where you had no systems in place at all? Are there any adjacent blind spots you should be worried about? Alternatively, was it a space you knew could be a source of problems, but had decided not to worry about?

What can we do to ensure it doesn't happen again?

The right preventative system will depend on the nature of the issue, but the following is a good starting place:

Putting monitoring in place
Ensuring that the right permissions are required in order to take a given action (both when dealing with human and technical systems)
Requiring review prior to action (particularly from a manager / a different team / an external or independent advisor)

Caveats

Obviously, not all problems are systemic! But the moment you start to treat a problem as a one-time event you make it likely that the same problem is going to bite you again in the future. It's far better to assume all problems are systemic and respond accordingly. It's a bit of extra overhead for the occasional problem that is truly unique, but this is far outweighed by the benefits of catching systemic problems the first time around.