Expect Crashes

Dying in a car crash is one of the leading causes of death for those of us living in developed countries. It’s not surprising then that we spend a lot of time as a society trying to mitigate that risk. We implement things like speed limits and safety standards for vehicles and education programs for drivers to try and prevent crashes. Prevention is the best cure, and all that.

We don’t stop there though do we? We know that despite our best efforts, crashes are still going to happen and so we put in place things like seatbelts and airbags and safety rails. We also have tools in place to help us deal with the problems that arise after the crash.  We have ambulances, and paramedics and laws about moving over for emergency vehicles. We don’t just try to prevent crashes, we also try to mitigate the effect of crashes.

What I’ve been describing here is an approach to injury prevention that can be summarized with the Haddon Matrix.  We have a pre-event phase, a during event phase, and post-event phase and we have strategies to help mitigate the impact in each phase.

I like to take ideas from other fields and think about how they relate to testing, so let’s do that for a minute here.  What phase do we spend most of our time in as testers?

Traditionally it has been the pre-event phase.  We are trying to find the bugs before they ever make it to the customer.  We are trying to find the crashes and errors ahead of time.  We work primarily in the prevention realm. But shouldn’t we consider that despite our best efforts, some crashes will still happen? We will have issues that customers face, so what is our strategy at that point? What is our during and post event strategies for bugs that get exposed to customers?

Think about filling out something like the table below. I simplified the Haddon matrix by taking out environmental factors, but just the process of going through this could be a helpful way to see where you can invest as a company.  The ability to prevent problems is important and helpful, but as applications grow in size we will never be able to do that completely.  We need to have strategies in place to deal with what happens when things go south.  What are your strategies?

Phase Human Factors System Factors
Pre-Event
  • Testing
  • Dogfooding
  • Code Review
  • Feature Flags
  • Build Processes
  • Realistic Test Environments
During-Event
  •  Dynamic response to failures
  • Ability to debug in production
  • Immediate access to live production data
  • Logging & Alerts
  • Automatic fail safes
  • Self-healing capabilities
  • Flighting and rollback ability
Post-Event
  • Root cause analysis
  • Customer follow up
  • Quick build pipelines
  • Ability to get fixes to production in a timely manner

1 Comment

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s