I have a confession to make.
We don’t like to talk about the ‘bad’ things we do (at least not until those things are distant memories), but that isn’t really helpful is it? If we don’t talk about our mistakes how are people going to learn from them. So I have a confession to make.
I have been ignoring certain failures in my regression suite for months now. I have a couple of tests that fail about 25% of the time and then pass on the next run. So what have I done? Nothing. I just let them run again on the next build and pass. I haven’t looked into the failures. I haven’t tried to see if there are certain conditions that reproduce the failures. I haven’t taken any steps to deal with these problematic tests differently from the other tests. Nope. All I’ve done is look the other way, cross my fingers, and go on with my day. I’ve been too busy and I haven’t had time and I’ve….*mumbles more excuses*
Well today it nearly bit me in the back. I was looking through the build history and realized that the build had been failing for a few days. That seemed odd as usually the spurious failures would only fail one build, so I took a closer look. When I did this I realized that the failures were actually happening on similarly named tests and were due to a recent code change. It didn’t end up being a defect (I just needed to make a minor tweak to a couple tests), but it made me realize that my ignoring those spurious failures could have easily allowed other issues to slip through unnoticed. By being ok with failing tests, I had trained myself to not trust failing build statuses.
Also today I saw an email from one of the customer engagement staff asking about why he was seeing a certain error when trying to set up something for a potential sale. The error message looked familiar. I don’t think it is the exact same error I’m seeing (and in his case it is consistent), but it is probably in a similar area of the code. Maybe if I had spent more time on tracking down the root cause of the errors I’m seeing we could have found a fix for this already.
I’m sure most of us have been there. Spurious failures are part of the job when it comes to writing test automation, but don’t make my mistake. Don’t just treat it as an unimportant thing you can ignore. The reality is, as I have said elsewhere, your test automation changes your behaviour. If you have tests that fail sporadically, those failures will train you. They’ll train you to pay less attention to build failures. Is that really what you want from your automation?
Is it easy to deal with issues like this? No. Why do you think I’ve been ignoring it for so long? It’s really hard to figure out what to do and how to handle these kinds of failures, but isn’t that what you’re getting paid to do? Figure out the hard problems so that you can help your team release high quality software.
Alright, time to get back to work – I have a test failure to investigate.