Spurious Failures

Spurious Failures

I have a confession to make.

We don’t like to talk about the ‘bad’ things we do (at least not until those things are distant memories), but that isn’t really helpful is it?  If we don’t talk about our mistakes how are people going to learn from them.  So I have a confession to make.

I have been ignoring certain failures in my regression suite for months now.  I have a couple of tests that fail about 25% of the time and then pass on the next run.  So what have I done?  Nothing.  I just let them run again on the next build and pass.  I haven’t looked into the failures.  I haven’t tried to see if there are certain conditions that reproduce the failures.  I haven’t taken any steps to deal with these problematic tests differently from the other tests.  Nope.  All I’ve done is look the other way, cross my fingers, and go on with my day.  I’ve been too busy and I haven’t had time and I’ve….*mumbles more excuses*

Well today it nearly bit me in the back.  I was looking through the build history and realized that the build had been failing for a few days.  That seemed odd as usually the spurious failures would only fail one build, so I took a closer look.  When I did this I realized that the failures were actually happening on similarly named tests and were due to a recent code change.  It didn’t end up being a defect (I just needed to make a minor tweak to a couple tests), but it made me realize that my ignoring those spurious failures could have easily allowed other issues to slip through unnoticed.  By being ok with failing tests, I had trained myself to not trust failing build statuses.

Also today I saw an email from one of the customer engagement staff asking about why he was seeing a certain error when trying to set up something for a potential sale.  The error message looked familiar.  I don’t think it is the exact same error I’m seeing (and in his case it is consistent), but it is probably in a similar area of the code.  Maybe if I had spent more time on tracking down the root cause of the errors I’m seeing we could have found a fix for this already.

I’m sure most of us have been there.  Spurious failures are part of the job when it comes to writing test automation, but don’t make my mistake.  Don’t just treat it as an unimportant thing you can ignore.  The reality is, as I have said elsewhere, your test automation changes your behaviour.  If you have tests that fail sporadically, those failures will train you.  They’ll train you to pay less attention to build failures.  Is that really what you want from your automation?

Is it easy to deal with issues like this?  No.  Why do you think I’ve been ignoring it for so long?  It’s really hard to figure out what to do and how to handle these kinds of failures, but isn’t that what you’re getting paid to do?  Figure out the hard problems so that you can help your team release high quality software.

Alright, time to get back to work – I have a test failure to investigate.

Trustworthy Tests

Trustworthy Tests

I was working on a script to use for an automated test and one of the checks in it was comparing two similar things with different inputs.  I decided to change one of them to make the difference between them more obvious, but to my surprise, when I ran the test everything passed.  What gives?  I had made a pretty significant change to the test – it should be failing.  I tried out some stuff interactively and it all looked ok, so I ran the test again – Still passing.

Puzzled I reviewed the test again.  Everything looked fine, but clearly something was going wrong.  I started adding some debug output into the tests and after a couple of tries I found out that I had accidentally switched the order of some commands.  I was checking the output before updating the engine.  A simple switch later and everything was as I expected.

This is just a normal story in the life of a tester.  I’m sure anyone who has written automation scripts can relate to this, but let’s take a minute to think about the implications of it.  What if I hadn’t tried that change?  I would have just merged in the test and it would have been passing but telling us nothing. The passing state of the test was due to a bug in the test and not due to the code we were trying to check.  The code could have changed quite dramatically and this test would have happily kept on reporting green.

I”ll just get straight to the point here. Your tests are only as good as you make them and since you are a human, you are going to have bugs in them sometimes.  A good rule of thumb is to not trust a test that has never failed.  I usually try to deliberately do something that I expect to cause a failure just to make sure my test is doing what I expect it to be doing.  Try it yourself.  You might be surprised at how often you find bugs in your tests just by doing this simple action.

Waiting too Long for Reviews

Waiting too Long for Reviews

As we continue to refactor our automated test scripts, I have been a part of a number of pull request reviews.  I have also been involved in some code reviews for large changes that are coming.  Code (or test) review is a very important and helpful strategy for a team to participate in and I am thankful for the many tools that have enabled this and made it a common practice.  To be honest, I can’t really imagine a world where we didn’t have easy and frequent code reviewing.

But. (Could you sense that coming?)

Just because something is good and helpful, doesn’t mean it is perfect and certainly doesn’t mean we can use it in every situation without putting on our critical thinking caps.  Participating in so many reviews has made me put that hat on.  What kinds of things are code reviews good for, and more importantly, what are their limitations?

I think code reviews are very very helpful for fine tuning.  ‘Don’t use that data type here.’ ‘Should you return this instead?’ ‘It look’s like you are missing a check here.’ Comments like this help greatly on catching mistakes and tweaking things to get the code into a good readable state.

Code reviews are great at helping us make those little adjustments that are needed to keep up the quality and maintainability, but what happens if the entire approach is wrong?  What if you are reviewing something and you want to just shout out, NO you’re doing it completely backwards?  We all do silly things sometimes and need to be called out on it, but do code reviews help with this?  I guess in some ways they do.  The fact that someone else will see the code makes us think about it differently and by putting it up for review someone might let us know we are approaching it from the wrong angle.  But think about it;  how likely are you to, in effect, tell someone ‘you just wasted 3 days doing something you need to delete.’  And even if you do, how effective is this form of feedback?

If we rely on code review as the first time someone else sees major new work we are doing, we are waiting too long.  We need much earlier feedback on it.  Does my approach make sense? Am I solving to correct problem here?  Are there any considerations I’m missing? These kinds of questions need to be asked long before the code review goes up. Many of them should be asked before any code is even written.  We need early communication on these kinds of things, so that we can get the feedback before we spend too much time going down a dead end road.  Another consideration is that code reviews are usually text based and to have a discussion on direction and approach via a text based medium is likely to be an exercise in frustration.  Some things work better with face to face discussions.  The powerful tools that we have which enable easy code reviews can sometimes push us towards inefficient communication about the code.  Stay aware of how your tools affect you and seek to do what make sense and not just what is easy.

I have been able to participate in more code reviews lately. Code reviews are an important part of testing and so I’m glad to be included in these, but there is still lots of testing work to do before the code review and I will be trying to get more testing input happening then. I guess it is my lot in life to celebrate the small victories (more code reviews – yay!) and then push on to the next thing.  Never stop improving!


Should Automation Find Bugs?

Should Automation Find Bugs?

Our automated tests help us find a lot of bugs.  In our reporting system we link bug investigations to the tests that are failing because of them.  I looked through this the other day and was surprised by the number of bugs that have been found with our automated tests.  They hit a lot of issues.

At first blush this seems like a good thing.  They are doing what they are meant to right? They are helping us find regressions in the product before we ship to customers.  Isn’t this the whole point of having automated regression tests?  Well, yes and no.

Yes, the point of these tests is to help make sure the product regressions don’t escape to the users and yes it is good that they are finding these kinds of issues.  The problem is in when we are finding these issues.  I don’t want our automation to find bugs, at least not the kind that get put into a bug tracking system.  I don’t want those bugs to get to customers, but I also don’t want them to even get to the main branch.  I want them taken care of well before they are something that even needs to be tracked or managed in some bug tracking system.

We find many bugs – good.  But we find them days removed from the code changes that introduced them – bad.  When bugs are found this late, they take a lot of time to get resolved and they end up creating a lot of extra overhead around filing and managing official defect reports.  I want a system that finds these bugs before they officially become bugs, but why doesn’t it work that way already?  What needs to change to get us there?  There are a few ideas we are currently working on as a team.

Faster and more accessible test runs

One of the main reasons bugs don’t get found early is that tests don’t get run early. There are two primary pain points that are preventing that.  One is how long it takes to execute the tests.  When the feedback time on running a set of tests starts at hours and goes up from there to a day or more depending on which sets you run, it is no wonder we don’t run them.  We can’t.  We don’t have enough machine resources to spend that much time on each merge.

Another pain point is around how easy it it to select particular sets of tests to run.  We can easily do this manually on our own machines, but we don’t have an easy way to pick particular sets of tests to run as part of  merge build chain.  This means we default to a minimum set of ‘smoke’ tests that get run as part of every merge.  This is helpful, but often there are other sets of tests that it would make sense to run, but that don’t get run because it is too difficult to get them to run.

Improper weighting for test runs

We do have some concept of rings of deployment in our process, and so we run different sets of tests along the way.  The purpose of this is to have finer and finer nets as we get closer to releasing to customers.  This is a good system overall, but right now it would seem that the holes in the early stages of the net are too big.  We need to run more of the tests earlier in the process so that we don’t have so many defects getting through the early rings.  Part of the issue here is that we don’t yet have good data around how many issues are caught at each stage in the deployment and so we are running blind when trying to figure out which sets of tests to run at each stage.  We don’t need to catch every single bug in the first ring (as that kind of defeats part of the purpose of rings of deployment), but we do need to catch a high percentage of them.  At this point we don’t have hard numbers on how many are getting caught at each stage, but looking the number that make it to the finial stage,  we can be pretty confident that the holes are too big in the early stages.

Late integration

We work on a large application that has a lot of different parts that need to be integrated together.  Many of the defects that our integration tests find are, well, integration defects.  The challenge here is that much of the integration happens as one batch process where we pull together many different parts of the software.  If we could integrate the parts in a more piece-wise fashion we could find some of these integration bugs sooner without needing to wait for many different moving parts to all come together. We need to continue to work on making better components and also on improving our build chains and processes to enable earlier integration of different parts of the product.

Our automation finds a lot of bugs, but instead of celebrating this we took a few minutes to think about it and came to realize that this wasn’t the good news it seemed at first. This gives us the ability to take steps towards improving it. What about your automation?  Does it find a lot of bugs?  Should it?  What can you do to make your automation better serve the needs of the team?