Trustworthy Tests

Trustworthy Tests

I was working on a script to use for an automated test and one of the checks in it was comparing two similar things with different inputs.  I decided to change one of them to make the difference between them more obvious, but to my surprise, when I ran the test everything passed.  What gives?  I had made a pretty significant change to the test – it should be failing.  I tried out some stuff interactively and it all looked ok, so I ran the test again – Still passing.

Puzzled I reviewed the test again.  Everything looked fine, but clearly something was going wrong.  I started adding some debug output into the tests and after a couple of tries I found out that I had accidentally switched the order of some commands.  I was checking the output before updating the engine.  A simple switch later and everything was as I expected.

This is just a normal story in the life of a tester.  I’m sure anyone who has written automation scripts can relate to this, but let’s take a minute to think about the implications of it.  What if I hadn’t tried that change?  I would have just merged in the test and it would have been passing but telling us nothing. The passing state of the test was due to a bug in the test and not due to the code we were trying to check.  The code could have changed quite dramatically and this test would have happily kept on reporting green.

I”ll just get straight to the point here. Your tests are only as good as you make them and since you are a human, you are going to have bugs in them sometimes.  A good rule of thumb is to not trust a test that has never failed.  I usually try to deliberately do something that I expect to cause a failure just to make sure my test is doing what I expect it to be doing.  Try it yourself.  You might be surprised at how often you find bugs in your tests just by doing this simple action.

Waiting too Long for Reviews

Waiting too Long for Reviews

As we continue to refactor our automated test scripts, I have been a part of a number of pull request reviews.  I have also been involved in some code reviews for large changes that are coming.  Code (or test) review is a very important and helpful strategy for a team to participate in and I am thankful for the many tools that have enabled this and made it a common practice.  To be honest, I can’t really imagine a world where we didn’t have easy and frequent code reviewing.

But. (Could you sense that coming?)

Just because something is good and helpful, doesn’t mean it is perfect and certainly doesn’t mean we can use it in every situation without putting on our critical thinking caps.  Participating in so many reviews has made me put that hat on.  What kinds of things are code reviews good for, and more importantly, what are their limitations?

I think code reviews are very very helpful for fine tuning.  ‘Don’t use that data type here.’ ‘Should you return this instead?’ ‘It look’s like you are missing a check here.’ Comments like this help greatly on catching mistakes and tweaking things to get the code into a good readable state.

Code reviews are great at helping us make those little adjustments that are needed to keep up the quality and maintainability, but what happens if the entire approach is wrong?  What if you are reviewing something and you want to just shout out, NO you’re doing it completely backwards?  We all do silly things sometimes and need to be called out on it, but do code reviews help with this?  I guess in some ways they do.  The fact that someone else will see the code makes us think about it differently and by putting it up for review someone might let us know we are approaching it from the wrong angle.  But think about it;  how likely are you to, in effect, tell someone ‘you just wasted 3 days doing something you need to delete.’  And even if you do, how effective is this form of feedback?

If we rely on code review as the first time someone else sees major new work we are doing, we are waiting too long.  We need much earlier feedback on it.  Does my approach make sense? Am I solving to correct problem here?  Are there any considerations I’m missing? These kinds of questions need to be asked long before the code review goes up. Many of them should be asked before any code is even written.  We need early communication on these kinds of things, so that we can get the feedback before we spend too much time going down a dead end road.  Another consideration is that code reviews are usually text based and to have a discussion on direction and approach via a text based medium is likely to be an exercise in frustration.  Some things work better with face to face discussions.  The powerful tools that we have which enable easy code reviews can sometimes push us towards inefficient communication about the code.  Stay aware of how your tools affect you and seek to do what make sense and not just what is easy.

I have been able to participate in more code reviews lately. Code reviews are an important part of testing and so I’m glad to be included in these, but there is still lots of testing work to do before the code review and I will be trying to get more testing input happening then. I guess it is my lot in life to celebrate the small victories (more code reviews – yay!) and then push on to the next thing.  Never stop improving!

 

Should Automation Find Bugs?

Should Automation Find Bugs?

Our automated tests help us find a lot of bugs.  In our reporting system we link bug investigations to the tests that are failing because of them.  I looked through this the other day and was surprised by the number of bugs that have been found with our automated tests.  They hit a lot of issues.

At first blush this seems like a good thing.  They are doing what they are meant to right? They are helping us find regressions in the product before we ship to customers.  Isn’t this the whole point of having automated regression tests?  Well, yes and no.

Yes, the point of these tests is to help make sure the product regressions don’t escape to the users and yes it is good that they are finding these kinds of issues.  The problem is in when we are finding these issues.  I don’t want our automation to find bugs, at least not the kind that get put into a bug tracking system.  I don’t want those bugs to get to customers, but I also don’t want them to even get to the main branch.  I want them taken care of well before they are something that even needs to be tracked or managed in some bug tracking system.

We find many bugs – good.  But we find them days removed from the code changes that introduced them – bad.  When bugs are found this late, they take a lot of time to get resolved and they end up creating a lot of extra overhead around filing and managing official defect reports.  I want a system that finds these bugs before they officially become bugs, but why doesn’t it work that way already?  What needs to change to get us there?  There are a few ideas we are currently working on as a team.

Faster and more accessible test runs

One of the main reasons bugs don’t get found early is that tests don’t get run early. There are two primary pain points that are preventing that.  One is how long it takes to execute the tests.  When the feedback time on running a set of tests starts at hours and goes up from there to a day or more depending on which sets you run, it is no wonder we don’t run them.  We can’t.  We don’t have enough machine resources to spend that much time on each merge.

Another pain point is around how easy it it to select particular sets of tests to run.  We can easily do this manually on our own machines, but we don’t have an easy way to pick particular sets of tests to run as part of  merge build chain.  This means we default to a minimum set of ‘smoke’ tests that get run as part of every merge.  This is helpful, but often there are other sets of tests that it would make sense to run, but that don’t get run because it is too difficult to get them to run.

Improper weighting for test runs

We do have some concept of rings of deployment in our process, and so we run different sets of tests along the way.  The purpose of this is to have finer and finer nets as we get closer to releasing to customers.  This is a good system overall, but right now it would seem that the holes in the early stages of the net are too big.  We need to run more of the tests earlier in the process so that we don’t have so many defects getting through the early rings.  Part of the issue here is that we don’t yet have good data around how many issues are caught at each stage in the deployment and so we are running blind when trying to figure out which sets of tests to run at each stage.  We don’t need to catch every single bug in the first ring (as that kind of defeats part of the purpose of rings of deployment), but we do need to catch a high percentage of them.  At this point we don’t have hard numbers on how many are getting caught at each stage, but looking the number that make it to the finial stage,  we can be pretty confident that the holes are too big in the early stages.

Late integration

We work on a large application that has a lot of different parts that need to be integrated together.  Many of the defects that our integration tests find are, well, integration defects.  The challenge here is that much of the integration happens as one batch process where we pull together many different parts of the software.  If we could integrate the parts in a more piece-wise fashion we could find some of these integration bugs sooner without needing to wait for many different moving parts to all come together. We need to continue to work on making better components and also on improving our build chains and processes to enable earlier integration of different parts of the product.

Our automation finds a lot of bugs, but instead of celebrating this we took a few minutes to think about it and came to realize that this wasn’t the good news it seemed at first. This gives us the ability to take steps towards improving it. What about your automation?  Does it find a lot of bugs?  Should it?  What can you do to make your automation better serve the needs of the team?

 

Know Thyself (And Thy Automation Tools)

Know Thyself (And Thy Automation Tools)

Technology can hypnotize us sometimes.  There seems to be something about working with software that makes us get caught up in solutions and artifacts and tools.  These are good things, but sometimes we forget the human element.  We forget that software isn’t made in a vacuum, it’s made by humans.  We all know humans have their flaws and biases, and so let’s just be honest and admit that those flaws and biases are going to come out in the software we create.

One fascinating theory I’ve heard is the idea that the structure of a software product tends to reflect the structure of the company (or division) that made it.  I can think of examples of this and find it to be an interesting theory, but that isn’t what I want to get into today.  I want to consider the ways in which the automation tools we use shape the way we approach software quality.

Test automation is something almost every tester will be exposed to and there are many different tools out there.  When evaluating a tool we often look at many different technical aspects.  One of the interesting things I have noticed though, as I switch to using a different test harness than I was before, is that the tool itself will shape your behavior.  This can be good or bad, but I think mostly we need to be aware of this so that we can consider if we want to make changes to address this.

For example, the previous test harness I was using, gave the automation engineer a lot of low level control over things, but the test harness I have now hides away a lot more of the details of the execution.  This shows up in something as simple as the fact that the old tool was a set of readable python scripts, while this tool is a set of compiled files.  There are a lot of benefits to the new tool such as a small learning curve and ease of use and a lot of features that are built in from the start, but the fact is this different approach will push me to behave in different ways.  With the previous harness I could get down into details pretty easily and so could target more precisely the exact thing I wanted to test. In the new harness I have a harder time getting to that level of detail and so the harness itself will push me to write tests that are more high level integration style.

This isn’t necessarily a bad thing since tests like that can be valuable, but it does mean that I should consider the fact that I might be tempted to take the easy path and create an integration test when a lower level or more detailed test might be more appropriate.  The fact that I have a tool that will allow me to very easily create tests of one sort means that I will be tempted to create those kinds of tests even when it doesn’t make sense to do so.

The very fact that certain automation tools make doing certain things easy is the whole point of using them in the first place, but we need to consider the fact that making something easy means we will probably do more of it, and that may not always be the best things for the product.  Don’t loose sight of the end goal – delivering high quality code in a timely fashion.

 

 

When Automation Has Bugs

When Automation Has Bugs

“The regression tests are all passing, so we can merge this code.”  How many times have you heard or said something similar to this?  It is a very common way of thinking and reveals to us something about the way we view automated regression tests.  We see their main purpose to be giving us confidence that it is ok to release new code to the next stage of production.  This is all well and good and probably should be one of the main purposes of automated regression tests, but do you know how good your tests are at doing this?  If you trust that running your regression suite means that it is probably ok to release your code, is that trust well founded?  How do you know?

I was recently faced with these questions.  A regression test started failing and so I dug into it.  Nothing out of the ordinary there, but as I looked at the failure I was puzzled.  It looked like it was doing the right thing, and sure enough after a bit more digging it turned out that we had encoded a check that asserted the wrong behavior as being correct.  Recent changes had caused this bug to get fixed, but we had been running this test for months with it explicitly checking that the bug was there and passing the test if it was.

This led to bit of existential crisis.  If this test is asserting on the existence of a bug, how many other tests are doing the same thing?  How can I know if my tests are any good? Do I have to test all my tests?  And then who tests the testing of the testing?  This could get really crazy really quickly so what should I do?

Deep Breath.

Ok, what I need to do is think about ways that I can evaluate how well founded my confidence is in these tests.  What are some heuristics or indicators I could use to let me know if my tests should be trusted when they tell me everything is ok?

I want to emphasize that the ideas below are just indicators.  They don’t cover every circumstance and I certainly wouldn’t want them to be applied as hard and fast rules, but they might indicate something about the trustworthiness of the tests.

Do they fail?

One indicator would be looking at how often they fail.  If the tests rarely fail then they might not be a helpful indicator of the quality of a build.  After all if the tests never fail then either we haven’t broken anything at all (hmm), or we aren’t checking for the kinds of things that we actually have broken.  A note here:  when I say fail, I mean fail in ways that find bugs.  Failures that merely require test updates don’t count.

Do they miss a lot of bugs?

The point of regression tests is to find out if you have caused things to break that were working before.  If we find a lot of bugs of this sort after the automated scripts have run, those scripts might not be checking the right things.

Do they take a long time to run?

What is the average run time per test?  Long running tests may be an indicator that we are not checking as much as we could.  As a rule of thumb, long running tests are spending a lot of time on setup and other activities that aren’t actively checking or asserting anything.  If you have a lot of these in your test suite, you might not be getting the coverage that you think you are.

When is the last time I looked at this test?

Tests age, and they usually don’t age well.  If you haven’t looked at a test script in a long time, it is quite likely that it’s checking things that just aren’t as important as they used to be.  Regular tests maintenance and review is essential to keeping a useful and trustworthy test suite.  And don’t be afraid to delete tests either – sometime things just need to go!

So there you have it.  A few indicators that can be used to give you an idea of the trustworthiness of your automated regression tests.  I’m sure you can add many more. Feel free to share indicators you use in the comments below!

Trusting Other’s Testing

Trusting Other’s Testing

As we continue consolidating automated tests from several different teams into one suite of tests, one of the things we are looking at is figuring out where there is duplication. Even though each of the previous teams had their own area of focus for testing, they were all still working on a common platform and in some cases there is significant overlap between some of the test sets.  It is easy to just accept that as a given (and indeed I think some level of overlap will always be there due to leaky abstractions and other considerations), but I wonder if the level of test duplication we are seeing is a symptom of something.  I wonder if it points to a trust and communication issue.

Why would Team A add tests that exercise functionality from area B?  Well, Team A depends on that functionality working a certain way and so they want to make sure that the stuff Team B is doing will indeed work well for their needs, and so Team A duplicates some of the testing Team B is doing.  But what if Team A trusted that Team B was checking for all the things that Team A needed to be there.  Would they be adding tests like this? It’s very unlikely anyone would do this.  If you knew that the kinds of things you cared about were being dealt with at a priority level you would give them, there would be no need to check those things yourself.  The problem of course was that we didn’t trust each other to do that.

This begs the question of why.  Why didn’t we trust each other like this?  This most obvious answer to this is that we didn’t show ourselves to be trustworthy to each other. If Team A ended up getting hurt a number of times due to Team B not checking for the things Team A needed, it would be very rational for them to check for those things themselves even it if meant duplicating effort.

With a new team structure, we may have less trust issues that will cause test duplication, but test automation isn’t the only place this comes into play.  For example, how much duplication of testing is there between the work that developers do and the work that testers do?  If a developer has done a lot of testing do we trust that testing? What if a product manager has tested a feature.  Do we trust that testing?  Why is is that we often feel to the need to do our own testing? This distrust may be well founded (we do it because past experience has taught us that they might be missing important things), but the reality is, it hurts our teams.  It slows us down and makes us less efficient.  We need to work on building trust, communication and trustworthiness so that we can stop redoing other people’s work.  What are you doing to learn about the testing others on your team are doing?  What are you doing to help them grow so that their testing is more trustworthy?

Too Many Tests?

Too Many Tests?

As I mentioned before, I have moved to a new team and we are working on a consolidation effort for automated regression tests that we inherited from several different teams.  I have spent the last 2 weeks setting up builds to run some of the tests in TeamCity.  2 weeks seems like a long  time to get some testing builds up and running and so I stopped to think about why it was taking me so long.  I know TeamCity quite well as I have been using it to do similar test runs for years now.  The automation framework tool that I’m working with now is new to me, but it is pretty easy and straightforward to use.  I would say I only lost about a day figuring out the tool and getting it set up on my VM’s.  So what is making this take so long?

The most important factor at play here is how long it takes to run the tests.  I broke the tests up between several builds, but each build still takes between 3 and 6 hours to run. What that means is that if I have a wrong configuration on the machine or I’ve setup wrong permission or anything like that, I have a 3 to 6 hour turnaround on getting feedback.  I did manage to help that out a bit by creating some dummy test runs that were much faster which allowed me to do some of the debugging more quickly, but at the end of the day, these long run times slowed me down a lot in my ability to get the tests up and running.

But let’s stop and think about that for a minute.  What is point of running automated regression tests? To give feedback on any regressions we might have introduced while making changes to the code right?  Well, if that is the case, how effective are these test runs at doing that?  The length of the feedback loop made it a slow process for me to find and fix issues with my builds and led to it taking much longer to get them setup than it would have otherwise.  What does the length of that feedback loop do for finding and fixing issues in the code?  The slow feedback meant that I would have to go work on some other stuff while I waited for a build to finish, which led to some of the inefficiencies of multitasking.  Does the same thing happen with long feedback loops for code changes?

I could go on, but I think the point has been made.  The pain I felt here applies to code changes as well.  Having a long running set of regression tests reduces the value of those tests in many ways! Do we perhaps have too many tests?