Know Thyself (And Thy Automation Tools)

Know Thyself (And Thy Automation Tools)

Technology can hypnotize us sometimes.  There seems to be something about working with software that makes us get caught up in solutions and artifacts and tools.  These are good things, but sometimes we forget the human element.  We forget that software isn’t made in a vacuum, it’s made by humans.  We all know humans have their flaws and biases, and so let’s just be honest and admit that those flaws and biases are going to come out in the software we create.

One fascinating theory I’ve heard is the idea that the structure of a software product tends to reflect the structure of the company (or division) that made it.  I can think of examples of this and find it to be an interesting theory, but that isn’t what I want to get into today.  I want to consider the ways in which the automation tools we use shape the way we approach software quality.

Test automation is something almost every tester will be exposed to and there are many different tools out there.  When evaluating a tool we often look at many different technical aspects.  One of the interesting things I have noticed though, as I switch to using a different test harness than I was before, is that the tool itself will shape your behavior.  This can be good or bad, but I think mostly we need to be aware of this so that we can consider if we want to make changes to address this.

For example, the previous test harness I was using, gave the automation engineer a lot of low level control over things, but the test harness I have now hides away a lot more of the details of the execution.  This shows up in something as simple as the fact that the old tool was a set of readable python scripts, while this tool is a set of compiled files.  There are a lot of benefits to the new tool such as a small learning curve and ease of use and a lot of features that are built in from the start, but the fact is this different approach will push me to behave in different ways.  With the previous harness I could get down into details pretty easily and so could target more precisely the exact thing I wanted to test. In the new harness I have a harder time getting to that level of detail and so the harness itself will push me to write tests that are more high level integration style.

This isn’t necessarily a bad thing since tests like that can be valuable, but it does mean that I should consider the fact that I might be tempted to take the easy path and create an integration test when a lower level or more detailed test might be more appropriate.  The fact that I have a tool that will allow me to very easily create tests of one sort means that I will be tempted to create those kinds of tests even when it doesn’t make sense to do so.

The very fact that certain automation tools make doing certain things easy is the whole point of using them in the first place, but we need to consider the fact that making something easy means we will probably do more of it, and that may not always be the best things for the product.  Don’t loose sight of the end goal – delivering high quality code in a timely fashion.

 

 

Advertisements

When Automation Has Bugs

When Automation Has Bugs

“The regression tests are all passing, so we can merge this code.”  How many times have you heard or said something similar to this?  It is a very common way of thinking and reveals to us something about the way we view automated regression tests.  We see their main purpose to be giving us confidence that it is ok to release new code to the next stage of production.  This is all well and good and probably should be one of the main purposes of automated regression tests, but do you know how good your tests are at doing this?  If you trust that running your regression suite means that it is probably ok to release your code, is that trust well founded?  How do you know?

I was recently faced with these questions.  A regression test started failing and so I dug into it.  Nothing out of the ordinary there, but as I looked at the failure I was puzzled.  It looked like it was doing the right thing, and sure enough after a bit more digging it turned out that we had encoded a check that asserted the wrong behavior as being correct.  Recent changes had caused this bug to get fixed, but we had been running this test for months with it explicitly checking that the bug was there and passing the test if it was.

This led to bit of existential crisis.  If this test is asserting on the existence of a bug, how many other tests are doing the same thing?  How can I know if my tests are any good? Do I have to test all my tests?  And then who tests the testing of the testing?  This could get really crazy really quickly so what should I do?

Deep Breath.

Ok, what I need to do is think about ways that I can evaluate how well founded my confidence is in these tests.  What are some heuristics or indicators I could use to let me know if my tests should be trusted when they tell me everything is ok?

I want to emphasize that the ideas below are just indicators.  They don’t cover every circumstance and I certainly wouldn’t want them to be applied as hard and fast rules, but they might indicate something about the trustworthiness of the tests.

Do they fail?

One indicator would be looking at how often they fail.  If the tests rarely fail then they might not be a helpful indicator of the quality of a build.  After all if the tests never fail then either we haven’t broken anything at all (hmm), or we aren’t checking for the kinds of things that we actually have broken.  A note here:  when I say fail, I mean fail in ways that find bugs.  Failures that merely require test updates don’t count.

Do they miss a lot of bugs?

The point of regression tests is to find out if you have caused things to break that were working before.  If we find a lot of bugs of this sort after the automated scripts have run, those scripts might not be checking the right things.

Do they take a long time to run?

What is the average run time per test?  Long running tests may be an indicator that we are not checking as much as we could.  As a rule of thumb, long running tests are spending a lot of time on setup and other activities that aren’t actively checking or asserting anything.  If you have a lot of these in your test suite, you might not be getting the coverage that you think you are.

When is the last time I looked at this test?

Tests age, and they usually don’t age well.  If you haven’t looked at a test script in a long time, it is quite likely that it’s checking things that just aren’t as important as they used to be.  Regular tests maintenance and review is essential to keeping a useful and trustworthy test suite.  And don’t be afraid to delete tests either – sometime things just need to go!

So there you have it.  A few indicators that can be used to give you an idea of the trustworthiness of your automated regression tests.  I’m sure you can add many more. Feel free to share indicators you use in the comments below!

Trusting Other’s Testing

Trusting Other’s Testing

As we continue consolidating automated tests from several different teams into one suite of tests, one of the things we are looking at is figuring out where there is duplication. Even though each of the previous teams had their own area of focus for testing, they were all still working on a common platform and in some cases there is significant overlap between some of the test sets.  It is easy to just accept that as a given (and indeed I think some level of overlap will always be there due to leaky abstractions and other considerations), but I wonder if the level of test duplication we are seeing is a symptom of something.  I wonder if it points to a trust and communication issue.

Why would Team A add tests that exercise functionality from area B?  Well, Team A depends on that functionality working a certain way and so they want to make sure that the stuff Team B is doing will indeed work well for their needs, and so Team A duplicates some of the testing Team B is doing.  But what if Team A trusted that Team B was checking for all the things that Team A needed to be there.  Would they be adding tests like this? It’s very unlikely anyone would do this.  If you knew that the kinds of things you cared about were being dealt with at a priority level you would give them, there would be no need to check those things yourself.  The problem of course was that we didn’t trust each other to do that.

This begs the question of why.  Why didn’t we trust each other like this?  This most obvious answer to this is that we didn’t show ourselves to be trustworthy to each other. If Team A ended up getting hurt a number of times due to Team B not checking for the things Team A needed, it would be very rational for them to check for those things themselves even it if meant duplicating effort.

With a new team structure, we may have less trust issues that will cause test duplication, but test automation isn’t the only place this comes into play.  For example, how much duplication of testing is there between the work that developers do and the work that testers do?  If a developer has done a lot of testing do we trust that testing? What if a product manager has tested a feature.  Do we trust that testing?  Why is is that we often feel to the need to do our own testing? This distrust may be well founded (we do it because past experience has taught us that they might be missing important things), but the reality is, it hurts our teams.  It slows us down and makes us less efficient.  We need to work on building trust, communication and trustworthiness so that we can stop redoing other people’s work.  What are you doing to learn about the testing others on your team are doing?  What are you doing to help them grow so that their testing is more trustworthy?

Too Many Tests?

Too Many Tests?

As I mentioned before, I have moved to a new team and we are working on a consolidation effort for automated regression tests that we inherited from several different teams.  I have spent the last 2 weeks setting up builds to run some of the tests in TeamCity.  2 weeks seems like a long  time to get some testing builds up and running and so I stopped to think about why it was taking me so long.  I know TeamCity quite well as I have been using it to do similar test runs for years now.  The automation framework tool that I’m working with now is new to me, but it is pretty easy and straightforward to use.  I would say I only lost about a day figuring out the tool and getting it set up on my VM’s.  So what is making this take so long?

The most important factor at play here is how long it takes to run the tests.  I broke the tests up between several builds, but each build still takes between 3 and 6 hours to run. What that means is that if I have a wrong configuration on the machine or I’ve setup wrong permission or anything like that, I have a 3 to 6 hour turnaround on getting feedback.  I did manage to help that out a bit by creating some dummy test runs that were much faster which allowed me to do some of the debugging more quickly, but at the end of the day, these long run times slowed me down a lot in my ability to get the tests up and running.

But let’s stop and think about that for a minute.  What is point of running automated regression tests? To give feedback on any regressions we might have introduced while making changes to the code right?  Well, if that is the case, how effective are these test runs at doing that?  The length of the feedback loop made it a slow process for me to find and fix issues with my builds and led to it taking much longer to get them setup than it would have otherwise.  What does the length of that feedback loop do for finding and fixing issues in the code?  The slow feedback meant that I would have to go work on some other stuff while I waited for a build to finish, which led to some of the inefficiencies of multitasking.  Does the same thing happen with long feedback loops for code changes?

I could go on, but I think the point has been made.  The pain I felt here applies to code changes as well.  Having a long running set of regression tests reduces the value of those tests in many ways! Do we perhaps have too many tests?

Build Hygiene

Build Hygiene

Our company recently went through a restructuring and so I’ve ended up on a new team. This team is going to be taking over the testing and regression scripts for parts of a number of other teams.  For the last couple of weeks I’ve been working on moving test runs over so that they can run on the machine resources we have in our business unit. As part of this I have had the chance to see a number of different build scripts and test running scripts as well as a number of different ways of setting up builds in TeamCity.

One of the things I noticed was that some builds and scripts were a lot easier to understand and convert than others.  Overall most teams were trying to do very similar things but each group ended up using TeamCity in different ways.  This got me thinking about code hygiene and how this applies to builds and build tools as well.  It seems that often the build process gets approached from the ‘git er done’ perspective.  It is often a high pressure area since we need to keep the build pipeline alive and moving or people get pretty upset, and this can sometimes lead to hacking things in to make it work.  One little thing follows another and soon the build scripts and processes are so complicated they can hardly be understood by an outsider.

It might seem that this doesn’t matter, but I think it does – and not just because I’m going through a conversion process right now.  You could (probably rightly) argue that moving a bunch of tests and builds to another team is not a common occurrence, but the reality is that in the builds as in everything else in software development, change is the only constant.  There are always going to be things that need to be changed and especially in the build process, these changes are often going to need to happen quickly.  What happens if you are out and someone else has to maintain the builds? What happens if your build setups and scripts get to large and fragile?  You start to threaten the ability to quickly make the changes that you need too to keep the build chain running smoothly. Just like with your code, it is important to practice build hygiene. Take some time to clean things up once in a while and make sure that you aren’t building up technical debt in your build process.

 

Test Mutation

Test Mutation

In an earlier post I wrote about the importance of learning from your automation and in that post I mentioned some of the tools that I have to help me learn from my automation. I was asked on about these in twitter and while it will be tough to answer in a general way since we use a custom test automation platform, I thought it would be worth trying to explain.

This post will probably be a slightly more technical than most of my posts are but hopefully it will be helpful even to those who are less technically minded (I won’t be posting code here).

In this post I want to talk about my test variation tool.  How it works and how I use it to give me new insights.  To do that let’s start at the begging with our test automation framework. The framework was custom built for the product we were using, but when we were writing it we didn’t want it to be too tightly coupled to the actual product.  There were a couple of reasons for this, not least of which was a desire to not have to modify the testing framework when making changes to the product code.  As a result the framework was built in a layered approach that ended being very helpful in many ways. The testing framework itself merely required you to specify a config file for each test with a few items in it (Test description, keywords etc.).  The config file then needed to define a Run() function (using python syntax) which could call any arbitrary code as long as it returned back certain status codes (pass/fail/timeout etc.) and messages.

This meant that the actual work of running the product under test and pointing to the run scripts used etc. was done in a set of ‘helper’ functions that we could import and use in any given test.  This gave a high degree of customization to the tests and allowed anyone to easily write their own additional functions to use in any particular set of tests.

You can probably figure out by now how I managed to make my test variation tool work. I wrote a function to modify the test scripts which could be used in any of the tests.  This function would find the scripts that were going to be run as part of that test.  It would also look for a predefined file that contained the commands we wanted to add.  It would then parse through the test scripts and modify them to add the requested commands at a point in the scripts immediately before we asked the engine to solve.  After that, control would be given back to the functions that were used to startup and run the system under test, but now instead of running the original scripts we would be running modified copies of the scripts.

This allowed us to do a lot of interesting things.  For example we could force every test in the system to use one particular option.  This would let us see how the option would work in a wide range of settings and we could find potential feature combination issues in any tests that crashed.  We could also use it to evaluate what might happen if we changed the default on an option.  We could just force that option to use the new default in all the tests and see what happened.  In many ways this allowed us to better explore our product and in fact when the developers saw some of the cool things that we could do they started to pull some of the ideas into the development code itself making it even easier to experiment with these things.

Now, my particular framework – and the fact that I had complete access to it since I was one of the people who wrote it – made it pretty easy for me to do something like this, but can we do this in general?  I’ve moved to another team now so I guess I’ll be able to find out how easy it is to generalize this, but I’ll close with a few thoughts on how you might be able to implement something like this.  If you don’t have a nicely modular testing framework you could still do something like this fairly easily.  The nice part about what I did was that I could integrate the test mutations right into the run itself and I only needed to define commands in a particular file and then turn on a flag to tell the tests to use the test variation tool.  But the modifications could easily be done in a way external to the test system itself.  You could write a script that would traverse your tests and modify your test scripts before you even start your test run.  This might be slightly less convenient but in theory it should be pretty straightforward to do.

If I end up doing something like this on my new team I’ll post the results of that here as well, but in the meantime maybe you can give it a try with your tests.  Who knows, your automation might just be able to teach you something!

 

 

Learning From Continuous Delivery

Learning From Continuous Delivery

We only ship our product three times a year and while we do fairly frequent builds, we don’t have a continuous delivery deployment system.  Since we don’t do it, it might seem that I shouldn’t bother with learning about continuous delivery and integration, but I try to keep up with new developments of this sort and so I’ve read and learned a bit about these things.  The interesting thing is that, as with many things that don’t seem directly applicable to my day to day work, I have been able to use ideas from this to help improve things.  A lesson or skill learned can be very helpful even if it isn’t directly about the work you are doing.

For example, due to a recent reorganization effort in our company, I am now part of a new team which is combining together some of the automation work of several previously separate teams.  I have been thinking about how we should go about consolidating and running these various tests and one of the ideas that I will be using is drawn from the continuous delivery world.  This is the idea of what is sometimes called rings of deployment or flighting.  The (very) quick summary of this is that you expose new code to a small group of people first and then if everything goes well you gradually roll it out to larger and larger groups of your customers.

Since we only ship once every few months it might seem the idea of rings of deployment doesn’t apply, but in fact I think the idea will fit in very nicely with out development process.  However, instead of thinking about gradually larger groups of customers we’ll structure it in terms of gradually exposing parts of the code to larger and larger rings of code.  So for example if we have changes in a particular component we will run a set of tests particular to that component.  These tests check if anything has broken directly within the component itself. If they pass we will then build that component against the latest ‘certified’ build and run some integration tests.  These tests check if the component changes broke anything the larger system relies on.  If those pass we can then check if everything works when the component is integrated with the latest combined builds to check that the build pipeline won’t break.  At that point we can start to consume those changes in all the developer builds and start certifying a new package which is the final integration step.

Once you add in the ability to use feature flags and beta features we can actually have a pretty sophisticated deployment chain that gives us a lot of information on where failures are coming from and that also allows us to limit the number of people affected by breaking changes.  If there is a breaking issue in a component, that will only affect teams working on that component and if there is a breaking issue in the integration that will only affect the people consuming the latest version of the component for their own integration testing.  By gradually exposing the changes to integrate with more and more of the code base, we are able to reduce the impact on the number of developers and testers affected by breakages.

I know this isn’t rings of deployment in a strict sense and the purpose of this article isn’t to define that clearly.  The point I want to make is that I probably would not have taken the approach that I am to the overall structure of the automation if I hadn’t heard of the idea of rings of deployment.  By learning new things (even things that might seem like they don’t really apply to my context) I have been able to make translations in my head and let ideas create new ideas and come up with something better.

Never stop learning!