“Now that is weird.”
I stared at my screen and puzzled over the problem. How could this happen?
It had all started with a 500 error due to a missing config. When I checked the config panel, sure enough, the config was missing. We were searching through pull requests trying to figure out what had gone wrong, when one of my co-workers said that it seemed to be there for her.
So I refreshed the page. And there it was.
How can that be? We didn’t make any code changes to the site so how can a config that wasn’t there a minute ago just appear?
But oh well right? The initial thing I was trying to do was working now, so maybe it was just some network glitch or something. I felt the temptation to just shake my head, attribute it to the whims of the software gods, and move on. But I decided to be curious and to see if I could find out anything else.
I posted on our company’s testing slack channel asking for ideas and within minutes I had a response that led to understanding the issue. We were creating the variable too late in the process and so the site didn’t know about it the first time we tried to use it. We got a fix together and took care of the issue, but I think there are a few important lessons to learn.
Be curious – It would have been easy to shrug my shoulders and move on, but that would have led to a bunch of clients hitting weird random errors when they first started using the feature. Not a good first impression!
It takes a team – I might have noticed the issue, but without the help of others it would have taken a long time to understand what was going on. We can’t know everything about everything. It really does take a team to solve problems like this
Open communication – A team isn’t a team without good communication and if it wasn’t for open communication channels like Slack, it would have taken much longer to get to the bottom of an issue like this.
Intermittent bugs are still bugs – Just because you can’t reproduce something, doesn’t mean it doesn’t matter. The only way to reproduce this bug is to create a whole new site. The problem is we create hundreds of sites for clients for each release. So even though I can’t reproduce this bug (at least not without spinning up a whole new site), it would still be a high impact bug. The more users and usage you have, the more important intermittent bugs become. Don’t just ignore them!
Did that bug really go away? Don’t assume it too quickly!