Don’t stop me now

From Arbitrary Technical Decision To Bug: It’s Just a Matter of Time…
A miniseries on software testing, legacy code and the impact of wrong assumptions or arbitrary technical decisions on software products.

If you have ever tested medical devices, or other safety critical software products, you should know that bugs on production are not that common in such a context.
Anyway, you can come across one every once in a while and see how all the people involved tend to take the situation very seriously, since it usually represents a great opportunity to learn something of extraordinary importance: actually the best lesson to avoid making the same mistake again.

So, some years ago, while serving as a software tester on medical devices, one day came the news of a worrying complaint from a laboratory which used an instrument developed and commercialized by the company I was working for.

The issue was pretty serious: the device in question had suddenly stopped functioning and rebooted just in the middle of a job run, that is, a set of assays which were being performed on a bunch of blood samples.
Purportedly, one of the most effective ways to upset a lab technician.

Different people within my team (programmers, testers, managers) started investigating how that could have happened, all of us being practically flabbergasted by the issue.
You know, professionals dealing with regulated environments don’t really enjoy these kinds of surprises.

After a difficult and deep investigation, which lasted several hours, the reason for the issue was eventually exposed: there was an internal time-out somewhere —something nobody was aware of, of course— which made automatically reboot our instruments after about three weeks of continuous operations.
The thing is not only was nobody aware of that technical specification(?)/limitation(?) —which had to be removed, of course—, but we also realized that we would never have discovered that problem by ourselves.
Can you guess why?

"#time #out #timeout #broken" by Eric_Dorsey

Well, because every Friday afternoon, before leaving the office, we paid especial attention to switching off all the instruments.
Which means they had never been continuously working for more than a few days.
So, we had to acknowledge that, on one hand, we didn’t know the devices we spent a lot of time with as well as we used to believe, and, on the other hand, we didn’t know enough about our customers and their habits either1.

The bottom line is that both wrong assumptions (on how products are used by customers, for example) or arbitrary(?) technical decisions (especially those ones nobody is aware of because they had been made in the very distant past on legacy code without even being documented) will convert into bugs sooner or later.

This is why I believe that challenging assumptions is the most effective way to reduce the impact of arbitrary technical decisions on software products.

Anyway, I guess you agree that one can always try to learn something from their mistakes.

Even though, out of respect of both the environment and the electricity bill, we didn’t get rid of the habit of making sure the instruments were turned off during the weekends, we started experimenting much more with system date and time settings, for example…

1 By the way, this is also why I believe that one of the most interesting experiences I had while working at that company consisted in visiting different hospitals which were using our instruments.

my profile picture

Thanks for reading this article.
Feel free to recommend it or to add a comment.

Should you have any doubts about Software Testing, contact me: I will be glad to help you.

On the other hand, if you want to get notified about my blog posts, sign up through the BLOG > SUBSCRIBE TO THE BLOG NEWSLETTER menu.
Thank you.

2 thoughts on “Don’t stop me now

Comments are closed.