Step away from the AI fanfare and focus on software reliability


Here is a challenge for you: go to any popular technology trade show and attempt to escape the talk tracks that focus on AI. AI is still very much en vogue, the zeitgeist that promises to be the cornerstone of the technical defining moments of this decade. 

However, while AI is undoubtedly going to change everything, in some ways it is simply the poster child for code that executes more intelligently than it did before. My favourite (tongue-in-cheek) definition of AI is anything that computers can do well today that they couldn’t do ten years ago. Decades ago that was playing chess; today it’s playing Go or driving a car.

The user scenarios demonstrated at the tech shows are still a pipe dream for many developers. Examples of AI systems in Tesla’s automated cars and Google’s Deep Mind AlphaGo which beat the world champion of Go, are complex intelligent systems far from the everyday ones that typical engineering teams deal with. It’s a bit like teenage sex – everyone is talking about it but few are doing it.

AI threat to jobs

When AI is not heralded as the next new imperative, it is vilified as being the grim reaper set to destroy humanity, or at least to steal our jobs. Since the Industrial Revolution, automation has progressively replaced workforces, from the Jacquard Loom and agricultural mechanisation to automated supermarket checkouts and online accounting software. In reality, nearly all business applications created today are still based on traditional software. When it comes to jobs, humans will continue to work alongside automated systems, roles will evolve to account for the management of new, complex systems and human decision-making will go hand and hand with their development.

While many senior software engineers within tech companies are charged by the executive team to “implement AI”, they are all still dealing with business applications based on conventional software. In a world where about 95 percent of ATM transactions use COBOL, and at a time when a range of products remain built on decades-old code, AI headlines distract from the real software reliability issues at play.

(Image: © Image Credit: Geralt / Pixabay)

Laying the foundation for a successful AI deployment

AI, like any traditional business system or application, needs a robust foundation. Before being able to come to grips with creating complex applications, engineers need to address immediate software reliability issues, first. Nearly all software ships with undiagnosed bugs that may turn into serious production incidents, resulting in client churn and burning hours of engineering resource later down the line. It would be foolish to ignore the promise - and the threats - heralded by AI; it is equally foolish to ignore the problems and risks from the unreliability of today’s “regular” software.

Commercial pressures mean that software development managers and their teams have to make trade-offs between code quality and the pressure to ship new product features. The Economist writes that some of the neatest software ever written – by NASA’s Software Assurance Technology Centre – carried 0.1 errors per 1,000 lines of source code. Most software has a reliability of many orders of magnitude lower than that. Tricentis, a testing platform vendor, highlighted this problem in their January 2018 Software Fail Watch report.  It analysed 606 software failures and found that over 3.6 billion people had been affected by these software problems, resulting in $1.7 trillion in lost revenue to software vendors.

Database vendors are particularly vulnerable due to the highly competitive nature of the market, the complexity of the systems, and the high costs of unreliability. As a result, bugs must be addressed as early as possible in testing. Unfortunately many are difficult to identify, as they only subtly affect the program so they do not seem to appear in the testing phase at all. Once in production, these bugs can lead to severe outages and software failures.  

Software reliability

To ensure businesses steer away from the above scenario, engineering departments are well advised to consider their software reliability strategy and take preventive measures to diagnose serious software defects before they cause havoc on customer site. So what can software development teams do to make their software more reliable? The revolution in testing (Continuous Integration, Test Driven Development, Fuzz Testing, etc) means that today thousands of automated tests can be run. A typical software project of a given size will be running thousands of times more tests than an equivalent project ten or twenty years ago. For the industry, this is a big leap forward. But all these tests are a nightmare to triage if even a tiny fraction fail, particularly if they fail intermittently.

One possible solution to this trillion-dollar problem of software reliability is software flight recording. By recording a program execution as it fails, engineering teams obtain a reliable reproducible test case that gives them total visibility into all the factors that led up to (and caused) a crash or program misbehaviour. This approach is especially effective against intermittent test failures, which are by nature very difficult to reproduce – a common problem in software development. Software failures can then be captured, replayed in a reversible-debugger and diagnosed orders of magnitude faster than with traditional techniques. A solution like recording and replaying program execution allows software engineering teams to observe exactly what their program did at any point in time and why. This helps to speed up time-to-resolution and minimise customer disruption.

Recording and replaying program execution is a revolution in software development and testing. Businesses should be less concerned with the fanfare revolving around AI systems and consider instead how to improve the foundation on which their business applications and products are built on. 

Dr Greg Law, Co-Founder & CTO at  Undo