Rapid response to a high-priority application incident creates an urgent—and nearly always painful—fire drill for any developer, Site Reliability Engineer (SRE), or DevOps professional responsible for an application. Besides the necessity to drop everything, there is the considerable stress of mitigating the issue as quickly as possible, knowing that revenue may be on the line and customers at stake. For whatever odd reason, these problems or outages seem to occur at inconvenient hours and immediately start a virtual “elapsed time clock” for mean time to resolve (MTTR) adds more pressure.
The Quick Fixes and Temporary Band-Aid Solutions
The first pressure is to find a quick fix or work-around to mitigate the impact on users. A work-around is not always possible without finding the root cause. Even when a solution is possible, it is a temporary “band aid” solution, because full remediation cannot be done until the root cause has been identified and a durable fix is implemented. Band aids are almost always temporary, and the same problem or a related one is likely to occur again. The simplest example is a system that periodically slows down, with the fix being to restart the system (a variant of the common “CTL-ALT-DEL” fix in Windows). Although the system becomes fully operational after the “fix”, the problem will undoubtedly occur again at a most inconvenient time. Finding ways to shrink “time to root cause” is a big part of overall MTTR improvement.
The Burden of the Rules-Based Approach
Another game-changer for incident management would be reducing the burden of building and maintaining alert rules or health checks. Alert rules were more reliable in a bygone software era, when software stacks were much simpler and changed relatively slowly. They allowed future occurrences of known issues to be automatically detected and possibly even remediated by triggering customized run books for each issue Relying on alert rules has become much harder as software stacks have become more horizontal and complex, and the pace of software change has accelerated by leaps and bounds. For one thing, it is hard to build rules that help you pinpoint problems when the software stack has an ever-growing landscape of possible failure modes. The second is the ongoing maintenance burden of testing and updating these rules as software changes rapidly. Building and maintaining this rule system takes a lot of work and diverts technical teams from other important activities. So most teams rely on very simple rules that catch the most catastrophic or coarse symptoms of problems but don’t tell you the causes (an approach called black box monitoring by Google).
Automated Root Cause Analysis
Rather than relying on a rules-based approach, organizations are moving towards automated root cause analysis. Machine learning technologies can revamp these human-driven processes by quickly and efficiently finding the root cause of software incidents, dramatically reducing the time it takes to implement a fix. This saves SREs and software engineers countless hours digging through dashboards and millions of log events to determine what went wrong. As a general troubleshooting workflow, metrics tell you when something goes wrong, traces help you narrow down where, and logs help you understand why. Dealing with the volume, diversity, and free-form nature of logs is an excellent application for machine learning-based automated root cause analysis systems. Correlating the discovered abnormalities is also well-suited for machine learning.
See More: Incident management for both internal as well as external IT support
How Machine Learning Can Expedite the Process
Some organizations may have professionals that are especially adept at manually finding the root cause without machine learning, but their process can take considerable. The most experienced engineers develop instincts to help them spot unusual events in the logs and correlate them with errors or warnings. But it is most often still a quest for finding the unknown.
1. Operating in Real-Time
Machine learning is far quicker and more exhaustive than human eyes at spotting these anomalous event patterns (particularly when dealing with large volumes of data). In real-time, it can find abnormal correlations between unusual/rare events and errors and construct RCA dialogues from the resultant data. Machine learning techniques such as NLP can even summarize a problem in plain languageOpens a new window using models that have been trained against known technical details available in the public domain.
2. Ongoing Analysis
Besides remediating problems, a system for root cause analysis can also work proactively. Organizations can configure a set of “signals” or conditions to trigger these machine learning-generated reports. These signals could come from monitoring tools that deterministically detect real incidents or known symptoms. Some teams, for example, might even watch for spikes in overall error frequency to know something is amiss. The same simple alert could trigger machine learning to scan the logs around the time of the alert to identify unusual events or sequences of events that explain the spike of errors. Machine learning can even fingerprint these root cause sequences, so when they occur again, there is already a pre-built rule that can be easily connected to an alert channel. Rather than a manually maintained system of rules, the automated machine learning approach requires no time investment from technical teams. Machine learning can also take over a load of maintaining complex “diagnostic” alert rules, avoiding the monotony of tweaking and testing regular expressions (regexes) to keep up with changing log formats.
3. Uncovering Incidents Early or Even Before They Happen
Finally, machine learning can proactively find silent, not-yet manifested bugs and inform the team before they cause serious problems in production. It used to be that new releases were always tested extensively before deploying to production. These stress and usage tests could potentially surface issues before they cause problems in production. Today, fast deployment cycles eliminate most abilities to perform such extensive testing. The new trend is for “testing in production,” but even that has its limitations.
By proactively using machine learning to surface correlated errors and anomalous event patterns, machine learning can uncover subtle or dormant bugs early before they have a severe or widespread impact on users. For instance, in this way, our team recently caught a bug related to a malformed middleware SQL query that potentially prevented users from completing their intended workflow. The ML identified it and sent an alert.
Machine Learning is a Game-Changer for the Incident Management Lifecycle
The shift to the use of machine learning can be a game-changer for the entire incident management lifecycle. As more users rely on software applications, the need to shrink MTTR and the stress of troubleshooting incidents under pressure all grow proportionally. Machine learning counters these pressures. It can remove a substantial amount of disruption and fire drills to enable better and faster, uninterrupted development work and proactive ability to prevent or minimize problems before they impact business or customers. After all, game-changing development work should welcome a game-changing incident management lifecycle.
How do you think Machine learning can transform the incident management lifecycle? Let us know on LinkedInOpens a new window , TwitterOpens a new window , or FacebookOpens a new window . We’d love to hear from you!