Advancements in IT incident remediation: GenAI-powered incident response
In the realm of IT operations, I'm intrigued by the ongoing advances in automation. From streamlined anomaly detection to de-duplication and correlation, automation has undoubtedly revolutionized many aspects of incident management. However, it's equally fascinating to observe the persistent challenges, particularly when resolution information is scarce and knowledge articles are outdated — lacking detailed step-by-step instructions.
Now comes generative AI, and I wanted to see if — amidst all the hype — generative AI could indeed bridge the last mile in incident remediation, incorporating aspects of self-healing. Here, I'd like to share my experiences based on a pilot I ran for one of our clients.
The challenges of the automation journey
In the heart of a bustling data center, a critical incident unfolds. One of the core pieces of IT equipment, responsible for a vital part of the operation, has encountered a perplexing issue. A sense of urgency fills the air as engineers rush to investigate and resolve the problem. However, here's where the challenge arises: No one has encountered this issue before, and the resolution information in the knowledge base is scant and outdated. In addition, the equipment comes with hundreds of user manuals, each describing specific functionalities and troubleshooting procedures.
In their desperation for a quick fix, some engineers resort to conducting Google searches, hoping to find relevant information on the web. Unfortunately, the information they find is outdated, unverified, and conflicts with each other depending on the source, further complicating an already complex situation.
The role of GenAI
This is where generative AI (GenAI) takes center stage, providing loosely structured remediation recommendations. It combines the verified solutions from your knowledge database with the extensive knowledge aggregated by public GenAI models. Utilizing large language models (LLM), vector databases, GenAI embeddings and machine learning (ML) techniques, the platform processes the incident data, consults the knowledge articles, optimizes the resolution coming from the source, and generates a concise, structured recommendation.
The effectiveness of GenAI is intrinsically linked to the quality of the information it relies upon.
This recommendation is based on the entirety of your knowledge database as well as the vast technical knowledge that was initially used to train the LLM. Armed with this knowledge, your IT engineers can review the recommendation and swiftly implement resolutions based on its own body of knowledge.
The benefits of AI-assisted remediation
GenAI brings a host of advantages to the table, revolutionizing the landscape of incident response. First, it significantly reduces the manual effort required to identify technical and procedural remediation details. What used to be a time-consuming task is now streamlined, allowing incident responders to focus on resolution rather than hunting for information. The results of an initial pilot project we conducted showed a 30% savings in effort. Second, GenAI empowers junior staff by providing access to a wealth of knowledge. It ensures that even less experienced staff can confidently refer to the correct body of knowledge, improving their competence and accelerating their growth within the team.
Moreover, the structured knowledge output generated not only enhances information comprehension for engineers, but also paves the way for seamless bot-to-bot communication. This structured knowledge ensures consistency in information flow and improves the overall efficiency of the incident management process.
Finally, GenAI plays a pivotal role in improving the management of an organization’s technical and institutional knowledge. By continuously learning and adapting, it ensures that knowledge remains up-to-date and readily accessible.
How to move forward with GenAI support
Going forward, there will be challenges in how junior staff members interact with GenAI recommendations. While these recommendations may be invaluable in terms of streamlining incident resolution, they must be used judiciously. Less experienced staff members may rely too heavily on the recommendations, assuming that they are definitive solutions. This highlights the importance of a balanced approach, where junior staff should view GenAI guidance as a starting point, not the final word.
Finally, there is one fundamental challenge that absolutely must be addressed to ensure the success of any AI-based IT Ops initiative. It is critical to understand that these solutions excel when a robust and continuously updated knowledge dataset is available for them to consume.
The effectiveness of GenAI is intrinsically linked to the quality of the information it relies upon. If it is given a quality dataset (and that is a big “if”), GenAI has the capability to write a resolution that the support engineer only needs to validate.
In the future, we will see GenAI becoming a true helper and — as feedback is incorporated into the loop — we will see GenAI reach self-healing mode, where it can implement some resolutions itself. Watch this space for more!