Sumo Logic Utilizes GenAI to Reduce Mean Time-to-Resolution of Log Data

Kendra Rasmusson

About Sumo Logic

Sumo Logic is a cloud-based analytics platform that unifies organizations by collecting, analyzing, and managing log data from applications and networks. It provides real-time insights into security, operations, and business intelligence, and can help to automate troubleshooting. Over 2,000 organizations worldwide save time and effort by relying on Sumo Logic to get powerful real-time analytics and insights to resolve the hardest questions facing their cloud-native applications.

Sumo Logic’s Challenge

Recognizing the step change in product and service improvements made possible by GenAI, Sumo Logic has long been an established player heavily investing in innovation across its many product offerings. 

“Customers have accepted AI as a real innovator, so the time is now for differentiating and disrupting the market,” said Tej Redkar, Chief Product Officer

Sumo Logic surmises that automating rule creation for security information and event management (SIEM) – specifically the challenges experienced in the logging and observability space and the process of identifying a root cause from logs – could be improved using GenAI. An initial project was aimed at automapping structured log data to the Elastic Common Schema (ECS). The end goal of this initial proof of concept (POC) engagement was to catapult innovation that reduces mean time-to-resolution (MTTR) for Sumo Logic’s customers.

Why Tribe AI?

Sumo Logic was introduced to Tribe AI by the private equity firm, Francisco Partners, who recently acquired their business. After hearing the Tribe AI team lead a conversation on applying GenAI to innovate observability and security, the Sumo Logic team knew that Tribe AI was the right partner for the engagement.

Developing the Phase 1 Use Case

A Proof of Concept for Auto Mapping Log Data

The engagement with Sumo Logic started with a four-week POC where Tribe AI was to perform extensive research and testing to understand if a large language model (LLM) could be utilized to automap structured log data to the Elastic Common Schema (ECS) format to improve observability. The work was successful in demonstrating that yes, an LLM could be used for this type of functionality. 

However, the Tribe AI team didn’t stop there. They also tested using an LLM to interpret unstructured log data, and showed that this works with Anthropic’s Claude.

Unstructured log data parsing tasks in Claude:

1. Claude parses unstructured logs into ECS format
2. Claude identifies the log type, then parses it
3. Claude correctly explains what's happening in the log data
4. Claude identifies incidents from looking at logs (analysis)

Phase 1 Proposed Solution

During the POC engagement, the Tribe AI team proposed a solution that was executed following a series of steps, detailed below, to achieve automapping of structured log data.

1. Read the log file from a JSON file and its corresponding csv ground truth with two columns for the correct map (key ---> ECS Key)
2. Generate descriptions from the logs for better "step by step" reasoning
3. Generate mapping with the FieldSets and keys detailed in prompt
4. Mapping + filter hallucinations + evaluation if Ground Truth is Provided
5. Present summary (counts & accuracy)

Developing the Phase 2 Use Case

Unlocking the Future of ‘Generative Context Engine’ with GenAI

Due to the success of the teams’ working relationship during the POC engagement, Sumo Logic decided to begin a second phase of the project. The focus of phase two was the interpretation of unstructured log data in the event of an incident – e.g. security threat or infrastructure outage – with a goal to reduce mean time-to-resolution when incidents arise. Customers presently spend millions on observability tools and are still limited by archaic, time-consuming processes when it comes to discovering the root cause of any given incident.

Current State of Logging & Observability:
Organizations face massive volumes of log data generated by applications, infrastructure, and services. These logs provide invaluable insights, but the sheer volume and complexity can make extracting meaningful information a daunting task.

Complexity of Logs:
Logs often come in unstructured formats from various sources—servers, applications, security systems, etc.—making it hard to identify patterns or link events. Traditional systems rely on predefined schemas, but they struggle with dynamic, unstructured data.

The Root Cause Process:
When an issue arises, teams need to sift through these logs, identify patterns, and link logs across different systems to form a 'trace' of the event. Traces provide a clearer picture of what went wrong. This end-to-end process can take hours or even days—especially when the logs are unstructured or missing critical information.

Tracing Instrumentation Challenges:
Implementing tracing systems can be time-consuming, requiring expert-level knowledge to set up and interpret. While tracing is helpful for deep diagnostics, many organizations haven’t yet implemented it or lack the resources to manage it effectively.

Business Opportunity:
Sumo Logic has recently moved to a value-based, free data ingesting customer licensing model. Removing price as a barrier to entry, Sumo Logic witnessed a large increase in new customers. They wish to cut through the complexity and demonstrate their platform’s value to pinpoint root causes quickly in real time to both new and existing customers who have not yet implemented tracing.

The teams made a hypothesis that they could utilize logs and leverage LLMs to provide a more dynamic and efficient view into interpreting customer log data.

By providing LLMs with unstructured log data in natural language, the models can interrupt and respond – again, using natural language – to more quickly uncover the root cause of the incident in a much faster fashion. 

Phase two spanned three months and in the end proved the team's hypothesis was valid, as mean time-to-resolution was reduced from hours/days to less than one minute. The team calls this functionality ‘Generative Context Engine’ and it has created a tremendous amount of excitement in the industry.

Phase 2 Proposed Solution

Tribe AI harnessed the power of Anthropic’s Claude 3.5 Sonnet LLM to extract meaningful insights from customer log data. Customers can skip the process of instrumenting traces to identify root causes, as Claude 3.5 automatically analyzes logs and directly pinpoints the root cause of incidents. This functionality, called ‘Generative Context Engine’, eliminates the need for predefined trace identifiers and accelerates the troubleshooting process. Claude’s capabilities had been improved significantly since the original POC, so the frequency of 'hallucinations' noted in phase 1 were reduced greatly in phase 2.


How it works:

  • Log Compression: Using proprietary approaches, logs are deduplicated and then sampled to retain the representation across all the services in the data, while maximizing the number of error messages that will fit in the context window along with the prompt
  • Log Summarization: Sonnet 3.5 summarizes the resulting thousands of logs, allowing analysts to extract key insights without manually sorting through endless data.
  • Service map view: Using Sonnet 3.5, an overview map is generated showing how services are connected, highlighting services that are exhibiting problems.

Tech Stack Details

Full-stack cloud-based application working alongside the existing Sumo Logic’s environment: 

  • Cloud: Amazon Web Services (AWS) - Bedrock, S3, Cloudwatch, EKS
  • Large language models: Anthropic Claude 3.5 Sonnet
  • Languages used: Python (coding)

Model Training with Sample Data

Because Sumo Logic data wasn’t available at the start of the engagement, Tribe AI formulated a plan for demoing and testing the model using sample data. Data was generated using the OpenTelemetry demo open-source Astronomy app hosted from minikube on Tribe team members’ laptops. The app was set up to send log and trace data from there into Sumo Logic. The Tribe AI team then fed data through Sonnet 3.5 for analysis and interpretation. Eventually the app was tested with multiple transactions of log data, then deliberated over-scaled with a load generator to 1,000 fake users, until it crashed and the teams could test the model’s accuracy in interpreting the incident and root cause. Later, another open-source tool called Chaos Mesh was leveraged to generate test data representing specific scenarios, such as network outages, and cascading failures.

Sumo Logic’s Experience Working with Tribe

“Partnering with Tribe AI – and leveraging their complementary GenAI skill set – was critical to the success of this project,” said Tej Redkar, Chief Product Officer

Redkar’s team at Sumo Logic has experience with building and utilizing traditional AI algorithms but without the Tribe AI team’s first-hand experience in leveraging LLMs, the process of delivering first-of-its-kind ‘Generative Context Engine’ wouldn't have been so smooth. Redkar credits Tribe AI for their ability to rapidly procure a custom team of experts that understood the use case and contributed to the positive outcomes of the engagement.

“We had an ambitious scope and needed a really novel GenAI application to achieve our goal, which required a very high-level of expertise in LLMs and prompt engineering,” said Tej Redkar, Chief Product Officer

Tribe Team Members

Kuba & Alex: ML Engineers
Kash: AI Engineer
Sam: Technical & Product Lead + ML/AI Engineer
Orges: Engagement Manager

Impact

For Sumo Logic, the biggest impact achieved during the engagement was in reducing mean time-to-resolution (MTTR), a critical metric in their space. What used to take hours or days can now be achieved in less than a minute. Additionally, the cost per root cause is predicted to be around $.50 compared to the per hour/day rate of a full-time engineer.  

This has also empowered their non-expert users to easily troubleshoot issues without relying on specialized staff or complicated instrumentation, broadening adoption and democratizing the use of log data across teams in the organization. 

A demo of ‘Generative Context Engine” was showcased at AWS Re: Invent in December of 2024 and was overwhelmingly well-received. Coupled with the announcement of Sumo’s general availability release of their copilot (called ‘Mo’), the demo signaled to the market that Sumo Logic is a category-defining leader in the observability space. 

The Future

Tribe AI and Sumo Logic know that they are just scratching the surface of what’s possible with GenAI, leaning on Anthropic and AWS as partners. Looking ahead, the teams believe the solution can do more than analyze unstructured logs and suggest fixes. The hope is that one day it could predict outages before they happen or automatically resolve issues on its own. GenAI is on track to completely transform how security and performance monitoring is done, not just at Sumo Logic, but across the industry.

The teams are currently working to scale up the Phase 2 solution with a one-to-one match of functionality through a beta test that incorporates real customer data.
The post-beta solution will be productized with plans for a General Availability launch by the end of Quarter 1, 2025.

Related Case Studies

Case Study

How Fantasmo is using machine learning to make GPS obsolete

Case Study

Accela Utilizes GenAI to Innovate 311 Help Lines with Faster & More Accurate Routing

Case Study

GenAI Solutions: How Bright Transformed Workforce Training with Tribe AI

Case Study

How Tribe AI Built a Model on GCP That Increased Security Questionnaire Auditor Efficiency by 55%

Case Study

Native Instruments Leverages Amazon Bedrock for Smarter, More Intuitive Search and Discovery for Music Creators

Case Study

Building a Proprietary Investment Engine Using Public Data for a Top PE Firm

Case Study

Taking a Data-Driven Approach to Women's Fertility with Rita

Case Study

Building a GenAI Roadmap for Educational Content Creation

Case Study

How Tribe AI Shaped Truebit’s AI Strategy

Case Study

GoTo Revolutionizes Contact Center Quality Management with AI

Case Study

Francisco Partners Accelerates Portfolio AI Efforts with Tribe AI

Case Study

Orchard Applies GenAI for a Faster, Easier-to-Use Lab Reporting Interface

Case Study

How Wingspan built a machine learning roadmap with Tribe AI

Case Study

How Togal AI Built the World's Fastest Estimation Software on AWS

Case Study

VitalSource Leans on GenAI to Reimagine Content Discoverability for Higher Ed Faculty

Case Study

Boomi Leverages Amazon Bedrock for Faster Help Desk Responses

Case Study

Togal.ai powers the construction industry into the age of machine learning

Case Study

How Nota Built a Roadmap for AI-enabled Journalism with Help from Tribe

Case Study

Tribe AI & rbMedia: Transforming Audiobook Production with Claude & Bedrock-Powered Dramatization

Case Study

Insurance Company Uses ML to Optimize Pricing

Case Study

Tribe AI & Venture Labs: Accelerating Startups with Tailored AI Expertise

Case Study

How MyFitnessPal Reimagined A New Era of AI-Enabled Nutrition Tracking with Tribe AI

Case Study

Kettle uses machine learning to balance risk in a changing climate

Case Study

How Tribe Helped Reservoir Bring Finance Infrastructure to NFT Trading

Tribe helps organizations rapidly deploy AI solutions that have real business impact.

Close
Kendra Rasmusson