Case Study: Sumo Logic Utilizes GenAI to Reduce Mean Time-to-Resolution of Log Data

About Sumo Logic

Sumo Logic is a cloud-based analytics platform that unifies organizations by collecting, analyzing, and managing log data from applications and networks. It provides real-time insights into security, operations, and business intelligence, and can help to automate troubleshooting. Over 2,000 organizations worldwide save time and effort by relying on Sumo Logic to get powerful real-time analytics and insights to resolve the hardest questions facing their cloud-native applications.

Sumo Logic’s Challenge

Recognizing the step change in product and service improvements made possible by GenAI, Sumo Logic has long been an established player heavily investing in innovation across its many product offerings.

“Customers have accepted AI as a real innovator, so the time is now for differentiating and disrupting the market,” said Tej Redkar, Chief Product Officer

Sumo Logic surmises that automating rule creation for security information and event management (SIEM) – specifically the challenges experienced in the logging and observability space and the process of identifying a root cause from logs – could be improved using GenAI. An initial project was aimed at automapping structured log data to the Elastic Common Schema (ECS). The end goal of this initial proof of concept (POC) engagement was to catapult innovation that reduces mean time-to-resolution (MTTR) for Sumo Logic’s customers.

Why Tribe AI?

Sumo Logic was introduced to Tribe AI by the private equity firm, Francisco Partners, who recently acquired their business. After hearing the Tribe AI team lead a conversation on applying GenAI to innovate observability and security, the Sumo Logic team knew that Tribe AI was the right partner for the engagement.

Developing the Phase 1 Use Case

A Proof of Concept for Auto Mapping Log Data

‍

The engagement with Sumo Logic started with a four-week POC where Tribe AI was to perform extensive research and testing to understand if a large language model (LLM) could be utilized to automap structured log data to the Elastic Common Schema (ECS) format to improve observability. The work was successful in demonstrating that yes, an LLM could be used for this type of functionality.

‍

However, the Tribe AI team didn’t stop there. They also tested using an LLM to interpret unstructured log data, and showed that this works with Anthropic’s Claude.

Unstructured log data parsing tasks in Claude:

1. Claude parses unstructured logs into ECS format
2. Claude identifies the log type, then parses it
3. Claude correctly explains what's happening in the log data
4. Claude identifies incidents from looking at logs (analysis)

‍

‍

Phase 1 Proposed Solution

During the POC engagement, the Tribe AI team proposed a solution that was executed following a series of steps, detailed below, to achieve automapping of structured log data.

1. Read the log file from a JSON file and its corresponding csv ground truth with two columns for the correct map (key ---> ECS Key)
2. Generate descriptions from the logs for better "step by step" reasoning
3. Generate mapping with the FieldSets and keys detailed in prompt
4. Mapping + filter hallucinations + evaluation if Ground Truth is Provided
5. Present summary (counts & accuracy)

‍

Developing the Phase 2 Use Case

Unlocking the Future of ‘Generative Context Engine’ with GenAI

Due to the success of the teams’ working relationship during the POC engagement, Sumo Logic decided to begin a second phase of the project. The focus of phase two was the interpretation of unstructured log data in the event of an incident – e.g. security threat or infrastructure outage – with a goal to reduce mean time-to-resolution when incidents arise. Customers presently spend millions on observability tools and are still limited by archaic, time-consuming processes when it comes to discovering the root cause of any given incident.

Current State of Logging & Observability:
Organizations face massive volumes of log data generated by applications, infrastructure, and services. These logs provide invaluable insights, but the sheer volume and complexity can make extracting meaningful information a daunting task.

Complexity of Logs:
Logs often come in unstructured formats from various sources—servers, applications, security systems, etc.—making it hard to identify patterns or link events. Traditional systems rely on predefined schemas, but they struggle with dynamic, unstructured data.

The Root Cause Process:
When an issue arises, teams need to sift through these logs, identify patterns, and link logs across different systems to form a 'trace' of the event. Traces provide a clearer picture of what went wrong. This end-to-end process can take hours or even days—especially when the logs are unstructured or missing critical information.

Tracing Instrumentation Challenges:
Implementing tracing systems can be time-consuming, requiring expert-level knowledge to set up and interpret. While tracing is helpful for deep diagnostics, many organizations haven’t yet implemented it or lack the resources to manage it effectively.

Business Opportunity:
Sumo Logic has recently moved to a value-based, free data ingesting customer licensing model. Removing price as a barrier to entry, Sumo Logic witnessed a large increase in new customers. They wish to cut through the complexity and demonstrate their platform’s value to pinpoint root causes quickly in real time to both new and existing customers who have not yet implemented tracing.

The teams made a hypothesis that they could utilize logs and leverage LLMs to provide a more dynamic and efficient view into interpreting customer log data.

By providing LLMs with unstructured log data in natural language, the models can interrupt and respond – again, using natural language – to more quickly uncover the root cause of the incident in a much faster fashion.

Phase two spanned three months and in the end proved the team's hypothesis was valid, as mean time-to-resolution was reduced from hours/days to less than one minute. The team calls this functionality ‘Generative Context Engine’ and it has created a tremendous amount of excitement in the industry.

Phase 2 Proposed Solution

Tribe AI harnessed the power of Anthropic’s Claude 3.5 Sonnet LLM to extract meaningful insights from customer log data. Customers can skip the process of instrumenting traces to identify root causes, as Claude 3.5 automatically analyzes logs and directly pinpoints the root cause of incidents. This functionality, called ‘Generative Context Engine’, eliminates the need for predefined trace identifiers and accelerates the troubleshooting process. Claude’s capabilities had been improved significantly since the original POC, so the frequency of 'hallucinations' noted in phase 1 were reduced greatly in phase 2.

How it works:

Log Compression: Using proprietary approaches, logs are deduplicated and then sampled to retain the representation across all the services in the data, while maximizing the number of error messages that will fit in the context window along with the prompt
Log Summarization: Sonnet 3.5 summarizes the resulting thousands of logs, allowing analysts to extract key insights without manually sorting through endless data.
Service map view: Using Sonnet 3.5, an overview map is generated showing how services are connected, highlighting services that are exhibiting problems.

Tech Stack Details

Full-stack cloud-based application working alongside the existing Sumo Logic’s environment:

Cloud: Amazon Web Services (AWS) - Bedrock, S3, Cloudwatch, EKS
Large language models: Anthropic Claude 3.5 Sonnet
Languages used: Python (coding)

Model Training with Sample Data

Because Sumo Logic data wasn’t available at the start of the engagement, Tribe AI formulated a plan for demoing and testing the model using sample data. Data was generated using the OpenTelemetry demo open-source Astronomy app hosted from minikube on Tribe team members’ laptops. The app was set up to send log and trace data from there into Sumo Logic. The Tribe AI team then fed data through Sonnet 3.5 for analysis and interpretation. Eventually the app was tested with multiple transactions of log data, then deliberated over-scaled with a load generator to 1,000 fake users, until it crashed and the teams could test the model’s accuracy in interpreting the incident and root cause. Later, another open-source tool called Chaos Mesh was leveraged to generate test data representing specific scenarios, such as network outages, and cascading failures.

Sumo Logic’s Experience Working with Tribe

“Partnering with Tribe AI – and leveraging their complementary GenAI skill set – was critical to the success of this project,” said Tej Redkar, Chief Product Officer

Redkar’s team at Sumo Logic has experience with building and utilizing traditional AI algorithms but without the Tribe AI team’s first-hand experience in leveraging LLMs, the process of delivering first-of-its-kind ‘Generative Context Engine’ wouldn't have been so smooth. Redkar credits Tribe AI for their ability to rapidly procure a custom team of experts that understood the use case and contributed to the positive outcomes of the engagement.

“We had an ambitious scope and needed a really novel GenAI application to achieve our goal, which required a very high-level of expertise in LLMs and prompt engineering,” said Tej Redkar, Chief Product Officer

‍

Tribe Team Members

Kuba & Alex: ML Engineers
Kash: AI Engineer
Sam: Technical & Product Lead + ML/AI Engineer
Orges: Engagement Manager

Impact

For Sumo Logic, the biggest impact achieved during the engagement was in reducing mean time-to-resolution (MTTR), a critical metric in their space. What used to take hours or days can now be achieved in less than a minute. Additionally, the cost per root cause is predicted to be around $.50 compared to the per hour/day rate of a full-time engineer.

‍

This has also empowered their non-expert users to easily troubleshoot issues without relying on specialized staff or complicated instrumentation, broadening adoption and democratizing the use of log data across teams in the organization.

‍

A demo of ‘Generative Context Engine” was showcased at AWS Re: Invent in December of 2024 and was overwhelmingly well-received. Coupled with the announcement of Sumo’s general availability release of their copilot (called ‘Mo’), the demo signaled to the market that Sumo Logic is a category-defining leader in the observability space.

The Future

Tribe AI and Sumo Logic know that they are just scratching the surface of what’s possible with GenAI, leaning on Anthropic and AWS as partners. Looking ahead, the teams believe the solution can do more than analyze unstructured logs and suggest fixes. The hope is that one day it could predict outages before they happen or automatically resolve issues on its own. GenAI is on track to completely transform how security and performance monitoring is done, not just at Sumo Logic, but across the industry.

‍

The teams are currently working to scale up the Phase 2 solution with a one-to-one match of functionality through a beta test that incorporates real customer data.
The post-beta solution will be productized with plans for a General Availability launch by the end of Quarter 1, 2025.

Table of Contents

This is some text inside of a div block.

Sumo Logic Utilizes GenAI to Reduce Mean Time-to-Resolution of Log Data

About Sumo Logic

Sumo Logic’s Challenge

Why Tribe AI?

Developing the Phase 1 Use Case

A Proof of Concept for Auto Mapping Log Data

Phase 1 Proposed Solution

Developing the Phase 2 Use CaseUnlocking the Future of ‘Generative Context Engine’ with GenAI

Phase 2 Proposed Solution

Tech Stack Details

Model Training with Sample Data

Sumo Logic’s Experience Working with Tribe

Tribe Team Members

Impact

The Future

Related Case Studies

Expediting Outside-In Due Diligence with AI Insights

Insurance Company Uses ML to Optimize Pricing

Francisco Partners Accelerates Portfolio AI Efforts with Tribe AI

Optimizing On-Call Workflows with AI at TigerConnect

Advancing AI-Driven Drug Discovery with Supercomputing & ML

Accelerating Bankruptcy Filings with an AI-Powered Vendor Classification Engine

How Togal AI Built the World's Fastest Estimation Software on AWS

Taking a Data-Driven Approach to Women's Fertility with Rita

Building a Proprietary Investment Engine Using Public Data for a Top PE Firm

Kettle uses machine learning to balance risk in a changing climate

Scaling ML at Sonova Through Automation and MLOps

Litmos Innovates in EdTech by Building AI with a Learner Focus

How Fantasmo is using machine learning to make GPS obsolete

Tribe AI & Venture Labs: Accelerating Startups with Tailored AI Expertise

How Nota Built a Roadmap for AI-enabled Journalism with Help from Tribe

How Tribe AI Built a Model on GCP That Increased Security Questionnaire Auditor Efficiency by 55%

How Tribe Helped Reservoir Bring Finance Infrastructure to NFT Trading

Leveraging AI to Launch a Revolutionary New Construction Product

Unlocking Data in Unstructured Documents with AI Q&A

Scaling Sales Enablement & Coaching with AI

How Wingspan built a machine learning roadmap with Tribe AI

Galley’s AI Advantage: Tripling Sales Velocity and Slashing Manual Recipe Input by 90%

GenAI Solutions: How Bright Transformed Workforce Training with Tribe AI

Orchard Applies GenAI for a Faster, Easier-to-Use Lab Reporting Interface

How Tribe AI Shaped Truebit’s AI Strategy

GoTo Revolutionizes Contact Center Quality Management with AI

Tribe helps organizations rapidly deploy AI solutions that have real business impact.

Developing the Phase 2 Use Case

Unlocking the Future of ‘Generative Context Engine’ with GenAI