2025-03-19
11 min read

Introducing SIFT: The next stage of Observability

Over the past few years, observability has primarily focused on collecting more data—increasing cardinality and providing users with tools to slice and dice massive telemetry datasets. This was bound to happen: OpenTelemetry, Prometheus and other open-source software have democratized telemetry collection, while modern serverless databases and cheap storage solutions becoming widely available have increased the amount of telemetry organizations can store.

However, while more data means a greater chance of having the necessary data to troubleshoot, it also means (way) higher costs and (way) more telemetry to sift through. We very much like the “needle in the haystack” analogy: think of your telemetry as a haystack, and the reason for your outage is a needle somewhere in it. The haystacks out there are looming higher than ever and keep growing at a steady pace.

Today, we announce our answer to this unsolved challenge: SIFT. This revolutionary framework transforms observability by leveraging state-of-the-art AI-driven intelligence, statistical analysis, and user-driven insights. It combines hardened workflows, with new capabilities made available by the combination of AI and OpenTelemetry, and uncompromising empathy and love for Dash0’s end users.

What is SIFT?

SIFT is an acronym of the following four steps:

Spam removal
Improve telemetry
Filtering and grouping
Triage

Sift is a multi-layered approach that makes extracting insights from vast telemetry datasets simple and effective. Unlike traditional observability approaches, SIFT doesn’t just collect and store more telemetry—it actively removes unwanted telemetry, refines and enhances the one that is kept, and pinpoints relevant signals in it. This allows engineers to focus on solving problems rather than sifting through overwhelming amounts of telemetry.

Let’s break SIFT down into its components.

1. Spam removal: Cutting through the noise

Most observability platforms charge by GB of telemetry data, which means their business models depend on storing more data, not less. Dash0’s approach is somewhat different: we charge by the number of logs, spans and metric data points you send, regardless of their size, because we want to enable you to send all metadata that you find relevant, irrespective of how much “larger” it makes your signals. Moreover, we wanted a pricing model aligned with the levers you can operate to align their costs to the value you perceive in Dash0. And we know firsthand that it is way easier to send fewer distinct pieces of telemetry, e.g., via sampling or dropping, and much, much harder to change telemetry to be “smaller”. So that’s what we built our pricing around.

But the fact that our pricing model is well aligned with the ways you could discard (and not pay for) unwanted telemetry on your end is one thing, still, we felt it was not ergonomic enough, not… “Dash0-y” enough. So we introduced spam filters, which are the most streamlined way we could imagine, to let you drop telemetry without having to wrangle configurations and infrastructure changes. Much like an email spam filter, our Spam filters let you remove irrelevant telemetry with a single click. Behind the scenes, Dash0 automatically generates OpenTelemetry Transformation Language (OTTL) rules that execute inside the ingestion pipeline of Dash0—preventing unwanted data from being stored, and you being charged for it. (And the filtering itself is also free, by the way!)

The value of easy removal of telemetry cannot be overstated: The haystack is smaller, so finding the needle is easier. And since you pay for the hay, the bill will be smaller at the end of the month. It’s a win-win for you, and we firmly believe that what’s good for our end users, is ultimately good for Dash0 too.

OTTL rules generated via Spam filters can also be exported and applied directly in your own OpenTelemetry Collector, reducing egress costs and ensuring that unnecessary telemetry never leaves your infrastructure. We have also incorporated support for filtering telemetry into our Dash0 Operator for Kubernetes.

There’s of course more we want to do on this topic, and we didn’t pick the name “Spam filter” by chance: Over time, Dash0 AI will learn from user-defined rules what telemetry is worth storing and what not, just like modern email filters do today. We will be able to automatically and proactively remove unnecessary telemetry without any user interactions, or offer you suggestions you can apply with one click.

Learn more about the Spam filters in our release post.

2. Improve telemetry for actionable insights

Observability is only as powerful as the context encoded in the data. While OpenTelemetry’s semantic conventions (SemConv) provide a significant step toward standardization, not all telemetry adheres to these conventions. Unstructured logs without severity, outdated and inconsistent attributes, inconsistent metadata on different signals from the same components, and lack of context make troubleshooting a struggle. Many of us at Dash0 have been involved in attempts at standardizing telemetry across organizations large and small, and even when the efforts were successful, we felt there should be a better way. So we set out to build one.

Dash0’s Resource Centricity was already a game-changing feature in this area. And now it is joined by a growing suite of automated enhancements:

Log AI: Automatically structures logs, making them easier to analyze.
Semantic convention upgrades: Align metadata on logs, traces and metrics with the OpenTelemetry SemConv standards.
Pattern Recognition with Log and Trace Pattern AI: Identifies recurring patterns in logs and traces, simplifying complex debugging workflows—and making large traces readable in the first place.

To return to our analogy, imagine a haystack where each piece of hay is precisely color-coded, combed, and sorted into consistent groups—this is what Dash0 does for telemetry data.

Read more about our automatic log classification using AI and the semantic convention upgrades, and stay tuned for Log Pattern AI and Trace AI in the coming weeks. Our AI strategy shows how we see Dash0 leverage the incredible advances in state-of-the-art large language models and other AI-related techniques. Suffice it to say, that we have a lot of ideas about how we can further improve your telemetry without toiling on your end.

3. Filtering and grouping: Finding the signal in the noise

Filtering is one of the most fundamental workflows in observability, yet most tools treat it as an afterthought. Dash0 redefines filtering by making every tag, parameter, and UI element interactive—allowing users to filter with just a click or keystroke. Moreover, Dash0 does not only enable you to filter: Before you do, you are shown insights about what the outcome will be. It’s one of our core tenets: “if we give you a choice, we give you guidance”. Say goodbye to frustrating, confusing, and time-consuming trial and error.

We are very proud of Dash0’s Filtering, which has been joined by the new Grouping feature. This feature lets you cluster telemetry and quickly isolate the relevant data. This reduces analysis time and shrinks the haystack into a manageable, focused dataset—basically segmenting haystack groups out of one large stack of hay in small, manageable chunks.

The new RED metrics above traces, providing a quick indicator of spans grouped by service name.

4. Triage: Instant root cause analysis

So far in the SIFT process, we have made the haystack smaller with Spam filters. We cleaned up and improved the hay with Dash0 AI, semantic convention upgrades, and more. We gave you a means of filtering the hay and sorting your haystack in manageable chunks with Filtering and Grouping.

And now, let’s tune it up a notch, and let Dash0 do the analysis for you!

Enter Dash0’s Triage: A one-click, entirely automated analysis of your telemetry to find out the probable causes and correlations behind outliers and patterns in your telemetry.

To illustrate Triage, we’ll take as an example one of the scenarios of the OTel Demo Application, specifically, that the GetProduct API of the productcatalog service has higher error rates for the product ID OLJCESPC7Z. Not every error is related to the product ID OLJCESPC7Z. And more often than not, requests about the product ID OLJCESPC7Z succeed.

With millions of logs and spans, and hundreds of metadata tags on each, how long would it take you to detect this?

Even with Filtering and Grouping, it would have either taken intuition or luck to group for the right attribute across many. Instead, we can just click on the Triage button on the spans of the Product Catalog service, which automatically highlights the faulty product ID OLJCESPC7Z and shows all associated attributes in a visual, easy-to-digest format.

In the needle and haystack analogy, it is like somebody told you to wait in the shades of a tree, and brought to you the needle on a silver platter shortly thereafter.

Triage provides a revolutionary, straightforward way to identify the root cause of the problem. We have also created various triage analysis methods to help you identify performance outliers and bottlenecks, compare attribute values, and understand their impact.

In the near future, we will integrate agentic AI into our Triage product, enabling our platform to proactively assist users in solving problems by iteratively and autonomously testing hypotheses. This approach will automatically identify root causes within traces, logs, and metrics, providing clear, actionable paths to resolution.

We are the first users of Dash0. We drink a lot of champagne. While we had big hopes for how useful Triage would turn out to be, we were surprised by how much value it gave us right away. We cannot wait for you to try it and let us know what you think!

Read more about our Triage release and how it redefines observability analytics.

SIFT: A Paradigm Shift in Observability

With SIFT, we introduce a fundamentally new approach to observability: Spam filtering, improving telemetry, effortless Filtering and grouping, and pinpointing root causes instantly with Triage.

Dash0 combines AI-driven automation and user-friendly workflows to make observability faster, more actionable, and cost-efficient—without requiring complex queries, chatbots, or manual analysis.

SIFT is built on modern LLMs, ML algorithms, and cutting-edge analytics—empowering engineers to do their jobs faster, more effectively, and at a lower cost.

Try Dash0 and experience SIFT today.