Grouping Logs and Spans
When faced with a large amount of telemetry, a very natural question is “how does it relate with…” Tired of sifting through endless streams of logs and traces? In other words, if you grouped all your logs and spans by, for example, which service emits them, what would you learn? Wonder no more! Within the Logging and Tracing Explorers, you can now group by attributes to quickly get an overview. Grouping unlocks yet another way to iteratively reduce the amount of data you are sifting through to get from data to answers.
- Effortless Aggregation: Now, you can easily group your logs and spans based on any attribute, allowing you to quickly identify patterns and anomalies.
- Contextual Insights: For each group, you'll get clear statistics to guide your workflow.
- Seamless Drill-Down: Need to investigate a specific group further? Simply click to drill down and explore the detailed logs or spans within that group.
Take a Tour: Using Span Grouping to Understand Errors
Let's examine a real-world scenario in which the new grouping capabilities are used to analyze a problem within the OpenTelemetry demo. Do you prefer this tour in video format? Check out this YouTube video!
Let's say I noticed several errors within the Dash0 Tracing area, which the Tracing Heatmap helpfully visualizes as a red cluster. I need to understand whether these are isolated occurrences or systemic and in what part of my application they are happening. I want to know whether I should drop everything I am doing to work on this, or whether it’s not that important.
At the top, I can see that there were 775 errors across 271k spans. Usually, this wouldn't worry me, but I see specific patterns within the heat map and span list that make me curious. So, I continue with a quick check.
To start, I click on ERRORS. This adds a filter that shows me errors only across the whole view. The filtered heat map shows that we are experiencing ongoing errors. Still, it doesn't indicate the severity of the errors. It's unclear whether these errors are rare occurrences or significant issues that require attention. For example, does a service have 1 error in 100 or 1 error in 100,000 spans? This information is crucial to determine whether I need to pay more attention to this topic immediately or not.
Alright, it's time for some statistics. One of the easiest ways to get an overview is to look at services. With the just-added grouping capabilities, this is trivial. To start, I remove the filter for errors and expand the time range. The larger time range helps me understand whether the problem (if any) just started or has persisted for some time. Next, I tell Dash0 to group spans by service.name
.
Now, the view has switched to the grouped view. Swapping out the table for a new one and replacing the heatmap with classic RED Charts (which you can read more about in the Changelog). Together, they instantly give me valuable insights that help me prioritize! So much I can see:
- The frontend service has the most errors by numbers but also has far more activity, resulting in a lower overall error percentage.
- In contrast, the ad service has far less activity. Yet a much higher error percentage.
- The chart also shows that this situation has persisted for quite a while. What gives?
Seeing a 10% error percentage worries me, it's time to dig deeper. I hit the drill-down button on the adservice
row to continue. Dash0 applies a filter, switches back to the ungrouped presentation, and shows me the detailed RED charts for that specific service.
This view also confirms the continuous errors. To determine what’s causing the error, I select some of the errors, and all five of my samples have the attribute rpc.grpc.status_code
with a value of 8
, which Dash0 translates for me as RESOURCE_EXHAUSTED
and color-codes it to bring my attention to it. This contextualization of information allows me to stay in my workflow; otherwise, I would have to Google or ChatGPT the meaning of gRPC status code 8
.
The sample size of 5 is small. Now, I am curious whether what I saw is representative of a broader problem. Back to statistics, we go! So, I quickly group by rpc.grpc.status_code
to see all the possible status codes and their correlation to error states. Maybe other gRPC errors are happening as well, and what I saw through the manual inspection were just isolated incidents?
Good… and bad. What I have seen through our individual clicks turned out to be representative of a larger issue. When it is an error, it is always status 8. Well, it seems to be just one problem and not multiple. That's something!
Time to go really deep. I inspect one individual trace through the Trace view to understand the underlying problem, where the errors originate, and how they spread “upwards and outwards” in my traces. When opening up the Trace view, I see at what point in the call chain this error started. Through span events, I get even deeper insights (Although, in this case, it is simply due to the OpenTelemetry demo, which has a feature flag that makes this problem occur).
This concludes our journey throughout some of the new capabilities of Dash0. And maybe it is a good time for you to start yours!
Start Exploring Today
These enhancements represent our ongoing commitment to making data analysis more intuitive, powerful, and accessible. We invite you to explore these new features and discover how they can transform your observability workflows.
Sign up today to experience the new Query Builder and share your feedback with us. We're excited to see how these tools will help you gain deeper insights into your systems.