One of the main ways of gaining insights into the inner workings of your microservices and applications, is by analyzing the telemetry they generate. There are different types of telemetry, called “signals”, and the most widely adopted are metrics, logs, and traces, although others like events, real user monitoring and profiles have been growing in terms of adoption and relevance.
The concept of observability has been gaining significant traction in the tech industry since 2016, when Charity Majors, co-founder of Honeycomb, and Ben Sigelman, co-founder of LightStep, have begun to popularize this term. Having worked at internet giants like Facebook and Google, Majors and Sigelman had observed the effective monitoring and troubleshooting practices these companies employ for their cloud-native applications and distributed systems, and set out to make similar tools accessible to developers across the industry, democratizing the practice of observability.
Observability vs Application Performance Management
Observability is often compared with Application Performance Management (APM), but this comparison is an oversimplification: Observability is not merely a replacement or an improvement over APM; it represents a conceptual evolution driven by technological advancements and the unique requirements of cloud-native applications. In fact, Observability stems from the convergence of the various monitoring disciplines and tooling that used to be separate industry verticals: APM, Log Management, Infrastructure Monitoring, Profiling, Real-User Monitoring, and more.
This article dives into the main categories of tools that have emerged in the field of Observability, along with the technological advancements that have facilitated their development.
A brief history of APM
Around the year 2000, the emergence of internet applications brought to the forefront the concept of Application Performance Monitoring (APM), championed initially, among the early tools, by Wily Introscope. Wily Introscope introduced a groundbreaking technique known as bytecode instrumentation: by adding at runtime dedicated logic to the monitored applications, it could trace user requests as they flowed through the system without requiring dedicated source-code modifications to the various system components. In the early days, the technique was primarily focused on monolithic Java applications (hence the usage of the Java-specific term “bytecode”) hosted on application servers and communicating with a single database.
At first, tracing was confined to monolithic applications, covering both method and database invocations. Similarly to CPU profiling, it offered lower overhead but limited information density. Wily also featured user-friendly dashboards for data visualization and metric tracking.
The next step for APM came with the advent of Service-Oriented Architectures (SOA), where it became common for monoliths to interact with one another. Vendors capitalized on this by introducing protocols that generated traces of the interactions among these monoliths, most notably Dynatrace, established in 2005, and AppDynamics, in 2008. Their respective products, PurePath(r) and BusinessTransaction, revolutionized the field. A key focus of that period was advancing the instrumentation technology, which rapidly improved in quality and coverage of supported libraries, frameworks and runtimes. The challenge was balancing the level of manual instrumentation required for comprehensive tracing with the associated overhead at runtime.
In 2008, Lew Cirne, the founder of Wily, established another APM company called New Relic that, alongside Dynatrace and AppDynamics, would come to lead the APM market for the next decade. New Relic initially focused on Ruby on Rails applications, was developed specifically to monitor cloud environments, and was notably the first APM tool offered exclusively as a SaaS, that is, without a possibility to run it on-premises. The SaaS nature of New Relic significantly reduced the toil needed by developers to adopt it. But it also limited New Relic mostly to startups and small and medium-sized businesses (SMBs), as larger enterprises were hesitant to send their monitoring and tracing data to a third party due to data governance concerns. Those concerns would largely dissipate across the industry over the next decade due to the rising popularity of cloud computing, although they persist in strongly regulated environments.
As of 2024, New Relic and Dynatrace are still multi-billion dollar companies in terms of market cap (New Relic was taken private in 2023 by PE firms Francisco Partners and TPG for $6.5B). AppDynamics was acquired by Cisco in 2017 for $3.7B, and has since lost a considerable part of their user base to other tools. In 2023 Cisco also acquired Splunk for $28B to modernize and enhance its observability offerings. Cisco has also acquired in 2021, and since retired, the serverless monitoring vendor Epsagon.
Three Pillars of Observability
The "three pillars" idiom is commonplace when discussing what an observability can or cannot do for you. These pillars are most commonly-adopted signals: metrics, logs, and traces. "Three pillars" is a legacy of the convergence of APM tools and other specialized monitoring tools to encompass more than tracing or metrics or logs. Nowadays, the “three pillars” term is frankly problematic to the evolution of the discourse, as observability has grown beyond traces, metrics and logs, currently including profiling and real-user monitoring, and probably more in the future.
Microservices and Distributed Tracing
APM systems built for more monolithic systems were not prepared for the rise of microservices, containers, and orchestration systems like Kubernetes. The amount of telemetry created with high cardinality and the increasingly distributed nature of systems, with the corresponding increase in complexity and size of traces, needed a new approach. The Dapper paper, released by Google in 2010, provided the concepts for a distributed tracing system that are today still the basis for tracing in modern observability solutions like OpenTelemetry.
The fundamental unit of data in distributed tracing is a span. A span represents an action that a component is doing, like serving an HTTP request or sending a query to a database. Unlike logs or events, which occur instantaneously, spans have a start time and a duration. Spans are organized into a hierarchy, called trace, with a reverse link from the child span to the parent. That is, when a span is created, if another span is active at the time, the former span is a child of the latter. For example, suppose you are querying a database while serving an HTTP request. In that case, the span representing the execution of the database query will likely be a child of the one representing the serving of the HTTP request.
Parent-child relations between spans also occur between components: the span representing an HTTP interaction from the point of view of the client will be the parent of the span representing the same interaction from the point of view of the server. When creating the span about serving the request, the server knows which span is the parent because the instrumentation in the client has added trace context to the outgoing request i.e., has performed trace context propagation. Think of it as metadata about:
- Which trace it is being recorded
- What is the currently active span (in our example, the client HTTP span)
- Other information like sampling (i.e., are we actually collecting tracing data for this trace or not, which is useful to cut down on telemetry for systems with high workloads)
This model of traces, spans and their parent-child relations, and trace context propagation, is commonplace in the various distributed tracing models and is well-suited for microservice-based systems, in no small part because the protocols used in cloud-native applications (HTTP, gRPC, messaging, etc.), generally support adding metadata to outgoing requests, which is used for trace context propagation.
In situations when a single “parent” span does not suffice, like consuming batches of messages from a queue, each with its own trace context, there are more advanced concepts like OpenTelemetry span links and the “follows-from” relation in OpenTracing.
Log Management
Logs are effectively the “ground-floor of observability”. Logs are textual records used by operating systems, infrastructure components, and applications to report events externally with varying severity levels (e.g., ERROR, INFO, DEBUG). DevOps and software development teams are relying on log data for getting visibility into components that can either not be traced or where logs provide visibility into runtime behavior and errors.
Traditionally, logs were written into files on the server where the components generating those logs were running. Accessing the logs of a component required connecting to that component’s server, locating the log file, and opening it in an editor or tailing it to the console. Log Management solutions address these challenges with various agents like Fluentbit that would collect logs across the systems and send them to a centralized log database, making all the logs across the various components of a distributed system readily available for filtering, grouping, searching and more. Splunk basically invented the Log Management category in 2003 and Elastic has made it available since 2010 as open-source technology to thousands of developers with the famous ELK stack (ElasticSearch, Logstash, Kibana), although a controversial licensing change in 2021 led to the forking of Elasticsearch into the OpenSearch project.
Infrastructure Monitoring
In the early 2000s, infrastructure monitoring was dominated by traditional monitoring vendors (IBM Tivoli, HP OpenView) and open-source tools like Nagios (and later Icinga, which forked Nagios). The primary function of these tools, which nowadays have far smaller adoption than back then, is to collect time stamped measurements called metrics. Metrics typically include host data such as CPU, memory, or network information, as well as data from infrastructure components such as HTTP servers, databases, or firewalls. The tools typically offer plugins or integrations to support a wide range of components.
The collected metrics are stored in a time-series database and primarily serve two main use cases:
- Visualization: The data can be visualized using dashboarding systems like Grafana, enabling users to monitor and analyze system performance over time.
- Alerting: Alerts can be set up to notify users when a metric reaches a predefined “static” threshold, such as CPU usage exceeding 100% for 30 seconds. Beyond static thresholds, some tools also offer baselines calculated with different techniques.
In 2010, Datadog was among the first vendors to establish infrastructure monitoring specifically tailored for the cloud. This involved the collection of metrics from various Amazon Web Services’ services through integrations, enhancing visibility into the cloud environment, and offering a user-friendly dashboarding and alerting system. Developers benefited from this development, gaining valuable insights and control over their cloud infrastructure at a time when the native monitoring capabilities of the cloud vendors were still in their infancy. Nowadays, Prometheus is the de-facto standard for cloud-native applications to collect, visualize, query and alert on metrics. Prometheus is a graduated Cloud Native Computing Foundation (CNCF) project and has wide support from different vendors; moreover, the communities of Prometheus and OpenTelemetry (the second largest CNCF project by contributors count) are working closely together to make the technologies of the two projects easy to integrate.
If you want to learn more about OpenTelemetry, read the “What is OpenTelemetry” article.
Observability and the convergence of tools
Although it was intended to represent an evolution in the practice of how systems are monitored, "observability" has come to represent the convergence of the three distinct categories of monitoring tools we covered earlier in this article: Application Performance Management, Log Management, and Infrastructure Monitoring.
The convergence of these categories of tools was very much driven by end end users, who demanded the proverbial “single pane of glass” to collect the various signals from a variety of systems and correlate them into better insights than what can be had by analyzing traces, metrics or logs in isolation from one another.
The convergence of tool categories, in turn, triggered a flurry of acquisitions among vendors in the various categories, as they sought to acquire the missing pieces for their platforms. Notable examples are:
- Splunk, a log management vendor, acquired SignalsFX (infrastructure monitoring) and Omnition (distributed tracing) in 2019.
- Datadog acquired Logmatic for logging in 2017 and later embarked on building its own APM solution.
As a result, between 2015 and 2020, nearly all vendors in the monitoring category had constructed platforms that encompassed APM, log management, and infrastructure monitoring capabilities.
In many cases, even observability platforms that cover all signals still treat each largely in isolation, storing its telemetry in loosely-coupled and independently-operated databases, providing the user with access to all of them through a unified user interface. For example, the open-source vendor Grafana has created three separate databases, with three distinct query languages, one for each signal, mainly relying on their dashboarding technology to bring the signals together. With respect to interconnections among signals at the “data” layer, OpenTelemetry is pretty much the first open-source project that natively supports logs, metrics and profiles to be related with tracing data with context like trace context for logs, exemplars for metrics and links for profiles Moreover, the concept of resource in OpenTelemetry provides a consistent way across signals to describe which system emits which telemetry.
Database evolutions for Observability tools
In parallel with the trend of convergence of monitoring tools, there have been significant changes in database technology and cloud adoption. APM tools like AppDynamics initially used relational databases like MySQL to store traces and were primarily deployed on-premises. However, the limitations of relational databases and on-premises deployments restricted the amount of data that could be stored and the number of parameters available for querying, primarily due to performance and cost considerations.
Around 2015, newly-established vendors like LightStep, Instana, and Honeycomb started employing modern columnar-based database technologies like Google BigQuery or ClickHouse. Others, like Honeycomb, developed in-house datastores specifically designed to handle high-cardinality and high-dimensional data in near real-time. Other interesting examples are companies like Humio, which introduced index-free databases tailored for log data, and Axiom, which developed a serverless datastore for logs based on object storage for persistence.
The advancement of database technology, particularly with the advent of cost-effective storage solutions like AWS S3, has empowered observability vendors to gather significantly more data with increased cardinality and dimensionality, and to perform much faster queries on this vast amount of data than it was before possible.
Another positive development due to the convergence of monitoring tools, is that newer vendors tend to approach signals, and especially logs, metrics, and traces, into a much more unified fashion – some say: “everything is a wide event”. This integration enhances the efficiency and effectiveness of data analysis.
Observability and OpenTelemetry
OpenTelemetry is the emerging de-facto standard for collecting telemetry in a vendor-neutral manner. Additionally, with OpenTelemetry resources, it offers a solid foundation for delivering context about metrics, logs, and traces in terms of which systems emit them. As stated at LEAP 2024: "Without context, telemetry is just data." Without adequate context, troubleshooting application issues in a vast amount of telemetry is like searching for a needle in a haystack.
Check out our “What is OpenTelemetry” article in the Observability FAQ to learn more about OpenTelemetry and more.
So what is a modern OpenTelemetry Native Observability Platform?
The establishment of OpenTelemetry as the de-facto standard for collecting and processing telemetry for cloud-native application has wide-reaching implications on the observability industry as a whole. The most notable of these, is the growing moment behind the concept of OpenTelemetry-native observability.In the remainder of this section, we cover the major trends.
Changes in processing, storing and querying data
The evolution of modern database technology has significantly streamlined the processes of collecting, storing, filtering, analyzing, and visualizing telemetry data. This advancement allows for the economical storage of vast data volumes and supports the efficient handling of complex, high-cardinality and dimensional datasets. Centralizing data in a single datastore eliminates the fragmentation traditionally caused by separate silos for different telemetry types, while a unified query language facilitates quick identification and correlation of data signals. Additionally, leveraging technology and machine learning aids in swiftly pinpointing the root causes of issues and detecting incidents within intricate cloud-native applications. This comprehensive approach to data management not only simplifies operations but also enhances the ability to glean actionable insights, ensuring more effective and timely decision-making.
All signals, correlated
The provision of key observability tools encompasses a suite of capabilities essential for ensuring optimal application performance and system health. These include Application Performance Management and Distributed Tracing, which offer insights into application behavior and trace transactions across distributed systems. Log Management organizes and analyzes log data, while Infrastructure Monitoring oversees the health of both hardware and software components. Alerting systems notify stakeholders of potential issues, and Dashboarding provides a real-time overview of various metrics. Additional valuable tools cover Real User Monitoring, Kubernetes Monitoring, Error Monitoring, Continuous Profiling, and Network Monitoring, each playing a critical role in a comprehensive monitoring strategy. Together, these tools equip organizations with the necessary resources to maintain, optimize, and secure their digital environments effectively.
Most vendors currently available in the market fall short of supporting all the requirements above. They treat signals like separate silos. Their offerings are limited in data volume, query flexibility, and openness, as they often convert OpenTelemetry into their proprietary data formats.
Integration in the cloud-native ecosystem
Openness and seamless integration are fundamental for observability tools within cloud-native ecosystems. True value from observability arises when these tools are deeply integrated, supporting necessary metadata and automation in sophisticated cloud-native environments. Essential integrations include OpenTelemetry, Kubernetes, Prometheus, Grafana, among other cloud-native technologies. This interconnectedness ensures that observability is not an isolated function but a cohesive part of the broader digital infrastructure, enhancing the monitoring, management, and optimization of cloud-native applications.
Challenges of Observability
As of 2024, the observability industry has strong momentum behind it, in large part because of the possibilities unlocked by OpenTelemetry. Nevertheless, there are notable challenges worth mentioning. (These challenges seem now clearly understood by the observability community at large, and the open-source projects are clearly trending to address these issues.)
Data Volume and Signal vs Noise
Observability tools collect more and more data which can actually create less visibility as it gets harder to decide which data are real signals and what is actual noise. How to fix it: Smart sampling and data aggregation algorithms, as well as tools to aid the analysis of large datasets can help deal with the data explosion.
High cost
Coinbase paying DataDog $65M made the news and showed that observability can actually be a high tax on your overall IT spend.
How to fix it: Observability tools must provide visibility and measure to the users so that they can control cost smartly.
Integration and Migration
Observability must be integrated into the overall development and SRE processes and tools which can be a painful and manual effort. How to fix it: Adopting open-source de-facto standards like PromQL can help reduce that pain.
Data Correlation
Correlating and comparing a vast amount of signals can be almost impossible for a human. Finding issues will cost your best developers and SREs a lot of time and requires expert knowledge of the underlying architecture and technology.
How to fix it: Observability tools must provide functionality to automate this process, because machines are actually really good at this kind of task.
False Positives
Being on Call and getting up at 3am in the morning is painful but it hurts even more if it was because your observability tool detected a problem that actually wasn’t there.
How to fix: Reducing false positives is a complicated technical task and requires configuration and adoption of the system by expert users.
Steep Learning Curve of OpenTelemetry
Instrumenting code with traces, logs and metrics and providing the right sementical metadata is not easy and requires some training and knowledge to get it done in the right way.
How to fix: Observability tools can help by pinpointing to the issues in instrumentation and suggest improvements. Automated Instrumentation based on injection technology in custom distros can also reduce the learning curve and fill the gap to existing APM solutions.
Benefits of Observability
Observability offers a wide range of benefits that enhance the reliability, performance, and efficiency of software systems, while also supporting business objectives. Some key benefits are:
- Improved System Reliability: By providing deep insights into system behavior, observability helps identify and fix issues before they impact users, leading to more stable and reliable systems.
- Faster Problem Resolution: With comprehensive data on system operations, teams can quickly pinpoint the root cause of issues, reducing the Mean Time To Repair (MTTR) significantly.
- Enhanced User Experience: Observability tools help monitor and analyze user interactions with applications, allowing teams to detect and rectify user experience problems, thus improving overall satisfaction.
- Cost Optimization: By providing insights into resource usage and performance bottlenecks, observability helps organizations optimize their infrastructure costs, avoiding overprovisioning and waste.
- Better Collaboration Across Teams: The shared visibility into system data and metrics, as well as a common understanding of the architecture of the system and dependencies of the services, fosters collaboration between development and SRE/DevOps teams. Service Level Objectives (SLOs) are a great way to derive higher level information for system management that is based on metrics provided by Observability systems.
- Safe Continuous Delivery: Observability supports continuous delivery practices by providing feedback on the impact of changes, helping teams iterate on and refine their systems quicker.
- Scalability and Flexibility: With insights into system performance and user behavior, organizations can more effectively scale their systems to meet demand, ensuring they remain responsive and resilient as they grow.
- Business Insights: The ability to track everything that happens in complex systems makes it also possible to create new higher level information, that can be used for strategic business decision making. For example in which geography user interactions are generating more value.
https://en.wikipedia.org/wiki/Observability
https://research.ijais.org/volume4/number1/ijais12-450607.pdf
https://dynatrace.wordpress.com/2008/08/14/visual-studio-team-system-for-unit-web-and-load-testing-with-dynatrace/
BH Sigelman, LA Barroso, M Burrows, P Stephenson, M Plakal, D Beaver, S Jaspan, C Shanbhag: “Dapper, a Large-Scale Distributed Systems Tracing Infrastructure”, 2010
Michele Mancioppi: "OpenTelemetry Resources: What they are. Why you need them. Why they are awesome." LEAP 2024.