What Is Data Observability and Why Should You Care?
As modern data stacks expand and grow in capabilities, one of the areas that is getting more attention is data observability. You might be…
As modern data stacks expand and grow in capabilities, one of the areas that is getting more attention is data observability. You might be wondering, what is data observability? It sounds fairly generic on the surface, however it does consist of a well-formed industry view on what it means. In an enterprise today, using the modern data stack, there is exponential growth of data and systems contributing to the data platform. With all of the contributing systems it is inevitable that at some point data being sent will not be what is expected and that will have an adverse affect on reporting for the company. Depending on if and when you catch that data anomaly, will determine how painful the impact is. It could be a serious system outage that manifests immediately, or it might be a more subtle change that goes undetected for a period of time until and end user points it out to your analytics team.
This ability to detect anomalies early in cycle, become aware and act on it before it becomes a broader issue, is the key to preventing issues for the reporting of the business. Think of data observability as being the warning lights on the dashboard of your car. What if you didn’t have a check engine light or know that your washer fluid, or tire pressure was getting low? You would be driving around unaware of the potential hazards to your driving until something suddenly goes wrong and now you are on the side of the road trying to fix it rather than having dealt with it when you became aware of the initial issue. Now you have cars whizzing by, potentially a bunch of grumpy passengers and are in the moment trying to figure out what is wrong. You could have been at home or taken the car to the mechanic ahead of time and saved yourself this headache.
When we say observability and detecting issues, we can classify them into a few different categories seen below.
Volume: What is the size of your data and the projected growth/decline? Have you seen any major shifts in the number of rows in your table?
Distribution: What is the distribution of the rows within a given table? If you historically have 40 percent of your current customers table in the North American Region and 60 percent in South America and it drops to 20 percent North America and 80 percent South America, you have a significant distribution shift.
Freshness: How often do you expect your data to arrive for a given table? If a daily feed doesn’t deliver for the day or delivers the same data from yesterday, you have a stale data problem.
Structure: What is the actual makeup of your data warehouse, tables, columns, keys, etc. If you drop a column or rename another or delete a table, it could have knock on failure effects downstream. Some of these are low impact and others are significant enough that they could knock out a highly used report.
Lineage: When you experience changes, where do they have an impact and manifest in your data structure? How do you quickly identify the impact of a change and what up and downstream actions need to take place as a result?
Layering on top of these changes we can observe is also the question of severity and significance. Not every problem is the same in scope and has the same impact, some might have zero impact or be expected changes. Using a combination of the observability categories above, usage patterns, and an analysis of the problem can help you make that determination.
Say for example in our hypothetical current customer example above, where our North American and South American customer distribution went from 40/60 to 80/20 respectively, we experienced the following observability patterns. A 10 percent shift in volume, the significant distribution shift, no freshness anomalies, no structure changes, and no lineage impacts. You would most likely want to know about at the time it happened and investigate this data change. In your observation of the data and follow up with other business units you may discover the change is intended because there was a focus on new business partners in the South America region while also sunsetting low performers in the North American region. Great, no need to panic and you have an explanation for your end users. It would also be prudent to send an alert on the data source letting folks know the impact.
Now think of this problem at scale and all of the ways data can change and manifest in each data source. How do you keep on top of true issues and avoid digging into each false positive? One way is to look at usage patterns of your assets and say, how often is an object used and what is involved in that object’s lineage? If an asset is highly utilized company wide, you may want to put that object and the lineage under stricter observability detection. This may represent 10 percent of your total objects and allow you to focus more acutely on what matters to the business.
The bigger question is how do you go about actually implementing observability? There is the long-standing concept of unit testing that can help detect issues in your data by writing explicit tests against your data or applying constraints such as keys or not null conditions. This has become more robust and easier to implement in the modern data stack tools such as dbt that allow you to embed these unit tests in your models at run time. You can then get alerted immediately when an issue occurs and even determine things like should the model stop or keep running. The problem with this paradigm is that you must anticipate and test for every problem up front. From an efficiency and time perspective this may not be feasible.
So, what to do now? Well in the modern data stack deployments there are several solutions that have popped up around this problem. The value proposition is that SaaS providers will sit on top of your cloud data platform and do the observability and monitoring for your through deployed machine learning models. This means you spend less active time on pre-emptive detection and more on sorting through anomalies to determine the impact to your business and have they been reviewed. Because of growing amount of data and variability more companies are finding this an attractive option. There are solutions you can go out and buy and also open-source solutions to pick from if you want to deploy it yourself.
The analytically minded may also wonder, how do I know what the ROI on observability is? A great question that should be asked of any tool in the data stack. I tend to think of observability ROI in two different metrics, adverse event (downtime) avoidance, and engineering productivity. Adverse events can have a material impact to your business depending on the significance of the outage and duration. It could be as bad as production level outages, but also more subtle capability loss such as instrumentation or analytics being unavailable to the business. These can be quantified by severity and duration and potentially cost as well. When these events occur, it is also a drag on your data and analytics resources as they are taken off of feature work and forced to deal with unexpected changes and firefighting in the moment. This has a knock-on effect to your ability to keep delivery of new features up and creates a lot of context switching in the process. This could also be quantified into a dollar figure by engineering hours saved by FTE cost.
As data size and variety grows it is a constant battle to make sure that the data is fresh and accurate for the business to consume. The concept of keeping ahead of observability of data is one that all companies will face in different sizes and stages. The innovation and thought going into this problem will help keep analytics and data professionals focused on bringing new capabilities and features to the business in the long term.



