Is a $1 million Datadog bill worth it?
In a recent reddit thread, I got into a conversation about justifying the cost of observability. It got to a really basic question about running a tech company: how do you know that any cost is justified? While a small number of expenses have clear and direct business values, a bunch of other costs, I would even say most costs, just aren’t that clear cut.
I’d like to write a bit about how Observability costs are significant, how these costs tend to be justified, and how precise amount a company spends on anything tends to be more subjective than you’d think. This article is not about how to reduce or control these costs, but rather how the costs are justified.
Observability is expensive
SaaS tools to do observability have significant costs for any large-scale user, and no observability tool has a cost of 0. While tools like SigNoz have self-hosted solutions that will cut your Datadog bill down to zero dollars, you still have to consider both the infrastructure costs of hosting your own data backend and the support you’ll need to do on such a tool. Even with the (amazing) open-source options, the cost is never zero when you count implementation effort and storage costs.
While everyone knows that observability costs something I think most people don’t really understand just how expensive it can be. While many organizations may target spending 10% of their Ops budget for full-stack observability, it’s common for observability to be the second-highest cost after infrastructure. Why so expensive? And why so universally? For an object lesson, consider my demo flask app before and after I add OpenTelemetry instrumentation.
# Acquire a tracer
app = Flask(__name__)
@app.route("/rolldice")
def roll_dice():
do_work("busywork")
return str(do_roll())
def do_roll():
res = randint(1, 6)
# Do the thing
rollspan.set_attribute("roll.value", res)
return res
def do_work(work_type):
print("doing some work...")
Now let’s add OpenTelemetry manual instrumentation:
# Acquire a tracer
tracer = trace.get_tracer("diceroller.tracer")
app = Flask(__name__)
@app.route("/rolldice")
def roll_dice():
do_work("busywork")
return str(do_roll())
def do_roll():
# This creates a new span that's the child of the current one
with tracer.start_as_current_span("do_roll") as rollspan:
res = randint(1, 6)
current_span = trace.get_current_span()
current_span.add_event("Gonna try it!")
# Do the thing
current_span.add_event("Did it!")
rollspan.set_attribute("roll.value", res)
return res
metric_reader = PeriodicExportingMetricReader(ConsoleMetricExporter())
provider = MeterProvider(metric_readers=[metric_reader])
# Sets the global default meter provider
metrics.set_meter_provider(provider)
# Creates a meter from the global meter provider
meter = metrics.get_meteFr("otel.nica.meter.name")
work_counter = meter.create_counter(
"work.counter", unit="1", description="Counts the amount of work done"
)
def do_work(work_type):
# count the work being doing
work_counter.add(1, {"work.type": work_type})
print("doing some work...")
Obviously a contrived example, but it’s worth noting that there were more lines involved in monitoring the app than the app doing its work. With detailed enough measurement and tracing, along with transmitting and storing that measurement, observability can easily rival the application’s footprint.
Cost reductions are possible
Again, this is not the focus of this piece (geez don’t you read the introductory paragraphs??), but it’s important to note that it’s completely possible to significantly reduce the costs of observability. Tools like the OpenTelemetry Collector have tons of processors available to filter, parse, and compress data so that you’re sending way less data than your app’s actual workload.
And open-source options are absolutely going to give more bang for your buck than closed-source SaaS tools. While companies like Datadog and New Relic may claim support for open standards like OpenTelemetry, the reality is a little different.
However costs exist, they’re never going to be smaller than, say, what your office spends on sugar packets. How do we justify these expenses?
Justifying the cost
Back to the Reddit thread where we started, the question is really how do we decide that a certain value is ‘worth it’ as a cost of observability? Let’s talk about some costs that are a lot easier to justify and specify, than observability:
Costs with a clear, quantifiable benefit
- Infrastructure & Operations - When we talk about infrastructure costs, it’s possible to calculate the marginal cost of a single user and show the value of a single additional user to our business, and therefore justify the costs. The same is true of the Operations team and SRE teams: they’re needed to keep the system up and running, so easy to put a dollar value on the benefit.
- Sales & Marketing - Here again, the calculation is fairly simple: calculate the benefit of new business, and you can show whether what you’re paying to acquire it makes sense
Okay, if these costs can be justified to the last clipped Scottish groat, can’t we do that for all business expenses? Not at all:
Costs that are less clear-cut
Here are some costs where the benefit is harder to quantify
- Literally, everyone else who works at the office - Not to start with a bold one, but the benefit of literally everyone else isn’t a number you can type into a spreadsheet. Some examples of people at your company who have a nebulous, hard-to-exactly define benefit:
- All executives - At publicly traded companies, we often give credit for all stock moves to the CEO. But everyone knows this can’t be an accurate number, and the direct benefit of all other executives is similarly hard to pin down. Again, this doesn’t mean they have zero benefit, just that it can’t be quantified as an exact figure.
- All Managers - If we set an arbitrary value for the contributions of a team of engineers, say (shudder) a dollar value on each line of code added, we can arrive at some vague estimate of the engineers’ contribution to value. Even this rough estimate isn’t possible with the team’s management. And as such, the benefit of management is impossible to get a perfect read on. We know that a great manager can get great things from their team. But the dollar value is quite difficult to calculate.
- The Office itself - More on this in later sections. But the value of all those cubicles does not fit on a spreadsheet.
So, we can see that a ton of the expenses a company runs up, especially at a tech company, can’t be calculated against the exact value they produce.
If this situation is hard to believe, consider that most airlines can’t specify the exact cost (and therefore the precise profit) of a single passenger on a single flight, and you’ll see that very often businesses must make decisions heuristically rather than with precise valuations.
The real argument for observability: the cost of not having it
When a developer uses a cool tracing tool to find the bit of code that’s slowing down performance, the dollar benefit of that act to the company is extremely difficult to show. You might try calculating how long the developer would have spent trying to fix the problem without the tool, but that’s always going to be a very rough estimate since a proper observability tool fundamentally alters the way dev teams operate so a simple time estimate is unlikely to be accurate.
No, the real way to show what observability has to offer is the cost of not having it. The cost of downtime, system slowdowns, and dissatisfied users is one that every exec can estimate.
No one wants to be responsible for the site going down, and the cost of downtime can be millions of dollars an hour for a large enough system.
This, then, is the core of observability’s cost justification: it may be hard to justify the cost of uptime, but the cost of extended downtime is enough to tank a successful company.
The Birth of Million Dollar Observability Bills
In the deep mists of observability history, I remember the healthcare.gov launch, and a now-famous line in a meeting with Mikey Dickerson and the IT firms who had bungled the website launch
“If I hear one more person tell me I can’t use New Relic,” he said. “I’ll punch them in the face.”
And that, truly, is the story of how our Datadog, New Relic, and Splunk bills got so high. After a spate of outages, failures, bugs, and other problems that users notice; some executive says ‘add observability, I don’t care what it costs.’ And the million-dollar observability bill was born.
Observability Shouldn’t Get a Blank Check
The initial rush to implement observability tools like Datadog, New Relic, and Splunk often comes with little regard for cost optimization. This urgency is understandable; system failures and outages can be catastrophic for business. However, as the dust settles, organizations begin to scrutinize those hefty monthly bills. This scrutiny often leads to a more nuanced approach to observability. Teams start to explore open-source alternatives, like Prometheus and Grafana, or build custom solutions tailored to their specific needs. The aim shifts from simply "adding observability" to achieving meaningful insights in a cost-effective manner. This evolution reflects a maturing understanding of observability, where quality and cost-efficiency are balanced to meet the organization's unique requirements.
OpenTelemetry and SigNoz can help with out-of-control-costs
This piece helps explain why observability SaaS offerings have often received a blank check as long as they reduced the risk of downtime. We haven't discussed why, so often, observability bills continue to grow and often outpace the growth in infrastructure costs. To explain that, we have to admit that part of the story is lock-in. With a closed-source SaaS offering for observability, switching service providers means at least an arduous change of installed software agents. In the worst case, teams will have added thousands of custom metric calls to their application code which will all have to be changed to switch APM tools. Inevitably, this leaves customers 'stuck' and unable to do much as their observability bill grows.
OpenTelemetry can solve this problem. By implementing open standards for how observability data is gathered and transmitted, OpenTelemetry makes it very easy to switch service providers. If you're using the OpenTelemetry Collector (and you should be), all you have to do is reconfigure your collection endpoint in a single place.
Along with OpenTelemetry, you'll need a backend to report and chart data. The OpenTelemetry project is neutral about your data backend, but a tool like SigNoz uses the power of Clickhouse to store data efficiently, and it even has a self-hosted option.