New Hosts and Formats, Observability Costs and Training

Transcript

[00:12]

Sven: Welcome to a new episode of the Case Podcast, or rather, Case Reloaded. As you might have noticed, our last episode with Corey was about one and a half years ago, around the same time our founder, Stefan Tilkov, sadly passed away. Since then, we haven’t done any recordings. But today marks the first episode with new hosts. I’m Sven, still with Case, and joining me are two new hosts, Heinrich and Alex. Our idea for the podcast remains the same: conversations about software engineering, discussing technical topics freely, without limitations. We’ll use this episode to introduce ourselves, share our interests, and outline what you can expect from the podcast. We have some new formats and ideas, and we’ll also touch on technical topics that are on our minds.

[01:43]

Sven: First up, let’s introduce Alex Heusingfeld. Alex, welcome!

Alex: Thanks for having me and for the invitation to join the show. A little about myself: I’ve been deeply involved in software architecture for over 10 years, with a background in software engineering dating back to the 1990s. Currently, I’m responsible for technology strategy at Vorwerk.

[02:21]

Alex: For our international listeners, Vorwerk is well-known in Germany, but for those unfamiliar, Vorwerk is best known for producing vacuum cleaners for horses—just kidding, that was in the 1920s. Today, Vorwerk is best known for the Thermomix smart kitchen appliance and the recipe portal behind it, the Cookidoo platform, which has about 18 million registered users and over 5 million paying subscribers. Vorwerk also produces vacuum cleaners and vacuum robots, which require software and firmware development.

[03:31]

Sven: Now, Heinrich, who are you?

Heinrich: Hello, Sven! I’m Heinrich, thrilled to be joining you two on this podcast. I’m a Senior Principal Engineer at Zalando. My journey has largely been about building reliable systems and understanding how to build software properly. I have a background in mathematics and spent over 10 years in academia, including earning a PhD. After transitioning to industry, I got involved with one of the first metric SaaS companies, Circonus, where I joined as Chief Data Scientist, developing query languages and anomaly detection for metrics data. Later, I joined Zalando and led their Site Reliability Engineering (SRE) organization for two and a half years before returning to an individual contributor role.

[05:00]

Heinrich: My focus remains on SRE and reliability, especially as data becomes an emerging field in reliability engineering. With AI systems relying heavily on data, managing and ensuring reliability in data systems is increasingly important. I’m excited to use this forum to discuss these topics, improve my understanding, and share insights with our listeners.

[06:02]

Sven: I’m Sven, and I’ve been with Case since the beginning. Before co-founding Case, I was a podcast host with IEEE Software Engineering Radio. For me, podcasts are a great way to engage in personal consulting—talking to the best people in the world on specific topics. As a consultant, I’m solid in software architecture, software engineering, software delivery, and reliability engineering, though not as deeply as Heinrich. Podcasts allow me to dive deep into topics by discussing them with experts, and I believe these discussions often align with the interests of many others.

[07:22]

Sven: Our format will change slightly. We’ll have discussions among ourselves and also invite guests, as we’ve done before. Interviews with guests are proven to be engaging and fun, and I’ve always enjoyed them. Additionally, Alex and I once had a conversation that a colleague found interesting, and we thought it would be nice to share such discussions with our audience. So, we’ll also have informal sessions where we discuss the latest news, trends, and whatever comes to mind.

[08:39]

Sven: This informal style is what makes podcasting great—it’s about inviting listeners into interesting, personal conversations between peers. It’s perfect for casual listening, whether you’re on a walk or on a treadmill. It should be easy listening, with some information density but not overload. Hopefully, we’ll get along well, and if not, we’ll just stick to guest interviews—just kidding!

[09:51]

Sven: Now, let’s move on to the topics we have for today. What do you bring to the table? Maybe one thing before we dive in...

[10:01]

Sven: We also want to expand to two new publication channels: Spotify and YouTube. While Spotify isn’t live yet, it’s on our roadmap. YouTube is a big focus for us, with ideas around snippets, shorts, and more. I don’t want to overpromise, but I’m confident we’ll make it work. One thing I particularly like about YouTube is the feedback potential—viewers can comment on specific scenes or sections, which adds valuable perspectives. For example, after Heinrich and I recorded a podcast in German, we received feedback about topics we missed, which is great for understanding what listeners want to dive deeper into.

[11:48]

Sven: Another advantage of YouTube is the ability to cut chapters from a podcast and publish them independently. While this requires extra effort, it’s worth exploring. I’ve seen non-technical podcasts do this successfully, and it’s an interesting approach.

[12:35]

Sven: Now, let’s dive into today’s topics. Heinrich, you wanted to talk about the cost of observability?

[12:58]

Heinrich: Yes, I’ve been thinking about this recently. A good starting point is Coinbase’s leaked Datadog bill in 2023, which was €65 million for the year. At first, it wasn’t clear if this was quarterly or annual, but it turned out to be annual. Coinbase had €3 billion in revenue, so I started speculating about their AWS or infrastructure costs. For €100 million, you can get a lot of compute on AWS, and Coinbase isn’t a video-heavy platform, so their observability bill might even rival their infrastructure costs.

[14:08]

This raised the question: Is €65 million too much? The answer is yes—Coinbase didn’t want to pay it. They started experimenting with on-prem solutions, like Prometheus, and eventually renegotiated a lower deal with Datadog. This story highlights how costs can creep up, especially with volume-based pricing and vendor lock-in.

[15:13]

Alex: The broader question is: How much value do you get from observability, and how much are you willing to pay? Historically, monitoring was seen as a minimal-cost add-on, but now we understand that good observability comes at a premium. I think of it as a tax on your infrastructure—you pay extra to make your systems observable.

[17:05]

The key is to ask: What do you want to observe? What questions do you need answered? In the early 2000s, developers often logged everything without considering what was useful. Later, with tools like Nagios, the trend shifted to collecting everything, even if it wasn’t immediately useful. SaaS products now promise to make sense of all this data, but that comes at a cost.

[18:26]

Ultimately, you need to define what insights are relevant to your product and what data is required to generate those insights. Observability isn’t different from building a customer-facing product—you need to understand what data is necessary, whether it’s for customer use or for feeding into systems like vector databases for AI.

[19:15]

Heinrich: Regarding Datadog’s billing, it’s typically volume-based, with some vendors also using seat-based pricing. Custom metrics are another factor, as seen with Dynatrace and Datadog. Pricing models vary, but volume and custom metrics are usually the dominant factors.

[20:00]

Pricing models for observability tools often include volume-based components, with some vendors also using seat-based or service-based pricing. However, volume is usually the dominant factor. From a vendor’s perspective, infrastructure costs aren’t the largest part of their expenses—payroll for engineers, product development, and support are. So, when you pay for observability, you’re paying for superior product development and reliability, not just server costs.

[20:49]

There are ways to reduce costs, like aggressively pruning metrics or using smart sampling to maintain observability while lowering volume. However, I believe the focus shouldn’t be on minimizing metrics but on understanding the value the tool provides. Vendors shouldn’t try to sneak in extra metrics to inflate costs, and customers shouldn’t obsess over sending as little data as possible. The best conversations I’ve had with vendors revolve around the unique value their tool provides versus its cost.

[22:17]

Heinrich: It’s important to consider the context. For example, Zalando is one of Europe’s largest retailers, handling massive traffic, while Vorwerk’s Thermomix serves millions of users. On the other hand, some finance companies I work with have much lower traffic, so their observability costs are less of a concern. Ultimately, it’s about cost per user and what’s sustainable for the service you’re offering.

[23:42]

Heinrich: A key point is distinguishing between internal users (e.g., developers and operations teams) and end users (customers). Observability provides value to both. For internal users, it’s about documentation, user experience, and insights. For end users, it’s about reliability and performance. Quantifying this value can be challenging—how much does better observability improve reliability? It’s not just about the tool’s quality but also the skill set of the team using it to turn observability data into customer value.

[26:05]

I recently discussed OpenTelemetry with a colleague who argued that spans alone provide most of the observability data needed. While this is a great starting point—especially since spans anchor to user experience—it’s not the whole story. Resource metrics (e.g., CPU, memory) and pod health metrics in containerized environments are still important, though their relevance has diminished as infrastructure concerns have shifted away from developers.

[28:49]

Heinrich: Observability tooling is another tool in the developer’s toolkit, and they need to understand how to use it effectively. This brings us to the next topic: observability training.

[29:15]

Sven: I’ve been asked to expand a two-hour observability training into a two-day program. The target audience is senior software engineers and architects, embedded in a larger training program. The key takeaways should focus on operational sustainability and aligning observability with quality goals.

[31:49]

In my experience, many architects neglect the operational impact of their designs, failing to consider whether their applications can run sustainably in production. Training should emphasize starting with requirements—both functional and non-functional (quality goals and scenarios). These scenarios act as hypotheses for how the system should behave, which can be validated through observability.

[32:44]

I often run workshops to help stakeholders, architects, and product owners understand operational requirements. For example, stating “99.9% availability” without context is meaningless. Training should focus on major use cases and stakeholder expectations, not just abstract metrics like “nines.”

[35:00]

Service Level Objectives (SLOs) are a great way to formalize these expectations, but the process should start with understanding what the service should do and how well it should perform. This overlaps with defining SLOs, which don’t need to be fully implemented immediately but should guide observability efforts.

[37:13]

One challenge is measuring less obvious objectives, like the replaceability of a service. For example, if the goal is to replace a service within 30 minutes, how do you measure that? It’s not something you track in production but rather in the pipeline or code repository. This highlights the importance of aligning observability with the specific quality scenarios stakeholders expect.

[38:40]

Alex: Another example is measuring the time to fix a production failure. This could involve rebuilding and deploying a previous version, which raises questions about reliable builds, binary availability, and third-party library dependencies. These complexities underscore the need for robust observability practices that go beyond simple metrics.

[40:00]

We’re not talking about service-side applications but client-side applications, like firmware updates. That’s true—it’s a lot of pain. I don’t even want to go there, but that’s what I mean. These are all topics architects need to understand: what am I building for? I would distinguish between application architects and platform engineering. The platform engineering team should track KPIs like time to deploy for every application released on the platform. Ideally, if it’s a standardized runtime, they should have that in mind.

[40:48]

I haven’t seen time to deploy actively managed as an operational signal, like monitoring a microservice or tracking error rates. Deployment cycles vary—sometimes you have one or two deployments a week, or even zero, so it operates on a different time scale.

[41:15]

The requirement is straightforward: you want to be able to roll back a change within 10 minutes. However, I haven’t seen this actively tested or monitored regularly. It’s similar to business continuity management and disaster recovery. If the system crashes, how fast can you rebuild it, get the service back up, or restore the platform? If you deploy often, you’ve likely automated 90% of it, reducing human error. But if you don’t regularly test database restores, for example, you might not even know how to test for data integrity.

[42:31]

Sven: Is this a topic for observability training? It depends. You can verify all of this with observability tooling today, even with tools like Nagios. The observability training is part of a larger program, and reliability engineering is another two-day training. Should we even discuss SLOs in the observability training? There’s an SRE training for that. The anchor points are interesting—we’ve already talked about understanding user needs. The SRE part and observability part raise questions about how the company is organized. If they’re separate, it’s usually for organizational reasons.

[43:40]

In a large company, you’ll find every setup and tool that exists. The anchor points matter—you need to observe the user experience, which is the ultimate goal. SLOs are naturally related to user experience, and from there, it trickles down to machine behavior, like detecting latencies in hardware components. Starting with user experience and understanding upstream dependencies is a good starting point.

[45:00]

The next question is how to get there and allow operators to reason about the data. Understanding different telemetry types and how they relate is crucial. Traces are the gold standard for transactional microservices communicating over REST APIs. OpenTelemetry in different language stacks is something every engineer should understand. Complementing traces with logs and metrics is important. There’s an interesting twist when it comes to events versus logs versus traces and Observability 2.0.

[46:03]

There’s some confusion here, but also valuable insights. I’d distinguish between event monitoring and resource monitoring. Resource monitoring, like room temperature, is a small aspect where metrics excel. Everything else is event-based, like a request being served or a scheduling event. How do you best represent an event? A structured object or log line works well. The complexity around metrics is about aggregation—rolling up data over time intervals, taking averages, percentiles, or histograms. Spans and traces are just events with trace IDs and parent IDs, stored in systems optimized for specific joins. Tracing is an opinionated subset of event processing, while logs are more unopinionated.

[48:08]

Observability 2.0 introduces the idea of wide, fat events. Instead of logging multiple steps in a function, you keep track of states in memory and write out a single large event at the end. This reduces noise and correlation problems. Event-based monitoring advocates for fewer, larger logs with better indexing in storage solutions. This is a high-level take on the concept.

[49:45]

My original idea aligns with the framework by the person who wrote the paper and created the concept. He asks: Is it broken? This connects business to software systems and user needs. Then, what’s broken? And what types of telemetry data exist—metrics, logs, traces, etc.? How is this data stored? Observability 2.0, as discussed by Charity Majors, introduces the idea of wide events. But I’m not sure how to implement this with tools like Dynatrace. Is it worth discussing if not all tools support it?

[52:10]

If you have a logging database that stores structured logs, what queries can you run? Tools like Splunk and Elasticsearch Kibana offer full-text indexing, while others index specific columns for SQL-like queries. Honeycomb excels in visualization and insights, but you can achieve similar results with structured log solutions. Tracing solutions for HTTP microservices often fit well into span and trace structures, reducing the need for additional events.

[54:24]

Alex: Observability has been around since 2007, but Honeycomb popularized it by advocating for wide events and specific query patterns. Observability 2.0 reintroduces the original idea, emphasizing wide events, good indexing, and visualizations. However, you can achieve similar results with structured logs and tracing solutions. The key is enabling an explorative approach to debugging with enough context and attributes for meaningful analysis.

[57:20]

Alex: I’m not fully convinced by fat events yet, but I remember when observability first emerged. Charity Majors emphasized observing systems without assuming everything upfront. Observability 2.0 continues this mindset, focusing on exploration and understanding systems in production. The challenge is balancing cost and relevance, especially for companies moving back to on-prem solutions to avoid high SaaS observability costs.

[1:00:00]

Right? That’s the takeaway I get from your pledge for fat or white events. The issue I have with white events is—how often have you seen a process die in the middle of operations? This is exactly my objection. I was never a fan of it at the beginning. I kept asking, what happens if it crashes? Back then, I was working with C stacks and dealing with tons of segmentation faults, which were the most interesting things to debug. So, I was always arguing on Twitter: is this really the right thing?

There was a production incident at Honeycomb that couldn’t be diagnosed because of this issue. It leaves the door open for problems that are hard to troubleshoot. However, interestingly, a lot of companies today don’t use C stacks. They have Java stacks with proper error handling and don’t see as much of a stop-the-world behavior. In practice, it hasn’t been a significant issue. I’m not saying this is the solution, but it’s an idea worth exploring. It works well in many contexts and is better than just monitoring everything indiscriminately.

[1:01:24]

Maybe what I wanted to say didn’t come across. I had this discussion several months ago with someone at Fastly. They told me, “Of course, we don’t capture every log. Our log pipeline might drop a few if there’s too much traffic.”

I was raised with the mindset that losing any log events is unacceptable because you can’t properly diagnose the system. But over time, I started thinking differently. If a payment flow fails and the customer sees an error, what do they do? They click the button again—just like turning on a light switch. If it doesn’t work the first time, they try again.

The real question is whether the error is reproducible. If an issue only happens in a specific market or on a particular Android version, retries won’t help. That’s where good telemetry systems should differentiate between transient and systemic failures. However, I haven’t seen many telemetry systems that are good at making these distinctions.

[1:02:53]

Agreed. If an issue happens multiple times, your observability system will certainly catch it. There’s no need to log everything.

I once had a P1 incident where we initially thought there was a major issue in Italy—customers couldn’t book. After investigating, we realized it affected just eight customers because a single phone provider was blocking our IP. So, across all of Italy, it was just those eight users who connected via their mobile provider. Context matters.

[1:03:40]

I fully agree. Losing a few log lines is usually harmless. Many companies systematically sample their logs, traces, and events—sometimes at a ratio of 1:10,000. They aren’t storing and processing everything. If you’re smart about it, you can still maintain high observability.

[1:04:09]

Some teams discard 999 out of 1,000 traces. Distributed tracing seems crazy at first—it replaces every HTTP request with another request just to export span data. You generate a ton of metadata, almost as much as the actual request volume. These systems become incredibly high-load, and initially, it seemed like heavy sampling was the only option.

Now, with better engineering, we see that by keeping everything in memory and carefully managing retention boundaries, you can get away with very modest sampling rates. It took time to figure this out—partly due to technical limitations, but also because it required bold thinking.

I suspect the same applies to data. Right now, people discard a lot of it because it’s too expensive to store and process. But maybe doubling our data isn’t as costly as we think. Maybe we can just do it and be fine.

[1:05:49]

That’s an interesting idea—learning from reliability engineering in other fields. Looking at different domains and extracting useful patterns could be valuable.

I’ve been thinking about this in another context. As you know, I recently bought a house and am planning a complete renovation. I’ve been looking into what smart home technology has to offer in 2024. Opinions are all over the place—some say Zigbee is great because it’s flexible, while others insist on wiring everything because cables are more reliable.

At first, I was looking for software engineering patterns to apply to smart homes. Now, I’m thinking in the opposite direction—what can we learn from traditional architecture and engineering? Buildings last for decades. What can we take from that?

[1:07:31]

We already borrow a lot from architecture in software engineering—structural views, runtime views, infrastructure views. The same applies to buildings: you have electrical plans, material considerations, and structural analysis.

Buildings also have cross-cutting concerns, just like software. For example, in software, logging is a cross-cutting concern. In buildings, you have elements like windows and carpets that span multiple areas.

[1:08:24]

One principle I’ve been advocating recently is building for replacement. Technology moves so fast that sustainability isn’t about the specific tech stack but about functionality. We must continuously assess whether new technology can do the job better, faster, or more sustainably—and be willing to replace old systems.

It’s a modularization challenge. Some buildings embrace this idea. There’s a great book called How Buildings Learn—worth putting on your coffee table. It explores how buildings evolve over time, which I find fascinating.

[1:09:46]

I had a discussion with an electrical engineer about wiring. One insisted that cables should go directly into the walls—why bother with conduits? Another engineer, who recently renovated his parents’ 1969 home, was grateful for conduit pipes because they made replacing the old wiring much easier.

There’s no absolute right or wrong. It’s context-dependent, and that’s what I’m exploring as I plan my renovation.

This also applies to organizations. Software rarely exists in isolation—it’s embedded within teams and processes. Software can change quickly, but how fast can people adapt to new technologies?

That’s a great topic. “First, we shape our buildings, and then they shape us.”

[1:13:10]

Coming from a mathematics background, I value solid foundations. Early 20th-century mathematics focused on rigorous proofs—thousands of pages just to prove 1+1=2. But software never feels solid in the same way.

A key difference is that software systems behave more like dynamical systems rather than static structures. They require feedback loops to maintain stability. If you make things too rigid, they become brittle and hard to change. There’s a balance between robustness and adaptability.

We’ve come a long way. In the late ‘90s, deploying a website meant copying PHP files to an FTP server. Now, we have CI/CD pipelines, Kubernetes, and version-controlled deployments. We’re in a better place, but I still hope for even more rigor without sacrificing speed.

[1:16:55] Progress can be hard to see in the moment. I once spoke with a Gartner analyst who asked how our systems had evolved over three years. Initially, I complained about all the things that were still broken. But when I looked back, I realized how far we’d come. That perspective shift was eye-opening.

The Slack CEO once said he was embarrassed by his product—so many flaws, so much room for improvement. Yet, compared to IRC or Microsoft Messenger, Slack was years ahead. You need that mindset: seeing the gaps while still recognizing progress.

[1:18:25] For me, today is my last working day of the year. I’m looking forward to reflecting on what we achieved. If I had to fire myself, what advice would I leave for my successor?

Sven: That’s a great topic for our first episode next year—how to reflect, plan, and actually execute on plans. Looking forward is easy; following through is hard. I want to learn from both of you on this.

[1:20:33] Closing the year is powerful. Letting go of what didn’t get done and making peace with it allows for a fresh start. It’s all about mindset.

[1:21:14] Sven: This episode will be published in 2025, so I’ll just say: Merry Christmas and Happy New Year! Looking forward to our next discussions, including guest interviews. Stay tuned!