Sven Johann: Hello everyone to a new conversation about software engineering. Today is the first episode of a three-part series about the operational requirements on latency and availability. The Google Site Reliability Engineering program did a lot of work in that field, mostly known as service level objectives, service level indicators, and error budgets.
Sven Johann: In all three episodes I'm going to talk to Alex Bramley, a site reliability engineer at Google. In the first episode we are going to cover putting user happiness in front of almost everything, measuring user happiness with SLIs (service level indicators), defining user happiness with SLOs (service level objectives), what are error budgets and how to use them, defining good enough baseline requirements for latency and availability, how to establish feedback loops with people who understand our users best, measuring and defining latency - how to do that.
Sven Johann: Alright, let's get started.
Sven Johann: Welcome to a new conversation about software engineering, today about the art of service level objectives. My guest today is Alex Bramley. Alex joined Google in January 2010 as the first mobile site reliability engineer in London. He spent around 7,5 years in various reincarnations of mobile, Android and today site reliability engineering. He now works as a customer reliability engineer for Google Cloud, where much of his time recently has been spent rethinking how people teach customers, partners and the general public about service level objectives. Alex, welcome to the show.
Alex Bramley: Hi there. Thank you for having me.
Sven Johann: Yes, awesome. Looking forward to this conversation for a long time.
Alex Bramley: Yes, sorry it's taken so long.
Sven Johann: Okay, let's start with service level terminology. Why do we need to think about the level of service we provide to our customers?
Alex Bramley: Well, I sort of hope the answer is self-evident from the question, but for the sake of argument, let me just preach for a bit. A lot of my job in teaching people has been standing on the soapbox and just kind of shouting at them, and for the first part of this that's kind of what I'm gonna do... So I think that caring about your users and providing reliable services to them is just good business practice.
Alex Bramley: In most cases, the continued existence of a company is dependent on its users still wanting to use the services that it offers, and without customers to pay for those services or users to serve adverts to, the revenue stream for your company is gonna dry up, and the company is eventually gonna collapse. In well-run markets, with healthy competition, your users are gonna have a variety of providers to choose from for any given service... This means that if they can get equivalent or better server from a competitor for equivalent or less money, the only disincentive to moving are the time, cost and inconvenience of doing so.
Alex Bramley: As a business owner, if you'd rather your users didn't leave, you pretty much have to provide good service, or good enough service that it doesn't seem worth the effort of leaving. And that means that it's worth your while to continuously measure and understand the experience your users have with your services... Because you don't wanna discover that your service is no longer good enough when all of your users suddenly decamp to a competitor. At that point, it's much harder, a lot harder, to win back their trust and their business.
Sven Johann: That's true. Once I'm very unhappy with the (let's say) reliability of a certain service, I just move on and I don't look back... So I think that you as a SaaS provider, for example, you have to do a lot of work to regain customer trust.
Alex Bramley: Yes, it's very easy to just burn this away accidentally, not even thinking about it. It can be an unintended consequence of a choice that you made that you thought was going to make your product better. But once it's gone, it's gone, and it's really hard to get it back.
Sven Johann: That's indeed the case... When I'm unhappy, I don't go back. I just look for someone new. Most people have heard about service level agreements (SLAs). Can you briefly explain what it means?
Alex Bramley: Sure, totally. SLAs are more of a legal thing than a technical thing, in my opinion. They're a bit like an insurance policy that you provide for your customers. In general terms, an SLA tends to say "If the level of service you receive is this much below what you're paying for, you get this much in this form of compensation. If we promise you that your service will be available 99.99% of the time and it's only available 80% of the time, we're going to pay you this much for the difference."
Alex Bramley: Insurance is, I think, a nice analogy, because I don't think most people want to have insurance. They don't wanna have to have insurance. Most of the time we do this anyway, because it helps mitigate risk. Take home insurance, for example - you don't want to have to have home insurance. It would be nice if your house was just fine, right? The problem is that when you have to fix your house, it can cost a significant amount of money. You don't wanna have to pay out that money straight away.
Alex Bramley: I don't think most people have to have insurance, but most of the time we do this anyway, because it helps mitigate risks. This doesn't mean we're happy when we have to make a claim. It's usually quite the opposite. For example, I had to make a claim on my home insurance a couple of years ago due to some siding problems with my house; the insurance folks were great. The siding was sorted quickly, and they even repainted half of my house after fixing up the cracks. But you know what I would have preferred? That I didn't have one wall of my house sinking into the ground. That would have been way better.
Alex Bramley: A similar argument applies to a service level agreement. Yes, you can compensate your customers because they got poor service from you, but I'm pretty sure you burned a lot of trust too, because you gave them poor service. And that's not something that you can buy back with this compensation. They most likely much rather would have had good service in the first place. This applies doubly to a cloud service provider, because often the customers of a cloud services provider - they're going to also have customers, and they're gonna have to compensate their customers for the poor service that they receive from you. And the compensation they have to pay out may even be more than the compensation that they receive from the cloud provider. At that point, they've lost trust of their customers, they've lost money, they're not gonna be happy at all, and they're gonna come to you as a cloud provider and say "This is not good enough."
Alex Bramley: As usual, there are some game theory aspects to consider in this, too. I talk a lot about incentives when I'm talking to people about how to talk about reliability and how to organize your business so that your service is more reliable. The cut-off point for when an SLA violation triggers compensation is gonna be set near the point where that compensation motivates customers to continue using your services, rather than moving to a competitor. And given that there's a number of other factors to consider, that dissuade customers from that sort of change - the cost of switching a cloud provider for example is astronomically high. And it's not just about being able to stand up your services in a different cloud; you have to have all your data in multiple places... And people talk about this thing "data gravity", and moving your data out to another cloud provider is one of the highest cost/highest risk things in any migration.
Sven Johann: Yes, absolutely.
Alex Bramley: That additional cost means there's usually a massive gap between the level of service that's good enough to keep your customers happy and the level of service which is bad enough to trigger a compensation payout.
Sven Johann: Yes. Before talking about customer happiness -- yeah, by the time of the recording I think it was one week ago, one major cloud provider had a big problem is U.S. East... Basically, it was in the news because all of a sudden doorbells and refrigerators and all those internet-enabled things, vacuum cleaner, they didn't work anymore. I'm sure that those companies got money back from their cloud provider, because there was an SLA breach... But true, nobody wants that problem; the cloud provider doesn't want to pay back the money to their customers, and their customers - they have lots of cost because... I mean, "We had the problem that during the weekend we had an incident, because U.S. East was down." Everyone was quite unhappy.
Alex Bramley: When you get into providing consumer services on top of a cloud provider, you've got a massive fan-out problem. The relationship between you and the cloud provider is one-to-one, but you have one-to-many (tens of thousands) of your customers. And that means that the scale of compensation that you receive from your cloud provider is not likely to be the same as the scale of compensation that you may have to pay out to your customers... Although I suspect a lot of people providing end-to-end services to their customers, they aren't going to provide any kind of monetary compensations. You may get some free credits, like you won't have to pay as much in your subscription next month if it's been really bad. But a lot of the time -- your costs are gonna be "Okay, so instead of paying X for my support staff, I had to pay a Y because we had to burst to 3 to 5 times the number of support staff. We had to deal with 20 times the number of complaints, and of course, now the reputation of my company is in the trash big." That's the cost. And it's one that's hard to put a monetary figure on, because the reputation isn't fungible with cash, but it is still a cost that the company has to pay.
Sven Johann: Yes, absolutely. The cost of a day-long outage can be quite high from different perspectives. Customer happiness - so you mentioned that we want to keep our customers happy... That is a very difficult question. Let's say I go to a product manager and I say "What availability or performance makes our customers happy?" Usually, they have no idea what I'm asking... So obviously, that doesn't work. So how should we measure customer happiness?
Alex Bramley: I have a terrible joke in The Art of SLOs about users not consenting to having electrodes put into their brain... You can't measure serotonin and dopamine level directly, but that is what actual happiness is. So you have to find a proxy, as always, to this kind of thing. And really, the question is "How good is your proxy measure to the actual happiness of your customers?" So the things you can measure about your systems tend to be metrics; we have monitoring metrics.
Alex Bramley: Anyone running a large-scale system in production - they're gonna have something checking on the health of that system consistently, all the time, or I would hope so. Otherwise how do you know it's even working. So some of these metrics can be useful for determining whether -- not necessarily the customer is happy, but that the service they're receiving from you is meeting their expectations.
Alex Bramley: One of the tricks of SLOs in general is that conversion of user happiness to services currently meeting users' expectations. If you start talking about it in terms of whether your service is meeting those expectations, you can measure aspects of your system and its performance and judging whether it's meeting expectations is less of a specific thing. But you can make some assumptions that make it possible to do that. We start talking about this in terms of service level indicators, which are specific metrics that are measured by your monitoring systems that are a good proxy for the experience your users are having while using your service.
Alex Bramley: In The Art of SLOs we like to recommend that it's a ratio of two metrics - the number of good events to the total number of events over time. And when you start talking about specific metrics, it becomes easier to make this more concrete. The canonical example I tend to use here is HTTP responses, because a lot of the things that I monitor tend to be web-based services working for Google; most of the stuff we provide is a web-based service, so we talk about a lot of things in terms of HTTP responses.
Alex Bramley: No one really wants to receive the dreaded 500 internal server error when they're just trying to browse their favorite site or read their email... So this is clearly not a good event. They were expecting to receive the web page that they -- you know, they clicked a link and they were expecting to receive the page that they wanted, and instead they get an error page. That's going to make the user unhappy, because they didn't get the thing that they were expecting to get.
Alex Bramley: On the other hand, getting the 200 okays - okay, that's good. That's clearly a good event. So if you divide the rate at which you're serving these HTTP 200's by the rate you're serving all HTTP responses, you're probably gonna get a number that's a little bit smaller than one.
Alex Bramley: If it's one or close to one, most of your users are getting okay responses, and because of the translation we were talking about just now, you can say they're mostly getting the responses they expected from your system, so they are mostly happy. But if it's a lot smaller than one, it means that lots of your users are not getting this okay response, and so they're not getting the response they expected from your system, therefore you're going to assume that they are not happy... And therefore this metric is indicating the level of service that your users are receiving, and by proxy indicating how happy they are in aggregate about your system and its operation.
Sven Johann: Okay, so I have my service level indicator, which - let's say we have the Google search. If Google search returns 200 most of the time, then the user is happy. Or if I have my webshop and the search and the product detail page and the checkout from that webshop works (let's say) most of the time, my users are happy. The question is -- I mean, we talked about that before the podcast; it's very hard or impossible to have 100%. It would be easy to say "We want 100% of good events in the ratio with all events." That would certainly make the customer happy... But 100% is impossible. The question is "What is the right level? How do we know when our users are happy enough? Is it 100%, or 80%, or whatever?" It probably depends.
Alex Bramley: I guess this is the natural next question. Everyone wants 100%, right?
Sven Johann: Exactly...
Alex Bramley: But in practical terms, reality has a way of messing with you when it comes to that kind of thing. No one ever wants 100%, but achieving 100% is a very different matter entirely. Answering the question is quite difficult, "What level of service is good enough?" When your users are complaining about something else, I guess is one way of putting it.
Alex Bramley: For the purposes of the discussion right now though, I think it's fine to just pick a number. Set a target that you think will keep most of your users happy, and measure that. You need to find external correlating events which can tell you whether you set your target in the right place or not.
Alex Bramley: A good way of doing this is to have an outage. Nobody likes having outages, but you get information from them that you do not get when your service is operating normally. So if you have an outage, and say you've set a target/an objective for your service levels, that's like "I think my service should be three nines reliable. I think that 99.9% of the time my users should be getting an okay response." And then you have an outage, and you -- actually, I guess the best thing for the purposes of this discussion is to say "Well, you have a dip in your level of okay responses from 99.99% okay to 99.9% okay." So you're still at your target, and you're serving at the goal that you said would keep your users happy, but you start seeing increases to your support request lines, you start seeing sadness on Twitter, you start seeing forum posts saying "Why isn't this working anymore?" Then you know that your target was in the wrong place. That is information you couldn't have got without serving at that target, or below that target.
Alex Bramley: Conversely, if you set the three nines target and you dip down to 2,5 nines, and you still don't see any more support requests, you still don't see any complaints on Twitter, then that's an indication that maybe your users don't care that much about your availability.
Alex Bramley: You don't know in a vacuum. You have to find ways of getting information from your users. Sometimes just asking them can be enough. If you're a small company and you have a relatively good relationship with your users, then you can do things like customer satisfaction surveys and they will tell you whether it's good enough, or what they would like to see done better. When you've got two billion users, like some of Google's products do, then it's a little more difficult to have an individual conversation with each one of those...
Sven Johann: Yes. I would dive into that one a bit later, because there is a difference -- if you have already a service running, that's a different story, I believe, than "I'm currently working on a brand new service where I basically cannot really check if my customers are happy, because they cannot use the service yet." So let's dive into that later.
Sven Johann: I have another question when it comes to terminology... So we talked about service level objectives, we talked about service level indicators... There is another term, error budget. So what is an error budget and how does it relate to my SLOs?
Alex Bramley: If you define your service level indicators as the ratio of good events to all events, like I was suggesting just now, your service level objective is going to be a static threshold that's just short of one. We've talked about 3 nines, 2,5 nines... 3 nines is where 99.9% of your events are good, and 4 nines is where 99.99% of your events are good. The converse of that is that one in a thousand events are allowed to fail. But if you think about this, you're measuring this over time. Usually, a window -- you measure whether you're meeting your service level objectives as like 28 days or an hour... So if you're serving thousands of requests per second, then what if you've served a thousand requests successfully? That one failure is in the bank, as it were, from the perspective of your service level objective, because you're measuring over the period of like an entire hour or an entire 28 days.
Alex Bramley: So until all of these successful requests move out of that window of measurement, you can serve two bad responses in the next 1,000 requests and you're still okay overall. And that failure being in the bank is like -- that's where the error budget concept comes from. It's the idea that once you have a target that says "Some level of service is good enough", it gives you a budget of allowable errors that you can serve during any particular measurement window, while still maintaining good overall service. And if you haven't served all the errors in that budget, you've banked the remainder and you can spend them on risky activities that may allow you to serve more errors than usual.
Alex Bramley: The canonical one here is pushing a new release. We've got good data in Google that shows that 70% or so of the outages that we have are related to changing our systems in one way or another, like pushing a new release, pushing a new configuration... That kind of risk is a major one. So we treat software releases as a risky business and we make sure that we have spare error budget that we could burn if the release goes wrong before we push any releases.
Sven Johann: I think the idea of an error budget is absolutely fantastic... It makes so much sense to have something like that. I guess for most of the companies that's true - change usually leads to outages. The thing is, in our projects we don't perfectly measure that... But I would have an educated -- it's not even educated guess, but my guess would also be having most of the problems after a release. And yeah, if you just consumed most of your error budget, then slow a little bit down on new features and focus on reliability work.
Alex Bramley: Absolutely. We were talking about trade-offs before you pressed the record button, and this is one of the core trade-offs in software engineering - when do you build more features and when do you engineer more reliability into your service? Because you can't do both -- or you can't do 100% of both. You have to find a nice balance between the two.
Sven Johann: Yes.
Alex Bramley: I just realized, I hadn't pasted one area of thing... But this kind of leads nicely into it. So that trade-off, and finding the right point in that - that kind of configuration space of engineering for reliability versus engineering for features is one of the things you can do with an error budget. You can build a feedback loop to regulate the pace of change in your production environment. If you have plenty of error budget remaining, then you've got the green light to take more risks. And if there's very little remaining, it's the signal that you need to act more conservatively... Perhaps by postponing the risk of potentially buggy new features, like I was saying, or by turning off experiments.
Alex Bramley: Another good thing to use spare error budget on is run A/B tests. Say you want to gather data on whether a new feature is going to go down well with users, but it's still potentially buggy - you release it to half a percent or one percent of your user base... You can burn some error budget if it does go wrong, but you're not going to burn all of it. So A/B testing is a nice way of using some of your error budget.
Alex Bramley: If you have an error budget policy that's agreed upon between all parties, like development, SRE, product management, and hopefully sponsored by an executive, that can make sure you've got the correct incentives in place and a feedback loop built upon an error budget, with an error budget policy guiding the balance between reliability and feature work. This kind of feedback loop can be pretty much self-driven by the development team.
Sven Johann: Yes. I think the error budget policy is also quite interesting, but I would postpone that discussion to a later point... First, it was more clarification of the terms, and now I want to dive a little bit deeper into the topic. SLOs - I mean, we've briefly touched on how to get the right SLOs, but now let's have a bit deeper discussion on it.
Sven Johann: As I already said, I cannot go to a technical product sponsor and say "How much availability do you want?" because usually he/she would say "I want 100%" or "I want a very fast system." We really cannot work with that, so both statements are really not good. Could you explain why that is not the right approach to define SLOs?
Alex Bramley: Sure, totally. We all want these things -- don't we all want these things? I also want a pony, if that's possible. Can I have a pony, please? No? Oh, snap. But you're right, they are both problematic statements. And they're problematic in different ways. The second is easier to pick apart, "I want a very fast system." It's just not specific enough. You can't measure very fast with a monitoring system, because computers don't understand very fast. They need a specific response time threshold that you can describe to them, like 500 milliseconds. And if you can't measure it, you can't tell whether you're very fast enough. So the statement isn't helpful.
Alex Bramley: For the first statement, you might have noticed when I was talking about SLAs before that I suggested targets just short of 100%, like 3 nines. Ben Treynor, the vice-president of 24/7 at Google, has a quote... And we've used it in the SRE books and other training tools like The Art of SLOs - it says "100% is the wrong target for almost everything." I know this sounds like an appeal to authority kind of argument... "Oh, Ben Treynor says so, so it must be true", but that's not what I'm trying to do. Really, it's an acknowledgment that some level of failure is simply inevitable. Everyone strives for 100% as much as possible, but reality gets in the way sometimes... And this is true even of things that we consider reliability to be essential for, like pace-makers. I went digging through the medical literature to prove this point when I was writing The Art of SLOs, because otherwise someone was gonna say "Citation needed", from the audience.
Alex Bramley: I've found a paper from 2005, and - this sounds like I did a bunch of research, but I literally just went to Wikipedia... But it had data showing that the post-installation reliability of pace-makers fitted between 1990 and 2002 was around 99.6%, so 2,5 nines. So around 0.4% of pace-makers inside a person, keeping them alive, failed at some point. And the thing I really, really like about this - the title of this paper is "Pace-makers malfunction less often than defibrillators." They're not even hitting 3 nines and this is a success story in medicine, where you have to keep people alive.
Sven Johann: Interesting.
Alex Bramley: So the point I'm trying to make here is that 100% is just not a realistic reliability target over anything but the shortest time windows. What's more, making a service more reliable requires increasing commitments of both engineering time and operational support for ever-decreasing improvements to your overall reliability. We have a rough rule of thumb inside Google, where each extra nine costs ten times more than the last one... And at some point before you reach 100% reliability, the trade-off is just no longer worth making, because the costs outweigh the benefits.
Alex Bramley: And you were asking before about where do we put the target, when the user is happy enough, and how do we figure out where that target should be. And in Google's case, when talking about web search specifically, we came to the conclusion that if we aim to be slightly more reliable than top consumer ISPs, the users would be substantially more likely to actually randomly errors to failures at their ISP, rather than failures within Google.
Alex Bramley: Google doesn't target 100% reliability, but how to you normally check whether your internet connection is working? For most people, that'd be like "Well, I'd load Google.com. If it's working, then my internet connection is working." And that just shows you, we're not 100% reliable. I've seen the graphs, I know that's true, but people consider us to be reliable because we're more reliable than other steps in the path for them getting their search results.
Sven Johann: Yes, that's true. Thanks to actually the Google book, I always use that analogy; it's not an analogy, but when I say 100%, what happens if your internet connection just doesn't work? Or you're on your mobile phone and you drive through a tunnel, or something? So yes, no one expects 100%.
Alex Bramley: And if it's just random failure, people are willing to just F5 and retry it. If it works that time, then they don't think about it anymore. It's sustained failure that starts to cause people's expectations to really drop.
Sven Johann: Yes. The question is failure happens too often or it happens too long... So I just don't know -- or in other words, the Google books say the SLO should be the dividing line between happy and unhappy customers. When I saw that, I thought "Okay, let's implement it." But it turned out -- I mean, no surprise, it turned out to be an incredibly hard question to answer. So okay, 100% is not really necessary, as we learned... But what is necessary? And we briefly touched on that a couple of minutes ago, but let's dive deeper into the topic. So how do I get the initial values for a running system, for example?
Alex Bramley: This is gonna be a thing you hear me preaching a lot, because it's a very core part of the process, in my opinion... Finding something is mostly a case of choosing some initial numbers and then gathering data and iterating. Don't try and make your initial numbers too good. Don't wait, basically. Don't wait to try and make the numbers better and refine them before starting to measure them. Measure them as soon as possible, because the more data you have, the better decisions you can make.
Alex Bramley: The first thing on that note to remember is that your past performance is already an implicit SLO for your users, because your users' expectations of your service's reliability are gonna be anchored by their past experience of it. If you have a live system, you can simply start measuring SLIs you think capture your services reliability and base your initial targets on that performance once you have a few months' worth of data. You don't even need a goal to start off with. Just measure an SLI, measure something you think might work as an SLI and see how it performs... Because you learn something new about your system. You might learn that your idea of an SLI was a bad one. So going through the effort of setting a goal for something that turned out to be a bad SLI - that's wasted effort. The more you can do with gathering data before, the better your initial things will be.
Alex Bramley: So you'll know from your monitoring history that you can meet any goal you set over a short to medium term if the SLI turns out to perform well, like it turns out to be a good proxy for the experience your users have when using your system... Then you'll know from the monitoring history of that SLI that you can set a target that you can meet over the short term. And if your users are currently happy, then this should keep them happy, broadly speaking. You're not going to regress, you're not going to suddenly start performing less well because now you've set a target that you know you can meet, and you're measuring that you're meeting that target.
Alex Bramley: And of course, this only works if your users are actually happy with the current level of service and reliability. If you know that they're not, maybe because they're loudly informing you of that fact via Twitter, then you know that you have to set your future targets higher than what your past performance indicates you can meet. This also means you've got some engineering work to do to improve the reliability of your systems. If you don't already have a running system, then things get a bit more difficult; then you have to make some educated guesses as to what level of unreliability will start negatively impacting your users. This is a kind of thought experiment. Will 1 in 100 page loads be an error? Will that be a minor nuisance or a frustrating experience? Will 1 error in 1,000 page loads be frustrating? Is taking half a second to respond really fast enough for 99% of your users?
Alex Bramley: And even though you don't have concrete data to base these decisions on, you're not doing this in a vacuum. When you're developing a new product, your initial SLOs should be part of the product definition, so you'll need to involve product folks and user experience designers into these discussions. You can look at other companies with similar products in similar market segments and you can analyze their performance to see what level of service they are providing to their users.
Alex Bramley: And if no one's sure, as I said before, don't set targets yet. You don't have any users, so they can't be unhappy. You can come up with some SLIs and measure them as you start to introduce your traffic into the system, and then you can set initial targets based on this data. The problem with targets based on historical data is that you don't know whether users may be happier or use your system more if it was more reliable. What this means is you should also try and have some idea of what reliability target the business thinks the service needs to meet as well. This might be to avoid breaching SLAs or contractual agreements with third parties, or because user engagement starts to drop off, or because customers buy that stuff when the target is missed. If you can tie the SLO target back to how the company makes money, then you're gonna have a much easier time selling it to the rest of the business.
Sven Johann: Yes, that's true. If you are able to find that metric... I think that's quite hard.
Alex Bramley: Absolutely.
Sven Johann: One thing you said in the beginning - just start as quickly as possible with measuring...
Alex Bramley: Yes.
Sven Johann: Let's assume we don't have the perfect monitoring system available yet... Because that's another thing - you really have to build the monitoring around it. And now I could say "Yeah, I just wait another two or three months until my monitoring is there..." So what I did once propose, and people think that's too simplistic, is just start with looking at your incidents. Just put it in an Excel sheet and just work with that Excel sheet. It's totally simplistic, and it's missing a lot of interesting data, but at least you get started. You have something until you have your monitoring ready for SLOs. So is that something you also think is doable, or is that a bit too simplistic?
Alex Bramley: No, I think that's a great idea. If you don't have monitoring, then it's important to start gathering any data that you can. If it's gonna take you three months to build a monitoring system to start measuring any of this, that's three months you're wasting, that you could be gathering any data at all... Even if it's in an Excel spreadsheet.
Alex Bramley: Incomplete data, data that only has a few data points over the entire three-month period, is still more data than your non-existent monitoring system is gonna provide. I think that's valuable.
Alex Bramley: If you're doing it from an instance, then presumably that's coming as part of your post mortem process, or I'd hope it's part of your post mortem process. Any post mortem, for instance, should be able to tell the company the impact on its users and on its revenue streams. The process of gathering that data is something you have to do for the post mortem anyway, in my opinion. Something you at least ought to be doing. Maybe I shouldn't be saying "You have to do it", because I don't wanna be forcing people into things... But it's something I believe you ought to be doing, because it's valuable to the company to understand the cost of that outage.
Alex Bramley: And then all your Excel spreadsheet is doing is like "Okay, I have X, Y, Z post mortems. This is the impact to the users, this is the cost to the company." You maybe wanna have some other tagging things, like which region you were serving from at the time, or which countries the users were most impacted in, or something like that... Because you're spotting patterns at this point, and it's still valuable to do that, and you can still use even the most incomplete of data to guide some engineering decisions, in the absence of a monitoring system that can tell you more.
Sven Johann: Yes, it sounds like incremental software development; start with something small and then grow it, and don't wait until you have something perfect in place.
Alex Bramley: Yes. You kind of raise some eyebrows when you say "Okay, this sprint we're having an outage, because I need to gather some data..."
Sven Johann: Yes. Okay, so now I collect those data, maybe with my perfect monitoring, or just with my Excel spreadsheet... And no one is complaining about Twitter, let's say, heavily, but I still do not perfectly know what makes my customers happy or not... And now I want to establish feedback loops to get better insight into what my customers really need. So what ideas for those feedback loops, to get better customer insights?
Alex Bramley: I think the feedback loop is really two things. Well, there's two halves to it, really. You've got to understand the customer experience of your services reliability. So you're measuring your service from the perspective of your users and how they're receiving the service... And then you need to use that understanding to tune your SLO targets. One reasonable path towards the first half of this, towards understanding the customer experience, is simply to ask them. You need to find things that you can correlate with your monitoring signals.
Alex Bramley: I mentioned customer satisfaction surveys before - these are nice; they're good at small scales. They don't scale up to billions of customers. And one of the other problems with the is that they can be biased, because people often simply ignore these surveys, unless they're dissatisfied and they want to complain. But since you're mostly looking for evidence on happy customers, that's not a huge problem anyway.
Alex Bramley: If people are willing to tell you that you suck, then that's valuable data. And if you have customer support forms, or a ticket system, or things like that, it can provide a similar signal. And the same goes for social media, like Twitter or Facebook, although you need to take those with a slightly bigger pinch of salt, because it's very easy to be angry on Twitter.
Alex Bramley: Other things that can be very helpful here are product metrics, like engagement, session length etc. Your product folks should have some idea of what they want to track to understand how your users are interacting with the services you provide. Each of their launches, they're gonna want to prove to the business that the launch was a success. They're gonna have a metric that tells them that the launch was a success in their terms. Those metrics - you should not just measure them for the launch, you should measure them continuously over time... Because if they dip again, then the launch has suddenly become less successful, even if it's like two years down the line, and you wanna know why... And if that dip in a metric that told you the launch was successful correlates to an outage that you had, then that metric of customer engagement that the product folks like is also a metric that is useful to tell you that your customers are less happy with your services, just in general.
Alex Bramley: And they're often very different, they're often like engagement metrics, but they can tell you how much -- being able to tie an increase in the 500's served to users to drop in user engagement is valuable, because it tells you "These 500's had this impact on the trust our users put in our services." Because as you said before, if there's an outage, I don't come back to the service often. If they burn enough of my trust, I'm just not gonna come back. And that drop in engagement is users not coming back.
Alex Bramley: The last thing to factor in here are the business metrics, like revenue... Although in larger companies these can be increasingly difficult to get a hold of, because there's a lot of regulation and reporting requirements in that they have to be kept secret... Like insider trading things, and stuff like that. I'm not a finance guy, but I've seen these things happen before. And they're also a lot harder to tie back to concrete service metrics, because they're often very much lagging indicators. You don't have a real -- I mean, these days some companies do. I don't know if you've seen Shopify's real-time Black Friday Cyber Monday monitoring... It's fantastic.
Sven Johann: No, I haven't seen it.
Sven Johann: Oh, okay, okay.
Alex Bramley: It's amazing. But most companies that'll have this -- maybe there'll be an accounting job that runs at the end of the day, or maybe they only figure it out once a quarter, "How much money did we make?" But if you can get a real-time signal of that, and you can tie that back to your 500's, then you can gauge the revenue impact of any unreliability, because these errors are potential lost sales.
Alex Bramley: If you make your money via adverts rather than actually selling things to people, then each page view has an associated revenue. Each error is money down the drain. So you can tie the 500's you served to a lost ad revenue. And then when it comes to writing the post mortem, you can hopefully just write a script that says "Okay, so what's the average revenue per ad that we serve on this property, and how many pages did we not serve? ...therefore this is how much cash we've lost."
Sven Johann: Yes. I know how much cash we've lost - we need to dive a bit deeper. I mean, not here on the podcast, but in our project... Because it's not that you can say "I had a ten-minute outage and I lost ten minutes of potential revenue", because customers may come back... But still, it's interesting to see if you had an outage, how much revenue did you lose compared to the last month's, for example.
Alex Bramley: I have a nice story on that... And this isn't like an approved one, so this might cause me to have to go back to the press folks, but... So in Play, and in fact in a lot of the Google services that deal with money, we do have the real-time graph -- not of dollar value, but of order numbers. So on the Play store you'd have the orders per second kind of ticking across, and it's one of the metrics we kept track of... And you could see outages, and you did see the crater from the outage, and you did see the spike afterwards, as people came back and then started buying whatever they were buying on the Play store again,
Alex Bramley: Someone with a much better grasp of statistics than I did wrote script in R that would take two hours of monitoring data with a crater in it, and it would plot what it thought the trendline would have been through that dip and subsequent spike, and calculated the net loss of orders... Which was really great. It made writing the post mortems a lot easier, because you just point the script at your monitoring data and it would tell you how much money you'd lost. All we had to give it was the average order value per day, which is -- that was a thing that I managed to get access to, being one of the people on the Play Store SRE team, specifically for the purpose of writing the post mortems. So you'd just run the script and it'd be like "Ouch! That was expensive today..."
Sven Johann: That's basically perfect to discuss with the product people, right? You can say "It costs us X amount of money to make things more reliable, and if you look at our numbers, our unreliability cost Y euros over the last month or year", and then you really have nice numbers to see how far do we want to get... Because of course, reliability, as you said, each 9 costs you a factor of ten.
Sven Johann: A simpler way - what you said with Net Promoter Score... Or you just ask customers if they're happy. In the beginning, I thought this doesn't work, but I have to admit that my experience with that is actually a good indication... Because I reached out to the customer happiness team, and they said "Yeah, we ask our customers on a regular basis, and usually they complain that we need more features (for example), and there was a time they complained about performance, but that's now fixed, so nobody complains about performance anymore... So it seems that we are quite good at performance. And nobody complains about availability, except in this and that region." I was quite surprised that you get a very simple answer and you can work with that answer.
Alex Bramley: You're looking at broad brush strokes anyway, right? You don't need really fine-grained data, because it tends to be too much. As you said, just the signal of like "In this region customers are complaining about reliability more than the customers in other regions" - that's enough of a signal for you to go to a software engineer, "Right, your job for this sprint is to figure out why."
Sven Johann: Exactly, exactly. And the thing is no one is asking "Is that really necessary?" because everyone knows "Yeah, we need to put engineering effort into that piece, because we know that customers are complaining about it." So it's not so hard to get this work into a sprint.
Sven Johann: Okay, I mentioned performance or latency; in this one specific case customers complained about latency. If you work with customers internally at Google, or outside, with customer reliability engineering, what is your approach to defining latency SLO? With what numbers should I start? For example, in my opinion, a search or product detail page must be really fast, because I do that quite often, and I expect that to have very low latency.
Sven Johann: If I then come to the checkout and payment, I'm actually happy if it's a bit slower, because I search and look at products quite often, and checkout and payment happens only once, so I'm okay with it if it's slower... But what's a good reference point to get started with to understand good latency requirements?
Alex Bramley: I'm gonna preach the same as I've always preached - you measure what you can do, and start from there. That's the easiest thing to get started with - you're not working in a vacuum, you have data on how fast you are already; let's just start with that. What you don't want to do is get worse.
Alex Bramley: I think payments are an interesting example, because we're talking about anchored expectations here, people already using your service. They know how fast it usually is in their experience. If you get worse than that, they will start getting less happy... And payments are a great example of anchored expectations, because everyone knows that it takes a few seconds for the transaction to complete when you're buying something on the internet... So you don't expect it to be instant. And this is mostly fine, because the transaction is usually the culmination of a whole session of searching and browsing and comparing... Like you were saying, the search and product detail pages need to be fast, because you've got to bang through 50 of them to find that nice Christmas present for your mom, or whatever... But you'd be pretty annoyed if every click of those 50 or so clicks to find the right present took 3-4 seconds to return a response. You'd have to spend -- instead of five minutes poking through Amazon, you'd spend 20 minutes poking through Amazon, and by that time you'd be like "God, why isn't this experience over already?" So you're right, those pages need to load considerably faster.
Alex Bramley: The other thing to consider here is where you're measuring the latency. Are you measuring from the client, including the roundtrip time from your users to your servers? This is the most accurate way to assess what your users are experiencing, because they're there on the other end of that maybe long piece of string, going "Load... Load, please... Now..." But there's a lot of variability inherent in that measure, and you don't have a huge amount of control over a lot of that variability either.
Sven Johann: Yes, that's a question I'm asking myself for a longer time, "What is the best place to measure, if a best place exists?" For example, as you said, the most realistic part is to measure directly at the end user. For example - I put that in the show notes - there is something from Google, the largest contentful paint... It's the part on the web browser, for example, where the user can see the most important stuff... And that is a really good indication what the user really sees and needs in the web browser, if it's a web app, of course.
Alex Bramley: Yeah, I agree, the best place is not really where you wanna be, because there's usually not one best place... And it's kind of an engineering decision you need to make given the system that you're measuring and the expectations and behavior of your users is where you measure your latency. I can suggest some rules of thumb, I think, that can help with making those decisions though...
Alex Bramley: The first one is to try and minimize the variability that I mentioned just now. If you can't control it and it can go from 100 to 1,000 milliseconds without you changing anything in your system, then it's very hard to make an SLI that measures that and gives you a useful signal. And it's much more of a concern with things like latency. Because an HTTP 500 is the same, no matter where it's measured. You're not going to have a 200, so leaving your load balancer that suddenly turns into a 500 by the time it reaches the user, unless someone's playing terrible games on the return path of a serving... And they may be.
Alex Bramley: The canonical example of something like this happening is a company proxies the web browsing of its employees through proxies, for monitoring reasons and various other things... And they're legally allowed to do this kind of stuff. But if that proxy is malfunctioning, like you get the request, you send it back and then the proxy serves a 500 to your user, you have no control over that. But in general, HTTP 500 is the same no matter where it's measured, but a 200 millisecond response time measured at your load balancers - it could translate into several seconds if you're measuring it at the wrong end of a poor mobile connection.
Alex Bramley: As I said, this variability reduces the signal-to-noise ratio of your SLI, because you can't tell whether the deterioration in performance is a natural variation, because someone's walking around in a spotty mobile signal area, or because your service is running slowly.
Alex Bramley: Another thing to ask yourself if what proportion of this thing that I'm measuring is down to the factors beyond my control? If you're serving from Western Europe, but you have users accessing these services in South Africa, those users are going to have a completely different experience, and probably completely different expectations from your European users.
Alex Bramley: The distance between the two places is going to incur a couple of hundred milliseconds of latency alone, and that's before you factor in the differences in capacity and connectivity of the local internet infrastructure. The only way you can materially change the experience of these folks in South Africa is to serve them locally... And this may not be possible for any number of reasons, like regulation, or just the sheer cost of setting up another date center significantly further away from where your business is.
Alex Bramley: Excluding this roundtrip time from your SLIs and only measuring latency in code thanks to the infrastructure that you control directly is a reasonable choice to make here. And this brings me to the last point that I wanna make here, which is that it's useful to think about the mechanics of measurement. Your request latency forms a distribution of values which is normally tracked as a metric by quantizing these individual latency measurements into buckets. Each bucket counts the number of requests within the slice of the latency of distribution, like say 50 to 100 milliseconds.
Alex Bramley: To measure a latency SLO with 100% accuracy with a metric like this, you have to ensure that the threshold that's too slow falls exactly on the bucket boundary... Because if you don't, you're at the mercy of whatever intrabucket interpolation method; usually, it's just linear... Whatever is used to calculate the number of events that were above that threshold. For large bucket sizes and low event rates, which can happen pretty regularly when you're measuring long tail latency, this can result in really significant loss of your measurement accuracy. So you don't really know whether you're meeting your SLO, because you're at the mercy of where in that linear interpolation between the two bucket boundaries your 99th percentile happens to fall, for example.
Sven Johann: Yes, talking about percentile, or percentile buckets, do you have an opinion which percentile to choose? For our listeners - what is a percentile, actually? The percentile defines how many requests I look at from a sample. For example, if I have 100 requests, the 99th percentile looks at the 99 fastest requests, and ignores the slowest. And the 75th percentile looks at the 75 fastest requests and ignores the 25 slowest requests.
Sven Johann: One question I'm always -- I mean, it got a bit better, but it's not a super-easy question to choose the right percentile. So what is your recommendation on choosing percentiles?
Alex Bramley: If you're defining your latency SLIs in the way that “The Art of SLO” suggests, where you separate your requests into good and bad buckets. Instead of choosing a static percentile and measuring the low bound of latency for that percentile of requests - like you say, I wanna find the 99th percentile latency - you look at every 100 requests, and the 99th request, that's the latency you choose, when you've ordered them. This means that the latency you're measuring - it fluctuates around.
Alex Bramley: Instead of choosing a static percentile and measuring this fluctuating percentile latency, you set a static latency threshold like 500 milliseconds. That divides your distribution into good versus bad. The requests that are faster than this 500 millisecond threshold are good, and those that aren't are bad... And this is easy to turn into an SLI, because you can just measure the fraction of requests that are good.
Alex Bramley: When you set an SLO goal, you're choosing what percent of requests need to meet your latency threshold for the service to be reliable enough. So instead of saying "I want my 99th percentile latency to be 500 milliseconds", you're saying "I want 99% of my requests to be faster than 500 milliseconds." Which is a very slight difference, but in terms of measurements, it makes the problem a lot more tractable.
Sven Johann: Yes, that's true. The problem is if I only look at percentiles and I have this very bad moment in a certain month where I don't meet my percentile... I have this binary decision that I didn't make my percentile, or I left my percentile... But if I just measure good and bad events - okay, it's just a bad event, for example.
Alex Bramley: I guess it depends, because when you start talking about fraction of requests rather than percentile latency, you can end up with a very long tail eschewing the percentile latency. But if still 99.95% of your requests were -- and that's because of the interpolation I was talking about before... If still 99.95% of your requests were faster than the 500 millisecond threshold, then everyone's still broadly speaking happy. So yeah, I think it can be more helpful to measure things when you've got the static threshold.
Alex Bramley: I think also that making the latency constant means that the SLO is easy to reason about, for the reasons we've just discussed, but it also helps when discussing with other parts of the business. Your product folks - they're almost certainly gonna have a specific goal in mind when it comes to the tail latency. This is the small fraction of responses that are significantly slower than the rest, whether it be for garbage collection or whatever other reasons. Tracking and minimizing this latency is an important part of optimizing the user experience for the product folks.
Alex Bramley: When your users are interacting with the service regularly, even if only one in a hundred requests is really slow, when they've got an average browsing session for a user making 100 requests, then that means they're almost guaranteed to have one slow response every time they use your service. And having to noticeably wait for that response will stick in their mind.
Alex Bramley: If we go back to the example of browsing Amazon, or whatever shopping site for Christmas presents - that one time where the page didn't load quickly. It's not gonna stick in your mind. But when you're sitting there -- and it's always the one that you really wanted to look at as well, isn't it? It's always the one product you think "Oh, this is the one, this is the one", and then it takes three seconds to load rather than being damn quick and you're sitting there going "Why...!?"
Alex Bramley: So like other SLOs, what you're trying to do with the latency SLO is you're modeling the relationship between the user happiness and the response latency. The example I was giving there is a nice illustration of that. But to get technical for a second, this relationship between user happiness and response latency - it kind of follows an inverse sigmoidal curve. I can't in words describe what this looks like, so please image-search for this term, because it will save me -- this is one of those examples where a picture really does give you a thousand words.
Alex Bramley: So it follows this inverse sigmoidal curve, with increasing latency on the X axis, and percentage of users that are happy on the Y axis. So when you set a latency threshold and an associated SLO target, you're approximating this curve with a step function... Like, instead of it being a nice, slow drop in happiness as latency increases, you're approximating with this thing where users are 100% happy until they hit this threshold of 500 milliseconds, and then they're 100% unhappy.
Alex Bramley: This is obviously a pretty crude approximation. Your users are not gonna go from 100% happy at 498 milliseconds and 100% unhappy at 502 milliseconds. That is just not how people work. So when you are forced to aim for this long tail, because that's the top priority.
Alex Bramley: You're normally talking about 99% of your responses being faster than 500 milliseconds, and the problem with this is that you could have 98% of your responses being served in exactly 498 milliseconds and still be within SLO. And because what we were talking about just now, you know that you've ensured your long tail is not too long, but at what cost? Because your users probably still aren't that happy about the situation. If every page load takes 498 milliseconds, you're within SLO, but your users are probably still grumbling quite a bit.
Sven Johann: Yes, yes.
Alex Bramley: So depending on the shape of your particular curve, depending on how quickly happiness drops off as latency increases, you can get a better approximation of the code by creating two latency SLIs instead of one. You might say that 75% of your requests should complete in 200 milliseconds, and 99% of then should complete in 500 milliseconds, for example. This extra SLI ensures that -- you know, a quarter of your requests can cause your users to grumble a bit before you start doing something about it.
Sven Johann: Yes.
Sven Johann: No problem. I've just had my -- I don't want to say my little epiphany, but... What you're saying is perfectly right, but most of the tooling you can buy is looking at percentiles. That's a little bit off the problem. I'm so deep into percentiles, and approaching it that way requires some rethinking of tooling, and thinking about performance.
Alex Bramley: Yes, I agree. But the trick here, I think, is to remember that underlying any percentile is a distribution measurement. And as I said before, as long as one of your bucket boundaries is where your SLO threshold for latency is, then you can measure it accurately just from the distribution. Because you sum all of the buckets that are below the boundary, and that's your good, and then you take the whole distribution sum as your valid quest. So if your percentiles are being measured by an underlying distribution, you should be able okay without having to change things too much. You just have to change how you aggregate data from -- because a percentile is just a statistical function. So instead of applying the percentile statistical function to your distribution, you sum the buckets that are below a threshold and you sum all of the buckets, and then that gives you your good and your valid.
Sven Johann: Thank you for listening to this first episode. Please also consider listening to the second episode, where we will cover availability SLOs and availability requirements, dependency graphs and service availability calculus, how to deal with unknown availability SLOs, communicating availability SLOs, the SLI menu, where should we measure what, which measurement has which pros and cons, when and how to combine those measurements, and finally, reporting windows.