Transcript
Sven Johann: Hello everyone to a new conversation about software engineering. Today is the third and last episode of our three-part series on service-level objectives with Alex Bramley, site reliability engineer at Google, and he's also the author of the open source “Art of SLO” Training.
Sven Johann: In this last episode we are going to cover error budget policies. We talk about developing, communicating and sharing error budget policies, and we will discuss alerting, especially alerting on service-level objectives. We're looking rather at your burn rate on an error budget, whatever that means; we are going to discuss that. We're not looking at single individual problems causing alert fatigue.
Sven Johann: Alright, please enjoy our last episode.
Sven Johann: We are talking about measurements... I already asked how to decide if I met my SLO or not if I have synthetic clients. That was one thing I asked. But what do I need to think of if it gets even more complicated, if I have multiple types of clients, like I have a web browser, and a mobile, or maybe even an IoT device, like Netflix has different kinds of devices talking to their API... Then I possibly need to measure multiple endpoints. The browser returns HTML, but there is also an API endpoint for the mobile, and IoT device returning JSON. This is also true for use cases which require multiple actions, for example a checkout process of a web shop. How do I calculate SLOs for multiple endpoints?
Alex Bramley: I think the problem here isn't so much the calculating of lots of SLOs, but you're getting a lot of information overload that's coming from measuring things in such a fine-grained manner. So let's deal with the specifics of the question first. Measuring an individual endpoint may be too fine-grained when it's possible to have tens, even hundreds of them. If you've got a lot of endpoints that serve related traffic at similar request rates, you can probably group them together for measurement purposes.
Alex Bramley: I gave an example for this -- it's not in The Art of SLOs, because we had to really cut that down for the purposes of getting the theory part done in the morning... But the longer SLO training that CRE gives the customers that we work with - we have an example from the Play Store, which I worked on. It just goes into how we created SLO buckets for certain types of requests. We had a browse bucket which contained - the details pages - people looking at the frontpage, and people trying to search for particular things on the Play Store... Because if you want to get to a particular piece of content, those are the things you're generally gonna be doing, and they will have relatively similar request rates. People hit the frontpage, and then they either click on the detail of an app, or a book, or a movie, or whatever they want to buy... Or they search for what they want, because they know what they're looking for. Those are all within an order of magnitude similar request rates, and they're all served by the same underlying set of jobs, apart from search, which obviously has a more specialized thing. And they also all fail at relatively similar rates.
Alex Bramley: So you can just take the requests and responses and sum everything together as an SLO for the browse bucket, as opposed to having 4-5 different SLOs for the 4-5 different endpoints which are doing relatively similar things.
Alex Bramley: So that's one way of approaching it... You can cut down the number of SLOs you have, so you're not getting so information-overloaded by trying to find patterns of similar things and aggregating them together. But I think sometimes a better approach is to have an SLO that covers a particular use case start to finish. I mentioned before that the synthetic clients - you'd have one that tried to go through the whole checkout process, from start to finish. So it would load the frontpage of your shop, it would do a search for a particular item or navigate to a particular item, and then it would attempt to stick it in the cart, and it would attempt to buy it with a fake credit card that would be recognized by your system... And that kind of thing tests everything in a similar way that a user would test it, so in some ways it's a more accurate representation of the user experience.
Sven Johann: Yes, that's true.
Alex Bramley: What you really want to know there is whether a user encountered an error during the entire process, right? This takes more effort to set up. If you don't have a synthetic client doing it, you probably have to do some kind of after-the-fact session reconstruction from your logs, which can make your time measurements harder, but this more accurate understanding of your customer experience means that you don't have to worry so much about tracking all of the multiple endpoints separately. Instead of having the -- so we said the frontpage, the details page, the search, the add to cart and then the checkout, and then probably there's a confirmation. So that's six different potential handlers which you're taking down into one higher-level user journey, and really it's whether people can successfully complete that user journey that you care about.
Sven Johann: Yes.
Alex Bramley: Now, talking more generally, once you start dicing your SLOs into dimensions like client type, the number of things you're measuring can really snowball quite quickly. You're doing it like a cross-product of all of the different dimensions. And I think there's a couple of things to keep in mind here. First, do you really care about the dimensions? Sometimes it can just be fine to pretend everything is equal, rather than sliced up along a particular dimension. Or to avoid creating an SLO for specific types of traffic, because it's just not important enough. Say, if you know that 90% of the traffic to this handler is a Google bot scraping your service or something like that, then maybe it's not such a big deal.
Alex Bramley: And if you are sure that you need your SLOs broken down like this, then try to avoid considering more than one dimension at a time in each SLO. You talked about like -- what did you talk about? I don't know... Sorry.
Sven Johann: I have a mobile client and I have the web browser, and I potentially have some IoT device, for example.
Alex Bramley: Okay. That's kind of one dimension. That's the client type. But you also mentioned multiple endpoints. So your second dimension is server endpoints. Maybe you care about what country your users are in, or something like that. The resulting cardinality when you combine all of these three things is massive. Do you really want to measure an SLO specifically for your mobile users hitting the add content point from Germany? Maybe you do; maybe it's specifically those users that are having a real problem. But if that's the case, it's probably only going to be a transient thing, because you've got a bug for those specific users, and so you don't need to measure an SLO long-term for that very small particular cohort. You wanna have some way of spinning up temporary monitoring, so you can track the progress of fixing the bug, and then you just kind of throw it away again afterwards. Or you track the metrics, but you don't turn them into an SLI.
Alex Bramley: SLI's are for very specific things. I'm not saying you should replace all of your monitoring with SLI's. That's not gonna help you. Because SLI's just tell you something is wrong, and you need all the other metrics to tell you what's wrong and really dig into the details.
Sven Johann: Yes, that's something which I think is sometimes confusing, because I heard in the past here and there "Only measure your SLO's, and the rest magically works." But that's not how it works.
Alex Bramley: No, definitely not. SLO's are giving you a reduced problem space, especially for your operations teams. You should be able to translate directly from an SLO to a particular group of users being harmed for a particular thing, and then it should be enough information to reduce where they need to go and look in the systems, to go and find where the bug is and find the cause. SLOs should be symptoms-based. They should be close to your users and close to your users' experience. And the closer you are to your users' experience, generally the further you are away from the real cause of whatever the problem is.
Sven Johann: Yes. Alright, cool. So from the measuring perspective, all my questions are answered...
Alex Bramley: Just one more thing I wanted to say... I know we're skipping a couple of bits, but... The last thing - because it pertains to SLOs rather than SLIs - is try not to have lots of different SLO goals, and for example latency thresholds. Everyone has a preference for round numbers, so find the nearest 100 or 500 milliseconds and stick to those across all of your SLOs. It will make it easier for everyone.
Alex Bramley: Similarly, we always tend to use a round number of nines, like 3 nines, or 4 nines, or maybe 3,5 nines... But if you stick to all of these, then people will be able to reason about your data more easily, because it's not like they have to go and check out "Well, this particular SLO has a goal of 2,5 nines, or it's 99.8% for this one, because reasons..." I that's a valuable extra thing to think about.
Sven Johann: Yes, I can easily second that, because my first attempt with SLOs it was too fine-grained, and different endpoints had different numbers, and I confused a lot of people... Because you know, "Why has this endpoint--" and I can't remember in detail, but this endpoint has 150 milliseconds, and the other one has 250, and stuff like that, and it was really confusing for everyone.
Alex Bramley: Yeah. I think that's a very common thing, because it's a process of iteration. All through this, I've been recommending that you start with things like load balancer metrics, which are there already... And the thing about that is it's quite seductively easy to break down by endpoint, because the load balancer metrics automatically do that for you. You're like, "Well, okay, let's look at this dimension, because it's there and it's easy." But the thing to remember is that you've got to be trying to take it back to the user experience every time. Your user doesn't care about the endpoint that they're hitting, they care about seeing the web page, and buying the thing.
Alex Bramley: So I think load balancer metrics and looking at individual endpoints is a great place to start, but there's got to be some kind of iteration process towards having SLOs that cover a user journey at some point, because those are the higher-level things that your users actually care about. Like, what goal are they trying to accomplish with your service.
Sven Johann: Alright. Shifting gears a little bit again... So now we know how and what to measure, and how to report. So now I look at my measurements, and if I'm lucky, I'm within my SLO, I met my requirements, but at some point I will fail to meet my SLOs, and it is good to be prepared and get an agreement before that happens, to know what needs to be done next. So the SRE book recommends an error budget policy. Could you describe what an error budget policy is? What are its parts, and so on?
Alex Bramley: Sure, no problem. It just so happens that I wrote a three-part blog series on this, on the CRE Life Lessons Blog, on this topic, a couple of years ago... And I'm sure all of your listeners are avid readers of the CRE Blog, but I'm happy to go over it again for the few people who haven't been keeping up with all of our output.
Sven Johann: I will also put the links in the show notes for more detailed reading.
Alex Bramley: Cool, thank you. But if people need to find it, searching for "CRE life lessons" will get you to all of the things. So one of the major goals when you're creating SLOs for a service is to have that feedback loop that we talked about, that regulates the pace of change so that you're keeping the service reliable enough. But you can't have this regulation happen, the feedback loop doesn't really work unless there are consequences when the service is out of SLO.
Alex Bramley: We've talked about the SLO being the dividing line between your happy customers and your unhappy customers, and I've kind of also talked about it being the dividing line between having engineering resources working on -- having people. I hate the "resources" when it's actually fundamentally they are human beings doing the things here... But they're having people working on new features that the company needs, or improving the reliability, so your users are happy still.
Alex Bramley: The thing here is -- I talk about the dividing line, but it's never as clear-cut as that. We said you don't have like a 498 milliseconds and everybody is happy, and 502 milliseconds and everybody is unhappy; you don't have that kind of black and white thing. In the same vein, your company doesn't have a black and white thing where, on one hand, 100% of the work is on features, and on the other hand 100% of the work is on reliability. And even when you're way out of SLO, there's always gonna be some feature work happening. And even when a service is operating well within its SLOs, it's sometimes well worth doing some proactive reliability work to reduce the risk of feature outages.
Alex Bramley: So when you're writing an error budget policy, you're kind of trying to describe how your company should make this shift of engineering effort towards improving reliability for a service based on the past SLO performance over the last 28 days, over the last quarter, or maybe even just over the last hour. And you're completely right that you want to have this prepared and agreed in advance, because having it written down and approved by people with decision-making power means that everyone has the same understanding of what's gonna happen, when it happens, when your SLOs are missed, and what the consequences are of this not happening.
Alex Bramley: The last thing you want when you're in the middle of a big outage is having people arguing about what should be happening. Everyone should be focused on making the outage stop, mitigating it so that your users aren't harmed. And if you're having a political argument about whether it's right to roll back, or who's gonna be fixing the bugs, then you're not helping your users at that point.
Alex Bramley: I think it's really helpful when the policy starts well before your first SLO is missed, your 28-day long-term SLO is missed. I think the first place you should start is by saying "What triggers an operational response?" Because really, the first change in stance towards reliability is someone in your operations organization getting paged and going to investigate the problem. Your service was reliable, and now your monitoring systems are telling you it's no longer reliable enough, and somebody needs to go do something.
Alex Bramley: So that is a change in stance towards reliability, right? Beforehand, that operations person was carrying on with their project work, they were trying to make tomorrow better than today; now the monitoring systems are telling them "Today is not good enough. Go do something to fix it." So one of the reasons I like to start here is because it makes it clear to say -- like, if you've got a split responsibility where you have separated operations team and development team, it makes it clear to the development team what responsibilities the operations team has and how much work they have to do before they can say "Okay, this is now a reliability bug. I need you to stop feature work to do something to fix it."
Sven Johann: I think it's also good for product owners, because usually a product owner or product management - they do not really know how to prioritize work which is not feature work. If it's architectural work, reliability work, it's not so easy to prioritize. But if you have the requirements as an SLO and you missed it, then I think it's easier for them to then understand "Okay, I'm about to miss it, and now I need to invest more time into reliability."
Alex Bramley: I have a theory on this... Operations folks tend to only be considering things from the "Oh, we've got to make reliability better." And product folks are often thinking about things just in terms of "Well, I need to get this feature shipped." Really, to harmonize these two views, I think it's best to align both of them in terms of keeping the user happy, or making the user happier. You've got this kind of spectrum of user happiness, and operations folks are generally thinking about "I need to make the user less unhappy, because I am responding to this outage and I am trying to stop the SLO burn." And the product folks are thinking "I need to make the user more happy by designing this delightful, new feature that will make their life easier."
Alex Bramley: When you think about it in these terms, if you can bring everything down to an increment or decrement in the happiness of your users, you can make trade-offs against feature work and reliability work.
Alex Bramley: So if you can phrase your reliability work in terms of "I think this will increase our users' happiness by this much", and then your product person can say "Well, okay, I think my feature is only gonna increase the happiness of users this much. I can see that there is a lot of harm to users happening, because this reliability work isn't fixed, so I can then prioritize the reliability work above the feature work, because it has more of an impact to the happiness of our users."
Sven Johann: Yes, that's true. I think it sounds so simple to just say "Take the perspective of your user..." Of course. But we are all living in our silos, and we are only seeing our stuff, and we rarely look at our users as a whole and say "Okay, maybe our users need less of the stuff I'm building right now. They need more of the things other people do, so I hold back with my requirements, because my requirements are not as important as maybe other requirements."
Alex Bramley: It's difficult. It's especially difficult for product folks when they've been banging away on the design and the iteration of this product, and they fundamentally believe that it's gonna make great things for the users, and then to have an outage come along and everyone goes "Well, actually, we've gotta make our users happy by just fixing these problems first." It's gotta be pretty painful to see, but for the users it may still be better. It's a difficult thing to do to let go of something that you've been holding dear for a long time trying to get released for a while, because actually it's better for the users to focus on other things. I can see that being difficult.
Sven Johann: I think so. But you have to start somewhere, and I think it's a simple idea, but I'm not -- what is it, simple but not easy? I think that's the saying.
Alex Bramley: Yes. There's a lot of politics involved. And no one ever thinks politics is easy.
Sven Johann: So yeah - error budget policy is a good thing to have. Someone has to create it... Who is involved in creating an error budget policy? How do I do that?
Alex Bramley: It can be difficult. It's important to involve everyone who's kind of got some responsibility for the product, or towards the users, or towards keeping the service reliable. Usually, you'll have stakeholders in the operations organization, the development organization, the product folks, and it's critical to have an executive sponsor as well, because when there are differences of opinion as to what is the highest priority, like say you've got the product person who can't quite let go of their feature, and is refusing to productize reliability work because they want to get the feature out of the door, but your out of SLO - that can be a really bad political situation, because you're gonna have two competing views of what is the most important thing, and you need to have a single person who is empowered and equipped to make a decision there.
Alex Bramley: Ideally, the decision should come from enforcing the policy. So your policy - in this situation where you've got two competing interests - should say clearly what is going to happen. And the person who's making the decision just has to say "Look, the policy says this. This is what's happening." And eventually, people will stop having to go to that executive sponsor to enforce it, because they'll know that the policy is just going to be enforced, blanket enforced, without any questions.
Alex Bramley: But when you're starting off, it usually comes down to "But the policy doesn't apply to me in this particular case, does it? Surely not." Those kinds of questions. So it's important to have someone with the executive power to enforce it for those kinds of things when they crop up.
Sven Johann: Yes, that's also a thing... It sounds simple, but getting someone with executive power and convincing that person to stop maybe developing features, and concentrate on reliability - yes, it could be a difficult thing.
Alex Bramley: Yes, I agree. I think it is one of the most difficult things, in fact... We've kind of talked about it - it usually comes down to the same things as with the product folks; they need to see it in terms of the user happiness. You can also put it in terms of the reputational damage to the company. We've talked a lot about user trust at the start of this... And I think someone who's an executive for a company should have some understanding of the reputational costs of an outage...
Alex Bramley: Say you're an operations person and you're trying to create your first error budget policy, trying to get buy-in from someone in a leadership role. If you have data from post-mortems that can show "Well, this outage cost the company this many dollars, this outage cost the company this much in user trust and engagement, and if you sum across all of the outages in the last six months, this is the opportunity cost that we've lost", come to those kind of people with data. That's what a lot of people in those kinds of roles, that's what they operate on. If you give them data, they should be able to make a reasonable decision. And if you don't have data that's good enough, then you don't necessarily expect the decision to go your way.
Alex Bramley: The product folks will have data on how much value each feature is bringing into the company, because they have to make a business case for the feature to be allowed to command some engineering resources, to have that feature built. Taking the specific example we were talking about, with the feature that is being pushed forward to production when the service is out of SLO, and the operations team would prefer the developer effort was focused on improving reliability in the short-term, if you can show that the cost of an outage is going to be more damaging to the company than the value that the feature will bring over the next couple of weeks, it should be a relatively easy thing to decide that actually the right thing to do is pause the feature. But you do need to have good data, and that can be hard to get.
Alex Bramley: The thing about policy is that's kind of a couple of steps down the line. You can measure your SLIs first, you can start gathering data on impact, you can do all of that before you really start creating the feedback loop. You only really need the SLO policy when you want to have the data that you're gathering in your SLIs and the output of your SLOs actually modulate the engineering effort of your company. So at that point, you should have got at least 3-6 months' worth of data to prove that your SLOs are working correctly, they're detecting outages, they are a good representation of your user experience... So when the SLO dips, you can show that users are getting hurt. And ideally - although this is really challenging in lots of cases, ideally you can quantify that cost to the company. If the SLO goes from 99.9 to 98 - how much is that costing your company? And if you can show that, you can win many arguments. But it's hard to do.
Sven Johann: Yes. As engineers, we are almost used to that. If you want to pay down technical debt or if you want to do something else related to architecture which is not feature development, you always have to explain what is really the value of what I'm doing here, because maybe people just think you want to play around... I think as engineers we always have to gather data, to explain to decision-maker why the non-feature work is really necessary, and what the value it brings to the business is.
Alex Bramley: Yes. And if you think about it, it's only fair. The product folks have to do this all the time to justify their features... If you want to say that something is more important than that feature work, you should have to go through the same effort. It's only fair.
Sven Johann: Yes, that's true. I never thought about it that way, but yes. One part you touched on is what do I have to do if I miss my SLO, so there is a part in an error budget policy which is called the SLO miss policy. So what am I going to do when/if I breached my SLO, or maybe half of my error budget is left... So what are the typical contents of that SLO miss policy?
Alex Bramley: Okay, the blog post goes in a lot more detail, but I like to think of a policy like this - as a series of triggers and escalating consequences. Each trigger is a certain amount of error budget burnt, and the consequence is a thing that changes the stance of your engineering organization towards doing more reliability work. I wanna give you specifics of contents, because it's very dependent on how a particular business wants to reallocate engineering time and change behavior when a service is not reliable enough. I don't think there's any typical contents, but I can talk about the example I give in the blog post, which is a lightly reedited version of the things we did on the Play team.
Alex Bramley: I wanna stress that this is just one example, and the consequences there and the thresholds we chose - they are based on the Play's market environment and how the developers needed to balance engineering time for features to engineering time for reliability. One of the things you want to keep in mind is that the Play store is mostly accessed from mobile phones that are on wireless connections, so we could take a bit of a hit in terms of availability and latency, because the users were always going to be expecting a lot more variability anyway because of the mobile connections they were on, for example. So that gave us a little more leeway to be more flexible in terms of how we dealt with reliability problems than, for example, web search might be, where everyone's on a wired connection and expects results very quickly.
Alex Bramley: So I've said already that I think a good place to start is describing the conditions that trigger your first operational response. So the first point in your policy might be that your site reliability engineer or operations engineer gets paged when the one-hour burn rate exceeds 10, for example. So you've burned ten times your error budget in the last hour, or the 12-hour burn rate exceeds two, so you've burned two times your 12-hour error budget in the last 12 hours.
Sven Johann: Yes. We didn't talk about burn rate, right? So maybe you can say a few words what the burn rate is.
Alex Bramley: Yes, of course. Sorry. Yes, so the burn rate is just like the multiple of your error budget you've burned over a limited window. Say you have a one-hour measurement window, so you're looking at the previous hour's worth of data for your SLO, and you're calculating the performance of the SLO over the last hour. A burn rate of one is you've burned exactly all of your error budget in that hour. If you've got 1,000 requests in the last hour - I'm gonna try and do maths; I'm sorry, it might go badly wrong... But if you have a 3 nines services and you've served exactly 0.1% errors for the past hour, then your one-hour burn rate is exactly one. If you served 0.2% errors, that's double the error rate that is allowable by your SLO; then your burn rate is 2. So if you've served 1% errors, then your burn rate is 10. It's a nice, dimensionless way of looking at these things.
Alex Bramley: So for a 3 nines SLO, if you serve 1% errors for an hour, then your one-hour burn rate is 10. And then you'd page somebody. If you kept serving 1% errors over 28 days, you'd be ten times out of your SLO. You'd want to stop that early. Is that enough for that?
Sven Johann: Yes, yes.
Alex Bramley: So the consequence of reaching this point where your burn rate is significantly higher than they should be is that the responding engineer - they get the page, they're expected to figure out why the error budget burn is elevated, and take steps to bring the service back into SLO... And do the usual kind of operations, things like redirecting traffic of the load balancers away from a bad cell, or rolling back the new release of the software that just happens to have gone out to previous version.
Alex Bramley: They should inform the development team that these things have happened, especially if a rollback is required. There's no expectation that the development team have to do anything at this point. This is normal, operational work. The service is being brought back into SLO by doing standard operational things. But what happens if the standard operational things that you can do, like the usual playbooks, the usual things - they don't bring the service back into SLO. A rollback doesn't fix it. Or you can't roll back because for some reason you've got a dependency between the backend and the frontend that has kind of locked you in this new stage. And as an operations engineer, you need to get assistance from the dev team, maybe because there's significant code changes that are needed to fix whatever problem has happened.
Alex Bramley: The consequence of reaching this point is that the development team have to start allocating some time to resolving the problem. So the working is quite vague here, I think, because you need to leave it up to your operations engineers to decide when it is appropriate to escalate to a development team. That's one of their jobs, is to recognize when the things they're doing are not solving the problem and they need to get someone who has deep understanding of the code, like the person who built the feature, to come and allocate some time to resolve the problem.
Alex Bramley: They might do this by ensuring that there's a bug or a ticket that's assigned to an engineer, or your developers might have an on-call rotation that the operations team can escalate to. Or of course, in a dev ops model, developing folks are the operations folks too, then presumably the problem here is that the person responding isn't the person who is the most familiar with the feature, and they need to just find the right person within the dev team. So here the policy is just to describe the escalation process and the expected time commitments each time is gonna make.
Alex Bramley: If you have a system where you page the development on call rotation, then they should have an expected response time there, and then they should be willing to -- someone from their team, there's usually gonna be someone on call, so the on-call person needs to be willing to drop whatever project work they're doing and prioritize the interrupt. It's important at this point to note that both of these kind of thresholds come before the services breached any of the 28-day SLOs, and that's kind of intentional. The goal here is to stop any kind of breach of the 28-day SLO happening, where possible. So you're doing a lot of work ahead of time to make sure that overall the service stays within SLO over the 28-day window.
Alex Bramley: Of course, sometimes you'll go out of SLO across the 28-day window, and this is where most people think they need to stop writing the policy. This is the dividing line, "Okay, we're out of SLO, but over the longer-term window this is where the organization needs to start doing something more serious about changing the reliability of the service." And broadly speaking, while there are many ways you can get to this point, there's two extremes, and you probably wanna handle them different in your policy. The first is you have a single massive outage; say you're hard down for four hours, and everything is just total madness and 100% errors for all of your users, and you just 100% erase your 28-day error budget over the course of this very stressful four-hour period.
Alex Bramley: The second is when you're serving -- we talked about the 3 nines things, and that needs 0.1% errors across 28 days; that's your error budget. So if you're serving 0.11% errors over the course of the 28 days, then over your 28-day period you will be out of SLO, but it's a very slow-burn.
Alex Bramley: The thing about the first kind of SLO mess is that once your terrible four hours is over, it's over. There's gonna be a post-mortem, it's inevitable, but that ought to do a good job of uncovering the root cause, producing action items, again, that stop it from happening again... Because that's the whole point of a post mortem, you've gotta learn from your outage.
Alex Bramley: In this case, the consequences should be that you make sure the action items are completed, and once they're completed, other engineering work can resume. But because the post mortem will have identified all of the things that caused the outage in the first place - or it should have done, at least - you can generally just return to normal without further escalation, further engineering time needed, because you should be relatively sure that the problem won't happen again.
Alex Bramley: The second kind of SLO miss is way more insidious, because people tend not to take that kind of slow burn problem seriously. Like, "Oh, it's just a tiny bit of errors. It's not that big of a deal." A four-hour outage, everyone being stressed, things in the newspapers - that focuses attention, people take it seriously. Serving a couple more errors than you should be - everyone's like "Oh, it's not a big deal. My feature doesn't have to be stopped because of that, does it?
Sven Johann: Yes, that sounds very familiar to me.
Alex Bramley: Yes. And my experience is this kind of thing really tests the commitment of a team towards their SLOs, because it's so easy to just go "Oh, we don't have to do this right now, do we? I can get my feature out the door before I take you seriously."
Alex Bramley: So your policy has to call this scenario out specifically, because it's where most of the friction will occur between keeping the service reliable and making your product improvements. It's where you'll call on that executive sponsor to enforce the policy the most, because no one after a massive outage is gonna say "Well, my feature work has to take priority now", because they're just gonna get laughed out of the room. In this case, they're gonna have a reasonable argument to say "Well, it's not that big of a deal, is it?" The policy needs to be enforced objectively here.
Alex Bramley: The way I've dealt with this in the blog post is I gained further escalation on the 28-day error budget being exhausted, and the previous conditions having been satisfied for a week or more. Essentially, the operations team have got to have been escalating to the development team to try and get things fixed for more than a week, and the development team still haven't been able to fix the underlying root cause of the small extra errors for more than a week. And at that point, your users have been slightly harmed by this for a month now, or a 28-day period, and the development team have not been able to fix it in over a week. At that point, it's clear that there's some kind of significant ongoing problem, that it's not a Big Bang outage, and it's going to need proper, dedicated engineering time to fixing it. The policy at this point uses a consequence of pausing feature code reaching production to obtain this. This is something that Google does a lot, and it's pretty controversial outside of Google.
Alex Bramley: The reason we do this is it aligns incentives well, because the features can't go out until the services are reliable again. So the engineering team - there's no point in them working on even more features, because none of them are going to reach production until the SLO burn is fixed. So having a couple of engineers, maybe a small group stop the feature work they're doing, so that they can unblock everyone else on the team from this kind of release freeze is usually what happens. You'll go sit down with your counterpart in engineering management and you'll say "Okay, so you're out of SLO. All of the features you are focused on working on - none of them are going to production until you're back in SLO again. So can we have a couple of engineers to really dig into this problem, take this bug seriously, and help us fix it?" That usually works. That's usually what happens within Google.
Sven Johann: Yes, I think if you have too many production problems, also outside of Google, it works, I would say. There are probably some companies who just say "Oh, just build your service and it works 100% of the time", but I think that's not 100% of the cases.
Sven Johann: I'm just wondering -- you know, I'm now writing my error budget policy, and I'm on call, for example, even as an operations engineer SRE, or just in the normal "You build it, you run it" team, where let's say the developers are also involved in the operations... And now I'm constantly in a rush to develop features, I don't make my SLO, and actually the error budget policy says "Oh, you now really have to focus on reliability", but I get ignored all the time. The escalation path doesn't work, my product owner doesn't give me any time, and I'm just drowning in operations work, even on the weekend or in the evening. So what does a team do if the policy will be ignored? What can you do as a team if no one really listens to you?
Alex Bramley: I don't know. I mean, the question to ask is "You've written a policy. Why is it being ignored?" When you write it, as I said, you've got to get that executive buy-in. They've got to sign off on the policy. And if they sign off on the policy, then you've got the signature to say "You said you would take this seriously. Please take this seriously." If you can't get that buy-in at the start, then it's probably not worth pursuing the path of even writing the policy, because you're not going to get it enforced.
Alex Bramley: It comes down to being able to make a good argument that the reliability work is valuable enough to the company that it should have engineering people working on it. And if you can't make that argument, then maybe it's not that important.
Alex Bramley: If you're in a situation where you're working extra time on the weekends to try and keep the service reliable - that's the hero complex, and it's kind of a failure mode of this. You shouldn't as an individual feel beholden to try and keep the service reliable in the face of a company that doesn't care. Companies don't care about; this is just a personal opinion. And Google is in fact very good at mostly caring about its employees, but the system of the world is that the company does things for the company, and you still need to look after yourself, I think. If you go down the hero path, you're not looking after yourself, and that's something that is dangerous to do. I'm sure the press folks will make me cut that bit.
Sven Johann: Another reason an error budget policy could be ignored is that it was just wrong. We can actually live with all those problems, because no one really complains. That would be the time to adjust the SLOs and the error budget policy. Is that the correct thinking?
Alex Bramley: Yeah, I mean... That's absolutely possible. And that's why I say you need the data. Your SLO's don't operate in a vacuum. The feedback loop that you're trying to create here - you need some external source of information that your users are happy or unhappy, so that you can refine your SLOs. We talked before about looking at things like Twitter, looking at your support calls, looking at your forums, things like that... So if you have a big dip in your SLOs but you don't see increased support calls, you don't see sadness on Twitter, then did that SLO really detect a problem? You need to have some external thing to validate that your SLOs are actually detecting real problems. If you can tie it to increases in support cards, increases in the forum posts, increases in sadness on Twitter - that's the kind of data that makes your policy look valid in the eyes of your local executives. You can say to them "Well, we served 5% errors for an hour here, and look at all the complains on Twitter. This is the user trust that that burned, so we need to have things that allow us to stop doing that. I need two engineers for a week to fix the bug that caused this to happen, so it doesn't happen again in the future. And that is a useful and valuable thing for these people to be doing, because this is the user trust that we burned, and we don't wanna do that again."
Sven Johann: Do you have a recommendation that I should reconsider my SLOs and my error budget policy on a regular schedule?
Alex Bramley: Yes, I think that's the most important thing. If you're just waiting for a sign that your SLOs are wrong, then that might take a long time to appear. You may not even be collecting data such that you can see that they're in the wrong place. You might be blind to problems that are really there, because your SLOs are all green, but your users are still unhappy. So it's important to really review SLO performance on a regular basis.
Alex Bramley: We usually recommend that you do this at least once a year. You look at your SLO performance, you look at your other sources of user impact, you look at your post mortems, you look at all of this holistically, and try and see if there are patterns, try and see if there's any mismatches. See if one source of data is telling you one thing and another source of data is telling you another thing, and then trying to dig into why there's a disagreement. You do this at least once a year, and ideally more frequently. Six months is not a bad time schedule.
Alex Bramley: If your SLOs are new and you're just starting out on this, then looking at it every one month or three months is good, because a shorter iteration period means that you're more likely to converge on good SLOs and good data more quickly.
Alex Bramley: One good trigger -- if you have a massive outage, that's a good time to do a review, too. If you have one of those terrible four-hour periods where all of your user queries are going into dev/null and you really didn't want that - at that point you should see obvious signs in your other sources that users were discontented at that point, and that's very valuable data.
Alex Bramley: So when you're conducting the review, as I said, you're looking for these mismatches, you're looking for disagreement between your sources of data... And what you're trying to do with the output of this is changes to your SLO. So if you have a huge outage in the last quarter, but you didn't see any dips in your SLOs, then your SLOs are probably wrong. You're either measuring the wrong thing or your targets aren't strict enough.
Alex Bramley: If your SLOs indicate your service was unreliable but you can't find any evidence of this in your user impact, then your targets might be too strict, or you might be measuring something that you thought your users cared about, but it turns out they really didn't. And as I said, you probably won't get things right the first time, so doing these things more frequently, when you've got new SLOs, is a good thing to do.
Alex Bramley: The reason we say that you have to do this on a regular basis, even if you think things are good - say you've gone through this process, you've kind of dialed in your SLOs, you've ratcheted it down from doing every one month to every three months, to every six months, but you still need to do it on a yearly basis, because over the course of a year your service can change; you will release new features, you'll get new users, and you may even retire some things and turn some stuff down, or mostly turn some stuff down... And some of those changes will mean that your SLOs change, and your patterns of user behavior change, and your users' expectations change... And you don't wanna miss those.
Sven Johann: Yes. A final question on the error budget policies - would you recommend to have one error budget policy which rules everything across the organization or across a certain product group, or maybe have an error budget policy per service? How fine-grained should it be?
Alex Bramley: One error budget policy to rule them all, signed off by the Eye of Sauron himself...
Sven Johann: Signed off by whom?
Alex Bramley: The Eye of Sauron himself...
Sven Johann: Okay...
Alex Bramley: I think usually one shared policy is enough. The consistency is quite valuable here. Everyone knows what to expect, and you can publish it in a public space for the whole company to see, and everyone has the same expectations, which is valuable. But the only reason you might wanna have different policies, I think, is you have different parts of a larger organization that have very different trade-offs they need to make on the kind of reliability engineering spectrum.
Alex Bramley: We talked about authentication earlier... That one usually has pretty high reliability requirements, because it underpins a lot of the other parts of the organization, and core infrastructure pieces like that often have at least one, maybe even two extra nines of reliability needed, versus something that's only serving a subset of your users in a subset of locations, and isn't doing something that's critical, in any critical user serving path. Often, the differing SLO goals deal with this quite nicely.
Alex Bramley: Say your authentication service has a 5 nines requirement, and your random frontend job, that's only serving like a fraction of the users, has 3 nines or 2,5 nines - they will still burn error budget in the same way, and that kind of normalizes all of it. So even in that case, you don't need particularly different policies. But sometimes, say, you've gotta jump on periods of unreliability with more of a shift towards fixing the problems, as opposed to a more laissez-faire attitude of "Oh, we can just leave it for a bit." That's going to be more of a business decision than a market decision. You may need to move fast and break things if you're Facebook, for example, and that's the market you're in. Whereas if you're providing banking services, then people are gonna be pretty angry if you break things... So you have to do things differently.
Sven Johann: Yes. Alright, cool. Last chapter or last part - alerting. Ideally, I want to get warned before things get critical, because then I can react and solve problems as long as they are still small, right?
Alex Bramley: I agree, yeah. This is kind of a big thing. And I think that SLO-based alerting - this can replace most or even all of your previous service alerts. The caveat here is as long as you are relatively confident your SLOs capture the performance of your service from your users' perspective. You've gotta have gone through the kind of iteration process that I was talking about just now, and be relatively sure that you have a high signal-to-noise ratio from any periods where you're burning error budget, so you know that they correlate well to periods where your users are actually being harmed. If you've got that, then I think you can throw away a lot of your other alerts. And I know that's gonna sound scary to a lot of people, but if you think about it, I think it makes sense. If you've got good coverage of all your user interactions and your metrics tell you when your users are not interacting successfully with your service, and you know that the correlation is good, why do you need to worry about anything else?
Alex Bramley: Your users, as I said before - they don't care about which endpoint they're hitting; they don't care that your CPU utilization is high. They don't care about your server's load average. As long as the interactions they have with your service are meeting their expectations, they are happy. So if you're measuring those interactions well, you don't need much else. And again, I wanna stress here that that doesn't mean throw away all your metrics. That means you can throw away the alerts, but the underlying metrics is still very useful.
Sven Johann: Yes. I mean, there are probably some exceptions, right? So if I run out of disk space, or something like that - that's something an SLO probably doesn't catch; or some background job is not working...
Alex Bramley: Well, at some point it has to have an impact on the user, or why do you care? If you run out of disk space, presumably that's going to manifest itself in users not being able to upload things correctly, or your service just 100% throwing errors because it can't write its logs anymore. Those are the things that your users actually care about. They don't care that it's disk space that is causing that, they just see the 500 error and they're like "Arrghh! I'm angry now!" So if your SLOs are capturing the user experience, they will see this.
Alex Bramley: Disk space is an interesting one though, because it's a problem that you can detect in advance, or predict in advance. I like it as an example, because I think it's quite illuminating, because it's hard to apply SLO methodology to things like disk utilization or other quotas, because they're different to serving errors. The quota exhaustion tends to act like a step function. When you've got disk space, you've got quota to spare, your service is functioning normally. But when you run out, suddenly you're burning a lot of error budget very quickly. And what's more, this should be an avoidable situation, so you've gotta just go in and request a quota increase, or go buy more disks, or something like that. And you've gotta do this far enough in advance that you don't run out of quota. So you should be able to predict and avoid the outage.
Alex Bramley: But there are still a lot of similarities when it comes to detecting and responding to this avoidable situation in time. Like most of those, you can set a static threshold. "You go above 80% disk utilization, you go ask someone to buy another disk." And this works well enough, for the most part. But just like static threshold alerting for outages, it's got those same drawbacks. If you have a little spike in disk usage, say you've got a kron job that runs once an hour and then generates a bunch of disk usage and then cleans up after itself, if you go 1% over your threshold for five minutes, you get paged by another disk, even though that may be not necessary at this point in time.
Alex Bramley: As a little aside, I can talk about -- I've got an experiment with trying to create SLIs from quotas, if you're interested in hearing about it...
Sven Johann: Okay, yes.
Alex Bramley: So the idea is that instead of alerting when your utilization goes some fraction over limit, so you hit that 80% threshold of disk usage and you fire an alert to have someone do something, you treat usage above some fraction of your limit as an error budget.
Alex Bramley: Let's take this disk usage as a concrete example. Say you've got a terabyte of disk quota. As I said earlier, this is a hard limit. You hit that terabyte of disk quota - all your writes fail and your system goes down. If you set this arbitrary threshold below this hard limit as a soft limit, say like 80% or 90%, then you can create an SLI from the difference between these two limits. You can then plug this into the SLI equation that we talk about in The Art of SLOs, by saying that your valid bytes are the total buffer space between your soft and hard limit, and the good bytes is the amount of buffer space you have remaining.
Sven Johann: Oh, okay, yes.
Alex Bramley: This works like an SLI, because when you're under your soft limit, your SLI is 100%. Everything is good. When you've consumed the entire buffer, your SLI is at 0% and everything is bad. So if you set an SLO goal of, say, 99% or 95% for this SLI, you're effectively saying you're okay with using on average 1% or 5% of buffer at steady state across your SLO window. More than that will start burning your error budget, and then you can plug this into burn rate alerting, and everything will just work.
Sven Johann: Yes, it sounds cool. Alright...
Alex Bramley: We actually skipped forward over the actually talking about burn rate alerting though. I think this is part of the problem...
Sven Johann: Yes, before we talk about the burn rate of alerting, I'm just wondering how do I find the right thresholds for alerting if my SLOs are in danger? Finding the right threshold seems to be very difficult, at least for me.
Alex Bramley: Yes, so the other thing about having static thresholds is that you need to tune them all the time, depending on how your service is performing... And that's a source of operational load that you don't really need. It can be frustrating. Knowing whether you've got it right is hard until you have an outage, and then you try and tune it for what you have.
Alex Bramley: There's a lot of good advice on building alerts based on SLOs in the SRE Workbook, and you can find that freely online at google.com/sre. That's the one piece of advertising that I'll put in this; I'm working for an advertising company... The principle behind this kind of alert is instead of notifying someone when you go above a static threshold, you notify someone when you burn some portion of your error budget. You've used up some portion of your error budget in a short period of time.
Alex Bramley: You can't completely get away from the process of iteration of finding the right thresholds, because it's dependent on your business requirements for reliability, but there's a good starting point suggested by the Workbook that says "You should fire an alert when you've burned through around 2% of your 28-day error budget in an hour, or 5% in six hours."
Alex Bramley: So if you think about it, you probably (hopefully) aren't gonna have more than, say, 14 of these -- hang on... So 2% would give you the ability to do this 50 times in 28 days. If you have 50 hours where you burn 2% of your error budget, then you're gonna be out of SLO over the 28 days. So you can do that roughly twice a day. Does that make sense? I'm trying to do math, and it's a terrible idea...
Alex Bramley: So you can do that roughly twice a day in your 28 days and you're gonna be out of SLO a bit. That's where the rationale behind the 2% number comes from. And you're also gonna need another threshold to catch persistent low level rates of errors. For that one, it recommends creating a ticket if 10% of your budget is burned over three days, and that's just slightly below. So if over each three-day period you burn 10% of your budget, you're going to be roughly at your SLO, and you're going to roughly burn all of your error budget over the 28-day period.
Alex Bramley: These are great places to start, because you know that you're defending the 28-day SLO effectively by getting someone to do something at this point, for the reasons I just went into.
Alex Bramley: The problem with implementing alerting in this fashion is that you've got to be able to look back over 28 days' worth of performance history to figure out what 2% of your 28-day error budget is. And this can be quite difficult, depending on how you're measuring and storing the data. It can be quite an expensive query to run every minute, to go "Okay, so how's things over the last 28 days?"
Sven Johann: I just wanted to say that, for example, if you use an application performance management (APM) tool, which is not only looking at performance in terms of latency, but also looking at availability, the tools I know - they basically cannot do that. You cannot say "Alert me based on a burn rate of so and so much percent over the last two days", because they can only run queries about the last 30 minutes, for example. It's just not possible to do -- for example, in Prometheus you can do more, but a lot of tools just don't allow me to make this kind of expensive query.
Alex Bramley: Well, 30 minutes of history is not enough for anybody, in my opinion. If you're debugging a problem, then maybe that's enough. I think those are more for developers to really understand the performance of their application in the short-term, when they're building new features and things, rather than trying to do operations work with it. I wouldn't want to do any kind of operations work with only 30 minutes of monitoring data. That would make me very sad.
Sven Johann: No, I mean, I'm not saying 30 minutes monitoring data. Those tools have more than a month's monitoring data, maybe two months, three months of monitoring data, but you cannot run a query on those 30 days, for example. They don't have a query language; you can just click Alerts, and it doesn't allow you to select more than 30 minutes, for example, for how far you can look at errors, for example. You cannot say "Alert me if the last ten days have burned so and so many percent of the error budget." You can only say "Alert me if we burned so and so many percent over the last 30 minutes, or hour.|
Alex Bramley: Oh, okay. So I've not really used these application performance monitoring packages, I've got to admit. I'm unfamiliar with them as a concept, generally. Sorry. But even when you're working with things like Prometheus, it can be difficult to do queries over months of data.
Alex Bramley: The way that the Workbook recommends to deal with this kind of problem is to calculate something called the burn rate, which we've kind of touched on before; I explained it before.
Alex Bramley: With some maths that I'm not gonna get into here, because I've tried to do maths already and it's not been the most enthralling thing for your listeners to listen to as I stare off into space and don't say anything, the Workbook shows that if you have a one-hour burn rate of 14.4, where you have 14.4 times as many errors in the past hour as your SLO permits, this isn't the same as burning 2% of a 30-day error budget. And the Workbook uses 30 days rather than 28 days, because it does -- like, we settle on 28 days as a thing to recommend to people after the Workbook was published. There's a process of iteration, as always...
Alex Bramley: The equivalent burn rate for a 2% threshold of an SLO measure over 28 days is 13.44. I conveniently did the maths ahead of time for that one. So the idea there is that if you use that threshold, you're getting your 2% of your 28-day error budget over the hour, and you're essentially just translating it to an SLO window of an hour and saying, "Okay, I know that the 28-day error budget is this much, so over the course of an hour I'm allowed to burn this much error budget."
Alex Bramley: The reason for taking this approach - using these burn rate alerts - is that the alerting scales with both the severity of the outage and the tightness of the SLO goal. So it takes away two of the tuning dimensions from setting a static threshold.
Alex Bramley: So if you are hard down serving 100% errors, you're gonna hit that 14.4 times burn rate threshold and page someone in just a minute or two. And again, as I said, it scales with both the severity of the outage and the tightness of the SLO goal. If you're serving 100% errors, if your SLO goal is 5 nines, then you're gonna page basically instantly. If your SLO goal is just one nine, you're gonna take a lot longer to page, because you have much more error budget to burn.
Alex Bramley: On the other hand, if you're only just burning off error budget to be out of SLO over your entire measurement window, you're gonna get a ticket after three days, and someone can do something about it, because you've still got plenty of time to deal with it whenever it's convenient.
Alex Bramley: Because of this kind of scaling, you don't have to worry so much about finding the right thresholds or measurement windows that balance the sensitivity to outage with the high signal-to-noise ratio, because these two scaling dimensions take care of a lot of the problems.
Sven Johann: Yes. I have here one final question, and I believe you basically answered it already...
Alex Bramley: Yeah, I've kind of jumped forward...
Sven Johann: ...and that is that the threshold I have doesn't tell me how fast my error budget is shrinking. But if I look at what you've said, I probably -- yeah, I could also combine those alerts. So I have an alert which looks at a burn rate over the last ten minutes, a burn rate over the last hour, a burn rate over the last day... Something like that.
Alex Bramley: Yeah, absolutely. We tend to do that at Google. We have an SLO alert based on the one-hour burn rate, an SLO alert based on the 12-hour burn rate, and a ticket alert that files a bug based on the burn rate over a week. It works well. The thing you need to have is some alert suppression. So if the one-hour alert is fired, then when the 12-hour alert inevitably also fires, you don't need to page the person again, because they're probably doing something about it.
Alex Bramley: That's the only thing I will say... That's proven to be a source of additional noise for us. You have a big outage and the one-hour alert fires, because it's the fastest response thing. But you're just getting into working on the problem and then your pager goes off again, because it's saying "Hey, your service is out of SLO for the 12 hours as well." It's like, "Thank you, I knew this." Alert suppression is something you have to have built into your alerting system, and it can be difficult if you don't have that to implement it.
Sven Johann: Yes, you get alert fatigue.
Alex Bramley: Yes.
Sven Johann: Alright, awesome. So I do not have any questions left. Overall, we only talked three hours, or something like that... Three and a half hours. So I will put all of the things you mentioned, all the parts from the SRE Workbook, your blog posts, the ACM Queue article, and stuff like that - I will put all that in the show notes. For listeners who -- you know, sometimes where we had the math, it's maybe hard to follow in the podcast. We also have show -- not show notes; we also have show notes, of course... But we have a transcript, where basically you can read everything slowly, and reiterate.
Sven Johann: Alex, I am super-happy for those three hours (or more than three hours) of conversation. Do you have -- I think we touched on a lot of stuff... I just cannot believe that I forgot to ask something important, but maybe that you think I forgot to ask something...
Alex Bramley: Let's see... No, I think there are a few bits we've skipped over, but I wrote so much, because I like the subject... And I just wanna say thank you for the opportunity to talk about these things. I think it's an important thing to get into people's heads. The key things to reiterate are focus on the users' experience, and try and measure that as closely as possible. And don't be afraid to just set something up quickly and start measuring things before you have confidence that what you're doing is good... Because the way to get confidence that what you're doing is good is to start, and then see how bad it is, and then try and make it better.
Sven Johann: Yes, yes.
Alex Bramley: A quote that I really like from Adventure Time With Finn and Jake - which is a kids cartoon; I'm terrible, I like watching kids cartoons - Jake says "Sucking at something is the first step on the road to being good at something." It's a fundamentally important thing that everyone needs to know in their lives. You have to be bad at something first, so just start.
Sven Johann: Yes. A previous guest, Philippe Kruchten - his answer to that is always "Yes, incremental and iterative software development." I think that also applies here.
Alex Bramley: It's not just software development, it's literally everything in life. And that's one of the reasons I like -- because especially children don't get told this enough. There's the expectation that "Just be good at stuff." I've got kids myself, and letting them go through that process of iteration is hard for them, because they don't like getting things wrong, and it can be a very frustrating experience; helping them understand that that's just how life is is important.
Sven Johann: Yes. Alright, cool. Thank you very much, and also thank you to our listeners for listening those...
Alex Bramley: Three hours...
Sven Johann: ...all those episodes. Alright, bye-bye.
Alex Bramley: Cool, thank you so much again for the experience. It's been a great time. Thank you.