Alex Bramley on The Art of SLO, Part 2

Transcript

Sven Johann: Hello everyone to a new conversation about software engineering. Today is the second episode of our three-part series with Alex Bramley, site reliability engineer at Google, and author of the open source Art of SLO training.

Sven Johann: So we still are in our deep-dive on service-level objectives, architectural requirements for operations, and the show today covers availability requirements and service-level objectives, how to do that, how to do a deep dive into that. We talk about dependency graphs and availability calculus. You can imagine in a microservice system you have many moving parts; how do you calculate your availability... We talk about unknown availability SLOs, we talk about communication of availability to all kinds of stakeholders, we discuss the service-level indicator menu, where to measure what, we have several places where we can measure availability at the user level, load balancer, at the service level, and so on. We discuss all the pros and cons of those possibilities, and when and how to combine them... And we are closing the episode with reporting. Please enjoy!

Sven Johann: So going back to availability, but to have a little bit of a different look on availability... Usually, when I want to define an availability SLO for my service, my service X, whatever it is - search, or payment, or product detail page, or you name it - unfortunately, it has some sort of dependency on something... Like the platform quotes external services, like SAP, or Elasticsearch, or whatever... We need storage...

Alex Bramley: Those pesky cloud providers...

Sven Johann: Sorry?

Alex Bramley: I said "Those pesky cloud providers..."

Sven Johann: Yes... For example, for example. But also, just other services I depend on if I have a microservices architecture. And obviously, those dependencies influence the SLO of my service. So what's your take on this one?

Alex Bramley: Well, if they're internal dependencies, then you've got some extra options, because you can kind of go talk to them... It's definitely worth checking if they have any published SLOs internally, and then maybe you can go sit down with the teams if you've got some spare time... Or these days, have a Zoom call with the team that run the services, and have a chat about what your expectations as a user should be. They'll be interested in how you're using their service, and hopefully they'll want to get your perspective on their reliability. So it can be a win/win situation for everyone. They get real data from their users about how they're experiencing the service reliability, and they can also hopefully figure out -- you can also hopefully figure out whether your intended usage is gonna cause problems for that service. If you can go sit down and chat with them and say "Hey, I want to send you half a million QPS" and then you see their eyebrows hit the roof, then you know that maybe they're not prepared for that and they might have some problems with that. They'll also be able to give you some idea of whether they think they can provide the reliability that you think you need... Like, going in there with an idea of a goal, like "I'd like to get 3 nines out of your service. Do you think that's possible?" That's gonna anchor the discussion.

Sven Johann: Yes, I just wanted to second that... I was once working with an insurance, and they had an authentication service, and those people of course knew that this authentication service is quite critical, but that it's also critical during the weekends for some services, not only during normal working days, and that we have increasing load, and stuff like that... For them, that was kind of surprising. They said "We cannot provide you that sort of reliability." But if more and more services of your dependencies are coming and saying "We have the following requirements", it also reflects in their backlog. So it's easier for them to say to their product owner "Oh, our clients need more reliability. We have to work on it."

Alex Bramley: Yes. I think authentication is a great example, because everything has a dependency on authentication; whatever you're doing, you need to know who the person is doing it, and whether they're allowed to do it. Authentication and authorization underpins most businesses. So they often have high reliability requirements, and I think they're also -- because they tend not to have many dependencies themselves, they're also a great place to start if you're trying to move your business towards having SLOs for services, starting with authentication. Everyone can usually agree that authentication needs to be reliable, because it's a core dependency for a lot of things. And because it doesn't have many dependencies of its own, and because it's also theoretically a relatively simple service, like "Is this person authenticated? What authorization do they have to do things?", it's a reduced problem space to get started on. So it can be a great place to start when you're trying to bring SLOs into a company.

Sven Johann: Yes. So that's if you have internal dependencies. But now the tricky part, I guess, are external services.

Alex Bramley: Yes. Do everything in-house. Depending on external services is more risky, especially if you're building your entire business on top of a cloud provider. You're taking on a risk there. One of the things that the CRE team was formed to help mitigate is that risk, because -- I gave you the example earlier of Shopify having their millions of dollars per minute run rates. That's built on GCP, and they're happy to say that. We work quite closely with Shopify on CRE, and one of the things we're trying to do is help them have the confidence to run their business on top of GCP. We do shared post mortems with them, we are there to respond to their instants in some cases, because that helps them have the confidence to run on top of a cloud provider. And that's a very extreme case, because they want to reach high reliability, and that means that Google has to be there to help them do that, because they have this massive external dependency on us.

Alex Bramley: If you remember the discussion about SLAs earlier, as someone depending on an external service, you're only likely to get compensation for major outages, and even that won't make up for the loss of user trust that your customers have experienced. So one way to cope with this is to measure SLIs for your dependencies separately, perhaps even set your own targets for them. If Spotify came to us -- not Spotify, sorry; Shopify came to us... It's difficult, they're very, very similar words... If Shopify came to us and said "Hey, we expect this level of reliability from you", that kind of a discussion is really helpful, because as you say, it's something that we can take to product teams and say "Okay, so we need to engineer for this, because some of our biggest customers is demanding it."

Alex Bramley: But by measuring the performance of your dependencies in isolation, you can get insights that you can feed into the engineering decisions you have. Say if you need 4 nines from a cloud provider, and you're frequently experiencing only 3,5, maybe you need to adjust your usage patterns, maybe you need to file a support ticket. Maybe you can start the discussion with the provider to say "Hey, our business is suffering because your reliability isn't good enough for us, and we don't wanna have to leave, because that's an expensive thing to do. What can we do about making this better?"

Sven Johann: Yes. I think one additional thing I find interesting with measuring the reliability of my dependency is that I understand my reliability of the dependency. How often do we fail because of let's say the dependency failure, but -- I have no clear numbers how often it actually happens. I mean, I can always do something about it, right? So that is a very concrete work item, that I want to understand the reliability of my dependency, because that's obviously important.

Sven Johann: And I think there is one absolutely fantastic paper about dealing with external dependencies. You mentioned Ben Treynor, the VP of 24/7 at Google... So he and his colleagues (or your colleagues), they wrote a great article for the ACM Queue, and it's called (I'll put it also in the show notes) The Calculus of Service Availability. For me it was, from many perspectives, an eye-opener; a fantastic paper, and it would be a show on its own to discuss that paper, but let's just pick one example here.

Sven Johann: So I have a service A, and this service A has ten critical dependencies. So not only one, it has ten... Which itself have critical dependencies, because that's how you need to think about it, right? So I have always a dependency tree.

Alex Bramley: This is the welded microservices, right?

Sven Johann: Sorry?

Alex Bramley: The welded microservices, right?

Sven Johann: Exactly, exactly. Especially if you do Netflix-style microservices. So you have this dependency tree, and now I need to calculate the availability of my service A. I would ignore for a second that we can do a lot to mitigate the risk of a dependency failing; that's a story for another day. But now I have this dependency tree, and how do I calculate the availability of my service A? How do we do that?

Alex Bramley: This was a fun part for me to figure out and answer to, because I had to go and read the paper and think about it a lot. It was interesting. It was an eye-opening paper to me too, and I think - if you can excuse just a little bit of pedantry from my side, you can estimate an expected availability using the maths in the paper, but what it says is you have to measure the actual availability achieved over a given time window, because the events affecting that availability occur randomly. You don't schedule an outage on the fourth day of every month... That's how you meet your targets, right? That's not how these things work, unfortunately.

Alex Bramley: So you can't really calculate the availability of service. You have to measure the availability of service over a time window in the past. But you can estimate what you can expect to receive based on the availability for your dependencies. But the paper proposes that you kind of run this the other way around. You start by declaring a desired availability for your parent service. You set an SLO, effectively. You say "I want to get 3 nines out of the top-level service." And then you run the maths the other way around to figure out the required availability the dependencies must provide.

Alex Bramley: So that's kind of a different way of looking at things than the one you're proposing, where you're going upwards from the leaves to the root. You go downwards, and you say "If the root wants to achieve 3 nines, what do I need to get from each of these dependencies?" I think it's easier when you start talking to the dependencies, to be able to say "I want to achieve this. Because the fan-out from my service is 10x, I need you to be 10x more reliable." It's gonna be a difficult pill for them to swallow, but at least you have some data to show why you want that performance out of it.

Alex Bramley: That said, I think there are two key points that the paper makes with regards to your question. Firstly, service dependencies are rarely unique to the service, especially when you have a service mesh, Istio, web buzzwords etc. Service mesh... I've said that already. I got lost in the buzz words. Firstly, your service dependencies are rarely unique to the service, especially when you've got a nested hierarchy, like we were saying; you've got a service mesh, you've got microservices, all of that kind of thing. Service A is gonna depend on service B, service B is gonna depend on C, and they all depend on your authentication service. But you also talk to service D that depends on service C, and then God knows what's happening in all the other services you hit.

Alex Bramley: So your critical dependencies are not gonna be disjoined sets, like everything is gonna depend on authentication somewhere down at the bottom. And it doesn't really matter if you depend on a particular service via one or many paths, because if authentication is down, everything's down; it doesn't really matter. So the unavailability of each transitive service dependency can only contribute once to the top-level unavailability of your service. That's one of the key things the data says. So that has a consequence for the maths, because if you run it upwards, then you're saying "Okay, well if this fails, then it contributes via service B, it contributes via service C", and then you get a completely different number if you start at the top and work down and say "Well, okay, I need the reliability of this service to be X, because I depend on it, and I need it to be this reliable... And it doesn't matter how I depend on it, it just matters that I depend on it, and I have this requirement for it."

Alex Bramley: Sorry, I got a bit off into the weeds there... Anyway, secondly - while you say that you want to ignore the possibility of mitigations, that's kind of what the bulk of the paper concentrates on. It's proposing a rule of thumb that your critical dependencies need to provide an additional nine of reliability compared to the parent service... And this has some pretty start implications for running complex, multi-led services reliably. In many cases, risk mitigation is the only effective strategy available, because it's simply not physically possible for the human beings to respond in time to preserve the availability of the service.

Alex Bramley: If you've got a nested hierarchy that's six levels deep, then you add an additional nine of reliability at each level. Then you go from 2 nines to 7 nines, or 8 nines. I can do maths... You go from 2 nines to 8 nines, and 8 nines is just like -- 5 nines is almost impossible to meet in reality. 8 nines - that's marketing speak.

Sven Johann: Yes, that's true. I cannot just go somewhere and say I need 8 nines, or 7. Even 5 nines is crazy, as you say... Maybe not crazy for Google, but for normal companies, 5 nines is already pretty crazy, I think.

Alex Bramley: Yes. I see 7 nines when you're talking about storage durability, durability being like if I write a byte, what are the chances of me not being able to read that byte, back, because the core system isn't pretty reliable? You also see it in a lot of literature that's talking about availability in terms of the system is up and running, versus the system is down. The binary, black and white things. Because if you're talking about a mainframe kind of thing, the mainframe availability - they can reach 7 or 8 nines, because they're built in with hugely redundant systems, so that it can keep running even if parts of it fail.

Alex Bramley: You mentioned IBM before we started recording... I used to work for a startup that got bought by IBM, and when we got bought, they had this really interesting story about mainframes. There was apparently a guy who was a salesman in Texas who used to sell mainframes, and one of his favorite sales tactics was literally to pull a gun out and fire a bullet at one of the processes... And he'd do this while the customers were in the room. I think he'd probably use an airgun or something, rather than an actual bullet, because you know, the danger, and things like that... But this is Texas, so maybe not. I don't know. And the story may even be apocryphal, I don't know, but it's a lovely story.

Alex Bramley: So he'd do this while the system was running, and the customers were looking at the console prompt and all you'd see is CPU 4 offline, and the whole system would just be like "Oh, we lost a CPU. It's been shot. It's fine. I'm only bleeding a little bit." To paraphrase Monty Python, "It's just a flesh wound!"

Sven Johann: Yes. You already said it in a partial sentence, that I have to measure my dependencies. I just want to make it a little bit more specific. I obviously have also dependencies which do not offer an SLO. It's just unknown. And okay, you said if that's the case, I have to start measuring it, and go to the folks and tell them "I have an SLO requirement." But is there anything else I can do if one of my dependencies has an unknown SLO?

Alex Bramley: I mean, it's difficult. It usually comes down to engineering work, I think. Most third-party dependencies are almost certainly not going to tell you about their SLOs. mean, they may well be measuring them. I know for example Google measures a ton of SLOs about the performance of GCP, but customers don't usually get to see these things. But I think what it comes down to is if you measure the reliability of your dependencies, then you can figure out whether they're meeting the availability, the reliability that you need from them. And that's the signal that you need. If they are meeting the reliability you need from them, then you don't need to do anything. If they aren't, then you have to start putting mitigations in place. You can maybe find another third party that provides that service and use both of them, and failover between one and the other if you need to.

Alex Bramley: You can go talk to their sales teams and say "Look this simply isn't good enough. What are you going to do about it?" There are many options you have when you have data showing that the reliability of the third party is not what you need it to be. And until you have that data, you can't have the conversation.

Sven Johann: Yes, exactly. Exactly. Switching gears a little bit again, how do I communicate those service-level objectives and get a buy-in? When I started with SLOs and error budgets, although I think that the concept is absolutely fantastic, some people - or maybe even all business stakeholders and service managers, they thought that it's kind of a crazy idea to just be fine with a certain amount of downtime. That's one thing. And the other thing is that if you reached a certain amount of downtime, then you are not allowed to, let's say, deploy new features, for example, and you're only allowed to work on reliability.

Sven Johann: To me, when I started out with SLOs and error budgets, it was kind of tricky to communicate that to those people. So do you have any recommendations on communication?

Alex Bramley: I don't know about recommendations... Communication is a difficult skill, and one that has to be practiced. I think the mainframe example I gave just now is where it kind of comes from... The idea of the mainframe being up, being the availability of the service, is like -- if the hardware is running, that's fine; that's where a lot of this 100% thing comes from, because it's possible to build highly-redundant systems that can reach 7 or 8 nines of hardware reliability. The hardware is on, and the software is nominally running.

Alex Bramley: So the "Downtime is unacceptable" mindset comes from that binary understanding of availability as the mainframe is on, that comes from running your large enterprise software on these huge, vertically-scaled servers. When you have one server, that server is up or down, and when that server is down because of hardware failure, maybe your salesman has just shut the CPU or something - then that's definitely bad.

Alex Bramley: What's more, on these large enterprise systems you update the software at most once a quarter, probably more like once a year, because these things tend to be huge projects, massive amounts of work. And then when you're doing that software updates, you generally involve scheduled downtime. So the rate of change that systems experience is mostly zero, and you've got then these massive step-changes where you roll out a new version of Oracle, and then you take down the entire mainframe for a weekend and nobody can use that bank, or whatever... And that scheduled downtime is -- I don't know, I'm not trying to be mean here, but to me this is kind of cheating... Because scheduled downtime is still downtime, but because you're telling people about it, it doesn't count against your availability metrics.

Alex Bramley: When you're running your services as a bunch of load balance Kubernetes pods in multiple regions on a cloud provider, like everyone wants to do these days with the dev ops and all, individual hardware failures are invisible to your software and your users. On the other hand, you're probably deploying new code at least once a week, maybe even several times a day if you're really speedy on this kind of thing. So there's a constant rate of change in the system. And while there's still the possibility of complete downtime for your service, there's also this broad spectrum of availability from the perspective of users, because each change brings with it the risk of something going wrong.

Alex Bramley: The mainframe -- you can equate the user's experience of availability with the mainframe being on, because when the mainframe is on, it is serving, and the software is not changing in that time... And the biggest risk you've probably got to contend with is that maybe disk fills up, or something like that. You can mitigate all of those risks ahead of time, I think... Or most of them ahead of time.

Alex Bramley: So in a service mesh world, in a microservices world, where you're changing things constantly, the chance of something going wrong is much higher. Most of the time, all that will happen is a few users see a few errors; and that's bad for your users, but it's very rare that you see 100% everything is hard down. So I think SLOs and error budgets - they provide you with the tools to deal with this kind of reality where availability isn't binary, and each error you serve to your users is a slight detriment to the trust they have in you.

Sven Johann: Yes, I think that's a really good point, and I think that's also super-helpful to make clear that -- I mean, we also had that discussion before we started the podcast... Do you want to have five outages which take two minutes or ten minutes and affects 10% of your customers, or do you want to have one outage which affects 100% of the customers. It's not even an outage, it's a planned/scheduled downtime... And that for a day, or something. Still, some people think that the former is worse, which I still cannot believe it, but that's how it is. Looking at SLOs is way better, and it helps to communicate (let's say) customer satisfaction.

Sven Johann: A long time ago, maybe five years or something, for one customer it was totally okay to have those insane release weekends, where from Friday evening to Monday morning everything was down. Scheduled downtime. That is more than 50 hours every three months, where you are completely unavailable. But if during a random day you have a 30 minutes outage, people just freak out because you have a 30-minute outage. And then the discussion is "How can we avoid those 30-minute outages?", but nobody thinks really about the 50-hour scheduled downtime, which really annoys a lot of customers.

Alex Bramley: Yes, it's very much a mindset change, and I think when you're operating mainframes and you're not -- like, can you imagine trying to do an Agile, continuous deployment type thing in a large Oracle database? Firstly, you don't get software releases from Oracle; secondly, updating the database is a hugely risky thing for a bank that completely rides on that kind of thing... So it's very much a mindset change, and I don't necessarily think that it's the wrong thing to do, because when you've got this big step change in software, you need to have enough space to really work with that. What happens if you've rolled 90% of the way forward and then you discover a problem? You need to be able to roll back, and that probably still takes a bunch of time.

Alex Bramley: Their approach to things is a product of the software environment they're working in I think is where I'm going with this... And I think where they start having problems is trying to take that approach to a different software environment than the cloud, where a lot of the assumptions that they're making and a lot of the fundamental building blocks operate quite differently. It just doesn't work, and that's why we've got SLOs instead, and this granular notion of availability - because that's the software environment we're working in.

Sven Johann: Yes. Another larger topic I want to discuss is the measurement itself. One of my problems I had to work on was, for example, measuring latency. My initial reaction was that I measure the latency at the load balancer. And then we had our first major performance incident, and that was because some JavaScript library took like 40 seconds to render a search text field. It was a single-page application; not that I recommend to do that, but that's just what it was.

Alex Bramley: That was what people did at that time.

Sven Johann: Yes, exactly. And then I was like "Okay, load balancer - that didn't do the trick." So basically, we cannot only do load balancer, we also have to do some client-side measuring. So how do you approach the problem of measuring performance? How do I get started? What possibilities do I have? And how should I incrementally improve measuring performance?

Alex Bramley: To this specific question - I think we touched a bit on it previously, when you were asking about measuring latency before... The thing you want to avoid when measuring is the variability. The real problem comes in when your SLI has a lot of variability in it, and that means you can't set a good static threshold that says "These things are good and these things are bad."

Alex Bramley: There's no one correct way to measure latency, but the key thing to avoid is high variability in the underlying SLI. As you've discovered, there are advantages and disadvantages to most approaches. A lot of this comes down to making engineering decisions about what trade-offs you want to make. Load balancer metrics - they were a solid foundation, they were a good place to start, especially if you're running in the cloud, because you often have these metrics already being recorded for you by the cloud load balancer.

Alex Bramley: This gives you some historical data to look at, so you can say "Okay, I know my application performance tends to vary between these boundaries, and that means if I set my threshold slightly above those boundaries, then I'll know I can meet my SLO most of the time, and I will catch any regressions."

Alex Bramley: The load balancers - the cloud ones especially - will capture the serving time of the entire request, without incorporating any of the variable roundtrip latency between your users and this serving environment. As I've explained before, we go into this in a reasonable amount of detail in The Art of SLOs. The variance causes it to be very difficult to set an SLA threshold before.

Alex Bramley: But as you've found, that doesn't cover client-side stuff. If you happen to do something that incurs a lot of latency in the render pipeline or in other parts of the serving of the request, you can't see that, and therefore you're not able to capture your users' experience correctly. But you don't wanna measure the entire request from the client-side all the way through the internet to the servers and back again, because that variability from wherever your client happens to be on the internet is a real problem for setting a meaningful threshold.

Alex Bramley: So the best way to go about this is probably to have two latency SLIs. I think I mentioned this earlier as well, if you can have one that captures the client-side render time, so it's from the time that the client receives the response from the server to the time where it's done -- what was that phrase you used? The largest paint, or something like that...

Sven Johann: Yes, the largest contentful paint.

Alex Bramley: Yes, that sounds like a great metric to have. That's one latency SLI. The second latency SLI is how quickly the server could take the request and render results. Then you can combine the two of them to give you a good picture of how long it takes for the client to render things, as well as for your service to serve the things, but without the latency in the middle, which you can't really do meaningful things about unless you're a large company or want to pay Cloudflare or other content delivery networks a large quantity of money to get your service and have some caching close to your users.

Sven Johann: Okay, so now I know how to measure the latency... The other thing I want to measure is availability. I find that equally hard. I could look at an HTTP result code, for example, at the load balancer... But that wouldn't catch bugs like a button disappeared, or shrunk, it shows some wrong data, or it's the wrong format, it expects JSON but it's something else, or it's not correct JSON... How do you approach a problem of measuring availability? How do I get started and what possibilities do I have, and how should I incrementally improve here?

Alex Bramley: Yes, so I love HTTP response codes, because they're simple and quick. The things that we use in The Art of SLOs is the kind of canonical, easy thing to introduce people to the idea of how to measure availability... But you're right, they don't capture anything. They're really problematic. The 500 is the server's opinion on whether it's served the correct response... But due to bugs in software or any other kind of thing, it may send an okay response header, for example the wrong user's data or just a blank HTML body or whatever... Essentially, at that point your SLI is lying to you, because it's not an accurate representation of user experience.

Alex Bramley: Again, there's not really a single right answer, because you need to figure out where the gaps in your coverage are. What is the delta between what your SLI is measuring and what your user is experiencing, and how can you meaningfully reduce that.

Alex Bramley: In the specific case of HTTP response codes, they're quick to get started with, because again, just like the latency if you're using your cloud load balancer, you're gonna have metrics already, and you can see what your performance is already... And the nice thing is they're also conveniently real-time. You can get an instant response from your server and you can instantly measure it to say "Okay, what's happening right now with my service?" And for many scenarios, this can be good enough, especially if you're just looking to trigger an instant response. If you're serving 100% 500's to your users, you'll want to do something about that quickly and you want to know something about that quickly. 500 is kind of categorically bad. If you're serving 100% blank responses to your users, that is also bad, but it's harder to detect.

Alex Bramley: If you need more courage like that kind of thing, you generally need to instrument a client or have some kind of synthetic client, so that you're actually introspecting the response; you're having something that has an understanding of what the correct response looks like, and is telling you if the response it's receiving when it sends requests to your server isn't correct.

Alex Bramley: Often, these synthetic clients can be harder to write, but the goal with one of those is that you try and follow the actual journey that your user is trying to take with your service. Try and follow the same kind of pattern of requests as a user would send when they're talking to your service, and then make sure that each of the responses that your service sends back across an entire set of actions... Like, taking a slightly higher-level view than the single request here. So you try and go through the entire buy flow process; you load up a search page, then try and search for a result, then go to the details page for that, and then try and add it to your cart, and then try and check out, and then have some kind of special thing with your system where the synthetic client could use a fake credit card number and a fake name, and that would not actually execute the purchase in the system, but it would do all the things necessary to test if that worked, and then send back a correct result... And then you'd have an SLO based on that.

Alex Bramley: So if the synthetic client that tried to do this once a minute was green, then you know that your users are most likely able to buy thing on your store and they can go through all of the steps they need to take to buy thing on your store.

Sven Johann: Yes. We do have synthetic clients, so we measure -- as you said, it's quite a lot of work to implement them...

Alex Bramley: Yes, very much so.

Sven Johann: The thing is, it looks like a good combination to have synthetic clients which probably cannot catch everything. If you only have problems for users which are from France, or something like that, you cannot catch everything, so you probably need a combination of synthetic clients and measuring directly at the load balancer.

Alex Bramley: Absolutely. And the biggest problem with synthetic clients, a major drawback in my opinion, is that you can't measure really really high availability with them, because they only -- like, unless you have them sending hundreds of QPS to your service, you don't have enough data points across a 28-day window to really measure high availability. I'm not gonna try and do maths while talking live, because there's always a danger, but there's 1440 minutes in a day, and there's 28 days in an SLO window... So measuring 4 nines - it only takes a couple of request failures across a whole 28-day window to put you out of SLO for the entire 28 days. That's not the level of sensitivity you want, really...

Sven Johann: Yes. Another thing I see with synthetic clients is if you're in multiple regions... Let's say we have a data center in Europe, and one in the U.S, and one in Australia, maybe China, maybe something in South Africa... And let's say your load is not equally distributed. Lots of customers are in China -- 80% are in China and the U.S, and then you have a small portion in Europe, and a small portion in Australia, and a very little portion in South Africa... And if I would need to do some weighting with my synthetic clients -- or would I measure each data center alike? Like, I look at the availability per data center, and not for the whole application? Does it make sense what I'm saying?

Alex Bramley: It does. I think you'd probably want to just kind of agglomerate everything together. What kind of problems are you running into when you're -- what's the problem with synthetic clients not sending... Is it because you're expecting so much more traffic to come from China and you want your synthetic client near China to be sending more traffic, to be more representative?

Sven Johann: Yes, exactly. Let's say in China we have ten times more customers than in South Africa, and then if I want to look at the overall availability of the whole application across the globe, then I cannot just -- I can look at all the requests from all the data centers when I look at the load balancer, but when I send synthetic requests, the synthetic request should somehow represent in the overall picture the part of the region that I'm looking at. So if China has ten times more traffic than South Africa, in order to do some weighting I need to send ten times more synthetic traffic to China than to South Africa. That's the thing.

Alex Bramley: I'm not sure that's really -- so the thing to remember is that synthetic users aren't real users, which is one of the other reasons why you need to also have your load balancer metrics. Because they're not real users, they measure a proxy for user experience; they're not measuring the actual user experience, again, just like the other SLI's we have. So I would say that it's fine for a synthetic client near each of your data centers to be sending requests to that data center and be sending a constant level of requests... Because it's a separate SLI, and it will give you an idea of your availability. Nothing is 100% accurate, 100% wrong target, as with SLOs themselves.

Alex Bramley: So you're gonna have two different signals. One of them is gonna tell you whether your users are receiving bad response codes, one of them is going to tell you whether your synthetic clients can follow your user journeys... That's good enough coverage, and I don't know whether you need to weight the synthetic client--

Sven Johann: Okay, okay...

Alex Bramley: What problem are trying to solve with the weighting that isn't solved already by the synthetic clients?

Sven Johann: I want to understand the overall availability of a certain feature across the globe, with synthetic clients. And if only, let's say, South Africa has a problem and all the other regions do not have a problem, you could say that my availability is looking at the synthetic client, the availability would be 99.9%. But actually it's not true, because it's higher, because I have in all the other regions more customers which are accessing the page.

Alex Bramley: This is specifically for a problem that is only caught by the synthetic client failing, and not the...

Sven Johann: Exactly, yeah.

Alex Bramley: Okay, I can see that. So it would artificially -- so I would say that instead of trying to deal with that by having the synthetic clients send more requests, I would try to deal with that when you aggregate the data center availabilities to a global availability for your service. Instead of aggregating the SLIs together by summing the total number of requests sent and the total number of good replies for each synthetic client, turn each one into a data center-level SLO, so you get your South-African availability, you get your Chinese availability, you get your American availability, and then you weight those when you create an aggregate SLO. I think that will -- a) it's gonna be easier, because you don't have to do different things with your different synthetic clients, so it means operationally, when you're running these in production, they don't look different, which means that life is easier for you. And b) that kind of maths is the kind of thing that your monitoring system should be able to do relatively easily. You just have a table of constants which you multiply each of the availabilities by before you, then sum them to the global level.

Sven Johann: Yes, okay. Cool. That was...

Alex Bramley: It's a difficult problem...

Sven Johann: ...a deep dive into a very specific problem. I don't know how many listeners will have that problem...

Alex Bramley: So you can generalize it to say -- your problem there is trying to aggregate the availabilities, and aggregating SLOs is quite a difficult thing. It's something I don't think Google has great answers for in a lot of cases... So there's a few different strategies for it. You can aggregate at the SLI, you can aggregate at the SLO, and another practice we've taken a couple of times is to turn everything into a bad minute thing. So each individual SLO, whether that SLO is good or bad, is just kind of turned into a boolean, and then your aggregate SLO is either good or bad, depending on how many of the individual components are good or bad.

Alex Bramley: So you can say "For this minute - this minute is a good minute if all of the component SLOs are performing well. And this is a bad minute if any two of the components SLOs is bad."

Sven Johann: Okay, cool. In the beginning of your answer to my synthetic client weighting question you mentioned -- no, it was actually the answer of the synthetic clients in general, when you said it's pretty coarse-grained, if you have a measurement window of 28 days... We haven't talked about measurement windows. How do I calculate if I met my SLO or not? You already gave a few examples here, but I want to discuss now more what measurement window should I look at in order to say "I met my SLO or not."

Alex Bramley: Well, I keep throwing that number 28 days around... We've found this one tends to work well, because -- like, it depends what you're trying to achieve with your SLO, but generally, one of the goals is to set up that feedback loop that we discussed earlier about trying to modulate your development velocity, because you're dependent on your services' reliability. So your developers can go gung-ho on features, do whatever they want, as long as you've got plenty of error budget to burn. But when you start getting low on error budget, you need to start tilting the overall stance of your engineering organization towards fixing the bugs, burning down the post mortem action items, pushing for more reliability, rather than going all out on features.

Alex Bramley: The thing about this kind of changing of engineering stance is your organization can't react to that quickly. You can't change what you're doing every week. That tends to be too short of a time. I know a lot of people do Agile, and often there's like two-week sprinting. If you do two-week sprinting, then maybe just 14 days rather than 28 days can be good. One of the inputs to your sprint planning could be "Okay, so over the last two weeks were we in SLO or not? If we weren't in SLO the last two weeks, then hey, the next sprint is gonna be more reliability work, rather than less reliability work."

Alex Bramley: Four weeks is two sprint cycles, but it's also approximately a month. They're a nice human construct, and it works well with organizations to think about things in terms of months and quarters, because the business tends to operate in months and quarters... But the problem with using actual months is that they aren't all the same length. Often, it means that there are five weekends rather than four weekends in a window if you have a month... And this is another source of the variability that I've said that you don't really want to have in your SLOs, especially if your business does lots of business during the week, and it is dead during the weekend. That kind of variability where you've got five of these things in the window rather than four can make your SLO jump up and down a lot.

Alex Bramley: Also, if you aren't doing super-agile daily delivery type things, if you have one release build a week, then you'll find that your release day is a source of SLO error budget burn, and if you have five release days in your window rather than four, then it's gonna look worse, because you naturally have an increased source of errors.

Alex Bramley: So that's why I kind of push 28 days as a good starting point in The Art of SLOs... But you can use a lot of shorter windows as well. If your goal isn't to change the engineering stance of your organization but it is to, say, trigger some kind of operational response because things are on fire, then using a window like an hour or 12 hours can be really helpful, because those respond quickly to the periods of increased errors.

Alex Bramley: You can look at your error budget burn over, say, a one-hour window, and if it's many multiples of your error budget in an hour, you probably wanna have someone go and do something about that before your 28-day error budget is gone and you have to then go and do the whole changing of stance and making people work more on bugs and reliability.

Sven Johann: If I have a 28 days error window, that's obviously a rolling window, I guess. It's not a static one, it's always rolling... And then I'm wondering how some people just want a management report, "How did my service do over the past quarter or month?" How would I solve that? Because obviously a month or a quarter, as you explained, is not really a good measurement window.

Alex Bramley: So reporting tends to be something that you do separate to this. People should be willing to react more quickly than waiting for the next report if things are bad. That's the nice thing about a rolling window, is that it's better because it provides that continuous measurement and it better matches the experience of your users. But with a static window for the purposes of choosing what your organization is doing, your error budget resets, like "Hey, it's the 1st of December today." Let's say we had a massive outage yesterday; today we have a fresh, clean slate. No error budget burned at all. The last month looks terrible, just a sea of red, but today everything's fine. That's not how your users see it. They're still gonna be snarling from yesterday's outage; they're still gonna be moaning on Twitter, or whatever.

Alex Bramley: When you use a static calendar window, you're not representing your users' experience as well as you could. But I understand that the management chain - they want reports, they wanna see "Okay, November was bad. Why was November bad? Do I need to dig into this? Do I need to go and have a meeting with somebody?" So you can absolutely build static reporting... On the 1st of December you have a cron job that runs and goes and crawls all of the logs from November and produces a very detailed report for the consumption of higher-up people in your organization. That's fine. By all means, do that. But I don't think it's good for the continuous measurement, for the things that you make the day-to-day engineering decisions on. I think the continuous measurement is an important thing to have.

Sven Johann: Thank you for listening to this second episode. Please also consider listening to the third and final episode with Alex, where we cover error budget policies, when to improve or not on your reliability, developing and communicating and sharing an error budget policy, and we finally discuss alerting on service-level objectives. Look at your burn rate, not on single problems. See you then!