Eoin Woods on Bringing Systems into Production and Keeping them there

Transcript

Sven Johann: Welcome to a new conversation about software engineering. My guest today is Eoin Woods. Eoin is the CTO of Endava, a global consulting firm. Before working for Endava, he spent many years working on software products, and in corporate IT for finance.

Sven Johann: He's a regular speaker at conferences about architectural topics in general, about operations and security. He also published a book on software architecture, which we'll discuss a bit later, at least the operational part... And he's also a contributor for the IEEE Software Magazine. Actually, that's why I know him; I was a contributor there.

Sven Johann: Today we talk about production environments, production-ready software and why this knowledge is so important for software developments. Eoin, welcome to the show.

Eoin Woods: Good morning. It's very nice to be here. Thank you for the invitation

Sven Johann: Thank you for coming. Let's start with a few definitions. What is a production environment?

Eoin Woods: That's a great question, because that very much depends where you're standing. For me, a production environment is any environment where if it's not available, people can't get their work done, and if you lose information out to the environment, then that causes somebody a real problem with that work. So that does of course mean that not every production environment is what we think of as a classical one, such as a transaction processing system, or an eCommerce site, or something like that. In our world of software development if you have an internal JIRA, that's a production system. Because if it's not there, then the software development organization probably can't function properly.

Sven Johann: Yes, that's actually an interesting point, because in one of my projects that was not clear to a lot of people. So for them, the production environment was actually only the environment where our software was running. But for example if GitLab is down, nobody really saw that as a production problem. Obviously, if GitLab is down, and we have GitLab pipeline and stuff like that, nobody can do anything.

Eoin Woods: I think that's quite a common problem. I don't think you find it so much in tech-first firms, but it's quite common in historical, older retailers or manufacturing firms - not to name names, but firms that don't view software engineering maybe as core to their business. They never think that if the software engineering systems aren't there, actually a really important part of the business can't function.

Sven Johann: Okay, so now we know what a production environment is... What is a non-production environment?

Eoin Woods: Well, logically, I suppose I have to invert my previous definition... There are some grey areas. Many people have the idea of pre-production environments. Again, you've gotta think about the fact that if you're using green-blue deployment, for example - or A/B deployment, whatever you call it - where you're using two environments the productionness (that's not really a word, is it; but you know what I mean), whether something is production varies over time. So it's production today, and maybe next Monday it won't be production, because the other one will be running production.

Eoin Woods: So there are some gray areas, but it's all the other things, such as our development environments. We can destroy and recreate those any time we like. What's the classic way of debugging the long-running integration tests? We reset the environment, we run them again, and magically, the problem goes away, because it was probably a problem that we created in the test data, or something like that. So those, I'd say, are non-production environments, because they are totally under our control, we own them, and we can destroy the information in them if it suits us.

Sven Johann: Yeah. If you just throw away the data or the production environment, people get very mad...

Eoin Woods: Yes, they get really unreasonably annoyed.

Sven Johann: Yes. When you said the production environment - well, in your definition of a production environment, when you said "When it's not available, people cannot work", I also remember when I worked for a large insurance, they had those integration environments, and basically it was really hard to work there without those integration environments. You could, but it was not that easy... And it was not a production environment, but that's kind of a grey area. One of the grey areas you mentioned. We can still work --

Eoin Woods: We probably view those as tools rather than production environments, but actually that's definitely, as you say, a grey area, because they do affect people's ability to get their job done.

Sven Johann: Yes. A question we lately had, and it was an interesting discussion - who owns actually those environments? Who owns production?

Eoin Woods: Well, I'm quite clear on that. It was pointed out to me now quite a few years ago, but it was a bit of a shock when somebody actually pointed it out to me... The organization that owns the system owns production. It doesn't belong to the software development team; the production environment belongs to the business. Now, arguably, everything could belong to the business in some sense, but if you think of it as -- the environment I work in, we build software for the people, so often we are providing some of the software development environments ourselves; we own those, to a certain degree. Obviously, it's always on behalf of the client, and their views and their needs are always what drives us, but we in some sense own them. The client doesn't want to know if we've recreated a production environment; they absolutely own it, and if we want to remove a piece of data from it, if we need access to it, if we want to recreate something, they have to know.

Eoin Woods: It's much more obvious when there's a sort of contractual relationship, even if it's a partnership one, whereas a contract - it makes it more obvious. But in a big firm, somebody who was very experienced, a production operations manager took me aside years ago and said to me "You've gotta remember the reason out there. They're so predictive of what this is. It's their business, it's THEIR environment, not ours." I've carried that piece of advice with me ever since, and it really helps to focus the mind.

Sven Johann: Yes, that's true. In the discussion we had back a few months ago, of course, everybody claimed "It's our environment", because they wanted the responsibility, and the budget, and whatever... But in the end it was pretty clear that the business sponsor, the organization, owns the production environment, and even if it's the operations team or development team, they work on behalf of the project sponsor and the business.

Eoin Woods: I think that's exactly right, yes.

Sven Johann: Yes, so I couldn't agree more. But it leads to some discussions... Okay, so now we know what a production environment is, who owns it, and now our software runs on a production system... Usually, if everything works fine, then everything is fine and nobody's unhappy. But things go wrong in a production environment... So what could go wrong in a production environment so that it's not usable for its users?

Eoin Woods: Gosh, so many things can go wrong, as I'm sure you're well aware... Some of the things that I've seen that kind of go wrong repeatedly, the things that cause people real problems are that when you're in the production environment, performance (I've noticed) is something that repeatedly something that surprises people. And perhaps inevitably. But there are things to do with the system's performance that are often the problem.

Eoin Woods: The system's running in what's probably a less controlled environment, in the sense that particularly in a large organization, very few systems stand alone, so that you are connected to a lot of other systems, and their behavior has an effect on your system, which of course right through the testing cycle we normally mock those systems out; we replace them with highly-controlled versions that we control. And of course, in many large organizations you're running on a platform, be it the storage platform, the database, the application platform that somebody else actually owns and controls. The environment often causes you complications.

Eoin Woods: Things go wrong in production, which as you were saying earlier, you can't just start everything from scratch typically to solve a problem. I mean, things go wrong in production often in new and unpredictable ways people haven't seen before.

Eoin Woods: And the last one is probably my favorite, which is security is nearly always much more complicated in the real world than in development test environments, where nobody cares nearly so much about it. As soon as you're in the real world, actually, one, there are real people who want to attack you; not test scripts. There are real people trying to get their work done that don't quite fit the security model. There are real complications in, for example, the data that you're trying to control access to. There are real, unexpected situations go wrong in the business, that mean that they want to by-pass a security control temporarily, possibly for a good reason, but that was probably never predicted when the system was put together. That's quite a lot, I guess... But the things that I see again and again are performance in the broad sense often creates surprises, the environment you're running in creates surprises, the kinds of failure you can encounter in production tend to be more numerous than the ones outside, and security is dramatically more difficult in many operational environments than it is in the development environment.

Sven Johann: Yes, I think that's a very nice categorization - environment, security, performance, and let's say functional correctness... Is that--

Eoin Woods: Yes, failures of all sorts, I suppose. To give you an example of the failures - years and years ago I was involved with a system in a big bank, a system with a tricky history, but we put it in very good place, where the business were once again happy. It was one of those systems where if people forgot about it, it was working perfectly, because it was one of those systems that was critical to lots of things, but nobody ever thought about. Suddenly, we had a lot of performance problems with the system by its very nature; it had to do a lot of processing in big batches, that's just the way it was. And over a couple of nights we had this thing -- we were getting paged and alarmed at 3 o'clock in the morning London time, our jobs were running too long, our message queues were starting to overflow... We felt something was going wrong.

Eoin Woods: When we looked into it, to cut a long story short, actually what had happened was we were sharing massive storage arrays with another system in the bank, which had suddenly had a huge peak in workload in its environment; nothing to do with our business line, nothing to do with our databases, nothing to do with our data. It's just that all of our data access between something like 1 AM in the morning and 3 AM in the morning London time suddenly dramatically slowed down... So it looked as if our system had a performance problem.

Eoin Woods: That's the kind of environmental thing that, as we said afterwards, we were never going to predict that that was going to happen. And it's not something we really could have tested for.

Sven Johann: Yes, I think the unpredictability, and how do you prepare - that's a totally different story, I think. Or not totally different, sorry. What I mean is that's something we can talk about a lot, and actually I have a question later, where we can discuss how we can prepare for the unpredictable events.

Sven Johann: When you gave a talk about this, I always felt "Yes, of course, performance, and environments, and bugs... That's usually the case. That's something I can relate to." But for some reason - I don't know why - I never really thought about security. And then I was like "Yeah, obviously security is a big thing", especially these days, where Garmin for example was down for to weeks because - we don't know; some security issue. Probably some ransomware attack.

Sven Johann: I think that now it's quite interesting that we have all those ransomware attacks, for example, or lots of security breaches, because that moves security and the effects on an available system very much to the top of an organization. I was talking two years ago with a client, and we just had the question with the product management team when it came to availability if we should be able to evacuate the customers from (let's say) Europe to another region if we have a big problem, let's say, in Europe, whatever that problem is. AWS could be down, or you could have a ransomware attack, or something like that...

Eoin Woods: Yes, absolutely.

Sven Johann: And two years ago product management was like "Yeah, no... I mean, no. Nah..."

Eoin Woods: "It's never gonna happen."

Sven Johann: It's never gonna happen, and if it happens, especially when AWS is down, the hub of the internet is down, and stuff like that... And now, C-level people are asking "What are we doing if we have a ransomware attack? Is our system still operating? What happens if we have a major data breach? How do we detect that? How can we protect ourselves from that?"

Sven Johann: Security came - or at least that's my feeling - security requirements, especially when it comes to availability of a system, came quite to the top in the last 15 months, I have the feeling.

Eoin Woods: Yes, I agree. My background of security is unusual, in that I'm not a security engineer, but I've worked in a secure systems environment. So I've sort of worked with security and around security people for quite a long time; I'm very conscious that, like other specializations, you often get the security guys and the dev guys, and it's a bit like traditionally with ops, there's a yawning gap in between. So I've been trying to give talks, mainly to dev people, about security for like 15 years. And what's just been amazing is how 10-12 years ago I was getting 20 people in a hall that could sit 100, and I said "Who here is a security engineer?", 10 out of them put their hands up. Basically, I was talking to the people who already knew it.

Eoin Woods: Now, if you give a security talk, the entire room is packed out, and I say "Who is a security engineer?" If there's a hundred people, five hands go up. There's much, much more interest in security, and how to apply it, and what's important in the mainstream community. And that's just so fantastic to see. It's such a turnaround, in a relatively short time.

Sven Johann: Yes, true. I'm also organizing conferences -- I mean, not organizing, but I'm a program committee or program committee chair... And 5-6 years ago when we thought "Hey, it would be a good idea to have a track on security and privacy", it was really hard to get speakers from that area to speak at, let's say, regular software development conferences... And also to get people in the room. Nobody was there. I remember -- maybe it was three years ago, and we had a track on Rugged. That was kind of a hip security thing a couple of years ago...

Eoin Woods: That's right...

Sven Johann: ...and basically every room was totally packed. Everyone was interested in how to do secure coding, how can I -- whatever. It's interesting to see that the interest is really rising a lot on security.

Eoin Woods: Yes. My big theory is that this is what the industry does though... The thing I'm afraid of is that there's going to be a wave of interest in security now, and then in two years' time everyone will have moved on to something else... Whereas people mustn't do that. They cannot maybe keep security as their only interest, but they've gotta hang on to how important security is, even in a couple of years' time, when it's quite so cool... Because it currently seems to, as you say, be very much what people focus on.

Sven Johann: Yes, good question. I think two years ago I talked to Sam Newman, and we had a podcast on microservices security... And Sam said his hope and hopeful prediction is that security will become something like unit testing. A couple of years ago, or decades ago (maybe better), no developer or only a few developers really cared a lot about testing - at least that's what he said - and now, 15 years ago (or a little bit more) the test-driven development book came out, and now every developer thinks about testing, and writing test cases, and stuff like that. His hopeful prediction was that it will be kind of the same with security, that most of the developers have this built-in security know-how.

Eoin Woods: That's good to hear. I hope he's right.

Sven Johann: Yes. Okay... When I look at the things which can go wrong in production, I see functional correctness, security, performance, capacity, maybe availability, reliability... Those are the two bigger architectural qualities we have to care about. And we rarely have good, or even -- I mean, it would be nice to have at least some requirements, but rarely we have really good requirements for those operational concerns. How do I get good operational requirements? You don't have to answer everything, because that probably goes too far, and also, I have to say that I did a recording with Alex Bramley from Google, where we talked in-depth about performance and availability requirements... But in general, the question is "Why don't we get good operational requirements?" Or often we don't get any operational requirements. Why is that the case?

Eoin Woods: That's a great question, and I have the same observation. They're very difficult to get. I think it's part of a broader problem, which - speaking from a software architecture perspective, it's difficult to get good, quality attribute requirements; non-functionals, as we used to call them. In general, they're quite difficult things for people to think about. And to give you anything other than very trivial answers, the system must be usable, or it should be available all the time, or it should definitely be secure. Beyond that, people find it quite hard to think it through.

Eoin Woods: In the operations area, I think it's probably exacerbated slightly further by the fact that historically, people who operate the system have not had very much to do with building the system, so they don't have very much background, or very much experience or education in thinking about requirements. Often the people who view themselves as being responsible for some of these qualities such as performance, scalability and security - they were actually seen as part of the infrastructure team; they were the infrastructure/designer/architecture team, but they didn't operate it either. So they were sort of thinking about what they might call non-functional requirements. And then you have the development team, who are really focused on functions.

Eoin Woods: So I think, one, it's just hard; non-functional quality attributes are. And two, you've always had this fragmentation in responsibility, where it's always someone else's problem, which is why it's doubly-hard.

Eoin Woods: A general approach to specifying quality attribute requirements which I've used for a long time and I think really does work is to use scenarios; scenario-based definition. And you can also use also use scenarios for exploring whether a system can meet a quality attribute or not. This is work which -- I mean, because back a very long way, I think it's probably the Software Engineering Institute (SEI) which is at Carnegie Mellon University in Pittsburgh, they probably have done the most to popularize this approach, and I'm not sure if they came up with it, but they've done a lot to make it visible... And the idea being that rather than simply saying "We have to have scalability, and here's a number, actually tell people a story, so that they understand the implications of what they're asking for."

Eoin Woods: So you can say that the stimulus on the system is that it's Black Friday and there's this transaction load, and the transaction load is increasing by 40% every five minutes, something like that. And then you say "Well, in response, what should the system do?" And you actually need to kind of go through from an abstract perspective and say "Well, the system will accept a load up to the following level and rate of increase, after which it will load-shed. When it load-sheds, here's what the end user will see."

Eoin Woods: And then there's normally also a measurement section of the scenario, which is "Here are the critical metrics that we would use to understand this scenario in terms of what we need, and how the system should respond." And the thing that I've found over the years is that if you just write down quality attributes as simple statements, people just sort of gloss over them and go "Yes, that sounds okay", because they don't really think what the implication of them is. If you write them down as scenarios or stories, it brings them to life, especially for less technical users such as acquirers, business sponsors, people like that, as to what they're really asking for.

Eoin Woods: And for operations, we can do the same thing - we can use stories to try and bring to life for the operations group how it's going to be to operate this system, so what will we be expecting them to do under certain conditions, what operational facilities or tools should be there.

Eoin Woods: The other thing which causes problems, I'm sure - and knowing your background, you've seen it many times in large companies - is where the operations team aren't very involved in the whole process of developing the system, so they never get the chance and the stand-up to put their hand up and go "Hang on a minute... I don't think we're gonna be able to operate that, because if we haven't got monitoring, how would we know that this happened?" And then team go, "Oh yeah, that's a really good point. We didn't think of that." Because their voice isn't in the room very early. Then it can be a very long way down the delivery lifecycle, way past any point people are thinking about requirements, where you actually get any input from anyone who knows anything about operations.

Eoin Woods: So these are the two things I like to use - use scenarios, so tell people stories if you like, but also, see if you can persuade people who really know about operations to come and be part of the definition and the building of a system, not just taking it when it's finished.

Sven Johann: Yes. Also, I really like the scenario approach. And to communicate the scenarios and help the people understand what the scenario means, I think a good example is availability. If you have scenarios for, let's say, availability and non-availability, there the reaction is always pretty interesting... Because usually, if you ask someone -- I mean, you cannot ask someone what type of availability they want, because usually they would say 100%, which is of course not possible... But if you talk to them and you show that a certain amount of availability or non-availability is maybe okay, and also say "Okay, we have here a scenario where the system is not available, for example, and that's probably still okay... And if that's not okay for you, then of course you have to pay more money to get a higher availability, for example."

Eoin Woods: Exactly. Actually, bringing money into it can be key, because you could show people that it's a non-linear relationship. You don't get 10% more availability for 10% more money. A guy I used to know worked for Tandem, when there was a Tandem, before they became part of Compaq, and then HP... And that's how Tandem used to have that conversation with their clients; it was because Tandem of all people knew that getting that last 1%, or tenth, or hundredth percent got exponentially more expensive. And that's what they did, they said "Well, we can give you 100% availability. It will be 20 million dollars", and the customers would just look at them as if they were mad... And they went, "But we can get you 99% availability for two million", and they went "Oh, that's a lot better." Now, they could have that trade-off discussion at even theoretical availability and cost.

Sven Johann: Yes, that's true. There is an interesting Google paper, I think ACM Queue, and there they have this internal rule of thumb that each 9 more costs factor ten more.

Eoin Woods: That seems intuitively about right to me, yes.

Sven Johann: Yes, so that's their kind of rule of thumb. And was it Frank Buschmann who once in a talk said that Siemens - they usually try to answer with money; everybody can show up and say "Our telephone switch should handle 40 million parallel calls", and then they would say "Okay, you can have that, and it costs you 3 billion." And then people are shocked. "But if you want to have what you would expect to have, then we can give you that for 5 million", or something like that.

Eoin Woods: Yes. That really helped.

Sven Johann: Yes, yes. Money is always -- everyone understands that. And also, I think it reduces -- if you have that conversation, it reduces this kind of friction, because if you don't have operational requirements, usually people freak out, even if everything is okay. For example, you have an incident, and people really -- I heard quite often, now luckily less, but we want to be as close as possible to zero incidents, or something like that. And for every incident, people freak out. But if you have a requirement, and you have, let's say, three incidents of 20 minutes per quarter or month, then everyone is still relaxed if you have that incident... Because it should be fine to be down for, let's say, three hours a month; and now we consumed like 1,5 hours with all three incidents, so everything is still okay. Nobody needs to freak out. So it also helps to be a little bit more friendly with each other.

Eoin Woods: Yes, I think that's a realistic expectation. This is why Google introduced this idea of the error budget, didn't they?

Sven Johann: Yes, exactly.

Eoin Woods: Because they have ops (actually SRE) and dev teams, and this is their shared metric, which is "We have an error budget, and we will use it."

Sven Johann: Yes, yes. I think they did a very good job with helping the community to find good requirements for availability and performance, and how to measure them, and how to improve... I think that was a really good idea.

Eoin Woods: Agreed.

Sven Johann: Let's assume we have good requirements, or good enough requirements. You once described a solution framework for dealing with those operational concerns. Could you elaborate on that solution framework?

Eoin Woods: Yes, sure. What I observed was that there's four broad areas that people seem to expect from their production systems. That they're functionally correct, which is probably outside what we can talk about today. We were talking earlier about the fact that things like unit testing and functional integration testing - those kinds of things. But then there's stability, there's capacity and security. Those are three broad areas we need to think about achieving... Because there's an expectation especially today that systems are stable, in the broad sense; that they have the capacity that they will require for a reasonable workload, and that they are secure, whatever that means, in the right context.

Eoin Woods: This was one of the big banks, and I was just trying to help all the teams think about -- I mean, they sort of accepted the point, "Yes, those are all important to us, and if we don't get them right, we'll be called in the middle of the night... What is it we need to know to achieve this?" And I observed there's three things you really need to know. One, you need to know "Are there design principles in this area that help you to be stable, have capacity, be secure?" So how do you go about designing the system? And are there specific technology solutions which are useful in achieving these qualities? And then are there particular processes, maybe operational processes, that we should be aware of or that we should be assuming, or telling operations that we need in order to achieve?

Eoin Woods: It was not a terribly sophisticated framework, but that was a little framework I came up with just to really help organize knowledge in this area, especially for teams who didn't have large amounts of operational experience themselves.

Sven Johann: Yes. Simple is always good, right?

Eoin Woods: It normally helps.

Sven Johann: Yes, yes. I also used your solution framework, and I think it really helps me to categorize my work. For example, when I want to have an available system, what do I need to do in order to achieve that? And then I can look up what you said - design principles, I can look for example at the work from Michael Nygard, who describes a lot of design principles for example for stability...

Eoin Woods: Yes, exactly.

Sven Johann: And I can think of technologies which are supporting me... And also the processes - I think that is quite often forgotten. In that Google ACM paper I mentioned they kind of described the same thing you said. They have design principles, they mention technologies, and they also say "Look, if you want to have 99.999% availability..." Some internal Google services really only allow (let's say) 10 minutes downtime a year, for example... Then you just cannot have no incident process. Or if you have an incident process, you cannot just say "People need to respond within 30 minutes", because if the requirement is 10 minutes, then 30 minutes is obviously not enough, so you need also following the sun operational support, for example. I think it's really helpful...

Sven Johann: In one of your presentations I believe you played through one example to describe the solution framework... Maybe we could also play through one example now. For example, let's say stability. How do I achieve stability? What technologies, design principles, processes support me?

Eoin Woods: Sure, yes. We could definitely do stability. So if we stand back and say -- once again, how we structure it... So what we need to think about in terms of design, what we need to think about in terms of specific technology that would be useful to actually help to do this, and then are there processes that we should be aware of, or that we're mandating or building in.

Eoin Woods: Examples of the kind of design principles that help with stability are things like fail quickly or fail fast, so that you don't allow errors to build up and propagate in the system and become their own incidents... Michael Nygard describes it that way - you don't want a failure to become a propagating incident itself.

Eoin Woods: Then you can isolate problems. So when something happens, you want to limit the -- I think the phrase lots of people like to use is the blast radius. You want a problem to be isolated in the element or the part of the system where it occurs, and not affect everything else.

Eoin Woods: Then a principle would be "Ensure steady state operation." It sounds obvious that the system, if there's nothing going wrong with it, will just continue running... But actually, there's all kinds of things during normal operation that could also affect stability, such as a single query running out of control in a database, that kind of thing.

Eoin Woods: For each of these, you could look at specific things, such as -- timeout is a design pattern which you could apply to almost any kind of request within a system. You don't want synchronous requests, or even asynchronous requests, to last forever. You wanna time them out. We've just mentioned that asynchronous versus synchronous communication - they're both very valuable patterns in different places, but you want to choose those quite intentionally, and aim for stability, versus for example performance and latency; so stability, asynchronous integration tends to be better.

Eoin Woods: There are design patterns which - again, going back to Michael Nygards book, which is full of good advice on this kind of thing... He talks about bulkheads, which is a quite conceptual pattern, and that's separating the system into different pieces from a particular perspective... And circuit-breakers - that's where we prevent overloading one part of a system from another, by when errors occur backing off... Analogous to the Ethernet collision protocol.

Eoin Woods: Around ensuring steady state operation it's things like making sure that the system has sensible housekeeping. I should probably be embarrassed by the number of systems I've been involved with that have actually run into an operational problem, simply because something wasn't house-kept correctly, such as a quota, or disk space wasn't monitored, or memory wasn't monitored, or something like that.

Eoin Woods: In fact, it was only a few weeks ago that Google had a major outage, which - I'm not sure we have all the details, but something running out of disk space due to a quota was part of it. I'm not sure if that was a housekeeping problem (it may not have been), but it just shows that things that sound quite minor can actually have quite a big impact.

Eoin Woods: And then things like governance patterns such as throttling. There's lots of specific technology solutions, some of which implement those patterns and some of which are useful in building them. Things like you can have API gateways which can incorporate quite a few of those design patterns, such as circuit-breaker, and failing fast, and asynchronous communication, and so on. We can build bulkheads into the system by isolating different parts of the system inside different parts of our cloud infrastructure, for example...

Eoin Woods: Some modern technologies have got things that resource governors built into them. First we developed relational databases, where you could put a limit in that said "A single query can only consume so much I/O", or so many memory blocks, or whatever. So if we build that kind of patterns into the design, we can then go and look for specific technologies that will help us implement those.

Eoin Woods: Like any quality attribute, my advice is always "Use an existent, tested one." Don't try and build it yourself, because these things are always harder than they look. In some cases we may have to select or build some of the technology ourselves.

Eoin Woods: Circuit-breakers, for example - I think it's relatively rare, actually, outside of a couple of API gateways, that you get circuit-breakers built into things, so you may need to build yourself a little wrapper for a network request library that builds the circuit breaker, and that kind of thing. Again, my advice is test it awfully carefully. You don't want your stability mechanism to become the source of a problem itself, which is all too easy to do.

Eoin Woods: When it comes to processes, I'm a great believer in taking as many of the processes away from the people as possible... But it's things like making sure we have transparency in the system, making sure that we automate as much as possible, make sure that things are repeatable, even if we can't automate them, so that people don't have to figure out processes every time... And to make sure that we've got processes in the system, be they ideally automated, or if not, manual, for that steady state operation. What does the system need done all the time, just to make sure that it keeps running reliably? ...the obvious one being trimmed logs, so that you have enough log messages going back a significant time to investigate what's going on, and perhaps you wanna take statistics out of the old ones, but you don't let them build up forever. I know it's quite a trivial example, but I'm sure if any listener thinks of any system they've been close to, there are things that you need to do to the system routinely, just to keep it running in a steady state... And those are often really easy to overlook, because they're not very interesting, but they're actually quite important to steady state operation.

Sven Johann: Yes. I just remembered myself implemented circuit-breaker seven years ago, or something like that...

Eoin Woods: Right... I'm sure you twisted it really carefully though.

Sven Johann: Yes, and the thing is, I looked at Hystrix back in the days, and I thought "It has too many options. It's so hard to configure correctly", and of course, "not invented here", so we have to build it on our own... But the thing is you test for your one use case, and then you bring it into life, and then all of a sudden you're like "Oh, we forgot this, and we forgot that..." And in the end, it was just a waste of time.

Eoin Woods: "Now I know why they've put that feature in..." Yes.

Sven Johann: Exactly. So it was quite stupid. One thing that you said regarding processes - that some just sound trivial, but you still have to remember them... So we at INNOQ, we use the arc42 template to (let's say) document and communicate architectural requirements... And we kind of have -- or not "we kind of..." Arc42 is, to some extent, a kind of... Yeah, but when I say "checklist", it's not correct, but for the simplicity, I think it's good to say that... For example, you look at the things which remind you what you should think of. "Don't forget that, because it can bring you into trouble."

Sven Johann: Let's say you have to think about logging, or you have to think about "How do I monitor my system, for example?" or error handling, or you name it... And what we don't have in arc42, and basically I haven't seen that in any of the templates, neither arc42 or C4 - where do I document the operational stuff? Of course, there is always place for design principles, there is always place for technologies, but obviously, supporting processes - it's neither part of those two. There is no great guidance.

Sven Johann: In your book - and I mentioned the book at the beginning - you propose the operational viewpoint, and I think that's quite interesting. Can you say a few words about your operational viewpoint? What it does, and...

Eoin Woods: Yes, sure. I mean, I'd first have to say both C4 and arc42 are great frameworks as well; if anything, a disadvantage is the one that Nick Rozanski came up with, that it's a bit more all-encompassing, so therefore a bit more complicated, which is why people certainly use it... But I think one of the reasons C4 has been so successful is it's very straightforward to get started with, and that's just fantastic. Simon and I have talked a lot about it over the years... So I think he probably deliberately didn't put in the operational stuff, because he didn't want to give people too much to think about when they were starting off... But what Nick and I have found was--

Sven Johann: Sorry to interrupt... Also with arc42, it deliberately doesn't talk about operational processes. The goal is to come up with everything related to development; what needs to be implemented, and why.

Eoin Woods: And I think that's very fair. There's plenty to think about, just about how you build it. When Nick and I were looking at putting our framework together - to be honest, this was a long time ago, but even back then, we were trying to find something that existed, but it just didn't seem to, which is why we created our own... It was one of the things we realized early on when we were talking through what did architects need to care about - of course, there were these quality attributes, performance and scalability and security and so on, and we started off actually thinking "Well, there's a set of these that are important to operation." Both of us had seen quite experienced software developers and architects running into problems, for them and their organizations, because they didn't think about operations early. So we said "All we'll do is we'll find the quality attributes useful for operations."

Eoin Woods: Then, to be honest, we flip-flopped back and forward a bit as to whether this was a bundle of quality attributes or an actual view of the system... But in the end we decided that it was a view of the system, the reason being one of the reasons you create a view is because you're trying to create a model, draw a picture, tell a story for a particular group of people. And of course, we realized that there were a couple of groups of people such as infrastructure folks and operations folks, and compliance, and maybe audit, those kind of people - they really wanted to talk about "How is the system going to be operated? How will you ensure it can be monitored? How will you put software into the environment?" and so on.

Eoin Woods: So we realized actually we did have a constituency, a group of stakeholders who cared, so therefore we came up with the operational view. The kind of thing that it tends to contain is those kind of concerns, which is "How do you operate a system? How do you monitor the system? How is it controlled? From an operational perspective, how does software enter the system?" In fact, things like deployment pipelines are probably in the development view, because the operations people don't really care too much about what's in the pipeline; they care about what's coming out of the pipeline, how that's applied to the operational environment... So how are operational environments created.

Eoin Woods: Hopefully, these days -- we wrote the book quite a long time ago... These days a lot of the historical problems, such as "We need to go and buy servers, and wire them together, raise lots of tickets to get the networking configured" - a lot of that is going away very quickly, because people are in public cloud or private cloud environments with very high degrees of automation, and a lot of infrastructure as code. But a lot of the concerns about "How is this gonna be monitored? How is this gonna be controlled? How do we know what's going on inside production? What kind of analytics are we going to run on that environment? Where does that data come from? Where is it stored? How does that processing work?" All of those things that are of great interest to people -- I don't wanna say the operations team; the people focused on operation. They are of less interest to a lot of other people.

Eoin Woods: So therefore we came up with a view, which actually -- I mean, we've published it, it's part of the book, and then I've put out an academic paper on it, and it appeared in IEEE Software... I've been quite surprised over the years that an awful lot of people have come across it and found it at least a useful concept, even if not in all the detail; and quite a few people have actually picked up the detail and said "Yes, it's a few years old now, but the ideas do seem to still be very relevant", especially as people think more and more about how to integrate operational concerns and operational people into the development lifecycle.

Sven Johann: Yes, I will put the paper into the show notes... A colleague of mine, he once said "You know, operations is kind of the forgotten view in software architecture", and I said "Yes, correct. Maybe we should do something about it." And then I googled exactly those words and I found your paper in IEEE Software...

Sven Johann: Yes, for example with arc42 will remain what it is, but maybe that's kind of an idea that we have a supplement, so to speak; something like OPS42, where we have the operations view, and where we probably heavily steal from you... So let's see. But yes, I think the viewpoint you have - I think it's a great idea, and very useful.

Eoin Woods: That's good to hear.

Sven Johann: So it's always good, if other people steal it.

Eoin Woods: Definitely. It's a sign of success, definitely.

Sven Johann: Exactly, exactly. Okay. Maybe coming to the end, when we run systems in production -- so now we developed the system, and we deployed them, and we regular-deploy them, and now they are running in production... So you already talked about monitoring. I have (let's say) a little extended question - what is observability of a system? That's a term which came up lately. Nobody talks about logging and monitoring anymore; everyone talks about observability now... So what is that, and is it very different to, let's say, the traditional terms?

Eoin Woods: That's a great question. To be honest, I think that term -- from my recent research into it, that term is still very much emerging. There are some quite large voices in that area, who I think are trying to lead the industry to a particular definition of observability, that they feel is very important, and the people need to understand better. Of course, I think observability as a term has been used informally for quite a long time; so there's at least the potential for a bit of confusion here...

Eoin Woods: Historically, there have been three technical bits to the puzzle. I think there's been logging, which we all know and have loved for a long time, there's been monitoring - so that's your metrics, if you like. And then more recently, there's been the idea of tracing, as in comprehensive tracing, which I suppose companies like AppDynamics had a lot to do with popularizing... Where we can trace actual transactions through actual multi-tier systems in production. Those three things give you the building blocks.

Eoin Woods: I think what the observability folks are pointing out is that it's all very well having this data, but actually unless you've got information and then understanding, it doesn't actually get you very far, and I think that's a really good point. My bit of the observability area is that really what people are talking about is they're saying "You've gotta have an understanding of how your system works internally, and you've gotta link all this data to that understanding." Now, dare I say it, it sounds a bit like having a model of your system - a subject close to my heart, because I think models unlock a tremendous amount of insight.

Eoin Woods: I think this is where observability is trying to go - they're trying to say "Of course you need comprehensive logs, of course you need traces, of course you need runtime monitoring, but what you need is a system model that all of those three relate to, so that you can understand what a particular queue length in this part of the system, what that actually means for the health of the system." That sounds like a pretty lofty kind of research goal. There are companies who are very actively trying to work on observability systems right now, and there are a lot of challenges.

Eoin Woods: Early on, the observability folks were saying "Just collect all the data. You don't wanna sample, you just need to have it all, and then we'll analyze it." This is the world of big data. And I think for big systems, that's quite overwhelming. A friend of mine is a product manager in the APM market, and his view has always been that works up to a certain scale, and then it doesn't matter, even if you're Google, collecting every event in every system across your state probably isn't gonna work.

Eoin Woods: There's lots of things still to be worked out, but that's where observability is trying to go. And people today probably do observability themselves, in their heads. They maybe build system-specific tools to help them understand what the things of their metrics and traces and logs are showing them. I think what the observability community is trying to do is bring it to the next level, where we could have some automated tools to do it, which sounds like that would be a fantastic step forward when we get there.

Sven Johann: Yes, that's true. When you said "collect everything", I also dealt with an APM provider, and they also said "We are just in the world of big data, and we collect everything", and I was a bit surprised, because even Google, in their tracing system - I forgot what the name is... They brought out a paper, and in that paper they said that they basically only collect one from 1,000 requests.

Eoin Woods: Right... Which at their scale is still, I'm sure, a huge amount of data.

Sven Johann: Yes. But what they try to do is they want to ignore, let's say, the good requests, and only try to collect the possibly problematic -- but not requests, but traces.

Eoin Woods: Yes, the outliers. You'll probably learn a lot more, rather than collecting mainly the successful ones.

Sven Johann: Yes. But I like the idea to say we have a model, and that it's also a kind of architectural concern, "What do I want to get out of my system, in terms of understanding it?"

Sven Johann: Okay, so our system is now running in production, and you said we probably need processes supporting our system... Obviously, we need processes to get systems or deployment units into a production environment... But I'm also wondering if you could say a few words about other necessary processes, like security incident processes, maybe fire drills, something like that - what are those and why they are also important for software developers and architects.

Eoin Woods: Sure. Also, I should immediately say I've never been a specialist operations person. In fact, this is not something I'm an expert on. But having worked with operations quite a lot -- I think you've named a lot of the important processes for today. Historically, ops have always picked up lots and lots of manual processes, often because the software wasn't finished, quite honestly, so they have to pick up things that they shouldn't really have to do... But in terms of the essentials - yes, there needs to be a way of putting software into production, a very reliable, controlled way, a highly automated way, so we know exactly what's there... That includes emergency temporary fixes, just as much as significant functional releases.

Eoin Woods: There needs to be a way of routinely understanding what's going on in the operational environment, and a process for analyzing that, so that trends can be extracted... And that's all part of helping the wider team understand 1) how successful they're being, 2) where they need to improve... And then there need to be processes for both recognizing and dealing with all of the things that can happen in production. And you've named two of the important ones already - one is, if you like, a generalized incident, and the second one is a security incident specifically, because that can be a live incident causing ever greater harm to the environment, with human beings actually actively attacking the system, so there can be a need to take a different level of action with that... But any production incident obviously needs immediate attention, needs a lot of focus, and we need to resolve it as quickly as we can... Going through some structured triage process is quite important, so that actually the real impact on the organization that is owning the system is understood.

Eoin Woods: Some incidents can look quite dramatic, but actually, because of the way that the system is being used, we've got a little bit more time to deal with them; we can deal with them in a more thoughtful manner. Others are genuine emergencies, and we absolutely have to deal with them right now.

Eoin Woods: A good example in the wholesale banking environments I used to work in is anything involving a client interface is really an absolute emergency, because clients are demanding and expecting to be able to use them... And there are some very important processes, for example internal reconciliation. We absolutely have to be able to be able to do them, and they have to be demonstrated as robust to a regulator, and the business needs them to know that they can trust the values that they're seeing on their screens. But if that's not there for a couple of hours, actually there are a number of workarounds that and be used. So there are different levels of response required for those.

Eoin Woods: The last thing - and often those get forgotten - is learning from incidents. I think this is a really important point about improving the resilience of the organization - use any problem, stake, incident as an opportunity to improve and for it not to happen again. That needs to be institutionalized. Not just running through the steps - most organizations do that - but producing real learning points out of them and having the commitment and bravery sometimes to actually tell people what needs to change, and to see it through a change process that improves the organization in some way, so that the learning is beneficial and isn't just an academic exercise. And all of those need to be done.

Eoin Woods: One of the things we sometimes dismiss - and I must admit, I sometimes dismiss it, too - is that service management processes have been with us for a long time. Best characterized with ITIL. I know that from a modern ops perspective there's a lot of things about historical ITIL that maybe are not what we want to emulate today... It's worth being familiar with that world. One, because a lot of operations people have been trained in it since they were graduates, so they're very, very familiar with the terminology and the approach. It's good to be able to understand what their historical background is. And two, although we might want to go about things in a sort of ITIL version two way today... We might well want to see what's in ITIL to make sure that we learn from all of that amassed experience, and maybe we apply the approach and we apply the learning of all those decades in a different way, but we do actually use that learning.

Eoin Woods: Something Matt Skelton - he's one of the co-authors of Team Topologies - said to me many years ago (we've known Matt for quite a long time), he said "People come to me and say "We do DevOps. We have no time for ITIL." He said "It always kind of makes me smile, because I inevitably go and look at them and find there's all these gaps in what they're doing operationally, because they don't understand anything about ITIL. And it's not that the ideas behind ITIL are in themselves wrong, it's that they're optimizing for a particular kind of environment." Many of the ideas are good, but we need to re-optimize them for a different kind of environment, the one we're in today.

Eoin Woods: I think the key things are we need the processes to keep the system running stably, and also to understand how it's running, so we can spot trends and gain insights. We need the processes that allow us to recognize and respond to incidents in a rapid, structured, but calm manner. And lastly, we need a process we all really believe it and really follow, so that we learn from anything that goes wrong... And 1) we don't do it again, but 2) we improve the organization in some way as a result of a problem that we've suffered.

Sven Johann: Yes, learning from incidents - I also think that this is quite important. For example, in one of my projects we have a lot of old-school service managers and ITIL people... But they adapt to the new situation and they bring in their know-how. I have to say, I was very positively surprised about all the gaps they were filling. That was quite interesting.

Sven Johann: We also tried to learn from incidents, but that turned out to be not so easy with all the teams. Like you said, you can learn from incidents, but then you also have to do something about it. That happened quite often, and you think "Hey, didn't that thingy occur two months ago, exactly in the same way? Yes. Do we have a post-mortem report? Yes. Did the post-mortem report say "Here are the things we need to improve?" Yes. Did anyone do that? No..."

Sven Johann: So I think it's all good that you have those processes, and then you have the learnings... But yeah, if certain people do not believe that it's important to implement those learnings, then it's kind of a waste of time.

Eoin Woods: Exactly.

Sven Johann: So everyone needs to agree that it's important to implement the learnings.

Eoin Woods: That's right. And of course, people like product owners need to understand that there is a trade-off there. They will be able to do perhaps less functional delivery for a month or so, because having had an incident, we're going to learn from it and improve some things.

Sven Johann: Yes. On the other side, something I try to do - and then we come back to the requirements - if you have an agreed requirement, if the product owner thinks it's important to have, let's say, this amount of availability, and due to your monitoring systems you see that you are far away from this requirement, then obviously you need to invest time, because you're not making the requirement. Or you just say "Okay, if you don't want to implement the learnings we had, then we have to change the requirements, because obviously, we cannot make the requirements, and nobody's interested in spending more time on implementing them."

Eoin Woods: Yes, this is the unfortunate reality. Understanding you've got a problem is only part of what's required... And actually knowing what to do in response to the problem and implementing it is really the hard bit, which - you know, there's no way around that.

Sven Johann: Yes, that's true. Learning from incidents maybe even before we have an incident... So over the past year, especially people like Adrian Cockroft popularized the term "chaos engineering" or "resilience engineering", so that I run controlled experiments in production to see how my system behaves. I have to say, initially I thought "This is maybe only for Netflix kind of companies", but lately, I became a big fan of it, I have to say... Because it helps us to understand our production environment way better. What is your opinion about those kinds of controlled experiments in production? I'll just give a few examples - I put additional synthetic load, spikes of load to my production environment just to see how it behaves. Obviously, it's not about creating chaos or doing random things; I do believe that all my experiments work, but you cannot really know...

Sven Johann: Or the classic thing is that I kill machines because I know that my system still works, or I believe that my system still works... So what do you think about running controlled experiments in production?

Eoin Woods: I think that’s a great idea. That's how you really build confidence in both the technical, but also the business teams, that you have a resilient system. The thing I worry about is that sometimes people have the tendency to read the headlines, and they suddenly jump in and start running all this stuff in production, without thinking intentionally about how they do it. All the stuff I've read about chaos engineering from people like Netflix and a few books on the subject - they're all very sensible. They talk about the fact that you've gotta make sure that the software you're using to cause the problems is itself very reliable; you've gotta have confidence in it, you've gotta make sure you set expectations and build the confidence with business stakeholders... You've gotta do a lot of it in non-production environments before you get to the point of having the confidence to do it in the production environment.

Eoin Woods: In the production environment you do it step by step; you do small things, that aren't critical, you learn from those, you get better at it, to the point where you end up with these little chaos robots running out all over your system, causing unplanned problems. That definitely seems to be somewhere it will be good to get to. I would be quite concerned of a team who says that they're just going to start doing that. You do hear some people talk in those terms about "Hey, we should start applying some simian army to our production system." "Well, okay, but why don't you start on your laptop? Because I'm not sure your software is that robust that even in that environment it would cause chaos."

Eoin Woods: It sounds like a really good end state, but like so many of these things, people -- I hear this clients quite a bit... "Google do this." Yes, but Google's environment is not your environment. They've been doing this a long time, and they didn't go from a standing start nothing to that in a single step. So view that as an inspiration or a vision; let's take sensible, risk-controlled steps to get there.

Sven Johann: Yes, true. I hear quite often, if something goes really wrong in production... For example, lately I think someone accidentally deleted a production clsuter in the U.S, or something like that... And then everyone was like "Oh yeah, this was a chaos experiment." No, no, it wasn't.

Eoin Woods: It was just chaos.

Sven Johann: It was just chaos, yes. If it would have been a chaos experiment, we would have tried that out on a non-production environment, and there it worked, and we were also aware of that, and we had (let's say) our monitoring in place to see if really everything works as you intended to do... Yes, I think that's a good point; it's not about randomly creating chaos, it's really about getting a better understanding of your system and only do things in production which you already have verified in non-production, and you are so sure about it, and then you take tiny steps toward that goal.

Eoin Woods: It's about building confidence...

Sven Johann: Yes, exactly.

Eoin Woods: ...and the way to build confidence is not to deliberately have a huge production outage.

Sven Johann: Yes. I have to say, I was always very against it. I have a customer who says "We need to do chaos engineering", and I was always blocking that, because as I said, I thought this is only for Netflix kind of people... And also some consultancies around that topic, when I talked to them, it was really a bit disappointing, because they wanted to sell tools...

Eoin Woods: Right.

Sven Johann: But then - I have to make some advertisement... I attended a training from Russ Miles.

Eoin Woods: Oh, yes.

Sven Johann: And he basically put me upside down on that. So he could really explain why you should do that, and how... And I bought his book, and we implemented what he's proposing in the book, the step-by-step process, and I've found that quite helpful. I think Casey Rosenthal and Nora Jones are also good resources to look at... Yes, about building confidence.

Eoin Woods: All those people definitely give pointers to getting to that vision, but in a sensible way.

Sven Johann: I also have to say, they are also selling tools, but I think they do an additional great job in telling how to do it right. With one customer I find it quite interesting - we planned one of those experiments shortly before Christmas, and it was at a time where it was the first lockdown day in Germany, the Covid-19 lockdown day, and then some business people kind of freaked out and said "Are you crazy to do that experiment now? It's the first lockdown day." And we were like "Hm... I mean, we have this under control. We know exactly what's going on, but we have (what you said) the steady state, we know what the steady state is, we have the monitoring in place... We basically see if we even go slightly out of the steady state, and we have a kill switch and everything... We have a lot of confidence in our system." So it really felt good to do that experiment and say "Well, it doesn't matter if it's Christmas or something else. We basically know what's going to happen. And if the hypothesis is wrong, then we know how to stop and get back into normal operations within seconds."

Eoin Woods: That's great.

Sven Johann: Maybe one of the final questions - or actually, it's my second to last question - how should we organize operations? There are different approaches... You mentioned Google; Google has a site reliability engineering team. Other companies prefer "You build it, you run it", and then even other organizations have still have a separated development team and an operations team... What are the questions I need to ask to decide which approach I should use?

Eoin Woods: That's a great question. I think I'd point back actually to a book I mentioned very fleetingly earlier, the Team Topologies book. Matt Skelton and Manuel Pais do a good job of working through some of the different trade-offs that the different teams have. I'm not sure I've got a simple answer... There's a number of organizational (I suppose) forces or pressures that would cause you to build one or the other... Definitely, one I see repeatedly is what I call the DevOps bottle, which is where you have a single team, they're accountable for everything, you have a mix of skills in the team, the team can do everything... That does require people who've got real operational skills to be happy working in that sort of environment, and it was probably Endava's head of managed services who pointed out to me actually those people come from a different background, and often they actually do like to work in a different way. So if you've got a team like that, it might take you -- you know, maybe a pure DevOps model is not for you; or even if that's the direction, maybe there need to be steps on the way.

Eoin Woods: Google's model obviously works very well for them... And I think there are, to be fair, organizations where there is quite a bit split between dev and ops. But because they've got a culture that means that they all collaborate and they've got shared goals and directions - it's not specifically SRE, but they have a separate run-it team, but actually they get good results anyway.

Eoin Woods: So I don't think I have a very simple set of principles to drive people towards one, but what I do think is that it's very important that you understand people, the organizational pressures, the expectations, maybe the politics, dare I say it, in the organization you're working in... And then you work out which models could possible work and what their trade-offs are, rather than saying "DevOps integrated. You build it, you run it. That's a good thing, let's just do that." I think that's not the way to select the operational model for your organization.

Sven Johann: Yes... Under great pain I also realized that, I have to admit... But I think it's good that more and more books or knowledge comes out discussing those approaches, with the pros and cons. Team Topologies, I have to say, is now also pretty much on the top of my reading stack... So I'm looking forward to that one.

Sven Johann: Also, there is a book called "Seeking SRE", which I find quite interesting. There are various people talking about site reliability engineering, how they do it, but not the Google way... Because they work for Google, and now they work for another company, and then they fail with SRE at that company because it's just not Google.

Sven Johann: It's not really clear to me when I read that -- it's not something you can say "Oh yeah, now I exactly choose that model." It's what you said - you have to understand a lot of things about your organization, then you need to know all the different approaches, and then you need to select one and adapt.

Sven Johann: Final question - let's say I choose an SRE approach, or a "you build it, you run it" approach, so basically there is, let's say, one team which is responsible for operating and building the system. Also, the classic SRE says you have site reliability engineers, but also developers are part of the on-call support... In my experience, that works great if your services or your applications, whatever you call it, if you constantly work on them. I work on my service, I deploy it, and once it runs in production, in parallel I add more features. And then it works fine. But often, there are services which are stable, they don't change, or only very rarely change... And maybe I even have this (let's say) DevOps team which takes care of that. But how do I effectively operate systems or services which are basically stable? Should that be the dev ops team, or should I hand that over to an old school operations team?

Eoin Woods: That's a great question. A related question is how do you operate package software, where you get occasional quite significant releases, and rather than writing a lot of code, you would probably do more configuration.

Eoin Woods: I think it's a good question. I don't think as an industry we necessarily have the right answer to that yet. Honestly, I still think we're still struggling to find the right answer for systems that change all the time. My observation is that people with systems where they don't change very much - they tend to be much happier with the traditional service delivery model. And I think that's probably because there's a lot more predictability. These probably aren't systems that suffer very unpredictable workloads. They're probably not systems that (as you say) have to absorb a lot of change; they're probably not systems that are used in unusual ways or are at the core of the fastest-moving business processes in the organization.

Eoin Woods: When it's run well, there's absolutely nothing wrong with high-quality responsive service delivery, and perhaps that's the right way to do it. The other way to do it is -- at Endava we call it application management-oriented work. One of our professional disciplines is called application management, but in fact they're software engineers who have a range of skills, from development through to operations... So they're less specialist in development than our development engineering discipline, but they still have development skills, and they do everything from sustaining engineering and complex products, right through to application support type work.

Eoin Woods: I think where customers don't want so much a managed service, we still have a project team, we've still got a cross-functional team responsible for everything... But they just have a different mix of skills. The other thing is just that often a relatively small number of highly-skilled people in an environment like that, where the customer has invested in good underlying platforms, kept things up to date, invested in automation - they can actually run quite a lot of software for relatively few people, because there are less unexpected surprises, there's less unpredictability. The automation tends to be very reliable, because automated tasks just run the same way every time.

Eoin Woods: So I think a lot of people would go towards traditional service delivery, but I think another way to look at it is a cross-functional team model, but changing a particular mix of skills you've got in a team to be more oriented towards automation and operation, rather than traditional development.

Sven Johann: Okay. Thank you.

Eoin Woods: You're very welcome.

Sven Johann: What would be the resources every developer and architect needs to know about? Obviously, we have the Team Topologies, we had Michael Nygard, Adrian Cockroft to follow all the chaos engineering guys... Anything else?

Eoin Woods: I could send you a list; there are a few other things, I'm sure. For example, I came across an O'Reilly book, "Distributed systems observability" by Cindy Sridharan. I thought that was a very good, accessible, balanced introduction to the whole area of observability, which I felt was interesting.

Eoin Woods: There are classic texts on resilience - outside software as well; not just software resilience - which are probably a little abstract for a lot of practicing software engineers, but definitely can contain good advice on things to be aware of.

Sven Johann: Sorry that I jump in - I think Adrian Cockroft is kind of reading those materials and brings them to the software world.

Eoin Woods: Oh. Fantastic.

Sven Johann: What are the resilience practices, for example, in hospitals or airlines, and what can we learn from that.

Eoin Woods: That's the kind of thing Adrian is great at doing... So yes, I'll be looking forward to seeing that.

Sven Johann: Sorry, didn't want to -- of course, I wanted to interrupt, but only shortly.

Eoin Woods: That's a very good interjection. I would hit one other point, which was it's good to get a good book on retrospectives. Aino Vonge Corry recently wrote a new one, and then... You know, there's an older one I've come across, that I thought was very good - Esther Derby I think was the author... Maybe with some other people. I think it was just called "Agile Retrospectives." Books that lead you through the process not only of working out where it all went wrong, but actually thinking about the organizational change required to do things better.

Sven Johann: I have another book from Esther Derby, but from what I know, I think it lately came out, like two months ago, or something.

Eoin Woods: It's very recent, yes.

Sven Johann: Probably, so... Really cool. I actually saw a presentation about that book, so that was actually quite nice. Okay, I will put all that in the show notes. I have to thank you a lot, and once we approach the OPS42 and arc42, I'll come back to you.

Eoin Woods: Yes, it sounds great. Thank you for the interview. It was really interesting, and I think I learned at least as much as everybody else did. Thank you.

Sven Johann: Cool. Thank you very much, and see you next time.

Eoin Woods: Yes, thank you. Have a good day.