Daniel Bryant on Service Meshes and API Gateways for Application Modernization

Transcript

Sven Johann: Welcome to a new conversation about software engineering. Our guest today is Daniel Bryant. Daniel is an independent technical consultant, a product architect at Datawire, an InfoQ news manager, InfoQ podcaster and a conference speaker. What else? He also published a book on continuous delivery with Java.

Sven Johann: Today we will talk about API gateways and service meshes, and how they open the door to application modernization. Welcome to the show, Daniel.

Daniel Bryant: Thank you, Sven. Pleasure to be here.

Sven Johann: Gregor Hohpe, the book author and famous ex-Googler and chief architect from Allianz, the insurance - he once tackled the application modernization problem with infrastructure... Which kind of surprised me, because for me, application modernization was always about the code of the application itself. But what he said in a training was, you know, if you have an old car and you make it nice, but you still have bumpy roads and mud, it just doesn't matter if you have a race car or not. A tractor is just fine. I thought this is a nice analogy when it comes to application modernization... And I was just wondering, what's your take on application modernization and infrastructure? Is it going into a similar direction?

Daniel Bryant: It's an interesting one, Sven. I think the key thing with all this stuff we do - and you know I've talked about this before - is remember the value you're seeking. It very much depends on what you wanna get out of these projects. With that modernization it can be many things, and I'm with you, it definitely impacts the application itself... But also, I totally get Gregor's ideas, as in you can modernize an application only so far before the underlying hardware it runs on (or the underlying infrastructure) holds it back.

Daniel Bryant: So the two things in my mind that are really important - one is understanding what we wanna get out of this modernization; why are we doing it, what value are we delivering to the business, to the customers, these kinds of things. And then what is the kind of current state we're at in the application. Is it the application code that is a problem, is it the infrastructure that is a problem? Once we've figured out the combination of priorities and what we need to change, then we prioritize again - do we change the app first, or the infrastructure? But I think they go very much hand in hand.

Sven Johann: In your talk, in your article about that topic, you also focused on infrastructure... So what would I aim for when modernizing infrastructure? What's the business problems I can solve with modernizing infrastructure?

Daniel Bryant: The typical business problem I see in relation to infrastructure is not being able to deliver value fast enough. The customer says "I want these things", the business recognizes how they can affect them, but then the infrastructure holds them back. Maybe they can only do deploys every two weeks, or month, every three months, six months even, because of the way the infrastructure is configured. Or maybe it has to do with the change fail rates; deploying on the infrastructure is really risky, so we want to limit the amount of times we deploy onto it. I think that's the key angle from sort of a business driver.

Daniel Bryant: There are also other things sometimes... Some people would say they want to decompose applications, because developers are finding it too challenging to understand how to modify an app, and then they want to go to what we're now calling microservices, and a lot of existing infrastructures don't support microservices. That's still connected to a value angle, because the business wants to go fast, it wants to make more changes, and the developers are really struggling because of the software architecture... But the infrastructure is often a kind of key area that you can leverage. If you can change the infrastructure to meet the requirements of the software architecture in the business, you're all good.

Sven Johann: You mentioned the change fail percentage and deployment frequency... So those are two of the four key metrics from the Accelerate book. The book is almost two years old, but I also think it's quite interesting that businesses - let's say half-technical people - really come now and say... Either they say "We want that. We heard about the book", or you as a consultant - you can always show the book and say "If you do that, you will be better." So I fully agree that (let's say) having a higher deployment frequency is really difficult if you're in a very old-fashioned IT department.

Daniel Bryant: I completely agree, Sven. I've been there back when I was working at a company called OpenCredo several years ago; we used to work with a number of big clients... They were doing fantastic stuff, and there were very capable people there, but the decisions that were made in the past - I'm sure, with good intentions - were meant that infrastructure was built a certain way, and that totally held us back. So we often used to have to change the infrastructure, change the apps, and also change the mindset. There's a lot of challenges there.

Sven Johann: Yes, I know. I know... I know there's challenges. Let's say we were looking at higher deployment frequency, and mean time to restore should be low, maybe we want microservices... What do I need to do in terms of infrastructure modernization? I think you say you have to decouple infrastructure from the application... What do you mean with decoupling infrastructure from applications?

Daniel Bryant: Most of that is in relation to containers being the solution. But taking a step back and looking at the problem, what I've found with people using, say, a bare metal, a mainframe, and even VMs to some degree, is that the application is often tightly couples with the underlying infrastructure... Either in that we're building apps to take advantage of certain properties of the infrastructure - back when I was doing Java in my early days, and we were running on custom hardware sometimes, and we were coding to that custom hardware... So if we wanted to sort of lift and shift that app, it wouldn't really work that well.

Daniel Bryant: For example, I'll pick on Oracle. It's easy to pick on Oracle... I was doing some stuff with databases, and people were trying to lift and shift their Oracle databases into the cloud. But the way the Oracle database server is running - it was a very effective piece of software, but it was literally coded to run on Oracle racks, and there were expectations around timing, around access to hardware, that when you move to cloud, not only is the underlying infrastructure architecture different, but everything is going over the network. So suddenly, the software was not behaving correctly, because the person who wrote the software at the time (completely correctly) made a bunch of assumptions about how they were running on the hardware.

Daniel Bryant: So when I say decoupling - you often have to change those assumptions, and that can sometimes mean a complete rewrite of an app; an extreme example of a database, where you often code very close to the hardware. But most of us writing business apps - to your answer - it's more of a case of being able to deploy units of code in isolation, if you like.

Daniel Bryant: I've seen some folks that were literally building big VM images and putting their applications on there... This was pretty much best practice 5, 6, maybe 7 years ago. Netflix, with their famous Aminator (I think it was called), where they used to bake Java apps into the AMIs on Amazon, and then they'd just deploy those whole AMIs, whole VM images onto a VM. That does definitely give you more flexibility in the old coding to the actual hardware thing, but they're quite big, these images, and they are somewhat coupled to the VM format that you're running on - be it VMware, be it QEMU, these kinds of things... Whereas with containers you get even more flexibility; or that's the theory of why they were created, why Docker became so popular.

Daniel Bryant: The big thing for me with decoupling is if you code your applications to not only fit in and be deployed within a container, but work well within a container, baring in mind that most communication is going to be going over a network, your block store may be, again, going over a network, providing you code the application to respect those things, and it fits nicely into a container, then you can run it pretty much anywhere you've got a container-aware host. So you can run it locally, you can run it on some bare metal services, with Docker or a CRI-O daemon installed, you can run it on Kubernetes, you can run it on Mesos... You can take your pick. So I really like this notion of -- it's not really full workload portability, but it's portability of deployment environments.

Sven Johann: Back in the days, in the pre-Docker era of 2015, I thought it's easy to deploy... I'm also a Java guy, and you just -- of course, you have your Tomcat, and... But still, you deploy a war-file, Docker is a different story. You also ship your configuration, for example... So yes, that's really one of the reasons it has so much success.

Sven Johann: But now I decouple (let's say) the deployment unit from the execution platform... So what kinds of decoupling are available? You said something like "We can decouple from the compute fabric, and we can decouple from the network." What's the compute fabric?

Daniel Bryant: This was based on some observations I made around how the big cloud vendors are recognizing that hybrid cloud is a really big thing. Hybrid cloud has been a thing for many years, but Amazon in particular were very focused on "Move all your workloads to the cloud." Azure and Google took a slightly different tech. Now Amazon - we've literally had re:Invent a couple weeks ago - are attacking towards this "hybrid cloud is a thing."

Daniel Bryant: But if you look at the way vendors are pitching it, they're pitching it very different. So they're all saying app modernization, you're gonna be working some workloads in the cloud, you're probably gonna be running some workloads on premises, but the way they want their customers to do it is very different. So if you look at Azure and Google, they are saying Kubernetes is the common fabric. Google have got Anthos, and Azure have got something called Arc now... And the idea being they provide control planes that manage Kubernetes running on premises, and also in the cloud.

Daniel Bryant: You kind of get this homogenized experience of "As an engineer, I can log in just to the Azure console, or just to the Google console, and I can control the deployment of my apps and services", and it's actually running Kubernetes across the underlying fabric, the underlying hardware, where we don't really care about it too much. Someone else configures Kubernetes running on that data center, on premises, someone else configures it running in the cloud... And then as an engineer/developer, I just access this by one control plane, one UI or one console. Kubernetes is very much the homogenized fabric of compute, if you like.

Daniel Bryant: Whereas if you look at other folks like Amazon - Amazon have got something called Outposts, where you can install some of their hardware on premises, next to your existing code... But they're pitching that you connect the cloud, the classic Amazon cloud with this Outposts using something like a service mesh. At re:Invent they've talked a lot about how their App Mesh product, which is basically their version of Envoy, which is a popular network proxy... And they're pitching App Mesh as a service mesh that joins the existing Amazon cloud with your Outpost, your on-premises stuff.

Daniel Bryant: At Datawire we were talking about the same thing. We've done a lot of work with HashiCorp as well, and we were saying if you deploy an API gateway like Ambassador, which we have at Datawire, as a front door into Kubernetes, and you use something like Consul, which can route across Kubernetes, VMs, existing on-premise stuff, you can basically incrementally migrate from the on-premises to the cloud, or you can even run hybrid options. Sometimes it's not worth migrating things.

Daniel Bryant: So there's these two very distinct approaches to modernizing the infrastructure. One is going the Kubernetes as a compute fabric, and the other is using network service mesh abstractions, and they're almost offering it at different -- same goals, but at different layers of the stack.

Sven Johann: But why is that? When I hear Microsoft and Google are going for Kubernetes, I think "Okay, Kubernetes comes out of Google", and also one of the founders of Kubernetes moved now to Microsoft, so it's clear that they follow the strategy to have Kubernetes everywhere, let's say... Federated Kubernetes I think is the product, right?

Daniel Bryant: Yes...

Sven Johann: And AWS is maybe not so active with Kubernetes, following another strategy... Would that be an explanation? What are other pros and cons to move from doing one or the other?

Daniel Bryant: It's a great question, Sven, and I'll offer an opinion here. I genuinely don't have much insight into Amazon. I know folks at Amazon, I know folks at Google and Azure as well... I think Amazon have always been a little bit hesitant around Kubernetes. I think they were the last big public cloud to offer Kubernetes as a service; they pretty much had to, because the customers drove that request, and Amazon are very customer-focused...

Daniel Bryant: I also think they're somewhat wary of standardizing around one option. And if you look at Amazon historically, it kind of makes sense. They kind of offer you a smorgasbord of services. Now, that can be confusing, because there's many ways to deploy. There's EC2, there's EKS, there's ECS, there's Lambda... But that's kind of how they work. It's a high-volume/low-margin type business. They give you all the tools, and then we as engineers have to choose accordingly.

Daniel Bryant: So I think if you look at it from that angle, it's just Amazon doing their thing, with a slight bias against one true platform, which in Kubernetes' case is what it is. As you've already said, Google - they totally make sense. The masterplan has always been around Kubernetes; that totally makes sense. You look at Azure, with Brendan Burns joining, and many of the other team are doing fantastic work there... Again, it kind of makes sense to centralize around this one platform... Even VMware, Joe Beda and Craig McLuckie and the rest of the Heptio team at VMware now - they're kind of doing the same thing as well, they're betting on Kubernetes.

Daniel Bryant: I think it kind of makes sense, because we as engineers do like choice, but sometimes an opinionated platform helps us be productive really quickly. If you look at things like Heroku and Cloud Foundry, they actually had a lot of success because it removes a lot of choice, and we as engineers can just focus on delivering business value. So I think Google and Azure are almost skewing towards what Amazon are not doing, with the goal of trying to make it easier for developers... And also the added bonus with using Kubernetes as the common substrate is that you do get a bit of a crossover between the platforms; a bit of interoperability is their theory.

Daniel Bryant: There's new things popping out, like the service mesh interface (SMI). If you can lift the abstractions up, I think Google and Azure are betting on those abstractions being a thing, and then it's gonna be easier for customers to migrate and run hybrid situations between Google and Azure, let's say. I think that's where they're going for.

Sven Johann: I'm also one of the people who likes freedom from choice - is that how you say it? One thing is freedom OF choice, and the other one is freedom FROM choice... So yeah, AWS - we are going to move to EKS, but we also run Fargate and Lambda... So all kinds of stuff. It's quite mixed. But let's say I use EKS here, I use Fargate there, I have Lambda somewhere else... How does it work with App Mesh to (let's say) harmonize all those platforms? Or maybe another question - before, you mentioned AWS Outposts... What is it exactly? It's hardware you order, where basically the Amazon cloud software is installed on the hardware?

Daniel Bryant: Exactly, Sven. I definitely encourage listeners to pop along to the Amazon site and the re:Invent announcements around this. Outposts has been around a while, but it went GA (generally available) at this re:Invent.

Daniel Bryant: The interesting thing with (say) Azure and GCP is you pretty much install software on your data center. They have some different ways of abstracting it; some is a VM image, some is just software you install on a Linux machine... But Amazon have gone very much like "We deliver a rack to your premises." The beauty is it's low latency, because you connect that rack into your actual system, and so you're on-premises existing stuff has a very low-latency connection into the Amazon rack... But it is someone else's hardware in your data center.

Sven Johann: But I also have something like EKS on that hardware, and could theoretically --

Daniel Bryant: You can't run everything on there, and that's definitely worth noting. They do offer a bunch of -- obviously, EC2 and things, and it just abstracts over the actual physical hardware that's now running in your data center. You use the Amazon standard control plane. You can't run every service on an Outpost, and I can't honestly remember off the top of my head which ones you can and can't honestly remember off the top of my head which ones you can and cannot, so I definitely encourage listeners to pop along to the Amazon website. I'm sure it will change over time, as well...

Sven Johann: Let's talk about decoupling the network... I had this question about App Mesh, but we'll dive into services meshes and available products a bit later, so I'll just place my App Mesh question there, I think.

Sven Johann: Another thing you mentioned in your talk was decoupling the network. What does this mean, and why should I decouple the network?

Daniel Bryant: Yeah, I talked about -- again, in the context of API gateways and service meshes and so forth... And I think we were referring to - and correct me if I'm wrong, Sven - I was talking about this notion of you may have multiple sub-nets, multiple networks even, within a typical enterprise deployment. And often that's best practice in terms of security, of DMZs, and you balkanize effectively your network, so if someone does break in or something goes wrong on the network, you limit the blast radius. So I was talking about that as a common pattern traditionally, but I often see folks - when they're deploying Kubernetes, they just have one massive Kubernetes cluster, even one massive namespace sometimes, and they kind of forget about this notion of segmentation and so forth.

Daniel Bryant: So what I was talking about is you can often use something like an API gateway and something like a service mesh to bridge the gaps between the networks. So your networks are decouples - your networks, your sub-nets; you may even have something running in Kubernetes, something running on premise, something running in the cloud, or whatever... But with something like HashiCorp Consul you can connect all those things up. So they aren't decoupled, and you get all the good practices associated with that in terms of security and resilience, but you do wanna join them together, obviously, to deliver the customer experience of the application.

Daniel Bryant: I've got the most experience with Consul, to be honest, but I know Istio is looking at this, Linkerd are looking at this, all the other services meshes are looking at this too, and there's many other traditional technologies in this space... But I think it's really important when you have decoupled your network and segmented applications, segmented services, to be able to join them all back together.

Daniel Bryant: Now, I was focusing on an example - I've got some code online; you can look at it in my GitHub - around using Ambassador, which is Datawire's open source API gateway, in combination with HashiCorp Consul, which is now being rebranded as a service mesh... I'm sure many of the listeners know and love Consul from service discovery and distributed key-value days, but these days we can actually use it as a service mesh now. So that was the main focus of my talk around that space in the actual presentation at the conference.

Sven Johann: Having (let's say) one gigantic Kubernetes cluster, even with one gigantic namespace - I also think that's quite risky. Also, operating one big thing -- it's better to have smaller units within smaller clusters, and within those clusters multiple namespaces... For security reasons, but also -- yeah, we'll come later to it; if you introduce a service mesh, I think it's better to do it incrementally, namespace per namespace... But yeah, let's discuss that later.

Sven Johann: Once we have those different execution environments that could be Kubernetes namespaces, something on premise (that could also be the case). As we have it, we work on different platforms - Kubernetes on EC2, AWS Lambda, Fargate... But basically, you could see it as one application. And I think what you're saying is showing this application as one thing to the outside, we could use an API gateway, right? So that would be basically the North-South traffic management.

Daniel Bryant: Perfectly said, Sven. Yeah, that's the big picture of what I've been talking about for over a year now. If you look at an API gateway as a classic facade - if we think back to our design patterns days from the Java space, and I'm sure you and I have done much work in that area... API gateway was literally a facade that you can mesh together several backend services to have one API that is exposed to end users; or you can expose them individually, whatever you like... But the API gateway provides that ability to choose how you expose these backend services. And it also provides things like location transparency, so in theory you can move your backend services around the user; the frontend never knows. They just hit a URL, they hit an endpoint, that they're kind of happy.

Daniel Bryant: Also, an API gateway provides a bunch of other things, like security, and rate limiting, DDOS protection using a web application firewall, things like that... So it's a multi-function API gateway, but it does definitely provide that location transparency and the ability to aggregate services at the backend.

Sven Johann: I think you also mentioned in your talk the API gateway -- it was an article, "The identity crisis of an API gateway."

Daniel Bryant: Oh, yes, Christian Posta’s blog post

Sven Johann: So, so many things an API gateway is doing...

Daniel Bryant: Yes...

Sven Johann: At the company I work for, INNOQ, we see the danger that an API gateway is becoming the new enterprise service bus, with too much -- let's say you have a horizontal monster in front of your application... But basically, it should only be about being a reverse proxy and offering API management.

Daniel Bryant: I agree, Sven. If you look back, I did this very popular talk a few years ago called "The seven deadly sins of microservices". I talked about that for a year or two; people really wanted to hear about the bad things with microservices, and one of my deadly sins was exactly as you described. I said "Be very wary of an API gateway turning into an ESB." And not to pick on Netflix, because Netflix are amazing; I've got massive respect for the Netflix team. But one anti-pattern I did see in the Java space - a lot of people using Netflix's Zuul; the original version, not the Netty version.

Daniel Bryant: The original version, about five years ago I think it was released. And you could write dynamic Groovy scripts and inject them into Zuul... And Netflix had some really fantastic use cases for this. They could dynamically load scripts that rate-limit and route traffic, and so forth... But I saw a lot of people abusing it by putting arbitrary business code into the gateway. So when you did a deployment, you had to look at the app, you had to look at some integration stuff, like Apache Camel, or whatever... And you also had to look in the gateway now, because it was business logic smooshed into there as well.

Daniel Bryant: That was an absolute, complete anti-pattern of very highly-coupled business logic into an API gateway. I've talked about this a bunch of times before, and I'm very conscious of all the work I do at Datawire now... We are actively saying to folks, "The API gateway provides a whole bunch of cross-cutting concerns, developer portal, security, all these good things... But please, don't put business logic in your API gateway."

Sven Johann: So the best thing is your API gateway just doesn't allow to put any business logic somehow in it... But I assume that's tricky.

Daniel Bryant: Yeah... I think it's almost impossible. We have something called Filters in Ambassador, and we had to put them, because the customer was so keen to do it. It was not really business logic, but they wanted to do some kind of security analysis for some kind of special cases, so we needed the ability to write arbitrary code... But we basically say to folks "Be very careful about what you put in those filters. Use them only for cross-cutting concerns; for security checks, for doing special auth... Please, please, please don't put business logic in the filters."

Sven Johann: Yeah, one can only hope...

Daniel Bryant: Yes, ha-ha!

Sven Johann: But when it comes to application modernization and API gateways, what would be a tactic to separate -- or I rephrase. I have my large application, and I want to improve delivery, so I try to (let's say) rip out a candidate from the application, put it on the container, have something lightweight, maybe not really start with Kubernetes in the beginning... And put it on - just to have an example - AWS Lambda... I mean, Lambda doesn't support containers, but... Let's say a container on Fargate, or my function on Lambda... And then I put an API gateway in front of the application, and I just say "Okay, all my clients now talk to the API gateway." And one part goes through the old application and the other one to the new application. So basically a very simple reverse proxy functionality.

Daniel Bryant: Yes, yes.

Sven Johann: Then from there -- basically, when we have that, we can grow from there, right?

Daniel Bryant: Yes, and what you've talked about there is the classic strangler pattern... Probably a horrible name, but it's kind of caught on. Martin Fowler has talked a lot about this on his blog. There's a fantastic book, actually, "Monolith to microservices" - it's Sam Newman's second book. I'm sure you know Sam Newman's Microservices book... But it's just published. I've met Sam actually in Berlin, when I was in Berlin recently, and he gave me a copy of the book... Fantastic book. And he talks a lot about the strangler pattern and related tactics... So I thoroughly encourage listeners -- if they haven't got Sam's original book, it's still worth buying that one... But he's updated a lot of his thinking and a lot of his learning from all the consultancy work and all the training he's done; and Sam's just generally a deep thinker in this space as well. The combination of his book and (say) Martin Fowler's work is really nice as a sort of conceptual pattern.

Daniel Bryant: Then where people like me and other folks build on this is we give you perhaps more practical-related advice. There's benefits in both here, the theory and the practice. I often do use Kubernetes as my example, so even if you're pulling out service and if you know you're gonna go to Kubernetes, sometimes it is good just to spin up a Kubernetes cluster... And even if you're running only one or two services on it, you're getting experience of running it pre-production, and you're getting experience of running it in production ultimately, before you start increasing the load. Does that make sense.

Daniel Bryant: The key thing with all this app modernization is the learning. We've got so much choice these days. Things are fundamentally changing... Like, the way you run VMware, or VMs, versus the way you run Kubernetes is quite different, so the ops team, the platform team have to do quite a bit of learning there... And it's all totally doable, but I really encourage people to do it incrementally.

Daniel Bryant: I love the pattern you mentioned - take out one service, bundle it in a container, put it on Kubernetes, run it in production for a while... Because you get a feel for how the Kubernetes runs, you get a feel for how you're gonna monitor the app in Kubernetes, you start building a good pipeline around getting your apps into containers, into Kubernetes and so forth... And then as you get more and more confident, then you can expand this out more and more. Then you can start putting the classic strangler pattern, you can start pulling more and more bits of your monolith out, putting them in containers and running them in Kubernetes.

Daniel Bryant: Now, you may never get rid of the original application completely; that's totally fine. I think Twitter, when they were doing their rewrite and were running their monorail, as they called it - their classic monolith Rails app - they were running it for many years, but they had most of their traffic going through microservices... But it was just not cost-effective to take out the final bits of this monolith, so they just kind of left it hanging around.

Daniel Bryant: Many customers that I work with, many folks at Datawire worked within this space. Some of them put the monolith in a box, as we say, some of them put the monolith in a container, some of them don't bother and they just like to route the traffic, as you said, between the monolith... And most of the traffic, when you've made good progress, is going to the Kubernetes cluster and is going to your new microservices.

Sven Johann: If you talk about Sam Newman and his books - actually, we had a podcast with Sam a year ago, and he basically said "Okay, there will be the new book", and we already agreed to do a new podcast on the new book... So thanks for reminding; I have to catch up with him.

Sven Johann: I also fully agree -- I think three years ago we were at a bank, and they had a large, very specific monolithic enterprise service bus to let the service bus talk to all the core banking systems. And we decided that we'd basically use the application strangulation pattern. We just left the whole integration architecture as it is, just installed -- at that time it was Docker Swarm... We decided to take Docker Swarm because it was easier at the time than Kubernetes. Now we would probably also use Kubernetes... But yeah, just start with one dump pipe smart endpoint; that was the idea.

Sven Johann: We are visiting those guys again in January or February. So far, I never heard anything negative as a feedback... Because I think the nice thing is that you're not so much under time pressure. You're not chasing a date where you completely rewrite your whole system. You always have a working system, and everything which is new is then running on the new platform; and everything you have to change, and it's tricky to change -- you don't change it, you just rewrite it to the new platform, and it's very easy to add new or update functionality on the new platform.

Sven Johann: So yeah, I really like the application strangulation pattern. The only downside we experienced was obviously it takes a lot of time to build the infrastructure, to make everything -- let's say the developer experience using the new infrastructure and the strangulation pattern, it takes time... And then people really got nervous that we are working always on architecture and infrastructure, and there is no business logic. So that was a bit tricky, I think...

Daniel Bryant: It's a constant balancing act, Sven. I've worked on some projects where, same as you mentioned, the developer experience is super-important... And if listeners are interested, I did a write-up of Shopify's experience. So if they pop along to InfoQ and search for Shopify Niko Kurtti - I learned a lot from Niko about how they took the Heroku experience, which their developers were used to, and actually built a Kubernetes platform with a Heroku-like experience.

Daniel Bryant: Niko and his team purposely listened to the developers, listened to what they want - the developer experience was super-important to the whole team, to be able to go fast - and they basically put this facade over Kubernetes to allow them to use the same developer experience.

Daniel Bryant: I thought that was a very interesting way of minimizing that pain... Because it did take a while to build out the platform, but ultimately the developer experience was a very easy shift from Heroku to Kubernetes, because they put this layer of abstraction in between.

Sven Johann: Yeah, I will link it in the show notes.

Daniel Bryant: Nice.

Sven Johann: So I think API gateways - it's kind of clear. Let's assume for now it's a reverse proxy with API management capabilities. We also -- we have an API gateway, but not as we described it here... But in my current project, we really think about "How can we harmonize all the application with one common API gateway?"

Daniel Bryant: Right, yeah.

Sven Johann: What other properties I should look at to make a product decision?

Daniel Bryant: Yeah, great question, Sven. So I'm still learning this as well. There's a bunch of things that are business or developer-specific to your organization, certain properties you're looking for... But there are certain key things in terms of the ability to route traffic dynamically, the ability to secure endpoints, and the ability to do things like rate limiting, and then the ability to offer the API documentation to end users, using something like Swagger, or whatever.

Daniel Bryant: So there's a bunch of standard things that you can look at in an API gateway, but it's actually quite a fast-moving space now, because the underlying fabric that a lot of gateways are being deployed onto - Kubernetes and the associated change in architectures, like microservices - are actually creating new use cases. So if you look at a lot of classic API gateways, they were built with the assumption that you're kind of running monoliths in the backend, or at least big services in the backend. So that is somewhat changing the requirements now.

Daniel Bryant: You definitely want the ability, say in a modern API gateway, to route across multiple backends, be they Kubernetes, be they VMs, be they Lambda (as you mentioned; it's a good example). So that's the broad brushstrokes of what I look for in an API gateway, but it's something we're actively working on at Datawire, to be honest, because a lot of our customers say "How do we compare Ambassador with Kong, or other ones?" And we've got sort of the standard answers that I've just mentioned, in terms of basic functionality, but when we actually scratch the surface of individual customers, individual use cases, often lots of interesting requirements pop up... And it's things -- because anyone who's worked in an enterprise for many years has often got many layers of infrastructure, many layers of apps, as you've already alluded to in our conversation earlier. And because you've got this, you've often got certain requirements around "Well, my existing API gateway does this", and it's something I've never heard of. I'm like "Why does it do that?" But because it's a API gateway, they built it to meet their requirements, they built it to do exactly what they want.

Daniel Bryant: So that often plays a big part in discussions, and we have to say to customers, "Hang on... That functionality - I totally get what you did there, but really it shouldn't belong in the gateway. We need to move this into an integration layer, or we need that responsibility into the services themselves."

Daniel Bryant: It seems simple on the face of it, and hopefully my answer around the basic things to look for is useful to folks, but I've actually discovered as we've done more and more work at Datawire that the list of requirements is actually much deeper than I thought it was... So as we speak, we're literally putting together some docs on how to help folks on understanding the problems an API gateway solves, and how some of the solutions map to the existing problems and requirements they have.

Sven Johann: Also, we haven't talked about cost. I think that's also one thing which I find quite interesting, when you look at costs of certain solutions...

Daniel Bryant: Agree, yes.

Sven Johann: But what you are working on - will it be public soon? How do I select a -- I mean, I assume you are quite biased with that, but...

Daniel Bryant: No, of course not... Yes, I know what you're saying. We are doing a lot of documentation, a lot of articles in this space at Datawire, so yes, it's totally planned to be public. We're doing a lot of articles helping people understand the problem space, and helping people understand solutions, and so forth. So I'm hoping, given a month or two -- we've got lots of interesting stuff coming out now; we've just launched a new product, actually, the Ambassador Edge Stack, so our big focus is on that in the run-up to the holidays, and probably a little bit afterwards as well... But I'm definitely keen -- after you and I chatted in Berlin (at O’Reilly Software Architecture conference).

Daniel Bryant: I'm definitely really keen to share some of our ideas around the criteria for evaluating a gateway, because I think it will be a conversation. I think we will put some stuff out there, and then you and other folks will go "Hang on, have you thought about this? I don't agree with this. What do you mean by this?" And I think it will take a few iterations, because -- you already mentioned Christian Posta's article, and some of the other stuff I've doe... I think between all of us, we've realized that the API gateway space - there's a critical, core bit of functionality, but it's actually a really important part of a system, of an application, because all your traffic goes through this gateway. So it's on the critical path everywhere of request.

Daniel Bryant: So it is really important to understand a bunch of things. You already mentioned cost... Obviously, we get a lot of inquiries about support at Datawire; people say "If something goes wrong, we need to be able to call you, we need to be able to email you and fix it..." So support is one of those things that you -- if you go with an open source solution, it's very easy to forget about the total cost of ownership, if you like... It's very easy to forget about "I need to have engineers on call for the gateway, and for the mesh", and these kinds of things. Anything that goes through the critical path has actually got a lot of requirements associated with it.

Sven Johann: When you say it's the critical path, all the requests go through the API gateway - I've worked once with (let's say) a classic product, and that was a disaster in terms of traffic, because this thing went down every once in a while... And then the whole application is down. Of course, there was a big anti-pattern - one API gateway for all applications, and totally distinct applications. So we could have several API gateways, but yeah - when it was down, every application was down. That was really bad. So the stability, non-functional properties - you have to understand them, too.

Daniel Bryant: I completely agree, Sven. And there'll be a lot of technology changing; a lot of the more modern gateways, like Ambassador and others, are built on Envoy proxy, and that is a fantastic bit of technology. It came out of Lyft - a U.S. version of Uber, if you like, or a U.S. competitor to things like Uber; it's a travel service... And they battle-tested it at Lyft. There's many other companies, all big clouds, that are using Envoy, for example, so I'm really confident of the abilities of Envoy, and we've hardly had any issues with that over the time we've been working on it at Datawire. But it is new technology. So a lot of people that are sort of embracing these newer modern gateways have to do a bit of learning, of "How does, say, Envoy differ from HAProxy?" or "How does Envoy differ from NGINX?" for example. Those are the two classic reverse proxies you often see in deployments.

Daniel Bryant: The ops team have just got to learn how to read different logs, they've got to learn how to monitor the things differently. So that's one of the things that can cause a bit of disruption when you're adopting these modern gateways, too.

Sven Johann: Envoy - to me it seems it's everywhere now. I think Istio, or Linkerd, the two service meshes - they basically are based on Envoy, right?

Daniel Bryant: So Istio - yes, it's based on Envoy, but Linkerd is not. Linkerd uses a custom Rust proxy. So they made a conscious choice. Whereas Consul, with the HashiCorp service mesh - that also uses Envoy, too. But the odd ones out - definitely Linkerd, and I think it's Maesh, which is Traefik's mesh. I think they wrote Maesh in Go. So the Linkerd and the Maesh are the two service meshes that don't use Envoy. All the other meshes pretty much use Envoy.

Sven Johann: What's so special on Envoy? I once read a very nice article comparing Envoy with all the other ones, and I think the main reason was - if I remember correctly - observability of Envoy is really easy. All the other ones - HAProxy, NGINX, they also have a broad user base, battle-proof, but observability is just a built-in property of Envoy.

Daniel Bryant: Actually, my boss, Richard Lee, wrote a nice blog post on the Ambassador blog around this, so you may have even read his work...

Sven Johann: Maybe, maybe.

Daniel Bryant: He did that Envoy vs NGINX, and so forth. So definitely point listeners to that one. But one thing I say - to be honest, I've used NGINX for many years; I'm sure you have, I'm sure many of the listeners have... Fantastic bit of kit, I've used HAProxy for many years, too... The big difference in my mind - and they are fantastic pieces of technology, but they were born pre-cloud. Envoy is probably the first proxy that was born, created, and the ideas going into it were formed after the cloud revolution, so 2006 roughly, when Amazon made the big announcement.

Daniel Bryant: Matt Klein, who is the creator of Envoy, now still works at Lyft. Matt had worked at Twitter on networking, he'd worked at Amazon on networking, and he worked at a bunch of other places, as well. So he'd been at the front lines seeing all these networking challenges, seeing all these proxies, how they worked, how they failed, how you could observe them or not. And he and his team at Lyft created Envoy with all this knowledge baked in. All the existing (almost) strengths of the classic proxies, where they might be weaknesses in a cloud environment, where the fabric is different, things are more ephemeral, there's higher latency potentially in connections - all the classic loud issues - Matt baked all that into Envoy. Observability, for example, was a key part of that.

Daniel Bryant: So the big distinction, in my mind, is it's a constant trade-off, and I have some very interesting discussions with folks, of "Yes, NGINX, HAProxy and the classic proxies are more battle-tested. You can't argue with that. They've been around longer. But Envoy was born in a cloud era." And if you're running in the cloud, I think with a bit of analysis it's probably easier to argue that Envoy is more appropriate given the requirements and the challenges of cloud. That's the key thing people need to think about... But I totally agree with you, things like observability and reliability are really important for our proxy.

Sven Johann: That's the only problem we have with NGINX - it implements open tracing, but somehow... You have to move to the paid version, and then we still had some troubles making it work. But basically, I have to say now with the paid version it also works quite nice. But okay.

Sven Johann: I also have the feeling Envoy is the new kid on the block, and it makes a big difference if something is made for whatever you need. It doesn't only mean made for the cloud, but I think like Jenkins is also a nice example - Jenkins was the number one CI server, but now it also feels like not the right choice anymore... And of course, they can do a lot of things, but it's just not -- you know, you build all those new capabilities in after the new world of containers and cloud-native comes along, and it just feels a bit strange... And maybe it's the same with Envoy.

Daniel Bryant: A couple of things worth mentioning with Envoy is Matt Klein, the creator, deliberately chose not to build a business around it, because he feared the business drivers would interfere with how the product was developed. And he wrote a fascinating blog post several years ago; it's still on his blog. If you go to Matt's blog, you can see his thought process around why he wasn't creating a company. Apparently, VCs were throwing money at him, saying "We'd love you to create a company, there's all the potential of it", but he was like "No, I don't want to, for various reasons."

Daniel Bryant: It's really interesting understanding Matt's reasoning, and given the hindsight of time now, we can look back and see that Matt was actually very wise. And not only did he not create these perverse incentives of maximizing profit over creating a good product, but it's also encouraged a whole bunch of other folks to work on the product.

Daniel Bryant: So Envoy now - the main contributors come from Lyft, of course, from Google, from IBM, Amazon are contributing, Azure are contributing... There's many, many other companies. I missed out a bunch. But I do think having all these people come together, people that would probably not collaborate on anything else that's not open-source and open-governed, makes it a better product.

Daniel Bryant: So I think Matt was very wise in that he saw the potential for Envoy, but realized that commercializing it could halt it, in some ways. So now, the fact that all of these companies are bringing their expertise... They're not putting all their expertise in NGINX, because that's largely a proprietary system; I know it's an open source codebase, but the NGINX Plus is a proprietary wrapper on the open core. I think that is one reason why even if Envoy doesn't meet someone's requirements now, it's still an interesting investment to look into, because I think Envoy has got the best trajectory. If you look at the community engagement, the incentives around it... A whole bunch of reasons. I think Envoy is on the up, basically.

Sven Johann: Also, my feeling. But it's just a feeling. I will check Matt's blog post and I will also put it in the show notes. It's an interesting discussion about Envoy, and see how this will end in the future.

Daniel Bryant: Yes, I think a lot of the roles that you and I and people like us take - you're making technical bets. You're putting something in a stack and you're betting that it's good for the future. And as I've moved up through different roles in my career, I've realized that having the ability to examine the current state of the world, make predictions and then justify them in a very structured way is actually a really core skill. You've gotta look not only at the code, but you've gotta look at the politics, you've gotta look at the organization, you've gotta look at the general industry trends, and then make an informed decision as to what you're gonna base your stack on.

Sven Johann: That's true. Looking at the technology is only one small thing.

Daniel Bryant: Yes.

Sven Johann: What are challenges with an API gateway? We already had a few; you can put business logic into it if you behave not correctly... But are there any other challenges with using an API gateway? If I make this bet, or if I decide to use an API gateway, are there any other things I should particularly look at?

Daniel Bryant: Going back to the earlier point around requirements - I definitely think it's worth folks doing analyses upfront as to what they expect from their API gateway, in addition to the standard things we mentioned; the things that maybe we just assume in our organization are standard in API gateways, because our API gateway supports them, it may not be the case for other API gateways.

Daniel Bryant: The classic challenges we see at Datawire is people choosing an API gateway not realizing that moving to microservices at the same time - a bunch of folks are adopting a bunch of things, like microservices, Kubernetes, these kinds of things... With microservices you've got many more things being exposed at the edge. So you're managing APIs associated with these new microservices at the edge.

Daniel Bryant: One monolith was quite easy to manage. You could configure the gateway with hardcoded stuff. But if you're deploying (say) hundreds of microservices, like some of our customers at Datawire, who literally have hundreds of services at the edge, you can't get away with hardcoding that into the gateway, for example. You need some kind of dynamic mechanism of service discovery, and these kinds of things.

Daniel Bryant: The second thing that we see quite a lot is the ability to add have a comprehensive configuration. So not only have you got more things on the edge, but quite often the services are different compared to each other. One service exposed is REST API, one service exposed is WebSocket, we've got gRPC-Web - there's a bunch of different protocols these services can support.

Daniel Bryant: So you need to make sure when you're looking at a gateway, do recognize what protocols you need to support. The classic in the enterprise space is SOAP. Folks say to us "We're moving to REST, we're moving to gRPC, but we've got this SOAP stuff, because hey, it is what it is." SOAP is very different to REST. So you need to make sure that the gateway supports the old world and the new.

Sven Johann: Okay. Yes, you tend to forget it. Or I tend to forget this SOAP affair. Alright, let's change gears a little bit, going to service meshes... Definitely, at least in my bubble, the hottest topic for the last maybe 2-3 years. The interesting thing with service meshes for me is it seems that people can actually use it right now. In the interview I did with Sam Newman a year ago I said "Sam, I recently visited a training on Istio", and they said "Oh, you know, it's not quite ready... But just wait another six months and then you can use it." And then Sam was like "Ah, you know, they keep saying this for three years now..."

Sven Johann: Also, I think once 1.0 came out, there was an interview with Eric Brewer, VP of Infrastructure at Google, and also the CAP theorem guy... And the interesting question was "Is Istio ready for production?" That was maybe March 2019. I will also put this interview in the show notes... And he was like, "Yeah, you know, it's 1.0. This means it's production-ready. But you know..."

Sven Johann: What I've found interesting - he was really, really positive. He made it very explicitly clear that building and open-sourcing Istio at Google was not an easy thing to get an approval to do that. But he was really positive about the idea. But still, it's new technology, be careful...

Sven Johann: I was always very hesitant to use a service mesh, because -- I hear people like Sam Newman, or even Eric Brewer saying "You have to be very careful." But now we will introduce service mesh in the next six months, I assume. Actually, it's on the roadmap, and let's discuss the migration ideas in a couple of minutes... But before that, maybe just set the baseline. So what is a service mesh?

Daniel Bryant: A service mesh is a piece of technology -- it's almost an abstraction layer over the application-level communications. It's probably easier to describe how a mesh is implemented. Within each service of our system, within each application, if you like, you have a sidecar system, typically, or some way of that application joining the mesh.

Daniel Bryant: The classic way at the moment is in containers you have a sidecar container, something like Envoy, that all the communication from each service going to another service goes via these proxies, these sidecar proxies. So you don't have app A talking to app B, you have app A talking to proxy A, and proxy A talks to proxy B, and then proxy B talks to application B.

Daniel Bryant: So you have this kind of abstraction mesh, if you like, across all the communication of your applications. And it's not just applications, it can even be, of course, an API gateway, it can be things like data stores... These kinds of things. But the fundamental difference is you put this layer of abstraction above the applications, if you like. Or probably it's best to say actually below the applications, so that all communication is now going through this common mesh, this common fabric.

Daniel Bryant: And that gives you the ability to observe in a standard way whether your service A and service B - were they written both in Java, or whether one's written in Java and one's written in Ruby. It doesn't matter from an observability point of view now, because you're wiring into these proxies, you're wiring into the mesh. So it gives you a common way to observe 503's being returned on HTTP, it gives you a common way to understand latency, these kind of things.

Daniel Bryant: It also gives you security. So now, because (again) you're not baking security into the individual services, you can bake security into these proxies; providing the app can securely talk to a proxy, the proxy can then do things like TLS (transport level security), it can do things like auth, it can make policy decisions... There's a very interesting framework called "The Open Policy Agent" that's popping out now at the moment, that gives you almost like a rule-based access control of service A talking to B, but perhaps not talking to C, or not being able to talk to C... These kinds of things.

Daniel Bryant: And finally, it also gives you the ability to do reliability. Many of us in the Java space have used things like Hystrix from Netflix, which is a circuit breaker; we've also used retries, bulkheads, timeouts, but we used to bake that into our application language. Hystrix, in Java case, Ruby's got something similar, and Go's got something similar, but all the implementations of it were often different. Whereas now, again, because we're using this common abstraction layer, we can do network-level circuit-breaking and retries, but in a standardized way. Because it's outside of the actual application, you can do it in a standardized way. It's not quite the same as doing it in the app. You've gotta be careful if you're running some sort of business logic, particularly with retries; you don't wanna retry certain transactions out of process, because you'll get a bunch of disasters happening. But that kind of mesh offers those three things: observability, security and reliability, and it really is a fundamental abstraction there of the communication/.

Sven Johann: When you talk about sidecars, in the Kubernetes world this would mean my service is on a pod, and the sidecar -- so the Envoy proxy is on the same pod.

Daniel Bryant: Exactly, Sven. It's a separate container. I've used Consul quite a bit as of late, but Consul will even bootstrap an Envoy sidecar preconfigured for you. So you deploy your application... I think Istio has something very similar, with webhooks, and you can just specify "I want this to be Envoy-enabled, or Istio-enabled, or Consul-enabled", and then it will wire in another container, a sidecar container into your pod, that sits alongside your existing application containers.

Sven Johann: Yeah, I mean, because a service mesh basically spans the whole application, or makes a mesh across the application, I used to be very scared, but I have to say we built a lot of those things by ourselves. We have our own authentication proxy, which is a sidecar on a pod. We do mTLS communication between the services; you know, service-by-service, stuff like that. Also, you talked about observability, the whole distributed tracing - of course, we also have a solution for that...

Sven Johann: A circuit breaker I think is quite an interesting -- or resilience in general, but circuit breakers in a special case... I've found that quite interesting, because I'm a Java guy, or a JVM guy, and I always thought a circuit breaker - okay, you just take Hystrix and that's it. But then we have Node.js people, and there is just -- or at least they said, there is just no such thing than Hystrix on Node.js... So it really helps to have this platform independent. Also, Canary testing is something we have on our list for service mesh. I'm not sure if you -- you didn't mention it, but...

Daniel Bryant: Great point, Sven. There's fundamentals I forget about sometimes... But the traffic shaping and traffic splitting is a fundamental property of API gateways and service mesh... And the ability to split traffic, so one percent of users go to a new service, is a great way to Canary-test this service.

Daniel Bryant: One thing I would advice - and again, I am probably biased with my work at Datawire here, but I'd say start at the edge and work your way in. We've had some customers that tried to do canarying with Istio and got a bit confused... Because you can put canaries in an arbitrary point in the stack, but then when you're debugging you've gotta remember when the request is flowing down through the surface. Did it get canaried? Did it get canaried in two services? Where did it get canaried?

Daniel Bryant: We often say to folks "Canary at the edge", and it's really obvious then. Maybe one percent of your traffic goes to this new service, or gets a property injected into the header, and then that goes down through the stack... But it's very easy when you're doing lots of canarying at arbitrary points to get confused.

Daniel Bryant: In my experience at least, when you look at some of the big companies that are doing it at scale, at success, like Twitter and others, they've got a whole lot of machinery that supports this. They've written deliberate tools to understand where the request flowed, where it branched, they can link these things up, they do machine learning based on responses going back... Obviously, it's so big a scale at Twitter, you can't look at it. You have to have machines understand these things.

Daniel Bryant: So there's a lot of caveats with this thing. I often say to folks, be really clear about the requirements you have for canarying. Sometimes you can find it's easier to do with feature flags, for example, and quite often you can pull a requirement closer to the edge... And be conscious the whole time that if you are doing these traffic-splitting things, they are very powerful, and they actually are game-changing, to use an overused phrase... They really are game-changing if you get it done right, in terms of continuous delivery and progressive delivery... But with great power comes great responsibility, as the old cliché goes. I've definitely seen folks get a bit overwhelmed with the cool stuff they can do with a service mesh, to be honest.

Sven Johann: We also look at the accelerate metrics, and of course we want a low change failure rate... And back in the days, like five years ago, we did Canary testing, but it was at the edge, so the reverse proxy just looked at some property in a cookie, and then decided "Okay, let's go to this version or the other version." On EC2, and everything very static, it was "easy" to do Canary testing... But I've also found this very helpful. At that time I was really relaxed to do a deployment, because if the deployment fails, it only fails for 10% of the users, or 1% of the users, depending on the project... And if it works for this 1%, I can move to 10% and then 100%. But now, when you say "Do it on the edge", you mean decide on which version to choose - you should do that at the API gateway?

Daniel Bryant: Yes, so it's quite an overloaded recommendation, I guess, because it depends how folks have configured their services. We often see people -- when they're starting out in their microservice journey, they have quite a shallow stack. Often it is the monolith and a few other services next to the monolith. So then it's quite easy at the edge to make a decision. You might have service A and service A', and you Canary 1% of traffic to service A'. So if you've got shallow systems, it's really quite simple to do.

Daniel Bryant: When you move to what a lot of folks now are calling deep systems, so when the microservice call chain is 2, 3, 4 other services... So the edge basically routes to the first service, and that services then routes on to another service, on to another service, on to another service, I always recommend folks keeping the chains as shallow as possible, for so many reasons, in terms of understandability and latency, and so forth - but I've seen many things over the years - then you can do a number of things. You can have literally different chains, so almost like two different chains, and you route to the different chains, depending.

Daniel Bryant: You can also do as you mentioned - put something in the cookie, at the edge, and then the services look and do things depending on what the cookie value is. Does that make sense? You can have an arbitrary deep stack, but you set the cookie at the edge, because then the traceability is much easier to understand; when this request came in, that person was eligible for Canary testing, we set the flag, and therefore we know all the services downstream (or upstream) of that request went through and processed that request with the Canary flag.

Daniel Bryant: When you start having multiple Canary flags - which sometimes is essential for certain systems, but for smaller systems or systems where you're just starting out with a microservice pattern, I recommend keeping it as simple as possible. If you've got multiple Canaries happening at multiple points of a deep stack, you get quite deep/wide branches... Does that make sense? So the combinations of requests can be handled gets quite complicated.

Daniel Bryant: I think if you keep something simple at the edge, you at least know this cookie was set at the edge, and therefore you can much easier trace the flow of the cookie up the stack.

Sven Johann: It makes sense. I never thought about the deep call chain, because we consider this a not so good solution, to have so many services calling other services... So keep the tree small, or not deep; not having a deep tree, but maximum three hops, four hops.

Daniel Bryant: Yes, I've seen one pattern being espoused by a number of vendors (I won't mention their name), and I think it's with good intentions they're encouraging folks to do this... But it's the classic layering approach. I talk a lot about systems of engagement and systems of record. Systems of engagement are your web apps or your funky cool stuff, and systems of record are the data stores and all these other things. And these vendors - they're kind of steeped in the traditional enterprise way of just adding another layer, because you know, that's kind of what you did back in the day... And they're advocating that you have these application-level services of APIs, that then talk to process-level APIs, which then talk to system-level APIs. It's an arbitrarily deep stack.

Daniel Bryant: I understand why they're pitching that. They're trying to say "As a developer, you just build up the appropriate layer you're working on." This would be like the classic Java thing - you used to always have a DAO, a service layer and a view layer, because that was the done thing. In an ideal world you could swap out the view layers or you could call multiple DAO layers from multiple views... I think that's the kind of mentality folks are bringing into this new microservice architecture, but I am 100% in agreement with what you've said; I actually think keeping the call chain as small as possible ultimately makes it easier to understand end-to-end.

Daniel Bryant: I think they're good intentions, with these multiple abstraction layers of application, process and system, and I get where they're coming from, but it optimizes for a local understandability. If you're looking for the global of like "How did my customers' requests get solved?", keeping that call stack short is much easier.

Sven Johann: Yes, that's really the case. And it's faster.

Daniel Bryant: Yes.

Sven Johann: Talking about being faster, one thing -- I said service mesh is complex, therefore I was a bit scared... What about latency? Introducing all those sidecars, that introduces lots of latency. Is that really a big problem, or would you say "Well, sidecars have local calls, and it's acceptable"?

Daniel Bryant: I think it's always good to look at your requirements, and I think there's a couple of angles here. One is "Premature optimization is the root to all evil" kind of thing. I do see some folks getting really worried about potential latency introduced by a service mesh, and then in reality their system is like a standard e-commerce system, and then a couple of extra microseconds or milliseconds doesn't make a big deal. And the benefits they might get from having a service mesh clearly outweigh the costs.

Daniel Bryant: But then I've worked with some financial companies who are like "We've gotta have everything down to the microsecond", so they minimize these extra hops. So it very much depends on your requirements, and it's all about trade-offs. You're gonna get some benefits from a service mesh, but there's clearly some costs... Not only in terms of complexity of running the mesh, but in terms of latency.

Daniel Bryant: I think Matt Klein shared some blog posts, we've done some research internally at Datawire - for example, the standard kind of hops with Envoy are milliseconds, single-digit milliseconds. So for most use cases it doesn't add a big problem, particularly when you're hopping over the internet it's gonna cost you ten milliseconds, or whatever it is.

Daniel Bryant: It's one of those things -- I always say to folks, if you do have your doubts, do a small proof of concept. I've learned over the years there's just too many variables in a typical system for me to predict what's right or wrong. The way you've got your applications configured, the way your hardware is configured, all these things. I saw to folks - I understand why you're worried, because yes, it is another layer in the stack, but do a POC with a few jumps, with a few proxies in the mix, and just see what kind of performance you get at the P99, the worst(ish) case scenario.

Daniel Bryant: Anecdotally, all of the folks I've chatted to have come back and said "It's not an issue. We've got five milliseconds of latency, but the benefits we've got by doing these things clearly outweigh that cost."

Sven Johann: In our case, since we already have sidecars, just not automatically injected by a service mesh, there is already this latency. I find it interesting - it seems to be easier to remove Kubernetes pod limits, CPU limits, to get faster... That was something I was really surprised, that CPU limits have a very bad consequence for latency. But I would also think that in a general case, latency is not a big issue. It's something you have to keep in mind...

Daniel Bryant: Exactly.

Sven Johann: ...that if you introduce proxies everywhere, it can be a mess. So now I have all those proxies... How does it actually work under the hood? If I install it, they automatically get injected onto a Kubernetes pod, and this is then centrally managed by the data plane? Is that how it works?

Daniel Bryant: Yes, pretty much, Sven. I mean, I would say to listeners definitely check out the setup for your individual service meshes, because they are subtly different, but fundamentally exactly what you said. Consul is one I've worked with quite a bit - you inject the Envoy pod (it's like a one-liner) in the Kubernetes deployment... And obviously, if you've installed Consul into your cluster, then when you deploy your app with this one-liner, Consul recognizes that and injects a pre-configured container into your pod. And then it updates along the way; there's different mechanisms you can use to update it, but fundamentally there is a centralized control plane that then talks to Envoy and synchronizes all the changes from the control plane into the individual pods, which we called at the data plane.

Sven Johann: Yes, I can somehow remember my Istio class... So when we look at the functionalities of the service mesh, how does this help with our application modernization? Of course, it makes things easier if I already have a microservice architecture, but you already need to have the split, right? So if you move from monolith to microservices, you need first a decent amount of microservices and then you would introduce the service mesh in order to not develop all those things like security, tracing, canary, resilience from scratch.

Daniel Bryant: Yes, and one thing we've seen... A pattern that's quite popular is say you're moving to Kubernetes and you know it's gonna take a while to incrementally migrate, using something like a service mesh, particularly Consul, because it's the one that provides the best connectivity across many platforms at the moment - others are looking to do this, for sure - but you can use Consul as a way to do location transparency.

Daniel Bryant: So you can provide and you can install a Consul agent on your existing app. The app can effectively be anywhere. I've got some tutorials which I can share on my GitHub, where I'm using Ambassador and Consul and Google Cloud Platform, and I'm using some Terraform to spin up a Kubernetes cluster with Ambassador and Consul installed, and then I show you how to route to a Consul-based service that is out of the Kubernetes cluster; it's running on a classic VM in Google Cloud. And then you can move that service into Kubernetes, and Consul will (behind the scenes) change the routing. So no longer will it route to the external VM, it will route to the service in the Kubernetes cluster.

Daniel Bryant: The beauty of doing that is not only does your user never know and care about this, but even your API gateway doesn't... Because your API gateway is talking to Consul to get the service discovery information, and Consul is basically hiding, giving location transparency from wherever these services are. And you can choose to always leave them running in a VM, no problem at all. But if you do wanna migrate them, the service mesh there gives you some layer of abstraction.

Sven Johann: Okay. So now let's say we decide that a service mesh is a good idea. Two things I still want to discuss is how to select the product, and the other one is what are migration tactics to a service mesh? So maybe the first one, which products maybe even exist, how do I select the right product? I think within the last six months this became a very interesting question, because six months ago I had the feeling Istio was really leading the pack and Linkerd was somewhere behind, and everyone looking at Istio, but now I have the feeling the whole landscape changed. Linkerd is quite popular, but also 3-4 other products came out. Traefik, AWS App Mesh, HashiCorp... So quite a lot of products out there. What are the things to look at if I have to choose? Obviously, requirements, but... How would you approach this?

Daniel Bryant: It's kind of a similar answer, Sven, to the API gateway. There's some standard requirements that you can look for, but the space is developing rapidly. It's one of those things that you constantly need to update. The big thing, pretty first up, is do you want a Kubernetes-only service mesh? Linkerd, for example, is focusing only on the Kubernetes space. They've been clear about that upfront; they believe Kubernetes will rule the world. So if you wanna have a service mesh that talks outside of Kubernetes, Linkerd is not the choice for you.

Daniel Bryant: Then another thing, probably very closely connected to that one, is the complexity of the mesh. Because Linkerd have made a bunch of assumptions around this - Kubernetes-only - the installation experience and the usability of Linkerd is amazing... Compared to other meshes that are more powerful, which are more complex/complicated to understand.

Daniel Bryant: So yes, Kubernetes-only... You have to think about how much complexity you need or want. Istio is taking a little bit of the classic Java EE vibe, in that it does everything... But as a developer, you may not want everything. This is something I struggled with in the classic Java EE days - you could literally do anything you want with a lot of these specs that were in Java EE, but me as an engineer, I actually wanted an opinionated way to do things. So I think Istio is a market leader in terms of Google promoting it, it is very powerful, and you can do everything you pretty much from a service mesh in it, but that comes with the baggage of complexity.

Daniel Bryant: So if you want a simple use case, I think Linkerd has a fantastic installer experience. If you're just looking for TLS, I think Consul is a fantastic experience. Consul - I install that with a Helm chart into my Kubernetes cluster. One-line injection into my pods and I've got mTLS between all my services. So those are a bunch of things...

Daniel Bryant: The standard stuff to look through is the traffic management, the observability, the reliability patterns... Those are pretty standard across all the meshes, and they're almost table stakes, to some degree. Like I mentioned with API gateways, there are some things you just kind of expect from an API gateway.

Daniel Bryant: Then the next thing is have a look at the community. There's a fantastic community around all the big meshes. Definitely Istio is a good community. They're very Google-driven, to be honest; that's one thing to bear in mind. Linkerd - fantastic community; obviously, driven by Buoyant, the company behind it; they're great people. Consul - HashiCorp, fantastic community in general. They really recognize the power of the community...

Daniel Bryant: So all three of the big ones have got really good communities behind them. The more new ones, like Kong with Kuma, Traefik with Maesh - there are smaller communities behind them... And also with Maesh, for example, you're betting on a non-commoditized tech, and the underlying code of Maesh is a custom Go proxy... Whereas most of the meshes are based on Envoy, with the exception of Linkerd. You've gotta recognize those things, as well.

Daniel Bryant: But that's probably the core things I'd look at. It is a very new field, and there's obvious table stakes, but there's more subtle requirements. I do think, as I mentioned a couple times in this podcast, I definitely recommend folks try stuff out. I know it can seem like it might take a bit of time to do that, but this time upfront will save you many days further down the line. You spin them all up - spin up Consul, spin up Linkerd, spin up Istio, have a play around with it, try and upgrade them in the cluster, try and add some services, try and take some services away... Just kind of do it like a one-week bootcamp where you cram all the things you think you'll be doing over the next year into this PoC, and just get a feel for is the documentation good? Does the service mesh behave as you expect? Is it easy to operate? Do my developers understand it? Often within an organization - anecdotally, with my conversations - a winner emerges. Each of the big meshes is quite opinionated, and typically those opinions will map to your requirements and your organizational style or not.

Sven Johann: Yes, it's work, trying out everything...

Daniel Bryant: Yes, it is. Exactly.

Sven Johann: ...but with every decision which has application-wide consequences, the work is worth to do it, because otherwise everyone's suffering for a long time.

Daniel Bryant: Yes, I don't think particularly with service meshes you can do a paper analysis. I don't even think you ca do it with API gateways these days, that's why I was hesitant to talk about requirements of API gateways. People want a tick box; they wanna compare Istio with Linkerd, or they wanna compare Ambassador with Kong... And the tick boxes don't really represent in my mind a fair analysis. You have to play with these things just to understand. It's new tech, it's a new world... Your requirements - you're bringing with you baggage and requirements. And how that translates to these new technologies - I think exactly what you said, Sven; it is work, but I think it's work that is really valuable in the bigger scheme of things.

Sven Johann: I also tend to be a check box guy, but the thing is especially if it's something new - a service mesh, and there are all those products, you have no clue what to choose, what criteria, and then you just look at the features... But yes, without trying it out, making a prototype, having even certain teams deploying on it and play around with it... Not only one or two people playing around with it, but also a broader space - yes, you have to do that.

Daniel Bryant: Yes.

Sven Johann: But let's assume you made a decision... How do I roll out a service mesh?

Daniel Bryant: I talked about this a little bit in my O’Reilly Software Architecture conference talk in Berlin. The pattern I like is called balkanization, where you basically take a segment of your network, deploy a service mesh in a smaller area, and if it's successful, then you gradually roll these things out.

Daniel Bryant: There is a few caveats on that. For example Istio, I believe, you have to do a cluster-wide install. So even if you've got one cluster and many namespaces, I don't think you can install just to one namespace. That was when I last checked, so I will encourage listeners to double-check that fact.

Daniel Bryant: Consul - I've done a whole bunch of work with Nick Jackson, a buddy of mine. We've talked about how you kind of segment the network and how you roll out things like Consul. I know Linkerd have got this notion as well, where they talk about incrementally rolling out Linkerd. They've got a permissive mode, with a TLS connection, so you can kind of gradually -- you put TLS everywhere, but for the services that have not got Linkerd connected into them you can have a permissive mode where it drops back to basic unsecured HTTP communication. And as you add more Linkerd, you can upgrade as you go along.

Daniel Bryant: So definitely look at how the technologies you're choosing support these rollouts. Definitely be wary of the monolithic installs, the monolithic upgrades, because as we've said a few times, the service mesh is pretty much on the critical path of all your services. So if you do a monolithic install of a service mesh and it breaks, probably your whole application will go down... So definitely be a bit wary of that kind of thing. There's been some great stories -- I think Monzo talked about some of their journeys, some of their experiences with a service mesh when it went a bit wrong. Like many things in software development - and many things in life, to be honest - start small, get that feedback, and then iterate.

Sven Johann: I think with service meshes, people say "Well, the whole power of a service mesh is that it helps you application-wide, making things easier. So don't do it incrementally." There is a true point in it, but I also think it's new technology; if something goes wrong, the whole application goes wrong. Yes, we understand that the full power only comes when it's rolled out across the whole application, but - yes, starting small...

Daniel Bryant: To your point, Sven... I think you mentioned earlier about ESPs - I remember when I first started playing with ESPs 15 years ago; the vendors had the same pitch then. They were like "Oh, just go all-in on this ESP." Basically, that it's the only way to do communication in your cluster, in your data center; it makes it so much easier. And those projects used to take a year or two to roll out, and then there was always a lot of pain when you went live. So I think in some ways we haven't changed our mindset on these things, but I do definitely recommend starting small. Otherwise it'll take you many months to actually do something, to roll out something, and then the pain of your learning will be crazy when it goes into production, if it's a system-wide thing.

Sven Johann: Yes, I think so. I don't want to have this pain... But it's interesting that you say -- you know, the latest info is that Istio needs to be installed cluster-wide... Because I think it's a good pattern if you have Kubernetes to start with one small namespace, for example, and then add another one, and add another one. So I will look it up. I'm really curious if that's still the case with Istio, because I thought they move into the direction that it's easier to install, and that you do not have to install everything at once, but you can do -- it's more decomposed. You can install a few functionalities, so to speak.

Daniel Bryant: Yes. You might well be right, Sven. To be honest, I haven't used Istio for a couple of months now, and in the service mesh world that's a long time. I'll be geeking out a bit over the Christmas holidays; I will play around with some of these meshes. I love playing with technology in general... But I find I have to play with tech every couple of months, because it all changes so much.

Sven Johann: Yes, that's true. Back in the day it was like "Do I choose Docker Swarm, Mesosphere or Kubernetes?" With service meshes it feels to be the same situation; things can change overnight, and whatever you bet on, it will be wrong tomorrow.

Daniel Bryant: The nature of the game, isn't it?

Sven Johann: Yes, yes. Alright, so all my questions are answered. Is there anything in particular I haven't asked, but you think it's important mentioning?

Daniel Bryant: I don't think so, Sven. I mean, I'm keen to give a plug to the stuff I'm doing at Datawire. I thoroughly encourage folks to jump on our Slack, or read our articles... We're trying to share as much of the thinking you and I have discussed today publicly. So we're doing blog posts, we're in Slack, chatting about these things... Hopefully, something that has come through to the listeners is this API gateway space/service mesh space is pretty new; the cloud-native versions of these things are pretty new... And we're all learning together. Obviously, I've got my vendor hat on today, but we're trying to learn as much as we can from the community, as much as we can from customers. I know all the other service mesh and all the other API vendors are doing the same.

Daniel Bryant: So having some kind of discussion I think is really important, and regardless of where we hang our hats, it's really important if we can share all this knowledge, share our learnings, give feedback... I see this happening a lot, for example even in the Linkerd Slack, and in the Kubernetes Slack, I see a lot of interesting conversation... But I'm really keen to encourage folks, particularly the folks that are arriving to the space now, that are quite new.

Daniel Bryant: I was at KubeCon recently, and a lot of folks I was chatting to at KubeCon are actually new to the whole cloud space. They were at KubeCon to learn about Kubernetes, to learn about service mesh and these things. And their input is really valuable, because yes, they don't know about all the latest buzzwords that probably most of us who have been in the space for a couple of years know. They don't know what a pod is, they don't want to know what a mesh is, or a proxy, or a service mesh proxy is, but their input is really valuable, because it helps us all understand how to map the journey from people who are solving real-world business problems, and have just been too busy up until now to look at cloud, but now they're arriving - we all need to understand how can we help them on this journey towards cloud-native systems. So I would thoroughly encourage people to get involved in the conversation, that's my takeaway of that one.

Sven Johann: Okay. Thank you very much, Daniel, and talk to you soon!

Daniel Bryant: Sounds good. Till then. Thanks, man.

Sven Johann: Bye-bye.

Daniel Bryant: Bye.