Conversations about Software Engineering

Conversations about Software Engineering (CaSE) is a podcast for software engineers about technology, software engineering, software architecture, reliability engineering, and data engineering. The three of us regularly come together to discuss recent events or articles, exchange on our learnings, and reflect on our professional and personal experiences. Additionally our guest episodes feature engaging conversations with interesting people from the world of software engineering.

Transcript

00:00:22: Sven Hello and welcome to a new conversation about Software Engineering. Today with Christoph Windhäuser, it's actually not a conversation about Software Engineering, but more about Data Architecture. So before I welcome our guest, we are still the three of us, Sven, Heinrich, how are you doing?

00:00:46: Heinrich Great, good to be here.

00:00:47: Sven Gut. Alex, how are you doing?

00:00:51: Alex I'm still fine, end of the week, it's great.

00:00:56: Sven Ja, ja, nice weather, that's that's true. Today I'm I'm really excited to have our guest Christoph Windhäuser. He is, he has a multiple decade, I don't know, 30 plus years career in data. Right now he is at Data Bricks. He was head of Data and AI at a global consultancy called Thoughtworks. You know, basically being at Thoughtworks and Data Bricks should be enough, ja? But I think there were a lot of other interesting parts of his career, being at SAP for example, but also you, you were in France, in Paris doing your PhD back in the days when all those you know, big guys like Jan LeCun and stuff like that, you did your PhD in in Artificial Intelligence, right?

00:01:54: Christoph Ja, this is this is right. So thanks Sven, Heinrich and Alex to to having me in the show. I'm very, very happy to be here and and talk to you. Ja, that that's great. And Sven, you are right, that was an interesting experience. You see, I'm a little bit older, so I have up to 30 years of experience in the industry and I did actually my PhD that was with France Telecom in 1995 and at that time already in Machine Learning and we worked with Neural Networks, that was the name at that point in time. It was not really Deep Learning because the networks haven't been deep, but it was exactly the same principle. It was gradient descent, it was back propagation, so the same mathematics as you have today up to the Transformers, same mathematics behind and we could show it's working actually. What was obviously lacking at that time was on the on the one side the data. We had tiny little databases, so for speech recognition, for example, we recorded the speech by ourselves, cut it into single words or digits and of course the architecture of the network and the hardware behind it was much more limited. So we could run, I don't know, one, two, three layers, that was the maximum, not only because of the hardware, but also because of the training data. If you don't have that many training data, you cannot train a model with a huge number of parameters, right? But it it was working and and France Telecom was really interested in this to use this for digit recognition, dialing and this kind of stuff and it was ja, it was an amazing time and and Jan LeCun and and all these guys have been already around there doing their PhD, AT&T Bell Labs and and other places and we met in in in conferences. Neurips already existed at that time and it was very exciting.

00:04:23: Sven Cool. All right. Before we start, I think Heinrich had a question on the piano, did you?

00:04:31: Heinrich Ja, I really like the background and I was just curious about your your take on Chopin versus Tchaikovsky.

00:04:40: Christoph Äh ja, and and by the way, this is a real background, it's not virtual, so you are right and I'm very lucky one because I'm of the few people who actually has a grand piano in his office. Nice. So that's great, but I have to disappoint you a bit Heinrich, I'm not playing Chopin or Tchaikovsky or no anymore, but I'm playing Jazz. I love Jazz music and Jazz piano, I also play here in Cologne where I live in in a band, so you can see me live sometimes. And that is something I do when I'm not working on on on big data architecture.

00:05:25: unbekannt] Any favorite Jazz composers that you like to play?

00:05:30: Christoph Ähm, well, my favorite Jazz pianist actually is Bill Evans. I think he's the greatest. Oscar Peterson is also great, but these are huge, huge people, you know, you will never achieve that. But we are we are playing the standards in the real book and and of course Miles Davis stuff, kind of blue, probably a lot of people know this album. So that that's just amazing music.

00:06:07: unbekannt] Nice. And it's it's so great to have it like in the picture, you know, it kind of sets the tone, it sets the atmosphere and uh like a big music fan myself and uh sadly not so active at the moment, but yeah, really appreciate that, ja. Do you use it, do you use it, Heinrich?

00:06:30: Sven Do you use it between, I don't know, like meetings to relax, to to decouple?

00:06:37: Christoph Not really. I think it would be a good idea, right? But what really happens when I have my calls with my colleagues at Data Bricks or so, they say, oh, you have a piano, can you play? And and sometimes, you know, when we have a little bit of time, I I just play for them. That happens actually.

00:07:00: Sven Nice. Cool. I I hope nobody tries to join in because of the latency. It's usually, I mean, everyone knows this this cringe moments when people try to sing in harmony over Microsoft Teams or something.

00:07:26: Christoph Actually, that's very difficult and you know, during the pandemic, we tried to to to practice rehearsal with our band remotely, but as you said with the latency and and Jazz is very rhythmic music, of course, that didn't really work well. Ja.

00:07:50: Sven All right, wonderful. Let's start with Data Architecture, I would say. Let's say there are three like modern types of Data Architecture, Data Lake, Data Lake House and Data Mesh, we want to talk about those. But first before we we start, I'm also I'm also not that young anymore and when I started with Data Architecture, you know, there was the good old Data Warehouse, where, you know, I happily created materialized views for ETL processes to pick my data up. So how does a, how what is actually a Data Warehouse and why did it become unpopular in the last 15 years?

00:06:51: Heinrich Why did it become popular in the first place, right?

00:06:54: Christoph became popular and then maybe not popular any longer. I think it's it's a great question, Sven. And ja, probably younger people among us don't don't have have been living at that time, but don't remember. But it started all, you know, with the with the with the ERP systems, at that time I worked for SAP and these were very transactional systems and they had some kind of reporting, you know, like like the reporting you have today is SQL reporting in in a transactional database. But these reports are very limited and what what the companies like SAP and Oracle, Terraform wanted to offer and the customer really wanted to have is some analytics workload and and really doing this kind of slicing and dicing, what a data analyst is doing. So going into the data, see, show me all the sales results of this region or of this product group or of these sales organization or whatever it is. And and this is a very different requirement than transactional listings or reporting, because it's it's more real time, the analyst goes there and and actually you you have you need access to a huge amount of data and then you slice and dice and and have to select out of that data. And the problem is, this is a very different requirement to the database, you know, transactional databases and and today we we call them OLTP, so online transactional processing, they are built to make lots of transactions more or less at the same time or in very short notice, so that they block a maybe a row in in a database, make the update and then free it again as soon as possible so that the next transaction can can go. Imagine flight booking systems, that that's a very good example for for this, right? But when you do analytics, you want to have access on a huge amount of data and then you you read actually, you are not updating, you are just reading, but big amounts and and you take it out of the database, then you make a selection on that, so this is a smaller part and another selection, but then you go back and read again, so it's a completely different workload. And it was very clear that the transactional databases at that point in time were not made to do this in in with good performance, it was not working. And therefore they they developed another architecture, which was optimized to this kind of workload and they called this the online analytics processing, OLAP, right?

00:09:45: Heinrich OLAP cubes.

00:09:46: Christoph Exactly, and that was the the cubes, you know, big big cubes, you can read and then you can slice and dice along of of dimensions through these these cubes, get data out of it and then present it in in some graphical way. And actually that was a lot of workload on on these machines and therefore some some special hardware was required, so that the memory, the storage had to be very close to the processing and these machines were really on prem at that point in time, obviously very expensive and you could not really scale, the only thing to scale is just to buy a bigger machine. And Alex, maybe in you you know this of history in your companies, these were these huge machines in the basement, super, super expensive for this business warehouses, right? They they still are there with a lot of customers today.

00:10:54: Alex Oracle Exadata, it was for me back then.

00:10:58: Christoph Exactly, you remember, right? And that is was then, I mean, they worked great for this point in time, but that was also the disadvantage, they were not they were not able to scale, very, very expensive, usually on prem, you know, CPU, storage was very close to to to speed up the processing, but at some point you hit the biggest machine already in your basement, you couldn't scale up even more and and that was a big problem. And then of course, and then we are here in our topic, a cloud cloud came up, right? A couple of 10 of years ago and of course cloud, the cloud infrastructure changed the whole picture. And what also changed is that new processing schemes, massive parallel processing were developed. This all started with Hadoop at that point in time, because they had this massive amount of data and they said, hey, we cannot just build bigger and bigger machines, this doesn't work, let's scale horizontally, just buy cheap cheap servers, cheap PCs, Media Markt PCs, well, kind of, right? But massive amounts of them and have a clever software who is distributing these these data to these different processors and then we are able to scale what we call horizontally instead of vertically, right? And that was of course a big, big, big shift in in in the architecture and it was very successful, so there was Hadoop, programming Hadoop was super, super complex and then a bit later Spark was developed, right? And maybe you know who developed Spark as a PhD program in University of Berkeley, these have been the founders of INNOQ.

00:13:30: Sven Ja. Was it only, I mean, I did a long time ago, I did a track at Qcon London was called the Big Data Architecture or something, it was 2013 or something and my my slogan, of course, I stole it from someone from MIT was Big Data Architecture, it's for what is actually big data, it's, I mean, back in the days, ja? So it was those data are too big or and or too fast and or too versatile for normal databases like the the classical databases to handle, handle those data. So is that kind or was that kind of correct? I mean, was there also beyond the scaling like the speed of data coming in or the types of data like unstructured data?

00:14:47: Christoph The question is, as you rightly asked, what is big data anyway? When when data is big and when it's not big, when it is small, right? And actually there is not a border, you don't say everything bigger than a terabyte is big data and and and everything below this is not big data, that doesn't make sense. So today I think there's a consensus, I like this definition, I I read it with James Sarah, he's from from Microsoft and he wrote a very nice book on O'Reilly, deciphering data architectures. He actually said very simple, big data isn't anything else than just all data and that's exactly what you said, Sven, it's the kind of data, you know, we we not only have this kind of structured data in databases, we have semi structured database, JSON, XML, CSV, we have unstructured data, text, pictures, video, audio, we have batch data, we are streaming data, so I I would say big data means you can handle all kind of datas, right? And then definitively with these structured databases and warehouses, you are you are at your limit, that is not possible.

00:16:18: Sven Ja.

00:14:11: Christoph I remember this wonderful quote from my supervisor, must have been 2004, 2005 when I was working at a wholesale company and he said, you only felt real stress when you executed a query on the database cluster that brought down the whole logistics department because they had to wait for your query to finish.

00:14:39: Sven Yes, yes.

00:14:40: Christoph And that was the time when when we bought the Exadata, but only to notice a year afterwards that it was actually idle 95% of the time.

00:14:55: Sven Ja, ja.

00:14:56: Christoph And that that in my point of view was a big problem of the of the data warehouse concept. Isn't it? That this idle time and as you said with the with the with the advent of of cloud services, we could somehow have a pay as you go model, but that was kind of clumsy, if I remember, if I remember correctly. Like the first installation of of Hadoop, it wasn't so much pay as you go, it was actually a very big installation that we had there. So when when did actually this this point in time arrive that we had real on demand compute for data processing, would you say? I cannot give you a clear clear date here. In terms of architecture, what was the what was the shift? Very important are these of course the cloud vendors, so 2006 AWS started this business, Google came two years later, Azure another 10 sorry, two days later, but also the the development of Spark, I would say, that was 2009 and then the further development of this made made the whole thing much easier and and now with with companies like Data Bricks, so you have Spark out of the box, so it's pretty easy to set it up, right? You don't have to to turn all these knobs and and and optimizations. And and Spark is a big advantage over Hadoop. Hadoop was really kind of as you said, right, very tricky to to to build, to optimize, to write the programs. And and the nice thing about Spark, it's a descriptive language. So you don't actually, it's like SQL, you don't have to program how Spark is doing this and how Spark is optimizing. You just describe what Spark has to do with your data, right? Or like SQL, etc. That that is the beauty of this approach.

00:17:40: Sven What I also like is this repeatability of the of the pipelines. If you kept your your source data, you could like repeat a pipeline and insert new modules into that, that was really nice.

00:17:54: Christoph Ja.

00:17:54: Sven Maybe we are we are already a bit too far in the pipeline. Before we talk about the pipeline, so we said data warehouse, it was just not the right thing, it had it had some issues, then infrastructure changed, the type of data we wanted to process changed. What was then the next architecture which came up? Was that the Data Lake?

00:18:42: Christoph That was probably the Data Lake, exactly. And and Data Lake and data warehouse, they exist in parallel. It's not that that you switch off your your data warehouse and put everything in the Data Lake. Data Lake had some some other needs why the Data Lake was was was developed, right? And Data Lake basically is when the when the the hyperscalers came up, like the the Amazons, Azures and Googles, suddenly very cheap cloud storage was available and kind of unlimited, right? So you're not storing your stuff on your own servers in in the basement, you you just could could rent as much as you wanted in want in in in the cloud environment. And of course then it was already clear, we talked about Big Data that that data is important for the business and everybody starts to first of all collect the data, right? Because maybe even you are not using it, but you thought, hey, it's important, my machinery data, sales data, sensory data, website usage data, whatever, just let's not delete it, storage is cheap, so let's store them, right? And and later probably we might need this data. So and that was the birth of the Data Lake, because they it's just you take this cheap storage and in in the in the easiest way, you just have a the structure of dictionaries in some way how you order your data and you just dump your data on this Data Lake in in whatever format, it is CSV or parquet files, whatever, and that was a was I would say the first Data Lakes, right? Of course, very soon you you recognize all the problems, these these simple architecture and these simple principles is causing, right?

00:21:00: Alex Can we contrast this with the data warehouse? I mean, we had this example of this Oracle database. Um, was the idea of a data warehouse still that you have a unified technology, like deployment of a certain database where you copy everything in, um, or was it already that you kind of federate and have multiple storage backends that you can kind of have on top of a like multiple technological solutions? And for the Data Lake, I think there is the file system and files, but I think we also have integrations with databases and and query front ends for this, right?

00:21:49: Christoph Ja, that was the the next step, then you go kind of in an in a lake house, I would say. But to your to your first first question, um, the the the classical data warehouses had a big problem with unstructured data, obviously. They were meant and built for structured data, table based data, therefore they are great. But now suddenly people have um, ja, more XML stuff or or even videos, text files, right? Unstructured texts, pictures, this kind of stuff. And this is really hard to to to store this, it doesn't really make sense in the classical data warehouse. You cannot use SQL, you cannot select them, you cannot search for pictures or whatever, this doesn't work, right? And and that was the reason why they put this stuff on on on Data Lakes, because that was easy, it's a storage system, right? And and one important difference what we say between data warehouses and Data Lakes is, in data warehouses we we call this this is schema on write. So when you write some data into the warehouse, you first have to transform it in a fixed schema, right? The data schema, so the the kind of the structure of the table, the fields, is it an integer or float or a text, whatever it is. So you change your data and then you can store it. And when you read it, it's in this schema, right? Always. Data Lake is different. It's a schema on read as we call it. That means you you just dump the data there, you you don't care about the schema basically when you when you write it, but when you read it back, then you have to probably have to transform it into a schema what you're use what you need. And this depends on the application. Is it is it an SQL and analytics application, then you have to transform it in a table format, is it a machine learning application, right? Then then it's probably, I don't know, some Python stuff or whatever, it's a different file format. So that is a big difference between the two approaches.

00:24:20: Sven Ja, so now um, so you you said they both

00:24:24: unbekannt]

00:21:44: Alex exist in parallel, right? So the the

00:21:47: Christoph They still exist, right? I mean, it's not the whole world switched something off and and switched on something new, of course, right?

00:21:56: Alex Ja, but I I was just wondering because there is also the Data Lake House. Yes, yes. And I was just wondering if the Data Lake House is kind of the solution to having a Data Lake and a Data Warehouse in parallel.

00:22:14: Christoph Ja, ja, that's the idea. And actually Data Bricks has invented these these Lake House concept. I think it was in 2020. And it was, ja, kind of a revolutionary idea at that point in time. Basically what what they said is, actually we don't need a special Data Warehouse with a special hardware and and all what we described, right? To to achieve this performance. We have cloud storage and that is basically a Data Lake, right? Where the data is in in special certain formats, but open formats, that's also important, it's not proprietary formats. And on the other side we have the compute and also compute is in the cloud, right? And compute is independent from storage and it's also independently scalable. And with techniques like Hadoop or or Spark later, it was able to scale also the compute and and get as much compute as you need for your analytics loads. And this combination of of having the the Data Lake with with your storage and then having a high performance compute engine also running in the cloud on top of this, doing for example SQL stuff, that was the birth of the of the Data Lake House. So a Lake House is a is an artificial word, is a combination of Data Lake and and Data Warehouse, obviously.

00:23:47: Alex Ja, okay, okay.

00:23:53: Christoph And maybe to to add this one really important point was that that uh it was possible with with different tools to kind of implement these what we call the ACID properties on the data. If if a Data Lake is just a file storage system, then of course you have the problems that you maybe your your read or write fails, right? You cannot recover, you cannot roll back and this stuff. For small data it's not not the problem. You see, okay, my my copy is not successful, so I copy again, but imagine you have tens of thousands, hundred of thousands of files, right? All the time there there might be an error, there can be a network error, hardware error, whatever, and then very soon you have an inconsistent state in your database. You don't know exactly was everything copied correctly and changed and and transferred correctly or are there some errors here and this leads to follow up errors, obviously. And you know, that is the big advantage of the Data Warehouses, like this commit and this roll back, so that either the whole transaction is successful and you are again in a consistent state, or if an error occurs, you have the possibility to roll back to the previous consistent state. So that would mean you always are in a consistent data state, right? Whatever you do. And in a in a simple Data Lake, this is not the case, but with tools like Spark and and and a layer in between the data and the compute, you could with software also achieve this, right? And and promise this, guarantee this, this. That's what we call ACID properties. ACID stands for atomicity, the atomicity, consistency, isolation and durability. That was a very important step.

00:25:57: Alex Ja. Well in that case, consistency would refer to integrity, right? That you can rely on the on the integrity of of your of your data. I'm currently wondering Christoph, you said Data Lake is actually schema on read. And if we combine now Data Lake and Data Warehouse, where Data Warehouse basically means you know which schema you have and and you want to query. How does that go together conceptually, I mean?

00:26:46: Christoph Um, in the the Data Lake, ja, you you you go back to the to the old Data Warehouse and you have actually schema on write because you you need this. Um, you have in your Data Lake some very known kind of file formats, Pake is is is the standard here, absolutely. And of course for your at least for your tabular, the structured data, you have schemas and and you make sure that the data you save in your Pake files is in that structure. That is important that the warehouse part of your of your Lake House is working. But you are not forced to this. You can also still dump pictures and text files whatsoever also in your Data Lake. You cannot query this with SQL, right? But you can use it with with machine learning approaches, right? So you have more flexibility in the in a Lake House actually. Another big advantage of the Lake House is, before this you had different things. You had the Data Lake for your unstructured data and you have the warehouse for your structured data. And you you you should not forget, you always need some data governance. You need rules, you need who has access to data, what is data quality, all these checks. So this is what we call data governance. That's very, very important. Otherwise you end up in a in a data mess or how we call it a data swamp, right? The from Data Lake becomes very soon a data swamp and then you you don't find any data anymore. You need data catalog and and maybe lineage and and all this this kind of stuff. Um, if you have a Data Lake and a Data Warehouse or several Data Warehouses, you have to set up these data governments in all these places, probably with different tools. And this is difficult because you have different tools, you have to set it up twice, you you end up with inconsistencies. In a Lake House, the nice part is you have one governance layer for your lake, for your lake and and that's the central place to to set up your governance.

00:29:17: Alex So it means you, you wrap your head around data quality of the data that is incoming, even though it doesn't have a schema yet. Um, or you don't know in which schema it will be queried, let's put it this way.

00:29:37: Christoph Right, but data governance is much more than schema, obviously.

00:29:41: Alex Ja, ja, of course. Right? Ja, so it's it's much bigger than than just the schema. It's for example, who has access to data, what is with personal data, what is with data lineage, this kind of stuff, quality rules. Mhm. Um, but when when it comes to data quality, um

00:28:37: Christoph I would say it it depends on the use cases that you have when you curry the data, what quality you need. Right? So for example, taking the actuality of of data where you say, I must make sure that the data is not older than than 10 seconds. Ja. Or no zero somewhere or if it's a text field, you should not have numbers, this kind of stuff, 100%. And and how how would you, how would you go about this? If you say, ah, okay, there is a new use case that we have on the Data Lake where we say, we want to provide that that use case or let's call it data application, but that means that we need to adjust our data governance. How would you approach that? Ja, good, good question. Usually the the data is transferred into your lake with, we haven't talked about this yet, with ETL processes, extract, transform, load, and that's also what we call data pipelines, right? These are automatic tools or scripts or software whatever who is moving consistently data from one place to the other. And you would build these quality gates or quality rules, however you call this, into these pipelines as checks for example. And then you you you have a governance tool, for example, a INNOQ, this is the Unity Catalog where you where you can manage this kind of rules and this kind of pipelines as in a in a central place.

00:30:20: Sven Maybe maybe one basic question, we we talked about data quality, data catalog, data lineage, lineage, data access. For those of us who, who do not really know exactly what this means, can we briefly give a little definition what that is? So what is data quality, what are typical properties of data quality? We we have already actuality, what else? I mean, you don't have to list all the words, but just that

00:30:59: Christoph Everyone that all listeners get a little feeling for it. Absolutely. So data quality, you can write books about this. Exactly. It is, well, data has a certain quality. You expect values, you know, and and that is also important. Data is is a very technical term. Data at the end are ones and zeros, bits and bytes, but what is important is data is also information and that is important for the business for example. What does it mean actually? And therefore it has to be in a in a certain quality. So you define a schema, as we said. A schema is you you define how the different fields of a of a set of data is defined. Ja, for example, you have a telephone number and there shouldn't be numbers in this. There should be only numbers, there shouldn't be letters in this. So only numbers and maybe a plus for the country code, whatever. Or you have names with or without Umlauts, I don't know, capital letter, smaller letters. This is all things you you define. And if you obey to these rules and and the data is is is following these rules, then you have a high data quality. And it makes it easier to to interpret the data. Ja. If if it's not following the rules, you have in in you have inconsistencies, you can interpret the data in a different way or you don't know what it means. Ja, if you you want a telephone number and there are letters in there, you say, what should I do with this?

00:32:56: Sven Ja. So if I if I may rephrase that, data quality depends on what you want to do with your data. So you cannot in general say, I have great data quality if you don't know what your data will be used for.

00:33:13: Christoph Absolutely, very, very important point. Later we will see in in when we when we discuss about data mesh for example, data quality has to be defined by the by the business department. They the only one who knows what is a high quality of data. If they can use the data, right? And you cannot see it in bits and bytes, you you have to look at the at the semantics basically of the data.

00:33:45: Sven Ja. Before we dive into into the data mesh, I have another question on the the quality if I may, Sven. Ja, ja, just go ahead with quality. We still have the catalog and the lineage. But go ahead.

00:34:01: Alex I wanted to know Christoph, you said in like with your experience, did you have did you meet a situation where you had different business stakeholders phrasing different requirements that were opposing?

00:34:23: Christoph Ja, in terms of data quality? And actually that happens all the time and and later, not yet, but later when we talk about data mesh, you will see that is an approach to deal with this. But of course, in an organization, look at the business, they have the data, they use the data, but there are different departments and different departments have a different view on data. For example, look at customer data. I mean, you know this from from your company, right? If they are the sales people, they they have a different need than the service people, right? Or the after sales people or maybe the product people who want to develop new products, they also need the customer data, but different aspects of this data. And maybe they're telling the sales data, hey, I want to have the what the what the customer doesn't like about the product because we want to develop a new and better product. And the sales people say, I don't care what what if if if my customer is not like on my product, so I'm not collecting this. And you always have this kind of fight who is the owner of the data, who defines the data, who defines the quality of the data. That is happens all the time.

00:35:40: unbekannt] I think a typical workflow is is also that you just have certain data sets available and people start using it and you don't really know who is using it. And every time you read the data, the program you write has actually expectations on the data. And a data quality problem, it's it's kind of the thing, the expectations of the the the program has on the data are not actually matching the reality. And I think a very interesting field here is how do you facilitate this like alignment between the producer and the consumer and how do you kind of get an overview of all the consumers and the kind of I've already caught a contract because I think this is the direction where where this is this is going. Um, but but yeah, Christoph, does it does it reflect your kind of understanding of the space and what approaches do you kind of see to kind of manage this tension?

00:35:01: Christoph Ja. Ja, this is another very interesting development going on in this field and this has nothing to do with technology and architecture, this has to do with organization and and actually the business of the data, as you said, you have producers of data and you have consumers of data. And um, you know, in the maybe still today on the old days, you had these uh um data analytics teams, very technical, uh part of IT, they were dealing with data in in a very technical sense, data break, database administrators, they were looking, table spaces are working well, database is performance, but they had no glue, no idea, what actually are these datas, they they have in their database, right? No idea, what does it mean for the organization, maybe production has a sense though, they just see their bits and bytes. And so that has to change, that's also caused bottlenecks, because the business was coming and say, I need this report, I need this this LM model, then the technical people say, I don't really understand or I have to learn this, how do we do this? And there is a movement to move more and more of these responsibilities into the business and and and one of the very exciting things is what we call they define data products or data is a product, like you have internal products, IT is a product, maybe your your your bookkeeping or controlling is a product and data is another product, it has a meaning, it has business context. Um, and but a product needs a product owner, who is responsible for this product and you should sell the product to other things, to other consumers as you call them, right? So the whole thing is moving from a technical aspect more and more into the direction of organizational changes and organizational responsibilities.

00:37:09: unbekannt] Yeah, I don't think it will stay there. I think having it in in business and in management is kind of just uh it's saying that it's important, like people are sufficiently caring about this that you have like business stakeholder caring about it. I think they will they will try to push it down to tech again and it may not be the the the data specialists or data analysts, I think it will be software engineers that are like more tightly integrating this things.

00:37:44: Sven That is interesting viewpoint and of course, we probably could discuss the rest of the evening about this, that's exciting. Ja. But maybe let's get back to Sven's catalog and continue that. Exactly. Ja, oh, I I want to mention on that, you know, famous, famous last announcements, but uh we all in the future we have an episode with Simon Hara. Uh on data products, data contracts, so that we we have the chance to basically talk one hour about uh data products. Ja, okay, so um, we had data quality, uh, what is a data catalog?

00:38:41: Christoph Uh, good question, data has to be what we call observable, so this is the observability of data, that's very important. If you just have a few data on your laptop, that's not a problem, you know your data and and use it on your laptop, but imagine an an organization, a huge organization with with tons of data, if you want to use the data, you first have to discover it, you have to know what kind of data is available, ja? We talked about customer data and imagine there's not just one one file with the customer data, there are there tons of of of data sources about customers with all the different aspects. So what is the right one to choose? You want for example to build a new report, um, I don't know, how customers are using a special product of your company in a special region dependent on the weather, just an example, right? So what kind of data are you using? What is the correct data? And so that is what we call data has to be discoverable, people have to find out and say, okay, this is exactly the data I want, this is the description, this is the description of the schema of course, of the fields, I can use to build more complex reports. And this is what a data catalog is doing, so they where all the data sources or data products, however you call it, are described and people can discover the data there.

00:40:15: Sven Ah, okay, okay, ja. I know I know a product for this, I just remember, it's called.

00:40:31: unbekannt] I think Collibra is the main player here, right?

00:40:33: Sven Sorry?

00:40:34: unbekannt] Is Collibra the main player here?

00:40:35: Sven For example, I wouldn't say the main player, but definitively an important player on the market, absolutely. Okay. All right, then we, we also had data lineage. What is data lineage?

00:40:54: Christoph Data lineage also important is where is data coming from? What's the history of the data, ja? So you, you, you look into, I don't know, a report or into a data product and you want to know, hey, actually where's this data coming from? Um, so who has entered that data? How has this data entered the organization? Is it coming, I don't know, from a sensor, from the internet, from a machinery, whatever, and maybe who changed the data on on on the life cycle of this data. And this is for example important, we all here in Germany and Europe, we have GDPR, you know, right? Um, this this general data protection, what is it? Regulation, exactly. And you know, users or consumers have the right that their data get deleted and so you cannot just delete it in in in one database, you have to look where is it coming from, because you know, there's a lot of replications, there are pipelines, if you just delete at one place, it pops up again. So this is not helpful and so you have to to delete it at the source for example, in all systems, or if you do you have you do debugging, you have errors, you find out data quality is is bad, why is this? Why are the fields wrong or the schema is is is wrong? For this kind data lineage is is very important and in particular for all this regulatory stuff, if you think about banking, insurance, they have by law, they need to show where is data coming from, where it is transformed and changed, if this is financial data for example, that's super important, you cannot just pop up somewhere and disappear again, so you have to really show how this data is traveling through your architecture and your organization.

00:43:08: unbekannt]

00:41:29: Alex What I like about the data lineage aspect is that helps, it helps you validate your assumptions. So assumptions in terms of what can I actually do with a with a certain information or data point. Because you can always verify, ah, okay, it's coming from that system, okay, that system will not be able to update the data every 30 minutes.

00:41:53: Christoph Exactly, ja? We talked about data products and if you think about using a data product for some reporting or some business need, then the question is, is this the right data, as you said Alex, and and the question is of course, where is this data coming from, right? And and who at the end is the owner maybe or is responsible for it?

00:42:16: Alex Who has touched it in between.

00:42:19: Christoph Ja, exactly, who has touched this.

00:42:22: Alex And therefore it helps you to to sort of say handle stakeholder expectations.

00:42:31: Christoph Exactly, ja, and you can explain and one thing with data is very, very important is the trust into data, right? If we see this in organizations when people saying, ja, I know there's this warehouse or this data lake and there's tons of data, but I really don't trust it, you know, I got a report, but it was a lot of errors and I don't trust it, then they are not using it and you will never get it into production and in usage. So trust, trust in data is very, very important and if you can show the lineage where things are coming from and where maybe something went wrong and you correct this, it's very important to trust into data.

00:43:09: Alex What I, what I also noticed is, sometimes, I'm not saying this is the usual case, but sometimes stakeholders have very high expectations in terms of quality, like this needs to be up to the millisecond correct and precise and with the data lineage in in place, you can actually show them what costs it would create if you had to, if you had to have this consistency, so that the actuality of of data and that gets you into, I would say, more constructive discussions, because in the end the stakeholders would be the ones that had to invest this this amount of money.

00:43:58: Christoph Ja, in order to get that. And this is another very important aspect is the whole financial aspect of of your data management. Imagine if you have everything in the cloud, it's pretty expensive, it's it's it's a lot of it's really a significant amount of money you are you are paying to the hyperscalers, obviously. And it's it's of course the management want to know, okay, what how much are we paying for what, right? You have tons of use cases, applications, departments and they're all working on the in the same data architecture, on the same cloud provider, but who is really generating which kind of cost, right? That's important information as well.

00:44:49: Alex Isn't it, I forgot the name of the product, but their marketing slogan was, put all your data into our data lake, it's very cheap and it was extremely expensive to get the data out there.

00:45:03: unbekannt] I mean, that might might be the case. But for this also lineage is important, right? To see where data is traveling and then also having the associated cost to this as Alex said.

00:45:19: Alex I mean, this is if I had a magic wand, I can just like have one thing, then having perfect data lineage, I think would be ranking way up there. And why, why, I mean, I, I, I really like to think of parallels between like microservice architectures and data architectures a lot and was how can we learn this? And in the microservice space, for me, like distributed tracing really stands out as a core enabler of observability and debugging, that allows you to reason kind of cross microservice, right? You cannot really understand the reliability of a single microservice, you have to piece it together and have this multi system interactions that you need to understand. Understand the relationships. Ja, and here you have really HTTP requests that are nested and you have every time you are processing something, you really have a chain of sockets and memory that is allocated in different microservices that are like the context of a transaction, give you the context of a transaction and with distributed tracing, get information about the state, that you can like understand causality, why am I here, like who is the user, who expects something from me and so on. And you even without the distributed tracing, if still in the moment you are doing the processes, versus you have all this kind of connections and all this memory allocated, you have this information in the system somewhere. And with data, I always feel like, okay, we have thrown all of this away. Like once we produced that row in the CSV file, we erase everything, it's gone, it's no longer there and we are all we are left is the result, it was 35 euros, this thing cost. And we don't know which kind of version of the code produced this and all that, so it's it's very decoupled. It's kind of the attractive bit about it, so you are not like limited in kind of having all the system around you while you do the processing, but it's also you threw everything away, like you just retained the output and um, like I, like I have a fundamental kind of a hunch that we have to retain a lot more information about the processing in the way we publish and handle data and that's I think fundamentally something that that should be enabled through data lineage or systems that are handling data lineage and an additional complication here just to say one more, it is the the the the platform aspect. I think probably if you're all in Data Bricks, you can have kind of better tools, if you're all in Airflow, you can solve parts of this, but what if this arrives on an FTP server or something and then it goes to Kafka and then it hits Data Bricks and then you have an like a custom reporting application after that. So like how does does would that look in the end, like what how can we approach the problem, ja?

00:48:58: Christoph This is a very good, that's a one million dollar question, I would say, that's a pro thing. Maybe before, maybe before we dive into this, Heinrich, your your first questions. I don't think that you throw everything away. Maybe that that was in former times or this is the case on your laptop when you when you move files around. In Data Bricks, in we have this Delta layer, right? And you have time travel for example, you can go back in time, you have all this versioning, so all this information is stored actually, all this lineage as you said. You have this information available, the problem is how to use it and what do you do with that, but the data is there, it's not thrown away.

00:49:49: Alex Okay, so one more reason to really check out Data Bricks in more detail to understand all this power, ja? But but I, ja, I fundamentally think we have to retain it and build better tools to to leverage it. I mean, I'm a reliability engineer, so a lot of the use cases I care about are, okay, we have a data quality problem or data availability problem at production, like what the hell is going on, ja?

00:50:25: Christoph Ja, ja, and of course putting these these tools together of course augments the the complexity of the solution 100%. On the other side, that's what customers usually today like, they like this best of breed and it makes a lot of sense, ja? You're not going to buy everything from one vendor, you choose the things and you use a lot of open tool, open source tools, but you have to stick things together, right? And and of course you have this kind of complexity. But also, you know, the real life and and the business you are you are implementing is also complex, right? So don't expect that your your data architecture actually is super simple if your life you are managing with this is is very complex.

00:51:18: unbekannt] Hä. Ja. If there is inherent complexity, you cannot remove it. Exactly, exactly, don't make it easier than it, than it can be, right? But also don't make it more complex, 100%. Um, I would like to go back to the Delta Lake topic for a second just, um, if I understood the Unity Catalog concept correctly, um, Christoph, is it or is it actually correct that in order for for Data Bricks to be able to, to support you, um, when you import a certain record of of data, you have to describe the relationship to, to other data points, even be it if they are outside of Data Bricks. Otherwise if if the Unity Catalog doesn't know about a relationship of of one record to another data point, then it obviously cannot help you, is that correct the understanding or? I mean, we are not forcing any customer to use Unity Catalog, so Data Bricks also works or Spark and and Delta without the Unity Catalog, but Unity Catalog of course, as we said, has has huge advantages, that is the Data Bricks product for for our data governance, right? And for example, we have the advantage, I would say it's the advantage, it's not only include the tables which are on the Data Bricks platform, but you also can include and manage tables outside of the Data Bricks platform, if they are in Parquet or Delta file format. But for this, of course, you have to register these tables in the Unity Catalog, otherwise this is not possible. Um, what I wanted to stress is from from my understanding, Unity Catalog is not only an enterprise data catalog like, I don't know, Data Hub for example, this open source version, but it actually comes with a lot of additional features that would be helpful to, to improve data quality or actually make data quality visible in the first place. Ja, absolutely. We talked about lineage, we talked about quality rules, registering the tables and and observability of of of tables and of data, this all is managed in the Unity Catalog, that's correct. Ja, and to be honest, Heinrich, I didn't like get through all the the power of the of Unity Catalog either, so it's really something to look at every now and then in order to to understand even more, but it's, I mean, that's anyway the the topic with data, right? If you get gain new perspectives, you should look at your data with this new perspective in order to see whether you need to change something on your data governance. And in in every data architecture, the the data governance tool is really a central tool, it's really a centerpiece for basically everything, and for Data Bricks it's the Unity Catalog, and therefore we have all these these functionality in this tool. Um, just one tangent here, what is the granularity where like a data catalog is cataloging the data? And discussion that we had internally was like, okay, we have maybe a data set around customer, but it's not one data set, it's actually 15 data sets and three APIs and something more. And then there is this concept of a data product, you already alluded to, and so there was a lot of discussion like what is the right concept that we should track and how this thing is exposed. Um, if you are just kind of listing like a schema dump, every kind of data set or table we have somewhere, then I think there is there's a risk of kind of being too fine granular and and if we are zooming out, then there is I think always like a ontology question where it's like, okay, where to draw the boundary. Is there some guidance you can give for this kind of problem or like how is this generally approached? It's hard, I think to to give general guidance on this, it really depends on on the use cases, on the customer situation, on the architecture, obviously, right? I mean, the granularity goes down to tables, I would say, even to to to columns and and lines in the table, you can have column based, um, um, sorry, um, authorizations, you can define this, right? So so even down down the level down to lines and columns and then on the other side you can you can combine tables to to databases, and so you can really play in in the right level of of granularity you need for your use cases. So the tool is generally capable and how to approach it, really look at the use case and see what kind of your users think to to work with, ja. Mhm. [unverständlich]

00:54:08: Sven Alright. We we still have one architectural style left. I I want to before we dive into more topics, I just want to to to to talk briefly about that one. It's the I mean, I was about to say the new kid on the block, but it's already five, six, seven years old or yeah, seven years, I believe. And that is a Data Mesh.

00:54:43: Heinrich It's it's six years old, ja.

00:54:46: Sven So what is what is a Data Mesh?

00:54:52: Heinrich Ja, very good question, what is a Data Mesh? It was developed by Zamak Dagani at her time at Thoughtworks, I was working with her actually, and I think the Heinrich, you said this, and this is exactly what is behind us, you talked about Microservice Architecture for Data, that is exactly the idea behind the Data Mesh. Actually what Thoughtworks has developed like Microservice Architecture, but also very important this domain driven design, maybe you remember that's another architectural pattern for for software development in in big organizations. This just was combined and said, can't we use this for for a Data Architecture, right? And so instead of Microservices as a as a as a software service, Zamak defined data products as as as one central piece of data as a product and and you can like like Lego blocks, you can connect these different data products, so they have input and output, standardized and you you can connect them and then have can can create very complex reporting and analyst tools, right? So that is that is the one idea. The other idea is that these data products actually belong to into an organization, into a domain, into a business domain. So this is this connecting the technical architecture as a product with a with a business domain, with a business surrounding. And this is also what what this domain driven design is doing, if you know, right? You you define the different design or the different domain, sorry, business domains and then you you organize your whole software architecture along these design rules and not independently through an organization and that's really reduces the complexity. That's the same what you do with Data Mesh. You define your your data products along these business domains and then you reduce complexity and you clear, you define clear ownership, who is who is responsible, who is the owner of a of a data product, it is in this business domain, right? But from this you see the Data Mesh is not a technical architecture like a like a Lakehouse or a Data Lake or Warehouse, it's much more an organizational pattern and that is very very important to understand. So it's not a technical implementation project, implementing a Data Mesh, it's much more an organizational or transformational project in the organization.

00:58:10: Alex Ah, may I just mirror what I understood and you correct my my point of view?

00:58:18: Sven Absolutely.

00:58:20: Alex So, um, when I think about this kind of Microservice Architecture, what I what I heard reminds me of like domain gateways and like ways to kind of put interfaces in between, so I can segregate domains better. So if I naively translate this to have a Data Architecture, I would just say instead of exposing tables, I kind of create specialized views that I give to certain stakeholders, so that data is kind of consumed through these views. Is that kind of the basic

00:58:53: Heinrich Exactly, that's exactly the same. In Data Mesh, this is called, so every data product is what Zamak called the polyglot data ports, that's her name. Polyglot means it is independent of of technical standards, it can be batch, it can be streaming, it can be graph, it can be SQL, right? That's independent, but it's kind of a data port and there are input ports and there are output ports. So you can collect data from somewhere, then you do data transformation, whatever you do with the data, could be SQL reporting, right, filtering, could be a machine learning tool, whatever classification and you have these output ports where you provide again this data. And important, these are kind of polyglot and of course you need a technology underneath who is managing all of this, of course.

00:58:47: Alex Can you kind of republish on the same port, so to say, a new kind of transformed version or are they always kind of read read only once they are written?

00:58:58: Christoph Well, the output port is read only and and the input port is you write into this, ja? So you have both directions, but you can stick these ports basically freely together and you can create new data products, reading from some data products and then creating some new knowledge or data and then providing this data at the output port for other consumers.

00:59:22: Alex So I'm thinking now of a multi-step pipeline and this this goes into a like a a related problem, which is multi-domain ownership or cross-domain data sets. So I may have like a producer like say an e-commerce platform, so you have a partner who provides a product for you, then you have certain augmentation steps and then you finally feed it to a catalog. So you have three different domains that are owning it kind of partner integrations then and so on. So in the one model I have in mind, you would have kind of certain channels or fields that are maybe owned by by the partner integrations and that are kind of fed through. So you may they may be dropped in subsequent views, but they may be just passed through. Then there are fields which are kind of added to a thing, this is by by a certain domain, that's relatively clear. The question would be, can you is it allowed, is it more a technical detail maybe, but is it allowed to kind of mutate certain channels which are already predefined and then how would you kind of have the ownership of those, because those things I think then it gets really hairy when it's like, okay, this is something that was should be owned by was produced by partners, but now we kind of see like two stage derivatives being published in the same name.

01:01:00: Christoph Ja, so what you do, you you define or build your new data product. So what you're describing, you have a need, you want to do a data transformation. So basically you define your own data product. You are the owner of this product, you are responsible for the quality. You can read from other products, you cannot change this in the other products, you can just read the data from them, that's read only, but when the data is in your product, of course you can do whatever you want with this data. You can delete, you can merge, you can whatever, right? Drop, that is your responsibility and then you use the data by your own and you also offer this new data with an output port to other organizations. Ja, so you as a data owner, that's the beauty of this product concept, you are kind of responsible also to offer your product to others in the organization. Maybe they find this super helpful and say, that's great what you are doing, exactly what I need, but I have to modify it again, that's I'm doing in my product. That is the whole idea of of Data Mesh and data products.

01:02:06: Alex But isn't that like a like a how do you say it, like a chain reaction? We had this this wonderful case in the in the microservice architecture when one microservice was calling another microservice, was calling another microservice, was calling another microservice. So how do you resolve or or actual guarantee data quality if you are dependent on someone who's dependent on someone, you know what I mean?

01:02:41: Christoph And there comes the contract in place. I think we mentioned this already. So the the output ports also have a contract, where this is the same as in microservice architecture, where you describe the data and this actually happens in the data catalog, where it is exactly the schema and the quality of the data is described and the owner of the product is responsible to adhere these quality, right? So you rely on him, absolutely. If you find out something is wrong, then of course you call him and say, look here, this is not along the contract of the data catalog, there's a problem. That's the way this is managed.

01:03:26: Alex So as I know that I cannot predict production, I'm currently wondering whether there is already a resilience patterns for data.

01:03:38: Sven Sorry, what do you mean with this Alex?

01:03:40: Alex Um, I mean this there is this wonderful book by Michael Nygard, who wrote about resilience patterns for in the end applications or microservice applications, where you have patterns like the the bulkheads or circuit breaker and that basically says, okay, how can I make myself resilient if the if my my relationship partner doesn't fulfill the the contract or the the expectations that I have and that could could be that that I return a default value.

01:04:36: Sven Ja, ja, I think it's a very, very good and valid question. I I have actually no no answer to to that question, but you see the the main challenge of a Data Mesh product and introducing and running a Data Mesh is not the technology, is the the transformation, is the behavior of people. And it sometimes it changes completely the behavior in in in companies of people. I I know maybe you also have examples in in old companies, there was if you have data, if you are a data owner, you have power and you are not easily distributing and giving this data away to others, you say, this is my data, right? Sales people say, no, no, of course you don't get the customer data, no way. Um, it's power and Data Mesh turns this exactly in the opposite way. They say, I'm a owner and I have to distribute it. That's a product. I have to offer this product to others in the company, right? And you need incentives and you need whole management and governance around this that this actually is working. Um, but spinning this for for a second, just for a second, Sven, um, what would be your recommendations in in in in terms of adopting this towards data governance? Like if if we think of Data Mesh, for example, which um so to say distributes responsibility, but at the same time makes responsibility very explicit in the role of of data owners, um, what effect does that have on on data governance, would you say, or what would be your recommendations?

01:06:37: Christoph Ja, in my experience and and when I worked for Thoughtworks, I implemented I I worked in projects where we implemented Data Mesh architectures and the the most difficult part and the part we took the most time was actually this organizational change. So it starts with defining the domains. It's not that easy in an organization to to really find the right cut and granularity of of data domains, right? That is already the first fight and you need really senior people in the room to to help you with this. It's very important. Next step is, who are these product owners? These should be people from the business, they have no idea about data and IT and SQL and databases, right? Um, they suddenly say, I never did this, so what what is my job? What does it mean? And and now I have to build data products, how do I do this? And how should I do that, right? It's a complete new job description and you need kind of IT people or at least analysts or developer who are actually developing these these these data products. So you have a decentralized organization, which usually companies don't have, so you have to also to change that. This is this is the biggest, biggest challenges of Data Mesh projects.

01:08:06: Alex What I'm currently thinking, Heinrich, we might soon see an extension of of data aspects into something like service level objectives. Oh yeah, this is actually already happening. Um, like the last project that or the last year, like I was personally doing data SLOs for for a lot of the data sets and um, we're running in in all kinds of like problems that are like along those lines and um, maybe just to

01:06:43: unbekannt] To bring a new kind of impulse here, I would also be be curious Christoph, like on the macro perspective, AI is I think driving a lot of this. So if I'm like looking at data as a whole thing, I think for a long time all these problems existed, but they were like it's the problem of the business analysts to kind of figure this out for the KPI runs and you had a certain amounts of friction, manual labor and just incorrectness that you kind of tolerated and I mean not everything is perfect, so you just write your big SQL query and just get along. Um, and now I think it's shifting. It's really like, hey, uh, we need to integrate tons of data into Rags or into chatbots that are then really business critical and that's like it's different animal than kind of catalog data or like customer data that was was already in those kind of mission critical systems. And um, yeah, what's your kind of take on that kind of global shift? What are you already seeing and like is what is your extrapolation of that maybe over the next few months?

01:08:19: Christoph Ja, great, great question, so let's look a little bit into the future. But before this, maybe one step again, a step back also, because I find this very interesting. You know, you have these in an organization, you have these kind of flows, you have a usually when you produce something, you have a material flow, raw material, you produce something, that's your value creation. Then you have a financial flow, right? You have costs, you have revenue, that's also has to be tracked by, by, by controlling and in finance, and you have an informational flow, right? Information is flowing through your organization, needs at certain points for decisions, is, is generated somewhere and is used and or consumed at another place. And in the, I would say in the good old days, 20 years ago with SAP ERP system, all these flows have been managed in one ERP system. But in particular the the the data side, it's becoming bigger and bigger and larger and larger and and and today you, you cannot just manage all the information flow just in an ERP system. That's the reason why you need this, these information architecture additionally to to an ERP system, right? And now you, you are, you are mentioning Gen AI and and machine learning and AI tools and that again makes it even bigger and more important, right? Every decision somewhere in the organization needs this kind of data is data driven ideally, right? And at some point in time probably also is automated, that's what I would, would, would assume in in maybe five, five years. You know, SAP is is investing heavily also with Data Bricks as a partner into that and and today there is maybe an SAP administrator sitting in front a screen, doing some, getting some data and doing some decision and clicking some, some buttons, and of course that probably can be automated by, by a large language model, right? Um, so as you said, this is getting more and more data, because you as we said in the beginning, you have to collect that data, you have to manage that data, it has to be in the right quality, that's that's very important, otherwise your LLMs are not working properly, you have to monitor this, but you also need to the right access rights, not everybody is, is, is allowed to see every data. And that is a big challenge with large language models, because you cannot explain them, hey, this guy is HR and he can see this personal data, but this other guy is not HR, so please don't show them the paycheck from, from Heinrich for example. So you, you do this with Rex systems, as you said, you do this with vector stores, these are the kind of classical databases for, for vectors, and then again you have all these access rights with you, which you can manage. Ja. Then you have the updates, I mean, you know, if you maybe you need your product catalog in in a Rex system, right? You need this for decisions or for customer recommendations. But your, your product palette is changing constantly, I mean, in your company you know this by heart of course, right? All the time, so how do you update this? You cannot do this by LLM training, this is not working. LLM is not forgetting, right? It's, it's ends in a mess. So you also try to use classical database architecture again, which is super interesting for your vectors, right? For your knowledge vectors to manage them.

01:12:30: unbekannt] Ja, I think right now a lot of this is just driven by batch jobs, which are like running on a daily basis or something and in having the data available is is pretty critical. I think if you extrapolate it out, probably it's going to be a little bit more fine grain, maybe hourly or online in the in the end game. Um, but but a lot of this is currently like relatively classical and the integrations are relatively naive as as far as I can can tell the reg, it's I I'm always like underwhelmed when I learn read about what it exactly is, like just take the query, then do some like lookup keyword search on the query in the classical information retrieval and then just put it in addition to the prompt and say maybe this documents are useful. And um, I think we will, we will see more creative ways to dynamically integrate, um, but at least I mean, my, my view is very limited on this kind of refinement training and training new language models kind of is um, it's not happening so much. I think the majority is really relatively basic integrations and um, but yeah, the availability of the data, the quality of that data is certainly business critical and so you have really, okay, we had some issues and there is production impact and revenue impact if those stuff not there, versus KPIs in a business meeting missing, do not have this direct direct implications, ja.

01:14:22: Christoph Ja. Um, I think what what you're saying is, is, is right, but don't forget, we are moving really, really, really fast, not not we, but the whole field, the AI field is moving crazy fast. Remember, I think how was three years ago, then ChatGPT came up and everybody was, was flashed by that, just three years. And now a lot of companies at least thinking about use cases and and doing POCs, not everything goes into production of course. And and one reason is, um, we all know LLMs are not perfect, obviously, they do a lot of mistakes, they screw up, they hallucinate. And at least in applications which are customer facing, you don't want that, that is way too dangerous, right? It could be hacked, it can be very nasty comments, which which really is a catastrophe for, for the public, um, for your company of course, right? So I see companies for their employees internally, they are already more advanced and doing, doing stuff, right? With with Rags and with fine tuning and training of LLMs, but customer facing they are still very, very careful.

01:15:45: Sven Mhm. I want to briefly go back to um, to the different styles, like Data Mesh and Data Lakehouse, one final question on from my side on those styles, because for me those are architecture or data architectural styles, like we have also architectural styles in, in let's say normal software systems, you know, like microservices, pipes and filter

01:16:30: unbekannt] Mhm.

01:14:17: Sven Stuff like that. So now those styles, they, if I have to make a decision, what kind of style I want to choose, then obviously I need some, I need some requirements and I have to, to based on my requirements, I have to decide, oh, do I now want to go for Microservices or Monolith or whatever, right? How is that for Data Mesh and for example Data Lakehouse? If I'm an organization and I have to think about a Data Architecture, I have no idea, ja? So my, my personal feeling is everyone who uses Microservices, they're like, oh, yeah, we, of course we have, we have, we use the main driven design, we have bounded contexts and per bounded context we have a service and of course we use Data Mesh as our Data Architecture style. And I'm just wondering, is that always correct? What questions do I need to ask in order to make a decision, is Data Mesh the right thing, is Data Lakehouse the right thing? Is it even possible to have a mix of both?

01:15:46: Christoph Ja, definitiv, so in particular Data, Data Lakehouse and the Data Mesh, they are not exclusive and and and and the other. Data Lakehouse is an architectural, technical, architectural pattern, how, how you build your architecture. Data Mesh is more an organizational pattern and for example, you can build a Data Mesh with a Lakehouse or with several Lakehouses and and a lot of customers are are doing this actually. The the question, what is the right architecture for you? Of course, there is not just a simple answer to that, but one thing is, is, is, is very clear is nowadays, I would say you have to go to the cloud, that makes a lot of sense, you would not have the stuff on prem. Maybe you are, I don't know, top secret or whatever, right? But usually you would go into the cloud. Then you can have a particular Data Warehouse in the cloud, something like Redshift for example, BigQuery, Google. If, if you heavily use transactional data and analytics, classical analytics on SQL, Snowflake is another example, but the problem is of course, if you go more and more in the AI, ML, GenAI space, that doesn't work with this, so you would need a Data Lake for your data. And then you should consider really, and I think the whole development is going in this direction, is the Lakehouse actually. And it's not only Data, Data Bricks who is offering this, Microsoft Fabric and others, they are all going in this direction of, of, of a Lakehouse, because it makes a lot of sense. It is a combination as we said of both, you have the Data Lake, you have the governance and you have your warehouses, performant in the cloud. So I would say from a technical architecture, the Lakehouse is probably the, the way to go nowadays. Data Mesh is then a next question from an organizational and data governance type. It's really a much longer way to go as, as we said regarding all the transformation of your organization, of your business, but if you want to go in this kind of Microservice organization, the ownership and data products, then Data Mesh is, is, I would say the right way to do this. I would say and it's a bit controversial, it makes sense for larger organizations. If you are kind of a mid-sized, small company, I don't know, 1,000 to 2,000 employees, you have a, you have an analytics department and they're probably fine with, with the, with the Lakehouse and and and can build the reports and everything needed, but if you are huge multi-international organization and you have to organize the data between different continents and and different architectures, then probably a Data Mesh makes, makes a lot of sense. All right.

01:18:39: Heinrich Ja.

01:18:40: Christoph But of course Data Mesh, you cannot just easily implement it. It's always also some custom build. You can of course use Snowflake, Data Bricks or Microsoft Fabric, but you have to stitch a lot of things together, right? You, you have to build the data catalog, these interfaces, what we described. So there's still a lot of custom software development necessary.

01:19:12: Sven And the, I mean, when, when I decide to, you know, to go for Data Mesh, I probably can also introduce that incrementally over, you know, a few parts of the organization.

01:19:27: Christoph Of course, you would never do this as a big bang, that makes no sense, it doesn't work. You start with an MVP, so usually you start with one domain and you define a handful, maybe one, two, three products and you, you define these products, you build this, you build the platform to, to, to, to run these products on that platform and then you already can combine them and you get the first results and people see how this works, right? So and I saw in, in customers at Thoughtworks when we implement the Data Mesh, in the beginning a lot of skepticism, a lot of theoretical discussion, but when you had the first products, people were coming and say, oh, I also have a product and I also have a product and can we also implement that product? Suddenly you have a huge demand in the organization and you have to say, stop, stop, stop, we cannot build thousands of products here, so we have to organize this.

01:20:27: Sven Mm.

01:20:29: unbekannt] Hm.

01:20:02: Alex Okay.

01:20:05: Sven Ja, so, I mean, I wanted to, you know, to go through all the different styles. So I'm now, I'm now happy, but Alex and Heinrich, you you had some more detailed

01:20:22: Heinrich We have a huge set of questions, ja, ja.

01:20:25: Alex Ja, I have I have two actually tensions here. I mean, you can choose which one you want to take. The first one one is is streaming in in comparison with those things, so kind of like what's the relationship and is the lake actually something that flows or something that stands. And the other, the other thing is around DuckDB and local compute. And I mean, just like started using DuckDB recently and it kind of is it's it's kind of this shim library that allows you to query all kinds of data sets, like maybe a data warehouse would, because you have your parquet files lying around in file systems and it's kind of of the scale down idea, which I like a lot. So having like technology you can use on a single node that adds value, that is like has a trajectory to scale out and kind of sit behind the cloud compute. Like I I would just like to get your take on that, how you view this kind of coming into the picture.

01:21:48: Christoph Let's start with the streaming maybe. Streaming getting more and more important. That's what we see with our customers, right? More and more streaming use cases coming. Streaming doesn't mean 100 % real time, zero milliseconds. Streaming can be 100, 200 milliseconds, maybe a few seconds, right? At Data Bricks or in Spark, streaming is realized with kind of mini batches. You have kind of of of mini blocks of of data, which is moving between tables like like in in batch, these kind of mini batch, but because they are fast and they are small, it's it's really a kind of kind of streaming. Basically you can handle it the same way as batch, right? It's not not a big difference. Of course, there are differences how you deal with the data, but basically it's it's the same thing. That also goes to the what we said in the data mesh, these these polyglot ports, right? It doesn't matter if this is batch or streaming, on a higher abstraction level, it's basically the same. So streaming is is a part of this and streaming is getting more and more important, so it's it's flowing.

01:23:12: Heinrich I have I have one story I want to tell about the mini batches, because I was doing stream processing for in my former life and I was always annoyed by the micro batch topic, because it was like, no, I want to process the messages like one at a time, why are you forcing me to take this batch? And then I built the stream processor, which is based on message processing and was slow as fuck, because it was just like every message that you get millions, you had to traverse your call stack up and down and up and down and up and down, I couldn't get this to run faster. So I was like, okay, at least for the offline use cases, I need to build a completely different query engine. So everything is kind of vectorized and I'm taking big stuff. And then I was like, how nice would it be if this is kind of unified and I could operate on micro batches. This is why they did it this way. So it's like, okay, you have been there and you know that this kind of, I don't know, RabbitMQ style of of individual message processing doesn't scale. So if you want to have something that's kind of able to either do a high volume or to do batch and online at the same time, you will end up at micro batches in some form and this is actually what you probably want in the stream processing use cases and probably you can tune it in a way that the micro batch is a single message, so if you really, really want to be like super up to date, then you you can just do that, right?

01:25:12: Christoph You have to use Kafka, right? There you can do this, some special special software, you have single events which are flowing around. If if you really need this in your use case, then you would would use these kind of tools.

01:25:25: Heinrich Ja. And and what about the integrations between stream and this batch systems or this this this lake house or warehouse things? So is that like is this really common that you have basically your pipe ending in the warehouse or yeah, probably that's kind of how it works, right?

01:25:49: Christoph At Data Bricks, it's not a big difference, it all ends in in Delta tables, right? That's the the the the data format and if it is coming from streaming or from batch, it doesn't matter. At the end it's a table where the data is.

01:26:03: Heinrich Ja, okay, that makes sense.

01:26:06: Alex Alex, I know you you look like you have questions. A lot. Um, I'm I I was just thinking as Heinrich shared the story. Um, it's like you said, okay, the industry is moving so fast, we're we're actually learning new things every day. And um, I'm I'm trying to look at at takeaways that, for example, our listeners can apply on their on their daily jobs and um, I was wondering Christoph, how do you personally decide which which learnings you take into your decisions and where you say, well, that is actually a learning that is outdated in the meantime?

01:27:15: Christoph I don't have a have a have an easy answer and an easy rule for this to say, okay, this is I'm going to read, this is I I I I don't look at. At the end it's what what interests me, right? So you you need to be curious and and keep curious in in your career, in your job. I mean, that is the way in our career, in our jobs, we have to content continuously learning and be curious and learn new things and this is also the exciting stuff about our job, right? What I really, really enjoy and so I I take some time away, I try to do this every day to really check the internet and and and X and and the different platforms and see what is new, what is coming and try to read maybe one article a day, it's not always possible as a manager, but really to keep up, also sometimes trying to play around with little Python scripts, right? Try things out and in particular in in Gen AI with with very simple few lines of code, you can do amazing things, right? And and this is also something I enjoy, I have to say.

01:28:57: Alex Couldn't agree more. Um, that's actually like trying, trying things out actually makes me understand whether my my assumptions in terms of um, is that a decision that is reversible or irreversible, which I usually take as the as the main criteria in in terms of what I have to put as as governance in place. Um, I learned that on the one hand, there can be new technology that makes things reversible, so that I can actually take a decision more lightweight or actually leave it to a team. If it's reversible, it's it's not a it's not a big problem. At the same time, reversibility is not always a technical thing, right? So, for example, when it comes to data governance, I am or actually governance in in general, I understood that sometimes it's hard for people to reverse a decision. To say, well, actually this is not put into granite and we can do it differently. And um Like I'm I'm I'm currently wondering, as you said, the industry is moving is moving so fast. Um, how do you how do you approach your your colleagues with such things of of reversibility and and technical evolution? Um, so in in terms of engineering culture, for example, at Data Bricks or in general when you when you're talking to colleagues in the industry.

01:27:51: Christoph Ja, that that goes into company culture and how to process this. I think we are very happy at Data Bricks, we have a great culture, like in other tech industries. I think in general in the tech industry, I worked in in in mid-size companies in Germany, which was completely different and I am very happy to be out of that, to be to say, but but I I worked at SAP, Cap Gemini, Thoughtworks, now Data Bricks, it's all the same. These are really smart people, I have to say, they are open-minded, just already by their job, right? And this kind of reversibility, Alex, I don't see this as a big problem. They they they admit errors. They would not say, I said this architecture is the right way, I stick to this to the end. Nobody would do that, at least it's my my experience. They would say, okay, that was my assumption, but I was wrong. So sorry for this. And you know, we have a we have a very nice saying at Data Bricks and also at Thoughtworks, same thing, don't ask for allowance, ask for forgiveness, and I think that's a very, very nice thing, do something, right? Because we move fast in the industry, cannot wait until some manager gives you the green light, right, for days. Do it, if it's wrong, it was wrong, then correct it, fail and but fail fast, right?

01:29:25: Alex Exactly. And innovation is not possible without failure, that's basically.

01:29:31: Christoph 100% exactly.

01:29:33: Sven So, um, when it comes to your your personal experience, um, I was actually looking to to tease you a little bit in in terms of recommendations for software engineers out there, like what would be um, let's let's assume there would be software engineers building on top of Data Bricks or building LLM based solutions. What would you say are emerging technologies, trends where you would recommend them to look into?

01:30:15: Christoph Ja, in in the LLM space, definitely at the moment agent-based systems. I mean, you, you, maybe so what is it from China coming great system, I think it was Magnus was the name, Deepsea, and yeah, exactly. The next thing is building components of LLMs, one is doing the reasoning, one is doing the planning and they are specialized, glue them together, that is that is a very, very exciting thing. Um, another very important thing in LLM development is actually the fine tuning, because you have to adapt the LLMs to to a domain or to a company, we talked about this, can be done by Rag, but there is certain limitations with Rag, we also talked about this, so people also using fine tuning, and in particular reinforcement learning, I think it's super, super promising. Um, so Deepseek used a lot of reinforcement learning, also when you train an LLM on on on boundaries or guidelines, you do this with reinforcement learning. Um, maybe the last word what what you said, Alex, you have to try things out, right? This is kind of these these these these grounding of of of the knowledge. We as humans, we know it's super important. We cannot just read learn out of books and and never try it, it will never work. You cannot learn driving a car just by books, you have to drive it actually, and same for software. LLMs are kind of learning on the books, learning from the data they get, right? And but there are new new ways, for example, giving them a sandbox where they can actually try coding, if they if they are copy load for for coding. And this, I have the hope will will let them go through this kind of ceiling they have today by the knowledge and and maybe even reach superhuman performance, that's the hope, right? So that I think is is super, super exciting.

01:32:51: Sven I I really like your your excitement and engagement on that. Um, taking this a step further, um, assume you you have a big company, for example, in Germany and people who are tasked with data governance, um, what would be your recommendation to the data governance team to allow such experimentation and innovation?

01:33:30: Christoph Data governance at the end is everything. If you don't have governance, you have anarchy and you you cannot as a company, you cannot work with your data at the end. I would say, it's like it's like traffic. I mean, if you have a bigger city, you need traffic roads, rules, you need traffic lights, you need signs, otherwise you have a mess, you have a chaos and you cannot go from A to B. Same with data in an organization, without rules and governance, it will end in a mess, you don't find anything, you don't know what is allowed, you're you're doing things which are not allowed, etcetera. Um, so data governance today is one of the most important things in in in organizations. I would tell this the team, say you are on a very, very central point for for the success of our organization.

01:34:33: Sven Thanks a lot, Christoph. Ja, um, talking about governance being without governance, there is anarchy. So the same applies also to to software architecture, ja? I mean, if everyone can do whatever they want, I mean, it's a catastrophe. But now, I mean, it's maybe not a catastrophe, but it's really bad. And but the the I mean, in in in in software architecture, I would I mean, I'm just citing here Martin Fowler. Martin Fowler said once, governance, you it it's it's a scary word and you can see it in two different ways. One, one way would be, you know, it's hard rules, it's you have to do it, you know, It's um if if you don't apply to the rules, you you get in trouble. The other one is guidance, we help you making decision, it's like decision scaling, we we we have guard rails for you, you know, but but you have to use those guard rails. Is is it in data the same that, you know, you that you say it's the governance is guard rails, maybe you can break them once in a while, but usually you should stick to them in 98% of the cases.

01:34:27: Heinrich Depends on the reliability you want to have, right? For example, hey, there are very strict rules how you use personal information and you cannot just, okay, there are the 2% where I just look into the salary paycheck or I got this, you can't do this. It maybe it depends a bit, maybe there are areas where you say, okay, 80% is fine for me, right? And in particular if you go with AI AI is capable of understanding data even if the data quality is not perfect. So that's then you could say, 80% of the address data is fine, that's good enough, perfect, right? But in other things look at at at at governance, banking, public service, you cannot allow this. So I think it really depends on the use cases.

01:35:11: Sven Yeah. I mean in software architecture is also is similar. I mean if you have some sort of some sort of authentication, I mean or also data protection rules, there is no, I mean, I mean, you you could say it's guidance, but it's actually set in stone rules, yeah. It's just a nicer word for rules.

01:35:36: Heinrich Maybe another nice example to this and this brings us back to the very beginning of our talk and it's a completely different domain, it's again the music and I told you that I'm playing jazz and and jazz actually is very interesting because it gives people the utmost freedom, you know, jazz is improvisation, but it's not everybody can whatever play whatever he wants, that is that is cacophony, that is catastrophe. So you have this kind of rules, so you everybody in the band knows exactly what is the form, for example, what are the harmonies, so you have have a structure, but inside the structure, everybody who is improvising has the freedom to express in the second. That's a great balance between regulation and also freedom, right?

01:36:28: Sven Yeah. I mean, we in a microservice world, we we have something called the macro architecture and a micro architecture and the macro architecture basically defines the rules you you have to you have to apply, yeah. Right, right, right. Alex, Alex looks like he wants to say something about it.

01:36:56: Alex No, no, it's I was just reading a comment from Simon Hara earlier today. Okay. It's uh yeah.

01:37:04: Sven Read more.

01:37:05: Alex Yeah, I I look at this actually more like a zoo, right? It's if your business is large enough, um you I think it's a it's a trap to to strive for too much order. It's uh you have very different demands and very different like requirements and cultures that you have in a very large org. And if you're trying to set rules for everybody, um there's a good chance it will kind of only fit for half of it. And um so I think building like in microservice governance and data governance, um giving enough freedom for adjustment is um is something to consider. Also with regards to innovation speed. I think we are at a time where I think the in in in the LLM world at least or AI world, like speed to market, experimentation speed is I think the most important thing you have to enable. And the biggest slowdown on a lot of this kind of greenfield experimentation projects are governance rules that you kind of have to have before you can like touch certain things, move certain to production and so on. So like I think a lot about like how can we balance this? How can you allow this iteration? How can you kind of give flexibility to different bits of the the the business and at the same time it's like you don't want the crazy and you if you are not having enough, nobody can talk to each other, so you get crazy slow because of incompatibilities.

01:39:07: Sven That's exactly what I was looking at.

01:39:08: Heinrich Exactly what you're saying, Heinrich. As you said in the beginning, you cannot regulate everything and every detail. At a data mesh, there's one another very important principle, that's what what we call federated governance. And that is like like a like a state, like a democracy, you don't manage everything on the same level. You have some fundamental things you manage on the top level, that is valid for everybody, but on the lower level, you you manage more of in another granularity, more details, like on a state or in a city or in a in a household, right? You you manage different things.

01:39:52: Sven I love that picture. Yeah. You manage the the like general interfaces, how things are supposed to play together and the individual parts have.

01:40:04: Heinrich You manage on a high level, but the details, how you move bits and bytes and what does it mean, maybe on a domain level, right? Or even on a data product level.

01:40:16: Sven Getting back to your picture, Heinrich, I was actually looking for these data governance rules where you say, better not mix the lines and the giraffes because it won't play out well. Yeah, or maybe cultural mixture. I don't know, are there any such such rules, Christoph, where you say, well, that's never a great idea?

01:40:55: unbekannt] Probably. to my mind at the moment. to be distracted from the picture of the lines and the giraffes. the human rights of data probably, right? Um there's probably some principles you should never do, but I I don't I don't know at the moment to be honest, right?

01:41:27: Sven Okay. No, I I I need to laugh because um we were discussing architecture principles the other day and I refer to the Carta of human rights in in terms of don't data. Yeah. There's some parallels maybe. But looking at architecture principles, um in in in software architecture, we also we we don't call it, I've never heard the term federated architecture governance, but we have the the subsidiarity, I hope I hope I say it correctly, subsidiarity, I think that's that's the correct pronouncement and it means exactly that, you know, you you you manage you have rules on different uh uh levels, yeah. So you always want to have the to make the decisions on the lowest possible layer, that's the idea, but, you know, some decisions have to be made on the top, yeah.

01:41:00: Heinrich Yeah, we had Now is a great time to jump into DuckDB and get your take on local data processing.

01:41:11: Sven Ja, ja, so I'm not an expert on DuckDB, but it's as you said, it's a completely different paradigm, which is which is coming up here and it's it's a great product, all what I saw, it's amazing, also the performance, the speed and and how this was developed, it's it's great. I see impact in the industry or with other people and as you said, it's it's very local, it's it's on a single server, it's even on your laptop, it's a complete different paradigm, there you're not running on a cloud server somewhere centralized with a lot of user, you have your own stuff, you can save money with this. I think that the biggest challenge is probably again the governance, how do you manage that, right? It's kind of bring your own device or bring your own database or your own DuckDB, but still in an organization it has to be managed in some way. So what is allowed, what is not allowed, where do you get your data from, where you store your data, right? Is it just on your laptop and then your laptop gets stolen, do you have to, I don't know, safeguard it somewhere, versioning, so there are millions of questions which I don't know how to solve that. I mean, did we when when Heinrich mentioned DuckDB first, did we clarify briefly what DuckDB is?

01:42:37: Heinrich Yeah, I think we should do that. So I mean, my experience with it is really as a CLI tool and as a Python library. So for a long time, like how do you have your source data available, right? It's a bunch of like CSV, JSON files or like parquet files living in folders. And for a long time I've been looking for a tool where I can just say, treat this as a database and let me load this stuff and do SQL queries on top of this. And this is exactly what DuckDB allows you to do. So instead of kind of loading all those things into memory, then putting it into a SQLite with an ETL thing and then you do the SQL, you can kind of import this as virtual tables or they are just automatically discovered and loaded into a namespace. So you get an experience, you have directly a SQL like query interface to just stuff that lies scattered on set out on the file system. And like from that it's it's kind of a local toy, right? It's a command line tool that allows you to be productive, it's can can be like replace some of the the Pandas magic you might do in Python, also without loading stuff to memory, so it has performance aspects to it. But it also reminded me of like systems like Presto SQL that are like a federated query layer on top of different data stores, where you just like query all of this. And I I I think I I read some Hacker News articles that this is kind of has a trajectory in the kind of data architecture space where you would kind of either use the tool or ideas to to kind of cross over, but it may be not really really something, but the the basic idea behind the tool is as I just described.

01:44:43: Sven And I think that the the biggest advantage is the it's very simple and it's easy and it's fast, right? And and of course end users might ask, why do I have to build a data product and use these these standardized environment of my organization and do all this hassle until I get my my results, why not just dumping the data on my laptop and doing my DuckDB magic and and Pandas easy, a little Python script and I'm there. Much faster, easier, I'm more flexible, but it also has a lot of challenges. So what is we we talked about data lineage, right? Data governance, all this stuff, can you make this sure?

01:45:34: unbekannt] Nothing.

01:45:36: Sven Yeah, it's it's a different approach. I would say it's a paradigm shift, right?

01:45:45: Heinrich Yeah, I don't know if we want to explore this avenue, but like one striking experience was just just lately crunching some data, doing that in Presto and querying a data lake, it took like three minutes quick query and then the entire of the data sets I was working with was maybe 10 gigabyte. So pulling that down to the file system first and then working with DuckDB was just orders of magnitude faster and the queries actually look the same, it's but just the the with minor translations, right? And also building like applications on the query side and the reporting side on top of the data lake always ran into this problem of the query on the data lake is actually too slow, so you need kind of a hot version of that data set that lives somewhere. And I mean, this is I don't claim I'm an expert, but this is a the wall I I run I'm running against often and I know that there's kind of data reporting systems that have some some hot loading capabilities, but this is something where I could see like a system like DuckDB shine that they kind of allow you to it's not out of the box functionality, but I could imagine a system that kind of does the the persistent caching for you in a similar way, yeah.

01:47:20: Sven Or for departments who are doing innovation, who are very moving very fast, trying things out with data, maybe that that is the right tool for them to be super flexible, super fast, but maybe not for for production grade data management. Is there something in in Data Bricks that allows you to kind of cache data sets from the data lake for local processing?

01:47:43: unbekannt] Uh, as far as I know, no, we we we even work with DuckDB, I think we have an interface, you can even integrate this into the Unity catalog, but I don't know the details about.

01:47:55: Sven Okay. All right. I think we had more than 100 very interesting minutes. Christoph, thank you very much for for taking the time for answering all the questions.

01:48:37: Christoph Thanks a lot, Christoph.

01:48:39: Alex It was really great.

01:48:41: Christoph Thank you.

01:48:42: Alex Enjoyed it very much.

01:48:44: Christoph That that is my secret, instead of reading papers, I just ask Christoph all the questions. It's a way. I really enjoyed the conversation with you guys and big fan of this this case podcast, so I'm very very happy to also be part of this now. Thank you for your time and setting this up and organizing this and yeah, it was a pleasure talking to you, thank you.

01:49:10: Sven Ja, and maybe one one more thing, when when I mean not a question, it's just just, you know, a conversation I had with Christoph, we both live in Cologne, so we met for a lunch and we talked about this podcast. And also Christoph, he he also said he can come to another yet another episode later this year where we talk about interesting AI use cases, which we didn't touch that much today. But we can yeah, then let's let's see what we can do in the in the second or third quarter. Maybe then as an onsite recording? Sorry? I said maybe as an onsite recording because we were talking about having an onsite recording once, maybe.

01:50:10: Christoph I'm open to that, absolutely. So I'm happy also to talk about this topic, you know, that's kind of my my my my really my my heart glutes in in in that. And hey, at Data Bricks, of course, we see a lot of examples how customers now using AI use cases, what they are doing and and what this is status and as I said, it's so the development is so fast, it's it's crazy, right? It's very, very fascinating.

01:50:38: Sven All right, then I would say, thank you very much and see you next time on exciting AI use cases.

01:50:48: Christoph Thank you very much for having me. Have a nice day.

01:50:50: Sven And also for all the listeners, thank you for making it so far. Thank you, hope you enjoyed the show. Bye.

01:51:03: Christoph Bye.