Transcript
00:01:09: Sven Welcome to a new episode of the Case Podcast. ah Today our guest is Simon Harrer, CEO and Co-Founder of Entropy Data, recently ah founded company.
00:01:23: Sven ah besides that there is a long list a man of many talents so to speak uh phd in computer science author of java by comparison uh book by pragmatic programmers i i guess um maintainer and co-author of mop dot sh so a tool for mob programming ensemble programming author of the githops primer translator of the data made a data mesh book many things.
00:01:54: Sven Welcome, Simon.
00:01:57: Simon Yeah, thanks. Thanks for having me.
00:02:00: Sven And also Alex and Heinrich. How are you doing, guys
00:02:06: Heinrich It's very hot in here.
00:02:09: Heinrich Otherwise, great, yeah.
00:02:09: Sven guys? I'm burning a lot of energy here on my air condition.
00:02:14: Heinrich Yeah.
00:02:15: Sven No,
00:02:16: Alex Nice. I hope it's coming from solar energy. Okay.
00:02:21: Sven no. no
00:02:22: Heinrich yeah
00:02:22: Alex Yeah.
00:02:23: Sven still have But it's it's, oh, actually, I have, at least it's it's ah it's a green power, I must say.
00:02:24: Alex to him
00:02:30: Sven Yeah.
00:02:30: Alex okay
00:02:30: Heinrich diving straight into our first agenda topic, Sven's homelab setup.
00:02:33: Sven Yeah.
00:02:35: Alex yeah
00:02:36: Sven Yeah. But I must say, talking about maybe a little anecdote to warm up with the heat. Yeah. A few colleagues of mine and me, we we translated a book from Gregor Hohpe, A Cloud Strategy, and Gregor lives in Singapore.
00:02:55: Heinrich Hmm.
00:02:56: Sven And we translated it one year ago. We started with a translation. One of my colleagues said, ah, it's so hot here under the roof. And Gregor was like, yeah, good morning, Singapore. We had 28 degrees ah at 8 o'clock in the morning.
00:03:14: Sven And then my colleague said, yeah but you have air condition. And then Grigo was like, I mean, you cannot run air condition all the time. You just have to get used to the heat.
00:03:26: Sven it's
00:03:29: Alex Okay.
00:03:31: Sven All right. So let's dive into the topic. um Yeah, I already said it. Entropy data. maybe Maybe you can say a few words about the company. What what what are you guys ah doing?
00:03:45: Simon So we we just founded in August 2025 together with Jochen Christ. It's a spin-off from InnoQ. And um it markets the data mesh manager that's a data marketplace built on data products and data contracts.
00:04:04: Simon And that's really helpful to really large companies
00:04:07: Sven Thank you.
00:04:09: Simon Because it's an internal data marketplace where teams can share data with other teams ah in the in a way to build trust and to to build cool things on on top of that data.
00:04:21: Alex It sounds really interesting.
00:04:21: Simon And it's really an interesting situation right now. So we did this step because the product grew at InnoQ.
00:04:33: Simon And we already have a lot of customers. So we are we are basically
00:04:38: Simon So we we are we are not VC funded, we are ah bootstrapped and and we are we are break even and really want to grow now.
00:04:50: Simon And it's it's it's kind of it feels the right time. That data becomes very important with the rise of AI and we have so many inbound leads. So it's really a really interesting topic at the moment.
00:05:03: Alex So Simon, can you give some career advice? How do you go from Java programmer over to data and CEO of a data company?
00:05:14: Simon from doing For me, it was a long long long way, so to say.
00:05:15: Alex What was your story?
00:05:19: Simon But maybe I share that story because I i think it it might be interesting. um So for me, so I'm a software engineer at heart. And I was at NOQ doing a consultancy project for Breuninger, where we built as part of the e-commerce system, so their e-commerce um vendor, we built the distributed order management.
00:05:44: Simon and it was part of this InnoQ team. We did mob programming, as you already mentioned. It was really a fun fun time there. And at some point while we worked there, um something happened in my life that has never happened before.
00:06:02: Simon The backlog went dry.
00:06:04: Heinrich Huh.
00:06:05: Simon And at the same time, people from their data team said, hey, we've we've got BigQuery. Who wants to try it out? And so we said, oh, well, actually, don't have too much to do at the moment.
00:06:19: Simon let's Let's try it out. So we pumped our data from our operational systems that we um and we've built, like safe-contained systems. We pumped that to BigQuery from all the safe-contained systems into one one BigQuery project.
00:06:35: Simon And we started doing analytical analysis and queries on on all of that data. And that's how we transitioned into a hypothesis-driven product team by the books.
00:06:49: Simon So we really evaluated all the user stories that we got or that we some ideas we had with data first. And then we decided whether it's sensible to really follow up and and build the feature.
00:07:03: Simon And then after we build the features, we could really evaluate whether they had the impact we prognosticed. and this was That was really my way into the data world because in the end, what we were there just to to to point a little in the future. so we Basically, we were the small domain team that not only had ownership of the operational data, but also on the analytical data, like the whole end-to-end ownership.
00:07:33: Alex Mm-hmm.
00:07:34: Simon and One could say this is kind of the the idea of data mesh. And the BigQuery was our self-service data platform. We were the domain team. ah Governance was not really there too much, and and we didn't share data products on.
00:07:51: Simon But in the end, ah we had this first step of bottom-up approach of data mesh, I would say.
00:07:58: Alex Would you advocate for such? I can imagine it's quite a stretch because usually people consuming analytics data are different types of stakeholders than those that are interested in the product management and bringing features to customers, isn't it?
00:08:19: Simon um
00:08:23: Simon there There are some people that that use the data in a different way. Maybe you have to do some reporting. But as a product team, you need your own data and also data from other teams to make your product better.
00:08:35: Alex Mm-hmm.
00:08:37: Simon and And the best way to do that is is using and an analytics platform because then you can crunch a lot of data. You can combine it very easily. but that But this is the problem.
00:08:48: Simon I think and that's also the the easiest way to get the culture in because how do you convince software people or not to to um to to share data with others?
00:09:05: Simon The best way is you you need that data from others as well. And so you will be part, will be try to be a good citizen in that in that ecosystem because you know you need something from the others as well.
00:09:11: Alex Okay.
00:09:18: Simon And so you try to be helpful as well.
00:09:24: Sven Yeah, um
00:09:27: Sven i'm i'm I'm not very good, I assume, in in in analyzing data, at at least not as good as, let's say, a data scientist. um Before we dive into the marketplace and product data products and stuff like that, maybe also one further question.
00:09:47: Sven how How should a team do that? So should should I build up this expertise, becoming you know an analytics expert? Or is it that you know in the sense of ah team topologies, there is an enabling team which supports me with all the knowledge that I you know be able to to analyze data in a certain way?
00:10:16: Simon The bottleneck is always domain knowledge. it's not ah send not so So it's not the knowledge of ah how to do data analysis, but it's more like you have to have the domain knowledge, and then you can be supported to do the analysis on your own.
00:10:30: Sven e
00:10:35: Simon and and And so it's much easier to get people from an enabling team into your team so they help you to learn these few analysis skills.
00:10:46: Sven Okay, okay.
00:10:47: Simon um but then the other way around. So we tried the other way around to say we've put all our great data scientists in one team, and then they had to acquire the domain knowledge from all around the company.
00:10:53: Sven Yeah, yeah, yeah.
00:10:59: Simon And and that created a much, much more problematic bottleneck, I would say.
00:11:04: Sven Yeah, exactly.
00:11:05: Simon And this is exactly in the situation where but companies ah are often in, and they don't want that anymore.
00:11:09: Heinrich simple
00:11:11: Simon They want to to go really in a domain-oriented Architecture, domain-driven design also follows a little bit like that approach. um so And that's that's also the approach for this enabling team and the platform team.
00:11:27: Simon So because both help so having a good platform, ah an easy-to-use self-service platform also helps these domain teams make it easier for them to to do the analysis but for sharing data on or using other teams' data.
00:11:40: Sven Okay.
00:11:44: Simon So that's, yeah, so i would say yes.
00:11:47: Sven Okay, okay. Yeah.
00:11:52: Sven All right.
00:11:55: Sven Good. So, um yeah, I would just start now with, um let's let's start from data marketplace, then go to data products, and then finally, eventually, to a data contract.
00:12:08: Sven um So what what is a data marketplace? And um I think ah in in your you wrote an article, I believe, about it, um you compare it to a data catalog. I must say we we had ah an episode with Christoph Windhäuser on data architecture, and he said, yeah, the data catalog is, so to speak, the heart of um of data architecture.
00:12:40: Sven So, yeah, please um yeah explain what a marketplace is and how it compares to to a data catalog.
00:12:49: Simon So to me, most modern or basically all modern data approaches go towards this federated approach. So you have federated teams, decentralized teams that own their data.
00:13:05: Simon and they they have to share it on. But for that to work, you need a central place where everything comes together. and We can monitor, we can govern, we can manage what what one team owns and shares and who is using it and so on and so forth.
00:13:24: Simon You have to kind coordinate that. um and and The data marketplace to me is the system. This is where everybody meets. ah the The owner of data who shares data on.
00:13:37: Simon ah The consumer can look for data, can request access, and then ah use the data. We can also ensure trust. and think of an entry in the marketplace where we see, oh, these are the data guarantees around. no We are guaranteed this data schema.
00:13:55: Simon We are guaranteed that in this column are at most 5% null values. And but we also see the green check mark next to to which says, OK, this is automatically ah checked.
00:14:09: Simon And governance is at the marketplace because they want to ensure that only data that follows certain rules, certain policies ah is is shared, and it's shared in a correct way not to violate GDPR or other ah the ah they constraints.
00:14:30: Simon And Platform is interested in as well because you can automate a lot around that. um Maybe someone requests access to a to ah to data and then it's getting approved. Then we need to automat automatically make sure that the people really have access where the data is actually in in some buckets or SQL tables or so on.
00:14:51: Simon So everybody kind of meets in this in this marketplace. And that's why I think this is so such an essential component and The details page, this is what we what I see, is the details page in our data marketplace should be the data product and the guarantees around data that I'm sharing should be captured in the data contract. and and With that, this is what what I think works. and What we had previously, what was classified as a data catalog,
00:15:30: Simon um often did not work. so they They were very expensive and they had this functionality of scraping all the whole company. Every SQL table was scraped and put in the data catalog.
00:15:45: Simon and Then governance said, ok let's assign ownership and then say, okay you you own it now, you have to document this now and you respond.
00:15:49: Heinrich Yeah.
00:15:55: Alex And keep it updated, right?
00:15:56: Simon That did not work. and that that did not work No one was had an incentive to to actually keep that up to date. and There was a lot of scraped metadata in from maybe tables that were not for sharing.
00:16:11: Simon It was just internal tables.
00:16:12: Heinrich yeah
00:16:13: Simon So it's not did not I've never experienced a company that said, we are proud of our data catalog. And and it it it brings the value that we we think.
00:16:25: Simon And it's a good arrow ROI. have never heard about that.
00:16:27: Heinrich it's It's like just looking at all your GitHub issues or your JIRA issues, right?
00:16:27: Simon Sorry.
00:16:32: Heinrich So never somebody kind of proud of the whole collection. It really matters that the things that are important move forward, right?
00:16:41: Alex but isn't that i mean that's ah obviously i would say a cultural topic isn't that related to what you said in the beginning like you said hey let's be a good citizen because we depend on the others so let's be ah role model and deliver what we expect from others
00:16:41: Simon yeah
00:16:42: Heinrich Yeah.
00:17:01: Alex Is that what you want to sort of formalize in ah in a data marketplace?
00:17:06: Simon um In some way, yes. For me, it's It's more this, I just put the data I want to share in the marketplace, and this is curated, and this is a
00:17:19: Heinrich Yeah.
00:17:23: Simon ah step I take. ah I don't know the word, bewusst, consciously.
00:17:28: Alex Conscious.
00:17:28: Sven How do you say that?
00:17:28: Simon yeah ah none and And I put this there, I own it, and I add also guarantees to that.
00:17:36: Alex Mm-hmm. Mm-hmm.
00:17:37: Simon and and this is this is the other way around. I'm totally against putting automatically creating data products because this this has to be a step where say, I do this, I can help with a lot of tools to make it easy to do that, of course, but it should be a conscious step.
00:17:54: Heinrich Yep.
00:17:57: Alex Would it help maybe? Let's just assume listeners have heard about the term data data products, but they are used to, let's say, classical ah API publishing. So having an API spec that I that i publish and ah maybe people know where the different instances of my application reside, what the URL is and where they can find that API.
00:18:23: Alex How would you compare that to a data product being published in a data marketplace?
00:18:29: Simon Think of a data product like a microservice or self-contained system, and the data contracts are basically the API specification.
00:18:40: Alex Is it just an OpenAPI 3.0 spec or is there more to it?
00:18:46: Simon yeah so The data contracts come with their own YAML specification.
00:18:53: Alex Mm-hmm.
00:18:53: Simon ah the one one the I think the most popular one is the Open Data Contract Standard from the Linux Foundation. And i to to ah to clarify, I'm also part of the Technical Steering Committee driving this open standard.
00:19:07: Alex Okay, that's probably something that we should dive into deeper. If I look into Sven's face, that's a topic where we have a lot of questions, right?
00:19:16: Sven yeah Yeah, exactly.
00:19:17: Alex Let's stick to the to the data products for a second.
00:19:17: Sven Yeah.
00:19:20: Alex So you said a data product is like a microservice.
00:19:24: Simon Mm-hmm. Mm-hmm.
00:19:24: Alex um The thing that that I have in mind is um when I publish an API, um I'm usually very careful with what I publish because I'm aware things that I published once are hard to take back.
00:19:40: Alex So I'm actually, ah I'm never publishing all the data that I have. At the same time, I would like my colleagues to know which data I have so they could ask me for it.
00:19:52: Alex So they know, ah okay, the data is available somewhere in the company. and Alex and his team has it. ah So how would I do that?
00:20:03: Simon Yeah, so what we what we would recommend is specifying a proposed um data product with a data contract, and in the data contract, it's also in the proposed state, and you could describe but that your data model.
00:20:20: Alex Cool.
00:20:21: Simon But it's not yet published. It's not yet implemented. ah No one can can build something on it. You can just say, hey, this is what I but i own.
00:20:27: Alex So.
00:20:30: Simon This is what i'm what I could share.
00:20:30: Alex Cool. Yeah.
00:20:33: Simon um
00:20:33: Sven Mm-hmm.
00:20:35: Alex So it's ah it's a proposal not only on I have this data and could share it, but also a proposal how ah how I would want to share it.
00:20:35: Simon that's
00:20:42: Alex Cool.
00:20:42: Simon both ways would work. it's um I think then then the the discussion bubble will start and you can um define the interface, perhaps even together with your with your initial customer, but by still thinking about how could it be more helpful to to many others, because the the idea still with a data product is sharing data so many can use it.
00:20:55: Alex Mm-hmm.
00:21:07: Alex Yeah, at the same time, I'm a big fan of the statement, you have to pay twice before it's cheap, right? Only after the third time someone asks for it, I create this general abstraction and a general interface.
00:21:17: Simon which
00:21:25: Simon Yeah, but in the when you when you maybe coming more from the software engineering on a side and you you have your operational system and want to share that data, um you might already have a good API from your operational system.
00:21:37: Alex Mm-hmm.
00:21:43: Simon ah Thinking of ah your aggregate that you share in your operational system, think speaking in domain-driven design terms.
00:21:43: Alex Yeah. Yeah.
00:21:53: Simon And you might have a feed implementation or a REST API where you can look at the current state of your aggregate. That's probably very interesting to data people.
00:22:06: Simon and not having having this this table where you where you see the current snapshot of ah of this aggregate and also how it evolves over time, ah that's that's already a good first guess.
00:22:23: Alex It's interesting. i'm i'm I'm seeing a rabbit hole there and I'm wondering whether we should go into that in terms of API design, ah like because most of the APIs that I see are, um I can query the current state rather than um APIs that are emitting events that occurred within the system.
00:22:43: Alex That's only only very seldomly seen, at least in my environment and the systems that I see, unfortunately, I have to say.
00:22:55: Simon Just one one quick comment on that because I also find this very interesting. So what we, in the in the data world, what you don't want to do is re-implement business logic you already have implemented in an operational system.
00:23:10: Alex Yeah.
00:23:10: Simon So, and you don't want to,
00:23:11: Alex Agreed.
00:23:13: Simon interpret events and determine the state again, especially maybe for a simple business logic logic that works, it's very stable. But if it's more complex business logic, you might end up with the wrong data, wrong conclusions. And that's that's horrible. We don't want to do that.
00:23:29: Alex Yeah.
00:23:29: Simon um so So there's no alternative to to just get the aggregation states because that's otherwise, you have to re-implement the business logic ending in the current state.
00:23:44: Simon So otherwise, it would be event sourcing, basically, and re-implement event sourcing in the data platform.
00:23:44: Alex I. Yeah.
00:23:46: Alex i
00:23:49: Simon not Not sure you want to do that.
00:23:51: Alex I couldn't agree more, Simon.
00:23:52: Sven and
00:23:53: Alex In the end, that's that's coming back to the the statement that you phrased earlier. Domain knowledge is always the bottleneck, right? And only those with the domain knowledge know how to properly aggregate their domain's events.
00:24:07: Alex and um So the the point that I wanted to make is basically that um the people that have the domain knowledge should be the ones understanding what are the questions of the consumer, no matter whether it's an operational API or a data analytics API that that I'm specifying.
00:24:29: Sven um can we Can we come up with an example? So let's say an example data product.
00:24:43: Simon like i can I can tell, um we had we a talk about one of our clients gave a talk and and they told about their most important data product. and I think that's that's an interesting story.
00:24:56: Simon so The most important data product, it's a large logistics company, and that most important data product was the shipments data product.
00:25:06: Simon Because it's a large logistics company, they have shipments for the road,
00:25:08: Heinrich yeah.
00:25:12: Heinrich Yeah.
00:25:12: Simon ah sea, air, water, I don't know, several ah well there' have systems for each of that.
00:25:15: Sven Mm-hmm.
00:25:16: Heinrich Yeah.
00:25:18: Simon And it's very complicated. And it's very hard to get the data. and And what they did is they created this this shipments data product. And it has four different output ports, so the same data.
00:25:34: Simon But it's available through a REST API, a Kafka topic, on SQL tables on Snowflake, and as um I think JSON files on a AWS S3.
00:25:46: Sven so the So the data product shipments, it's what, how would i imagine? it is Is it the, we say we we we have a description of of this product. We say, hey, we we offer data about all kinds of shipments we have.
00:26:04: Simon Yeah, of all the shipments of the company, regardless where they are right now,
00:26:04: Sven And
00:26:10: Simon You get all get the the picture, you you you can always track the thing um ah ah and it's it's it's
00:26:18: Sven Okay. Yeah. Yeah.
00:26:19: Simon it's fed by a lot of systems. It's it's it's a a clean interface. It doesn't have all the details, right it's ah but it's such it's so valuable that that it's the most um ah popular thing and they they told in the
00:26:27: Sven yeah
00:26:35: Simon the ah talk that it's it' enabled a lot of interesting new use cases.
00:26:43: Heinrich I would add two examples here. um the First one, I would also think about companies is selling data sets. I think here the data product is just very literally true. um I think a couple of our colleagues we met at the InnoQ roundtable are actually in that business of selling really data sets to others. And you can imagine like different channels through which you make the data available, but you are on the hook for the quality and availability of the data as a vendor very very directly. And there are certain SLAs with regards to kind of freshness associated to this that you guarantee to your customers.
00:27:16: Heinrich And then um in e-commerce context, I think there is the um the user data product that I would think is extremely important to know who is your users.
00:27:28: Heinrich which traditionally may be just a SQL table, but in federated architectures, this gets a lot more subtle than this. There are many, many different ah like systems who build up a opinion on what a user is and what attributes are associated with them, and there's a lot of duplication.
00:27:37: Sven Yeah.
00:27:46: Heinrich So having an authoritative source and a team that is kind of accountable to having up-to-date user data and providing it to everybody else, I think it's that is kind of...
00:27:57: Heinrich an important kind of asset. um Similarly, sales orders, right? That's also something that a lot of systems touch. And I think then you also have data sets that are more internal or data products that are more kind of like, for example, on the platform.
00:28:12: Heinrich um We do have data sets about like all employees that we have. We have data sets about all applications. And also here, there are different ah ways to query the data. There are different ways to view the data.
00:28:28: Heinrich But I think this kind of governance and accountability structures there are quite important that there is somebody you can go to that will take ownership of this and if there are issues in the data or in the way they are made available,
00:28:45: Heinrich that they will kind of take take ownership of this, um fix them, and also document the data set in a way that others can discover it and leverage it in a self-serviceable manner.
00:28:57: Heinrich So you don't have to kind of go to people or read source code in order to make use of a data set. This is how as i see it, Simon, Simon, is this, oh yeah, in English, yeah. Does that resonate with your kind of ah draw different accents here?
00:29:15: Simon Yeah, no, i would I would fully agree. Maybe we we can use the classification of data products um from Sharmak Degani from her book because we have this, but we mostly what we talked about were the aggregated data products, not data products that
00:29:30: Heinrich and
00:29:32: Simon um that are fed from many different systems. We have created a unified data model. They they have a lot of value in the company, um and they they have a lot of different use cases where they could apply to.
00:29:46: Simon But they are kind of the middle layer. On the bottom layer, we have the source-aligned data products, which offer data mostly from one source system. ah where then And their data model is heavily linked to the model of the source system.
00:30:03: Heinrich Yeah.
00:30:03: Simon They're very dependent. So when the source system changes, they can't really do that much. They have to follow a follower, basically, from ah from the from the model.
00:30:13: Simon now we could um And that's the the the lower part.
00:30:14: Sven and
00:30:18: Simon Then we touch the middle layer, the aggregated layer. um And then and we have the more consumer-aligned data products. This is where we produce maybe a data mart for a specific set of reports or for a specific team.
00:30:34: Simon And so we have a limited number of people where we produce data for.
00:30:37: Heinrich yeah
00:30:38: Simon up four
00:30:39: Sven know
00:30:40: Simon And these are the, that might help.
00:30:42: Sven what What is the data mod?
00:30:46: Simon Think of like a small data warehouse with a specific group of people in mind.
00:30:49: Sven Okay. Hmm. Okay. Okay.
00:30:53: Simon um
00:30:55: Heinrich I mean, isn't kind of management stakeholders a typical example here. So you have management reports that you're producing and they obviously very important, but only a handful of people read them.
00:31:05: Heinrich Yeah.
00:31:06: Simon Yes. I think of a data map for finance um ah data that's just fed feeding, used for creating finance ah reports.
00:31:17: Simon and And the underlying data for for those re reports, they they should also be captured in ah in ah in this case, consumer aligned um data product.
00:31:30: Sven All right. Yeah, wonderful. and So i I actually, more than 10 years ago, i think I worked on a data product. No, actually, we we we sold data. Actually, we aggregated data, but it yeah there there were no terms, so at least we didn't know any terms or structure. We just did stuff and sold data.
00:31:56: Sven Yeah. Yeah.
00:31:59: Sven So now we we had the data marketplace. We now know what a data product is. um Now eventually ah we come we come to the data contract.
00:32:11: Sven So what is a data contract?
00:32:16: Simon let me Let me just say before we come to that, just one one more word for the data product because this gives us the mindset. um With the data product, the core idea is to to share data so that it's good for my consumers. I share in a way that it's easy to use and perfect for for my consumers.
00:32:44: Simon but That's the core idea, the product thinking idea of a data product.
00:32:50: Sven Hmm.
00:32:51: Simon I have the consumer in mind. i want to to help them. um i want to to build it so they can easily easily use it. Maybe I use the technologies that are good for my consumers. I create a data model that is good for my consumers so that they can benefit.
00:33:08: Simon they can benefit
00:33:10: Alex But then again, Simon, what is the difference to regular API specification?
00:33:10: Sven but
00:33:16: Alex Because I would assume that when people design an API, they also have the consumers in mind and they make some assumptions on how it will be interacted with, even though they don't specify that.
00:33:24: Sven yeah
00:33:30: Alex Yeah.
00:33:31: Simon Yeah, nothing. Absolutely nothing. the the and The only thing is that in the data world, when we when we had we have this situation where people just dumped data somewhere and thought this is valuable,
00:33:47: Alex A three bucket, internet readable.
00:33:48: Simon and
00:33:49: Alex Great idea.
00:33:50: Simon Well, that internet readable, more more kind um enterprise readable, ah you just dumped it there and you thought this was valuable as it's an asset. And it actually it was, that we thought about a data lake and we have to fish for the gold in there.
00:34:03: Simon And it's ah sometimes some person will go there and find the gold and everybody will be happy.
00:34:06: Alex Interesting.
00:34:10: Simon And this did not work.
00:34:10: Alex what's that What's that phrase again? is it is Is it a piece of art or is it a trash dump?
00:34:15: Simon It was a trash dump. Absolutely.
00:34:17: Alex Mm-hmm. Yeah.
00:34:18: Sven ah but but only on Yeah, on the other side, um i mean, with one of my current customers, which is large enterprise IT, it's, you know, i we we also deal with a lot of APIs and no one, no one is a bit too extreme, but lots of those teams who offer the API,
00:34:18: Simon It was a costly one.
00:34:35: Simon Thank
00:34:45: Sven we have to explain them that they have to talk to us. We are their customers. You know, you have to, they they don't give a shit. You know, they just build something and think that's probably useful.
00:34:56: Alex Yeah.
00:34:57: Sven So I think also in the, I would assume in the data world, it's the same, you know, you see your data product as something you you want to have users. You want to have your your users, you offer a product, but probably,
00:35:12: Sven the The product thinking is not in most of the teams, I would say, the development teams.
00:35:16: Alex Well, depends, Sven. oh I would argue from a domain-driven design perspective, you are a conformist, or at least the ah the producer thinks you are conformist.
00:35:27: Sven Exactly, exactly. but But I am not, yeah? Because... we No, i mean, even even even if I'm the... they they They don't even think...
00:35:39: Sven Probably we have to explain the terms a bit. yeah So domain-driven design um talks about, let's say, strategic design talks about different types of relationships.
00:35:41: Alex Mm-hmm.
00:35:46: Alex Strategic design knows relationship patterns.
00:35:51: Sven So you can be a conformist. You have to eat what your supplier gives you. But even...
00:35:56: Alex Because you don't have a connection or relationship to them, so you can negotiate the interface. You cannot negotiate, so you just have to accept.
00:36:01: Sven Exactly. which Which is fine, you know, if you're Google or if you're AWS, ah you have so many clients, you have to come up with something. yeah And I mean, if I say something, I want the interface different, the API, they don't care about me.
00:36:11: Alex Mm-hmm.
00:36:16: Alex Yeah.
00:36:18: Sven But in another setting, um I rather want another relationship. you know I want that people talk to me they and also to possible other consumers.
00:36:30: Sven and um And I see that as a problem yeah with normal APIs. And I would also see that with data products that...
00:36:41: Alex um Well, the point for me being, um and Simon, please correct me, what I understood from data contracts is they make a relationship more reliable because each side knows what to expect from the other.
00:36:58: Sven Data contracts, are you saying?
00:36:59: Alex Data contracts, yes.
00:37:00: Sven ah We are not at the data contracts yet, are we?
00:37:00: Simon yeah
00:37:03: Alex Well, wait, the point is here, reliability in terms of I know what to expect. The point that I'm trying to make is i only know very few people who wrap their head around strategic design and intentionally designing such relationships.
00:37:22: Alex So you say, well, I'm not a conformist, but obviously political relationships at your and at your client say, well, you are, because otherwise the um the team that is that is ah offering the data, so the supplier, would need to would need to talk to you.
00:37:42: Alex And then you would have a customer-supplier relationship.
00:37:48: Alex So if you had political power, so your team had political power, you could call them into um into a customer-supplier relationship and say, I want to negotiate with you because I have the political power to do that, to force you into that.
00:38:00: Sven Yeah, i expect the han like exactly, exactly. yeah Yeah, actually, yeah to be honest, that's true. I mean, this is happening. You know, they see me as a conformist and I say, no, we are a customer and you are our supplier.
00:38:11: Alex Yeah.
00:38:12: Simon Mm-hmm.
00:38:15: Sven And also those that team and that team and that team, please talk to us.
00:38:19: Alex yeah
00:38:20: Sven um
00:38:21: Alex and the The interesting part for me is that
00:38:23: Sven And they do and they do. i must I must admit most of them they do. Right. So you you can change that relationship. Sorry.
00:38:30: Alex I mean, the the the interesting part here is, i mean, obviously that is pretty frustrating for people who have to be conformist and have to, so to say, have to accept what what is offered to them.
00:38:30: Sven Go ahead.
00:38:42: Alex If it doesn't fit, then you have a problem and you cannot ah you cannot fulfill your team's goals, right? And sometimes it's it's pretty hard to ah to make this visible, right?
00:38:54: Alex Like to make it visible to ah to your team stakeholders, hey, we are blocked, we cannot deliver our goals or our sprint goals, our monthly, quarterly goals, whatever, because we don't get the data from this other team's interface.
00:39:02: Simon Thank
00:39:12: Alex And now again, Simon, correct me. Data contracts is a way to express what I would actually need in order to fulfill that goal and to very explicitly show what is missing.
00:39:32: Simon Yes. so we the maybe Maybe moving on to to data contracts. so Data contracts is a description of um
00:39:47: Sven right.
00:39:47: Simon basically a data model, quality guarantees, service level agreements, or no and and it can be defined either by one offering data,
00:39:53: Alex Exactly. Mm-hmm.
00:39:58: Simon And it's a producer-driven data contract. So it's more guarantees. It's more, this is what I promise. And you can rely on that because it's my data. I share it.
00:40:10: Simon i own it. this is When you go there and you see the data contract, and this and you can also vary we can also verify that automatically, and potential consumers can rely on what's there.
00:40:23: Simon And then there's the contract the consumer-driven data contract. This is when the consumer defines their expectations and say, oh, I actually need these columns.
00:40:31: Sven Mmm.
00:40:35: Simon And I need it in in that's with that SLA. and And I do have these additional quality check constraints and so on and so forth.
00:40:40: Alex Exactly.
00:40:46: Simon And both both work.
00:40:46: Alex And, and,
00:40:48: Simon and And the best both both overlap in some way.
00:40:52: Alex but,
00:40:52: Heinrich Thank you.
00:40:52: Simon should, yeah.
00:40:53: Alex But that's that's the benefit that I see from from data contracts and why I think people should wrap their head around that because exactly aspects like freshness of data is something that you cannot express in regular API contracts.
00:41:08: Alex So your expectations, how fresh the data shall be, how available or during which time periods of the day the data should be available and in in which, how do you say, a ah querying interval.
00:41:24: Alex So if you say, well, in the time between 9 and 9 in the morning and 6 I want to query your ah your API every 60 seconds, and I want to have a freshness of data that's no older than five minutes.
00:41:39: Alex That is something that you cannot express in OpenAPI spec.
00:41:44: Simon Yeah, so they're different. yeah They have their own. ah Think of data contracts as guarantees around a large data set.
00:41:54: Alex Mm-hmm.
00:41:54: Simon The data set could span many tables, just using the table metaphor here. um and And you guarantee things around this whole data set. This could be structure. This could be these SLAs. It could also be governance things, like you are allowed to use the data for these purposes, but not for these.
00:42:17: Simon We can limit that. Maybe also you can use the data, but you cannot reshare it. So there could be additional terms and conditions as well. um So it nuts it's a little bit more than we typically have experience in the operational world, ah because there's also this analytical data that constraints on what you can do with it.
00:42:31: Sven Mm-hmm.
00:42:37: Simon And in the operational world, you always have this ah i felt china I don't know the English term, ah the free ticket basically, because you can always say, we need this to do business.
00:42:49: Simon And then you're allowed to do a lot of stuff.
00:42:49: Alex Yeah, that's true.
00:42:51: Simon And then the analytical world, you always have to verify, is this allowed? are we Do we have the purpose?
00:42:56: Heinrich Oh.
00:42:56: Simon Do we have the consent? Do we have whatever? um and And maybe you also buy data from outside and and weather data, statistics data. And they they also have come with license terms.
00:43:07: Simon You can do anything with that.
00:43:09: Alex
00:43:09: Simon so
00:43:10: Sven Mm-hmm.
00:43:10: Simon these These are all aspects to the to the data contract.
00:43:15: Heinrich I mean, one important dimension here, I think it's also tooling support or automated like integrations and checks. um Because in order for this to actually work, you need some degree of like really hard linking and maybe some degree of discovery.
00:43:34: Heinrich um you cannot purely make this a documentation exercise. You can think of a data catalog like a telephone book, and then you can have just a Google Docs process or an editorial process of planoff releasing the data catalog every month.
00:43:47: Simon Mm-hmm. you
00:43:48: Heinrich And you can see how horrible that would be. And the same at the same time, just running a discovery cron job over your databases and seeing, hey, here are all the databases that also clearly doesn't work.
00:44:00: Heinrich So I think there's a lot of kind of, on a practical level, a lot of the art is kind of making the tooling support seamless, but also really strictly positive in the value that it adds, since it should really support the developer in articulating the guarantees, expectations, and so on.
00:44:21: Heinrich and um And a lot of the journeys are actually starting extremely basic, right? I have a data set. I just put it on Google Drive as a zip file and share the link. That can be a data product or it can be a spreadsheet where our customers are in it.
00:44:38: Heinrich I can put 10,000 rows into spreadsheets, no problem. And for many companies, this is the master database of all the customers. I mean, for like at the majority of companies, this is the ah master database.
00:44:50: Sven Mm-hmm.
00:44:51: Heinrich um So, I think a lot of this journey is kind of start extremely low tech. And if you are putting everything in the MySQL database, you immediately kind of make this big kind of wall around it.
00:44:55: Simon Yeah.
00:45:01: Heinrich Only developers can even integrate or see the data. And um so, the kind of field, the technological landscape of systems which which carry data is very large.
00:45:13: Heinrich Um, and yeah, the trade-offs are just very subtle because then there's also the vendors who say, oh, just bring everything to Databricks and then everybody will be in your kind of unity catalog and you will be happy. And then I'm like, no, they are in Google drive or in, I don't know my FTP server or in my other system.
00:45:30: Heinrich So can you like talk a little bit about this kind of set of trade-offs and challenges, and maybe do you have an opinion or a sweet spot that kind of should work for, for medium sized companies?
00:45:43: Simon and so So a data contract really can can apply to any data set on any technology.
00:45:46: Sven Thank you.
00:45:50: Simon um that it could be i always use this example of CSV files on an SFTP store, very similar to what you're talking about.
00:45:53: Alex Thank
00:45:58: Heinrich yeah
00:46:00: Simon It could also be data set behind a REST API, where you can go through the data set through pagination. ah It could be SQL tables. But in the end, it's always data set.
00:46:12: Simon regardless of how you access it, is it just a get query? Is it not? And it takes some time, or do have to query the whole Kafka topic to get all the data?
00:46:20: Heinrich yeah
00:46:27: Simon In the end, the data contract protects a very large data set and gives the guarantees around that. And we have to and then we have to
00:46:39: Heinrich Thank you.
00:46:39: Simon And then I would agree totally for the tooling support. the The value, it's it's it's nice to document it. And it's ah already nice to say, well, this is what I offer.
00:46:51: Simon And potential consumers can have a look at it. But the the benefit comes from automation. and I'm co-maintainer of the data contract CLI that can automatically read YAML structures written using the open data contract standard.
00:47:11: Simon And then it can automatically connect to a data source and check whether the guarantees are met.
00:47:20: Heinrich Okay, now we come.
00:47:20: Simon That's very popular.
00:47:21: Heinrich Yeah.
00:47:22: Simon ah And that's that's the key to win the hearts of your data engineers.
00:47:27: Heinrich but That's actually important and interesting information. i mean, I have been circling around this for a while, but I have not really come across those concrete tools. And um so is that kind of the way forward right now that we have this open data contract spec, that there are some maybe open source databases you can already use for managing data contracts, and then there are vendor solutions which are kind of adding adding value on top?
00:47:47: Alex Thank
00:47:54: Simon Yeah, this is the current situation I see it. um I think um we are in this in the data world.
00:47:57: Heinrich Yeah.
00:48:00: Simon We are in this spot where open standards are really driving a lot of things. Like when you think of iceberg that changed a lot in the data world, and I think that the open data contract standard will have its impact.
00:48:16: Simon And that there will be an open data product standard. There will be more standards coming coming as well. um um I'm working together with a lot of great minds on that.
00:48:26: Heinrich Nice.
00:48:26: Simon and
00:48:26: Heinrich Yep.
00:48:27: Simon But the combination is with open source tools for and using making use as ah as an engineer. and We see where the vendors come in is this and in the central platforms.
00:48:41: Simon Now where everything comes together, we we talked about the data marketplace.
00:48:41: Heinrich yeah
00:48:45: Simon ah We see also that trend here. ah be There you see the data product. You see your contract. You see the validation results next to it. You see who already has access and why you can request access and provide a purpose.
00:48:59: Simon um And and and all all these things together, i think this is really what but how i see the current current world.
00:49:09: Alex It's pretty interesting, Simon, to me.
00:49:09: Simon yeah
00:49:11: Alex Usually I would say engineers have a kind of dislike for like specifications and and yeah specifying a lot of things.
00:49:24: Alex um Would you mind sharing a little bit about the the history, if I remember correctly, when Jochen and you built data contracts? um It was in parallel to an initiative at PayPal.
00:49:37: Alex And what was the reason or the use case that you had for building this spec and how did it evolve since then into open data contracts?
00:49:49: Simon so So we started datacontract.com with the data contract specification. And we we started it because we realized the key part of a data product is actually the interface around the data sets it shares, like the API.
00:50:05: Alex Mm-hmm.
00:50:06: Simon that's That's where the music is playing. This is the essential part. But for it to work, um you need to specify it in ah and like an open API spec.
00:50:18: Simon um But there's it's it's you have to to make it in combination with ah with a tooling, with an automatic verification, and also use it for you can also generate a lot of stuff, like your code generators from OpenAPI.
00:50:35: Simon That was basically one of the things why OpenAPI was so successful. It was because of those generators, because of the automation around it. And we we had the same idea. We wanted to say, OK, we have the data contracts back now.
00:50:47: Simon Let's build the automation around it so to make it successful. At the same time, ah data contracts were invented in in several places. We have this at GoCardless with Andrew Jones.
00:51:01: Simon We had this at ah PayPal with Jean-Georges Perrin. And what George Perrin did is basically convinced the people at PayPal to open source it, ah their own format, and later ah pushed it into the Linux Foundation as ah as the open data contract standard and tried to find a lot of people um to to support this.
00:51:28: Simon And we also realized that, we we looked at it, and In the initial version that but from PayPal, it was still very PayPal-specific.
00:51:39: Simon And so we invested a lot of time um to make it more you know ah more applicable like for other companies, for other technologies, because PayPal was ah ah using BigQuery a lot. And so it was very BigQuery-specific, but not everybody's using BigQuery.
00:51:57: Simon and um And now we we are basically saying, and we and basically we also realized that it's much better to have a standard where a lot of people are behind than do your own thing. And this is why we said, hey, we we put all our resources together um and and the open data contracts and that is the new way forward.
00:52:22: Simon And we are basically sunsetting the data contract specification.
00:52:29: Simon That's kind of the the history in a nutshell.
00:52:32: Alex Thanks.
00:52:33: Simon and the Maybe just to add the data contracts, Eli, the automation that's already supporting the open data contract standard um as well, because this is, again, automation is the key to adoption in that case.
00:52:48: Alex So to make it easier for for developers to use, right? make it To make it simple to do the right thing.
00:52:55: Simon but to To get a benefit. So it's not just the documentation exercise, but you get this benefit of checking whether your data, your interface matches what you specified.
00:53:08: Simon is Is the table I produce, is it the structure as I define in my data contract? you can just run that um in your CI CD pipeline while you while you develop.
00:53:18: Alex I was going to ask, is it a similar to consumer driven contract testing where you can integrate checking your data contracts against your latest build?
00:53:31: Simon yeah Yes and no. The cool thing with data contracts is that you basically can, when you have access to the data, you can check it.
00:53:42: Simon You don't need a complex mock system.
00:53:45: Alex Mm-hmm.
00:53:45: Simon you The moment you get access, you can run and verify whether the contract or the data that you have access to matches the contract.
00:53:57: Alex Okay, but that means,
00:53:57: Simon There's no complex setup required to execute that.
00:54:02: Alex but I'm thinking about systems that have different environments. I don't know, and developer environment, integration environment, demo environment, production environment.
00:54:13: Simon who
00:54:13: Alex You would have to run the tests. If you say you actually validate against the real data that is stored in the system, you would have to validate against all of these environments, right?
00:54:27: Simon Yeah, if you if you want to, and they might differ a little bit, maybe a because
00:54:34: Alex That would be my question. like Do you already have some experience, some recommendations that you could give to to people in terms of when you update your data contract, this is how you should run the validation. This is how you should run updates in in terms of process, communication to others.
00:54:53: Alex Because I assume you will not run an update of your data contract on all the systems, on all the the environments at once. but say I don't know. You first release it to an integration environment and inform people. I don't know. What's your thoughts?
00:55:10: Simon So first of all, only data in prod is interesting.
00:55:14: Alex That's what you say.
00:55:15: Simon yeah the The problem is that other data in other stages you can try to to mimic the characteristics of your prod data, but it's never prod data.
00:55:25: Sven Mm-hmm.
00:55:25: Alex Yeah, it's true. It's true.
00:55:27: Simon It's always different. it's always It's hard to verify that when you do something, that's the right thing. Because it's it's it it has ah you can anonymize, you can try to retain the structure, but in the end, it's it's always different.
00:55:42: Simon so
00:55:43: Alex So you're a big proponent of testing changes in production.
00:55:43: Simon and
00:55:46: Simon Well, was if you if you can, data only is fun in prod.
00:55:50: Alex Ah.
00:55:51: Simon That being said, um that and and that's also where others rely on on it and where you have to you can use the contract to monitor, to continuously monitor the guarantees.
00:56:02: Simon Because data is constantly flowing, data is increasing, data is appended to to the to the interfaces.
00:56:03: Alex Mm-hmm.
00:56:10: Simon It's getting bigger and bigger. might be also truncated, but it's normally it's getting bigger and bigger. And you can with the contract, you can always verify whether these the data still matches what you guaranteed.
00:56:23: Simon doesn't doesn't break The SLA is not broken at some interval.
00:56:27: Sven it's i mean basically you do data monitoring so you run let's say those tests every every hour or whatever and if it fails you get an alert for example
00:56:28: Alex you
00:56:42: Simon yeah In a nutshell, yeah. ah the The problem is that it's costly because you pay for compute.
00:56:51: Sven hmm
00:56:51: Simon Running those checks is paying for compute. And ah the just using the BigQuery pricing model, you pay, think it's $10 or $5 for one terabyte that you you do process.
00:57:07: Sven Oh, OK.
00:57:07: Simon And we venue when you have a large table and you process it every everyone every hour and it's ah
00:57:07: Sven Yeah.
00:57:14: Sven yeah
00:57:14: Simon it's it's It covers one terabyte and you really ah crunch the one terabyte for each check. Then you can always have $10 or something. time just in yeah yeah yeah yeah you have to
00:57:23: Alex Well, in the end, it could be worth it, right?
00:57:28: Simon What I mean is you have to make sure that it it fits. um and you have to make a Think a little bit about it. it's not It's not just a little bit of unit testing and during development with fixed costs.
00:57:40: Alex Yeah.
00:57:42: Simon the the data monitoring can be really expensive and you have to really manage money manage that.
00:57:45: Sven Hmm.
00:57:48: Sven Yeah.
00:57:49: Simon But regarding to the stages, Alex, what you said, just one one one word to that. so
00:57:59: Simon What we recommend is basically um
00:58:05: Simon using the contract, the validation of the contract also in the CI CD pipeline, during not during the propagation. ah you You build and you you you check it against your testing.
00:58:18: Simon And then in your pipeline, you can push it further to staging and you can also run those data contract checks against the data API on your staging environment.
00:58:32: Simon um And then at the end, you publish it in production. and And in production, I would rather go just with the monitoring.
00:58:41: Alex Mm-hmm.
00:58:48: Sven i I had a look at um the data mesh manager um and I believe, you know, it's like five weeks ago.
00:59:01: Sven So when I looked at it, I think I saw like all the data products. And if you look at the data products, you see, you know, the the status. Is it, um are the data correct?
00:59:13: Sven You know, what whatever the the contract promises, is that true?
00:59:16: Simon Mm-hmm.
00:59:19: Sven So how often or is it even the case that if I go to the marketplace and I publish my data product with a certain data contract that this marketplace, data contract manager, that you guys also execute once in a while um those contracts or is it only executed once?
00:59:45: Simon That's the monitoring aspect. now It should be, or we executed ah through a schedule because in production, you want to monitor it.
00:59:48: Sven Yeah.
00:59:56: Simon when you When you move through the stages, it's not that important to publish it to others.
01:00:02: Sven Yeah,
01:00:03: Simon You can still publish it if you want, if you want to give people insight into your dev or staging environment and you want to document that.
01:00:09: Sven yeah.
01:00:11: Simon But in the end, it's not that important.
01:00:14: Sven yeah Yeah, I'm only here, in this case, I'm only interested in production.
01:00:17: Simon yeah
01:00:19: Sven the And how often is it then executed?
01:00:23: Simon Yeah. You have to define that yourself.
01:00:27: Sven okay, yeah.
01:00:28: Simon yeah Because that's, again, it's it's costly. And and because nobody can make that decision for you.
01:00:31: Sven Hmm.
01:00:34: Sven No.
01:00:35: Simon And cost ops is also very important there. Because you have to manage that.
01:00:38: Sven OK.
01:00:42: Simon But it's also that's kind of the idea of a data product also to to check when it's um but you have no customers, that you also shut it down again at some point. So you don't run the checks anymore. You can also shut down the data product, remove it at some point. that's this This whole lifecycle is also in there. That's also new in the data world where where people thought we could store data forever, and at some point, the data will be valuable.
01:01:12: Sven Who is then responsible for for all that?
01:01:12: Alex um
01:01:17: Sven So, I mean, usually I would say product thinking. Well, you could say everyone in the team should have that product thinking, but in general you have product manager, product owner who cares about the thing we are building, is it also the product owner who is then responsible for the data product, looking how successful it is, you know, talking to customers?
01:01:41: Sven That's okay, okay.
01:01:44: Simon Yes. yeah In the best case, it's that simple.
01:01:49: Heinrich Are there new roles emerging?
01:01:52: Heinrich Data product manager?
01:01:56: Simon and We see it a little sometimes here and there. um i think um these roles emerge for teams that the just build data products, and then they just rename the product owner.
01:02:12: Simon that we know from operational teams, they they they ah name it differently because they only have data products. So they the data product owners.
01:02:27: Heinrich In practice, how do you discover your clients? how doen who do you know How do you know who is querying your data sets and build stuff on top of it?
01:02:39: Simon So the the idea that we have with with the marketplace that before you can access it, you have to request access or or get access.
01:02:48: Heinrich Ah.
01:02:49: Simon Think like the App Store in Apple where when the app is free, you still have to say, I want this. I agree to the terms. And you just get it. There's no approval. It's just it's just installed.
01:03:02: Simon And the same ah we we um say here because we always want to track not only who's using it, but also why.
01:03:12: Alex And also have a back channel, right? In case something occurs, you need to communicate something to your consumers.
01:03:19: Simon yeah Yeah, it's just know your customer's idea.
01:03:20: Alex Mm-hmm.
01:03:23: Heinrich But could you distinguish here between experimentation and production workloads?
01:03:31: Heinrich Because as a developer, I mean, oftentimes I'm just like, let me see what I can do with that data.
01:03:32: Simon Yeah.
01:03:36: Heinrich Just give me a piece of it. And if I have to kind of go through a process here to get access that may slow exploration down.
01:03:38: Simon Yeah.
01:03:47: Simon um So the idea yeah but but what we recommend is basically it's auto-approved. You just click Get Access, and you just get it. there's not that It's still an additional step.
01:03:56: Heinrich Oh.
01:03:56: Alex Depends on the data, I would say.
01:03:58: Simon It's not just I have access to everything, and I can do exploration on all the data any time. I don't think that GDPR is also happy with that approach to some extent.
01:04:10: Heinrich I mean, for certain datasets, right?
01:04:10: Simon Yeah.
01:04:11: Heinrich I mean, internal datasets, I mean, from my perspective, there should be no restrictions on any employee just looking at, like for example, what do what are my applications?
01:04:22: Heinrich All the telemetry systems are purposefully completely open. So everybody can kind of look into the logs of, and not the logs, but the the traces of a lot of the applications that we have.
01:04:34: Heinrich And I think that's yeah that's that's important.
01:04:34: Alex Exactly.
01:04:37: Heinrich um And so there's this balance of, okay, yeah, you can make a security guard in front of everything, and then you just play the approval and review game all all day.
01:04:39: Simon I love it.
01:04:46: Heinrich Or you can kind of and have some, yeah, More like less strict access, but um then um some some, I mean, for me, I think the sweet spot is allow experimentation, but not production use. So if I have a production application that is kind of depending on it, I wanna know about it.
01:05:08: Heinrich But in the data world, that's often not so clearly cut because a data analyst that prepares a management report every Wednesday is kind of a production user, if you like, right?
01:05:19: Heinrich Although he may come through a CLI on his on his notebook. Yeah.
01:05:27: Simon Yeah, it's it's not so we did so we we think, we think and I agree, that the experimentation should should be easily supportable.
01:05:38: Simon um What we do there is that it's but but you can easily request access for maybe a limited time.
01:05:46: Heinrich Yep, that makes sense.
01:05:47: Simon ah
01:05:49: Alex Simon, now that you founded a company around that, um I'm hearing, well, you have to think about this, you have to think about that, there should be a process for this, there need to be people who take care of this and that.
01:05:49: Sven Thank you.
01:05:58: Simon Mm-hmm.
01:06:01: Alex um Is there actually something like a like a map, like a landscape that you that you use to explain the idea of data contracts and the data contract manager, your product,
01:06:15: Alex to your prospects and say, this is why you should use it because these are the roles that you should have. These are the processes that you need to put in place. And actually we build the automation for that. Is there an overview already or are you going to create it before we release the podcast?
01:06:32: Alex Yeah.
01:06:33: Simon I should probably run run straight at that. um and the The funny thing is that most customers basically just come to us and say, yeah, we we realize we need that data marketplace. We are yeah already hooked.
01:06:49: Simon But they they basically all need a little bit of assistance, little bit of consulting. um And I think that what you said is ah we we know that in our heads, it might be helpful to to write that down.
01:06:58: Alex Thank you.
01:07:05: Simon and We try to write it down here and there, but it's not this. Here you go, there. this is the one size fits all. That's also a little bit complicated because every company has their own quirks in that space.
01:07:18: Simon But in the end,
01:07:22: Simon This is kind of our approach. So we encoded our best practices into the product, including the processes. ah then
01:07:30: Sven I now thought that you say go to datamesh-architecture.com.
01:07:39: Alex yeah That would be my followup question.
01:07:40: Sven and Yeah.
01:07:42: Alex Do you have to have a full blown data mesh architecture until the least bits before you can actually think about adopting data contracts?
01:07:53: Simon No, you can you can start whenever you share data in a company from one team to another team, you can use a data contract.
01:08:01: Alex So it could also be like a pilot use case. Let's try out data contracts between two, three, four systems in order to understand whether that would help our data kind governance to move forward.
01:08:15: Simon Yeah, and it's actually a very easy one. You can do that very decentralized. you don't need a lot of people. You just need access to the data. You need a place to put the YAML.
01:08:25: Alex Mm-hmm.
01:08:25: Simon And you know ah need a CI CD pipeline to to run the data contracts CLI using the YAML file and connect to the data. and
01:08:34: Alex I know it's pretty hard for, uh, for you to answer a question, but are there alternatives to using the data contract manager, uh, to serve your data contracts?
01:08:48: Simon um Of course. um
01:08:52: Alex But I won't recommend them.
01:08:54: Sven It's all crap.
01:08:54: Simon Well, when you when you when you go into details, there differences and you know, yeah.
01:08:55: Heinrich Mm-hmm.
01:08:59: Alex Mm-hmm.
01:09:00: Simon um I think what what we...
01:09:06: Simon so
01:09:08: Alex I mean, let's let's let's talk about the the options that one has when they want to get started with data contracts and why you think data contract manager is a a good way to to go forward.
01:09:20: Simon Yeah, so I would first and for all always start with the Open Data Contract Standard because this is really a standard driven by a group of very diverse people from vendors, end users, and consultancies.
01:09:34: Sven Thank you.
01:09:36: Simon um And I think this this is the future. Then you need some kind of automation.
01:09:40: Heinrich Thank you.
01:09:41: Simon You can use the open source tool for free. um it's It's very popular. we have um many, many, have over 80 contributors and and over 650 GitHub stars, and it's really rising.
01:09:55: Simon So this is what you can do for free, and you can go a long way with it. And then what you need, i think, now when you when you go bigger, you need one central place where you have your contracts, if you have your check results,
01:10:14: Alex Mm-hmm.
01:10:15: Simon ah You need to know who's the owner there. You need a way to contact them.
01:10:20: Alex And that's the marketplace offering, right?
01:10:20: Simon you need In that sense, now you need this this kind of software and it should be able to read and support ah that that YAML format that you have.
01:10:26: Alex Mm-hmm.
01:10:33: Simon night should um
01:10:39: Simon have a lot of other features around it, but in the end, this is this is what you need. You need the central place to manage. And there are others that can do that. that it's not It's not unique.
01:10:52: Simon But we are we are unique that we have a very strong support of those open standards. But others stuff will follow. so um this is But at the moment, we we we have this this support.
01:11:04: Alex when you say central place would you would you agree to heinrich's idea of saying we make this open for everyone so exploration can freely flourish
01:11:19: Sven Thank you.
01:11:20: Alex So if if you say central place for all the data contracts, does that mean that you share them somewhere so that everybody sees what is offered there?
01:11:28: Simon Yeah. So what we recommend from the permission model is that everybody has read access to the metadata.
01:11:29: Alex Mm-hmm.
01:11:35: Alex Mm-hmm.
01:11:36: Simon So you see all the offerings. um ah You see all the check results. You see who's owning it, who's using it, for what purpose. You see the whole graph, so to say, um of of that now with the data products, the contracts, and and and and the usages.
01:11:56: Simon but to To use the data, you have to request access or but just say, I won't get ah ah want to have access, and you get it automatically. Just one click, and and you get it for exp exploitation exploration. exploration This is what what i but i think makes sense.
01:12:22: Alex Okay. to spend a look into the future, where do you think are the data contracts going? So what is the the perspective for the but the specification and maybe also the the roadmap for Data Contracts Manager?
01:12:47: Simon I think the future is very authentic, and AI is is is everywhere. So what we but we already explored, um we we build a data product, MCP.
01:13:00: Simon So an MCP server in front of the data mesh manager. and
01:13:03: Alex Jake.
01:13:04: Simon And the yeah you have to do that at the moment. But the idea is really um ah quite quite nice, because the data contract is is the source of truth of metadata.
01:13:18: Simon What's in the contract, that's true. Now, the data can be wrong. It could but deviate from the contract. But the contract is is highly curated, ah highly valuable metadata.
01:13:33: Simon And it's very condensed. And with that, we can we can make a lot of use in an LLM, in an agent. So what we what we did is basically saying, hey, like you can find data very efficiently. now You can request data. ah you can You can discover all the data products you have, all the data contracts you have quite um easily through this MCP. It can, based on your use case, you just as a business user, go there and say, i need boost my who are i' my top customers? Or I don't know, any any other use case, then you can
01:14:11: Simon and search um our our metadata database, search for the data products that might be interesting. You see the data contract. You can evaluate whether the data fulfills your need, whether you can answer the business question where we started with.
01:14:28: Simon with the data contract. You also see whether you already have access. And if you don't have access, it could immediately request access. And if it's ah like auto-approved, it's just for exploration, it could grant that for that specific purpose.
01:14:42: Simon And it knows the purpose because you already have given the business question in the beginning. And all this can be done in the MCP. And then you can also query the data and because you can construct the SQL query based on the data contract, you know the structure, you know where it's located, um and you can
01:15:05: Simon ah run the SQL query. But then you also know why you want to use the data. And we can check whether the governance policies, whether it's allowed to query that for that purpose.
01:15:19: Simon and This can all be done. and so so what i What I'm proposing here is basically that in the future, our most popular customer will be an an agent using the MCP server for the consumer side.
01:15:29: Alex Mm-hmm.
01:15:31: Simon Now, the producer side will still have to define the data products and data contracts, but the consumer will well will mostly be people, oh, this is my cloud desktop, this is my MCP server I connected, and I just
01:15:35: Sven Mm-hmm.
01:15:47: Simon just use it and it's it's just handling that on their own.
01:15:53: Alex So thinking about automated validation by agents that are and or training models by themselves, creating reports or doing whatever else will be done in the future with the data.
01:16:08: Simon You can also do things like check whether two data products are similar, which which is the better one. should we Do we have duplications? Should we retire one? Is one data product included in the other? Or um all these things, we could we could we could monitor the whole data mesh also through that approach.
01:16:29: Simon um And
01:16:33: Simon maybe maybe just we we also um integrate AI into our product itself. to to handle this request access flow.
01:16:43: Simon Now I request access from from you, Alex, perhaps. And now you have to check whether my access request is valid. And that's a lot of burden because you have to know all the policies that that might lead to rejecting that access.
01:16:58: Simon Maybe it's not um my course is not good. It's not allowed, perhaps. And that for that, you have to know a lot. And here we can also use LLMs because they can go through GDPR, they can go through your privacy policies, through your license agreements and check whether my purpose, my usage scenario fits that to lessen your burden.
01:17:10: Heinrich Bye.
01:17:22: Simon It's easier and it's not so hard on your shoulders. And this is also where we see where we see a lot of impact now on this using AI automating what but previously data governance experts have done ah i and and trying to to automate the 80% and leave the 20% where it's not so clear to the humans.
01:17:53: Alex Thanks a lot.
01:17:58: Alex Are there things that we missed, topics that you definitely want to have covered, Simon?
01:18:05: Simon Oh, that's a tough one. I think we covered a lot.
01:18:11: Alex Okay.
01:18:14: Alex So as the CEO of Entropy Data, are you going to take vacations still this year?
01:18:19: Simon ah Fun story, actually. So we founded the company and beginning of August. and And just think almost in the in the second week, i I went on holiday for three days.
01:18:34: Simon so
01:18:35: Alex So exhausted already.
01:18:36: Simon Yeah, yeah, yeah, so exhausted already.
01:18:37: Heinrich Mm-hmm.
01:18:38: Sven laughter
01:18:38: Simon but I just wanted to phrase it little different. I want to put it in the perspective of, um well, I have a great co-founder, Jochen. We have a great ah team there. So ah ifre taking a few days off is never a problem.
01:18:52: Alex OK.
01:18:52: Simon That's what I wanted to say.
01:18:55: Alex OK, sounds good. Yeah.
01:18:58: Sven All right. Yeah, thank you very much ah for coming and lots of success in the future. One thing we didn't mention, and that is it's ah ah if if our listeners ah know the ThoughtWorks technology radar,
01:19:19: Sven I think, correct me if I'm wrong, but the ThoughtWorks technology radar put your product into assess. Am I right?
01:19:29: Simon Yeah, and it was an interesting entry because it it was named Data Mesh Manager, but in the text, it was three things. So we said ah the Data Mesh Manager as a product, the Open Data Contract Standard as the standard to go forward for data contracts, and the Data Contract CLI as the ah way to the open source tool to automate the check.
01:19:56: Simon So it was actually three things put into one.
01:19:58: Sven o
01:19:59: Alex Nice.
01:20:03: Sven All right. Cool.
01:20:06: Alex Yeah, again, thanks a lot for your time, Simon, and for answering all our questions.
01:20:06: Sven So.
01:20:13: Simon Yeah, thank you for all for having me.
01:20:13: Alex It's been a pleasure.
01:20:17: Sven Yeah. Okay. Then.
01:20:23: Sven i would say. Have a nice weekend.
01:20:25: Alex have Have a great evening.
01:20:26: Sven we
01:20:27: Alex And thanks to all the listeners for taking the time with us.
01:20:31: Sven Exactly. Thank you for making it so far.
01:20:37: Sven All right. Then um I would say bye-bye.
01:20:38: Alex Hear you next time.
01:20:42: Sven Cheers.
01:20:43: Alex Bye, everybody.
01:20:44: Heinrich Bye-bye.
01:20:44: Simon Yeah.