Katrin (00:10)
Welcome back to Knowledge Distillation, where we explore how AI is reshaping the role of the data analyst. I'm your host, Katrin Ribant, CEO and founder of Ask-Y. My guest today is someone I've known for 15 years, back when we were both riding the first wave of what would become the data revolution. Mike Driscoll is the co-founder and CEO of Rill Data, but before that, Mike founded MetaMarkets in 2010.
Where he co-invented ⁓ Apache Druid, one of the most important real-time analytics databases of the last decade. And we'll talk about that. Mike, you're also a founding partner of Data Collective, a VC fund that raised over a billion dollars to fund data ventures. And before that, when we went met in twenty ten, you founded Data Spora, a data science consultancy back when data scientists was all the rage. I mean
Technically before data scientists was all the rage. I should also mention that you hold a PhD in bioinformatics. So I'd have to start with that actually. Can you tell us a bit about that? What is bioinformatics and how did you end up deciding to have an entrepreneur career and a VC career, both in analytics?
Mike (01:30)
Well, I actually there's one company that that I started before I did my PhD and it was a a retailer, ⁓ one of the early ⁓ retail online retailers called custominc.com. It's a t-shirt business. And I think the in some ways the the way I ended up doing a PhD in bioinformatics is that I had ⁓ gone very deep into custom apparel as an entrepreneur.
And almost a year into that, I realized that if ⁓ I continued on this path, I would be kind of in the t-shirt world potentially for decades. And I realized that I I didn't just want to be an entrepreneur. I I wanted to, I love data. I loved ⁓ I love computer science. I was sort of a frustrated computer scientist. I'd never really been technical an undergraduate. So I sold that business to my college roommate who went his
Went on to run it for 25 years. If anyone's ever bought t-shirts online, you probably have heard of Custom Inc. And I left in ⁓ for grad school. And so what is bioinformatics? It's it's really comp computer science and biology. It's ⁓ some people call it computational biology. But the way I got into it is I had ⁓ I was a self-taught programmer and was working for the human genome project as a as a developer. and
It was just fascinating to me. I've always loved data and you know the the origins of big data actually that term really come from the life sciences. It was this recognition that ⁓ you know, we've got ⁓ our DNA is this incredible rich trove of of information. And so at that point, this is you know, the late nineteen nineties, early two thousands, ⁓ we were we were sequencing the human genome. We'd started with Drosophila, which is the fruit fly.
And we we were doing some really interesting techniques to ⁓ sequence the human genome. And I found that absolutely fascinating. And I decided rather than become a, you know, a t-shirt entrepreneur for the rest of my life, I I actually went to that PhD program thinking I would become a biotech entrepreneur. ⁓ of course, you know, as the saying ⁓ says, you know, man plans and God laughs.
So I I had a plan, but I ended up going in a different direction. But yeah, that's how I ended up in computational biology, bioinformatics was leaving ⁓ e commerce and ⁓ just yeah, inspired by the data and the life sciences.
Katrin (04:07)
So Mike, you and I met at the onset of the rise of the data scientists around 2010. This was before Harvest Business Review declared it the sexiest job of the twenty first century in twenty twenty twelve. ⁓ and when everyone was scrambling to figure out ⁓ really ⁓ like what is actually meant, I remember those debates at the time. This is not about that debate, by the way.
Mike (04:32)
Yes.
Katrin (04:33)
At the time I was migrating Havasi's data platform from ⁓ Oracle basically to an MPP architecture. And I was in the process of selecting a vendor amongst the MPP sort of darlings of the time. ⁓ and my POC included Netiza, Vertica, Greenplum, and a few others. And the founder of Greenplum, Scott Yara, which I ended up selecting, actually Greenplum, introduced us during the workshop.
I remember you were really in the trenches solving the hardest technical problems in high-scale real-time analytics. ⁓ so tell us about the life of an analyst in 2010, like the issues with managing queries at scale and how using AIML really looked like at that time.
Mike (05:16)
Sure. I mean it's in you know it's incredible sometimes to look back and ⁓ and right, and how how many cycles of you know Moore's Law have we had, right? Ten ten cycles of Moore's Law ⁓ in in fifteen years. And so, you know, at the time the data seemed enormous that, you know, I I remember I distinctly remember being in New York when we met and as I was brought in as a consultant to
Katrin (05:24)
It's only fifteen years.
Mike (05:46)
⁓ the the Green Plum team as a as their kind of data scientist for hire. You know, ⁓ gosh, everything was was back then was was new. AWS, you know, was was nascent. ⁓ the idea of ⁓ you know kind of massive parallel MPP databases was new. You know, Green Plum was was based on Postgres. It was a distributed Postgres engine. Postgres of course continues. ⁓
to ⁓ thrive today. You know, I mean in some ways things things change, but but they stay the same. I think one of the things that that has always been a challenge in the world ⁓ when you're a data analyst is really the the data transformation and the data preparation. So what I recall from those early days of working with Greenplum and I think Havas was, you know, one of the clients we worked with, you know, boy, moving data out of Oracle, for instance, into
another system was always a huge headache. I remember working with one of Scott's colleagues and so you know, data movement, ⁓ orchestration, ETL was was often 80% of the effort.
Katrin (06:58)
That project cost me a few years of my life.
Mike (07:00)
Right door.
My first gray hairs came in ⁓ at that at that point. But I I also recall again similar trends. One of the very common themes we saw was that while we could design analytics and run, for instance, an you know, back then I was very heavily involved in the R community. We were using the open source R programming language for doing statistical analysis that was very popular among the data scientists at the time.
Well, we could get something running, you know, on a laptop with a sample of data. Again, scaling up that analysis and having it run in a system like Greenplum was was also a major hurdle. And so I what I recall again working with Havas was that, you know, just translating those models from something you could get to work on a, you know, small an hour or a day of data and then migrating it to something that could work on, you know, a month or, you know, a year of data.
That was always a challenge. We were kind of rewriting our scripts and rewriting kind of linear regression.
Katrin (08:00)
We were doing sophisticated things for the time. I mean we had ⁓ you know, user level data on the hit level and we were doing attribution like dynamic attribution modeling on that in twenty ten, right? Yes. That wasn't popular at the time.
Mike (08:15)
Yeah, there was no there were really no ⁓ out of the box solutions, I think, for the the things we were working on. and ⁓ yeah, so it was you know, it was definitely the early days. And I think, you know, big data was not that big, right? The data we
Katrin (08:30)
It wasn't that big though. But back then the challenge was simply having tools powerful enough to handle vast amounts of data, which, you know, sure was weren't that big at the time, but you know, still too big for the tools, right? Today, largely that problem is solved-ish, but reasonably good enough, I would say. Yes. We have incredible databases, transformation pipelines, BI platforms. ⁓ but now we face something I think far more subtle and far more critical.
maintaining context throughout the entire analytics process. So let's have a look at that evolution and how we get to where we are today. So do you want to talk to us a little bit about Druid and meta markets?
Mike (09:15)
Sure. Sure. So I think and I maybe I'll try to embed it in a in a c I think a larger context, which which kind of a through line as we look back at, you know, 2010 to the present. I think that one of one of the reasons why ⁓ many of us who who love data analysis end up in ⁓ in the world of media and publishing and ⁓ and advertising is because that's where the data is.
Katrin (09:44)
Lots of data and lots of analysis.
Mike (09:47)
You know, the ⁓ media business was one of the first to digitally transform. The the products are digital. Yeah. and so, you know, kind of getting ⁓ you know, digital signal and exhaust from the consumption of of digital media is that sort of natural, ⁓ a very natural thing. I think outside of essentially IT observability, you know, the places where like folks like Splunk play, I think digital media
has always really been the tip of the spear for innovation and analytics ⁓ and data infrastructure. And so when we started MetaMarkets, yeah, ⁓ there there was an issue ⁓ where you know Hadoop was solving the problem of of scale. You know, Hadoop had come on the scene and ⁓ obviously now you know we've evolved to things like Spark, but but scale was something that was at least solved. What was not solved was speed at scale.
And the the vision that that I had was I really wanted ⁓ end users who were doing analytics and ⁓ observing ⁓ trends for advertising campaigns or ⁓ trying to sort of go deep and and do yeah, user level ⁓ analysis, ⁓ cohort, ⁓ understand how different cohorts were performing. The all of the questions that many folks want to ask of their data.
those questions, frankly, would take too long to answer in an interactive manner with something like Hadoop on the back end. And so my experience with Green Plum really inspired me. ⁓ and I ⁓ Greenplum is was a very powerful engine for delivering speed at scale. It was ⁓ one of the first. But it it still really wasn't designed for the level of concurrency that we were looking to deliver. ⁓
for kind of user-facing dashboards. So our early customers at MetaMarkets for ⁓ digital advertising platforms like ⁓ Jim Payne's Mopub was an early customer ultimately acquired by by Twitter. ⁓ OpenX was still around one of the early ⁓ platforms. ⁓ and so we decided to build we did what often is a crazy thing. We decided to roll our own database. We had a very ⁓ narrow set of use cases and requirements for
Powering interactive dashboards at ⁓ you know, multi, let's say multi-hundred gigabyte scale, which at the time was a lot of data. And so ⁓ one of our engineers had an idea and for an architecture that he, I think he always wanted to build. And so we basically developed Druid as an in-memory columnar ⁓ distributed data store. It was a NoSQL data store, it didn't have SQL support.
And ⁓ and we yeah, we used that for probably the you know, the next eight years until we were acquired by Snap. ⁓ Druid powered all of our interactive dashboards for ⁓ you know all of the leading ⁓ digital media platforms ⁓ out there. So that was the genesis, was just ⁓ trying to build a something that was fat, not just scalable, but had high performance at scale.
Katrin (13:06)
And I remember at the time you had to be very specific about the use case that you served because, you know, we c you remember we competed a few times in pitches, right? And we were asking each other, we were like, But why are we in the same pitch?
Mike (13:19)
Right. You're at Data Rama. Right.
Katrin (13:21)
Yes. W w ⁓ yeah, you know, and and wh why did they put MetaMarkets and Datorama in the same pitch? Because I mean, MetaMarkets sells one type of use cases, datorama serves pretty much exactly the opposite type of use case. Right. How is that ⁓ confusing to anyone? And I suppose that really speaks to the type of confusion there is in any ⁓ emerging technology when something is new.
The market doesn't really understand the nuances of what is what, and you end up lumped in ⁓ in in categories that really you don't you don't belong to, right? Right. And so then from meta markets acquisition by snap in twenty seventeen, I think, right? To real today, ⁓ how
Mike (14:07)
Yes.
Katrin (14:11)
Did your thinking about the analyst's role evolve? And ⁓ how did that lead you to dashboards as code?
Mike (14:20)
So I think the ⁓ the experience of of going to Snap was illuminating ⁓ because it it I think one of the questions we always asked ourselves at MetaMarkets was, you know, is this is this very powerful, interactive, exploratory tool that we built, is it does it have use cases beyond
digital media platforms. That's where we had been quite successful in building, you know, tens of millions of dollars of recurring revenue in that vertical. And we ultimately made the decision to exit the business. But I think I always in the back of my mind thought, you know, could this, could this be valuable beyond ⁓ you know, the the sets of kind of vertical use cases we defined. I think when we got to Snap, we saw that that that analytics platform ⁓ was valuable. Snap began
an elastic search stack that they had built and migrated a lot of their analytics onto the platform that we brought in through MetaMarkets and was involved not just in ⁓ obviously advertising optimization, which was critical for Snap at that period, ⁓ but also for ⁓ experimentation for crash analytics at the time they were they were trying to launch an Android app. And so quite literally looking at you know a trillion plus events a day ⁓ coming back from all of the telemetry.
of their user base at scale and trying to get an an Android app that you know didn't crash built. So that led me to I confirmed that this tech was valuable from you know our philosophy was valuable beyond you know the use case that was in meta markets and the in terms of the role of the analyst, I think the hardest thing ⁓ for a lot of technologies is their adoption. And so ⁓
MetaMarkets was an enterprise sales motion. ⁓ it it took weeks to months to get customers on board. And around, you know, I guess 2017, 2018, you know, I I looked at some of the emerging ⁓ platforms that developers were adopting. And I think, you know, we saw things like what was Next.js that ultimately became Versale. I saw the rise of infrastructure as code, ⁓ companies like HashiCorp doing quite well.
Grafana, Grafana Labs ultimately ⁓ I think was very inspiring. And so I think there was this ⁓ set of developer-led, ⁓ developer-led growth, you know, companies where they made it very easy to adopt their their tool. And Druid was never that easy to adopt either as a even as an open source, you know, database. And so really I think my my shift was to recognize that ⁓ analysts, ⁓
were becoming more technical, or at least there's a there was a a a cohort of analysts that were becoming becoming more
Katrin (17:17)
I think it's really it's a general movement. ⁓ analysts have become considerably ⁓ more technical. I think across the span of, you know, ⁓ of the analytics roles there's definitely been a a a strong shift.
Mike (17:31)
And so we sort of see, and people talked about this term analytics engineer. We saw the rise of, you know, DBT ⁓ out of Fishtown Analytics, ⁓ where you again you had a group of analysts who were comfortable not just writing SQL, but you know, using something like Git, you know, a Python, ⁓ had you know becoming more and more ⁓ used in the analyst community. And so what I've sort of witnessed was maybe that ⁓ analysts on one end of the continuum, and analytical engineers.
data engineers, there's sort of this compression where ⁓ folks were able to be a data engineer writing maybe a an ETL pipeline, but also able to write some Python to do transformation and get their data from you know an object store into a database and then even, you know, write some SQL ⁓ to get a dashboard built, you know, whether that be a Grafana dashboard or or something like Superset or or you know Tableau or Looker. ⁓
And so that that was kind of the thesis was could we build something ⁓ that was similar, inspired by meta markets? We actually ended up spinning the core tech out of Snap, the MetaMarkets technology and stack and rebuilding it from the ground up. But what we added was was this layer of ⁓ BIS code to say, can we let analysts define an entire stack from data to dashboard in a single GitHub repo?
And and so that's the that's the journey we've been on. ⁓ and and so that's and then there was an observation that there was a rise of while Druid was a great first generation analytics database, I think there's an there's a new class of real-time analytical databases like ClickHouse and DuckDB and ⁓ Starrox and Pino that actually deliver the the the performance at scale that we felt is necessary.
For the kinds of data applications ⁓ that we want to deliver at RHEL. So, so yeah, I think the the evolution was just thinking that analysts are much more capable than they were, you know, 10 or 15 years ago. ⁓ and could we lean into that and lean into this sort of ⁓ code first trend that we were observing ⁓ having success in a lot of other areas of software?
Katrin (19:56)
It's fascinating, right? Because I I so agree with you with the shift to the left of the skill set of, you know, the average the average analyst across the board from data engineer to business analyst, right? There's really that that really has happened. I think that that is reinforced by ⁓ the rise of the AI analyst because having code generated for you really does help that shift to the left. Yes. I also do think
that LLMs help the shift to the right because you do have an ability to have more business context and to have you know better visualization skills, better storytelling sk telling skills, with the help of ⁓ whoever, you know, Claude or whoever whoever you like to use. In my case it's Claude, ⁓ who's my favorite for that. But it does really help a shift to the left ⁓ le to the to sorry to the right as well. And so
So our thesis is that ⁓ the rise of the AI analyst basically powers a full stack shift of the skill set. Not that everybody will become a specialist in everything, but I think that your general skills as an analyst become the foundation of you being able to leverage additional help to expand your skill set, right? And hence fight commoditization ⁓ of you know, of your jobs, et cetera.
I think that happens to software engineer first. We've we've seen we've observed that, right? I mean, we've got the the rise of the AI engineer that we've been been able to observe. I think that that also powers a lot of confusion in the market because when I look at AI analyst jobs currently in the market, there is this big confusion between we want a software engineer that will set up infrastructure, but there should also be a data scientist.
Mike (21:51)
And
Katrin (21:52)
They
should also be a data scientist who has built LLMs, preferably, right? Or been close to somebody who built LLMs, maybe, maybe that counts. But they sort should also be able to do like just you know data analysis and present dashboards to stakeholders. Right. I mean it's kind of really funny in a way when you look at those job job descriptions. It's also really not funny.
Again speaks to the confusion in the market at the beginning of, you know, any any big trend. Yeah. Do you see some of that with your clients, with the people that you work with? Like you you work with a very technical user, right? What do you see?
Mike (22:33)
So first, ⁓ you know, I think we're you you put your finger on it when you said we're kind of at the early, you know, we're in the early chapters of this this, you know, AI revolution that's that's changing every ⁓ every aspect of of so both you know software development as well as sort of business processes. I think I do think we still do have effectively there's two stakeholders that we work with, and I think many
you know, many companies work with, which is ⁓ you have developers that are implementing tools. ⁓ and analysts to some extent, ⁓ you know, kind of I increasingly I think are moving towards that sort of developer persona. And then you have business users who are, you know, using the tools ⁓ that that, you know, the AI analyst or AI engineer kind of sets up for them. I think that the gap there's a gap.
obviously that's always been there. ⁓ where the business users often have the domain expertise. They have the business questions. They have the know you know they have that sort of tacit knowledge of the ⁓ you know of what metrics matter and and you know ⁓ what the right, you know, what questions to ask. I, you know, I think the
The opportunity, you know, you talk about sort of bridging this gap. There's always this gap has always existed and been a huge issue. I think if you if you think about the way that anal analytics had this analytics supply chain has traditionally worked, it has been sort of ⁓ I mean, it's been broken, frankly, for years, right? You have this idea of a ⁓ a business user like the chief marketing officer says, Hey, why are conversions down ⁓ today? And they kind of throw that question over the wall.
You know, ⁓ maybe they write a Jira ticket and there's a data team that picks it up and you know, maybe they build a dashboard and and that dashboard gets sent back to the CMO. And then of course, it's never, you know, the end of the discussion. The CMO says, I want to know why in in you know in Phoenix we saw this particular drop, or why in ⁓ you know, APAC ⁓ Android devices were not converting as well. Or there's always a question that follows every question.
And that ping pong ball of Jira tickets or you know, ⁓ that back and forth today is just so slow, right? ⁓ and so I think the the promise and the excitement of ⁓ you know ai powered analytics is really to you know maybe bridge this gap where instead of you know humans on the data team that are ⁓ are you know ⁓
getting that ticket when they wake up in the morning and then, you know, building a pipeline and building a dashboard that we probably are going to see. And by the way, I think the the employment numbers already sort of back this up. ⁓ I do think that a lot of data teams are going to ⁓ drastically reduce their size. We are going to be re replacing data teams with analyst agents who will instead of
you know, waiting for that JIR ticket, picking it up and taking a few hours. I think we have now the opportunity for the CMO to get answers from a conversational prompt about why conversions are down, an APAC for Android devices. That answer could come back ⁓ maybe in minutes. ⁓ and so then the question is, what's the role of the analyst in this in this new world? I think the analyst is not the per you know, the analyst is not the person who's fishing. You know, the old saying of give a person a fish and they
eat for a day and give teaching the fish and they eat for, you know, a lifetime. I think that AI, the real role of these AI engineers, AI analysts is to set up the systems that the business users ⁓ interact with. And so ⁓ data teams are not giving answers. They're create they're they're building tool chains. They're they're ⁓ they're creating infrastructure that the the business can use. And so
What does that infrastructure look like? You know, I've got I've got some theories of what types of infrastructure you and I are both ⁓ you know building tools that I think this next generation set of teams should stand up. But I think there is promise, finally for companies to go beyond dashboards and go beyond kind of Jira tickets and data teams and this sort of almost sclerotic slow.
bureaucratic way of doing ⁓ you know being data driven and replace it with something much more fluid and natural in terms of how business users interface with the most critical data and metrics in their business. I think we're on the cusp of that. And that's really the role of the analyst, I think is to ⁓ to create you s set up these internal tools which will streamline for the first time.
self-serve analytics inside of businesses.
Katrin (27:48)
So you and I are both on the tool building side of that revolution, right? But for the AI analysts that are listening, what in your point of view and ⁓ with the types of users that you you are obviously interacting with, but also, you know, got a very long career behind you, you've hired a lot of people, you you you've seen you've seen a lot of different eras and different evolutions, different
Praises for different languages and different technologies and all of that stuff. We've, you know, we've all all seen quite a bit of that. If you think about being an AI or a digital analyst today, shifting to become an AI analyst, upskilling to get the best out of this new technology to to help them. What would you say ⁓ would be the areas that they should focus on in terms of upskilling? Is it ⁓
Prompt engineering, is it ⁓ what is it?
Mike (28:49)
I am a strong believer that data engineering remains an absolutely critical skill for ⁓ agree. And it's a superpower for analysts because I'll tell the story of going hearkening back to the art you know early days of big data. One of the one of the signature events at that period was when Netflix open sourced ⁓
a portion of their ratings data and they ⁓ initiated what was called a the Netflix prize ⁓ to say they were going to award, I think it was a million dollars for a machine learning scientist that could ⁓ have the best prediction of which movies their ⁓ Netflix users were likely to watch based on their ⁓ historical rating patterns.
And it was a great sort of ⁓ corpus of real data to test this idea that could we do predictive analytics at scale? And and Netflix and there ⁓ being smart said, Yeah, we'll open source this and we'll have a Netflix prize where every week, you know, I think they released new data and and a new data set and folks would, you know, try to predict, you know, what Netflix users were watching. And what I remember talking to some of the prize contestants was
One of them I was talking to was a PhD at Berkeley. And he said how he was in the statistics department. And so many statisticians would come to him and say, David, you should try this algorithm. This this algorithm would be a lot more powerful. And he and he would say, Sure, well, you could try it yourself. And but they had no ability to get the so much of the data prep work of getting the data from Netflix, putting it into, you know, a matrix that would fit in memory. There was just a lot of work to do.
Prepare and get that data amenable for analysis. That was ETL work. And the vast majority of people who wanted to compete in the Netflix prize didn't have any data engineering skills. They needed that data to be put on a silver plate for them. ⁓ and that's a huge crutch for folks. And so I think, especially in the era of you know, cloud and copilot, and you know, now we've got codex. ⁓ I think that if an if an analyst is able
⁓ not just to build, you know, ⁓ a metrics layer or a semantic layer or ⁓ build dashboards or visualizations or prompts, but to go down the stack and actually ⁓ s be able to orchestrate raw logs of data, get close to the source and and actually get that data all the way through the stack to business users, I think that's a superpower. And that that does require data engineering skill. It's getting easier.
⁓ you know, again, the the AI agents can help you write transformation code to extract from a data lake or you know, object storage and get it into a database like ClickHouse ⁓ or Redshift or Snowflake as as you desire. So I think it data engineering is critical because what you actually have the opportunity to to come true there is the one human data team. I think people talk about the one person.
you know, billion-dollar company, which I think we're on the cusp of seeing that. But I think ⁓ AI provides leverage. And so I think we're we're on the cusp of seeing one-person data teams that can really span the span the gamut from data engineer to ⁓ analyst. And I I've seen it actually. I've seen some of our users at RIL, our CTOs and VPs of engineering, who on the one hand have business context because they're in the meetings with the CMO and the CFO. And on the other hand,
They know where the where the data lives. They've got the credentials. They know where their Cloudflare logs are located. And they are capable in a couple of hours of essentially getting to the bottom of really mission critical business questions and and again build a a tool that can answer ⁓ key business questions and ⁓ without having to bring a team of folks and frankly, without having to bring, you know, 17 SaaS vendors ⁓
of the modern stack, you know, into the into the picture.
Katrin (33:13)
Well, that really is a a dream skill set, right? Being able to go like fully to the left and then extend fully full stack. ⁓ and I think the critical aspect is really in in that is obviously you have to have the technical skills because you can generate all the code you want if you don't know how to architect, it's not gonna be very good. Right. So you so you you you really do ne do need to have skills there and so much of
the data engineering, the pipeline creation is about the business context because you can put that data together in many ways and most of them are not very useful. You really do need to understand something about how it's going to be used in order to create the data pipeline correctly.
Mike (34:02)
Right. Absolutely. Which is why it's so valuable. If you can, you know, if you can traverse that, you know, all the go all the way from the left of data engineering and pipelines all the way to the right of the of the business use cases. It is so powerful because, you know, we use that term predicate pushdown in analytics where, you know, if you're gonna filter data out, you'd like to filter it out at the database level, not after you've, you know,
moved all that data out of the database, ⁓ you know, you'll you'll save it's much more efficient to push logic down the down the stack. Similarly, I mean in our in our world, for example, in digital media, not every, you know, ⁓ not every advertising event is created equal. And and we know that in ⁓ online programmatic ⁓ auctions, ⁓ you know, oftentimes in in any
you know, auction marketplace, you'll have many, many bidders and only one bidder will win an auction. And if you don't have that business context to know that it's the winning bid that matters, ⁓ you can end up carrying around, you know, ⁓ two orders of magnitude more data than you need to answer a business question at the at the end of that pipeline. ⁓ and so the ability to kind of, I would say filter out a lot of the noise and enrich the signal.
have that requires business expertise to know what signal is really important and and what log files you might be able to sample ⁓ or even just throw away ⁓ you know and and throw that away at the beginning of your analysis. So a lot of dent data engineers don't have that business.
Katrin (35:44)
Context.
That's fascinating because that almost answers what was going to be my next question. I think really the you know the the main challenge is the context and the continuity of the context across the entire process, right? And if you have only one person managing that whole process, then you can consider that that person has all the context. But then if you have questions, you also always have to go to that.
That person, right? Yes. And that person has limited time. Like everybody. embedding the context in the process, in the tools. You know, we used to hear, ⁓ you have to do documentation, which is very funny when you've been in this for 20 years, right? You're like, yeah, show me one instance when that actually works. the context has to be something that is alive and that has to be ⁓ that is, you know.
Mike (36:14)
Yes.
Yes.
Katrin (36:40)
Automated with every interaction and it has to be something that is intelligent because it has to handle exactly the kind of scenarios that you just mentioned, right? ⁓ how do you see that context question coming teams? I'm not talking about the one person data team, right? But come in the t type of teams that Real works with.
Mike (37:04)
Well, I think there's there's, you know, two broad categories of context that really ⁓ matter and one is a lot easier to solve for than the other. The first, ⁓ but but both very valuable. The first I would say category of context is sort of macro context. This is what the large language models are are actually great at. They have a notion of ⁓ what an advertising campaign is. They have a notion of what an auction is, they
they understand sort of ⁓ you know, ⁓ conceptual entities frankly better than any junior or even senior data engineer might know. And so when you I've I've seen with with great success, if you give a data set to an LLM and you say, ⁓ can you do you know, can you sort of define a set of ⁓ an ontology for this data set? Can you provide me a set of
you know, dimensions and measures that are useful for a BI tool from this data set. it will do a very good job at at defining a metric like MAU or DAU ⁓ given you know ⁓ user session locks. and it will, you know, which is fantastic. That's often better than most data engineers would do. It's it's got a world model. And so that's that's great. ⁓ the harder, the second category of context I would say
I don't know if would call it macro micro context, but I I might call it, you know, you know, company domain context is that while you know there's a set of concepts out there in the world that MAU and DAU that everyone uses, every we you and I know from working with lots of companies that every company has their own bespoke custom
Concepts and and ontologies and and
Katrin (39:02)
Generally every department has has their own va their own variation and takes the source for different fields in different systems.
Mike (39:09)
rights. And this is the the the messy reality. ⁓ you know probably the only area one of the only areas where things have gotten a little cleaned up after centuries is you know you look at accounting, right? Which is kind of in some ways the you know the first real ⁓ data scientists and ⁓ data analysts worked you were counting money and we have the generally ac accepted accounting principles gap you know set
defines a set of standards. After centuries, we can f finally mostly agree on what EBITA and profit and bookings and, you know, ⁓ these terms are, these metrics are mostly, right? We still see people on Wall Street debating, you know, and and gaming these metrics. But but when it talks about when we talk about what is a customer, how do we count active customers? ⁓ you know, that's still there's a lot of
Katrin (39:49)
Mostly, yes.
Mike (40:06)
Right. and forget customers. How do we count whether, you know, an item was shipped? Is it, you know, or or an item was returned or all of these ⁓ bespoke ⁓ concepts that every company has. That's I think could be extremely challenging and that's where I I think there's opportunity.
Katrin (40:25)
So it's sometimes just very ⁓ operationally sound to have differences. Like what is revenue? If I want to optimize my ads, it's going to be revenue that is counted by my ads pixel. If I want to optimize overall, it's going to be revenue deduplicated in GA four. Right. Or whatever counts that. Like operationally, it makes complete sense to have that flexibility.
Mike (40:51)
Yes, sure. And we have and I think we've seen again in the world of of you know, let's say retail, ⁓ you know, retail advertisers. If you're Warby Parker or you're, you know, if or if you're Netflix and you're trying to optimize an advertising campaign, there are different ways to measure the return on that advertising ⁓ spent. And so depending you like you said, you want that flexibility.
Some some retailers might say we want to optimize for number of glasses ⁓ sold. Some people want to optimize for number of new customers ⁓ that you know they sign up. So some want to optimize for, you know, just re pure revenue. ⁓ so there's so many ways that I think companies had may decide that there's some metric that they want to optimize ⁓ you know, ⁓ advertising spent for.
as ex as an example that you need that flexibility. Right. And and if you're running a platform, you may you may be working with a hundred different advertisers who all have different ⁓ metrics that they want, they actually want to pay you based on a different set of metrics. So you need to be able to flex your measurement for all of those kind of combinations.
Katrin (42:08)
And so talking about, you know, building this and automating those processes, in a world where we have analytics agents that allow us to automate some parts of our workflows and where AI analysts become sort of more orchestrators of these agents than necessarily what we are today as analysts, which is mostly people who type on a keyboard and our operators are different of ⁓ of different platforms.
Do you already see some of that within your users? ⁓ and if you do, how does that manifest?
Mike (42:45)
So if y maybe maybe flesh that out for me ⁓ for a moment. Think are you you're suggesting this the shift in how the AI analysts sort of do their role. Is that what what you're
Katrin (42:57)
Yeah, well basically I ⁓ I'm thinking, you know, currently, let's take a very simple example, code generation, right? Right. What do you do when to when you when you get an LM to generate code for you or cursor or whatever it is, you prompt it, it starts working, and depending on how good your prompt is and how good the agent and how complex can take a while. You're not gonna stare at your screen during that time.
Right. Right. You're going to in parallel do another process and in parallel do another process. And then at some point, not right now, right? But you hope that at some point we got get to a stage where process one is gonna call you saying, Hi, right decision point, you know, checkpoint here. Please make a decision. And in analytics, this is probably a little more prevalent than in software engineering because of that exact iteration loop that you mentioned about, you know, we have a question.
We have an answer that that basically means three more questions. Yes. ⁓ because every decision point, every every data point becomes a decision and a fork, right, in the process. And there is no unit test that tells you everything you did is valid, right? So within a world like that, you end up spending more time sort of orchestrating different processes and architecting normal processes. And this is something that talking to ⁓ you know, somebody who handles a
largely that team in ⁓ you know in a company that actually builds tools as well. Yeah. she told me, we're actually, you know, that project planning, that project management and that chunking and planning aspect is something that is difficult for a lot of people. So I'm thinking this is something, you know, an aspect that we need upskilling in, right? That's right. It's sort of sort of I don't want to call it an analytics architecture because like it's not about architecting tools. It's about architecting
architecting processes and sort of managing a virtual team that does part of these processes for you. So that's what I'm asking is like within the users that you that you have, you know, because they're deeply technical and they're used to thinking architecturally because they have to architect complex systems for big data analytics. It's not simple, right? Yes. Do you see some of that already emerging?
Mike (45:17)
For sure. I think, you know, you know, in some ways we're we're seeing a shift in the nature of knowledge work from, as you mentioned, ⁓ for a long time you had kind of the managers and then the, you know, the doers, right? And so a manager would say, we're gonna break up this analytics project. I'm gonna have my data engineer go write the pipeline. I'm gonna have my analyst define a set of measures and dimensions on that data, and I'm gonna have my
you know, collaborator in the marketing team, ⁓ give us input on the, you know, the dashboards that we're going to design, right, for the ⁓ the business stakeholders. And I think I think now increasingly a lot of the work will be done by agents. And so any of us who are workers really become man everyone becomes a manager, right? Everyone needs to learn how to ⁓ break work up, how to frankly, you know, prompting.
is a form of managing how to clearly communicate right to agents the task that needs doing. And so ⁓ absolutely I think the the those who are good at you know break thinking about the architecture of a project and how to break it up into a set of smaller tasks that could be delegated, whether that was previously to humans and increasingly to agents, ⁓ I think will be much more successful.
I think that ⁓ also I think there's again not unlike not unlike management, there's something that ⁓ you know ⁓ people talk about task relevant maturity in management, where depending on the maturity of a of a team member you're working with, do you let them work for days before you kind of check in and see how they're doing? Or do you kind of have a a daily check-in or even you know multi multiple times a day check in?
I think right now in the in the current state of agents, I do think ⁓ they're not very mature. And so the idea that we're toddlers. Yes, they're they're very junior. And so you would just like you would never have an intern go off for two weeks and then come back and give you, you know, the results of their work. And you know, a lot of people don't hire interns because they do require a lot of management, a lot of checking. Yes. ⁓ I would say that a lot of agents today are at the level of like, ⁓
intern intern level capability. Now there's lots of them, but you do need to check in several times a day. And ⁓ just it it that's going to change. We're gonna see more and more maturity. But but I think ⁓ the nature of work is shifting. And yes, people who the upskilling is how do you manage workers, communicate with them, evaluate their work, give good feedback, all of these things that
Frankly, not everyone loves to manage, right? Some people like just to write code, but but writing code might go the way of data entry and, you know, digging digging holes with a with a with a shovel. You know, writing code may eventually not be a job that any human does in
Katrin (48:25)
But I think reading code is, right? Because what you what you expressed, I I love this analogy at management because basically, as you said, prompting is like talking to your team. Yes. You need to know the LLM. You need to really understand LLMs. They're not magic, they're tools. You need to really understand how they work and how they're different. Yes. And that's, you know, know your team. Yes. ⁓
Conversely, like, you know, this is probably the first time of in history where we can actually say, I wrote this but I didn't read it. And a lot of people do. That's generally not a good not not a good result. Yes. If you generate code, you have to learn how to read it and how to evaluate it. Absolutely. Because because it still has to integrate with the rest of your of of your system, etc. All of that, all of that still has to happen. And I think most importantly, you have to to to keep your critical sense about you.
You have to you have to actually check everything.
Mike (49:22)
Right. Right. Yes. I absolutely. I think the role of architects is actually going to be even more important than ever. ⁓ because I think, you know, you you it's not enough just just to ask an agent to solve a problem with code. You really do need to hint about how you would like it to solve the problem and and you need to have enough domain you need to have enough expertise that you guide that agent to solve it, I think in a
in a in a thoughtful way because I'll just observe that ⁓ just like junior, you know, when I was ⁓ a teaching assistant in graduate school, ⁓ we observed that the best programmers would often write the shortest their homework assignments would have the sh you know the fewest number lines of code to solve the problem. ⁓ the le the least mature developers often or you know often solve problems with them with many more lines of code.
And so if you look at AI agents writing software today, they tend to be on the verbose side. and that that can be a problem. Unmanaged, you end up with a lot of just, you know, yeah, essentially in your code base.
Katrin (50:34)
Slop.
So Mike, as usually when we talk, I could do this for hours. However, we are coming over time and we sort of have to wrap up a little bit, unfortunately. So real is in public beta. You're tackling this ⁓ data leak to dashboard in minutes vision, and you've got this unique architecture combining last mile ETL, in memory database, operational dashboards, all in one tool. Where can people find real? So shameless plug time.
Yes. ⁓ where should it go to try it out? And where should it follow your work? you know, keep up with what you're building, also are you hiring? you know, anything you want to sort of put out as a message?
Mike (51:18)
Sure. Well, first if if you're if you are a ⁓ a data engineer or a data engineering ⁓ you know, biased data analyst, you can you can use our tool and get it running on on your local MacBook in in literally seconds. ⁓ real data.com. It's r-l data dot com. We we are actually beyond public beta at this point. We're we're live with
⁓ some of you know dozens of the largest enterprises ⁓ in the world are using real today. Folks like Comcast, ATT, some of the largest fintech firms are are leveraging our tool. But we have a free open source real developer tool that you can run just, you know, on your on your own. And so ⁓ folks
Katrin (52:03)
This is just download you don't even have to give to pay with your personal data.
Mike (52:08)
That's right. You can locally.
You can run it locally and securely. And yeah, we support DuckDB and Click House on that tool. So ⁓ yeah, I I would encourage any AI native data engineer that wants to impress your colleagues. you can build not just dashboards but conversational analytics and minutes ⁓ with the tool. So we we would love to ⁓ and we have a Discord channel if if you want to hang out with a bunch of other ⁓ data geeks that are trying rel.
⁓ which you can sign up for when they're do
Katrin (52:39)
We're in that channel, it's fun. We
you know, I I'm just gonna, you know, shameless plug, we we use real as well at ask-y.ai.
Mike (52:46)
Course, of course, we love having you.
Katrin (52:50)
And so well thank you for that, Mike. It's been a pleasure. So that's ⁓ episode two of Knowledge Distillation. If you're a data analyst trying to navigate this evolution from worrying about scale to worrying about context, check out ask-y.ai and try Prism. Thank you for listening and remember bots handle the what, AI analysts handle the why.
Thanks to Tom Fuller for the editing magic on this episode. If you want to work with Tom, head to ask-y.ai and check out the show notes for his contact info.