The Swyx Mixtape | [Tech] dbt as a standard

[Tech] dbt as a standard - Laurie Voss

October 24, 2022 / 15:12/E451 Download MP3

@seldo explains why dbt has become so successful and how they use it at Netlify.

Listen to The Right Track: https://www.heavybit.com/library/podcasts/the-right-track/ep-6-domain-expertise-with-laurie-voss-of-netlify/

Transcript

Stefania: I wanted to maybe shift a little bit in terms of how the industry is changing before we move on to how you have seen data cultures being built and data trusts being undermined and all those things.

Can you talk a little bit about how you see the industry has changed in the past few years?

Laurie: Yeah. I wrote a blog post about this recently.

I think it's probably the thing that spurred you to invite me to this podcast in the first place.

Stefania: Correct.

Laurie: Which is about nine months ago, I was introduced to DBT. DBT has been around for awhile now, I think five or six years, but it was new to me nine months ago.

And it definitely seems to be exponentially gaining in momentum at the moment.

I hear more and more people are using it and see more and more stuff built on top of it.

And the analogy that I made in the blog post is as a web developer, it felt kind of like Rails in 2006.

Ruby on Rails very fundamentally changed how web development was done, because web development prior to that was everybody has sort of like figured out some architecture for their website and it works okay. But it means that every time you hire someone to a company, you have to teach them your architecture. And it would take them a couple of weeks, or if it was complicated, it would take them a couple of months to figure out your architecture and become productive. And Ruby on Rails changed that.

Ruby on Rails was you hire someone and you say, "Well, it's a Rails app."

And on day one, they're productive.

They know how to change Rails apps.

They know how to configure them.

They know how to write the HTML and CSS and every other thing.

And that taking the time to productivity for a new hire from three months to one month times a million developers is a gigantic amount of productivity that you have unlocked.

The economic impact of that is huge. And DBT feels very similar.

It's not doing anything that we weren't doing before.

It's not doing anything that you couldn't do if you were rolling your own, but it is a standard and it works very well and it handles the edge cases and it's got all of the complexities accounted for.

So you can start with DBT and be pretty confident that you're not going to run into something that DBT can do.

And it also means that you can hire people who already know DBT.

We've done it at Netlify. We've hired people with experience in DBT and they were productive on day one.

They were like, "Cool. I see that you've got this model. It's got a bug. I've committed a change. I've added some tests. We have fixed this data model."

What happens on day two? It's great.

The value of a framework is that a framework exists more than like any specific technical advantage of that framework.

Stefania: Yeah. I love that positioning of DBT.

Do you have any thoughts on why this has not happened in the data space before?

We have a lot of open source tools already built.

We had like a huge rise in people using Spark and Hadoop and all those things for their data infrastructure awhile ago, maybe 10 years ago, and that's still happening in some of the companies.

What are your thoughts on why this is happening now?

Laurie: I think it was inevitable.

I mean, the big data craze was 10 years ago.

I recently was reminded by somebody that I wrote a blog post.

It was literally 10 years ago. It was like July 15th 2011.

I was like, statisticians are going to be the growth career for the next 10 years, because all I see is people collecting data blindly.

They're just creating data warehouses and just pouring logs into them and then doing the most simple analyses on them.

They're just like counting them up.

They're not doing anything more complicated than counting them up.

A lot of companies in 2010 made these huge investments and then were like, "What now?"

And they were like, "Well, we've sort of figured we'd be able to do some kind of analysis, but we don't know how. This data is enormous. It's very difficult to do."

It was inevitable that people would be trying to solve this problem.

And lots of people rolled their own over and over.

Programmers are programmers, so when they find themselves rolling their own at the third job in a row, that's usually when they start writing a framework.

And that seems to be what DBT emerged from.

I think it's natural that it emerged now. I think this is how long it takes.

This is how much iteration the industry needed to land at this.

Stefania: Yeah. That's a good insight.

I maybe want to touch on then also another thing that a lot of people talk about.

And ultimately, I mean, I think what most companies want to strive for, although it remains to be defined what it literally means, are self-serve analytics.

What does that mean to you and how does that fit into the DBT world?

Laurie: I have what might be a controversial opinion about self-serve analytics, which is that I don't think it's really going to work.

There are a couple of problems that make self-serve analytics difficult.

What people are focusing on right now are like just the pure technical problems.

One of the problems with self-serve analytics is that it's just hard to do.

You have to have enormous amounts of data.

If people are going to be exploratory about the data, then the database needs to be extremely fast.

If queries take 10 minutes, then you can't do ad hoc data exploration.

Nobody but a data scientist is going to hang around for 10 minutes waiting for a query to finish.

Stefania: Finishing your query is the new-- It's compiling.

Laurie: But even when you solve that problem, and I feel like a lot of companies now solve that problem, you run into the next problem, which is, what question do I ask?

What is the sensible way to ask?

And also, where is it?

Discovery is another thing.

If you've instrumented properly, you're going to have enormous numbers of data sources, even if you're using DBT.

And they're all neatly arrayed in very nicely named tables and the tables of documentation, you're going to have 100, 200, 300 tables, right?

You have all sorts of forms of data.

And unless somebody goes through every table by name and tries to figure out what's in that table.

And does it answer my question?

The data team knows where the data is and it's very hard to make that data automatically discoverable.

I don't think people have solved that problem.

Even if you solved that problem, the chances are that somebody whose job isn't data is going to run into traps.

They're going to run into obvious data problems that a professional data person would avoid.

The simplest one is like people who are using an average instead of a median.

They're like, "The average is enormously high. So we don't have to care about this."

And I'm like, "No, no, no, no. The median is two."

And that's different from an average of 10.

You've just got a couple of outliers that are dragging your average up.

I solve that problem for stakeholders in our organization multiple times a week.

It's like correcting them just on that particular point.

And that's not even a particularly subtle question about data.

There's lots of ways that somebody who doesn't spend all of their time thinking about how to present and analyze and question data is going to mislead themselves if they are self-serve.

So that doesn't mean that they don't think self-serve should happen.

I think one of the most productive ways that I interact with my colleagues outside of the data department is we have self-serve analytics.

There's no barrier.

They can go in and write their own queries and build their own dashboard.

And they get like 80% of the way.

And then they come to me and they're like, "Is this right? Does this say what I think it says?"

And some of the time I'll be like, "Yes," some of the time I'll be like, "Nope, you're being misled by this. Sorry about that. You looked at the wrong table or you misunderstood what that problem was for."

And sometimes it will be, "You're almost there. I need to make a couple of tweaks to fix this source of error," that kind of stuff.

They can get a lot of the way, but I think being a hundred percent self-serve is not a practical. No.

Stefania: I think that's a really good way to put it.

Another way also I like to think about it is there are layers of self-serve and it depends on your audience, what that means.

So self-serve to a very non-technical product manager, providing self-serve analytics to a non technical product manager means one thing, and then providing self-serve analytics to a very technical backend engineer that wants to answer some question because he's deciding how to architect their API or something like that are two very different things.

And this touches a little bit on sort of, who are your stakeholders as a data team? I think.

Laurie: I agree.

Stefania: But it sounds like you have already built some sort of self-serve analytics and it depends on people knowing SQL.

Is that right?

Laurie: We have a couple of tools. We have a bunch of dashboards.

We use Mode and we have a bunch of dashboards in Mode where if you have one of the set of questions that the exploration tools for these visualizations we've already built can answer, then you can completely self-serve using just point-and-click.

If that doesn't work for you, Mode will let you write your own SQL.

We have recently adopted a new tool called Transform, whose whole raison d'etre is to be a source of consistently defined metrics across the business.

So you give it a metric and then it gives you quite expressive ways of slicing and dicing that metric, filtering it and resorting it and stuff like that.

So we believe our goal is to have most of our metrics be in Transform and have people be able to examine them there and be confident that that data is correct and that those metrics mean what they think it means, which I think is going to lead us naturally to the next part of our conversation.

And Mode is going to become more about ad hoc analysis, one-off reports, very detailed explorations of specific questions, not everyday metrics.

Stefania: Yeah. Exciting, exciting times.

Broadcast by

headphones Listen Anywhere

Listen Anywhere