The Swyx Mixtape | Time Series DB's: Clickhouse at Sentry (Ted Kaemming, James Cunningham)

Time Series DB's: Clickhouse at Sentry (Ted Kaemming, James Cunningham)

March 10, 2022 / 23:39/E303 Download MP3

Best way to evaluate a DB is to listen to extremely knowledgeable users.

Listen to the Data Eng podcast: https://www.dataengineeringpodcast.com/snuba-event-data-warehouse-episode-108/ (11mins in)

https://blog.cloudflare.com/http-analytics-for-6m-requests-per-second-using-clickhouse/

Transcript

James Cunningham

Yeah, so I'd say as far as all the decisions that we made in order to go into this new platform, one of the biggest leaders was that we had a big push for having environments be kind of like a first class filtration, we had to build a new dimensionality of data across all this denormalized data, essentially doubled the storage that we had. And then we said to ourselves, like all this is great, this looks cool. environments are dope. But what happens we want to add another dimension and have dimension or we're just going to continue to, I guess, like, extrapolate across this data set and eventually end up with 100 terabytes of, you know, five different dimensions of data. So we said ourselves That we kind of needed a flat event model that we'd be able to kind of search across and to ourselves, you know, there are a few other pieces that we want. And on top of that, we want to be able to search across these arbitrary fields that we really, really looked into whether those are custom tags or something that we kind of promote, whether that is like releases or traces or searching across messages. We didn't want that to take as long as it did. And some of the other parts is that we have all this data stored in, you know, this tag store and all these searches that we have to go through. But we have in a completely different side for time series data that again, had to have that dimensionality in it. If we search across these arbitrary fields, the next thing that a customer would ask for is, Hey, can I please see a pretty graph. So if we could boil down that search, and that time series data into the same system, we'd be destroying two systems with one rewrite.

Ted Kaemming
And also like as part of that process, I mean, you kind of always have this Standard checkpoints, you know, like the replication and durability is obviously really important for us ease of maintenance is huge, low cost as well for us. So even that just kind of ruled out some like the hosted magic storage solutions, like those kinds of pressures.

Tobias Macey
And as you were deciding how to architect this new system, can you talk through some of the initial list of possible components that you were evaluating and what the process was for determining whether something was going to stay or go in the final right?

James Cunningham
Yeah, of course. Um, so our first, I guess, thing that we kind of crossed off is no more orientation, Postgres to serve as well, probably wouldn't, you know, we hope that we could engineer a good solution on top of it, but ultimately, we decided we probably needed a different shape of database to get the query across. We've kind of had like, five major options. We had document stores, you know, we had Some sort of Google proprietary blend, because we are completely on GCP. We had, you know, more more generic distributed query stuff, you know, a little bit of Spark, maybe a little bit of presto, we took a look at other distributed databases, we ran a good amount of Cassandra and my old gig. So I know how to run that. And we also said, like, Oh, hey, we could just like, put data down on distance ourselves and not have to worry about this. Some of the other like, serious considered things that we had was a was a column restore some of these other ones that we actually like kick the tires on, was to do we kick the tires on Pino, and Druid. And ultimately, we found click house as a commerce store. And we kind of just started running it. And it was one of the easiest ones to kick the tires on. Some of these other like, I guess, you know, columnar stores built on top of distributed file systems. It really did take a good amount of bricks to put down in order to get to your first query. And some of the things that we wanted was figuring out operational costs on that. We want to be able to iterate across question You wanted to be able to kind of pare down all the dependencies that the service had. You know, while we weren't afraid to run a few JVM, or to run it, you know, a little bit of HDFS, that was something that realistically, I might not want to have to have, you know, an entire engineer dedicated to running something like that. And on the antithesis of that, you know, we can choose some of this Google proprietary blend, but how did it feel to go from having century only require Redis and Postgres to now saying, you can only run the new version on Google? Yeah, as a little bit silly. So we ended up really just getting through an MVP of I think, both Kudo and click house, and one of the one of the biggest ones that really did kick us and for anyone listening, go ahead and correct me if I'm wrong. But one of my memories was that one of our engineers, you know, started loading data into q2, and you didn't really know when it was there. It was great for you know, being able to being able to crunch down about your numbers, but one of our biggest things that you did kind of hint at Is that we do need real time data and to be able to write into this data store, and then to be able to read it on a consistent basis with one of the things we need it, we have the ability to have a feature called alert rules and what you say, hey, only tell me if, you know, any event with the tag, you know, foom in got in and the value equals to what it was only maybe like 10 events in the last hour. And you want to be able to read that pretty quickly so that when that 10th event comes in, you're not waiting minutes until that alert shows up and click houses able to do that. And so that kind of just got its way up to number one.

Ted Kaemming
Yeah, I think also in general, like, at century we try and kind of bias a little bit towards relatively simple solutions. And it seemed like click house there was, at least to us, based on our backgrounds, it seemed more straightforward to get running. And I think that as well. appealed to us quite a bit. The documentation is pretty solid. It's also open source. You know, a lot of us will be but you know, click house has a pretty active repository. They've been very responsive when we've had questions or issues, they're very public about their development plan. So I think a lot of these things just kind of kind of worked out in its favor.

Tobias Macey
Yeah, it's definitely from what I've been able to understand a fairly new entrant into the overall database and data storage market. But I've heard of a few different stories of people using it in fairly high load environments. So I heard about the work that you're doing with Snoop, as far as I understand. CloudFlare is also using it for some of their use cases. And they definitely operate at some pretty massive scale with high data volume. So it seems like a pretty impressive system that has a lot of different capabilities. And I was pretty impressed when I had some of the folks from all tend to be on the podcast A while ago to talk about their experience of working it and working with some of their clients on getting it deployed. And I'm curious what some of the other types of systems you are able to replace with click house were given that you as you said, you have these four different systems that you had to be able to replicate event data to Were you able to collapse them all down into this one storage engine.

Ted Kaemming
Yeah. So like in our code base, the those four different things, the TSP search, tag store, and node star all have kind of abstract service interfaces that really just sort of evolved from the fact that it's a open source projects, people wanted to use these these different methods for it. Three of those now are backed by the same data set and click house. So all the TSP data comes directly out of click house, there's no pre aggregation that happens anymore. It's just you know, we're just ripping over individual rows competing those aggregates on demand, at least for now. Search. Some of the data for search still lives in Postgres, but a lot of it now is it just runs in from log data in House essentially, tax store, we've removed how many servers were we using for tags?

James Cunningham
We had? Oh, goodness, like 12 and one haiman 3232 core and maybe 200 odd gigs. But you know, getting getting into some of these other stats that we have a little bit more down the list. We went from 52 terabytes of SSD to two terabytes. Which is a good number to break down from. Yeah,

Ted Kaemming
so we were able to absolutely, yeah, we were able to decommission like an entire Redis cluster, like cluster in quotes, and this entire Postgres cluster with drastically less hardware. And yeah, just the fact that it all reads from the same click house cluster. And there's none of this weird replication lag between all these systems. That's it's a huge positive.

Tobias Macey
Can you talk a bit more about the overall architecture of Snoopy itself and just some of the operations characteristics and experience that you've had in terms of click house itself and maybe some of the early pain points and sharp edges that you ran into as you are getting used to this new system.

Ted Kaemming
Yeah, sure. So I guess just to give you kind of a brief overview of the architecture, because it's, it's something that's really not particularly fancy. It's really Snoopy is just a small, like, a relatively small flask application at least small when you compare it with like the remainder of century. So it's a Yeah, it's a flask application and it just speaks HTTP. It's in Python. It's generally stateless rights as they come in. They go through a Kafka topic. It's published directly from the the remainder of the century kobus. The central code base in this new book codebase are actually completely independent, at least as far as like the project. Get to read. So century rights in this Kafka topic. This new book, consumer picks them up, does some de normalization Some data munging you know, kind of conventional Kafka consumer stuff and writes large batches of events to click house. We don't use the click house Kafka engine or anything particularly special for that we just use the complete Kafka driver from confluent, which is live already Kafka based. And that's all in on Python reads just me about half and also over HTTP. Not anything also particularly fancy there. We have some various optimizations that we we do kind of just a general query cache and duplication of queries. So that way, we don't have large queries that have long run times, executing concurrently on the cluster. We do some optimizations where we move some stuff from the where clause in click house sequel to a pre WHERE clause, which is basically the closest thing you get to any sort of query optimization. And we just some other just like query rewriting stuff based on our domain model. There's other rate limits and Quality of Service metrics logging type stuff that happens in there as well. As long as that all goes well, responses returned to the caller with something that is almost identical to what you would get if you're just interacting with the HTTP interface, click house itself. If it doesn't go, well, that ends up getting locked a century. And we we then kind of entered the system again to go to go look at it. So that's kind of a brief overview. It's, it's nothing particularly fancy.

Tobias Macey
Yeah, sometimes simple as best, particularly when you're dealing with something that is critical path is this.

James Cunningham
Yeah, for sure. Yeah, so talk a little bit of the early engineering that you might have alluded to. One of our I say one of our biggest early difficulties was that we've you know, we've we've spent a lot of eggs in the Postgres basket. So we turn this on and, you know, the queries that we've set up for a rather oriented database are just like, absolutely not met. columnar store, which is a crazy thing to say,

Ted Kaemming
it's so easy to type select star.

James Cunningham
So easy spelling is Howard. But, you know, there's there's some things that just absolutely did not cut over to this column or store that we kind of had to like redesign how we had every query, you know, a century kind of had a quick application of order by some arbitrary column and then limit by 1000, to be able to like, explicitly hit a binary tree index in Postgres. And that didn't matter in click house, you know, any sort of limits just kind of truncated, what rose you're returning if you applied an order by that would have taken your entire data set and ordered it so many other things is that we have a lot of select stars everywhere, like Ted said, and that is, honestly one of the worst ways to operate on a column or store because you're just reading from every liberal file. So maybe change that a little bit. Some of the other things that we kind of had, you know, we we didn't have a quarter planner, so there was a lot of like, Taking a query and just kind of moving pieces around. One of the things that Ted alluded to was the notion of a pre where when you have, you know, multiple columns that you want to you want to filter on and aware clause, you kind of have the ability to give click house a little bit of heuristics and say, This is the column that we believe has the highest selectivity. And you put them in a pre WHERE clause, it will read through that column first, you know, decide which block IDs it's going to read from for the rest of them. So if you have something along the lines of an event ID that for us is, you know, global unique, that might have a little bit higher selectivity than environment or you know, it release might have a little bit of higher selectivity. So we were kind of working around these edges by just swapping variables around and saying, Well, did that make it faster? And then we said, Yes, we kind of threw some high fives around.

Ted Kaemming
Yeah, they're like, also just the integration into some of the query patterns we have in century was a bit of a challenge. Click house is really designed to do particularly well with inserts, it does not do particularly well with updates or deletes to the point where they are actually like syntactically valid in the like click house flavored sequel. So we have except century as a whole is particularly insert heavy but it's not insert only and so we had to kind of work around. Basically the fact that click houses is extremely oriented towards inserts. We kind of ended up with something that actually James mentioned he worked on Cassandra in a past life I did as well. We ended up with a architecture that is fairly similar to Cassandra tombstone for how we delete data, where we kind of implement our own last right wins semantics on top of the replacing merge tree and click house. There's a long blog posts About how we do that, as part of, we have this field guide series that we've been working on where we go into some of these like weird things that we do with cookhouse. Similarly, for things like those alerts that James mentioned earlier, we basically require sequential consistency to be able to execute those queries effectively. That becomes a problem when you're dealing with multi master replication, like click house does. So we ended up having to do some kind of dodgy load balancing stuff, where we, we don't have a literal primary for all rights, but we kind of have this ad hoc primary, that all rights go to as long as that is up. And for some subset of queries, they are only allowed to evaluate on that that primary. It's not like guaranteed sequential consistency and like a true distributed system sense but it's it's good enough for what we need. It's also particularly complicated because the system doing the querying is not smoother. It's lives in the century codebase. And so we basically need to be able to notify the century codebase that these rows have been written to click half from Cuba as part of this. So we ended up having to engineer this solution where we have a commit log coming out of the smooth Kafka consumer that the century application is actually subscribed to that Kafka topic, the commit log Kafka topic and gating its own progress based on the progress of this new writer. There's also a blog post that goes into more depth about how we specifically implemented that on the century blog as part of this field guide series. But just yeah, things like that, that you like we knew things like the mutations were going to be something that we had to manage. We didn't particularly have strategy around it and The sequential consistency stuff probably caught us a little bit more by surprise than it should have, as we were doing some of our our kind of integration testing in production with us. And notice that some of the queries weren't exactly returning what we thought they would have. So that was that was something we also had to solve.

Tobias Macey
And you mentioned that one of the reasons that you ended up going further forward with click house than any of the other systems is that it was pretty easy to get up and running with and seemed fairly simple operationally. So I'm curious what you have found to be the case now that you're actually using it in production and putting it under heavier load in a clustered environment. And any sort of useful lessons that you've learned in the process. Do you think anybody else is evaluating click has to know about?

James Cunningham
Absolutely. So this is this is my time to shine.
So one of the things that I kind of had to had to make a concession Is that I've never worked with a database that possibly be bound by CPU. It's always been, you know, make sure that your disks are as fast as possible, you know that the data is on the
disks, you got to read from the disk.
And the reason that you know, it very well could be bound by CPU is that, you know, I've seen compression in the compression in the past, and I didn't really understand what compression could actually give you until we returned click house on sort of compression realistically, you know, brings our entire data set, you know, we kind of alluded to it earlier, brings our entire data set from 52 terabytes data, two terabytes, and about 800 gigs of those are surprisingly uncompressible because they're unique, you know, 32 character strings. If anyone can tell me a, an algorithm that helps compress that, I think that we made a TV series around that or something like that, but you know, for the for the right The rest of the data, it's so well compressed that being able to actually like compute across it does so well, you know, we, we run a small amount of servers to supply what is a large amount of a data set? You know, we've, we started, I wouldn't say that, like, if there was any advice to anyone out there, start by sharding. Never Never shard by two, because two is a curse to number in terms of distributed systems. But we really just started with, you know, three shards, three replicas. And you know, with that, with that blessed number of nine, we haven't gone up yet. We kind of have a high watermark of a terabyte per machine. Google gives a certain amount of read and write off that disk based on how much storage you have. And we've kind of unlocked a certain level and one terabyte for a machine on if anyone else is somehow running click house on GCP I guess on GCP that is, you know, we're we're about to apply our fourth shard. But realistically, some of the other things that are operationally sound is That, you know, as as much as we'd all love to, I guess like hammer on or praise XML. It is it is very explicit about about what you have to write in. Its configured via XML. There's no runtime configuration that you're applying. There's no you know, magic distribution of writing into an options store and watching that cascade into a cluster
auto scaling.
Yeah, I'm not I'm not, you know, crunching in any Kubernetes pods or anything like that. One of the things I'd be remiss to not say is that you did mention CloudFlare is running click house and shut out CloudFlare they run real hardware and I'll never do that again in my life. But uh, one of the things that they alluded to and one of their kick ass blogs about click house is that it replicates so fast that they found it more performance that when a disk in a like raid 10 dies, they just wipe all the data, rebuild the disk essentially empty and just have click house refill it itself. It is crazy fast in terms of rough application. Since all that is compressed, it really just sends that across the wire. Some of the other stuff that, you know, we found completely great in terms of operationalize is that since it is CPU bound, it's mostly by reads when you are right heavy company, and you're now bound by reads in terms of cost of goods sold, like, I can throw around a million high fives after that. It's great to just watch, you know, people log in and actually look at their data and watch our graphs tick up, instead of just saying, Well, you know, we spent a lot of spend a lot of money on this, and people are only reading, you know, 1% of their data. One other piece that I'd be remiss to not answer is that some some niceties about click house that kind of separated for a few of the databases I've worked with is that the ability to kind of set some very quick either like throttling or kind of like turbo ng settings that you have on a client side. So some of the things that we might do is that if we know that a query is going to be expensive, we could you know, sacrifice a little bit of resources and Kind of like turn it back fast. So there is just a literal setting that is Max threads where I say, you know what, I really want this to run faster set max threads to eight instead of four. And it does exactly what it says it does, it'll run twice as fast if you have it twice as many threads. So they're pretty easy things that we kind of run around in terms of operational wise, I think that as far as a database goes, you know, one of the hardest things to do is just kind of read all of the settings to figure out what they do. But after you kind of get versed in it, you'll understand you know, what applying this setting might be or at what threshold, you might set something, and it's not very magical, you know, some of these settings, realistically are for very explicit types of queries that you'd only supply from a client side if you really needed them. So fairly, I wouldn't go so far as a simple like the configurations almost like dumb, and then either straightforward, very straightforward. Yeah.

Time Series DB's: Clickhouse at Sentry (Ted Kaemming, James Cunningham)

Broadcast by

headphones Listen Anywhere

Listen Anywhere