[Tech] Separation of Storage and Compute (Databases) - Nikita Shamgunov

Download MP3
The CEO of Neon Database explains how to split Postgres.
Listen to the Changelog: https://changelog.com/podcast/510

Transcript

So elastic compute makes sense, and scaling down because you have like ephemeral on-demand resource usage, right? Like, all of a sudden, I have to answer a bunch of HTTP requests, and so my server has to do stuff, and then everybody leaves, and my website doesn’t get any hits, and I could scale that down. With databases, if I’ve got a one-gigabyte database, it’s just like, it’s always there. I mean, all that data is there, and I could access any part of it at any time, or I need to… And we don’t know which parts. So I have a hard time with database scaling to zero, unless you’re – I don’t know, just like stomaching the cost… Or tell us how that works with Neon. Are you just stomaching the costs of keeping that online, or are you actually scaling it down?
We’re actually scaling that down. Let me explain how this works, and it may get quite technical. The first thing is what should be the enabling technology of scaling that down? If you’re just kind of thinking, “How would I build serverless Postgres?” and if you ask a person that is not familiar with database internals, they would say something like, “Well, I would put it in the VM maybe, or I would put it in the container, I would put that stuff into Kubernetes… Maybe I can change the size of the containers…” The issue with all that, as you start moving those containers around, it will start breaking connections, because databases like to have a persistent connection to them. And then you will be impacting your cache. Databases like to have a working set in memory, and if you don’t have a working set of memory, you’re paying the performance hit by bringing that data from cold storage to memory.

The third thing that you will find out, that if the database is large enough, it’s really, really hard to move database from host to host, because that involves data transfer, and data transfers are just long and expensive. And now you need to do it live, while the application is running and hitting the system. And so naively, you would arrive with something that you kind of proposed, like just stomach the costs. There is a better approach, though… And the better approach starts with an architectural change of separating of storage and compute.

If you look at how databases, storage works at the high level, it’s what is called a page-based storage; all the data in the database is split into 9-kilobyte pages. And the storage subsystem basically reads and writes those pages from disk, and caches those pages in memory. And then, kind of the upper-level system in the database lays out data on pages.

So now you can separate that storage subsystem, and move that storage subsystem away from Compute into a cloud service. And because that storage subsystem operates is relatively simple from the API standpoint - the API is “read a page, write into a page”, then you can make that part multi-tenant. And so now you start amortizing costs across all your clients. So if you make that multi-tenant, and you make that distributed, and distribute key-value stores - you know, we’ve been building them forever, so it’s not rocket science anymore - then you can make that key-value store very, very efficient, including being cost efficient. And cost efficiency comes from taking some of that data that’s stored there and offloading cold data into S3.

[20:13">20:13] Now, then it leaves out compute. And compute is the SQL query processor, and caching. So that, you can put in a VM. We actually started with containers, but we quickly realized that micro VMs such as Firecracker or Cloud-hypervisor is the right answer here. And those micro VMs have very, very nice properties to them. First of all, we can scale them to zero, and preserve the state. And they come back up really, really quickly. And so that allows to us to even preserve caches, if we shut that down.

The second thing that allows us to do is live-changing the amount of CPU and RAM we’re allocating to the VM. That’s where it gets really tricky, because we need to modify Postgres as well, to be able to adjust to suddenly you have more memory, or shrink down to “Oh, all of a sudden, I have less memory now.” And so if you all of a sudden have less memory, you need to release some of the caches, and release this memory into the operating system, and then we change the amount of memory available to the VM. And there’s a lot of cool technology there, with live-changing the amount of CPU, and there’s another one that’s called memory ballooning, that allows you to, at the end of the day, adjust the amount of memory available to Postgres.

And then you can live-migrate VMs from host to host. Obviously, if you put multiple VMs on a host, they all started growing, at some point, you don’t have enough space on the host. Now you do make a decision - which ones do you want to remove from the host? Maybe you have a brand new hosts available for them, with the space… But there is an application running, with a TCP connection, hitting that system> Storage is separate, so you only need to move the compute. And so now you’re not moving terabytes of data with moving Postgres, you’re just moving the compute part, which is really the caches, and caches only. But you need to perform a live migration here. So that’s what we’re doing with this technology that’s called Cloud Hypervisor, that supports live migrations. And the coolest part is, as you’re performing the live migration, you’re not even terminating the TCP connection. So you can have the workload keep hitting the system as you change the size of the VM for the computer up and down, as well as you can change the host for that VM, and the application just keeps running… So yeah, that’s kind of super-exciting technology.
So do you have your own infrastructure that this is running on, or are you on top of a public cloud, or how does that all work?
So we are on top of AWS. We know that we need to be on every public cloud, and that’s where the users are… Now, this question kind of hits home a little bit; the cost can be at least ten times cheaper if we use something like, I don’t know, Hetzner, or OVH. And in our architecture, it’s super-important to have an object store as part of the architecture. So Amazon S3. And in the past, there was no alternative to S3. Like, no real alternative. But just a few weeks ago, Cloudflare released R2, and they made a GA. And all of a sudden, you can put cold data onto R2; we still don’t know what the real reliability of R2 is, but I trust that Cloudflare will get it up there eventually. And that opens up all sorts of possibilities.

The other one that we’re looking into closely is Fly. We even have a shared Slack channel with Fly.io. I think it’s a fantastic company, and I see a day where Neon will be running on Fly infrastructure as well.

[23:52">23:52] Now, all that said, as of right now, right now we’re only on Amazon, and we’ll be adding another cloud. In which order, and what’s going to come sooner, Fly or Google, for example, I can’t really commit to, because we are continuously evaluating.
Yeah. So when you say move data off to S3, how do you deem data as cold on your customers’ behalf? Because there have got to be some smarts in there.
Yeah, there’s a lot of known algorithms, and they’re mostly caching algorithms. So it’s already happening today a little bit in Postgres; there is a buffer manager, a buffer pool… I’m maybe mixing SQL Server and Postgres terminology here, because my background is SQL Server. But the architecture is similar, where the buffer pool has a counter for every page, and it refreshes the counter of the pages touched… And then the algorithm kind of sweeps the cache and decides which pages haven’t been touched for a while, and then evict them from the cache.

Here we’re adding another tier, in the remote storage. We also track pages, and you see which pages have been touched recently, and which have not been, and then you offload those pages onto S3. There is a caveat, however; S3 does not like small objects, and a page is 8 kilobytes. So we need to organize those pages into some sort of data structure that will bucket those pages together, so when we throw those pages onto S3, we throw a bunch of them together in a chunk. That data structure is called an LSM Tree, and that’s the implementation of LSM tree that we built from scratch in Rust, and that’s integrated with S3 offloads older data to S3. It’s kind of like several use cases. One use case is a very large database; if you have a very large database, chances are large portions of that database are never even touched. So over time, some of that data - maybe it’s the data from like, I don’t know, five years ago, and you don’t really need it, but you’re keeping this there because it doesn’t cost you much, and it’s better to have them for occasional use that not have at all, or put them in a different system.

And the other use case is you have a big fleet of databases; a lot of them are scaled down to zero, because you just have them for occasional usage, and now if you keep them hot, that will start to add up both on the compute side, and on the storage side. Storing all that data into SSDs is a very different economics than storing all that data in S3 in a compressed form. So these are the second place, where integration with S3 can drive much better economics.
[Tech] Separation of Storage and Compute (Databases) - Nikita Shamgunov
Broadcast by