I'm curious as to how you plan to scale PostHog to larger users. As the person w...

timgl · on Feb 20, 2020

For larger volumes of events, we wouldn't recommend using Postgres.

The nice thing about single-tenancy is that in reality lots of users have small enough datasets that scaling isn't a problem. Heap et al have to scale to all of their users combined (as you said, terabytes), we just have to scale to the biggest user. Postgres also allows you to get started very quickly and do lots of queries yourself.

In our docs we explain our thinking more. Postgres is great for the vast majority of use-cases, and we're working hard to optimise those queries. Once users get beyond Postgres, we have integrations with databases that can scale well across many hosts, and we provide services around this to help people size their servers correctly.

malisper · on Feb 20, 2020

> Heap et al have to scale to all of their users combined.

The hard part wasn't scaling the system to handle all users combined. The hard part was designing the system such that when an individual user runs a query, they would get their results back in a reasonable amount of time.

Having every user in a single cluster made this easier because an individual customer could make use of the computer power of a cluster that was sized to fit the data for everyone in it. In other words, if Heap doubled the number of customers, Heap would get twice as fast for everyone. That's not true for PostHog.

> Heap et al have to scale to all of their users combined (as you said, terabytes), we just have to scale to the biggest user.

A decent sized Heap customer had multiple terabytes of data with the largest being well beyond that. You're going to have to figure out how to scale PostHog to that point without the benefits of multi-tenancy.

> Once users get beyond Postgres, we have integrations with databases that can scale well across many hosts, and we provide services around this to help people size their servers correctly.

I think a cluster of servers that could churn through terabytes of data in seconds would be prohibitively expensive for any individual customer to purchase.

carterehsmith · on Feb 21, 2020

" One of the big takeaways I had was that no matter how you use it, Postgres will work completely fine as long as you have <5GB of data. "

Surely you meant "5TB", not "5GB"?

malisper · on Feb 21, 2020

> Surely you meant "5TB", not "5GB"?

I meant what I said. You can literally just setup a PG instance and it will work perfectly fine up to a few GB. At that point, you will probably start to see certain slow queries due to bad query plans. All you need to know are the basics of EXPLAIN ANALYZE, create a few indexes. That will get you to ~100GB at which point you will start to have to deploy more serious optimizations like partitioning, denormalization, etc. Once you get to multi-TB postgres instances, you will have to look at ways to horizontally scale your DB. This can be done in the Postgres world with something like Citus, but you would probably also want to look at non-Postgres based alternatives.

carterehsmith · on Feb 22, 2020

This is kind of shifty... is it 5GB or 100GB?

Yes, if you are dealing with large databases, you need to learn about... dealing with the large databases. 5GB is something that a small laptop will do.