Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm curious as to how you plan to scale PostHog to larger users. As the person who scaled Heap, here is my honest opinion of this. I think there is going to be a huge challenge ahead in scaling query performance. This was perpetually a challenge at Heap and was for a long time the main limitation on Heap's growth.

The challenge was tough enough for Heap and PostHog is going to be at a huge disadvantage due to the lack of multi-tenancy. When you use Heap, your data is stored across Heap's entire cluster of machines. When you run a query, that query is ran simultaneously against every single machine in Heap's cluster. Even though your data may be taking up something like .1% of the total disk space, when you run a query, 100% of the disk throughput of Heap's cluster will go to processing your query. It's not an overstatement to say this alone results in a >50x improvement to query performance.

I honestly think Heap wouldn't be possible without multi-tenancy. It's hard enough as is to get queries that process multiple terabytes of data to return in seconds when you have a fleet of dozens of i3s available. I'm not sure how you would do that with a fleet a tiny fraction of that size. If you're curious about Heap's infrastructure, Heap's CTO, Dan Robinson, has given a number of talks on how it works[0][1].

That's not to say that PostHog won't work for anyone. I previously tried (and failed) to start a company based on optimizing people's Postgres instances. One of the big takeaways I had was that no matter how you use it, Postgres will work completely fine as long as you have <5GB of data. I think if you have a modest amount of data, something like PostHog would work perfectly fine for you. Since the Postgres optimization business didn't work out, I wound up pivoting to freshpaint.io which eliminates the need to setup event tracking for your analytics and marketing tools by automatically instrumenting your entire site. Since I started working on it, things have been going a lot better.

[0] https://www.youtube.com/watch?v=NVl9_6J1G60

[1] https://www.youtube.com/watch?v=iJLq3GV1Dyk



For larger volumes of events, we wouldn't recommend using Postgres.

The nice thing about single-tenancy is that in reality lots of users have small enough datasets that scaling isn't a problem. Heap et al have to scale to all of their users combined (as you said, terabytes), we just have to scale to the biggest user. Postgres also allows you to get started very quickly and do lots of queries yourself.

In our docs we explain our thinking more. Postgres is great for the vast majority of use-cases, and we're working hard to optimise those queries. Once users get beyond Postgres, we have integrations with databases that can scale well across many hosts, and we provide services around this to help people size their servers correctly.


> Heap et al have to scale to all of their users combined.

The hard part wasn't scaling the system to handle all users combined. The hard part was designing the system such that when an individual user runs a query, they would get their results back in a reasonable amount of time.

Having every user in a single cluster made this easier because an individual customer could make use of the computer power of a cluster that was sized to fit the data for everyone in it. In other words, if Heap doubled the number of customers, Heap would get twice as fast for everyone. That's not true for PostHog.

> Heap et al have to scale to all of their users combined (as you said, terabytes), we just have to scale to the biggest user.

A decent sized Heap customer had multiple terabytes of data with the largest being well beyond that. You're going to have to figure out how to scale PostHog to that point without the benefits of multi-tenancy.

> Once users get beyond Postgres, we have integrations with databases that can scale well across many hosts, and we provide services around this to help people size their servers correctly.

I think a cluster of servers that could churn through terabytes of data in seconds would be prohibitively expensive for any individual customer to purchase.


" One of the big takeaways I had was that no matter how you use it, Postgres will work completely fine as long as you have <5GB of data. "

Surely you meant "5TB", not "5GB"?


> Surely you meant "5TB", not "5GB"?

I meant what I said. You can literally just setup a PG instance and it will work perfectly fine up to a few GB. At that point, you will probably start to see certain slow queries due to bad query plans. All you need to know are the basics of EXPLAIN ANALYZE, create a few indexes. That will get you to ~100GB at which point you will start to have to deploy more serious optimizations like partitioning, denormalization, etc. Once you get to multi-TB postgres instances, you will have to look at ways to horizontally scale your DB. This can be done in the Postgres world with something like Citus, but you would probably also want to look at non-Postgres based alternatives.


This is kind of shifty... is it 5GB or 100GB?

Yes, if you are dealing with large databases, you need to learn about... dealing with the large databases. 5GB is something that a small laptop will do.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: