Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yes, we need search engines, but they don't need to be monolithic. Imagine that indexing the text of your average web page takes up 10k. Then you get 100.000 pages per Gig. It means that you if you spend ~270USD on a consumer 10 tera drive you can index a billion webpages. Google no longer says how many pages they index, but its estimated to be with in one order of magnitude of that.

This means that in terms of hardware, you can build your own google, then you get to decide how it rates things and you don't have to worry about ads and SEO becomes much harder because there is no longer one target to SEO. Google obviously don't want you to do this (and in fairness google indexes a lot of stuff that isn't keywords form web pages), but it would be very possible to build an open source configurable search engine that anyone could install, run, and get good results out of.

(Example: The piratebay database, that arguably indexes the vast majority of avilable music / tv / film / software was / is small enough to be downloaded and cloned by users)



Google's paper on Percolator from 2010 says there are more than 1T web pages. 9 years later there is surely way more than that.

https://ai.google/research/pubs/pub36726

The real issue would be crawling and indexing all those pages. How long would it take for an average user's computer with a 10Mb internet connection to crawl the entire web? It's not as easy a problem as you make it seem.


I'm not saying its easy, its not, but people tend to think that because Google is so huge, you have to be that huge to do what Google does. My argument is that in terms of hardware google need expensive hardware because they have so many users, not because what they do requires that hardware to deliver the service for one or a few users.

I have a gigabit link to my apartment (go Swedish infrastructure!). At that theoretic speed I get 450 gigs an hour, so I could download ten tera in a day. We can easily slow that down by an order of magnitude and its still a very viable thing to do. If someone wrote the software to do this, one could imagine some kind of federated solution for downloading the data, so that every user doesn't have to hit every web server.


Could be done with a p2p "swarm". Peers get asigned pages to index then share the result.


How would you begin indexing everything?


This is good question. Crawling and storing the pages is the easy part... searching them with a sub 1 second response time is much harder. Which current DB platforms can handle the size of data that Google indexes?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: