Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It kind of is a lot of data... trying to archive internet websites whole and keeping multiple snapshots is pretty close to what google and other search engines do, only search engines need to generate distributed indexes as well as some information regarding versioning.

Not to mention, that historically speaking, you cannot trust google to keep this information preserved, or public. Look at what happened to any number of other tools google once offered. TBH, I would like to see funding via a grant from the library of congress towards archive.org.

I'm curious as to how many copies a piece of data is needed to be "safe" in such a flexible unknown as end user/volunteer storage. It's one thing for compute items that can be re-queued for work in a day or two if abandoned... it's another where every copy of a record happens to walk away. Let alone the communications protocol.. this goes way beyond most bigtable implementations.



On a purely numerical comparison, the article mentions 27TB which really isn't much in terms of size, especially compared to what some of the companies using AWS produce daily.

EDIT: according to comments below, looks like its about 20+ petabytes which is actually a fairly large amount.


The article mentions 27TB so far at a point where they appear to still be focused on making their tools better. 27TB is a tiny proportion of IA.


Cite: https://en.wikipedia.org/wiki/Internet_Archive

> As of October 2012, its collection topped 10 petabytes.

I'd be curious what it's grown to since, but google didn't immediately tell me.



Not for Google or S3.


Seeing as how the out of the door price for 12PB of data hosted on RRD S3 storage @ 99.999999999% durability is roughly $300k/month, I'd hardly call it trivial.


A few things:

1) They have it sitting unused presently. The nominal cost of providing it would be zero.

2) RRS is fine. So would Glacier be. This reduces costs further.

3) There is significant PR benefit to such a move.


I'm not sure how you've figured they have 12PB of disk (Or possibly more like 36PB due to replication for 11 9s of durability) just lying around. The whole meme that ec2 came about due to it just being extra capacity is incorrect. Same goes for all of the other services. Running a business whose MO is to keep lowering costs and make a profit on razor thin margins doesn't lend itself to lots of unused infrastructure. They aren't going to sink costs into infrastructure without realizing a return as soon as possible.

Server amortization costs, future cost calculation planning, depreciation costs, power consumption, etc are all closely calculated and factored into budgets. Just thinking they can support that much data for free and at no cost, or minimal cost because it feels good is naive.


Google, and Amazon could very well run within the archive.org network once it is up... and could very well offload a significant amount of data. The nominal cost is anything but near zero... just on the wear of hard drives alone, it will be costly.

My point was you can't rely on them to keep said data available. Not that they couldn't participate.. but I wouldn't trust anything less than a "Lifetime" (of the company) or a 25 year minimum commitment as anything but transient.


> Not to mention, that historically speaking, you cannot trust google to keep this information preserved, or public. Look at what happened to any number of other tools google once offered.

Google is hardly going to abandon cloud storage anytime soon. Google's willingness to release and kill small projects has no rational impact on the reliability of their core services.

Historically speaking, Google helped preserve Usenet archives via Google Groups and digitized library archives via Google Books. Both projects have seen their share of ups and downs, but they're still publicly available.


It's odd that you'd use the deja news archive. They way Google handled that aas pretty terrible.

Arguably moving away from the Google Groups search page and into the general search page but using the site:groups.google.com term is better but now it's really hard to find stuff and it's really hard to search by different parameters.


Yes, and Google Books was hampered by backlash from authors. That's why I noted both projects had a history of ups and downs. However, both are still going.

The point is that, if we're talking about Google's history, they have shown an interest in preserving data. And in regards to the Internet Archive, they'd only be serving as a backup storage provider, not in a frontend role.

This is not to say that Google (or Amazon or Microsoft) should be the only backup provider for the Internet Archive, for example, but it would hardly be a bad thing, as tracker1 suggested, if they cared to donate the resources.


> Google is hardly going to abandon cloud storage anytime soon.

I wonder what "soon" means in this context. If I wanted to archive anything I would think in the timescale of decades, and that's only for my personal photos and videos.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: