It kind of is a lot of data... trying to archive internet websites whole and kee...

manigandham · on May 26, 2015

On a purely numerical comparison, the article mentions 27TB which really isn't much in terms of size, especially compared to what some of the companies using AWS produce daily.

EDIT: according to comments below, looks like its about 20+ petabytes which is actually a fairly large amount.

vidarh · on May 26, 2015

The article mentions 27TB so far at a point where they appear to still be focused on making their tools better. 27TB is a tiny proportion of IA.

philh · on May 26, 2015

Cite: https://en.wikipedia.org/wiki/Internet_Archive

> As of October 2012, its collection topped 10 petabytes.

I'd be curious what it's grown to since, but google didn't immediately tell me.

Mithrandir · on May 26, 2015

About 21-22 petabytes: http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK

sneak · on May 26, 2015

Not for Google or S3.

res0nat0r · on May 26, 2015

Seeing as how the out of the door price for 12PB of data hosted on RRD S3 storage @ 99.999999999% durability is roughly $300k/month, I'd hardly call it trivial.

sneak · on May 28, 2015

A few things:

1) They have it sitting unused presently. The nominal cost of providing it would be zero.

2) RRS is fine. So would Glacier be. This reduces costs further.

3) There is significant PR benefit to such a move.

res0nat0r · on May 28, 2015

I'm not sure how you've figured they have 12PB of disk (Or possibly more like 36PB due to replication for 11 9s of durability) just lying around. The whole meme that ec2 came about due to it just being extra capacity is incorrect. Same goes for all of the other services. Running a business whose MO is to keep lowering costs and make a profit on razor thin margins doesn't lend itself to lots of unused infrastructure. They aren't going to sink costs into infrastructure without realizing a return as soon as possible.

Server amortization costs, future cost calculation planning, depreciation costs, power consumption, etc are all closely calculated and factored into budgets. Just thinking they can support that much data for free and at no cost, or minimal cost because it feels good is naive.

tracker1 · on May 28, 2015

Google, and Amazon could very well run within the archive.org network once it is up... and could very well offload a significant amount of data. The nominal cost is anything but near zero... just on the wear of hard drives alone, it will be costly.

My point was you can't rely on them to keep said data available. Not that they couldn't participate.. but I wouldn't trust anything less than a "Lifetime" (of the company) or a 25 year minimum commitment as anything but transient.

yohui · on May 26, 2015

> Not to mention, that historically speaking, you cannot trust google to keep this information preserved, or public. Look at what happened to any number of other tools google once offered.

Google is hardly going to abandon cloud storage anytime soon. Google's willingness to release and kill small projects has no rational impact on the reliability of their core services.

Historically speaking, Google helped preserve Usenet archives via Google Groups and digitized library archives via Google Books. Both projects have seen their share of ups and downs, but they're still publicly available.

DanBC · on May 26, 2015

It's odd that you'd use the deja news archive. They way Google handled that aas pretty terrible.

Arguably moving away from the Google Groups search page and into the general search page but using the site:groups.google.com term is better but now it's really hard to find stuff and it's really hard to search by different parameters.

yohui · on May 26, 2015

Yes, and Google Books was hampered by backlash from authors. That's why I noted both projects had a history of ups and downs. However, both are still going.

The point is that, if we're talking about Google's history, they have shown an interest in preserving data. And in regards to the Internet Archive, they'd only be serving as a backup storage provider, not in a frontend role.

This is not to say that Google (or Amazon or Microsoft) should be the only backup provider for the Internet Archive, for example, but it would hardly be a bad thing, as tracker1 suggested, if they cared to donate the resources.

leni536 · on May 26, 2015

> Google is hardly going to abandon cloud storage anytime soon.

I wonder what "soon" means in this context. If I wanted to archive anything I would think in the timescale of decades, and that's only for my personal photos and videos.