The article compares raw vs base64+gzip, but I'd be interested to see gzip vs ba...

leetbulb · on Jan 30, 2019

Did some further investigation:

i included some text files, hex encoding, and other compression as well :)

raw file list: https://hastebin.com/vazonowuvo.txt

tsv (c/p to spreadsheet: https://hastebin.com/ewohafucem.tsv

---

(interesting) using bzip2, compression is better when the following file are encoded first with base64 or hex: bing.png googlelogo.png peppers_color.jpg

Useless takeaways:

- prefer base64 over hex when encoding already compressed images before further compression

- prefer hex over base64 when encoding plain text / low entropy data before further compression

spurgu · on Jan 30, 2019

Heh, so gzipped png's are generally larger than non-gzipped. Samples are prob. heavily compressed though.

zrm · on Jan 31, 2019

Not just that, PNG and gzip use the same compression algorithm:

https://en.wikipedia.org/wiki/DEFLATE

qqqqqqqqqqqq23 · on Jan 30, 2019

The formats he uses as raw are already compressed by default, gzipping them wouldn't help. Gzipping the new encoding makes sense.

emilfihlman · on Jan 30, 2019

Theoretically base64+gzip and gzip should be almost equivalent. For more aggressive compressors they should be equivalent.

BeeOnRope · on Jan 31, 2019

Well they will never be equivalent since the compressor has to learn and encode the set of 64 characters used, and passing along that information has some cost. In practice, that cost is either sending the probabilities up front in a compressor that does that, or the ramp up cost of using the wrong probabilities for a compressor that just uses the existing frequencies as the implicit probabilities.

Otherwise we could pass along some information "for free" in any base-64 encoding scheme by choosing some set of 64 characters (there are lots of choices and whichever set we choose encodes a message), encoding the original message with it, and then compressing it back to the original size - leading to "infinite" compression.

Other reasons base64 can't be compressed exactly back to its original form include the presence of arbitrary newlines, and trailing padding characters.

This doesn't matter in practice for compressing one large base64 encoded file where the overheads go to approximately zero, but for compressing a larger non-base64 file (e.g., HTML file) that contains embedded base64 chunks it is actually a real problem the compressor has to delimit the base64 encoded regions and communicate the new symbol probabilities somehow for each region.

jnordwick · on Feb 1, 2019

paging BeeOnRope. Can you email me? address in profile. you answered a question i had a while ago and wanted to follow up. thx