Why and how to search two years back in your Elastic search logs [pdf]

hinkley · on Feb 25, 2020

I don't get many 'new toy' moments with the unix CLI, and writing those words down makes me want to change that situation.

A number of years ago we had your typical low-disk-space-server problem and someone had bought us time by shortening the log rotation interval and compressing the logs. This is how I (re-)discovered zless and zgrep.

Streaming compression dovetails nicely with a number of kinds of tools, but seems to work particularly well on anything with pipe semantics. I'm certain that phenomenon informed the rather long tenure of the tgz file format.

hinkley · on Feb 25, 2020

I was trying to refresh my memory of the BWT algorithm for compression the other day and stumbled on a guy doing a tutorial on how they use it for gene analysis/searches. One of his assertions was that suffix trees and BWT aren't that far apart, and it has me wondering.

Compressing text and searching text are both about identifying patterns. How much R&D have we done on trying to do both at the same time? Is searching for text in a compressed file in log(n) to sqrt(n) time a solved problem?

rolling_roland · on Feb 25, 2020

There's literature on this topic, it is called succinct data structures. Wikipedia got you covered as usual: https://en.wikipedia.org/wiki/Succinct_data_structure

bleonard · on Feb 25, 2020

This seems to use this: https://cloudvyzor.com/downloads.html Is it only on Windows?

gdm85 · on Feb 25, 2020

Is this an ad?

kevrone · on Feb 25, 2020

S3 Select

marcinzm · on Feb 25, 2020

One issue with the title: the presentation says CSV and not JSON.

dang · on Feb 25, 2020

We changed the title to that of the article, as the site guidelines ask.

(Submitted title was "Convert: Elastic Search snapshots to zipped JSONs. 60TB to 3TB searchable [pdf]")