Saving to text file is inefficient. I save sqlite databases using VACUUM INTO, l...

nine_k · 2025-05-01T14:54:15 1746111255

It's cool but it does not address the issue of indexes, mentioned in the original post. Not carrying index data over the slow link was the key idea. The VACUUM INTO approach keeps indexes.

A text file may be inefficient as is, but it's perfectly compressible, even with primitive tools like gzip. I'm not sure the SQLite binary format compresses equality well, though it might.

conradev · 2025-05-01T17:27:52 1746120472

SQLite tosses out the SQL once it is parsed into bytecode. Using text is just going to take longer, even though I’m sure it works great.

You can modify the database before vacuuming by making a new in-memory database, copying selected tables into it, and then vacuuming that to disk.

nine_k · 2025-05-01T19:34:47 1746128087

This should be the accepted answer.

vlovich123 · 2025-05-01T15:39:44 1746113984

> A text file may be inefficient as is, but it's perfectly compressible, even with primitive tools like gzip. I'm not sure the SQLite binary format compresses equality well, though it might.

I hope you’re saying because of indexes? I think you may want to revisit how compression works to fix your intuition. Text+compression will always be larger and slower than equivalent binary+compression assuming text and binary represent the same contents? Why? Binary is less compressible as a percentage but starts off smaller in absolute terms which will result in a smaller absolute binary. A way to think about it is information theory - binary should generally represent the data more compactly already because the structure lived in the code. Compression is about replacing common structure with noise and it works better if there’s a lot of redundant structure. However while text has a lot of redundant structure, that’s actually bad for the compressor because it has to find that structure and process more data to do that. Additionally, is using generic mathematical techniques to remove that structure which are genetically optimal but not as optimal as removing that structure by hand via binary is.

There’s some nuance here because the text represents slightly different things than the raw binary SQLite (how to restore data in the db vs the precise relationships + data structures for allowing insertion/retrieval. But still I’d expect it to end up smaller compressed for non trivial databases

dunham · 2025-05-01T16:15:19 1746116119

Below I'm discussing compressed size here rather than how "fast" it is to copy databases.

Yeah there are indexes. And even without indexes there is an entire b-tree sitting above the data. So we're weighing the benefits of having a domain dependent compression (binary format) vs dropping all of the derived data. I'm not sure how that will go, but lets try one.

Here is sqlite file containing metadata for apple's photo's application:

    767979520 May  1 07:28 Photos.sqlite

Doing a VACUUM INTO:

    719785984 May  1 08:56 photos.sqlite

gzip -k photos.sqlite (this took 20 seconds):

    303360460 May  1 08:56 photos.sqlite.gz

sqlite3 -readonly photos.sqlite .dump > photos.dump (10 seconds):

    1277903237 May  1 09:01 photos.dump

gzip -k photos.dump (21 seconds):

    285086642 May  1 09:01 photos.dump.gz

About 6% smaller for dump vs the original binary (but there are a bunch of indexes in this one). For me, I don't think it'd be worth the small space savings to spend the extra time doing the dump.

With indexes dropped and vacuumed, the compressed binary is 8% smaller than compressed text (despite btree overhead):

    566177792 May  1 09:09 photos_noindex.sqlite
    262067325 May  1 09:09 photos_noindex.sqlite.gz

About 13.5% smaller than compressed binary with indices. And one could re-add the indices on the other side.

vlovich123 · 2025-05-01T20:15:08 1746130508

Yup, these results are pretty consistent with what I'd expect (& why I noted the impact of indices) cause even string data has a lot of superfluous information when expressed in the DDL ("INSERT INTO foo ...") - I would expect all of that to exceed any bookkeeping within the btree. And non-string values like blobs or numbers are going to be stored more efficiently than in the dump which is a text encoding (or even hex for blobs) which is going to blow things up further.

zimpenfish · 2025-05-03T08:51:54 1746262314

Some more anecdata - from this it looks like you could `VACUUM INTO` + `zstd --long -12` using 19.1s and get 109% of the size you'd get from `dump` + `zstd --long -5` using 32.8s. Saves 13.7s at the cost of 76M. YMMV, obvs.

sqlite3 3.49.1, zstd 64bit 1.5.7, gzip (Apple) 457.120.3

Original file (3.3G)

    3264290816 Photos.sqlite

VACUUM INTO (10.3s, 3.1G, 94.3%)

    3078881280 test.sqlite

gzip -k (76s, 1.1G, 33.1%)

    1080119337 test.sqlite.gz

zstd --long (3.2s, 987M, 30.2%)

     986252298 test.sqlite.zst

zstd --long -9 (8.8s, 903M, 27.6%)

     902282663 test.sqlite.9.zst

zstd --long -12 (21.5s, 885M, 27.1%)

     884863443 test.sqlite.12.zst

.dump (27.6s, 4.7G)

    4693437307 photos.dump

gzip -k (72s, 942M, 28.8%)

     941018021 photos.dump.gz

zstd --long (5.2s, 860M, 26.3%)

     859204016 photos.dump.zst

zstd --long -12 (31.7s, 827M, 25.3%)

     826776415 photos.dump.12.zst

(edited to fix a typo in a size and a conclusion that came from that)

ForOldHack · 2025-05-01T18:35:39 1746124539

Brilliant. >60% savings. 700mb? wow.

cnewey · 2025-05-01T18:49:13 1746125353

Is that really necessary?

nine_k · 2025-05-01T18:59:24 1746125964

Depending on the bandwidth at the target site, which may be pretty remote, and not exposing a public internet service.

cnewey · 2025-05-01T19:12:03 1746126723

Ah no, I meant “is the snark necessary?” to the parent comment. I enjoyed the read!

gwbas1c · 2025-05-01T14:47:01 1746110821

Does that preserve the indexes? As the TFA mentioned, the indexes are why the sqlite files are huge.

M95D · 2025-05-01T15:09:50 1746112190

You're right. It does. I never thought about it until you asked.

4silvertooth · 2025-05-01T15:52:58 1746114778

I think it won't preserve the index but it will recreate the index while running the text sql.