Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Saving to text file is inefficient. I save sqlite databases using VACUUM INTO, like this:

  sqlite3 -readonly /path/db.sqlite "VACUUM INTO '/path/backup.sqlite';"
From https://sqlite.org/lang_vacuum.html :

  The VACUUM command with an INTO clause is an alternative to the backup API for generating backup copies of a live database. The advantage of using VACUUM INTO is that the resulting backup database is minimal in size and hence the amount of filesystem I/O may be reduced.


It's cool but it does not address the issue of indexes, mentioned in the original post. Not carrying index data over the slow link was the key idea. The VACUUM INTO approach keeps indexes.

A text file may be inefficient as is, but it's perfectly compressible, even with primitive tools like gzip. I'm not sure the SQLite binary format compresses equality well, though it might.


SQLite tosses out the SQL once it is parsed into bytecode. Using text is just going to take longer, even though I’m sure it works great.

You can modify the database before vacuuming by making a new in-memory database, copying selected tables into it, and then vacuuming that to disk.


This should be the accepted answer.


> A text file may be inefficient as is, but it's perfectly compressible, even with primitive tools like gzip. I'm not sure the SQLite binary format compresses equality well, though it might.

I hope you’re saying because of indexes? I think you may want to revisit how compression works to fix your intuition. Text+compression will always be larger and slower than equivalent binary+compression assuming text and binary represent the same contents? Why? Binary is less compressible as a percentage but starts off smaller in absolute terms which will result in a smaller absolute binary. A way to think about it is information theory - binary should generally represent the data more compactly already because the structure lived in the code. Compression is about replacing common structure with noise and it works better if there’s a lot of redundant structure. However while text has a lot of redundant structure, that’s actually bad for the compressor because it has to find that structure and process more data to do that. Additionally, is using generic mathematical techniques to remove that structure which are genetically optimal but not as optimal as removing that structure by hand via binary is.

There’s some nuance here because the text represents slightly different things than the raw binary SQLite (how to restore data in the db vs the precise relationships + data structures for allowing insertion/retrieval. But still I’d expect it to end up smaller compressed for non trivial databases


Below I'm discussing compressed size here rather than how "fast" it is to copy databases.

Yeah there are indexes. And even without indexes there is an entire b-tree sitting above the data. So we're weighing the benefits of having a domain dependent compression (binary format) vs dropping all of the derived data. I'm not sure how that will go, but lets try one.

Here is sqlite file containing metadata for apple's photo's application:

    767979520 May  1 07:28 Photos.sqlite
Doing a VACUUM INTO:

    719785984 May  1 08:56 photos.sqlite
gzip -k photos.sqlite (this took 20 seconds):

    303360460 May  1 08:56 photos.sqlite.gz
sqlite3 -readonly photos.sqlite .dump > photos.dump (10 seconds):

    1277903237 May  1 09:01 photos.dump
gzip -k photos.dump (21 seconds):

    285086642 May  1 09:01 photos.dump.gz
About 6% smaller for dump vs the original binary (but there are a bunch of indexes in this one). For me, I don't think it'd be worth the small space savings to spend the extra time doing the dump.

With indexes dropped and vacuumed, the compressed binary is 8% smaller than compressed text (despite btree overhead):

    566177792 May  1 09:09 photos_noindex.sqlite
    262067325 May  1 09:09 photos_noindex.sqlite.gz
About 13.5% smaller than compressed binary with indices. And one could re-add the indices on the other side.


Yup, these results are pretty consistent with what I'd expect (& why I noted the impact of indices) cause even string data has a lot of superfluous information when expressed in the DDL ("INSERT INTO foo ...") - I would expect all of that to exceed any bookkeeping within the btree. And non-string values like blobs or numbers are going to be stored more efficiently than in the dump which is a text encoding (or even hex for blobs) which is going to blow things up further.


Some more anecdata - from this it looks like you could `VACUUM INTO` + `zstd --long -12` using 19.1s and get 109% of the size you'd get from `dump` + `zstd --long -5` using 32.8s. Saves 13.7s at the cost of 76M. YMMV, obvs.

sqlite3 3.49.1, zstd 64bit 1.5.7, gzip (Apple) 457.120.3

Original file (3.3G)

    3264290816 Photos.sqlite
VACUUM INTO (10.3s, 3.1G, 94.3%)

    3078881280 test.sqlite
gzip -k (76s, 1.1G, 33.1%)

    1080119337 test.sqlite.gz
zstd --long (3.2s, 987M, 30.2%)

     986252298 test.sqlite.zst
zstd --long -9 (8.8s, 903M, 27.6%)

     902282663 test.sqlite.9.zst
zstd --long -12 (21.5s, 885M, 27.1%)

     884863443 test.sqlite.12.zst
.dump (27.6s, 4.7G)

    4693437307 photos.dump
gzip -k (72s, 942M, 28.8%)

     941018021 photos.dump.gz
zstd --long (5.2s, 860M, 26.3%)

     859204016 photos.dump.zst
zstd --long -12 (31.7s, 827M, 25.3%)

     826776415 photos.dump.12.zst
(edited to fix a typo in a size and a conclusion that came from that)


Brilliant. >60% savings. 700mb? wow.


Is that really necessary?


Depending on the bandwidth at the target site, which may be pretty remote, and not exposing a public internet service.


Ah no, I meant “is the snark necessary?” to the parent comment. I enjoyed the read!


Does that preserve the indexes? As the TFA mentioned, the indexes are why the sqlite files are huge.


You're right. It does. I never thought about it until you asked.


I think it won't preserve the index but it will recreate the index while running the text sql.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: