Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

... are Apple's manpages never read?

https://developer.apple.com/library/mac/documentation/Darwin...

"For applications that require tighter guarantees about the integrity of their data, Mac OS X provides the F_FULLFSYNC fcntl. The F_FULLFSYNC fcntl asks the drive to flush all buffered data to permanent storage. Applications, such as databases, that require a strict ordering of writes should use F_FULLFSYNC to ensure that their data is written in the order they expect. Please see fcntl(2) for more detail."

https://developer.apple.com/library/mac/documentation/Darwin...

"F_FULLFSYNC - Does the same thing as fsync(2) then asks the drive to flush all buffered data to the permanent storage device (arg is ignored). This is currently implemented on HFS, MS-DOS (FAT), and Universal Disk Format (UDF) file systems. The operation may take quite a while to complete. Certain FireWire drives have also been known to ignore the request to flush their buffered data."

OS X has aggressive file buffering in memory, and it's getting more aggressive all the time. For example, cfprefsd, introduced in 10.8 (https://developer.apple.com/library/mac/releasenotes/DataMan...) made it so that when a system application read a preferences file, it stayed in memory and ignored the disk version, until cfprefsd eventually synced it back to disk. In 10.9, the behavior is much worse to the point that as soon as a pref is in cfprefsd, it's unlikely to leave it until the user logs out / the machine reboots.

In this instance, OS X has, for quite some time, had "defrag on the fly" for files under 20MB in size. On access of the file, it's read into memory and kept there in its entirety until memory pressure from other processes triggers a sync it back to disk. When it comes to writing a small file back to disk, OS X will "get around to it" when it's damned well ready unless you force its hand using the fcntl options above.

Unfortunately, the bit about "This is currently implemented on HFS, MS-DOS (FAT), and Universal Disk Format (UDF) file systems" covers pretty much the range of filesystem types that OS X can natively read+write on - but one that might get past this is ExFAT. I'd be surprised if that was the case, but it is natively supported read+write on OS X and would be something quick and easy to test (set up an ExFAT volume for the database) and possibly verify this is the root cause.

(Additionally, third-party read+write access to filesystems like NTFS via Paragon / Tuxera may be able to confirm this as well.)

More reading material (MySQL has been dealing with this since 2005): http://lists.apple.com/archives/darwin-dev/2005/Feb/msg00072...



This appears to be benchmark gaming - the POSIX Rationale for fsync(2) says:

The fsync() function is intended to force a physical write of data from the buffer cache, and to assure that after a system crash or other failure that all data up to the time of the fsync() call is recorded on the disk. Since the concepts of "buffer cache", "system crash", "physical write", and "non-volatile storage" are not defined here, the wording has to be more abstract.


Unfortunately, it's often not that simple on POSIX systems; it's quite common for disk controllers to disobey fsync, for instance, either in their drivers or in their hardware caches or both. I'm actually surprised Apple is even willing to make the guarantees above about F_FULLFSYNC; I'd read it as, at best, only applying to Apple hardware, as if a third-party controller is doing something silly they can't really do much about that.


In other words, what the osx default fsync() semantics is useful for? I had the same discussion on Twitter a few days ago...


It is useful for forcing all writes out to the storage device. If you device is battery/UPS backed, has enough capacity to flush its buffers (to disk or to flash memory) after a power loss, that is sufficient to (eventually) get your data on disk (yes,the drive may fail, but if that happens after the data has hit the platter, you have no guarantees, either)

From what I understand, that behaviour is in spec (for me, borderline, at best, but I don't make that spec) according to http://pubs.opengroup.org/onlinepubs/009695399/functions/fsy... ("physical write from the buffer cache", not "physical write to the disk") and, AFAIK, is what others do, too (http://ridiculousfish.com/blog/posts/mystery.html)

Edit: http://lists.apple.com/archives/darwin-dev/2005/Feb/msg00072..., referenced from that ridiculous fish post, gives more background info.


Ok makes sense in special cases indeed, however a really unsafe default...


There are two patches linked in the OP that switch to using F_FULLFSYNC on OSX. The OP says that people are still encountering db corruption even on branches with these fixes.

https://github.com/sipa/bitcoin/commit/b28d8b423bddc860c5858... https://github.com/gmaxwell/bitcoin/commit/e7bad10c12ce9b5d4...


We'll, I'm glad someone apparently IS reading :)

But again - I'd point to the work of other longstanding database projects that are available on OS X as a source of "how we ensured data correctness".


Additional reply here since my other is too old to edit:

Something a tester experiencing corruption at startup may want to try is using the 'purge' command from the Terminal.

While a restart will indeed trigger caching files back to disk before the pending restart of the system, the 'purge' command will simulate a "cold boot"-like empty disk buffer by dropping existing file caches.

https://developer.apple.com/library/mac/documentation/Darwin...

If using the command solves the problem of the database corruption without a restart, then you're definitely suffering from disk cache. Easy to confirm.

Additionally, since this is an error on boot up of LevelDB, this is probably in regards to file reading - especially since on second bootup after a restart, no error is detected. F_FULLSYNC is for ensuring that a particular file's changes are 'fully written to disk' ...

... but the cache works both ways. A program could also end up reading the disk cache (which sounds like what's happening here), unless you used F_NOCACHE or F_GLOBAL_NOCACHE. Mind you, these don't prevent the accessing of files already in disk cache - they prevent a file from getting into the disk cache in the first place.

http://lists.apple.com/archives/darwin-dev/2009/Oct/msg00165...

(Disclaimer: I can't think of a situation where the disk cache would get out of sync with the on-disk version of the file if F_FULLSYNC is used when accessed by a subsequent launch of a program except for in the case of faulty RAM on a machine which flipped bits. Your average file operation done by your average application in OS X isn't performing a checksum of data and generally wouldn't notice a single bit flip. It would be interesting to see which of these machines are using ECC RAM.)


I'm not sure why you got that conclusion, as leveldb already received that fix some months ago, see https://code.google.com/p/leveldb/issues/detail?id=197


Bitcoin is using an older version of leveldb (although, as mentioned, this fix is backported in a pull request).


Looks like a free $10k for you if you're right! Let's see!


I believe to claim the reward you have to reproduce the issue, there is already a patch out for the F_FULLFSYNC change.

https://code.google.com/p/leveldb/issues/attachmentText?id=1...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: