This F_FULLFSYNC behaviour has been like this on OSX for as long as I can remember. It is a hint to ensures that the data in the write buffer has been flushed to stable storage - this is historically a limitation of fsync that is being accounted for - are you 1000% sure it does as you expect on other OSes?
Maybe unrealistic expectation for all OSes to behave like linux.
Maybe linux fsync is more like F_BARRIERFSYNC than F_FULLFSYNC. You can retry with those for your benchmarks.
Also note that 3rd party drives are known to ignore F_FULLFSYNC, which is why there is an approved list of drives for mac pros. This could explain why you are seeing different figures if you are supplying F_FULLFSYNC in your benchmarks using those 3rd party drives.
Last time I checked (which is a while at this point, pre SSD) nearly all consumer drives and even most enterprise drives would lie in response to commands to flush the drive cache. Working on a storage appliance at the time, the specifics of a major drive manufacturer's secret SCSI vendor page knock to actually flush their cache was one of the things on their deepest NDAs. Apparently ignoring cache flushing was so ubiquitous that any drive manufacturer looking to have correct semantics would take a beating in benchmarks and lose marketshare. : \
So, as of about 2014, any difference here not being backed by per manufacturer secret knocks or NDAed, one-off drive firmware was just a magic show, with perhaps Linux at least being able to say "hey, at least the kernel tried and it's not our fault". The cynic in me thinks that the BSDs continuing to define fsync() as only hitting the drive cache is to keep a semantically clean pathway for "actually flush" for storage appliance vendors to stick on the side of their kernels that they can't upstream because of the NDAs. A sort of dotted line around missing functionality that is obvious 'if you know to look for it'.
It wouldn't surprise me at all if Apple's NVME controller is the only drive you can easily put your hands on that actually does the correct things on flush, since they're pretty much the only ones without the perverse market pressure to intentionally not implement it correctly.
Since this is getting updoots: Sort of in defense of the drive manufacturers (or at least stating one of the defenses I heard), they try to spec out the capacitance on the drive so that when the controller gets a power loss NMI, they generally have enough time to flush then. That always seemed like a stretch for spinning rust (the drive motor itself was quite a chonker in the watt/ms range being talked about particularly considering seeks are in the 100ms range to start with, but also they have pretty big electrolytic caps on spinning rust so maybe they can go longer?), but this might be less of a white lie for SSDs. If they can stay up for 200ms after power loss, I can maybe see them being able to flush cache. Gods help those HMB drives though, I don't know how you'd guarantee access to the host memory used for cache on power loss without a full system approach to what power loss looks like.
On at least one drive I saw, the flush command was instead interpreted as a barrier to commands being committed to the log in controller DRAM, which could cut into parallelization, and therefore throughput, looking like a latency spike but not a flush out of the cache.
The drive controller is internally parallel. The write is just a job queue submission, so the next write hits while it's still processing previous requests.
People have tested this stuff on storage devices with torture tests. Can you point at an example of a modern (directly attached) NVMe drive from a reputable vendor that cheats at this?
FWIW, macOS also has F_BARRIERFSYNC, which is still much slower than full syncs on the competition.
On my benchmarking of some consumer HDD's, back in 2013 or so, the flush time was always what you'd expect based on the drive's RPM. I got no evidence the drive was lying to me. These were all 2.5" drives.
My understanding was, the capacitor thing on HDD's is to ensure it completely writes out a whole sector, so it passes the checksum. I only heard the flush cache thing with respect to enterprise SSD's. But I haven't been staying on top of things.
You definitely weren't testing the cache in a meaningful way if you were hovering over the same track.
WRT to the capacitor thing being about a single sector, think about the time spans. You should be able to even cut out the drive motor power, and still stay up for 100s of ms. In that time you can seek to a config track and blit out the whole cache. If you're already writing a sector you'll be done in microseconds. The whole track spins around every ~8ms at 7200RPM.
Tangential thinking out loud: this makes me think of a sort of interleaving or striping mechanism that tries to leave a small proportion of every track empty, such that ideal power loss flush scenarios would involve simply waiting for the disk to spin around to the empty/reserved area in the current track. On drives that aren't completely full, it's probably statistically reasonable that for any given track position there's going to be a track with some reserved space very close by, such that the amount of movement/power needed to seek there is smaller.
Of course, this approach describes a completely inverted complexity scenario in terms of sector remapping management, with the size of the associated tables probably being orders of magnitude larger. :<
Now I wonder how much power is needed for flash writes. The chances are an optimal-and-viable strategy would probably involve a bit of multichannel flash on the controller (and some FEC because why not).
Oooh... I just realized things'll get interesting if the non-volatile RAM thing moves beyond the vaporware stage before HDDs become irrelevant. Last-millimeter write caching will basically cease to be a concern.
But thinking about the problem slightly more laterally, I don't understand why nobody's made inline SATA adapters with RAM, batteries and some flash in them. If they intercept all writes they can remember what blocks made it to the disk, then flush anything in the flash at next power on. Surely this could be made both solidly/efficiently and cheaply...?
> But thinking about the problem slightly more laterally, I don't understand why nobody's made inline SATA adapters with RAM, batteries and some flash in them.
Hardware raid controllers with Battery Backup Units was really popular starting in the mid 90’s until maybe mid 2010’s? Software caught up in a lot of features and batteries failed often and required a lot more maintenance. Super caps were to replace the batteries but I think SSDs and software negated a ton of the value add. You can still buy them but they’re pretty rare to see in the wild.
I've heard of those, they sound mildly interesting to play with, if just to go "huh" at and move on. I get the impression the main reason they developed a... strained reputation was their strong tendency to want to do RAID things (involving custom metadata and other proprietaryness) even for single disks, making data recovery scenarios that much more complicated and stressful if it hadn't been turned off. That's my naive projection though, I (obviously) have no real-world experience with these cards, I just knew to steer far away from them (heh)
An inline widget (SATA on both sides) that just implements a write cache and state machine ("push this data to these blocks on next power on") seems so much simpler. You could even have one for each disk and connect to a straightforward RAID/SAS controller. (Hmm, and if you externalize the battery component, you could have one battery connect to several units...)
You are indeed right about the battery/capacitor situation ("you have to open the case?!"), I wouldn't be surprised if the battery level reporting in those RAID cards was far from ideal too lol
With all this being said, a UPS is by far the simplest solution, naturally, but also the most transiently expensive.
Probably more like downvoted because missing the point.
Sure fsync allows that behavior, but also it's so widely misunderstood that a lot of programs which should do a "full" flush only do a fsync, including Benchmarks. In which case they are not comparable and doing so is cheating.
But that's not the point!
The point is that with the M1 Macs SSDs the performance with fully flushing to disk is abysmal bad.
And as such any application with cares for data integrity and does a full flush can expect noticable performance degradation.
The fact that Apple neither forces frequent full syncs or at least full syncs when a Application is closed doesn't make it better.
Though it is also not surprising as it's not the first time Apple set things up under the assumption their hardware is unfailable.
And maybe for a desktop focused high end designs where most devices sold are battery powered that is a reasonable design choice.
"And maybe for a desktop focused high end designs where most devices sold are battery powered that is a reasonable design choice"
Does the battery last forever? Do they never shut down from overheating, shut down from being too cold, freeze up, they are water and coffee proof?
Talk to anyone that repairs mac about how high-end and reliable their designs trully are - they are better than bottomn of the barrel craptops, sure, but not particularly amazing and have some astounding design flaws.
As the article points out, a lot of those cases can be detected with advanced notice (dying battery, and overheating - probably even being too cold). In those cases the OS makes sure all the caches are flushed.
Spilled drinks are a viable cause for concern, but if they do enough damage to cause an unexpected shutdown, you've probably got bigger issues than unflushed cache.
On many laptops even with water damage you can recover your local data fully, not do for Macs (for more reasons then just data loss/corruption due to non flushing).
Especially if you are already in a bad situation you don't want your OS to make it worse.
The CPU can't possibly get too cold. See for example overclocking performed by cooling the CPU with liquid nitrogen. Condensation is a factor as is lost of ductility of plastic at low temp making it brittle. Expansion and contraction of materials especially when different materials expand to different degrees.
"The CPU can't possibly get too cold" - Untrue. There are plenty of chips with what overclockers like to call "cold bugs".
Sequential logic (flipflops) has a setup time requirement. This means the combinatorial computation between any two connected pairs of flops (output of flop A to input of flop B) has to do its job fast enough such that the input of B stops toggling some amount of time before the next clock edge arrives at the flipflop. Violate that timing, and B will sometimes sample the wrong value, leading to an error.
Setup time is what most people are thinking about when they use LN2 or other exotic forms of cooling. By cooling things down, you usually improve the performance of combinatorial logic, which provides more setup time margin, allowing you to increase clock speed until setup time margin is small again.
But flops also have hold time requirements - their inputs have to remain stable for some amount of time after the clock edge, not just before. It's here where we can run into problems if the circuit is too cold. Imagine a path with relatively little combinatorial logic, and not much wire delay. If you make that path too fast, it might start violating hold time on the destination flop. Boom, shit doesn't work.
Many phones, laptops cameras and similar are only guaranteeing functionally by above 0 degree....
Luckily they often operate in lower temperatures too, but not seldomly by hoping they don't get cooled that much themself (because they are e.g. in your pocket).
Incidentally CPUs do get too cold, not at a reasonable temperature, but sufficiently low temperatures do change the characteristics of semi conductors. Not something to worry about if you're not using liquid nitrogen (or colder).
I've had my phone shut off on me from being out in the Chicago cold for a couple hours. Battery over 50% when I brought it back inside and warmed it up.
I mean the Apple hardware in question is usually a laptop, which has its own very well instrumented battery backup. In most cases the hardware knows well in advance if the battery is gonna run dry.
And yes the hardware is failable. But the kind if failure that would cause the device to completely lose power is extremely rare. The OS has many chances to take the hint and flush the cache before powering down.
A simple test can be to see the degree of dataloss you can occur with a hard power off.
I think the author did that test for M1 Mac but idk. if they did the test with the other laptops.
But then the M1 Mac is slower when flushing then most SSDs out there and even some HDDs. I think if most SSDs wouldn't flush data at all we would know of that and I should have run into problems with the few docent hard resets I ran into in the last few years. (And sure there are probably some SSDs which cheap out on cache flushing in a dangerous way, but most shouldn't as far as I can tell).
We’d see data loss only if the power loss or hard reset happened before the data is actually flushed. After the data is accepted into the buffer there would be a narrow time window when it could occur. Also, a hard reset on the computer side may not be reflected on the storage embedded electronics.
It isn't, because otherwise it would be showing the ~same performance with and without sync commands, as I showed in the thread. There is a significant performance loss for every drive, but Apple's is way worse.
There is no real excuse for a single sector write to take ~20ms to flush to NAND, all the while the NAND controller is generating some 10MB/s of DRAM traffic. This is a dumb firmware design issue.
This affects T2 Macs too, which use the same NVMe controller design as M1 Macs.
We've looked at NVMe command traces from running macOS under a transparent hypervisor. We've issued NVMe commands outside of Linux from a bare-metal environment. The 20ms flush penalty is there for Apple's NVMe implementation. It's not some OS thing. And other drives don't have it. And I checked and Apple's NVMe controller is doing 10MB/s of DRAM memory traffic when issued flushes, for some reason (yes, we can get those stats). And we know macOS does not properly flush with just fsync() because it actively loses data on hard shutdowns. We've been fighting this issue for a while now, it's just that it only just hit us yesterday/today that there is no magic in macOS - it just doesn't flush, and doesn't guarantee data persistence, on fsync().
Ive just been scanning through linux kernel code (inc ext4). Are you sure that its not issuing a PREFLUSH? What are your barrier options on the mount? I think you will find these are going to be more like F_BARRIERFSYNC.
Those are Linux concepts. What you're looking for is the actual NVMe commands. There's two things: FLUSH (which flushes the whole cache), and a WRITE with the FUA bit set (which basically turns that write into write-through, but does not guarantee anything about other commands). The latter isn't very useful for most cases, since you usually want at least barrier semantics if not a full flush for previously completed writes. And that leaves you with FLUSH. Which is the one that takes 20ms on these drives.
> Those are Linux concepts. What you're looking for is the actual NVMe commands.
Im not sure what commands are being sent to the NVMe drive. But what you are describing as a flush would be F_BARRIERFSYNC - NOT the F_FULLFSYNC which youve been benchmarking.
Sigh, no. A barrier is not a full flush. A barrier does not guarantee data persistence, it guarantees write ordering. A barrier will not make sure the data hits disk and is not lost on power failure. It just makes sure that subsequent data won't show up and not the prior data, on power failure. NVMe doesn't even have a concept of barriers in this sense. An OS-level barrier can be faster than a full sync only because it doesn't need to wait for the FLUSH to actually complete, it can just maintain a concept of ordering within the OS and make sure it is maintained with interleaved FLUSH calls.
I don't know why you keep pressing on this issue. macOS has the same performance with F_FULLFSYNC as Linux does with fsync(). Why would they be different things? We're getting the same numbers. This entire thing started because fsync() on these Macs on Linux was dog slow and we couldn't figure out why macOS was fast. Then we found F_FULLFSYNC which has the same semantics as fsync() on Linux. And now both OSes perform equally slowly on this hardware. They're obviously doing the same thing. And the same thing on Linux on non-Apple SSDs is faster. I'm sure I could install macOS on this x86 iMac again and show you how F_FULLFSYNC on macOS also gives better performance on this WD drive than on the M1, but honestly, I don't have the time for that, the isssue has been thoroughly proved already.
Actually, I have a better one that won't waste as much of my time.
Plugs in a shitty USB3 flash drive into the M1.
224 IOPS with F_FULLFSYNC. On a shitty flash drive.
58 IOPS with F_FULLFSYNC. On internal NVMe.
Both FAT32.
Are you convinced there's a problem yet?
(I'm pretty sure the USB flash drive has no write cache, so of course it is equally fast/slow with just fsync(), but my point still stands - committing writes to persistent storage is slower on this NVMe controller than on a random USB drive)
It seems to be pretty apples to apples, they're running the same benchmark using equivalent data storage APIs on both systems. What are you thinking might be different? The Linux+WD drive isn't making the data durable? Or that OSX does something stupid which could be the cause of the slowdown rather than the drive? Both seem implausible.
Something that is not quite clear to me yet (I did read the discussion below, thank you Hector for indulging us, very informative): isn't the end behaviour up to the drive controller? That is, how can we be sure that Linux actually does push to the storage or is it possible that the controller cheats? For example, you mention the USB drive test on a Mac — how can we know that the USB stick controller actually does the full flush?
Regardless, I certainly agree that the performance hit seems excessive. Hopefully it's just an algorithm, issue and Apple can fix this with a software update.
MacOS was really just FreeBSD with a fancier UI. Not sure what is the behavior now, but I'm pretty sure FreeBSD behaved almost exactly the same as a power loss rendered my system unbootable over 10 years ago.
I'm sorry but this is incorrect. NeXTSTEP was the primary foundation for Mac OS X, and the XNU kernel was derived from Mach and IIRC 4.4BSD. FreeBSD source was certainly an important sync jumping off point for a number of Unix components of the kernel and CLI userland, there was some code sharing going on for a while (still?), but large components of the kernel and core frameworks were unique (for better or worse).
4.3, only Rhapsody incorporated elements from 4.4, but that was the tail end of nextstep, essentially the initial preview of macos (it was released as osx server 1.0, then forked to darwin from which the actual OSX 10.0 would be built, two major pieces missing from rhapody were Classic and Carbon, so it really was nextstep with an OS9 skin).
Thanks for the correction, man has it been a long, long time. I had the Public Beta and than got on the OS X train pretty fast on a good old B&W G3. Even with the slowness the multitasking still allowed getting around it and having all Unix right there with a big rush to initial porting was really interesting, good times. I remember calling Apple for help getting Apache compiled and got forwarded right out of the regular call system to some dev whose name I sadly forget and we worked through it.
Everything is a million times more refined and overall better now but I do have a bit of nostalgia for the community and really getting your hands dirty back then while still having a fairly decent fallback. I haven't actually needed to mess with kernel stuff since 10.5 or so but thinking back makes me wonder about paths not taken.
> so it [Rhapsody] really was nextstep with an OS9 skin
Sorry to be pedantic, but Rhapsody's user interface is modeled after the Mac OS 8 "Platinum" design language. Though 9 also was modeled on Platinum, Rhapsody's interface appears nearly identical to Mac OS 8's except for the Workspace Manager which doesn't exist in 8.
I like that. Fsync() was designed with the block cache in mind. IMO how the underlying hardware handles durability is its own business. I think a hack to issue a “full fsync” when battery is below some threshold is a good compromise.
It's important to read the entire document including the notes, which informs the reader of a pretty clear intent (emphasis mine):
> The fsync() function is intended to force a physical write of data from the buffer cache, and to assure that after a system crash or other failure that all data up to the time of the fsync() call is recorded on the disk.
This seems consistent with user expectations - fsync() completion should mean data is fully recorded and therefore power-cycle- or crash-safe.
That particular implementation seems inconsistent with the following requirement:
> The fsync() function shall request that all data for the open file descriptor named by fildes is to be transferred to the storage device associated with the file described by fildes.
If I wrote that requirement in a classroom programming assignment and you presented me with that code, you'd get a failing grade. Similarly, if I were a product manager and put that in the spec and you submitted the above code, it wouldn't be merged.
> You are quoting the non-normative informative part
Indeed, I am! It is important. Context matters, both in law and in programming. As a legal analogy, if you study Supreme Court rulings, you will find that in addition to examining the text of legislation or regulatory rules, the court frequently looks to legislative history, including Congressional findings and statements by regulators and legislators in order to figure out how to best interpret the law - especially when the text is ambiguous.
> If I wrote that requirement in a classroom programming assignment and you presented me with that code, you'd get a failing grade.
It's a good thing operating systems aren't made up entirely of classroom programming assignments.
Picture an OS which always runs on fully-synchronized storage (perhaps a custom Linux or BSD or QNX kernel). If there's no write cache and all writes are synchronous, then fsync() doesn't need to do anything at all; therefore `int fsync(int) {return 0}` is valid because fsync()'s method is implementation-specific.
This allows you to have no software or hardware write cache and not implement fsync() and still be POSIX-compliant.
> Context matters, both in law and in programming. As a legal analogy, if you study Supreme Court rulings, you will find that in addition to examining the text of legislation or regulatory rules, the court frequently looks to legislative history, including Congressional findings and statements by regulators and legislators in order to figure out how to best interpret the law - especially when the text is ambiguous.
The POSIX specification is not a court of law, and the context is pretty clear: fsync() should do whatever it needs to do to request that pending writes are written to the storage device. In some valid cases, that could be nothing.
> Picture an OS which always runs on fully-synchronized storage (perhaps a custom Linux or BSD or QNX kernel). If there's no write cache and all writes are synchronous, then fsync() doesn't need to do anything at all; therefore `int fsync(int) {return 0}` is valid because fsync()'s method is implementation-specific.
Sure, I'll give you that, in a corner case where all writes are synchronized to storage before completing. However, most modern computers cache writes for performance, and the speed/security tradeoff is the context of this discussion. We wouldn't be having this debate in the first place if computers and storage devices didn't cache writes.
> The POSIX specification is not a court of law
Indeed, it isn't; nor is legislative text (the closest analogy in law). Hence the need for interpretation.
> fsync() should do whatever it needs to do to request that pending writes are written to the storage device
The wording here is quite subtle. Without SIO, fsync is merely a request, returning an error if one occurred. As the informative section points out, this means that the request may be ignored, which is not an error.
> If _POSIX_SYNCHRONIZED_IO is not defined, the wording relies heavily on the conformance document to tell the user what can be expected from the system. It is explicitly intended that a null implementation is permitted.
Compare this to e.g. the wording for write(2):
> The write() function shall attempt to write nbyte bytes from the buffer pointed to by buf to the file associated with the open file descriptor, fildes. [yadadada]
This actually specifies that an action needs to be performed. fsync(2) sans SIO is merely a request form that the OS can respond to or not. And because macOS does not define SIO, you have to go out and find out what that particular implementation is actually doing and the answer is: essentially nothing for fsync.
It makes sense that a null implementation is permitted to cover cases such as the one illustrated above where all writes are always synchronized. However, it violates the spirit of the law (so to speak) as discussed in the normative section to have a null implementation where writes are not always synchronized (i.e., cached). As another commenter noted, the wording was not intended to give the implementor a get-out-of-jail-free card ("it was merely a request; I didn't actually have to even try to fulfill it").
There’s also the very likely possibility that the storage is lying to the OS, that the data that was accepted and which is in the buffer has been written somewhere durable while it’s actually waiting for an erase to finish or a head to get wherever it needs to be. There are disk controllers with batteries precisely for those situations.
And, if cheating will give better numbers on benchmarks, I’m willing to bet money most manufacturers will cheat.
Since crashes and power failures are out of scope for POSIX, even F_FULLSYNC's behavior description would of necessity be informative rather than normative.
But, the reality is that all operating systems provide some way to make writes to persistent storage complete, and to wait for them. All of them. It doesn't matter what POSIX says, or that it leaves crashes and power failure out of scope.
POSIX's model is not a get-out-of-jail-free card for actual operating systems.
At least it is also implemented by windows, which cause apt-get in hyperv vm slower
And also unbearable slow for loopback device backed docker container in the vm due to double layer of cache. I just add eat-my-data happily because you can't save a half finished docker image anyway.
OSX defines _POSIX_SYNCHRONIZED_IO though, doesn't it? I don't have one at hand but IIRC it did.
At least the OSX man page admits to the detail.
The rationale in the POSIX document for a null implementation seems reasonable (or at least plausible), but it does not really seem to apply to general OSX systems at all. So even if they didn't define _POSIX_SYNCHRONIZED_IO it would be against the spirit of the specification.
I'm actually curious why they made fsync do anything at all though.
No problem - sorry if i came off harsh, i thought you were being pedantic :D
TBH, im not so sure its that different. Scanning through the linux docs it seems that this behaviour can be configured as part of mount options (e.g. barrier on ext4). At least its explicit on macOS (with compliant hardware).
> No problem - sorry if i came off harsh, i thought you were being pedantic :D
No just I did a ctrl+F ctrl+C ctrl+V without thinking enough. No need to apologize though, my reply was actually flippant I should have been more respectful of your (correct) point.
> TBH, im not so sure its that different. Scanning through the linux docs it seems that this behaviour can be configured as part of mount options (e.g. barrier on ext4). At least its explicit on macOS (with compliant hardware).
I disagree (unless Linux short-cuts this by default). The reason is in the POSIX rationale:
*RATIONALE*
> The fsync() function is intended to force a physical write of data from the buffer cache, and to assure that after a system crash or other failure that all data up to the time of the fsync() call is recorded on the disk. Since the concepts of "buffer cache", "system crash", "physical write", and "non-volatile storage" are not defined here, the wording has to be more abstract.
The first paragraph gives the intention of the interface. It's clearly to persist data.
> If _POSIX_SYNCHRONIZED_IO is not defined, the wording relies heavily on the conformance document to tell the user what can be expected from the system. It is explicitly intended that a null implementation is permitted. This could be valid in the case where the system cannot assure non-volatile storage under any circumstances or when the system is highly fault-tolerant and the functionality is not required. In the middle ground between these extremes, fsync() might or might not actually cause data to be written where it is safe from a power failure. The conformance document should identify at least that one configuration exists (and how to obtain that configuration) where this can be assured for at least some files that the user can select to use for critical data. It is not intended that an exhaustive list is required, but rather sufficient information is provided so that if critical data needs to be saved, the user can determine how the system is to be configured to allow the data to be written to non-volatile storage.
Now this gives a rationale for why you might not include it. And lists three examples of where it could be valid to water down the intended semantics. The system can not support it; the functionality is not required because data durability is guaranteed in other ways; the functionality is traded off in cases where major risks have been reduced.
OSX on a consumer Mac doesn't fit those cases.
Linux with the option is violating POSIX even by the letter because presumably mounting the drive with -onobarrier does not cause all your applications to be recompiled with the property undefined. But it's not that unreasonable an option, it's clearly not feasible to have two sets of all your software compiled and select one or the other depending on whether your UPS is operational or not.
Oh yeah, I definitely agree with you on this. If anything you should be able to pass in flags to reduce resiliency - not have the default be that way. Maybe thats how the actual SIO spec reads (i havent read it).
The implication (in fact no, it's explicitly stated) is that this fsync() behaviour on OSX will be a surprise for developers working on cross platform code or coming from other OS's and will catch them out.
However if in fact it's quite common for other OS's to exhibit the same or similar behaviour (BSD for example does this too, which makes sense as OSX has a lot of BSD lineage), that argument of least surprise falls a bit flat.
That's not to say this is good behaviour, I think Linux does this right, the real issue is the appalling performance for flushing writes.
> If _POSIX_SYNCHRONIZED_IO is not defined, the wording relies heavily on the conformance document to tell the user what can be expected from the system.
> fsync() might or might not actually cause data to be written where it is safe from a power failure.
The fsync() function shall request that all data for
the open file descriptor named by fildes is to be
transferred to the storage device associated with the
file described by fildes. The nature of the transfer
is implementation-defined. The fsync() function shall
not return until the system has completed that action
or until an error is detected.
then:
The fsync() function is intended to force a physical
write of data from the buffer cache, and to assure
that after a system crash or other failure that all
data up to the time of the fsync() call is recorded
on the disk. Since the concepts of "buffer cache",
"system crash", "physical write", and "non-volatile
storage" are not defined here, the wording has to be
more abstract.
The only reason to doubt the clarity of the above is that POSIX does not consider crashes and power failures to be in scope. It says so right in the quoted text.
Crashes and power failures are just not part of the POSIX worldview, so in POSIX there can be no need for sync(2) or fsync(2), or fcntl(2) w/ F_FULLFSYNC! Why even bother having those system calls? Why even bother having the spec refer to the concept at all?
Well, the reality is that some allowance must be made for crashes and power failures, and that includes some mechanism for flushing caches all the way to persistent storage. POSIX is a standard that some real-life operating systems aim to meet, but those operating systems have to deal with crashes and power failures because those things happen in real life, and because their users want the operating systems to handle those events as gracefully as possible. Some data loss is always inescapable, but data corruption would be very bad, which is why filesystems and applications try to do things like write-ahead logging and so on.
That is why sync(2), fsync(2), fdatasync(2), and F_FULLFSYNC exist. It's why they [well, some of them] existed in Unix, it's why they still exist in Unix derivatives, it's why they exist in Unix-alike systems, it's why they exist in Windows and other not-remotely-POSIX operating systems, and it's why they exist in POSIX.
If they must exist in POSIX, then we should read the quoted and linked page, and it is pretty clear: "transferred to the storage device" and "intended to force a physical write" can only mean... what that says.
It would be fairly outrageous for an operating system to say that since crashes and power failures are outside the scope of POSIX, the operating system will not provide any way to save data persistently other than to shut down!
> the fsync() function is intended to force a physical write of data from the buffer cache
If they define _POSIX_SYNCHRONIZED_IO, which they dont.
fsync wasnt defined as requiring a flush until version 5 of the spec. It was implemented in BSDs loooong before then. Apple introduced F_FULLFSYNC prior to fsync having that new definition.
I dont disagree with you, but it is what it is. History is a thing. Legacy support is a thing. Apple likely didnt want to change peoples expectations of the behaviour on OSX - they have their own implementation after all (which is well documented, lots of portable software and libs actively uses it, and its built in to the higher level APIs that Mac devs consume).
Depends on the definition of "storage device", I guess. If it's physical media, then OS X doesn't. If it's the controller, then OS X does. But since the intent is to have the data reach persistent storage, it has to be the physical media.
My guess is that since people know all of this, they'll just keep working around it as they already do. Newbies to OS X development will get bitten unless they know what to look for.
If you need to run software/servers with any kind of data consistency/reliability on OS X this is definitely something you should be aware of and will be a footgun if you're used to Linux.
Macs in datacentres are becoming increasingly common for CI, MDM, etc.
I believe it’s at least 5s. Marcan didn’t specify how long it was, but gave an example of at least 5s. That could cause a device to think it’s allowed to do something via MDM but not actually have a record in the database allowing it to do so.
You don't just lose seconds of data. When you drop seconds of writes, that can effectively corrupt any data or metadata that was touched during that time period. Which means you can lose data that had been safe.
And with those hundreds of databases, we’re only learning about this behavior now instead of any of the previous decades where abundant errors would have caused a conversation.
Doesn’t seem like an issue worthy of hundreds of HN comments and upvotes, just people raising a stink over a non-issue.
If in a hundred times someone loses power, their system is corrupted once, it’s a choice whether you accept that or not. I do not. I want quality and quality is an system that does not get corrupted.
I've used Macs for years and never been aware of it.
Note: the tweeter couldn't provoke actual problems under any sort of normal usage. To make data loss show up he had to use weird USB hacks. If you know you have a battery and can forcibly shut down the machine 'cleanly' it's not really clear what the need for a hard fsync is.
"Macs in datacentres are becoming increasingly common for CI, MDM, etc."
CI machines are the definition of disposable data. Nobody is running Oracle on macOS and Apple don't care about that market.
These days, best practice for data consistency / reliability in that environment, IIUC, is to write to multiple redundant shards and checksum, not to assume any particular field-pattern spat at the hard drive will make for a reliability guarantee.
POSIX spec says no: https://pubs.opengroup.org/onlinepubs/9699919799/functions/f...
Maybe unrealistic expectation for all OSes to behave like linux.
Maybe linux fsync is more like F_BARRIERFSYNC than F_FULLFSYNC. You can retry with those for your benchmarks.
Also note that 3rd party drives are known to ignore F_FULLFSYNC, which is why there is an approved list of drives for mac pros. This could explain why you are seeing different figures if you are supplying F_FULLFSYNC in your benchmarks using those 3rd party drives.