Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Both KeyKOS and EUMEL used periodic globally consistent checkpoints supplemented by an explicit fsync() call for use on things that required durability like database transaction journals. So when the system crashed you would lose your last few minutes of work; EUMEL ran on floppies, so the interval could be as long as 15 minutes, but I think even KeyKOS only took a checkpoint every minute or so, perhaps because it would pause the whole system for close to a second on the IBM 370s it ran on. (Many years later, Norm Hardy showed me a SPARC workstation with I think 32 or 64 MiB of RAM; he said it was the biggest machine KeyKOS had ever run on, twice the size of the mainframe that was the previous record holder. Physically it was small enough to hold in one hand.)

I assume L3 used the same approach as EUMEL. I don't know enough about OS/400 or Multics to comment intelligently, but at least OS/400 is evidently reasonably performant and reliable, so I'd be surprised if it isn't a similar strategy.

There's a latency/bandwidth tradeoff: if you take checkpoints less often, more of the data you would have checkpointed gets overwritten, so your checkpoint bandwidth gets smaller. NetApp's WAFL (not a single-level store, but a filesystem using a similar globally-consistent-snapshot strategy) originally (01996) took a checkpoint ("snapshot") once a second and stored its journals in battery-backed RAM, but with modern SSDs a snapshot every 10 milliseconds or so seems feasible, depending on write bandwidth.

Most of these systems did have "volatile segments" actually, but not for the reason you suggest, which is performance; rather, it was for device drivers. You can't swap out your device drivers (in most cases) and they don't have state that would be useful across power failures anyway. So they would be, as EUMEL called it, "resident": exempted from both the snapshot mechanism and from paging.



> So when the system crashed you would lose your last few minutes of work

So, pretty much how disk storage is managed today? But then if you're going to checkpoint application programs from time to time, it might better to design them so that they explicitly serialize their program state to a succinct representation that they can reload from, and only have OS-managed checkpoint and restore as a last resort for programs that aren't designed to do this...Which is again quite distinct from having everything on a "single level".

> but with modern SSDs a snapshot every 10 milliseconds or so seems feasible, depending on write bandwidth.

Not so. Even swapping to SSD is generally avoided because every single write creates wear-and-tear that impacts lifetime of the device.


Well, I don't think it's "pretty much how disk storage is managed today," among other things because you don't have a multi-minute reboot process, and you usually lose some workspace context in a power failure.

The bigger advantage of carefully designed formats for persistent data is usually not succinctness but schema upgrade.

It's true that too much write bandwidth will destroy your SSDs prematurely, and that might indeed be a drawback of a single-level store, because you might end up spending that bandwidth on things that aren't worth it. This is especially true now that DWPD has collapsed from 10 or more down to, say, 0.2. But they're fast enough that you can do submillisecond durable commits, potentially hundreds of them per second. In the limit where each snapshot writes a single 4 KiB page, 100 snapshots per second is 0.4 MB/s.

If you have a 3-year-life 1TiB SSD with 0.2 DWPD, you can write about 220 TiB to it before it fails. So if you have 4 GiB of RAM, in the worst case where every checkpoint dirties every page, you can only take 56000 checkpoints before killing the drive: about an hour and a half if you're doing it every 10 ms. But actually the drive probably isn't that fast. If you limit yourself to an Ultra DMA 2 speed of 33 MB/s on average, while still permitting bursts of the SSD's maximum speed, that's 2.8 months, but a maximum checkpoint latency of more than 2 minutes. You have to scale back to 7.7 MB/s to guarantee a disk life of a year, at which point your worst-case checkpoint time rises to over 9 minutes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: