Bug 40607 - EasyHack: call fsync when we can when writing documents to certain linux file-systems
Summary: EasyHack: call fsync when we can when writing documents to certain linux file...
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: LibreOffice (show other bugs)
Version:
(earliest affected)
3.3.3 release
Hardware: x86-64 (AMD64) Linux (All)
: high enhancement
Assignee: Michael Meeks
URL:
Whiteboard: target:3.6.0 target:3.5.2
Keywords: difficultyInteresting, easyHack, skillCpp
Depends on:
Blocks:
 
Reported: 2011-09-03 11:13 UTC by Tristan Schmelcher
Modified: 2015-12-16 00:21 UTC (History)
8 users (show)

See Also:
Crash report or crash signature:


Attachments
Small utility to test if fsync() is desirable for a certain path (2.64 KB, text/x-csrc)
2012-01-15 14:05 UTC, Tristan Schmelcher
Details
sal fsync patch ... (4.22 KB, patch)
2012-01-16 08:42 UTC, Michael Meeks
Details
Small utility to test if fsync() is desirable for a certain path (2.75 KB, text/x-csrc)
2012-01-17 21:29 UTC, Tristan Schmelcher
Details
gdb trace of all the calls to close(2) when saving a document (241.31 KB, text/plain)
2012-01-21 14:49 UTC, Tristan Schmelcher
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tristan Schmelcher 2011-09-03 11:13:44 UTC
I am using LibreOffice 3.3.3 on Ubuntu 11.04 amd64 (distro-provided packages). Recently I was working on a document in LibreOffice while my battery was low and so I was frequently saving, which I thought would help me if I lost power. However, when I eventually did lose power and later rebooted, the document had become 0 bytes long. LibreOffice was not able to restore the auto-saved copy either. As a result, I have lost a whole week of notes for one of my courses.

After researching online, it seems that this is caused by the application not calling fsync() (or fdatasync()) when saving files. Due to delayed allocation in modern filesystems, there is no guarantee that the new file's data has actually been written to disk unless the application calls fsync. So if an app writes a new file and replaces the old one with it without fsync'ing the new one first then there is a window of opportunity during which a power failure will result in the loss of BOTH versions of the file. In ext4 this window is also much larger than in ext3.

Theodore Tso blogged about this at http://ldn.linuxfoundation.org/blog-entry/delayed-allocation-and-zero-length-file-problem and http://www.linuxfoundation.org/news-media/blogs/browse/2009/03/don%E2%80%99t-fear-fsync. He strongly recommends to call fsync in this situation.

Please update LibreOffice to fsync() saved files so that other users do not lose their data like I did.

Forwarding upstream from Ubuntu bug at https://bugs.launchpad.net/ubuntu/+source/libreoffice/+bug/817326

You can see evidence of more users encountering this at:
http://ubuntuforums.org/showthread.php?p=11215058
http://user.services.openoffice.org/en/forum/viewtopic.php?f=6&t=39666
Comment 1 Urmas 2011-09-08 03:15:54 UTC
Since when filesystem issues are application concern?
Comment 2 Tristan Schmelcher 2011-09-08 08:47:54 UTC
This is not a filesystem issue, the filesystem is behaving correctly. In order to persist data to disk (on _any_ filesystem), the application must call fsync(). This is part of the POSIX spec. By not calling fsync(), LibreOffice is basically _telling_ the filesystem that it is not important to retain this file in the event of a power outage.
Comment 3 Tristan Schmelcher 2011-09-23 02:42:37 UTC
Here's a decent example of how to fsync() from Java: http://android-developers.blogspot.com/2010/12/saving-data-safely.html.

Note that if using buffered I/O then you also need to flush the data before the sync().
Comment 4 John Iglar 2011-10-27 04:32:32 UTC
I'm adding a "me too" here. Our school is using Ubuntu and LibreOffice and we have had many students lose work due to this bug.

Repeat Tristan's comment - not a filesystem issue, but application. 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/45
Comment 5 Christopher M. Penalver 2011-12-10 11:12:05 UTC
Severity > Critical since this scenario causes data loss.
Comment 6 Don't use this account, use tml@iki.fi 2011-12-10 13:53:51 UTC
Why do you think an example using Java, on Android even (comment #3) has relevance for LibreOffice? Just asking in case you think LibreOffice is written in Java...

Also a disk crash causes data loss. If you are working on important documents, without doing backups to another device (preferrably a removable or remote one), they can't be that important after all.

Is LibreOffice the new dog that eats homework?
Comment 7 Michael Meeks 2011-12-10 14:01:56 UTC
This is a tricky area. It is fine to ask people to call fsync like drunken sailors on ext4 or btrfs - the costs are low; however for ext3 systems an 'fsync' can take many seconds to complete (if one has not been run recently) during which the entire system is un-responsive to I/O and the user feels like everything has crashed/locked. There is a monster kernel I/O bug about this IIRC that never goes anywhere.

So - it is a nuanced issue. We should look at what sequence we are using to re-write that file.

IIRC, if we use an atomic 'rename' then we should be guarenteed to get either the old or the new file, and not a zero length one; so prolly we should do that. Though of course this interacts nastily with the tangled locking code (no doubt).

Anyhow, I'm sympathetic to the idea of an fsync; except for the still rather widely deployed ext3 world where it is a potential "LibreOffice wedges my machine when I save" problem that is a royal PITA. Of course, for ourselves, we could do the 'fsync' aynchronously; with some horrible 'fsync thread', but ... it's all rather a mess.

We'd need to see if 'rename' will work at all with the locking semantics, and whatever odd-ball remote file-systems we're supposed to work with.
Comment 8 Tristan Schmelcher 2011-12-10 18:43:41 UTC
Yeah sorry about the Java comment, I thought for some reason that LibreOffice was written in Java. Ignore that.

Doing the atomic rename as Michael suggests is a good middle-ground between robustness and performance. ext4 in recent builds supposedly adds an fsync to that if it is missing.
Comment 9 Björn Michaelsen 2011-12-23 12:38:46 UTC
[This is an automated message.]
This bug was filed before the changes to Bugzilla on 2011-10-16. Thus it
started right out as NEW without ever being explicitly confirmed. The bug is
changed to state NEEDINFO for this reason. To move this bug from NEEDINFO back
to NEW please check if the bug still persists with the 3.5.0 beta1 or beta2 prereleases.
Details on how to test the 3.5.0 beta1 can be found at:
http://wiki.documentfoundation.org/QA/BugHunting_Session_3.5.0.-1

more detail on this bulk operation: http://nabble.documentfoundation.org/RFC-Operation-Spamzilla-tp3607474p3607474.html
Comment 10 Jean-Baptiste Faure 2011-12-26 08:05:47 UTC
According to comments 6, 7 and 8, the relevant category seems to be enhancement.
LibreOffice works as designed but something could be improved when saving files.

Best regards. JBF
Comment 11 Tristan Schmelcher 2011-12-30 17:58:47 UTC
I'm not sure the enhancement category is the right fit here. I'm sure loss of data was not one of the design criteria, so it doesn't seem to me like this aspect of LibreOffice can be considered to be working as designed.

I'll try out 3.5.0 beta 1 when I get a chance.
Comment 12 Tristan Schmelcher 2012-01-07 19:38:32 UTC
Tested 3.5.0 Beta 2 and the bug persists. My repro steps:

1) Open LibreOffice, type something, and save it.
2) Open GEdit Text Editor, type something, and save it.
3) In a terminal, run "sync". The previously-saved versions are now on disk.
4) Change the LibreOffice document and save again.
5) Change the Text Editor document and save again.
6) Pull out the power cord.

After powering up, the GEdit document is intact and contains the latest content, whereas the LibreOffice document is gone forever.
Comment 13 Michael Meeks 2012-01-09 03:19:42 UTC
> After powering up, the GEdit document is intact and contains the latest
> content, whereas the LibreOffice document is gone forever.

Thanks for testing. This is as expected, unless we use 'rename' - and the code needs fixing to do that. *Unfortunately* the code is far worse than a rats nest here.

Using 'rename' seems like an easy & trivial thing to do right ? it is the elegant solution to avoiding 'fsync' - but this is not really so. It is fairly easy to create a file-system for which we have no ability to create files in the same directory that can be renamed over the top - and finding other directories on similar file-systems for which 'rename' can validly replace a file in another directory is a twisted mess of nastiness. In these cases we -have- to open that file, write stuff to it - and then we can't call fsync because it is a "bust your system interactivity, break your burning DVD, jitter your sound playback, and make people think LibreOffice kills babies" syscall on old file-systems, where 'old' means 'widely deployed' :-)

Worse - there is no standard library out there that wraps 'fsync' into a 'sane_fsync' that will tell us what file-systems underlies the file, and ensure that we are only calling fsync where it is safe (cf. above). If some enterprising kernel hacker wants to create a nice, ultra-liberally licensed library that turns a dev_t into a boolean: int is_it_safe_to_fsync (dev_t *t); then I'd be more than happy to see it used un-conditionally in our system-abstraction for Unix / Linux. Which would prolly solve the problem. AFAICS dev_t's are pretty useless; perhaps we'd need to parse /proc/mounts and combine that with the horrible stat walk we already do to determine file-system type, and then start doing string munging: "ext4" - safe to 'fsync', "btrfs" - prolly a good idea to 'fsync' - or is it ? perhaps only > some version of btrfs, and for ext3/ext2 etc. there is no need - they tend not to wantonly loose your data ;-)
Comment 14 Tristan Schmelcher 2012-01-09 07:38:56 UTC
(In reply to comment #13)
> If some
> enterprising kernel hacker wants to create a nice, ultra-liberally licensed
> library that turns a dev_t into a boolean: int is_it_safe_to_fsync (dev_t *t);
> then I'd be more than happy to see it used un-conditionally in our
> system-abstraction for Unix / Linux.

I was actually looking into that recently as part of another project and it's pretty easy. Basically,

1) Stat the file to get the st_dev.
2) Stat each file in /dev/disk/by-uuid to find one with a matching st_rdev.
3) Run realpath() on that file. Now you have the device file holding the filesystem.

From there you can easily look up the filesystem type in many places, e.g. /etc/mtab, /proc/fs, or /sys/fs. Probably /etc/mtab is your best bet since it is a generic UNIX thing.

On non-Linux or on Linux without udev you could fall back to stat'ing each file in /dev rather than /dev/disk/by-uuid.
Comment 15 Michael Meeks 2012-01-10 02:26:43 UTC
Hi Tristan,

> I was actually looking into that recently as part of another project
> and it's pretty easy. Basically ...

Sounds cool :-) any chance you can knock up a nice fragment of tested C code that would do this for Linux under MPL/LGPLv3+ that we can couple up ? We'd want to add that to sal/osl/unx/file.cxx (osl_syncFile) inside some #ifdef LINUX guard I guess.

Then we'd want to add an osl_syncFile to the end of our export logic, which is slightly more involved; I suspect ucb/source/ucp/file/filstr.cxx /closeOutput/ - anyhow - first steps first.

Turning this into an intermediate / easy hack with the above pointers :-)
Comment 16 Tristan Schmelcher 2012-01-14 16:25:18 UTC
Actually on second thought there is a simpler and more robust approach, just run realpath() on the document file and then scan /etc/mtab for the longest mount point that is a prefix of the document's path. That way tmpfs and such can be detected correctly. That appears to be how df works.
Comment 17 Tristan Schmelcher 2012-01-15 14:05:34 UTC
Created attachment 55617 [details]
Small utility to test if fsync() is desirable for a certain path

Here you go. Tested on ext3, ext4, tmpfs, "none" filesystems, procfs, sysfs, and bind mounts, including mount points that are renamed after they are mounted and mount points with whitespace in their paths. I hereby release this file into the public domain.
Comment 18 Michael Meeks 2012-01-16 08:42:06 UTC
Created attachment 55644 [details]
sal fsync patch ...

Thanks Tristan ! you provoked me into doing something here, and thanks for the tips. As you say it shouldn't be so hard. I attach a sample patch to LibreOffice. The only real problem that remains (beyond removing the debug), is to work out -which- of the umpteen different entry points is actually doing the writing of our documents.

It is possible / probable that there is some 'transfer' or 'move' thing happening whereby we write data into /tmp/ and then in a second step zip it over the target [ prolly in the package/ code ].

With a bit more digging to find that, we should be able to hook this up (cf. my failures on that score in ucb/ ;-).
Comment 19 Tristan Schmelcher 2012-01-16 15:52:23 UTC
FYI, the parse_mounts in your patch won't work for paths containing whitespace. The whitespace characters are translated into octal escape sequences in the file. You'd be better off using setmntent and friends, which handle that for you.
Comment 20 Tristan Schmelcher 2012-01-17 21:29:02 UTC
Created attachment 55708 [details]
Small utility to test if fsync() is desirable for a certain path

Discovered a small bug in my code, attaching a fixed version.
Comment 21 Michael Meeks 2012-01-18 04:22:32 UTC
Hi Tristan,

> FYI, the parse_mounts in your patch won't work for paths containing
> whitespace. The whitespace characters are translated into octal escape 
> sequences in the file. You'd be better off using setmntent and friends,
> which handle that for you.

Doh - I should have read your code much more carefully; silly me. Any chance of a cleanup / simplify of my patch ? & a bit of hunting with gdb to find which bit of code is doing the final close on the output file so we can osl_syncFile it there ? :-)

Of course, you'd need a build - but it'd be much appreciated ! :-)
Comment 22 Tristan Schmelcher 2012-01-19 18:13:05 UTC
Not sure when I could fit that in ... I'll see.
Comment 23 Michael Meeks 2012-01-20 03:47:15 UTC
Doing a bit more code reading; it seems that the likeliest place to add an osl_syncFile - or somesuch UNO call to either the ucb/ or package/ code is the SfxMedium code:

sfx2/source/doc/docfile.cxx (Transfer_Impl)

perhaps this, or hereabouts:

                // copy the temporary storage into the disk spanned package
                GetStorage()->copyToStorage( xStor );
                uno::Reference < embed::XTransactedObject > xTrans( pImp->xStorage, uno::UNO_QUERY );
                if ( xTrans.is() )
                    xTrans->commit();

but some stepping through in the debugger is needed there I guess.
Comment 24 Tristan Schmelcher 2012-01-21 14:49:09 UTC
Created attachment 55932 [details]
gdb trace of all the calls to close(2) when saving a document

Seems reasonable. I used gdb to find all the stacks that call close when saving a file (in 3.4.4, since I didn't have time to build my own package with symbols). There's a lot of calls (27), but most are very similar. Transfer_Impl indeed appears in several, and SfxMedium in almost all of them.

I noticed that some of the calls are for backing up stuff. Ideally the backup(s) should be fsync'ed too so that a usable backup will be present even if a power outage occurs in the middle of saving the real file, since copying is not atomic like rename is.
Comment 25 Michael Meeks 2012-01-24 08:32:40 UTC
A really useful trace - thanks for that ! :-) it seems that a load of these are related to the flurry of /tmp files that we create (apparently for no really good reason).

Re-running with:

diff --git a/sal/osl/unx/file.cxx b/sal/osl/unx/file.cxx
index aa6cc26..6b3c56c 100644
--- a/sal/osl/unx/file.cxx
+++ b/sal/osl/unx/file.cxx
@@ -1064,6 +1064,9 @@ SAL_CALL osl_closeFile( oslFileHandle Handle )
 {
     FileHandle_Impl* pImpl = static_cast<FileHandle_Impl*>(Handle);
 
+    fprintf (stderr, "Close on '%s'\n", pImpl && pImpl->m_strFilePath ?
+             pImpl->m_strFilePath->buffer : "<null>");
+
     if (pImpl == 0)
         return osl_File_E_INVAL;
 

Might show us which one is actually the file you saved as :-) interesting too to see the backup stuff there. It was great to see the osl_copyFile method being used too - there is a comment in the impl. saying it isn't used ;-)
Comment 26 Michael Meeks 2012-01-24 12:41:45 UTC
Using my own patch for a save I got:

Close on '/tmp/lu4uaci5.tmp/lu4uaci9.tmp'
Close on '/tmp/lu4uaci5.tmp/lu4uaci9.tmp'
Close on '/tmp/lu4uaci5.tmp/lu4uacia.tmp'
Close on '/tmp/lu4uaci5.tmp/lu4uacib.tmp'
Close on '/tmp/lu4uaci5.tmp/lu4uacib.tmp'
Close on '/tmp/lu4uaci5.tmp/lu4uacib.tmp'
Close on '/tmp/.~lock.test.odt#'
Close on '/tmp/lu4uaci5.tmp/lu4uacia.tmp'
Close on '/tmp/test.odt'
Close on '/tmp/lu4uaci5.tmp/lu4uacia.tmp'
Close on '/tmp/lu4uaci5.tmp/lu4uacic.tmp'
Close on '/tmp/test.odt'
Close on '/tmp/lu4uaci5.tmp/lu4uaci8.tmp'
Close on '/home/michael/.config/libreoffice/3/user/5znZDX'

Makes you wonder what is going on: looks like an 'open and shut' case to me ;-)

I was saving as /tmp/test.odt.

HTH.
Comment 27 Robert Steiner 2012-03-08 02:47:58 UTC
I have my odt-files on a NAT system, which is accessed via NFS.

Today, my day was ruined because one odt-file which was open (by LibreOffice  3.5.0rc3 on a debian "Lenny" system) when I shut down the system yesterday was gone today without a trace. Thank god another odt-file that was also open is still there. Please note that I did not have a crash or power-outage, but just made a normal shutdown.

I'm not sure whether this is the same bug, because other users have reported that they had an empty file, but my file vanished completely. Maybe the other affected users could clarify this.

I've been using OpenOffice 2.4 for several years with the same NAT, the same NFS-setup, the same OS and the same odt-files - and I have never had dataloss.

Also, I might add that to categorize this bug as "enhancement" is really an insult - no bug can be worse than any bug that causes dataloss especially if complete files are deleted.
Comment 28 Michael Meeks 2012-03-08 04:37:09 UTC
Hi Robert, sorry to hear about your data loss:

> I've been using OpenOffice 2.4 for several years with the same NAT, the same
> NFS-setup, the same OS and the same odt-files - and I have never had dataloss.

It is worth pointing out that this (missing fsync) is -exactly- the same in the OpenOffice code, so ... while it's absence might well be the cause of your grief, it's not different to what you had before.

GregKH managed to 3/4 persuade me that we should just call fsync() wherever and risk the wrath, frustration etc. of people blaming the kernel /file-system bugs for ext2, ext3 etc. and to some extent ext4/btrfs on us.

I'm inclined to do that I guess - and skip the fstab parsing etc.
Comment 29 Robert Steiner 2012-03-08 23:15:27 UTC
Hello Michael, thanks for your reply.

Please use fsync, even if it may cause some frozen GUI for a few milliseconds, keeping your data safe should be priority to *everything else*. And if only older Linux-systems are caught, even more so. (And I say that as a debian-Lenny user. Older installations are not upgraded to the latest LibreOffice versions that often anyway - and to be honest I was only interested in SVG, otherwise I would still be happily using OpenOffice 2.4) While I was somewhat "lucky" to lose only a few hours work (if the other file were lost, it would have been a week though...) other people on https://bugs.launchpad.net/ubuntu/+source/libreoffice/+bug/817326 reported many days and weeks worth of lost data. I don't think that a faster GUI can make up for that, even if you add it up over a user's lifetime.

In fact it would be a good idea to keep a backup copy somewhere (for example in  ~/.libreoffice/3/user/backup) for all recently used documents. If you keep several versions, that would be a great feature because it would mitigate user-mistakes (most deletions are caused by user-mistakes).

I am a software developer myself and while it may be technically true that OpenOffice 2.4 also contains this bug (it's a *BUG*, not an enhancement proposal), it does not seem to appear as often. Maybe it's the combination with my NFS-setup (which has sometimes a delay of a few seconds because it has to spin up) or whatever, but 2.4 didn't do that for years, while 3.5 did it within a week after installation. In any case it's unacceptable.

So please use fsync. Most Linux-installation which will install LibreOffice >3.5 will be using ext4 anyway. (But as I said, even I as an ext3 user would prefer a frozen GUI to dataloss any day of the week.)
Comment 30 Michael Meeks 2012-03-09 09:01:37 UTC
pushed fix to master.
Comment 31 Not Assigned 2012-03-09 09:07:06 UTC
Michael Meeks committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=d3192948fe968fc4d6a8ec0e6fda232f265b3c4c

fdo#40607 - osl_syncFile having written, and avoid doing that on start
Comment 32 Not Assigned 2012-03-19 08:40:27 UTC
Michael Meeks committed a patch related to this issue.
It has been pushed to "libreoffice-3-5":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=c6e22c0fc0cc4ce1508f8401c4b0c14fc89df942&g=libreoffice-3-5

fdo#40607 - osl_syncFile having written, and avoid doing that on start


It will be available in LibreOffice 3.5.2.
Comment 33 Michael Meeks 2012-03-19 10:48:38 UTC
Cor reports some significant slowdowns at first-ever-start it seems we're doing a load of fsyncs on that code-path (140 or so) - but it'd be good to get some strace -ttt -f data from him on that.
Comment 34 Rainer Bielefeld Retired 2012-04-05 07:58:34 UTC
I added Fix submitter as assignee because this will ease queries and bug tracking.
Comment 35 Robinson Tryon (qubit) 2015-12-16 00:21:16 UTC
Migrating Whiteboard tags to Keywords: (EasyHack,DifficultyInteresting,SkillCpp)
[NinjaEdit]