Bug 167170 - Turn off RSIDs by default and warn about their privacy implications
Summary: Turn off RSIDs by default and warn about their privacy implications
Status: UNCONFIRMED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: LibreOffice (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: Privacy
  Show dependency treegraph
 
Reported: 2025-06-23 04:59 UTC by Eyal Rozenberg
Modified: 2025-06-23 12:41 UTC (History)
1 user (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Eyal Rozenberg 2025-06-23 04:59:27 UTC
tl;dr: RSIDs leak private information, directly and contextually, without author awareness (and are also not fully reliable), so we should reconsider having them enabled by default.

Now for the long version.


Introduction - about RSIDs
---------------------------

The openooo:rsid attribute is a LibreOffice (OpenOffice?) extension of the ODF standard, based on a similar feature in MS OpenXML; see

https://learn.microsoft.com/en-us/dotnet/api/documentformat.openxml.wordprocessing.rsid?view=openxml-3.0.1

https://wiki.documentfoundation.org/Development/ODF_Implementer_Notes/List_of_LibreOffice_ODF_Extensions

It is a mostly-random value which distinguishes editing sessions of a document; if we open a document in LO with RSID generation enabled, and make edits to styles - these styles will be marked with a new generated RSID, which must be higher than previous RSIDs in the document.

It is added to different ODF, and Flat ODF, entities; particularly, character and paragraph styles, including autostyles. It is intended to assist in document comparison, helping to identify originally-identical entities in the documents.

Here is an example of how they look, when generated:

 <office:automatic-styles>
  <style:style style:name="T1" style:family="text">
   <style:text-properties officeooo:rsid="001e7e69"/>
  </style:style>
 </office:automatic-styles>

RSID generation is turned on by default in LO, and controlled via Tools > Options > LibreOffice Writer > Comparison .


RSIDs divulge private information I
-----------------------------------

Let us begin with the 'dry' technicality of this claim, then add usage scenarios as context.

RSIDs induce an ordered breakdown of the creating (or modification) of styles in a document. While the exact dates and times are not listed - if the styles are applied, one can tell which styled parts of a document were styled in the same session.

A particular kind of styles, and the one which users are least cognizant of, are automatic styles. These are applied not only when a user directly-formats content, but something completely automatically. In fact, when adding text to a paragraph from a previous editing session, it is likely (perhaps certain?) that an automatic text style would be generated and applied to the new edit, with a different RSID.

Thus, through an induction from styles, mainly automatic styles, to the styled text, RSIDs allow a partial time-ordered decomposition of the text into editing sessions.

This means, among others:

* The ability to identify relations between disparate pieces of the text, which happen to be edited in the same session, separately from their surrounding text which had been edited earlier.
* The ability to partition/section the text along different lines than its formatting or explicit sectioning indicates.


RSIDs divulge private information II
------------------------------------

Alice and Bob are negotiating a contract with many articles and terms. Alice sends over a draft, and Bob party makes all sorts of changes, in different editing sessions. In particular, Bob inserts two separate clauses, in different parts of the contract; they each appear to belong where they have been added. It so happens, that one clause creates conditions in which the other is likely to apply. Also, Bob visits Charlie and consults him about the contract; Charlie is unhappy, and Bob makes some changes based on Charlie's comments; some of them can be clearly understood to relate to Charlie, some cannot. Finally, Bob sends a new draft to Alice.

Normally, Alice would simply be faced with the new contents of the draft, possibly in track-changes form, and/or be able to diff the previous and current drafts. However, given that the above-mentioned changes have likely resulted in new automatic styles created for added paragraphs in different editing sessions, she can reconstruct - possibly with near-perfection - a sequence of editing sessions in which the different changes were made.

This would let her discern, among other things:

1. The two seemingly-unrelated clauses are actually related, and thus their combined effect must be scrutinized.
2. One of the editing session is a "Charlie-related session", having changes which obviously relate to his interests; the rest of the changes in that session are likely to also be influenced or inspired by his comments and interest.
3. What Bob had immediately started working on (and is readily on his mind), and what he only changed later, after some thought. The ordering of work is particularly useful when it does not correspond to the order of the paragraphs in the contract.
4. Which clauses and terms Bob has written in "one sitting", and which he wrote gradually.
5. Which clauses and terms Bob struggled with, making edits at different times rather than changing them once.

Note that even if items (1.) and (2.) seem to you a little contrived, or niche, situations - items (3.) through (5.) are quite general.


Consequences
-------------

Much of what we put in a document divuleges information about the user or author; however, when this happens:

* without conscious user intent
* without the user being reasonably notified of the information being stored
* with it being non-trivial to realize how more sensitive information can be inferred from seemingly "dry" and uninteresting information

this is problematic. One might go as to far as to say that adding this information to documents, under the above conditions, is somewhat unfaithful to the user - especially considering how LibreOffice prides itself on a commitment to user privacy.

At this point one may object, regarding the benefits of RSID's for document comparison. While it is difficult for me to estimate how significant these are, I believe that the balance of considerations should lean in favor of better privacy at the expense of efficiency, by default.

Moreover, in the Tools > Options tree branch for controlling whether  RSIDs are stored (LibreOffice > LibreOffice Writer > Comparison), we should include a warning about the potential leak of private information through this mechanism. We do not need to go into specific details (although the documentation could go into them - both the benefits and the detriments).
Comment 1 Mike Kaganski 2025-06-23 05:23:03 UTC
(In reply to Eyal Rozenberg from comment #0)
> if we open a document in LO with RSID generation enabled, and make
> edits to styles - these styles will be marked with a new generated RSID,
> which must be higher than previous RSIDs in the document.

A note: the "which must be higher than previous RSIDs in the document" is wrong. The values are absolutely not required to have their numerical values in any particular order.
Comment 2 Buovjaga 2025-06-23 07:02:10 UTC
Some research:

Examining and detecting academic misconduct in written documents using revision save identifier numbers in MS Word as exemplified by multiple scenarios
https://www.sciencedirect.com/science/article/pii/S2666281724001458

Establishing Genealogies of Born Digital Content: The Suitability of Revision Identifier (RSID) Numbers in MS Word for Forensic Enquiry
https://www.researchgate.net/profile/Dirk-Spennemann/publication/371872153_Establishing_Genealogies_of_Born_Digital_Content_The_Suitability_of_Revision_Identifier_RSID_Numbers_in_MS_Word_for_Forensic_Enquiry/links/649a5279c41fb852dd349ac3/Establishing-Genealogies-of-Born-Digital-Content-The-Suitability-of-Revision-Identifier-RSID-Numbers-in-MS-Word-for-Forensic-Enquiry.pdf
Comment 3 Eyal Rozenberg 2025-06-23 08:08:06 UTC
I should mention that it seems we save rsid when we save DOCX's, so this bug is probably relevant to OOXML.

(In reply to Mike Kaganski from comment #1)
> A note: the "which must be higher than previous RSIDs in the document" is
> wrong. The values are absolutely not required to have their numerical values
> in any particular order.

Here: https://learn.microsoft.com/en-us/dotnet/api/documentformat.openxml.wordprocessing.rsid?view=openxml-3.0.1

it says that "Every editing session shall be assigned a revision save ID that is larger than all earlier ones in the same file". Do we not observe that? Or, do we observe that only for OOXML but not for ODF?
Comment 4 Mike Kaganski 2025-06-23 08:16:47 UTC
(In reply to Eyal Rozenberg from comment #3)
> it says that "Every editing session shall be assigned a revision save ID
> that is larger than all earlier ones in the same file". Do we not observe
> that? Or, do we observe that only for OOXML but not for ODF?

We don't, just as MS doesn't.
Comment 5 Mike Kaganski 2025-06-23 08:30:11 UTC
(In reply to Mike Kaganski from comment #4)
> We don't, just as MS doesn't.

Hmm, I am likely wrong about us. I must have been confused by what MS does; they actually do not keep that order, but Writer does. So yes, your point is true at least for Writer.