Bug 127073 - REDACTION: Make redacted PDF-document searchable
Summary: REDACTION: Make redacted PDF-document searchable
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Printing and PDF export (show other bugs)
Version:
(earliest affected)
6.3.0.4 release
Hardware: All All
: medium enhancement
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: Redaction
  Show dependency treegraph
 
Reported: 2019-08-21 08:13 UTC by Ulrich Windl
Modified: 2019-09-08 19:35 UTC (History)
5 users (show)

See Also:
Crash report or crash signature:


Attachments
ZIP-Archive with test documents (ODT, PDFs) (503.34 KB, application/zip)
2019-08-27 12:00 UTC, Ulrich Windl
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ulrich Windl 2019-08-21 08:13:47 UTC
Converting a document to a pixel image for redaction just seems plain wrong (the current implementation is basically the digital version of "print it, blacken it, scan it again to make a new document"). Having used a similar function in Adobe's Acrobat Pro, I feel it could be done much better:
Instead of pixelizing the whole document (and blowing it up) the portions to be redacted should be replaced by (rectangle) shapes anchored at the proper places (which will require that anchoring shapes works correctly). Then the user would have the choice whether to fill those shapes black, white, with patterns, etc.

Now the resulting document would still be searchable (except for the redacted portions).

Metadata redaction is another challenge, however.
Comment 1 Roman Kuznetsov 2019-08-22 19:12:01 UTC
But then any enemy (or journalists) will delete your shapes using good PDF Editor and will see your secret info. Export to bitmap PDF will guarantee us that our secret info will remain secret.

I think it's WONTFIX

Muhammet, what do you think?
Comment 2 Muhammet Kara 2019-08-22 20:40:03 UTC
(In reply to Ulrich Windl from comment #0)
> (the current implementation is basically the digital version of "print it,
> blacken it, scan it again to make a new document").

That's exactly how it was designed.

(In reply to Roman Kuznetsov from comment #1)
> I think it's WONTFIX
> 
> Muhammet, what do you think?

What the reporter describes might be a nice (but risky -check the internet for security incidents of similar implementations-) implementation of a redaction feature, but not this one. Unless there is a strong will, and necessary resources to implement another version of the redaction feature, this is a WONTFIX for me.
Comment 3 Ulrich Windl 2019-08-26 07:25:23 UTC
(In reply to Roman Kuznetsov from comment #1)
> But then any enemy (or journalists) will delete your shapes using good PDF
> Editor and will see your secret info. Export to bitmap PDF will guarantee us
> that our secret info will remain secret.

You missed what I wrote: I did not write "overlayed by" but "replaced with".
Of course implementation will be harder. So if anenemy will delete the rectangles, there will be just "holes" in the document where the rectangles were.
Comment 4 Ulrich Windl 2019-08-26 07:29:59 UTC
(In reply to Muhammet Kara from comment #2)
> What the reporter describes might be a nice (but risky -check the internet
> for security incidents of similar implementations-) implementation of a
> redaction feature, but not this one. Unless there is a strong will, and
> necessary resources to implement another version of the redaction feature,
> this is a WONTFIX for me.

Well I think another "poor man's solution" was selected for this interesting feature. I can make an offer: Provide a simple test document, where the parts to be redacted are marked in some color, and I'll return you a version "redacted" with Adobe's product. Then you can look what you find in the document, and compare it to the "pixel version". If you don't know better, you might think things have to be that way.
Comment 5 Ulrich Windl 2019-08-27 12:00:12 UTC
Created attachment 153686 [details]
ZIP-Archive with test documents (ODT, PDFs)

The ZIP contains:
Testpage.odt: A simple text document with images used for PDF export
Testpage.pdf: PDF produced from LO6.2
Testpage-Distiller.pdf: PDF produced by Adobe's Distiller

Note: For some reason the Adobe tool cannot redact parts of an image for the PDF created by LO, but it can for the PDF created by Distiller. Thus the extra file.

Testpage-Distiller-redacted.pdf: Parts of the text and parts of the images being redacted.

  Length      Date    Time    Name
---------  ---------- -----   ----
   312544  2019-08-27 13:41   Testpage-Distiller-redacted.pdf
    89760  2019-08-27 13:33   Testpage-Distiller.pdf
    17055  2019-08-27 13:18   Testpage.odt
   135472  2019-08-27 13:19   Testpage.pdf
---------                     -------

Try the redacted PDF, searching for text: You can find any text that had not been redacted, while you cannot find text that had been redacted. I think THAT's the way to go.

Note: I just realized that the original ODT had images linked; thus they are missing in the ZIP. However, any image will do...
Comment 6 Dieter 2019-09-01 11:40:41 UTC
Ulrich, I agree that redacted PDF should be searchable (actual it is not). Is this - in short - your proposal in this bug report?
Comment 7 Ulrich Windl 2019-09-02 06:31:43 UTC
(In reply to Dieter Praas from comment #6)
> Ulrich, I agree that redacted PDF should be searchable (actual it is not).
> Is this - in short - your proposal in this bug report?

My basic point was that a PDF page is much more than a pixel image, but redaction as implemented now just seems to create such.
Being able to search is probably the pint most noticeable to the user, but document structure, and file size are also important issues. Finally I wonder whether you'll be able to make good-looking high-resolution prints of such redacted documents.
Comment 8 Dieter 2019-09-03 09:18:21 UTC
(In reply to Ulrich Windl from comment #7)
> My basic point was that a PDF page is much more than a pixel image, but
> redaction as implemented now just seems to create such.
> Being able to search is probably the pint most noticeable to the user, but
> document structure, and file size are also important issues. Finally I
> wonder whether you'll be able to make good-looking high-resolution prints of
> such redacted documents.

I agree to all, but you should focus on one topic per bug. So my suggestion is to change bug summary to "Make redacted document readable", because I suppose, that this would also solve the other problems. I changed bug summary. Please feel free to make a different proposal, if you don't agree.
Comment 9 Cor Nouws 2019-09-08 19:35:31 UTC
(In reply to Ulrich Windl from comment #0)
> Converting a document to a pixel image for redaction just seems plain wrong
> ...
As an intermediate solution, OCR could be used for these cases?