👩‍💻 chrismanbrown.gitlab.io

PDF Crimes

and what to do about them

2025-02-11

Contents🔗

  1. Doom
  2. PDF Extensions
  3. Too Many PDF/A
  4. Choose One
  5. Converting and Saving
  6. Refrerences

Doom🔗

I am deeply disturbed and borderline offended by the proliferation of PDF crimes we have been seeing lately.

These crimes take the form of turning a PDF into a platform that can run arbitrary code.

Just look:

This has made me curious about how to ensure that the PDF I’m about to open is, you know, a portable document. And not an operating system.

I think what I’m interested in is some strict subset of PDF.

PDF Extensions🔗

This lead me to learn a little bit about some subsets / extensions of the PDF standard:

What I am interested in is the PDF/A extension. Because of its goal of long-term digital preservation, it prohibits a lot of PDF features such as multimedia audio / visual content, javascript, and linked content (e.g. fonts) (as opposed to embedded content).

This sounds like exactly what I want! A self-contained PDF that is just some text and images with no runtime for javascript or webassembly.

Far less opportunity for PDF crime.

Too Many PDF/A🔗

Over the years there have been several versions of PDF/A published under ISO 19005.

Additionally! Each version has two levels of conformity: Level B (Basic), and Level A (Accessible).

Four formats with two levels of conformity each makes for 8 different possible targets.

Well, 10. Technically PDF/A-4 has two additional levels of conformance: PDF/A-4e (E for Engineering; PDF/A-4e supercedes PDF/E) and PDF/A-4f (F for Files; as in “Files, the embedding of arbitrary”; and I’m not sure what PDF/A-4f brings to the table in terms of embedding files that A-3 and A-4 don’t already have…)

But let’s not dwell on any of that.

Finally, no subsequent version of of PDF/A is meant to obsolete any previous version. They are all meant to coexist side by side. Newer versions simply support newer features. So it’s not like you can just choose the latest version based on the assumption that the older versions are no longer supported.

So which version should we choose?

Choose One🔗

PDF/A-2b

A-1 is the most simple (restrictive) format but it prohibits transparency which a whole lot of PDFs have. A-2 is probably the most feature-restricted version that still supports the largest amount of modern features.

Also, A-1 is based on an Adobe Systems format. A-2 is based on an ISO standard. And I want to support open standards.

Level A conformance isn’t really something you can add to a document after the fact. Not without a lot of tedius manual work. If you’re creating a bespoke PDF then PDF/A-2a is a fine choice. But if you have a PDF you found somewhere in the wild and are preparing it for archiving then PDF/A-2b is probably the easiest route to go.

Converting and Saving🔗

There are some online tools you can use to convert PDF to PDF/A. You can find them by duckducking go “PDF to PDF/A”.

I just tried one on an ebook I have that was PDF 1.5 and 54.6 MB. Now it is PDF 1.7 (so probably PDF/A-2) and 28.6 MB.

I fed the result through an online PDF/A validator. And it didn’t pass. I think because it couldn’t successfully autodetect the PDF/A version But when I explicit rely told it the version (PDF/A-2b), it validated.

Preview.app has the ability to export to PDF/A but I have had mixed success with it.

Libreoffice seems to have the best support of anything I’ve tried so far for exporting a document directly to PDF/A.

Refrerences🔗

PDF Family (Library of Congress)
https://www.loc.gov/preservation/digital/formats/fdd/fdd000030.shtml
PDF/A
https://en.wikipedia.org/wiki/PDF/A
White Paper: PDF/A – the standard for long-term archiving
https://www.pdf-tools.com/public/downloads/whitepapers/whitepaper-pdfa.pdf
How to Pick the Right Version of PDF/A
https://apryse.com/blog/pdfa-format/how-to-pick-right-version-of-pdfa
Digitization of Text Documents Using PDF/A
https://ital.corejournals.org/index.php/ital/article/view/9878
PDF/A in a Nutshell 2.0
https://pdfa.org/wp-content/uploads/2013/05/PDFA_in_a_Nutshell_211.pdf
A Guide to Choosing the Right PDF Format
https://smallpdf.com/blog/pdfx-vs-pdfa-a-guide-to-choosing-the-right-pdf-format
lab6 pdf experiments
https://lab6.com/
Minimal PDF
https://brendanzagaeski.appspot.com/0004.html
Hand-coded PDF tutorial
https://brendanzagaeski.appspot.com/0005.html
pdftools
https://github.com/uroesch/pdftools