Creating PDF/A Documents for Long-Term Archiving

by Josef Spillner

In the Service Engineering research area, we aim at producing high-quality output in terms of software, publications, lecture materials and other results. From time to time, this implies departing from old habits and taking a bit of extra effort to reach new quality levels. For publications, there are excellent tools like LaTeX to achieve a compelling layout and typesetting. Using the standard templates and the rubber tool is enough to produce a distributable PDF quickly. Now, quality and effort are seemingly in a good balance.

But PDF, while being the lingua franca for digital scientific publications, is a complex format. Apart from lacking a proper open specification, many issues arise from placing the wrong contents or just references instead of full contents into PDF files. Therefore, libraries increasingly prefer PDF/A as strict subset for long-term archiving. PDF/A is still a complex family of standards and interpretations. There are many articles available on how to convert a LaTeX source structure into a PDF/A-compliant one. Useful ones include this one and this one. While helpful, the last micro-mile was apparently missing and it took some tinkering on a sample document to get the validation right. This blog post intends to fill the gap.

The main issues of non-compliant documents are: references to external fonts, raster graphics for figures with alpha channel, missing metadata, and missing colour profile. Further trouble is often caused by the need for hyperlinks in combination with unicode. The first two can be checked and fixed rather easily:

pdffonts <pdf>
pdfimages -list <pdf>

The first command should show all fonts as embedded: yes in the “emb” column, and the second command should only show “image” entries without any “smask” (and of course, no raster graphics in the first place, although for screenshots this is mostly unavoidable). Here’s a good example:

$ pdffonts foo.pdf
name           type     encoding   emb sub uni object ID
-------------- -------- ---------- --- --- --- ---------
FMXBJF+CMR17   Type 1   Builtin    yes yes yes     381 0

$ pdfimages -list foo.pdf 
page num type  width height color comp bpc enc   interp object ID x-ppi y-ppi size  ratio
------------------------------------------------------------------------------------------
 48  0   image 300   180    rgb   3    8   image no        1041 0 210   210   7827B 4.8%

The metadata and colour profile additions can be realised with the pdfx package. However, depending on the version, the choice between inputenc’s/grffile’s/lstlisting’s utf8 and ucs/utf8x as well as the need for a metadata schema extension become important. To keep the instruction simple, they will only consider the latest pdfx (2016-05-11, listed copyright years are 2015 for pdfx.sty and 2016 for pdfa.xmp) in combination with utf8.

The following files need to be added to the LaTeX source: 8bit.def, foo.xmpdata (to be customised), glyphtounicode-cmr.tex, pdfa.xmp, pdfx.sty, and sRGB_IEC61966-2-1_black_scaled.icc. Four of these files should be downloaded from point of this source while pdfx.sty and pdfa.xmp should be retrieved from the package link above.

Even the latest pdfa.xmp seems to not match what the validation expects as it does not define two pdfSchema:valueTypes in its XML Schema definition. Therefore, the following patch needs to be applied to pdfa.xmp.

@@ -62,6 +62,10 @@
 </rdf:li>
 </rdf:Seq>
 </pdfaSchema:property>
+ <pdfaSchema:valueType> <!-- make validator happy -->
+ <rdf:Seq>
+ </rdf:Seq>
+ </pdfaSchema:valueType>
 </rdf:li>
 %% RRM: this declares the namespace resource for PRISM metadata
 <rdf:li rdf:parseType="Resource">
@@ -104,6 +108,10 @@
 % <pdfaProperty:description></pdfaProperty:description>
 % </rdf:li>
 </rdf:Seq></pdfaSchema:property>
+ <pdfaSchema:valueType> <!-- make validator happy -->
+ <rdf:Seq>
+ </rdf:Seq>
+ </pdfaSchema:valueType>
 </rdf:li>
 </rdf:Bag>
 </pdfaExtension:schemas>

Furthermore, .xmpi files which get generated should be added to the list of ignored files.

The main .tex file then just needs to have the following:

\usepackage[a-1b]{pdfx}

Finally, the document should be validated. A free PDF/A validator is available online. Without the instructions above, on a stock LaTeX document, the output will be similar to the following one:

Validating file "foo.pdf" for conformance level pdfa-1b
The separator after an 'obj' must be an EOL. (163)
The separator before an 'endobj' must be an EOL. (163)
The key Metadata is required but missing.
A device-specific color space (DeviceGray) without an appropriate output intent is used.
A device-specific color space (DeviceRGB) without an appropriate output intent is used.
The value of the key SMask is an image but must be None. (9)
The key S has a value Transparency which is prohibited. (11)
The separator before 'endstream' must be an EOL.
The document does not conform to the requested standard.
The file format (header, trailer, objects, xref, streams) is corrupted.
The document contains device-specific color spaces.
The document contains transparency.
The document's meta data is either missing or inconsistent or corrupt.
Done.

Whereas, by following the instructions, the output will eventually be as desired:

foo.pdf validated successfully.

For your convenience, all six files including the applied patch are made available for download as a bundle from a dedicated Service Prototyping Lab repository. The remaining effort then boils down to using the files, to curating the metadata of your publications, and to publishers of scientific papers to reward the higher production quality.

Schlagwörter: publishing

7 Kommentare

  • Thanks for that tutorial. However, as soon as I include the pdfx package like described, I face tons of errors all over
    the document: “Missing number, treated as zero. …” and “Illegal unit of measure (pt inserted)” every time listings (e.g., lstinline) are used. Is this a known incompatibility? How to fix that?

  • Dear Josef,

    Thank you very much for the tutorial!
    Unfortunately it is not working good in my case, I have the program kile in debian. I think I have an additional program with the packages, after I include the files you suggested and made the additional changes I keept getting the following error:

    ! Undefined control sequence.
    \pdfx@xmpunimarkup …perscript \LIIXUmapTeXnames
    \csname psdmapshortnames\e…

    I then tried with your working example and it was the same. Please, do you have any idea what could be the source of this error?

    Thank you very much in advance!

    Fermin


Leave a Reply

Your email address will not be published. Required fields are marked *