In the Service Engineering research area, we aim at producing high-quality output in terms of software, publications, lecture materials and other results. From time to time, this implies departing from old habits and taking a bit of extra effort to reach new quality levels. For publications, there are excellent tools like LaTeX to achieve a compelling layout and typesetting. Using the standard templates and the rubber tool is enough to produce a distributable PDF quickly. Now, quality and effort are seemingly in a good balance.
But PDF, while being the lingua franca for digital scientific publications, is a complex format. Apart from lacking a proper open specification, many issues arise from placing the wrong contents or just references instead of full contents into PDF files. Therefore, libraries increasingly prefer PDF/A as strict subset for long-term archiving. PDF/A is still a complex family of standards and interpretations. There are many articles available on how to convert a LaTeX source structure into a PDF/A-compliant one. Useful ones include this one and this one. While helpful, the last micro-mile was apparently missing and it took some tinkering on a sample document to get the validation right. This blog post intends to fill the gap.
The main issues of non-compliant documents are: references to external fonts, raster graphics for figures with alpha channel, missing metadata, and missing colour profile. Further trouble is often caused by the need for hyperlinks in combination with unicode. The first two can be checked and fixed rather easily:
pdffonts <pdf> pdfimages -list <pdf>
The first command should show all fonts as embedded: yes in the “emb” column, and the second command should only show “image” entries without any “smask” (and of course, no raster graphics in the first place, although for screenshots this is mostly unavoidable). Here’s a good example:
$ pdffonts foo.pdf name type encoding emb sub uni object ID -------------- -------- ---------- --- --- --- --------- FMXBJF+CMR17 Type 1 Builtin yes yes yes 381 0 $ pdfimages -list foo.pdf page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio ------------------------------------------------------------------------------------------ 48 0 image 300 180 rgb 3 8 image no 1041 0 210 210 7827B 4.8%
The metadata and colour profile additions can be realised with the pdfx package. However, depending on the version, the choice between inputenc’s/grffile’s/lstlisting’s utf8 and ucs/utf8x as well as the need for a metadata schema extension become important. To keep the instruction simple, they will only consider the latest pdfx (2016-05-11, listed copyright years are 2015 for pdfx.sty and 2016 for pdfa.xmp) in combination with utf8.
The following files need to be added to the LaTeX source: 8bit.def, foo.xmpdata (to be customised), glyphtounicode-cmr.tex, pdfa.xmp, pdfx.sty, and sRGB_IEC61966-2-1_black_scaled.icc. Four of these files should be downloaded from point of this source while pdfx.sty and pdfa.xmp should be retrieved from the package link above.
Even the latest pdfa.xmp seems to not match what the validation expects as it does not define two pdfSchema:valueTypes in its XML Schema definition. Therefore, the following patch needs to be applied to pdfa.xmp.
@@ -62,6 +62,10 @@ </rdf:li> </rdf:Seq> </pdfaSchema:property> + <pdfaSchema:valueType> <!-- make validator happy --> + <rdf:Seq> + </rdf:Seq> + </pdfaSchema:valueType> </rdf:li> %% RRM: this declares the namespace resource for PRISM metadata <rdf:li rdf:parseType="Resource"> @@ -104,6 +108,10 @@ % <pdfaProperty:description></pdfaProperty:description> % </rdf:li> </rdf:Seq></pdfaSchema:property> + <pdfaSchema:valueType> <!-- make validator happy --> + <rdf:Seq> + </rdf:Seq> + </pdfaSchema:valueType> </rdf:li> </rdf:Bag> </pdfaExtension:schemas>
Furthermore, .xmpi files which get generated should be added to the list of ignored files.
The main .tex file then just needs to have the following:
Finally, the document should be validated. A free PDF/A validator is available online. Without the instructions above, on a stock LaTeX document, the output will be similar to the following one:
Validating file "foo.pdf" for conformance level pdfa-1b The separator after an 'obj' must be an EOL. (163) The separator before an 'endobj' must be an EOL. (163) The key Metadata is required but missing. A device-specific color space (DeviceGray) without an appropriate output intent is used. A device-specific color space (DeviceRGB) without an appropriate output intent is used. The value of the key SMask is an image but must be None. (9) The key S has a value Transparency which is prohibited. (11) The separator before 'endstream' must be an EOL. The document does not conform to the requested standard. The file format (header, trailer, objects, xref, streams) is corrupted. The document contains device-specific color spaces. The document contains transparency. The document's meta data is either missing or inconsistent or corrupt. Done.
Whereas, by following the instructions, the output will eventually be as desired:
foo.pdf validated successfully.
For your convenience, all six files including the applied patch are made available for download as a bundle from a dedicated Service Prototyping Lab repository. The remaining effort then boils down to using the files, to curating the metadata of your publications, and to publishers of scientific papers to reward the higher production quality.