{"id":10463,"date":"2016-08-09T12:31:54","date_gmt":"2016-08-09T10:31:54","guid":{"rendered":"https:\/\/blog.zhaw.ch\/icclab\/?p=10463"},"modified":"2019-08-05T14:35:27","modified_gmt":"2019-08-05T12:35:27","slug":"creating-pdfa-documents-for-long-term-archiving","status":"publish","type":"post","link":"https:\/\/blog.zhaw.ch\/icclab\/creating-pdfa-documents-for-long-term-archiving\/","title":{"rendered":"Creating PDF\/A Documents for Long-Term Archiving"},"content":{"rendered":"\n<p>by <a href=\"https:\/\/blog.zhaw.ch\/icclab\/josef-spillner\/\">Josef Spillner<\/a><\/p>\n\n\n<p>In the Service Engineering research area, we aim at producing high-quality output in terms of software, publications, lecture materials and other results. From time to time, this implies departing from old habits and taking a bit of extra effort to reach new quality levels. For publications, there are excellent tools like <a href=\"https:\/\/en.wikibooks.org\/wiki\/LaTeX\/Introduction\">LaTeX<\/a> to achieve a compelling layout and typesetting. Using the standard templates and the <a href=\"https:\/\/launchpad.net\/rubber\">rubber<\/a> tool is enough to produce a distributable PDF quickly. Now, quality and effort are seemingly in a good balance.<\/p>\n<p><!--more--><\/p>\n<p>But PDF, while being the lingua franca for digital scientific publications, is a <a href=\"https:\/\/www.adobe.com\/devnet\/pdf\/pdf_reference.html\">complex format<\/a>. Apart from lacking a proper open specification, many issues arise from placing the wrong contents or just references instead of full contents into PDF files. Therefore, libraries increasingly prefer <a href=\"https:\/\/en.wikipedia.org\/wiki\/PDF\/A\">PDF\/A<\/a> as strict subset for long-term archiving. PDF\/A is still a complex family of standards and interpretations. There are many articles available on how to convert a LaTeX source structure into a PDF\/A-compliant one. Useful ones include <a href=\"http:\/\/kulturreste.blogspot.ch\/2014\/06\/grrrr-oder-wie-man-mit-latex-vielleicht.html\">this one<\/a> and <a href=\"http:\/\/www.mathstat.dal.ca\/~selinger\/pdfa\/\">this one<\/a>. While helpful, the last micro-mile was apparently missing and it took some tinkering on a sample document to get the validation right. This blog post intends to fill the gap.<\/p>\n<p>The main issues of non-compliant documents are: references to external fonts, raster graphics for figures with alpha channel, missing metadata, and missing colour profile. Further trouble is often caused by the need for hyperlinks in combination with unicode. The first two can be checked and fixed rather easily:<\/p>\n<pre>pdffonts &lt;pdf&gt;\npdfimages -list &lt;pdf&gt;<\/pre>\n<p>The first command should show all fonts as embedded: yes in the &#8220;emb&#8221; column, and the second command should only show &#8220;image&#8221; entries without any &#8220;smask&#8221; (and of course, no raster graphics in the first place, although for screenshots this is mostly unavoidable). Here&#8217;s a good example:<\/p>\n<pre>$ pdffonts foo.pdf\nname           type     encoding   <strong><span style=\"text-decoration: underline\">emb<\/span><\/strong> sub uni object ID\n-------------- -------- ---------- --- --- --- ---------\nFMXBJF+CMR17   Type 1   Builtin    <span style=\"text-decoration: underline\"><strong>yes<\/strong><\/span> yes yes     381 0\n\n$ pdfimages -list foo.pdf \npage num <span style=\"text-decoration: underline\"><strong>type<\/strong><\/span>  width height color comp bpc enc   interp object ID x-ppi y-ppi size  ratio\n------------------------------------------------------------------------------------------\n 48  0   <span style=\"text-decoration: underline\"><strong>image<\/strong><\/span> 300   180    rgb   3    8   image no        1041 0 210   210   7827B 4.8%<\/pre>\n<p>The metadata and colour profile additions can be realised with the <a href=\"http:\/\/www.ctan.org\/tex-archive\/macros\/latex\/contrib\/pdfx\/\">pdfx<\/a> package. However, depending on the version, the choice between inputenc&#8217;s\/grffile&#8217;s\/lstlisting&#8217;s utf8 and ucs\/utf8x as well as the need for a metadata schema extension become important. To keep the instruction simple, they will only consider the latest pdfx (2016-05-11, listed copyright years are 2015 for pdfx.sty and 2016 for pdfa.xmp) in combination with utf8.<\/p>\n<p>The following files need to be added to the LaTeX source: 8bit.def, foo.xmpdata (to be customised), glyphtounicode-cmr.tex, pdfa.xmp, pdfx.sty, and sRGB_IEC61966-2-1_black_scaled.icc. Four of these files should be downloaded from point of <a href=\"http:\/\/www.mathstat.dal.ca\/~selinger\/pdfa\/#4\">this source<\/a> while pdfx.sty and pdfa.xmp should be retrieved from the package link above.<\/p>\n<p>Even the latest pdfa.xmp seems to not match what the validation expects as it does not define two pdfSchema:valueTypes in its XML Schema definition. Therefore, the following patch needs to be applied to pdfa.xmp.<\/p>\n<pre>@@ -62,6 +62,10 @@\n &lt;\/rdf:li&gt;\n &lt;\/rdf:Seq&gt;\n &lt;\/pdfaSchema:property&gt;\n+ &lt;pdfaSchema:valueType&gt; &lt;!-- make validator happy --&gt;\n+ &lt;rdf:Seq&gt;\n+ &lt;\/rdf:Seq&gt;\n+ &lt;\/pdfaSchema:valueType&gt;\n &lt;\/rdf:li&gt;\n %% RRM: this declares the namespace resource for PRISM metadata\n &lt;rdf:li rdf:parseType=\"Resource\"&gt;\n@@ -104,6 +108,10 @@\n % &lt;pdfaProperty:description&gt;&lt;\/pdfaProperty:description&gt;\n % &lt;\/rdf:li&gt;\n &lt;\/rdf:Seq&gt;&lt;\/pdfaSchema:property&gt;\n+ &lt;pdfaSchema:valueType&gt; &lt;!-- make validator happy --&gt;\n+ &lt;rdf:Seq&gt;\n+ &lt;\/rdf:Seq&gt;\n+ &lt;\/pdfaSchema:valueType&gt;\n &lt;\/rdf:li&gt;\n &lt;\/rdf:Bag&gt;\n &lt;\/pdfaExtension:schemas&gt;<\/pre>\n<p>Furthermore, .xmpi files which get generated should be added to the list of ignored files.<\/p>\n<p>The main .tex file then just needs to have the following:<\/p>\n<pre>\\usepackage[a-1b]{pdfx}<\/pre>\n<p>Finally, the document should be validated. A free <a href=\"http:\/\/www.pdf-tools.com\/pdf\/validate-pdfa-online.aspx\">PDF\/A validator<\/a> is available online. Without the instructions above, on a stock LaTeX document, the output will be similar to the following one:<\/p>\n<pre>Validating file \"foo.pdf\" for conformance level pdfa-1b\nThe separator after an 'obj' must be an EOL. (163)\nThe separator before an 'endobj' must be an EOL. (163)\nThe key Metadata is required but missing.\nA device-specific color space (DeviceGray) without an appropriate output intent is used.\nA device-specific color space (DeviceRGB) without an appropriate output intent is used.\nThe value of the key SMask is an image but must be None. (9)\nThe key S has a value Transparency which is prohibited. (11)\nThe separator before 'endstream' must be an EOL.\nThe document does not conform to the requested standard.\nThe file format (header, trailer, objects, xref, streams) is corrupted.\nThe document contains device-specific color spaces.\nThe document contains transparency.\nThe document's meta data is either missing or inconsistent or corrupt.\nDone.<\/pre>\n<p>Whereas, by following the instructions, the output will eventually be as desired:<\/p>\n<pre>foo.pdf validated successfully.<\/pre>\n<p>For your convenience, all six files including the applied patch are made <a href=\"https:\/\/github.com\/serviceprototypinglab\/latex-pdfa\/archive\/master.zip\">available for download<\/a> as a bundle from a dedicated Service Prototyping Lab <a href=\"https:\/\/github.com\/serviceprototypinglab\/latex-pdfa\">repository<\/a>. The remaining effort then boils down to using the files, to curating the metadata of your publications, and to publishers of scientific papers to reward the higher production quality.<\/p><div class=\"pt-sm\">Schlagw\u00f6rter: <a href=\"https:\/\/blog.zhaw.ch\/icclab\/tag\/publishing\/\">publishing<\/a><br><\/div>","protected":false},"excerpt":{"rendered":"<p>In the Service Engineering research area, we aim at producing high-quality output in terms of software, publications, lecture materials and other results. From time to time, this implies departing from old habits and taking a bit of extra effort to reach new quality levels. For publications, there are excellent tools like LaTeX to achieve a [&hellip;]<\/p>\n","protected":false},"author":486,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"ngg_post_thumbnail":0,"footnotes":""},"categories":[1,15],"tags":[844],"features":[],"class_list":["post-10463","post","type-post","status-publish","format-standard","hentry","category-allgemein","category-howtos","tag-publishing"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v27.2 (Yoast SEO v27.2) - https:\/\/yoast.com\/product\/yoast-seo-premium-wordpress\/ -->\n<title>Creating PDF\/A Documents for Long-Term Archiving - Service Engineering (ICCLab &amp; SPLab)<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/blog.zhaw.ch\/icclab\/creating-pdfa-documents-for-long-term-archiving\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Creating PDF\/A Documents for Long-Term Archiving\" \/>\n<meta property=\"og:description\" content=\"In the Service Engineering research area, we aim at producing high-quality output in terms of software, publications, lecture materials and other results. From time to time, this implies departing from old habits and taking a bit of extra effort to reach new quality levels. For publications, there are excellent tools like LaTeX to achieve a [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/blog.zhaw.ch\/icclab\/creating-pdfa-documents-for-long-term-archiving\/\" \/>\n<meta property=\"og:site_name\" content=\"Service Engineering (ICCLab &amp; SPLab)\" \/>\n<meta property=\"article:published_time\" content=\"2016-08-09T10:31:54+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2019-08-05T12:35:27+00:00\" \/>\n<meta name=\"author\" content=\"icclab\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"icclab\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"4 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/blog.zhaw.ch\/icclab\/creating-pdfa-documents-for-long-term-archiving\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/blog.zhaw.ch\/icclab\/creating-pdfa-documents-for-long-term-archiving\/\"},\"author\":{\"name\":\"icclab\",\"@id\":\"https:\/\/blog.zhaw.ch\/icclab\/#\/schema\/person\/045c6bde7e681e689e4fc051d8932563\"},\"headline\":\"Creating PDF\/A Documents for Long-Term Archiving\",\"datePublished\":\"2016-08-09T10:31:54+00:00\",\"dateModified\":\"2019-08-05T12:35:27+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/blog.zhaw.ch\/icclab\/creating-pdfa-documents-for-long-term-archiving\/\"},\"wordCount\":599,\"commentCount\":7,\"keywords\":[\"publishing\"],\"articleSection\":[\"*.*\",\"HowTos\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/blog.zhaw.ch\/icclab\/creating-pdfa-documents-for-long-term-archiving\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/blog.zhaw.ch\/icclab\/creating-pdfa-documents-for-long-term-archiving\/\",\"url\":\"https:\/\/blog.zhaw.ch\/icclab\/creating-pdfa-documents-for-long-term-archiving\/\",\"name\":\"Creating PDF\/A Documents for Long-Term Archiving - Service Engineering (ICCLab &amp; SPLab)\",\"isPartOf\":{\"@id\":\"https:\/\/blog.zhaw.ch\/icclab\/#website\"},\"datePublished\":\"2016-08-09T10:31:54+00:00\",\"dateModified\":\"2019-08-05T12:35:27+00:00\",\"author\":{\"@id\":\"https:\/\/blog.zhaw.ch\/icclab\/#\/schema\/person\/045c6bde7e681e689e4fc051d8932563\"},\"breadcrumb\":{\"@id\":\"https:\/\/blog.zhaw.ch\/icclab\/creating-pdfa-documents-for-long-term-archiving\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/blog.zhaw.ch\/icclab\/creating-pdfa-documents-for-long-term-archiving\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/blog.zhaw.ch\/icclab\/creating-pdfa-documents-for-long-term-archiving\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Startseite\",\"item\":\"https:\/\/blog.zhaw.ch\/icclab\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Creating PDF\/A Documents for Long-Term Archiving\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/blog.zhaw.ch\/icclab\/#website\",\"url\":\"https:\/\/blog.zhaw.ch\/icclab\/\",\"name\":\"Service Engineering (ICCLab &amp; SPLab)\",\"description\":\"A Blog of the ZHAW Zurich University of Applied Sciences\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/blog.zhaw.ch\/icclab\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/blog.zhaw.ch\/icclab\/#\/schema\/person\/045c6bde7e681e689e4fc051d8932563\",\"name\":\"icclab\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/secure.gravatar.com\/avatar\/7b13169e03783f50e96b96fa2ff222b9c530d13c3125f077c7c44f729b857a51?s=96&d=mm&r=g\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/7b13169e03783f50e96b96fa2ff222b9c530d13c3125f077c7c44f729b857a51?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/7b13169e03783f50e96b96fa2ff222b9c530d13c3125f077c7c44f729b857a51?s=96&d=mm&r=g\",\"caption\":\"icclab\"},\"url\":\"https:\/\/blog.zhaw.ch\/icclab\/author\/icclab\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Creating PDF\/A Documents for Long-Term Archiving - Service Engineering (ICCLab &amp; SPLab)","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/blog.zhaw.ch\/icclab\/creating-pdfa-documents-for-long-term-archiving\/","og_locale":"en_US","og_type":"article","og_title":"Creating PDF\/A Documents for Long-Term Archiving","og_description":"In the Service Engineering research area, we aim at producing high-quality output in terms of software, publications, lecture materials and other results. From time to time, this implies departing from old habits and taking a bit of extra effort to reach new quality levels. For publications, there are excellent tools like LaTeX to achieve a [&hellip;]","og_url":"https:\/\/blog.zhaw.ch\/icclab\/creating-pdfa-documents-for-long-term-archiving\/","og_site_name":"Service Engineering (ICCLab &amp; SPLab)","article_published_time":"2016-08-09T10:31:54+00:00","article_modified_time":"2019-08-05T12:35:27+00:00","author":"icclab","twitter_card":"summary_large_image","twitter_misc":{"Written by":"icclab","Est. reading time":"4 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/blog.zhaw.ch\/icclab\/creating-pdfa-documents-for-long-term-archiving\/#article","isPartOf":{"@id":"https:\/\/blog.zhaw.ch\/icclab\/creating-pdfa-documents-for-long-term-archiving\/"},"author":{"name":"icclab","@id":"https:\/\/blog.zhaw.ch\/icclab\/#\/schema\/person\/045c6bde7e681e689e4fc051d8932563"},"headline":"Creating PDF\/A Documents for Long-Term Archiving","datePublished":"2016-08-09T10:31:54+00:00","dateModified":"2019-08-05T12:35:27+00:00","mainEntityOfPage":{"@id":"https:\/\/blog.zhaw.ch\/icclab\/creating-pdfa-documents-for-long-term-archiving\/"},"wordCount":599,"commentCount":7,"keywords":["publishing"],"articleSection":["*.*","HowTos"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/blog.zhaw.ch\/icclab\/creating-pdfa-documents-for-long-term-archiving\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/blog.zhaw.ch\/icclab\/creating-pdfa-documents-for-long-term-archiving\/","url":"https:\/\/blog.zhaw.ch\/icclab\/creating-pdfa-documents-for-long-term-archiving\/","name":"Creating PDF\/A Documents for Long-Term Archiving - Service Engineering (ICCLab &amp; SPLab)","isPartOf":{"@id":"https:\/\/blog.zhaw.ch\/icclab\/#website"},"datePublished":"2016-08-09T10:31:54+00:00","dateModified":"2019-08-05T12:35:27+00:00","author":{"@id":"https:\/\/blog.zhaw.ch\/icclab\/#\/schema\/person\/045c6bde7e681e689e4fc051d8932563"},"breadcrumb":{"@id":"https:\/\/blog.zhaw.ch\/icclab\/creating-pdfa-documents-for-long-term-archiving\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/blog.zhaw.ch\/icclab\/creating-pdfa-documents-for-long-term-archiving\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/blog.zhaw.ch\/icclab\/creating-pdfa-documents-for-long-term-archiving\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Startseite","item":"https:\/\/blog.zhaw.ch\/icclab\/"},{"@type":"ListItem","position":2,"name":"Creating PDF\/A Documents for Long-Term Archiving"}]},{"@type":"WebSite","@id":"https:\/\/blog.zhaw.ch\/icclab\/#website","url":"https:\/\/blog.zhaw.ch\/icclab\/","name":"Service Engineering (ICCLab &amp; SPLab)","description":"A Blog of the ZHAW Zurich University of Applied Sciences","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/blog.zhaw.ch\/icclab\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/blog.zhaw.ch\/icclab\/#\/schema\/person\/045c6bde7e681e689e4fc051d8932563","name":"icclab","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/7b13169e03783f50e96b96fa2ff222b9c530d13c3125f077c7c44f729b857a51?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/7b13169e03783f50e96b96fa2ff222b9c530d13c3125f077c7c44f729b857a51?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/7b13169e03783f50e96b96fa2ff222b9c530d13c3125f077c7c44f729b857a51?s=96&d=mm&r=g","caption":"icclab"},"url":"https:\/\/blog.zhaw.ch\/icclab\/author\/icclab\/"}]}},"_links":{"self":[{"href":"https:\/\/blog.zhaw.ch\/icclab\/wp-json\/wp\/v2\/posts\/10463","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.zhaw.ch\/icclab\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.zhaw.ch\/icclab\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.zhaw.ch\/icclab\/wp-json\/wp\/v2\/users\/486"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.zhaw.ch\/icclab\/wp-json\/wp\/v2\/comments?post=10463"}],"version-history":[{"count":5,"href":"https:\/\/blog.zhaw.ch\/icclab\/wp-json\/wp\/v2\/posts\/10463\/revisions"}],"predecessor-version":[{"id":12513,"href":"https:\/\/blog.zhaw.ch\/icclab\/wp-json\/wp\/v2\/posts\/10463\/revisions\/12513"}],"wp:attachment":[{"href":"https:\/\/blog.zhaw.ch\/icclab\/wp-json\/wp\/v2\/media?parent=10463"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.zhaw.ch\/icclab\/wp-json\/wp\/v2\/categories?post=10463"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.zhaw.ch\/icclab\/wp-json\/wp\/v2\/tags?post=10463"},{"taxonomy":"features","embeddable":true,"href":"https:\/\/blog.zhaw.ch\/icclab\/wp-json\/wp\/v2\/features?post=10463"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}