This article shows how to convert Markdown documents into PDF files by using Pandoc and its extensions. The first step is to transform the Markdown document into LaTeX code, and then rendering that code in PDF. The plain Pandoc template is customized to get a reusable set of settings.

Markdown

You might have encountered Markdown formatting while writing a comment on a forum, or editing a link in a chat application. It’s a formatting language specifying just the basic elements: where a paragraph should end, which words should be in bold, which section of text represents a link, and so on. This format was started in 2004 as an informal collection of formatting rules, and in 2016 two formal RFC documents were published (RFC7763 and RFC7764).

Since Markdown files are essentially text files, all the tooling associated with text files is available from the start (think of grep, diff or git). Entire workflows dedicated to collaborating and working with text-files are unlocked from the start, without any vendor lock-in.

On top of that, there’s a large ecosystem of tools for processing Markdown files (which are “pure content”) into other formats: PDF documents, presentations, websites, and so on.

Because the format is so simple, it encourages even novice programmers to create, share and adapt scripts and automations for whatever task they find repetitive, in whatever language they feel confortable using, without having to learn any complex framework.

Pandoc

Pandoc is a converter between various document formats. It transforms documents written in a markup language (such as Markdown or MediaWiki) to other formats (such as LaTeX, HTML or docx). Using Pandoc we’ve already expanded the range of things we can do with our Markdown files, and we can push it even further by making formatting choices and customisations.

LaTeX and TeX

LaTeX is the gold standard when it comes to academic publishing. It’s a typesetting system started in the late ’70s, which over time grew in functionality and popularity in the academic circles.

It also relies on plaintext documents, but its philosophy is at the polar opposite from Markdown: while Markdown tries to allow as few formatting options as possible in order to focus on representing the content, TeX allows you to write content, make formatting choices, and documents can even contain code that control how the document should be formatted. To achieve consistent formatting, you are not limited to a few formatting options, but are invited to write short replacement-rules or fragments of code, and reuse those as often as possible.

In fact, TeX is a Turing Complete system, meaning that any valid computer program can be written as TeX code (and the program will run when the TeX document is converted to PDF). Over time, users of the LaTeX system began curating a library of macros at CTAN.

The LaTeX syntax is not as convenient as Markdown, but it’s a tradeoff that’s worth doing in academia: think of how often academic papers contain specialized notation, such as advanced math, electrical diagrams, chemistry or music notation - once you become proficient using the macros needed in your field, it’s simpler than using a general-purpose document editor.

Pandoc ecosystem

Since Pandoc is a tool focused just on converting documents from one format to another, producing good-looking documents is considered out of scope. Instead, it relies on community members to build and share templates, which are “reasonable set of formatting choices”. One such template geared towards the LaTeX format is Eisvogel, which in turn can be customized even further (see the README.md file and the examples directoy).

Pandoc also supports “filters”: small extensions that change the way the input is processed in order to introduce new formatting elements, not handled by the pandoc core. Using just Markdown and a good LaTeX template covers about 90% of the use-cases, but sometimes that’s not enough - sometimes we want to add a glossary page at the end of the document, or write text-boxes with tips that are formatted differently from the rest of the paragraphs. These things would be easy while working in pure LaTeX code, but Markdown is too simple. We can cover these use-cases by adding the following filters:

  • pandoc-latex-environment, maps the ::: syntax element of the Markdown format to a pair of \begin{...} \end{...} tags inside the LaTeX output.
  • pandoc-gls, adds a custom syntax element of (+x) to the Markdown format, which is mapped to \gls{x} inside the LaTeX output.

PDF documents

The PDF format is a standard file format (ISO32000) for documents, achieving a large degree of adoption. It’s a complex binary format, which encodes with high fidelity documents, “as printed”.

One restricted variant of this format, PDF/A is considered suitable for archival and digital preservation.

Putting it all together

Dealing with all this complexity and tweaking might sound daunting, but it’s just an one-time setup activity. Once you get a template and a set of extensions that work for your use-case, starting working on a new document is just a matter of copying the customized template to a new folder. Achieving a consistent formatting and using the brand identity elements happens automatically, and you can focus on the content.

The template used by this website is present at https://github.com/PersonalCompute-net/doc-template/ in two versions: with support for glossary (the “glossary-template” directory) and without (the “basic-template” directory). Check out the main.pdf document inside the chosen directory for the steps involved in setting it up. The main.pdf also serves as a preview on how the final documents will look. Feel free to tweak it to match your use-case.

This is a good example of open-source collaboration, where a good solution can be built using standardized interfaces and formats - combining the ease of Markdown editing with the quality of the TeX layout engine. There was no central plan or architecture, no single set of assumptions on how the final documents should look, and no vendor lock-in to a specific set of tools. The setup experience is not 100% streamlined, but this is because you have the choice of picking the components, customizing them, or replacing them with better alternatives, if those appear.

A note on computer security

Let’s also have a look about how various formats and ecosystems deal with computer security, since “allowing any valid program” implies that running malware is possible.

  • Markdown has a clear separation between the content (which can be shared without worries) and external tools (which have to be installed by the user).
  • Pandoc by itself does not distribute third-party extensions (templates and filters). Those have to be installed separately, and it’s the user’s responsability to decide which tools are trustworthy enough to run.
  • LaTeX allows for third party code to be distributed and executed in two instances: first in the library of macros contributed by the community (the texlive-full package, where macros have to pass a basic review by the package maintainers) and in the documents themselves. The way security works in the case of documents received from collaborators is that the collaborators themselves have to reach a minimal level of trust, the macros are expected to be readable code (obfuscated macros should raise concerns) and the scope of what the macro can do is quite limited to reading local files and control the contents of the generated PDF.
  • Microsoft Office had a problem with embedded macros because rather than require explicit user action to install a script (as is the case with the Markdown or Pandoc ecosystems), rely on a central library of trusted and reviewed extensions, or allow for the code to be inspected before running it (as is the case with LaTeX), it included macros directly in Word Documents, and the macros could be launched before the user had a chance to review them first. To make matters worse, the macro didn’t run inside a tight sandbox (and the macro received the control of the email accounts and network connections).