Pandoc conversion tips
3 minute read
Pandoc helps convert documents, such as Word .DOCX files or PDF documents, to alternate formats, like Markdown and others.
Conversion can require a bit of tinkering. The following tips can help.
Simple conversion
To use Pandoc to convert a document to a new format:
$ pandoc source.docx -o target.md
By default, many details are inferred from the input parameters. In this example:
- The input file (source) is assumed to be a Microsoft Word .DOCX file
- The output file (target) is assumed to be a Markdown .MD file.
Use parameters to control these decisions.
For example, --to
(-t
) controls the variation (or flavor) of Markdown used to format the output file. Valid options vary and include values such as markdown
, markdown_strict
, gfm
, and more.
Other supported options include:
--from
(-f
) <input_format>--read
(-r
) is an alias for--from
--write
(-w
) is an alias for--to
HTML and Markdown
When converting Markdown mixed with HTML, use --from
to improve conversion results:
$ pandoc markdown.md -f markdown+raw_html
To learn more, see the Markdown section of the pandoc manual.
Extract images
To extract images from your source doc, add the --extract-media
parameter.
This requires a value specifying the directory where extracted images should be saved. (The directory must exist before the conversion.)
Examples:
$ pandoc spec.docx -o output.md -t commonmark --extract-media=img/
$ pandoc api.docx -o api/_index.md.md -t gfm --extract-media=./img
Be sure to validate your input files before converting them. Images cannot be extracted from files that don’t include them.
This may seem obvious, but some export results vary according to the format. In 2020, for example, Jira’s “Export to Word” feature excluded images but included them when exporting to PDFs.
In contrast, PDFs created by Acrobat Pro DC (v2020.013.20074) required specific settings to include images when exporting .DOCX files. (From View Results, select the Settings icon and then locate Layout Settings. Here, activate Retain Flowing Text and deactivate Include Comments.)
Table conversion
By default, Pandoc converts tables to HTML tables, especially when the source tables include lines breaks or multiple paragraphs.
In such cases, you can:
-
Try using
--to gfm
(or an alternate Markdown syntax) -
Try alternate conversion tools, such as h2m.
-
Modify the source doc for conversion and then later revert the problematic content.
The latter option is a multi-step process:
-
Use Word’s search and replace feature to replace paragraph marks (
^P
) with text symbols that don’t otherwise appear in the content (example:<*>
). -
Convert the modified document.
-
Update the converted results to replace your text symbols with
<br>
tags or other choices.
Vital statistics
- 23 May 2024: First post