👩‍💻 chrismanbrown.gitlab.io

How to create a PDF archive of some blog posts with pandoc

pandoc is awesome

2025-02-15

Here are two things that happened to me this weekend:

  1. I remembered about a series of blog posts that I have referenced and re-read several times over the past few years. I wanted to read it again, but when I went to look for a local copy, I discovered that I didn’t have one!

  2. I read man pandoc and discovered / was reminded that Pandoc supports urls: you can convert from html to markdown on the fly by feeding it urls on the command line!

So here’s the deal. I’ve got three blog posts I want to consolodate into a single file.

Here’s how we do it:

pandoc \
  -f html \
  -t markdown \
  --standalone \
  --extract-media=assets \
  -o blog1.md \
  https://example.com/blog/awesome-article/

Make special note of the --standalone and --extract-media flags. These, as you might guess, extract images and media from the post and save them to the specified directory, and make a standalone document out of the source. For markdown documents, that mostly means extracting information from the html’s meta tags and putting it in a yaml frontmatter block.

I repeated this for each article in the series, saving the files as blog1.md, blog2.md, and blog3.md.

A little bit of cleanup is required. I deleted the breadcrumbs and nav elements from the beginning of the doc, and the comments from the end of the doc. And I adjusted the depth of couple of headings to make the table of contents look good.

I made an html doc to test it all out:

pandoc \
  -f markdown \
  -t html \
  -s \
  --toc \
  -V toc-title:"Contents" \
  -o blog.html \
  blog1.md blog2.md blog3.md

It looked good! The only problem I noticed is that including yaml frontmatter in each markdown document means that the final html doc always uses the frontmatter of whatever the final markdown doc is. So I deleted all of those yaml blocks and made a separate metadata.yaml file.

Now I’m ready for the real deal! The real deal is making a pdf of the content.

pandoc \
  -f markdown \
  -t pdf \
  --pdf-engine=typst \
  -s \
  --toc \
  -V toc-title:"Contents" \
  --metadata-file=metadata.yaml \
  -o blog.pdf \
  blog1.md blog2.md blog3.md

Perfect! There are lots of pdf engines to choose from. This blog series happened to include some unicode characters that broke my default engine. So I tried a couple others and found that typst is the fastest to compile, and also looks good.

I added this pdf to my archive and now I have it forever and ever.