How to create a PDF archive of some blog posts with pandoc
pandoc is awesome
2025-02-15
Here are two things that happened to me this weekend:
I remembered about a series of blog posts that I have referenced and re-read several times over the past few years. I wanted to read it again, but when I went to look for a local copy, I discovered that I didn’t have one!
I read
man pandoc
and discovered / was reminded that Pandoc supports urls: you can convert from html to markdown on the fly by feeding it urls on the command line!
So here’s the deal. I’ve got three blog posts I want to consolodate into a single file.
Here’s how we do it:
pandoc \
-f html \
-t markdown \
--standalone \
--extract-media=assets \
-o blog1.md \
https://example.com/blog/awesome-article/
Make special note of the --standalone
and
--extract-media
flags. These, as you might guess, extract
images and media from the post and save them to the specified directory,
and make a standalone document out of the source. For markdown
documents, that mostly means extracting information from the html’s meta
tags and putting it in a yaml
frontmatter block.
I repeated this for each article in the series, saving the files as blog1.md, blog2.md, and blog3.md.
A little bit of cleanup is required. I deleted the breadcrumbs and nav elements from the beginning of the doc, and the comments from the end of the doc. And I adjusted the depth of couple of headings to make the table of contents look good.
I made an html doc to test it all out:
pandoc \
-f markdown \
-t html \
-s \
--toc \
-V toc-title:"Contents" \
-o blog.html \
blog1.md blog2.md blog3.md
It looked good! The only problem I noticed is that including yaml
frontmatter in each markdown document means that the final html doc
always uses the frontmatter of whatever the final markdown doc is. So I
deleted all of those yaml blocks and made a separate
metadata.yaml
file.
Now I’m ready for the real deal! The real deal is making a pdf of the content.
pandoc \
-f markdown \
-t pdf \
--pdf-engine=typst \
-s \
--toc \
-V toc-title:"Contents" \
--metadata-file=metadata.yaml \
-o blog.pdf \
blog1.md blog2.md blog3.md
Perfect! There are lots of pdf engines to choose from. This blog series happened to include some unicode characters that broke my default engine. So I tried a couple others and found that typst is the fastest to compile, and also looks good.
I added this pdf to my archive and now I have it forever and ever.