Download PDFs from websites

Things I've learned trying to download websites as PDFs

2019-12-10

From the site is very straightforward. You just right click, print, and save as PDF. And that’s all the story there.

If you want to automate or get a PDF from a cli, then it’s a different story.

You have puppeteer, the first tool I’ve used for downloading pdfs, as I was somewhat familiar with it. The thing was that I had to run a chrome instance, and wanted a lighter and simpler solution.

One thing I loved about downloading a PDF directly from the website, is that it let you respect the @media print query on css. So I’d avoid the unnecessary information, like sidebars, headers, … Instead, I’d only have the content as pdf, with the styles of the site. Awesome!

So now that I’m using pandoc, I want to use something like that to download the content from a website.

$ pandoc http://localhost:8000/posts/why-using-rss-feed.html -o out.pdf

Not so fast.

It throws me an error to install a pdf engine. Fair enough.

default –pdf-engine

On pandoc you have to download a pdf engine to be able to convert the html to pdf. It doesn’t come pre-installed. So I downloaded MacTex (latex for mac or something like that).

Then I run the command:

$ pandoc http://localhost:8000/posts/why-using-rss-feed.html -o out.pdf

And it works! It was fast but…

It didn’t include the styles nor respect the @media print query to avoid unnecessary content.

Ok, let’s try something else.

prince

Searching for a way to include the styles on my downloads, I encountered prince. Simple to install and very straight forward on how to use it:

$ prince http://localhost:8000/posts/why-using-rss-feed.html --page-margin=10mm -o out.pdf

The output is as expected. It includes the styles and excludes the unnecessary content according to the @media print query. Awesome.

But there are a couple of things I don’t like.

Not completely free to use. Buying a license costs 500 dollars for 1 user on a desktop 🙃.
It adds an ugly logo on the pdf.
It doesn’t recognize css grid styles.
I had to install another program.

So I kept searching, and I remembered that with pandoc I had other pdf engine options.

wkhtmltopdf

That’s a real name.

Looking at the wkhtmltopdf config options for pandoc it lets you add --css to specify the css file. Great! The thing is that I needed to make a local reference to that file… I couldn’t reference it with http. But I can always curl and then run the command. Ok, another step, but nothing too crazy.

So I installed wkhtmltopdf from their site, and worked well:

$ curl http://localhost:8000/css/styles.css > styles.css
$ pandoc http://localhost:8000/posts/why-using-rss-feed.html --pdf-engine wkhtmltopdf --css styles.css -V margin-left=10 -V margin-top=10 -V margin-right=10 -V margin-bottom=10 -o out.pdf

Something to consider is that I had to specify the margins of every side for it to look ok, so the command got bigger.

The output was ok, it didn’t consider the css grid property but as with prince, specifying a margin was enough.

The last thing that was bothering me about this was that it kept including the extra info hidden by the print media query…

Looking at the wkhtmltopdf config, it specifies that you can give a --print-media-type option to make it load with the print media query.

I tried to use it from pandoc but didn’t work, so I tried using wkhtmltopdf directly…

$ wkhtmltopdf http://localhost:8000/posts/why-using-rss-feed.html --print-media-type out.pdf

Wow. Much simpler than using wkhtmltopdf through pandoc, who would have thought?

And I have all the things I want! The same output as prince but without paying 500 dollars.

Conclusion

One thing I’ve learned with this, is to use the tools in a more direct way, not through other tools that may limit the interface. Like with pandoc and wkhtmltopdf.

And that’s it. Now that I’ve figured out how to download websites as PDFs in a simple way, I’ll probably automate something for myself.

I recommend that if you’re going to use wkhtmltopdf make an alias or script with a better name.

Have a good reading!