A few days ago, I wrote about batch converting video files using ffmpeg. A few days later, I faced a similar problem of needing to convert a directory of .html files. “Need” is perhaps too strong of a word. I was experimenting with how to save pages from a PBworks wiki.
PBworks allows the user the download a .zip file of all of the pages from a wiki. My downloaded backup contained 44 .html files, many of which were nested into subfolders. Instead of figuring out to recursively loop thought the subfolders, I used a find command, which searches subfolders by default. In my script below, the
find command is inserted using command substitution. The converted files are saved to the original subdirectory, keeping .html in the filename, but adding .md as the file extension.
I tried out two scripts to do the text conversion. First, I tried html2text, which worked great. Out of curiosity, I also tried using Pandoc. I ended up preferring how Pandoc formatted the final Markdown text. However, one feature of html2text I liked was the option to use
--ignore-links, since most of the links were relative to the PBworks domain and would be broken when used offline. I decided it might be useful to see where the original link pointed to, so I decided to skip the
Here is the script I created:
1 #!/bin/bash 2 3 # Usage: html2md /path/to/file 4 5 # Set $IFS so that filenames with spaces don't break the loop 6 SAVEIFS=$IFS 7 IFS=$(echo -en "nb") 8 9 # Loop through path provided as argument 10 for x in $(find $@ -name '*.html') 11 do 12 pandoc -f html -t markdown -o $x.md $x 13 done 14 15 # Restore original $IFS 16 IFS=$SAVEIFS