Checked Exception: January 2011

Thursday, January 6

How to extract ISBN from PDF e-books

Here's a super simple script to extract the majority of US/English ISBN numbers from a pdf file in Linux. The dependency, if you don't already have it is pdftk... but if you do anything with pdf files, you probably have it. If you need to look for other formats or you have suggestions for improving the script, please feel free to comment.

for file in $(ls)
do
echo --- $file
pdftk $file output - uncompress | grep -o "$[0-9]\{3\}-$\?[0-9]-[0-9]\{3,5\}-[0-9]\{3,5\}-[0-9X]"
done

Make sure you remove spaces first though because they really leave a mark with this script. You can do that like this:http://techgurulive.com/2008/09/22/how-to-remove-spaces-from-filenames-linux/