Thursday, January 6

How to extract ISBN from PDF e-books

Here's a super simple script to extract the majority of US/English ISBN numbers from a pdf file in Linux.  The dependency, if you don't already have it is pdftk... but if you do anything with pdf files, you probably have it.  If you need to look for other formats or you have suggestions for improving the script, please feel free to comment.

for file in $(ls)
    echo --- $file
    pdftk $file output - uncompress | grep -o "\([0-9]\{3\}-\)\?[0-9]-[0-9]\{3,5\}-[0-9]\{3,5\}-[0-9X]"

Make sure you remove spaces first though because they really leave a mark with this script.  You can do that like this: