Thursday, January 6

How to extract ISBN from PDF e-books

Here's a super simple script to extract the majority of US/English ISBN numbers from a pdf file in Linux.  The dependency, if you don't already have it is pdftk... but if you do anything with pdf files, you probably have it.  If you need to look for other formats or you have suggestions for improving the script, please feel free to comment.

for file in $(ls)
do  
    echo --- $file
    pdftk $file output - uncompress | grep -o "\([0-9]\{3\}-\)\?[0-9]-[0-9]\{3,5\}-[0-9]\{3,5\}-[0-9X]"
done


Make sure you remove spaces first though because they really leave a mark with this script.  You can do that like this:http://techgurulive.com/2008/09/22/how-to-remove-spaces-from-filenames-linux/

2 comments:

Franck P. said...

You might consider the following command tool utility : pdfgrep

It works like grep for pdf files, for instance :

pdfgrep -h --color none ISBN "your_pdf_filename"

It is consistent and works amazingly well.

-- Franck Porcher, Ph.D.

Michael said...

Sounds like a great tool. I'll give it a shot. Thanks for the tip.