Checked Exception: How to extract ISBN from PDF e-books

Thursday, January 6

How to extract ISBN from PDF e-books

Here's a super simple script to extract the majority of US/English ISBN numbers from a pdf file in Linux. The dependency, if you don't already have it is pdftk... but if you do anything with pdf files, you probably have it. If you need to look for other formats or you have suggestions for improving the script, please feel free to comment.

for file in $(ls)
do
echo --- $file
pdftk $file output - uncompress | grep -o "$[0-9]\{3\}-$\?[0-9]-[0-9]\{3,5\}-[0-9]\{3,5\}-[0-9X]"
done

Make sure you remove spaces first though because they really leave a mark with this script. You can do that like this:http://techgurulive.com/2008/09/22/how-to-remove-spaces-from-filenames-linux/

2 comments:

Franck P. said...: You might consider the following command tool utility : pdfgrep

It works like grep for pdf files, for instance :

pdfgrep -h --color none ISBN "your_pdf_filename"

It is consistent and works amazingly well.

-- Franck Porcher, Ph.D.; February 25, 2013 at 10:36 PM
Michael said...: Sounds like a great tool. I'll give it a shot. Thanks for the tip.; February 28, 2013 at 9:10 PM