Here's a super simple script to extract the majority of US/English ISBN numbers from a pdf file in Linux. The dependency, if you don't already have it is pdftk... but if you do anything with pdf files, you probably have it. If you need to look for other formats or you have suggestions for improving the script, please feel free to comment.
for file in $(ls)
do
echo --- $file
pdftk $file output - uncompress | grep -o "\([0-9]\{3\}-\)\?[0-9]-[0-9]\{3,5\}-[0-9]\{3,5\}-[0-9X]"
done
Make sure you remove spaces first though because they really leave a mark with this script. You can do that like this:http://techgurulive.com/2008/09/22/how-to-remove-spaces-from-filenames-linux/
2 comments:
You might consider the following command tool utility : pdfgrep
It works like grep for pdf files, for instance :
pdfgrep -h --color none ISBN "your_pdf_filename"
It is consistent and works amazingly well.
-- Franck Porcher, Ph.D.
Sounds like a great tool. I'll give it a shot. Thanks for the tip.
Post a Comment