pdfsearch - search tool for PDF, PS files
PDF indexer README 7 sep 2002 jose nazario
i had this problem in grad school. i love reading papers, they're such a great way to learn stuff. however, i wind up with piles and stacks of papers. so i try and keep PDFs on my laptop, but i find that they're hard to sift through to find the ones i need to read. so, after some discussion with another of scooter's groomsmen bob i hacked a bit of shell scripting magic to make an index of the PDF and PS files in my home directory and allow me to search them. they're in two parts: the first is mk_pdf_index, a small shell script to reformat PDFs and PS files into text; the second is search, which does the actual searching. some notes: you'll need the xpdf package, which contains pdftotext, and ghostscript 5.5 or later, which contains ps2pdf. if you have "antiword" or "pptHtml", you can also index word docs and powerpoint presentations, respectively. the index maker detects these (in /usr/local/bin) and indexes them. it works by converting the files it finds into ascii text and then splitting it into words. you then look for these words in the index file. it keeps the file location and the first 20 lines at the top of the index file. it doesn't work for all files, but for most. this has only been tested on openbsd. lastly, it needs some refinement, which maybe i'll do soon. the search is doing a boolean OR, and maybe boolean AND would be more useful. however, it works: $ search paxson matches filename 1 /home/jose/papers/SP-supplement.pdf 4 /home/jose/papers/norm-usenix-sec-01.ps 17 /home/jose/papers/stationarity-May00.ps 4 /home/jose/papers/tbit.ps so, i found some papers i didn't even realize i had. how cool is that? so, no more printing out PDF papers for me, i can keep them organized. i run the index generator every week or so, it takes about 30 minutes to fully run (i have a very full home directory). it doesn't work on all papers, some have protection embedded, and some have been made by scanning images of pages. however it works for most PDFs out there you'll run across. INSTALLATION you'll need to compile "wsplit", a small utility to split text files into their component words, for this. in the directory 'wsplit' run the Makefile (via make). you will need "flex" to build this. copy the three files, mk_pdf_index, search, and the wsplit, into a directory in your path. i use ~/bin, you can use that or /usr/local/bin, for example. now run the indexer: mk_pdf_index. this will take a while. now you can search your PDF and PS files using "search". CHANGELOG 18 jul 02 initial version released 7 sep 02 version 0.2 released now supports indexing of word docs now supports ppt presentations handles spaces in names - support done by Anton Chuvakin, PhD, with tweaks case insensitive filename extensions with  matching
LICENSE: BSD type.
PDF, Portable Document Format, PS, and PostScript are all registered trademarks of Adobe Systems, Inc. Word and PowerPoint are trademarks of Microsoft.