Convert your PDF library to SQLITE with Full Text Search

\N

This is a small snippet I’ve written on a Friday afternoon. It’s only 80 lines.

I wanted a way to put all my PDF’s or WORD/Open Office documents into a database and search for a pattern of text within them all, to know exactly in which document(s) the information I required resides.

This way I learned how to use full text search in SQlite. It’s not that hard once you grasp the theory.

Converting the whole library to a SQLite database doesn’t take that long if you have a good computer. Multithreading could be implemented to do it even faster.

Comparing the normal SQL search using Like with the FTS match command:

time \{booksDB2 eval \{SELECT rowid FROM BooksData WHERE data LIKE '%Good%'}} 100
2699524.55 microseconds per iteration= 2.69 seconds
time \{booksDB eval \{SELECT rowid FROM BooksData WHERE data MATCH 'Good'}} 100
15978.82 microseconds per iteration = 0.01597882 seconds

So almost 168 times faster..:D

Well, think how much time you’d waste to just search every file manually, I had 1240 files in that folder:D.

Some statistics:

  • One of my folders containing books with subfolders and other books that was almost 2.2 GB in size.

  • After automatically converting each book with the script to simple .txt files containing only the text the size was 287 MB.

  • After importing everything in a SQLite db it was arround 517 MB. This because of all the indexes and references used by FTS for each word it knows.

  • The tar.bz2 archive was only 181 MB.

How to use

You need a recent version of TCL (8.5). Either use ActiveTCL distribution, use a TclKIT or compile it yourself.

Be sure you have the following programs installed on your Linux box, or windows if you like to use Cygwin: antiword & pdftotext (xpdf)

You require a recent version of SQlite for TCL and the fileutil package in the tcllib packages. They’re both very easy to download with the teacup mechanism provided by ActiveTcl. Just type:

teacup install sqlite

Snippet/Project is available under GPL 3.0 license.

Subscribe to my Newsletter

Receive emails about Linux, Programming, Automation, Life tips & Tricks and information about projects I'm working on