Convert your PDF library to SQLITE with Full Text Search

Show table of contents

This is a small snippet I've written on a Friday afternoon. It's only 80 lines.

I wanted a way to put all my PDF's or WORD/Open Office documents into a database and search for a pattern of text within them all, to know exactly in which document(s) the information I required resides.

This way I learned how to use full text search in SQlite. It's not that hard once you grasp the theory.

Converting the whole library to a SQLite database doesn't take that long if you have a good computer. Multithreading could be implemented to do it even faster.

Comparing the normal SQL search using Like with the FTS match command:

time {booksDB2 eval {SELECT rowid FROM BooksData WHERE data LIKE '%Good%'}} 100
2699524.55 microseconds per iteration= 2.69 seconds
time {booksDB eval {SELECT rowid FROM BooksData WHERE data MATCH '*Good*'}} 100
15978.82 microseconds per iteration = 0.01597882 seconds

So almost 168 times faster..:D

Well, think how much time you'd waste to just search every file manually, I had 1240 files in that folder:D.

Some statistics:

  • One of my folders containing books with subfolders and other books that was almost 2.2 GB in size.
  • After automatically converting each book with the script to simple .txt files containing only the text the size was 287 MB.
  • After importing everything in a SQLite db it was arround 517 MB. This because of all the indexes and references used by FTS for each word it knows.
  • The tar.bz2 archive was only 181 MB.

How to use

You need a recent version of TCL (8.5). Either use ActiveTCL distribution, use a TclKIT or compile it yourself.

Be sure you have the following programs installed on your Linux box, or windows if you like to use Cygwin: antiword & pdftotext (xpdf)

You require a recent version of SQlite for TCL and the fileutil package in the tcllib packages. They're both very easy to download with the teacup mechanism provided by ActiveTcl. Just type:

teacup install sqlite

Snippet/Project is available under GPL 3.0 license.

Get it here: http://lba.im/snippets/convertToTxt.tcl

Subscribe to my newsletter

NOTE:You will need to confirm your e-mail address in order to fully complete the subscription process.

What are your thoughts?

All comments are moderated and must adhere to the terms of service.

You might enjoy these similar articles: