Convert your PDF library to SQLITE with Full Text Search

This is a small snippet I've written on a Friday afternoon. It's only 80 lines.

I wanted a way to put all my PDF's or WORD/Open Office documents into a database and search for a pattern of text within them all, to know exactly in which document(s) the information I required resides.

This way I learned how to use full text search in SQlite. It's not that hard once you grasp the theory.

Converting the whole library to a SQLite database doesn't take that long if you have a good computer. Multithreading could be implemented to do it even faster.

Comparing the normal SQL search using Like with the FTS match command:

time {booksDB2 eval {SELECT rowid FROM BooksData WHERE data LIKE '%Good%'}} 100
2699524.55 microseconds per iteration= 2.69 seconds
time {booksDB eval {SELECT rowid FROM BooksData WHERE data MATCH '*Good*'}} 100
15978.82 microseconds per iteration = 0.01597882 seconds

So almost 168 times faster..:D

Well, think how much time you'd waste to just search every file manually, I had 1240 files in that folder:D.

Some statistics:

  • One of my folders containing books with subfolders and other books that was almost 2.2 GB in size.
  • After automatically converting each book with the script to simple .txt files containing only the text the size was 287 MB.
  • After importing everything in a SQLite db it was arround 517 MB. This because of all the indexes and references used by FTS for each word it knows.
  • The tar.bz2 archive was only 181 MB.

How to use

You need a recent version of TCL (8.5). Either use ActiveTCL distribution, use a TclKIT or compile it yourself.

Be sure you have the following programs installed on your Linux box, or windows if you like to use Cygwin: antiword & pdftotext (xpdf)

You require a recent version of SQlite for TCL and the fileutil package in the tcllib packages. They're both very easy to download with the teacup mechanism provided by ActiveTcl. Just type:

teacup install sqlite

Snippet/Project is available under GPL 3.0 license.

Get it here: http://lba.im/snippets/convertToTxt.tcl

You might enjoy these similar articles:

Be the first to comment!

Add a new comment

All comments are moderated and must adhere to the terms of service.

Subscribe to my awesome newsletter!








What to expect: Ultimate Knowledge regarding Business Efficiency, Personalized Marketing Experience, Software Development and Cyber Security tips and tricks. 1-3 mails per month. Unsubscribe any time. See the privacy policy to learn how we take care of your information.