This is a small snippet I've written on a Friday afternoon. It's only 80 lines.
I wanted a way to put all my PDF's or WORD/Open Office documents into a database and search for a pattern of text within them all, to know exactly in which document(s) the information I required resides.
This way I learned how to use full text search in SQlite. It's not that hard once you grasp the theory.
Converting the whole library to a SQLite database doesn't take that long if you have a good computer. Multithreading could be implemented to do it even faster.
Comparing the normal SQL search using Like with the FTS match command:
time {booksDB2 eval {SELECT rowid FROM BooksData WHERE data LIKE '%Good%'}} 100
2699524.55 microseconds per iteration= 2.69 seconds
time {booksDB eval {SELECT rowid FROM BooksData WHERE data MATCH '*Good*'}} 100
15978.82 microseconds per iteration = 0.01597882 seconds
So almost 168 times faster..:D
Well, think how much time you'd waste to just search every file manually, I had 1240 files in that folder:D.
Some statistics:
- One of my folders containing books with subfolders and other books that was almost 2.2 GB in size.
- After automatically converting each book with the script to simple .txt files containing only the text the size was 287 MB.
- After importing everything in a SQLite db it was arround 517 MB. This because of all the indexes and references used by FTS for each word it knows.
- The tar.bz2 archive was only 181 MB.
How to use
You need a recent version of TCL (8.5). Either use ActiveTCL distribution, use a TclKIT or compile it yourself.
Be sure you have the following programs installed on your Linux box, or windows if you like to use Cygwin: antiword & pdftotext (xpdf)
You require a recent version of SQlite for TCL and the fileutil package in the tcllib packages. They're both very easy to download with the teacup mechanism provided by ActiveTcl. Just type:
teacup install sqlite
Snippet/Project is available under GPL 3.0 license.
Get it here: http://lba.im/snippets/convertToTxt.tcl