// = Convert your PDF library to SQLITE with Full Text Search :id: 9b538f8a-c3b8-44d9-84da-0c6f9ae0cf54 :author: Andrei Clinciu :website: https://andreiclinciu.net/ :publish_at: 2012-08-15 18:18:00Z :heading_image: \N :description: \N :type: article :tags: :keywords: :toc: left :imagesdir: ../assets/
image::{heading_image}[] This is a small snippet I’ve written on a Friday afternoon. It’s only 80 lines.
I wanted a way to put all my PDF’s or WORD/Open Office documents into a database and search for a pattern of text within them all, to know exactly in which document(s) the information I required resides.
This way I learned http://sqlite.org/fts3.html[how to use full text search in SQlite]. It’s not that hard once you grasp the theory.
Converting the whole library to a SQLite database doesn’t take that long if you have a good computer. Multithreading could be implemented to do it even faster.
Comparing the normal SQL search using Like with the FTS match command:
time {booksDB2 eval {SELECT rowid FROM BooksData WHERE data LIKE ‘%Good%’}} 100 + 2699524.55 microseconds per iteration= 2.69 seconds + time {booksDB eval {SELECT rowid FROM BooksData WHERE data MATCH ‘Good’}} 100 + 15978.82 microseconds per iteration = 0.01597882 seconds
So almost 168 times faster..:D
Well, think how much time you’d waste to just search every file manually, I had 1240 files in that folder:D.
== Some statistics:
- One of my folders containing books with subfolders and other books that was almost 2.2 GB in size.
- After automatically converting each book with the script to simple .txt files containing only the text the size was 287 MB.
- After importing everything in a SQLite db it was arround 517 MB. This because of all the indexes and references used by FTS for each word it knows.
- The tar.bz2 archive was only 181 MB.
== How to use
You need a recent version of TCL (8.5). Either use http://www.activestate.com/activetcl[ActiveTCL] distribution, use a http://code.google.com/p/tclkit/[TclKIT] or http://sourceforge.net/projects/tcl/[compile it yourself].
Be sure you have the following programs installed on your Linux box, or windows if you like to use Cygwin: antiword & pdftotext (xpdf)
You require a recent version of SQlite for TCL and the fileutil package in the tcllib packages. They’re both very easy to download with the teacup mechanism provided by ActiveTcl. Just type:
teacup install sqlite
Snippet/Project is available under GPL 3.0 license.
Get it here: http://lba.im/snippets/convertToTxt.tcl