[plug] importing a large text database for fast search

Andrew Cooks acooks at gmail.com
Fri Sep 2 15:32:45 WST 2011


Sounds almost like you're analyzing the wikileaks db.  :)

How much structure is there in this data and what kind of operations
would you be doing? If you need to list all messages containing a
certain word and written by A. Person and sent between two dates, then
that sounds like you want an RDBMS like sqlite, mysql, or postgresql.
If you're only interested in browsing the top 10-50 messages
containing the word "Libya", then lucene might be better. My guess is
that Perl with DBD::SQLite or, if your Perl is much better than your
SQL, DBIx::Class would work nicely.

On Fri, Sep 2, 2011 at 8:30 AM, Michael Holland
<michael.holland at gmail.com> wrote:
>
> Suppose you had a large database - 1.7GB, with about 250k records in a CSV file.
> Each record has 8 fields - 7 headers plus a body.
> You might use a PERL script to split to files, sort into folders by
> embassy name, convert the ALLCAPS to more legible case, and remove the
> quote escaping from the body.
> Maybe add links to a glossary for the more obscure military/diplomatic
> terms and acronyms.
> But greping all this data is still slow. What is a good way to store
> it in Linux, with a full text index?
> _______________________________________________
> PLUG discussion list: plug at plug.org.au
> http://lists.plug.org.au/mailman/listinfo/plug
> Committee e-mail: committee at plug.org.au
> PLUG Membership: http://www.plug.org.au/membership



More information about the plug mailing list