I am really impressed both by this OpenNews post about how to tackle a huge pile of documents, and also by the tools recommended. After all:
What I received a month later from Nash County, N.C., were two boxes filled with thousands of printed pages of emails. Double-sided.
One of the problems it solves is that your filesystem is usually very, very good at finding files, on all kinds of criteria, and fast – just look at any unix/linux find examples page – but that presupposes that the information you have is broken out into files whose boundaries map roughly to a logical structure within the underlying data.
Also, one of the best things is also the simplest: Overview has a feature that pulls a randomly selected sample of documents.
The blog is crazy good, too. Interestingly, I remember IBM announcing their big investment in big data the other year and giving “Computational Journalism” as one of the use cases.
Did I say the blog was good? The blog is good.