Log in

No account? Create an account

July 18th, 2008

Gobs of Data

I did some simple arithmetic about LJ today -- by generalizing with a broad brush stroke -- and found yet more confirmation that we generate just gobs of data, and most of it is useless (from the perspective of knowledge).

Greg Hewgill wrote a beautifully practical script that exports ones LJ entries to XML -- with comments and metadata! It's called ljdump of course. (This supplants my previous use of LJArchive and solves the problem of importing my journal into Mediawiki, but that's another post.)  What disappointed me was the pithy size of my more than three-year old journal: a mere 4MiB.  This number shouldn't be that shocking because text files are almost always smaller that other datatypes; enormously smaller when compared to multimedia.  But something that is large is the number of LJers, and churning this amount with an assumption that each LJer writes the same amount of data as me -- hah! -- we arrive are gobs of data:

Assume each LJer writes ~1MiB/year ( = Justin's 4MiB/3.5years)
Assume there are ~1M such average LJers (where by "average" I mean something LJ calls "active".  See http://www.livejournal.com/stats.bml)
We arrive at 1TiB of data written per year.

That's a substantial amount of data when one considers the time it takes to read a few dozen friend's journals, and considering my observation that readers typically skim through entires.  LJ's database is, on the other hand, committing every character to disk.  I cannot vouch for the accuracy of the above numbers yet even halving these assumptions yields waste.  Why maintain such an accurate and cumbersome thing over the years unless we intend the data to be used for more than a mere instant reading?  I see all this information yet its void of knowledge because we'd rather keep writing than reflecting.  This is why semantic applications and a system for assimilating knowledge is vital to our future.