A Billion Words: Today's language modeling standard should be higher
googleresearch.blogspot.comThe GZ file is only 1.7GB, I imagine a densely-packed model would almost fit on a machine with 8GB of RAM, which is surprising.
Along similar lines, all of the English Wikipedia is < 10GB, and about 45GB uncompressed: http://en.wikipedia.org/wiki/Wikipedia:Database_download#Eng.... That omits all the history (just the current pages), but still surprising to me how small it seems now.