Last December we were having all kinds of fun with the Google Books Ngram Viewer, playing around in Google’s digital library comparing word usage over the past couple of centuries. But eight months is a long time in tech years, and the fine art of text analysis hasn’t stood still—nor has it remained the jurisdiction of bored workers comparing literary instances of cats vs. dogs, dames vs. broads, the New Jersey Turnpike vs. Route 66. In the New York Times’ Mechanic Muse column a few weeks ago, Ben Zimmer brought up some new, more sophisticated text comparison engines.
The Corpus of Contemporary American English (COCA) and the Corpus of Historical American English (COHA), both maintained by Mark Davies at Brigham Young University, sample not just fiction but periodicals and journals, academic works, and transcripts of the spoken word. COCA takes its data from 1800 to the present; COCA from the last two decades. Both corpora (yes, that’s the plural) are free to the general public and relatively easy to navigate, although taking the guided tour isn’t a bad idea. Data can be analyzed according to source or date, and also by part of speech and collocations—combinations and augmentations of the words in question, which is to say a word’s most common “neighbors.” Studying the way certain collocations trend in fiction especially, says Zimmer, can give armchair lexicographers insight into a kind of genre semiotics:
When we see a character in contemporary fiction “bolt upright” or “draw a breath,” we join in this silent game, picking up the subtle cues that telegraph a literary style. The game works best when the writer’s idiomatic English does not scream “This is a novel!” but instead provides a kind of comfortable linguistic furniture to settle into as we read a novel or short story.
Even if you have no vested interest in how often Dan Brown’s characters raised their eyebrows in his last book (although I can’t imagine why not), the sites are still a lot of fun to noodle around with. Or you can check out Martin Hilpert’s Motion Chart Resource Page, where he’s crunched some of the data very attractively—I don’t pretend to understand all of what’s going on, but it’s pretty.
If you’re looking for an even more contemporary source for language usage there’s Newswordy, which takes a new media buzzword every day and examines its usage (or misusage, as the case may be), defines it, and gives samples from Google News and Twitter feeds. “News words,” explains proprietor Josh Smith, “are accepted by audiences for their implied meaning. But often loaded words are misused or used out of context. The actual definitions can be different than what is implied.” The site is big and bright and slick in a good way, and there’s an archive that might turn up a useful new word or two (Bourse: n., a stock market in a non-English-speaking country. “The Hong Kong stock exchange said Thursday it is in talks with bourses on mainland China to set up a joint venture in the city, in the latest step to boost economic integration.”—AFP). Plus it’s just good to see the word “opprobrium” used in a tweet now and then.
(Above image is a chart of proportions of random instances of the phrase “like fire” from 1810 through 2000, via the Corpus of Historical American English. Obviously if usage dates had extended to 2009, we would have run away with it.)