Culturnomics: the quantitative
measurement of the printed word.
...with analysis & insight...
You may or may not have come across the new word of the moment “culturnomics” a term used by
a research paper just published in the journal Science. The paper uses data based on a subset of
over two trillion words so far scanned by Google in their race to digitise all of the world’s
information. These words have been acquired by scanning images from books and then
processing those images using Optical Character Recognition (OCR) technology to create a
searchable index of words. To get Google’s official view on what drives their behaviour, consider
reading their philosophy: “Ten things we know to be true.” All this work forms the basis of Google
Books, a very useful library service that, of course, just happens to also be a very good platform
from which to serve text advertisements based on the content being searched.
We can all appreciate that there are negative as well as positive aspects to Google but, however
critically you view the company, they are creating some amazing “first-of-their-kind” services.
Perhaps being served a relentless stream of advertisements is the price we have to be prepared to
pay for the development of such innovative “free” services. Or perhaps that price is too high?
Certainly the company needs to work within a clearly defined legal and transparent framework.
Eventually it will probably be forced to do so. Consider, for example, their blatant invasion of
privacy in the use of I.P. address tracking in Europe, where this isn’t allowed because it’s
considered to be personal data. The only European country to challenge Google was Germany
which exerted legal pressure to prevent I.P. address tracking unless people chose to opt-in. Or
think about Google’s monopoly status in terms of Internet search in many countries: in the UK it’s
currently running at 92% of all searches made. But Google also opaquely controls the position of
organic or natural search as well, or course, as paid search. One result of this is that competing
companies can bid to display adverts to appear when people search using a specific brand term.
There has just been a European Parliament resolution about this.
The plus side of Google is the development of the kind of original tool used to create the data for
this chart, (derived from Google Labs’ Ngram Viewer launched a couple of days ago), as well as
being used for the academic paper in Science. This software tool would probably not have been
created without the advertising revenue that Google has obtained from its digital tracking and data
harvesting of consumers.
We can examine what the data tells us before going on to consider the limitations of the Ngram
Viewer. I have chosen keyword terms that should be on everyone’s mind at the moment as the
western world undergoes a series of systemic economic crises. The Ngram Viewer uses 500 billion
words, derived from over 5.2 million books dating from 1800 through to 2008, just before the
latest economic crisis occurred. The tool demonstrates the frequency that a word, or term, is used
in books published over a period of time, hence the term “culturnomics.”
The chart shows that the term “banking” was used with increasing frequency from about 1820,
reaching a peak around 1938, shortly before the Second World War. “Banking” was clearly used at
a higher frequency than “socialism” and “capitalism”, and at a much greater frequency than the
term “free market.” There is a spike in the frequency of content about “banking” in 1840 that
closely follows on from the banking panic in the United States in 1837. We see another spike in the
frequency of “banking” matching the market collapse of 1873 and the subsequent long depression.
Frequency for “banking” continues to climb rapidly through the 1907 Bankers’ Panic that eventually
resulted in the creation of the Federal Reserve Bank of New York in 1913. The frequency of literary
discussion about “banking” peaks in the mid-1930s after the 1929 Wall Street Crash, and the
subsequent depression that followed, then dramatically falls away during and after the Second
World War until rising again in the Saving & Loans Crisis of 1980 which resulted in the Western
economic recession experienced in the early 1990s. From then on the frequency of use of the
keyword “banking” falls away to match the low levels of the early 1900s. Based on this analysis it
would be fairly safe to predict that since the collapse of Lehman Brothers on the 15th September
2008 there will have been a dramatic rise in the use of the term “banking” in literary works but,
unfortunately, the Google data stops at 2008.
This kind of keyword frequency analysis is used to gauge the sentiments expressed about brands
by consumers using social networks like Facebook. Although that is highly suspect because of the
limitations of language analysis software to interpret meaning, it does seem valid to use the
frequency of use in literature over time as a measure of public interest. But there are limitations
which should be considered. Firstly, although over 500 billion words have been derived from the
5.2 million books that the Ngram dataset uses, these are but a small subset of the two trillion
words scanned by Google so far. And even that sum of scanned words only represents about 11
per cent of the total number of published books, so the full set of data up to now is itself a subset.
Secondly, we have no idea about the categories that have been scanned to-date, for example, are
the proportion of books about banking and politics in equal proportion? Thirdly, we have no data
on the error rate of the scanning and OCR process although, to be fair, having viewed a small
sample of Google Books the error rate would appear to be low. Fourthly, word usage can change
over time and has this factor been considered in the time frame being observed? If the words
have multiple different meanings, then any analysis would plainly be nonsense. Words are often
used incorrectly and they continually mutate. Indeed, the term “culturnomics”, used in the
academic paper published in Science, morphed into the word “culturenomics” within 24 hours in
many press reports. For my chart, I deliberately chose keywords whose meaning is unequivocal
and whose usage since 1800 has remained fairly constant. Finally, there could be a difference in
word frequency for other possible keyword terms, for example “banking” could be substituted by
“bank,” “banks,” “bank account”, or even “finance”, “financial service” or “services”. So we should,
as always, tread carefully when interpreting such data.
These limitations notwithstanding, using a dataset of books where there is a good chance of
finding all the keywords used in any single book, the data in this chart can provide some useful
insights. It’s interesting that up until the Second World War, literary discussion of “banking” was
considerably more frequent than either “socialism,” or “capitalism,” and very much greater than
thoughts about the “free market.” In the great boom (bubble?) that followed the Second World
War it seems that much less thought was given in literature to “banking” than there had been
earlier in the 20th Century. The writing and subsequent publication of books would seem to follow
the popular sentiment of the day. This data seems to confirm that in the midst of an economic
boom there is little concern about “banking” or interest in political thoughts, but after a financial
crisis there’s much greater curiosity about the behaviour that led up to the crash. Of course we
didn’t need a dataset of 500 billion words to provide this insight, but at least quantitative analysis
would seem to confirm this aspect of human behaviour rather than the opposite.