About About Contact Contact

Culturnomics: the quantitative

measurement of the printed word.

Click here to download the PowerPoint chart: Click here to download the PowerPoint chart:
...with analysis & insight...
Home Home Archive: Free PowerPoint download Free PowerPoint download
You may or may not have come across the new word of the moment “culturnomics” a term used by a research paper just published in the journal Science.  The paper uses data based on a subset of over two trillion words so far scanned by Google in their race to digitise all of the world’s information.  These words have been acquired by scanning images from books and then processing those images using Optical Character Recognition (OCR) technology to create a searchable index of words.  To get Google’s official view on what drives their behaviour, consider reading their philosophy: “Ten things we know to be true.”   All this work forms the basis of Google Books, a very useful library service that, of course, just happens to also be a very good platform from which to serve text advertisements based on the content being searched. We can all appreciate that there are negative as well as positive aspects to Google but, however critically you view the company, they are creating some amazing “first-of-their-kind” services.  Perhaps being served a relentless stream of advertisements is the price we have to be prepared to pay for the development of such innovative “free” services.  Or perhaps that price is too high? Certainly the company needs to work within a clearly defined legal and transparent framework. Eventually it will probably be forced to do so.  Consider, for example, their blatant invasion of privacy in the use of I.P. address tracking in Europe, where this isn’t allowed because it’s considered to be personal data. The only European country to challenge Google was Germany which exerted legal pressure to prevent I.P. address tracking unless people chose to opt-in.  Or think about Google’s monopoly status in terms of Internet search in many countries: in the UK it’s currently running at 92% of all searches made. But Google also opaquely controls the position of organic or natural search as well, or course, as paid search.  One result of this is that competing companies can bid to display adverts to appear when people search using a specific brand term.  There has just been a European Parliament resolution about this. The plus side of Google is the development of the kind of original tool used to create the data for this chart, (derived from Google Labs’ Ngram Viewer launched a couple of days ago), as well as being used for the academic paper in Science.  This software tool would probably not have been created without the advertising revenue that Google has obtained from its digital tracking and data harvesting of consumers. We can examine what the data tells us before going on to consider the limitations of the Ngram Viewer.  I have chosen keyword terms that should be on everyone’s mind at the moment as the western world undergoes a series of systemic economic crises.  The Ngram Viewer uses 500 billion words, derived from over 5.2 million books dating from 1800 through to 2008, just before the latest economic crisis occurred.  The tool demonstrates the frequency that a word, or term, is used in books published over a period of time, hence the term “culturnomics.” The chart shows that the term “banking” was used with increasing frequency from about 1820, reaching a peak around 1938, shortly before the Second World War.  “Banking” was clearly used at a higher frequency than “socialism” and “capitalism”, and at a much greater frequency than the term “free market.”  There is a spike in the frequency of content about “banking” in 1840 that closely follows on from the banking panic in the United States in 1837.  We see another spike in the frequency of “banking” matching the market collapse of 1873 and the subsequent long depression.  Frequency for “banking” continues to climb rapidly through the 1907 Bankers’ Panic that eventually resulted in the creation of the Federal Reserve Bank of New York in 1913.  The frequency of literary discussion about “banking” peaks in the mid-1930s after the 1929 Wall Street Crash, and the subsequent depression that followed, then dramatically falls away during and after the Second World War until rising again in the Saving & Loans Crisis of 1980 which resulted in the Western economic recession experienced in the early 1990s.  From then on the frequency of use of the keyword “banking” falls away to match the low levels of the early 1900s.  Based on this analysis it would be fairly safe to predict that since the collapse of Lehman Brothers on the 15th September 2008 there will have been a dramatic rise in the use of the term “banking” in literary works but, unfortunately, the Google data stops at 2008. This kind of keyword frequency analysis is used to gauge the sentiments expressed about brands by consumers using social networks like Facebook.  Although that is highly suspect because of the limitations of language analysis software to interpret meaning, it does seem valid to use the frequency of use in literature over time as a measure of public interest.  But there are limitations which should be considered.  Firstly, although over 500 billion words have been derived from the 5.2 million books that the Ngram dataset uses, these are but a small subset of the two trillion words scanned by Google so far.  And even that sum of scanned words only represents about 11 per cent of the total number of published books, so the full set of data up to now is itself a subset.  Secondly, we have no idea about the categories that have been scanned to-date, for example, are the proportion of books about banking and politics in equal proportion?  Thirdly, we have no data on the error rate of the scanning and OCR process although, to be fair, having viewed a small sample of Google Books the error rate would appear to be low.  Fourthly, word usage can change over time and has this factor been considered in the time frame being observed?  If the words have multiple different meanings, then any analysis would plainly be nonsense.  Words are often used incorrectly and they continually mutate.  Indeed, the term “culturnomics”, used in the academic paper published in Science, morphed into the word “culturenomics” within 24 hours in many press reports.  For my chart, I deliberately chose keywords whose meaning is unequivocal and whose usage since 1800 has remained fairly constant.  Finally, there could be a difference in word frequency for other possible keyword terms, for example “banking” could be substituted by “bank,” “banks,” “bank account”, or even “finance”, “financial service” or “services”.  So we should, as always, tread carefully when interpreting such data. These limitations notwithstanding, using a dataset of books where there is a good chance of finding all the keywords used in any single book, the data in this chart can provide some useful insights.  It’s interesting that up until the Second World War, literary discussion of “banking” was considerably more frequent than either “socialism,” or “capitalism,” and very much greater than thoughts about the “free market.”  In the great boom (bubble?) that followed the Second World War it seems that much less thought was given in literature to “banking” than there had been earlier in the 20th Century.  The writing and subsequent publication of books would seem to follow the popular sentiment of the day.  This data seems to confirm that in the midst of an economic boom there is little concern about “banking” or interest in political thoughts, but after a financial crisis there’s much greater curiosity about the behaviour that led up to the crash.  Of course we didn’t need a dataset of 500 billion words to provide this insight, but at least quantitative analysis would seem to confirm this aspect of human behaviour rather than the opposite. December 2010
Click here to download the PowerPoint chart: Click here to download the PowerPoint chart: Click to return to page
Click image to enlarge
Click here to download the PowerPoint chart: Click here to download the PowerPoint chart:
Archive Archive
About
Contact
Home

Culturnomics: the

quantitative

measurement of the

printed word.

You may or may not have come across the new word of the moment “culturnomics” a term used by a research paper just published in the journal Science.  The paper uses data based on a subset of over two trillion words so far scanned by Google in their race to digitise all of the world’s information.  These words have been acquired by scanning images from books and then processing those images using Optical Character Recognition (OCR) technology to create a searchable index of words.  To get Google’s official view on what drives their behaviour, consider reading their philosophy: “Ten things we know to be true.”   All this work forms the basis of Google Books, a very useful library service that, of course, just happens to also be a very good platform from which to serve text advertisements based on the content being searched. We can all appreciate that there are negative as well as positive aspects to Google but, however critically you view the company, they are creating some amazing “first-of-their-kind” services.  Perhaps being served a relentless stream of advertisements is the price we have to be prepared to pay for the development of such innovative “free” services.  Or perhaps that price is too high? Certainly the company needs to work within a clearly defined legal and transparent framework. Eventually it will probably be forced to do so.  Consider, for example, their blatant invasion of privacy in the use of I.P. address tracking in Europe, where this isn’t allowed because it’s considered to be personal data. The only European country to challenge Google was Germany which exerted legal pressure to prevent I.P. address tracking unless people chose to opt-in.  Or think about Google’s monopoly status in terms of Internet search in many countries: in the UK it’s currently running at 92% of all searches made. But Google also opaquely controls the position of organic or natural search as well, or course, as paid search.  One result of this is that competing companies can bid to display adverts to appear when people search using a specific brand term.  There has just been a European Parliament resolution about this. The plus side of Google is the development of the kind of original tool used to create the data for this chart, (derived from Google Labs’ Ngram Viewer launched a couple of days ago), as well as being used for the academic paper in Science.  This software tool would probably not have been created without the advertising revenue that Google has obtained from its digital tracking and data harvesting of consumers. We can examine what the data tells us before going on to consider the limitations of the Ngram Viewer.  I have chosen keyword terms that should be on everyone’s mind at the moment as the western world undergoes a series of systemic economic crises.  The Ngram Viewer uses 500 billion words, derived from over 5.2 million books dating from 1800 through to 2008, just before the latest economic crisis occurred.  The tool demonstrates the frequency that a word, or term, is used in books published over a period of time, hence the term “culturnomics.” The chart shows that the term “banking” was used with increasing frequency from about 1820, reaching a peak around 1938, shortly before the Second World War.  “Banking” was clearly used at a higher frequency than “socialism” and “capitalism”, and at a much greater frequency than the term “free market.”  There is a spike in the frequency of content about “banking” in 1840 that closely follows on from the banking panic in the United States in 1837 We see another spike in the frequency of “banking” matching the market collapse of 1873 and the subsequent long depression.  Frequency for “banking” continues to climb rapidly through the 1907 Bankers’ Panic that eventually resulted in the creation of the Federal Reserve Bank of New York in 1913.  The frequency of literary discussion about “banking” peaks in the mid-1930s after the 1929 Wall Street Crash, and the subsequent depression that followed, then dramatically falls away during and after the Second World War until rising again in the Saving & Loans Crisis of 1980  which resulted in the Western economic recession experienced in the early 1990s.  From then on the frequency of use of the keyword “banking” falls away to match the low levels of the early 1900s.  Based on this analysis it would be fairly safe to predict that since the collapse of Lehman Brothers on the 15th September 2008 there will have been a dramatic rise in the use of the term “banking” in literary works but, unfortunately, the Google data stops at 2008. This kind of keyword frequency analysis is used to gauge the sentiments expressed about brands by consumers using social networks like Facebook.  Although that is highly suspect because of the limitations of language analysis software to interpret meaning, it does seem valid to use the frequency of use in literature over time as a measure of public interest.  But there are limitations which should be considered.  Firstly, although over 500 billion words have been derived from the 5.2 million books that the Ngram dataset uses, these are but a small subset of the two trillion words scanned by Google so far.  And even that sum of scanned words only represents about 11 per cent of the total number of published books, so the full set of data up to now is itself a subset.  Secondly, we have no idea about the categories that have been scanned to-date, for example, are the proportion of books about banking and politics in equal proportion?  Thirdly, we have no data on the error rate of the scanning and OCR process although, to be fair, having viewed a small sample of Google Books the error rate would appear to be low.  Fourthly, word usage can change over time and has this factor been considered in the time frame being observed?  If the words have multiple different meanings, then any analysis would plainly be nonsense.  Words are often used incorrectly and they continually mutate.  Indeed, the term “culturnomics”, used in the academic paper published in Science, morphed into the word “culturenomics” within 24 hours in many press reports.  For my chart, I deliberately chose keywords whose meaning is unequivocal and whose usage since 1800 has remained fairly constant.  Finally, there could be a difference in word frequency for other possible keyword terms, for example “banking” could be substituted by “bank,” “banks,” “bank account”, or even “finance”, “financial service” or “services”.  So we should, as always, tread carefully when interpreting such data. These limitations notwithstanding, using a dataset of books where there is a good chance of finding all the keywords used in any single book, the data in this chart can provide some useful insights.  It’s interesting that up until the Second World War, literary discussion of “banking” was considerably more frequent than either “socialism,” or “capitalism,” and very much greater than thoughts about the “free market.”  In the great boom (bubble?) that followed the Second World War it seems that much less thought was given in literature to “banking” than there had been earlier in the 20th Century.  The writing and subsequent publication of books would seem to follow the popular sentiment of the day.  This data seems to confirm that in the midst of an economic boom there is little concern about “banking” or interest in political thoughts, but after a financial crisis there’s much greater curiosity about the behaviour that led up to the crash.  Of course we didn’t need a dataset of 500 billion words to provide this insight, but at least quantitative analysis would seem to confirm this aspect of human behaviour rather than the opposite. December 2010
Click to return to page Click here to download the PowerPoint chart: Click here to download the PowerPoint chart: