Culturomics - Analyzing Language with N-grams and Google Books

Christopher Schmit



Brigham Young University’s corpus projects provide an excellent tool for researchers to examine how the English language has evolved in the United States over the past two hundred years. BYU has three collections that have been made available to utilize the enormous amount of content located within the Google Books library. The Corpus of Historical American English was one of its first projects, and it provided an impressively long time period, from 1810 to 2009, as well as covering more than 400 million words (Brigham Young University, n.d.). BYU also was able to make better use of the Google Books resource, and made their advanced Google Books corpus, which they refer to as Culturomics, available in May 2011. This corpus claims to have gained access to 155 billion words from 1.3 million books and has an easy to use search interface (Davies, 2011a). There is also a plan to add more content from non-English languages in December 2011, if they are able to receive the funding for this expansion (Davies, 2011c).This huge source has been able to give researchers access to how American English has evolved, and shows the results both in a table form and by utilizing a chart.
Making Use of Technology
Before the use of computing technologies, the creation of such a large database of would have been a very time consuming process. One could just imagine how long it would take to index the words of just one book or to create a concordance of the collection of works from one particular author. The benefits of the Google Books digitization project have been able to scan in a great amount of material, which then enables projects like Culturomics to sort through the words from such a huge timespan and catalogue. This has unearthed a great tool for researchers to use when looking at how speech has evolved over nearly two hundred years. The catalogue which is in use by BYU’s Culturomics can also show how the meaning of words have changed as well as highlighting the frequency of particular words and phrases as society has progressed (Brigham Young University, n.d.). The Culturomics database searches based on n-grams that have been provided by Google Books, and not the actual Google Books (Davies, 2011d). N-grams basically represent the content of the books, and were created by Google to allow people to be able to search their digitized books more easily (Davies, 2011d).
Useful Features
Linking to Google Books
One of the most important features of Culturomics is the ability to link any search term to results that are located in the Google Books site. This is a great feature, as it can show the researcher from which specific books that their term or phrase came. As an example, the term “Prussia” was entered into the system. The results showed that the term was found the most within texts that were published in the first decade of the 1900’s, with 64 tokens being listed. The user can then click on the number under that decade to link to the items contained within the Google Books system that featured the search term. This provides the user the specific resources that were used when displaying the search results, which then could be further examined if needed, offering a the researcher the ability to look at these original items and gain further insights into the information.
Frequency
One of the first pieces of information that are available through Culturomics is word frequency. This is displayed using a table, which shows how many results were found, as well as showing the number of words that are included in each decade. It also presents the information in a chart, which allows the user to visually see the frequency of a particular word or phrase in each decade. The Culturomics database has also set up a minimum occurrence amount of 40, so that if a word or phrase occurs less than 40 times in the 155 billion words, it is not included (Davies, 2011d). The reasoning for these threshold criteria is so that the researcher “can be quite sure that they are not typos or other anomalies” (Davies, 2011).
Display
Users may choose two options to display their results, list and chart. The list feature allows users to be able to see a listing of each term, as well as displaying collocates. This is particularly useful when searching for phrases, as it will show a variety of terms that are matches based on whether they may contain nouns, adjectives, verbs, and other parts of language. The Chart feature displays the results in graphical bar chart, which highlights the overall frequency of the term that was found in the database
Collocation
Another great feature of the interface is the ability to collocate words and phrases. This allows users the ability to narrow their searches for information that will be more relevant to their particular search. An example of this technique is given using the term “thick” with a collocation of three. This enables the search to find items that within three words of when thick first occurs. This is helpful when users would want to create more context when trying to search for terms that they believe to have some kind of relation to each. (Davies, 2011e).
Part of Speech
A user may also search by utilizing the “part of speech” function that Culturomics provides. This allows the user to choose from a variety of types of words, including nouns, adjectives and verbs. This can help to allow the user to only find phrases that incorporate a specific function of language. The example given by Culturomics shows that a user would be able to search for the term “eyes” while only displaying phrases that have an adjective before the word “eyes”. The results then include phrases such as “blue eyes” and “hazel eyes” (Davies, 2011f).
Decades
The search interface of Culturomics also allows users to manipulate the results based on particular decades. This is a very useful feature if one would want to only compare or examine phrases or words from specific ranges of time. A user can then focus their search on a very specific time period, which could help to uncover potential patterns and make a comparative study more fruitful. This can be a great tool to see how literature could have changed in relation to its coverage of a specific topic over time.
Synonyms
Another valuable option available to a user is to be able to search for a word and include synonyms in the results. When utilizing this method, a user is presented with their search term first, and then synonyms are listed below it. This can be quite useful when looking for trends in language, as well as topics that may have been related during a specific time period. It can also show how some words have fallen out of favor for others over time, which would be of great use to those interest in the history of language.
Sorting
Sorting is an important feature for any service that displays information. Culturomics allows users to search by frequency, relevancy or alphabetically. Sorting by frequency is straight forward; the most frequent results are displayed first. Sorting alphabetically is also just as the title implies. Sorting by relevance relies on Culturomics’ “Mutual Information score”, which looks at the frequency of node words and collocated words and then comparing how closely they were located to each other (Davies, 2011b).
Example Searches
A few example searches show the great value that Culturomics provides. Searching for the term “holocaust” shows a stark increase in the usage of the term starting in the 1940’s, which then steadily increases until the present. This is also a great representation of utilizing the graphical results and being able to view in which decade the results are located. This information could be used to gain insight into how the word has evolved throughout its use in the United States.
The term “Security Council” was also entered into the Culturomics system. As expected it also has a sharp increase in use in the 1940’s, due to its creation at the time. Unlike the previous term, “security council” only had a handful of hits before the 1940’s and then steadily increased as it became more relevant and influential. This type of information would be quite valuable to historians that would be studying the role of the Security Council, and be able to help determine the amount of coverage that it received in American English sources contain in Google Books.
Conclusion
The Culturomics interface presents a fantastic tool for researchers from a variety of fields to use in order to unearth and investigate themes in the development of American culture and language. By being able to draw upon such an impressive amount of material, users are able to utilize a platform that definitely would not have been even remotely possible without the use of computing power. The inclusion of a graphical representation of the results is also a very useful tool that allows users to glance through the results to determine in which decade their search received the most hits. This helps to speed up the search process, and can lead to their investigation to the most relevant material to them in the shortest amount of time. This resource looks like it will only improve as it is able to add additional content from other languages and also as Google Books is able to add more content from American English. It will be fantastic to witness this tool’s evolution, and also to see the results that researchers are able to ascertain by using this interface.


Bibliography
Brigham Young University. (n.d.). Corpus of Historical American English. Retrieved from Corpus of Historical American English: http://corpus.byu.edu/coha/
Davies, M. (2011). 40 token threshold for n-grams. Retrieved from Google Books (American English) Corpus (155 billion words, 1810-2009): http://googlebooks.byu.edu/
Davies, M. (2011). Compare: Corpus of Historical American English (COHA) and Google Books (Culturomics). Retrieved from Google Books (American English) Corpus (155 billion words, 1810-2009): http://googlebooks.byu.edu/compare-googleBooks.asp
Davies, M. (2011). Corpora: 45-425 million words each: free online access. Retrieved from Corpus.BYU.Edu: http://corpus.byu.edu/MutualInformation.asp
Davies, M. (2011). Five minute tour. Retrieved from Google Books: American English (155 billion words): http://googlebooks.byu.edu/
Davies, M. (2011). N-grams:overview. Retrieved from Google Books (American English) Corpus (155 billion words, 1810-2009): http://googlebooks.byu.edu
Davies, M. (2011). Searches: collocates. Retrieved from Google Books (American English) Corpus (155 billion words, 1810-2009): http://googlebooks.byu.edu
Davies, M. (2011). Searches: part of speech. Retrieved from Google Books (American English) Corpus (155 billion words, 1810-2009): http://googlebooks.byu.edu