Subject Guides

Digital Scholarship

A guide to digital scholarship tools, methods, and best practices across the digital humanities and data-driven fields

Text Analysis

Text Analysis 

Text analysis is the act of pulling unstructured data out of large bodies or corpora of texts and organizing it in machine-readable ways. This organization helps the researcher make connections with the pool of data and quantify it using different analysis techniques. It is also known as text mining, as you mine the text to pull out previously unknown patterns and trends. 

Text analysis can be an important part of personal research or a useful skill for classroom instruction. It is beneficial as a starting point for a topic as it can answer questions such as:

  • What kind of keywords are being used? (word cloud)
  • What emotions can be pulled from the texts? (sentiment analysis)
  • What places are associated with your data? (geographic analysis)
  • What kind of connections or subject clusters exist? (topic modeling) 

The text analysis process has three main components: data, tools, and analysis. Data is what you hope to pull from. The tools help run the data. The analysis is the conclusions you pull from the data with help from the tools. You may already have in mind what you want to use for one or more of the components, but each tab will walk you through ideas to get you started. 

An example of word cloud analysis of the Beatles' Wikipedia page. 

Data

Depending on the project that you are working on, your data may come from a variety of sources. Paid tools available through the library include built in data sources covered under copyright. These proprietary tools linked with existing databases are as follows:

For examples of what is included in the databases, you can search the providers on the BU libraries database list

However, depending on your project, you may choose a corpus from other means. Several places host datasets, such as Re3data, Kaggle, and Awesome DataThere are also other main sources for open-source data available from big organizations such as U.S. Census Bureau Data, Data.gov, Google Books, Internet Archive, and Wikidata

Web Scraping

If there is no existing corpus or dataset for your topic, there is also the option of web scraping. Web scraping, including social media scraping, involves crawling specific websites to collect information. This process is much more involved than using existing data and does require more skills. It also has legal and ethical limitations depending on the sites you are scraping. Trends and allowances often shift, so you must be informed of the Terms of Use for your targeted scraping. Two of the most popular web scraping tools are libraries through Python: Beautiful Soup (recommended for beginners) and Scrapy (for more advanced projects). 

APIs 

Some sites may offer APIs (application programming interfaces) as a more direct and predefined way to access their data. These are often more reliable and consistent than simply scraping. However, they come with their own challenges, and some sites have this option behind a paywall. Examples of sites with APIs include GitHub, Youtube, and Spotify

Library Tools 

If this is your first time doing text analysis, we recommend starting with the options available through the library. Most of these tools are created with beginners in mind, with room to grow. They also have the bonus of including built-in data. Using these tools may also help you understand the power of text analysis in general.

Tools available through Binghamton Universities and the login steps are as follows: 

Constellate 

  • Go to main page
  • Click "Log in" in upper right corner
  • You need a free JSTOR account to log in. 
    • Log in if you already have one
    • If not, select register one.
  • You will need to be affiliated with Binghamton University as an institution to access the Constellate Lab. In the upper right-hand corner, there should be a banner that says "We think you are at SUNY, Binghamton University." 
  • From the Dashboard page, you should be able to Build Datasets and open the Constellate Lab to analyze them using Jupyter Notebooks. 

Gale Digital Scholar Lab

  • Go to the main page
  • Click on the red "Log In/Create Account" button in upper left corner
  • Select "Sign in with Google" option
  • Use binghamton.edu account to connect 
  • Click "Personal" for workspace options if asked 

Hathi Trust Research Center

  • Go to the main page 
  • Click on the blue “Sign In” button in the top right corner.
  • Choose your institution from the drop down menu, and use your credentials to login.
  • Use your Binghamton credentials/authentication to log in.
  • You may be required to confirm you created the account on your first sign-in - this will go to your Binghamton email.

TDM Studio

  • Go to main page
  • Click "Create Account" in upper right corner
  • Input Binghamton email and create a password 
  • Access Visualization Tools 
  • Contact the DS Team for Workbench access 

Some of these tools allow you to upload your own corpus, which means using the tools even without being limited by the options. 

Other options 

While there are an infinite number of text analysis tools available, choosing the best fit will depend on your goals and level of expertise. For example, a very simple and free tool is Voyant. Starting with Voyant may give you an idea of what is possible with Text Analysis. However, if you are familiar with programming languages, you may want to use libraries in Python or R to conduct text analysis. Constellate, one of our paid Text Analysis services, has free and openly available tutorials on getting started with Text Analysis with Python. 

Analysis 

The analysis is the information you chose to pull from your data. Your options will depend on your tool. Library products will have set parameters for popular analysis functions. Here is a breakdown of what each offers: 

Constellate

  • In the "Datasets" option, Constellate offers prebuilt visualizations when building datasets through their documents. These include:
  • The Constellate Lab employs Python and Jupyter Notebooks for more user flexibility. They offer tutorials and the code needed to do more complex and specific analyses, but you may need a baseline coding knowledge to analyze texts in the Lab. 

Gale Digital Scholar Lab

Hathi Trust Research Center

TDM Studio

  • Geographic Analysis
  • Topic Modeling
  • Sentiment Analysis
  • See Help & Learn tab for more information

All the services listed here can be exported in different formats, including csv data files, png image files (for visualizations), and PDF files that can be used in research. 

Other Analysis Options 

The library's tools listed above are some of the most common functions for text analysis. However, there are many other options (plus the above) available when approached through Python or R. Depending on your needs; you will need to install different packages to complete the analyses. If you are interested in going this route, Constellate has free and openly available tutorials on getting started with Text Analysis with Python. For more information, including using R, the University of Pennsylvania Libraries also has an extensive guide on the different package options and methods.