Subject Guides

Digital Scholarship

A guide to digital scholarship tools, methods, and best practices across the digital humanities and data-driven fields

Guide Contents

Text Analysis

Text analysis is the act of pulling unstructured data out of large bodies or corpora of texts and organizing it in machine-readable ways. This organization helps the researcher make connections with the pool of data and quantify it using different analysis techniques. It is also known as text mining, as you mine the text to pull out previously unknown patterns and trends.

Text analysis can be an important part of personal research or a useful skill for classroom instruction. It is beneficial as a starting point for a topic as it can answer questions such as:

What kind of keywords are being used? (word cloud)
What emotions can be pulled from the texts? (sentiment analysis)
What places are associated with your data? (geographic analysis)
What kind of connections or subject clusters exist? (topic modeling)

The text analysis process has three main components: data, tools, and analysis. Data is what you hope to pull from. The tools help run the data. The analysis is the conclusions you pull from the data with help from the tools. You may already have in mind what you want to use for one or more of the components, but each tab will walk you through ideas to get you started.

An example of word cloud analysis of the Beatles' Wikipedia page.

Data

Depending on the project that you are working on, your data may come from a variety of sources. Paid tools available through the library include built in data sources covered under copyright. These proprietary tools linked with existing databases are as follows:

Constellate - JSTOR
Gale Digital Scholar Lab - Gale
Hathi Trust Research Center - HathiTrust
TDM Studio - Proquest

For examples of what is included in the databases, you can search the providers on the BU libraries database list.

However, depending on your project, you may choose a corpus from other means. Several places host datasets, such as Re3data, Kaggle, and Awesome Data. There are also other main sources for open-source data available from big organizations such as U.S. Census Bureau Data, Data.gov, Google Books, Internet Archive, and Wikidata.

Web Scraping

If there is no existing corpus or dataset for your topic, there is also the option of web scraping. Web scraping, including social media scraping, involves crawling specific websites to collect information. This process is much more involved than using existing data and does require more skills. It also has legal and ethical limitations depending on the sites you are scraping. Trends and allowances often shift, so you must be informed of the Terms of Use for your targeted scraping. Two of the most popular web scraping tools are libraries through Python: Beautiful Soup (recommended for beginners) and Scrapy (for more advanced projects).

APIs

Some sites may offer APIs (application programming interfaces) as a more direct and predefined way to access their data. These are often more reliable and consistent than simply scraping. However, they come with their own challenges, and some sites have this option behind a paywall. Examples of sites with APIs include GitHub, Youtube, and Spotify.

Library Tools

If this is your first time doing text analysis, we recommend starting with the options available through the library. Most of these tools are created with beginners in mind, with room to grow. They also have the bonus of including built-in data. Using these tools may also help you understand the power of text analysis in general.

Tools available through Binghamton Universities and the login steps are as follows:

Constellate (available until June 2025)

Go to main page
Click "Log in" in upper right corner
You need a free JSTOR account to log in.
- Log in if you already have one
- If not, select register one.
You will need to be affiliated with Binghamton University as an institution to access the Constellate Lab. In the upper right-hand corner, there should be a banner that says "We think you are at SUNY, Binghamton University."
From the Dashboard page, you should be able to Build Datasets and open the Constellate Lab to analyze them using Jupyter Notebooks.

Gale Digital Scholar Lab

Go to the main page
Click on the red "Log In/Create Account" button in upper left corner
Select "Sign in with Google" option
Use binghamton.edu account to connect
Click "Personal" for workspace options if asked

Hathi Trust Research Center (available until December 2026)

Go to the main page
Click on the blue “Sign In” button in the top right corner.
Choose your institution from the drop down menu, and use your credentials to login.
Use your Binghamton credentials/authentication to log in.
You may be required to confirm you created the account on your first sign-in - this will go to your Binghamton email.

TDM Studio

Go to main page
Click "Create Account" in upper right corner
Input Binghamton email and create a password
Access Visualization Tools
Contact the DS Team for Workbench access

Some of these tools allow you to upload your own corpus, which means using the tools even without being limited by the options.

Other options

While there are an infinite number of text analysis tools available, choosing the best fit will depend on your goals and level of expertise. For example, a very simple and free tool is Voyant. Starting with Voyant may give you an idea of what is possible with Text Analysis. However, if you are familiar with programming languages, you may want to use libraries in Python or R to conduct text analysis. Constellate, one of our paid Text Analysis services, has free and openly available tutorials on getting started with Text Analysis with Python.

Analysis

The analysis is the information you chose to pull from your data. Your options will depend on your tool. Library products will have set parameters for popular analysis functions. Here is a breakdown of what each offers:

Constellate

In the "Datasets" option, Constellate offers prebuilt visualizations when building datasets through their documents. These include:
- Number of documents and metadata
- Keyphrases
- Word Frequency
- Treemaps
The Constellate Lab employs Python and Jupyter Notebooks for more user flexibility. They offer tutorials and the code needed to do more complex and specific analyses, but you may need a baseline coding knowledge to analyze texts in the Lab.

Gale Digital Scholar Lab

Hathi Trust Research Center

TDM Studio

Geographic Analysis
Topic Modeling
Sentiment Analysis
See Help & Learn tab for more information

All the services listed here can be exported in different formats, including csv data files, png image files (for visualizations), and PDF files that can be used in research.

Other Analysis Options

The library's tools listed above are some of the most common functions for text analysis. However, there are many other options (plus the above) available when approached through Python or R. Depending on your needs; you will need to install different packages to complete the analyses. If you are interested in going this route, Constellate has free and openly available tutorials on getting started with Text Analysis with Python. For more information, including using R, the University of Pennsylvania Libraries also has an extensive guide on the different package options and methods.

Last Updated: Mar 17, 2025 10:35 AM
URL: https://libraryguides.binghamton.edu/digitalScholarship
Print Page