Skip to Main Content

Research Guides

Text and Data Mining (TDM) at University of Toronto

This guide introduces the U of T researchers and students to recourses available to them if they wish to undertake a TDM project, outlining available datasets and platforms, corpus creation, and APIs

Building a Corpus

If you are undertaking research project that requires assembling a corpus that contains materials from multiple sources, this guide can help you.  

First, create a list of the required materials.

  • This may be a number of journal titles, specific articles from multiple journals based on the results from searches, or the name of a publication over a period of time  
  • If you have conducted a search to identify materials for inclusion, it’s a good idea to keep a record of your search terms and methods 

There are two issues we need to tackle to assemble a usable corpus: 

  1. Permissions – ensuring you can acquire and text mine the data from a legal and licensing perspective 

  1. Acquisitions—figuring out how to get the data 

Permissions 

Once you’ve identified what materials you’d like to analyze, it’s vital to check that you have permission to do so. 

Complying with our licensing agreements is important because it 

  • Ensures you don’t violate copyright or licensing agreements 
  • Protects your research from possible retraction or publisher response  
  • Allows all UofT users continued access to these materials 

If you have any questions about copyright or licensing considerations, please contact our Scholarly Communications and Copyright Office at scco@library.utoronto.ca 

How to Check Permissions 

  1. Identify the publisher(s) of the materials you are hoping to include. This might require looking the journal title up in the catalog or conducting Web searches.  
    • If publishers are listed in our available resources guide, you can use these materials for TDM and often can acquire them using an API 
  2. Search for journal titles in the catalog and see if TDM permissions are listed  

Example

For example, if we search for “Journal of Psychology,” we’ll find this entry. Below we can see a number of access options--databases where you can find issues and articles from The Journal of Psychology. Next to each of those is a link to "Show License."  

Journal of Psychology record in our library catalog

If you click on that it reveals what uses are available for this journal from that source. In this case, I've clicked "Show License" for the Periodicals Archive Online Collection 2. We can see that text and data mining is allowed. 

List of permitted uses for the Journal of Psychology

Contact the Library 

If you cannot determine whether TDM is allowed for certain materials or need help reaching out to a publisher to request permission, contact us at mdl@library.utoronto.ca 

Acquiring the Data 

Once you’ve determined that you have permission to conduct TDM with the materials, you usually need to find a way to get these materials in a relatively convenient way.  

If you’re using a small number of articles, it may be easiest to manually download them through a user interface. In some cases, publishers will deliver files directly to you. Often, the easiest way to acquire textual data is using APIs.

Consult the Journals section of this guide for how to acquire materials from some major publishers. Consult the API section of this guide to find information and resources on using APIs to acquire journal articles. 

chat loading...