Skip to Main Content

Research Guides

Text and Data Mining (TDM) at University of Toronto

This guide introduces the U of T researchers and students to recourses available to them if they wish to undertake a TDM project, outlining available datasets and platforms, corpus creation, and APIs

Books

HathiTrust

  • Materials in the public domain are available for text mining 
  • HathiTrust Research Analytics links to datasets and gives an overview of the collection  

Internet Archive (Free) 

Project Gutenberg (Free) 

Text Creation Partnership  

  • Full text versions of early print books  
  • Three major collections: 
    • Early English Books Online  
    • Eighteenth-Century Collections Online 
    • Early American Imprints 
  •  These collections include transcriptions. Some data wrangling and downloading may be required to enable text mining
  • Existing dataset: Eighteenth Century Collections Online texts 2,198 plain-text English documents from Eighteenth Century Collections Online [TCP-ECCO] (zip file

TXT Lab (Free) 

  • Datasets produced by the TXT Lab at McGill university, including 
  • Novel 450  
  • Race in Cinema 
  • Academic Publishing 
  • Note citation instructions 

Other Book Datasets 

 

chat loading...