Skip to Main Content

Research Guides

Text and Data Mining (TDM) at University of Toronto

This guide introduces the U of T researchers and students to recourses available to them if they wish to undertake a TDM project, outlining available datasets and platforms, corpus creation, and APIs

Web Archives

Common Crawl (Free)

  • Web page data, metadata, and text from regular crawls of the internet
  • Learn more on their overview page

Internet Archive 

  • The Wayback Machine allows users to view past versions of websites 
  • Archive-It contains Web Archives 
  • WARC files (Web archive files) are generally not analysis-ready 
  • If you have WARC files and would like to extract text for them the Archives Unleashed Toolkit provides code for doing so  

ARCH Datasets

Other Web Archive Datasets 

chat loading...