Skip to Main Content

Text and Data Mining (TDM) at University of Toronto

This guide introduces the U of T researchers and students to recourses available to them if they wish to undertake a TDM project, outlining available datasets and platforms, corpus creation, and APIs

Web Archives

Common Crawl (Free)

Web page data, metadata, and text from regular crawls of the internet
Learn more on their overview page

Internet Archive

The Wayback Machine allows users to view past versions of websites
Archive-It contains Web Archives
WARC files (Web archive files) are generally not analysis-ready
If you have WARC files and would like to extract text for them the Archives Unleashed Toolkit provides code for doing so

ARCH Datasets

Textual datasets (in csv format) derived from the University of Toronto's Archive-it collections
Available collections:
- Canadian Political Parties and Political Interest Groups
- Covid-19 in Ontario
Contact us for access to collections or request a plain text collection

Other Web Archive Datasets

Web Archive for Historical Research Dataverse
- Web Archive datasets on a variety of topics
Web Archive for Longitudinal Knowledge (WALK)

chat loading...