Unlocking and Analyzing Historical Texts

Goals
The human record is enormous, ranging from the text we produce on the internet today to ancient writings on clay like the cuneiform tablets. Unlike the text that is “born digital” today, much of the historical texts and their metadata remain locked up in various inscrutable archival states rendering computational analysis of these texts impossible. The goal of this project is to build an ecosystem of computational (AI) tools to identify, obtain, scrape, and process these ancient documents in varying states of digitization from a multitude of libraries and publicly available archival sources.
Issues Involved or Addressed
Historical texts exist in many various states that are not machine readable (but publicly available) across museums and libraries: a) Either it is completely undigitized (only physical relics are to be found in the museums), or b) archival images in multiple varying formats, resolutions, and quality exist across libraries with different conventions for storing, cataloging, and archiving (for example, the cuneiform tablets have been 3D-imaged, ancient Greek and Latin texts are available on microfilm images etc.) c) partially machine readable versions of certain editions of archived documents exist. But the conventions around storing metadata, treating paratext like footnotes and marginalia, and handling digitized outputs vary greatly among libraries. The pipeline for getting the text ready in a machine readable state for computational analysis involves a lot of manual effort that we aim to alleviate — navigating through non-standardized and varying conventions of cataloging and archiving information, requesting and scraping archived data from libraries and public archives, developing and deploying image analysis algorithms to recognize the layout, format, and other visual characteristics of documents, developing and deploying optical character recognition (OCR) systems, working with diverse structure of the printed text and the metadata etc. Many of these steps like OCR, tend to be imperfect and noisy, so developing evaluation schemes for these systems is also necessary.
Partners/Sponsors
No partners or sponsors as of now, but I am working on a grant proposal for a project on analysis of ancient Greek and Latin texts and their historical translations which would require the ecosystem and tools mentioned above for acquiring and processing these texts as the starting point.
Methods and Technologies
- Computer vision
- Natural language processing
- Catalog Management
- Dataset curation
- Qualitative and Quantitative evaluation of AI systems
- Optical Character Recognition
- Information retrieval
- Bibliographic methods
- Unit testing
- Data Science
Majors Sought
Computing: Computer Science
Liberal Arts: Global Media and Cultures, History, Technology, and Society
Preferred Interests and Preparation
Basic familiarity with shell scripting, python, and programming will be advantageous. Also, some knowledge about catalogs, library practices, archives or historical texts would be a plus.
Advisor
Kartik Goyal
Kartik Goyal
kartikgo@gatech.edu
Day, Time & Location
Full Team Meeting:
5:00 – 5:50 Wednesday
Klaus 2446
Subteam meetings scheduled after classes begin.