Automating the recognition of historical Chinese handwritten texts

Hosted by the British Library

This fellowship sits within the British Library’s Digital Scholarship Department. It will engage with new digital tools and techniques in order to explore possible solutions to automate the transcription of historical Chinese handwritten texts. The fellowship will focus on material from Dunhuang (China), part of the Stein collection, which is been digitised through the Lotus Sutra Manuscripts Digitisation Project as part of the digitisation activities conducted by the British Library to make the collections under its custodianship accessible to all. The digitised content will be accessible through the International Dunhuang Project (IDP) platform.

 

The Stein Collection

The British Library’s Stein collection, gathered by Aurel Stein in the early 20th century, is one of the most outstanding collections of manuscripts and printed books from China and Central Asia. It is of immense historical and cultural significance, containing over 45,000 items written on paper, wood and other materials in many languages, such as Chinese, Tibetan, Sanskrit, Tangut, Khotanese, Kuchean, Sogdian, Uighur, Turkic and Mongolian. It notably holds some of the most important surviving Buddhist texts, such as the famous printed copy of the Diamond Sutra from the Dunhuang Library Cave dated to 868 AD.

 

The International Dunhuang Project

Established by the British Library in 1994, the International Dunhuang Project is an international collaborative programme including institutions from Europe, Asia and the US holding collections related to Dunhuang and other Silk Road sites. All partners aim to conserve, catalogue and digitise manuscripts, printed texts, paintings, textiles and artefacts under their custodianship and make them freely available online on a web platform. As part of this effort, and thanks to the generous support of a number of institutions and foundations, a large number of manuscripts from the Stein collection have been digitised and images have been made available on the IDP website (over 170,000 to date).


Project scope and objectives

Building upon this vast and well-curated digitised resource, the Library’s Digital Scholarship Department aims to promote the collection, enhance its searchability, and actively engage with innovative research using its data, through methods such as text mining and data visualisations. As part of this work, members of the Digital Scholarship team are engaging closely with the development of Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR) systems for non-Western scripts.

The Chevening Fellow will contribute to these efforts. They will research the current landscape of Chinese handwritten text recognition – looking into methods, challenges, tools and software. They will test our material with existing tools and demonstrate digital research opportunities arising from the availability of texts in machine-readable format.

The Library’s ongoing Lotus Sutra Manuscripts Digitisation Project aims to conserve, catalogue and digitise nearly 800 Lotus Sutra manuscripts from Dunhuang in the Chinese language. This corpus of texts constitutes an ideal test case: not only because the Lotus Sutra is one of the main Buddhist scriptures and the canonical edition has already been transcribed, but also because the manuscripts present minor variations, such as variant characters, handwriting and scribal errors. The fellow could therefore use the project’s digitised content as a starting point to examine approaches, opportunities and possible solutions to automate the transcription of our Chinese historical collections.


Key Responsibilities

  • To develop an in-depth understanding of the content digitised as part of the Lotus Sutra Manuscripts Digitisation Project
  • To research existing digitised materials available on the IDP website and identify different scripts and challenges for text recognition tools
  • To identify key stakeholders and research existing market solutions, tools and methods for Chinese OCR/HTR
  • To train text recognition systems with IDP materials, evaluate and compare results
  • In collaboration with the relevant British Library colleagues, to increase awareness of the Stein and other Central Asian collections at the British Library and other digitised content available on the IDP platform, and to promote their research potential when in machine-readable format, e.g. text mining and data visualisation
  • To develop the Library’s engagement in a global network working with Chinese OCR/HTR systems and foster relationships with Chinese Digital Humanities research communities, which could form the basis for future partnerships

Deliverables

  • Creating or joining a network of scholars and professionals exploring OCR/HTR solutions for historical Chinese documents
  • A recommended platform, software or tool for the Library to work with using digitised materials available on the IDP platform
  • A report on the types of texts, scripts and potential challenges that OCR/HTR tools may face with digitised collection items available on the IDP website, including an overview of tested systems and outcomes
  • A suggested operational workflow to produce, proof read, correct and feed transcriptions back into Library strategic systems
  • Promoting the project internally and externally, including posts on the British Library’s Digital Scholarship, Asian and African Collections and IDP blogs, using other British Library social media platforms, and giving a talk for Library staff members about the project, its aims and outcomes
  • Contributing to the Library’s 2021 workshop/conference concluding the Lotus Sutra Manuscripts Digitisation Project
  • Sharing experience and lessons learnt, and participating in other related activities of the Digital Scholarship Department

Candidate requirements

  • Degree in a relevant subject e.g. digital humanities, computer science and/or cultural history
  • Knowledge of Chinese language, ideally with the ability to read/recognise several variants of historical Chinese scripts and calligraphic styles
  • Excellent written and spoken English
  • Familiarity with OCR/HTR systems
  • Demonstrable knowledge of tools and methods useful for digital humanities research e.g. text and data mining, name entity recognition, data modelling and linking, data visualisation
  • Interest in archival material, library collections and digitisation
  • Excellent writing skills and experience of networking and partnership building

Individuals must be resident in their home country at the time of making their application. Applicants from Mainland China will be eligible.

Applications are open

If you are eligible and believe you would be a strong candidate for a place on this programme, we encourage you to apply now.

Apply