Automating the recognition of historical Chinese handwritten texts
Hosted by the British Library
This fellowship sits within the British Library’s Digital Research Team. It will engage with new digital tools and techniques in order to explore possible solutions to automate the transcription of historical Chinese handwritten texts. The fellowship will focus on material from Dunhuang (China), part of the Stein collection, which is being conserved and digitised through the Lotus Sutra Manuscripts Digitisation Project as part of the digitisation activities conducted by the British Library to make the collections under its custodianship accessible to all. The digitised content will be accessible through the International Dunhuang Project (IDP) platform.
This fellowship offers a unique professional development opportunity which would be particularly suited to candidates at the early or middle stages of their careers and with interests in and knowledge of both cultural heritage and digital humanities. The British Library is keen to explore opportunities for long-term partnership with the successful applicant’s home institution and will be happy to actively enable and encourage opportunities for dialogue with digital humanities networks in the UK. See below for full candidate requirements and eligibility criteria.
The Stein Collection
The British Library’s Stein collection, gathered by Aurel Stein in the early 20th century, is one of the most outstanding collections of manuscripts and printed books from China and Central Asia. It is of immense historical and cultural significance, containing over 45,000 items written on paper, wood and other materials in many languages, such as Chinese, Tibetan, Sanskrit, Tangut, Khotanese, Kuchean, Sogdian, Uighur, Turkic and Mongolian. It notably holds some of the most important surviving Buddhist texts, such as the famous printed copy of the Diamond Sutra from the Dunhuang Library Cave dated to 868 AD.
The International Dunhuang Project
Established by the British Library in 1994, the International Dunhuang Project is an international collaborative programme including institutions from Europe, Asia and the US holding collections related to Dunhuang and other Silk Road sites. All partners aim to conserve, catalogue and digitise manuscripts, printed texts, paintings, textiles and artefacts under their custodianship and make them freely available online on a web platform. The National Library of China and the Dunhuang Academy, in China, are amongst the project’s key contributors. As part of this effort, and thanks to the generous support of a number of institutions and foundations, a large number of manuscripts from the Stein collection have been digitised and images have been made available on the IDP website (over 170,000 to date).
Project scope and objectives
Building upon this vast and well-curated digitised resource, the Library’s Digital Research Team aims to promote the collection, enhance its searchability, and actively engage with innovative research using its data, through methods such as text mining and data visualisations. As part of this work, members of the Digital Research Team are engaging closely with the development of Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR) systems for non-Western scripts.
The Chevening Fellow will contribute to these efforts. They will research the current landscape of Chinese handwritten text recognition – looking into methods, challenges, tools and software. They will test our material with existing tools and demonstrate digital research opportunities arising from the availability of texts in machine-readable format.
The Library’s ongoing Lotus Sutra Manuscripts Digitisation Project aims to conserve, catalogue and digitise nearly 800 Lotus Sutra manuscripts from Dunhuang in the Chinese language. This corpus of texts constitutes an ideal test case: not only because the Lotus Sutra is one of the main Buddhist scriptures and the canonical edition has already been transcribed, but also because the manuscripts present minor variations, such as variant characters, handwriting and scribal errors. The fellow could therefore use the project’s digitised content as a starting point to examine approaches, opportunities and possible solutions to automate the transcription of our Chinese historical collections.
- To develop an in-depth understanding of the content digitised as part of the Lotus Sutra Manuscripts Digitisation Project
- To research existing digitised materials available on the IDP website and identify different scripts and challenges for text recognition tools
- To identify key stakeholders and research existing market solutions, tools and methods for Chinese OCR/HTR
- To train text recognition systems with IDP materials, evaluate and compare results
- In collaboration with the relevant British Library colleagues, to increase awareness of the Stein and other Central Asian collections at the British Library and other digitised content available on the IDP platform, and to promote their research potential when in machine-readable format, e.g. text mining and data visualisation
- To develop the Library’s engagement in a global network working with Chinese OCR/HTR systems and foster relationships with the Fellow’s home institution and Chinese Digital Humanities research communities, with the view of forming the basis for future partnerships and collaborations
- A recommended platform, software or tool for the Library to work with using digitised materials available on the IDP platform
- A report on the types of texts, scripts and potential challenges that OCR/HTR tools may face with digitised collection items available on the IDP website, including an overview of tested systems and outcomes
- A suggested operational workflow to produce, proof read, correct and feed transcriptions into Library strategic systems
- Promoting the project internally and externally, including posts on the British Library’s Digital Scholarship, Asian and African Collections and IDP blogs, using other British Library social media platforms, and giving a talk for Library staff members about the project, its aims and outcomes
- Contributing to the Library’s workshop/conference concluding the Lotus Sutra Manuscripts Digitisation Project
- Becoming an active member of a network of scholars and professionals exploring OCR/HTR solutions for historical Chinese documents, fostering longer-term working relationships
- Exchanging experiences and lessons learnt within UK, Chinese and global DH networks laying the foundations for future inter-regional collaborations
- Participating in other related activities of the Digital Research Team
- Degree in a relevant subject e.g. digital humanities, computer science and/or cultural history
- Knowledge of Chinese language, ideally with the ability to read/recognise several variants of historical Chinese scripts and calligraphic styles
- Excellent written and spoken English
- Familiarity with OCR/HTR systems
- Demonstrable knowledge of tools and methods useful for digital humanities research e.g. text and data mining, Named Entity Recognition, data modelling and linking, data visualisation, etc.
- Interest in archival material, library collections and digitisation
- Excellent writing skills and experience of networking and partnership building
Individuals must be resident in their home country at the time of making their application.
Note: This fellowship is only open to applicants from Mainland China.
- Staff-level access to unique British Library collections and research resources, including access to staff training opportunities
- Staff-level access to the Digital Scholarship Training Programme courses, workshops, talks and reading group
- Opportunity to network and exchange ideas with digital scholarship staff, East Asia section curators and the wider Asian & African Collections department and other colleagues across the Library, as well as externally within the UK and wider professional DH communities
- Opportunity to gain experience in disseminating project outcomes and engaging different audiences through various communication channels
- Opportunity to become familiar with the activities of the International Dunhuang Project and the work of the Endangered Archives Programme, which has helped digitise manuscripts and archival material in and around China
- Opportunity to enhance spoken and written English through work practice and collaboration with colleagues
Open for applications until 2 November 2021, at 12:00 (GMT)
If you are eligible and believe you would be a strong candidate for a place on this programme, we encourage you to apply now.