Archive Card Index vs. Transkribus: machine recognition of handwritten text
DOI:
https://doi.org/10.28925/2311-259x.2021.3.9Keywords:
lexicography, Archive Card Index, Electronic System «Archive Card Index», machine recognition, Transkribus, lexicographycal toloka (crowdsourcing)Abstract
The subject of the research is machine recognition of handwritten materials of the Archival Card Index (ACI) — lexical and phraseological materials of the dictionary commission of the All-Ukrainian Academy of Sciences, in particular, card index of the “Russian-Ukrainian dictionary” 1924–1933 ed. A. Krymsky and S. Yefremov. The study of the ACI should be considered in the context of cultural and national revival in Ukraine in the 20th — early 21st centuries. The relevance and value of the ACI became a prerequisite for the transfer of its materials to the digital format. In 2018 the Institute of Ukrainian Language of the NAS of Ukraine created a computer system “Archival Card Index”, which accessibles materials primarily in the form of scanned images. The problem that needs urgent resolution is the transfer of handwriting to a typewriter format. The complexity of manual recognition, which requires considerable effort and time, encourages the study and application of Transkribus resource capabilities, which involves the use of the machine teaching. The Aim of the study is to clarify by analyzing, systematizing, classifying and describing the material features of the preparation of ACI cards for machine processing of texts. The scientific novelty of the study is that for the first time, the issue of providing the HTR engine with ACI training data (loading to the platform, segmenting images into lines and text areas, transcribing content each page).
The main result is finding out the content of the preparatory stage, the tasks of which are to eliminate the flaws of automatic segmentation: non-text elements, non-substantial text elements, incorrect automatic detection of text region or line. The prospects of lexicographic toloka (crowdsourcing) in the process of card recognition are outlined, for which it is envisaged to use collective access to the collection of transcribed documents in Transkribus. To recognize the cards manually and for the future check and adjustment of automatically recognized ones, you can join the new project “All-Ukrainian Toloka: Archival Card Index” — online platform on the website “ACI”.
Downloads
References
Arkhivna kartoteka [Archival Card Index] (2018–2021). https://ak.iul-nasu.org.ua
Danli, R. (2018). Mashiny chitayut arkhivnye dokumenty: programmnoe obespechenie dlya raspoznavaniya rukopisnogo teksta [The machines read archival documents: handwriting recognition software]. Blog Natsionalnykh Arkhivov Velikobritanii. http://blog.nationalarchives.gov.uk/blog/machines-reading-the-archivehandwritten-text-recognition-software/ Tsyt. za perekladom https://tsdea.archives.gov.ua/wp-content/uploads/2018/03/26032018_st.pdf
Krymskyi, A., Yefremov, S. (Ed.). (1924–1933). Rosiisko ukrainskyi slovnyk [Russian-Ukrainian Dictionary]. Vol. I–III.
Pozdran, Yu. (2018). “Rosiisko-ukrainskyi slovnyk” za redaktsiieiu A.Yu.Krymskoho ta S.O.Yefremova v istoryko-linhvistychnomu konteksti [“Russian-Ukrainian Dictionary” edited by A. Yu. Krymsky and S. O. Yefremov in the historical-linguistic context].
Rosiisko-ukrainski slovnykу [Russian-Ukrainian Dictionaries] (2021). https://r2u.org.ua
Transkribus (2021) https://readcoop.eu/transkribus/
Tyshchenko, O. (2016). Arkhivna kartoteka yak leksyko-iliustratyvna baza “Rosiisko-ukrainskoho slovnyka” za red. A. Yu. Krymskoho ta S. O. Yefremova. I. Leksychna kartoteka: istoriia stvorennia ta represii; II. Mikro- ta makrostruktura arkhivnoi kartoteky [The archival card index as the lexical and illustrative base of “Russian-Ukrainian dictionary” ed. A. Krymsky and S. Yefremov. I. Lexical card index: history of creation and repression; II. Micro- and macrostructure of archival lexical card index]. Ukrainska mova,2, 44–71; 3, 57–78.
Tyshchenko, O. (2020). Arkhivna kartoteka ukrainskoi movy v tsyfrovomu formati: vid pamiatky movy do suchasnoho leksykohrafichnoho instrumentariiu [Archival card index of the Ukrainian language in digital format: from a language monument to modern lexicographic tools]. Rocznik Slawistyczny, LXIX, 185–197.
Useukrainska toloka: Arkhivna kartoteka [All-Ukrainian Toloka: Archival Card Index] (2020) http://work.iul-nasu.org.ua
Published
How to Cite
Issue
Section
License
Copyright (c) 2021 Oksana Tyshchenko
This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).