UNIVERSITY OF HERTFORDSHIRE COMPUTER SCIENCE RESEARCH COLLOQUIUM presents "High Accuracy Recognition of Connected Script Languages in Old Printed Books" Professor William Clocksin (School of Computer Science, University of Hertfordshire, UK) 8 December 2010 (Wednesday) Lecture Theatre A161 Hatfield, College Lane Campus 1 -2 pm Everyone is Welcome to Attend Refreshments will be available Abstract: I will describe Quroyo, a high performance system implemented in Java for the recognition of Syriac, Arabic and Mandaic script. The script sources are scans of old printed books that were typeset by hand between the 17th and early 20th centuries. These sources are plentiful -- some 20,000 of these books are in the British Library alone -- and scholars would like to have machine-readable transcriptions to aid research. Unlike previous optical character recognition systems, Quroyo is not based on a trainable classifier. No statistical language model is used, nor is a training set required. Instead, exemplars model the graphic appearance of each character in the alphabet. A contour matcher based on dynamic space warping finds the candidate matches of examplar characters to deformed characters within each connected group of pixels. In connected script languages, the baseline extender provides important but unreliable context information. Without requiring any segmentation information, a knapsack decoder (similar to a Viterbi decoder) then finds an optimal sequence of characters within each connected group, and then words are detected by maximisation of spatial expectation. The numerous diacritic marks used in the target languages are decoded by an efficient hypergraph matching algorithm. Except for a few cases (approximately 1 in every 2000 characters) where characters are seriously broken by typesetting faults, the character recognition rate is 100% at a speed of about 4 characters per second. The system has been tested on hundreds of pages of Syriac in all three of its ancient script forms, and on scores of pages of Modern Standard Arabic. --------------------------------------------------- Hertfordshire Computer Science Research Colloquium http://homepages.stca.herts.ac.uk/~nehaniv/colloq