Buchbeitrag

OCR Nachkorrektur des Royal Society Corpus

We present an approach for automatic detection and correction of OCR-induced misspellings in historical texts. The main objective is the post-correction of the digitized Royal Society Corpus, a set of historical documents from 1665 to 1869. Due to the aged material the OCR procedure has made mistakes, thus leading to files corrupted by thousands of misspellings. This motivates a post processing step. The current correction technique is a pattern-based approach which due to its lack of generalization suffers from bad recall. To generalize from the patterns we propose to use the noisy channel model. From the pattern based substitutions we train a corpus specific error model complemented with a language model. With an F1-Score of 0.61 the presented technique significantly outperforms the pattern based approach which has an F1-score of 0.28. Due to its more accurate error model it also outperforms other implementations of the noisy channel model.

OCR Nachkorrektur des Royal Society Corpus

Urheber*in: Klaus, Carsten; Fankhauser, Peter; Klakow, Dietrich

Namensnennung - Keine Bearbeitungen 4.0 International

0
/
0

Sprache
Englisch

Thema
OCR-Schrift
Korrektur
Automatische Sprachverarbeitung
Digital Humanities
Sprache

Ereignis
Geistige Schöpfung
(wer)
Klaus, Carsten
Fankhauser, Peter
Klakow, Dietrich
Ereignis
Veröffentlichung
(wer)
Frankfurt am Main : Zenodo
(wann)
2019-02-27

URN
urn:nbn:de:bsz:mh39-85353
Letzte Aktualisierung
06.03.2025, 09:00 MEZ

Datenpartner

Dieses Objekt wird bereitgestellt von:
Leibniz-Institut für Deutsche Sprache - Bibliothek. Bei Fragen zum Objekt wenden Sie sich bitte an den Datenpartner.

Objekttyp

  • Buchbeitrag

Beteiligte

  • Klaus, Carsten
  • Fankhauser, Peter
  • Klakow, Dietrich
  • Frankfurt am Main : Zenodo

Entstanden

  • 2019-02-27

Ähnliche Objekte (12)