Buchbeitrag

OCR Nachkorrektur des Royal Society Corpus

zu Verbundenen Objekten

We present an approach for automatic detection and correction of OCR-induced misspellings in historical texts. The main objective is the post-correction of the digitized Royal Society Corpus, a set of historical documents from 1665 to 1869. Due to the aged material the OCR procedure has made mistakes, thus leading to files corrupted by thousands of misspellings. This motivates a post processing step. The current correction technique is a pattern-based approach which due to its lack of generalization suffers from bad recall. To generalize from the patterns we propose to use the noisy channel model. From the pattern based substitutions we train a corpus specific error model complemented with a language model. With an F1-Score of 0.61 the presented technique significantly outperforms the pattern based approach which has an F1-score of 0.28. Due to its more accurate error model it also outperforms other implementations of the noisy channel model.

OCR Nachkorrektur des Royal Society Corpus

Urheber*in: Klaus, Carsten; Fankhauser, Peter; Klakow, Dietrich

Namensnennung - Keine Bearbeitungen 4.0 International

Sprache: Englisch

Thema: OCR-Schrift
Korrektur
Automatische Sprachverarbeitung
Digital Humanities
Sprache

Ereignis: Geistige Schöpfung

(wer): Klaus, Carsten
Fankhauser, Peter
Klakow, Dietrich

Ereignis: Veröffentlichung

(wer): Frankfurt am Main : Zenodo

(wann): 2019-02-27

URN: urn:nbn:de:bsz:mh39-85353

Letzte Aktualisierung: 06.03.2025, 09:00 MEZ

Datenpartner

Dieses Objekt wird bereitgestellt von:
Leibniz-Institut für Deutsche Sprache - Bibliothek. Bei Fragen zum Objekt wenden Sie sich bitte an den Datenpartner.

Original beim Datenpartner anzeigen

Objekttyp

Buchbeitrag

Beteiligte

Klaus, Carsten
Fankhauser, Peter
Klakow, Dietrich
Frankfurt am Main : Zenodo

Entstanden

2019-02-27

Ähnliche Objekte (12)

SCyDia – OCR for Serbian Cyrillic with diacritics

Buchbeitrag

SCyDia – OCR for Serbian Cyrillic with diacritics

OCR Nachkorrektur des Royal Society Corpus

OCR Nachkorrektur des Royal Society Corpus

Corpus REDEWIEDERGABE

Buchbeitrag

Corpus REDEWIEDERGABE

Das Mannheimer Corpus

Buchbeitrag

Das Mannheimer Corpus

CorpusExplorer v2.0 – Visualisierung prozessorientiert gestalten

Buchbeitrag

CorpusExplorer v2.0 – Visualisierung prozessorientiert gestalten

Transcription Bottleneck of Speech Corpus Exploitation

Buchbeitrag

Transcription Bottleneck of Speech Corpus Exploitation

Trendi - a monitor corpus of Slovene

Buchbeitrag

Trendi - a monitor corpus of Slovene

Argument omissions in multiple German corpora

Buchbeitrag

Argument omissions in multiple German corpora

Extracting specialized terminology from linguistic corpora

Buchbeitrag

Extracting specialized terminology from linguistic corpora

"Corpus-Driven": Linguistische Interpretation von Kookkurrenzbeziehungen

Buchbeitrag

"Corpus-Driven": Linguistische Interpretation von Kookkurrenzbeziehungen

Grammar and Corpora – past, present, and future

Buchbeitrag

Grammar and Corpora – past, present, and future

Corpus Query Lingua Franca part II: Ontology

Buchbeitrag

Corpus Query Lingua Franca part II: Ontology

SCyDia – OCR for Serbian Cyrillic with diacritics

Buchbeitrag

SCyDia – OCR for Serbian Cyrillic with diacritics

OCR Nachkorrektur des Royal Society Corpus

OCR Nachkorrektur des Royal Society Corpus

Corpus REDEWIEDERGABE

Buchbeitrag

Corpus REDEWIEDERGABE

Das Mannheimer Corpus

Buchbeitrag

Das Mannheimer Corpus

CorpusExplorer v2.0 – Visualisierung prozessorientiert gestalten

Buchbeitrag

CorpusExplorer v2.0 – Visualisierung prozessorientiert gestalten

Transcription Bottleneck of Speech Corpus Exploitation

Buchbeitrag

Transcription Bottleneck of Speech Corpus Exploitation

Trendi - a monitor corpus of Slovene

Buchbeitrag

Trendi - a monitor corpus of Slovene

Argument omissions in multiple German corpora

Buchbeitrag

Argument omissions in multiple German corpora

Extracting specialized terminology from linguistic corpora

Buchbeitrag

Extracting specialized terminology from linguistic corpora

"Corpus-Driven": Linguistische Interpretation von Kookkurrenzbeziehungen

Buchbeitrag

"Corpus-Driven": Linguistische Interpretation von Kookkurrenzbeziehungen

Grammar and Corpora – past, present, and future

Buchbeitrag

Grammar and Corpora – past, present, and future

Corpus Query Lingua Franca part II: Ontology

Buchbeitrag

Corpus Query Lingua Franca part II: Ontology

SCyDia – OCR for Serbian Cyrillic with diacritics

Buchbeitrag

SCyDia – OCR for Serbian Cyrillic with diacritics

OCR Nachkorrektur des Royal Society Corpus

OCR Nachkorrektur des Royal Society Corpus

Corpus REDEWIEDERGABE

Buchbeitrag

Corpus REDEWIEDERGABE

Das Mannheimer Corpus

Buchbeitrag

Das Mannheimer Corpus

CorpusExplorer v2.0 – Visualisierung prozessorientiert gestalten

Buchbeitrag

CorpusExplorer v2.0 – Visualisierung prozessorientiert gestalten

Transcription Bottleneck of Speech Corpus Exploitation

Buchbeitrag

Transcription Bottleneck of Speech Corpus Exploitation

Trendi - a monitor corpus of Slovene

Buchbeitrag

Trendi - a monitor corpus of Slovene

Argument omissions in multiple German corpora

Buchbeitrag

Argument omissions in multiple German corpora

Extracting specialized terminology from linguistic corpora

Buchbeitrag

Extracting specialized terminology from linguistic corpora

"Corpus-Driven": Linguistische Interpretation von Kookkurrenzbeziehungen

Buchbeitrag

"Corpus-Driven": Linguistische Interpretation von Kookkurrenzbeziehungen

Grammar and Corpora – past, present, and future

Buchbeitrag

Grammar and Corpora – past, present, and future

Corpus Query Lingua Franca part II: Ontology

Buchbeitrag

Corpus Query Lingua Franca part II: Ontology

Informationen zur Registrierung von Kultur- und Wissenseinrichtungen finden Sie hier.

Felder mit * müssen ausgefüllt werden.

Benutzername*

Bitte geben Sie Ihren Benutzernamen ein

E-Mail*

Bitte geben Sie Ihre E-Mail ein

Bitte füllen Sie dieses Feld nicht aus

Vorname

Nachname

Passwort*

Bitte geben Sie Ihr Passwort ein

Passwort bestätigen*

Bitte geben Sie das gleiche Passwort ein

Ich habe die Nutzungsbedingungen und die Datenschutzerklärung zur Erhebung persönlicher Daten gelesen und stimme ihnen zu. *

Dieses Feld ist ein Pflichtfeld.

Ich möchte den Newsletter der Deutschen Digitalen Bibliothek abonnieren. Siehe Informationen zum Newsletter-Abonnement.

Benutzerkonto angelegt

Ihr „Meine DDB“-Konto wurde erfolgreich angelegt. Bevor Sie sich in Ihrem Konto anmelden können, müssen Sie auf den Bestätigungslink in der Nachricht klicken, die wir gerade an die von Ihnen angegebene E-Mail-Adresse geschickt haben