Buchbeitrag
Tokenizing on scale. Preprocessing large text corpora on the lexical and sentence level
When comparing different tools in the field of natural language processing (NLP), the quality of their results usually has first priority. This is also true for tokenization. In the context of large and diverse corpora for linguistic research purposes, however, other criteria also play a role – not least sufficient speed to process the data in an acceptable amount of time. In this paper we evaluate several state of the art tokenization tools for German – including our own – with regard to theses criteria. We conclude that while not all tools are applicable in this setting, no compromises regarding quality need to be made.
- Language
-
Englisch
- Subject
-
Korpus <Linguistik>
Englisch, Altenglisch
- Event
-
Geistige Schöpfung
- (who)
-
Diewald, Nils
Kupietz, Marc
Lüngen, Harald
- Event
-
Veröffentlichung
- (who)
-
Mannheim : IDS-Verlag
Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)
- (when)
-
2022-07-20
- URN
-
urn:nbn:de:bsz:mh39-111464
- Last update
-
06.03.2025, 9:00 AM CET
Data provider
Leibniz-Institut für Deutsche Sprache - Bibliothek. If you have any questions about the object, please contact the data provider.
Object type
- Buchbeitrag
Associated
- Diewald, Nils
- Kupietz, Marc
- Lüngen, Harald
- Mannheim : IDS-Verlag
- Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)
Time of origin
- 2022-07-20