Buchbeitrag

Tokenizing on scale. Preprocessing large text corpora on the lexical and sentence level

When comparing different tools in the field of natural language processing (NLP), the quality of their results usually has first priority. This is also true for tokenization. In the context of large and diverse corpora for linguistic research purposes, however, other criteria also play a role – not least sufficient speed to process the data in an acceptable amount of time. In this paper we evaluate several state of the art tokenization tools for German – including our own – with regard to theses criteria. We conclude that while not all tools are applicable in this setting, no compromises regarding quality need to be made.

Urheber*in: Diewald, Nils; Kupietz, Marc; Lüngen, Harald

Attribution - ShareAlike 4.0 International

Language: Englisch

Subject: Korpus <Linguistik>
Englisch, Altenglisch

Event: Geistige Schöpfung

(who): Diewald, Nils
Kupietz, Marc
Lüngen, Harald

Event: Veröffentlichung

(who): Mannheim : IDS-Verlag
Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

(when): 2022-07-20

URN: urn:nbn:de:bsz:mh39-111464

Last update: 06.03.2025, 9:00 AM CET

Data provider

This object is provided by:
Leibniz-Institut für Deutsche Sprache - Bibliothek. If you have any questions about the object, please contact the data provider.

Show original at data provider

Object type

Buchbeitrag

Associated

Diewald, Nils
Kupietz, Marc
Lüngen, Harald
Mannheim : IDS-Verlag
Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

Time of origin

2022-07-20

Other Objects (12)

Buchbeitrag

Tokenizing on scale. Preprocessing large text corpora on the lexical and sentence level

Konferenzbeitrag

CMC Corpora in DeReKo

Artikel

Das Deutsche Referenzkorpus DEREKO im Jubiläumsjahr 2014

Konferenzbeitrag

Recent developments in DeReKo

Buchbeitrag

IBK- und Social Media-Korpora am Leibniz-Institut für Deutsche Sprache

Artikel

Zum Nutzen von Korpusauszeichnungen für die Lexikographie

Konferenzbeitrag

Igel: Comparing document grammars using XQuery

Buchbeitrag

Zwischen Empirie und Ästhetik – Ansätze zur korpuslinguistischen Untersuchung und Bewertung von Sprachwandel

Buchbeitrag

The German reference corpus DeReKo: new developments – new opportunities

Buchbeitrag

The Morphosyntactic Annotation of DeReKo: Interpretation, Opportunities, and Pitfalls

Buchbeitrag

Maximizing the potential of very large corpora: 50 years of big language data at IDS Mannheim

Artikel

DeReKo-Archiv jetzt mit fünf Milliarden Textwörtern

Buchbeitrag

Tokenizing on scale. Preprocessing large text corpora on the lexical and sentence level

Konferenzbeitrag

CMC Corpora in DeReKo

Artikel

Das Deutsche Referenzkorpus DEREKO im Jubiläumsjahr 2014

Konferenzbeitrag

Recent developments in DeReKo

Buchbeitrag

IBK- und Social Media-Korpora am Leibniz-Institut für Deutsche Sprache

Artikel

Zum Nutzen von Korpusauszeichnungen für die Lexikographie

Konferenzbeitrag

Igel: Comparing document grammars using XQuery

Buchbeitrag

Zwischen Empirie und Ästhetik – Ansätze zur korpuslinguistischen Untersuchung und Bewertung von Sprachwandel

Buchbeitrag

The German reference corpus DeReKo: new developments – new opportunities

Buchbeitrag

The Morphosyntactic Annotation of DeReKo: Interpretation, Opportunities, and Pitfalls

Buchbeitrag

Maximizing the potential of very large corpora: 50 years of big language data at IDS Mannheim

Artikel

DeReKo-Archiv jetzt mit fünf Milliarden Textwörtern

Buchbeitrag

Tokenizing on scale. Preprocessing large text corpora on the lexical and sentence level

Konferenzbeitrag

CMC Corpora in DeReKo

Artikel

Das Deutsche Referenzkorpus DEREKO im Jubiläumsjahr 2014

Konferenzbeitrag

Recent developments in DeReKo

Buchbeitrag

IBK- und Social Media-Korpora am Leibniz-Institut für Deutsche Sprache

Artikel

Zum Nutzen von Korpusauszeichnungen für die Lexikographie

Konferenzbeitrag

Igel: Comparing document grammars using XQuery

Buchbeitrag

Zwischen Empirie und Ästhetik – Ansätze zur korpuslinguistischen Untersuchung und Bewertung von Sprachwandel

Buchbeitrag

The German reference corpus DeReKo: new developments – new opportunities

Buchbeitrag

The Morphosyntactic Annotation of DeReKo: Interpretation, Opportunities, and Pitfalls

Buchbeitrag

Maximizing the potential of very large corpora: 50 years of big language data at IDS Mannheim

Artikel

DeReKo-Archiv jetzt mit fünf Milliarden Textwörtern

Cultural heritage institutions wishing to register will find more information here.

Fields marked * need to be filled in.

Username*

Please enter your username

Email*

Please enter your email address

Please do not fill this field

First name

Last name

Password*

Please enter your password

Confirm password*

Please enter the same password

I have read the terms of use and the privacy policy for the collection of personal data and accept them. *

This field is required.

I would like to subscribe to the newsletter of the Deutsche Digitale Bibliothek. See newsletter subscription info.

Account created

Your "My DDB" account has been successfully created. Before you can log in to your account, you must click the confirmation link in the message we just sent to the email address you provided.

Tokenizing on scale. Preprocessing large text corpora on the lexical and sentence level

Download

Object Details

Classification and Topics

Contributors, Places and Time

Further information

Data provider

Object type

Associated

Time of origin

Other Objects (12)

Tokenizing on scale. Preprocessing large text corpora on the lexical and sentence level

CMC Corpora in DeReKo

Das Deutsche Referenzkorpus DEREKO im Jubiläumsjahr 2014

Recent developments in DeReKo

IBK- und Social Media-Korpora am Leibniz-Institut für Deutsche Sprache

Zum Nutzen von Korpusauszeichnungen für die Lexikographie

Igel: Comparing document grammars using XQuery

Zwischen Empirie und Ästhetik – Ansätze zur korpuslinguistischen Untersuchung und Bewertung von Sprachwandel

The German reference corpus DeReKo: new developments – new opportunities

The Morphosyntactic Annotation of DeReKo: Interpretation, Opportunities, and Pitfalls

Maximizing the potential of very large corpora: 50 years of big language data at IDS Mannheim

DeReKo-Archiv jetzt mit fünf Milliarden Textwörtern

Tokenizing on scale. Preprocessing large text corpora on the lexical and sentence level

CMC Corpora in DeReKo

Das Deutsche Referenzkorpus DEREKO im Jubiläumsjahr 2014

Recent developments in DeReKo

IBK- und Social Media-Korpora am Leibniz-Institut für Deutsche Sprache

Zum Nutzen von Korpusauszeichnungen für die Lexikographie

Igel: Comparing document grammars using XQuery

Zwischen Empirie und Ästhetik – Ansätze zur korpuslinguistischen Untersuchung und Bewertung von Sprachwandel

The German reference corpus DeReKo: new developments – new opportunities

The Morphosyntactic Annotation of DeReKo: Interpretation, Opportunities, and Pitfalls

Maximizing the potential of very large corpora: 50 years of big language data at IDS Mannheim

DeReKo-Archiv jetzt mit fünf Milliarden Textwörtern

Tokenizing on scale. Preprocessing large text corpora on the lexical and sentence level

CMC Corpora in DeReKo

Das Deutsche Referenzkorpus DEREKO im Jubiläumsjahr 2014

Recent developments in DeReKo

IBK- und Social Media-Korpora am Leibniz-Institut für Deutsche Sprache

Zum Nutzen von Korpusauszeichnungen für die Lexikographie

Igel: Comparing document grammars using XQuery

Zwischen Empirie und Ästhetik – Ansätze zur korpuslinguistischen Untersuchung und Bewertung von Sprachwandel

The German reference corpus DeReKo: new developments – new opportunities

The Morphosyntactic Annotation of DeReKo: Interpretation, Opportunities, and Pitfalls

Maximizing the potential of very large corpora: 50 years of big language data at IDS Mannheim

DeReKo-Archiv jetzt mit fünf Milliarden Textwörtern

Related objects

Reset password