Konferenzbeitrag

A harmonised testsuite for POS tagging of German social media data

We present a testsuite for POS tagging German web data. Our testsuite provides the original raw text as well as the gold tokenisations and is annotated for parts-of-speech. The testsuite includes a new dataset for German tweets, with a current size of 3,940 tokens. To increase the size of the data, we harmonised the annotations in already existing web corpora, based on the Stuttgart-Tübingen Tag Set. The current version of the corpus has an overall size of 48,344 tokens of web data, around half of it from Twitter. We also present experiments, showing how different experimental setups (training set size, additional out-of-domain training data, self-training) influence the accuracy of the taggers. All resources and models will be made publicly available to the research community.

A harmonised testsuite for POS tagging of German social media data

Urheber*in: Rehbein, Ines; Ruppenhofer, Josef; Zimmermann, Victor

Urheberrechtsschutz

0
/
0

Sprache
Englisch

Thema
Korpus <Linguistik>
Deutsch
Soziale Software
Sprache

Ereignis
Geistige Schöpfung
(wer)
Rehbein, Ines
Ruppenhofer, Josef
Zimmermann, Victor
Ereignis
Veröffentlichung
(wer)
Vienna, Austria : Austrian academy of sciences
(wann)
2018-09-20

URN
urn:nbn:de:bsz:mh39-79318
Letzte Aktualisierung
06.03.2025, 09:00 MEZ

Datenpartner

Dieses Objekt wird bereitgestellt von:
Leibniz-Institut für Deutsche Sprache - Bibliothek. Bei Fragen zum Objekt wenden Sie sich bitte an den Datenpartner.

Objekttyp

  • Konferenzbeitrag

Beteiligte

  • Rehbein, Ines
  • Ruppenhofer, Josef
  • Zimmermann, Victor
  • Vienna, Austria : Austrian academy of sciences

Entstanden

  • 2018-09-20

Ähnliche Objekte (12)