Konferenzbeitrag
Automatic Classification by Topic Domain for Meta Data Generation, Web Corpus Evaluation, and Corpus Comparison
In this paper, we describe preliminary results from an ongoing experiment wherein we classify two large unstructured text corpora—a web corpus and a newspaper corpus—by topic domain (or subject area). Our primary goal is to develop a method that allows for the reliable annotation of large crawled web corpora with meta data required by many corpus linguists. We are especially interested in designing an annotation scheme whose categories are both intuitively interpretable by linguists and firmly rooted in the distribution of lexical material in the documents. Since we use data from a web corpus and a more traditional corpus, we also contribute to the important field of corpus comparison and corpus evaluation. Technically, we use (unsupervised) topic modeling to automatically induce topic distributions over gold standard corpora that were manually annotated for 13 coarse-grained topic domains. In a second step, we apply supervised machine learning to learn the manually annotated topic domains using the previously induced topics as features. We achieve around 70% accuracy in 10-fold cross validations. An analysis of the errors clearly indicates, however, that a revised classification scheme and larger gold standard corpora will likely lead to a substantial increase in accuracy.
- Sprache
-
Englisch
- Thema
-
Korpus <Linguistik>
Textlinguistik
Annotation
Linguistik
- Ereignis
-
Geistige Schöpfung
- (wer)
-
Schäfer, Roland
Bildhauer, Felix
- Ereignis
-
Veröffentlichung
- (wer)
-
Berlin : Association for Computational Linguistics
- (wann)
-
2016-09-26
- URN
-
urn:nbn:de:bsz:mh39-52979
- Letzte Aktualisierung
-
06.03.2025, 09:00 MEZ
Datenpartner
Leibniz-Institut für Deutsche Sprache - Bibliothek. Bei Fragen zum Objekt wenden Sie sich bitte an den Datenpartner.
Objekttyp
- Konferenzbeitrag
Beteiligte
- Schäfer, Roland
- Bildhauer, Felix
- Berlin : Association for Computational Linguistics
Entstanden
- 2016-09-26