Artikel

Building linguistic corpora from Wikipedia articles and discussions

Wikipedia is a valuable resource, useful as a lingustic corpus or a dataset for many kinds of research. We built corpora from Wikipedia articles and talk pages in the I5 format, a TEI customisation used in the German Reference Corpus (Deutsches Referenzkorpus - DeReKo). Our approach is a two-stage conversion combining parsing using the Sweble parser, and transformation using XSLT stylesheets. The conversion approach is able to successfully generate rich and valid corpora regardless of languages. We also introduce a method to segment user contributions in talk pages into postings.

Urheber*in: Margaretha, Eliza; Lüngen, Harald

Attribution 4.0 International

Language: Deutsch

Subject: Wikipedia
Korpus <Linguistik>
Computerlinguistik
Germanische Sprachen; Deutsch

Event: Geistige Schöpfung

(who): Margaretha, Eliza
Lüngen, Harald

Event: Veröffentlichung

(when): 2014-12-16

URN: urn:nbn:de:bsz:mh39-33306

Last update: 06.03.2025, 9:00 AM CET

Data provider

This object is provided by:
Leibniz-Institut für Deutsche Sprache - Bibliothek. If you have any questions about the object, please contact the data provider.

Show original at data provider

Object type

Artikel

Associated

Margaretha, Eliza
Lüngen, Harald

Time of origin

2014-12-16

Other Objects (12)

Building linguistic corpora from Wikipedia articles and discussions

Artikel

DeReKo-Archiv jetzt mit fünf Milliarden Textwörtern

Konferenzbeitrag

Mining corpora of computer-mediated communication: analysis of linguistic features in Wikipedia talk pages using machine learning methods

Mining corpora of computer-mediated communication: Analysis of linguistic features in Wikipedia talk pages using machine learning methods

Artikel

A TEI P5 Document Grammar for the IDS Text Model

Artikel

DEREKO - Das Deutsche Referenzkorpus. Schriftkorpora der deutschen Gegenwartssprache am Institut für Deutsche Sprache in Mannheim

zweidimensionales bewegtes Bild

Interview mit Dr. Harald Lüngen zu den Wikipedia-Korpora in DeReKo

Buchbeitrag

Reply relations in CMC: types and annotation

Buchbeitrag

Linguistische Annotationen für die Analyse von Gliederungsstrukturen wissenschaftlicher Texte

Konferenzbeitrag

Recent developments in DeReKo

Buchbeitrag

Zur Erstellung und Interpretation der Zeitverlaufsgrafiken

Building linguistic corpora from Wikipedia articles and discussions

Artikel

DeReKo-Archiv jetzt mit fünf Milliarden Textwörtern

Konferenzbeitrag

Mining corpora of computer-mediated communication: analysis of linguistic features in Wikipedia talk pages using machine learning methods

Mining corpora of computer-mediated communication: Analysis of linguistic features in Wikipedia talk pages using machine learning methods

Artikel

A TEI P5 Document Grammar for the IDS Text Model

Artikel

DEREKO - Das Deutsche Referenzkorpus. Schriftkorpora der deutschen Gegenwartssprache am Institut für Deutsche Sprache in Mannheim

zweidimensionales bewegtes Bild

Interview mit Dr. Harald Lüngen zu den Wikipedia-Korpora in DeReKo

Buchbeitrag

Reply relations in CMC: types and annotation

Buchbeitrag

Linguistische Annotationen für die Analyse von Gliederungsstrukturen wissenschaftlicher Texte

Konferenzbeitrag

Recent developments in DeReKo

Buchbeitrag

Zur Erstellung und Interpretation der Zeitverlaufsgrafiken

Building linguistic corpora from Wikipedia articles and discussions

Artikel

DeReKo-Archiv jetzt mit fünf Milliarden Textwörtern

Konferenzbeitrag

Mining corpora of computer-mediated communication: analysis of linguistic features in Wikipedia talk pages using machine learning methods

Mining corpora of computer-mediated communication: Analysis of linguistic features in Wikipedia talk pages using machine learning methods

Artikel

A TEI P5 Document Grammar for the IDS Text Model

Artikel

DEREKO - Das Deutsche Referenzkorpus. Schriftkorpora der deutschen Gegenwartssprache am Institut für Deutsche Sprache in Mannheim

zweidimensionales bewegtes Bild

Interview mit Dr. Harald Lüngen zu den Wikipedia-Korpora in DeReKo

Buchbeitrag

Reply relations in CMC: types and annotation

Buchbeitrag

Linguistische Annotationen für die Analyse von Gliederungsstrukturen wissenschaftlicher Texte

Konferenzbeitrag

Recent developments in DeReKo

Buchbeitrag

Zur Erstellung und Interpretation der Zeitverlaufsgrafiken

Cultural heritage institutions wishing to register will find more information here.

Fields marked * need to be filled in.

Username*

Please enter your username

Email*

Please enter your email address

Please do not fill this field

First name

Last name

Password*

Please enter your password

Confirm password*

Please enter the same password

I have read the terms of use and the privacy policy for the collection of personal data and accept them. *

This field is required.

I would like to subscribe to the newsletter of the Deutsche Digitale Bibliothek. See newsletter subscription info.

Account created

Your "My DDB" account has been successfully created. Before you can log in to your account, you must click the confirmation link in the message we just sent to the email address you provided.

Building linguistic corpora from Wikipedia articles and discussions

Download

Object Details

Classification and Topics

Contributors, Places and Time

Further information

Data provider

Object type

Associated

Time of origin

Other Objects (12)

Building linguistic corpora from Wikipedia articles and discussions

DeReKo-Archiv jetzt mit fünf Milliarden Textwörtern

Mining corpora of computer-mediated communication: analysis of linguistic features in Wikipedia talk pages using machine learning methods

Mining corpora of computer-mediated communication: Analysis of linguistic features in Wikipedia talk pages using machine learning methods

A TEI P5 Document Grammar for the IDS Text Model

DEREKO - Das Deutsche Referenzkorpus. Schriftkorpora der deutschen Gegenwartssprache am Institut für Deutsche Sprache in Mannheim

DEREKO - Das Deutsche Referenzkorpus. Schriftkorpora der deutschen Gegenwartssprache am Institut für Deutsche Sprache in Mannheim

Interview mit Dr. Harald Lüngen zu den Wikipedia-Korpora in DeReKo

Reply relations in CMC: types and annotation

Linguistische Annotationen für die Analyse von Gliederungsstrukturen wissenschaftlicher Texte

Recent developments in DeReKo

Zur Erstellung und Interpretation der Zeitverlaufsgrafiken

Building linguistic corpora from Wikipedia articles and discussions

DeReKo-Archiv jetzt mit fünf Milliarden Textwörtern

Mining corpora of computer-mediated communication: analysis of linguistic features in Wikipedia talk pages using machine learning methods

Mining corpora of computer-mediated communication: Analysis of linguistic features in Wikipedia talk pages using machine learning methods

A TEI P5 Document Grammar for the IDS Text Model

DEREKO - Das Deutsche Referenzkorpus. Schriftkorpora der deutschen Gegenwartssprache am Institut für Deutsche Sprache in Mannheim

DEREKO - Das Deutsche Referenzkorpus. Schriftkorpora der deutschen Gegenwartssprache am Institut für Deutsche Sprache in Mannheim

Interview mit Dr. Harald Lüngen zu den Wikipedia-Korpora in DeReKo

Reply relations in CMC: types and annotation

Linguistische Annotationen für die Analyse von Gliederungsstrukturen wissenschaftlicher Texte

Recent developments in DeReKo

Zur Erstellung und Interpretation der Zeitverlaufsgrafiken

Building linguistic corpora from Wikipedia articles and discussions

DeReKo-Archiv jetzt mit fünf Milliarden Textwörtern

Mining corpora of computer-mediated communication: analysis of linguistic features in Wikipedia talk pages using machine learning methods

Mining corpora of computer-mediated communication: Analysis of linguistic features in Wikipedia talk pages using machine learning methods

A TEI P5 Document Grammar for the IDS Text Model

DEREKO - Das Deutsche Referenzkorpus. Schriftkorpora der deutschen Gegenwartssprache am Institut für Deutsche Sprache in Mannheim

DEREKO - Das Deutsche Referenzkorpus. Schriftkorpora der deutschen Gegenwartssprache am Institut für Deutsche Sprache in Mannheim

Interview mit Dr. Harald Lüngen zu den Wikipedia-Korpora in DeReKo

Reply relations in CMC: types and annotation

Linguistische Annotationen für die Analyse von Gliederungsstrukturen wissenschaftlicher Texte

Recent developments in DeReKo

Zur Erstellung und Interpretation der Zeitverlaufsgrafiken

Related objects

Reset password