Buchbeitrag

Maximizing the potential of very large corpora: 50 years of big language data at IDS Mannheim

Very large corpora have been built and used at the IDS since its foundation in 1964. They have been made available on the Internet since the beginning of the 90’s to currently over 30,000 researchers worldwide. The Institute provides the largest archive of written German (Deutsches Referenzkorpus, DeReKe) which has recently been extended to 24 billion words. DeReKe has been managed and analysed by engines known as COSMAS and afterwards COSMAS II, which is currently being replaced by a new, scalable analysis platform called KorAP. KorAP makes it possible to manage and analyse texts that are accompanied by multiple, potentially conflicting, grammatical and structural annotation layers, and is able to handle resources that are distributed across different, and possibly geographically distant, storage systems. The majority of texts in DeReKe are not licensed for free redistribution, hence, the COSMAS and KorAP systems offer technical solutions to facilitate research on very large corpora that are not available (and not suitable) for download. For the new KorAP system, it is also planned to provide sandboxed environments to support non-remote-API access “near the data” through which users can run their own analysis programs.

Maximizing the potential of very large corpora: 50 years of big language data at IDS Mannheim

Urheber*in: Kupietz, Marc; Lüngen, Harald; Bański, Piotr; Belica, Cyril

Attribution - NonCommercial 4.0 International

Language
Englisch

Subject
Deutsch
Korpus <Linguistik>
Textkorpus
Germanische Sprachen; Deutsch

Event
Geistige Schöpfung
(who)
Kupietz, Marc
Lüngen, Harald
Bański, Piotr
Belica, Cyril
Event
Veröffentlichung
(who)
Reykjavik : ELRA
(when)
2014-10-24

URN
urn:nbn:de:bsz:mh39-31634
Last update
06.03.2025, 9:00 AM CET

Data provider

This object is provided by:
Leibniz-Institut für Deutsche Sprache - Bibliothek. If you have any questions about the object, please contact the data provider.

Object type

  • Buchbeitrag

Associated

  • Kupietz, Marc
  • Lüngen, Harald
  • Bański, Piotr
  • Belica, Cyril
  • Reykjavik : ELRA

Time of origin

  • 2014-10-24

Other Objects (12)