Konferenzbeitrag

Organizing corpora at the Stanford Literary Lab. Balancing simplicity and flexibility in metadata management

This article describes a series of ongoing efforts at the Stanford Literary Lab to manage a large collection of literary corpora (~40 billion words). This work is marked by a tension between two competing requirements – the corpora need to be merged together into higher-order collections that can be analyzed as units; but, at the same time, it’s also necessary to preserve granular access to the original metadata and relational organization of each individual corpus. We describe a set of data management practices that try to accommodate both of these requirements – Apache Spark is used to index data as Parquet tables on an HPC cluster at Stanford. Crucially, the approach distinguishes between what we call “canonical” and “combined” corpora, a variation on the well-established notion of a “virtual corpus” (Kupietz et al., 2014; Jakubíek et al., 2014; van Uytvanck, 2010).

Organizing corpora at the Stanford Literary Lab. Balancing simplicity and flexibility in metadata management

Urheber*in: McClure, David; Algee-Hewitt, Mark; Douris, Steele; Fredner, Erik; Walser, Hannah

Namensnennung - Nicht kommerziell - Keine Bearbeitungen 4.0 International

0
/
0

Sprache
Englisch

Thema
Korpus <Linguistik>
Englisch
Texttechnologie
Datenmanagement
Metadaten
Sprache

Ereignis
Geistige Schöpfung
(wer)
McClure, David
Algee-Hewitt, Mark
Douris, Steele
Fredner, Erik
Walser, Hannah
Ereignis
Veröffentlichung
(wer)
Mannheim : Institut für Deutsche Sprache
(wann)
2017-07-05

URN
urn:nbn:de:bsz:mh39-62617
Letzte Aktualisierung
06.03.2025, 09:00 MEZ

Datenpartner

Dieses Objekt wird bereitgestellt von:
Leibniz-Institut für Deutsche Sprache - Bibliothek. Bei Fragen zum Objekt wenden Sie sich bitte an den Datenpartner.

Objekttyp

  • Konferenzbeitrag

Beteiligte

  • McClure, David
  • Algee-Hewitt, Mark
  • Douris, Steele
  • Fredner, Erik
  • Walser, Hannah
  • Mannheim : Institut für Deutsche Sprache

Entstanden

  • 2017-07-05

Ähnliche Objekte (12)