Konferenzbeitrag

Organizing corpora at the Stanford Literary Lab. Balancing simplicity and flexibility in metadata management

This article describes a series of ongoing efforts at the Stanford Literary Lab to manage a large collection of literary corpora (~40 billion words). This work is marked by a tension between two competing requirements – the corpora need to be merged together into higher-order collections that can be analyzed as units; but, at the same time, it’s also necessary to preserve granular access to the original metadata and relational organization of each individual corpus. We describe a set of data management practices that try to accommodate both of these requirements – Apache Spark is used to index data as Parquet tables on an HPC cluster at Stanford. Crucially, the approach distinguishes between what we call “canonical” and “combined” corpora, a variation on the well-established notion of a “virtual corpus” (Kupietz et al., 2014; Jakubíek et al., 2014; van Uytvanck, 2010).

Urheber*in: McClure, David; Algee-Hewitt, Mark; Douris, Steele; Fredner, Erik; Walser, Hannah

Attribution - NonCommercial - NoDerivates 4.0 International

Language: Englisch

Subject: Korpus <Linguistik>
Englisch
Texttechnologie
Datenmanagement
Metadaten
Sprache

Event: Geistige Schöpfung

(who): McClure, David
Algee-Hewitt, Mark
Douris, Steele
Fredner, Erik
Walser, Hannah

Event: Veröffentlichung

(who): Mannheim : Institut für Deutsche Sprache

(when): 2017-07-05

URN: urn:nbn:de:bsz:mh39-62617

Last update: 06.03.2025, 9:00 AM CET

Data provider

This object is provided by:
Leibniz-Institut für Deutsche Sprache - Bibliothek. If you have any questions about the object, please contact the data provider.

Show original at data provider

Object type

Konferenzbeitrag

Associated

McClure, David
Algee-Hewitt, Mark
Douris, Steele
Fredner, Erik
Walser, Hannah
Mannheim : Institut für Deutsche Sprache

Time of origin

2017-07-05

Other Objects (12)

Organizing corpora at the Stanford Literary Lab. Balancing simplicity and flexibility in metadata management

Druckgraphik

Simplicity

Druckgraphik

Simplicity

Grafik

Sweet simplicity

Bilderbogen

Calino's simplicity.

Grafik

Sweet simplicity

zweidimensionales bewegtes Bild

Seeking Simplicity

Simplicity Nähbuch

Balancing control and simplicity: a variable aggregation method in intensity modulated radiation therapy planning

Artikel

Simplicity in complexity

Noten (Musik)

Simplicity : for flute

Minimalism : Designing Simplicity

Organizing corpora at the Stanford Literary Lab. Balancing simplicity and flexibility in metadata management

Druckgraphik

Simplicity

Druckgraphik

Simplicity

Grafik

Sweet simplicity

Bilderbogen

Calino's simplicity.

Grafik

Sweet simplicity

zweidimensionales bewegtes Bild

Seeking Simplicity

Simplicity Nähbuch

Balancing control and simplicity: a variable aggregation method in intensity modulated radiation therapy planning

Artikel

Simplicity in complexity

Noten (Musik)

Simplicity : for flute

Minimalism : Designing Simplicity

Organizing corpora at the Stanford Literary Lab. Balancing simplicity and flexibility in metadata management

Druckgraphik

Simplicity

Druckgraphik

Simplicity

Grafik

Sweet simplicity

Bilderbogen

Calino's simplicity.

Grafik

Sweet simplicity

zweidimensionales bewegtes Bild

Seeking Simplicity

Simplicity Nähbuch

Balancing control and simplicity: a variable aggregation method in intensity modulated radiation therapy planning

Artikel

Simplicity in complexity

Noten (Musik)

Simplicity : for flute

Minimalism : Designing Simplicity

Cultural heritage institutions wishing to register will find more information here.

Fields marked * need to be filled in.

Username*

Please enter your username

Email*

Please enter your email address

Please do not fill this field

First name

Last name

Password*

Please enter your password

Confirm password*

Please enter the same password

I have read the terms of use and the privacy policy for the collection of personal data and accept them. *

This field is required.

I would like to subscribe to the newsletter of the Deutsche Digitale Bibliothek. See newsletter subscription info.

Account created

Your "My DDB" account has been successfully created. Before you can log in to your account, you must click the confirmation link in the message we just sent to the email address you provided.

Organizing corpora at the Stanford Literary Lab. Balancing simplicity and flexibility in metadata management

Download

Object Details

Classification and Topics

Contributors, Places and Time

Further information

Data provider

Object type

Associated

Time of origin

Other Objects (12)

Organizing corpora at the Stanford Literary Lab. Balancing simplicity and flexibility in metadata management

Simplicity

Simplicity

Sweet simplicity

Calino's simplicity.

Sweet simplicity

Seeking Simplicity

Simplicity Nähbuch

Balancing control and simplicity: a variable aggregation method in intensity modulated radiation therapy planning

Simplicity in complexity

Simplicity : for flute

Minimalism : Designing Simplicity

Organizing corpora at the Stanford Literary Lab. Balancing simplicity and flexibility in metadata management

Simplicity

Simplicity

Sweet simplicity

Calino's simplicity.

Sweet simplicity

Seeking Simplicity

Simplicity Nähbuch

Balancing control and simplicity: a variable aggregation method in intensity modulated radiation therapy planning

Simplicity in complexity

Simplicity : for flute

Minimalism : Designing Simplicity

Organizing corpora at the Stanford Literary Lab. Balancing simplicity and flexibility in metadata management

Simplicity

Simplicity

Sweet simplicity

Calino's simplicity.

Sweet simplicity

Seeking Simplicity

Simplicity Nähbuch

Balancing control and simplicity: a variable aggregation method in intensity modulated radiation therapy planning

Simplicity in complexity

Simplicity : for flute

Minimalism : Designing Simplicity

Related objects

Reset password