Konferenzbeitrag

Improving Sentence Boundary Detection for Spoken Language Transcripts

This paper presents experiments on sentence boundary detection in transcripts of spoken dialogues. Segmenting spoken language into sentence-like units is a challenging task, due to disfluencies, ungrammatical or fragmented structures and the lack of punctuation. In addition, one of the main bottlenecks for many NLP applications for spoken language is the small size of the training data, as the transcription and annotation of spoken language is by far more time-consuming and labour-intensive than processing written language. We therefore investigate the benefits of data expansion and transfer learning and test different ML architectures for this task. Our results show that data expansion is not straightforward and even data from the same domain does not always improve results. They also highlight the importance of modelling, i.e. of finding the best architecture and data representation for the task at hand. For the detection of boundaries in spoken language transcripts, we achieve a substantial improvement when framing the boundary detection problem as a sentence pair classification task, as compared to a sequence tagging approach.

Improving Sentence Boundary Detection for Spoken Language Transcripts

Urheber*in: Rehbein, Ines; Ruppenhofer, Josef; Schmidt, Thomas

Namensnennung - Nicht kommerziell 4.0 International

0
/
0

Sprache
Englisch

Thema
Automatische Spracherkennung
Gesprochene Sprache
Korpus <Linguistik>
Satzende
Maschinelles Lernen
Sprache

Ereignis
Geistige Schöpfung
(wer)
Rehbein, Ines
Ruppenhofer, Josef
Schmidt, Thomas
Ereignis
Veröffentlichung
(wer)
Paris : European Language Resources Association
(wann)
2020-05-19

URN
urn:nbn:de:bsz:mh39-98382
Letzte Aktualisierung
06.03.2025, 09:00 MEZ

Datenpartner

Dieses Objekt wird bereitgestellt von:
Leibniz-Institut für Deutsche Sprache - Bibliothek. Bei Fragen zum Objekt wenden Sie sich bitte an den Datenpartner.

Objekttyp

  • Konferenzbeitrag

Beteiligte

  • Rehbein, Ines
  • Ruppenhofer, Josef
  • Schmidt, Thomas
  • Paris : European Language Resources Association

Entstanden

  • 2020-05-19

Ähnliche Objekte (12)