Dissertation o. Habilitation

Danish Stød and Automatic Speech Recognition

Stød is a prosodic feature in Danish spoken language that is able to distinguish lexemes. This distinction can also identify word class and has the potential to improve the performance of automatic speech recognisers for Danish spoken language. Stød manifestation exhibits a large amount of variability and may be perceptual in nature, because stød in some cases can be audibly perceived yet not be visible in a spectrogram. The variability is the primary reason there is currently no agreed upon acoustic or phonetic definition of stød. The working definition of stød is “. . . a kind of creaky voice” (Grønnum, 2005) and “stød is not just creak” (Hansen, 2015). In the present work, we investigate whether stød can be exploited in automatic speech recognition. To exploit stød without an acoustic or phonetic definition, we need to use a (almost) zero-knowledge datadriven approach which is based on a number of assumptions that we investigate prior to conducting ASR experimentation. We assume that stød can be detected in audio input, using acoustic features. To detect stød, we need to identify features that signal stød, which requires annotated data. To select the right features, the stød annotation must be reliable and accurate. We therefore conduct a reliability study of stød annotation with inter-annotator agreement measures, rank acoustic features for stød detection according to feature importance using a forest of randomised decision trees and experiment with stød detection as a binary and multi-class classification task. The experiments identify a set of features important or stød detection and confirms that we can detect stød in audio. Lastly, we model stød in automatic speech recognition and show that significant improvements in word error rate can be gained simply by annotating stød in the phonetic dictionary at the expense of decoding speed. Extending the acoustic feature vectors with pitch-related features and other features of voice quality also give significant performance improvement on both read-aloud speech and spontaneous speech. Decoding speed increases when we extend the acoustic feature vectors and actually improve decoding speed over the baseline where stød is not modelled.

ISBN
9788793483132
Language
Englisch

Bibliographic citation
Series: PhD Series ; No. 24.2016

Classification
Management

Event
Geistige Schöpfung
(who)
Kirkedal, Andreas Søeborg
Event
Veröffentlichung
(who)
Copenhagen Business School (CBS)
(where)
Frederiksberg
(when)
2016

Handle
Last update
10.03.2025, 11:41 AM CET

Data provider

This object is provided by:
ZBW - Deutsche Zentralbibliothek für Wirtschaftswissenschaften - Leibniz-Informationszentrum Wirtschaft. If you have any questions about the object, please contact the data provider.

Object type

  • Dissertation o. Habilitation

Associated

  • Kirkedal, Andreas Søeborg
  • Copenhagen Business School (CBS)

Time of origin

  • 2016

Other Objects (12)