Symbolic Phonetic Features for Modeling of Pronunciation Variation
A significant source of variation in spontaneous speech is due to intra-speaker pronunciation changes, often realized as small feature changes, e.g., nasalized vowels or affricated stops, rather than full phone transformations. Previous computational modeling of pronunciation variation has typically involved transformations from one phone to another, in part because most speech processing systems use phone-based units. Here, a phonetic-feature-based prediction model is presented where phones are represented by a vector of symbolic features that can be on, off, unspecified or unused. Feature interaction is examined using different groupings of possibly dependent features, and a hierarchical grouping with conditional dependencies led to the best results. Feature-based models are shown to be more efficient than phone-based models, in the sense of requiring fewer parameters to predict variation while giving smaller distance and perplexity values when comparing predictions to the hand-labeled reference. A parsimonious model is better suited to incorporating new conditioning factors, and this work investigates high-level information sources, including both text (syntax, discourse) and prosody cues. Experiments show that feature-based models benefit from prosody cues, but not text, and that phone-based models do not benefit from any of the high-level cues explored here.
Bates, R., Ostendorf, M., & Wright, R. (2007). Symbolic phonetic features for modeling of pronunciation variation. Speech Communication, 49(2), 83-97. doi:10.1016/j.specom.2006.10.007
Link to Publisher Version (DOI)
Publisher's Copyright and Source
Copyright © 2006 Elsevier B.V. Article published by Elsevier in Speech Communication, volume 49, issue number 2, February 2007, pages 83-97. Available online https://doi.org/10.1016/j.specom.2006.10.007.