Linguistics and TTS
Information on Linguistics and TTS
Intro
Why solving TTS is difficult
One very over-arching reason why text-to-speech is a difficult task is because of the one-to-many problem: one text string corresponds to a infinity of possible acoustic realizations!
Rime has remained fairly opinionated when it comes to the defaults for acoustic realization, but the API also allows you to make extremely low-level adjustments for your particular use case.
Homographs
English’s written form, like many languages’, contains homographs. A homograph is a word that shares the same written form as a word but has one or both of: a different meaning, or a different pronunciation. For the purpose of the Rime TTS API, we are only concerned with the latter. For example, the words sow (verb, to plant seed) and sow (noun, a female pig) are spelled identically but pronounced differently. This poses a problem for any speech synthesis technology, but Rime leverages both contextual syntactic analysis and frequency information to generate likely predictions.
We also provide a straightforward way of overriding these predicted pronunciations:
A complete table of the homographs for which you can specify a disambiguated variant can be found in the Appendix.
Punctuation
Punctuation serves many purposes in normal writing, it indicates sentence structural things like sentence breaks and questions, but it also serves to indicate pronunciation cues, such as commas for pauses and exclamation points for excitement.
For Rime text-to-speech, these various uses are even more flexible. Not only can users employ punctuation for traditional structural purposes, users can modulate the prosody by using differing punctuation. Below we show some basic ways our powerful engine can alter the prosody using punctuation. These are just a few examples, feel free to play around and see what you can create!
Questions
Audio Clip | Sentence | Notes |
---|---|---|
what do you mean. | a simple period at the end of the sentence renders it a non-question | |
what do you mean? | a simple question mark indicates an unmarked question | |
what do you mean?! | adding an exclamation point makes the question more excited | |
what do you mean!? | changing the order of the exclamation point and question mark makes a different sort of question | |
what do you mean?? | multiple question marks can also change the type of question prosody |
False Starts
Audio Clip | Sentence | Notes |
---|---|---|
i i think it’s pretty cool | putting a word twice in a row can create more realistic, flawed human speech | |
i- i think it’s pretty cool | adding a dash immediately after some words can give a cut-off, false start sort of realism |
Pauses
Audio Clip | Sentence | Notes |
---|---|---|
so it’s kind of funny. | without any comma, there will be no pause | |
so, it’s kind of funny. | adding a comma creates a slight pause | |
so. it’s kind of funny | adding a period creates a longer pause |