First of all, DO NOT get someone else to finish recording for you! Synthetic voices made from a mixture of the speech of two people almost never sound good, even if you think the two people sound very much alike. It is okay to have someone else record a complete voice for you, but in that case, they should register their own account (as a “Proxy Voice Banker”) and record their own voice from the start.
As our technology has improved over the years, the amount of recording needed has significantly decreased, and depends on the inventory you choose to record. Our latest “Gen3” inventory consists of about 300 sentences that are designed to elicit more expressive speech that we can model with our new generative AI technology. We could easily create very realistic personal voices with fewer than half that number of recordings, but with the 300 sentences we are able to capture more of each individual’s expressive speech qualities. If you started with the “standard” inventory, you are welcome to switch to the Gen3 inventory at any time. A voice built from 300 of the standard sentences would lack the expressiveness of voices built from our Gen3 inventory.
Our older “standard” inventory has 3155 sentences, and while it is best to record all of them, that sometimes turns out to be too difficult. We will try to build a voice from as many recordings as you are able to complete. Our sentence material is ordered so that the most important material is recorded earliest. In studies we’ve run with these sentences, we have found the following to be a rough guideline to the tradeoff between the number of sentences recorded and the intelligibility of the resulting TTS voice.
- 200 sentences: Using only the first 200 sentences, it is possible to get a voice that will work some of the time, but it will not generally be usable for communication, particularly with strangers.
- 400 sentences: Voices made with the first 400 sentences can be usable, but there will still be many words that are mangled and hard/impossible to understand. The prosody (speech timing and intonation) will be quite robotic. This is the smallest number of sentences we recommend attempting to use as a real TTS voice.
- 800 sentences: With 800 sentences recorded, the synthetic voice will be approaching its maximum intelligibility. That is, recording more sentences will probably only slightly improve the intelligibility of the voice. However, speech prosody will still be awkward and frequently sound incorrect. For example, questions are more likely to sound like statements, or statements to sound like questions because the intonation is not appropriate.
- 1600 sentences: As you go from 800 to 1600 sentences, the majority of the changes in voice quality will be changes in the naturalness of the speech. Sentences will more frequently sound like they have the correct rhythm and intonation. Effects like the way we indicate phrase and sentence boundaries will more often be correct.
- 3155 sentences: After the first 1600 sentences, nearly all of the changes in voice quality will be changes in the naturalness of the speech. Sentences will more frequently sound like they have the correct rhythm and intonation. Effects like the way we indicate phrase and sentence boundaries will more often be correct.
Note that studies we’ve conducted to determine these guidelines were run with voices created from speech recorded under studio conditions by American English speaking voice talent. For speakers of other English dialects, speech recorded under less than ideal audio conditions, and speech recorded by talkers who are dysarthric or less able to produce exactly the correct sentences with consistent speaking rate and style, these breakpoints are likely to be optimistic. Your experience may differ considerably.