Auditioning your Voice

After you have finished recording, we build several different versions of your synthetic voice and present them for you to audition. The audition process allows you to use some of the same techniques we use in the lab to decide which version of your voice you personally prefer. There are really two reasons for doing this. First, because your concept of how your synthetic voice should sound might be different from ours, we believe it is important for you to have a say in selecting the voice and setting some of its controls. Second, now that we are charging clients for their final voice, we want to provide an opportunity for you to hear the voice and be certain you want to use it before you buy it.

As of October 15, 2023, we are introducing a new type of TTS voice based on our third generation of ModelTalker TTS technology that we refer to as Gen3 voices. The Gen3 voices are based on AI technology and generally do an excellent job of capturing your vocal identity including the sound of your voice as well as the typical flow of your speech though they sometimes have a buzzy quality we are working to improve. Additionally, we are continuing to provide examples of your voice as rendered by our first generation (Gen1) TTS technology (also called unit selection voice), and our second generation (Gen2) TTS technology (also called hybrid DNN voices).

Another change we are introducing to our audition process is the preselection of which voice versions we present to each client, based on our initial listening to multiple versions of your voice. In the audition, you might hear as few as two or as many as six different versions of your voice depending on how well different versions have come out. This will, in turn, somewhat affect the number of trials you experience in the second (voice comparison) audition step described below.

There are three steps in the process of selecting your voice:

Set synthesis parameters
Compare the different voice versions
Make a final choice

In the following, we will give you a little more detail regarding each step. When you are ready, click the Begin button below to begin step 1.

1. Set synthesis parameters

We provide controls for three voice parameters: speaking rate; sentence intonation; and syllable timing. Speaking rate is simply the overall speaking rate and can be adjusted from very slow to very fast. This setting applies to all voice versions. By default, your ModelTalker voice will speak at about the same rate that you spoke when recording the speech inventory. Sentence intonation and syllable timing are settings that apply only to our Gen1 and Gen2 voices. Intonation refers to the tonal pattern of your voice, for example, does the pitch rise (as with some questions) or fall (as with most statements) at the end of a sentence. For Gen1 and Gen2 voices, ModelTalker attempts to find examples of your recorded speech that match the best intonation pattern for a sentence, but it does not attempt to modify the intonation of the speech, so sometimes sentences may have inappropriate or disjoint intonation. If you enable synthetic intonation, ModelTalker will use Digital Signal Processing (DSP) strategies to make the sentence match normal intonation more closely. However, that may also reduce the naturalness of your voice quality to some extent. You need to decide whether, on balance, enabling synthetic intonation helps or hurts your voice.

Syllable timing refers to the way the rhythm of your speech can vary within a phrase. For example, the same syllable may be spoken more rapidly at the beginning of a phrase or more drawn out if it is at the end of a phrase. Similarly, stressed or emphasized syllables tend to be longer in duration than unstressed syllables. As with intonation, by default ModelTalker attempts to find syllables that match what is appropriate for normal syllable timing. If you enable synthetic syllable timing, ModelTalker will use DSP to modify syllable durations to try to make them match typical speech timing patterns more closely. In this case, as with intonation, the DSP may reduce the naturalness of your voice quality, however, sometimes this helps make the speech easier to understand, particularly if you have some dysarthria, or if your speaking rate varied a lot when you recorded your inventory.

In contrast to our Gen1 and Gen2 voices, the syllable timing and intonation parameters have no effect on Gen3 voices. Instead, Gen3 voices use AI to learn generalizable characteristics of your intonation patterns and syllable timing from the examples of your speech and apply that learned knowledge when synthesizing sentences.

In step one, there is a text box where you can enter a sentence or try several different sentences while trying different combinations of speaking rate, timing, and intonation control. The point of this step is to determine the combination of settings that will be used for the second step when you compare different versions of your voices, all with the same synthesis parameters.

2. Compare the voice versions

Step 2 consists of between 30 and 40 sentence comparison “trials,” depending on the number of different voices being compared. On each trial, you will be able to listen to the same sentence generated by two different versions of your voice and you must choose which of the two versions you prefer. The sentences we are using are chosen to challenge the limits of your synthetic voice. Some may be hard or nearly impossible to understand. More often it may be difficult to decide which of the two versions sounds better, but you must pick one version even if you feel the choice is random.

Over the trials, you will hear at least 10 sentences made by each version of your voice and each version will be paired an equal number of times with each other version of your voice. From these “paired comparison” trials, we will select the two versions that you chose as the preferred voice most frequently and use those two versions in the third step.

3. Make a final choice

In step 3, you will again have access to the speaking rate, timing, and intonation controls along with a text box as in Step 1. You may enter short sentences in the text box, or copy and paste much longer passages (up to about 2000 characters). We encourage you to use this step to test your synthetic voice with things you may actually want to say.

In addition to the same controls from Step 1, you will also have radio control buttons that let you choose which of the two voices selected in Step 2 you want to use to render the text in the text box. You should try both voice versions with multiple texts and also try again modifying the rate, timing, and intonation controls. When you have decided on the combination of controls and the voice that you prefer, you may complete the audition process. Most users at this point will have the option to either Accept the voice or Decline the voice. If you Decline the voice, we will not build an installer for the voice and you will not be charged for the service. If you Accept the voice, the current page settings will be saved as the ones we will use in building your voice installer. A notice will be sent to you as soon as the voice is ready for download. After clicking Accept, you will have the option to pay for the voice immediately by credit card. If you do not pay immediately, you will still be required to pay by credit card before you will be able to download your voice.

Note: If you see a Done button instead of the Accept and Decline buttons, either we have already received payment or it is not required. When you click done we will build your voice installer and send a notice as soon as the voice is ready for download.