Starting from a piece of recorded speech, voices of different speaker identities and different speaker characters can be generated through voice transformation. Here are some examples. The original speaker, bdl, is a tenor. Nine different human-like voices are generated. Whisper voice and hoarse voices can also be generated. In addition, the emotion level of the voices can be changed, from completely calm to excited. Cartoon voices can also be generated.
Here is the original voice.
The basic transformed voices are as follows:
The whispering voices are as follows:
The hoarse voices are as follows:
The emotional level can be changed for each of the basic voices:
Giant: flat, calm, indifference, emotional, excited.
Contrabass: flat, calm, indifference, emotional, excited.
Bass: flat, calm, indifference, emotional, excited.
Baritone: flat, calm, indifference, emotional, excited.
Tenor: flat, calm, indifference, emotional, excited.
Contralto: flat, calm, indifference, emotional, excited.
Mezzo-soprano: flat, calm, indifference, emotional, excited.
Soprano: flat, calm, indifference, emotional, excited.
Child: flat, calm, indifference, emotional, excited.
Voices of cartoon characters can also be generated. There are large number of ways to generate cartoon voices. Here are two simple examples using mismatched pitch and vocal-tract size:
Using timbre interpolation, the speed of speech can be changed dramatically. Because the way of speed change is exactly what human beings are doing, the result is a clear and natural voice at various speaking speeds, especially at high speed. The speedy speech signals are particularly useful for vision-impared people to increase the efficiency of listening. Customer tests showed that many blind people praised that this type of speedy speedy speech is the best of all technologies. The slow speech is useful for language education. The number indicates words per minute.
Using the pitch control capabilities of the new theory, various prosody modifications can be made. The example is Sentence a0207, spoken by a female speaker slt.
The original speech has a rather flat tone, sounds like a statement rather than a question. The first modified speech,
put a rising tone at the last word, which makes it a typical question. The second variation,
It places the emphasis to the time of purchasing. The third variation,
It puts the emphasis to the amount paid, whether it was overpaid or underpaid.
The output of the vocoder can be changed by changing some parameters. Here are the results of certain simple voice transformation operations.
The original voice slt is alto (A). It can be transformed into soprano (S), tenor (T), and baritone (B). The pitch contour can also be modified to change the emotion level. Here, four emotion levels are displayed: calm (C), normal (N), emotional (E), and excited (X). Two of the above sentences, a0006 and a0008, are shown here.
a0006_S_C, a0006_S_N, a0006_S_E, a0006_S_X
a0006_A_C, a0006_A_N, a0006_A_E, a0006_A_X
a0006_T_C, a0006_T_N, a0006_T_E, a0006_T_X
a0006_B_C, a0006_B_N, a0006_B_E, a0006_B_X
a0008_S_C, a0008_S_N, a0008_S_E, a0008_S_X
a0008_A_C, a0008_A_N, a0008_A_E, a0008_A_X
a0008_T_C, a0008_T_N, a0008_T_E, a0008_T_X
a0008_B_C, a0008_B_N, a0008_B_E, a0008_B_X
A classical example of voice transformation is the prosody variations of the expression "OK!". Different pitch contours and stress patterns of the same expression make different meanings. The level of emotion is represented by the range of pitch variations. Further, the same expression can be spoken by different speaker types, such as soprano, alto, tenor and bass.
Here is the original recording.
The following variations are characterized by a diminishing duration and intensity of the first syllable "O" and a increasing duration and intensity of the second syllable "kay". The pitch variations are as follows:
First prosodic variation. Low-rising.
Different speaker identities and emotion levels:
Soprano_calm Soprano_neutral Soprano_emotional Soprano_excited
Alto_calm Alto_neutral Alto_emotional Alto_excited
Tenor_calm Tenor_neutral Tenor_emotional Tenor_excited
Bass_calm Bass_neutral Bass_emotional Bass_excited
Second prosodic variation. Low-falling.
Different speaker identities and emotion levels:
Soprano_calm Soprano_neutral Soprano_emotional Soprano_excited
Alto_calm Alto_neutral Alto_emotional Alto_excited
Tenor_calm Tenor_neutral Tenor_emotional Tenor_excited
Bass_calm Bass_neutral Bass_emotional Bass_excited
Third prosodic variation. High-falling.
Different speaker identities and emotion levels:
Soprano_calm Soprano_neutral Soprano_emotional Soprano_excited
Alto_calm Alto_neutral Alto_emotional Alto_excited
Tenor_calm Tenor_neutral Tenor_emotional Tenor_excited
Bass_calm Bass_neutral Bass_emotional Bass_excited
Forth prosodic variation. High-rising.
Different speaker identities and emotion levels:
Soprano_calm Soprano_neutral Soprano_emotional Soprano_excited
Alto_calm Alto_neutral Alto_emotional Alto_excited
Tenor_calm Tenor_neutral Tenor_emotional Tenor_excited
Bass_calm Bass_neutral Bass_emotional Bass_excited
Starting with a lyrics narrated by a speaker, using the timbron parameterization, the speech can be converted into singing. Although high-quality solo voice is difficult to mimic because personal style is crucial, good-quality choral music can be synthesized. Here is a sample, the comic version of Richard Wagner's Bridal March, started with words read by a male speaker, and synthesized as a chorus of a group of small children. The lyrics is
Here comes the bride
Big fat and wide
See how she wobbles
from side to side.
Here comes the groom
Skinny as a broom
He'd wobble too
if he had any room.
Another sample, the beginning phrases of George Fridric Handel's Messiah, with typical four-part chorus.