Demonstration 1: Voice transformation for character creation

Starting from a piece of recorded speech, voices of different speaker identities and different speaker characters can be generated through voice transformation. Here are some examples. The original speaker, bdl, is a tenor. Nine different human-like voices are generated. Whisper voice and hoarse voices can also be generated. In addition, the emotion level of the voices can be changed, from completely calm to excited. Cartoon voices can also be generated.

Here is the original voice.

The basic transformed voices are as follows:

Giant

Contrabass

Bass

Baritone

Contralto

Mezzo-soprano

Soprano

Child

Baby

The whispering voices are as follows:

Giant

Contrabass

Bass

Baritone

Tenor

Contralto

Mezzo-soprano

Soprano

Child

Baby

The hoarse voices are as follows:

Giant

Contrabass

Bass

Baritone

Tenor

Contralto

Mezzo-soprano

Soprano

Child

Baby

The emotional level can be changed for each of the basic voices:

Giant: flat, calm, indifference, emotional, excited.

Contrabass: flat, calm, indifference, emotional, excited.

Bass: flat, calm, indifference, emotional, excited.

Baritone: flat, calm, indifference, emotional, excited.

Tenor: flat, calm, indifference, emotional, excited.

Contralto: flat, calm, indifference, emotional, excited.

Mezzo-soprano: flat, calm, indifference, emotional, excited.

Soprano: flat, calm, indifference, emotional, excited.

Child: flat, calm, indifference, emotional, excited.

Voices of cartoon characters can also be generated. There are large number of ways to generate cartoon voices. Here are two simple examples using mismatched pitch and vocal-tract size:

Sample 1

Sample 2

Demonstration 2: Speed variation for a sentence in the ARCTIC databases

Using timbre interpolation, the speed of speech can be changed dramatically. Because the way of speed change is exactly what human beings are doing, the result is a clear and natural voice at various speaking speeds, especially at high speed. The speedy speech signals are particularly useful for vision-impared people to increase the efficiency of listening. Customer tests showed that many blind people praised that this type of speedy speedy speech is the best of all technologies. The slow speech is useful for language education. The number indicates words per minute.

speed_100_wpm

speed_200_wpm

speed_300_wpm

speed_400_wpm

speed_500_wpm

speed_600_wpm

speed_700_wpm

speed_800_wpm

speed_900_wpm

speed_1000_wpm

 

Demonstration 3: Prosody variation for a sentence in the ARCTIC databases

Using the pitch control capabilities of the new theory, various prosody modifications can be made. The example is Sentence a0207, spoken by a female speaker slt.

How much was it?

The original speech has a rather flat tone, sounds like a statement rather than a question. The first modified speech,

How much was IT?

put a rising tone at the last word, which makes it a typical question. The second variation,

How much WAS it?

It places the emphasis to the time of purchasing. The third variation,

How MUCH was it?

It puts the emphasis to the amount paid, whether it was overpaid or underpaid.

 

Demonstration 4: Examples of simple voice transformation operations

The output of the vocoder can be changed by changing some parameters. Here are the results of certain simple voice transformation operations.

The original voice slt is alto (A). It can be transformed into soprano (S), tenor (T), and baritone (B). The pitch contour can also be modified to change the emotion level. Here, four emotion levels are displayed: calm (C), normal (N), emotional (E), and excited (X). Two of the above sentences, a0006 and a0008, are shown here.

a0006_S_C, a0006_S_N, a0006_S_E, a0006_S_X

a0006_A_C, a0006_A_N, a0006_A_E, a0006_A_X

a0006_T_C, a0006_T_N, a0006_T_E, a0006_T_X

a0006_B_C, a0006_B_N, a0006_B_E, a0006_B_X

a0008_S_C, a0008_S_N, a0008_S_E, a0008_S_X

a0008_A_C, a0008_A_N, a0008_A_E, a0008_A_X

a0008_T_C, a0008_T_N, a0008_T_E, a0008_T_X

a0008_B_C, a0008_B_N, a0008_B_E, a0008_B_X

Demonstration 5: A voice transformation study of the expression "OK!"

A classical example of voice transformation is the prosody variations of the expression "OK!". Different pitch contours and stress patterns of the same expression make different meanings. The level of emotion is represented by the range of pitch variations. Further, the same expression can be spoken by different speaker types, such as soprano, alto, tenor and bass.

Here is the original recording.

The following variations are characterized by a diminishing duration and intensity of the first syllable "O" and a increasing duration and intensity of the second syllable "kay". The pitch variations are as follows:

First prosodic variation. Low-rising.

Different speaker identities and emotion levels:

Soprano_calm Soprano_neutral Soprano_emotional Soprano_excited

Alto_calm Alto_neutral Alto_emotional Alto_excited

Tenor_calm Tenor_neutral Tenor_emotional Tenor_excited

Bass_calm Bass_neutral Bass_emotional Bass_excited

Second prosodic variation. Low-falling.

Different speaker identities and emotion levels:

Soprano_calm Soprano_neutral Soprano_emotional Soprano_excited

Alto_calm Alto_neutral Alto_emotional Alto_excited

Tenor_calm Tenor_neutral Tenor_emotional Tenor_excited

Bass_calm Bass_neutral Bass_emotional Bass_excited

Third prosodic variation. High-falling.

Different speaker identities and emotion levels:

Soprano_calm Soprano_neutral Soprano_emotional Soprano_excited

Alto_calm Alto_neutral Alto_emotional Alto_excited

Tenor_calm Tenor_neutral Tenor_emotional Tenor_excited

Bass_calm Bass_neutral Bass_emotional Bass_excited

Forth prosodic variation. High-rising.

Different speaker identities and emotion levels:

Soprano_calm Soprano_neutral Soprano_emotional Soprano_excited

Alto_calm Alto_neutral Alto_emotional Alto_excited

Tenor_calm Tenor_neutral Tenor_emotional Tenor_excited

Bass_calm Bass_neutral Bass_emotional Bass_excited

 

Demonstration 6: Singing synthesis

Starting with a lyrics narrated by a speaker, using the timbron parameterization, the speech can be converted into singing. Although high-quality solo voice is difficult to mimic because personal style is crucial, good-quality choral music can be synthesized. Here is a sample, the comic version of Richard Wagner's Bridal March, started with words read by a male speaker, and synthesized as a chorus of a group of small children. The lyrics is

Here comes the bride

Big fat and wide

See how she wobbles

from side to side.

Here comes the groom

Skinny as a broom

He'd wobble too

if he had any room.

Bridal_march

 

Hallelujah

Another sample, the beginning phrases of George Fridric Handel's Messiah, with typical four-part chorus.