WO2008018653A1

WO2008018653A1 - Voice color conversion system using glottal waveform

Info

Publication number: WO2008018653A1
Application number: PCT/KR2006/004478
Authority: WO
Inventors: Yung-Hwan Oh; Jae-Hyun Bae
Original assignee: Korea Advanced Institute Of Science And Technology
Priority date: 2006-08-09
Filing date: 2006-10-31
Publication date: 2008-02-14
Also published as: KR20080013524A; KR100809368B1

Abstract

Disclosed is to a speaker conversion system which can express various voice colors, more particularly, to voice color conversion method and system which can generate converted voice having various voice colors according to an utterance situation by converting the glottal wave.

Description

VOICE COLOR CONVERSION SYSTEM USING GLOTTAL WAVEFORM

[Technical Field] The present invention relates to a speaker conversion system which can express various voice colors, more particularly, to voice color conversion method and system which can generate converted voice having various voice colors according to an utterance situation by converting the glottal wave .

[Background Art]

FIG. 1 shows a process of generating a voice upon speaking. When a man speaks, air from a bronchial tube passes through the vocal cords and thus glottal wave is formed. At this time, since the man pronounces while breathing out his/her breath, noise (aspirated sound) due to out-breathing is contained in the glottal wave. While the glottal wave passes through a vocal tract, an articulating phenomenon is occurred. Then, the glottal wave is radiated into the air and thus generates a voice. The present invention relates to the conversion of glottal wave generated when a man speaks and grows out of a fact that a shape of glottal wave is changed according to environmental status, feelings and the like when a man speaks and thus the voice having various voice colors can be generated.

As a conventional method of mimicking a voice of a specific person (hereinafter, called "target speaker"), there are a mimicking method employing an expert dubbing artist and a mimicking method using a computer.

In case of the mimicking method employing an expert dubbing artist, it is possible to mimic a prosodic characteristic with respect to a certain part of the voice, however, it is difficult to express various and natural voice colors .

And in case of the method using a computer, which can convert a voice of a certain speaker (hereinafter, called "original speaker" ) into the target speaker by using the computer, there are mimicking methods using Hidden Markov Model (HMM) , Gaussian Mixture Model (GMM) and Neural Network.

In the conventional methods using the HMM, the GMM and the Neural Network, firstly, vocal tract characteristic parameters of voice such as LPC (Linear Prediction Coefficient) , LSP (Line Spectral Pair) , MFCC (Mel-Frequency Cepstral Coefficient) and HNM (Harmonic and Noise Model) characteristics are extracted from the voices of the original speaker and the target speaker, and the HMM or GMM is trained by using the characteristic parameters of each speaker, and then a conversion function between the trained models is calculated so that the vocal tract characteristic of the original speaker is converted into that of the target speaker. In case of the prosody, the prosody of the target speaker is modeled and then applied to the converted voice. In a method of mimicking the prosody of the target speaker, after forming a pitch histogram of the original speaker and the target speaker, an excitation signal matched with the histogram is used. In case of the glottal wave, the excitation signal in which basic frequency information of the target speaker is excluded is applied to the converted voice. In the conventional speaker conversion method, it is possible to convert the prosody and the vocal tract characteristic. However, since the excitation signal of the target speaker is used in the vocal cord characteristic, there is a problem that the voice color contained in the given excitation signal of the target speaker has to be used as it is.

Therefore, in the conventional method it is difficult to exactly express various voice colors according to situations, it is impossible to reflect the change of voice color according to feelings, environments, and the context when the original speaker speaks, and it is possible only to convert the voice of the original speaker into a predetermined voice color.

[Disclosure] [Technical Problem]

An object of the present invention is to provide a method of converting voice color, which can generate a voice of a certain speaker without employment of an expert dubbing artist who can mimic the voice and also which can generate voices having various voice color at anywhere and any time according to feelings of the certain person.

Another object of the present invention is to provide a method of converting voice color, which can generate voices having various voice colors according to situations and conditions

[Technical Solution]

According to the present invention, there is provided a voice color conversion method for converting a speaker, comprising the steps of analyzing a glottal wave signal of a voice of an original speaker; converting the analyzed glottal wave signal; and re-composing the converted glottal wave signal. And the step of analyzing the glottal wave signal comprises the steps of extracting a glottal wave from the voice of the original speaker; and extracting voice color characteristic parameters of the extracted glottal wave.

Further, the step of converting the glottal wave signal comprises the steps of collecting glottal wave characteristic parameters for various voice colors from the voice of the original speaker and creating a voice color database of the original speaker; and collecting glottal wave characteristic parameters for various voice colors from a voice of a target speaker and creating a voice color database of the target speaker. The step of converting the glottal wave signal further comprises the steps of collecting glottal wave characteristic parameters for various voice colors from voices of various general speakers and creating a voice color database of the general speakers. The step of converting the glottal wave signal further comprises the steps of creating a corresponding relationship among the voice color databases of the original speaker, the target speaker and the general speakers on the basis of a quantity of change of the voice color characteristic parameters stored in each voice color database .

Further, the characteristic parameters comprises an OQ section in which vocal cords are opened, a CQ section in which the vocal cords are closed, EE (Effective Excitation) which is an effective excitation value and T₀ which is a period of signal. The quantity of change of the characteristic parameters is an NAQ parameter which is defined as follows:

NAQ= 3.

T_nBE In addition, a voice color conversion system for converting a speaker, comprising glottal wave extracting means for extracting a glottal wave from a voice of an original speaker; voice color parameter extracting means for extracting voice color parameters from the extracted glottal wave; glottal wave converting means for converting the glottal wave using the voice color parameters; and converted voice generating means for generating a converted voice using the converted glottal wave. Preferably, the glottal wave extracting means comprises an A/D converter and an input buffer, and the voice color parameter extracting means and the glottal wave converting means comprises a command storing unit and a main controller, and the converted voice generating means comprises a D/A converter and an output buffer. Further, The system further comprises a storing unit in which voice color databases of an original speaker, a target speaker and general speakers are stored.

[Advantageous Effects]

According to the voice color conversion method of the present invention, it is possible to convert a voice of an original speaker in to a voice having various voice color of a target speaker according to situations, feelings and context. Furthermore, it also is possible that the speaker conversion system of the present invention can facilely substitute for an expert dubbing artist who can mimic a voice of a target speaker. Further, since the speaker conversion system can optionally generate a virtual voice by combination of the voice color database of the general speaker, it can be used to make voices of various virtual characters .

[Description of Drawings]

The above and other objects, features and advantages of the present invention will become apparent from the following description of preferred embodiments given in conjunction with the accompanying drawings, in which:

Fig. 1 is a view showing a mechanism for generating a voice when a mean speaks; Fig. 2 is a view showing a process of converting glottal wave to express various voice colors according to the present invention;

Fig. 3 is a view showing a glottal wave obtained by using an LF (LiIj encrants-Fant) model and a derivative glottal wave thereof according to the present invention;

Fig. 4 is a view showing a derivative glottal wave of a KLGLOT88 model according to the present invention;

Fig. 5 is a view showing characteristic parameters of glottal wave for converting the voice color according to the present invention; Fig. 6 is a view showing a manner of storing voice color data in voice color databases of an original speaker, a target speaker and a general speaker according to the present invention; Fig. 7 is a graph showing distribution of NAQ parameters for various voice colors according to the present invention;

Fig. 8 is a view showing a change of vocal glottal waves having various voice colors according to the present invention; Fig. 9 is a view showing a process in a step of converting the glottal wave according to the present invention; and

Fig. 10 is a block diagram explaining a system for converting glottal wave signal according to the present invention.

[Best Mode]

Hereinafter, the embodiments of the present invention will be described in detail with reference to accompanying drawings .

Fig. 2 is a view showing a process of converting glottal wave to express various voice colors according to the present invention. As shown in Fig. 2, a method of converting a voice according to the present invention includes a step of analyzing glottal wave signal, a step of converting the glottal wave signal and a step of re-composing the glottal wave signal.

The step of analyzing glottal wave signal includes a step of extracting glottal wave from voice signal of an input original speaker and a step of extracting parameters of various voice colors from the extracted glottal wave . In the step of converting the glottal wave signal, the parameters of voice colors extracted from the glottal wave are converted into glottal wave signals using data of voice color database. At this time, the voice color database includes voice color database of an original speaker, voice color database of a target speaker and voice color database of a general speaker.

The voice color database of the target speaker as a model representing native voice color of only the target speaker is a database in which characteristic parameters of the target speaker are collected. The voice color database of the general speaker is a database in which characteristic parameters of various voice colors extracted from general speakers are collected. In the step of converting the glottal wave signal, the characteristic parameters of voice colors of the original speaker, which are extracted in the step of analyzing the glottal wave signal, are converted into the glottal wave having the voice color of the target speaker using the voice color characteristic parameters in which voice color models of the target speaker and the general speaker weighted and combined with each other.

In the step of re-composing the glottal wave, the glottal wave having the voice color of the target speaker is reconstructed using vocal tract characteristic parameters by- using the HMM and the like and the glottal wave generated in the step of converting the glottal wave signal so as to generate a finally converted voice.

Meanwhile, in the step of analyzing the glottal wave, the glottal wave is extracted from input signal which is converted into digital signal through a sampling process, and the characteristic parameters of voice colors are extracted from the glottal wave. When extracting the glottal wave in the step, the vocal tract characteristic is removed as much as possible so as to remain only the glottal wave. The glottal wave may be extracted by inverse-filtering excitation signal using linear prediction algorithm of voice. To analyze the extracted the glottal wave, the glottal wave is differentiated along a time axis to obtain glottal derivative signal. Fig. 3 is a view showing a glottal wave obtained by using an LF (LiIj encrants-Fant) model and a derivative glottal wave thereof according to the present invention. The derivative glottal wave may be represented by using the LF model as follows : g(t) = E_{)e^atsm^' (ωJ , 0 < t ≤ T_e

wherein EO is a maximum amplitude of the glottal wave signal, EE is a maximum closing speed of the vocal cords as an effective Excitation value, T_e is a time period while the vocal cords are opened, T_c is a time period while the vocal cords are closed and T₀ is a period while the vocal cords are opened and closed.

To represent the glottal wave, a KLGLOT88 model may be also used. The equation is as follows:

g(t) = at² - bf , for 0 < t < O_n T₀ = 0 , for O_<y ^ < t < T_n

(2) wherein a and b are defined as a voicing amplitude (AV) and an equation of O_q, and the a and b may be represented as follows :

27,4 K 27,4 V

4 T₀O^ ¹ 4T_Xλθl

(3) wherein The AV is the same parameter as the E₀ of the LF model, 0_q is also the same parameter as the OQ of the LF model and the T₀ is a basic period of the vocal cords.

Fig. 4 is a view showing a derivative glottal wave of the

KLGLOT88 model. As shown in Figs. 3 and 4, although the used models are respectively different, shapes of the glottal waves are equal or similar to each other, because the same characteristic parameters are used.

Fig. 5 is a view showing characteristic parameters of glottal wave for converting the voice color according to the present invention. The characteristic parameters which is necessary to express various voice colors includes OQ which indicates an opening section of the vocal cords, CQ which indicates a closing section of the vocal cords, EE (Effective Excitation) which is an effective excitation value and T₀ which is a period of signal. The characteristic parameters can be obtained during a process of matching the derivative glottal wave with the LF model using the Figs. 3 and 4. And after modeling the LF derivative glottal wave, an error signal which indicates a difference between the characteristic parameters of the derivative glottal wave and the actual derivative glottal wave can be obtained.

Fig. 6 is a view showing a manner of storing voice color data in the voice color databases of the original speaker, the target speaker and the general speaker according the present invention. The voice color database of the target speaker stores the characteristic parameters of glottal wave which is analyzed in the signal analyzing step for the actual voice of the target speaker and the error signal for each voice color. The voice color database of the general speaker stores the characteristic parameters of glottal wave which is analyzed in the signal analyzing step for the actual voices of the ordinary people and the error signal for each voice color. A breathy voice, a pressed voice and a normal voice are representative of the voice colors which are expressed by the glottal wave. On the basis of the fact, the voice color databases of the target speaker and the general speaker stores the characteristic parameters of various voice colors in a storing manner shown in Fig. 6. In case of the breathy voice, the OQ and CQ sections are not divided clearly and the EE section has a flat shape. In case of the pressed voice, the EE section has a peak shape, and the CQ section is much longer than the OQ section. The normal voice has average OQ, CQ and EE sections. When generating the converted glottal wave, the glottal wave can have various voice colors by adjusting the characteristic parameters such as the OQ, the CQ, the EE, the E₀ and the like. The voice color database of the original speaker stores various voice colors of the original speaker, which are divided according to situations, feelings and the like. In the above three voice color databases, the various voice colors are correspondent to each other according to the divided conditions. The corresponding relationship is set by the histogram of changes in the parameters. Therefore, the various voice colors of the original speaker can be converted into the voice colors of the target speaker by the voice color databases of the target speaker and the general speaker. The three voice color databases are constructed to have voice colors correspondent to each other according to voice color numbers .

Herein, to explain the corresponding relationship among the original speaker, the target speaker and the general speaker, it will be described that the speakers have common characteristics for various voice colors. To this end, the present invention introduces an NAQ parameter. The NAQ parameter is a combination of the above characteristic parameters of glottal wave and represented as follows:

(4)

That is, the characteristic parameters of glottal wave are extracted from the glottal wave shown in Fig. 5, and the NAQ parameter is obtained by using the Equation 4 to show the distribution thereof, whereby it is possible to grasp the corresponding relationship among the above characteristic parameters

Fig. 7 is a graph showing distribution of NAQ parameters obtained by combining the characteristic parameters of glottal wave, wherein the glottal waves of breathy (Bre) , normal (Nor) and pressed (Pre) voices pronounced by thirteen people are analyzed and then the distribution of NAQ parameters is obtained. Referring to Fig. 7, the NAQ parameters are constantly distributed for a common voice color and thus it is the case that a quantity of change of the parameter values for each voice color is constant in each voice color. Therefore, it is possible to set the corresponding relationship for each voice color.

Fig. 8 is a view showing a shape of derivative glottal waves for the three voice colors, in which the change of the characteristic parameters OQ, CQ and EE for each voice color is observed. Referring to Fig. 8, in case of the breathy voice, the QC section and the CQ section are not clearly divided and the EE section has a gentle shape. However, in case of the pressed voice, the CQ section is relatively longer then the OQ section and the EE section has a peak shape.

In the step of converting the glottal wave, the characteristic parameters of glottal wave and the error signal which have a desired voice color of the target speaker are extracted and then the parameter values are converted so as to generate the a glottal wave model having the desired voice color of the target speaker. If there is not the desired voice color in the voice color database of the target speaker, the characteristic parameters of the desired voice color and the characteristic parameters of the normal voice color are extracted from the voice color database of the general speaker, and the quantity of change between the two characteristic parameters are calculated. Then, the calculated quantity of change is applied to the glottal wave having the normal voice color of the target speaker, thereby generating a glottal wave model having the desired voice of the target speaker. The generated glottal wave model is combined with a basic frequency, a duration time and an energy parameter calculated in an existing prosody converting step to generate an actual glottal wave g(t), and the generated glottal wave is passed through a filter r(t) which has a radiation effect from the lips when a man speaks and then passed through a linear prediction filter v(t) comprised of the vocal tract characteristic parameters calculated in an existing vocal tract characteristic converting step, thereby finally generating a desired voice of the target speaker.

Fig. 9 is a view showing a process in a step of converting the glottal wave according to the present invention, which is represented as follows:

ft(t) =<](t) * r(t) * v(t) (5) wherein * indicates convolution The filter r(t) may be represented as follows:

(6) As described above, since the final voice which is generated through the signal conversion step can reproduce the voice of target speaker as it is, the present invention can provide various converted voices compared with the conventional speaker conversion system which can express only one voice color.

Fig. 10 is a block diagram explaining a system for converting glottal wave signal according to the present invention. As shown in Fig. 10, the voice of speaker is input to an A/D converter and an input buffer to extract a glottal wave, and a voice color parameter of the extracted glottal wave are extracted by a command storing unit and a main controller and converted into a glottal wave, and then the final voice is output through an D/A converter and an output buffer.

[industrial Applicability]

According to the voice color conversion method of the present invention, it is possible to convert a voice of an original speaker in to a voice having various voice color of a target speaker according to situations, feelings and context.

Further, the speaker conversion system of the present invention can be widely applied to animations, movies, plays, commercial films and the like where voices having various and plentiful voice colors are needed.

Furthermore, it also is possible that the speaker conversion system of the present invention can facilely substitute for an expert dubbing artist who can mimic a voice of a target speaker. Further, since the speaker conversion system can optionally generate a virtual voice by combination of the voice color database of the general speaker, it can be used to make voices of various virtual characters .

Claims

[CLAIMS]

[Claim l]

A voice color conversion method for converting a speaker, comprising the steps of: analyzing a glottal wave signal of a voice of an original speaker; converting the analyzed glottal wave signal; and re-composing the converted glottal wave signal.

[Claim 2]

The method according to claim 1, wherein the step of analyzing the glottal wave signal comprises the steps of: extracting a glottal wave from the voice of the original speaker; and extracting voice color characteristic parameters of the extracted glottal wave .

[Claim 3]

The method according to claim 1, wherein the step of converting the glottal wave signal comprises the steps of: collecting glottal wave characteristic parameters for various voice colors from the voice of the original speaker and creating a voice color database of the original speaker; and collecting glottal wave characteristic parameters for various voice colors from a voice of a target speaker and creating a voice color database of the target speaker.

[Claim 4] The method according to claim 3, wherein the step of converting the glottal wave signal further comprises the steps of collecting glottal wave characteristic parameters for various voice colors from voices of various general speakers and creating a voice color database of the general speakers .

[Claim 5]

The method according to claim 4, wherein the step of converting the glottal wave signal further comprises the steps of creating a corresponding relationship among the voice color databases of the original speaker, the target speaker and the general speakers on the basis of a quantity of change of the voice color characteristic parameters stored in each voice color database.

[Claim 6]

The method according to claim 5, wherein the characteristic parameters comprises an OQ section in which vocal cords are opened, a CQ section in which the vocal cords are closed, EE (Effective Excitation) which is an effective excitation value and T₀ which is a period of signal. [Claim 7]

The method according to claim 5 or 6, wherein the quantity of change of the characteristic parameters is an NAQ parameter which is defined as follows:

K

NAQ= T_{]EE

[Claim 8] A voice color conversion system for converting a speaker, comprising: glottal wave extracting means for extracting a glottal wave from a voice of an original speaker; voice color parameter extracting means for extracting voice color parameters from the extracted glottal wave; glottal wave converting means for converting the glottal wave using the voice color parameters; and converted voice generating means for generating a converted voice using the converted glottal wave.

[Claim 9]

The system according to claim 8, wherein the glottal wave extracting means comprises an A/D converter and an input buffer. [Claim lθ]

The system according to claim 8, wherein the voice color parameter extracting means and the glottal wave converting means comprises a command storing unit and a main controller.

[Claim ll]

The system according to claim 8, wherein the converted voice generating means comprises a D/A converter and an output buffer.

[Claim 12]

The system according to claim 8, further comprising a storing unit in which voice color databases of an original speaker, a target speaker and general speakers are stored.