WO2008018653A1 - Voice color conversion system using glottal waveform - Google Patents
Voice color conversion system using glottal waveform Download PDFInfo
- Publication number
- WO2008018653A1 WO2008018653A1 PCT/KR2006/004478 KR2006004478W WO2008018653A1 WO 2008018653 A1 WO2008018653 A1 WO 2008018653A1 KR 2006004478 W KR2006004478 W KR 2006004478W WO 2008018653 A1 WO2008018653 A1 WO 2008018653A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- voice
- glottal wave
- speaker
- glottal
- converting
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/033—Voice editing, e.g. manipulating the voice of the synthesiser
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
- G10L21/013—Adapting to target pitch
- G10L2021/0135—Voice conversion or morphing
Definitions
- the present invention relates to a speaker conversion system which can express various voice colors, more particularly, to voice color conversion method and system which can generate converted voice having various voice colors according to an utterance situation by converting the glottal wave .
- FIG. 1 shows a process of generating a voice upon speaking.
- a man speaks air from a bronchial tube passes through the vocal cords and thus glottal wave is formed.
- noise due to out-breathing is contained in the glottal wave.
- the glottal wave passes through a vocal tract an articulating phenomenon is occurred. Then, the glottal wave is radiated into the air and thus generates a voice.
- the present invention relates to the conversion of glottal wave generated when a man speaks and grows out of a fact that a shape of glottal wave is changed according to environmental status, feelings and the like when a man speaks and thus the voice having various voice colors can be generated.
- target speaker As a conventional method of mimicking a voice of a specific person (hereinafter, called "target speaker”), there are a mimicking method employing an expert dubbing artist and a mimicking method using a computer.
- HMM Hidden Markov Model
- GMM Gaussian Mixture Model
- Neural Network a voice of a certain speaker
- vocal tract characteristic parameters of voice such as LPC (Linear Prediction Coefficient) , LSP (Line Spectral Pair) , MFCC (Mel-Frequency Cepstral Coefficient) and HNM (Harmonic and Noise Model) characteristics are extracted from the voices of the original speaker and the target speaker, and the HMM or GMM is trained by using the characteristic parameters of each speaker, and then a conversion function between the trained models is calculated so that the vocal tract characteristic of the original speaker is converted into that of the target speaker.
- the prosody of the target speaker is modeled and then applied to the converted voice.
- An object of the present invention is to provide a method of converting voice color, which can generate a voice of a certain speaker without employment of an expert dubbing artist who can mimic the voice and also which can generate voices having various voice color at anywhere and any time according to feelings of the certain person.
- Another object of the present invention is to provide a method of converting voice color, which can generate voices having various voice colors according to situations and conditions
- a voice color conversion method for converting a speaker comprising the steps of analyzing a glottal wave signal of a voice of an original speaker; converting the analyzed glottal wave signal; and re-composing the converted glottal wave signal.
- the step of analyzing the glottal wave signal comprises the steps of extracting a glottal wave from the voice of the original speaker; and extracting voice color characteristic parameters of the extracted glottal wave.
- the step of converting the glottal wave signal comprises the steps of collecting glottal wave characteristic parameters for various voice colors from the voice of the original speaker and creating a voice color database of the original speaker; and collecting glottal wave characteristic parameters for various voice colors from a voice of a target speaker and creating a voice color database of the target speaker.
- the step of converting the glottal wave signal further comprises the steps of collecting glottal wave characteristic parameters for various voice colors from voices of various general speakers and creating a voice color database of the general speakers.
- the step of converting the glottal wave signal further comprises the steps of creating a corresponding relationship among the voice color databases of the original speaker, the target speaker and the general speakers on the basis of a quantity of change of the voice color characteristic parameters stored in each voice color database .
- the characteristic parameters comprises an OQ section in which vocal cords are opened, a CQ section in which the vocal cords are closed, EE (Effective Excitation) which is an effective excitation value and T 0 which is a period of signal.
- the quantity of change of the characteristic parameters is an NAQ parameter which is defined as follows:
- a voice color conversion system for converting a speaker, comprising glottal wave extracting means for extracting a glottal wave from a voice of an original speaker; voice color parameter extracting means for extracting voice color parameters from the extracted glottal wave; glottal wave converting means for converting the glottal wave using the voice color parameters; and converted voice generating means for generating a converted voice using the converted glottal wave.
- the glottal wave extracting means comprises an A/D converter and an input buffer
- the voice color parameter extracting means and the glottal wave converting means comprises a command storing unit and a main controller
- the converted voice generating means comprises a D/A converter and an output buffer.
- the system further comprises a storing unit in which voice color databases of an original speaker, a target speaker and general speakers are stored.
- the voice color conversion method of the present invention it is possible to convert a voice of an original speaker in to a voice having various voice color of a target speaker according to situations, feelings and context. Furthermore, it also is possible that the speaker conversion system of the present invention can facilely substitute for an expert dubbing artist who can mimic a voice of a target speaker. Further, since the speaker conversion system can optionally generate a virtual voice by combination of the voice color database of the general speaker, it can be used to make voices of various virtual characters .
- Fig. 1 is a view showing a mechanism for generating a voice when a mean speaks
- Fig. 2 is a view showing a process of converting glottal wave to express various voice colors according to the present invention
- Fig. 3 is a view showing a glottal wave obtained by using an LF (LiIj encrants-Fant) model and a derivative glottal wave thereof according to the present invention
- Fig. 4 is a view showing a derivative glottal wave of a KLGLOT88 model according to the present invention.
- Fig. 5 is a view showing characteristic parameters of glottal wave for converting the voice color according to the present invention
- Fig. 6 is a view showing a manner of storing voice color data in voice color databases of an original speaker, a target speaker and a general speaker according to the present invention
- Fig. 7 is a graph showing distribution of NAQ parameters for various voice colors according to the present invention
- Fig. 8 is a view showing a change of vocal glottal waves having various voice colors according to the present invention
- Fig. 9 is a view showing a process in a step of converting the glottal wave according to the present invention.
- Fig. 10 is a block diagram explaining a system for converting glottal wave signal according to the present invention.
- Fig. 2 is a view showing a process of converting glottal wave to express various voice colors according to the present invention.
- a method of converting a voice according to the present invention includes a step of analyzing glottal wave signal, a step of converting the glottal wave signal and a step of re-composing the glottal wave signal.
- the step of analyzing glottal wave signal includes a step of extracting glottal wave from voice signal of an input original speaker and a step of extracting parameters of various voice colors from the extracted glottal wave .
- the parameters of voice colors extracted from the glottal wave are converted into glottal wave signals using data of voice color database.
- the voice color database includes voice color database of an original speaker, voice color database of a target speaker and voice color database of a general speaker.
- the voice color database of the target speaker as a model representing native voice color of only the target speaker is a database in which characteristic parameters of the target speaker are collected.
- the voice color database of the general speaker is a database in which characteristic parameters of various voice colors extracted from general speakers are collected.
- the characteristic parameters of voice colors of the original speaker which are extracted in the step of analyzing the glottal wave signal, are converted into the glottal wave having the voice color of the target speaker using the voice color characteristic parameters in which voice color models of the target speaker and the general speaker weighted and combined with each other.
- the glottal wave having the voice color of the target speaker is reconstructed using vocal tract characteristic parameters by- using the HMM and the like and the glottal wave generated in the step of converting the glottal wave signal so as to generate a finally converted voice.
- the glottal wave is extracted from input signal which is converted into digital signal through a sampling process, and the characteristic parameters of voice colors are extracted from the glottal wave.
- the vocal tract characteristic is removed as much as possible so as to remain only the glottal wave.
- the glottal wave may be extracted by inverse-filtering excitation signal using linear prediction algorithm of voice.
- the glottal wave is differentiated along a time axis to obtain glottal derivative signal. Fig.
- FIG. 3 is a view showing a glottal wave obtained by using an LF (LiIj encrants-Fant) model and a derivative glottal wave thereof according to the present invention.
- EO is a maximum amplitude of the glottal wave signal
- EE is a maximum closing speed of the vocal cords as an effective Excitation value
- T e is a time period while the vocal cords are opened
- T c is a time period while the vocal cords are closed
- T 0 is a period while the vocal cords are opened and closed.
- KLGLOT88 To represent the glottal wave, a KLGLOT88 model may be also used.
- the equation is as follows:
- a and b are defined as a voicing amplitude (AV) and an equation of O q , and the a and b may be represented as follows :
- the AV is the same parameter as the E 0 of the LF model, 0 q is also the same parameter as the OQ of the LF model and the T 0 is a basic period of the vocal cords.
- Fig. 4 is a view showing a derivative glottal wave of the
- KLGLOT88 model As shown in Figs. 3 and 4, although the used models are respectively different, shapes of the glottal waves are equal or similar to each other, because the same characteristic parameters are used.
- Fig. 5 is a view showing characteristic parameters of glottal wave for converting the voice color according to the present invention.
- the characteristic parameters which is necessary to express various voice colors includes OQ which indicates an opening section of the vocal cords, CQ which indicates a closing section of the vocal cords, EE (Effective Excitation) which is an effective excitation value and T 0 which is a period of signal.
- the characteristic parameters can be obtained during a process of matching the derivative glottal wave with the LF model using the Figs. 3 and 4. And after modeling the LF derivative glottal wave, an error signal which indicates a difference between the characteristic parameters of the derivative glottal wave and the actual derivative glottal wave can be obtained.
- Fig. 6 is a view showing a manner of storing voice color data in the voice color databases of the original speaker, the target speaker and the general speaker according the present invention.
- the voice color database of the target speaker stores the characteristic parameters of glottal wave which is analyzed in the signal analyzing step for the actual voice of the target speaker and the error signal for each voice color.
- the voice color database of the general speaker stores the characteristic parameters of glottal wave which is analyzed in the signal analyzing step for the actual voices of the ordinary people and the error signal for each voice color.
- a breathy voice, a pressed voice and a normal voice are representative of the voice colors which are expressed by the glottal wave.
- the voice color databases of the target speaker and the general speaker stores the characteristic parameters of various voice colors in a storing manner shown in Fig. 6.
- the OQ and CQ sections are not divided clearly and the EE section has a flat shape.
- the EE section has a peak shape, and the CQ section is much longer than the OQ section.
- the normal voice has average OQ, CQ and EE sections.
- the glottal wave can have various voice colors by adjusting the characteristic parameters such as the OQ, the CQ, the EE, the E 0 and the like.
- the voice color database of the original speaker stores various voice colors of the original speaker, which are divided according to situations, feelings and the like.
- the various voice colors are correspondent to each other according to the divided conditions.
- the corresponding relationship is set by the histogram of changes in the parameters. Therefore, the various voice colors of the original speaker can be converted into the voice colors of the target speaker by the voice color databases of the target speaker and the general speaker.
- the three voice color databases are constructed to have voice colors correspondent to each other according to voice color numbers .
- the NAQ parameter is a combination of the above characteristic parameters of glottal wave and represented as follows:
- the characteristic parameters of glottal wave are extracted from the glottal wave shown in Fig. 5, and the NAQ parameter is obtained by using the Equation 4 to show the distribution thereof, whereby it is possible to grasp the corresponding relationship among the above characteristic parameters
- Fig. 7 is a graph showing distribution of NAQ parameters obtained by combining the characteristic parameters of glottal wave, wherein the glottal waves of breathy (Bre) , normal (Nor) and pressed (Pre) voices pronounced by thirteen people are analyzed and then the distribution of NAQ parameters is obtained.
- the NAQ parameters are constantly distributed for a common voice color and thus it is the case that a quantity of change of the parameter values for each voice color is constant in each voice color. Therefore, it is possible to set the corresponding relationship for each voice color.
- Fig. 8 is a view showing a shape of derivative glottal waves for the three voice colors, in which the change of the characteristic parameters OQ, CQ and EE for each voice color is observed.
- the QC section and the CQ section are not clearly divided and the EE section has a gentle shape.
- the CQ section is relatively longer then the OQ section and the EE section has a peak shape.
- the characteristic parameters of glottal wave and the error signal which have a desired voice color of the target speaker are extracted and then the parameter values are converted so as to generate the a glottal wave model having the desired voice color of the target speaker. If there is not the desired voice color in the voice color database of the target speaker, the characteristic parameters of the desired voice color and the characteristic parameters of the normal voice color are extracted from the voice color database of the general speaker, and the quantity of change between the two characteristic parameters are calculated. Then, the calculated quantity of change is applied to the glottal wave having the normal voice color of the target speaker, thereby generating a glottal wave model having the desired voice of the target speaker.
- the generated glottal wave model is combined with a basic frequency, a duration time and an energy parameter calculated in an existing prosody converting step to generate an actual glottal wave g(t), and the generated glottal wave is passed through a filter r(t) which has a radiation effect from the lips when a man speaks and then passed through a linear prediction filter v(t) comprised of the vocal tract characteristic parameters calculated in an existing vocal tract characteristic converting step, thereby finally generating a desired voice of the target speaker.
- Fig. 9 is a view showing a process in a step of converting the glottal wave according to the present invention, which is represented as follows:
- the present invention can provide various converted voices compared with the conventional speaker conversion system which can express only one voice color.
- Fig. 10 is a block diagram explaining a system for converting glottal wave signal according to the present invention.
- the voice of speaker is input to an A/D converter and an input buffer to extract a glottal wave
- a voice color parameter of the extracted glottal wave are extracted by a command storing unit and a main controller and converted into a glottal wave, and then the final voice is output through an D/A converter and an output buffer.
- the voice color conversion method of the present invention it is possible to convert a voice of an original speaker in to a voice having various voice color of a target speaker according to situations, feelings and context.
- the speaker conversion system of the present invention can be widely applied to animations, movies, plays, commercial films and the like where voices having various and plentiful voice colors are needed.
- the speaker conversion system of the present invention can facilely substitute for an expert dubbing artist who can mimic a voice of a target speaker. Further, since the speaker conversion system can optionally generate a virtual voice by combination of the voice color database of the general speaker, it can be used to make voices of various virtual characters .
Abstract
Disclosed is to a speaker conversion system which can express various voice colors, more particularly, to voice color conversion method and system which can generate converted voice having various voice colors according to an utterance situation by converting the glottal wave.
Description
VOICE COLOR CONVERSION SYSTEM USING GLOTTAL WAVEFORM
[Technical Field] The present invention relates to a speaker conversion system which can express various voice colors, more particularly, to voice color conversion method and system which can generate converted voice having various voice colors according to an utterance situation by converting the glottal wave .
[Background Art]
FIG. 1 shows a process of generating a voice upon speaking. When a man speaks, air from a bronchial tube passes through the vocal cords and thus glottal wave is formed. At this time, since the man pronounces while breathing out his/her breath, noise (aspirated sound) due to out-breathing is contained in the glottal wave. While the glottal wave passes through a vocal tract, an articulating phenomenon is occurred. Then, the glottal wave is radiated into the air and thus generates a voice. The present invention relates to the conversion of glottal wave generated when a man speaks and grows out of a fact that a shape of glottal wave is changed according to environmental status, feelings and the like when a man speaks and thus the voice having various voice colors
can be generated.
As a conventional method of mimicking a voice of a specific person (hereinafter, called "target speaker"), there are a mimicking method employing an expert dubbing artist and a mimicking method using a computer.
In case of the mimicking method employing an expert dubbing artist, it is possible to mimic a prosodic characteristic with respect to a certain part of the voice, however, it is difficult to express various and natural voice colors .
And in case of the method using a computer, which can convert a voice of a certain speaker (hereinafter, called "original speaker" ) into the target speaker by using the computer, there are mimicking methods using Hidden Markov Model (HMM) , Gaussian Mixture Model (GMM) and Neural Network.
In the conventional methods using the HMM, the GMM and the Neural Network, firstly, vocal tract characteristic parameters of voice such as LPC (Linear Prediction Coefficient) , LSP (Line Spectral Pair) , MFCC (Mel-Frequency Cepstral Coefficient) and HNM (Harmonic and Noise Model) characteristics are extracted from the voices of the original speaker and the target speaker, and the HMM or GMM is trained by using the characteristic parameters of each speaker, and then a conversion function between the trained models is calculated so that the vocal tract characteristic of the
original speaker is converted into that of the target speaker. In case of the prosody, the prosody of the target speaker is modeled and then applied to the converted voice. In a method of mimicking the prosody of the target speaker, after forming a pitch histogram of the original speaker and the target speaker, an excitation signal matched with the histogram is used. In case of the glottal wave, the excitation signal in which basic frequency information of the target speaker is excluded is applied to the converted voice. In the conventional speaker conversion method, it is possible to convert the prosody and the vocal tract characteristic. However, since the excitation signal of the target speaker is used in the vocal cord characteristic, there is a problem that the voice color contained in the given excitation signal of the target speaker has to be used as it is.
Therefore, in the conventional method it is difficult to exactly express various voice colors according to situations, it is impossible to reflect the change of voice color according to feelings, environments, and the context when the original speaker speaks, and it is possible only to convert the voice of the original speaker into a predetermined voice color.
[Disclosure]
[Technical Problem]
An object of the present invention is to provide a method of converting voice color, which can generate a voice of a certain speaker without employment of an expert dubbing artist who can mimic the voice and also which can generate voices having various voice color at anywhere and any time according to feelings of the certain person.
Another object of the present invention is to provide a method of converting voice color, which can generate voices having various voice colors according to situations and conditions
[Technical Solution]
According to the present invention, there is provided a voice color conversion method for converting a speaker, comprising the steps of analyzing a glottal wave signal of a voice of an original speaker; converting the analyzed glottal wave signal; and re-composing the converted glottal wave signal. And the step of analyzing the glottal wave signal comprises the steps of extracting a glottal wave from the voice of the original speaker; and extracting voice color characteristic parameters of the extracted glottal wave.
Further, the step of converting the glottal wave signal comprises the steps of collecting glottal wave characteristic
parameters for various voice colors from the voice of the original speaker and creating a voice color database of the original speaker; and collecting glottal wave characteristic parameters for various voice colors from a voice of a target speaker and creating a voice color database of the target speaker. The step of converting the glottal wave signal further comprises the steps of collecting glottal wave characteristic parameters for various voice colors from voices of various general speakers and creating a voice color database of the general speakers. The step of converting the glottal wave signal further comprises the steps of creating a corresponding relationship among the voice color databases of the original speaker, the target speaker and the general speakers on the basis of a quantity of change of the voice color characteristic parameters stored in each voice color database .
Further, the characteristic parameters comprises an OQ section in which vocal cords are opened, a CQ section in which the vocal cords are closed, EE (Effective Excitation) which is an effective excitation value and T0 which is a period of signal. The quantity of change of the characteristic parameters is an NAQ parameter which is defined as follows:
NAQ= 3.
TnBE
In addition, a voice color conversion system for converting a speaker, comprising glottal wave extracting means for extracting a glottal wave from a voice of an original speaker; voice color parameter extracting means for extracting voice color parameters from the extracted glottal wave; glottal wave converting means for converting the glottal wave using the voice color parameters; and converted voice generating means for generating a converted voice using the converted glottal wave. Preferably, the glottal wave extracting means comprises an A/D converter and an input buffer, and the voice color parameter extracting means and the glottal wave converting means comprises a command storing unit and a main controller, and the converted voice generating means comprises a D/A converter and an output buffer. Further, The system further comprises a storing unit in which voice color databases of an original speaker, a target speaker and general speakers are stored.
[Advantageous Effects]
According to the voice color conversion method of the present invention, it is possible to convert a voice of an original speaker in to a voice having various voice color of a target speaker according to situations, feelings and context. Furthermore, it also is possible that the speaker
conversion system of the present invention can facilely substitute for an expert dubbing artist who can mimic a voice of a target speaker. Further, since the speaker conversion system can optionally generate a virtual voice by combination of the voice color database of the general speaker, it can be used to make voices of various virtual characters .
[Description of Drawings]
The above and other objects, features and advantages of the present invention will become apparent from the following description of preferred embodiments given in conjunction with the accompanying drawings, in which:
Fig. 1 is a view showing a mechanism for generating a voice when a mean speaks; Fig. 2 is a view showing a process of converting glottal wave to express various voice colors according to the present invention;
Fig. 3 is a view showing a glottal wave obtained by using an LF (LiIj encrants-Fant) model and a derivative glottal wave thereof according to the present invention;
Fig. 4 is a view showing a derivative glottal wave of a KLGLOT88 model according to the present invention;
Fig. 5 is a view showing characteristic parameters of glottal wave for converting the voice color according to the present invention;
Fig. 6 is a view showing a manner of storing voice color data in voice color databases of an original speaker, a target speaker and a general speaker according to the present invention; Fig. 7 is a graph showing distribution of NAQ parameters for various voice colors according to the present invention;
Fig. 8 is a view showing a change of vocal glottal waves having various voice colors according to the present invention; Fig. 9 is a view showing a process in a step of converting the glottal wave according to the present invention; and
Fig. 10 is a block diagram explaining a system for converting glottal wave signal according to the present invention.
[Best Mode]
Hereinafter, the embodiments of the present invention will be described in detail with reference to accompanying drawings .
Fig. 2 is a view showing a process of converting glottal wave to express various voice colors according to the present invention. As shown in Fig. 2, a method of converting a voice according to the present invention includes a step of analyzing glottal wave signal, a step of converting the
glottal wave signal and a step of re-composing the glottal wave signal.
The step of analyzing glottal wave signal includes a step of extracting glottal wave from voice signal of an input original speaker and a step of extracting parameters of various voice colors from the extracted glottal wave . In the step of converting the glottal wave signal, the parameters of voice colors extracted from the glottal wave are converted into glottal wave signals using data of voice color database. At this time, the voice color database includes voice color database of an original speaker, voice color database of a target speaker and voice color database of a general speaker.
The voice color database of the target speaker as a model representing native voice color of only the target speaker is a database in which characteristic parameters of the target speaker are collected. The voice color database of the general speaker is a database in which characteristic parameters of various voice colors extracted from general speakers are collected. In the step of converting the glottal wave signal, the characteristic parameters of voice colors of the original speaker, which are extracted in the step of analyzing the glottal wave signal, are converted into the glottal wave having the voice color of the target speaker using the voice color characteristic parameters in which voice color models of
the target speaker and the general speaker weighted and combined with each other.
In the step of re-composing the glottal wave, the glottal wave having the voice color of the target speaker is reconstructed using vocal tract characteristic parameters by- using the HMM and the like and the glottal wave generated in the step of converting the glottal wave signal so as to generate a finally converted voice.
Meanwhile, in the step of analyzing the glottal wave, the glottal wave is extracted from input signal which is converted into digital signal through a sampling process, and the characteristic parameters of voice colors are extracted from the glottal wave. When extracting the glottal wave in the step, the vocal tract characteristic is removed as much as possible so as to remain only the glottal wave. The glottal wave may be extracted by inverse-filtering excitation signal using linear prediction algorithm of voice. To analyze the extracted the glottal wave, the glottal wave is differentiated along a time axis to obtain glottal derivative signal. Fig. 3 is a view showing a glottal wave obtained by using an LF (LiIj encrants-Fant) model and a derivative glottal wave thereof according to the present invention. The derivative glottal wave may be represented by using the LF model as follows :
g(t) = E{)eatsm' (ωJ , 0 < t ≤ Te
wherein EO is a maximum amplitude of the glottal wave signal, EE is a maximum closing speed of the vocal cords as an effective Excitation value, Te is a time period while the vocal cords are opened, Tc is a time period while the vocal cords are closed and T0 is a period while the vocal cords are opened and closed.
To represent the glottal wave, a KLGLOT88 model may be also used. The equation is as follows:
g(t) = at2 - bf , for 0 < t < On T0 = 0 , for O<y ^ < t < Tn
(2) wherein a and b are defined as a voicing amplitude (AV) and an equation of Oq, and the a and b may be represented as follows :
27,4 K 27,4 V
4 T0O^ 1 4TXλθl
(3) wherein The AV is the same parameter as the E0 of the LF model, 0q is also the same parameter as the OQ of the LF model
and the T0 is a basic period of the vocal cords.
Fig. 4 is a view showing a derivative glottal wave of the
KLGLOT88 model. As shown in Figs. 3 and 4, although the used models are respectively different, shapes of the glottal waves are equal or similar to each other, because the same characteristic parameters are used.
Fig. 5 is a view showing characteristic parameters of glottal wave for converting the voice color according to the present invention. The characteristic parameters which is necessary to express various voice colors includes OQ which indicates an opening section of the vocal cords, CQ which indicates a closing section of the vocal cords, EE (Effective Excitation) which is an effective excitation value and T0 which is a period of signal. The characteristic parameters can be obtained during a process of matching the derivative glottal wave with the LF model using the Figs. 3 and 4. And after modeling the LF derivative glottal wave, an error signal which indicates a difference between the characteristic parameters of the derivative glottal wave and the actual derivative glottal wave can be obtained.
Fig. 6 is a view showing a manner of storing voice color data in the voice color databases of the original speaker, the target speaker and the general speaker according the present invention. The voice color database of the target speaker stores the characteristic parameters of glottal
wave which is analyzed in the signal analyzing step for the actual voice of the target speaker and the error signal for each voice color. The voice color database of the general speaker stores the characteristic parameters of glottal wave which is analyzed in the signal analyzing step for the actual voices of the ordinary people and the error signal for each voice color. A breathy voice, a pressed voice and a normal voice are representative of the voice colors which are expressed by the glottal wave. On the basis of the fact, the voice color databases of the target speaker and the general speaker stores the characteristic parameters of various voice colors in a storing manner shown in Fig. 6. In case of the breathy voice, the OQ and CQ sections are not divided clearly and the EE section has a flat shape. In case of the pressed voice, the EE section has a peak shape, and the CQ section is much longer than the OQ section. The normal voice has average OQ, CQ and EE sections. When generating the converted glottal wave, the glottal wave can have various voice colors by adjusting the characteristic parameters such as the OQ, the CQ, the EE, the E0 and the like. The voice color database of the original speaker stores various voice colors of the original speaker, which are divided according to situations, feelings and the like. In the above three voice color databases, the various voice colors are correspondent to each other according to the divided conditions. The corresponding relationship is
set by the histogram of changes in the parameters. Therefore, the various voice colors of the original speaker can be converted into the voice colors of the target speaker by the voice color databases of the target speaker and the general speaker. The three voice color databases are constructed to have voice colors correspondent to each other according to voice color numbers .
Herein, to explain the corresponding relationship among the original speaker, the target speaker and the general speaker, it will be described that the speakers have common characteristics for various voice colors. To this end, the present invention introduces an NAQ parameter. The NAQ parameter is a combination of the above characteristic parameters of glottal wave and represented as follows:
(4)
That is, the characteristic parameters of glottal wave are extracted from the glottal wave shown in Fig. 5, and the NAQ parameter is obtained by using the Equation 4 to show the distribution thereof, whereby it is possible to grasp the corresponding relationship among the above characteristic parameters
Fig. 7 is a graph showing distribution of NAQ parameters
obtained by combining the characteristic parameters of glottal wave, wherein the glottal waves of breathy (Bre) , normal (Nor) and pressed (Pre) voices pronounced by thirteen people are analyzed and then the distribution of NAQ parameters is obtained. Referring to Fig. 7, the NAQ parameters are constantly distributed for a common voice color and thus it is the case that a quantity of change of the parameter values for each voice color is constant in each voice color. Therefore, it is possible to set the corresponding relationship for each voice color.
Fig. 8 is a view showing a shape of derivative glottal waves for the three voice colors, in which the change of the characteristic parameters OQ, CQ and EE for each voice color is observed. Referring to Fig. 8, in case of the breathy voice, the QC section and the CQ section are not clearly divided and the EE section has a gentle shape. However, in case of the pressed voice, the CQ section is relatively longer then the OQ section and the EE section has a peak shape.
In the step of converting the glottal wave, the characteristic parameters of glottal wave and the error signal which have a desired voice color of the target speaker are extracted and then the parameter values are converted so as to generate the a glottal wave model having the desired voice color of the target speaker. If there is not the desired voice color in the voice color database of the target speaker, the
characteristic parameters of the desired voice color and the characteristic parameters of the normal voice color are extracted from the voice color database of the general speaker, and the quantity of change between the two characteristic parameters are calculated. Then, the calculated quantity of change is applied to the glottal wave having the normal voice color of the target speaker, thereby generating a glottal wave model having the desired voice of the target speaker. The generated glottal wave model is combined with a basic frequency, a duration time and an energy parameter calculated in an existing prosody converting step to generate an actual glottal wave g(t), and the generated glottal wave is passed through a filter r(t) which has a radiation effect from the lips when a man speaks and then passed through a linear prediction filter v(t) comprised of the vocal tract characteristic parameters calculated in an existing vocal tract characteristic converting step, thereby finally generating a desired voice of the target speaker.
Fig. 9 is a view showing a process in a step of converting the glottal wave according to the present invention, which is represented as follows:
ft(t) =<](t) * r(t) * v(t) (5) wherein * indicates convolution
The filter r(t) may be represented as follows:
(6) As described above, since the final voice which is generated through the signal conversion step can reproduce the voice of target speaker as it is, the present invention can provide various converted voices compared with the conventional speaker conversion system which can express only one voice color.
Fig. 10 is a block diagram explaining a system for converting glottal wave signal according to the present invention. As shown in Fig. 10, the voice of speaker is input to an A/D converter and an input buffer to extract a glottal wave, and a voice color parameter of the extracted glottal wave are extracted by a command storing unit and a main controller and converted into a glottal wave, and then the final voice is output through an D/A converter and an output buffer.
[industrial Applicability]
According to the voice color conversion method of the present invention, it is possible to convert a voice of an original speaker in to a voice having various voice color of a
target speaker according to situations, feelings and context.
Further, the speaker conversion system of the present invention can be widely applied to animations, movies, plays, commercial films and the like where voices having various and plentiful voice colors are needed.
Furthermore, it also is possible that the speaker conversion system of the present invention can facilely substitute for an expert dubbing artist who can mimic a voice of a target speaker. Further, since the speaker conversion system can optionally generate a virtual voice by combination of the voice color database of the general speaker, it can be used to make voices of various virtual characters .
Claims
[CLAIMS]
[Claim l]
A voice color conversion method for converting a speaker, comprising the steps of: analyzing a glottal wave signal of a voice of an original speaker; converting the analyzed glottal wave signal; and re-composing the converted glottal wave signal.
[Claim 2]
The method according to claim 1, wherein the step of analyzing the glottal wave signal comprises the steps of: extracting a glottal wave from the voice of the original speaker; and extracting voice color characteristic parameters of the extracted glottal wave .
[Claim 3]
The method according to claim 1, wherein the step of converting the glottal wave signal comprises the steps of: collecting glottal wave characteristic parameters for various voice colors from the voice of the original speaker and creating a voice color database of the original speaker; and collecting glottal wave characteristic parameters for
various voice colors from a voice of a target speaker and creating a voice color database of the target speaker.
[Claim 4] The method according to claim 3, wherein the step of converting the glottal wave signal further comprises the steps of collecting glottal wave characteristic parameters for various voice colors from voices of various general speakers and creating a voice color database of the general speakers .
[Claim 5]
The method according to claim 4, wherein the step of converting the glottal wave signal further comprises the steps of creating a corresponding relationship among the voice color databases of the original speaker, the target speaker and the general speakers on the basis of a quantity of change of the voice color characteristic parameters stored in each voice color database.
[Claim 6]
The method according to claim 5, wherein the characteristic parameters comprises an OQ section in which vocal cords are opened, a CQ section in which the vocal cords are closed, EE (Effective Excitation) which is an effective excitation value and T0 which is a period of signal.
[Claim 7]
The method according to claim 5 or 6, wherein the quantity of change of the characteristic parameters is an NAQ parameter which is defined as follows:
K
NAQ= T{]EE
[Claim 8] A voice color conversion system for converting a speaker, comprising: glottal wave extracting means for extracting a glottal wave from a voice of an original speaker; voice color parameter extracting means for extracting voice color parameters from the extracted glottal wave; glottal wave converting means for converting the glottal wave using the voice color parameters; and converted voice generating means for generating a converted voice using the converted glottal wave.
[Claim 9]
The system according to claim 8, wherein the glottal wave extracting means comprises an A/D converter and an input buffer.
[Claim lθ]
The system according to claim 8, wherein the voice color parameter extracting means and the glottal wave converting means comprises a command storing unit and a main controller.
[Claim ll]
The system according to claim 8, wherein the converted voice generating means comprises a D/A converter and an output buffer.
[Claim 12]
The system according to claim 8, further comprising a storing unit in which voice color databases of an original speaker, a target speaker and general speakers are stored.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR10-2006-0075140 | 2006-08-09 | ||
KR1020060075140A KR100809368B1 (en) | 2006-08-09 | 2006-08-09 | Voice Color Conversion System using Glottal waveform |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2008018653A1 true WO2008018653A1 (en) | 2008-02-14 |
Family
ID=39033161
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/KR2006/004478 WO2008018653A1 (en) | 2006-08-09 | 2006-10-31 | Voice color conversion system using glottal waveform |
Country Status (2)
Country | Link |
---|---|
KR (1) | KR100809368B1 (en) |
WO (1) | WO2008018653A1 (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2010031437A1 (en) * | 2008-09-19 | 2010-03-25 | Asociacion Centro De Tecnologias De Interaccion Visual Y Comunicaciones Vicomtech | Method and system of voice conversion |
CN103730117A (en) * | 2012-10-12 | 2014-04-16 | 中兴通讯股份有限公司 | Self-adaptation intelligent voice device and method |
WO2014058270A1 (en) * | 2012-10-12 | 2014-04-17 | Samsung Electronics Co., Ltd. | Voice converting apparatus and method for converting user voice thereof |
US20160005403A1 (en) * | 2014-07-03 | 2016-01-07 | Google Inc. | Methods and Systems for Voice Conversion |
CN109147758A (en) * | 2018-09-12 | 2019-01-04 | 科大讯飞股份有限公司 | A kind of speaker's sound converting method and device |
WO2023242445A1 (en) | 2022-06-17 | 2023-12-21 | The Provost, Fellows, Scholars And Other Members Of The Board Of Trinity College Dublin | Glottal features extraction using neural networks |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6615174B1 (en) * | 1997-01-27 | 2003-09-02 | Microsoft Corporation | Voice conversion system and methodology |
KR20040061709A (en) * | 2002-12-31 | 2004-07-07 | (주) 코아보이스 | Voice Color Converter using Transforming Vocal Tract Characteristic and Method |
US6950799B2 (en) * | 2002-02-19 | 2005-09-27 | Qualcomm Inc. | Speech converter utilizing preprogrammed voice profiles |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB8901423D0 (en) * | 1989-01-23 | 1989-03-15 | Fujisawa Pharmaceutical Co | Pyrazolopyridine compound and processes for preparation thereof |
JP4362953B2 (en) * | 2000-07-03 | 2009-11-11 | 日本軽金属株式会社 | Bumpy stay |
JP4060562B2 (en) * | 2001-05-02 | 2008-03-12 | 日本碍子株式会社 | Electrode body evaluation method |
KR100639968B1 (en) * | 2004-11-04 | 2006-11-01 | 한국전자통신연구원 | Apparatus for speech recognition and method therefor |
JP2008002003A (en) * | 2006-06-21 | 2008-01-10 | Toray Ind Inc | Ground fabric for airbag and method for producing the ground fabric |
-
2006
- 2006-08-09 KR KR1020060075140A patent/KR100809368B1/en not_active IP Right Cessation
- 2006-10-31 WO PCT/KR2006/004478 patent/WO2008018653A1/en active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6615174B1 (en) * | 1997-01-27 | 2003-09-02 | Microsoft Corporation | Voice conversion system and methodology |
US6950799B2 (en) * | 2002-02-19 | 2005-09-27 | Qualcomm Inc. | Speech converter utilizing preprogrammed voice profiles |
KR20040061709A (en) * | 2002-12-31 | 2004-07-07 | (주) 코아보이스 | Voice Color Converter using Transforming Vocal Tract Characteristic and Method |
Non-Patent Citations (2)
Title |
---|
KAIN A. ET AL.: "Spectral voice conversion for text-to-speech synthesis", PROC. ICASSP, 1998, pages 285 - 288 * |
STYLIANOU Y. ET AL.: "Continuous probabilistic transform for voice conversion", IEEE TRANS. ON SPEECH AND AUDIO PROCESSING, vol. 6, no. 2, March 1998 (1998-03-01) * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2010031437A1 (en) * | 2008-09-19 | 2010-03-25 | Asociacion Centro De Tecnologias De Interaccion Visual Y Comunicaciones Vicomtech | Method and system of voice conversion |
CN103730117A (en) * | 2012-10-12 | 2014-04-16 | 中兴通讯股份有限公司 | Self-adaptation intelligent voice device and method |
WO2014058270A1 (en) * | 2012-10-12 | 2014-04-17 | Samsung Electronics Co., Ltd. | Voice converting apparatus and method for converting user voice thereof |
EP2908312A4 (en) * | 2012-10-12 | 2015-12-02 | Zte Corp | Self-adaptive intelligent voice device and method |
US9552813B2 (en) | 2012-10-12 | 2017-01-24 | Zte Corporation | Self-adaptive intelligent voice device and method |
US9564119B2 (en) | 2012-10-12 | 2017-02-07 | Samsung Electronics Co., Ltd. | Voice converting apparatus and method for converting user voice thereof |
US10121492B2 (en) | 2012-10-12 | 2018-11-06 | Samsung Electronics Co., Ltd. | Voice converting apparatus and method for converting user voice thereof |
US20160005403A1 (en) * | 2014-07-03 | 2016-01-07 | Google Inc. | Methods and Systems for Voice Conversion |
US9613620B2 (en) * | 2014-07-03 | 2017-04-04 | Google Inc. | Methods and systems for voice conversion |
CN109147758A (en) * | 2018-09-12 | 2019-01-04 | 科大讯飞股份有限公司 | A kind of speaker's sound converting method and device |
WO2023242445A1 (en) | 2022-06-17 | 2023-12-21 | The Provost, Fellows, Scholars And Other Members Of The Board Of Trinity College Dublin | Glottal features extraction using neural networks |
Also Published As
Publication number | Publication date |
---|---|
KR20080013524A (en) | 2008-02-13 |
KR100809368B1 (en) | 2008-03-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP4355772B2 (en) | Force conversion device, speech conversion device, speech synthesis device, speech conversion method, speech synthesis method, and program | |
US8898055B2 (en) | Voice quality conversion device and voice quality conversion method for converting voice quality of an input speech using target vocal tract information and received vocal tract information corresponding to the input speech | |
Boril et al. | Unsupervised equalization of Lombard effect for speech recognition in noisy adverse environments | |
JP4456537B2 (en) | Information transmission device | |
JP5581377B2 (en) | Speech synthesis and coding method | |
US20060129399A1 (en) | Speech conversion system and method | |
JP5961950B2 (en) | Audio processing device | |
JP2004522186A (en) | Speech synthesis of speech synthesizer | |
JP5039865B2 (en) | Voice quality conversion apparatus and method | |
CN110648684B (en) | Bone conduction voice enhancement waveform generation method based on WaveNet | |
JP2006171750A (en) | Feature vector extracting method for speech recognition | |
WO2008018653A1 (en) | Voice color conversion system using glottal waveform | |
JP2005266349A (en) | Device, method, and program for voice quality conversion | |
JP2010014913A (en) | Device and system for conversion of voice quality and for voice generation | |
JP4382808B2 (en) | Method for analyzing fundamental frequency information, and voice conversion method and system implementing this analysis method | |
JP2002268660A (en) | Method and device for text voice synthesis | |
Aryal et al. | Articulatory inversion and synthesis: towards articulatory-based modification of speech | |
JP2017520016A (en) | Excitation signal generation method of glottal pulse model based on parametric speech synthesis system | |
KR101560833B1 (en) | Apparatus and method for recognizing emotion using a voice signal | |
Pietruch et al. | Methods for formant extraction in speech of patients after total laryngectomy | |
Furui | Robust methods in automatic speech recognition and understanding. | |
Arslan et al. | 3-d face point trajectory synthesis using an automatically derived visual phoneme similarity matrix | |
Singh et al. | Features and techniques for speaker recognition | |
Del Pozo | Voice source and duration modelling for voice conversion and speech repair | |
Qavi et al. | Voice morphing based on spectral features and prosodic modification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 06812317 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
NENP | Non-entry into the national phase |
Ref country code: RU |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 06812317 Country of ref document: EP Kind code of ref document: A1 |