WO2008018653A1 - Voice color conversion system using glottal waveform - Google Patents

Voice color conversion system using glottal waveform Download PDF

Info

Publication number
WO2008018653A1
WO2008018653A1 PCT/KR2006/004478 KR2006004478W WO2008018653A1 WO 2008018653 A1 WO2008018653 A1 WO 2008018653A1 KR 2006004478 W KR2006004478 W KR 2006004478W WO 2008018653 A1 WO2008018653 A1 WO 2008018653A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
glottal wave
speaker
glottal
converting
Prior art date
Application number
PCT/KR2006/004478
Other languages
French (fr)
Inventor
Yung-Hwan Oh
Jae-Hyun Bae
Original Assignee
Korea Advanced Institute Of Science And Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Korea Advanced Institute Of Science And Technology filed Critical Korea Advanced Institute Of Science And Technology
Publication of WO2008018653A1 publication Critical patent/WO2008018653A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing

Definitions

  • the present invention relates to a speaker conversion system which can express various voice colors, more particularly, to voice color conversion method and system which can generate converted voice having various voice colors according to an utterance situation by converting the glottal wave .
  • FIG. 1 shows a process of generating a voice upon speaking.
  • a man speaks air from a bronchial tube passes through the vocal cords and thus glottal wave is formed.
  • noise due to out-breathing is contained in the glottal wave.
  • the glottal wave passes through a vocal tract an articulating phenomenon is occurred. Then, the glottal wave is radiated into the air and thus generates a voice.
  • the present invention relates to the conversion of glottal wave generated when a man speaks and grows out of a fact that a shape of glottal wave is changed according to environmental status, feelings and the like when a man speaks and thus the voice having various voice colors can be generated.
  • target speaker As a conventional method of mimicking a voice of a specific person (hereinafter, called "target speaker”), there are a mimicking method employing an expert dubbing artist and a mimicking method using a computer.
  • HMM Hidden Markov Model
  • GMM Gaussian Mixture Model
  • Neural Network a voice of a certain speaker
  • vocal tract characteristic parameters of voice such as LPC (Linear Prediction Coefficient) , LSP (Line Spectral Pair) , MFCC (Mel-Frequency Cepstral Coefficient) and HNM (Harmonic and Noise Model) characteristics are extracted from the voices of the original speaker and the target speaker, and the HMM or GMM is trained by using the characteristic parameters of each speaker, and then a conversion function between the trained models is calculated so that the vocal tract characteristic of the original speaker is converted into that of the target speaker.
  • the prosody of the target speaker is modeled and then applied to the converted voice.
  • An object of the present invention is to provide a method of converting voice color, which can generate a voice of a certain speaker without employment of an expert dubbing artist who can mimic the voice and also which can generate voices having various voice color at anywhere and any time according to feelings of the certain person.
  • Another object of the present invention is to provide a method of converting voice color, which can generate voices having various voice colors according to situations and conditions
  • a voice color conversion method for converting a speaker comprising the steps of analyzing a glottal wave signal of a voice of an original speaker; converting the analyzed glottal wave signal; and re-composing the converted glottal wave signal.
  • the step of analyzing the glottal wave signal comprises the steps of extracting a glottal wave from the voice of the original speaker; and extracting voice color characteristic parameters of the extracted glottal wave.
  • the step of converting the glottal wave signal comprises the steps of collecting glottal wave characteristic parameters for various voice colors from the voice of the original speaker and creating a voice color database of the original speaker; and collecting glottal wave characteristic parameters for various voice colors from a voice of a target speaker and creating a voice color database of the target speaker.
  • the step of converting the glottal wave signal further comprises the steps of collecting glottal wave characteristic parameters for various voice colors from voices of various general speakers and creating a voice color database of the general speakers.
  • the step of converting the glottal wave signal further comprises the steps of creating a corresponding relationship among the voice color databases of the original speaker, the target speaker and the general speakers on the basis of a quantity of change of the voice color characteristic parameters stored in each voice color database .
  • the characteristic parameters comprises an OQ section in which vocal cords are opened, a CQ section in which the vocal cords are closed, EE (Effective Excitation) which is an effective excitation value and T 0 which is a period of signal.
  • the quantity of change of the characteristic parameters is an NAQ parameter which is defined as follows:
  • a voice color conversion system for converting a speaker, comprising glottal wave extracting means for extracting a glottal wave from a voice of an original speaker; voice color parameter extracting means for extracting voice color parameters from the extracted glottal wave; glottal wave converting means for converting the glottal wave using the voice color parameters; and converted voice generating means for generating a converted voice using the converted glottal wave.
  • the glottal wave extracting means comprises an A/D converter and an input buffer
  • the voice color parameter extracting means and the glottal wave converting means comprises a command storing unit and a main controller
  • the converted voice generating means comprises a D/A converter and an output buffer.
  • the system further comprises a storing unit in which voice color databases of an original speaker, a target speaker and general speakers are stored.
  • the voice color conversion method of the present invention it is possible to convert a voice of an original speaker in to a voice having various voice color of a target speaker according to situations, feelings and context. Furthermore, it also is possible that the speaker conversion system of the present invention can facilely substitute for an expert dubbing artist who can mimic a voice of a target speaker. Further, since the speaker conversion system can optionally generate a virtual voice by combination of the voice color database of the general speaker, it can be used to make voices of various virtual characters .
  • Fig. 1 is a view showing a mechanism for generating a voice when a mean speaks
  • Fig. 2 is a view showing a process of converting glottal wave to express various voice colors according to the present invention
  • Fig. 3 is a view showing a glottal wave obtained by using an LF (LiIj encrants-Fant) model and a derivative glottal wave thereof according to the present invention
  • Fig. 4 is a view showing a derivative glottal wave of a KLGLOT88 model according to the present invention.
  • Fig. 5 is a view showing characteristic parameters of glottal wave for converting the voice color according to the present invention
  • Fig. 6 is a view showing a manner of storing voice color data in voice color databases of an original speaker, a target speaker and a general speaker according to the present invention
  • Fig. 7 is a graph showing distribution of NAQ parameters for various voice colors according to the present invention
  • Fig. 8 is a view showing a change of vocal glottal waves having various voice colors according to the present invention
  • Fig. 9 is a view showing a process in a step of converting the glottal wave according to the present invention.
  • Fig. 10 is a block diagram explaining a system for converting glottal wave signal according to the present invention.
  • Fig. 2 is a view showing a process of converting glottal wave to express various voice colors according to the present invention.
  • a method of converting a voice according to the present invention includes a step of analyzing glottal wave signal, a step of converting the glottal wave signal and a step of re-composing the glottal wave signal.
  • the step of analyzing glottal wave signal includes a step of extracting glottal wave from voice signal of an input original speaker and a step of extracting parameters of various voice colors from the extracted glottal wave .
  • the parameters of voice colors extracted from the glottal wave are converted into glottal wave signals using data of voice color database.
  • the voice color database includes voice color database of an original speaker, voice color database of a target speaker and voice color database of a general speaker.
  • the voice color database of the target speaker as a model representing native voice color of only the target speaker is a database in which characteristic parameters of the target speaker are collected.
  • the voice color database of the general speaker is a database in which characteristic parameters of various voice colors extracted from general speakers are collected.
  • the characteristic parameters of voice colors of the original speaker which are extracted in the step of analyzing the glottal wave signal, are converted into the glottal wave having the voice color of the target speaker using the voice color characteristic parameters in which voice color models of the target speaker and the general speaker weighted and combined with each other.
  • the glottal wave having the voice color of the target speaker is reconstructed using vocal tract characteristic parameters by- using the HMM and the like and the glottal wave generated in the step of converting the glottal wave signal so as to generate a finally converted voice.
  • the glottal wave is extracted from input signal which is converted into digital signal through a sampling process, and the characteristic parameters of voice colors are extracted from the glottal wave.
  • the vocal tract characteristic is removed as much as possible so as to remain only the glottal wave.
  • the glottal wave may be extracted by inverse-filtering excitation signal using linear prediction algorithm of voice.
  • the glottal wave is differentiated along a time axis to obtain glottal derivative signal. Fig.
  • FIG. 3 is a view showing a glottal wave obtained by using an LF (LiIj encrants-Fant) model and a derivative glottal wave thereof according to the present invention.
  • EO is a maximum amplitude of the glottal wave signal
  • EE is a maximum closing speed of the vocal cords as an effective Excitation value
  • T e is a time period while the vocal cords are opened
  • T c is a time period while the vocal cords are closed
  • T 0 is a period while the vocal cords are opened and closed.
  • KLGLOT88 To represent the glottal wave, a KLGLOT88 model may be also used.
  • the equation is as follows:
  • a and b are defined as a voicing amplitude (AV) and an equation of O q , and the a and b may be represented as follows :
  • the AV is the same parameter as the E 0 of the LF model, 0 q is also the same parameter as the OQ of the LF model and the T 0 is a basic period of the vocal cords.
  • Fig. 4 is a view showing a derivative glottal wave of the
  • KLGLOT88 model As shown in Figs. 3 and 4, although the used models are respectively different, shapes of the glottal waves are equal or similar to each other, because the same characteristic parameters are used.
  • Fig. 5 is a view showing characteristic parameters of glottal wave for converting the voice color according to the present invention.
  • the characteristic parameters which is necessary to express various voice colors includes OQ which indicates an opening section of the vocal cords, CQ which indicates a closing section of the vocal cords, EE (Effective Excitation) which is an effective excitation value and T 0 which is a period of signal.
  • the characteristic parameters can be obtained during a process of matching the derivative glottal wave with the LF model using the Figs. 3 and 4. And after modeling the LF derivative glottal wave, an error signal which indicates a difference between the characteristic parameters of the derivative glottal wave and the actual derivative glottal wave can be obtained.
  • Fig. 6 is a view showing a manner of storing voice color data in the voice color databases of the original speaker, the target speaker and the general speaker according the present invention.
  • the voice color database of the target speaker stores the characteristic parameters of glottal wave which is analyzed in the signal analyzing step for the actual voice of the target speaker and the error signal for each voice color.
  • the voice color database of the general speaker stores the characteristic parameters of glottal wave which is analyzed in the signal analyzing step for the actual voices of the ordinary people and the error signal for each voice color.
  • a breathy voice, a pressed voice and a normal voice are representative of the voice colors which are expressed by the glottal wave.
  • the voice color databases of the target speaker and the general speaker stores the characteristic parameters of various voice colors in a storing manner shown in Fig. 6.
  • the OQ and CQ sections are not divided clearly and the EE section has a flat shape.
  • the EE section has a peak shape, and the CQ section is much longer than the OQ section.
  • the normal voice has average OQ, CQ and EE sections.
  • the glottal wave can have various voice colors by adjusting the characteristic parameters such as the OQ, the CQ, the EE, the E 0 and the like.
  • the voice color database of the original speaker stores various voice colors of the original speaker, which are divided according to situations, feelings and the like.
  • the various voice colors are correspondent to each other according to the divided conditions.
  • the corresponding relationship is set by the histogram of changes in the parameters. Therefore, the various voice colors of the original speaker can be converted into the voice colors of the target speaker by the voice color databases of the target speaker and the general speaker.
  • the three voice color databases are constructed to have voice colors correspondent to each other according to voice color numbers .
  • the NAQ parameter is a combination of the above characteristic parameters of glottal wave and represented as follows:
  • the characteristic parameters of glottal wave are extracted from the glottal wave shown in Fig. 5, and the NAQ parameter is obtained by using the Equation 4 to show the distribution thereof, whereby it is possible to grasp the corresponding relationship among the above characteristic parameters
  • Fig. 7 is a graph showing distribution of NAQ parameters obtained by combining the characteristic parameters of glottal wave, wherein the glottal waves of breathy (Bre) , normal (Nor) and pressed (Pre) voices pronounced by thirteen people are analyzed and then the distribution of NAQ parameters is obtained.
  • the NAQ parameters are constantly distributed for a common voice color and thus it is the case that a quantity of change of the parameter values for each voice color is constant in each voice color. Therefore, it is possible to set the corresponding relationship for each voice color.
  • Fig. 8 is a view showing a shape of derivative glottal waves for the three voice colors, in which the change of the characteristic parameters OQ, CQ and EE for each voice color is observed.
  • the QC section and the CQ section are not clearly divided and the EE section has a gentle shape.
  • the CQ section is relatively longer then the OQ section and the EE section has a peak shape.
  • the characteristic parameters of glottal wave and the error signal which have a desired voice color of the target speaker are extracted and then the parameter values are converted so as to generate the a glottal wave model having the desired voice color of the target speaker. If there is not the desired voice color in the voice color database of the target speaker, the characteristic parameters of the desired voice color and the characteristic parameters of the normal voice color are extracted from the voice color database of the general speaker, and the quantity of change between the two characteristic parameters are calculated. Then, the calculated quantity of change is applied to the glottal wave having the normal voice color of the target speaker, thereby generating a glottal wave model having the desired voice of the target speaker.
  • the generated glottal wave model is combined with a basic frequency, a duration time and an energy parameter calculated in an existing prosody converting step to generate an actual glottal wave g(t), and the generated glottal wave is passed through a filter r(t) which has a radiation effect from the lips when a man speaks and then passed through a linear prediction filter v(t) comprised of the vocal tract characteristic parameters calculated in an existing vocal tract characteristic converting step, thereby finally generating a desired voice of the target speaker.
  • Fig. 9 is a view showing a process in a step of converting the glottal wave according to the present invention, which is represented as follows:
  • the present invention can provide various converted voices compared with the conventional speaker conversion system which can express only one voice color.
  • Fig. 10 is a block diagram explaining a system for converting glottal wave signal according to the present invention.
  • the voice of speaker is input to an A/D converter and an input buffer to extract a glottal wave
  • a voice color parameter of the extracted glottal wave are extracted by a command storing unit and a main controller and converted into a glottal wave, and then the final voice is output through an D/A converter and an output buffer.
  • the voice color conversion method of the present invention it is possible to convert a voice of an original speaker in to a voice having various voice color of a target speaker according to situations, feelings and context.
  • the speaker conversion system of the present invention can be widely applied to animations, movies, plays, commercial films and the like where voices having various and plentiful voice colors are needed.
  • the speaker conversion system of the present invention can facilely substitute for an expert dubbing artist who can mimic a voice of a target speaker. Further, since the speaker conversion system can optionally generate a virtual voice by combination of the voice color database of the general speaker, it can be used to make voices of various virtual characters .

Abstract

Disclosed is to a speaker conversion system which can express various voice colors, more particularly, to voice color conversion method and system which can generate converted voice having various voice colors according to an utterance situation by converting the glottal wave.

Description

VOICE COLOR CONVERSION SYSTEM USING GLOTTAL WAVEFORM
[Technical Field] The present invention relates to a speaker conversion system which can express various voice colors, more particularly, to voice color conversion method and system which can generate converted voice having various voice colors according to an utterance situation by converting the glottal wave .
[Background Art]
FIG. 1 shows a process of generating a voice upon speaking. When a man speaks, air from a bronchial tube passes through the vocal cords and thus glottal wave is formed. At this time, since the man pronounces while breathing out his/her breath, noise (aspirated sound) due to out-breathing is contained in the glottal wave. While the glottal wave passes through a vocal tract, an articulating phenomenon is occurred. Then, the glottal wave is radiated into the air and thus generates a voice. The present invention relates to the conversion of glottal wave generated when a man speaks and grows out of a fact that a shape of glottal wave is changed according to environmental status, feelings and the like when a man speaks and thus the voice having various voice colors can be generated.
As a conventional method of mimicking a voice of a specific person (hereinafter, called "target speaker"), there are a mimicking method employing an expert dubbing artist and a mimicking method using a computer.
In case of the mimicking method employing an expert dubbing artist, it is possible to mimic a prosodic characteristic with respect to a certain part of the voice, however, it is difficult to express various and natural voice colors .
And in case of the method using a computer, which can convert a voice of a certain speaker (hereinafter, called "original speaker" ) into the target speaker by using the computer, there are mimicking methods using Hidden Markov Model (HMM) , Gaussian Mixture Model (GMM) and Neural Network.
In the conventional methods using the HMM, the GMM and the Neural Network, firstly, vocal tract characteristic parameters of voice such as LPC (Linear Prediction Coefficient) , LSP (Line Spectral Pair) , MFCC (Mel-Frequency Cepstral Coefficient) and HNM (Harmonic and Noise Model) characteristics are extracted from the voices of the original speaker and the target speaker, and the HMM or GMM is trained by using the characteristic parameters of each speaker, and then a conversion function between the trained models is calculated so that the vocal tract characteristic of the original speaker is converted into that of the target speaker. In case of the prosody, the prosody of the target speaker is modeled and then applied to the converted voice. In a method of mimicking the prosody of the target speaker, after forming a pitch histogram of the original speaker and the target speaker, an excitation signal matched with the histogram is used. In case of the glottal wave, the excitation signal in which basic frequency information of the target speaker is excluded is applied to the converted voice. In the conventional speaker conversion method, it is possible to convert the prosody and the vocal tract characteristic. However, since the excitation signal of the target speaker is used in the vocal cord characteristic, there is a problem that the voice color contained in the given excitation signal of the target speaker has to be used as it is.
Therefore, in the conventional method it is difficult to exactly express various voice colors according to situations, it is impossible to reflect the change of voice color according to feelings, environments, and the context when the original speaker speaks, and it is possible only to convert the voice of the original speaker into a predetermined voice color.
[Disclosure] [Technical Problem]
An object of the present invention is to provide a method of converting voice color, which can generate a voice of a certain speaker without employment of an expert dubbing artist who can mimic the voice and also which can generate voices having various voice color at anywhere and any time according to feelings of the certain person.
Another object of the present invention is to provide a method of converting voice color, which can generate voices having various voice colors according to situations and conditions
[Technical Solution]
According to the present invention, there is provided a voice color conversion method for converting a speaker, comprising the steps of analyzing a glottal wave signal of a voice of an original speaker; converting the analyzed glottal wave signal; and re-composing the converted glottal wave signal. And the step of analyzing the glottal wave signal comprises the steps of extracting a glottal wave from the voice of the original speaker; and extracting voice color characteristic parameters of the extracted glottal wave.
Further, the step of converting the glottal wave signal comprises the steps of collecting glottal wave characteristic parameters for various voice colors from the voice of the original speaker and creating a voice color database of the original speaker; and collecting glottal wave characteristic parameters for various voice colors from a voice of a target speaker and creating a voice color database of the target speaker. The step of converting the glottal wave signal further comprises the steps of collecting glottal wave characteristic parameters for various voice colors from voices of various general speakers and creating a voice color database of the general speakers. The step of converting the glottal wave signal further comprises the steps of creating a corresponding relationship among the voice color databases of the original speaker, the target speaker and the general speakers on the basis of a quantity of change of the voice color characteristic parameters stored in each voice color database .
Further, the characteristic parameters comprises an OQ section in which vocal cords are opened, a CQ section in which the vocal cords are closed, EE (Effective Excitation) which is an effective excitation value and T0 which is a period of signal. The quantity of change of the characteristic parameters is an NAQ parameter which is defined as follows:
NAQ= 3.
TnBE In addition, a voice color conversion system for converting a speaker, comprising glottal wave extracting means for extracting a glottal wave from a voice of an original speaker; voice color parameter extracting means for extracting voice color parameters from the extracted glottal wave; glottal wave converting means for converting the glottal wave using the voice color parameters; and converted voice generating means for generating a converted voice using the converted glottal wave. Preferably, the glottal wave extracting means comprises an A/D converter and an input buffer, and the voice color parameter extracting means and the glottal wave converting means comprises a command storing unit and a main controller, and the converted voice generating means comprises a D/A converter and an output buffer. Further, The system further comprises a storing unit in which voice color databases of an original speaker, a target speaker and general speakers are stored.
[Advantageous Effects]
According to the voice color conversion method of the present invention, it is possible to convert a voice of an original speaker in to a voice having various voice color of a target speaker according to situations, feelings and context. Furthermore, it also is possible that the speaker conversion system of the present invention can facilely substitute for an expert dubbing artist who can mimic a voice of a target speaker. Further, since the speaker conversion system can optionally generate a virtual voice by combination of the voice color database of the general speaker, it can be used to make voices of various virtual characters .
[Description of Drawings]
The above and other objects, features and advantages of the present invention will become apparent from the following description of preferred embodiments given in conjunction with the accompanying drawings, in which:
Fig. 1 is a view showing a mechanism for generating a voice when a mean speaks; Fig. 2 is a view showing a process of converting glottal wave to express various voice colors according to the present invention;
Fig. 3 is a view showing a glottal wave obtained by using an LF (LiIj encrants-Fant) model and a derivative glottal wave thereof according to the present invention;
Fig. 4 is a view showing a derivative glottal wave of a KLGLOT88 model according to the present invention;
Fig. 5 is a view showing characteristic parameters of glottal wave for converting the voice color according to the present invention; Fig. 6 is a view showing a manner of storing voice color data in voice color databases of an original speaker, a target speaker and a general speaker according to the present invention; Fig. 7 is a graph showing distribution of NAQ parameters for various voice colors according to the present invention;
Fig. 8 is a view showing a change of vocal glottal waves having various voice colors according to the present invention; Fig. 9 is a view showing a process in a step of converting the glottal wave according to the present invention; and
Fig. 10 is a block diagram explaining a system for converting glottal wave signal according to the present invention.
[Best Mode]
Hereinafter, the embodiments of the present invention will be described in detail with reference to accompanying drawings .
Fig. 2 is a view showing a process of converting glottal wave to express various voice colors according to the present invention. As shown in Fig. 2, a method of converting a voice according to the present invention includes a step of analyzing glottal wave signal, a step of converting the glottal wave signal and a step of re-composing the glottal wave signal.
The step of analyzing glottal wave signal includes a step of extracting glottal wave from voice signal of an input original speaker and a step of extracting parameters of various voice colors from the extracted glottal wave . In the step of converting the glottal wave signal, the parameters of voice colors extracted from the glottal wave are converted into glottal wave signals using data of voice color database. At this time, the voice color database includes voice color database of an original speaker, voice color database of a target speaker and voice color database of a general speaker.
The voice color database of the target speaker as a model representing native voice color of only the target speaker is a database in which characteristic parameters of the target speaker are collected. The voice color database of the general speaker is a database in which characteristic parameters of various voice colors extracted from general speakers are collected. In the step of converting the glottal wave signal, the characteristic parameters of voice colors of the original speaker, which are extracted in the step of analyzing the glottal wave signal, are converted into the glottal wave having the voice color of the target speaker using the voice color characteristic parameters in which voice color models of the target speaker and the general speaker weighted and combined with each other.
In the step of re-composing the glottal wave, the glottal wave having the voice color of the target speaker is reconstructed using vocal tract characteristic parameters by- using the HMM and the like and the glottal wave generated in the step of converting the glottal wave signal so as to generate a finally converted voice.
Meanwhile, in the step of analyzing the glottal wave, the glottal wave is extracted from input signal which is converted into digital signal through a sampling process, and the characteristic parameters of voice colors are extracted from the glottal wave. When extracting the glottal wave in the step, the vocal tract characteristic is removed as much as possible so as to remain only the glottal wave. The glottal wave may be extracted by inverse-filtering excitation signal using linear prediction algorithm of voice. To analyze the extracted the glottal wave, the glottal wave is differentiated along a time axis to obtain glottal derivative signal. Fig. 3 is a view showing a glottal wave obtained by using an LF (LiIj encrants-Fant) model and a derivative glottal wave thereof according to the present invention. The derivative glottal wave may be represented by using the LF model as follows : g(t) = E{)eatsm' (ωJ , 0 < t ≤ Te
Figure imgf000012_0001
wherein EO is a maximum amplitude of the glottal wave signal, EE is a maximum closing speed of the vocal cords as an effective Excitation value, Te is a time period while the vocal cords are opened, Tc is a time period while the vocal cords are closed and T0 is a period while the vocal cords are opened and closed.
To represent the glottal wave, a KLGLOT88 model may be also used. The equation is as follows:
g(t) = at2 - bf , for 0 < t < On T0 = 0 , for O<y ^ < t < Tn
(2) wherein a and b are defined as a voicing amplitude (AV) and an equation of Oq, and the a and b may be represented as follows :
27,4 K 27,4 V
4 T0O^ 1 4Tθl
(3) wherein The AV is the same parameter as the E0 of the LF model, 0q is also the same parameter as the OQ of the LF model and the T0 is a basic period of the vocal cords.
Fig. 4 is a view showing a derivative glottal wave of the
KLGLOT88 model. As shown in Figs. 3 and 4, although the used models are respectively different, shapes of the glottal waves are equal or similar to each other, because the same characteristic parameters are used.
Fig. 5 is a view showing characteristic parameters of glottal wave for converting the voice color according to the present invention. The characteristic parameters which is necessary to express various voice colors includes OQ which indicates an opening section of the vocal cords, CQ which indicates a closing section of the vocal cords, EE (Effective Excitation) which is an effective excitation value and T0 which is a period of signal. The characteristic parameters can be obtained during a process of matching the derivative glottal wave with the LF model using the Figs. 3 and 4. And after modeling the LF derivative glottal wave, an error signal which indicates a difference between the characteristic parameters of the derivative glottal wave and the actual derivative glottal wave can be obtained.
Fig. 6 is a view showing a manner of storing voice color data in the voice color databases of the original speaker, the target speaker and the general speaker according the present invention. The voice color database of the target speaker stores the characteristic parameters of glottal wave which is analyzed in the signal analyzing step for the actual voice of the target speaker and the error signal for each voice color. The voice color database of the general speaker stores the characteristic parameters of glottal wave which is analyzed in the signal analyzing step for the actual voices of the ordinary people and the error signal for each voice color. A breathy voice, a pressed voice and a normal voice are representative of the voice colors which are expressed by the glottal wave. On the basis of the fact, the voice color databases of the target speaker and the general speaker stores the characteristic parameters of various voice colors in a storing manner shown in Fig. 6. In case of the breathy voice, the OQ and CQ sections are not divided clearly and the EE section has a flat shape. In case of the pressed voice, the EE section has a peak shape, and the CQ section is much longer than the OQ section. The normal voice has average OQ, CQ and EE sections. When generating the converted glottal wave, the glottal wave can have various voice colors by adjusting the characteristic parameters such as the OQ, the CQ, the EE, the E0 and the like. The voice color database of the original speaker stores various voice colors of the original speaker, which are divided according to situations, feelings and the like. In the above three voice color databases, the various voice colors are correspondent to each other according to the divided conditions. The corresponding relationship is set by the histogram of changes in the parameters. Therefore, the various voice colors of the original speaker can be converted into the voice colors of the target speaker by the voice color databases of the target speaker and the general speaker. The three voice color databases are constructed to have voice colors correspondent to each other according to voice color numbers .
Herein, to explain the corresponding relationship among the original speaker, the target speaker and the general speaker, it will be described that the speakers have common characteristics for various voice colors. To this end, the present invention introduces an NAQ parameter. The NAQ parameter is a combination of the above characteristic parameters of glottal wave and represented as follows:
(4)
That is, the characteristic parameters of glottal wave are extracted from the glottal wave shown in Fig. 5, and the NAQ parameter is obtained by using the Equation 4 to show the distribution thereof, whereby it is possible to grasp the corresponding relationship among the above characteristic parameters
Fig. 7 is a graph showing distribution of NAQ parameters obtained by combining the characteristic parameters of glottal wave, wherein the glottal waves of breathy (Bre) , normal (Nor) and pressed (Pre) voices pronounced by thirteen people are analyzed and then the distribution of NAQ parameters is obtained. Referring to Fig. 7, the NAQ parameters are constantly distributed for a common voice color and thus it is the case that a quantity of change of the parameter values for each voice color is constant in each voice color. Therefore, it is possible to set the corresponding relationship for each voice color.
Fig. 8 is a view showing a shape of derivative glottal waves for the three voice colors, in which the change of the characteristic parameters OQ, CQ and EE for each voice color is observed. Referring to Fig. 8, in case of the breathy voice, the QC section and the CQ section are not clearly divided and the EE section has a gentle shape. However, in case of the pressed voice, the CQ section is relatively longer then the OQ section and the EE section has a peak shape.
In the step of converting the glottal wave, the characteristic parameters of glottal wave and the error signal which have a desired voice color of the target speaker are extracted and then the parameter values are converted so as to generate the a glottal wave model having the desired voice color of the target speaker. If there is not the desired voice color in the voice color database of the target speaker, the characteristic parameters of the desired voice color and the characteristic parameters of the normal voice color are extracted from the voice color database of the general speaker, and the quantity of change between the two characteristic parameters are calculated. Then, the calculated quantity of change is applied to the glottal wave having the normal voice color of the target speaker, thereby generating a glottal wave model having the desired voice of the target speaker. The generated glottal wave model is combined with a basic frequency, a duration time and an energy parameter calculated in an existing prosody converting step to generate an actual glottal wave g(t), and the generated glottal wave is passed through a filter r(t) which has a radiation effect from the lips when a man speaks and then passed through a linear prediction filter v(t) comprised of the vocal tract characteristic parameters calculated in an existing vocal tract characteristic converting step, thereby finally generating a desired voice of the target speaker.
Fig. 9 is a view showing a process in a step of converting the glottal wave according to the present invention, which is represented as follows:
ft(t) =<](t) * r(t) * v(t) (5) wherein * indicates convolution The filter r(t) may be represented as follows:
Figure imgf000018_0001
(6) As described above, since the final voice which is generated through the signal conversion step can reproduce the voice of target speaker as it is, the present invention can provide various converted voices compared with the conventional speaker conversion system which can express only one voice color.
Fig. 10 is a block diagram explaining a system for converting glottal wave signal according to the present invention. As shown in Fig. 10, the voice of speaker is input to an A/D converter and an input buffer to extract a glottal wave, and a voice color parameter of the extracted glottal wave are extracted by a command storing unit and a main controller and converted into a glottal wave, and then the final voice is output through an D/A converter and an output buffer.
[industrial Applicability]
According to the voice color conversion method of the present invention, it is possible to convert a voice of an original speaker in to a voice having various voice color of a target speaker according to situations, feelings and context.
Further, the speaker conversion system of the present invention can be widely applied to animations, movies, plays, commercial films and the like where voices having various and plentiful voice colors are needed.
Furthermore, it also is possible that the speaker conversion system of the present invention can facilely substitute for an expert dubbing artist who can mimic a voice of a target speaker. Further, since the speaker conversion system can optionally generate a virtual voice by combination of the voice color database of the general speaker, it can be used to make voices of various virtual characters .

Claims

[CLAIMS]
[Claim l]
A voice color conversion method for converting a speaker, comprising the steps of: analyzing a glottal wave signal of a voice of an original speaker; converting the analyzed glottal wave signal; and re-composing the converted glottal wave signal.
[Claim 2]
The method according to claim 1, wherein the step of analyzing the glottal wave signal comprises the steps of: extracting a glottal wave from the voice of the original speaker; and extracting voice color characteristic parameters of the extracted glottal wave .
[Claim 3]
The method according to claim 1, wherein the step of converting the glottal wave signal comprises the steps of: collecting glottal wave characteristic parameters for various voice colors from the voice of the original speaker and creating a voice color database of the original speaker; and collecting glottal wave characteristic parameters for various voice colors from a voice of a target speaker and creating a voice color database of the target speaker.
[Claim 4] The method according to claim 3, wherein the step of converting the glottal wave signal further comprises the steps of collecting glottal wave characteristic parameters for various voice colors from voices of various general speakers and creating a voice color database of the general speakers .
[Claim 5]
The method according to claim 4, wherein the step of converting the glottal wave signal further comprises the steps of creating a corresponding relationship among the voice color databases of the original speaker, the target speaker and the general speakers on the basis of a quantity of change of the voice color characteristic parameters stored in each voice color database.
[Claim 6]
The method according to claim 5, wherein the characteristic parameters comprises an OQ section in which vocal cords are opened, a CQ section in which the vocal cords are closed, EE (Effective Excitation) which is an effective excitation value and T0 which is a period of signal. [Claim 7]
The method according to claim 5 or 6, wherein the quantity of change of the characteristic parameters is an NAQ parameter which is defined as follows:
K
NAQ= T{]EE
[Claim 8] A voice color conversion system for converting a speaker, comprising: glottal wave extracting means for extracting a glottal wave from a voice of an original speaker; voice color parameter extracting means for extracting voice color parameters from the extracted glottal wave; glottal wave converting means for converting the glottal wave using the voice color parameters; and converted voice generating means for generating a converted voice using the converted glottal wave.
[Claim 9]
The system according to claim 8, wherein the glottal wave extracting means comprises an A/D converter and an input buffer. [Claim lθ]
The system according to claim 8, wherein the voice color parameter extracting means and the glottal wave converting means comprises a command storing unit and a main controller.
[Claim ll]
The system according to claim 8, wherein the converted voice generating means comprises a D/A converter and an output buffer.
[Claim 12]
The system according to claim 8, further comprising a storing unit in which voice color databases of an original speaker, a target speaker and general speakers are stored.
PCT/KR2006/004478 2006-08-09 2006-10-31 Voice color conversion system using glottal waveform WO2008018653A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR10-2006-0075140 2006-08-09
KR1020060075140A KR100809368B1 (en) 2006-08-09 2006-08-09 Voice Color Conversion System using Glottal waveform

Publications (1)

Publication Number Publication Date
WO2008018653A1 true WO2008018653A1 (en) 2008-02-14

Family

ID=39033161

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2006/004478 WO2008018653A1 (en) 2006-08-09 2006-10-31 Voice color conversion system using glottal waveform

Country Status (2)

Country Link
KR (1) KR100809368B1 (en)
WO (1) WO2008018653A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010031437A1 (en) * 2008-09-19 2010-03-25 Asociacion Centro De Tecnologias De Interaccion Visual Y Comunicaciones Vicomtech Method and system of voice conversion
CN103730117A (en) * 2012-10-12 2014-04-16 中兴通讯股份有限公司 Self-adaptation intelligent voice device and method
WO2014058270A1 (en) * 2012-10-12 2014-04-17 Samsung Electronics Co., Ltd. Voice converting apparatus and method for converting user voice thereof
US20160005403A1 (en) * 2014-07-03 2016-01-07 Google Inc. Methods and Systems for Voice Conversion
CN109147758A (en) * 2018-09-12 2019-01-04 科大讯飞股份有限公司 A kind of speaker's sound converting method and device
WO2023242445A1 (en) 2022-06-17 2023-12-21 The Provost, Fellows, Scholars And Other Members Of The Board Of Trinity College Dublin Glottal features extraction using neural networks

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6615174B1 (en) * 1997-01-27 2003-09-02 Microsoft Corporation Voice conversion system and methodology
KR20040061709A (en) * 2002-12-31 2004-07-07 (주) 코아보이스 Voice Color Converter using Transforming Vocal Tract Characteristic and Method
US6950799B2 (en) * 2002-02-19 2005-09-27 Qualcomm Inc. Speech converter utilizing preprogrammed voice profiles

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB8901423D0 (en) * 1989-01-23 1989-03-15 Fujisawa Pharmaceutical Co Pyrazolopyridine compound and processes for preparation thereof
JP4362953B2 (en) * 2000-07-03 2009-11-11 日本軽金属株式会社 Bumpy stay
JP4060562B2 (en) * 2001-05-02 2008-03-12 日本碍子株式会社 Electrode body evaluation method
KR100639968B1 (en) * 2004-11-04 2006-11-01 한국전자통신연구원 Apparatus for speech recognition and method therefor
JP2008002003A (en) * 2006-06-21 2008-01-10 Toray Ind Inc Ground fabric for airbag and method for producing the ground fabric

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6615174B1 (en) * 1997-01-27 2003-09-02 Microsoft Corporation Voice conversion system and methodology
US6950799B2 (en) * 2002-02-19 2005-09-27 Qualcomm Inc. Speech converter utilizing preprogrammed voice profiles
KR20040061709A (en) * 2002-12-31 2004-07-07 (주) 코아보이스 Voice Color Converter using Transforming Vocal Tract Characteristic and Method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
KAIN A. ET AL.: "Spectral voice conversion for text-to-speech synthesis", PROC. ICASSP, 1998, pages 285 - 288 *
STYLIANOU Y. ET AL.: "Continuous probabilistic transform for voice conversion", IEEE TRANS. ON SPEECH AND AUDIO PROCESSING, vol. 6, no. 2, March 1998 (1998-03-01) *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010031437A1 (en) * 2008-09-19 2010-03-25 Asociacion Centro De Tecnologias De Interaccion Visual Y Comunicaciones Vicomtech Method and system of voice conversion
CN103730117A (en) * 2012-10-12 2014-04-16 中兴通讯股份有限公司 Self-adaptation intelligent voice device and method
WO2014058270A1 (en) * 2012-10-12 2014-04-17 Samsung Electronics Co., Ltd. Voice converting apparatus and method for converting user voice thereof
EP2908312A4 (en) * 2012-10-12 2015-12-02 Zte Corp Self-adaptive intelligent voice device and method
US9552813B2 (en) 2012-10-12 2017-01-24 Zte Corporation Self-adaptive intelligent voice device and method
US9564119B2 (en) 2012-10-12 2017-02-07 Samsung Electronics Co., Ltd. Voice converting apparatus and method for converting user voice thereof
US10121492B2 (en) 2012-10-12 2018-11-06 Samsung Electronics Co., Ltd. Voice converting apparatus and method for converting user voice thereof
US20160005403A1 (en) * 2014-07-03 2016-01-07 Google Inc. Methods and Systems for Voice Conversion
US9613620B2 (en) * 2014-07-03 2017-04-04 Google Inc. Methods and systems for voice conversion
CN109147758A (en) * 2018-09-12 2019-01-04 科大讯飞股份有限公司 A kind of speaker's sound converting method and device
WO2023242445A1 (en) 2022-06-17 2023-12-21 The Provost, Fellows, Scholars And Other Members Of The Board Of Trinity College Dublin Glottal features extraction using neural networks

Also Published As

Publication number Publication date
KR20080013524A (en) 2008-02-13
KR100809368B1 (en) 2008-03-05

Similar Documents

Publication Publication Date Title
JP4355772B2 (en) Force conversion device, speech conversion device, speech synthesis device, speech conversion method, speech synthesis method, and program
US8898055B2 (en) Voice quality conversion device and voice quality conversion method for converting voice quality of an input speech using target vocal tract information and received vocal tract information corresponding to the input speech
Boril et al. Unsupervised equalization of Lombard effect for speech recognition in noisy adverse environments
JP4456537B2 (en) Information transmission device
JP5581377B2 (en) Speech synthesis and coding method
US20060129399A1 (en) Speech conversion system and method
JP5961950B2 (en) Audio processing device
JP2004522186A (en) Speech synthesis of speech synthesizer
JP5039865B2 (en) Voice quality conversion apparatus and method
CN110648684B (en) Bone conduction voice enhancement waveform generation method based on WaveNet
JP2006171750A (en) Feature vector extracting method for speech recognition
WO2008018653A1 (en) Voice color conversion system using glottal waveform
JP2005266349A (en) Device, method, and program for voice quality conversion
JP2010014913A (en) Device and system for conversion of voice quality and for voice generation
JP4382808B2 (en) Method for analyzing fundamental frequency information, and voice conversion method and system implementing this analysis method
JP2002268660A (en) Method and device for text voice synthesis
Aryal et al. Articulatory inversion and synthesis: towards articulatory-based modification of speech
JP2017520016A (en) Excitation signal generation method of glottal pulse model based on parametric speech synthesis system
KR101560833B1 (en) Apparatus and method for recognizing emotion using a voice signal
Pietruch et al. Methods for formant extraction in speech of patients after total laryngectomy
Furui Robust methods in automatic speech recognition and understanding.
Arslan et al. 3-d face point trajectory synthesis using an automatically derived visual phoneme similarity matrix
Singh et al. Features and techniques for speaker recognition
Del Pozo Voice source and duration modelling for voice conversion and speech repair
Qavi et al. Voice morphing based on spectral features and prosodic modification

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 06812317

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

NENP Non-entry into the national phase

Ref country code: RU

122 Ep: pct application non-entry in european phase

Ref document number: 06812317

Country of ref document: EP

Kind code of ref document: A1