WO2001078064A1

WO2001078064A1 - Voice character converting device

Info

Publication number: WO2001078064A1
Application number: PCT/JP2001/002388
Authority: WO
Inventors: Shin Kamiya
Original assignee: Sharp Kabushiki Kaisha
Priority date: 2000-04-03
Filing date: 2001-03-26
Publication date: 2001-10-18
Also published as: JP2001282300A; JP3631657B2

Abstract

The speech load on the conversion-target speaker is lightened and conversion of voice character with high accuracy is effected. A nonlinear frequency axis spectrum matching unit (8) determines a frequency warping function relating to the spectrum envelope of the conversion-source speaker and the spectrum envelope of the conversion-target speaker. A frequency warp table memory (9) holds the averages of the frequency warping function for each phoneme , each similar phoneme , each voiced sound section/voiceless sound section and each whole voice section of the frequency warping fucntion. When the voice character is converted, the spectrum envelope of the conversion-source speaker is converted to the spectrum envelope of the conversion-target speaker by using the average frequency warping function. Thus high-accuracy voice character conversion is effected. If the amount of data on speech of the conversion-target speaker is a little, average frequency warping functions such as of the similar phoneme and the voiced sound section/voiceless sound section are used, thereby dealing with the conversion even if the amount of data on speech of the conversion-target speaker is a little and lightening the speech load on the conversion-target speaker.

Description

Technical field

The present invention relates to a voice conversion device and a voice conversion method for converting a synthesized voice or a bright input voice into a specific speaker's voice and outputting the converted voice, and a program recording medium storing a voice conversion program. book

Background art

To date, many text-to-speech synthesizers have been developed with the aim of realizing synthesized speech that is more natural and closer to human speech. Once this goal has been achieved to some extent, it is naturally expected that the need for a text-to-speech synthesizer that utters according to the voice quality and prosody of a particular speaker, such as a favorite voice actor or actress, or a family or lover, will increase. You. Also, it is desirable that the voice data required by the voice synthesizer for 'voice quality' prosody conversion be as small as possible in consideration of the utterance burden of the provider.

Conventionally, as a method of converting voice quality, a method of extracting and converting formant frequencies from the spectral envelope (for example, Kuwahara, Ogushi, “Independent control of formant frequency and band width and judgment of individuality”, Electronic Communication Transactions of the Society, Vol. J69-A No. 4, pp. 509-517 (1986)). In addition, the peak point of the spectrum envelope is obtained, each spectrum envelope is band-divided based on the frequency of the peak point, and the frequency difference and the intensity difference obtained for these division points are used to make a spectrum. There is a method of deforming the vector envelope (for example, Japanese Patent Application Laid-Open No. Hei 9-244649).

On the other hand, in the field of speech recognition technology for unspecified speakers, by performing simultaneous nonlinear expansion and contraction of the frequency axis and the intensity axis of the speech spectrum, a remarkable effect on speaker normalization is seen, and the speech recognition performance is improved. (Eg, Nakagawa, Kamiya, Sakai, "Recognition of unspecified speaker's word speech based on simultaneous nonlinear expansion and contraction of time axis, frequency axis, and intensity axis of speech spectrum", Journal of the Society, Vol. J64-D No. 2, pp. 116 -123 (1981)).

In addition, DP (Dynamic Programming) matching in the frequency domain is performed between spectrum envelope sequences (n-dimensional vector sequences) of a plurality of vowels in the speech uttered by the source speaker and the target speaker in advance. A method has been proposed in which the spectrum envelope of the above-mentioned source speaker is converted into the spectrum envelope of the destination speaker by using one obtained optimal DP path (see, for example, Japanese Patent Laid-Open No. 4-1 / 1991). No. 4730000).

However, the conventional voice conversion method has the following problems. That is, the method of extracting and converting the formant frequency has a problem that the sound quality is affected by the extraction accuracy of the formant frequency. Further, in the method of deforming the spectrum envelope based on the frequency difference and the intensity difference between the division points of the spectrum envelope based on the frequency of the peak point, the spectrum is divided by the frequency of the peak point There is a problem that the band of the spectrum is affected, and there is also a problem that the sound quality is affected by the extraction accuracy of the low-frequency peak point when the pitch frequency is high.

Also, in the above method of performing speaker normalization by simultaneous non-linear expansion and contraction of the frequency axis and the intensity axis of the speech spectrum, unless the constraint conditions for the non-linear expansion and expansion are set quite well, not only individual differences but also phonemes. The problem is that the difference is normalized, resulting in poor performance.

Also, there is a method of performing DP matching between a plurality of vowels in a speech uttered by a conversion source speaker and a conversion target speaker, and an envelope sequence (n-dimensional vector sequence). If the optimal DP path for each vowel is different due to individual differences (soft differences) caused by vocal habits such as the opening of the mouth and the degree of mouth opening, the members of similar optimal DP paths (for example, the back vowel) However, there is a problem that DP paths that are slightly inappropriate for other groups are extracted, and DP paths that are not optimal as a whole are selected. If the vowels for training can be selected so that the optimal DP path is not biased, only the individual differences (hard differences) caused by physical differences such as vocal tract shape and vocal tract length are normalized. Since it is a DP pass, there is a problem that the recognition performance is not sufficiently improved by the regular pass. Furthermore, the source speaker and the destination speaker have the same content (word or sentence: for example, ) Is premised on the restriction of utterance, so if the source speaker's utterance is different or the voice data is insufficient, there is also a problem that it cannot be used. is there.

Thus, the conventional voice quality conversion method described above is not sufficient in terms of voice quality conversion performance. Disclosure of the invention

Accordingly, an object of the present invention is to provide a voice conversion apparatus and a voice conversion method capable of reducing the utterance load of the conversion-target speaker and performing more accurate voice conversion, and a program recording a voice conversion processing program. To provide a medium.

To achieve the above object, according to a first aspect of the present invention, there is provided a voice conversion device for converting a voice in a voice quality of a first speaker into a voice in a voice quality of a second speaker. A spectral envelope extracting means for extracting a first spectral envelope from a first voice uttered by a speaker, and extracting a second spectral envelope from a second voice uttered by a second speaker; First memory means for storing the extracted first spectrum envelope and second spectrum envelope with a label for each voice, and for the same label, the first spectrum stored in the first memory; A nonlinear frequency expansion and contraction is performed by performing nonlinear frequency expansion / contraction matching on the vector envelope and the second vector envelope, and obtaining a frequency-moving function representing the correspondence between the frequency axes of the two envelopes. Vector matching means, and the frequency The second memory means for storing a speech function with a label of a voice unit and storing the first spectrum envelope of the specified voice unit name from the first memory, while reading the specified voice unit name. The frequency sorbing function of the unit name is read out from the second memory, and based on the read out frequency convergence function, the above read out first spectrum envelope is used as the spectrum envelope for the second speaker. There is provided a voice quality conversion device provided with a spectrum envelope conversion means for converting into a voice.

According to the above configuration, the correspondence between the frequency axes of the first spectrum envelope obtained from the voice of the first speaker and the second spectrum envelope obtained from the voice of the second speaker is represented. The first speaker by the first speaker of the specified voice unit name uses the frequency The frequency axis of the vector envelope is nonlinearly expanded and contracted and converted into a vector envelope by the second speaker, and the voice of the second speaker with the specified voice unit name is obtained. Therefore, there is no need to extract the specific position of the first spectral envelope by the first speaker, and highly accurate voice quality conversion is performed without the sound quality being affected by the extraction accuracy of the specific position. In one embodiment, the nonlinear frequency axis spectrum matching means performs the nonlinear spectrum expansion / contraction matching with respect to the first spectrum envelope and the second spectrum envelope. The difference in output value between adjacent channels when the envelope is divided into multiple channels in the frequency band is used.

In one embodiment, the first memory means stores the inclination of the first spectrum envelope and the inclination of the second spectrum envelope with a label for each voice. The slope of the first spectrum envelope is extracted from the first speech uttered by the first speaker, and the slope of the second spectrum envelope is extracted from the second speech uttered by the second speaker. Reading out the spectrum inclination extracting means to be stored in the first memory means, and reading out the inclination of the first spectrum envelope and the inclination of the second spectrum envelope of the designated voice unit name from the first memory means; And a spectral inclination correcting means for correcting the inclination of the spectral envelope of the second speaker obtained by the spectral envelope conversion means based on the difference between the two inclinations.

According to the above embodiment, based on the difference between the slopes of the first and second spectral envelopes in the specified voice unit name, the slope of the spectral envelope of the second speaker obtained above is obtained. Is corrected, and a voice closer to the voice quality of the second speaker is obtained.

In one embodiment, the speech unit is a phoneme, and the frequency sorbing function stored in the second memory means is divided into phonemes, similar phonemes, voiced / unvoiced sections, and speakers based on the labels. Averaging means for grouping, calculating an average value of the frequency-mapping functions belonging to each group, assigning a label of each group name to the obtained average frequency-mapping function, and storing the label in the second memory means; The spectrum envelope conversion means uses an average frequency rubbing function of any group to which a specified phoneme belongs as the frequency rubbing function. According to the above-described embodiment, the average frequency rubbing function is obtained for each group of “phonemes”, “similar phonemes”, “voiced sound sections / unvoiced sound sections”, and “speakers”. Therefore, according to the amount of utterance data of the second speaker stored in the first memory means, an appropriate group of average frequency averaging functions is selected and used instead of the above frequency averaging function. Can be. For example, if there is little or no utterance data of the back tongue vowel / o /, the mean frequency voving function of the back tongue vowel / a /, which is a similar phoneme of the phoneme / o /, or a voiced interval The average frequency ビング ving function is selected. In this way, even when the amount of utterance data of the second speaker is small, it can be handled. Further, by calculating the average frequency rubbing function for each phoneme and each similar phoneme, individual differences due to vocal habits are normalized.

In one embodiment, the speech unit is a phoneme, and the slope of the first spectrum envelope and the slope of the second spectrum envelope stored in the first memory unit are based on the label. Phoneme 'similar phoneme' voiced section / unvoiced section and each speaker, calculate the average value of the slope of the spectrum envelope belonging to each group, and calculate the obtained average spectrum slope for each speaker. And an averaging means for assigning a label to each group name and storing the same in the first memory means, and the spectrum inclination correcting means comprises an average memory for any of the groups to which the specified phoneme belongs. The vector inclination is used as the inclination of the above-mentioned spectral envelope.

According to the above-described embodiment, the average spectral gradient is determined for each of the groups of “phoneme”, “similar phoneme”, “voiced sound section / unvoiced sound section”, and “speaker”. Therefore, according to the amount of utterance data of the second speaker stored in the first memory means, an appropriate average spectral slope of a group is selected and used in place of the spectral envelope slope. Can be. In this way, even if the amount of utterance data of the second speaker is small, it can be handled. Further, by calculating the average spectral gradient for each phoneme or each similar phoneme, individual differences due to vocal habits are normalized.

Also, in one embodiment, the time series of the extracted first spectral envelope or the second spectral envelope is recognized by an unspecified speaker voice recognition method, and the voice unit name of the recognition result is referred to as the first voice envelope. A voice recognition unit is provided for sending to the memory unit. According to the above embodiment, the speech unit name for the label is automatically obtained from the first and second spectrum envelopes extracted from the utterances of the first and second speakers. In this way, the above-described spectrum envelope or the labeling process for the inclination of the spectrum envelope is easily performed.

In one embodiment, the speech recognition means can supply a time series of the obtained speech unit names to the spectrum envelope conversion means or the spectrum inclination correction means. The envelope conversion means or the spectrum inclination correction means is adapted to use the time series of the speech unit names obtained by the above speech recognition means as the designated speech unit names.

According to the above embodiment, the speech unit name when converting the first spectrum envelope by the first speaker's utterance into the second speaker's spectrum envelope is determined by the speech recognition means. It is specified by the time series of the obtained voice unit name. In this way, the uttered sound of the first speaker is directly converted to the sound of the second speaker in real time without inputting the voice unit name sequence to be converted from the keyboard or the like.

In one embodiment, the averaging means calculates the average frequency-diving function by performing a linear conversion between frequency-averaging functions to be averaged. -Further, in one embodiment, the above-mentioned frequency averaging function is obtained by dividing the first spectrum envelope and the second spectrum envelope into a plurality of channels in the same frequency band and dividing the first and second spectrum envelopes into a plurality of channels. It has a matrix form of data in which different element values are given to the grid points corresponding to the DP path and other grid points on the plane consisting of the vector channels, and the linear transformation between the frequency Then, the sum of the element values at the same grid point in a plurality of matrices corresponding to the frequency averaging function for which the average value is to be calculated is determined, and a matrix having the obtained values as element values is defined as the average frequency averaging function.

In one embodiment, the spectrum envelope converting means converts the second frequency in the matrix of the average frequency-moving function to be used when converting the intensity into the intensity in a certain frequency band of the second static envelope. To the corresponding channel of the vector envelope At the grid point of the relevant row or column, the product sum of the element value of each grid point and the intensity in the channel of the first spectrum envelope corresponding to the grid point is determined, and the value of this product sum is calculated as the second The intensity of the spectrum envelope in the frequency band is assumed.

According to a second aspect of the present invention, there is provided a voice quality conversion device for converting a voice of a voice quality of a first speaker into a voice of a voice quality of a second speaker, wherein the voice of the first speaker is uttered. A vocal tract cross-sectional area extracting means for extracting a first vocal tract cross-sectional area from a second voice uttered by a second speaker while extracting a first vocal tract cross-sectional area from the first voice, A first memory means for storing a vocal tract cross-sectional area and a second vocal tract cross-sectional area with a label assigned to each speech unit; and, for the same label, the first vocal tract cross-sectional area stored in the first memory. Nonlinear vocal tract axis matching that performs nonlinear vocal tract axis expansion and contraction matching with the second vocal tract cross section and obtains the vocal tract axis ヮ one-Bing function representing the correspondence between the vocal tract axes of both vocal tract cross sections Means, and a second memory means for storing the vocal tract axis mapping function with a speech unit label attached thereto, and specifying The first vocal tract cross-sectional area of the specified voice unit name is read out from the first memory unit, while the vocal tract axis mapping function of the specified voice unit name is read out from the second memory, and this read-out is performed. A voice quality conversion device is provided that includes a vocal tract cross-sectional area conversion unit that converts the read first vocal tract cross-sectional area into a vocal tract cross-sectional area for a second speaker based on a vocal tract axis moving function.

According to the above configuration, the voice representing the correspondence of the vocal tract axis between the first vocal tract cross-sectional area obtained from the voice of the first speaker and the second vocal tract cross-sectional area obtained from the voice of the second speaker The vocal axis mapping function is used, the vocal tract axis of the first vocal tract cross section by the first speaker of the specified voice unit name is nonlinearly expanded and contracted and converted to the vocal tract cross section of the second speaker, and The voice of the second speaker with the specified voice unit name is obtained. Therefore, it is not necessary to extract the specific position of the first spectral envelope by the first speaker, and the sound quality is not affected by the extraction accuracy of the specific position, and highly accurate voice quality conversion is performed.

According to a third aspect of the present invention, there is provided a voice conversion method for converting a voice of a first speaker into a voice of a second speaker, wherein the voice of the first speaker is uttered. Extracting the first spectrum envelope from the first speech while extracting the second spectrum envelope from the second speech uttered by the second speaker; Performing a nonlinear frequency expansion / contraction matching on the extracted first and second spectral envelopes to obtain a frequency-moving function representing the correspondence between the frequency axes of the two spectral envelopes; And a step of converting the first spectral envelope of the specified voice unit name into a spectral envelope for the second speaker based on the frequency-moving function of the specified voice unit name. A voice conversion method is provided.

According to the above configuration, the frequency axis of the first spectrum envelope by the first speaker of the specified voice unit name is nonlinearly expanded and contracted and converted into the spectrum envelope by the second speaker, and the above designation is performed. The voice of the second speaker with the specified voice unit name is obtained. Therefore, there is no need to extract the specific position of the first spectral envelope by the first speaker, and highly accurate voice quality conversion is performed without the sound quality being affected by the extraction accuracy of the specific position.

In one embodiment, the gradient of the first studio envelope is extracted from the first speech uttered by the first speaker, while the second speech is extracted from the second speech uttered by the second speaker. Extracting the slope of the vector envelope; and calculating the second spectrum based on the difference between the slope of the first spectrum envelope and the slope of the second spectrum envelope of the specified speech unit name. The method includes the step of correcting the inclination of the spectral envelope of the speaker.

According to the above-described embodiment, the inclination of the spectrum envelope regarding the second speaker obtained above is corrected, and a voice closer to the voice quality of the second speaker can be obtained.

According to a fourth aspect of the present invention, a computer is connected to the above-mentioned spectral envelope extracting means, nonlinear frequency axis spectral matching means, spectral envelope transforming means, and spectral gradient extracting means. (4) A program recording medium on which a voice quality conversion processing program for functioning as a spectrum inclination correcting means is provided. According to the above configuration, the frequency axis of the first spectrum envelope by the first speaker of the specified voice unit name is nonlinearly expanded and contracted, and is converted into the spectrum envelope of the second speaker. Furthermore, it is corrected based on the difference between the slopes of the first and second spectral envelopes of the obtained spectral envelope of the second speaker. In this way, highly accurate voice quality conversion is performed without the sound quality being affected by the extraction accuracy of the specific position of the first spectral envelope by the first speaker. BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 is a block diagram of a voice quality conversion apparatus according to the present invention.

Figures 2A to 2F show examples of spectral envelope, spectral tilt, and sound source characteristics.

FIG. 3 is a flowchart of the voice quality conversion processing operation by the voice quality conversion device shown in FIG.

FIG. 4 is a flowchart of the voice quality conversion processing operation following FIG.

FIG. 5 is a flowchart of the voice quality conversion processing operation following FIG.

6A to 6C are diagrams showing the concept of nonlinear frequency axis spectrum matching by dynamic programming.

7A and 7B are conceptual diagrams of spectral envelope normalization.

FIG. 8 is a block diagram of a voice conversion device different from FIG.

9A to 9D are diagrams showing examples of the vocal tract cross-sectional area and the sound source characteristics.

FIGS. 108 to 100C are diagrams showing the concept of vocal tract axis matching by dynamic programming.

Fig. 11 is a conceptual diagram of the modified vocal tract cross-sectional area. BEST MODE FOR CARRYING OUT THE INVENTION

Hereinafter, the present invention will be described in detail with reference to the illustrated embodiments. In the following description, the above-mentioned voice unit is referred to as “phoneme”, but the present invention is not limited to this.

(No.; L embodiment)

FIG. 1 is a block diagram of a voice conversion device according to the present embodiment. The waveform analyzer 1 extracts a cepstrum and prosody information from an input speech waveform. The star envelope extraction unit 2 extracts a spectrum envelope as shown in FIGS. 2C and 2F based on the low-order cepstrum coefficients extracted by the waveform analysis unit 1. The spectral gradient extraction unit 3 performs the approximation in the case where the above-mentioned spectral envelope is approximated by a least square approximation line. Extract the slope of the line as shown in Fig. 2B and Fig. 2E. The sound source Tokusei extraction unit 4 extracts sound source characteristics as shown in FIGS. 2A and 2D based on the higher-order cepstrum coefficients extracted by the waveform analysis unit 1. The speech recognition unit 5 performs HMM (Hidden Markov Model) based on the time series of the spectrum envelope extracted by the spectrum envelope extraction unit 2 and the prosodic information (power, pitch frequency, etc.) extracted by the waveform analysis unit 1. ) Is used for speech recognition. Then, the phoneme (speech unit) sequence of the recognition result is output together with the prosody information (phoneme duration, average power, average pitch frequency, etc.) between the phoneme segments. The extracted spectral envelope, spectral tilt, and sound source characteristics are stored in the feature memory 6 with a phoneme label, which is the recognition result of each speaker by the voice recognition unit 5, attached.

The averaging unit 7 converts the spectral envelope, spectral slope, and sound source characteristics of each phoneme for each speaker stored in the feature memory 6 into phonemes, similar phonemes, Calculate the average value by classifying each voice section / unvoiced section and the entire voice section (speaker). Then, the obtained average spectral envelope, average spectral slope, and average sound source characteristics are labeled with the corresponding phoneme name, similar phoneme name, voiced / unvoiced sound section, or the label of the entire voice section (speaker). Store it in memory 6. Further, as described later in detail, the similar phoneme, voiced / unvoiced sound section, and speech are obtained by performing a linear transformation on the frequency-moving function of each phoneme stored in the frequency warp table memory 9 for each speaker. An average value is calculated for each of the sections. Then, the obtained average frequency rubbing function is stored in the frequency warp table memory 9 with a corresponding similar phoneme name, a label of a voiced section / unvoiced section or an entire voice section.

Here, the calculation of the frequency averaging function stored in the frequency warp table memory 9 is performed by the nonlinear frequency axis spectrum matching unit 8 as follows. That is, the nonlinear frequency axis spectrum matching unit 8 uses the nonlinear frequency axis spectrum matching based on the dynamic planning method to calculate, for each phoneme, the average spectrum envelope of the source speaker S stored in the feature memory 6 for each phoneme. And the average spectral envelope of the speaker T to be converted. Then, a frequency averaging function corresponding to the optimal DP path is obtained, and a phoneme name is assigned to the function and stored in the frequency warp table memory 9. It is.

The spectrum-envelope conversion unit 10 reads out the spectrum envelope of the source speaker s of the phoneme corresponding to the utterance instruction from the feature memory 6, and reads out the frequency-moving function of the phoneme from the frequency warp table memory 9. . In that case, the data of the corresponding phoneme of the pre-speaker stored in the feature memory 6 and the frequency warp table memory 9 have little power, and if there is no data, the similar phoneme of the phoneme or the same section as the phoneme Reads the average frequency rubbing function of the voice section or unvoiced section) or the entire voice section. Then, the spectrum envelope of the source speaker S is converted into the spectrum envelope of the destination speaker T by using the above (average) frequency eave function. Hereinafter, the spectrum envelope of the destination speaker T obtained by this conversion is referred to as a “deformed vector envelope”.

The spectrum slope conversion unit 11 calculates the average spectrum slope of the conversion source speaker S and the average spectrum slope of the conversion destination speaker T of the phoneme corresponding to the utterance instruction from the six feature memories. , And performs a transformed spectral slope transformation for correcting the transformed spectrum slope from the spectrum envelope transformation unit 10 by an amount corresponding to a difference between the two mean spectrum slopes, and obtains a normalized spectrum envelope. Ask. The sound source characteristic conversion unit 12 reads out the average sound source characteristic corresponding to the utterance instruction from the feature memory 6, and obtains the deformed sound source characteristic by deforming by linear transformation or the like as necessary. The spectrum synthesis unit 13 uses the normalized spectrum envelope from the spectrum inclination conversion unit 11 and the deformed sound source characteristics from the sound source characteristic conversion unit 12 to generate a high fundamental frequency. The combined spectrum is determined by determining the spectrum intensity over the entire spectrum. The waveform synthesizing unit 14 synthesizes a voice waveform by a sine wave weight method based on the spectrum intensity of the synthesized spectrum.

FIG. 3 to FIG. 5 are flow charts of the voice conversion processing operation by the voice conversion apparatus having the above configuration. Hereinafter, the operation of the voice conversion device will be described to Itoda with reference to FIGS.

In step S1, an initial value "1" is set for the speaker number s. The speaker number s, the subsequent phoneme number X, the conversion destination speaker number sT, and the conversion source speaker number sS are set in a working memory (not shown). Further, as the above-mentioned speaker, a speaker that can be a conversion source speaker S and a conversion destination speaker T when performing voice quality conversion is selected. In step S2, the waveform The speech waveform is input to the analyzer 1.

In step S3, the waveform analysis unit 1 performs a waveform analysis on the input speech waveform to extract cepstrum and prosody information. In step S4, the spectrum envelope extraction unit 2 extracts a spectrum envelope based on the low-order cepstrum coefficient from the waveform analysis unit 1. In step S5, the slope of the approximate straight line when the above-mentioned spectral envelope is approximated by the least squares approximate straight line is extracted as the spectral slope by the spectrum slope extracting unit 3. In step S6, the sound source characteristics are extracted by the sound source characteristics extraction unit 4 based on the higher-order cepstrum coefficients from the waveform analysis unit 1. In step S7, the input speech is recognized by the speech recognition unit 5, and the phoneme number (phoneme name) sequence and the prosodic information of each phoneme section (phoneme duration, average power, average pitch frequency, etc.) are recognized. Is output. Here, it is assumed that the phoneme number is determined in advance in association with the phoneme name and stored in a RAM (random 'access' memory) (not shown).

In the present embodiment, the speech waveform analysis by the waveform analysis unit 1 is referred to as cepstrum analysis, and a spectrum envelope, a spectrum slope, and a sound source characteristic are extracted based on the cepstrum analysis result. I have. However, the audio waveform analysis method by the waveform analysis unit 1 is not limited to this, and any audio waveform analysis method such as LPC (linear predictive analysis) can be used as long as it can extract the spectrum envelope and sound source characteristics. It can be a law.

In step S8, the spectrum envelope extracted by the spectrum envelope extraction unit 2 and the spectrum inclination extracted by the spectrum inclination extraction unit 3 and the sound source characteristic extraction unit 4 extract The sound source characteristics are stored in the feature memory 6 with a label formed by a pair of the speaker number s and the phoneme number X from the voice recognition unit 5. In step S9, it is determined whether or not the learning voice, which is an utterance by the speaker of the speaker number s, has a certain power, that is, whether or not there is a voice input by the same speaker. As a result, if there is any, the process returns to step S2, and shifts to the above-described spectrum envelope, vector inclination and sound source characteristic extraction and speech recognition for the next speech. Otherwise, go to step S10.

In step S10, the phoneme number X is set to an initial value “1”. Step S 11 Then, the averaging unit 7 reads out the spectrum envelope, the vector inclination, and the sound source characteristics to which the speaker number s and the phoneme number X are assigned from the characteristic feature. Then, the read spectrum envelope, vector slope, and sound source characteristics are classified into “phonemes”, “similar phonemes”, “voiced / unvoiced sound segments”, and “entire speech segments”, respectively. You. In step S12, it is determined whether the phoneme number X is equal to or greater than the maximum value x _MAX . As a result, if it is equal to or more than the maximum value x _MAX , the process proceeds to step S14, and if not, the process proceeds to step S13. In step S13, the phoneme number X is incremented. After that, the process returns to step S11, and shifts to the classification of the spectral envelope of the next phoneme, the spectrum W, and the sound source characteristics.

In step S14, the averaging unit 7 sets the "phoneme", "similar phoneme", and "voiced sound section / unvoiced sound section" with respect to the studio envelope, vector slope, and sound source characteristics to which the speaker number s is assigned. ”And“ the entire voice section ”are averaged by linear transformation or the like. The obtained average spectrum envelope, average spectrum slope and average sound source characteristics, and the corresponding phoneme name, similar phoneme name, voiced section / unvoiced section, and labels of the entire voice section are given by the feature memory 6. Stored.

In step S 15, the speaker number s is either force not at maximum value s _MAX or not is decided. As a result, the process proceeds to step S 17 if the maximum value s _MAX or more, the process proceeds to step S 16 if Kere such case. In step S16, the speaker number s is incremented. After that, the process proceeds to step S2 above, and for the next speaker, extraction of the spectrum envelope, spectrum inclination and sound source characteristics, phoneme recognition, spectrum envelope, spectrum inclination and sound source characteristics are performed. Move on to classification and average calculation. When it is determined in step S15 that the speaker number s is equal to or greater than the maximum value sMAX, the process _proceeds to step S17.

In this way, the spectrum envelope, spectrum slope, and sound source characteristics extracted from a large amount of data of the source speaker S and a small amount of data of the destination speaker T are represented by the speaker number s and the phoneme number. It is labeled X and stored. In addition, the average spectral envelope, average spectral slope, and average sound source characteristics for each of “phonemes”, “similar phonemes”, “voiced / unvoiced sound segments”, and “entire speech segments” are represented by speaker number s Phoneme name, similar phoneme name, voiced sound Labels are added to the section / unvoiced section and the entire voice section and stored.

In step S17, the conversion destination speaker number designated from outside is set in the conversion destination speaker number sT. The conversion source speaker number s S is also set to the conversion source speaker number specified externally. In step S18, an initial value “1” is set to the phoneme number X. In step S19, the nonlinear frequency axis spectrum matching unit 8 searches the feature memory 6 for an average spectrum envelope to which the speaker number s corresponding to the conversion destination speaker number sT and the phoneme number are assigned. Then, based on the search result, it is determined whether or not the phoneme data for the conversion destination speaker is stored in the feature memory 6. As a result, if stored, the process proceeds to step S20; otherwise, the process proceeds to step S24. In step S20, the non-linear frequency axis spectrum matching unit 8 obtains, from the feature memory 6, the average spectrum envelope to which the speaker number s corresponding to the source speaker number s S and the phoneme number X are assigned. Searched. Then, based on the search result, it is determined whether or not the data of the phoneme for the conversion source speaker is stored in the feature memory 6. As a result, if it is stored, the process proceeds to step S21; otherwise, the process proceeds to step S24.

In step S21, the nonlinear frequency axis spectrum matching unit 8 uses the nonlinear frequency axis spectrum matching by dynamic programming to apply the average spectrum envelope of the source speaker S and the destination Matching with the average spectral envelope of speaker T is performed. Then, a frequency ビング ving function corresponding to the optimal DP path is obtained.

FIG. 6A shows the concept of nonlinear frequency axis spectrum matching by dynamic programming executed by the nonlinear frequency axis spectrum matching unit 8 described above. With respect to the average spectrum envelope S of the source speaker S and the average spectrum envelope T of the destination speaker T for the same phoneme, the spectrum envelope is divided into L equal parts in the band, and both spectrum envelopes S, Element values representing the output value (spectral intensity) of each channel of T are defined as an element value T i and an element value S j (l≤i, j≤L). Then, the frequency axis is nonlinearly expanded and contracted by dynamic programming so that the two spectral envelopes correspond to each other. In other words, the two A series of lattice points c = (i, j) on a plane consisting of the vector envelope S and T forces

ί — C ₂ , "J ^c ', ^C L

think of. Then, the sequence Fmin that minimizes the sum D along the sequence F of the distance d (i, j) = d (c) between the element value Ti and the element value Sj for the lattice point c = (i, j) is The optimal DP path (frequency rubbing function) is used.

In step S22, the above-described frequency averaging function is transmitted to the frequency warp table memory 9 together with the phoneme number X by the nonlinear frequency axis spectrum matching unit 8. Then, the label of the phoneme number X is given by the frequency warp table memory ⁹ and stored.

As shown in FIG. 6B, the data format of the frequency sorbing function used in the present embodiment is such that the element value of grid point c (i, j) on the DP path is an integer larger than “0”, and The element value of the grid point c (i, j) other than is a matrix with L rows and L columns such that it is "0". It is desirable that the number of band divisions L be large, because the rubbing accuracy is increased. However, if the number is too large, the storage capacity of the frequency warp table memory 9 increases, and the processing time also increases.

In the above description, the nonlinear frequency axis spectrum matching unit 8 calculates the average spectrum envelope S of the conversion source speaker S and the average spectrum envelope T of the conversion target speaker T for the same phoneme. Matching is performed using the element values (spectral intensity) Si and Tj of each channel, but the matching target is not limited to the output value (spectral intensity) of each channel of the spectral envelope. Absent. For example, matching may be performed using the difference (spectral local slope) between the output values between adjacent channels with respect to the mean spectrum envelope S and the mean spectrum envelope T (ΔS and ΔΤ). .

Where Δ Sj = Sj— S (j-1)

ΔΤί = Τί-Τ (ί-1)

Where 2≤ i, j ≤L

In step S23, it is determined whether or not the phoneme number X is equal to or greater than the maximum value x _MAX . As a result, if it is equal to or more than the maximum value x _MAX , the process proceeds to step S25, and if not, the process proceeds to step S24. In step S24, the phoneme number X is incremented. So After that, the process returns to the step S19, and the process shifts to the process of matching the spectral envelope between the source speaker S and the destination speaker T of the next phoneme, and storing the obtained frequency sorbing function.

In step S25, the averaging unit 7 reads out the frequency mapping function for each speaker from the frequency warp table memory 9, and classifies the “similar phonemes” and “voiced sounds” in step S11. The average for each “section / unvoiced sound section” and “entire voice section” is calculated by linear conversion or the like. Then, the obtained average frequency rubbing function (the sum of the frequency rubbing function may be substituted as shown in Fig. 6C) is used to calculate the corresponding analog phoneme name, voiced / unvoiced sound section, and the entire voice section. A label is added and stored by the frequency warp table memory 9.

Thereafter, the process shifts to speech synthesis processing based on the voice quality of the conversion-target speaker based on the utterance instruction. In step S26, the phoneme number X corresponding to the utterance indicating phoneme is input to the spectrum envelope conversion unit 10, the spectrum inclination conversion unit 11 and the sound source characteristic conversion unit 12. In step S27, the spectrum envelope to which the speaker number s corresponding to the conversion source speaker number sS and the phoneme number X are assigned is read from the feature memory 6 by the spectrum envelope converter 10. Further, the average frequency averaging function (the average frequency averaging function between the source speaker number s S and the destination speaker number s) to which the phoneme number X is assigned is read from the frequency warp table memory 9. Then, the spectrum envelope S of the source speaker S is expressed by the following equation using the average frequency rubbing function (element value c (i, j)).

Ti = ∑ Sj * c (i, j) / ∑ c (i, j)

However, it is transformed according to 1 ≤ j ≤ L (or i-a; j ≤ i + Q ;, a is a positive integer), and the transformed vector envelope T (i-channel element The prime value Ti) is determined. As a result, as shown in FIG. 7A, the channel position (j = 4) of the peak Sa of the spectrum envelope S of the source speaker S is changed to the channel position (i) in the modified spectrum envelope T. = 3).

Here, in the present embodiment, the frequency warping table memory 9 stores a plurality of average frequency averaging functions for each phoneme, each similar phoneme, each voiced / unvoiced voice section, and each voice section. Have been. Therefore, as shown below, It is possible to select an appropriate average frequency mapping function according to the amount of utterance data of the speaker T. That is, if there is little or no utterance data for a phoneme (eg, back vowel / ο /), the average frequency of similar phonemes (eg, back vowel / a /) of the phoneme (/ ο /) Select the rubbing function or the average frequency rubbing function of the voiced section. Alternatively, if the utterance data of the phoneme (/ o /) is sufficiently large, the average frequency sorbing function of the phoneme (/ o /) is selected. By doing so, it is possible to cope with the case where the amount of training utterance data of the conversion predecessor T is small, and it is possible to reduce the utterance burden of the conversion predecessor T.

In addition, by calculating the average value of the above-described frequency rubbing function for each phoneme and each similar phoneme, individual differences (soft differences) due to vocal habits such as articulation points and mouth opening are correctly normalized. ing. Therefore, an optimal frequency-binging function can be obtained.

In step S28, the spectrum tilt converter 11 generates an average spectrum from the feature memory 6 to which the speaker number s corresponding to the conversion source speaker number s S and the phoneme number X are assigned. The slope and the average spectrum slope to which the speaker number s corresponding to the conversion destination speaker number sT and the phoneme number X are assigned are read out. Then, as shown in FIG. 7B, the slope of the modified spectrum envelope obtained in step S27 is corrected by the difference between the two average spectrum slopes, and the normalized spectrum envelope is obtained. In this case, too, by selecting an appropriate average spectral slope according to the amount of utterance data of the conversion-target speaker T for learning, the amount of learning utterance data of the conversion-target speaker T is selected. Can be dealt with even if there are few.

In step S29, the sound source characteristic conversion unit 12 reads out the average sound source characteristic to which the speaker number s corresponding to the conversion destination speaker number sT and the phoneme number X are assigned from the feature memory 6. Then, if necessary, the sound source is deformed by linear transformation or the like, and the deformed sound source characteristic is obtained. In step S30, the spectrum combining unit 13 obtains a combined spectrum using the normalized shadow envelope and the deformed sound source characteristics obtained as described above. In this spectrum synthesis method, the spectrum envelope over a high frequency of the fundamental frequency is obtained by synthesizing the normal envelope spectrum envelope and the deformed sound source characteristic. This is done. In step S31, a speech waveform is synthesized by the waveform synthesizing unit 14 by the sine wave weight method based on the spectrum intensity of the synthesized spectrum. Note that the method of synthesizing the speech waveform is not limited to the sine wave weighing method using the synthesized spectrum, but a method in which the normalized spectrum envelope is zero-phased and superimposed for each fundamental frequency. Also, a synthesized waveform can be obtained by a method of performing an inverse Fourier transform on the synthesized spectrum.

In step S32, it is determined whether or not the utterance instruction phoneme designated with the phoneme number X in step S26 is the last utterance instruction phoneme. As a result, if it is not the last utterance indicating phoneme, the process returns to step S26 to shift to the synthesis of the speech waveform relating to the next utterance indicating phoneme. On the other hand, if it is the last utterance indicating phoneme, the voice conversion processing operation ends.

As described above, in the present embodiment, the input voices of the conversion target speaker T and the conversion source speaker S are subjected to cepstrum analysis by the waveform analysis unit 1 and the spectrum envelope is extracted by the spectrum envelope extraction unit 2. Then, the spectrum gradient is extracted by the spectrum gradient extraction unit 3, and the sound source characteristics are extracted by the sound source characteristic extraction unit 4. Then, the averaging unit 7 calculates the average value of the above-mentioned spectral envelope, spectral gradient and sound source characteristic for each of “phonemes”, “similar phonemes”, “voiced / unvoiced sound sections”, and “entire voice section”. The phoneme number of the recognition result by the voice recognition unit 5 is assigned and stored by the feature memory 6. ,

Further, in the nonlinear frequency axis spectrum matching section 8, the average spectrum envelope of the source speaker S and the average spectrum envelope of the destination speaker T are calculated for all phonemes stored in the feature memory 6. The nonlinear frequency axis spectrum matching is performed to find the frequency averaging function corresponding to the optimal DP path. Then, the averaging unit 7 calculates the average value of the frequency mapping function for each of “similar phoneme”, “voiced sound section / unvoiced sound section”, and “entire voice section”, assigns a phoneme number, and stores the frequency warp table memory. Store in 9. Then, when performing voice synthesis with the voice quality of the conversion-target speaker in accordance with the utterance instruction, the following procedure is performed. That is, first, the spectrum envelope conversion unit 10 uses the spectrum envelope of the corresponding phoneme of the conversion source speaker S, and uses the average frequency eving function between the conversion source speaker S and the conversion target story T of the relevant phoneme. The transform envelope of the speaker T (transformation To the envelope. Next, the spectrum slope converter 11 converts the envelope of the transformed spectrum by the difference between the average spectrum slope of the source speaker s and the average spectrum slope of the destination speaker T. The inclination is corrected to obtain the regular envelope spectrum envelope. Then, in the sound source characteristic conversion unit 1 ^2, by modifying the average sound characteristics of the conversion-target speaker τ Ru sought deformation sound characteristics.

After that, the spectrum synthesizer 13 calculates a synthesized spectrum from the normalized spectrum envelope and the deformed sound source characteristic, and the waveform synthesizer 14 generates a speech based on the synthesized spectrum. It combines the waveforms.

That is, in the present embodiment, the frequency envelope of the spectrum envelope of the source speaker is nonlinearly expanded and contracted to obtain the spectrum envelope of the destination speaker, and its slope is corrected to obtain the normalized spectrum envelope. I want to ask. Therefore, it is not necessary to extract the specific position of the spectrum envelope as in the conventional voice conversion method based on the formant frequency or the voice conversion method based on the division point between the peak points of the spectrum envelope. The sound quality is not affected by the accuracy of the extraction.

In addition, the average frequency averaging function used when transforming the above-mentioned spectrum envelope, the average vector slope used when correcting the above-mentioned modified vector envelope slope, and the average used when obtaining the above-mentioned deformed sound source characteristics. The sound source characteristics are obtained for each of "phonemes", "similar phonemes", "voiced / unvoiced sound segments", and "entire speech segments". Therefore, according to the volume of the utterance data of the pre-speaker T stored in the feature memory 6 and the frequency warp table memory 9, the average frequency rubbing function, the average spectrum slope, and the average sound source characteristics in appropriate sections are obtained. By using, it is possible to cope with the case where the amount of the training utterance data of the conversion destination speaker T is small. That is, according to the present embodiment, it is possible to reduce the utterance burden of the conversion destination speaker. Furthermore, individual differences caused by vocal habits can be normalized by obtaining an average frequency rubbing function, an average spectrum slope, and an average sound source characteristic for each phoneme and each similar phoneme.

Note that, in the above embodiment, during speech synthesis with the voice quality of the conversion-target speaker, utterances are made to the spectrum envelope conversion unit 10, the spectrum inclination conversion unit 11, and the sound source characteristic conversion unit 12. The designated phoneme is specified. In the present invention, The method of specifying the utterance indicating phoneme is not limited to this, and it is also possible to directly specify the utterance indicating phoneme by the utterance by the source speaker as follows.

That is, the utterance indicating phoneme is input to the waveform analysis unit 1 by the utterance of the conversion source speaker. Then, the speech recognition unit 5 performs speech recognition based on the time series of the spectrum envelope from the spectrum envelope extraction unit 2 and the prosody information from the waveform analysis unit 1, and generates a phoneme sequence of the recognition result. The prosodic information of the phoneme section and the prosodic information are input to the spectrum envelope converter 1◦, the spectrum tilt converter 11 and the sound source characteristic converter 12 as utterance instruction information. By doing so, the spectrum envelope conversion unit 10 and the spectrum inclination conversion unit 11 read the average frequency-moving function and the average spectrum inclination of the corresponding phoneme according to the input phoneme sequence. On the other hand, the sound source characteristic conversion unit 12 reads out the sound source characteristics of the corresponding phoneme according to the input prosody information. In this way, the utterance of the source speaker is converted in real time.

Also, in the above embodiment, the average spectrum envelope of the same phoneme of the source speaker and the destination speaker is determined in advance, and nonlinear frequency axis spectrum matching is performed using the average spectrum envelope. To obtain the average frequency rubbing function. However, nonlinear frequency axis spectral matching is performed using the individual spectral envelopes of the same phoneme to find the frequency rubbing function, and the frequency rubbing function is averaged within the same phoneme to find the average frequency rubbing function. No problem.

(Second embodiment)

FIG. 8 is a block diagram of the voice quality conversion device according to the present embodiment. In FIG. 8, a waveform analysis unit 21, a sound source characteristic extraction unit 23, a sound source characteristic conversion unit 30, and a waveform synthesis unit 32 are provided in the first embodiment in a waveform analysis unit 1 of the voice quality conversion apparatus shown in FIG. The sound source characteristic extraction unit 4, the sound source characteristic conversion unit 12, and the waveform synthesis unit 14 have the same configuration and operate similarly.

Based on the autocorrelation analysis or covariance analysis extracted by the waveform analysis unit 1, the vocal tract cross-sectional area extraction unit 22 cuts the vocal tract from the glottis to the lips as shown in Figs. 9B and 9D. Extract the area. 9A and 9C show the sound source extraordinary extracted by the sound source characteristic extraction unit 23. The voice recognition unit 24 is a vocal tract cross-sectional area extracted by the vocal tract cross-sectional area extraction unit 22. Speech recognition is performed based on the time series of the prosodic information (such as the pitch and the pitch frequency) extracted by the waveform analysis unit 21 and the prosody information. The feature memory 25 stores the extracted vocal tract cross-sectional area and sound source characteristics with a phoneme label added thereto.

The averaging unit 26 compares the vocal tract cross-sectional area and sound source characteristics of each phoneme for each speaker stored in the feature memory 25 with respect to the phonemes, similar phonemes, voiced / unvoiced sound sections, and the entire voice section. The average value is calculated for each (speaker). Then, the obtained average vocal tract cross-sectional area and average sound source characteristics are labeled with the corresponding phoneme name, similar phoneme name, voiced sound section / unvoiced sound section, or the label of the entire voice section (speaker), and the feature memory 2 is assigned. Store in 5. Furthermore, for the vocal tract axis-ving function of each phoneme stored in the vocal tract axis warp table memory 28 later, for each of the above similar phonemes, voiced / unvoiced sections, and the entire speech section, Calculate the average value. The obtained average vocal tract axis moving function is stored in the vocal tract axis warping table memory 28 with a corresponding similar phoneme name, a label of a voiced section / unvoiced section or the entire voice section.

The nonlinear vocal tract axis matching unit 27 performs the nonlinear vocal tract axis matching by the dynamic programming method for each phoneme similarly to the case of the nonlinear frequency axis spectrum matching unit 8 in the first embodiment. As shown in Fig. 1 OA, matching is performed between the average vocal tract cross-sectional area of source speaker S and the average vocal tract cross-sectional area of target speaker T stored in feature memory 25. Then, a vocal tract axis moving function as shown in FIG. 1OB is obtained, a phoneme name is given, and the vocal tract axis warping table memory 28 is stored. FIG. 10C shows the average vocal tract axis moving function calculated by the averaging unit 26 (substitution of the added value).

The vocal tract cross section converter 29 reads the vocal tract cross section of the source speaker S corresponding to the utterance instruction from the feature memory 25, while reading the vocal tract of the phoneme from the vocal tract axis warp table memory 28. Read the axis rubbing function. Then, using the above vocal tract axis moving function, as shown in FIG. 11, the vocal tract cross-sectional area of the source speaker S is converted to the vocal tract cross-sectional area (transformed vocal tract cross-sectional area) of the destination speaker T, as shown in FIG. Convert. Then, the spectrum synthesizer 31 uses the modified vocal tract cross-sectional area from the vocal tract cross-section converter 29 and the deformed sound source and characteristics from the sound source characteristic converter 30 to calculate the fundamental frequency. By determining the spectrum intensity over high frequencies Then, we seek the composite spectrum.

As described above, in the second embodiment, the vocal tract cross-sectional area having a high relation with the spectrum envelope is used instead of the spectrum envelope in the first embodiment, and the voice of the source speaker is used. The vocal tract axis of the road cross-sectional area is nonlinearly expanded and contracted to obtain the vocal tract cross-sectional area of the predecessor speaker. Therefore, as in the case of the first embodiment, the spectral envelope is specified as in the conventional voice quality conversion method based on the formant frequency or the voice quality conversion method based on the division point between the peak points of the spectrum envelope. There is no need to extract the position, and the sound quality is not affected by the extraction accuracy of the specific position.

In each of the above embodiments, a phoneme is used as the speech unit, but the present invention can be applied to a syllable.

By the way, in each of the above-described embodiments, the waveform analyzers 1 and 2 1, the spectral envelope extractor 2, the spectral gradient extractor 3, the vocal tract cross-sectional area extractor 2 2, the sound source characteristic extractors 4 and 2 3, Speech Recognition Unit 5-24, Averaging Unit 7-26, Nonlinear Frequency Axis Spectrum Matching Unit 8, Nonlinear Vocal Tract Axis Matching Unit 27, Spectral Envelope Transformation Unit 10, Spectral Slope Transform The voice quality conversion processing function by the unit 11, the vocal tract cross-sectional area conversion unit 29, the sound source characteristic conversion unit 12, 30, the spectrum synthesis unit 13-31, and the waveform synthesis unit 14, 32 is as follows. The program is realized by a voice quality conversion processing program recorded on a recording medium. The program recording medium in each of the above embodiments is a program medium such as a ROM (Read Only Memory). Alternatively, it may be a program medium that is attached to and read from an external auxiliary storage device. In any case, the program reading means for reading the voice quality conversion processing program from the program medium may have a configuration of directly accessing and reading the program medium, or a program provided in the RAM. A configuration may be adopted in which the program is downloaded to a storage area (not shown), and the program storage area is accessed and read. It is assumed that a download program for downloading from the program medium to the program storage area of the RAM is stored in the main device in advance.

Here, the above-mentioned program medium is configured to be separable from the main body side, Tapes such as tape cassette tapes, magnetic disks such as floppy disks and hard disks, CDs (compact disks), optical disks such as ROMs, MOs (magneto-optical) disks, MDs (mini disks), DVDs (digital video disks), etc. Discs, card systems such as IC (integrated circuit) cards and optical cards, mask ROMs, EPR OMs (ultraviolet erasable ROMs), EEPR OMs (electrically erasable ROMs), and semiconductor memories such as flash ROMs It is a medium for fixedly carrying programs, including systems. Further, the program medium may be a medium that carries a program in a fluid manner by downloading from a communication network or the like. In this case, it is assumed that the download program for downloading from the communication network is stored in the main unit in advance. Alternatively, it shall be installed from another recording medium.

It should be noted that what is recorded on the recording medium is not limited to a program, but can also be data.

Claims

The scope of the claims

1. A voice conversion device that converts voice of the first speaker's voice into voice of the second speaker's voice,

Extracting the first spectral envelope from the first voice uttered by the first speaker and extracting the second spectral envelope from the second voice uttered by the second speaker Delivery means,

First memory means for storing the extracted first spectrum envelope and second spectrum envelope with a label for each voice unit,

For the same label, a nonlinear frequency expansion / contraction matching is performed on the first spectrum envelope and the second spectrum envelope stored in the first memory, and the correspondence between the frequency axes of both spectrum envelopes is obtained. A non-linear frequency axis spectrum matching means for obtaining a frequency sorbing function representing

A second memory for storing the above-described frequency sorbing function with a sound unit label attached;

The first spectrum envelope of the specified voice unit name is read from the first memory, while the frequency sorbing function of the specified voice unit name is read from the second memory and read out. A voice envelope conversion means for converting the read first spectrum envelope into a spectrum envelope relating to a second speaker based on the obtained frequency envelope function.

2. The voice conversion device according to claim 1,

The non-linear frequency axis spectrum matching means, when performing the non-linear frequency expansion and contraction matching, relates each of the first spectrum envelope and the second spectrum envelope to each other. A voice quality conversion device characterized by using a difference in output value between adjacent channels when the frequency band is divided into a plurality of channels.

3. The voice conversion device according to claim 1,

The first memory means stores the inclination of the first spectrum envelope and the inclination of the second spectrum envelope with a label for each sound unit, and stores the inclination. While extracting the slope of the first spectrum envelope from the first voice uttered by the first speaker, extracting the slope of the second spectrum envelope from the second voice uttered by the second speaker Extracting means to be stored in the first memory means,

The slope of the first spectrum envelope and the slope of the second spectrum envelope of the specified voice unit name are read from the first memory means, and the spectrum envelope is determined based on the difference between the two slopes. A voice quality conversion device comprising a spectrum inclination correction means for correcting the inclination of the spectrum envelope of the second speaker obtained by the conversion means.

4. The voice conversion device according to claim 1,

The speech unit is a phoneme,

The frequency sorbing functions stored in the second memory means are grouped by phoneme, similar phoneme, voiced / unvoiced sound interval and speaker based on the label, and the average value of the frequency sorbing functions belonging to each group is calculated. Calculating means for averaging means for assigning a label to each group name and storing the obtained average frequency working function in the second memory means,

A voice quality conversion device, wherein the spectrum envelope conversion means uses an average frequency rubbing function of any group to which a specified phoneme belongs as the frequency rubbing function.

5. The voice conversion device according to claim 3,

The speech unit is a phoneme,

The slope of the first spectrum envelope and the slope of the second spectrum envelope stored in the first memory means are calculated for each phoneme, similar phoneme, voiced / unvoiced sound section, and speaker based on the label. Grouping, calculating the average value of the slope of the spectrum envelope belonging to each group, assigning the obtained average spectrum slope to the label of each speaker name and each group name, and assigning the label to the first memory means. In addition to the averaging means for storing, the spectral inclination correction means uses the average spectral inclination of any group to which the specified phoneme belongs as the inclination of the spectral envelope. A voice quality conversion device characterized by the following.

6. The voice conversion device according to claim 1, A time series of the extracted first or second spectrum envelope is recognized by an unspecified speaker voice recognition method, and the voice unit name of the recognition result is transmitted to the first memory means. A voice quality conversion device comprising recognition means.

7. The voice conversion device according to claim 6,

The voice recognition means can supply the obtained time series of voice unit names to the spectrum envelope conversion means or the spectrum inclination correction means,

The spectral envelope converting means or the spectral inclination correcting means is adapted to set the time series of the voice unit name obtained by the voice recognition means as the specified voice unit name. Characteristic voice conversion device.

8. The voice conversion device according to claim 4,

The voice quality conversion device according to claim 1, wherein the averaging means calculates the average frequency-diving function by performing linear conversion between frequency-averaging functions to be averaged.

9. The voice conversion device according to claim 8,

The frequency averaging function is obtained by dividing the first spectrum envelope and the second spectrum envelope into a plurality of channels in the same frequency band on a plane formed by the first and second spectrum channels. It has a matrix data format in which different element values are given to grid points corresponding to the DP path and other grid points,

The linear transformation between the frequency-mapping functions is performed by calculating the sum of the element values of the same grid point in a plurality of matrices corresponding to the frequency-mapping function for which the average value is to be calculated, and using the obtained value as the element value. Is the average frequency ヮ ving function.

10. The voice conversion device according to claim 9,

When converting into the intensity in the frequency band having the second spectral envelope, the spectrum envelope converting means corresponds to the corresponding channel of the second spectral envelope in the matrix of the average frequency-moving function to be used. At the grid point in the row or column related to the above, the product sum of the element value of each grid point and the intensity in the channel of the first spectrum envelope corresponding to the grid point is calculated, and the value of the product sum is calculated as the second space. Vector envelope A voice quality conversion device characterized in that the intensity is in the frequency band.

1 1. A voice conversion device for converting voice of the first speaker's voice into voice of the second speaker's voice,

A vocal tract cross section extracting means for extracting the first vocal tract cross section from the first voice uttered by the first speaker, and extracting the second vocal tract cross section from the second voice uttered by the second speaker; A first memory means for storing the extracted first vocal tract cross-sectional area and the second vocal tract cross-sectional area with a voice unit label added thereto;

For the same label, non-linear vocal tract axis expansion and contraction is performed on the first vocal tract cross-sectional area and the second vocal tract cross-sectional area stored in the first memory to obtain a vocal tract axis of both vocal tract cross-sectional areas. Nonlinear vocal tract axis matching means for obtaining a vocal tract axis moving function representing the correspondence between

Second memory means for storing the vocal tract axis moving function with a voice unit label added thereto,

The first vocal tract cross-sectional area of the specified voice unit name is read from the first memory, while the vocal tract axis moving function of the specified voice unit name is read from the second memory. Vocal tract cross-sectional area conversion means for converting the read first vocal tract cross-sectional area into a vocal tract cross-sectional area for the second speaker based on the vocal tract axis moving function. apparatus.

1 2. A voice quality conversion method for converting voice in voice quality of a first speaker into voice in voice quality of a second speaker,

Extracting the first spectrum envelope from the first speech uttered by the first speaker and extracting the second spectrum envelope from the second speech uttered by the second speaker; and For the unit name, nonlinear frequency expansion / contraction matching is performed on the extracted first and second spectral envelopes, and a frequency mapping representing the correspondence between the frequency axes of the two spectral envelopes is performed. Obtaining the function and converting the first spectral envelope of the specified voice unit name into a spectral envelope for the second speaker based on the frequency-moving function of the specified voice unit name. A voice quality conversion method characterized by comprising:

1 3. Regarding the voice quality conversion method according to claim 12,

While extracting the slope of the first spectrum envelope from the first voice uttered by the first speaker, extracting the slope of the second spectrum envelope from the second voice uttered by the second speaker Based on the difference between the slope of the first spectrum envelope and the slope of the second spectrum envelope of the specified voice unit name, the slope of the spectrum envelope for the second speaker obtained above is calculated as A voice quality conversion method comprising a step of correcting.

1 4.

The spectral envelope extracting means, the nonlinear frequency axis spectral matching means, and the spectral envelope converting means according to claim 1. The spectral inclination extracting means and the spectral inclination correcting means according to claim 2.

A program recording medium readable by a computer, in which a voice quality conversion processing program for functioning as a program is recorded.