WO2001078064A1 - Voice character converting device - Google Patents

Voice character converting device Download PDF

Info

Publication number
WO2001078064A1
WO2001078064A1 PCT/JP2001/002388 JP0102388W WO0178064A1 WO 2001078064 A1 WO2001078064 A1 WO 2001078064A1 JP 0102388 W JP0102388 W JP 0102388W WO 0178064 A1 WO0178064 A1 WO 0178064A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
speaker
frequency
spectrum
envelope
Prior art date
Application number
PCT/JP2001/002388
Other languages
French (fr)
Japanese (ja)
Inventor
Shin Kamiya
Original Assignee
Sharp Kabushiki Kaisha
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sharp Kabushiki Kaisha filed Critical Sharp Kabushiki Kaisha
Publication of WO2001078064A1 publication Critical patent/WO2001078064A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing

Definitions

  • the present invention relates to a voice conversion device and a voice conversion method for converting a synthesized voice or a bright input voice into a specific speaker's voice and outputting the converted voice, and a program recording medium storing a voice conversion program.
  • a method of extracting and converting formant frequencies from the spectral envelope for example, Kuwahara, Ogushi, “Independent control of formant frequency and band width and judgment of individuality”, Electronic Communication Transactions of the Society, Vol. J69-A No. 4, pp. 509-517 (1986)).
  • the peak point of the spectrum envelope is obtained, each spectrum envelope is band-divided based on the frequency of the peak point, and the frequency difference and the intensity difference obtained for these division points are used to make a spectrum.
  • There is a method of deforming the vector envelope for example, Japanese Patent Application Laid-Open No. Hei 9-244649).
  • DP Dynamic Programming
  • a method has been proposed in which the spectrum envelope of the above-mentioned source speaker is converted into the spectrum envelope of the destination speaker by using one obtained optimal DP path (see, for example, Japanese Patent Laid-Open No. 4-1 / 1991). No. 4730000).
  • the conventional voice conversion method has the following problems. That is, the method of extracting and converting the formant frequency has a problem that the sound quality is affected by the extraction accuracy of the formant frequency. Further, in the method of deforming the spectrum envelope based on the frequency difference and the intensity difference between the division points of the spectrum envelope based on the frequency of the peak point, the spectrum is divided by the frequency of the peak point There is a problem that the band of the spectrum is affected, and there is also a problem that the sound quality is affected by the extraction accuracy of the low-frequency peak point when the pitch frequency is high.
  • the optimal DP path for each vowel is different due to individual differences (soft differences) caused by vocal habits such as the opening of the mouth and the degree of mouth opening, the members of similar optimal DP paths (for example, the back vowel)
  • DP paths that are slightly inappropriate for other groups are extracted, and DP paths that are not optimal as a whole are selected.
  • the vowels for training can be selected so that the optimal DP path is not biased, only the individual differences (hard differences) caused by physical differences such as vocal tract shape and vocal tract length are normalized.
  • the source speaker and the destination speaker have the same content (word or sentence: for example, ) Is premised on the restriction of utterance, so if the source speaker's utterance is different or the voice data is insufficient, there is also a problem that it cannot be used. is there.
  • an object of the present invention is to provide a voice conversion apparatus and a voice conversion method capable of reducing the utterance load of the conversion-target speaker and performing more accurate voice conversion, and a program recording a voice conversion processing program. To provide a medium.
  • a voice conversion device for converting a voice in a voice quality of a first speaker into a voice in a voice quality of a second speaker.
  • a spectral envelope extracting means for extracting a first spectral envelope from a first voice uttered by a speaker, and extracting a second spectral envelope from a second voice uttered by a second speaker;
  • First memory means for storing the extracted first spectrum envelope and second spectrum envelope with a label for each voice, and for the same label, the first spectrum stored in the first memory;
  • a nonlinear frequency expansion and contraction is performed by performing nonlinear frequency expansion / contraction matching on the vector envelope and the second vector envelope, and obtaining a frequency-moving function representing the correspondence between the frequency axes of the two envelopes.
  • the second memory means for storing a speech function with a label of a voice unit and storing the first spectrum envelope of the specified voice unit name from the first memory, while reading the specified voice unit name.
  • the frequency sorbing function of the unit name is read out from the second memory, and based on the read out frequency convergence function, the above read out first spectrum envelope is used as the spectrum envelope for the second speaker.
  • a voice quality conversion device provided with a spectrum envelope conversion means for converting into a voice.
  • the correspondence between the frequency axes of the first spectrum envelope obtained from the voice of the first speaker and the second spectrum envelope obtained from the voice of the second speaker is represented.
  • the first speaker by the first speaker of the specified voice unit name uses the frequency
  • the frequency axis of the vector envelope is nonlinearly expanded and contracted and converted into a vector envelope by the second speaker, and the voice of the second speaker with the specified voice unit name is obtained. Therefore, there is no need to extract the specific position of the first spectral envelope by the first speaker, and highly accurate voice quality conversion is performed without the sound quality being affected by the extraction accuracy of the specific position.
  • the nonlinear frequency axis spectrum matching means performs the nonlinear spectrum expansion / contraction matching with respect to the first spectrum envelope and the second spectrum envelope. The difference in output value between adjacent channels when the envelope is divided into multiple channels in the frequency band is used.
  • the first memory means stores the inclination of the first spectrum envelope and the inclination of the second spectrum envelope with a label for each voice.
  • the slope of the first spectrum envelope is extracted from the first speech uttered by the first speaker, and the slope of the second spectrum envelope is extracted from the second speech uttered by the second speaker.
  • a spectral inclination correcting means for correcting the inclination of the spectral envelope of the second speaker obtained by the spectral envelope conversion means based on the difference between the two inclinations.
  • the slope of the spectral envelope of the second speaker obtained above is obtained. Is corrected, and a voice closer to the voice quality of the second speaker is obtained.
  • the speech unit is a phoneme
  • the frequency sorbing function stored in the second memory means is divided into phonemes, similar phonemes, voiced / unvoiced sections, and speakers based on the labels.
  • Averaging means for grouping calculating an average value of the frequency-mapping functions belonging to each group, assigning a label of each group name to the obtained average frequency-mapping function, and storing the label in the second memory means;
  • the spectrum envelope conversion means uses an average frequency rubbing function of any group to which a specified phoneme belongs as the frequency rubbing function. According to the above-described embodiment, the average frequency rubbing function is obtained for each group of “phonemes”, “similar phonemes”, “voiced sound sections / unvoiced sound sections”, and “speakers”.
  • an appropriate group of average frequency averaging functions is selected and used instead of the above frequency averaging function.
  • the speech unit is a phoneme
  • the slope of the first spectrum envelope and the slope of the second spectrum envelope stored in the first memory unit are based on the label.
  • Phoneme 'similar phoneme' voiced section / unvoiced section and each speaker calculate the average value of the slope of the spectrum envelope belonging to each group, and calculate the obtained average spectrum slope for each speaker.
  • an averaging means for assigning a label to each group name and storing the same in the first memory means, and the spectrum inclination correcting means comprises an average memory for any of the groups to which the specified phoneme belongs.
  • the vector inclination is used as the inclination of the above-mentioned spectral envelope.
  • the average spectral gradient is determined for each of the groups of “phoneme”, “similar phoneme”, “voiced sound section / unvoiced sound section”, and “speaker”. Therefore, according to the amount of utterance data of the second speaker stored in the first memory means, an appropriate average spectral slope of a group is selected and used in place of the spectral envelope slope. Can be. In this way, even if the amount of utterance data of the second speaker is small, it can be handled. Further, by calculating the average spectral gradient for each phoneme or each similar phoneme, individual differences due to vocal habits are normalized.
  • the time series of the extracted first spectral envelope or the second spectral envelope is recognized by an unspecified speaker voice recognition method, and the voice unit name of the recognition result is referred to as the first voice envelope.
  • a voice recognition unit is provided for sending to the memory unit.
  • the speech unit name for the label is automatically obtained from the first and second spectrum envelopes extracted from the utterances of the first and second speakers. In this way, the above-described spectrum envelope or the labeling process for the inclination of the spectrum envelope is easily performed.
  • the speech recognition means can supply a time series of the obtained speech unit names to the spectrum envelope conversion means or the spectrum inclination correction means.
  • the envelope conversion means or the spectrum inclination correction means is adapted to use the time series of the speech unit names obtained by the above speech recognition means as the designated speech unit names.
  • the speech unit name when converting the first spectrum envelope by the first speaker's utterance into the second speaker's spectrum envelope is determined by the speech recognition means. It is specified by the time series of the obtained voice unit name. In this way, the uttered sound of the first speaker is directly converted to the sound of the second speaker in real time without inputting the voice unit name sequence to be converted from the keyboard or the like.
  • the averaging means calculates the average frequency-diving function by performing a linear conversion between frequency-averaging functions to be averaged.
  • the above-mentioned frequency averaging function is obtained by dividing the first spectrum envelope and the second spectrum envelope into a plurality of channels in the same frequency band and dividing the first and second spectrum envelopes into a plurality of channels.
  • the spectrum envelope converting means converts the second frequency in the matrix of the average frequency-moving function to be used when converting the intensity into the intensity in a certain frequency band of the second static envelope.
  • the product sum of the element value of each grid point and the intensity in the channel of the first spectrum envelope corresponding to the grid point is determined, and the value of this product sum is calculated as the second
  • the intensity of the spectrum envelope in the frequency band is assumed.
  • a voice quality conversion device for converting a voice of a voice quality of a first speaker into a voice of a voice quality of a second speaker, wherein the voice of the first speaker is uttered.
  • a vocal tract cross-sectional area extracting means for extracting a first vocal tract cross-sectional area from a second voice uttered by a second speaker while extracting a first vocal tract cross-sectional area from the first voice,
  • a first memory means for storing a vocal tract cross-sectional area and a second vocal tract cross-sectional area with a label assigned to each speech unit; and, for the same label, the first vocal tract cross-sectional area stored in the first memory.
  • Nonlinear vocal tract axis matching that performs nonlinear vocal tract axis expansion and contraction matching with the second vocal tract cross section and obtains the vocal tract axis ⁇ one-Bing function representing the correspondence between the vocal tract axes of both vocal tract cross sections Means, and a second memory means for storing the vocal tract axis mapping function with a speech unit label attached thereto, and specifying
  • the first vocal tract cross-sectional area of the specified voice unit name is read out from the first memory unit, while the vocal tract axis mapping function of the specified voice unit name is read out from the second memory, and this read-out is performed.
  • a voice quality conversion device is provided that includes a vocal tract cross-sectional area conversion unit that converts the read first vocal tract cross-sectional area into a vocal tract cross-sectional area for a second speaker based on a vocal tract axis moving function.
  • the voice representing the correspondence of the vocal tract axis between the first vocal tract cross-sectional area obtained from the voice of the first speaker and the second vocal tract cross-sectional area obtained from the voice of the second speaker The vocal axis mapping function is used, the vocal tract axis of the first vocal tract cross section by the first speaker of the specified voice unit name is nonlinearly expanded and contracted and converted to the vocal tract cross section of the second speaker, and The voice of the second speaker with the specified voice unit name is obtained. Therefore, it is not necessary to extract the specific position of the first spectral envelope by the first speaker, and the sound quality is not affected by the extraction accuracy of the specific position, and highly accurate voice quality conversion is performed.
  • a voice conversion method for converting a voice of a first speaker into a voice of a second speaker, wherein the voice of the first speaker is uttered. Extracting the first spectrum envelope from the first speech while extracting the second spectrum envelope from the second speech uttered by the second speaker; Performing a nonlinear frequency expansion / contraction matching on the extracted first and second spectral envelopes to obtain a frequency-moving function representing the correspondence between the frequency axes of the two spectral envelopes; And a step of converting the first spectral envelope of the specified voice unit name into a spectral envelope for the second speaker based on the frequency-moving function of the specified voice unit name.
  • a voice conversion method is provided.
  • the frequency axis of the first spectrum envelope by the first speaker of the specified voice unit name is nonlinearly expanded and contracted and converted into the spectrum envelope by the second speaker, and the above designation is performed.
  • the voice of the second speaker with the specified voice unit name is obtained. Therefore, there is no need to extract the specific position of the first spectral envelope by the first speaker, and highly accurate voice quality conversion is performed without the sound quality being affected by the extraction accuracy of the specific position.
  • the gradient of the first studio envelope is extracted from the first speech uttered by the first speaker, while the second speech is extracted from the second speech uttered by the second speaker. Extracting the slope of the vector envelope; and calculating the second spectrum based on the difference between the slope of the first spectrum envelope and the slope of the second spectrum envelope of the specified speech unit name.
  • the method includes the step of correcting the inclination of the spectral envelope of the speaker.
  • the inclination of the spectrum envelope regarding the second speaker obtained above is corrected, and a voice closer to the voice quality of the second speaker can be obtained.
  • a computer is connected to the above-mentioned spectral envelope extracting means, nonlinear frequency axis spectral matching means, spectral envelope transforming means, and spectral gradient extracting means.
  • the frequency axis of the first spectrum envelope by the first speaker of the specified voice unit name is nonlinearly expanded and contracted, and is converted into the spectrum envelope of the second speaker. Furthermore, it is corrected based on the difference between the slopes of the first and second spectral envelopes of the obtained spectral envelope of the second speaker. In this way, highly accurate voice quality conversion is performed without the sound quality being affected by the extraction accuracy of the specific position of the first spectral envelope by the first speaker.
  • FIG. 1 is a block diagram of a voice quality conversion apparatus according to the present invention.
  • Figures 2A to 2F show examples of spectral envelope, spectral tilt, and sound source characteristics.
  • FIG. 3 is a flowchart of the voice quality conversion processing operation by the voice quality conversion device shown in FIG.
  • FIG. 4 is a flowchart of the voice quality conversion processing operation following FIG.
  • FIG. 5 is a flowchart of the voice quality conversion processing operation following FIG.
  • 6A to 6C are diagrams showing the concept of nonlinear frequency axis spectrum matching by dynamic programming.
  • 7A and 7B are conceptual diagrams of spectral envelope normalization.
  • FIG. 8 is a block diagram of a voice conversion device different from FIG.
  • 9A to 9D are diagrams showing examples of the vocal tract cross-sectional area and the sound source characteristics.
  • FIGS. 108 to 100C are diagrams showing the concept of vocal tract axis matching by dynamic programming.
  • Fig. 11 is a conceptual diagram of the modified vocal tract cross-sectional area.
  • the present invention will be described in detail with reference to the illustrated embodiments.
  • the above-mentioned voice unit is referred to as “phoneme”, but the present invention is not limited to this.
  • FIG. 1 is a block diagram of a voice conversion device according to the present embodiment.
  • the waveform analyzer 1 extracts a cepstrum and prosody information from an input speech waveform.
  • the star envelope extraction unit 2 extracts a spectrum envelope as shown in FIGS. 2C and 2F based on the low-order cepstrum coefficients extracted by the waveform analysis unit 1.
  • the spectral gradient extraction unit 3 performs the approximation in the case where the above-mentioned spectral envelope is approximated by a least square approximation line. Extract the slope of the line as shown in Fig. 2B and Fig. 2E.
  • the sound source Tokusei extraction unit 4 extracts sound source characteristics as shown in FIGS.
  • the speech recognition unit 5 performs HMM (Hidden Markov Model) based on the time series of the spectrum envelope extracted by the spectrum envelope extraction unit 2 and the prosodic information (power, pitch frequency, etc.) extracted by the waveform analysis unit 1. ) Is used for speech recognition. Then, the phoneme (speech unit) sequence of the recognition result is output together with the prosody information (phoneme duration, average power, average pitch frequency, etc.) between the phoneme segments.
  • the extracted spectral envelope, spectral tilt, and sound source characteristics are stored in the feature memory 6 with a phoneme label, which is the recognition result of each speaker by the voice recognition unit 5, attached.
  • the averaging unit 7 converts the spectral envelope, spectral slope, and sound source characteristics of each phoneme for each speaker stored in the feature memory 6 into phonemes, similar phonemes, Calculate the average value by classifying each voice section / unvoiced section and the entire voice section (speaker). Then, the obtained average spectral envelope, average spectral slope, and average sound source characteristics are labeled with the corresponding phoneme name, similar phoneme name, voiced / unvoiced sound section, or the label of the entire voice section (speaker). Store it in memory 6.
  • the similar phoneme, voiced / unvoiced sound section, and speech are obtained by performing a linear transformation on the frequency-moving function of each phoneme stored in the frequency warp table memory 9 for each speaker. An average value is calculated for each of the sections. Then, the obtained average frequency rubbing function is stored in the frequency warp table memory 9 with a corresponding similar phoneme name, a label of a voiced section / unvoiced section or an entire voice section.
  • the calculation of the frequency averaging function stored in the frequency warp table memory 9 is performed by the nonlinear frequency axis spectrum matching unit 8 as follows. That is, the nonlinear frequency axis spectrum matching unit 8 uses the nonlinear frequency axis spectrum matching based on the dynamic planning method to calculate, for each phoneme, the average spectrum envelope of the source speaker S stored in the feature memory 6 for each phoneme. And the average spectral envelope of the speaker T to be converted. Then, a frequency averaging function corresponding to the optimal DP path is obtained, and a phoneme name is assigned to the function and stored in the frequency warp table memory 9. It is.
  • the spectrum-envelope conversion unit 10 reads out the spectrum envelope of the source speaker s of the phoneme corresponding to the utterance instruction from the feature memory 6, and reads out the frequency-moving function of the phoneme from the frequency warp table memory 9. .
  • the data of the corresponding phoneme of the pre-speaker stored in the feature memory 6 and the frequency warp table memory 9 have little power, and if there is no data, the similar phoneme of the phoneme or the same section as the phoneme Reads the average frequency rubbing function of the voice section or unvoiced section) or the entire voice section.
  • the spectrum envelope of the source speaker S is converted into the spectrum envelope of the destination speaker T by using the above (average) frequency eave function.
  • the spectrum envelope of the destination speaker T obtained by this conversion is referred to as a “deformed vector envelope”.
  • the spectrum slope conversion unit 11 calculates the average spectrum slope of the conversion source speaker S and the average spectrum slope of the conversion destination speaker T of the phoneme corresponding to the utterance instruction from the six feature memories. , And performs a transformed spectral slope transformation for correcting the transformed spectrum slope from the spectrum envelope transformation unit 10 by an amount corresponding to a difference between the two mean spectrum slopes, and obtains a normalized spectrum envelope.
  • the sound source characteristic conversion unit 12 reads out the average sound source characteristic corresponding to the utterance instruction from the feature memory 6, and obtains the deformed sound source characteristic by deforming by linear transformation or the like as necessary.
  • the spectrum synthesis unit 13 uses the normalized spectrum envelope from the spectrum inclination conversion unit 11 and the deformed sound source characteristics from the sound source characteristic conversion unit 12 to generate a high fundamental frequency.
  • the combined spectrum is determined by determining the spectrum intensity over the entire spectrum.
  • the waveform synthesizing unit 14 synthesizes a voice waveform by a sine wave weight method based on the spectrum intensity of the synthesized spectrum.
  • FIG. 3 to FIG. 5 are flow charts of the voice conversion processing operation by the voice conversion apparatus having the above configuration.
  • the operation of the voice conversion device will be described to Itoda with reference to FIGS.
  • step S1 an initial value "1" is set for the speaker number s.
  • the speaker number s, the subsequent phoneme number X, the conversion destination speaker number sT, and the conversion source speaker number sS are set in a working memory (not shown). Further, as the above-mentioned speaker, a speaker that can be a conversion source speaker S and a conversion destination speaker T when performing voice quality conversion is selected.
  • step S2 the waveform The speech waveform is input to the analyzer 1.
  • step S3 the waveform analysis unit 1 performs a waveform analysis on the input speech waveform to extract cepstrum and prosody information.
  • step S4 the spectrum envelope extraction unit 2 extracts a spectrum envelope based on the low-order cepstrum coefficient from the waveform analysis unit 1.
  • step S5 the slope of the approximate straight line when the above-mentioned spectral envelope is approximated by the least squares approximate straight line is extracted as the spectral slope by the spectrum slope extracting unit 3.
  • step S6 the sound source characteristics are extracted by the sound source characteristics extraction unit 4 based on the higher-order cepstrum coefficients from the waveform analysis unit 1.
  • step S7 the input speech is recognized by the speech recognition unit 5, and the phoneme number (phoneme name) sequence and the prosodic information of each phoneme section (phoneme duration, average power, average pitch frequency, etc.) are recognized. Is output.
  • the phoneme number is determined in advance in association with the phoneme name and stored in a RAM (random 'access' memory) (not shown).
  • the speech waveform analysis by the waveform analysis unit 1 is referred to as cepstrum analysis, and a spectrum envelope, a spectrum slope, and a sound source characteristic are extracted based on the cepstrum analysis result. I have.
  • the audio waveform analysis method by the waveform analysis unit 1 is not limited to this, and any audio waveform analysis method such as LPC (linear predictive analysis) can be used as long as it can extract the spectrum envelope and sound source characteristics. It can be a law.
  • step S8 the spectrum envelope extracted by the spectrum envelope extraction unit 2 and the spectrum inclination extracted by the spectrum inclination extraction unit 3 and the sound source characteristic extraction unit 4 extract The sound source characteristics are stored in the feature memory 6 with a label formed by a pair of the speaker number s and the phoneme number X from the voice recognition unit 5.
  • step S9 it is determined whether or not the learning voice, which is an utterance by the speaker of the speaker number s, has a certain power, that is, whether or not there is a voice input by the same speaker. As a result, if there is any, the process returns to step S2, and shifts to the above-described spectrum envelope, vector inclination and sound source characteristic extraction and speech recognition for the next speech. Otherwise, go to step S10.
  • step S10 the phoneme number X is set to an initial value “1”.
  • step S 11 the averaging unit 7 reads out the spectrum envelope, the vector inclination, and the sound source characteristics to which the speaker number s and the phoneme number X are assigned from the characteristic feature. Then, the read spectrum envelope, vector slope, and sound source characteristics are classified into “phonemes”, “similar phonemes”, “voiced / unvoiced sound segments”, and “entire speech segments”, respectively. You.
  • step S12 it is determined whether the phoneme number X is equal to or greater than the maximum value x MAX .
  • step S14 if it is equal to or more than the maximum value x MAX , the process proceeds to step S14, and if not, the process proceeds to step S13.
  • step S13 the phoneme number X is incremented. After that, the process returns to step S11, and shifts to the classification of the spectral envelope of the next phoneme, the spectrum W, and the sound source characteristics.
  • step S14 the averaging unit 7 sets the "phoneme”, “similar phoneme”, and "voiced sound section / unvoiced sound section” with respect to the studio envelope, vector slope, and sound source characteristics to which the speaker number s is assigned. ”And“ the entire voice section ”are averaged by linear transformation or the like. The obtained average spectrum envelope, average spectrum slope and average sound source characteristics, and the corresponding phoneme name, similar phoneme name, voiced section / unvoiced section, and labels of the entire voice section are given by the feature memory 6. Stored.
  • step S 15 the speaker number s is either force not at maximum value s MAX or not is decided.
  • step S 17 if the maximum value s MAX or more, the process proceeds to step S 16 if Kere such case.
  • step S16 the speaker number s is incremented.
  • step S2 the process proceeds to step S2 above, and for the next speaker, extraction of the spectrum envelope, spectrum inclination and sound source characteristics, phoneme recognition, spectrum envelope, spectrum inclination and sound source characteristics are performed. Move on to classification and average calculation.
  • the spectrum envelope, spectrum slope, and sound source characteristics extracted from a large amount of data of the source speaker S and a small amount of data of the destination speaker T are represented by the speaker number s and the phoneme number. It is labeled X and stored.
  • the average spectral envelope, average spectral slope, and average sound source characteristics for each of “phonemes”, “similar phonemes”, “voiced / unvoiced sound segments”, and “entire speech segments” are represented by speaker number s
  • Phoneme name, similar phoneme name, voiced sound Labels are added to the section / unvoiced section and the entire voice section and stored.
  • step S17 the conversion destination speaker number designated from outside is set in the conversion destination speaker number sT.
  • the conversion source speaker number s S is also set to the conversion source speaker number specified externally.
  • step S18 an initial value “1” is set to the phoneme number X.
  • step S19 the nonlinear frequency axis spectrum matching unit 8 searches the feature memory 6 for an average spectrum envelope to which the speaker number s corresponding to the conversion destination speaker number sT and the phoneme number are assigned. Then, based on the search result, it is determined whether or not the phoneme data for the conversion destination speaker is stored in the feature memory 6. As a result, if stored, the process proceeds to step S20; otherwise, the process proceeds to step S24.
  • step S20 the non-linear frequency axis spectrum matching unit 8 obtains, from the feature memory 6, the average spectrum envelope to which the speaker number s corresponding to the source speaker number s S and the phoneme number X are assigned. Searched. Then, based on the search result, it is determined whether or not the data of the phoneme for the conversion source speaker is stored in the feature memory 6. As a result, if it is stored, the process proceeds to step S21; otherwise, the process proceeds to step S24.
  • step S21 the nonlinear frequency axis spectrum matching unit 8 uses the nonlinear frequency axis spectrum matching by dynamic programming to apply the average spectrum envelope of the source speaker S and the destination Matching with the average spectral envelope of speaker T is performed. Then, a frequency ⁇ ⁇ ⁇ ving function corresponding to the optimal DP path is obtained.
  • FIG. 6A shows the concept of nonlinear frequency axis spectrum matching by dynamic programming executed by the nonlinear frequency axis spectrum matching unit 8 described above.
  • the spectrum envelope is divided into L equal parts in the band, and both spectrum envelopes S, Element values representing the output value (spectral intensity) of each channel of T are defined as an element value T i and an element value S j (l ⁇ i, j ⁇ L).
  • the frequency axis is nonlinearly expanded and contracted by dynamic programming so that the two spectral envelopes correspond to each other.
  • the two A series of lattice points c (i, j) on a plane consisting of the vector envelope S and T forces
  • step S22 the above-described frequency averaging function is transmitted to the frequency warp table memory 9 together with the phoneme number X by the nonlinear frequency axis spectrum matching unit 8. Then, the label of the phoneme number X is given by the frequency warp table memory 9 and stored.
  • the data format of the frequency sorbing function used in the present embodiment is such that the element value of grid point c (i, j) on the DP path is an integer larger than “0”, and
  • the element value of the grid point c (i, j) other than is a matrix with L rows and L columns such that it is "0". It is desirable that the number of band divisions L be large, because the rubbing accuracy is increased. However, if the number is too large, the storage capacity of the frequency warp table memory 9 increases, and the processing time also increases.
  • the nonlinear frequency axis spectrum matching unit 8 calculates the average spectrum envelope S of the conversion source speaker S and the average spectrum envelope T of the conversion target speaker T for the same phoneme. Matching is performed using the element values (spectral intensity) Si and Tj of each channel, but the matching target is not limited to the output value (spectral intensity) of each channel of the spectral envelope. Absent. For example, matching may be performed using the difference (spectral local slope) between the output values between adjacent channels with respect to the mean spectrum envelope S and the mean spectrum envelope T ( ⁇ S and ⁇ ). .
  • step S23 it is determined whether or not the phoneme number X is equal to or greater than the maximum value x MAX . As a result, if it is equal to or more than the maximum value x MAX , the process proceeds to step S25, and if not, the process proceeds to step S24. In step S24, the phoneme number X is incremented. So After that, the process returns to the step S19, and the process shifts to the process of matching the spectral envelope between the source speaker S and the destination speaker T of the next phoneme, and storing the obtained frequency sorbing function.
  • step S25 the averaging unit 7 reads out the frequency mapping function for each speaker from the frequency warp table memory 9, and classifies the “similar phonemes” and “voiced sounds” in step S11.
  • the average for each “section / unvoiced sound section” and “entire voice section” is calculated by linear conversion or the like.
  • the obtained average frequency rubbing function (the sum of the frequency rubbing function may be substituted as shown in Fig. 6C) is used to calculate the corresponding analog phoneme name, voiced / unvoiced sound section, and the entire voice section.
  • a label is added and stored by the frequency warp table memory 9.
  • step S26 the phoneme number X corresponding to the utterance indicating phoneme is input to the spectrum envelope conversion unit 10, the spectrum inclination conversion unit 11 and the sound source characteristic conversion unit 12.
  • step S27 the spectrum envelope to which the speaker number s corresponding to the conversion source speaker number sS and the phoneme number X are assigned is read from the feature memory 6 by the spectrum envelope converter 10. Further, the average frequency averaging function (the average frequency averaging function between the source speaker number s S and the destination speaker number s) to which the phoneme number X is assigned is read from the frequency warp table memory 9. Then, the spectrum envelope S of the source speaker S is expressed by the following equation using the average frequency rubbing function (element value c (i, j)).
  • the frequency warping table memory 9 stores a plurality of average frequency averaging functions for each phoneme, each similar phoneme, each voiced / unvoiced voice section, and each voice section. Have been. Therefore, as shown below, It is possible to select an appropriate average frequency mapping function according to the amount of utterance data of the speaker T. That is, if there is little or no utterance data for a phoneme (eg, back vowel / ⁇ /), the average frequency of similar phonemes (eg, back vowel / a /) of the phoneme (/ ⁇ /) Select the rubbing function or the average frequency rubbing function of the voiced section.
  • a phoneme eg, back vowel / ⁇ /
  • the average frequency of similar phonemes eg, back vowel / a /
  • the average frequency sorbing function of the phoneme (/ o /) is selected. By doing so, it is possible to cope with the case where the amount of training utterance data of the conversion predecessor T is small, and it is possible to reduce the utterance burden of the conversion predecessor T.
  • step S28 the spectrum tilt converter 11 generates an average spectrum from the feature memory 6 to which the speaker number s corresponding to the conversion source speaker number s S and the phoneme number X are assigned.
  • the slope and the average spectrum slope to which the speaker number s corresponding to the conversion destination speaker number sT and the phoneme number X are assigned are read out.
  • the slope of the modified spectrum envelope obtained in step S27 is corrected by the difference between the two average spectrum slopes, and the normalized spectrum envelope is obtained.
  • the amount of learning utterance data of the conversion-target speaker T is selected. Can be dealt with even if there are few.
  • step S29 the sound source characteristic conversion unit 12 reads out the average sound source characteristic to which the speaker number s corresponding to the conversion destination speaker number sT and the phoneme number X are assigned from the feature memory 6. Then, if necessary, the sound source is deformed by linear transformation or the like, and the deformed sound source characteristic is obtained.
  • step S30 the spectrum combining unit 13 obtains a combined spectrum using the normalized shadow envelope and the deformed sound source characteristics obtained as described above. In this spectrum synthesis method, the spectrum envelope over a high frequency of the fundamental frequency is obtained by synthesizing the normal envelope spectrum envelope and the deformed sound source characteristic. This is done.
  • a speech waveform is synthesized by the waveform synthesizing unit 14 by the sine wave weight method based on the spectrum intensity of the synthesized spectrum.
  • the method of synthesizing the speech waveform is not limited to the sine wave weighing method using the synthesized spectrum, but a method in which the normalized spectrum envelope is zero-phased and superimposed for each fundamental frequency.
  • a synthesized waveform can be obtained by a method of performing an inverse Fourier transform on the synthesized spectrum.
  • step S32 it is determined whether or not the utterance instruction phoneme designated with the phoneme number X in step S26 is the last utterance instruction phoneme. As a result, if it is not the last utterance indicating phoneme, the process returns to step S26 to shift to the synthesis of the speech waveform relating to the next utterance indicating phoneme. On the other hand, if it is the last utterance indicating phoneme, the voice conversion processing operation ends.
  • the input voices of the conversion target speaker T and the conversion source speaker S are subjected to cepstrum analysis by the waveform analysis unit 1 and the spectrum envelope is extracted by the spectrum envelope extraction unit 2. Then, the spectrum gradient is extracted by the spectrum gradient extraction unit 3, and the sound source characteristics are extracted by the sound source characteristic extraction unit 4. Then, the averaging unit 7 calculates the average value of the above-mentioned spectral envelope, spectral gradient and sound source characteristic for each of “phonemes”, “similar phonemes”, “voiced / unvoiced sound sections”, and “entire voice section”. The phoneme number of the recognition result by the voice recognition unit 5 is assigned and stored by the feature memory 6. ,
  • the average spectrum envelope of the source speaker S and the average spectrum envelope of the destination speaker T are calculated for all phonemes stored in the feature memory 6.
  • the nonlinear frequency axis spectrum matching is performed to find the frequency averaging function corresponding to the optimal DP path.
  • the averaging unit 7 calculates the average value of the frequency mapping function for each of “similar phoneme”, “voiced sound section / unvoiced sound section”, and “entire voice section”, assigns a phoneme number, and stores the frequency warp table memory. Store in 9.
  • voice synthesis with the voice quality of the conversion-target speaker in accordance with the utterance instruction, the following procedure is performed.
  • the spectrum envelope conversion unit 10 uses the spectrum envelope of the corresponding phoneme of the conversion source speaker S, and uses the average frequency eving function between the conversion source speaker S and the conversion target story T of the relevant phoneme.
  • the transform envelope of the speaker T transformation To the envelope.
  • the spectrum slope converter 11 converts the envelope of the transformed spectrum by the difference between the average spectrum slope of the source speaker s and the average spectrum slope of the destination speaker T.
  • the inclination is corrected to obtain the regular envelope spectrum envelope.
  • the sound source characteristic conversion unit 1 2 by modifying the average sound characteristics of the conversion-target speaker ⁇ Ru sought deformation sound characteristics.
  • the spectrum synthesizer 13 calculates a synthesized spectrum from the normalized spectrum envelope and the deformed sound source characteristic, and the waveform synthesizer 14 generates a speech based on the synthesized spectrum. It combines the waveforms.
  • the frequency envelope of the spectrum envelope of the source speaker is nonlinearly expanded and contracted to obtain the spectrum envelope of the destination speaker, and its slope is corrected to obtain the normalized spectrum envelope. I want to ask. Therefore, it is not necessary to extract the specific position of the spectrum envelope as in the conventional voice conversion method based on the formant frequency or the voice conversion method based on the division point between the peak points of the spectrum envelope. The sound quality is not affected by the accuracy of the extraction.
  • the average frequency averaging function used when transforming the above-mentioned spectrum envelope, the average vector slope used when correcting the above-mentioned modified vector envelope slope, and the average used when obtaining the above-mentioned deformed sound source characteristics are obtained for each of "phonemes”, “similar phonemes”, “voiced / unvoiced sound segments", and "entire speech segments”. Therefore, according to the volume of the utterance data of the pre-speaker T stored in the feature memory 6 and the frequency warp table memory 9, the average frequency rubbing function, the average spectrum slope, and the average sound source characteristics in appropriate sections are obtained. By using, it is possible to cope with the case where the amount of the training utterance data of the conversion destination speaker T is small.
  • utterances are made to the spectrum envelope conversion unit 10, the spectrum inclination conversion unit 11, and the sound source characteristic conversion unit 12.
  • the designated phoneme is specified.
  • the method of specifying the utterance indicating phoneme is not limited to this, and it is also possible to directly specify the utterance indicating phoneme by the utterance by the source speaker as follows.
  • the utterance indicating phoneme is input to the waveform analysis unit 1 by the utterance of the conversion source speaker.
  • the speech recognition unit 5 performs speech recognition based on the time series of the spectrum envelope from the spectrum envelope extraction unit 2 and the prosody information from the waveform analysis unit 1, and generates a phoneme sequence of the recognition result.
  • the prosodic information of the phoneme section and the prosodic information are input to the spectrum envelope converter 1 ⁇ , the spectrum tilt converter 11 and the sound source characteristic converter 12 as utterance instruction information.
  • the spectrum envelope conversion unit 10 and the spectrum inclination conversion unit 11 read the average frequency-moving function and the average spectrum inclination of the corresponding phoneme according to the input phoneme sequence.
  • the sound source characteristic conversion unit 12 reads out the sound source characteristics of the corresponding phoneme according to the input prosody information. In this way, the utterance of the source speaker is converted in real time.
  • the average spectrum envelope of the same phoneme of the source speaker and the destination speaker is determined in advance, and nonlinear frequency axis spectrum matching is performed using the average spectrum envelope.
  • nonlinear frequency axis spectral matching is performed using the individual spectral envelopes of the same phoneme to find the frequency rubbing function, and the frequency rubbing function is averaged within the same phoneme to find the average frequency rubbing function. No problem.
  • FIG. 8 is a block diagram of the voice quality conversion device according to the present embodiment.
  • a waveform analysis unit 21, a sound source characteristic extraction unit 23, a sound source characteristic conversion unit 30, and a waveform synthesis unit 32 are provided in the first embodiment in a waveform analysis unit 1 of the voice quality conversion apparatus shown in FIG.
  • the sound source characteristic extraction unit 4, the sound source characteristic conversion unit 12, and the waveform synthesis unit 14 have the same configuration and operate similarly.
  • the vocal tract cross-sectional area extraction unit 22 cuts the vocal tract from the glottis to the lips as shown in Figs. 9B and 9D. Extract the area. 9A and 9C show the sound source extraordinary extracted by the sound source characteristic extraction unit 23.
  • the voice recognition unit 24 is a vocal tract cross-sectional area extracted by the vocal tract cross-sectional area extraction unit 22. Speech recognition is performed based on the time series of the prosodic information (such as the pitch and the pitch frequency) extracted by the waveform analysis unit 21 and the prosody information.
  • the feature memory 25 stores the extracted vocal tract cross-sectional area and sound source characteristics with a phoneme label added thereto.
  • the averaging unit 26 compares the vocal tract cross-sectional area and sound source characteristics of each phoneme for each speaker stored in the feature memory 25 with respect to the phonemes, similar phonemes, voiced / unvoiced sound sections, and the entire voice section. The average value is calculated for each (speaker). Then, the obtained average vocal tract cross-sectional area and average sound source characteristics are labeled with the corresponding phoneme name, similar phoneme name, voiced sound section / unvoiced sound section, or the label of the entire voice section (speaker), and the feature memory 2 is assigned. Store in 5.
  • the vocal tract axis-ving function of each phoneme stored in the vocal tract axis warp table memory 28 calculates the average value.
  • the obtained average vocal tract axis moving function is stored in the vocal tract axis warping table memory 28 with a corresponding similar phoneme name, a label of a voiced section / unvoiced section or the entire voice section.
  • the nonlinear vocal tract axis matching unit 27 performs the nonlinear vocal tract axis matching by the dynamic programming method for each phoneme similarly to the case of the nonlinear frequency axis spectrum matching unit 8 in the first embodiment.
  • Fig. 1 OA matching is performed between the average vocal tract cross-sectional area of source speaker S and the average vocal tract cross-sectional area of target speaker T stored in feature memory 25.
  • a vocal tract axis moving function as shown in FIG. 1OB is obtained, a phoneme name is given, and the vocal tract axis warping table memory 28 is stored.
  • FIG. 10C shows the average vocal tract axis moving function calculated by the averaging unit 26 (substitution of the added value).
  • the vocal tract cross section converter 29 reads the vocal tract cross section of the source speaker S corresponding to the utterance instruction from the feature memory 25, while reading the vocal tract of the phoneme from the vocal tract axis warp table memory 28. Read the axis rubbing function. Then, using the above vocal tract axis moving function, as shown in FIG. 11, the vocal tract cross-sectional area of the source speaker S is converted to the vocal tract cross-sectional area (transformed vocal tract cross-sectional area) of the destination speaker T, as shown in FIG. Convert. Then, the spectrum synthesizer 31 uses the modified vocal tract cross-sectional area from the vocal tract cross-section converter 29 and the deformed sound source and characteristics from the sound source characteristic converter 30 to calculate the fundamental frequency. By determining the spectrum intensity over high frequencies Then, we seek the composite spectrum.
  • the vocal tract cross-sectional area having a high relation with the spectrum envelope is used instead of the spectrum envelope in the first embodiment, and the voice of the source speaker is used.
  • the vocal tract axis of the road cross-sectional area is nonlinearly expanded and contracted to obtain the vocal tract cross-sectional area of the predecessor speaker. Therefore, as in the case of the first embodiment, the spectral envelope is specified as in the conventional voice quality conversion method based on the formant frequency or the voice quality conversion method based on the division point between the peak points of the spectrum envelope. There is no need to extract the position, and the sound quality is not affected by the extraction accuracy of the specific position.
  • a phoneme is used as the speech unit, but the present invention can be applied to a syllable.
  • the voice quality conversion processing function by the unit 11, the vocal tract cross-sectional area conversion unit 29, the sound source characteristic conversion unit 12, 30, the spectrum synthesis unit 13-31, and the waveform synthesis unit 14, 32 is as follows.
  • the program is realized by a voice quality conversion processing program recorded on a recording medium.
  • the program recording medium in each of the above embodiments is a program medium such as a ROM (Read Only Memory). Alternatively, it may be a program medium that is attached to and read from an external auxiliary storage device.
  • the program reading means for reading the voice quality conversion processing program from the program medium may have a configuration of directly accessing and reading the program medium, or a program provided in the RAM. A configuration may be adopted in which the program is downloaded to a storage area (not shown), and the program storage area is accessed and read. It is assumed that a download program for downloading from the program medium to the program storage area of the RAM is stored in the main device in advance.
  • the above-mentioned program medium is configured to be separable from the main body side, Tapes such as tape cassette tapes, magnetic disks such as floppy disks and hard disks, CDs (compact disks), optical disks such as ROMs, MOs (magneto-optical) disks, MDs (mini disks), DVDs (digital video disks), etc.
  • Discs card systems such as IC (integrated circuit) cards and optical cards, mask ROMs, EPR OMs (ultraviolet erasable ROMs), EEPR OMs (electrically erasable ROMs), and semiconductor memories such as flash ROMs It is a medium for fixedly carrying programs, including systems.
  • the program medium may be a medium that carries a program in a fluid manner by downloading from a communication network or the like.
  • the download program for downloading from the communication network is stored in the main unit in advance.

Abstract

The speech load on the conversion-target speaker is lightened and conversion of voice character with high accuracy is effected. A nonlinear frequency axis spectrum matching unit (8) determines a frequency warping function relating to the spectrum envelope of the conversion-source speaker and the spectrum envelope of the conversion-target speaker. A frequency warp table memory (9) holds the averages of the frequency warping function for each phoneme , each similar phoneme , each voiced sound section/voiceless sound section and each whole voice section of the frequency warping fucntion. When the voice character is converted, the spectrum envelope of the conversion-source speaker is converted to the spectrum envelope of the conversion-target speaker by using the average frequency warping function. Thus high-accuracy voice character conversion is effected. If the amount of data on speech of the conversion-target speaker is a little, average frequency warping functions such as of the similar phoneme and the voiced sound section/voiceless sound section are used, thereby dealing with the conversion even if the amount of data on speech of the conversion-target speaker is a little and lightening the speech load on the conversion-target speaker.

Description

技術分野 Technical field
この発明は、 合成音声または明入力音声を特定話者の音質に変換して出力する声 質変換装置および声質変換方法、 並びに、 声質変換処理プログラムを記録したプ 口グラム記録媒体に関する。 書  The present invention relates to a voice conversion device and a voice conversion method for converting a synthesized voice or a bright input voice into a specific speaker's voice and outputting the converted voice, and a program recording medium storing a voice conversion program. book
背景技術 Background art
これまで、 より自然で人間の発声に近い合成音声の実現を目指して、 テキスト 音声合成装置が数多く開発されてきている。 この目標の実現がある程度なされた 時点で、 次に、 好きな声優や女優または家族や恋人等の特定話者の声質や韻律で 発声するテキスト音声合成装置のニーズが高まってくることが当然予想される。 また、'声質'韻律変換のために音声合成装置が必要とする音声データは、 提供者 の発声負担を考慮して、 できるだけ少量であることが望まれる。  To date, many text-to-speech synthesizers have been developed with the aim of realizing synthesized speech that is more natural and closer to human speech. Once this goal has been achieved to some extent, it is naturally expected that the need for a text-to-speech synthesizer that utters according to the voice quality and prosody of a particular speaker, such as a favorite voice actor or actress, or a family or lover, will increase. You. Also, it is desirable that the voice data required by the voice synthesizer for 'voice quality' prosody conversion be as small as possible in consideration of the utterance burden of the provider.
従来より、 声質を変換する方法として、 スぺク トル包絡からフォルマント周波 数を抽出して変換する方法 (例えば、 桑原,大串、 「ホルマント周波数、 パンド幅 の独立制御と個人性判断」、 電子通信学会論文誌、 Vol. j69 - A No. 4, pp. 509 - 517 (1986))。 また、 上記スぺクトル包絡のピーク点を求め、 そのピーク点の周波 数を基準として各スぺクトル包絡を帯域分割し、 これら分割点について求めた周 波数差と強度差とを利用してスぺク トル包絡を変形させる方法 (例えば、 特開平 9 - 2 4 4 6 9 4号公報)がある。  Conventionally, as a method of converting voice quality, a method of extracting and converting formant frequencies from the spectral envelope (for example, Kuwahara, Ogushi, “Independent control of formant frequency and band width and judgment of individuality”, Electronic Communication Transactions of the Society, Vol. J69-A No. 4, pp. 509-517 (1986)). In addition, the peak point of the spectrum envelope is obtained, each spectrum envelope is band-divided based on the frequency of the peak point, and the frequency difference and the intensity difference obtained for these division points are used to make a spectrum. There is a method of deforming the vector envelope (for example, Japanese Patent Application Laid-Open No. Hei 9-244649).
一方において、 不特定話者の音声認識技術分野において、 音声スぺク トルの周 波数軸'強度軸の同時非線形伸縮を行なうことによって、 話者正規化に関して著 しい効果が見られ、 音声認識性能が向上したという報告がある(例えば、 中川,神 谷,坂井、 「音声スぺクトルの時間軸'周波数軸'強度軸の同時非線形伸縮に基づく 不特定話者の単語音声の認識」、 電子通信学会論文誌、 Vol. j64 - D No. 2, pp. 116 - 123 (1981) )。 On the other hand, in the field of speech recognition technology for unspecified speakers, by performing simultaneous nonlinear expansion and contraction of the frequency axis and the intensity axis of the speech spectrum, a remarkable effect on speaker normalization is seen, and the speech recognition performance is improved. (Eg, Nakagawa, Kamiya, Sakai, "Recognition of unspecified speaker's word speech based on simultaneous nonlinear expansion and contraction of time axis, frequency axis, and intensity axis of speech spectrum", Journal of the Society, Vol. J64-D No. 2, pp. 116 -123 (1981)).
また、 予め変換元話者と変換先話者とが発声した音声における複数母音のスぺ クトル包絡系列(n次元べクトル系列)間で周波数領域における D P (動的計画法) マツチングを行ない、 求められた一つの最適 D Pパスを利用して上記変換元話者 のスぺク トル包絡を変換先話者のスぺク トル包絡に変換する方法が提案されてい る(例えば、 特開平 4 - 1 4 7 3 0 0号公報)。  In addition, DP (Dynamic Programming) matching in the frequency domain is performed between spectrum envelope sequences (n-dimensional vector sequences) of a plurality of vowels in the speech uttered by the source speaker and the target speaker in advance. A method has been proposed in which the spectrum envelope of the above-mentioned source speaker is converted into the spectrum envelope of the destination speaker by using one obtained optimal DP path (see, for example, Japanese Patent Laid-Open No. 4-1 / 1991). No. 4730000).
しかしながら、 上記従来の声質変換方法には、 以下のような問題がある。 すな わち、 フォルマント周波数を抽出して変換する方法においては、 上記フォルマン ト周波数の抽出精度によって音質が影響されるという問題がある。 また、 上記ピ ーク点の周波数を基準としたスぺクトル包絡の分割点の周波数差と強度差とに基 づいてスぺクトル包絡を変形させる方法においては、 ピーク点の周波数によって 分割されるスぺク トルの帯域が影響されるという問題があり、 ピッチ周波数が高 い場合における低域のピーク点の抽出精度によって音質が影響されるという問題 も想定される。  However, the conventional voice conversion method has the following problems. That is, the method of extracting and converting the formant frequency has a problem that the sound quality is affected by the extraction accuracy of the formant frequency. Further, in the method of deforming the spectrum envelope based on the frequency difference and the intensity difference between the division points of the spectrum envelope based on the frequency of the peak point, the spectrum is divided by the frequency of the peak point There is a problem that the band of the spectrum is affected, and there is also a problem that the sound quality is affected by the extraction accuracy of the low-frequency peak point when the pitch frequency is high.
また、 上記音声スぺク トルの周波数軸'強度軸の同時非線形伸縮によって話者 正規化を行なう方法においては、 非線形伸縮の際の制約条件を相当上手く設定し ないと、 個人差のみならず音韻差まで正規化されてしまい、 結果として性能を下 げてしまうという問題がある。  Also, in the above method of performing speaker normalization by simultaneous non-linear expansion and contraction of the frequency axis and the intensity axis of the speech spectrum, unless the constraint conditions for the non-linear expansion and expansion are set quite well, not only individual differences but also phonemes. The problem is that the difference is normalized, resulting in poor performance.
また、 変換元話者と変換先話者とが発声した音声における複数母音のスぺクト /レ包絡系列( n次元べクトル系列)間で D Pマツチングを行なう方法にぉレ、ては、 調音点や口の開き具合などの発声癖に起因する個人差(ソフト差)の影響で各母音 毎の最適 D Pパスが異なる場合には、 似通った最適 D Pパス群 (例えば、 後舌母 音)のメンパーが多い方に偏って、 他の群にはやや不適切な D Pパスを抽出し、 全体として最適ではない D Pパスが選択されてしまうという問題がある。 また、 最適 D Pパスが偏らないように上手く学習用の母音を選択できた場合には、 声道 形状や声道長等の身体上の差に起因する個人差 (ハード差)のみを正規化する D P パスであるため、 正規ィ匕による認識性能の向上が充分でないという問題がある。 さらに、 変換元話者と変換先話者とが同じ内容 (単語または文:例えば「あいうえ お、 いえあおう」)を発声するという制約を前提にしているため、 変換元話者の発 声内容が異なっていたり、 音声データが不足している場合には、 利用することが できないという問題もある。 Also, there is a method of performing DP matching between a plurality of vowels in a speech uttered by a conversion source speaker and a conversion target speaker, and an envelope sequence (n-dimensional vector sequence). If the optimal DP path for each vowel is different due to individual differences (soft differences) caused by vocal habits such as the opening of the mouth and the degree of mouth opening, the members of similar optimal DP paths (for example, the back vowel) However, there is a problem that DP paths that are slightly inappropriate for other groups are extracted, and DP paths that are not optimal as a whole are selected. If the vowels for training can be selected so that the optimal DP path is not biased, only the individual differences (hard differences) caused by physical differences such as vocal tract shape and vocal tract length are normalized. Since it is a DP pass, there is a problem that the recognition performance is not sufficiently improved by the regular pass. Furthermore, the source speaker and the destination speaker have the same content (word or sentence: for example, ) Is premised on the restriction of utterance, so if the source speaker's utterance is different or the voice data is insufficient, there is also a problem that it cannot be used. is there.
このように、 上記従来の声質変換方法においては、 声質の変換性能の点におい て、 十分であるとは言えないのである。 発明の開示  Thus, the conventional voice quality conversion method described above is not sufficient in terms of voice quality conversion performance. Disclosure of the invention
そこで、 この発明の目的は、 変換先話者の発声負担を軽減し、 より精度の良い 声質変換を行うことができる声質変換装置および声質変換方法、 並びに、 声質変 換処理プログラムを記録したプログラム記録媒体を提供することにある。  Accordingly, an object of the present invention is to provide a voice conversion apparatus and a voice conversion method capable of reducing the utterance load of the conversion-target speaker and performing more accurate voice conversion, and a program recording a voice conversion processing program. To provide a medium.
上記目的を達成するため、 この発明の第 1のアスペクトによれば、 第 1話者の 声質での音声を第 2話者の声質での音声に変換する声質変換装置であって、 上記 第 1話者が発声した第 1音声から第 1スぺク トル包絡を抽出する一方,第 2話者 が発声した第 2音声から第 2スぺクトル包絡を抽出するスぺクトル包絡抽出手段 と、 上記抽出された第 1スぺクトル包絡および第 2スぺクトル包絡を音声単位の ラベルを付与して格納する第 1メモリ手段と、 同一ラベルに関して,上記第 1メ モリに格納された上記第 1スぺク トル包絡と第 2スぺク トル包絡とに対して非線 形な周波数伸縮マッチングを行って,両スぺクトル包絡の周波数軸の対応付けを 表わす周波数ヮービング関数を求める非線形周波数軸スぺクトルマツチング手段 と、 上記周波数ヮービング関数を音声単位のラベルを付与して格納する第 2メモ リ手段と、 指定された音声単位名の第 1スぺク トル包絡を上記第 1メモリから読 み出す一方,上記指定された音声単位名の周波数ヮービング関数を上記第 2メモ リから読み出して,この読み出された周波数ヮービング関数に基づいて,上記読み 出された第 1スぺクトル包絡を第 2話者に関するスぺク トル包絡に変換するスぺ クトル包絡変換手段を備えた声質変換装置が提供される。  To achieve the above object, according to a first aspect of the present invention, there is provided a voice conversion device for converting a voice in a voice quality of a first speaker into a voice in a voice quality of a second speaker. A spectral envelope extracting means for extracting a first spectral envelope from a first voice uttered by a speaker, and extracting a second spectral envelope from a second voice uttered by a second speaker; First memory means for storing the extracted first spectrum envelope and second spectrum envelope with a label for each voice, and for the same label, the first spectrum stored in the first memory; A nonlinear frequency expansion and contraction is performed by performing nonlinear frequency expansion / contraction matching on the vector envelope and the second vector envelope, and obtaining a frequency-moving function representing the correspondence between the frequency axes of the two envelopes. Vector matching means, and the frequency The second memory means for storing a speech function with a label of a voice unit and storing the first spectrum envelope of the specified voice unit name from the first memory, while reading the specified voice unit name. The frequency sorbing function of the unit name is read out from the second memory, and based on the read out frequency convergence function, the above read out first spectrum envelope is used as the spectrum envelope for the second speaker. There is provided a voice quality conversion device provided with a spectrum envelope conversion means for converting into a voice.
上記構成によれば、 第 1話者の音声から得られた第 1スぺク トル包絡と第 2話 者の音声から得られた第 2スぺク トル包絡との周波数軸の対応付けを表わす周波 数ヮービング関数が用いられ、 指定された音声単位名の第 1話者による第 1スぺ ク トル包絡の周波数軸が非線形伸縮されて第 2話者によるスぺク トル包絡に変換 され、 上記指定された音声単位名の第 2話者での音声が得られる。 したがって、 第 1話者による第 1スぺクトル包絡の特定位置を抽出する必要が無く、 上記特定 位置の抽出精度に音質が影響されることのない精度の高い声質変換が行われる。 また、 1実施の形態では、 上記非線形周波数軸スペクトルマッチング手段は、 上記非線形な周波数伸縮マッチングを行うに際して、 上記第 1スぺク トル包絡と 第 2スぺクトル包絡とに関して、 夫々のスぺクトル包絡を周波数帯域で複数チヤ ネルに分割した際における隣接チャネル間の出力値の差を用いる。 According to the above configuration, the correspondence between the frequency axes of the first spectrum envelope obtained from the voice of the first speaker and the second spectrum envelope obtained from the voice of the second speaker is represented. The first speaker by the first speaker of the specified voice unit name uses the frequency The frequency axis of the vector envelope is nonlinearly expanded and contracted and converted into a vector envelope by the second speaker, and the voice of the second speaker with the specified voice unit name is obtained. Therefore, there is no need to extract the specific position of the first spectral envelope by the first speaker, and highly accurate voice quality conversion is performed without the sound quality being affected by the extraction accuracy of the specific position. In one embodiment, the nonlinear frequency axis spectrum matching means performs the nonlinear spectrum expansion / contraction matching with respect to the first spectrum envelope and the second spectrum envelope. The difference in output value between adjacent channels when the envelope is divided into multiple channels in the frequency band is used.
また、 1実施の形態では、 上記第 1メモリ手段は,上記第 1スぺク トル包絡お よび第 2スぺクトル包絡の傾きをも音声単位のラベルを付与して格納するように なっており、 上記第 1話者が発声した第 1音声から第 1スぺク トル包絡の傾きを 抽出する一方,第 2話者が発声した第 2音声から第 2スぺクトル包絡の傾きを抽 出して上記第 1メモリ手段に格納させるスぺクトル傾き抽出手段と、 指定された 音声単位名の第 1スぺクトル包絡の傾きと第 2スぺクトル包絡の傾きとを上記第 1メモリ手段から読み出して,両傾きの差に基づいて,上記スぺクトル包絡変換手 段によって得られた上記第 2話者に関するスぺクトル包絡の傾きを補正するスぺ クトル傾き捕正手段を備えている。  In one embodiment, the first memory means stores the inclination of the first spectrum envelope and the inclination of the second spectrum envelope with a label for each voice. The slope of the first spectrum envelope is extracted from the first speech uttered by the first speaker, and the slope of the second spectrum envelope is extracted from the second speech uttered by the second speaker. Reading out the spectrum inclination extracting means to be stored in the first memory means, and reading out the inclination of the first spectrum envelope and the inclination of the second spectrum envelope of the designated voice unit name from the first memory means; And a spectral inclination correcting means for correcting the inclination of the spectral envelope of the second speaker obtained by the spectral envelope conversion means based on the difference between the two inclinations.
上記実施の形態によれば、 上記指定された音声単位名での第 1,第 2スぺクト ル包絡の傾きの差に基づいて、 上記得られた第 2話者に関するスぺクトル包絡の 傾きが補正されて、 より第 2話者の声質に近い音声が得られる。  According to the above embodiment, based on the difference between the slopes of the first and second spectral envelopes in the specified voice unit name, the slope of the spectral envelope of the second speaker obtained above is obtained. Is corrected, and a voice closer to the voice quality of the second speaker is obtained.
また、 1実施の形態では、 上記音声単位は音素であり、 上記第 2メモリ手段に 格納された周波数ヮービング関数を上記ラベルに基づいて音素,類似音素,有声音 区間/無声音区間および話者毎にグループ化し,各グループに属する周波数ヮーピ ング関数の平均値を笋出し,得られた平均周波数ヮービング関数を各グループ名 のラベルを付与して上記第 2メモリ手段に格納させる平均化手段を備えると共に、 上記スぺクトル包絡変換手段は,指定された音素が属する何れかのグループの平 均周波数ヮービング関数を上記周波数ヮービング関数として用いるようになって いる。 上記実施の形態によれば、 平均周波数ヮービング関数が「音素」,「類似音素」, 「有声音区間/無声音区間」および「話者」毎のグループ別に求められている。 した がって、 上記第 1メモリ手段に保存されている第 2話者の発声データの量に応じ て、 適切なグループの平均周波数ヮービング関数を選択して上記周波数ヮーピン グ関数の代りに用いることができる。 例えば、 後舌母音/ o /の発声データが少な レ、か全く無い場合には、 当該音素/ o /の類似音素である後舌母音/ a /の平均周波 数ヮービング関数、 または、 有声音区間の平均周波数ヮービング関数が選択され る。 こうして、 第 2話者の発声データの量が少ない場合でも対処可能になる。 さ らに、 上記音素毎および類似音素毎の平均周波数ヮービング関数を求めることに よって、 発声癖に起因する個人差が正規ィヒされる。 In one embodiment, the speech unit is a phoneme, and the frequency sorbing function stored in the second memory means is divided into phonemes, similar phonemes, voiced / unvoiced sections, and speakers based on the labels. Averaging means for grouping, calculating an average value of the frequency-mapping functions belonging to each group, assigning a label of each group name to the obtained average frequency-mapping function, and storing the label in the second memory means; The spectrum envelope conversion means uses an average frequency rubbing function of any group to which a specified phoneme belongs as the frequency rubbing function. According to the above-described embodiment, the average frequency rubbing function is obtained for each group of “phonemes”, “similar phonemes”, “voiced sound sections / unvoiced sound sections”, and “speakers”. Therefore, according to the amount of utterance data of the second speaker stored in the first memory means, an appropriate group of average frequency averaging functions is selected and used instead of the above frequency averaging function. Can be. For example, if there is little or no utterance data of the back tongue vowel / o /, the mean frequency voving function of the back tongue vowel / a /, which is a similar phoneme of the phoneme / o /, or a voiced interval The average frequency ビ ン グ ving function is selected. In this way, even when the amount of utterance data of the second speaker is small, it can be handled. Further, by calculating the average frequency rubbing function for each phoneme and each similar phoneme, individual differences due to vocal habits are normalized.
また、 1実施の形態では、 上記音声単位は音素であり、 上記第 1メモリ手段に 格納された第 1スぺク トル包絡の傾きおよび第 2スぺク トル包絡の傾きを上記ラ ベルに基づいて音素'類似音素'有声音区間/無声音区間および話者毎にグループ 化し,各グループに属するスぺクトル包絡の傾きの平均値を算出し,得られた平均 スぺクトル傾きを各話者名および各グループ名のラベルを付与して上記第 1メモ リ手段に格納させる平均化手段を備えると共に、 上記スぺク トル傾き補正手段は, 指定された音素が属する何れかのグループの平均スぺクトル傾きを上記スぺク ト ル包絡の傾きとして用いるようになつている。  In one embodiment, the speech unit is a phoneme, and the slope of the first spectrum envelope and the slope of the second spectrum envelope stored in the first memory unit are based on the label. Phoneme 'similar phoneme' voiced section / unvoiced section and each speaker, calculate the average value of the slope of the spectrum envelope belonging to each group, and calculate the obtained average spectrum slope for each speaker. And an averaging means for assigning a label to each group name and storing the same in the first memory means, and the spectrum inclination correcting means comprises an average memory for any of the groups to which the specified phoneme belongs. The vector inclination is used as the inclination of the above-mentioned spectral envelope.
上記実施の形態によれば、 平均スぺクトル傾きが Γ音素」,「類似音素」,「有声音 区間/無声音区間」および「話者」毎のグループ別に求められている。 したがって、 上記第 1メモリ手段に保存されている第 2話者の発声データの量に応じて、 適切 なグループの平均スぺクトノレ傾きを選択して上記スぺクトル包絡の傾きの代りに 用いることができる。 こうして、 第 2話者の発声データの量が少ない場合でも対 処可能になる。 さらに、 上記音素毎おょぴ類似音素毎のの平均スペク トル傾きを 求めることによって、 発声癖に起因する個人差が正規化される。  According to the above-described embodiment, the average spectral gradient is determined for each of the groups of “phoneme”, “similar phoneme”, “voiced sound section / unvoiced sound section”, and “speaker”. Therefore, according to the amount of utterance data of the second speaker stored in the first memory means, an appropriate average spectral slope of a group is selected and used in place of the spectral envelope slope. Can be. In this way, even if the amount of utterance data of the second speaker is small, it can be handled. Further, by calculating the average spectral gradient for each phoneme or each similar phoneme, individual differences due to vocal habits are normalized.
また、 1実施の形態では、 上記抽出された第 1スペク トル包絡あるいは第 2ス ぺクトル包絡の時系列を不特定話者音声認識方法によって認識し、 認識結果の音 声単位名を上記第 1メモリ手段に送出する音声認識手段を備えている。 上記実施の形態によれば、 上記第 1 ,第 2話者の発声から抽出された第 1,第 2 スペク トル包絡から、 ラベル用の音声単位名が自動的に得られる。 こうして、 上 記スぺクトル包絡あるいはスぺクトル包絡の傾きに対するラベル付け処理が容易 に亍われる。 Also, in one embodiment, the time series of the extracted first spectral envelope or the second spectral envelope is recognized by an unspecified speaker voice recognition method, and the voice unit name of the recognition result is referred to as the first voice envelope. A voice recognition unit is provided for sending to the memory unit. According to the above embodiment, the speech unit name for the label is automatically obtained from the first and second spectrum envelopes extracted from the utterances of the first and second speakers. In this way, the above-described spectrum envelope or the labeling process for the inclination of the spectrum envelope is easily performed.
また、 1実施の形態では、 上記音声認識手段は,得られた音声単位名の時系列 を上記スぺクトル包絡変換手段あるいはスぺクトル傾き補正手段に供給可能にな つており、 上記スぺクトル包絡変換手段あるいはスぺクトル傾き捕正手段は,上 記音声認識手段によつて得られた音声単位名の時系列を上記指定された音声単位 名とするようになつている。  In one embodiment, the speech recognition means can supply a time series of the obtained speech unit names to the spectrum envelope conversion means or the spectrum inclination correction means. The envelope conversion means or the spectrum inclination correction means is adapted to use the time series of the speech unit names obtained by the above speech recognition means as the designated speech unit names.
上記実施の形態によれば、 上記第 1話者の発声による第 1スぺクトル包絡を第 2話者のスぺクトル包絡に変換する際の音声単位名が、 上記音声認識手段によつ て得られた音声単位名の時系列によって指定される。 こうして、 キーボード等か ら声質変換すべき音声単位名列を入力することなく、 上記第 1話者の発声音が上 記第 2話者の音質での音声に直接リアルタイムに変換される。  According to the above embodiment, the speech unit name when converting the first spectrum envelope by the first speaker's utterance into the second speaker's spectrum envelope is determined by the speech recognition means. It is specified by the time series of the obtained voice unit name. In this way, the uttered sound of the first speaker is directly converted to the sound of the second speaker in real time without inputting the voice unit name sequence to be converted from the keyboard or the like.
また、 1実施の形態では、 上記平均化手段は、 平均値算出の対象となる周波数 ヮービング関数間の線形変換を行なうことによって上記平均周波数ヮービング関 数を算出するようになっている。 - また、 1実施の形態では、 上記周波数ヮービング関数は,上記第 1スペクトル 包絡と第 2スぺクトル包絡とを同一周波数帯域で複数チャネルに分割した際にお ける上記第 1 '第 2スぺクトルのチャネルから成る平面上における D Pパスに相 当する格子点とその他の格子点とに異なる要素値が与えられたマトリタス状のデ ータ形式を有し、 上記周波数ヮービング関数間の線形変換は,上記平均値算出の 対象となる周波数ヮービング関数に相当する複数のマトリタスにおける同一格子 点の要素値の和を求め,得られた値を要素値とするマトリクスを上記平均周波数 ヮービング関数とする。  In one embodiment, the averaging means calculates the average frequency-diving function by performing a linear conversion between frequency-averaging functions to be averaged. -Further, in one embodiment, the above-mentioned frequency averaging function is obtained by dividing the first spectrum envelope and the second spectrum envelope into a plurality of channels in the same frequency band and dividing the first and second spectrum envelopes into a plurality of channels. It has a matrix form of data in which different element values are given to the grid points corresponding to the DP path and other grid points on the plane consisting of the vector channels, and the linear transformation between the frequency Then, the sum of the element values at the same grid point in a plurality of matrices corresponding to the frequency averaging function for which the average value is to be calculated is determined, and a matrix having the obtained values as element values is defined as the average frequency averaging function.
また、 1実施の形態では、 上記スペク トル包絡変換手段は、 上記第 2スぺタト ル包絡のある周波数帯域における強度に変換する場合には、 使用する平均周波数 ヮービング関数のマトリクスにおける上記第 2スぺクトル包絡の該当チャネルに 関する行または列の格子点において、 各格子点の要素値と当該格子点に対応する 上記第 1スぺク トル包絡のチャネルにおける強度との積和を求め、 この積和の値 を上記第 2スぺクトル包絡の当該周波数帯域における強度とする。 In one embodiment, the spectrum envelope converting means converts the second frequency in the matrix of the average frequency-moving function to be used when converting the intensity into the intensity in a certain frequency band of the second static envelope. To the corresponding channel of the vector envelope At the grid point of the relevant row or column, the product sum of the element value of each grid point and the intensity in the channel of the first spectrum envelope corresponding to the grid point is determined, and the value of this product sum is calculated as the second The intensity of the spectrum envelope in the frequency band is assumed.
また、 この発明の第 2のアスペク トによれば、 第 1話者の声質での音声を第 2 話者の声質での音声に変換する声質変換装置であって、 上記第 1話者が発声した 第 1音声から第 1声道断面積を抽出する一方,第 2話者が発声した第 2音声から 第 2声道断面積を抽出する声道断面積抽出手段と、 上記抽出された第 1声道断面 積およぴ第 2声道断面積を音声単位のラベルを付与して格納する第 1メモリ手段 と、 同一ラベルに関して,上記第 1メモリに格納された上記第 1声道断面積と第 2声道断面積とに対して非線形な声道軸伸縮マッチングを行って,両声道断面積 の声道軸の対応付けを表わす声道軸ヮ一ビング関数を求める非線形声道軸マッチ ング手段と、 上記声道軸ヮーピング関数を音声単位のラベルを付与して格納する 第 2メモリ手段と、 指定された音声単位名の第 1声道断面積を上記第 1メモリ力、 ら読み出す一方,上記指定された音声単位名の声道軸ヮーピング関数を上記第 2 メモリから読み出して,この読み出された声道軸ヮービング関数に基づいて,上記 読み出された第 1声道断面積を第 2話者に関する声道断面積に変換する声道断面 積変換手段を備えた声質変換装置が提供される。  According to a second aspect of the present invention, there is provided a voice quality conversion device for converting a voice of a voice quality of a first speaker into a voice of a voice quality of a second speaker, wherein the voice of the first speaker is uttered. A vocal tract cross-sectional area extracting means for extracting a first vocal tract cross-sectional area from a second voice uttered by a second speaker while extracting a first vocal tract cross-sectional area from the first voice, A first memory means for storing a vocal tract cross-sectional area and a second vocal tract cross-sectional area with a label assigned to each speech unit; and, for the same label, the first vocal tract cross-sectional area stored in the first memory. Nonlinear vocal tract axis matching that performs nonlinear vocal tract axis expansion and contraction matching with the second vocal tract cross section and obtains the vocal tract axis ヮ one-Bing function representing the correspondence between the vocal tract axes of both vocal tract cross sections Means, and a second memory means for storing the vocal tract axis mapping function with a speech unit label attached thereto, and specifying The first vocal tract cross-sectional area of the specified voice unit name is read out from the first memory unit, while the vocal tract axis mapping function of the specified voice unit name is read out from the second memory, and this read-out is performed. A voice quality conversion device is provided that includes a vocal tract cross-sectional area conversion unit that converts the read first vocal tract cross-sectional area into a vocal tract cross-sectional area for a second speaker based on a vocal tract axis moving function.
上記構成によれば、 第 1話者の音声から得られた第 1声道断面積と第 2話者の 音声から得られた第 2声道断面積との声道軸の対応付けを表わす声道軸ヮーピン グ関数が用いられ、 指定された音声単位名の第 1話者による第 1声道断面積の声 道軸が非線形伸縮されて第 2話者による声道断面積に変換され、 上記指定された 音声単位名の第 2話者での音声が得られる。 したがって、 第 1話者による第 1ス ぺクトル包絡の特定位置を抽出する必要が無く、 上記特定位置の抽出精度に音質 が影響されることのなレ、精度の高い声質変換が行われる。  According to the above configuration, the voice representing the correspondence of the vocal tract axis between the first vocal tract cross-sectional area obtained from the voice of the first speaker and the second vocal tract cross-sectional area obtained from the voice of the second speaker The vocal axis mapping function is used, the vocal tract axis of the first vocal tract cross section by the first speaker of the specified voice unit name is nonlinearly expanded and contracted and converted to the vocal tract cross section of the second speaker, and The voice of the second speaker with the specified voice unit name is obtained. Therefore, it is not necessary to extract the specific position of the first spectral envelope by the first speaker, and the sound quality is not affected by the extraction accuracy of the specific position, and highly accurate voice quality conversion is performed.
また、 この発明の第 3のアスペクトによれば、 第 1話者の声質での音声を第 2 話者の声質での音声に変換する声質変換方法であって、 上記第 1話者が発声した 第 1音声から第 1スぺク 'トル包絡を抽出する一方,第 2話者が発声した第 2音声 から第 2スぺクトル包絡を抽出するステップと、 同一音声単位名に関して,上記 抽出された上記第 1スぺクトル包絡と第 2スぺクトル包絡とに対して非線形な周 波数伸縮マッチングを行って,両スぺクトル包絡の周波数軸の対応付けを表わす 周波数ヮービング関数を求めるステップと、 指定された音声単位名の第 1スぺク トル包絡を,上記指定された音声単位名の周波数ヮービング関数に基づいて、 第 2話者に関するスぺク トル包絡に変換するステツプを備えた声質変換方法が提供 される。 According to a third aspect of the present invention, there is provided a voice conversion method for converting a voice of a first speaker into a voice of a second speaker, wherein the voice of the first speaker is uttered. Extracting the first spectrum envelope from the first speech while extracting the second spectrum envelope from the second speech uttered by the second speaker; Performing a nonlinear frequency expansion / contraction matching on the extracted first and second spectral envelopes to obtain a frequency-moving function representing the correspondence between the frequency axes of the two spectral envelopes; And a step of converting the first spectral envelope of the specified voice unit name into a spectral envelope for the second speaker based on the frequency-moving function of the specified voice unit name. A voice conversion method is provided.
上記構成によれば、 指定された音声単位名の第 1話者による第 1スぺクトル包 絡の周波数軸が非線形伸縮されて第 2話者によるスぺク トル包絡に変換され、 上 記指定された音声単位名の第 2話者による音声が得られる。 したがって、 第 1話 者による第 1スぺク トル包絡の特定位置を抽出する必要が無く、 上記特定位置の 抽出精度に音質が影響されることのない精度の高い声質変換が行われる。  According to the above configuration, the frequency axis of the first spectrum envelope by the first speaker of the specified voice unit name is nonlinearly expanded and contracted and converted into the spectrum envelope by the second speaker, and the above designation is performed. The voice of the second speaker with the specified voice unit name is obtained. Therefore, there is no need to extract the specific position of the first spectral envelope by the first speaker, and highly accurate voice quality conversion is performed without the sound quality being affected by the extraction accuracy of the specific position.
また、 1実施の形態では、 上記第 1話者が発声した第 1音声から第 1スぺタト ル包絡の傾きを抽出する一方,上記第 2話者が発声した第 2音声から第 2スぺク トル包絡の傾きを抽出するステップと、 上記指定された音声単位名の第 1スぺク トル包絡の傾きと第 2スぺクトル包絡の傾きとの差に基づいて,上記得られた第 2話者に関するスぺクトル包絡の傾きを補正するステップを備えている。  In one embodiment, the gradient of the first studio envelope is extracted from the first speech uttered by the first speaker, while the second speech is extracted from the second speech uttered by the second speaker. Extracting the slope of the vector envelope; and calculating the second spectrum based on the difference between the slope of the first spectrum envelope and the slope of the second spectrum envelope of the specified speech unit name. The method includes the step of correcting the inclination of the spectral envelope of the speaker.
上記実施の形態によれば、 上記得られた第 2話者に関するスぺクトル包絡の傾 きが補正され、 より第 2話者の声質に近い音声が得られる。  According to the above-described embodiment, the inclination of the spectrum envelope regarding the second speaker obtained above is corrected, and a voice closer to the voice quality of the second speaker can be obtained.
また、 この発明の第 4のアスペクトによれば、 コンピュータを、 上記スぺタト ル包絡抽出手段,非線形周波数軸スぺクトルマッチング手段,スぺクトル包絡変換 手段,スぺク トル傾き抽出手段おょぴスぺク トル傾き補正手段として機能させる 声質変換処理プログラムが記録されているプログラム記録媒体が提供される。 上記構成によれば、 指定された音声単位名の第 1話者による第 1スぺク トル包 絡の周波数軸が非線形伸縮されて、 第 2話者に関するスぺク トル包絡に変換され る。 さらに、 得られた第 2話者に関するスペクトル包絡の傾き力 第 1,第 2ス ぺクトノレ包絡の傾きの差に基づいて補正される。 こうして、 第 1話者による第 1 スぺク トル包絡の特定位置の抽出精度に音質が影響されることのない、 精度の高 い声質変換が行われる。 図面の簡単な説明 According to a fourth aspect of the present invention, a computer is connected to the above-mentioned spectral envelope extracting means, nonlinear frequency axis spectral matching means, spectral envelope transforming means, and spectral gradient extracting means. (4) A program recording medium on which a voice quality conversion processing program for functioning as a spectrum inclination correcting means is provided. According to the above configuration, the frequency axis of the first spectrum envelope by the first speaker of the specified voice unit name is nonlinearly expanded and contracted, and is converted into the spectrum envelope of the second speaker. Furthermore, it is corrected based on the difference between the slopes of the first and second spectral envelopes of the obtained spectral envelope of the second speaker. In this way, highly accurate voice quality conversion is performed without the sound quality being affected by the extraction accuracy of the specific position of the first spectral envelope by the first speaker. BRIEF DESCRIPTION OF THE FIGURES
図 1はこの発明の声質変換装置におけるプロック図である。  FIG. 1 is a block diagram of a voice quality conversion apparatus according to the present invention.
図 2 A〜図 2 Fはスぺクトル包絡,スぺクトル傾き,音源特性の例を示す図であ る。  Figures 2A to 2F show examples of spectral envelope, spectral tilt, and sound source characteristics.
図 3は図 1に示す声質変換装置による声質変換処理動作のフローチヤ一トであ る。  FIG. 3 is a flowchart of the voice quality conversion processing operation by the voice quality conversion device shown in FIG.
図 4は図 3に続く声質変換処理動作のフローチャートである。  FIG. 4 is a flowchart of the voice quality conversion processing operation following FIG.
図 5は図 4に続く声質変換処理動作のフローチャートである。  FIG. 5 is a flowchart of the voice quality conversion processing operation following FIG.
図 6 A〜図 6 Cは動的計画法による非線形周波数軸スぺクトルマツチングの概 念を示す図である。  6A to 6C are diagrams showing the concept of nonlinear frequency axis spectrum matching by dynamic programming.
図 7 A,図 7 Bはスぺクトル包絡正規化の概念図である。  7A and 7B are conceptual diagrams of spectral envelope normalization.
図 8は図 1とは異なる声質変換装置におけるプロック図である。  FIG. 8 is a block diagram of a voice conversion device different from FIG.
図 9 A〜図 9 Dは声道断面積,音源特性の例を示す図である。  9A to 9D are diagrams showing examples of the vocal tract cross-sectional area and the sound source characteristics.
図 1 0八〜図1 0 Cは動的計画法による声道軸マッチングの概念を示す図であ る。  FIGS. 108 to 100C are diagrams showing the concept of vocal tract axis matching by dynamic programming.
図 1 1は変形声道断面積の概念図である。 発明を実施するための最良の形態  Fig. 11 is a conceptual diagram of the modified vocal tract cross-sectional area. BEST MODE FOR CARRYING OUT THE INVENTION
以下、 この発明を図示の実施の形態により詳細に説明する。 尚、 以下の説明に おいては、 上記音声単位を「音素」としているが、 この発明はこれに限定されるも のではない。  Hereinafter, the present invention will be described in detail with reference to the illustrated embodiments. In the following description, the above-mentioned voice unit is referred to as “phoneme”, but the present invention is not limited to this.
(第 ;L実施の形態)  (No.; L embodiment)
図 1は、 本実施の形態の声質変換装置におけるブロック図である。 波形分析部 1は、 入力された音声波形からケプストラムと韻律情報とを抽出する。 スぺタト ル包絡抽出部 2は、 波形分析部 1で抽出された低次のケプストラム係数に基づい て、 図 2 C,図 2 Fに示すようなスペク トル包絡を抽出する。 スペク トル傾き抽 出部 3は、 上記スぺクトル包絡を最小 2乗近似直線で近似した場合における近似 直線の傾きである図 2 B,図 2 Eに示すようなスぺクトル傾きを抽出する。 音源 特十生抽出部 4は、 波形分析部 1で抽出された高次のケプストラム係数に基づいて、 図 2 A,図 2 Dに示すような音源特性を抽出する。 音声認識部 5は、 スペクトル 包絡抽出部 2で抽出されたスぺク トル包絡と波形分析部 1で抽出された韻律情報 (パワーやピッチ周波数等)の時系列に基づいて、 HMM (隠れマルコフモデル)を 用いて音声認識を行なう。 そして、 認識結果の音素 (音声単位)系列をその音素区 間における韻律情報 (音素継続時間長,平均パワー,平均ピッチ周波数等)と共に出 力する。 尚、 上記抽出されたスペク トル包絡,スぺク トル傾き,音源特性は、 音声 認識部 5による各話者毎の認識結果である音素ラベルが付与されて特徴メモリ 6 に格納される。 FIG. 1 is a block diagram of a voice conversion device according to the present embodiment. The waveform analyzer 1 extracts a cepstrum and prosody information from an input speech waveform. The star envelope extraction unit 2 extracts a spectrum envelope as shown in FIGS. 2C and 2F based on the low-order cepstrum coefficients extracted by the waveform analysis unit 1. The spectral gradient extraction unit 3 performs the approximation in the case where the above-mentioned spectral envelope is approximated by a least square approximation line. Extract the slope of the line as shown in Fig. 2B and Fig. 2E. The sound source Tokusei extraction unit 4 extracts sound source characteristics as shown in FIGS. 2A and 2D based on the higher-order cepstrum coefficients extracted by the waveform analysis unit 1. The speech recognition unit 5 performs HMM (Hidden Markov Model) based on the time series of the spectrum envelope extracted by the spectrum envelope extraction unit 2 and the prosodic information (power, pitch frequency, etc.) extracted by the waveform analysis unit 1. ) Is used for speech recognition. Then, the phoneme (speech unit) sequence of the recognition result is output together with the prosody information (phoneme duration, average power, average pitch frequency, etc.) between the phoneme segments. The extracted spectral envelope, spectral tilt, and sound source characteristics are stored in the feature memory 6 with a phoneme label, which is the recognition result of each speaker by the voice recognition unit 5, attached.
平均化部 7は、 上記特徴メモリ 6に格納されている話者毎の各音素のスぺクト ル包絡,スペク トル傾きおよび音源特性に対して、 線形変換等によって、 音素,類 似音素,有声音区間/無声音区間及び音声区間全体 (話者)毎に分類して平均値を算 出する。 そして、 得られた平均スペク トル包絡,平均スペク トル傾きおよび平均 音源特性を、 対応する音素名,類似音素名,有声音区間/無声音区間あるいは音声 区間全体 (話者)のラベルを付与して特徴メモリ 6に格納させる。 さらに、 後に詳 述するようにして周波数ワープ表メモリ 9に格納される話者毎の各音素の周波数 ヮービング関数に対して、 線形変換等によって、 上記類似音素,有声音区間/無声 音区間および音声区間全体毎に分類して平均値を算出する。 そして、 得られた平 均周波数ヮービング関数を、 対応する類似音素名,有声音区間/無声音区間あるい は音声区間全体のラベルを付与して周波数ワープ表メモリ 9に格納させる。  The averaging unit 7 converts the spectral envelope, spectral slope, and sound source characteristics of each phoneme for each speaker stored in the feature memory 6 into phonemes, similar phonemes, Calculate the average value by classifying each voice section / unvoiced section and the entire voice section (speaker). Then, the obtained average spectral envelope, average spectral slope, and average sound source characteristics are labeled with the corresponding phoneme name, similar phoneme name, voiced / unvoiced sound section, or the label of the entire voice section (speaker). Store it in memory 6. Further, as described later in detail, the similar phoneme, voiced / unvoiced sound section, and speech are obtained by performing a linear transformation on the frequency-moving function of each phoneme stored in the frequency warp table memory 9 for each speaker. An average value is calculated for each of the sections. Then, the obtained average frequency rubbing function is stored in the frequency warp table memory 9 with a corresponding similar phoneme name, a label of a voiced section / unvoiced section or an entire voice section.
ここで、 上記周波数ワープ表メモリ 9に格納されている上記周波数ヮービング 関数の算出は、 非線形周波数軸スペクトルマッチング部 8によって、 次のように して行われる。 すなわち、 非線形周波数軸スペク トルマッチング部 8は、 動的計 画法による非線形周波数軸スペク トルマッチングによって、 各音素毎に、 特徴メ モリ 6に格納された変換元話者 Sの平均スぺクトル包絡と変換先話者 Tの平均ス ぺクトル包絡とのマッチングを行なう。 そして、 最適 D Pパスに相当する周波数 ヮービング関数を求め、 音素名を付与して周波数ワープ表メモリ 9に格納するの である。 Here, the calculation of the frequency averaging function stored in the frequency warp table memory 9 is performed by the nonlinear frequency axis spectrum matching unit 8 as follows. That is, the nonlinear frequency axis spectrum matching unit 8 uses the nonlinear frequency axis spectrum matching based on the dynamic planning method to calculate, for each phoneme, the average spectrum envelope of the source speaker S stored in the feature memory 6 for each phoneme. And the average spectral envelope of the speaker T to be converted. Then, a frequency averaging function corresponding to the optimal DP path is obtained, and a phoneme name is assigned to the function and stored in the frequency warp table memory 9. It is.
スぺクトノレ包絡変換部 1 0は、 発声指示に対応する音素の変換元話者 sのスぺ クトル包絡を特徴メモリ 6から読み出す一方、 周波数ワープ表メモリ 9力 ら当該 音素の周波数ヮービング関数を読み出す。 その場合、 特徴メモリ 6および周波数 ワープ表メモリ 9に格納されている変換先話者の該当音素のデータが少ない力、全 く無い場合には、 当該音素の類似音素や当該音素と同じ区間(有声音区間または 無声音区間)や音声区間全体の平均周波数ヮービング関数を読み出す。 そして、 上記 (平均)周波数ヮービング関数を利用して、 変換元話者 Sのスぺクトル包絡を 変換先話者 Tのスぺク トル包絡に変換する。 以下、 この変換して得られた変換先 話者 Tのスぺク トル包絡を「変形スぺク トル包絡」と言う。  The spectrum-envelope conversion unit 10 reads out the spectrum envelope of the source speaker s of the phoneme corresponding to the utterance instruction from the feature memory 6, and reads out the frequency-moving function of the phoneme from the frequency warp table memory 9. . In that case, the data of the corresponding phoneme of the pre-speaker stored in the feature memory 6 and the frequency warp table memory 9 have little power, and if there is no data, the similar phoneme of the phoneme or the same section as the phoneme Reads the average frequency rubbing function of the voice section or unvoiced section) or the entire voice section. Then, the spectrum envelope of the source speaker S is converted into the spectrum envelope of the destination speaker T by using the above (average) frequency eave function. Hereinafter, the spectrum envelope of the destination speaker T obtained by this conversion is referred to as a “deformed vector envelope”.
スぺク トル傾き変換部 1 1は、 上記特徴メモリ 6力 ら、 発声指示に対応する音 素の変換元話者 Sの平均スぺクトル傾きと変換先話者 Tの平均スぺクトル傾きと を読み出し、 両平均スぺクトル傾きの差の分だけスぺクトル包絡変換部 1 0から の上記変形スぺクトル傾きを捕正する変形スぺクトル傾き変換を行い、 正規化ス ぺクトル包絡を求める。 音源特性変換部 1 2は、 発声指示に対応する平均音源特 性を特徴メモリ 6から読み出し、 必要に応じて線形変換等によって変形して変形 音源特性を求める。 スぺクトル合成部 1 3は、 スぺクトル傾き変換部 1 1からの 正規化スぺク トル包絡と音源特性変換部 1 2からの変形音源特性とを用いて、 基 本周波数の高周波数に亘るスぺクトル強度を求めることによって、 合成スぺクト ルを求める。 波形合成部 1 4は、 上記合成スぺク トルのスぺク トル強度に基づい て、 正弦波重量法によって音声波形を合成する。  The spectrum slope conversion unit 11 calculates the average spectrum slope of the conversion source speaker S and the average spectrum slope of the conversion destination speaker T of the phoneme corresponding to the utterance instruction from the six feature memories. , And performs a transformed spectral slope transformation for correcting the transformed spectrum slope from the spectrum envelope transformation unit 10 by an amount corresponding to a difference between the two mean spectrum slopes, and obtains a normalized spectrum envelope. Ask. The sound source characteristic conversion unit 12 reads out the average sound source characteristic corresponding to the utterance instruction from the feature memory 6, and obtains the deformed sound source characteristic by deforming by linear transformation or the like as necessary. The spectrum synthesis unit 13 uses the normalized spectrum envelope from the spectrum inclination conversion unit 11 and the deformed sound source characteristics from the sound source characteristic conversion unit 12 to generate a high fundamental frequency. The combined spectrum is determined by determining the spectrum intensity over the entire spectrum. The waveform synthesizing unit 14 synthesizes a voice waveform by a sine wave weight method based on the spectrum intensity of the synthesized spectrum.
図 3〜図 5は、 上記構成を有する声質変換装置による声質変換処理動作のフ口 一チャートである。 以下、 図 3〜図 5に従って、 上記声質変換装置の動作につい て言 糸田に説明する。  FIG. 3 to FIG. 5 are flow charts of the voice conversion processing operation by the voice conversion apparatus having the above configuration. Hereinafter, the operation of the voice conversion device will be described to Itoda with reference to FIGS.
ステップ S 1で、 話者番号 sに初期値「1」が設定される。 尚、 この話者番号 s や後の音素番号 X ,変換先話者番号 sT,変換元話者番号 s S等は、 作業メモリ(図 示せず)等に設定される。 また、 上記話者としては、 声質変換を行う際の変換元 話者 Sおよび変換先話者 Tと成り得る話者が選ばれる。 ステップ S 2で、 波形分 析部 1に音声波形が入力される。 In step S1, an initial value "1" is set for the speaker number s. The speaker number s, the subsequent phoneme number X, the conversion destination speaker number sT, and the conversion source speaker number sS are set in a working memory (not shown). Further, as the above-mentioned speaker, a speaker that can be a conversion source speaker S and a conversion destination speaker T when performing voice quality conversion is selected. In step S2, the waveform The speech waveform is input to the analyzer 1.
ステップ S 3で、 上記波形分析部 1によって、 入力音声波形に対して波形分析 が行われてケプストラムと韻律情報とが抽出される。 ステップ S 4で、 スぺクト ル包絡抽出部 2によって、 波形分析部 1からの低次のケプストラム係数に基づレヽ て、 スペク トル包絡が抽出される。 ステップ S 5で、 スペク トル傾き抽出部 3に よって、 上記スぺク トル包絡を最小 2乗近似直線で近似した場合の近似直線の傾 きが、 スペク トル傾きとして抽出される。 ステップ S 6で、 音源特性抽出部 4に よって、 波形分析部 1からの高次のケプストラム係数に基づいて、 音源特性が抽 出される。 ステップ S 7で、 音声認識部 5によって、 入力音声が認識され、 認識 結果としての音素番号 (音素名)系列と各音素区間の韻律情報 (音素継続時間長,平 均パワー,平均ピッチ周波数等)とが出力される。 ここで、 上記音素番号は、 予め 音素名に対応付けて決定されており、 R AM (ランダム'アクセス'メモリ)(図示 せず)に格納されているものとする。  In step S3, the waveform analysis unit 1 performs a waveform analysis on the input speech waveform to extract cepstrum and prosody information. In step S4, the spectrum envelope extraction unit 2 extracts a spectrum envelope based on the low-order cepstrum coefficient from the waveform analysis unit 1. In step S5, the slope of the approximate straight line when the above-mentioned spectral envelope is approximated by the least squares approximate straight line is extracted as the spectral slope by the spectrum slope extracting unit 3. In step S6, the sound source characteristics are extracted by the sound source characteristics extraction unit 4 based on the higher-order cepstrum coefficients from the waveform analysis unit 1. In step S7, the input speech is recognized by the speech recognition unit 5, and the phoneme number (phoneme name) sequence and the prosodic information of each phoneme section (phoneme duration, average power, average pitch frequency, etc.) are recognized. Is output. Here, it is assumed that the phoneme number is determined in advance in association with the phoneme name and stored in a RAM (random 'access' memory) (not shown).
尚、 本実施の形態においては、 上記波形分析部 1による音声波形分析をケプス トラム分析とし、 このケプストラム分析結果に基づいてスぺクトル包絡,スぺク トル傾きおよび音源特性を抽出するようにしている。 しかしながら、 波形分析部 1によるにおける音声波形分析法はこれに限定されるものではなく、 L P C (線 形予測分析)等のスぺクトル包絡および音源特性を抽出できる方法であれば何れ の音声波形分析法であっても差し支えない。  In the present embodiment, the speech waveform analysis by the waveform analysis unit 1 is referred to as cepstrum analysis, and a spectrum envelope, a spectrum slope, and a sound source characteristic are extracted based on the cepstrum analysis result. I have. However, the audio waveform analysis method by the waveform analysis unit 1 is not limited to this, and any audio waveform analysis method such as LPC (linear predictive analysis) can be used as long as it can extract the spectrum envelope and sound source characteristics. It can be a law.
ステップ S 8で、 上記スぺクトル包絡抽出部 2で抽出された上記スぺクトル包 絡とスぺクトル傾き抽出部 3で抽出された上記スぺクトル傾きと音源特性抽出部 4で抽出された上記音源特性とが、 音声認識部 5からの話者番号 sと音素番号 X の対でなるラベルが付与されて特徴メモリ 6によって格納される。 ステップ S 9 で、 当該話者番号 sの話者による発声である学習音声がある力否か、 つまり同一 話者による音声入力があるか否かが判別される。 その結果、 あれば上記ステップ S 2に戻って、 次の音声に関する上記スぺクトル包絡,ぺクトル傾きおよび音源特 性の抽出と音声認識とに移行する。 一方、 なければステツプ S 10に進む。  In step S8, the spectrum envelope extracted by the spectrum envelope extraction unit 2 and the spectrum inclination extracted by the spectrum inclination extraction unit 3 and the sound source characteristic extraction unit 4 extract The sound source characteristics are stored in the feature memory 6 with a label formed by a pair of the speaker number s and the phoneme number X from the voice recognition unit 5. In step S9, it is determined whether or not the learning voice, which is an utterance by the speaker of the speaker number s, has a certain power, that is, whether or not there is a voice input by the same speaker. As a result, if there is any, the process returns to step S2, and shifts to the above-described spectrum envelope, vector inclination and sound source characteristic extraction and speech recognition for the next speech. Otherwise, go to step S10.
ステップ S 10で、 上記音素番号 Xが、 初期値「1」に設定される。 ステップ S 11 で、 平均化部 7によって、 特徴メキリ 6力 ら話者番号 sと音素番号 Xとが付与さ れたスペク トル包絡,ぺクトル傾きおよび音源特性が読み出される。 そして、 こ の読み出されたスぺクトル包絡,ぺクトル傾きおよび音源特性の夫々が、 「音素」, 「類似音素」,「有声音区間/無声音区間」および「音声区間全体」毎に分類される。 ス テツプ S 12で、 音素番号 Xが最大値 x M A X以上である力^かが判別される。 その 結果、 最大値 xM A X以上であればステップ S 14に進む一方、 そうでなければステ ップ S 13に進む。 ステップ S 13で、 音素番号 Xがインクリメントされる。 そうし た後に、 上記ステップ S 11に戻って、 次の音素のスペクトル包絡,スぺク トノ W頃 きおよび音源特性に対する分類に移行する。 In step S10, the phoneme number X is set to an initial value “1”. Step S 11 Then, the averaging unit 7 reads out the spectrum envelope, the vector inclination, and the sound source characteristics to which the speaker number s and the phoneme number X are assigned from the characteristic feature. Then, the read spectrum envelope, vector slope, and sound source characteristics are classified into “phonemes”, “similar phonemes”, “voiced / unvoiced sound segments”, and “entire speech segments”, respectively. You. In step S12, it is determined whether the phoneme number X is equal to or greater than the maximum value x MAX . As a result, if it is equal to or more than the maximum value x MAX , the process proceeds to step S14, and if not, the process proceeds to step S13. In step S13, the phoneme number X is incremented. After that, the process returns to step S11, and shifts to the classification of the spectral envelope of the next phoneme, the spectrum W, and the sound source characteristics.
ステップ S 14で、 上記平均化部 7によって、 話者番号 sが付与されたスぺタ ト ル包絡,ぺクトル傾きおよび音源特性に関する「音素」,「類似音素」,「有声音区間/ 無声音区間」及び「音声区間全体」毎の平均が、 線形変換等によって算出される。 そして、 得られた平均スぺク トル包絡,平均スぺクトル傾きおよび平均音源特性 、 特徴メモリ 6によって対応する音素名,類似音素名,有声音区間/無声音区間 および音声区間全体のラベルが付与されて格納される。  In step S14, the averaging unit 7 sets the "phoneme", "similar phoneme", and "voiced sound section / unvoiced sound section" with respect to the studio envelope, vector slope, and sound source characteristics to which the speaker number s is assigned. ”And“ the entire voice section ”are averaged by linear transformation or the like. The obtained average spectrum envelope, average spectrum slope and average sound source characteristics, and the corresponding phoneme name, similar phoneme name, voiced section / unvoiced section, and labels of the entire voice section are given by the feature memory 6. Stored.
ステップ S 15で、 上記話者番号 sが、 最大値 s M A X以上である力否かが判別さ れる。 その結果、 最大値 s M A X以上であればステップ S 17に進む一方、 そうでな ければステップ S 16に進む。 ステップ S 16で、 話者番号 sがインクリメントされ る。 そうした後、 上記ステップ S 2に進んで、 次の話者に関して、 スぺク トノレ包 絡,スぺクトル傾きおよび音源特性の抽出、 音素認識、 スぺクトル包絡,スぺクト ル傾きおよび音源特性の分類、 平均値算出に移行する。 そして、 上記ステップ S 15において、 話者番号 sが最大値 s M A X以上であると判別されるとステップ S 17 に移行する。 In step S 15, the speaker number s is either force not at maximum value s MAX or not is decided. As a result, the process proceeds to step S 17 if the maximum value s MAX or more, the process proceeds to step S 16 if Kere such case. In step S16, the speaker number s is incremented. After that, the process proceeds to step S2 above, and for the next speaker, extraction of the spectrum envelope, spectrum inclination and sound source characteristics, phoneme recognition, spectrum envelope, spectrum inclination and sound source characteristics are performed. Move on to classification and average calculation. When it is determined in step S15 that the speaker number s is equal to or greater than the maximum value sMAX, the process proceeds to step S17.
このようにして、 変換元話者 Sの大量のデータと変換先話者 Tの少量のデータ とから抽出されたスぺクトル包絡,スぺクトル傾き及び音源特性が、 話者番号 s と音素番号 Xとのラベルが付与されて蓄積される。 また、 「音素」,「類似音素」, 「有声音区間/無声音区間」および「音声区間全体」毎の平均スぺクトル包絡,平均 スぺク トル傾きおよび平均音源特性が、 話者番号 sと音素名,類似音素名,有声音 区間/無声音区間おょぴ音声区間全体とのラベルが付与されて蓄積されるのであ る。 In this way, the spectrum envelope, spectrum slope, and sound source characteristics extracted from a large amount of data of the source speaker S and a small amount of data of the destination speaker T are represented by the speaker number s and the phoneme number. It is labeled X and stored. In addition, the average spectral envelope, average spectral slope, and average sound source characteristics for each of “phonemes”, “similar phonemes”, “voiced / unvoiced sound segments”, and “entire speech segments” are represented by speaker number s Phoneme name, similar phoneme name, voiced sound Labels are added to the section / unvoiced section and the entire voice section and stored.
ステップ S 17で、 上記変換先話者番号 sTに、 外部から指示された変換先話者 番号が設定される。 また、 変換元話者番号 s Sに、 同様に外部から指示された変 換元話者番号が設定される。 ステップ S 18で、 音素番号 Xに初期値「1」が設定さ れる。 ステップ S 19で、 非線形周波数軸スペクトルマッチング部 8によって、 特 徴メモリ 6から、 変換先話者番号 sTに該当する話者番号 sと当該音素番号 と が付与された平均スペクトル包絡が検索される。 そして、 この検索結果に基づい て、 当該変換先話者用の当該音素のデータが特徴メモリ 6に保存されているカ否 かが判別される。 その結果、 保存されていればステップ S 20に進み、 そうでなけ ればステップ S 24に進む。 ステップ S 20で、 非線形周波数軸スペクトルマツチン グ部 8によって、 特徴メモリ 6から、 変換元話者番号 s Sに該当する話者番号 s と当該音素番号 Xとが付与された平均スペク トル包絡が検索される。 そして、 こ の検索結果に基づいて、 当該変換元話者用の当該音素のデータが特徴メモリ 6に 保存されているか否かが判別される。 その結果、 保存されていればステップ S 21 に進み、 そうでなければステップ S 24に進む。  In step S17, the conversion destination speaker number designated from outside is set in the conversion destination speaker number sT. The conversion source speaker number s S is also set to the conversion source speaker number specified externally. In step S18, an initial value “1” is set to the phoneme number X. In step S19, the nonlinear frequency axis spectrum matching unit 8 searches the feature memory 6 for an average spectrum envelope to which the speaker number s corresponding to the conversion destination speaker number sT and the phoneme number are assigned. Then, based on the search result, it is determined whether or not the phoneme data for the conversion destination speaker is stored in the feature memory 6. As a result, if stored, the process proceeds to step S20; otherwise, the process proceeds to step S24. In step S20, the non-linear frequency axis spectrum matching unit 8 obtains, from the feature memory 6, the average spectrum envelope to which the speaker number s corresponding to the source speaker number s S and the phoneme number X are assigned. Searched. Then, based on the search result, it is determined whether or not the data of the phoneme for the conversion source speaker is stored in the feature memory 6. As a result, if it is stored, the process proceeds to step S21; otherwise, the process proceeds to step S24.
ステップ S 21で、 上記非線形周波数軸スペク トルマッチング部 8によって、 動 的計画法による非線形周波数軸スぺクトルマッチングを用いて、 当該音素に関し て変換元話者 Sの平均スぺクトル包絡と変換先話者 Tの平均スぺクトル包絡との マッチングが行われる。 そして、 最適 D Pパスに相当する周波数ヮービング関数 が求められる。  In step S21, the nonlinear frequency axis spectrum matching unit 8 uses the nonlinear frequency axis spectrum matching by dynamic programming to apply the average spectrum envelope of the source speaker S and the destination Matching with the average spectral envelope of speaker T is performed. Then, a frequency ビ ン グ ving function corresponding to the optimal DP path is obtained.
図 6 Aは、 上記非線形周波数軸スぺクトルマツチング部 8によって実行される 動的計画法による非線形周波数軸スぺクトルマッチングの概念を示す。 同じ音素 に関する変換元話者 Sの平均スぺク トル包絡 Sと変換先話者 Tの平均スぺク トル 包絡 Tとに関して、 スペクトル包絡を帯域で L等分し、 両スぺクトル包絡 S , T の各チャネルの出力値(スぺクトル強度)を表す要素値を要素値 T iおよび要素値 S j ( l≤ i , j≤L)とする。 そして、 両スペクトル包絡同士が対応するように周 波数軸を動的計画法によって非線形に伸縮する。 つまり、 対応すべき 2つのスぺ クトル包絡 S, T力 らなる平面上の格子点 c = ( i , j )の系列 FIG. 6A shows the concept of nonlinear frequency axis spectrum matching by dynamic programming executed by the nonlinear frequency axis spectrum matching unit 8 described above. With respect to the average spectrum envelope S of the source speaker S and the average spectrum envelope T of the destination speaker T for the same phoneme, the spectrum envelope is divided into L equal parts in the band, and both spectrum envelopes S, Element values representing the output value (spectral intensity) of each channel of T are defined as an element value T i and an element value S j (l≤i, j≤L). Then, the frequency axis is nonlinearly expanded and contracted by dynamic programming so that the two spectral envelopes correspond to each other. In other words, the two A series of lattice points c = (i, j) on a plane consisting of the vector envelope S and T forces
ί — C 2, "J c ' , C L ί — C 2 , "J c ', C L
を考える。 そして、 格子点 c=(i, j )に関する要素値 Tiと要素値 Sjとの距離 d ( i , j )=d (c)の系列 Fに沿った総和 Dを最小にする系列 Fminを、 上記最適 D Pパス(周波数ヮービング関数)とするのである。 think of. Then, the sequence Fmin that minimizes the sum D along the sequence F of the distance d (i, j) = d (c) between the element value Ti and the element value Sj for the lattice point c = (i, j) is The optimal DP path (frequency rubbing function) is used.
ステップ S 22で、 上記非線形周波数軸スペクトルマッチング部 8によって、 上 記周波数ヮービング関数が、 音素番号 Xと共に周波数ワープ表メモリ 9に送出さ れる。 そして、 周波数ワープ表メモリ 9によって音素番号 Xのラベルが付与され て格納される。 In step S22, the above-described frequency averaging function is transmitted to the frequency warp table memory 9 together with the phoneme number X by the nonlinear frequency axis spectrum matching unit 8. Then, the label of the phoneme number X is given by the frequency warp table memory 9 and stored.
本実施の形態において用いる周波数ヮービング関数のデータ形式は、 図 6 Bに 示すように、 DPパス上の格子点 c ( i, j )の要素値は「0」より大きな整数であ り、 DPパス以外の格子点 c ( i, j )の要素値は「0」であるような L行 L列のマ トリタスである。 尚、 帯域の分割数 Lの数は多い方がヮービング精度が上がるの で望ましい。 しカ しながら、 あまり多くすると周波数ワープ表メモリ 9の記憶容 量が大きくなり、 処理時間も長くなつてしまう。  As shown in FIG. 6B, the data format of the frequency sorbing function used in the present embodiment is such that the element value of grid point c (i, j) on the DP path is an integer larger than “0”, and The element value of the grid point c (i, j) other than is a matrix with L rows and L columns such that it is "0". It is desirable that the number of band divisions L be large, because the rubbing accuracy is increased. However, if the number is too large, the storage capacity of the frequency warp table memory 9 increases, and the processing time also increases.
尚、 上述の説明においては、 非線形周波数軸スペク トルマッチング部 8は、 同 じ音素に関する変換元話者 Sの平均スぺク トル包絡 Sと変換先話者 Tの平均スぺ クトル包絡 Tとにおける各チャンネルの要素値(スぺクトル強度) Si, Tjを用い てマッチングを行なっているが、 マッチング対象はスぺクトル包絡の各チャネル の出力値(スぺク トル強度)に限定されるものではない。 例えば、 平均スぺク トル 包絡 Sと平均スぺク トル包絡 Tとに関する隣接チャネル間の出力値の差(スぺク トル局所傾き) Δ Sと ΔΤとを用いてマッチングを行なっても構わない。  In the above description, the nonlinear frequency axis spectrum matching unit 8 calculates the average spectrum envelope S of the conversion source speaker S and the average spectrum envelope T of the conversion target speaker T for the same phoneme. Matching is performed using the element values (spectral intensity) Si and Tj of each channel, but the matching target is not limited to the output value (spectral intensity) of each channel of the spectral envelope. Absent. For example, matching may be performed using the difference (spectral local slope) between the output values between adjacent channels with respect to the mean spectrum envelope S and the mean spectrum envelope T (ΔS and ΔΤ). .
但し、 Δ Sj=Sj— S (j- 1)  Where Δ Sj = Sj— S (j-1)
ΔΤί = Τί-Τ(ί-1)  ΔΤί = Τί-Τ (ί-1)
ここで、 2≤ i , j ≤L  Where 2≤ i, j ≤L
ステップ S 23で、 音素番号 Xが最大値 xMAX以上であるか否かが判別される。 その結果、 最大値 xMAX以上であればステップ S 25に進む一方、 そうでなければ ステップ S24に進む。 ステップ S24で、 音素番号 Xがインクリメントされる。 そ うした後、 上記ステップ S19に戻って、 次の音素の変換元話者 Sと変換先話者 T とのスぺクトル包絡のマッチング、 得られた周波数ヮービング関数の格納の処理 に移行する。 In step S23, it is determined whether or not the phoneme number X is equal to or greater than the maximum value x MAX . As a result, if it is equal to or more than the maximum value x MAX , the process proceeds to step S25, and if not, the process proceeds to step S24. In step S24, the phoneme number X is incremented. So After that, the process returns to the step S19, and the process shifts to the process of matching the spectral envelope between the source speaker S and the destination speaker T of the next phoneme, and storing the obtained frequency sorbing function.
ステップ S25で、 上記平均化部 7によって、 周波数ワープ表メモリ 9から各話 者毎の周波数ヮーピング関数が読み出され、 上記ステップ S 11におレ、て分類され た「類似音素」,「有声音区間/無声音区間」及び「音声区間全体」毎の平均が、 線形変 換等によって算出される。 そして、 得られた平均周波数ヮービング関数(図 6 C に示すように周波数ヮービング関数の加算値で代用してもよい)が、 対応する類 似音素名,有声音区間/無声音区間および音声区間全体のラベルが付与されて、 周 波数ワープ表メモリ 9によって格納される。  In step S25, the averaging unit 7 reads out the frequency mapping function for each speaker from the frequency warp table memory 9, and classifies the “similar phonemes” and “voiced sounds” in step S11. The average for each “section / unvoiced sound section” and “entire voice section” is calculated by linear conversion or the like. Then, the obtained average frequency rubbing function (the sum of the frequency rubbing function may be substituted as shown in Fig. 6C) is used to calculate the corresponding analog phoneme name, voiced / unvoiced sound section, and the entire voice section. A label is added and stored by the frequency warp table memory 9.
以降、 発声指示に基づく変換先話者の声質での音声合成処理に移行する。 ステ ップ S 26で、 スぺクトル包絡変換部 1 0,スぺクトル傾き変換部 1 1および音源 特性変換部 1 2に対して、 発声指示音素に該当する音素番号 Xが入力される。 ス テツプ S27で、 スペクトル包絡変換部 1 0によって、 特徴メモリ 6から変換元話 者番号 sSに該当する話者番号 sと当該音素番号 Xとが付与されたスぺクトル包 絡が読み出される。 さらに、 周波数ワープ表メモリ 9から当該音素番号 Xが付与 された平均周波数ヮービング関数 (変換元話者番号 s Sと変換先話者番号 s丁との 間の平均周波数ヮービング関数)が読み出される。 そして、 変換元話者 Sのスぺ クトル包絡 Sが、 平均周波数ヮービング関数 (要素値 c (i , j ))を用いて次式  Thereafter, the process shifts to speech synthesis processing based on the voice quality of the conversion-target speaker based on the utterance instruction. In step S26, the phoneme number X corresponding to the utterance indicating phoneme is input to the spectrum envelope conversion unit 10, the spectrum inclination conversion unit 11 and the sound source characteristic conversion unit 12. In step S27, the spectrum envelope to which the speaker number s corresponding to the conversion source speaker number sS and the phoneme number X are assigned is read from the feature memory 6 by the spectrum envelope converter 10. Further, the average frequency averaging function (the average frequency averaging function between the source speaker number s S and the destination speaker number s) to which the phoneme number X is assigned is read from the frequency warp table memory 9. Then, the spectrum envelope S of the source speaker S is expressed by the following equation using the average frequency rubbing function (element value c (i, j)).
Ti=∑ Sj* c ( i, j )/∑ c ( i , j )  Ti = ∑ Sj * c (i, j) / ∑ c (i, j)
但し、 1≤ j ≤ L (または、 i— a ;j ≤ i +Q;、 a :正整数) に従って変形されて、 変換先話者 Tでの変形スぺク トル包絡 T( iチャネルの要 素値 Ti)が求められる。 その結果、 図 7 Aに示すように、 変換元話者 Sのスぺク トル包絡 Sのピーク Saのチャネル位置(j =4)が、 変形スぺクトル包絡 Tにお いてはチヤネノレ位置( i = 3)にワープされるのである。  However, it is transformed according to 1 ≤ j ≤ L (or i-a; j ≤ i + Q ;, a is a positive integer), and the transformed vector envelope T (i-channel element The prime value Ti) is determined. As a result, as shown in FIG. 7A, the channel position (j = 4) of the peak Sa of the spectrum envelope S of the source speaker S is changed to the channel position (i) in the modified spectrum envelope T. = 3).
ここで、 本実施の形態においては、 上記周波数ワープ表メモリ 9には、 各音素 毎,各類似音素毎,有声音区間/無声音区間毎及び音声区間全体毎の複数の平均周 波数ヮービング関数が格納されている。 したがって、 以下のように、 学習用の変 換先話者 Tの発声データの量に応じて、 適切な平均周波数ヮーピング関数を選択 することができるのである。 すなわち、 ある音素 (例、 後舌母音/ ο /)の発声デー タが少ないか全く無い場合には、 当該音素 (/ ο /)の類似音素 (例、 後舌母音/ a /) の平均周波数ヮービング関数、 または、 有声音区間の平均周波数ヮービング関数 を選択する。 あるいは、 当該音素 (/ o /)の発声データが十分に多い場合には、 当 該音素 (/ o /)の平均周波数ヮービング関数を選択するのである。 こうすることに よって、 変換先話者 Tの学習用発声データの量が少ない場合でも対処することが でき、 変換先話者 Tの発声負担を軽減することができるのである。 Here, in the present embodiment, the frequency warping table memory 9 stores a plurality of average frequency averaging functions for each phoneme, each similar phoneme, each voiced / unvoiced voice section, and each voice section. Have been. Therefore, as shown below, It is possible to select an appropriate average frequency mapping function according to the amount of utterance data of the speaker T. That is, if there is little or no utterance data for a phoneme (eg, back vowel / ο /), the average frequency of similar phonemes (eg, back vowel / a /) of the phoneme (/ ο /) Select the rubbing function or the average frequency rubbing function of the voiced section. Alternatively, if the utterance data of the phoneme (/ o /) is sufficiently large, the average frequency sorbing function of the phoneme (/ o /) is selected. By doing so, it is possible to cope with the case where the amount of training utterance data of the conversion predecessor T is small, and it is possible to reduce the utterance burden of the conversion predecessor T.
また、 各音素毎および各類似音素毎に上記周波数ヮービング関数の平均値を求 めることによって、 調音点や口の開き具合等の発声癖に起因する個人差(ソフト 差)が正しく正規化されている。 したがって、 最適な周波数ヮービング関数が得 られるのである。  In addition, by calculating the average value of the above-described frequency rubbing function for each phoneme and each similar phoneme, individual differences (soft differences) due to vocal habits such as articulation points and mouth opening are correctly normalized. ing. Therefore, an optimal frequency-binging function can be obtained.
ステップ S 28で、 上記スぺクトル傾き変換部 1 1によって、 上記特徴メモリ 6 から、 変換元話者番号 s Sに該当する話者番号 sと当該音素番号 Xとが付与され た平均スぺクトル傾きと、 変換先話者番号 sTに該当する話者番号 sと当該音素 番号 Xとが付与された平均スペクトル傾きとが読み出される。 そして、 図 7 Bに 示すように、 両平均スペクトル傾きの差の分だけ、 上記ステップ S 27において得 られた変形スぺクトル包絡の傾きが補正されて正規化スぺクトル包絡が求められ る。 尚、 この場合にも、 学習用の変換先話者 Tの発声データの量に応じて、 適切 な平均スぺクトル傾きを選択することによって、 変換先言者丁の学習用発声デー タの量が少なレ、場合でも対処することができるのである。  In step S28, the spectrum tilt converter 11 generates an average spectrum from the feature memory 6 to which the speaker number s corresponding to the conversion source speaker number s S and the phoneme number X are assigned. The slope and the average spectrum slope to which the speaker number s corresponding to the conversion destination speaker number sT and the phoneme number X are assigned are read out. Then, as shown in FIG. 7B, the slope of the modified spectrum envelope obtained in step S27 is corrected by the difference between the two average spectrum slopes, and the normalized spectrum envelope is obtained. In this case, too, by selecting an appropriate average spectral slope according to the amount of utterance data of the conversion-target speaker T for learning, the amount of learning utterance data of the conversion-target speaker T is selected. Can be dealt with even if there are few.
ステップ S 29で、 上記音源特性変換部 1 2によって、 特徴メモリ 6から変換先 話者番号 sTに該当する話者番号 sと当該音素番号 Xとが付与された平均音源特 性が読み出される。 そして、 必要に応じて線形変換等によって変形されて変形音 源特性が求められる。 ステップ S 30で、 スペクトル合成部 1 3によって、 上述の ようにして得られた正規ィ匕スぺクトル包絡と変形音源特性とを用いて合成スぺク トルが求められる。 このスペクトル合成法は、 正規ィ匕スペクトル包絡と変形音源 特性とを合成して、 基本周波数の高周波数に亘るスぺクトル強度を求めることに よって行われる。 ステップ S 31で、 波形合成部 1 4によって、 上記合成スぺクト ルのスぺクトル強度に基づいて、 正弦波重量法によって音声波形が合成される。 尚、 音声波形の合成法は、 合成スぺクトルを用いた正弦波重量法に限定されるも のではなく、 上記正規化スぺクトル包絡をゼロ位相化して基本周波数毎に重ね合 わせる方法や、 上記合成スぺクトルを逆フーリエ変換する方法等によっても合成 波形を得ることができる。 In step S29, the sound source characteristic conversion unit 12 reads out the average sound source characteristic to which the speaker number s corresponding to the conversion destination speaker number sT and the phoneme number X are assigned from the feature memory 6. Then, if necessary, the sound source is deformed by linear transformation or the like, and the deformed sound source characteristic is obtained. In step S30, the spectrum combining unit 13 obtains a combined spectrum using the normalized shadow envelope and the deformed sound source characteristics obtained as described above. In this spectrum synthesis method, the spectrum envelope over a high frequency of the fundamental frequency is obtained by synthesizing the normal envelope spectrum envelope and the deformed sound source characteristic. This is done. In step S31, a speech waveform is synthesized by the waveform synthesizing unit 14 by the sine wave weight method based on the spectrum intensity of the synthesized spectrum. Note that the method of synthesizing the speech waveform is not limited to the sine wave weighing method using the synthesized spectrum, but a method in which the normalized spectrum envelope is zero-phased and superimposed for each fundamental frequency. Also, a synthesized waveform can be obtained by a method of performing an inverse Fourier transform on the synthesized spectrum.
ステップ S 32で、 上記ステップ S 26において音素番号 Xが指定された発声指示 音素は、 最後の発声指示音素である力否かが判別される。 その結果、 最後の発声 指示音素でなければ上記ステップ S 26に戻って、 次の発声指示音素に関する音声 波形の合成へ移行する。 一方、 最後の発声指示音素であれば、 声質変換処理動作 を終了する。  In step S32, it is determined whether or not the utterance instruction phoneme designated with the phoneme number X in step S26 is the last utterance instruction phoneme. As a result, if it is not the last utterance indicating phoneme, the process returns to step S26 to shift to the synthesis of the speech waveform relating to the next utterance indicating phoneme. On the other hand, if it is the last utterance indicating phoneme, the voice conversion processing operation ends.
上述のように、 本実施の形態においては、 変換先話者 Tおよび変換元話者 Sの 入力音声を波形分析部 1でケプストラム分析し、 スぺク トル包絡抽出部 2でスぺ クトル包絡を抽出し、 スぺク トル傾き抽出部 3でスぺク トル傾きを抽出し、 音源 特性抽出部 4で音源特性を抽出する。 そして、 平均化部 7で、 上記スペクトル包 絡,スぺクトル傾きおよび音源特性の平均値を「音素」,「類似音素」,「有声音区間/ 無声音区間」,「音声区間全体」毎に求め、 音声認識部 5による認識結果の音素番号 を付与して特徴メモリ 6によって格納する。 ,  As described above, in the present embodiment, the input voices of the conversion target speaker T and the conversion source speaker S are subjected to cepstrum analysis by the waveform analysis unit 1 and the spectrum envelope is extracted by the spectrum envelope extraction unit 2. Then, the spectrum gradient is extracted by the spectrum gradient extraction unit 3, and the sound source characteristics are extracted by the sound source characteristic extraction unit 4. Then, the averaging unit 7 calculates the average value of the above-mentioned spectral envelope, spectral gradient and sound source characteristic for each of “phonemes”, “similar phonemes”, “voiced / unvoiced sound sections”, and “entire voice section”. The phoneme number of the recognition result by the voice recognition unit 5 is assigned and stored by the feature memory 6. ,
さらに、 上記非線形周波数軸スぺク トルマツチング部 8で、 特徴メモリ 6に格 納された全音素に関して変換元話者 Sの平均スぺクトル包絡と変換先話者 Tの平 均スぺクトル包絡との非線形周波数軸スぺクトルマッチングを行い、 最適 D Pパ スに相当する周波数ヮービング関数を求める。 そして、 平均化部 7で、 上記周波 数ヮーピング関数の平均値を「類似音素」,「有声音区間/無声音区間」および「音声 区間全体」毎に求め、 音素番号を付与して周波数ワープ表メモリ 9に格納する。 そして、 発声指示に従って変換先話者の声質での音声合成を行う場合には、 次 の手順によって行う。 すなわち、 先ず、 スペク トル包絡変換部 1 0で、 変換元話 者 Sの該当音素のスぺク トル包絡を、 該当音素の変換元話者 S /変換先話 T間の 平均周波数ヮービング関数を用いて、 変換先話者 Tのスぺク トル包絡 (変形スぺ クトル包絡)に変換する。 次に、 スぺクトル傾き変換部 1 1で、 変換元話者 sの 平均スぺクトル傾きと変換先話者 Tの平均スぺクトル傾きとの差の分だけ上記変 形スぺクトル包絡の傾きを補正して正規ィ匕スぺクトル包絡を求める。 次に、 音源 特性変換部 1 2で、 変換先話者 τの平均音源特性を変形して変形音源特性を求め る。 Further, in the nonlinear frequency axis spectrum matching section 8, the average spectrum envelope of the source speaker S and the average spectrum envelope of the destination speaker T are calculated for all phonemes stored in the feature memory 6. The nonlinear frequency axis spectrum matching is performed to find the frequency averaging function corresponding to the optimal DP path. Then, the averaging unit 7 calculates the average value of the frequency mapping function for each of “similar phoneme”, “voiced sound section / unvoiced sound section”, and “entire voice section”, assigns a phoneme number, and stores the frequency warp table memory. Store in 9. Then, when performing voice synthesis with the voice quality of the conversion-target speaker in accordance with the utterance instruction, the following procedure is performed. That is, first, the spectrum envelope conversion unit 10 uses the spectrum envelope of the corresponding phoneme of the conversion source speaker S, and uses the average frequency eving function between the conversion source speaker S and the conversion target story T of the relevant phoneme. The transform envelope of the speaker T (transformation To the envelope. Next, the spectrum slope converter 11 converts the envelope of the transformed spectrum by the difference between the average spectrum slope of the source speaker s and the average spectrum slope of the destination speaker T. The inclination is corrected to obtain the regular envelope spectrum envelope. Then, in the sound source characteristic conversion unit 1 2, by modifying the average sound characteristics of the conversion-target speaker τ Ru sought deformation sound characteristics.
そうした後、 上記スぺクトル合成部 1 3で上記正規化スぺクトル包絡と変形音 源特性とから合成スぺクトルを求め、 波形合成部 1 4で上記合成スぺク トルに基 づいて音声波形を合成するのである。  After that, the spectrum synthesizer 13 calculates a synthesized spectrum from the normalized spectrum envelope and the deformed sound source characteristic, and the waveform synthesizer 14 generates a speech based on the synthesized spectrum. It combines the waveforms.
すなわち、 本実施の形態においては、 変換元話者のスぺクトル包絡の周波数軸 を非線形伸縮して変換先話者のスぺク トル包絡を求め、 その傾きを補正して正規 化スペク トル包絡を求めるようにしている。 したがって、 従来のフォルマント周 波数に基づく声質変換方法やスぺク トル包絡のピーク点間の分割点に基づく声質 変換方法のごとくスぺクトル包絡の特定位置を抽出する必要が無く、 上記特定位 置の抽出精度に音質が影響されることはないのである。  That is, in the present embodiment, the frequency envelope of the spectrum envelope of the source speaker is nonlinearly expanded and contracted to obtain the spectrum envelope of the destination speaker, and its slope is corrected to obtain the normalized spectrum envelope. I want to ask. Therefore, it is not necessary to extract the specific position of the spectrum envelope as in the conventional voice conversion method based on the formant frequency or the voice conversion method based on the division point between the peak points of the spectrum envelope. The sound quality is not affected by the accuracy of the extraction.
また、 上記スぺク トル包絡の変換時に用いる平均周波数ヮービング関数や、 上 記変形スぺク トル包絡の傾き補正時に用いる平均スぺク トル傾きや、 上記変形音 源特性を求める際に用いる平均音源特性は、 「音素」,「類似音素」,「有声音区間/無 声音区間」および「音声区間全体」毎に求めてある。 したがって、 特徴メモリ 6や 周波数ワープ表メモリ 9に保存されている変換先話者 Tの発声データの量に応じ て、 適切な区分での平均周波数ヮービング関数や平均スぺク トル傾きや平均音源 特性を用いることによって、 変換先話者 Tの学習用発声データの量が少なレヽ場合 でも対処することができる。 すなわち、 本実施の形態によれば、 変換先話者丁の 発声負担を軽減することができるのである。 さらに、 音素毎および類似音素毎の 平均周波数ヮービング関数や平均スぺク トル傾きや平均音源特性を求めることに よって、 発声癖に起因する個人差を正規化することができる。  In addition, the average frequency averaging function used when transforming the above-mentioned spectrum envelope, the average vector slope used when correcting the above-mentioned modified vector envelope slope, and the average used when obtaining the above-mentioned deformed sound source characteristics. The sound source characteristics are obtained for each of "phonemes", "similar phonemes", "voiced / unvoiced sound segments", and "entire speech segments". Therefore, according to the volume of the utterance data of the pre-speaker T stored in the feature memory 6 and the frequency warp table memory 9, the average frequency rubbing function, the average spectrum slope, and the average sound source characteristics in appropriate sections are obtained. By using, it is possible to cope with the case where the amount of the training utterance data of the conversion destination speaker T is small. That is, according to the present embodiment, it is possible to reduce the utterance burden of the conversion destination speaker. Furthermore, individual differences caused by vocal habits can be normalized by obtaining an average frequency rubbing function, an average spectrum slope, and an average sound source characteristic for each phoneme and each similar phoneme.
尚、 上記実施の形態においては、 変換先話者の声質での音声合成時には、 スぺ クトル包絡変換部 1 0 ,スぺク トル傾き変換部 1 1および音源特性変換部 1 2に 対して発声指示音素を指定するようにしている。 し力 しながら、 この発明におけ る発声指示音素の指定方法はこれに限るものではなく、 次のように、 変換元話者 による発声によつて直接指定することも可能である。 Note that, in the above embodiment, during speech synthesis with the voice quality of the conversion-target speaker, utterances are made to the spectrum envelope conversion unit 10, the spectrum inclination conversion unit 11, and the sound source characteristic conversion unit 12. The designated phoneme is specified. In the present invention, The method of specifying the utterance indicating phoneme is not limited to this, and it is also possible to directly specify the utterance indicating phoneme by the utterance by the source speaker as follows.
すなわち、 上記波形分析部 1に対して、 発声指示音素を変換元話者による発声 で入力する。 そして、 音声認識部 5によって、 スペク トル包絡抽出部 2からのス ぺクトル包絡と波形分析部 1からの韻律情報との時系列に基づレヽて音声認識を行 ない、 認識結果の音素系列とその音素区間の韻律情報とを発声指示情報としてス ぺクトル包絡変換部 1 ◦,スぺクトル傾き変換部 1 1およぴ音源特性変換部 1 2 に入力するのである。 そうすることによって、 スぺクトル包絡変換部 1 0及びス ぺクトル傾き変換部 1 1では、 入力音素系列に従って該当音素の平均周波数ヮー ビング関数や平均スぺク トル傾きを読み出す。 一方、 音源特性変換部 1 2では、 入力韻律情報に従って該当音素の音源特性を読み出すのである。 こうすることに よって、 変換元話者による発声がリアルタイムで声質変換される。  That is, the utterance indicating phoneme is input to the waveform analysis unit 1 by the utterance of the conversion source speaker. Then, the speech recognition unit 5 performs speech recognition based on the time series of the spectrum envelope from the spectrum envelope extraction unit 2 and the prosody information from the waveform analysis unit 1, and generates a phoneme sequence of the recognition result. The prosodic information of the phoneme section and the prosodic information are input to the spectrum envelope converter 1◦, the spectrum tilt converter 11 and the sound source characteristic converter 12 as utterance instruction information. By doing so, the spectrum envelope conversion unit 10 and the spectrum inclination conversion unit 11 read the average frequency-moving function and the average spectrum inclination of the corresponding phoneme according to the input phoneme sequence. On the other hand, the sound source characteristic conversion unit 12 reads out the sound source characteristics of the corresponding phoneme according to the input prosody information. In this way, the utterance of the source speaker is converted in real time.
また、 上記実施の形態においては、 予め、 変換元話者と変換先話者との同一音 素の平均スぺクトル包絡を求め、 その平均スぺクトル包絡を用いて非線形周波数 軸スぺクトルマッチングを行って平均周波数ヮービング関数を求めている。 しか しながら、 同一音素の個々のスぺクトル包絡を用いて非線形周波数軸スぺクトル マッチングを行って周波数ヮービング関数を求め、 その周波数ヮービング関数を 同一音素内で平均して平均周波数ヮービング関数を求めても差し支えない。  Also, in the above embodiment, the average spectrum envelope of the same phoneme of the source speaker and the destination speaker is determined in advance, and nonlinear frequency axis spectrum matching is performed using the average spectrum envelope. To obtain the average frequency rubbing function. However, nonlinear frequency axis spectral matching is performed using the individual spectral envelopes of the same phoneme to find the frequency rubbing function, and the frequency rubbing function is averaged within the same phoneme to find the average frequency rubbing function. No problem.
(第 2実施の形態)  (Second embodiment)
図 8は、 本実施の形態の声質変換装置におけるブロック図である。 図 8におい て、 波形分析部 2 1 ,音源特性抽出部 2 3 ,音源特性変換部 3 0および波形合成部 3 2は、 第 1実施の形態において図 1に示す声質変換装置の波形分析部 1 ,音源 特性抽出部 4,音源特性変換部 1 2および波形合成部 1 4と同様な構成を有して 同様に動作する。  FIG. 8 is a block diagram of the voice quality conversion device according to the present embodiment. In FIG. 8, a waveform analysis unit 21, a sound source characteristic extraction unit 23, a sound source characteristic conversion unit 30, and a waveform synthesis unit 32 are provided in the first embodiment in a waveform analysis unit 1 of the voice quality conversion apparatus shown in FIG. The sound source characteristic extraction unit 4, the sound source characteristic conversion unit 12, and the waveform synthesis unit 14 have the same configuration and operate similarly.
声道断面積抽出部 2 2は、 波形分析部 1で抽出された自己相関分析あるいは共 分散分析に基づいて、 図 9 B,図 9 Dに示すような声門から唇に掛けての声道断 面積を抽出する。 尚、 図 9 A,図 9 Cは、 音源特性抽出部 2 3で抽出された音源 特十生を示す。 音声認識部 2 4は、 声道断面積抽出部 2 2で抽出された声道断面積 と波形分析部 2 1で抽出された韻律情報 (パヮ一やピッチ周波数等)の時系列に基 づいて音声認識を行なう。 特徴メモリ 2 5は、 上記抽出された声道断面積および 音源特性を、 音素ラベルを付与して格納する。 Based on the autocorrelation analysis or covariance analysis extracted by the waveform analysis unit 1, the vocal tract cross-sectional area extraction unit 22 cuts the vocal tract from the glottis to the lips as shown in Figs. 9B and 9D. Extract the area. 9A and 9C show the sound source extraordinary extracted by the sound source characteristic extraction unit 23. The voice recognition unit 24 is a vocal tract cross-sectional area extracted by the vocal tract cross-sectional area extraction unit 22. Speech recognition is performed based on the time series of the prosodic information (such as the pitch and the pitch frequency) extracted by the waveform analysis unit 21 and the prosody information. The feature memory 25 stores the extracted vocal tract cross-sectional area and sound source characteristics with a phoneme label added thereto.
平均化部 2 6は、 上記特徴メモリ 2 5に格納されている話者毎の各音素の声道 断面積および音源特性に対して、 音素,類似音素,有声音区間/無声音区間および 音声区間全体 (話者)毎に平均値の算出を行う。 そして、 得られた平均声道断面積 および平均音源特性を、 対応する音素名,類似音素名,有声音区間/無声音区間あ るいは音声区間全体 (話者)のラベルを付与して特徴メモリ 2 5に格納させる。 さ らに、 後に声道軸ワープ表メモリ 2 8に格納される話者毎の各音素の声道軸ヮー ビング関数に対して、 上記類似音素,有声音区間/無声音区間および音声区間全体 毎に平均値の算出を行う。 そして、 得られた平均声道軸ヮービング関数を、 対応 する類似音素名,有声音区間/無声音区間あるいは音声区間全体のラベルを付与し て声道軸ワープ表メモリ 2 8に格納させる。  The averaging unit 26 compares the vocal tract cross-sectional area and sound source characteristics of each phoneme for each speaker stored in the feature memory 25 with respect to the phonemes, similar phonemes, voiced / unvoiced sound sections, and the entire voice section. The average value is calculated for each (speaker). Then, the obtained average vocal tract cross-sectional area and average sound source characteristics are labeled with the corresponding phoneme name, similar phoneme name, voiced sound section / unvoiced sound section, or the label of the entire voice section (speaker), and the feature memory 2 is assigned. Store in 5. Furthermore, for the vocal tract axis-ving function of each phoneme stored in the vocal tract axis warp table memory 28 later, for each of the above similar phonemes, voiced / unvoiced sections, and the entire speech section, Calculate the average value. The obtained average vocal tract axis moving function is stored in the vocal tract axis warping table memory 28 with a corresponding similar phoneme name, a label of a voiced section / unvoiced section or the entire voice section.
非線形声道軸マッチング部 2 7は、 上記第 1実施の形態における非線形周波数 軸スぺクトルマッチング部 8の場合と同様に、 動的計画法による非線形声道軸マ ツチングによって、 各音素毎に、 図 1 O Aに示すように、 特徴メモリ 2 5に格納 された変換元話者 Sの平均声道断面積と変換先話者 Tの平均声道断面積とのマッ チングを行なう。 そして、 図 1 O Bに示すような声道軸ヮービング関数を求めて、 音素名を付与して声道軸ワープ表メモリ 2 8に格納させるのである。 尚、 図 1 0 Cは、 平均化部 2 6によって算出された平均声道軸ヮービング関数である(加算 値代用)。  The nonlinear vocal tract axis matching unit 27 performs the nonlinear vocal tract axis matching by the dynamic programming method for each phoneme similarly to the case of the nonlinear frequency axis spectrum matching unit 8 in the first embodiment. As shown in Fig. 1 OA, matching is performed between the average vocal tract cross-sectional area of source speaker S and the average vocal tract cross-sectional area of target speaker T stored in feature memory 25. Then, a vocal tract axis moving function as shown in FIG. 1OB is obtained, a phoneme name is given, and the vocal tract axis warping table memory 28 is stored. FIG. 10C shows the average vocal tract axis moving function calculated by the averaging unit 26 (substitution of the added value).
声道断面積変換部 2 9は、 発声指示に対応する音素の変換元話者 Sの声道断面 積を特徴メモリ 2 5から読み出す一方、 声道軸ワープ表メモリ 2 8から当該音素 の声道軸ヮービング関数を読み出す。 そして、 上記声道軸ヮービング関数を利用 して、 図 1 1に示すようにして、 変換元話者 Sの声道断面積を変換先話者 Tの声 道断面積 (変形声道断面積)に変換する。 そして、 スぺクトル合成部 3 1は、 声道 断面積変換部 2 9力 らの変形声道断面積と音源特性変換部 3 0力 らの変形音源、特 性とを用いて、 基本周波数の高周波数に亘るスぺクトル強度を求めることによつ て、 合成スぺク トルを求めるのである。 The vocal tract cross section converter 29 reads the vocal tract cross section of the source speaker S corresponding to the utterance instruction from the feature memory 25, while reading the vocal tract of the phoneme from the vocal tract axis warp table memory 28. Read the axis rubbing function. Then, using the above vocal tract axis moving function, as shown in FIG. 11, the vocal tract cross-sectional area of the source speaker S is converted to the vocal tract cross-sectional area (transformed vocal tract cross-sectional area) of the destination speaker T, as shown in FIG. Convert. Then, the spectrum synthesizer 31 uses the modified vocal tract cross-sectional area from the vocal tract cross-section converter 29 and the deformed sound source and characteristics from the sound source characteristic converter 30 to calculate the fundamental frequency. By determining the spectrum intensity over high frequencies Then, we seek the composite spectrum.
このように、 第 2の実施の形態においては、 上記第 1実施の形態におけるスぺ クトル包絡の代りにスぺクトル包絡との関連性の高い声道断面積を用い、 変換元 話者の声道断面積の声道軸を非線形伸縮して変換先話者の声道断面積を求めるよ うにしている。 したがって、 上記第 1実施の形態の場合と同様に、 従来のフオル マント周波数に基づく声質変換方法やスぺクトル包絡のピーク点間の分割点に基 づく声質変換方法のごとくスぺクトル包絡の特定位置を抽出する必要が無く、 上 記特定位置の抽出精度に音質が影響されることはないのである。  As described above, in the second embodiment, the vocal tract cross-sectional area having a high relation with the spectrum envelope is used instead of the spectrum envelope in the first embodiment, and the voice of the source speaker is used. The vocal tract axis of the road cross-sectional area is nonlinearly expanded and contracted to obtain the vocal tract cross-sectional area of the predecessor speaker. Therefore, as in the case of the first embodiment, the spectral envelope is specified as in the conventional voice quality conversion method based on the formant frequency or the voice quality conversion method based on the division point between the peak points of the spectrum envelope. There is no need to extract the position, and the sound quality is not affected by the extraction accuracy of the specific position.
尚、 上記各実施の形態においては、 上記音声単位として音素を用いているが、 音節であっても適用可能である。  In each of the above embodiments, a phoneme is used as the speech unit, but the present invention can be applied to a syllable.
ところで、 上記各実施の形態における波形分析部 1 · 2 1 ,スぺクトル包絡抽出 部 2,スぺクトル傾き抽出部 3 ,声道断面積抽出部 2 2 ,音源特性抽出部 4 · 2 3 , 音声認識部 5 - 2 4 ,平均化部 7 - 2 6 ,非線形周波数軸スぺクトルマツチング部 8, 非線形声道軸マッチング部 2 7,スぺクトル包絡変換部 1 0,スぺクトル傾き変換 部 1 1 ,声道断面積変換部 2 9 ,音源特性変換部 1 2 · 3 0 ,スぺクトル合成部 1 3 - 3 1および波形合成部 1 4 · 3 2による上記声質変換処理機能は、 プログラム 記録媒体に記録された声質変換処理処理プログラムによって実現される。 上記各 実施の形態における上記プログラム記録媒体は、 R OM (リード'オンリ 'メモリ) 等でなるプログラムメディアである。 あるいは、 外部補助記憶装置に装着されて 読み出されるプログラムメディアであってもよい。 尚、 何れの場合においても、 上記プログラムメディアから声質変換処理プログラムを読み出すプログラム読み 出し手段は、 上記プログラムメディァに直接アクセスして読み出す構成を有して いてもよいし、 上記 R AMに設けられたプログラム記憶ェリァ(図示せず)にダウ ンロードし、 上記プログラム記憶エリアにアクセスして読み出す構成を有してい てもよい。 尚、 上記プログラムメディアから RAMの上記プログラム記憶エリア にダウンロードするためのダウンロードプログラムは、 予め本体装置に格納され ているものとする。  By the way, in each of the above-described embodiments, the waveform analyzers 1 and 2 1, the spectral envelope extractor 2, the spectral gradient extractor 3, the vocal tract cross-sectional area extractor 2 2, the sound source characteristic extractors 4 and 2 3, Speech Recognition Unit 5-24, Averaging Unit 7-26, Nonlinear Frequency Axis Spectrum Matching Unit 8, Nonlinear Vocal Tract Axis Matching Unit 27, Spectral Envelope Transformation Unit 10, Spectral Slope Transform The voice quality conversion processing function by the unit 11, the vocal tract cross-sectional area conversion unit 29, the sound source characteristic conversion unit 12, 30, the spectrum synthesis unit 13-31, and the waveform synthesis unit 14, 32 is as follows. The program is realized by a voice quality conversion processing program recorded on a recording medium. The program recording medium in each of the above embodiments is a program medium such as a ROM (Read Only Memory). Alternatively, it may be a program medium that is attached to and read from an external auxiliary storage device. In any case, the program reading means for reading the voice quality conversion processing program from the program medium may have a configuration of directly accessing and reading the program medium, or a program provided in the RAM. A configuration may be adopted in which the program is downloaded to a storage area (not shown), and the program storage area is accessed and read. It is assumed that a download program for downloading from the program medium to the program storage area of the RAM is stored in the main device in advance.
ここで、 上記プログラムメディアとは、 本体側と分離可能に構成され、 磁気テ ープゃカセットテープ等のテープ系、 フロッピーディスク,ハードディスク等の 磁気ディスクや C D (コンパクトディスク)一 R OM, MO (光磁気)ディスク, MD (ミニディスク), D V D (ディジタルビデオディスク )等の光ディスクのディスク 系、 I C (集積回路)カードや光カード等のカード系、 マスク R OM, E P R OM (紫外線消去型 R OM) , E E P R OM (電気的消去型 R OM),フラッシュ R OM 等の半導体メモリ系を含めた、 固定的にプログラムを坦持する媒体である。 また、 上記プログラムメディアは、 通信ネットワークからのダウンロード等に よって流動的にプログラムを坦持する媒体であっても差し支えない。 尚、 その場 合における上記通信ネットワークからダウンロードするためのダウンロードプロ グラムは、 予め本体装置に格納されているものとする。 あるいは、 別の記録媒体 からインストールされるものとする。 Here, the above-mentioned program medium is configured to be separable from the main body side, Tapes such as tape cassette tapes, magnetic disks such as floppy disks and hard disks, CDs (compact disks), optical disks such as ROMs, MOs (magneto-optical) disks, MDs (mini disks), DVDs (digital video disks), etc. Discs, card systems such as IC (integrated circuit) cards and optical cards, mask ROMs, EPR OMs (ultraviolet erasable ROMs), EEPR OMs (electrically erasable ROMs), and semiconductor memories such as flash ROMs It is a medium for fixedly carrying programs, including systems. Further, the program medium may be a medium that carries a program in a fluid manner by downloading from a communication network or the like. In this case, it is assumed that the download program for downloading from the communication network is stored in the main unit in advance. Alternatively, it shall be installed from another recording medium.
尚、 上記記録媒体に記録されるものはプログラムのみに限定されるものではな く、 データも記録することが可能である。  It should be noted that what is recorded on the recording medium is not limited to a program, but can also be data.

Claims

請 求 の 範 囲 The scope of the claims
1 . 第 1話者の声質での音声を第 2話者の声質での音声に変換する声質変換装 置であって、 1. A voice conversion device that converts voice of the first speaker's voice into voice of the second speaker's voice,
上記第 1話者が発声した第 1音声から第 1スぺクトノレ包絡を抽出する一方、 第 2話者が発声した第 2音声から第 2スぺク トル包絡を抽出するスぺク トル包絡抽 出手段と、  Extracting the first spectral envelope from the first voice uttered by the first speaker and extracting the second spectral envelope from the second voice uttered by the second speaker Delivery means,
上記抽出された第 1スぺクトル包絡および第 2スぺクトル包絡を、 音声単位の ラベルを付与して格納する第 1メモリ手段と、  First memory means for storing the extracted first spectrum envelope and second spectrum envelope with a label for each voice unit,
同一ラベルに関して、 上記第 1メモリに格納された上記第 1スぺクトル包絡と 第 2スぺク トル包絡とに対して非線形な周波数伸縮マツチングを行って、 両スぺ クトル包絡の周波数軸の対応付けを表わす周波数ヮービング関数を求める非線形 周波数軸スぺク トルマッチング手段と、  For the same label, a nonlinear frequency expansion / contraction matching is performed on the first spectrum envelope and the second spectrum envelope stored in the first memory, and the correspondence between the frequency axes of both spectrum envelopes is obtained. A non-linear frequency axis spectrum matching means for obtaining a frequency sorbing function representing
上記周波数ヮービング関数を、 音声単位のラベルを付与して格納する第 2メモ ジ¥ と、  A second memory for storing the above-described frequency sorbing function with a sound unit label attached;
指定された音声単位名の第 1スぺク トル包絡を上記第 1メモリから読み出す一 方、 上記指定された音声単位名の周波数ヮービング関数を上記第 2メモリから読 み出して、 この読み出された周波数ヮービング関数に基づいて、 上記読み出され た第 1スぺクトル包絡を第 2話者に関するスぺクトル包絡に変換するスぺクトル 包絡変換手段を備えたことを特徴とする声質変換装置。  The first spectrum envelope of the specified voice unit name is read from the first memory, while the frequency sorbing function of the specified voice unit name is read from the second memory and read out. A voice envelope conversion means for converting the read first spectrum envelope into a spectrum envelope relating to a second speaker based on the obtained frequency envelope function.
2 . 請求項 1に記載の声質変換装置において、  2. The voice conversion device according to claim 1,
上記非線形周波数軸スぺクトルマッチング手段は、 上記非線形な周波数伸縮マ ツチングを行うに際して、 上記第 1スぺクトル包絡と第 2スぺクトル包絡とに関 して、 夫々のスぺク トル包絡を周波数帯域で複数チャネルに分割した際における 隣接チャネル間の出力値の差を用いることを特徴とする声質変換装置。  The non-linear frequency axis spectrum matching means, when performing the non-linear frequency expansion and contraction matching, relates each of the first spectrum envelope and the second spectrum envelope to each other. A voice quality conversion device characterized by using a difference in output value between adjacent channels when the frequency band is divided into a plurality of channels.
3 . 請求項 1に記載の声質変換装置において、  3. The voice conversion device according to claim 1,
上記第 1メモリ手段は、 上記第 1スぺクトル包絡および第 2スぺクトル包絡の 傾きをも音声単位のラベルを付与して格納するようになっており、 上記第 1話者が発声した第 1音声から第 1スぺクトル包絡の傾きを抽出する一 方、 第 2話者が発声した第 2音声から第 2スぺクトル包絡の傾きを抽出して上記 第 1メモリ手段に格納させるスぺク トルィ頃き抽出手段と、 The first memory means stores the inclination of the first spectrum envelope and the inclination of the second spectrum envelope with a label for each sound unit, and stores the inclination. While extracting the slope of the first spectrum envelope from the first voice uttered by the first speaker, extracting the slope of the second spectrum envelope from the second voice uttered by the second speaker Extracting means to be stored in the first memory means,
指定された音声単位名の第 1スぺクトル包絡の傾きと第 2スぺクトル包絡の傾 きとを上記第 1メモリ手段から読み出して、 両傾きの差に基づいて、 上記スぺク トル包絡変換手段によつて得られた上記第 2話者に関するスぺクトル包絡の傾き を補正するスぺク トル傾き補正手段を備えたことを特徴とする声質変換装置。 The slope of the first spectrum envelope and the slope of the second spectrum envelope of the specified voice unit name are read from the first memory means, and the spectrum envelope is determined based on the difference between the two slopes. A voice quality conversion device comprising a spectrum inclination correction means for correcting the inclination of the spectrum envelope of the second speaker obtained by the conversion means.
4 . 請求項 1に記載の声質変換装置において、 4. The voice conversion device according to claim 1,
上記音声単位は音素であり、  The speech unit is a phoneme,
上記第 2メモリ手段に格納された周波数ヮービング関数を上記ラベルに基づい て音素,類似音素,有声音区間/無声音区間および話者毎にグループ化し、 各グル ープに属する周波数ヮービング関数の平均値を算出し、 得られた平均周波数ヮー ビング関数を各グループ名のラベルを付与して上記第 2メモリ手段に格納させる 平均化手段を備えると共に、  The frequency sorbing functions stored in the second memory means are grouped by phoneme, similar phoneme, voiced / unvoiced sound interval and speaker based on the label, and the average value of the frequency sorbing functions belonging to each group is calculated. Calculating means for averaging means for assigning a label to each group name and storing the obtained average frequency working function in the second memory means,
上記スぺクトル包絡変換手段は、 指定された音素が属する何れかのグループの 平均周波数ヮービング関数を上記周波数ヮービング関数として用いるようになつ ていることを特徴とする声質変換装置。  A voice quality conversion device, wherein the spectrum envelope conversion means uses an average frequency rubbing function of any group to which a specified phoneme belongs as the frequency rubbing function.
5 . 請求項 3に記載の声質変換装置において、  5. The voice conversion device according to claim 3,
上記音声単位は音素であり、  The speech unit is a phoneme,
上記第 1メモリ手段に格納された第 1スぺク トル包絡の傾きおよび第 2スぺク トル包絡の傾きを上記ラベルに基づいて音素,類似音素,有声音区間/無声音区間 および話者毎にグループ化し、 各グループに属するスぺクトル包絡の傾きの平均 値を算出し、 得られた平均スぺクトル傾きを各話者名および各グループ名のラベ ルを付与して上記第 1メモリ手段に格納させる平均化手段を備えると共に、 上記スぺクトル傾き補正手段は、 指定された音素が属する何れかのグループの 平均スぺクトル傾きを上記スぺクトル包絡の傾きとして用いるようになつている ことを特徴とする声質変換装置。  The slope of the first spectrum envelope and the slope of the second spectrum envelope stored in the first memory means are calculated for each phoneme, similar phoneme, voiced / unvoiced sound section, and speaker based on the label. Grouping, calculating the average value of the slope of the spectrum envelope belonging to each group, assigning the obtained average spectrum slope to the label of each speaker name and each group name, and assigning the label to the first memory means. In addition to the averaging means for storing, the spectral inclination correction means uses the average spectral inclination of any group to which the specified phoneme belongs as the inclination of the spectral envelope. A voice quality conversion device characterized by the following.
6 . 請求項 1に記載の声質変換装置において、 上記抽出された第 1スぺクトル包絡あるいは第 2スぺクトル包絡の時系列を不 特定話者音声認識方法によつて認識し、 認識結果の音声単位名を上記第 1メモリ 手段に送出する音声認識手段を備えたことを特徴とする声質変換装置。 6. The voice conversion device according to claim 1, A time series of the extracted first or second spectrum envelope is recognized by an unspecified speaker voice recognition method, and the voice unit name of the recognition result is transmitted to the first memory means. A voice quality conversion device comprising recognition means.
7 . 請求項 6に記載の声質変換装置において、  7. The voice conversion device according to claim 6,
上記音声認識手段は、 得られた音声単位名の時系列を上記スぺクトル包絡変換 手段あるいはスぺクトル傾き補正手段に供給可能になっており、  The voice recognition means can supply the obtained time series of voice unit names to the spectrum envelope conversion means or the spectrum inclination correction means,
上記スぺクトル包絡変換手段あるいはスぺクトル傾き補正手段は、 上記音声認 識手段によつて得られた音声単位名の時系列を上記指定された音声単位名とする ようになつていることを特徴とする声質変換装置。  The spectral envelope converting means or the spectral inclination correcting means is adapted to set the time series of the voice unit name obtained by the voice recognition means as the specified voice unit name. Characteristic voice conversion device.
8 . 請求項 4に記載の声質変換装置において、  8. The voice conversion device according to claim 4,
上記平均化手段は、 平均値算出の対象となる周波数ヮービング関数間の線形変 換を行なうことによって上記平均周波数ヮービング関数を算出するようになって いることを特徴とする声質変換装置。  The voice quality conversion device according to claim 1, wherein the averaging means calculates the average frequency-diving function by performing linear conversion between frequency-averaging functions to be averaged.
9 . 請求項 8に記載の声質変換装置において、  9. The voice conversion device according to claim 8,
上記周波数ヮービング関数は、 上記第 1スぺクトル包絡と第 2スぺクトル包絡 とを同一周波数帯域で複数チャネルに分割した際における上記第 1,第 2スぺク トルのチャネルから成る平面上における D Pパスに相当する格子点とその他の格 子点とに異なる要素値が与えられたマトリクス状のデータ形式を有し、  The frequency averaging function is obtained by dividing the first spectrum envelope and the second spectrum envelope into a plurality of channels in the same frequency band on a plane formed by the first and second spectrum channels. It has a matrix data format in which different element values are given to grid points corresponding to the DP path and other grid points,
上記周波数ヮーピング関数間の線形変換は、 上記平均値算出の対象となる周波 数ヮービング関数に相当する複数のマトリクスにおける同一格子点の要素値の和 を求め、 得られた値を要素値とするマトリクスを上記平均周波数ヮービング関数 とすることを特徴とする声質変換装置。  The linear transformation between the frequency-mapping functions is performed by calculating the sum of the element values of the same grid point in a plurality of matrices corresponding to the frequency-mapping function for which the average value is to be calculated, and using the obtained value as the element value. Is the average frequency ヮ ving function.
1 0 . 請求項 9に記載の声質変換装置において、  10. The voice conversion device according to claim 9,
上記スぺクトル包絡変換手段は、 上記第 2スぺクトル包絡のある周波数帯域に おける強度に変換する場合には、 使用する平均周波数ヮービング関数のマトリク スにおける上記第 2スぺクトル包絡の該当チャネルに関する行または列の格子点 において、 各格子点の要素値と当該格子点に対応する上記第 1スぺクトル包絡の チャネルにおける強度との積和を求め、 この積和の値を上記第 2スぺクトル包絡 の当該周波数帯域における強度とすることを特徴とする声質変換装置。 When converting into the intensity in the frequency band having the second spectral envelope, the spectrum envelope converting means corresponds to the corresponding channel of the second spectral envelope in the matrix of the average frequency-moving function to be used. At the grid point in the row or column related to the above, the product sum of the element value of each grid point and the intensity in the channel of the first spectrum envelope corresponding to the grid point is calculated, and the value of the product sum is calculated as the second space. Vector envelope A voice quality conversion device characterized in that the intensity is in the frequency band.
1 1 . 第 1話者の声質での音声を第 2話者の声質での音声に変換する声質変換 装置であって、  1 1. A voice conversion device for converting voice of the first speaker's voice into voice of the second speaker's voice,
上記第 1話者が発声した第 1音声から第 1声道断面積を抽出する一方、 第 2話 者が発声した第 2音声から第 2声道断面積を抽出する声道断面積抽出手段と、 上記抽出された第 1声道断面積および第 2声道断面積を、 音声単位のラベルを 付与して格納する第 1メモリ手段と、  A vocal tract cross section extracting means for extracting the first vocal tract cross section from the first voice uttered by the first speaker, and extracting the second vocal tract cross section from the second voice uttered by the second speaker; A first memory means for storing the extracted first vocal tract cross-sectional area and the second vocal tract cross-sectional area with a voice unit label added thereto;
同一ラベルに関して、 上記第 1メモリに格納された上記第 1声道断面積と第 2 声道断面積とに対して非線形な声道軸伸縮マツチングを行って、 両声道断面積の 声道軸の対応付けを表わす声道軸ヮービング関数を求める非線形声道軸マッチン グ手段と、  For the same label, non-linear vocal tract axis expansion and contraction is performed on the first vocal tract cross-sectional area and the second vocal tract cross-sectional area stored in the first memory to obtain a vocal tract axis of both vocal tract cross-sectional areas. Nonlinear vocal tract axis matching means for obtaining a vocal tract axis moving function representing the correspondence between
上記声道軸ヮービング関数を、 音声単位のラベルを付与して格納する第 2メモ リ手段と、  Second memory means for storing the vocal tract axis moving function with a voice unit label added thereto,
指定された音声単位名の第 1声道断面積を上記第 1メモリから読み出す一方、 上記指定された音声単位名の声道軸ヮービング関数を上記第 2メモリから読み出 して、 この読み出された声道軸ヮービング関数に基づいて、 上記読み出された第 1声道断面積を第 2話者に関する声道断面積に変換する声道断面積変換手段を備 えたことを特徴とする声質変換装置。  The first vocal tract cross-sectional area of the specified voice unit name is read from the first memory, while the vocal tract axis moving function of the specified voice unit name is read from the second memory. Vocal tract cross-sectional area conversion means for converting the read first vocal tract cross-sectional area into a vocal tract cross-sectional area for the second speaker based on the vocal tract axis moving function. apparatus.
1 2. 第 1話者の声質での音声を第 2話者の声質での音声に変換する声質変換 方法であって、  1 2. A voice quality conversion method for converting voice in voice quality of a first speaker into voice in voice quality of a second speaker,
上記第 1話者が発声した第 1音声から第 1スぺク トル包絡を抽出する一方、 第 2話者が発声した第 2音声から第 2スぺタトル包絡を抽出するステップと、 同一の音声単位名に関して、 上記抽出された上記第 1スぺクトル包絡と第 2ス ぺクトル包絡とに対して非線形な周波数伸縮マッチングを行って、 両スぺクトル 包絡の周波数軸の対応付けを表わす周波数ヮーピング関数を求めるステップと、 指定された音声単位名の第 1スぺクトル包絡を、 上記指定された音声単位名の 周波数ヮービング関数に基づいて、 第 2話者に関するスぺクトル包絡に変換する ステップを備えたことを特徴とする声質変換方法。 Extracting the first spectrum envelope from the first speech uttered by the first speaker and extracting the second spectrum envelope from the second speech uttered by the second speaker; and For the unit name, nonlinear frequency expansion / contraction matching is performed on the extracted first and second spectral envelopes, and a frequency mapping representing the correspondence between the frequency axes of the two spectral envelopes is performed. Obtaining the function and converting the first spectral envelope of the specified voice unit name into a spectral envelope for the second speaker based on the frequency-moving function of the specified voice unit name. A voice quality conversion method characterized by comprising:
1 3 . 請求項 1 2に記載の声質変換方法にぉレヽて、 1 3. Regarding the voice quality conversion method according to claim 12,
上記第 1話者が発声した第 1音声から第 1スぺクトル包絡の傾きを抽出する一 方、 上記第 2話者が発声した第 2音声から第 2スぺク トル包絡の傾きを抽出する 上記指定された音声単位名の第 1スぺク トル包絡の傾きと第 2スぺク トル包絡 の傾きとの差に基づいて、 上記得られた第 2話者に関するスぺクトル包絡の傾き を補正するステップを備えたことを特徴とする声質変換方法。  While extracting the slope of the first spectrum envelope from the first voice uttered by the first speaker, extracting the slope of the second spectrum envelope from the second voice uttered by the second speaker Based on the difference between the slope of the first spectrum envelope and the slope of the second spectrum envelope of the specified voice unit name, the slope of the spectrum envelope for the second speaker obtained above is calculated as A voice quality conversion method comprising a step of correcting.
1 4 . コンピュータを、  1 4.
請求項 1におけるスぺク トル包絡抽出手段,非線形周波数軸スぺク トルマッチ ング手段,スぺクトル包絡変換手段おょぴ請求項 2におけるスぺクトル傾き抽出 手段,スぺク トル傾き補正手段  The spectral envelope extracting means, the nonlinear frequency axis spectral matching means, and the spectral envelope converting means according to claim 1. The spectral inclination extracting means and the spectral inclination correcting means according to claim 2.
として機能させる声質変換処理プログラムが記録されたことを特徴とするコンビ ユータ読出し可能なプロダラム記録媒体。  A program recording medium readable by a computer, in which a voice quality conversion processing program for functioning as a program is recorded.
PCT/JP2001/002388 2000-04-03 2001-03-26 Voice character converting device WO2001078064A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2000100801A JP3631657B2 (en) 2000-04-03 2000-04-03 Voice quality conversion device, voice quality conversion method, and program recording medium
JP2000-100801 2000-04-03

Publications (1)

Publication Number Publication Date
WO2001078064A1 true WO2001078064A1 (en) 2001-10-18

Family

ID=18614947

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2001/002388 WO2001078064A1 (en) 2000-04-03 2001-03-26 Voice character converting device

Country Status (2)

Country Link
JP (1) JP3631657B2 (en)
WO (1) WO2001078064A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090089063A1 (en) * 2007-09-29 2009-04-02 Fan Ping Meng Voice conversion method and system
CN102822888A (en) * 2010-03-25 2012-12-12 日本电气株式会社 Speech synthesizer, speech synthesis method and the speech synthesis program
CN103886859A (en) * 2014-02-14 2014-06-25 河海大学常州校区 Voice conversion method based on one-to-many codebook mapping

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4432893B2 (en) * 2004-12-15 2010-03-17 ヤマハ株式会社 Voice quality determination device, voice quality determination method, and voice quality determination program
WO2008142836A1 (en) * 2007-05-14 2008-11-27 Panasonic Corporation Voice tone converting device and voice tone converting method
JP5038995B2 (en) 2008-08-25 2012-10-03 株式会社東芝 Voice quality conversion apparatus and method, speech synthesis apparatus and method
CN103370743A (en) * 2011-07-14 2013-10-23 松下电器产业株式会社 Voice quality conversion system, voice quality conversion device, method therefor, vocal tract information generating device, and method therefor
JP6827004B2 (en) * 2018-01-30 2021-02-10 日本電信電話株式会社 Speech conversion model learning device, speech converter, method, and program
CN111179905A (en) * 2020-01-10 2020-05-19 北京中科深智科技有限公司 Rapid dubbing generation method and device

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS64998A (en) * 1987-06-24 1989-01-05 A T R Jido Honyaku Denwa Kenkyusho:Kk Spectrogram normalizing system
JPH0193796A (en) * 1987-10-06 1989-04-12 Nippon Hoso Kyokai <Nhk> Voice quality conversion
JPH04147300A (en) * 1990-10-11 1992-05-20 Fujitsu Ltd Speaker's voice quality conversion and processing system
JPH0816183A (en) * 1994-06-29 1996-01-19 Mitsubishi Electric Corp Sound quality adaption device
JPH08248994A (en) * 1995-03-10 1996-09-27 Atr Onsei Honyaku Tsushin Kenkyusho:Kk Voice tone quality converting voice synthesizer
JPH09258779A (en) * 1996-03-22 1997-10-03 Atr Onsei Honyaku Tsushin Kenkyusho:Kk Speaker selecting device for voice quality converting voice synthesis and voice quality converting voice synthesizing device
JPH10307594A (en) * 1997-03-07 1998-11-17 Seiko Epson Corp Recognition result processing method and recognition result processor
JPH10312195A (en) * 1997-05-13 1998-11-24 Seiko Epson Corp Method and device and converting speaker tone

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS64998A (en) * 1987-06-24 1989-01-05 A T R Jido Honyaku Denwa Kenkyusho:Kk Spectrogram normalizing system
JPH0193796A (en) * 1987-10-06 1989-04-12 Nippon Hoso Kyokai <Nhk> Voice quality conversion
JPH04147300A (en) * 1990-10-11 1992-05-20 Fujitsu Ltd Speaker's voice quality conversion and processing system
JPH0816183A (en) * 1994-06-29 1996-01-19 Mitsubishi Electric Corp Sound quality adaption device
JPH08248994A (en) * 1995-03-10 1996-09-27 Atr Onsei Honyaku Tsushin Kenkyusho:Kk Voice tone quality converting voice synthesizer
JPH09258779A (en) * 1996-03-22 1997-10-03 Atr Onsei Honyaku Tsushin Kenkyusho:Kk Speaker selecting device for voice quality converting voice synthesis and voice quality converting voice synthesizing device
JPH10307594A (en) * 1997-03-07 1998-11-17 Seiko Epson Corp Recognition result processing method and recognition result processor
JPH10312195A (en) * 1997-05-13 1998-11-24 Seiko Epson Corp Method and device and converting speaker tone

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090089063A1 (en) * 2007-09-29 2009-04-02 Fan Ping Meng Voice conversion method and system
US8234110B2 (en) * 2007-09-29 2012-07-31 Nuance Communications, Inc. Voice conversion method and system
CN102822888A (en) * 2010-03-25 2012-12-12 日本电气株式会社 Speech synthesizer, speech synthesis method and the speech synthesis program
CN103886859A (en) * 2014-02-14 2014-06-25 河海大学常州校区 Voice conversion method based on one-to-many codebook mapping
CN103886859B (en) * 2014-02-14 2016-08-17 河海大学常州校区 Phonetics transfer method based on one-to-many codebook mapping

Also Published As

Publication number Publication date
JP2001282300A (en) 2001-10-12
JP3631657B2 (en) 2005-03-23

Similar Documents

Publication Publication Date Title
US10347238B2 (en) Text-based insertion and replacement in audio narration
US7035791B2 (en) Feature-domain concatenative speech synthesis
JP4302788B2 (en) Prosodic database containing fundamental frequency templates for speech synthesis
JP4176169B2 (en) Runtime acoustic unit selection method and apparatus for language synthesis
US7668717B2 (en) Speech synthesis method, speech synthesis system, and speech synthesis program
US4979216A (en) Text to speech synthesis system and method using context dependent vowel allophones
US20050071163A1 (en) Systems and methods for text-to-speech synthesis using spoken example
US20060259303A1 (en) Systems and methods for pitch smoothing for text-to-speech synthesis
JP5148026B1 (en) Speech synthesis apparatus and speech synthesis method
JP2000509157A (en) Speech synthesizer with acoustic elements and database
WO2011151956A1 (en) Voice quality conversion device, method therefor, vowel information generating device, and voice quality conversion system
JP3576840B2 (en) Basic frequency pattern generation method, basic frequency pattern generation device, and program recording medium
JP3631657B2 (en) Voice quality conversion device, voice quality conversion method, and program recording medium
US7765103B2 (en) Rule based speech synthesis method and apparatus
JP3050832B2 (en) Speech synthesizer with spontaneous speech waveform signal connection
JP2002215198A (en) Voice quality converter, voice quality conversion method, and program storage medium
JP3281266B2 (en) Speech synthesis method and apparatus
GB2313530A (en) Speech Synthesizer
KR100259777B1 (en) Optimal synthesis unit selection method in text-to-speech system
JP2583074B2 (en) Voice synthesis method
JP2975586B2 (en) Speech synthesis system
JP3109778B2 (en) Voice rule synthesizer
JP3060276B2 (en) Speech synthesizer
JPH11249679A (en) Voice synthesizer
Salor et al. Dynamic programming approach to voice transformation

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): CN KR US

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
122 Ep: pct application non-entry in european phase