JPH04147300A

JPH04147300A - Speaker's voice quality conversion and processing system

Info

Publication number: JPH04147300A
Application number: JP2273088A
Authority: JP
Inventors: Toru Sanada; 真田　徹
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1990-10-11
Filing date: 1990-10-11
Publication date: 1992-05-20

Abstract

PURPOSE:To enable high quality voice conversion by executing vectorization of plural rhyme in a frequency axis, obtaining a corresonding path by means of matching according to a dynamic planning method and utilizing this correspondence for frequency conversion. CONSTITUTION:Plural rhymes vectorization means 1 and 2 obtains a spectral envelope at maximum power for each of plural number of rhyme included in voice produced by each speaker. A corresponding path extracting means 3 obtains the corresponding path according to the dynamic planning method by matching the spectral envelope for the two rhyme corresponding to each other by means of the dynamic planning method. A frequency conversion means 4 executes the frequency conversion for each rhyme included in the sound of a speaker A by utilizing the corresponding path. Thus, since nonlinear conversion is executed for the spectral envelope for plural number of rhyme, the voice conversion in higher quality can be obtained.

Description

【発明の詳細な説明】〔概　要〕声質の異なる２名の話者の音声について周波数軸上で動
的計西法によるマツチング（以下ＤＰマツチングという
）を行って、対応する動的計画法抽出バス（以下ＤＰバ
バスいう）にもとづいて。[Detailed Description of the Invention] [Summary] The voices of two speakers with different voice qualities are matched on the frequency axis using the dynamic programming method (hereinafter referred to as DP matching), and the corresponding dynamic programming extraction bus is (hereinafter referred to as DP Babasu).

一方の話者の声質を他方の話者の声質に変換するように
した話者の声質変換処理方式に関し。This invention relates to a speaker voice quality conversion processing method that converts the voice quality of one speaker to the voice quality of another speaker.

話者の声質の変換に当たって、より高品質での声質変換
を行い得るようにすることを目的とし。The purpose of this invention is to enable higher-quality voice quality conversion when converting the voice quality of a speaker.

周波数軸上で複数音韻のスペクトルパターンのベクトル
系列でＤＰマツチングを行い、ＤＰパスを話者間の周波
数軸上での対応関係とし、この対応関係を用いて声質変
換を行うようにし、特に複数音韻のスペクトルパターン
を周波数領域でのベクトル系列として用いる構成として
いる。DP matching is performed on a vector sequence of spectral patterns of multiple phonemes on the frequency axis, the DP path is defined as a correspondence between speakers on the frequency axis, and voice quality conversion is performed using this correspondence. The spectral pattern is used as a vector sequence in the frequency domain.

[Industrial application field]

声質の異なる２名の話者の音声について周波数軸上でＤ
Ｐマツチングを行って、対応するＤＰババスもとづいて
、一方の話者の声質を他方の話者の声質に変換するよう
にした話者の声質変換処理方式に関する。D on the frequency axis for the voices of two speakers with different voice qualities
The present invention relates to a speaker voice quality conversion processing method that performs P matching to convert the voice quality of one speaker to the voice quality of another speaker based on the corresponding DP Babas.

このような声質変換処理は、音声認識における話者適応
や音声合成による声質変換について利用される。Such voice quality conversion processing is used for speaker adaptation in speech recognition and voice quality conversion by voice synthesis.

[Conventional technology]

第２図は従来の場合の構成を示す０図中の符号１１は制
御部であってスペクトル包絡を得る処理やＤＰマツチン
グを行う処理を実行させるべく制御を行うもの、１２．
１４は夫々パワー最大時点スペクトル包絡抽出部、１３
．１５は夫々包絡配憶部であって得られたスペクトル包
絡を記憶しておくもの　１６は周波数軸ＤＰ（ダイナミ
ック・プログラミング）部であって２つのスペクトル包
絡についてＤＰマツチングを行ってＤＰババス得るもの
、１７は周波数変換表であって得られたＤＰババスもと
づいて周波数変換を行う上での情報を保持するもの５１
８は周波数変換部であって話者Ａの音声のスペクトル包
絡を非線形変換するものを表している。FIG. 2 shows the configuration of a conventional case. Reference numeral 11 in FIG. 2 is a control unit that performs control to execute the process of obtaining a spectrum envelope and the process of performing DP matching; 12.
14 are maximum power point spectrum envelope extraction units; 13
．． 15 is an envelope storage unit that stores the obtained spectrum envelope; 16 is a frequency axis DP (dynamic programming) unit that performs DP matching on two spectrum envelopes to obtain a DP bus; 17 is a frequency conversion table that holds information for frequency conversion based on the obtained DP Babas 51
Reference numeral 8 denotes a frequency conversion unit that nonlinearly converts the spectrum envelope of speaker A's voice.

まず、制御部１１で話者Ａを選択する０話者Ａの単母音
（例えば／ａ／）が読み込まれて、パワー最大時点スペ
クトル包絡抽出部ＡＩ２が単母音（例えば／ａ／）の発
声のパワー最大時点でのスペクトル包絡を抽出する。こ
のスペクトル包絡が包絡記憶部Ａ１３に記憶される。First, the control unit 11 selects speaker A. A single vowel (for example, /a/) of speaker A is read, and the maximum power point spectrum envelope extraction unit AI2 extracts the utterance of the single vowel (for example, /a/). Extract the spectral envelope at the maximum power point. This spectral envelope is stored in the envelope storage section A13.

次に、制御部１１で話者Ｂを選択する０話者Ｂの単母音
（例えば／ａ／）が読み込まれて、パワー最大時点スペ
クトル包絡抽出部Ｂ１４が話者Ａと同じ単母音（例えば
／ａ／）の発声のパワー最大時点でのスペクトル包絡を
抽出する。このスペクトル包絡が包絡記憶部Ｂ１５に記
憶される。パワー最大時点スペクトル包絡抽出部Ｂ１４
と包絡記憶部Ｂ１５との動作はパワー最大時点スペクト
ル包絡抽出部Ａ１２と包絡記憶部Ａ１３とに同じである
。Next, the control unit 11 reads the monophthong of speaker B (e.g. /a/) for selecting speaker B, and extracts the same monophthong as speaker A (e.g. /a/) at the maximum power point spectral envelope extraction unit B14. The spectrum envelope at the time when the power of the utterance a/) is maximum is extracted. This spectral envelope is stored in the envelope storage section B15. Maximum power point spectrum envelope extraction unit B14
The operations of the envelope storage section B15 and the maximum power point spectrum envelope extraction section A12 and the envelope storage section A13 are the same.

次に制御部１１でＤＰを選択する。この指禾により１周
波数軸ＤＰ部１６は、包絡記憶部Ａ１３と包絡記憶部Ｂ
１５とから話者Ａと話者Ｂとの同一単母音のスペクトル
包絡を読み込み９周波数領域においてＤＰマツチングを
行う、このＤＰパスにより１話者Ａと話者Ｂとの周波数
領域での対応関係が求まる。この周波数領域での対応関
係は周波数変換表１７に書き込まれる。Next, the control unit 11 selects DP. As a result of this instruction, the one frequency axis DP section 16 has an envelope storage section A13 and an envelope storage section B.
15, the spectral envelope of the same single vowel of speaker A and speaker B is read and DP matching is performed in 9 frequency domains. Through this DP pass, the correspondence relationship between speaker A and speaker B in the frequency domain Seek. This correspondence relationship in the frequency domain is written into the frequency conversion table 17.

周波数変換部１８は９話者Ａのスペクトル包絡が入力さ
れると１周波数変換表１７に書き込まれている周波数領
域での対応関係を参照して９話者へのスペクトル包絡を
非線形に伸縮して、声質変換後のスペクトル包絡を出力
する。When the spectral envelope of the 9 speakers A is input, the frequency conversion unit 18 nonlinearly expands and contracts the spectral envelope of the 9 speakers by referring to the correspondence relationship in the frequency domain written in the 1-frequency conversion table 17. , outputs the spectral envelope after voice quality conversion.

[Problem to be solved by the invention]

従来の用いる音韻が単一音韻（例えば／ａ／）であった
ために他の音韻での周波数領域での対応関係が大きく異
なる可能性が高かった。そこで。Since the phoneme used in the past was a single phoneme (for example, /a/), there was a high possibility that the correspondence relationships in the frequency domain for other phonemes would be significantly different. Therefore.

他の音調でも周波数領域での対応関係が適切になる声質
変換方式が必要とされていた。There was a need for a voice quality conversion method that would provide appropriate correspondence in the frequency domain for other tones.

本発明は５話者の声質の変換”に当たって、より高品質
の声質変換を行い得るようにすることを目的としている
。An object of the present invention is to perform voice quality conversion of five speakers with higher quality.

[Means to solve the problem]

第１図は本発明の原理構成図を示す０図中の符号１．２
は夫々複数音韻スペクトルのベクトル化手段、３は周波
数軸上でのＤＰマツチングによって対応バスを求める対
応バス抽出手段、４は対応バスによる周波数変換手段を
表す。Figure 1 shows the principle configuration diagram of the present invention with reference numeral 1.2 in Figure 0.
3 represents a vectorization means for a plurality of phoneme spectra, 3 represents a corresponding bus extraction means for obtaining a corresponding bus by DP matching on the frequency axis, and 4 represents a frequency conversion means using the corresponding bus.

複数音韻スペクトルのベクトル化手段１．２は夫々の話
者の発した音声における複数の音＃Ｉについて夫々パワ
ー最大時点スペクトル包絡を求める。The plurality of phoneme spectrum vectorization means 1.2 obtains the spectral envelope at the maximum power point for each of the plurality of sounds #I in the speech uttered by each speaker.

対応バス抽出手段は、夫々対応する２つの音韻について
のスペクトル包絡をＤＰマツチングして。The corresponding bus extraction means performs DP matching on the spectrum envelopes of two corresponding phonemes.

対応するＤＰババス求める。そして対応バスによる周波
数変換手段４は２話者Ａの音声の中における各音韻につ
いて、対応パスを用いて周波数変換を行う。Find the corresponding DP bus. Then, the frequency conversion means 4 using the corresponding bus performs frequency conversion on each phoneme in the speech of the two speakers A using the corresponding path.

[For production]

話者Ａの音声を複数音韻スペクトルのベクトル化手段１
に入力してスペクトルパターンによる周波数領域のベク
トル系列を生成する０話者Ｂの音声を複数音韻スペクト
ルのベクトル化手段２に入力してスペクトルパターンに
よる周波数領域のベクトル系列を生成する０両者の周波
数領域でのベクトル系列を１周波数軸上でのＤＰマツチ
ングによって対応バスを求める手段３により両者の音声
の周波数ＩＩ域での対応関係を求める。この周波数領域
での対応関係を用いて、対応パスによる周波数変換手段
４が話者Ａの周波数特徴を読み込んで。Means 1 for vectorizing the speech of speaker A into a plurality of phoneme spectra
The speech of speaker B is input to the multi-phonetic spectrum vectorization means 2 to generate a vector sequence in the frequency domain according to the spectral pattern. The means 3 for determining a corresponding bus by performing DP matching on the vector series on one frequency axis determines the correspondence relationship in the frequency II range of both voices. Using this correspondence relationship in the frequency domain, the frequency conversion means 4 according to the corresponding path reads the frequency characteristics of speaker A.

声質変換後の周波数特徴を出力する。Outputs frequency features after voice quality conversion.

即ち１話者Ａと話者Ｂとの夫々の音韻ＡＩ　とＢｉとが
周波数軸上で図示の対応１０１，１０２，１０３・・・
・・・の如きものであったとするとき２話者Ａの音ｆｉ
Ａｉ上の点Ｐの周波数を話者Ｂの音１１Ｂｉ上の点Ｐ′
の周波数に変換される。That is, the respective phonemes AI and Bi of speaker A and speaker B correspond to each other on the frequency axis as shown in the figure 101, 102, 103, . . .
If the sound is as follows, then the sound fi of speaker A is
The frequency of point P on Ai is expressed as point P' on speaker B's sound 11Bi.
is converted to the frequency of

Ｃ実施例〕第３図は本発明の実施例構成を示す０図中の符号２１は
制御部であって第２図図示の制御部１１に対応するもの
、２２．２８は夫々パワー最大時点スペクトル包絡抽出
部、２３．２９は夫々記憶部選択部であって抽出した夫
々の音韻対応のスペクトル包絡を記憶せしめる場所を選
択するもの。C Embodiment] FIG. 3 shows the configuration of an embodiment of the present invention. The reference numeral 21 in FIG. 3 is a control section corresponding to the control section 11 shown in FIG. Envelope extraction sections 23 and 29 are storage section selection sections for selecting locations where the extracted spectral envelopes corresponding to each phoneme are to be stored.

２４ないし２６と３０ないし３２とは夫々包絡記憶部で
あって第２図図示の包絡記憶部１３．１５に対応するも
の、２７．３３は夫々ベクトル生成部であって各包絡記
憶部２４ないし２６あるいは３０ないし３２に記憶され
ているスペクトル包絡を取り出して周波数軸ＤＰ部３４
に供給するもの３４は周波数軸ＤＰ部であって与えられ
たスペクトル包絡について周波数軸上でのＤＰマツチン
グを行うもの、３５は周波数変換表であって複数の音韻
に対応した変換表を保持するもの、３６は周波数変換部
であって複数の音韻ごとに周波数変換を行うものを表し
ている。Reference numerals 24 to 26 and 30 to 32 are respectively envelope storage units corresponding to the envelope storage units 13 and 15 shown in FIG. 2, and 27 and 33 are vector generation units, respectively. Alternatively, the spectrum envelope stored in 30 to 32 is extracted and the frequency axis DP section 34
34 is a frequency axis DP unit that performs DP matching on the frequency axis for a given spectrum envelope, and 35 is a frequency conversion table that holds conversion tables corresponding to a plurality of phonemes. , 36 represents a frequency conversion unit that performs frequency conversion for each of a plurality of phonemes.

まず、制御部２１で話者Ａを選択する０話者Ａの単母音
の一つ（例えば／ａ／）が読み込まれてパワー最大時点
スペクトル包絡抽出部Ａ２２が。First, one of the monophthongs of speaker A (for example, /a/) for selecting speaker A is read by the control unit 21, and the spectrum envelope extraction unit A22 selects it at the maximum power point.

この単母音（例えば／ａ／）の発声のパワー最大時点で
のスペクトル包絡を抽出する。このスペクトル包絡が配
憶部選択部Ａ２３により、入力された母音（例えば／ａ
／）に対応する包絡記憶部Ａ１２４に言己憶される０話
者Ａの単母音の次の一つ（例えば／ｉ／）が読み込まれ
て、パワー最大時点スペクトル包絡抽出部Ａ２２が、こ
の単母音（例えば／ｌ／）の発声のパワー最大時点での
スベクトル包絡を抽出する。このスペクトル包絡が記憶
部選択部Ａ２３により、入力された母音（例えば／ｌ／
）に対応する包絡記憶部Ａ２２５に記憶される。同様に
して、他の単母音がｎ個まで入力されて、包絡記憶部Ａ
ｎ　　２６までに記憶される。なおここでｎは使用する
言語の基本母音数である。The spectrum envelope at the time when the power of the utterance of this monophthong (for example, /a/) is maximum is extracted. This spectral envelope is selected by the storage section selection section A23 for the input vowel (for example, /a
The next single vowel (for example, /i/) of speaker A stored in the envelope storage unit A124 corresponding to /) is read, and the maximum power point spectrum envelope extraction unit A22 extracts this single vowel. The svector envelope at the time of maximum vocalization power of a vowel (for example, /l/) is extracted. This spectral envelope is determined by the memory selection unit A23 for the input vowel (for example, /l/
) is stored in the envelope storage unit A225. In the same way, up to n other single vowels are input, and the envelope memory unit A
Stored by n 26. Note that n here is the basic number of vowels of the language used.

次に、制御部２１で話者Ｂを選択する。読み込まれる単
母音は話者Ｂが発声した音声であるが。Next, the control unit 21 selects speaker B. The monophthong that is read is the voice uttered by speaker B.

これ以降は話者Ａが選択された場合の動作と同様である
。各構成部２８〜３２は上記構成部２２〜２６に対応す
る。The subsequent operations are the same as those when speaker A is selected. Each of the components 28 to 32 corresponds to the components 22 to 26 described above.

次に、制御部２１でＤＰマツ７チングを行うことを選択
する。この指示により１周波数軸ＤＰ部３４は、ベクト
ル生成部Ａ２７を通して包絡記憶部Ａ１２４〜包絡記憶
部Ａｎ　　２６からｎ個のスペクトルパターンを周波数
領域のｎ次元ベクトル系列に変換して読み込み、ベクト
ル生成部Ｂ３３を通して包絡記憶部Ｂ１３０〜包絡記憶
部Ｂｎ３２からｎ個のスペクトルパターンを周波数領域
のｎ次元ベクトル系列に変換して読み込み２周波数領域
において夫々ＤＰマツチングを行う、このＤＰパスによ
り１話者Ａと話者Ｂとの周波数領域での対応関係が求ま
る。この周波数領域での対応関係は周波数変換表３５に
書き込まれる。Next, the control unit 21 selects to perform DP pine seventing. In response to this instruction, the 1-frequency axis DP section 34 converts and reads n spectral patterns from the envelope storage section A124 to the envelope storage section An 26 through the vector generation section A27 into an n-dimensional vector series in the frequency domain, Through this DP path, n spectral patterns are converted into an n-dimensional vector sequence in the frequency domain from the envelope storage unit B130 to the envelope storage unit Bn32, and DP matching is performed in each of the two frequency domains. The correspondence relationship with B in the frequency domain is determined. This correspondence relationship in the frequency domain is written into the frequency conversion table 35.

周波数変換部３６は９話者Ａのスペクトル包絡が入力さ
れると９周波数変換表３５に書き込まれている周波数領
域での対応関係を参照して１話者Ａのスペクトル包絡を
非線形に伸縮して、声質変換後のスペクトル包絡を出力
する。When the spectral envelopes of 9 speakers A are input, the frequency converter 36 non-linearly expands and contracts the spectral envelopes of 1 speaker A with reference to the correspondence relationships in the frequency domain written in the 9 frequency conversion table 35. , outputs the spectral envelope after voice quality conversion.

上記説明において、複数の音韻についてスペクトル包絡
を抽出するとしたが１本発明の場合には（ｉ）複数の定
常母御についてスペクトル包絡を求めたり、（ｉｔ）複
数の基本母−についてスペクトル包絡を求めたりするこ
とが、より効果的である。しかし勿論それに限られるも
のではない。In the above explanation, it is assumed that the spectral envelopes are extracted for a plurality of phonemes, but in the case of the present invention, (i) the spectral envelopes are found for a plurality of stationary bases, or (it) the spectral envelopes are found for a plurality of basic bases. It is more effective to However, it is of course not limited to this.

〔Effect of the invention〕

以上説明した如く５本発明によれば、複敞個の音韻につ
いてのスペクトル包絡を非線形変換するようにしている
ために、従来の場合にくらべてより高品質の声質変換を
行うことが可能となる。As explained above, according to the present invention, since the spectral envelopes of multiple phonemes are nonlinearly transformed, it is possible to perform voice quality conversion with higher quality than in the conventional case. .

[Brief explanation of drawings]

第１図は本発明の原理構成図、第２図は従来の場合の構
成、第３図は本発明の実施例構成を示す。図中、ｌ、２は複数音韻スペクトルのベクトル化手段、
３は周波数軸上でのＤＰ？シチングによフて対応バスを
求める手段、４は対応バスによる周波数変換手段を表す
。特許出馴人畜十這株式会社代理人弁理士森１）寛（外２名）３Ｌ来の１１八ｒ屯例檎成FIG. 1 shows the principle configuration of the present invention, FIG. 2 shows the conventional configuration, and FIG. 3 shows the configuration of an embodiment of the present invention. In the figure, l and 2 are means for vectorizing multiple phoneme spectra;
Is 3 the DP on the frequency axis? 4 represents means for determining a corresponding bus by searching, and 4 represents a frequency conversion means using the corresponding bus. Patent expert Chikujuhi Co., Ltd. agent patent attorney Mori 1) Hiroshi (and 2 others) 3L 118r ton example history

Claims

[Claims]

(1) The spectral envelope of the phoneme is extracted in response to the pronunciation pronounced by speaker A, and the spectral envelope of the phoneme is extracted in response to the same pronunciation pronounced by speaker B. Then, the spectral envelopes of the same phoneme for both speakers are matched using dynamic programming in the frequency domain, and based on the dynamic programming extraction path obtained in the matching, the spectral envelopes of both speakers in the frequency domain are A speaker voice quality conversion processing method that calculates the correspondence, non-linearly expands and contracts the spectral envelope of the voice uttered by speaker A, and converts the voice quality into a form that corresponds to the spectral envelope of the voice uttered by speaker B. a plurality of phoneme spectrum vectorization means (1, 2) for extracting respective spectral envelopes for a plurality of phonemes corresponding to the speaker A and the speaker B; and for each of the extracted phonemes. , a corresponding path extraction means (3) that performs dynamic programming matching in the frequency domain and obtains the dynamic programming extraction path; and a spectral envelope corresponding to each of the phonemes in the speech uttered by speaker A. Frequency conversion means (4
), and converts the voice quality of the voice uttered by speaker A to the voice quality of speaker B.

(2) The dynamic programming extraction path for each phoneme obtained by the corresponding path extraction means (3) is calculated using the frequency conversion table (35
), and the frequency conversion means (4) has the frequency conversion table (35
2. The speaker's voice quality conversion processing method according to claim 1, wherein the contents of the speaker's voice are read out and used.

(3) The frequency conversion means (4) non-linearly expands and contracts the spectral envelope corresponding to each phoneme in the speech uttered by speaker A, based on the contents of the frequency conversion table (35). The speaker's voice quality conversion processing method according to claim (2).

(4) The speaker's voice quality conversion processing method according to claim (1), wherein the plurality of phonemes correspond to a plurality of vowels.