JPH0197997A

JPH0197997A - Voice quality conversion system

Info

Publication number: JPH0197997A
Application number: JP62255498A
Authority: JP
Inventors: Masanobu Abe; 匡伸阿部; Kiyohiro Kano; 清宏鹿野; Satoru Nakamura; 哲中村; Hisao Kuwabara; 尚夫桑原
Original assignee: A T R JIDO HONYAKU DENWA KENKYUSHO KK
Current assignee: A T R JIDO HONYAKU DENWA KENKYUSHO KK
Priority date: 1987-10-09
Filing date: 1987-10-09
Publication date: 1989-04-17
Anticipated expiration: 2013-02-04
Also published as: JP2709926B2

Abstract

PURPOSE: To efficiently execute voice quality conversion by quantizing a voice by vectors, setting up correspondence between a reference speaker and a target speaker in a vector quantization code book and executing voice conversion based upon the correspondence. CONSTITUTION: The voice quality converter is constituted of an amplifier 1, a low pass filter(LPF) 2, an A/D converter 3, and a processor 4 and the processor 4 includes a computer 5, magnetic disks 6, terminals 7, and a printer 8. A voice is quantized by vectors, correspondence between a reference speaker and a target speaker is set up in the vector quantization code book and voice quality is converted based upon the correspondence. Consequenty efficient voice quality conversion can be attained.

Description

【発明の詳細な説明】［産業上の利用分野］この発明は声質変換方式に関し、特に、ベクトル量子化
を用いた声質変換に関するものであり、規則合成システ
ムの多様化を可能とするような声質変換方式に関する。[Detailed Description of the Invention] [Field of Industrial Application] This invention relates to a voice quality conversion method, and in particular to voice quality conversion using vector quantization. Concerning the conversion method.

［従来の技術および発明が解決しようとする問題点］人間の声は個々の個人によってのみ発声されるものであ
り、個人性を有している。音声における個人性は、音声
のスペクトル、パワー、ピッチ周波数などに混在一体と
なって含まれている。しかしながら、従来の技術では、
これらのパラメータのうち、ごく一部のもの、たとえば
スペクトルパラメータの中のホルマント周波数や、スペ
クトル全体の傾きなどを制御し、声質を変換していた。[Prior art and problems to be solved by the invention] The human voice is uttered only by each individual and has individuality. Personality in speech is mixed and integrated in the spectrum, power, pitch frequency, etc. of speech. However, with conventional technology,
Voice quality was transformed by controlling only a few of these parameters, such as the formant frequency of the spectral parameters and the slope of the entire spectrum.

これらの技術では、大雑把な声質変換（たとえば男女声
変換）しかできない。また、大雑把な声質変換を行なう
にしても、声質を特徴づけるパラメータの変換規則の求
め方は確立されておらず、試行錯誤の繰返しによって行
なうヒユーリステライタな手順を必要とするという問題
点があった。These techniques allow only rough voice quality conversion (for example, male-female voice conversion). Furthermore, even if rough voice quality conversion is performed, there is no established method for determining conversion rules for parameters that characterize voice quality, and there is a problem in that it requires a heuristic procedure that requires repeated trial and error. there were.

それゆえに、この発明の主たる目的は、ベクトル量子化
を用いて個人のスペクトル空間を表現し、この空間の対
応づけにより声質の変換を行なうことのできるような声
質変換方式を提供することである。Therefore, the main object of the present invention is to provide a voice quality conversion method that can represent an individual's spectral space using vector quantization and transform the voice quality by mapping this space.

［問題点を解決するための手段］この発明は音声をディジタル化し、ディジタル信号処理
を行なってパラメータ値を抽出し、このパラメータ値を
変化させて音声の声質変換を行なう声質変換方式であっ
て、音声をベクトル量子化し、ベクトル量子化のコード
ブックについて基準の話者とターゲットとなる話者の間
で対応づけを行ない、この対応づけに基づいて声質を変
換するように構成したものである。[Means for Solving the Problems] The present invention is a voice quality conversion method that digitizes voice, performs digital signal processing to extract parameter values, and converts the voice quality of voice by changing the parameter values. This system vector quantizes speech, creates a correspondence between a reference speaker and a target speaker using the vector quantization codebook, and converts voice quality based on this correspondence.

［作用］この発明に係る声質変換方式は、ベクトル量子化が音声
スペクトルを効率良く表現できる手法であることに鑑み
、ベクトル量子化のコードブックについて基準の話者と
ターゲットとなる話者との間で対応づけを行ない、この
対応づけに基づいて声質変換を効率良く行なう。[Operation] In view of the fact that vector quantization is a method that can efficiently express the speech spectrum, the voice quality conversion method according to the present invention is based on the vector quantization codebook between the reference speaker and the target speaker. Then, based on this correspondence, voice quality conversion is performed efficiently.

［発明の実施例］第１図はこの発明が適用される声質変換装置の概略ブロ
ック図である。[Embodiments of the Invention] FIG. 1 is a schematic block diagram of a voice quality conversion device to which the present invention is applied.

第１図において、声質変換装置はアンプ１とローパスフ
ィルタ２とＡ／Ｄ変換器３と処理装置４とから構成され
る。アンプ１は入力された音声信号を増幅するものであ
り、ローパスフィルタ２は増幅された音声信号から折返
し雑音を除去するものである。Ａ／Ｄ変換器３は音声信
号を１２ｋＨ２のサンプリング信号により、１６ビツト
のディジタル信号に変換するものである。処理装置４は
コンピュータ５と磁気ディスク６と端末類７とプリンタ
８とを含む。コンピュータ５はＡ／Ｄ変換器３から入力
された音声のディジタル信号に基づいて、後述の第２図
ないし第５図に示した手法を用いて声質変換を行なうも
のである。In FIG. 1, the voice quality conversion device is comprised of an amplifier 1, a low-pass filter 2, an A/D converter 3, and a processing device 4. The amplifier 1 is for amplifying an input audio signal, and the low-pass filter 2 is for removing aliasing noise from the amplified audio signal. The A/D converter 3 converts the audio signal into a 16-bit digital signal using a 12kHz sampling signal. The processing device 4 includes a computer 5, a magnetic disk 6, a terminal 7, and a printer 8. The computer 5 converts the voice quality based on the digital voice signal input from the A/D converter 3 using the method shown in FIGS. 2 to 5, which will be described later.

第２図ないし第５図はこの発明の一実施例の音声の入力
から声質変換された音声を出力するまでの全体の流れを
示すフロー図であり、特に、第２図はセパレートコード
ブックの作成手順を示し、第３図および第４図は変換コ
ードブックの作成手順を示し、第５図は声質変換合成手
順について示す。Figures 2 to 5 are flowcharts showing the overall flow from voice input to voice quality-converted voice output in one embodiment of the present invention. In particular, Figure 2 shows the creation of a separate codebook. 3 and 4 show the conversion codebook creation procedure, and FIG. 5 shows the voice quality conversion synthesis procedure.

次に、第１図ないし第５図を参照して、この発明の一実
施例の具体的な動作について説明する。Next, with reference to FIGS. 1 to 5, a specific operation of an embodiment of the present invention will be described.

この実施例における声質変換方式は、セパレートコード
ブックの作成と変換コードブックの作成と声質変換合成
の３つのステップからなっている。The voice quality conversion method in this embodiment consists of three steps: creation of a separate codebook, creation of a conversion codebook, and voice quality conversion synthesis.

まず、第２図を参照して、セパレートコードブックの作
成手順について説明する。音声に含まれる個人性は、パ
ワー、ピッチ周波数およびスペクトルに含まれており、
声質変換を行なうためには、これらのパラメータを適切
に制御しなければならない。そこで、個人性をうまく表
現するために、これらのパラメータ別にクラスタリング
を行なってコードブックを作成する。まず、入力された
音声はアンプ１で増幅され、ローパスフィルタ２によっ
て折返し雑音が除去された後、ステップ１０１において
、Ａ／Ｄ変換器３によってディジタル信号に変換される
。First, the procedure for creating a separate codebook will be explained with reference to FIG. The individuality of speech is contained in its power, pitch frequency, and spectrum;
In order to perform voice quality conversion, these parameters must be appropriately controlled. Therefore, in order to express individuality well, we perform clustering according to these parameters and create a codebook. First, the input voice is amplified by the amplifier 1, aliasing noise is removed by the low-pass filter 2, and then converted into a digital signal by the A/D converter 3 in step 101.

その後、ステップ１０２において、ＬＰＣ分析が施され
、パワー、ピッチ周波数およびスペクトル情報（自己相
関係数、ＬＰＣケプストラム係数）の３種のパラメータ
が得られる。これらのパラメータを充分多く収集した後
に、ステップ１０３゜１０４および１０５においてクラ
スタリングを行なう。クラスタリングは、ＬＢＧアルゴ
リズムで行なわれるが、この際使用される距離尺度は、
下記の式で示すごとく、パワーについては第（１）式、
ピッチ周波数については第（２）式、スペクトル情報に
ついては第（３）式のＷＬＲ尺度を用いる。Thereafter, in step 102, LPC analysis is performed to obtain three parameters: power, pitch frequency, and spectral information (autocorrelation coefficient, LPC cepstral coefficient). After collecting a sufficient number of these parameters, clustering is performed in steps 103, 104 and 105. Clustering is performed using the LBG algorithm, and the distance measure used at this time is
As shown in the formula below, the power is expressed by formula (1),
The WLR scale of Equation (2) is used for the pitch frequency, and the WLR scale of Equation (3) is used for the spectrum information.

Ｄｐｏｖｅｒ　＝Ｐ／Ｐ’　　＋Ｐ’　／Ｐ−２−（１
）Ｄｐｉｔｃｈ　−ｆ　−ｆ　’　　　　　　　　　　
　　　−（２）Ｄ　ｓｐｅｃｔｒｕｍ−Σ［ＩＣ（ｎ）
−Ｃ’　　（ｎ））Ｘ　　ｉＲ（ｎ）−Ｒ’　　（ｎ）
１１・・・　（３）ここで、Ｐは話者Ａのパワーであり
　ｐ　／　は話者Ｂのパワーであり、ｆは話者Ａのピッ
チ周波数であり、ｆ′は話者Ｂのピッチ周波数であり、
Ｃは話者Ａのケプストラム係数であり、Ｃ′は話者Ｂの
ケプストラム係数であり、Ｒは話者Ａの自己相関係数で
あり、Ｒ′は話者Ｂの自己相関係数である。Dpover =P/P'+P' /P-2-(1
)Dpitch -f -f'
−(2) D spectrum−Σ[IC(n)
-C' (n))X iR(n)-R' (n)
11... (3) Here, P is the power of speaker A, p / is the power of speaker B, f is the pitch frequency of speaker A, and f' is the pitch frequency of speaker B. and
C is the cepstral coefficient of speaker A, C' is the cepstral coefficient of speaker B, R is the autocorrelation coefficient of speaker A, and R' is the autocorrelation coefficient of speaker B.

なお、上述のＬＢＧアルゴリズムについては、Ｌｉｎｄ
ｅ、Ｂｕｚｏ、Ｇｒａｙ；　　“Ａｎ　　ａｌｇｏｒｉ
ｔｈｍ　　ｆｏｒ　　Ｖｅｃｔｏｒ　　Ｑｕａｎｔｉｚ
ａｔｉｏｎ　　Ｄｅｓｉｇｎ”ｌＥＥＥＣＯＭ−２８（
１９８０−０１）に詳細に記載されている。また、ＷＬ
Ｒ尺度は、音声の特徴を協調する尺度であり、単語音声
の認識において高い性能を示すものであり、村山、鹿野
による“ピークに重みをおいたＬＰＧスペクトルマツチ
ング尺度”電子通信学界論文（Ａ）Ｊ６４−Ａ５　（１
９８１−０５）に記載されている。Regarding the above-mentioned LBG algorithm, Lind
e, Buzo, Gray;
thm for Vector Quantiz
ation Design”lEEECOM-28(
1980-01). Also, WL
The R scale is a scale that coordinates the features of speech, and shows high performance in word speech recognition. )J64-A5 (1
981-05).

上述の第（１）式ないしく３）式に基づいて、ステップ
１０６のパワーコードブック、ステップ１０７のピッチ
周波数のコードブックおよびステップ１０８のスペクト
ル情報のコードブックが求まる。Based on the above equations (1) to 3), the power codebook in step 106, the pitch frequency codebook in step 107, and the spectral information codebook in step 108 are determined.

次に、第３図および第４図を参照して、変換コードブッ
クの作成手順について説明する。変換コードブックの作
成は、話者Ａおよび話者Ｂが発声した学習用の単語セッ
トを用いて行なう。話者Ａの音声は、ステップ２０１に
おいて、前述の第２図に示したセパレートコードブック
の作成手順に従って求めたセパレートコードブックを用
いて、パワー、ピッチ周波数およびスペクトル別にセパ
レート量子化される。次に、量子化された符号を用いて
、ステップ２０２において、話者Ａから話者Ｂへの変換
コードブックＢ′を作成する。この作成手順については
、後で説明する。ステップ２０３においては、コードブ
ックＢ′をコードブックＡと入替えることによって、話
者Ｂへの変換を行なう。ステップ２０５では、コードブ
ックＢ′で表現された特徴量とコードブックＢで表現さ
れた特徴量とが比較される。ステップ２０４において、
比較結果が成るしきい値を超えていることを判別すると
、ステップ２０６において、変換コードブックＢ′が完
成したものとし、ステップ２０５においてしきい値に達
していないことを判別すると、再びステップ２０２に戻
り、上述の動作を繰返し行なう。Next, a procedure for creating a conversion codebook will be described with reference to FIGS. 3 and 4. The conversion codebook is created using a learning word set uttered by speaker A and speaker B. In step 201, the speech of speaker A is separately quantized for each power, pitch frequency, and spectrum using a separate codebook obtained according to the separate codebook creation procedure shown in FIG. 2 described above. Next, in step 202, a conversion codebook B' from speaker A to speaker B is created using the quantized code. This creation procedure will be explained later. In step 203, conversion to speaker B is performed by replacing codebook B' with codebook A. In step 205, the feature amounts expressed in codebook B' and the feature amounts expressed in codebook B are compared. In step 204,
If it is determined that the comparison result exceeds the threshold, the conversion codebook B' is assumed to be completed in step 206, and if it is determined in step 205 that the threshold has not been reached, the process returns to step 202. Go back and repeat the above operation.

次に、第４図を参照して、変換コードブックＢ′を求め
る手順について説明する。まず、ステップ３０１および
３０２において、話者Ａおよび話者Ｂのそれぞれの音声
にＬＰＧ分析を施し、パワー、ピッチ周波数およびスペ
クトルパラメータを求める。次に、ステップ３０３およ
び３０４において、スペクトルパラメータをベクトル量
子化し、ステップ３０５および３０６でパワーをスカラ
ー量子化し、ステップ３０７および３０８においてピッ
チ周波数をスカラー量子化する。Next, referring to FIG. 4, the procedure for obtaining the conversion codebook B' will be explained. First, in steps 301 and 302, the voices of speaker A and speaker B are subjected to LPG analysis to obtain power, pitch frequency, and spectral parameters. Next, the spectral parameters are vector quantized in steps 303 and 304, the power is scalar quantized in steps 305 and 306, and the pitch frequency is scalar quantized in steps 307 and 308.

話者Ａおよび話者Ｂの発声した音声の時間対応をとるた
めに、スペクトルパラメータを用いて、ステップ３０９
においてＤｏｕｂｌｅ　　Ｓｐｌｉを法によるＤＰマツ
チングを行なう。ここで得られた時間対応の情報゛をも
とにして、ステップ３１０．３１１および３１２におい
て、各特徴量について話者Ａと話者Ｂの対応関係を求め
、ヒストグラムを作成する。スペクトルパラメータおよ
びパワーの変換フードブックは、このヒストグラムを重
みとした話者Ｂの特徴ベクトルの線形結合で求める。ま
た、ピッチ周波数の変換コードブックは、このヒストグ
ラムの最大値を与える話者Ｂの特徴ベクトルで作成する
。Step 309 uses the spectral parameters to take the time correspondence of the voices uttered by speaker A and speaker B.
DP matching is performed using the Double Spli method. Based on the time-corresponding information obtained here, in steps 310, 311 and 312, the correspondence between speaker A and speaker B is determined for each feature amount, and a histogram is created. The spectral parameter and power conversion food book is obtained by a linear combination of the feature vectors of speaker B using this histogram as weight. Furthermore, a pitch frequency conversion codebook is created using the feature vector of speaker B that gives the maximum value of this histogram.

次に、第５図を参照して、コードブックを用いた声質変
換合成方法について説明する。話者への音声は、ステッ
プ４０１においてＬＰＧ分析され、パワー、ピッチ周波
数およびスペクトルパラメータが抽出される。Next, a voice quality conversion/synthesis method using a codebook will be described with reference to FIG. The speech to the speaker is LPG analyzed in step 401 to extract power, pitch frequency and spectral parameters.

次に、前述の第２図で求めた話者Ａのセパレートコード
ブックを用いて、ステップ４０２においてスペクトルパ
ラメータがベク、トル量子化され、ステップ４０３にお
いてパワーがスカラー量子化され、ステップ４０４にお
いてピッチ周波数がスカラー量子化される。これらの量
子化されたパラメータを復号化する過程において、前述
の第３図で説明した変換コードブックが使用される。す
なわち、ステップ４０５において、話者Ａから話者Ｂへ
のスペクトル変換コードブックを用い、ステップ４０６
において、パワー変換コードブックを用い、ステップ４
０７ではピッチ周波数変換コードブックを用いる。そし
て、変換された各パラメータを用いてステップ４０８に
おいて構成される。Next, using the separate codebook of speaker A obtained in FIG. is scalar quantized. In the process of decoding these quantized parameters, the transformation codebook described in FIG. 3 above is used. That is, in step 405, a spectral conversion codebook from speaker A to speaker B is used, and step 406
In step 4, using the power conversion codebook,
07 uses a pitch frequency conversion codebook. Then, each converted parameter is used to configure in step 408.

［発明の効果］以上のように、この発明によれば、音声をディジタル化
し、ディジタル信号処理を行なってパラメータ値を抽出
し、このパラメータ値を変化させて音声の声質変換を行
なう声質変換方式において、音声をベクトル量子化し、
ベクトル量子化のコードブックについて基準の話者とタ
ーゲットとなる話者の間で対応づけを行ない１．この対
応づけに基づいて声質変換を行なうようにしたが、ベク
トル量子化は音声のスペクトルを効率良く表現できる手
法であり、スペクトル情報全体の特徴をうまく制御する
ことができ、スペクトルの情報の一部のみを制御する従
来の方法に比べて、詳細な声質変換が可能となる。しか
も、音声に含まれる個人性を各個人ごとのコードブック
によって表現するようにしたが、このコードブックの作
成アルゴリズムは既に確立されており、不特定多数の音
声の個人性を得ることが容易となる。さらに、個人のコ
ードブックが作成されれば、この発明によるアルゴリズ
ムに従って容易に声質変換が可能となる。[Effects of the Invention] As described above, according to the present invention, in a voice quality conversion method that digitizes voice, performs digital signal processing to extract parameter values, and converts the voice quality of voice by changing the parameter values. , vector quantize the audio,
1. Correlate the vector quantization codebook between the reference speaker and the target speaker. Although voice quality conversion was performed based on this correspondence, vector quantization is a method that can efficiently express the spectrum of the voice, and can effectively control the characteristics of the entire spectral information. This enables more detailed voice quality conversion than conventional methods that only control the voice quality. Moreover, the individuality contained in speech is expressed by a codebook for each individual, and the algorithm for creating this codebook has already been established, making it easy to obtain the individuality of an unspecified number of voices. Become. Furthermore, once a personal codebook is created, voice quality conversion can be easily performed according to the algorithm according to the present invention.

[Brief explanation of the drawing]

第１図はこの発明が適用される声質変換装置の概略ブロ
ック図である。第２図はセパレートコードブックの作成
手順を示すフロー図である。第３図および第４図は変換
コードブックの作成手順を示すフロー図である。第５図
は性質変換合成手順を説明するためのフロー図である。図において、１はアンプ、２はローパスフィルタ、３は
Ａ／Ｄ変換器、４は処理装置、５はコンピュータを示す
。ＷＷの浄；（内容に変更なし）第２図へ澗炉第３図手続補正書（旗）昭和６３年２月４日２、発明の名称声質変換方式３、補正をする者事件との関係　　特許出願人住　所　京都府相楽郡精華町大字乾谷小字三平谷５番地
名　称　株式会社エイ・ティ・アール自動翻訳電話研究
所代表者　縛松　　明４、代理人住　所　　大阪市北区南森町２丁目１番２９号　住友銀
行南森町ビル６、補正の対象図面７、補正の内容適正な用紙を用いて十分に濃厚な黒色で鮮明に描いた図
面を別紙のとおり。以上FIG. 1 is a schematic block diagram of a voice quality conversion device to which the present invention is applied. FIG. 2 is a flow diagram showing the procedure for creating a separate codebook. FIGS. 3 and 4 are flowcharts showing the procedure for creating a conversion codebook. FIG. 5 is a flow diagram for explaining the property conversion synthesis procedure. In the figure, 1 is an amplifier, 2 is a low-pass filter, 3 is an A/D converter, 4 is a processing device, and 5 is a computer. WW purification; (No change in content) Go to Figure 2 Kanro Figure 3 Procedural Amendment (Flag) February 4, 1986 2, Title of Invention Voice Quality Conversion Method 3, Relationship with the Amendment Person Case Patent applicant address: 5 Sanpeidani, Inuiya, Seika-machi, Soraku-gun, Kyoto Name: A.T.R. Automatic Translation Telephone Research Institute Representative: Akira Shibarimatsu 4; Agent address: 2 Minamimorimachi, Kita-ku, Osaka No. 1-29 Sumitomo Bank Minamimorimachi Building 6, Drawing 7 subject to amendment, Contents of amendment A drawing clearly drawn in sufficiently rich black using appropriate paper as shown in the attached sheet. that's all

Claims

[Claims]

(1) In a voice quality conversion method that digitizes the voice, performs digital signal processing to extract parameter values, and changes the voice quality of the voice by changing the parameter values, the voice is vector quantized, and the vector quantization code is A voice quality conversion method characterized by making a correspondence between a reference speaker and a target speaker for a book, and performing voice quality conversion based on this correspondence.

(2) The scope of claims characterized in that the voice quality is converted by performing separate vector quantization on three types of voice characteristics: power, pitch frequency, and spectrum, and correlating each feature amount. The voice quality conversion method described in Section 1.

(3) The method for making the correspondence between the reference speaker and the target speaker is characterized in that the correspondence of vectors in the codebook between the two is determined by learning certain learning words, and the voice quality is converted based on the correspondence. The voice quality conversion method according to claim 1.

(4) A histogram is created by DP matching during the learning, and the histogram is used to determine the correspondence between the reference speaker and the target speaker, and the voice quality is converted. Voice quality conversion method described in scope 3.

(5) When determining the correspondence using the histogram, voice quality conversion is performed by replacing the reference speaker's feature vector with a linear combination of the target speaker's feature vector weighted by the histogram for power and spectrum. The voice quality conversion method according to claim 4, characterized in that the method performs the following.

(6) When determining the correspondence using the histogram, voice quality conversion is performed by replacing the feature vector of the target speaker with the maximum histogram with the corresponding feature vector of the reference speaker with respect to the pitch frequency. The voice quality conversion method according to claim 4, characterized in that the method performs the following.