JPH08248994A

JPH08248994A - Voice tone quality converting voice synthesizer

Info

Publication number: JPH08248994A
Application number: JP7051039A
Authority: JP
Inventors: Makoto Hashimoto; 誠橋本; Norio Higuchi; 宜男樋口
Original assignee: ATR ONSEI HONYAKU TSUSHIN KENKYUSHO KK; ATR Interpreting Telecommunications Research Laboratories
Current assignee: ATR ONSEI HONYAKU TSUSHIN KENKYUSHO KK; ATR Interpreting Telecommunications Research Laboratories
Priority date: 1995-03-10
Filing date: 1995-03-10
Publication date: 1996-09-27
Anticipated expiration: 2014-06-02
Also published as: JP2898568B2

Abstract

PURPOSE: To allow learning with a small amount of learning data and to perform a tone quality conversion with high precision by generating and outputting voices signals of a target speaker corresponding to a character string based on the acoustic feature parameters of the voice signals of the target speaker. CONSTITUTION: A spectrum mapping processing section 22 quantizes the acoustic feature parameters of the voice of a selected speaker stored in a voice data-base 10 based on the inputted character string to be voice synthesized employing the code book of the speaker. Moreover, based on the corresponding relationship between the speaker's code book and the mapping code book, the acoustic parameters of the voice signals of the speaker corresponding to the character string are generated by the section 22. Furthermore, a voice synthesis section 24 generates and outputs the voice signals of the speaker corresponding to the character string based on the acoustic feature parameters of the voice signals of the speaker generated by the section 22. Therefore, the voices for a voice tone quality conversion are allowed to be different and the voice tone quality conversion from learning voices, Japanese and words to English words is accomplished.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、声質変換音声合成装置
に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice quality conversion voice synthesizer.

【０００２】[0002]

【従来の技術】多様な合成音声を生成することのできる
音声合成システムの実現は、合成音声の高品質化や合成
システム自体の普及のためにも非常に重要である。声質
変換も多様な合成音声生成のために必要な技術であり、
これまでにも種々の研究開発がなされてきた。2. Description of the Related Art The realization of a speech synthesis system capable of generating various synthesized speech is very important for improving the quality of synthesized speech and popularizing the synthesis system itself. Voice conversion is also a necessary technology for generating various synthetic speech,
Various researches and developments have been made so far.

【０００３】例えば、文献１「松本弘ほか，“教師あり
／教師なしスペクトル写像による声質変換”，日本音響
学会誌，Ｖｏｌ．５０，Ｎｏ．７，ｐｐ．５４９−５５
５，１９９４年７月」（以下、第１の従来例という。）
においては、声質変換の精度向上と品質の改善を目的と
して、変換音声のスペクトル系列と目標話者のスペクト
ル系列との２乗誤差を最小とする基準で写像を行って学
習し、未学習の部分を補間法により計算することが開示
されている。For example, reference 1 "Hiroru Matsumoto et al.," Voice quality conversion by supervised / unsupervised spectrum mapping ", Journal of Acoustical Society of Japan, Vol. 50, No. 7, pp. 549-55.
5, July 1994 "(hereinafter referred to as the first conventional example).
In order to improve the accuracy and quality of voice conversion, the learning is performed by performing mapping with a criterion that minimizes the squared error between the spectral sequence of the converted speech and the spectral sequence of the target speaker, and the unlearned part It is disclosed that is calculated by an interpolation method.

【０００４】さらに、図３は、文献２「阿部匡伸ほか，
“ベクトル量子化による声質変換”，日本音響学会講演
論文集，２−６−１４，昭和６２年１０月」（以下、第
２の従来例という。）において開示されたピッチ周波数
の変換コードブックを生成する方法を示す第２の従来例
のブロック図であり、図４は、図３の方法で生成された
ピッチ周波数の変換コードブックと同様の方法で生成さ
れたスペクトルパラメータの変換コードブックとを用い
てベクトル量子化による声質変換法を示すブロック図で
ある。この第２の従来例の方法は、話者毎のコードブッ
ク間の対応づけによって話者間の写像をとり、声質変換
を行う方法を用いている。すなわち、大量の学習データ
を用いて予め話者Ａから話者Ｂへの変換コードブックを
作成しておき、これを用いて声質変換を行うものであ
る。変換コードブックを作成するに当たっては、以下の
手順をとる。（Ｉ）クラスタリングされたコードブック間で対応を取
る。（II）対応するコード間の頻度を用いて写像を行う。Further, FIG. 3 shows the document 2 “Masanobu Abe et al.,
"Voice Conversion by Vector Quantization", Proceedings of the Acoustical Society of Japan, 2-6-14, October 1987 "(hereinafter referred to as the second conventional example). FIG. 4 is a block diagram of a second conventional example showing a generating method, and FIG. 4 shows a pitch frequency conversion codebook generated by the method of FIG. 3 and a spectrum parameter conversion codebook generated by the same method. It is a block diagram which shows the voice quality conversion method by vector quantization using it. The method of the second conventional example uses a method of performing voice quality conversion by taking a mapping between speakers by associating codebooks for each speaker. That is, a conversion codebook for a speaker A to a speaker B is created in advance using a large amount of learning data, and voice quality conversion is performed using this. Follow the steps below to create a conversion codebook. (I) Correspondence is made between the clustered codebooks. (II) Perform mapping using the frequency between corresponding codes.

【０００５】以下、話者Ａ，Ｂ間のピッチ周波数の変換
コードブックを作成する過程を、図３を参照して説明す
る。（１）話者Ａ，及び話者Ｂのそれぞれのピッチ周波数の
サンプルデータ３０，４０を取り込み、それぞれクラス
タリング３１，４１を行ってピッチ周波数のコードブッ
ク３２，４２を作成する。同様に、スペクトルパラメー
タもクラスタリングしコードブックを作成する。（２）ピッチ周波数のコードブック３２，４２を用い
て、学習データのピッチ周波数をコード化し、すなわち
スカラー量子化３３，４３する。同様に、スペクトルパ
ラメータもコード化し、すなわちベクトル量子化する。（３）コード化されたパラメータを用いて、学習単語毎
にＤＰマッチング（動的計画法によるマッチング処理）
を行い、時間の対応づけ３４を行う。（４）時間的に対応している話者Ａのピッチコードと話
者Ｂのピッチコードからヒストグラム３５を作成する。（５）話者Ａのピッチコードに対し、ヒストグラムが最
大となっている話者Ｂのピッチコードを対応づけて、話
者Ａから話者Ｂへの変換コードブック３６を作成する。
なお、スペクトルパラメータのマッピングは、ヒストグ
ラムによる重み付けを行い、文献３「中村ほか，“ベク
トル量子化を用いたスペクトログラムの正規化”，日本
音響学会音声研究会資料，ＳＰ８７−１７，１９８７
年」に記載された手順に従って、変換コードブック（図
４の３６ａ）を作成する。The process of creating a conversion codebook of pitch frequencies between speakers A and B will be described below with reference to FIG. (1) The pitch frequency sample books 30 and 40 of the speaker A and the speaker B are fetched and clustered 31 and 41, respectively, to generate pitch frequency code books 32 and 42. Similarly, spectral parameters are also clustered to create a codebook. (2) Pitch frequencies of the learning data are coded using the pitch frequency code books 32 and 42, that is, scalar quantization 33 and 43 are performed. Similarly, the spectral parameters are also coded, ie vector quantized. (3) DP matching for each learning word using the coded parameters (matching process by dynamic programming)
Then, the time correspondence 34 is performed. (4) The histogram 35 is created from the pitch code of the speaker A and the pitch code of the speaker B, which correspond in time. (5) The conversion code book 36 from the speaker A to the speaker B is created by associating the pitch code of the speaker A with the pitch code of the speaker B having the largest histogram.
The mapping of spectral parameters is performed by weighting with a histogram, and reference 3 “Nakamura et al.,“ Normalization of spectrogram using vector quantization ”, Material of Acoustical Society of Japan, SP87-17, 1987.
A conversion codebook (36a in FIG. 4) is created according to the procedure described in “Year”.

【０００６】次いで、上記作成された変換コードブック
を用いた第２の従来例の声質変換法を図４に示す。図４
に示すように、まず、話者Ａの音声をＬＰＣ分析５０
し、スペクトルパラメータとピッチパラメータを求め、
これを話者Ａのスペクトルパラメータとピッチ周波数の
コードブック５１，６１を用いてそれぞれベクトル量子
化５２及びスカラー量子化６２する。さらに、復号化５
３，６３するときには、話者Ａのコードブック５１，６
１の代わりに、上記作成された変換コードブック３６，
３６ａを用いる。これによって、話者Ｂの音声へ変換さ
れたことになり、この後、音声合成手段である合成フィ
ルタ５４を用いて話者Ｂの音声信号を発生して出力す
る。FIG. 4 shows a second conventional voice quality conversion method using the conversion codebook created above. FIG.
First, as shown in FIG.
To find the spectral and pitch parameters,
This is vector-quantized 52 and scalar-quantized 62 using the speaker A spectral parameters and pitch frequency codebooks 51 and 61, respectively. Furthermore, decryption 5
3 and 63, speaker A's codebook 51,6
Instead of 1, the conversion codebook 36 created above,
36a is used. As a result, it is converted into the voice of the speaker B, and thereafter, the voice signal of the speaker B is generated and output using the synthesis filter 54 which is the voice synthesizing means.

【０００７】[0007]

【発明が解決しようとする課題】しかしながら、第１の
従来例では、異なる話者間のスペクトルの差が比較的大
きなとき、学習処理を実行することがきわめて難しい。
また、第２の従来例では、すべての音声データ毎に異な
る話者間の変換コードブックを作成する必要があるの
で、この場合、大量の学習データを必要とする。すなわ
ち、実用化が難しいという問題点があった。However, in the first conventional example, it is extremely difficult to perform the learning process when the spectrum difference between different speakers is relatively large.
Further, in the second conventional example, since it is necessary to create a conversion codebook between different speakers for every voice data, a large amount of learning data is required in this case. That is, there is a problem that it is difficult to put it into practical use.

【０００８】本発明の目的は以上の問題点を解決し、話
者間のスペクトルの差が比較的大きくならないように変
換元話者を選択し、従来例に比較して少量の学習データ
で学習することにより声質変換することができる声質変
換音声合成装置を提供することにある。The object of the present invention is to solve the above problems, to select a conversion source speaker so that the spectrum difference between speakers does not become relatively large, and to learn with a small amount of learning data compared to the conventional example. It is to provide a voice quality conversion voice synthesizer capable of performing voice quality conversion by doing the above.

【０００９】[0009]

【課題を解決するための手段】本発明に係る請求項１記
載の声質変換音声合成装置は、複数の登録話者の音響特
徴パラメータを含む音声データベースとそのコードブッ
クを予め記憶する記憶手段と、入力された目標話者の少
なくとも１単語の音声信号に基づいて、声質変換をすべ
き目標話者に最も近い話者を、上記複数の登録話者の中
から選択する選択手段と、上記選択手段によって選択さ
れた話者の音響空間と目標話者の音響空間との間の差分
を計算することにより、選択された話者から上記目標話
者への写像コードブックを計算する生成手段と、入力さ
れた音声合成すべき文字列に基づいて、上記音声データ
ベースに記憶された上記選択された話者の音声の音響特
徴パラメータを上記選択された話者のコードブックを用
いて量子化し、上記選択された話者のコードブックと上
記写像コードブックの対応関係に基づいて上記文字列に
対応する目標話者の音声信号の音響特徴パラメータを生
成する写像処理手段と、上記写像処理手段によって生成
された目標話者の音声信号の音響特徴パラメータに基づ
いて、上記文字列に対応する目標話者の音声信号を発生
して出力する音声合成手段とを備えたことを特徴とす
る。According to a first aspect of the present invention, there is provided a voice quality conversion speech synthesizing apparatus according to the present invention, wherein a voice database including acoustic feature parameters of a plurality of registered speakers and a storage means for storing a codebook thereof in advance. Selection means for selecting, from the plurality of registered speakers, a speaker closest to the target speaker to be subjected to voice quality conversion based on the input voice signal of at least one word of the target speaker, and the selection means. Generating means for calculating a mapping codebook from the selected speaker to the target speaker by calculating a difference between the speaker's acoustic space selected by and the target speaker's acoustic space; Based on the character string to be speech-synthesized, the acoustic feature parameter of the voice of the selected speaker stored in the voice database is quantized using the codebook of the selected speaker, and Mapping processing means for generating an acoustic feature parameter of the voice signal of the target speaker corresponding to the character string based on the correspondence relationship between the selected speaker codebook and the mapping codebook, and the mapping processing means. And a voice synthesizing means for generating and outputting a voice signal of the target speaker corresponding to the character string based on the acoustic feature parameter of the voice signal of the target speaker.

【００１０】また、請求項２記載の声質変換音声合成装
置は、請求項１記載の声質変換音声合成装置において、
上記生成手段は、移動ベクトル場平滑化法を用いて、選
択された話者から上記目標話者への写像コードブックを
計算することを特徴とする。A voice quality-converted voice synthesizer according to a second aspect is the voice quality-converted voice synthesizer according to the first aspect.
The generating means is characterized by calculating a mapping codebook from the selected speaker to the target speaker using the moving vector field smoothing method.

【００１１】さらに、請求項３記載の声質変換音声合成
装置は、請求項１又は２記載の声質変換音声合成装置に
おいて、上記音響特徴パラメータは、スペクトルデータ
を含むことを特徴とする。またさらに、請求項４記載の
声質変換音声合成装置は、請求項３記載の声質変換音声
合成装置において、上記音響特徴パラメータはさらに、
ピッチ周波数データを含むことを特徴とする。Further, the voice quality-converted voice synthesizing apparatus according to a third aspect is the voice quality-converting voice synthesizing apparatus according to the first or second aspect, wherein the acoustic feature parameter includes spectral data. Still further, the voice quality-converted voice synthesis apparatus according to claim 4 is the voice quality-converted voice synthesis apparatus according to claim 3, wherein the acoustic feature parameter further comprises:
It is characterized in that it includes pitch frequency data.

【００１２】[0012]

【作用】以上のように構成された請求項１記載の声質変
換音声合成装置においては、上記選択手段は、入力され
た目標話者の少なくとも１単語の音声信号に基づいて、
声質変換をすべき目標話者に最も近い話者を、上記複数
の登録話者の中から選択し、上記生成手段は、上記選択
手段によって選択された話者の音響空間と目標話者の音
響空間との間の差分を計算することにより、選択された
話者から上記目標話者への写像コードブックを計算す
る。次いで、上記写像処理手段は、入力された音声合成
すべき文字列に基づいて、上記音声データベースに記憶
された上記選択された話者の音声の音響特徴パラメータ
を上記選択された話者のコードブックを用いて量子化
し、上記選択された話者のコードブックと上記写像コー
ドブックの対応関係に基づいて上記文字列に対応する目
標話者の音声信号の音響特徴パラメータを生成する。さ
らに、上記音声合成手段は、上記写像処理手段によって
生成された目標話者の音声信号の音響特徴パラメータに
基づいて、上記文字列に対応する目標話者の音声信号を
発生して出力する。第２の従来例では、音声データの登
録話者から目標話者への写像を行う場合、異なる話者間
のコードブックのすべてのコードの対応関係を学習によ
って補間することなく求めるために、大量の学習データ
が必要であった。これに対して、本発明によれば、１単
語程度の非常に少ない学習データで登録話者から目標話
者への写像関数を求めることができ、例えばデイジタル
計算機を用いて実用化することができる。また、発話内
容に関係なく従来例に比較してより高い精度で声質を変
換することができる。すなわち、声質変換用音声は異な
ってもよく、本発明を、例えば、学習用音声と日本語の
単語から英語の単語への声質変換、もしくは、英語の単
語から日本語の単語への声質変換に適用することができ
る。In the voice quality conversion speech synthesizer according to claim 1 configured as described above, the selecting means is based on the input voice signal of at least one word of the target speaker,
The speaker closest to the target speaker to be subjected to voice quality conversion is selected from the plurality of registered speakers, and the generation unit selects the acoustic space of the speaker selected by the selection unit and the sound of the target speaker. Compute the mapping codebook from the selected speaker to the target speaker by calculating the difference to the space. Next, the mapping processing means, based on the input character string to be voice-synthesized, sets the acoustic feature parameters of the voice of the selected speaker stored in the voice database to the codebook of the selected speaker. Quantization is performed by using the above, and the acoustic feature parameter of the voice signal of the target speaker corresponding to the character string is generated based on the correspondence relationship between the codebook of the selected speaker and the mapping codebook. Further, the voice synthesizing means generates and outputs a voice signal of the target speaker corresponding to the character string, based on the acoustic feature parameter of the voice signal of the target speaker generated by the mapping processing means. In the second conventional example, when the registered speaker of the voice data is mapped to the target speaker, in order to obtain the correspondences of all the codes of the code book between different speakers without learning, a large amount of Learning data was required. On the other hand, according to the present invention, the mapping function from the registered speaker to the target speaker can be obtained with very little learning data of about one word, which can be put to practical use by using, for example, a digital computer. . Further, the voice quality can be converted with higher accuracy than the conventional example regardless of the utterance content. That is, the voices for voice quality conversion may be different, and the present invention can be applied to, for example, voice learning and voice quality conversion from Japanese words to English words, or voice quality conversion from English words to Japanese words. Can be applied.

【００１３】また、請求項２記載の声質変換音声合成装
置においては、上記生成手段は、移動ベクトル場平滑化
法を用いて、選択された話者から上記目標話者への写像
コードブックを計算する。これにより、より簡単にかつ
精度よく声質変換して音声合成することができる。Further, in the voice quality conversion speech synthesizer according to the second aspect, the generating means calculates a mapping codebook from the selected speaker to the target speaker by using a moving vector field smoothing method. To do. As a result, it is possible to more easily and accurately convert the voice quality and synthesize the voice.

【００１４】さらに、請求項３記載の声質変換音声合成
装置においては、上記音響特徴パラメータは、好ましく
は、スペクトルデータを含む。またさらに、請求項４記
載の声質変換音声合成装置においては、上記音響特徴パ
ラメータはさらに、好ましくは、ピッチ周波数データを
含む。Further, in the voice quality conversion speech synthesizer according to a third aspect of the invention, the acoustic feature parameter preferably includes spectral data. Still further, in the voice quality conversion speech synthesizer according to the fourth aspect, the acoustic feature parameter further preferably includes pitch frequency data.

【００１５】[0015]

【実施例】以下、図面を参照して本発明に係る実施例に
ついて説明する。図１は、本発明に係る一実施例である
写像コードブック生成装置１００と声質変換音声合成装
置２００のブロック図である。この実施例のシステム
は、写像コードブック生成装置１００において、話者選
択部５と写像コードブック生成部６とを備えたことを特
徴とし、一方、声質変換音声合成装置２００において
は、スペクトル写像処理部２２を備えたことを特徴とす
る。この実施例においては、実用的な声質変換システム
を実現するためには学習データを極力少なくするため
に、話者選択と移動ベクトル場平滑化法（ＶＦＳ：Vect
or Field Smoothing）を用いたスペクトル写像による新
しい声質変換法を開示し、この方法は、少ない学習デー
タでも変換が行えるという特有の効果を有する。なお、
この明細書においては、予め音声データベースを用意し
ておく複数の話者を登録話者とし、変換先の話者を目標
話者とし、複数の登録話者から選ばれた１人の話者を選
択話者と定義する。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram of a mapping codebook generation device 100 and a voice quality conversion voice synthesis device 200 according to an embodiment of the present invention. The system of this embodiment is characterized in that the mapping codebook generating apparatus 100 is provided with a speaker selecting section 5 and a mapping codebook generating section 6, while the voice quality conversion speech synthesis apparatus 200 has a spectrum mapping process. It is characterized in that it has a portion 22. In this embodiment, in order to realize a practical voice quality conversion system, in order to reduce learning data as much as possible, speaker selection and a moving vector field smoothing method (VFS: Vect).
or field smoothing), a new voice quality conversion method by spectrum mapping is disclosed, and this method has a unique effect that conversion can be performed with a small amount of learning data. In addition,
In this specification, a plurality of speakers whose voice databases are prepared in advance are registered speakers, a conversion destination speaker is a target speaker, and one speaker selected from a plurality of registered speakers is defined as a target speaker. Defined as the selected speaker.

【００１６】図１に示すように、音声データベースメモ
リ１０内の音声データベースと、スペクトルコードブッ
クメモリ１１内のスペクトルコードブックとが予め作成
されて記憶される。音声データベースは、複数の登録話
者のピッチ周波数、ケプストラム係数データ、及びパワ
ーデータなどの音響特徴パラメータを含み、スペクトル
コードブックは、複数の登録話者のクラスタリングされ
たケプストラムデータのベクトルをフレーム単位でラベ
リングされてメモリ１１内に記憶される。As shown in FIG. 1, a voice database in the voice database memory 10 and a spectrum codebook in the spectrum codebook memory 11 are created and stored in advance. The speech database includes acoustic feature parameters such as pitch frequencies, cepstral coefficient data, and power data for multiple registered speakers, and the spectral codebook provides a vector of clustered cepstral data for multiple registered speakers in frame units. It is labeled and stored in the memory 11.

【００１７】目標話者の任意の１単語の発声音声はマイ
クロホン１に入力されてアナログ音声信号に変換され、
Ａ／Ｄ変換器２でディジタル音声信号に変換された後、
特徴抽出部３に入力される。このＡ／Ｄ変換器２では、
サンプリング周波数に対応する例えば２０ミリ秒である
所定のフレーム間隔でフレーム毎に音声信号データがラ
ベリングされ、以下の処理はフレーム毎に実行される。
特徴抽出部３は、入力された音声信号を例えばケプスト
ラム分析し、３０次ケプストラム係数、パワー及びピッ
チ周波数を含む３２次元の特徴パラメータを抽出する。
抽出された特徴パラメータの時系列はバッファメモリ４
を介して話者選択部５に入力される。The voice of any one word of the target speaker is input to the microphone 1 and converted into an analog voice signal,
After being converted into a digital audio signal by the A / D converter 2,
It is input to the feature extraction unit 3. In this A / D converter 2,
The audio signal data is labeled for each frame at a predetermined frame interval corresponding to the sampling frequency, which is, for example, 20 milliseconds, and the following processing is executed for each frame.
The feature extraction unit 3 performs, for example, cepstrum analysis on the input voice signal, and extracts a 32-dimensional feature parameter including a 30th-order cepstrum coefficient, power, and pitch frequency.
The time series of the extracted characteristic parameters is the buffer memory 4
Is input to the speaker selection unit 5 via.

【００１８】話者選択部５は、入力された目標話者のス
ペクトル時系列と、メモリ１０内の音声データベースに
登録された各登録話者のスペクトル時系列との互いの継
続長が一致するようにＤＴＷ（Dynamic Time Warping：
動的時間整合）法により時間整合した後、目標話者のス
ペクトル時系列と各登録話者のスペクトル時系列との距
離を計算し、２乗誤差が最小となる基準を用いて最も距
離の小さい登録話者を１名だけ選択する。ここで、スペ
クトル時系列はケプストラム時系列に対応する。The speaker selecting unit 5 ensures that the continuous time lengths of the spectrum time series of the input target speaker and the spectrum time series of each registered speaker registered in the voice database in the memory 10 match each other. DTW (Dynamic Time Warping:
After time matching by the dynamic time matching) method, the distance between the spectrum time series of the target speaker and the spectrum time series of each registered speaker is calculated, and the distance is minimized using the criterion that minimizes the square error. Select only one registered speaker. Here, the spectral time series corresponds to the cepstrum time series.

【００１９】図２は、図１の写像コードブック生成部６
によって実行される写像コードブック生成処理を示すフ
ローチャートである。FIG. 2 shows the mapping codebook generator 6 of FIG.
5 is a flowchart showing a mapping codebook generation process executed by.

【００２０】この写像コードブック生成部６では、選択
話者のスペクトルコードブックＣ^sを目標話者の音響空
間に写像して目標話者のスペクトルコードブックＣ^tに
変換する。ここで、目標話者の音響空間に写像されたコ
ードブックを、写像コードブックＣ^tと定義する。写像
コードブックＣ^tの生成には移動ベクトル場平滑化法を
用いる。これは、音響空間の話者間の差のベクトルは連
続的に変化するという仮定のもとに、ある話者の音響空
間を他話者の音響空間に写像する方法である。以下に、
その方法の手順を示す。The mapping codebook generator 6 maps the spectrum codebook C ^s of the selected speaker into the acoustic space of the target speaker and converts it into the spectrum codebook C ^t of the target speaker. Here, the codebook mapped in the acoustic space of the target speaker is defined as a mapping codebook C ^t . The moving vector field smoothing method is used to generate the mapping codebook C ^t . This is a method of mapping the acoustic space of one speaker to the acoustic space of another speaker under the assumption that the vector of the difference between speakers in the acoustic space changes continuously. less than,
The procedure of the method is shown.

【００２１】まず、ステップＳ１において、選択話者の
スペクトルコードブックＣ^sをスペクトルコードブック
メモリ１１から読み出して、写像コードブックＣ^tの初
期状態とする。次いで、ステップＳ２で、選択話者の学
習音声スペクトル時系列を写像コードブックＣ^tを用い
てベクトル量子化し、このベクトル量子化後のコード列
と、入力された目標話者の音声スペクトル時系列とをＤ
ＴＷ（Dynamic time warping）法を用いて対応付けの処
理を行う。そして、ステップＳ３において、自然数ｍ番
目のベクトルＣ_m ^sと、これに対応づけられた入力スペク
トルｘの平均ベクトル／Ｃ_m ^sとの差分ベクトルＶ_mを次
の数１に示すように計算し、これを移動ベクトルＶ_mと
する。なお、この明細書において、（Ｃ_m ^s）の上線（バ
ー）を記載することができないため、／Ｃ_m ^sと記す。な
お、数２の右辺の（１／Ｎ_m）の／は分数を示す。First, in step S1, the spectrum codebook C ^s of the selected speaker is read from the spectrum codebook memory 11 to set the mapping codebook C ^{t in} the initial state. Then, in step S2, the learning speech spectrum time series of the selected speaker is vector-quantized using the mapping codebook C ^t, and the code string after the vector quantization and the input speech spectrum time series of the target speaker. To D
The association processing is performed using the TW (Dynamic time warping) method. Then, in step S3, the difference vector V _m between the natural number m-th vector C _m ^s and the average vector / C _m ^{s of} the input spectrum x associated therewith is calculated as shown in the following formula 1, Let this be a movement vector V _m . In this specification, since the upper line (bar) of (C _m ^s ) cannot be described, it is referred to as / C _m ^s . In addition, / of (1 / N _m ) on the right side of Expression 2 indicates a fraction.

【００２２】[0022]

【数１】Ｖ_m＝／Ｃ_m ^s−Ｃ_m ^s ここで、V _m = / C _m ^s −C _m ^s where:

【数２】 [Equation 2]

【００２３】ここで、Ｎ_mは選択話者のｍ番目のベクト
ルＣ_m ^sに対応付けられた入力スペクトルベクトルの個数
であり、ＭはベクトルＣ_m ^sに対応付けられた入力スペク
トル時系列のベクトルの集合である。そして、ステップ
Ｓ４では、学習で対応付けが行なわれなかった選択話者
のｎ番目のベクトルＣ_n ^sと、その近傍にある対応付けが
行なわれた所定数のコードベクトルの集合の要素Ｃ_k ^sと
の間のファジィ級関数μ_n,_kを次の数３を用いて計算す
る。Here, N _m is the number of input spectrum vectors associated with the m-th vector C _m ^s of the selected speaker, and M is the vector of the input spectrum time series associated with the vector C _m ^s. Is a set of. Then, in step S4, the n-th vector C _n ^s of the selected speaker that has not been associated in learning and the element C _k ^s of the set of a predetermined number of associated code vectors in the vicinity thereof. The fuzzy class function μ _n , _k between and is calculated using the following equation 3.

【００２４】[0024]

【数３】 (Equation 3)

【００２５】ここで、ｍａ＝１／（ｍ−１）である。ま
た、ｄ_n,_kはベクトルＣ_n ^SとベクトルＣ_k ^Sとの間の距離
であり、ｍは制御パラメータ（ファジネス）であり、Ｋ
は対応付けのあったベクトルの集合である。さらに、ス
テップＳ５では、対応付けされなかったベクトルＣ_n ^sの
移動ベクトルＶ_nを、次の数４を用いて、対応付けが行
なわれたコードベクトルＣ_k ^sの移動ベクトルＶ_kと上記
ファジィ級関数μ_n,_kを用いて計算し、写像コードブッ
クのすべてのベクトルＣ^sを次の数５に示すごとく移動
ベクトルＶ_nの集合Ｖを用いて目標話者のベクトルＣ^tに
更新してステップＳ６に進む。Here, ma = 1 / (m-1). Further, d _n , _k is a distance between the vector C _n ^S and the vector C _k ^S , m is a control parameter (fuzziness), and K
Is a set of associated vectors. Further, in step S5, the movement vector V _n of not mapping vector C _n ^s, by using the following Equation 4, the moving vector V _k and the fuzzy grade correspondence is performed codevector C _k ^s The calculation is performed using the function μ _n , _k, and all the vectors C ^s of the mapping codebook are updated to the vector C ^t of the target speaker by using the set V of the movement vectors V _n as shown in the following equation 5, and step Proceed to S6.

【００２６】[0026]

【数４】 [Equation 4]

【数５】Ｃ^t＝Ｃ^s＋Ｖ[Formula 5] C ^t = C ^s + V

【００２７】ステップＳ６では、ＤＴＷ法による対応づ
けの時間整合処理のときの距離が収束していなければ、
ステップＳ２へ戻る。一方、収束していればステップＳ
７に進む。In step S6, if the distance has not converged in the time matching processing of the correspondence by the DTW method,
Return to step S2. On the other hand, if converged, step S
Proceed to 7.

【００２８】ステップＳ６までの処理では、学習データ
が少ない場合に異話者間の真の対応関係を表せずに移動
ベクトルの誤差が大きくなるという問題が残る。そこ
で、ステップＳ７においては、移動ベクトル場平滑化法
（ＶＦＳ法）を用いて、移動ベクトルに連続性の拘束条
件を入れ、以下に示す３つのステップＳＳ１乃至ＳＳ３
からなる平滑化処理を行なって、誤差を吸収させる。（ＳＳ１）写像コードブック内の選択話者のｌ番目のベ
クトルＣ_l ^sとその近傍にあるベクトルＣ_k ^sとの間のファ
ジィ級関数μ_l,_kを計算する。（ＳＳ２）上記ファジィ級関数μ_l,_kを用いて平滑化移
動ベクトルＶ_lを次の数６を用いて計算する。In the processes up to step S6, when the learning data is small, the problem that the error of the movement vector becomes large without representing the true correspondence between the different speakers remains. Therefore, in step S7, a moving vector field smoothing method (VFS method) is used to put a constraint condition of continuity on the moving vector, and the following three steps SS1 to SS3 are performed.
The smoothing process consisting of is performed to absorb the error. (SS1) A fuzzy class function μ _l , _k between the l-th vector C _l ^s of the selected speaker in the mapping codebook and the vector C _k ^{s in the} vicinity thereof is calculated. (SS2) Using the fuzzy class functions μ _l , _k , the smoothed movement vector V _l is calculated using the following equation 6.

【００２９】[0029]

【数６】 (Equation 6)

【００３０】ここで、Ｎ_k ^αは移動ベクトルＶ_kの信頼度
を表し、定数αを持たせた移動ベクトルへの重みとして
いる。ここで、ｋ＝ｌのときファジィ級関数μ_l,_k＝１
とする。（ＳＳ３）平滑化された移動ベクトルＶ_lを用いて、写
像コードブックメモリ１２内の写像コードブックのすべ
てのベクトルＣ_l ^sを次の数７に示すごとくベクトルＣ_l ^t
に更新する。この写像コードブックは、声質変化音声合
成装置２００におけるスペクトル写像処理部２２で用い
られる。Here, N _k ^α represents the reliability of the movement vector V _k , and is used as a weight for the movement vector having a constant α. Here, when k = 1, fuzzy class function μ _l , _k = 1
And (SS3) Using the smoothed movement vector V _l , all the vectors C _l ^s of the mapping codebook in the mapping codebook memory 12 are vector C _l ^{t as} shown in the following Expression 7.
To update. This mapping codebook is used by the spectrum mapping processing unit 22 in the voice quality change speech synthesizer 200.

【００３１】[0031]

【数７】Ｃ_l ^t＝Ｃ_l ^s＋Ｖ_ｌ (7) C _l ^t = C _l ^s + V _l

【００３２】次いで、図１の声質変換音声合成装置２０
０の構成と動作について説明する。図１に示すように、
目標話者の音声で音声合成したい文字列をキーボード２
１を用いて入力すると、スペクトル写像処理部２２は、
文字列に対応する選択話者の音声スペクトルのデータを
音声データベース１０から読み出し、その音声スペクト
ルのベクトル列Ｘ_ｐ ^ｓを、生成された写像コードブック
１２を用いてベクトル量子化することにより、以下のご
とくスペクトル写像を行って復号化処理を実行する。Next, the voice quality conversion speech synthesizer 20 of FIG.
The configuration and operation of 0 will be described. As shown in Figure 1,
Keyboard 2 for the character string that you want to synthesize with the voice of the target speaker
When input using 1, the spectrum mapping processing unit 22
Data of the voice spectrum of the selected speaker corresponding to the character string is read from the voice database 10, and the vector sequence X _p ^s of the voice spectrum is vector-quantized using the generated mapping codebook 12 to obtain the following. Then, the spectrum mapping is performed and the decoding process is performed.

【００３３】スペクトル写像処理部２２では、選択話者
の音声スペクトルのベクトル列Ｘ_p ^sと、と、その近傍に
ある所定数ｋ個のベクトルＣ_q ^s（ここで、ｑ＝１，２，
…，ｋ）との間の重み付け関数であるファジィ級関数μ
_p,_qを計算した後、ベクトルＣ_q ^sに対応付けられた目標
話者のベクトルＣ_q ^tとファジィ級関数μ_p,_qとに基づい
て、変換後の目標話者のベクトル列Ｘ_p ^tを計算する。そ
して、当該ベクトル列Ｘ_p ^tから、選択話者から目標話者
に写像された音声スペクトル時系列を計算してパラメー
タ系列生成部２３に出力する。In the spectrum mapping processing section 22, a vector sequence X _p ^s of the speech spectrum of the selected speaker and a predetermined number k of vectors C _q ^s (where q = 1, 2,
,, k) fuzzy class function μ which is a weighting function between
After calculating _p , _q , based on the target speaker vector C _q ^t and the fuzzy class function μ _p , _q associated with the vector C _q ^s , the converted target speaker vector sequence X _p ^t To calculate. Then, the voice spectrum time series mapped from the selected speaker to the target speaker is calculated from the vector sequence X _p ^t and output to the parameter sequence generation unit 23.

【００３４】以上の処理での説明では、写像コードブッ
ク生成装置１００及び声質変換音声合成装置２００にお
いて、スペクトルに関する処理のみについて説明してい
るが、ピッチ周波数について、同様に処理して、写像コ
ードブックを作成して、作成した写像コードブックを用
いて目標話者のピッチ周波数の時系列を計算してパラメ
ータ系列生成部２３に出力する。これにとって代わっ
て、ピッチ周波数の処理については、これに限らず、目
標話者と選択話者の間のピッチ周波数の対数値の平均の
差を予め計算しておき、選択話者のピッチ周波数の対数
値にその差を加算することにより、目標話者のピッチ周
波数の時系列を計算してもよい。In the above description of the process, only the process relating to the spectrum is explained in the mapping codebook generating device 100 and the voice quality conversion speech synthesizing device 200, but the pitch code frequency is similarly processed to obtain the mapping codebook. Is generated, the time series of the pitch frequency of the target speaker is calculated using the created mapping codebook, and is output to the parameter series generation unit 23. Instead of this, the pitch frequency processing is not limited to this, and the difference in the average of the logarithmic values of the pitch frequency between the target speaker and the selected speaker is calculated in advance, and the pitch frequency of the selected speaker is calculated. The time series of the pitch frequency of the target speaker may be calculated by adding the difference to the logarithmic value.

【００３５】最後に、パラメータ系列生成部２３は、入
力されるスペクトル時系列とピッチ周波数の時系列を取
りまとめて内蔵のバッファメモリに一時的に格納した
後、入力された文字列に対応する音声合成のための時系
列データに変換して音声合成部２４に出力する。ここ
で、時系列データは、音声合成のためのピッチ、有声／
無声切り換え、振幅及びフィルタ係数のデータを含む。
さらに、音声合成部２４は、パルス発生器と雑音発生器
とスイッチと振幅変更型増幅器とフィルタとから構成さ
れ、入力される時系列データに基づいて、発声音声信号
を合成してスピーカ２５に出力することにより、上記入
力された文字列に対応する目標話者の合成音声がスピー
カ２５から出力される。Finally, the parameter sequence generation unit 23 collects the input spectrum time series and the time series of the pitch frequency and temporarily stores them in a built-in buffer memory, and then synthesizes the speech corresponding to the input character string. And outputs it to the voice synthesizer 24. Here, the time-series data is pitch for voice synthesis, voiced /
Includes unvoiced switching, amplitude and filter coefficient data.
Further, the voice synthesizing unit 24 is composed of a pulse generator, a noise generator, a switch, an amplitude changing type amplifier and a filter, and synthesizes a vocal voice signal based on the inputted time series data and outputs it to the speaker 25. By doing so, the synthesized voice of the target speaker corresponding to the input character string is output from the speaker 25.

【００３６】さらに、本発明者は、以上のように構成さ
れたシステムについてシミュレーションを以下のごとく
行った。このシミュレーションでは、音声試料として音
素バランス２１６単語のうち、学習用に１語「うちあわ
せ」を使用し、評価用に５０語を使用した。アナウンサ
ー又はナレーターである男女各４名を登録話者とし、別
の男女各４名を目標話者として評価のためのシミュレー
ションを行なった。予め作成しておく各登録話者のコー
ドブックは、音素バランスされた５０３文を用いて作成
した。コードブックサイズは５１２であり、平滑化時の
ファジネスの値は１．１乃至５．０で変化させ、補間時
のファジネスもこれと同じ値に設定した。復号化時のフ
ァジネスは１．５、平滑化時の重み係数αは０．０５に
設定し、その処理の近傍数はすべて４とした。また、ス
ペクトルパラメータは３０次ＦＦＴケプストラムとし、
距離Ｄの計算には次の数８を用いた。Further, the present inventor has performed the following simulation on the system configured as described above. In this simulation, of the phoneme-balanced 216 words as speech samples, one word “Uchime” was used for learning and 50 words were used for evaluation. A simulation for evaluation was carried out by using four male and female announcers or narrators as registered speakers and another four male and female speakers as target speakers. The codebook of each registered speaker created in advance was created using 503 phoneme-balanced sentences. The codebook size is 512, the fuzzyness value during smoothing is changed from 1.1 to 5.0, and the fuzzyness during interpolation is set to the same value. The fuzzyness at the time of decoding was set to 1.5, the weighting coefficient α at the time of smoothing was set to 0.05, and the number of neighbors in the process was set to 4. In addition, the spectrum parameter is the 30th-order FFT cepstrum,
The following formula 8 was used for the calculation of the distance D.

【００３７】[0037]

【数８】 (Equation 8)

【００３８】ここで、ＣＥＰ_ij ^sはＤＴＷ法による時間
整合処理後の選択話者の第ｉフレームのｊ次ケプストラ
ム係数であり、ＣＥＰ_ij ^tは目標話者の第ｉフレームの
ｊ次ケプストラム係数である。また、ｆｒはフレーム数
である。本実施例の方法の基本性能を調べるため、変換
音声と目標話者の音声及び選択話者の音声と目標話者の
音声のケプストラム距離を計算した。ケプストラム距離
の５０単語の平均値の結果より、男性と女性の目標話者
ともに、変換音声と目標話者の音声との距離は選択話者
の音声と目標話者の音声との距離よりも小さくなり、本
実施例の方法の有効性が示された。Here, CEP _ij ^s is the j-th cepstral coefficient of the i-th frame of the selected speaker after the time matching processing by the DTW method, and CEP _ij ^t is the j-th cepstrum coefficient of the i-th frame of the target speaker. is there. Further, fr is the number of frames. In order to investigate the basic performance of the method of this embodiment, the cepstrum distances between the converted voice and the voice of the target speaker and the voice of the selected speaker and the voice of the target speaker were calculated. From the result of the average value of 50 words of the cepstrum distance, the distance between the converted voice and the voice of the target speaker is smaller than the distance between the voice of the selected speaker and the voice of the target speaker for both the male and female target speakers. Thus, the effectiveness of the method of this example was shown.

【００３９】次に、聴覚的に本実施例の方法の効果があ
るかどうかを調べるため、目標話者男女各１名に対し
て、公知のＡＢＸ法による聴取シミュレーションを行な
った。Ａ、Ｂは、それぞれ目標話者の分析合成音、選択
話者の分析合成音、Ｘはファジネス５の変換音声又は選
択話者の分析合成音である。変換音声は、５０単語のう
ちケプストラム距離の減少比が５０単語平均よりも小さ
い音声、大きい音声、同程度の音声を、１サンプルずつ
抽出したものとした。スペクトル写像精度のみを評価す
るために、基本周波数、音韻継続時間、パワーは目標話
者に合わせた。被験者には、Ｘの音声話者がＡ，Ｂどち
らの話者に近いかを強制判定させた。被験者は６名、呈
示回数は１サンプル当たり４回である。評価は、次の数
９に従って判定率ＣＲを求め、この値で比較した。Next, in order to examine whether or not the method of the present embodiment is auditorily effective, a listening simulation by the known ABX method was performed for each of the target speaker male and female. A and B are the analysis and synthesis sounds of the target speaker, the analysis and synthesis sound of the selected speaker, and X is the converted speech of fuzzy 5 or the analysis and synthesis sound of the selected speaker. The converted speech was one in which the reduction rate of the cepstrum distance in the 50 words was smaller than the average of 50 words, the large speech, and the similar speech, which were sampled one by one. In order to evaluate only the spectral mapping accuracy, the fundamental frequency, phoneme duration and power were adjusted to the target speaker. The test subject was forced to determine which of the A and B speakers the X voice speaker was closer to. There were 6 test subjects and the number of presentations was 4 per sample. In the evaluation, the judgment rate CR was calculated according to the following equation 9, and the values were compared.

【００４０】[0040]

【数９】ＣＲ＝（Ｐ_j／Ｐ_all）×１００［％］[Equation 9] CR = (P _j / P _all ) × 100 [%]

【００４１】ここで、Ｐ_jは「Ｘが目標話者に近いと判
定された回数」であり、Ｐ_allは「呈示回数」である。Here, P _j is “the number of times X is determined to be close to the target speaker” and P _all is “the number of presentations”.

【００４２】この評価結果より、変換音声が目標話者に
近いと判定された割合は、男性の目標話者の場合約６７
％であって、女性の目標話者の場合６５％である。ま
た、選択話者が目標話者に近いと判定された割合は、男
性の目標話者の場合約１８％であって、女性の目標話者
の場合２５％であり、両者とも高い割合で変換音声が目
標話者に近いと判定されており、聴覚的にも効果のある
ことが示された。選択話者が目標話者に近いと判定され
た割合が、目標話者が男性の場合より女性の場合の方が
高いのは、選択話者と目標話者との距離が男性の目標話
者の場合より近かったためと考えられる。このことは、
登録話者の中に存在する目標話者に近い話者が話者選択
によって適切に選ばれたことを示している。また、変換
音声が目標話者に近いと判定された割合が男性の目標話
者の方が高いのは、ＶＦＳ法の平滑化処理の効果が女性
の目標話者よりも大きいためと考えられる。以上から、
選択話者と目標話者の距離が大きいほどＶＦＳ法の平滑
化処理の効果が増し、距離が小さいほど話者選択の効果
が増すという相乗効果があるといえる。From the result of this evaluation, the rate at which the converted voice is judged to be close to the target speaker is about 67 for the male target speaker.
%, And 65% for female target speakers. In addition, the percentage of the selected speaker determined to be close to the target speaker is about 18% for the male target speaker and 25% for the female target speaker, both of which are converted at a high rate. It was judged that the voice was close to the target speaker, and it was shown to be effective auditorily. The percentage of the selected speaker closer to the target speaker is higher when the target speaker is female than when the target speaker is male, because the distance between the selected speaker and the target speaker is male. It is thought that it was because it was closer than in the case of. This is
It is shown that the speaker close to the target speaker existing in the registered speakers is properly selected by the speaker selection. Further, the reason why the conversion target speech is judged to be closer to the target speaker is higher in the male target speaker because the effect of the VFS smoothing process is larger than that in the female target speaker. From the above,
It can be said that there is a synergistic effect that the effect of smoothing processing of the VFS method increases as the distance between the selected speaker and the target speaker increases, and the effect of speaker selection increases as the distance decreases.

【００４３】以上説明したように、少ない学習データで
声質変換を実現するため、話者選択と移動ベクトル場平
滑化法によって選択話者から目標話者へのスペクトル写
像を行なうことによる声質変換法を開示している。スペ
クトル距離および聴取シミュレーションによる評価にお
いて、１単語のみで学習させ、５０単語で評価を行なっ
た結果、変換音声と目標話者音声とのスペクトル距離
は、選択話者音声と目標話者音声との距離より小さくな
り、また、聴取シミュレーションでも良好な結果が得ら
れ、本実施例の方法の有効性が示された。As described above, in order to realize the voice quality conversion with a small amount of learning data, the voice quality conversion method is performed by performing the speaker mapping and the spectrum mapping from the selected speaker to the target speaker by the moving vector field smoothing method. Disclosure. In the evaluation by the spectral distance and the listening simulation, only one word was learned, and the evaluation was performed with 50 words. As a result, the spectral distance between the converted speech and the target speaker speech is the distance between the selected speaker speech and the target speaker speech. It was smaller, and good results were obtained in listening simulations, demonstrating the effectiveness of the method of this example.

【００４４】第２の従来例では、音声データの登録話者
から目標話者への写像を行う場合、異なる話者間のコー
ドブックの対応関係を学習によって求めるために、大量
の学習データが必要であったり、合成音声の精度を改善
するために複雑な処理を必要としていた。これに対し
て、本発明に係る本実施例によれば、１単語程度の非常
に少ない学習データで登録話者から目標話者への写像関
数を求めることができ、例えばデイジタル計算機を用い
て実用化することができる。また、音声データベースだ
けを予め格納することにより、発話内容に関係なく従来
例に比較してより高い精度で声質を変換することができ
る。すなわち、音声データベースに格納される単語と、
声質変換しようとする単語は異なってもよく、本実施例
を、例えば、日本語の単語から英語の単語への声質変
換、もしくは、英語の単語から日本語の単語への声質に
適用することができる。In the second conventional example, a large amount of learning data is required in order to obtain the correspondence of the codebooks between different speakers by learning when mapping the voice data from the registered speaker to the target speaker. However, complicated processing is required to improve the accuracy of synthesized speech. On the other hand, according to the present embodiment of the present invention, the mapping function from the registered speaker to the target speaker can be obtained with very little learning data of about one word, and for example, it can be practically used by using a digital computer. Can be converted. Further, by storing only the voice database in advance, it is possible to convert the voice quality with higher accuracy than in the conventional example regardless of the utterance content. That is, the words stored in the voice database,
The words to be converted in voice quality may be different, and this embodiment can be applied to, for example, voice conversion from Japanese words into English words or voice characteristics from English words into Japanese words. it can.

【００４５】なお、以上の実施例において、Ａ／Ｄ変換
器２と、特徴抽出部３と、話者選択部５と、写像コード
ブック生成部６と、スペクトル写像処理部２２と、パラ
メータ系列生成部２３は、例えばディジタル計算機で構
成される。In the above embodiment, the A / D converter 2, the feature extraction unit 3, the speaker selection unit 5, the mapping codebook generation unit 6, the spectrum mapping processing unit 22, and the parameter sequence generation. The unit 23 is composed of, for example, a digital computer.

【００４６】以上の実施例においては、スペクトルデー
タとピッチ周波数について、話者選択、写像コードブッ
ク生成、及びスペクトル写像処理を行っているが、同様
に、他の音響特徴パラメータについて処理を行ってもよ
い。以上の実施例において、マイクロホン１に入力する
単語は少なくとも１つの単語でよい。また、音声データ
ベースメモリ１０に予め記憶する音声データベースは、
複数の登録話者の音声データベースのデータでよい。In the above embodiment, the speaker selection, the mapping codebook generation, and the spectrum mapping processing are performed on the spectrum data and the pitch frequency, but similarly, the processing may be performed on other acoustic feature parameters. Good. In the above embodiments, the word input to the microphone 1 may be at least one word. The voice database stored in advance in the voice database memory 10 is
The data may be data of voice databases of a plurality of registered speakers.

【００４７】[0047]

【発明の効果】以上詳述したように本発明に係る請求項
１記載の声質変換音声合成装置によれば、複数の登録話
者の少なくとも１単語の音声信号の音響特徴パラメータ
を含む音声データベースを予め記憶する記憶手段と、入
力された目標話者の少なくとも１単語の音声信号に基づ
いて、声質変換をすべき目標話者に最も近い話者を、上
記複数の登録話者の中から選択する選択手段と、上記選
択手段によって選択された話者の音響空間と目標話者の
音響空間との間の差分を計算することにより、選択され
た話者から上記目標話者への写像コードブックを計算す
る生成手段と、入力された音声合成すべき文字列に基づ
いて、上記音声データベースに記憶された上記選択され
た話者の音声の音響特徴パラメータを上記選択された話
者のコードブックを用いて量子化し、上記選択された話
者のコードブックと上記写像コードブックの対応関係に
基づいて上記文字列に対応する目標話者の音声信号の音
響特徴パラメータを生成する写像処理手段と、上記写像
処理手段によって生成された目標話者の音声信号の音響
特徴パラメータに基づいて、上記文字列に対応する目標
話者の音声信号を発生して出力する音声合成手段とを備
える。第２の従来例では、音声データの登録話者から目
標話者への写像を行う場合、異なる話者間のコードブッ
クの対応関係を学習によって求めるために、大量の学習
データが必要であったり、合成音声の精度を改善するた
めに複雑な処理を必要としていた。これに対して、本発
明によれば、１単語程度の非常に少ない学習データで登
録話者から目標話者への写像関数を求めることができ、
例えばデイジタル計算機を用いて実用化することができ
る。また、上記音声データベースだけを予め格納するこ
とにより、発話内容に関係なく従来例に比較してより高
い精度で声質を変換することができる。すなわち、上記
音声データベースに格納される単語と、声質変換しよう
とする単語は異なってもよく、本発明を、例えば、日本
語の単語から英語の単語への声質変換、もしくは、英語
の単語から日本語の単語への声質に適用することができ
る。As described in detail above, according to the voice quality conversion voice synthesizer of the first aspect of the present invention, a voice database including acoustic feature parameters of voice signals of at least one word of a plurality of registered speakers is created. A speaker closest to the target speaker whose voice quality should be converted is selected from the plurality of registered speakers based on the storage means stored in advance and the input voice signal of at least one word of the target speaker. By selecting a difference between the selecting means and the acoustic space of the speaker selected by the selecting means and the acoustic space of the target speaker, a mapping codebook from the selected speaker to the target speaker is obtained. Based on the generating means for calculating and the input character string to be voice-synthesized, the acoustic feature parameters of the voice of the selected speaker stored in the voice database are codebook of the selected speaker. Quantization using the mapping processing means for generating the acoustic feature parameter of the voice signal of the target speaker corresponding to the character string based on the correspondence between the selected speaker codebook and the mapping codebook, and A voice synthesizing means for generating and outputting a voice signal of the target speaker corresponding to the character string, based on the acoustic feature parameter of the voice signal of the target speaker generated by the mapping processing means. In the second conventional example, a large amount of learning data is required in order to obtain the correspondence of the codebook between different speakers by learning when mapping the voice data from the registered speaker to the target speaker. , Requires complicated processing to improve the accuracy of synthesized speech. On the other hand, according to the present invention, the mapping function from the registered speaker to the target speaker can be obtained with very little learning data of about one word,
For example, it can be put to practical use by using a digital computer. Further, by storing only the voice database in advance, it is possible to convert the voice quality with higher accuracy than the conventional example regardless of the utterance content. That is, the word stored in the voice database may be different from the word whose voice quality is to be converted, and the present invention can be applied to, for example, voice quality conversion from a Japanese word to an English word or from an English word to a Japanese word. It can be applied to the voice quality of words in words.

【００４８】また、請求項２記載の声質変換音声合成装
置においては、上記生成手段は、移動ベクトル場平滑化
法を用いて、選択された話者から上記目標話者への写像
コードブックを計算する。これにより、より簡単にかつ
精度よく声質変換して音声合成することができる。Further, in the voice quality conversion speech synthesizer according to the second aspect, the generating means calculates a mapping codebook from the selected speaker to the target speaker by using the moving vector field smoothing method. To do. As a result, it is possible to more easily and accurately convert the voice quality and synthesize the voice.

[Brief description of drawings]

【図１】本発明に係る一実施例である写像コードブッ
ク生成装置１００と声質変換音声合成装置２００のブロ
ック図である。FIG. 1 is a block diagram of a mapping codebook generation device 100 and a voice quality conversion voice synthesis device 200 according to an embodiment of the present invention.

【図２】図１の写像コードブック生成部６によって実
行される写像コードブック生成処理を示すフローチャー
トである。FIG. 2 is a flowchart showing a mapping codebook generation process executed by a mapping codebook generation unit 6 in FIG.

【図３】ピッチ周波数の変換コードブックを生成する
方法を示す第２の従来例のブロック図である。FIG. 3 is a block diagram of a second conventional example showing a method for generating a pitch frequency conversion codebook.

【図４】図３の方法で生成されたピッチ周波数の変換
コードブックと同様の方法で生成されたスペクトルパラ
メータの変換コードブックとを用いてベクトル量子化に
よる声質変換法を示すブロック図である。4 is a block diagram showing a voice quality conversion method by vector quantization using a pitch frequency conversion codebook generated by the method of FIG. 3 and a spectrum parameter conversion codebook generated by the same method. FIG.

[Explanation of symbols]

１…マイクロホン、２…Ａ／Ｄ変換器、３…特徴抽出部、４…バッファメモリ、５…話者選択部、６…写像コードブック生成部、１０…音声データベース、１１…スペクトルコードブック、２１…キーボード、２２…スペクトル写像処理部、２３…パラメータ系列生成部、２４…音声合成部、２５…スピーカ、１００…写像コードブック生成装置、２００…声質変換音声合成装置。 DESCRIPTION OF SYMBOLS 1 ... Microphone, 2 ... A / D converter, 3 ... Feature extraction part, 4 ... Buffer memory, 5 ... Speaker selection part, 6 ... Mapping codebook generation part, 10 ... Speech database, 11 ... Spectrum codebook, 21. ... keyboard, 22 ... spectrum mapping processing section, 23 ... parameter series generation section, 24 ... speech synthesis section, 25 ... speaker, 100 ... mapping codebook generation apparatus, 200 ... voice quality conversion speech synthesis apparatus.

Claims

[Claims]

1. A voice database including acoustic feature parameters of a plurality of registered speakers and a storage means for storing a codebook thereof in advance, and voice quality conversion based on an input voice signal of at least one word of a target speaker. The speaker closest to the target speaker
The selected speaker is selected by calculating the difference between the selecting means for selecting from the plurality of registered speakers and the acoustic space of the speaker selected by the selecting means and the acoustic space of the target speaker. From the generating means for calculating the mapping codebook to the target speaker from, and the acoustic feature parameter of the voice of the selected speaker stored in the voice database based on the input character string to be voice synthesized. Quantization using the codebook of the selected speaker, and acoustic characteristics of the voice signal of the target speaker corresponding to the character string based on the correspondence between the codebook of the selected speaker and the mapping codebook A mapping processing means for generating a parameter, and a voice signal of the target speaker corresponding to the character string based on the acoustic feature parameter of the voice signal of the target speaker generated by the mapping processing means. Voice conversion speech synthesis apparatus characterized by comprising a speech synthesis means generates and outputs.

2. The voice quality conversion according to claim 1, wherein the generating means calculates a mapping codebook from the selected speaker to the target speaker by using a moving vector field smoothing method. Speech synthesizer.

3. The voice quality conversion speech synthesis apparatus according to claim 1, wherein the acoustic feature parameter includes spectral data.

4. The voice quality conversion speech synthesizer according to claim 3, wherein the acoustic feature parameter further includes pitch frequency data.