JPH0816183A

JPH0816183A - Sound quality adaption device

Info

Publication number: JPH0816183A
Application number: JP6147876A
Authority: JP
Inventors: Mitsuru Ebihara; 充海老原; Yasushi Ishikawa; 泰石川
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1994-06-29
Filing date: 1994-06-29
Publication date: 1996-01-19
Anticipated expiration: 2017-03-04
Also published as: JP3261870B2

Abstract

PURPOSE:To attain high quality editing synthesis by converting to a spectrum parameter of a reference sound in an optimum section decided based on a decision result of a distance value between an adaptive input sound and the reference sound. CONSTITUTION:A distance value calculation means 14 calculates and outputs a spectrum parameter distance value 15 such as spectrum strain by correspondence between frames by e.g. DP matching between a partial reference sound spectrum parameter 12 and a partial adaptive input sound spectrum parameter 13. An optimum section decision means 16 inputs the spectrum parameter distance value 15, and detects an optimum reference sound section to output it as an optimum reference sound partial section 17. Then, a conversion means 18 converts the partial adaptive input sound spectrum parameter 13 into the partial reference sound spectrum parameter 12 of the optimum reference sound partial section 17 with frame correspondence by the DP matching to output it as a conversion sound spectrum parameter 19.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、学習によって入力音声
と参照音声との対応付けを行う音声品質適応化装置、中
でも特に発声者や発声時期が異なる音声間において蓄積
された音声データを接続する編集合成に適用できる音声
品質適応化装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice quality adaptation device for associating an input voice with a reference voice by learning, and in particular, connects voice data accumulated between voices having different utterances from different speakers. The present invention relates to a voice quality adaptation device applicable to edit synthesis.

【０００２】[0002]

【従来の技術】編集合成は、あらかじめ録音され蓄積さ
れた大量の単語、文節、あるいは文単位の音声データを
目的に応じて接続することで定型のメッセージを音声に
よって出力する技術であるが、メッセージの内容の更新
等により未登録語を出力する要求がある場合には新たな
音声データの追加をする必要がある。しかしながら、追
加収録される音声データは、発声者の相違、あるいは同
一話者の場合でも録音環境および発声の状態の相違によ
って、既存のデータとは異なる音声品質になることが考
えられ、その場合編集合成における接続箇所において合
成音が不自然なものとなる。これを避けるために、既存
のデータの品質を追加データに合わせるようにすると、
既存のデータを全て収録し直さなくてはならず多大な労
力が発生する。そこで、追加される音声データ（入力音
声データ）の品質を既存のデータ（参照音声データ）の
ものと均一にする技術が必要となり、従来その方式とし
ては声質変換などに適用されるコードブックマッピング
によるスペクトル適応化方式があり、文献“ベクトル量
子化による声質変換”（阿部、中村、鹿野、桑原、日本
音響学会講演論文集昭和６２年１０月ｐｐ．２１５
−２１６）において提案されている。2. Description of the Related Art Edit synthesis is a technique for outputting a fixed message by voice by connecting a large amount of prerecorded and accumulated words, phrases, or voice data in units of sentences according to the purpose. If there is a request to output an unregistered word due to updating of the contents of, etc., it is necessary to add new voice data. However, the additionally recorded voice data may have different voice quality from the existing data due to the difference in the speaker, or even in the case of the same speaker, due to the difference in the recording environment and the state of the vocalization. The synthetic sound becomes unnatural at the connection point in synthesis. To avoid this, if you try to match the quality of the existing data with the additional data,
It requires a great deal of effort to re-record all existing data. Therefore, a technology is required to make the quality of the added voice data (input voice data) equal to that of the existing data (reference voice data). Conventionally, the method is based on codebook mapping applied to voice quality conversion. There is a spectrum adaptation method, and the literature "Voice quality conversion by vector quantization" (Abe, Nakamura, Kano, Kuwahara, Proceedings of the Acoustical Society of Japan, October 1987, pp.215.
-216).

【０００３】図５はこの手法に基づく、従来のコードブ
ックマッピングによるスペクトル適応化方式の一構成例
を示す構成図である。図において、３はコードブックマ
ッピングによるスペクトル適応化を行うスペクトル適応
化手段、２はスペクトル適応化手段３の入力である入力
音声スペクトルパラメータ、４はスペクトル適応化手段
３の出力であり参照音声へのスペクトル適応化がなされ
た適応化入力音声スペクトルパラメータ、３２はスペク
トル適応化手段３に蓄積され入力音声と参照音声間の学
習に用いられる少数の参照音声データである学習用参照
音声スペクトルパラメータ、３３は学習用参照音声スペ
クトルパラメータ３２と同様に学習に用いられる学習用
の参照音声データと同じ発声内容の少数の入力音声デー
タである学習用入力音声スペクトルパラメータである。
また、２４は入力されるスペクトルパラメータに対しコ
ードブック内で最も類似度の高いコードベクトルを検索
しそのコード番号を出力するベクトル量子化手段、２２
は参照音声のベクトル量子化に用いられる参照音声ベク
トル量子化コードブック、３４は入力音声のベクトル量
子化に用いられる入力音声ベクトル量子化コードブッ
ク、２５は参照音声のスペクトルパラメータを参照音声
ベクトル量子化コードブック２２でベクトル量子化して
得られる参照音声ベクトル量子化コード番号列、３５は
入力音声のスペクトルパラメータを入力音声ベクトル量
子化コードブック３４でベクトル量子化して得られる入
力音声ベクトル量子化コード番号列、３６は学習用の入
力音声データと参照音声データの間の対応付けを行い入
力音声データを参照音声データに変換するコードブック
を作成する学習手段、２３は学習手段３６の出力である
適応化入力音声ベクトル量子化コードブック、２６は入
力音声のスペクトルパラメータを適応化入力ベクトル量
子化コードブック２３で量子化して得られる適応化入力
音声ベクトル量子化コード番号列、３７はベクトル量子
化コード番号列を入力し、コードブック内でそのコード
番号におけるコードベクトルを出力する復号化手段であ
る。FIG. 5 is a block diagram showing an example of the configuration of a conventional spectrum adaptation system by codebook mapping based on this technique. In the figure, 3 is a spectrum adapting means for performing spectrum adaptation by codebook mapping, 2 is an input speech spectrum parameter which is an input of the spectrum adapting means 3, and 4 is an output of the spectrum adapting means 3, which is a reference speech. The adapted input speech spectrum parameter subjected to spectrum adaptation, 32 is a reference speech spectrum parameter for learning which is a small number of reference speech data accumulated in the spectrum adaptation means 3 and used for learning between the input speech and the reference speech, and 33 is Similar to the learning reference speech spectrum parameter 32, the learning input speech spectrum parameter is a small number of input speech data having the same utterance content as the learning reference speech data used for learning.
Further, 24 is a vector quantizing means for searching a code vector having the highest degree of similarity in the code book with respect to the input spectral parameter and outputting the code number, 22
Is a reference voice vector quantization codebook used for vector quantization of the reference voice, 34 is an input voice vector quantization codebook used for vector quantization of the input voice, and 25 is a reference voice vector quantization of the spectrum parameter of the reference voice. Reference voice vector quantization code number sequence obtained by vector quantization in the codebook 22, 35 is an input voice vector quantization code number sequence obtained by vector quantization of the input voice spectrum parameter in the input voice vector quantization codebook 34. , 36 is a learning means for making a correspondence between the input voice data for learning and the reference voice data to create a codebook for converting the input voice data into the reference voice data, and 23 is an adaptive input which is an output of the learning means 36. Speech vector quantization codebook, 26 is the spectrum of the input speech The adaptive input speech vector quantization code number sequence obtained by quantizing the parameter with the adaptive input vector quantization codebook 23, 37 inputs the vector quantization code number sequence, and the code vector at that code number in the codebook Is a decoding means for outputting.

【０００４】以下、従来のコードブックマッピングによ
るスペクトル適応化方式の動作について図５及び図６に
基づいて説明する。スペクトル適応化手段３において、
ベクトル量子化手段２４では学習データである学習用参
照音声スペクトルパラメータ３２を参照音声ベクトル量
子化コードブック２２によってベクトル量子化し、参照
音声ベクトル量子化コード番号列２５を出力する。また
学習データである学習用入力音声スペクトルパラメータ
３３も、同様にベクトル量子化手段２４で入力音声ベク
トル量子化コードブック３４によってベクトル量子化さ
れ、入力音声ベクトル量子化コード番号列３５として出
力する。学習手段３６では、前記参照音声ベクトル量子
化コード番号列２５と前記入力音声ベクトル量子化コー
ド番号列３５をそれぞれ参照音声ベクトル量子化コード
ブック２２と入力音声ベクトル量子化コードブック３４
で復号化を行い、量子化された参照音声データと入力音
声データの間の時間方向の対応付けを行う。この対応付
けを効率良く行う一般的手法としてＤＰ（Ｄｙｎａｍｉ
ｃＰｒｏｇｒａｍｍｉｎｇ）マッチングによる時間軸
の非線形伸縮の方法あり、図６に示すように両データ間
のスペクトル距離値の総和が最小になるようなパスを選
択することで、参照音声データのｉ番目のフレームに対
して入力音声データのｊ番目のフレームが対応するとい
うような対応付けを行う。そして参照音声のベクトル量
子化コード番号に対して最も多く対応する入力音声のベ
クトル量子化コード番号を求め、それに基づき参照音声
ベクトル量子化コードブック２２のコードベクトルを入
力音声ベクトル量子化コードブック３４のコードベクト
ルに変換して適応化入力音声ベクトル量子化コードブッ
ク２３を作成し、出力する。ここまでのコードブック間
の学習の動作は事前に行われる。前記適応化入力音声ベ
クトル量子化コードブック２３で、入力される入力音声
スペクトルパラメータ２をフレーム毎にベクトル量子化
手段２４によってベクトル量子化し、適応化入力音声ベ
クトル量子化コード番号列２６を出力する。復号化手段
３７は、前記適応化入力音声ベクトル量子化コード番号
列２６を入力し、前記参照音声ベクトル量子化コードブ
ック２２を参照して、そこからベクトル量子化コード番
号に対応するコードベクトルを導き出し、適応化入力音
声スペクトルパラメータ４として出力する。The operation of the conventional spectrum adaptation method by codebook mapping will be described below with reference to FIGS. 5 and 6. In the spectrum adaptation means 3,
The vector quantization means 24 vector-quantizes the learning reference speech spectrum parameter 32, which is learning data, by the reference speech vector quantization codebook 22, and outputs a reference speech vector quantization code number sequence 25. Similarly, the learning input voice spectrum parameter 33, which is learning data, is also vector-quantized by the input voice vector quantization codebook 34 by the vector quantization means 24 and output as the input voice vector quantization code number sequence 35. In the learning means 36, the reference speech vector quantization code number sequence 25 and the input speech vector quantization code number sequence 35 are respectively referred to as a reference speech vector quantization codebook 22 and an input speech vector quantization codebook 34.
Then, the quantized reference voice data and the input voice data are associated with each other in the time direction. As a general method for efficiently performing this association, DP (Dynami)
c Programming) There is a method of nonlinear expansion / contraction of the time axis by matching, and as shown in FIG. 6, by selecting a path that minimizes the sum of the spectral distance values between both data, the i-th frame of the reference audio data is selected. Is associated with the j-th frame of the input voice data. Then, the vector quantization code number of the input voice that most corresponds to the vector quantization code number of the reference voice is obtained, and the code vector of the reference voice vector quantization codebook 22 is calculated based on the vector quantization code number of the input voice vector quantization codebook 34. An adaptive input speech vector quantization codebook 23 is created by converting into a code vector and is output. The learning operation between the codebooks up to this point is performed in advance. The adaptive input speech vector quantization codebook 23 vector-quantizes the inputted input speech spectrum parameter 2 by the vector quantizing means 24 for each frame, and outputs an adapted input speech vector quantization code number sequence 26. The decoding means 37 inputs the adapted input speech vector quantization code number sequence 26, refers to the reference speech vector quantization codebook 22, and derives a code vector corresponding to the vector quantization code number from the reference speech vector quantization codebook 22. , Output as the adapted input speech spectrum parameter 4.

【０００５】[0005]

【発明が解決しようとする課題】従来のコードブックマ
ッピングによる音声品質適応化装置は以上のように構成
されていたので、各コードブック中のコードベクトルの
値は離散的となり、特に適応化入力音声ベクトル量子化
コードブックは参照音声コードベクトルとのＤＰマッチ
ング操作により最も相関関係の大きい入力音声コードベ
クトルを用いるため、コード間の離散がより大きくな
る。その結果、変換されたスペクトルパラメータがフレ
ーム間で不連続になり、時間方向に対して自然音声が保
持している連続的な特徴が再現できないという問題点が
あった。また、コードブックを用いること自体に問題が
あった。即ち、コードブックは各コードベクトルとして
それぞれ対応するスペクトルパラメータの集合の重心
（セントロイド）を用いているため、コード毎のスペク
トルパラメータの分散が大きい場合にはホルマントなど
の音韻的特徴が平坦化され、このため音韻性にボケが生
じ、音質劣化の要因になるという問題点があった。ま
た、このようなコードブックマッピングの手法の問題点
を解決する方法として、特開平４−１５８３９７に示さ
れるようなセグメント単位音声の対応付けによる声質変
換方式も提案されている。しかしながら、この方法は予
めセグメンテーションがなされたデータを標準パターン
として用いているため、パラメータが時間方向の特徴を
有する結果となり、そのためコードブックマッピング手
法に比べて量子化歪が増大するという傾向があり、それ
を回避するとなると様々な時間長の膨大な量の標準パタ
ーンを用意しなくてはならないという問題点があった。
さらに、コードブックマッピング手法に比べて標準パタ
ーンが多くなる分、標準パターンに対応する学習用サン
プルの数が少なくなり、その結果サンプルの分布の偏り
による癖の影響が強くなるという問題があった。Since the conventional voice quality adaptation apparatus based on codebook mapping is configured as described above, the values of the code vectors in each codebook are discrete, and the adaptation input speech is particularly important. Since the vector quantization codebook uses the input speech codevector which has the largest correlation with the reference speech codevector by the DP matching operation, the discreteness between the codes becomes larger. As a result, the converted spectrum parameter becomes discontinuous between frames, and there is a problem in that the continuous feature held by natural speech in the time direction cannot be reproduced. Also, there was a problem in using the codebook itself. That is, since the codebook uses the centroid of the set of corresponding spectral parameters as each code vector, the phonological features such as formants are flattened when the variance of the spectral parameters for each code is large. As a result, there is a problem in that the phonological property is blurred and causes deterioration of the sound quality. Further, as a method for solving the problem of such a codebook mapping method, a voice quality conversion method by associating segment unit voices as disclosed in Japanese Patent Laid-Open No. 4-158397 has been proposed. However, since this method uses pre-segmented data as a standard pattern, the result is that the parameter has characteristics in the time direction, which tends to increase quantization distortion compared to the codebook mapping method. To avoid this, there was a problem that a huge amount of standard patterns of various time lengths had to be prepared.
Furthermore, as the number of standard patterns increases as compared with the codebook mapping method, the number of learning samples corresponding to the standard patterns decreases, and as a result, the influence of habit due to the uneven distribution of the samples becomes stronger.

【０００６】この発明は、これらのような問題を解決す
るためになされたものであり、高品質な編集合成への適
用が可能な音声品質適応化装置を提供することを目的と
する。The present invention has been made to solve the above problems, and an object of the present invention is to provide a voice quality adaptation device which can be applied to high-quality edit synthesis.

【０００７】[0007]

【課題を解決するための手段】この第１の発明に係わる
音声品質適応化装置は、学習データを用いて入力音声の
各フレームのスペクトル特徴の適応化を行うスペクトル
適応化手段と、音素モデルを用いてスペクトル適応化手
段の出力である適応化入力音声と参照音声についてそれ
ぞれ音素境界情報を含む音素ラベルを出力するラベリン
グ手段と、ラベリング手段によって与えられた適応化入
力音声の音素ラベル列から入力音声部分区間を決定する
とともにラベリング手段によって与えられた参照音声の
音素ラベル列より前記入力音声部分区間の音素ラベル列
と一致する区間を検出する部分区間決定手段と、部分区
間決定手段の出力である部分区間情報に基づいて適応化
入力音声と参照音声のそれぞれの切り出しを行う切り出
し手段と、切り出し手段により切り出された適応化入力
音声と参照音声間の距離値を求める距離値算出手段と、
距離値算出手段で求められた距離値の判定により参照音
声の最適区間を選び出す最適区間決定手段と、切り出し
手段で切り出された適応化入力音声を最適区間決定手段
により決定された最適区間における参照音声に変換する
変換手段と、を備えるようにしたものである。A speech quality adaptation device according to the first aspect of the present invention comprises a spectrum adaptation means for adapting a spectrum feature of each frame of an input speech using learning data, and a phoneme model. Using the labeling means for outputting the phoneme label including the phoneme boundary information for each of the adapted input speech and the reference speech which are the outputs of the spectrum adaptation means, and the input speech from the phoneme label string of the adapted input speech given by the labeling means. A partial section determining unit that determines a partial section and detects a section that matches the phoneme label string of the input speech partial section from the phoneme label string of the reference speech given by the labeling unit, and a section that is an output of the partial section determining unit. A cutting-out means for cutting out each of the adapted input voice and the reference voice based on the section information, and the cut-out means. A distance value calculation means for calculating a distance value between the cut-out adapted input speech and the reference sound by means,
An optimum section determining means for selecting the optimum section of the reference voice by judging the distance value obtained by the distance value calculating means, and a reference voice in the optimum section determined by the optimum section determining means for the adapted input voice cut out by the cutout means. And a conversion means for converting into.

【０００８】また、この第２の発明に係わる音声品質適
応化装置は、第１の発明におけるラベリング手段および
切り出し手段の入力として適応化入力音声に替えて入力
音声を用いるようにしたものである。Also, the voice quality adaptation device according to the second aspect of the present invention uses the input voice instead of the adaptive input voice as the input of the labeling means and the clipping means in the first aspect of the invention.

【０００９】また、この第３の発明に係わる音声品質適
応化装置は、参照音声ベクトル量子化コードブックおよ
び適応化入力音声ベクトル量子化コードブックによって
フレーム毎にそれぞれ参照音声と入力音声のベクトル量
子化を行うベクトル量子化手段と、ベクトル量子化手段
より出力される入力音声ベクトル量子化コード番号列に
ついての部分区間を決定して入力音声部分ベクトル量子
化コード番号列を出力するとともにベクトル量子化手段
より出力される参照音声ベクトル量子化コード番号列に
ついて入力音声部分ベクトル量子化コード番号列と一致
または類似する区間を検出して切り出しを行うコード番
号列切り出し手段と、コード番号列切り出し手段の出力
である部分コード番号列について入力音声と参照音声間
の距離値を求める距離値算出手段と、距離値算出手段で
求められた距離値の判定を行い参照音声の最適区間を選
び出す最適区間決定手段と、コード番号列切り出し手段
で出力された入力音声部分ベクトル量子化コード番号列
を最適区間決定手段により決定された最適区間における
参照音声に変換する変換手段と、を備えるようにしたも
のである。The speech quality adaptation apparatus according to the third aspect of the present invention further comprises a reference speech vector quantization codebook and an adaptive input speech vector quantization codebook for vector quantization of the reference speech and the input speech for each frame. And a vector quantizing means for determining the partial interval for the input speech vector quantization code number sequence output from the vector quantizing means and outputting the input speech partial vector quantization code number sequence and It is an output of the code number sequence cutting-out means for detecting and cutting out a section that matches or is similar to the input voice partial vector quantization code number sequence for the output reference voice vector quantization code number sequence, and the output of the code number sequence cutting out means. Find distance value between input voice and reference voice for partial code number sequence The separation value calculating means, the optimum interval determining means for judging the distance value obtained by the distance value calculating means and selecting the optimum interval of the reference voice, and the input voice partial vector quantization code number output by the code number string cutting means. Conversion means for converting the sequence into a reference voice in the optimum section determined by the optimum section determination means.

【００１０】また、この発明の第４の発明に係わる音声
品質適応化装置は、音声の各フレームのスペクトル特徴
の適応化を行うスペクトル適応化手段と、スペクトル適
応化手段の出力である適応化入力音声の区間切り出しを
行う入力音声切り出し手段と、入力音声切り出し手段で
切り出された適応化入力音声と参照音声の全ての部分区
間との間で距離値を求め距離値判定により参照音声の最
適区間を求める参照区間決定手段と、入力音声切り出し
手段で切り出された適応化入力音声を参照区間決定手段
により決定された最適区間における参照音声に変換する
変換手段と、を備えるようにしたものである。A voice quality adaptation device according to a fourth aspect of the present invention is a spectrum adaptation means for adapting the spectrum feature of each frame of voice, and an adaptation input which is an output of the spectrum adaptation means. A distance value is calculated between the input voice cutout unit that cuts out a voice segment, the adaptive input voice cut out by the input voice cutout unit, and all the partial segments of the reference voice, and the optimum section of the reference voice is determined by the distance value determination. The reference section determining means for obtaining and the converting means for converting the adapted input speech cut out by the input speech cutting out means into the reference speech in the optimum section determined by the reference section determining means are provided.

【００１１】[0011]

【作用】本発明に係わる音声品質適応化装置ではラベリ
ング手段が入力される適応化入力音声に音素モデルを用
いて音素境界情報を含む音素ラベルを与え、部分区間決
定手段は入力される適応化入力音声の音素ラベル列から
入力音声部分区間を決定して出力し、さらに入力される
参照音声の音素ラベル列より入力音声部分区間の音素ラ
ベル列と一致する区間を検索して、参照音声部分区間と
して出力する。また、切り出し手段は適応化入力音声を
入力音声部分区間で切り出し、部分適応化入力音声を出
力し、同様にして参照音声を参照音声部分区間で切り出
し、部分参照音声を出力する。さらに、距離値算出手段
は、部分適応化入力音声と部分参照音声との間のスペク
トル距離値を計算し、最適区間選択手段は距離値の判定
を行い、最適な参照音声の区間を検出して最適参照音声
部分区間として出力する。そして変換手段は、部分適応
化入力音声から最適参照音声部分区間における部分参照
音声への変換を行う。In the speech quality adaptation apparatus according to the present invention, the labeling means is used to give a phoneme label containing phoneme boundary information to the adapted input speech using the phoneme model, and the sub-interval determining means is supplied with the adapted input. The input phoneme label section is determined from the phoneme label string of the voice and output, and the section that matches the phoneme label string of the input phoneme section section is searched from the phoneme label string of the input reference speech, and the section is set as the reference voice section section. Output. Further, the cutout unit cuts out the adapted input speech in the input speech partial section, outputs the partially adapted input speech, similarly cuts out the reference speech in the reference speech partial section, and outputs the partial reference speech. Further, the distance value calculating means calculates a spectral distance value between the partially adapted input speech and the partial reference speech, and the optimum section selecting means determines the distance value and detects the optimum reference speech section. It is output as the optimum reference voice partial section. Then, the conversion means converts the partially adapted input speech into a partial reference speech in the optimum reference speech partial section.

【００１２】また、本発明に係わる音声品質適応化装置
ではラベリング手段が、入力される適応化入力音声に音
素モデルを用いて音素境界情報を含む音素ラベルを与
え、部分区間決定手段は入力される適応化入力音声の音
素ラベル列から入力音声部分区間を決定して出力し、さ
らに入力された参照音声の音素ラベル列より入力音声部
分区間の音素ラベル列と一致する区間を検索して参照音
声部分区間として出力する。また、切り出し手段は、入
力音声を入力音声部分区間で切り出し、部分入力音声を
出力し、同様にして参照音声を参照音声部分区間で切り
出し、部分参照音声を出力する。さらに、距離値算出手
段は部分入力音声と部分参照音声との間のスペクトル距
離値を計算し、最適区間選択手段は距離値の判定を行
い、最適な参照音声の区間を検出して最適参照音声部分
区間として出力する。そして変換手段は、部分入力音声
から最適参照音声部分区間における部分参照音声への変
換を行う。Further, in the speech quality adaptation apparatus according to the present invention, the labeling means gives a phoneme label containing phoneme boundary information to the input adaptation input speech using the phoneme model, and the partial interval determining means is inputted. The input speech part interval is determined from the phoneme label string of the adapted input speech and output, and the reference speech part is searched from the phoneme label string of the input reference speech for a section that matches the phoneme label string of the input speech part interval. Output as a section. The cutout unit cuts out the input voice in the input voice partial section, outputs the partial input voice, similarly cuts out the reference voice in the reference voice partial section, and outputs the partial reference voice. Further, the distance value calculating means calculates a spectral distance value between the partial input speech and the partial reference speech, and the optimum section selecting means determines the distance value, detects the optimum reference speech section, and detects the optimum reference speech. Output as a partial section. Then, the conversion means converts the partial input voice into a partial reference voice in the optimum reference voice partial section.

【００１３】さらに、本発明に係わる音声品質適応化装
置では、コード番号列切り出し手段が入力される入力音
声のベクトル量子化コード番号列から入力音声部分ベク
トル量子化コード番号列を決定して出力し、さらに入力
される参照音声のベクトル量子化コード番号列より入力
音声部分ベクトル量子化コード番号列と一致あるいは類
似する区間を検索し、その区間で参照音声ベクトル量子
化コード番号列を切り出して、参照音声部分ベクトル量
子化コード番号列を出力する。また、距離値算出手段は
参照音声ベクトル量子化コードブックと適応化入力音声
ベクトル量子化コードブックを参照し、入力音声部分ベ
クトル量子化コード番号列と参照音声部分ベクトル量子
化コード番号列との間でコードブックによるスペクトル
距離値を計算する。最適区間決定手段は距離値の判定を
行い、最適な参照音声の区間を検出し最適参照音声部分
区間として出力し、変換手段は入力音声部分ベクトル量
子化コード番号列から最適参照音声部分区間における参
照音声への変換を行う。Further, in the voice quality adaptation device according to the present invention, the input voice partial vector quantized code number sequence is determined and output from the vector quantized code number sequence of the input voice to which the code number sequence cutout means is input. , Search a section that matches or is similar to the input speech partial vector quantization code number sequence from the vector quantization code number sequence of the input reference speech, cut out the reference speech vector quantization code number sequence in that section, and refer to it. The audio partial vector quantization code number sequence is output. Further, the distance value calculation means refers to the reference speech vector quantization codebook and the adaptive input speech vector quantization codebook, and determines between the input speech partial vector quantization code number sequence and the reference speech partial vector quantization code number sequence. Calculate the spectral distance value according to the codebook. The optimum section determination means determines the distance value, detects the optimum reference speech section and outputs it as the optimum reference speech partial section, and the conversion section refers to the optimum reference speech partial section from the input speech partial vector quantization code number sequence. Convert to voice.

【００１４】さらに、本発明に係わる音声品質適応化装
置では入力音声切り出し手段が、入力される適応化入力
音声のある区間の切り出しを行い、参照区間決定手段
は、参照音声の適当な切り出しを行い、切り出された全
ての参照音声と部分適応化入力音声の間でスペクトル距
離値を計算し、距離値の判定を行って、最適な参照音声
の区間を検出し最適参照音声部分区間として出力する。
また、変換手段は部分適応化入力音声から最適参照音声
部分区間における部分参照音声への変換を行う。Further, in the voice quality adaptation device according to the present invention, the input voice cutout means cuts out a certain section of the input adaptive input voice, and the reference section determination means appropriately cuts out the reference voice. , The spectral distance value is calculated between all the cut out reference speeches and the partially adapted input speech, the distance value is determined, the optimal reference speech section is detected, and the optimal reference speech partial section is output.
Further, the conversion means converts the partially adapted input speech into a partial reference speech in the optimum reference speech partial section.

【００１５】[0015]

【Example】

実施例１．本発明の第１の実施例を図１について説明す
る。図１は音声品質適応化装置の構成を示す構成図であ
り、１は入力音声スペクトルパラメータ２の音声品質適
応化のために本装置に入力される参照音声スペクトルパ
ラメータ、５は入力されるスペクトルパラメータに音素
ラベルを与えるラベリング手段、６はラベリング手段５
によって参照音声スペクトルパラメータ１に与えられた
参照音声音素ラベル列、７はラベリング手段５によって
適応化入力音声スペクトルパラメータ４に与えられた適
応化入力音声音素ラベル列、８は入力される音素ラベル
列に基づいて音声データについて音節や音声素片程度の
長さの部分区間を決定する部分区間決定手段、９は参照
音声についての部分区間である参照音声部分区間、１０
は入力音声についての部分区間である入力音声部分区間
である。また１１は部分区間におけるスペクトルパラメ
ータを抽出する切り出し手段、１２は切り出し手段１１
により参照音声スペクトルパラメータ１から抽出された
部分参照音声スペクトルパラメータ、１３は切り出し手
段１１により適応化入力音声スペクトルパラメータ４か
ら抽出された部分適応化入力音声スペクトルパラメー
タ、１４は部分区間における２つの音声データ間のスペ
クトル距離値を求める距離値算出手段、１５はスペクト
ルパラメータ距離値、１６はスペクトルパラメータ距離
値１５の判定に基づき最適と思われる部分参照音声スペ
クトルパラメータ１２を選択する最適区間選択手段、１
７は最適参照音声部分区間、１８は入力される部分区間
の音声スペクトルパラメータを最適参照音声部分区間１
７における部分参照音声スペクトルパラメータ１２に変
換する変換手段、１９は変換手段１８の出力である変換
スペクトルパラメータである。尚、図中２，３および４
は従来例と同等のものである。Example 1. A first embodiment of the present invention will be described with reference to FIG. FIG. 1 is a configuration diagram showing a configuration of a voice quality adaptation device, where 1 is a reference voice spectrum parameter input to this device for voice quality adaptation of an input voice spectrum parameter 2 and 5 is an input spectrum parameter. Labeling means for giving a phoneme label to the label, 6 is a labeling means 5
Is a reference speech phoneme label string given to the reference speech spectrum parameter 1, 7 is an adapted input speech phoneme label string given to the adapted input speech spectrum parameter 4 by the labeling means 5, and 8 is a phoneme label string to be inputted. Based on the voice data, a sub-segment determining means for deciding a sub-segment having a length of about a syllable or a voice unit, 9 is a reference voice partial segment that is a sub-segment for a reference voice, 10
Is an input voice partial section which is a partial section for the input voice. Further, 11 is a cutout unit for extracting the spectrum parameter in the partial section, and 12 is a cutout unit 11
The partial reference speech spectrum parameter extracted from the reference speech spectrum parameter 1 by 13 is the partially adapted input speech spectrum parameter 13 extracted from the adapted input speech spectrum parameter 4 by the clipping means 11, and 14 is the two speech data in the partial section. Distance value calculating means for obtaining a spectral distance value between, 15 is a spectrum parameter distance value, 16 is an optimum interval selecting means for selecting a partial reference speech spectrum parameter 12 considered to be optimal based on the determination of the spectrum parameter distance value 15, 1
Reference numeral 7 is the optimum reference speech partial section, 18 is the speech spectrum parameter of the input partial section, and is the optimum reference speech partial section 1
The conversion means for converting the partial reference speech spectrum parameter 12 in 7 and the conversion spectrum parameter 19 output from the conversion means 18. In addition, 2, 3 and 4 in the figure
Is equivalent to the conventional example.

【００１６】次に、動作について説明する。スペクトル
適応化手段３は、入力音声スペクトルパラメータ２を入
力し、例えば従来例における学習データを用いたコード
ブックマッピングの手法により各フレームで入力音声か
ら参照音声へのスペクトルの変換を行い、適応化入力音
声スペクトルパラメータ４を出力する。ラベリング手段
５は、各々参照音声スペクトルパラメータ１、および適
応化入力音声スペクトルパラメータ４を入力し、音素モ
デルを用いてそれぞれについて音素境界情報を含む参照
音声音素ラベル列６と、適応化入力音声音素ラベル列７
を出力する。部分区間決定手段８は、前記参照音声音素
ラベル列６と前記適応化入力音声音素ラベル列７を入力
し、適応化入力音声音素ラベル列７の視察により音素境
界によって区切られた区間を決定して入力音声部分区間
１０を出力し、一方参照音声音素ラベル列６の中から前
記入力音声部分区間１０における適応化入力音素ラベル
列７と一致する区間を検索して参照音声部分区間９とし
て出力する。切り出し手段１１は、参照音声スペクトル
パラメータ１を前記参照音声部分区間９に基づいて切り
出し、部分参照音声スペクトルパラメータ１２を出力す
る。また、同様にして適応化入力音声スペクトルパラメ
ータ４を前記入力音声部分区間１０に基づいて切り出
し、部分適応化入力音声スペクトルパラメータ１３を出
力する。距離値算出手段１４は、前記部分参照音声スペ
クトルパラメータ１２と前記部分適応化入力音声スペク
トルパラメータ１３の間で、例えば図６に示されるよう
なＤＰマッチングによるフレーム間の対応付けによりス
ペクトル歪のようなスペクトルパラメータ距離値１５を
計算し出力する。最適区間決定手段１６は前記スペクト
ルパラメータ距離値１５を入力し、その距離値が最小か
つある一定の閾値以下であるものを検出するような距離
値判定方法により最適な参照音声の区間を検出し、最適
参照音声部分区間１７として出力する。変換手段１８は
部分適応化入力音声スペクトルパラメータ１３を、前記
最適参照音声部分区間１７の部分参照音声スペクトルパ
ラメータ１２に図６に示されるＤＰマッチングによるフ
レーム対応で変換し、変換音声スペクトルパラメータ１
９として出力する。Next, the operation will be described. The spectrum adapting means 3 inputs the input speech spectrum parameter 2, converts the spectrum from the input speech to the reference speech in each frame by, for example, a codebook mapping method using learning data in the conventional example, and adapts the input. The voice spectrum parameter 4 is output. The labeling means 5 inputs the reference speech spectrum parameter 1 and the adapted input speech spectrum parameter 4, respectively, and uses the phoneme model to provide a reference speech phoneme label sequence 6 including phoneme boundary information and an adapted input speech phoneme label. Row 7
Is output. The partial section determining means 8 inputs the reference speech phoneme label sequence 6 and the adapted input speech phoneme label sequence 7, and determines a section delimited by a phoneme boundary by observing the adapted input speech phoneme label sequence 7. The input speech subsection 10 is output, and on the other hand, the reference speech phoneme label sequence 6 is searched for a section that matches the adapted input phoneme label sequence 7 in the input speech subsection 10 and is output as the reference speech subsection 9. The clipping means 11 clips the reference speech spectrum parameter 1 based on the reference speech partial section 9 and outputs a partial reference speech spectrum parameter 12. Similarly, the adaptive input speech spectrum parameter 4 is cut out based on the input speech partial section 10 and the partially adapted input speech spectrum parameter 13 is output. The distance value calculation means 14 causes a spectrum distortion between the partial reference speech spectrum parameter 12 and the partially adapted input speech spectrum parameter 13 by, for example, associating frames by DP matching as shown in FIG. The spectrum parameter distance value 15 is calculated and output. The optimum section determination means 16 receives the spectrum parameter distance value 15 and detects the optimum reference speech section by a distance value determination method that detects a distance value that is the smallest and is equal to or less than a certain threshold. The optimal reference voice partial section 17 is output. The conversion means 18 converts the partially adapted input speech spectrum parameter 13 into the partial reference speech spectrum parameter 12 of the optimum reference speech partial section 17 in frame correspondence by DP matching shown in FIG.
Output as 9.

【００１７】本実施例によれば、スペクトルパラメータ
がフレーム間で不連続になったり、音韻性にボケが生じ
るという問題を、セグメント単位の標準パターンを用い
ることなく解決することができ、品質の高い音声品質適
応化を実現することができる。According to the present embodiment, it is possible to solve the problem that the spectrum parameter is discontinuous between frames and the phonology is blurred without using a standard pattern for each segment, and the quality is high. Voice quality adaptation can be achieved.

【００１８】実施例２．本発明の第２の実施例を図２に
ついて説明する。図２は音声品質適応化装置の構成を示
す構成図であり、２は従来例と同等のものであり、ま
た、１，５，６，８，９，１０，１１，１２，１４，１
５，１６，１７，１８，１９は実施例１記載と同等のも
のである。２０はラベリング手段５によって入力音声ス
ペクトルパラメータ２に与えられた入力音声音素ラベル
列、２１は切り出し手段１１により入力音声スペクトル
パラメータ２から抽出された部分入力音声スペクトルパ
ラメータである。Example 2. A second embodiment of the present invention will be described with reference to FIG. FIG. 2 is a configuration diagram showing a configuration of the voice quality adaptation device, 2 is equivalent to the conventional example, and 1, 5, 6, 8, 9, 10, 11, 11, 12, 14, 1
5, 16, 17, 18, and 19 are the same as those described in the first embodiment. 20 is an input speech phoneme label string given to the input speech spectrum parameter 2 by the labeling means 5, and 21 is a partial input speech spectrum parameter extracted from the input speech spectrum parameter 2 by the clipping means 11.

【００１９】次に、動作について説明する。ラベリング
手段５は、各々参照音声スペクトルパラメータ１と入力
音声スペクトルパラメータ２を入力し、音素モデルを用
いてそれぞれについて音素境界情報を含む参照音声音素
ラベル列６と、入力音声音素ラベル列２０を出力する。
部分区間決定手段８は前記参照音声音素ラベル列６と前
記入力音声音素ラベル列２０を入力し、入力音声音素ラ
ベル列２０の視察により音素境界によって区切られた区
間を決定し、入力音声部分区間１０を出力、一方参照音
声音素ラベル列６の中から前記入力音声部分区間１０に
おける入力音素ラベル列２０と一致する区間を検索して
参照音声部分区間９として出力する。切り出し手段１１
は、参照音声スペクトルパラメータ１を前記参照音声部
分区間９に基づいて切り出し、部分参照音声スペクトル
パラメータ１２を出力する。また、同様にして入力音声
スペクトルパラメータ２を前記入力音声部分区間１０に
基づいて切り出し、部分入力音声スペクトルパラメータ
２１を出力する。距離値算出手段１４は、前記部分参照
音声スペクトルパラメータ１２と前記部分入力音声スペ
クトルパラメータ２１の間で、例えば図６に示されるよ
うなＤＰマッチングによるフレーム間の対応付けによ
り、スペクトル歪のようなスペクトルパラメータ距離値
１５を計算し出力する。最適区間決定手段１６は前記ス
ペクトルパラメータ距離値１５を入力し、その距離値が
最小かつある一定の閾値以下であるものを検出するよう
な距離値判定方法により最適な参照音声の区間を検出
し、最適参照音声部分区間１７として出力する。変換手
段１８は、部分入力音声スペクトルパラメータ２１を、
前記最適参照音声部分区間１７の部分参照音声スペクト
ルパラメータ１２に図６に示されるＤＰマッチングによ
るフレーム対応で変換し、変換音声スペクトルパラメー
タ１９として出力する。Next, the operation will be described. The labeling means 5 inputs the reference speech spectrum parameter 1 and the input speech spectrum parameter 2, respectively, and outputs the reference speech phoneme label sequence 6 and the input speech phoneme label sequence 20 including the phoneme boundary information for each using the phoneme model. .
The sub-segment determining means 8 inputs the reference speech phoneme label sequence 6 and the input speech phoneme label sequence 20, determines a segment demarcated by a phoneme boundary by observing the input speech phoneme label sequence 20, and the input speech sub-segment 10 On the other hand, the reference speech phoneme label sequence 6 is searched for a section in the input speech partial section 10 that matches the input phoneme label sequence 20, and is output as the reference speech partial section 9. Cutting means 11
Cuts out the reference speech spectrum parameter 1 based on the reference speech partial section 9 and outputs the partial reference speech spectrum parameter 12. Similarly, the input voice spectrum parameter 2 is cut out based on the input voice partial section 10 and the partial input voice spectrum parameter 21 is output. The distance value calculation means 14 associates the partial reference speech spectrum parameter 12 and the partial input speech spectrum parameter 21 with each other by, for example, DP matching as shown in FIG. The parameter distance value 15 is calculated and output. The optimum section determination means 16 receives the spectrum parameter distance value 15 and detects the optimum reference speech section by a distance value determination method that detects a distance value that is the smallest and is equal to or less than a certain threshold. The optimal reference voice partial section 17 is output. The conversion means 18 converts the partial input speech spectrum parameter 21 into
The partial reference speech spectrum parameter 12 of the optimum reference speech partial section 17 is converted corresponding to the frame by the DP matching shown in FIG. 6, and is output as the converted speech spectrum parameter 19.

【００２０】本実施例によれば、スペクトルパラメータ
がフレーム間で不連続になったり、音韻性にボケが生じ
るという問題を、セグメント単位の標準パターンを用い
ることなく解決することができ、品質の高い音声品質適
応化を実現することができる。According to the present embodiment, it is possible to solve the problem that the spectrum parameter is discontinuous between frames and the phonology is blurred without using a standard pattern for each segment, and the quality is high. Voice quality adaptation can be achieved.

【００２１】実施例３．本発明の第３の実施例を図３に
ついて説明する。図３は音声品質適応化装置の構成を示
す構成図であり、２，２２，２３，２４，２５，２６は
従来例と同等のものであり、また、１，１４，１５，１
６，１７，１８，１９は実施例１記載と同等のものであ
る。２７は入力されるコード番号列の中からある長さの
部分区間のコード番号列を抽出するコード番号列切り出
し手段、２８はコード番号列切り出し手段２７により参
照音声ベクトル量子化コード番号列２５から抽出された
参照音声部分ベクトル量子化コード番号列、２９はコー
ド番号列切り出し手段２７により適応化入力音声ベクト
ル量子化コード番号列２６から抽出された入力音声部分
ベクトル量子化コード番号列である。Example 3. A third embodiment of the present invention will be described with reference to FIG. FIG. 3 is a configuration diagram showing the configuration of the voice quality adaptation device, in which 2, 22, 23, 24, 25 and 26 are equivalent to those of the conventional example, and 1, 14, 15, 1
6, 17, 18 and 19 are the same as those described in the first embodiment. Reference numeral 27 is a code number sequence cutout means for extracting a code number sequence of a partial section of a certain length from the input code number sequence, and 28 is extracted from the reference voice vector quantization code number sequence 25 by the code number sequence cutout means 27. The reference voice partial vector quantized code number sequence, 29 is the input voice partial vector quantized code number sequence extracted from the adaptive input voice vector quantized code number sequence 26 by the code number sequence slicing means 27.

【００２２】次に、動作について説明する。ベクトル量
子化手段２４は、参照音声スペクトルパラメータ１を入
力し、参照音声ベクトル量子化コードブック２２によっ
てベクトル量子化を行い参照音声ベクトル量子化コード
番号列２５を出力する。一方、入力音声スペクトルパラ
メータ２を入力し、従来例のように参照音声ベクトル量
子化コードブック２２との対応付けがなされている適応
化入力音声ベクトル量子化コードブック２３を用いてベ
クトル量子化を行い、適応化入力音声ベクトル量子化コ
ード番号列２６を出力する。コード番号列切り出し手段
２７は、前記参照音声ベクトル量子化コード番号列２５
と前記適応化入力音声ベクトル量子化コード番号列２６
を入力し、適応化入力音声ベクトル量子化コード番号列
２６の視察により入力音声部分ベクトル量子化コード番
号列２９を決定して出力するとともに、参照音声ベクト
ル量子化コード番号列２５の中から前記入力音声部分ベ
クトル量子化コード番号列２９と一致あるいは類似する
区間を検索し、その区間の参照音声ベクトル量子化コー
ド番号列２５を切り出して、参照音声部分ベクトル量子
化コード番号列２８として出力する。距離値算出手段１
４は、前記参照音声部分ベクトル量子化コード番号列２
８と前記入力音声部分ベクトル量子化コード番号列２９
を入力し、前記参照音声ベクトル量子化コードブック２
２と前記適応化入力音声ベクトル量子化コードブック２
３を参照して各ベクトル量子化コード番号列の復号化を
行い、復号化された部分参照音声のスペクトルパラメー
タと部分入力音声スペクトルパラメータの間で、例えば
図６に示されるようなＤＰマッチングによるフレーム間
の対応付けによりスペクトル歪のようなスペクトルパラ
メータ距離値１５を計算し出力する。最適区間決定手段
１６は、前記スペクトルパラメータ距離値１５を入力
し、その距離値が最小かつある一定の閾値以下であるも
のを検出するような距離値判定方法により最適な参照音
声の区間を検出し、最適参照音声部分区間１７として出
力する。変換手段１８は、前記入力音声部分ベクトル量
子化コード番号列２９を、前記最適参照音声部分区間１
７の参照音声スペクトルパラメータ１に図６に示される
ＤＰマッチングによるフレーム対応で変換し、変換音声
スペクトルパラメータ１９として出力する。Next, the operation will be described. The vector quantization means 24 receives the reference speech spectrum parameter 1 and performs vector quantization by the reference speech vector quantization codebook 22 and outputs a reference speech vector quantization code number sequence 25. On the other hand, the input speech spectrum parameter 2 is input and vector quantization is performed using the adaptive input speech vector quantization codebook 23 that is associated with the reference speech vector quantization codebook 22 as in the conventional example. , And outputs an adapted input speech vector quantization code number sequence 26. The code number sequence cutting-out means 27 uses the reference voice vector quantization code number sequence 25.
And the adaptive input speech vector quantization code number sequence 26
Is input, the input voice partial vector quantization code number sequence 29 is determined and output by observing the adaptive input voice vector quantization code number sequence 26, and the input is performed from the reference voice vector quantization code number sequence 25. A section that matches or is similar to the voice partial vector quantization code number sequence 29 is searched, and the reference voice vector quantization code number sequence 25 in that section is cut out and output as the reference voice partial vector quantization code number sequence 28. Distance value calculation means 1
4 is the reference voice partial vector quantization code number sequence 2
8 and the input speech partial vector quantization code number sequence 29
By inputting the reference speech vector quantization codebook 2
2 and the adaptive input speech vector quantization codebook 2
3, each vector quantization code number sequence is decoded, and a frame by DP matching as shown in FIG. 6, for example, between the spectrum parameter of the decoded partial reference voice and the partial input voice spectrum parameter is decoded. A spectral parameter distance value 15 such as spectral distortion is calculated and output by associating between the values. The optimum interval determination means 16 receives the spectrum parameter distance value 15 and detects the optimum reference speech interval by a distance value determination method that detects a distance value that is the minimum and is less than a certain threshold value. , As the optimum reference voice partial section 17. The converting means 18 converts the input speech partial vector quantization code number sequence 29 into the optimum reference speech partial section 1
The reference voice spectrum parameter 1 of No. 7 is converted corresponding to the frame by the DP matching shown in FIG. 6, and is output as the converted voice spectrum parameter 19.

【００２３】本実施例ではこのような構成をとることに
より、スペクトルパラメータがフレーム間で不連続にな
ったり、音韻性にボケが生じるという問題を、セグメン
ト単位の標準パターンを用いることなく解決することが
でき、品質の高い音声品質適応化を実現することができ
る。In the present embodiment, by adopting such a configuration, it is possible to solve the problem that the spectrum parameter becomes discontinuous between frames and the phonology is blurred without using a standard pattern for each segment. Therefore, it is possible to realize high-quality voice quality adaptation.

【００２４】実施例４．本発明の第４の実施例を図４に
ついて説明する。図４は音声品質適応化装置の構成を示
す構成図であり、２，３，４は従来例と同等のものであ
り、また、１，１３，１７，１８，１９は実施例１記載
と同等のものである。３０は入力音声データについて音
節や音声素片程度の長さの部分区間を決定しその区間に
おける適応化入力音声スペクトルパラメータを出力する
入力音声切り出し手段、３１は部分適応化入力音声スペ
クトルパラメータ１３との距離値判定に基づき最適と思
われる参照音声の部分区間を決定する参照区間決定手段
である。Example 4. A fourth embodiment of the present invention will be described with reference to FIG. FIG. 4 is a configuration diagram showing the configuration of the voice quality adaptation device, 2, 3, 4 are equivalent to the conventional example, and 1, 13, 17, 18, 19 are equivalent to those described in the first embodiment. belongs to. Reference numeral 30 designates an input speech segmentation means for deciding a subsection having a length of about a syllable or a speech segment in the input speech data, and outputting an adapted input speech spectrum parameter in the section, and 31 a part of the partially adapted input speech spectrum parameter 13 It is a reference section determining means for determining a sub section of the reference voice that is considered to be optimal based on the distance value determination.

【００２５】以下、動作について説明する。スペクトル
適応化手段３は入力音声スペクトルパラメータ２を入力
し、例えば従来例における学習データを用いたコードブ
ックマッピングの手法により各フレームで入力音声から
参照音声へのスペクトルの変換を行い、適応化入力音声
スペクトルパラメータ４を出力する。入力音声切り出し
手段３０は、前記適応化入力音声スペクトルパラメータ
４を入力し、例えばスペクトルの変化位置などを境界と
して選び、部分適応化入力音声スペクトルパラメータ１
３として出力する。参照区間決定手段３１は、参照音声
スペクトルパラメータ１について、例えばスペクトルの
変化位置などを境界として選んで切り出しを行い、全て
の切り出された参照音声スペクトルパラメータ１と前記
部分適応化入力音声スペクトルパラメータ１３の間で、
例えば図６に示されるようなＤＰマッチングによるフレ
ーム間の対応付けにより、スペクトル歪のようなスペク
トルパラメータ距離値を計算し、その距離値が最小かつ
ある一定の閾値以下であるものを検出するような距離値
判定方法により最適な参照音声の区間を検出し、最適参
照音声部分区間１７として出力する。変換手段１８は、
部分適応化入力音声スペクトルパラメータ１３を、前記
参照音声決定区間１７の参照音声スペクトルパラメータ
１に図６に示されるＤＰマッチングによるフレーム対応
で変換し、変換音声スペクトルパラメータ１９として出
力する。The operation will be described below. The spectrum adapting means 3 inputs the input speech spectrum parameter 2 and converts the spectrum from the input speech to the reference speech in each frame by, for example, the method of codebook mapping using learning data in the conventional example, and adapts the adapted input speech. The spectrum parameter 4 is output. The input speech cut-out means 30 receives the adapted input speech spectrum parameter 4 and selects, for example, a spectrum change position as a boundary, and the partially adapted input speech spectrum parameter 1
Output as 3. The reference section determining means 31 selects and cuts the reference speech spectrum parameter 1 by using, for example, a change position of the spectrum as a boundary, and cuts out all the extracted reference speech spectrum parameter 1 and the partially adapted input speech spectrum parameter 13. In between,
For example, by associating frames by DP matching as shown in FIG. 6, a spectrum parameter distance value such as spectrum distortion is calculated, and a distance value that is minimum and is equal to or less than a certain threshold is detected. The optimum reference voice section is detected by the distance value determination method, and is output as the optimum reference voice partial section 17. The conversion means 18 is
The partially adapted input speech spectrum parameter 13 is converted into the reference speech spectrum parameter 1 of the reference speech determination section 17 corresponding to the frame by the DP matching shown in FIG. 6, and output as the converted speech spectrum parameter 19.

【００２６】本実施例ではこのような構成をとることに
より、スペクトルパラメータがフレーム間で不連続にな
ったり、音韻性にボケが生じるという問題を、セグメン
ト単位の標準パターンを用いることなく解決することが
でき、品質の高い音声品質適応化を実現することができ
る。In the present embodiment, by adopting such a configuration, it is possible to solve the problem that the spectrum parameter becomes discontinuous between frames and the phonology is blurred without using a standard pattern for each segment. Therefore, it is possible to realize high-quality voice quality adaptation.

【００２７】実施例５．上記実施例３の距離値算出手段
において、入力音声と参照音声のベクトル量子化コード
番号列をベクトル量子化コードブックによって復号化し
てスペクトルパラメータ距離値を算出する代わりに、予
め参照音声コードブックと適応化入力音声コードブック
との全コードベクトル間のスペクトルパラメータ距離値
が記述されるテーブルを備え、参照音声と入力音声のフ
レーム間対応に応じてそのテーブルを参照し、選び出さ
れたフレーム間距離値の総和によって距離値算出を行う
ようにしてもよい。Example 5. In the distance value calculating means of the third embodiment, instead of decoding the vector quantization code number sequence of the input voice and the reference voice with the vector quantization codebook to calculate the spectrum parameter distance value, the reference voice codebook and adaptation are applied in advance. Equipped with a table that describes the spectral parameter distance values between all code vectors with the input speech codebook, and refers to the table according to the correspondence between the reference speech and the input speech frames, and the selected interframe distance values The distance value may be calculated based on the total sum of.

【００２８】実施例６．上記実施例１，２の部分区間決
定手段および実施例３のコード番号列切り出し手段にお
いて、入力音声の区間と一致する参照音声の区間の検索
を、最長一致法によって一致する区間がより長いものか
ら選んで検出していくようにしてもよい。Example 6. In the partial section determining means of the first and second embodiments and the code number string cutting means of the third embodiment, the section of the reference voice that matches the section of the input voice is searched for from the longer matching section by the longest matching method. You may make it select and detect.

【００２９】実施例７．上記実施例１，２，３，５の距
離値算出手段、実施例４の参照区間決定手段、および実
施例１，２，３，４の変換手段において、入力音声と参
照音声とのフレーム間の対応をＤＰマッチングによる非
線形伸縮で行う代わりに、線形伸縮によって行うように
してもよい。Example 7. In the distance value calculation means of the first, second, third, fifth embodiment, the reference section determination means of the fourth embodiment, and the conversion means of the first, second, third, and fourth embodiment, between the frames of the input voice and the reference voice. The correspondence may be performed by linear expansion / contraction instead of nonlinear expansion / contraction by DP matching.

【００３０】[0030]

【発明の効果】以上説明したように本発明によれば、適
応化入力音声の音素列の比較によって部分区間を決定
し、その区間における適応化入力音声と参照音声の距離
値の判定を行い、その結果に基づいて決定された最適区
間の参照音声のスペクトルパラメータへ変換を行うよう
にしたので、従来のコードブックマッピング方式におけ
る時間方向の不連続性や音韻性のボケなどの影響を低減
することができ、高品質な編集合成を可能とする音声品
質適応化装置が実現できる。As described above, according to the present invention, a partial section is determined by comparing phoneme strings of the adaptive input speech, and the distance value between the adaptive input speech and the reference speech in the section is determined. Since it is designed to convert to the spectrum parameter of the reference speech of the optimum section decided based on the result, it is possible to reduce the influence of the discontinuity in the time direction and the blur of the phoneme in the conventional codebook mapping method. Therefore, a voice quality adaptation device capable of high quality editing and synthesis can be realized.

【００３１】また本発明によれば、入力音声の音素列の
比較によって部分区間を決定し、その区間における入力
音声と参照音声の距離値の判定を行い、それに基づいて
決定された最適区間の参照音声のスペクトルパラメータ
へ変換を行うようにしたので、従来のコードブックマッ
ピング方式における時間方向の不連続性や音韻性のボケ
などの影響を低減することができ、高品質な編集合成を
可能とする音声品質適応化装置が実現できる。Further, according to the present invention, the partial segment is determined by comparing the phoneme strings of the input speech, the distance value between the input speech and the reference speech in the segment is determined, and the optimum segment reference based on the determination is performed. Since the conversion to speech spectral parameters is performed, the effects of discontinuity in the time direction and phonological blurring in the conventional codebook mapping method can be reduced, and high-quality edit synthesis is possible. A voice quality adaptation device can be realized.

【００３２】さらに、この発明によれば、適応化入力音
声のベクトル量子化コード番号列の比較等によって部分
区間を決定し、その区間における適応化入力音声と参照
音声の距離値の判定を行い、それに基づいて決定された
最適区間の参照音声のスペクトルパラメータへ変換を行
うようにしたので、従来のコードブックマッピング方式
における時間方向の不連続性や音韻性のボケなどの影響
を低減することができ、高品質な編集合成を可能とする
音声品質適応化装置が実現できる。Further, according to the present invention, the partial section is determined by comparing the vector quantization code number sequences of the adapted input speech, and the distance value between the adapted input speech and the reference speech in the section is determined, Since the spectrum parameters of the reference speech in the optimum interval determined based on it are converted, it is possible to reduce the effects of discontinuity in the time direction and phonological blur in the conventional codebook mapping method. Thus, it is possible to realize a voice quality adaptation device that enables high-quality editing / synthesis.

【００３３】加えて、本発明によれば、適応化入力音声
と全ての切り出された参照音声の間でスペクトルの距離
値を計算し、距離値判定を行って最適な部分区間の参照
音声を決定し、その区間の参照音声のスペクトルパラメ
ータへ変換を行うようにしたので、従来のコードブック
マッピング方式における時間方向の不連続性や音韻性の
ボケなどの影響を低減することができ、高品質な編集合
成を可能とする音声品質適応化装置が実現できる。In addition, according to the present invention, the distance value of the spectrum is calculated between the adapted input speech and all the cut out reference speeches, and the distance value judgment is performed to determine the optimal reference speech of the partial section. However, since the conversion is performed to the spectrum parameter of the reference speech in that section, it is possible to reduce the influence of the discontinuity in the time direction and the phonological blur in the conventional codebook mapping method, and it is possible to obtain high quality. A voice quality adaptation device capable of editing and synthesizing can be realized.

[Brief description of drawings]

【図１】この発明の第１の実施例を示す構成図であ
る。FIG. 1 is a configuration diagram showing a first embodiment of the present invention.

【図２】この発明の第２の実施例を示す構成図であ
る。FIG. 2 is a configuration diagram showing a second embodiment of the present invention.

【図３】この発明の第３の実施例を示す構成図であ
る。FIG. 3 is a configuration diagram showing a third embodiment of the present invention.

【図４】この発明の第４の実施例を示す構成図であ
る。FIG. 4 is a configuration diagram showing a fourth embodiment of the present invention.

【図５】従来のスペクトル適応化方式の一例を示す構
成図である。FIG. 5 is a configuration diagram showing an example of a conventional spectrum adaptation method.

【図６】ＤＰマッチングの動作についての説明図であ
る。FIG. 6 is an explanatory diagram of an operation of DP matching.

[Explanation of symbols]

１参照音声スペクトルパラメータ、２入力音声スペ
クトルパラメータ、３スペクトル適応化手段、４適応
化入力音声スペクトルパラメータ、５ラベリング手
段、６参照音声音素ラベル列、７適応化入力音声音
素ラベル列、８部分区間決定手段、９参照音声部分区
間、１０入力音声部分区間、１１切り出し手段、１２
部分参照音声スペクトルパラメータ、１３部分適応
化入力音声スペクトルパラメータ、１４距離値算出手
段、１５スペクトルパラメータ距離値、１６最適区
間決定手段、１７最適参照音声部分区間、１８変換
手段、１９変換音声スペクトルパラメータ、２０入
力音声音素ラベル列、２１部分入力音声スペクトルパ
ラメータ、２２参照音声ベクトル量子化コードブッ
ク、２３適応化入力音声ベクトル量子化コードブッ
ク、２４ベクトル量子化手段、２５参照音声ベクト
ル量子化コード番号列、２６適応化入力音声ベクトル
量子化コード番号列、２７コード番号列切り出し手
段、２８参照音声部分ベクトル量子化コード番号列、
２９入力音声部分ベクトル量子化コード番号列、３０
入力音声切り出し手段、３１参照区間決定手段。1 reference speech spectrum parameter, 2 input speech spectrum parameter, 3 spectrum adaptation means, 4 adaptation input speech spectrum parameter, 5 labeling means, 6 reference speech phoneme label sequence, 7 adaptation input speech phoneme label sequence, 8 partial section determination Means, 9 reference speech partial section, 10 input speech partial section, 11 clipping means, 12
Partial reference speech spectrum parameter, 13 Partial adaptation input speech spectrum parameter, 14 Distance value calculating means, 15 Spectral parameter distance value, 16 Optimal interval determining means, 17 Optimal reference speech partial interval, 18 Transforming means, 19 Transforming speech spectrum parameter, 20 input speech phoneme label sequence, 21 partial input speech spectrum parameter, 22 reference speech vector quantization codebook, 23 adaptive input speech vector quantization codebook, 24 vector quantization means, 25 reference speech vector quantization code number sequence, 26 adaptive input speech vector quantization code number sequence, 27 code number sequence cutting-out means, 28 reference speech partial vector quantization code number sequence,
29 input speech partial vector quantization code number sequence, 30
Input voice cutout means, 31 Reference section determination means.

Claims

[Claims]

1. A voice quality adaptation device for adapting a reference voice and an input voice having different utterances or different utterance times to bring the input voice closer to the voice quality of the reference voice, using learning data for each frame of the input voice. Spectrum adapting means for adapting the spectrum feature; and first and second phoneme models for outputting phoneme labels including phoneme boundary information for the adapted input speech and the reference speech output from the spectrum adapting means. 2 labeling means, determining an input speech subsection from the phoneme label string of the adapted input speech given by the first labeling means, and from the phoneme label string of the reference speech given by the second labeling means. Partial segment determining means for detecting a segment that matches the phoneme label string of the input speech partial segment, and the partial segment determining means First and second clipping means for clipping each of the adaptive input speech and the reference speech based on the output partial section information, and the adaptive input speech clipped by the first and second clipping means Distance value calculating means for obtaining the distance value between the reference voices, optimum section determining means for selecting the optimum section of the reference voice based on the determination result of the distance value obtained by the distance value calculating means, and cutout by the first cutout means. And a conversion means for converting the adapted input speech to a reference speech in the optimum section determined by the optimum section determination means.

2. The voice quality adaptation apparatus according to claim 1, wherein an input voice is used instead of the adapted input voice as an input to the first labeling means and the clipping means.

3. A voice quality adaptation device for adapting a reference voice and an input voice having different utterances or different utterance times to bring the input voice close to the voice quality of the reference voice, wherein a reference voice vector quantization codebook and an adaptive input are provided. First and second vector quantization means for performing vector quantization of the reference speech and the input speech for each frame by the speech vector quantization codebook, and the input speech vector quantum output from the first vector quantization means. Of the input voice partial vector quantized code sequence, the input voice partial vector quantization code number sequence is output, and the reference voice vector quantized code number sequence output from the second vector quantization means is input voice partial vector quantum. Code number string cutout that detects the section that matches or is similar to the code number string A step, a distance value calculation means for obtaining a distance value between an input voice and a reference voice for a partial code number sequence output from the code number sequence cutout means, and a determination of the distance value obtained by the distance value calculation means. Optimal interval determining means for selecting an optimal interval of the reference speech, and conversion of the input voice partial vector quantized code number sequence output by the code number sequence extracting means into reference speech in the optimal interval determined by the optimal interval determining means. A voice quality adaptation device comprising: a conversion unit.

4. A voice quality adaptation device for adapting a reference voice and an input voice having different utterances and different utterance times to bring the input voice closer to the voice quality of the reference voice, by adapting the spectrum feature of each frame of the voice. A spectrum adapting means for performing, an input speech clipping means for segmenting the adaptive input speech output from the spectrum adapting means, and all of the adaptive input speech and the reference speech clipped by the input speech clipping means A reference section determining means for determining a distance section and a distance section for determining an optimum section of the reference speech by the distance value determination, and the adapted input speech cut out by the input speech cutting section are determined by the reference section determining section. A voice quality adaptation device, comprising: a conversion means for converting into a reference voice in an optimum section.