JP2005189483A

JP2005189483A - Sound quality model generation method, sound quality conversion method, computer program for them, recording medium with program recorded thereon, and computer programmed with program

Info

Publication number: JP2005189483A
Application number: JP2003430209A
Authority: JP
Inventors: Mokhtari Parham; パーハム・モクタリ; Carlos Toshinori Ishii; カルロス寿憲石井; Pfitzinger Hartmut; ハートムット・フィツィンガー; Campbell Nick; ニック・キャンベル
Original assignee: Japan Science and Technology Agency; ATR Advanced Telecommunications Research Institute International
Current assignee: Japan Science and Technology Agency; ATR Advanced Telecommunications Research Institute International
Priority date: 2003-12-25
Filing date: 2003-12-25
Publication date: 2005-07-14
Anticipated expiration: 2023-12-25
Also published as: JP4177751B2

Abstract

<P>PROBLEM TO BE SOLVED: To control sound quality with a small number of parameters. <P>SOLUTION: A sound quality conversion method, in which the sound quality of an input speech waveform 50 is converted by using a glottal waveform model 36 consisting of a pair of a prototype sound chords wave made to correspond to prescribed sound quality and the prescribed number of main component expressions from the head obtained through a main component analysis of the prototype sound chords, includes a unit waveform extraction step 60 of extracting unit waveforms of the sound chords wave from parts meeting prescribed conditions respectively; and voice waveform generation steps 62 and 64 of generating an output speech waveform generated by converting the input voice waveform 60 to sound quality prescribed by a user on the basis of a glottal waveform model corresponding to sound quality prescribed as the sound quality of the input voice waveform and a glottal waveform model corresponding to sound quality prescribed by the user. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

この発明は音声の声質を制御する方法に関し、特に、音声の声質をパラメータにより表し、当該パラメータの値を用いて声質を変化させる方法および装置に関する。 The present invention relates to a method for controlling voice quality, and more particularly, to a method and apparatus for expressing voice quality by a parameter and changing the voice quality using the value of the parameter.

ここ二、三十年の間に、コンピュータを用いた音声処理技術は格段に進歩した。例えば、音声認識技術についてはかなり高精度で認識が行なわれるようになり、また音声合成についてもある程度の聞き取りやすさで音声を合成することが可能になっている。 In the last few decades, computer-based speech processing technology has made significant progress. For example, speech recognition technology can be recognized with considerably high accuracy, and speech synthesis can be performed with a certain degree of ease of hearing.

しかし、人間が普段行なっている音声処理とコンピュータを用いた音声処理との間には、まだ多くの点で相違がある。その典型的な例はパラ言語情報の取り扱いである。 However, there are still many differences between voice processing that humans usually perform and voice processing using a computer. A typical example is the handling of paralinguistic information.

パラ言語情報とは、話し言葉のうち、文字では表現できない要素のことをいう。例えば発話時の身振り、顔つき、声の調子などがパラ言語情報を構成する。人間であれば、声の調子の微妙な変化により話者の気持ちを感じ取ることができる。それに対し、音声認識によって得られるのは文字で表現できる要素のみであり、パラ言語情報を捕らえることはできない。同様に、人間であれば、同じ発話内容であっても、発話時の声の調子によって発話時の種々の気持ちを伝えることができる。しかし音声合成ではそのような音声を合成することは難しい。 Paralinguistic information refers to elements of spoken language that cannot be expressed in letters. For example, gestures at the time of utterance, facial appearance, voice tone, etc. constitute paralinguistic information. Humans can feel the speaker's feelings through subtle changes in the tone of their voices. On the other hand, only the elements that can be expressed by characters can be obtained by speech recognition, and paralinguistic information cannot be captured. Similarly, even if the content of the utterance is the same, a person can convey various feelings at the time of utterance by the tone of the voice at the time of utterance. However, in speech synthesis, it is difficult to synthesize such speech.

パラ言語情報の中で代表的なものとして、声質がある。声質については、種々の領域（例えば調音的、音響的、知覚的な領域）での、種々のレベル（例えば音声の機能的側面等）での定義が可能である。広い意味では、声質とは、話者により生成された人間の音声であって、かつ複数個の音声単位（例えば音素）にわたって聴者により知覚された音声の属性のことをいう。 Voice quality is a typical paralinguistic information. The voice quality can be defined at various levels (for example, functional aspects of speech) in various areas (for example, articulatory, acoustic, and perceptual areas). In a broad sense, voice quality refers to the attributes of speech that is human speech generated by a speaker and perceived by a listener over multiple speech units (eg, phonemes).

現在の技術では、音声認識においても音声合成においても、人間が発声する場合の声質の変化に対応した処理を行なうことは困難である。この分野は音韻学と音声処理技術との間で共同して研究を行なうのに格好の分野であると考えられる。 With the current technology, it is difficult to perform processing corresponding to a change in voice quality when a human utters in both speech recognition and speech synthesis. This field is considered to be a good field for joint research between phonology and speech processing technology.

特開２００３−３３０４７８JP2003-330478A Ｌａｖｅｒ，Ｊ．、「声質の音声的記述」、ケンブリッジ大学出版社、１９８０（Ｌａｖｅｒ，Ｊ．（１９８０），“ＴｈｅＰｈｏｎｅｔｉｃＤｅｓｃｒｉｐｔｉｏｎｏｆＶｏｉｃｅＱｕａｌｉｔｙ”，Ｃａｍｂｒｉｄｇｅ：ＣａｍｂｒｉｄｇｅＵｎｉｖｅｒｓｉｔｙＰｒｅｓｓ）Laver, J .; , "Voice description of voice quality", Cambridge University Press, 1980 (Laver, J. (1980), "The Physical Description of Voice Quality", Cambridge: Cambridge University Press).

人間とコンピュータとの間のインタフェースとして、音声処理技術は今後ますます多くの局面で使用されることになることは間違いない。その際、パラ言語的な情報もコミュニケーションに利用できれば、人間同士のコミュニケーションに近いものを実現できる。 There is no doubt that speech processing technology will be used in more and more aspects as an interface between humans and computers. At that time, if paralinguistic information can also be used for communication, it is possible to realize something close to human communication.

しかし、現在までのところ、音声認識によって人間の音声の声質を判定したりすることは困難で、ましてや声質から話者の感情を判断することは非常に難しい。また、音声合成においても、あるパラ言語的な情報を伝達するためには声質をどのように制御すればよいかは、今のところ判明していない。 However, until now, it has been difficult to determine the voice quality of human voice by voice recognition, and it is very difficult to judge the emotion of the speaker from the voice quality. In speech synthesis, how to control voice quality in order to transmit certain paralinguistic information has not yet been clarified.

また、声質を制御するためのパラメータの数はできるだけ少ないことが望ましく、さらに理想的には、そうしたパラメータは、生理学上の観点からも、知覚上の観点からも意味あるものであって、声質という現象に関するこれらふたつの領域における理解をより深めるようなものであることが望まれる。 In addition, it is desirable that the number of parameters for controlling voice quality is as small as possible, and more ideally, such parameters are meaningful from both physiological and perceptual viewpoints. It would be desirable to deepen the understanding of these phenomena in these two areas.

しかし従来は、そのような声質を判定したり制御したりするパラメータとしてどのようなものを使用すればよいかは不明であり、したがって当然、どのようにパラメータを変化させれば音声合成の声質を所望のものにすることができるかは分からなかった。 Conventionally, however, it is unclear what parameters should be used as parameters for determining and controlling such voice quality, and naturally, how to change the parameters will improve the voice quality of speech synthesis. I didn't know if it could be what I wanted.

それ故にこの発明の目的は、声質を表すための声質モデルを生成する声質モデル生成方法を提供することである。 Therefore, an object of the present invention is to provide a voice quality model generation method for generating a voice quality model for expressing voice quality.

それ故にこの発明の目的は、少数のパラメータで声質を表すための声質モデルを生成する声質モデル生成方法を提供することである。 SUMMARY OF THE INVENTION Therefore, an object of the present invention is to provide a voice quality model generation method for generating a voice quality model for expressing voice quality with a small number of parameters.

この発明の他の目的は、声質を所望のものに変換することが可能な声質変換装置及び方法を提供することである。 Another object of the present invention is to provide a voice quality conversion apparatus and method capable of converting a voice quality to a desired one.

この発明の他の目的は、少数のパラメータで声質を所望のものに変換することが可能な声質変換装置及び方法を提供することである。 Another object of the present invention is to provide a voice quality conversion apparatus and method capable of converting a voice quality to a desired one with a small number of parameters.

本発明の第１の局面に係る声質モデル生成方法は、それぞれ予め所定の声質に対応して準備された、基準となる複数の音声波形のうち、所定の条件を充足する部分から、当該部分が発声されたときの声帯波の単位波形を推定する声帯波形推定ステップと、声帯波の単位波形の各々を所定のパラメータ化方法にしたがってパラメータ化するパラメータ化ステップと、パラメータ化された声帯波の単位波形に対する主成分分析を行なうことにより、声帯波の単位波形の各々の主成分表現を取得する主成分分析ステップと、声帯波の単位波形の各々の波形と、当該波形に対応する主成分表現とを、当該声帯波が得られた音声波形に対応する声質のモデルとして出力するステップとを含む。 In the voice quality model generation method according to the first aspect of the present invention, a portion of a plurality of reference speech waveforms prepared in advance corresponding to a predetermined voice quality satisfies a predetermined condition from a portion satisfying a predetermined condition. A vocal cord waveform estimation step for estimating a unit waveform of a vocal cord wave when uttered; a parameterization step for parameterizing each of the unit waveforms of the vocal cord wave according to a predetermined parameterization method; and a unit of the parameterized vocal cord wave A principal component analysis step of obtaining a principal component representation of each unit waveform of the vocal fold wave by performing principal component analysis on the waveform, each waveform of the unit waveform of the vocal fold wave, and a principal component representation corresponding to the waveform; Outputting as a voice quality model corresponding to the voice waveform from which the vocal fold wave is obtained.

好ましくは、声帯波形推定ステップは、それぞれ予め所定の音質に対応して準備された、複数の音声波形の音節核を抽出するステップと、抽出された音節核の各々に対し、声道の影響を除去して音声が発生された際の声門気流の体積速度波形を検出するための逆フィルタを適用するステップと、逆フィルタが適用された後の音節核の各々から声帯波の単位波形を抽出する単位波形抽出ステップとを含む。 Preferably, the vocal cord waveform estimating step extracts a syllable nucleus of a plurality of speech waveforms prepared in advance corresponding to a predetermined sound quality, and influences of the vocal tract on each of the extracted syllable nuclei. Applying an inverse filter to detect the volume velocity waveform of the glottal airflow when speech is generated by removing, and extracting a unit waveform of the vocal cord wave from each of the syllable nuclei after the inverse filter is applied Unit waveform extraction step.

より好ましくは、単位波形抽出ステップは、音節核の中央部に存在する、体積速度波形の極小部分を起点とし、そこから当該音節核を含む所定領域の基本周波数により定まる周期の１周期分だけ遡った部分までを単位波形として抽出するステップを含む。 More preferably, the unit waveform extraction step starts from a minimum portion of the volume velocity waveform existing in the center of the syllable nucleus, and goes back by one period determined by the fundamental frequency of a predetermined region including the syllable nucleus. A step of extracting up to a certain portion as a unit waveform.

さらに好ましくは、単位波形抽出ステップに先立って、声門気流の体積速度波形を所定の正規化方法にしたがって正規化するステップをさらに含む。 More preferably, prior to the unit waveform extraction step, the method further includes a step of normalizing the volume velocity waveform of the glottal airflow according to a predetermined normalization method.

好ましくは、主成分分析ステップは、パラメータ化された声帯波の単位波形に対する主成分分析を行なうことにより、声帯波の単位波形の各々の、先頭から所定個数までの主成分による主成分表現を取得するステップを含む。 Preferably, the principal component analysis step obtains principal component representations of up to a predetermined number of principal components from the head of each unit waveform of the vocal fold wave by performing principal component analysis on the parameterized vocal fold unit waveform. Including the steps of:

より好ましくは、所定個数までの主成分は、第１主成分から第４主成分までである。 More preferably, the predetermined number of main components is from the first main component to the fourth main component.

さらに好ましくは、パラメータ化ステップは、声帯波の単位波形を複数の等長部分に分割する所定個数のサンプリング点において、声帯波の単位波形を再サンプリングする再サンプリングステップを含む。 More preferably, the parameterizing step includes a resampling step of resampling the unit waveform of the vocal fold wave at a predetermined number of sampling points that divide the unit waveform of the vocal fold wave into a plurality of equal length portions.

より好ましくは、再サンプリングステップによって再サンプリングされた声帯波の単位波形の差分をとることにより、声帯波の単位波形の微分データ列を求める微分ステップをさらに含み、主成分分析ステップは、微分データ列に対し主成分分析を行なうことにより、声帯波の単位波形の各々の微分量に対する主成分表現を取得するステップを含む。 More preferably, it further includes a differentiation step of obtaining a differential data string of the unit waveform of the vocal fold wave by taking a difference of the unit waveform of the vocal fold wave resampled by the re-sampling step, and the principal component analysis step includes the differential data string The principal component analysis is performed on the unit waveform to obtain a principal component expression for each differential amount of the unit waveform of the vocal fold wave.

さらに好ましくは、微分ステップによって求められた微分データ列の各々は、再サンプリング時間の差分と、当該再サンプリング時間の差分に対応する声帯波の単位波形の差分との対を含み、声質モデル生成方法はさらに、主成分分析ステップに先立って、微分データ列を求めるステップによって求められた微分データ列の各々に対し、時間軸方向の変動による影響と振幅方向の変動による影響とを等化するための予め定められる規準化処理を行なうステップをさらに含む。 More preferably, each of the differential data sequences obtained by the differentiation step includes a pair of a difference in resampling time and a difference in unit waveform of a vocal cord wave corresponding to the difference in resampling time, and a voice quality model generation method Further, prior to the principal component analysis step, for each of the differential data sequences obtained by the step of obtaining the differential data sequence, the effect due to the fluctuation in the time axis direction and the influence due to the fluctuation in the amplitude direction are equalized. The method further includes a step of performing a predetermined normalization process.

本発明の第２の局面に係る声質変換方法は、それぞれ所定の声質に対応付けられた複数のプロトタイプ声帯波の単位波形と、当該複数のプロトタイプ声帯波の単位波形の各々に対して予め定める主成分分析によって得られた先頭から所定個数の主成分表現との対からなる声門波形モデルを用いて、入力音声波形の声質を変換する声質変換方法であって、入力音声波形のうち、所定の条件を充足している部分からそれぞれ声帯波の単位波形を抽出する単位波形抽出ステップと、入力音声波形の声質として予め指定される声質に対応する声門波形モデルと、ユーザにより指定される声質に対応する声門波形モデルとに基づいて、入力音声波形から抽出される声帯波の単位波形を、ユーザにより指定される声質に変換して出力音声波形を生成する音声波形生成ステップとを含む。 The voice quality conversion method according to the second aspect of the present invention includes a main waveform predetermined for each of a plurality of prototype vocal cord wave unit waveforms respectively associated with a predetermined voice quality and a plurality of prototype vocal cord wave unit waveforms. A voice quality conversion method for converting the voice quality of an input speech waveform using a glottal waveform model consisting of pairs with a predetermined number of principal component expressions from the beginning obtained by component analysis. A unit waveform extraction step for extracting a unit waveform of a vocal cord wave from a portion satisfying the above, a glottal waveform model corresponding to a voice quality specified in advance as a voice quality of an input voice waveform, and a voice quality specified by a user Based on the glottal waveform model, the sound that generates the output speech waveform by converting the unit waveform of the vocal cords extracted from the input speech waveform into the voice quality specified by the user And a waveform generation step.

好ましくは、音声波形生成ステップは、入力音声波形の声質に対応する声門波形モデルから第１のプロトタイプ声帯波を選択するステップと、ユーザにより指定された声質に対応する声門波形モデルから第２のプロトタイプ声帯波を選択するステップと、第１の波形と第２の波形との間で所定の演算を行なうことにより、入力音声波形をユーザにより指定された声質の音声波形に変換するための変換関数を算出する変換関数算出ステップと、入力音声波形の声帯波の単位波形に対して変換関数を適用することにより、出力音声波形を生成するステップとを含む。 Preferably, the voice waveform generation step includes a step of selecting a first prototype vocal cord wave from a glottal waveform model corresponding to the voice quality of the input voice waveform, and a second prototype from the glottal waveform model corresponding to the voice quality specified by the user. A step of selecting a vocal fold wave, and a conversion function for converting the input voice waveform into a voice waveform of voice quality designated by the user by performing a predetermined calculation between the first waveform and the second waveform. A conversion function calculating step for calculating, and a step of generating an output speech waveform by applying the conversion function to a unit waveform of a vocal cord wave of the input speech waveform.

より好ましくは、変換関数算出ステップは、第２の波形から第１の波形を減算することにより、変換関数を算出するステップを含む。 More preferably, the conversion function calculating step includes a step of calculating the conversion function by subtracting the first waveform from the second waveform.

さらに好ましくは、音声波形生成ステップは、入力音声波形の声帯波の単位波形に対して変換関数を加算することにより、出力音声波形を生成するステップを含む。 More preferably, the speech waveform generation step includes a step of generating an output speech waveform by adding a conversion function to a unit waveform of a vocal cord wave of the input speech waveform.

好ましくは、単位波形抽出ステップは、入力音声波形の音節核を抽出するステップと、抽出された音節核の各々に対し、声道の影響を除去して音声が発生された際の声門気流の体積速度波形を検出するための逆フィルタを適用するステップと、逆フィルタが適用された後の音節核の各々から声帯波の単位波形を抽出するステップとを含む。 Preferably, the unit waveform extraction step includes a step of extracting a syllable nucleus of the input speech waveform, and a volume of glottal airflow when a voice is generated by removing the influence of the vocal tract for each of the extracted syllable nuclei. Applying an inverse filter for detecting the velocity waveform, and extracting a unit waveform of a vocal cord wave from each of the syllable nuclei after the inverse filter is applied.

より好ましくは、単位波形を抽出するステップは、音節核の中央部に存在する、体積速度波形の極小部分を起点とし、そこから当該音節核を含む所定領域の基本周波数により定まる周期の１周期分だけ遡った部分までを単位波形として抽出するステップを含む。 More preferably, the step of extracting the unit waveform starts from a minimum portion of the volume velocity waveform existing in the center of the syllable nucleus, and from there is one period determined by the fundamental frequency of a predetermined region including the syllable nucleus. This includes a step of extracting as a unit waveform up to a portion that is traced back.

さらに好ましくは、波形を抽出するステップに先立って、声門気流の体積速度波形を所定の正規化方法にしたがって正規化するステップをさらに含む。 More preferably, the method further includes the step of normalizing the volume velocity waveform of the glottal airflow according to a predetermined normalization method prior to the step of extracting the waveform.

好ましくは、先頭から所定個数の主成分表現は、第１主成分から第４主成分によるものである。 Preferably, the predetermined number of principal component representations from the top are based on the first principal component to the fourth principal component.

この発明の第３の局面にかかるコンピュータプログラムは、コンピュータにより実行されると、上記したいずれかの方法の全てのステップを実現するようにコンピュータを動作させるように構成されている。 When executed by a computer, the computer program according to the third aspect of the present invention is configured to operate the computer so as to realize all the steps of any one of the methods described above.

この発明の第４の局面に係るコンピュータは、上記したコンピュータプログラムによりプログラムされている。 A computer according to the fourth aspect of the present invention is programmed by the computer program described above.

この発明の第５の局面に係るコンピュータ読取可能な記録媒体は、上記したコンピュータプログラムを記録したものである。 A computer-readable recording medium according to a fifth aspect of the present invention records the above-described computer program.

‐構成‐
図１は、本発明の一実施の形態に係る声質変換システム３０のブロック図である。図１を参照して、この声質変換システム３０は、それぞれ特定の声質を持つ音声として選ばれた、声質を制御するパラメータの基準値を定めるための基準音声波形３２から、声門波形モデルとしての、声質を制御するパラメータを表すＰＣＡパラメータモデル３６を主成分分析（ＰＣＡ）によって作成するためのモデル作成部３４と、入力音声波形５０と当該入力音声波形５０の声質を特定する声質特定情報５１とを受け、入力音声波形５０に対して、モデル作成部３４が行なうのと同様の分析を行なって声帯波の波形を抽出し、声質特定情報５１と、ユーザにより設定されたターゲットの声質とに基づいて、ＰＣＡパラメータモデル３６を用いて音声波形５４をターゲットの声質で再生成するための声質変換装置５２とを含む。 -Constitution-
FIG. 1 is a block diagram of a voice quality conversion system 30 according to an embodiment of the present invention. Referring to FIG. 1, this voice quality conversion system 30 uses a reference voice waveform 32 for determining a reference value of a parameter for controlling voice quality, selected as a voice having a specific voice quality, as a glottal waveform model. A model creation unit 34 for creating a PCA parameter model 36 representing parameters for controlling voice quality by principal component analysis (PCA), an input speech waveform 50, and voice quality specifying information 51 for specifying the voice quality of the input speech waveform 50. In response to the input voice waveform 50, the same analysis as that performed by the model creation unit 34 is performed to extract the waveform of the vocal fold wave, and based on the voice quality specifying information 51 and the target voice quality set by the user. And a voice quality conversion device 52 for regenerating the speech waveform 54 with the target voice quality using the PCA parameter model 36.

本実施の形態では、基準音声波形３２としては、予めそれぞれ特徴的な声質の音声であるとして選ばれた１３種類の人間の音声波形を用いる。それら音声波形には、そうした声質を示すラベル付けが予めなされている。本実施の形態では、この音声波形として非特許文献１に添付された音声データを用いている。それら音声とその声質とについては図６を参照して後述する。なお、本実施の形態で使用する音声データは、所定のサンプリング速度でフレーム単位でサンプリングされたデジタルデータとして予め準備されるものとする。 In the present embodiment, as the reference speech waveform 32, 13 types of human speech waveforms selected in advance as speech of characteristic voice quality are used. These speech waveforms are pre-labeled to indicate such voice quality. In the present embodiment, voice data attached to Non-Patent Document 1 is used as this voice waveform. These voices and voice quality will be described later with reference to FIG. Note that audio data used in the present embodiment is prepared in advance as digital data sampled in units of frames at a predetermined sampling rate.

図２は、モデル作成部３４の詳細な構成を示すブロック図である。図２を参照して、モデル作成部３４は、音声波形のうち、話者の発声機構によって安定して発声されている領域（これを以後「音節核」と呼ぶ。）を抽出するための音節核抽出部８０を含む。より具体的には、音節核抽出部８０は、音響エネルギーの時間軸上の波形分布を算出し、その分布波形の輪郭に対して凸包アルゴリズムを適用することにより音響エネルギーの輪郭中の谷部分を検出し、入力音声を当該谷部分で擬似音節に分割する。音節核抽出部８０はさらに、上記のようにして得られた擬似音節中の、音響エネルギーの最大値が得られる点をまず音節核の起点とする。音節核抽出部８０はさらに、音節核の左右に、音響エネルギーが所定のしきい値（０．８×音響エネルギーの最大値）より大きく、かつ有声と判定されたフレームであって、かつ同じ擬似音節中のフレームがあればそれらのフレームを擬似音節に１フレームずつ追加していくことにより、連続した領域を音節核として抽出する。 FIG. 2 is a block diagram showing a detailed configuration of the model creation unit 34. Referring to FIG. 2, model creation unit 34 extracts a syllable for extracting a region (hereinafter, referred to as “syllable nucleus”) that is stably uttered by a speaker's utterance mechanism from a speech waveform. A nuclear extraction unit 80 is included. More specifically, the syllable nucleus extraction unit 80 calculates a waveform distribution on the time axis of the acoustic energy, and applies a convex hull algorithm to the contour of the distribution waveform to thereby obtain a valley portion in the contour of the acoustic energy. And the input speech is divided into pseudo syllables at the valleys. The syllable nucleus extraction unit 80 further sets a point at which the maximum value of acoustic energy in the pseudo syllable obtained as described above is obtained as a starting point of the syllable nucleus. The syllable nucleus extraction unit 80 further includes frames on both the left and right sides of the syllable nucleus, in which the acoustic energy is larger than a predetermined threshold value (0.8 × maximum value of acoustic energy) and determined to be voiced, and the same pseudo If there are frames in the syllable, those frames are added to the pseudo syllable one frame at a time to extract a continuous region as a syllable nucleus.

モデル作成部３４はさらに、音節核抽出部８０により抽出された音節核の各々について、線形予測（ＬＰ）ケプストラムを用いた線形予測によって最初の４つのフォルマント周波数及び帯域を推定するためのフォルマント推定部８１を含む。フォルマント推定部８１は、所定の線形ケプストラム‐フォルマントマッピングを利用しており、かつ予め母音フォルマントに対するマッピングを学習させてある。この音節核抽出部８０及びフォルマント推定部８１には、上記した特開２００３−３３０４７８において開示されたものと同様のものを用いている。 The model creation unit 34 further includes a formant estimation unit for estimating the first four formant frequencies and bands by linear prediction using a linear prediction (LP) cepstrum for each of the syllable nuclei extracted by the syllable nucleus extraction unit 80. 81. The formant estimation unit 81 uses predetermined linear cepstrum-formant mapping, and has previously learned mapping for a vowel formant. The syllable nucleus extraction unit 80 and the formant estimation unit 81 are the same as those disclosed in the aforementioned Japanese Patent Application Laid-Open No. 2003-330478.

モデル作成部３４はさらに、音節核抽出部８０及びフォルマント推定部８１により抽出された音節核の各々に対して、音声に対する声道の影響を除去するための逆フィルタを生成し音声波形に対して適用するための逆フィルタ処理部８２と、逆フィルタ処理部８２の出力から、当該音節核を発声したときの話者の声帯の声門の体積速度波形を検出するための体積速度波形検出部８４とを含む。 The model creation unit 34 further generates an inverse filter for removing the influence of the vocal tract on the speech for each of the syllable nuclei extracted by the syllable nucleus extraction unit 80 and the formant estimation unit 81, and applies it to the speech waveform. An inverse filter processing unit 82 for applying, and a volume velocity waveform detecting unit 84 for detecting the volume velocity waveform of the glottis of the vocal cords of the speaker when the syllable nucleus is uttered from the output of the inverse filter processing unit 82; including.

モデル作成部３４はさらに、体積速度波形検出部８４が検出した声門の体積速度波形を正規化するための正規化部８６と、正規化部８６により正規化された声門の体積速度波形から、各音節核の中央近くの１サイクル分の波形（声帯波）の波形データを抽出するための波形抽出部８７と、波形抽出部８７の抽出した１サイクル分の声帯波の波形データに対し、後述するようなＰＣＡ分析を行ない第４主成分までを算出するためのＰＣＡ分析部８８とを含む。 The model creation unit 34 further includes a normalization unit 86 for normalizing the glottal volume velocity waveform detected by the volume velocity waveform detection unit 84, and the glottal volume velocity waveform normalized by the normalization unit 86. A waveform extraction unit 87 for extracting waveform data of one cycle near the center of the syllable nucleus (voice vocal wave), and waveform data of one cycle of vocal fold wave extracted by the waveform extraction unit 87 will be described later. A PCA analysis unit 88 for performing such PCA analysis and calculating up to the fourth principal component.

ＰＣＡ分析部８８が出力するＰＣＡ分析の主成分の値は、対応する声帯波の波形（これをプロトタイプ声帯波と呼ぶ。）と対応付けられてＰＣＡパラメータモデル３６を構成する。ＰＣＡ分析部８８によるＰＣＡ分析に先立って音声波形データをパラメータ化する必要があるが、その詳細については後述する。後述するように、こうして得られたＰＣＡパラメータモデルは、基準音声波形３２を構成する各音声波形の声質をよく表すものと考えられる。 The value of the principal component of the PCA analysis output from the PCA analysis unit 88 is associated with the corresponding vocal cord waveform (referred to as a prototype vocal cord wave) to constitute the PCA parameter model 36. Prior to the PCA analysis by the PCA analysis unit 88, it is necessary to parameterize the speech waveform data, details of which will be described later. As will be described later, the PCA parameter model obtained in this way is considered to well represent the voice quality of each speech waveform constituting the reference speech waveform 32.

図３は、図２に示す逆フィルタ処理部８２のより詳細なブロック図である。図３を参照して、逆フィルタ処理部８２は、音節核の各々に対し、分析及び総合による最適化によって、ケプストラム‐フォルマントマッピングにより推定されたフォルマントの精度をより高め、さらに時間とともに変化する、声道の影響を除去するための逆フィルタを生成するための逆フィルタ生成部１２０と、入力される音節核の音声波形のうち、周波数の低いはっきりしない音声成分を減衰させるためのハイパスフィルタ１２２と、ハイパスフィルタ１２２の出力のうち、第４フォルマントより上のスペクトル成分を減衰させるためのローパスフィルタ１２４と、ローパスフィルタ１２４の出力する音声信号に対して、逆フィルタ生成部１２０の生成した逆フィルタを適用することにより、声道の最初の４つの共鳴成分の影響を除去するための逆フィルタ適用部１２６とを含む。 FIG. 3 is a more detailed block diagram of the inverse filter processing unit 82 shown in FIG. With reference to FIG. 3, the inverse filter processing unit 82 further improves the accuracy of the formant estimated by the cepstrum-formant mapping by analysis and synthesis optimization for each syllable nucleus, and further changes with time. An inverse filter generation unit 120 for generating an inverse filter for removing the influence of the vocal tract, and a high-pass filter 122 for attenuating an unclear speech component having a low frequency in the speech waveform of the input syllable nucleus, Among the outputs of the high-pass filter 122, the low-pass filter 124 for attenuating the spectral components above the fourth formant, and the inverse filter generated by the inverse-filter generating unit 120 for the audio signal output from the low-pass filter 124 Apply to remove the effects of the first four resonance components of the vocal tract And a reverse filter applying unit 126 for.

図２に示す体積速度波形検出部８４は、逆フィルタ適用部１２６により出力される、声道の影響の除去された音声信号を積分することにより、唇による放射の影響を除去し、声門気流の体積速度波の推定波形を出力する機能を持つ。 The volume velocity waveform detection unit 84 shown in FIG. 2 integrates the voice signal from which the influence of the vocal tract is removed, which is output from the inverse filter application unit 126, thereby removing the influence of the lip radiation, It has a function to output an estimated waveform of volume velocity wave.

図２に示す正規化部８６は、体積速度波形検出部８４の出力する声門気流の体積速度波の推定波形を正規化するためのものである。この波形の振幅がどのようなものかは前もって分からないので、このように正規化することが必要である。本実施の形態に係る正規化部８６は、音節核の全体にわたる体積速度波の振幅の平均値を求め、元の値から減算することにより波形の正規化を行なっている。 The normalization unit 86 shown in FIG. 2 is for normalizing the estimated waveform of the volume velocity wave of the glottal airflow output from the volume velocity waveform detection unit 84. Since it is not known in advance what the amplitude of this waveform is, it is necessary to normalize in this way. The normalizing unit 86 according to the present embodiment obtains an average value of the amplitudes of the volume velocity waves over the entire syllable nucleus, and normalizes the waveform by subtracting it from the original value.

図２に示す波形抽出部８７は、音節核の近くの声帯波を以下のようにして抽出する。すなわち、波形抽出部８７は、音節核の近くの、波形の極小値部分を探し、そこを起点として、そこから１周期分だけ遡った部分までを１サイクルの声帯波として抽出する。この場合の周期は、基本周波数Ｆ０の逆数として定められる。 The waveform extraction unit 87 shown in FIG. 2 extracts a vocal cord wave near the syllable nucleus as follows. That is, the waveform extraction unit 87 searches for a minimum value portion of the waveform near the syllable nucleus, and extracts a portion that goes back from that point as a starting point as a one-cycle vocal cord wave. The period in this case is determined as the reciprocal of the fundamental frequency F0.

図４は、図２に示すＰＣＡ分析部８８の詳細なブロック図である。ＰＣＡ分析のためには、波形を一定数のパラメータにより表現する必要がある。ＰＣＡ分析部８８は、声帯波の１サイクルの周期と振幅との双方の値に関連したＰＣＡ分析を可能とするために、以下に述べるような特定の方法により声帯波をパラメータ化している。 FIG. 4 is a detailed block diagram of the PCA analyzer 88 shown in FIG. For PCA analysis, it is necessary to express the waveform with a certain number of parameters. The PCA analysis unit 88 parameterizes the vocal fold wave by a specific method as described below in order to enable PCA analysis related to the values of both the period and amplitude of one cycle of the vocal fold wave.

図４を参照して、ＰＣＡ分析部８８は、分析対象の声帯波波形の第１５高調波により決定されるカットオフ周波数を有するローパスフィルタ１４０と、ローパスフィルタ１４０により低周波数成分が減衰された声帯波波形を、３０個の互いに等しい間隔の部分波形により構成されるように再サンプリングするための再サンプリング部１４２とを含む。再サンプリング部１４２によるサンプリングでは、波形自身に沿ったサンプリング点間の距離が互いに等しくなるようなサンプリングが行なわれる。このようなサンプリングにより、波形の振幅軸と時間軸との間の共分散を考慮に入れることができ、二つの次元の双方に同時に関連するような変化についても柔軟にモデル化することができる。したがって各サンプリング点は、時間軸方向の値と、振幅軸方向の値との二つの値の組となる。 Referring to FIG. 4, PCA analysis unit 88 includes a low-pass filter 140 having a cutoff frequency determined by the 15th harmonic of the vocal fold waveform to be analyzed, and a vocal cord whose low-frequency component is attenuated by low-pass filter 140. A re-sampling unit 142 for re-sampling the wave waveform so as to be composed of 30 partial waveforms at equal intervals. In the sampling by the re-sampling unit 142, sampling is performed such that the distances between the sampling points along the waveform itself are equal to each other. Such sampling can take into account the covariance between the amplitude axis of the waveform and the time axis, and can flexibly model changes that are simultaneously related to both two dimensions. Accordingly, each sampling point is a set of two values, a value in the time axis direction and a value in the amplitude axis direction.

図５に、例となる声帯波１６０と、声帯波１６０に対するサンプリング点の例（０〜３０により示す。）とを概念的に示す。図５に示すように、サンプリング点は３１個あり、その結果声帯波１６０は３０個の互いに等しい長さの部分波形に分割される。 FIG. 5 conceptually shows an example vocal fold wave 160 and examples of sampling points for the vocal fold wave 160 (indicated by 0 to 30). As shown in FIG. 5, there are 31 sampling points, and as a result, the vocal cord wave 160 is divided into 30 partial waveforms of equal length.

本実施の形態において、サンプリング点は３１個である。この個数は、波形の詳細部分を十分に保存しつつ、パラメータの数をできるだけ少なくするために選択されたものである。もちろん、サンプリング点の数が３１に限定されるわけではなく、使用する装置の性能、必要とされる精度などによりサンプリング点の数を選択することができる。 In the present embodiment, there are 31 sampling points. This number was chosen to minimize the number of parameters while preserving the detailed portions of the waveform. Of course, the number of sampling points is not limited to 31, and the number of sampling points can be selected according to the performance of the apparatus used, the required accuracy, and the like.

サンプリング定理によれば、互いに等しい間隔の３１個のサンプリング点でサンプリングすることにより、各声帯波形のスペクトルの第１５高調波までが保存されることになる。したがって、エイリアシングを避けるため、ローパスフィルタ１４０のカットオフ周波数を声帯波形の第１５高調波に設定してある。 According to the sampling theorem, by sampling at 31 sampling points that are equally spaced from each other, up to the 15th harmonic of the spectrum of each vocal cord waveform is preserved. Therefore, in order to avoid aliasing, the cut-off frequency of the low-pass filter 140 is set to the fifteenth harmonic of the vocal cord waveform.

図４を再度参照して、ＰＣＡ分析部８８はさらに、再サンプリング部１４２によりサンプリングされた波形の一次差分を算出するための差分算出部１４４を含む。これは、逆フィルタにより推定された声帯波の振幅のオフセットが未知であるため、その影響をなくすためである。また、種々の声帯波形の間の振幅の相違によってＰＣＡ分析に不自然な結果が生じることを避けるためでもある。この結果、３０個の座標点においてサンプリングされた、６０個のパラメータからなる、声帯波の微分量が得られる。これら６０個のパラメータに対してＰＣＡ分析を行なうことが可能である。 Referring back to FIG. 4, the PCA analysis unit 88 further includes a difference calculation unit 144 for calculating the primary difference of the waveform sampled by the re-sampling unit 142. This is because the offset of the amplitude of the vocal tract wave estimated by the inverse filter is unknown, and the influence is eliminated. It is also to avoid unnatural results in PCA analysis due to amplitude differences between the various vocal cord waveforms. As a result, a differential value of the vocal cord wave consisting of 60 parameters sampled at 30 coordinate points is obtained. PCA analysis can be performed on these 60 parameters.

図４を参照して、ＰＣＡ分析部８８はさらに、差分算出部１４４により算出された声帯波の微分量に対して規準化処理を行なうための規準化処理部１４６を含む。声帯波の微分の時間及び振幅の次元は互いに無関係であり、そのためＰＣＡ分析がそれら次元のうち変化量の大きな次元の方を不当に反映した形で行なわれてしまう可能性があり、それらの影響を等化しておくのが望ましい。そのため、ＰＣＡ分析に先立ち、各次元について、その全体の平均値を各サンプリング点の値から減算し、さらにそれらサンプリング点の各次元の値をそれらの標準偏差で除算することにより規準化する。規準化処理部１４６が行なうのはその処理である。 Referring to FIG. 4, PCA analysis unit 88 further includes a normalization processing unit 146 for performing normalization processing on the differential amount of the vocal cord wave calculated by difference calculation unit 144. The time and amplitude dimensions of the differentiation of the vocal cords are independent of each other, so the PCA analysis may be performed in a manner that improperly reflects the dimension with the larger amount of change. It is desirable to equalize. Therefore, prior to PCA analysis, for each dimension, the average value of the whole is subtracted from the value of each sampling point, and further, the value of each dimension of these sampling points is divided by their standard deviation. The normalization processing unit 146 performs that process.

ＰＣＡ分析部８８はさらに、規準化処理部１４６により規準化された３０個のサンプリング点での計６０個の値に対してＰＣＡ分析を行ない、その第４主成分までを算出するためのＰＣＡ計算部１４８を含む。 The PCA analysis unit 88 further performs a PCA analysis on a total of 60 values at the 30 sampling points normalized by the normalization processing unit 146, and calculates a PCA for calculating up to the fourth principal component. Part 148.

図６に、基準音声波形３２から得られた声帯波と、それらに対するサンプリング結果を示す。図６に示す波形は、特定の声質を表すと判断された複数の音節核に対して算出された波形を、声質ごとに平均することによって得られたものである。（なお、実際には波形の微分量が得られるので、図６に示すのはそれを積分して得られた波形ということになる。）以下、これらをプロトタイプ声帯波と呼ぶ。 FIG. 6 shows vocal cord waves obtained from the reference speech waveform 32 and sampling results for them. The waveform shown in FIG. 6 is obtained by averaging the waveforms calculated for a plurality of syllable nuclei determined to represent a specific voice quality for each voice quality. (In fact, since the differential amount of the waveform is obtained, what is shown in FIG. 6 is a waveform obtained by integrating it.) These are hereinafter referred to as prototype vocal cord waves.

図６において、各波形の上部に記載された１〜３個のアルファベットは、そのプロトタイプ声帯波の声質を表す。アルファベットの組とその意味とを次のテーブル１に示す。 In FIG. 6, 1 to 3 alphabets described at the top of each waveform represent the voice quality of the prototype vocal cord wave. Table 1 below shows the alphabet sets and their meanings.

上記したように、モデル作成部３４によるＰＣＡ分析は声帯波の微分波形に対して行なう。しかし、以下では、この結果得られたプロトタイプ声帯波を分かりやすく比較するために、ＰＣＡ分析の結果を積分して振幅のスケールに戻し、これらの声帯波の分析結果を論じる。その際、声帯波形の先頭サンプルの振幅を０とした。図７に、その結果を示す。

As described above, the PCA analysis by the model creation unit 34 is performed on the differential waveform of the vocal fold wave. However, in the following, in order to compare the resulting prototype vocal cords in an easy-to-understand manner, the results of the PCA analysis are integrated and returned to the amplitude scale, and the analysis results of these vocal cords are discussed. At that time, the amplitude of the first sample of the vocal cord waveform was set to zero. FIG. 7 shows the result.

図７に示す４つのグラフは、それぞれ第１、第２、第３、及び第４主成分について、テスト対象となった全ての声帯波の平均値（実線で示す。）、及び平均値±標準偏差（それぞれ「＋」と「□」とを含む線により示す。）を示したものである。ここでは、テスト対象は７７種類の声帯波の集団からなる。 The four graphs shown in FIG. 7 show the average value (shown by a solid line) of all the vocal cords that were tested for the first, second, third, and fourth principal components, and the average value ± standard. Deviations (indicated by lines including “+” and “□”, respectively) are shown. Here, the test object is composed of a group of 77 types of vocal cord waves.

これら先頭の４つの主成分により説明される全分散の累計値は、それぞれ５７．６％、８０．８％、８８．２％、及び９２．１％である。したがって、６０次元の空間により表されるデータに対するＰＣＡ分析の結果、直交基底関数が得られるが、そのうちの４つだけで分散の９０％以上を説明することが可能なことが分かる。 The cumulative total variance explained by these four leading principal components is 57.6%, 80.8%, 88.2%, and 92.1%, respectively. Therefore, as a result of PCA analysis on data represented by a 60-dimensional space, orthogonal basis functions are obtained, but it is understood that only four of them can explain 90% or more of the variance.

図７の第１番目のグラフは、分散の５７．６％を説明する第１主成分から得られる波形を示す。このグラフから、この主成分が主として波形の持続期間、すなわち声帯波の基本周波数を表す。またこの第１主成分は、波形の随伴的な変形についても説明する。周期が短くなると波形はより対称的になり、頂点はより丸くなる。周期が長くなると波形はより広く、頂部は平らになる。したがってこの第１主成分は、波形の立ち上がり部分（声門の開口時）と立下り部分（声門の閉鎖時）との変化は反映していない。 The first graph in FIG. 7 shows the waveform obtained from the first principal component that explains 57.6% of the variance. From this graph, this main component mainly represents the duration of the waveform, that is, the fundamental frequency of the vocal fold wave. The first principal component also describes the accompanying deformation of the waveform. As the period is shortened, the waveform becomes more symmetric and the vertices become more rounded. The longer the period, the wider the waveform and the top becomes flat. Therefore, this first principal component does not reflect the change between the rising portion of the waveform (when the glottis is opened) and the falling portion (when the glottis are closed).

第２主成分を図７の２番目のグラフで示す。第２主成分は、もとの分散のうち、２３．２％を説明するものであり、主として声門の開口時の波形の変動を説明するものである。特に、波形の中央部分は、中央のやや右側に偏った単一の頂点を持つ高振幅であるか、又は二重音系の発音の二つのパルスの間のくぼみを説明するような低い振幅であるかのいずれかである。第１主成分と異なり、この第２成分は波形の基本周期とはそれほど大きな関係はない。 The second principal component is shown in the second graph of FIG. The second principal component explains 23.2% of the original variance, and mainly explains the fluctuation of the waveform when the glottis is opened. In particular, the central part of the waveform is either high amplitude with a single apex biased slightly to the right of the center, or low amplitude that accounts for the indentation between the two pulses of the diphonic sound. Either. Unlike the first principal component, this second component has no significant relationship with the fundamental period of the waveform.

第３主成分を図７の３番目のグラフで示す。第３主成分は元の分散の７．４％を説明するが、主として開口時の波形の傾斜と声帯波のピークの形とを反映するようである。例えば、一方の極では開口時の傾斜は急でその後に比較的平坦な頂部が続くが、他方の極では開口時の傾斜はゆるく、その後にさらにピークに続くより緩やかな傾斜部分が続く。 The third principal component is shown in the third graph of FIG. The third principal component accounts for 7.4% of the original variance, but seems to primarily reflect the slope of the waveform at the time of opening and the peak shape of the vocal cords. For example, one pole has a sharp slope at the opening and is followed by a relatively flat top, while the other pole has a gentle slope at the opening, followed by a more gentle slope following the peak.

図７の４番目のグラフは、第４主成分による波形を示す。第４主成分は元の分散のうちわずか３．９％しか説明しないが、パルスのスキューと閉鎖時の速度とを反映する。一方の極では声帯波形は比較的対称形でより緩やかな閉鎖時の傾斜を示すが、他方の極では声帯波形のパルスはやや右側に偏り、より急な閉鎖時の傾斜を示す。 The fourth graph in FIG. 7 shows a waveform based on the fourth principal component. The fourth principal component accounts for only 3.9% of the original variance, but reflects the pulse skew and closing speed. At one pole, the vocal cord waveform is relatively symmetric and exhibits a more gentle closing gradient, while at the other pole, the vocal cord waveform pulses are slightly biased to the right, indicating a steeper closing gradient.

第５主成分以降は、波形のより詳細な部分について説明するものであるが、いずれも元の分散の２％に満たない部分を説明するものでしかない。したがって本実施の形態ではそれらについては考慮しない。 From the fifth principal component onwards, more detailed portions of the waveform will be described, but in any case, only portions that are less than 2% of the original variance will be described. Therefore, they are not considered in this embodiment.

もちろん、第５主成分以降まで考慮してもよい。利用可能な計算機資源と、アプリケーションが必要とする速度との兼ね合いでどの主成分まで考慮するかを決定すればよい。もっとも、上記したように第４主成分までで波形の変化の大方は説明できるので、第５主成分以降を考慮する実益は少ないと思われる。 Of course, you may consider even after a 5th main component. It is only necessary to determine which principal component is to be considered in consideration of the available computer resources and the speed required by the application. However, as described above, since most of the change in the waveform can be explained up to the fourth principal component, it seems that there is little practical benefit considering the fifth and subsequent principal components.

再び図１を参照して、声質変換装置５２は、モデル作成部３４により生成されたＰＣＡパラメータモデル３６を、それぞれのプロトタイプ声帯波の波形データとともに記憶するためのプロトタイプデータ記憶部６８と、モデル作成部３４で行なわれたのと同様の方法で入力音声波形５０から声帯波の１サイクル分の波形を抽出するための声帯波形抽出部６０と、声質特定情報５１、プロトタイプデータ記憶部６８に記憶されているプロトタイプ声帯波データ、及びユーザにより入力されたターゲットの声質に基づいて、入力音声波形５０から抽出された声帯波形を、ユーザにより指定された声質の声帯波形に変換する機能を持つ音声波形変換関数を生成するための変換関数生成部６２と、変換関数生成部６２により得られた変換関数を用いて声帯波形抽出部６０から出力される声帯波形を変換することにより、ユーザが指定した声質の音声波形５４を生成するための波形再生成部６４とを含む。 Referring again to FIG. 1, the voice quality conversion device 52 includes a prototype data storage unit 68 for storing the PCA parameter model 36 generated by the model creation unit 34 together with waveform data of each prototype vocal cord wave, and model creation. Stored in the vocal cord waveform extracting unit 60 for extracting a waveform of one vocal cord wave from the input speech waveform 50 in the same manner as that performed by the unit 34, the voice quality specifying information 51, and the prototype data storage unit 68. Voice waveform conversion having a function of converting a vocal cord waveform extracted from the input voice waveform 50 into a vocal cord waveform of a voice quality designated by the user based on the prototype vocal cord wave data and the target voice quality inputted by the user Using a conversion function generation unit 62 for generating a function and the conversion function obtained by the conversion function generation unit 62 By converting the vocal cord waveform output from the band waveform extracting section 60, and a waveform regenerating unit 64 for generating a speech waveform 54 of voice quality designated by the user.

声帯波形抽出部６０は、処理対象が入力音声波形５０であることを除き、モデル作成部３４と同様の処理をして声帯波の波形を抽出する機能を持つ。したがってここでは声帯波形抽出部６０の詳細な説明は行なわない。 The vocal cord waveform extraction unit 60 has a function of extracting a vocal cord waveform by performing the same processing as the model creation unit 34 except that the processing target is the input speech waveform 50. Therefore, detailed description of the vocal cord waveform extraction unit 60 will not be given here.

図８は、変換関数生成部６２のより詳細なブロック図である。図８を参照して、変換関数生成部６２は、キーボード及びモニタなど、ユーザとの間の対話を実現する入出力装置１８４と、声質特定情報５１に基づいて決定される、入力音声波形５０の声質に対応するＰＣＡパラメータを入出力装置１８４を用いてユーザに提示し、さらにＰＣＡパラメータのターゲットとしてユーザにより指定された値を入出力装置１８４を介して受け取るためのターゲット設定部１８２とを含む。ターゲット設定部１８２はこのとき、プロトタイプデータ記憶部６８に記憶されたＰＣＡパラメータモデルを参照する。 FIG. 8 is a more detailed block diagram of the conversion function generator 62. Referring to FIG. 8, the conversion function generation unit 62 includes an input / output device 184 that realizes dialogue with the user, such as a keyboard and a monitor, and an input speech waveform 50 that is determined based on the voice quality specifying information 51. A PCA parameter corresponding to the voice quality is presented to the user by using the input / output device 184, and further includes a target setting unit 182 for receiving the value designated by the user as the target of the PCA parameter via the input / output device 184. At this time, the target setting unit 182 refers to the PCA parameter model stored in the prototype data storage unit 68.

変換関数生成部６２はさらに、ターゲット設定部１８２により設定されたターゲットＰＣＡパラメータに対するプロトタイプ声帯波の波形から、声質特定情報５１に基づいて決定される、入力音声波形５０の声質に対応するプロトタイプ声帯波の波形を減算することにより、波形変換関数を生成するための波形減算処理部１８８とを含む。 The conversion function generation unit 62 further includes a prototype vocal cord wave corresponding to the voice quality of the input voice waveform 50 determined based on the voice quality specifying information 51 from the waveform of the prototype vocal cord wave corresponding to the target PCA parameter set by the target setting unit 182. And a waveform subtraction processing unit 188 for generating a waveform conversion function by subtracting the waveform.

図９に、ターゲット設定部１８２によるＰＣＡパラメータの表示とターゲットＰＣＡパラメータの設定との一手法を示す。図９を参照して、図８に示す入出力装置１８４の出力画面２００には、二つのＰＣＡパラメータ設定領域２０２及び２０４が表示される。ＰＣＡパラメータ設定領域２０２は第１主成分（ＰＣ１）及び第２主成分（ＰＣ２）の値を設定するためのものである。ＰＣＡパラメータ設定領域２０４は第３主成分（ＰＣ３）及び第４主成分（ＰＣ４）の値を設定するためのものである。 FIG. 9 shows one method of displaying the PCA parameters and setting the target PCA parameters by the target setting unit 182. Referring to FIG. 9, two PCA parameter setting areas 202 and 204 are displayed on the output screen 200 of the input / output device 184 shown in FIG. The PCA parameter setting area 202 is for setting values of the first principal component (PC1) and the second principal component (PC2). The PCA parameter setting area 204 is for setting values of the third principal component (PC3) and the fourth principal component (PC4).

ＰＣＡパラメータ設定領域２０２及び２０４はそれぞれ、二次元の座標（ＰＣ１，ＰＣ２）及び（ＰＣ３，ＰＣ４）により表される点を表示することができる。入力音声波形５０の声質が指定されることによりプロトタイプデータ記憶部６８に記憶された声門波形モデルからＰＣ１〜ＰＣ４が決定される。それに対応する点として点２１０及び２１４をそれぞれＰＣＡパラメータ設定領域２０２及び２０４に表示できる。この２点の表示により入力音声波形５０の第１主成分〜第４主成分が特定される。 The PCA parameter setting areas 202 and 204 can display points represented by two-dimensional coordinates (PC1, PC2) and (PC3, PC4), respectively. By designating the voice quality of the input voice waveform 50, PC1 to PC4 are determined from the glottal waveform model stored in the prototype data storage unit 68. As points corresponding thereto, points 210 and 214 can be displayed in the PCA parameter setting areas 202 and 204, respectively. The first principal component to the fourth principal component of the input speech waveform 50 are specified by the display of these two points.

表示上において、例えばユーザがＰＣＡパラメータ設定領域２０２において点２１２を新たに指定することにより、ＰＣ１及びＰＣ２のターゲットの値が点２１２に対応する各軸上の値として定まる。同様に、ユーザがＰＣＡパラメータ設定領域２０４において点２１６を新たに指定することにより、ＰＣ３及びＰＣ４のターゲットの値が定まる。図８に示すターゲット設定部１８２は、このようにしてユーザにより設定された、第１主成分から第４主成分までのターゲット値を取得する。 On the display, for example, when the user newly designates a point 212 in the PCA parameter setting area 202, the values of the targets of PC1 and PC2 are determined as values on each axis corresponding to the point 212. Similarly, when the user newly designates a point 216 in the PCA parameter setting area 204, the target values of PC3 and PC4 are determined. The target setting unit 182 illustrated in FIG. 8 acquires target values from the first principal component to the fourth principal component set by the user in this way.

もちろん、図９に示した手法はターゲットを設定するための一つの手法に過ぎない、これ以外にも、例えば各主成分ごとに値を直接入力する方法、予め準備されたプロトタイプを表示し、その中からターゲットとなるプロトタイプを指定させる方法等、様々な手法を用いることができる。 Of course, the method shown in FIG. 9 is only one method for setting a target. Besides this, for example, a method for directly inputting values for each principal component, a prototype prepared in advance, and the like are displayed. Various methods such as a method for designating a target prototype from among them can be used.

図１０は、図１に示す波形再生成部６４の詳細なブロック図である。図１０を参照して、波形再生成部６４は、声帯波形抽出部６０から出力される各声帯波に、変換関数生成部６２により生成された変換関数を加算することにより、入力音声波形５０から抽出された声帯波を修正するための波形加算部２４０と、波形加算部２４０の出力する変換後の声帯波形に対し、その音声のピッチ及び発話の持続時間を適切なものに調整し、さらに物理的に現実的でない、又は極端な変形により生じる声帯波形の不自然なフォールドバック等を避ける処理を行なうための波形調整部２４２とを含む。 FIG. 10 is a detailed block diagram of the waveform regeneration unit 64 shown in FIG. Referring to FIG. 10, waveform regeneration unit 64 adds the conversion function generated by conversion function generation unit 62 to each vocal cord wave output from vocal cord waveform extraction unit 60, thereby obtaining input voice waveform 50. A waveform adding unit 240 for correcting the extracted vocal fold wave, and adjusting the voice pitch and the duration of the utterance to an appropriate one for the converted vocal cord waveform output from the waveform adding unit 240, A waveform adjusting unit 242 for performing processing that avoids unnatural foldback of the vocal cord waveform that is not realistic in nature or caused by extreme deformation.

前述したとおり、規準化された声帯波形の微分は、それぞれ３０個の時間座標及び振幅座標の対により表される。そのため、変換関数生成部６２により生成される変換関数は、波形を振幅軸方向だけでなく時間軸方向へも変形させる。これにより、声質が不適切に変わってしまう可能性がある。そのため、波形調整部２４２により、波形を調整しそうした問題が生じないようにする。 As described above, the differentiation of the normalized vocal cord waveform is represented by 30 time coordinate and amplitude coordinate pairs, respectively. Therefore, the conversion function generated by the conversion function generation unit 62 deforms the waveform not only in the amplitude axis direction but also in the time axis direction. As a result, the voice quality may change inappropriately. Therefore, the waveform adjustment unit 242 adjusts the waveform so that such a problem does not occur.

波形再生成部６４はさらに、波形調整部２４２の出力する、変換後の音声波形の微分に対して、図１に示す声帯波形抽出部６０内で生成される逆フィルタ（図３に示す逆フィルタ処理部８２に相当）の逆フィルタ（逆・逆フィルタ）を適用することにより、元のフォルマントを復元し、変換後の音声信号を出力するための逆・逆フィルタ２４４を含む。 The waveform regeneration unit 64 further generates an inverse filter (inverse filter shown in FIG. 3) generated in the vocal cord waveform extraction unit 60 shown in FIG. 1 with respect to the differentiation of the converted speech waveform output from the waveform adjustment unit 242. An inverse / inverse filter 244 for restoring the original formant and outputting the converted speech signal is applied by applying an inverse filter (equivalent to the processing unit 82).

‐動作‐
以上構成を説明した声質変換システム３０は以下のように動作する。声質変換システム３０の動作には二つの局面がある。第１の局面はＰＣＡパラメータモデル３６を作成する処理に関し、第２の局面はこのＰＣＡパラメータモデル３６を用い、入力音声波形５０の声質をユーザ入力に従い変化させて音声波形５４を生成する局面である。以下、まず第１の局面、次に第２の局面を順に説明する。 -Operation-
The voice quality conversion system 30 having the above-described configuration operates as follows. There are two aspects to the operation of the voice quality conversion system 30. The first aspect relates to the process of creating the PCA parameter model 36, and the second aspect is an aspect in which the voice waveform 54 is generated by using the PCA parameter model 36 and changing the voice quality of the input voice waveform 50 according to the user input. . Hereinafter, the first aspect and then the second aspect will be described in order.

まず、第１の局面を説明する。図１を参照して、予め基準音声波形３２が準備されているものとする。これら基準音声波形３２の各々には、予めその声の声質を特定するラベル付けがなされているものとする。 First, the first aspect will be described. Referring to FIG. 1, it is assumed that a reference speech waveform 32 is prepared in advance. Each of the reference speech waveforms 32 is pre-labeled to specify the voice quality of the voice.

モデル作成部３４のうち、音節核抽出部８０及びフォルマント推定部８１は、基準音声波形３２の各音声に対し、前述した処理を行ない、音節核を抽出する。すなわち、図２を参照して、音節核抽出部８０は、音声波形の時間軸上のパワーの分布波形などに基づき、音節核を抽出する。フォルマント推定部８１は、各音節核におけるフォルマント周波数及び帯域を推定する。こうして抽出された音節核は、基準音声波形３２の音声のうちでも、発話者の発話機構により判定して発話されている部分を示す。 Of the model creation unit 34, the syllable nucleus extraction unit 80 and the formant estimation unit 81 perform the above-described processing on each speech of the reference speech waveform 32 to extract syllable nuclei. That is, referring to FIG. 2, the syllable nucleus extraction unit 80 extracts syllable nuclei based on the power distribution waveform on the time axis of the speech waveform. The formant estimation unit 81 estimates the formant frequency and band in each syllable nucleus. The syllable nucleus extracted in this manner indicates a portion of the speech of the reference speech waveform 32 that is uttered as determined by the speech mechanism of the speaker.

図２に示す逆フィルタ処理部８２は、音節核抽出部８０及びフォルマント推定部８１により抽出された音節核の各々に対し、逆フィルタ処理を行なうことにより声道による影響を除去する。すなわち、図３を参照して、逆フィルタ生成部１２０は、音節核の各々に対し、分析及び総合による最適化によって、声道の影響を除去するための逆フィルタのためのパラメータを生成する。このパラメータは時間とともに変化する。ハイパスフィルタ１２２及びローパスフィルタ１２４によって低周波数成分及び第４フォルマントより上の成分が除去された音声信号は逆フィルタ適用部１２６に与えられ、逆フィルタ適用部１２６によって音声信号から声道の最初の４つの共鳴成分の影響が除去される。逆フィルタ適用部１２６の出力は図２に示す体積速度波形検出部８４に与えられる。 The inverse filter processing unit 82 shown in FIG. 2 removes the influence of the vocal tract by performing an inverse filter process on each of the syllable nuclei extracted by the syllable nucleus extraction unit 80 and the formant estimation unit 81. That is, with reference to FIG. 3, the inverse filter generation unit 120 generates a parameter for an inverse filter for removing the influence of the vocal tract for each syllable nucleus by optimization based on analysis and synthesis. This parameter varies with time. The audio signal from which the low-frequency component and the component above the fourth formant have been removed by the high-pass filter 122 and the low-pass filter 124 is supplied to the inverse filter application unit 126, and the inverse filter application unit 126 uses the first 4 of the vocal tract from the audio signal. The effect of the two resonance components is eliminated. The output of the inverse filter application unit 126 is given to the volume velocity waveform detection unit 84 shown in FIG.

体積速度波形検出部８４は、逆フィルタ処理部８２の出力に基づいて、各音声の音節核における声門気流の体積速度波形を検出する。検出された体積速度波形は正規化部８６により正規化され波形抽出部８７に与えられる。 The volume velocity waveform detector 84 detects the volume velocity waveform of the glottal airflow in the syllable nucleus of each voice based on the output of the inverse filter processor 82. The detected volume velocity waveform is normalized by the normalizing unit 86 and supplied to the waveform extracting unit 87.

波形抽出部８７は、正規化された体積速度波形のうち、音節核の中心付近に存在する１サイクル分の波形を抽出し、ＰＣＡ分析部８８に与える。 The waveform extraction unit 87 extracts a waveform for one cycle existing in the vicinity of the center of the syllable nucleus from the normalized volume velocity waveform, and provides it to the PCA analysis unit 88.

図４を参照して、ＰＣＡ分析部８８のローパスフィルタ１４０は、対象の音声波形の第１５高調波により決定されるカットオフ周波数より上の周波数成分を除去し、音声信号を再サンプリング部１４２に与える。再サンプリング部１４２は、入力される音声波形に対し、波形上で互いに等しい３０個の部分波形に分割されるように選ばれた３１個の点で音声波形をサンプリングし、時間及び振幅の対を３１個生成する。差分算出部１４４は、これら３１個の対の一次差分をとることにより、３０個のサンプリング点でサンプリングされた、声帯波の微分量を出力する。規準化処理部１４６はこの微分量を構成する時間及び振幅の値から、処理対象となる一つの波形全体にわたって得られたそれらの平均値を減算し、さらにその結果の値をそれらの標準偏差で除算することにより規準化し、得られた６０個の値（３０個の時間及び振幅の微分量の対）をＰＣＡ計算部１４８に与える。ＰＣＡ計算部１４８は、このようにして与えられたパラメータに対してＰＣＡ分析を行ない、各声質を代表する音声について、第１主成分から第４主成分を算出し、対応する基準音声の声帯波の波形とともにＰＣＡパラメータモデル３６を作成する。このＰＣＡパラメータモデル３６は、声質変換装置５２のプロトタイプデータ記憶部６８に記憶される。 Referring to FIG. 4, low-pass filter 140 of PCA analysis unit 88 removes a frequency component above the cutoff frequency determined by the 15th harmonic of the target speech waveform, and sends the speech signal to resampling unit 142. give. The re-sampling unit 142 samples the speech waveform at 31 points selected to be divided into 30 partial waveforms that are equal to each other on the waveform of the input speech waveform, and sets a pair of time and amplitude. 31 are generated. The difference calculation unit 144 outputs the differential amount of the vocal fold wave sampled at 30 sampling points by taking the primary difference of these 31 pairs. The normalization processing unit 146 subtracts the average value obtained over one entire waveform to be processed from the time and amplitude values constituting this differential amount, and further calculates the resulting value by their standard deviation. Normalization is performed by division, and the obtained 60 values (30 pairs of time and amplitude differentials) are given to the PCA calculation unit 148. The PCA calculation unit 148 performs PCA analysis on the parameters given in this way, calculates the first principal component to the fourth principal component for speech representing each voice quality, and the corresponding vocal cord wave of the reference speech A PCA parameter model 36 is created together with the waveforms of The PCA parameter model 36 is stored in the prototype data storage unit 68 of the voice quality conversion device 52.

以上でＰＣＡパラメータモデル３６の作成処理は終了である。 This completes the PCA parameter model 36 creation process.

次に、第２の局面における声質変換装置５２の動作について説明する。図１を参照して、声帯波形抽出部６０は、入力音声波形５０に対し、モデル作成部３４と同様の処理を行なうことにより入力音声波形５０の声帯波の波形を抽出し、波形再生成部６４に与える。 Next, the operation of the voice quality conversion device 52 in the second aspect will be described. Referring to FIG. 1, vocal cord waveform extraction unit 60 extracts the vocal cord waveform of input speech waveform 50 by performing the same processing as model creation unit 34 on input speech waveform 50, and regenerates the waveform. 64.

図８を参照して、ターゲット設定部１８２は、入力音声波形５０に対応する声質を特定する声質特定情報５１を受け、プロトタイプデータ記憶部６８に記憶されているＰＣＡモデルを参照して、当該声質に対応するＰＣＡ分析の第１主成分から第４主成分の値ＰＣ１〜ＰＣ４をユーザに対して図９に示す形式で提示する。ユーザは、入出力装置１８４を用い、前述したような操作によってこれらの値を所望の声質に対応する値にそれぞれ変更する。ターゲット設定部１８２はユーザにより変更された値をＰＣＡのターゲット値として設定し、波形減算処理部１８８に与える。波形減算処理部１８８は、ターゲット設定部１８２により設定されたＰＣＡパラメータのターゲット値に相当するプロトタイプ声門波の波形から、入力音声の声質として指定されたプロトタイプ声門波の波形を減算することにより、波形を変換するための変換関数を生成し、図１に示す波形再生成部６４に与える。 Referring to FIG. 8, target setting unit 182 receives voice quality specifying information 51 that specifies the voice quality corresponding to input speech waveform 50, and refers to the PCA model stored in prototype data storage unit 68 to refer to the voice quality. The PCA analysis corresponding to the first to fourth principal component values PC1 to PC4 is presented to the user in the format shown in FIG. The user uses the input / output device 184 to change these values to values corresponding to the desired voice quality by the operation as described above. The target setting unit 182 sets the value changed by the user as the PCA target value, and gives the value to the waveform subtraction processing unit 188. The waveform subtraction processing unit 188 subtracts the prototype glottal wave waveform designated as the voice quality of the input speech from the prototype glottal wave waveform corresponding to the target value of the PCA parameter set by the target setting unit 182 to obtain the waveform. 1 is generated and applied to the waveform regenerator 64 shown in FIG.

図１０を参照して、波形再生成部６４の波形加算部２４０は、入力音声波形５０から得られた声帯波の波形に対し、波形減算処理部１８８から与えられた変換関数を加算し、結果を波形調整部２４２に与える。波形調整部２４２は、前述したとおり波形加算部２４０の出力が不自然なものとならないように調整し、その結果を逆・逆フィルタ２４４に与える。逆・逆フィルタ２４４は、図１に示す声帯波形抽出部６０内で生成された逆フィルタの逆フィルタ（逆・逆フィルタ）処理を入力に対して実行する。これにより、波形調整部２４２により生成された声門波形に対して、声道による変化が再び加えられ、声質が変化された後の音声波形が得られる。こうして、入力音声波形５０と同じ発話内容であって、かつその声質がユーザにより設定されたＰＣＡパラメータにより決定される声質に変換された音声波形５４が出力される。 Referring to FIG. 10, waveform adding section 240 of waveform regenerating section 64 adds the conversion function given from waveform subtraction processing section 188 to the vocal cord waveform obtained from input speech waveform 50, and the result Is given to the waveform adjustment unit 242. The waveform adjusting unit 242 adjusts the output of the waveform adding unit 240 so as not to be unnatural as described above, and gives the result to the inverse / inverse filter 244. The inverse / inverse filter 244 performs an inverse filter (inverse / inverse filter) process of the inverse filter generated in the vocal cord waveform extracting unit 60 shown in FIG. As a result, a change due to the vocal tract is again applied to the glottal waveform generated by the waveform adjustment unit 242 to obtain a voice waveform after the voice quality is changed. Thus, the speech waveform 54 having the same utterance content as the input speech waveform 50 and having the voice quality converted into the voice quality determined by the PCA parameter set by the user is output.

‐実験結果‐
図１１に、本実施の形態による処理結果の例を示す。図１１は、Ｌａｖｅｒ（非特許文献１）のＭｏｄａｌによる発話の一部のスペクトログラム２６０と、その発話をよりＣｒｅａｋｙな声に変換した後のスペクトログラム２６２とを対照して示す。この例では、変換関数はＭｏｄａｌプロトタイプに基づいて生成され、ターゲットをＣｒｅａｋｙに設定した。 -Experimental result-
FIG. 11 shows an example of the processing result according to the present embodiment. FIG. 11 shows a contrast of a spectrogram 260 of a part of an utterance by Modal of Laver (Non-patent Document 1) and a spectrogram 262 after the utterance is converted into a more clear voice. In this example, the transformation function was generated based on the Modal prototype and the target was set to Creamy.

本実施の形態では、いずれの入力音声の声質も、予め準備されたプロトタイプの声質に十分近く、そのプロトタイプを変換関数の基礎として選択すれば、入力音声の声質はほぼ正しくターゲットに変換されることを仮定している。図１１に示す例では、入力音声の声質はＭｏｄａｌの声質に十分近いものと仮定している。 In this embodiment, the voice quality of any input voice is sufficiently close to the voice quality of the prototype prepared in advance, and if the prototype is selected as the basis of the conversion function, the voice quality of the input voice is almost correctly converted to the target. Is assumed. In the example shown in FIG. 11, it is assumed that the voice quality of the input voice is sufficiently close to the voice quality of Modal.

しかし実際には声門波形は、全体としてある特定の声質を持つと感じられる発話中でも、大きく変動するものである。したがって、上記した仮定が常に成立するとは限らない。それでも、図１１からは、この変換により、音響的な情報と発話の持続時間とが明らかに保存されていることがわかる。さらに、垂直方向の縞模様からわかるように、この変換によって声帯波がより長くなっている。これは、Ｆ０がよりＣｒｅａｋｙな声の方向にシフトしていることからも予測されたことである。実際にこの音声波形に基づき音声を合成することにより、変換後の音声が変換前の音声と同じ音声情報を持っており、かつ声質が明らかにＣｒｅａｋｙな声に近くなっていることが分かる。 However, in reality, the glottal waveform fluctuates greatly even during utterances that are felt as having a specific voice quality as a whole. Therefore, the above assumption is not always true. Nevertheless, it can be seen from FIG. 11 that this conversion clearly preserves the acoustic information and the duration of the utterance. Furthermore, as can be seen from the vertical stripes, this conversion results in longer vocal cord waves. This is also predicted from the fact that F0 is shifting in the direction of a more crisp voice. By actually synthesizing speech based on this speech waveform, it can be seen that the speech after conversion has the same speech information as the speech before conversion, and the voice quality is clearly close to a clear voice.

以上説明した声質変換システム３０を構成するモデル作成部３４及び声質変換装置５２は、いずれもコンピュータハードウェア及びその上で動作するコンピュータプログラムにより実現できる。このコンピュータハードウェアとしては、音声信号を扱う設備を持ったものであれば、汎用のものを用いることができる。また、上で説明した装置の各機能ブロックは、この明細書の記載に基づき、当業者であればプログラムで実現することができる。そうしたプログラムもまた１つのデータであり、記憶媒体に記憶させて流通させることができる。 Both the model creation unit 34 and the voice quality conversion device 52 constituting the voice quality conversion system 30 described above can be realized by computer hardware and a computer program operating on the computer hardware. As this computer hardware, general-purpose hardware can be used as long as it has equipment for handling audio signals. Further, each functional block of the apparatus described above can be realized by a program by those skilled in the art based on the description in this specification. Such a program is also a piece of data and can be stored in a storage medium and distributed.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味および範囲内でのすべての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each of the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.

本発明の一実施の形態にかかる声質変換システム３０のブロック図である。1 is a block diagram of a voice quality conversion system 30 according to an embodiment of the present invention. 図１に示す声質変換システム３０のうち、モデル作成部３４の詳細なブロック図である。FIG. 2 is a detailed block diagram of a model creation unit 34 in the voice quality conversion system 30 shown in FIG. 1. 図２に示す逆フィルタ処理部８２のより詳細なブロック図である。FIG. 3 is a more detailed block diagram of an inverse filter processing unit 82 shown in FIG. 2. 図２に示すＰＣＡ分析部８８のより詳細なブロック図である。It is a more detailed block diagram of the PCA analysis part 88 shown in FIG. 図４に示す再サンプリング部１４２により行なわれる再サンプリングの手法を説明するための図である。It is a figure for demonstrating the method of the resampling performed by the resampling part 142 shown in FIG. プロトタイプ声帯波の波形とサンプリング結果とを示す図である。It is a figure which shows the waveform of a prototype vocal cord wave, and a sampling result. 第１主成分から第４主成分によりあらわされる波形の変化を説明するための図である。It is a figure for demonstrating the change of the waveform represented by the 4th main component from the 1st main component. 図１に示す変換関数生成部６２の詳細なブロック図である。FIG. 2 is a detailed block diagram of a conversion function generation unit 62 shown in FIG. 図８に示すターゲット設定部１８２によるターゲットの設定手法を説明するための図である。It is a figure for demonstrating the setting method of the target by the target setting part 182 shown in FIG. 図１に示す波形再生成部６４の詳細なブロック図である。FIG. 2 is a detailed block diagram of a waveform regeneration unit 64 shown in FIG. 本発明の一実施の形態による実験結果を示すスペクトログラムである。It is a spectrogram which shows the experimental result by one embodiment of this invention.

Explanation of symbols

３０声質変換システム、３２基準音声波形、３４モデル作成部、３６ＰＣＡパラメータモデル、５０入力音声波形、５１声質特定情報、５２声質変換装置、５４音声波形、６０入力波形抽出部、６２変換関数生成部、６４波形再生成部、８０音節核抽出部、８１フォルマント推定部、８２逆フィルタ処理部、８４体積速度波形検出部、８６正規化部、８７声帯波形抽出部、８８ＰＣＡ分析部、１２０逆フィルタ生成部、１２２ハイパスフィルタ、１２４，１４０ローパスフィルタ、１２６逆フィルタ適用部、１４２再サンプリング部、１４４差分算出部、１４６規準化処理部、１４８ＰＣＡ計算部、１８２ターゲット設定部、１８４入出力装置、１８８波形減算処理部、２４０波形加算部、２４２波形調整部、２４４逆・逆フィルタ、２４６積分処理部
30 voice quality conversion system, 32 reference speech waveform, 34 model creation unit, 36 PCA parameter model, 50 input speech waveform, 51 voice quality identification information, 52 voice quality conversion device, 54 speech waveform, 60 input waveform extraction unit, 62 conversion function generation unit , 64 waveform regeneration unit, 80 syllable nucleus extraction unit, 81 formant estimation unit, 82 inverse filter processing unit, 84 volume velocity waveform detection unit, 86 normalization unit, 87 vocal cord waveform extraction unit, 88 PCA analysis unit, 120 inverse filter Generator, 122 high-pass filter, 124,140 low-pass filter, 126 inverse filter application unit, 142 re-sampling unit, 144 difference calculation unit, 146 normalization processing unit, 148 PCA calculation unit, 182 target setting unit, 184 input / output device, 188 Waveform subtraction processing unit, 240 Waveform addition unit, 242 Waveform adjustment unit, 244 And inverse filter 246 integration processing unit

Claims

A vocal fold waveform that estimates a unit waveform of a vocal fold wave when a portion of a plurality of reference sound waveforms prepared in advance corresponding to a predetermined voice quality satisfies a predetermined condition when the portion is uttered An estimation step;
A parameterization step of parameterizing each of the unit waveforms of the vocal fold according to a predetermined parameterization method;
A principal component analysis step of obtaining a principal component representation of each of the unit waveforms of the vocal fold wave by performing a principal component analysis on the unit waveform of the parameterized vocal fold wave;
Outputting a waveform of each unit waveform of the vocal fold wave and a principal component expression corresponding to the waveform as a voice quality model corresponding to the speech waveform from which the vocal fold wave is obtained. .

The vocal cord waveform estimation step includes:
Extracting syllable nuclei of the plurality of speech waveforms, each prepared in advance corresponding to a predetermined sound quality;
Applying, to each of the extracted syllable nuclei, an inverse filter for detecting the volume velocity waveform of glottal airflow when sound is generated by removing the influence of the vocal tract;
The voice quality model generation method according to claim 1, further comprising: a unit waveform extraction step of extracting a unit waveform of the vocal cord wave from each of the syllable nuclei after the inverse filter is applied.

The unit waveform extraction step starts from a minimum portion of the volume velocity waveform that exists in the central part of the syllable nucleus, and then goes back by one period determined by the basic frequency of a predetermined region including the syllable nucleus. The voice quality model generation method according to claim 2, comprising a step of extracting up to a portion as the unit waveform.

The voice quality model generation method according to claim 2, further comprising a step of normalizing a volume velocity waveform of the glottal airflow according to a predetermined normalization method prior to the unit waveform extraction step.

The principal component analysis step includes:
The method includes: obtaining principal component representations of principal components from the head to a predetermined number of each of the unit waveforms of the vocal fold wave by performing principal component analysis on the unit waveform of the parameterized vocal fold wave. The voice quality model generation method according to any one of claims 1 to 4.

The voice quality model generation method according to claim 5, wherein the predetermined number of principal components is from a first principal component to a fourth principal component.

The parameterizing step includes a resampling step of resampling the unit waveform of the vocal fold wave at a predetermined number of sampling points that divide the unit waveform of the vocal fold wave into a plurality of equal length portions. 7. The voice quality model generation method according to any one of 6.

A differential step of obtaining a differential data string of the unit waveform of the vocal fold wave by taking a difference of the unit waveform of the vocal fold wave resampled by the re-sampling step;
The principal component analysis step includes the step of acquiring a principal component expression for each differential amount of the unit waveform of the vocal fold wave by performing the principal component analysis on the differential data string. Voice quality model generation method.

Each of the differential data strings obtained by the differentiation step includes a pair of a difference in re-sampling time and a difference in unit waveform of the vocal fold wave corresponding to the difference in re-sampling time,
The voice quality model generation method further includes the influence of fluctuation in the time axis direction and the influence of fluctuation in the amplitude direction on each of the differential data strings obtained by the step of obtaining the differential data string prior to the principal component analysis step. The voice quality model generation method according to claim 8, further comprising a step of performing a predetermined normalization process for equalizing.

A plurality of prototype vocal cord wave unit waveforms each associated with a predetermined voice quality, and a predetermined number of principal component representations from the head obtained by a predetermined principal component analysis for each of the plurality of prototype vocal cord wave unit waveforms A voice quality conversion method for converting the voice quality of an input voice waveform using a glottal waveform model consisting of
A unit waveform extraction step for extracting a unit waveform of each vocal fold wave from a portion satisfying a predetermined condition in the input speech waveform;
Based on the glottal waveform model corresponding to the voice quality specified in advance as the voice quality of the input voice waveform and the glottal waveform model corresponding to the voice quality specified by the user, the unit waveform of the vocal cord wave extracted from the input voice waveform A voice waveform conversion step including a voice waveform generation step of generating an output voice waveform by converting the voice quality specified by the user.

The speech waveform generation step includes
Selecting a first prototype vocal cord wave from a glottal waveform model corresponding to the voice quality of the input speech waveform;
Selecting a second prototype vocal cord wave from a glottal waveform model corresponding to the voice quality specified by the user;
A conversion function for calculating a conversion function for converting the input speech waveform into a speech waveform of voice quality designated by the user by performing a predetermined calculation between the first waveform and the second waveform. A calculation step;
The voice quality conversion method according to claim 10, further comprising: generating the output voice waveform by applying the conversion function to a unit waveform of a vocal cord wave of the input voice waveform.

The voice conversion method according to claim 11, wherein the conversion function calculating step includes a step of calculating the conversion function by subtracting the first waveform from the second waveform.

The voice quality conversion method according to claim 12, wherein the voice waveform generation step includes the step of generating the output voice waveform by adding the conversion function to a unit waveform of a vocal cord wave of the input voice waveform.

The unit waveform extraction step includes:
Extracting a syllable nucleus of the input speech waveform;
Applying, to each of the extracted syllable nuclei, an inverse filter for detecting the volume velocity waveform of glottal airflow when sound is generated by removing the influence of the vocal tract;
The voice quality conversion method according to claim 1, further comprising: extracting a unit waveform of the vocal cord wave from each of the syllable nuclei after the inverse filter is applied.

The step of extracting the unit waveform starts from a minimum portion of the volume velocity waveform existing in the center of the syllable nucleus, and from there for one period determined by the fundamental frequency of a predetermined region including the syllable nucleus. The voice quality conversion generation method according to claim 14, further comprising a step of extracting a retroactive portion as the unit waveform.

The voice quality conversion method according to claim 14 or 15, further comprising a step of normalizing a volume velocity waveform of the glottic airflow according to a predetermined normalization method prior to the step of extracting the waveform.

The voice quality conversion method according to any one of claims 1 to 16, wherein the predetermined number of principal component representations from the top are based on a first principal component to a fourth principal component.

A computer program configured to, when executed by a computer, operate the computer so as to realize all the steps according to any one of claims 1 to 17.

A computer programmed by the computer program according to claim 18.

A computer-readable recording medium on which the computer program according to claim 18 is recorded.