JP2005134685A

JP2005134685A - Vocal tract shaped parameter estimation device, speech synthesis device and computer program

Info

Publication number: JP2005134685A
Application number: JP2003371398A
Authority: JP
Inventors: Hiroyuki Hirai; 啓之平井; Hironori Takemoto; 浩典竹本; Tatsuya Kitamura; 達也北村; Kiyoshi Honda; 清志本多
Original assignee: ATR Advanced Telecommunications Research Institute International; Sanyo Electric Co Ltd
Current assignee: ATR Advanced Telecommunications Research Institute International; Sanyo Electric Co Ltd
Priority date: 2003-10-31
Filing date: 2003-10-31
Publication date: 2005-05-26

Abstract

<P>PROBLEM TO BE SOLVED: To estimate a vocal tract shaped parameter as accurate as possible. <P>SOLUTION: A vocal tract shaped parameter estimation device 20 includes an initial value calculation part 40 to calculate the initial value of the vocal tract shaped parameter of a living body from image information 22 that indicates the structure of the inside of a living body during prescribed utterance, a standard spectral parameter calculation part 44 to calculate a spectral parameter from a speech waveform 24 equivalent to this utterance, a spectral parameter conversion part 42 to optimize the vocal tract shaped parameter calculated by the initial value calculation part 40 by using a prescribed optimizing technique with a spectral parameter calculated by the standard spectral parameter calculation part 44 as a standard, a spectral parameter comparison part 46, and a vocal tract shaped parameter correction part 48. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

この発明は音声合成装置に関し、特に、声道形状パラメータから音声を合成する方式において、合成される音声をより明瞭にするための技術に関する。 The present invention relates to a speech synthesizer, and more particularly, to a technique for clarifying synthesized speech in a method of synthesizing speech from vocal tract shape parameters.

人間とコンピュータシステムに代表される機械系との間のインターフェイスとして種々のものが提案されている。その中でも、以前から有力視され、最近特に使用される頻度が高くなっているものに音声がある。音声を用いることにより、人間と機械系との間で、人間同士によるものと同様のコミュニケーションを実現することができる。 Various interfaces have been proposed as interfaces between humans and mechanical systems represented by computer systems. Among them, speech is one that has been regarded as promising for a long time and has been used particularly frequently recently. By using voice, communication similar to that between humans can be realized between humans and mechanical systems.

音声によるコミュニケーションで用いられる音声合成技術の中に、人間の音声発声器官の生理的な運動をシミュレートすることにより音声を合成する手法がある。この手法では、人間の音声生成モデルを、音波を発生させる音源モデルと、声道の形状による調音を表す声道モデルとに分けて考える。声道モデルは、声道の形状、特に断面形状を示す声道形状パラメータにより記述されることが多い。 Among speech synthesis techniques used in speech communication, there is a method of synthesizing speech by simulating the physiological motion of a human speech voicing organ. In this method, the human speech generation model is divided into a sound source model that generates sound waves and a vocal tract model that represents articulation based on the shape of the vocal tract. The vocal tract model is often described by a vocal tract shape parameter indicating a shape of the vocal tract, particularly a cross-sectional shape.

この手法では、ある人の音源の特性と、その声道形状に対応する声道形状パラメータとが正確に分かれば、その人の声によく似た明瞭な音声を生成できる。例えば何らかの原因で発話器官を失い、発話することができなくなった人等について、本来の音声によく似た音声を合成したりすることが可能である。 In this method, if the characteristics of a certain person's sound source and the vocal tract shape parameters corresponding to the vocal tract shape are accurately known, it is possible to generate clear speech that closely resembles that person's voice. For example, it is possible to synthesize speech that closely resembles the original speech for a person who has lost the speech organ for some reason and can no longer speak.

声道形状パラメータを用いた音声合成は、パラメータの圧縮率が高いこと、パラメータ変形に対するロバスト性が高いこと、及び声質変換に適用できる可能性があることなどの点から、有望な音声合成手法である。 Speech synthesis using vocal tract shape parameters is a promising speech synthesis method because of its high parameter compression ratio, high robustness to parameter deformation, and the possibility of being applicable to voice quality conversion. is there.

この手法では、どのようにして声道形状パラメータを推定するかが問題となる。この点について、特許文献１には、標準話者の声道モデルに基づいて予め決められた声道形状パラメータと、実際の音声のフォルマント周波数との間の対応関係を参照して、話者の声道形状パラメータを推定する手法が開示されている。 In this method, how to estimate the vocal tract shape parameter becomes a problem. With respect to this point, Patent Document 1 refers to a correspondence relationship between a vocal tract shape parameter determined in advance based on a standard speaker's vocal tract model and a formant frequency of an actual speech, and A technique for estimating vocal tract shape parameters is disclosed.

特開平１１−３２７５９２号公報、段落００２０から００４２JP-A-11-327592, paragraphs 0020 to 0042

しかし、一般に、人間の発声においては、異なる声道形状から同様の音声を生成可能なことが知られている。そのため、上記した特許文献１に記載の技術のように音声波形から声道形状パラメータを求めようとする場合、解が一意に求まらないという問題がある。その結果、音声合成の結果が好ましいものにならないという問題がある。 However, it is generally known that similar speech can be generated from different vocal tract shapes in human speech. For this reason, there is a problem that when the vocal tract shape parameter is obtained from the speech waveform as in the technique described in Patent Document 1, the solution cannot be obtained uniquely. As a result, there is a problem that the result of speech synthesis is not favorable.

それゆえに本発明の目的は、声道形状パラメータをできるだけ正確に推定することが可能な声道形状パラメータ推定装置、並びにそのようにして推定された声道形状パラメータを用いて音声合成を行なう音声合成装置を提供することである。 Therefore, an object of the present invention is to provide a vocal tract shape parameter estimation device capable of estimating a vocal tract shape parameter as accurately as possible, and to perform speech synthesis using the vocal tract shape parameter thus estimated. Is to provide a device.

本願発明の第１の局面にかかる声道形状パラメータの推定装置は、所定の発声時の生体内部の構造を表す画像情報から生体の声道形状パラメータを算出するための声道形状パラメータ算出手段と、所定の発声と等価な音声波形から、スペクトルパラメータを算出するためのスペクトルパラメータ算出手段と、スペクトルパラメータ算出手段により算出されたスペクトルパラメータを基準として、声道形状パラメータ算出手段により算出された声道形状パラメータを所定の最適化手法により最適化するための声道形状パラメータ最適化手段とを含む。 An apparatus for estimating a vocal tract shape parameter according to a first aspect of the present invention includes a vocal tract shape parameter calculating unit for calculating a vocal tract shape parameter of a living body from image information representing a structure inside the living body at a predetermined utterance. The vocal tract calculated by the vocal tract shape parameter calculation means based on the spectral parameter calculation means for calculating the spectral parameters from the speech waveform equivalent to the predetermined utterance and the spectral parameters calculated by the spectral parameter calculation means Vocal tract shape parameter optimization means for optimizing the shape parameter by a predetermined optimization method.

実際の生体内部の構造を表す画像情報から声道形状パラメータを算出し、それを初期値として、実際の音声波形から得られたスペクトルパラメータを基準に最適化する。実際の画像情報から得られた声道形状パラメータを初期値とするので、最適化処理によって実際の音声波形に近い音声波形を生成するための声道形状パラメータとして最適な値を求めることができる。その結果、こうして推定された声道形状パラメータを用いて音声合成を行なえば、話者の音声に近い、明瞭な音声を合成できる。 The vocal tract shape parameter is calculated from the image information representing the actual internal structure of the living body, and the initial value is used as an initial value to optimize the spectral parameter obtained from the actual speech waveform. Since an initial value is a vocal tract shape parameter obtained from actual image information, an optimal value can be obtained as a vocal tract shape parameter for generating a speech waveform close to an actual speech waveform by optimization processing. As a result, if speech synthesis is performed using the vocal tract shape parameters thus estimated, clear speech close to the speech of the speaker can be synthesized.

好ましくは、スペクトルパラメータ算出手段は、所定の発声と等価な音声波形のスペクトル包絡を算出するための第１のスペクトル包絡算出手段を含み、声道形状パラメータ最適化手段は、与えられる声道形状パラメータに基づいて合成される音声のスペクトル包絡を算出するための第２のスペクトル包絡算出手段と、第１のスペクトル包絡算出手段により算出されるスペクトル包絡と、第２のスペクトル包絡算出手段により算出されるスペクトル包絡とを比較しその差分を算出するためのスペクトル包絡比較手段と、スペクトル包絡比較手段により算出された差分が０に近づくように、最適化手法により声道形状パラメータを修正し、第２のスペクトル包絡算出手段に与えるための声道形状パラメータ修正手段と、所定の終了条件が成立したことに応答して、声道形状パラメータ修正手段によって修正された声道形状パラメータを出力するための終了条件検出手段とを含み、第２のスペクトル包絡算出手段は、声道形状パラメータの算出の第１回目には声道形状パラメータ算出手段により算出された声道形状パラメータを、それ以外の場合には声道形状パラメータ修正手段から与えられる声道形状パラメータを、それぞれ用いてスペクトル包絡を算出する。 Preferably, the spectrum parameter calculation means includes first spectrum envelope calculation means for calculating a spectrum envelope of a speech waveform equivalent to a predetermined utterance, and the vocal tract shape parameter optimization means includes the given vocal tract shape parameter Calculated by the second spectral envelope calculating means for calculating the spectral envelope of the speech synthesized based on the first spectral envelope, the spectral envelope calculated by the first spectral envelope calculating means, and the second spectral envelope calculating means. A spectral envelope comparison means for comparing the spectral envelope and calculating the difference thereof, and correcting the vocal tract shape parameter by an optimization method so that the difference calculated by the spectral envelope comparison means approaches zero; The vocal tract shape parameter correcting means for giving to the spectrum envelope calculating means and a predetermined end condition are established. In response, the second condition detection means for outputting the vocal tract shape parameter modified by the vocal tract shape parameter modification means, and the second spectral envelope calculation means includes a first calculation of the vocal tract shape parameter. The spectrum envelope is calculated by using the vocal tract shape parameter calculated by the vocal tract shape parameter calculation means for the first time, and the vocal tract shape parameter given from the vocal tract shape parameter correction means in other cases.

最適化においては、声道形状を反映するスペクトル包絡に基づいた最適化が行なわれる。音源によるスペクトル上の微細構造の影響を除くことができ、声道形状パラメータを正確に推定できる。 In the optimization, optimization based on a spectrum envelope reflecting the vocal tract shape is performed. The influence of the fine structure on the spectrum by the sound source can be removed, and the vocal tract shape parameters can be estimated accurately.

好ましくは、声道形状パラメータ修正手段は、差分が０に近づくように、シミュレーテッド・アニーリング法により声道形状パラメータを修正するための手段を含む。 Preferably, the vocal tract shape parameter correcting means includes means for correcting the vocal tract shape parameter by a simulated annealing method so that the difference approaches zero.

シミュレーテッド・アニーリング法を用いることで、局所解でなく、大域的な最適解を求めることができる可能性が高くなる。 By using the simulated annealing method, there is a high possibility that a global optimum solution can be obtained instead of a local solution.

より好ましくは、終了条件検出手段は、スペクトル包絡比較手段により算出された差分の大きさが所定のしきい値より小さくなったことに応答して、声道形状パラメータ修正手段によって修正された声道形状パラメータを出力するための手段を含む。 More preferably, the end condition detecting means is a vocal tract corrected by the vocal tract shape parameter correcting means in response to the difference calculated by the spectrum envelope comparing means being smaller than a predetermined threshold value. Means for outputting the shape parameters.

終了条件検出手段は、スペクトル包絡比較手段により算出された差分の大きさが所定のしきい値より小さくなったこと、又は第２のスペクトル包絡算出手段、スペクトル包絡比較手段、及び声道形状パラメータ修正手段による処理の繰り返しが所定回数完了したこと、に応答して、声道形状パラメータ修正手段によって修正された声道形状パラメータを出力するための手段を含んでもよい。 The end condition detection means is configured such that the difference calculated by the spectrum envelope comparison means is smaller than a predetermined threshold value, or the second spectrum envelope calculation means, spectrum envelope comparison means, and vocal tract shape parameter correction Means may be included for outputting the vocal tract shape parameter modified by the vocal tract shape parameter modifying means in response to the repetition of the processing by the means being completed a predetermined number of times.

所定の終了条件によって処理を終了することにより、ある程度の時間内に推定値を得ることができる。 By terminating the process according to a predetermined termination condition, an estimated value can be obtained within a certain amount of time.

さらに好ましくは、声道形状パラメータ算出手段は、所定の発声時の生体の複数の立体断面画像から、生体の声道の、複数箇所における断面の断面積を算出することにより声道断面積関数を定めるための声道断面積関数特定手段と、声道断面積関数特定手段により特定された声道断面積関数の時間変化パタンを抽出し、声道形状パラメータとして出力するための手段とを含む。 More preferably, the vocal tract shape parameter calculating means calculates a vocal tract cross-sectional area function by calculating a cross-sectional area of a cross section at a plurality of locations of the biological vocal tract from a plurality of stereoscopic cross-sectional images of the biological body at a predetermined utterance. A vocal tract cross-sectional area function specifying means for determining, and a means for extracting a time-varying pattern of the vocal tract cross-sectional area function specified by the vocal tract cross-sectional area function specifying means and outputting it as a vocal tract shape parameter.

この発明の第２の局面にかかる音声合成装置は、上記したいずれかの声道形状パラメータの推定装置と、声道形状パラメータの推定装置により推定された声道形状パラメータを用いて音声合成を行なうための音声合成手段とを含む。 A speech synthesizer according to a second aspect of the present invention performs speech synthesis using one of the above vocal tract shape parameter estimation devices and a vocal tract shape parameter estimated by the vocal tract shape parameter estimation device. Voice synthesis means.

上記した声道形状パラメータにより、実際の音声に近い音声を合成するための声道形状パラメータを正確に推定できる。その声道形状パラメータを用いて音声合成を行なうので、合成された音声は、当然話者の音声に近く、かつ明瞭なものとなる。 With the above vocal tract shape parameters, the vocal tract shape parameters for synthesizing speech close to actual speech can be accurately estimated. Since speech synthesis is performed using the vocal tract shape parameters, the synthesized speech is naturally close to the speech of the speaker and clear.

この発明の第３の局面にかかるコンピュータプログラムは、コンピュータにより実行されると、当該コンピュータを、上記したいずれかの声道形状パラメータの推定装置として動作させるものである。 The computer program according to the third aspect of the present invention, when executed by a computer, causes the computer to operate as any of the above vocal tract shape parameter estimation devices.

また、この発明の第４の局面にかかるコンピュータプログラムは、コンピュータにより実行されると、当該コンピュータを、上記した音声合成装置として動作させるものである。 A computer program according to the fourth aspect of the present invention, when executed by a computer, causes the computer to operate as the above-described speech synthesizer.

以上のように本発明によれば、実際の音声波形に近い音声波形を生成するための声道形状パラメータとして最適な値を求めることができる。その結果、推定された声道形状パラメータを用いて音声合成を行なうことにより、話者の実際の音声に近い、明瞭な音声を合成できる。 As described above, according to the present invention, an optimal value can be obtained as a vocal tract shape parameter for generating a speech waveform close to an actual speech waveform. As a result, by performing speech synthesis using the estimated vocal tract shape parameters, it is possible to synthesize clear speech that is close to the actual speech of the speaker.

本実施の形態の声道形状パラメータ推定装置及び音声合成装置では、声道形状パラメータを求める際に、ＭＲＩ（ＭａｇｎｅｔｉｃＲｅｓｏｎａｎｃｅＩｍａｇｉｎｇ：核磁気共鳴画像）又はＣＴ（ＣｏｍｐｕｔｅｄＴｏｍｏｇｒａｐｈｙ：コンピュータ断層法）など、生体内の３次元形状を撮像可能な装置により撮像された、所定のタスクで発話した際の話者の声道形状と、撮像時と同じタスクで発話された音声とを用いて、声道内の音波の伝播をシミュレートすることにより、声道形状パラメータを求める。３次元形状の画像としては、例えば後述する実施の形態のように生体の複数の断面画像を撮像してそれから復元した画像を用いることができる。それ以外に例えば３次元形状を直接撮像可能なものであればそれを用いることもできる。 In the vocal tract shape parameter estimation device and the speech synthesizer according to the present embodiment, when the vocal tract shape parameter is obtained, a raw resonance signal such as MRI (Magnetic Resonance Imaging) or CT (Computed Tomography) is used. Using the vocal tract shape of the speaker when speaking in a predetermined task, captured by a device capable of capturing a three-dimensional shape in the body, and the voice uttered in the same task as when imaging, The vocal tract shape parameters are obtained by simulating the propagation of sound waves. As an image having a three-dimensional shape, for example, an image obtained by capturing a plurality of cross-sectional images of a living body and restoring the images can be used as in an embodiment described later. Other than that, for example, it can be used if it can directly image a three-dimensional shape.

この手法では、声道形状パラメータを求める際に、ＭＲＩ又はＣＴなどにより撮像された発話時の声道形状から声道形状パラメータを得る。この声道形状パラメータから音声合成を行なうことにより得られる音声波形から、そのスペクトルパラメータを求める。このスペクトルパラメータと、ＭＲＩなどによる撮像時の発話と等価な実際の発話音声を分析して得たスペクトルパラメータとの差分を評価値として、シミュレーテッド・アニーリング法等の最適化手法により声道形状パラメータを修正する。この処理を繰返し、所定の終了条件が成立した時点で修正を終了し、そのときの声道形状パラメータを最終的に推定された声道形状パラメータとする。この声道形状パラメータを用いて、従来公知の方法により音声合成を行なう。なお、シミュレーテッド・アニーリング法を用いた場合には、局所解でない大域的な最適解を得ることができる可能性が高い。 In this method, when obtaining the vocal tract shape parameter, the vocal tract shape parameter is obtained from the vocal tract shape at the time of utterance imaged by MRI or CT. A spectrum parameter is obtained from a speech waveform obtained by performing speech synthesis from the vocal tract shape parameter. The difference between this spectral parameter and the spectral parameter obtained by analyzing the actual utterance equivalent to the utterance at the time of imaging by MRI or the like is used as an evaluation value, and the vocal tract shape parameter is determined by an optimization method such as a simulated annealing method. To correct. This process is repeated, and correction is terminated when a predetermined termination condition is satisfied, and the vocal tract shape parameter at that time is finally set as the estimated vocal tract shape parameter. Using this vocal tract shape parameter, speech synthesis is performed by a conventionally known method. When the simulated annealing method is used, there is a high possibility that a global optimum solution that is not a local solution can be obtained.

なお、本実施の形態では、後述するように声道形状パラメータとして、声道断面積関数の時間変化パタンを用いる。しかし声道形状パラメータとしては、音声合成のために声道形状を表現することが可能なパラメータであれば、音声合成の方式にあわせて種々のものを採用することができる。 In the present embodiment, as will be described later, a time variation pattern of the vocal tract cross-sectional area function is used as the vocal tract shape parameter. However, as the vocal tract shape parameter, various parameters can be adopted according to the speech synthesis method as long as the vocal tract shape can be expressed for speech synthesis.

図１に本実施の形態にかかる音声合成装置２０のブロック図を示す。音声合成装置２０は、話者があるタスクで発話している際にＭＲＩにより撮像された話者の生体内に関するＭＲＩ撮像データ２２と、同じタスクで当該話者が実際に発話した音声の音声波形２４とに基づいて、この話者の声道形状パラメータ２６を求め、この声道形状パラメータ２６を用いて合成音声信号３２を生成するものである。ＭＲＩ撮像データ２２がどのようなものであるかについては後述する。 FIG. 1 shows a block diagram of a speech synthesizer 20 according to the present embodiment. The voice synthesizer 20 uses the MRI imaging data 22 relating to the in vivo of the speaker captured by MRI when the speaker is speaking in a certain task, and the voice waveform of the voice actually spoken by the speaker in the same task. 24, the vocal tract shape parameter 26 of the speaker is obtained, and a synthesized speech signal 32 is generated using the vocal tract shape parameter 26. The details of the MRI imaging data 22 will be described later.

図１を参照して、音声合成装置２０は、ＭＲＩ撮像データ２２から生体の声道形状パラメータを算出するための初期値算出部４０と、この声道形状パラメータを受け、声道形状パラメータの値に基づいて合成される音声のスペクトルパラメータを算出するためのスペクトルパラメータ変換部４２とを含む。後述するようにスペクトルパラメータ変換部４２は、声道形状パラメータ推定処理の最初には初期値算出部４０の出力する声道形状パラメータを使用するが、以後の処理では修正後の声道形状パラメータを用いてスペクトルパラメータの算出を行なう。なお、本実施の形態では、スペクトルパラメータとしてスペクトル包絡を用いる。 Referring to FIG. 1, the speech synthesizer 20 receives an initial value calculation unit 40 for calculating a vocal tract shape parameter of a living body from MRI imaging data 22 and the vocal tract shape parameter, and receives the value of the vocal tract shape parameter And a spectral parameter conversion unit 42 for calculating a spectral parameter of speech synthesized based on the above. As will be described later, the spectral parameter conversion unit 42 uses the vocal tract shape parameter output from the initial value calculation unit 40 at the beginning of the vocal tract shape parameter estimation process, but the corrected vocal tract shape parameter is used in the subsequent processing. To calculate the spectral parameters. In the present embodiment, a spectrum envelope is used as a spectrum parameter.

音声合成装置２０はさらに、話者の発話から得られた音声波形２４から、声道形状パラメータ推定の際に基準となる基準スペクトルパラメータを算出するための基準スペクトルパラメータ算出部４４と、スペクトルパラメータ変換部４２により算出されたスペクトルパラメータと、基準スペクトルパラメータ算出部４４により得られた基準となるスペクトルパラメータとを比較し、その差分を出力するためのスペクトルパラメータ比較部４６と、スペクトルパラメータ比較部４６から出力された差分に基づいて、シミュレーテッド・アニーリング法により、その差分が小さくなるように（０に近づくように）声道形状パラメータを修正してスペクトルパラメータ変換部４２に与えるための声道形状パラメータ修正部４８とを含む。本実施の形態では、スペクトルパラメータ比較部４６は、比較の結果得られた差分の大きさが所定のしきい値より小さくなったとき、又は所定の回数だけ声道形状パラメータの修正が行なわれたときに終了条件が成立したとして処理を終了し、そのときの声道形状パラメータ（最後に声道形状パラメータ修正部４８により出力された声道形状パラメータ）を最適化された声道形状パラメータ２６として出力する機能を持つ。 The speech synthesizer 20 further includes a reference spectrum parameter calculation unit 44 for calculating a reference spectrum parameter serving as a reference in estimating the vocal tract shape parameter from the speech waveform 24 obtained from the utterance of the speaker, and a spectral parameter conversion. The spectral parameter calculated by the unit 42 and the reference spectral parameter obtained by the reference spectral parameter calculating unit 44 are compared, and the spectral parameter comparing unit 46 for outputting the difference is compared with the spectral parameter comparing unit 46. Based on the output difference, the vocal tract shape parameter for applying to the spectral parameter conversion unit 42 by correcting the vocal tract shape parameter so that the difference becomes small (approaching 0) by the simulated annealing method. And a correction unit 48. In the present embodiment, the spectral parameter comparison unit 46 corrects the vocal tract shape parameter when the magnitude of the difference obtained as a result of the comparison is smaller than a predetermined threshold value or a predetermined number of times. When the end condition is satisfied, the process is terminated, and the vocal tract shape parameter at that time (the vocal tract shape parameter finally output by the vocal tract shape parameter correction unit 48) is used as the optimized vocal tract shape parameter 26. Has a function to output.

音声合成装置２０はさらに、音源３０と、上記した処理により得られた最適化された声道形状パラメータ２６、音源３０、及び最適化された声道形状パラメータ２６を用いて、合成目標の音素列５０にしたがって合成音声信号３２を生成するための音声合成部２８とを含む。音声合成部２８が用いる合成手法は、声道形状パラメータを用いるものであればどのようなものでもよい。また音源３０は、話者の音声にあわせて最適化されたものを用いるのが望ましいが、それ以外のものを用いてもよい。 The speech synthesizer 20 further uses the sound source 30, the optimized vocal tract shape parameter 26 obtained by the above processing, the sound source 30, and the optimized vocal tract shape parameter 26 to generate a synthesis target phoneme sequence. 50 and a voice synthesizer 28 for generating a synthesized voice signal 32 in accordance with 50. The synthesis method used by the speech synthesizer 28 may be any method as long as it uses a vocal tract shape parameter. The sound source 30 is preferably optimized for the voice of the speaker, but other sound sources may be used.

図２に、例としてＭＲＩにより得られた、発話中のある時点での人体の正中矢状面の断面画像を示す。ＭＲＩ撮像データ２２は、同じ時点において、正中矢状面以外にもこの面と平行な複数の面における人体の断面形状を含む。その結果、これら複数の画像から、３次元的に生体内部の形状を復元することができる。 FIG. 2 shows a cross-sectional image of the mid-sagittal plane of the human body at a certain point during speech obtained by MRI as an example. The MRI imaging data 22 includes the cross-sectional shape of the human body on a plurality of planes parallel to the plane other than the median sagittal plane at the same time point. As a result, the internal shape of the living body can be three-dimensionally restored from the plurality of images.

初期値算出部４０は、こうして復元された生体内部の３次元形状に基づき、ある時点での声道の断面積と所定の位置（例えば声門）からその断面までの声道に沿った距離との間の関係を求める。この関係を声道断面積関数と呼ぶ。声道断面積関数は、発話音声に依存して、時間とともに変化する。したがって、上記したように話者があるタスクで発話する際のその声道断面積関数の時間変化パタンを得ることができる。 Based on the three-dimensional shape inside the living body restored in this way, the initial value calculation unit 40 calculates the cross-sectional area of the vocal tract at a certain point in time and the distance along the vocal tract from a predetermined position (eg glottis) to the cross-section. Seeking a relationship between. This relationship is called a vocal tract cross-sectional area function. The vocal tract cross-sectional area function varies with time depending on the speech. Therefore, as described above, it is possible to obtain a time change pattern of the vocal tract cross-sectional area function when the speaker speaks in a certain task.

本実施の形態では、図２に断面Ａ〜Ｌで示すように１２ヶ所で声道の断面積を算出し、それを声道形状パラメータとする。この際の声道断面積関数の原点となる声帯の位置を図２において点５０として示してある。 In the present embodiment, as shown by the cross sections A to L in FIG. 2, the cross-sectional areas of the vocal tract are calculated at 12 locations and set as vocal tract shape parameters. In this case, the position of the vocal cords which is the origin of the vocal tract cross-sectional area function is shown as a point 50 in FIG.

図３に、断面Ａ〜Ｌにおける声道及びその周囲の断面図を示す。これらの断面図は、上記した様に複数のＭＲＩ断面形状を合成した３次元形状から得られるものである。図３において、声道断面を符号６０から８２で示す。このように声道の断面形状画像がＭＲＩ画像から合成できるので、それら図面から各断面における声道面積を画像から求めることができる。この際、声道は、空気以外に何もない空間として他の部位と区別できる形で撮像される。 FIG. 3 is a cross-sectional view of the vocal tract and its surroundings in cross sections A to L. These sectional views are obtained from a three-dimensional shape obtained by synthesizing a plurality of MRI sectional shapes as described above. In FIG. 3, the vocal tract cross section is indicated by reference numerals 60 to 82. Thus, since the cross-sectional shape image of the vocal tract can be synthesized from the MRI image, the vocal tract area in each cross section can be obtained from the image from these drawings. At this time, the vocal tract is imaged in a form that can be distinguished from other parts as a space having nothing other than air.

なお、複数のＭＲＩ画像から３次元画像を復元する技術、得られた３次元画像から、声道の各位置における声道断面を含む断面画像を得る技術、及び各声道断面画像から声道の断面積を得る技術としては、公知の画像処理技術があり、本実施の形態ではそれらの画像処理技術を使用している。 A technique for restoring a three-dimensional image from a plurality of MRI images, a technique for obtaining a cross-sectional image including a vocal tract cross section at each position of the vocal tract from the obtained three-dimensional image, and a vocal tract from each vocal tract cross-sectional image As a technique for obtaining the cross-sectional area, there are known image processing techniques, and in the present embodiment, these image processing techniques are used.

図４に、ＭＲＩ画像から得られる声道断面積関数の時間変化パタンをグラフ形式で示す。このグラフは、話者が「アイウエオ」と発音した際に、各時刻において得られた声道断面積関数のグラフを時間軸に沿って並べたものである。図４において、横軸は声道断面の、声帯（図２における点５０）からの距離、垂直軸は各声道断面における声道断面積をそれぞれ示す。「アイウエオ」とラベルが付してある軸は時間軸であり、時間軸には発話開始からの経過時間をミリ秒単位で付してある。このように表される断面積形状およびその時間変化パタンが、初期値算出部４０が抽出する声道形状パラメータである。 FIG. 4 is a graph showing the time variation pattern of the vocal tract cross-sectional area function obtained from the MRI image. This graph is a graph of vocal tract cross-sectional area functions obtained at each time when a speaker pronounces “Aiueo” along the time axis. In FIG. 4, the horizontal axis represents the vocal tract cross section, the distance from the vocal cord (point 50 in FIG. 2), and the vertical axis represents the vocal tract cross sectional area of each vocal tract cross section. The axis labeled “Iueo” is the time axis, and the time axis indicates the elapsed time from the start of the utterance in milliseconds. The cross-sectional area shape and its time change pattern expressed in this way are the vocal tract shape parameters extracted by the initial value calculation unit 40.

ＭＲＩ撮像データ２２から抽出した、ある時点における声道断面積関数の例を図５に示す。時間ごとに、このような声道断面積関数が得られる。 An example of the vocal tract cross-sectional area function extracted from the MRI imaging data 22 at a certain time is shown in FIG. Such a vocal tract cross-sectional area function is obtained for each time.

こうして声道断面積関数が得られると、それに基づいて音声合成をした結果得られる音声のスペクトル分布を公知の音声処理ツールを用いて算出することができる。その結果の一例を図６に示す。この変換を行なうのが図１に示すスペクトルパラメータ変換部４２である。なおスペクトルパラメータ変換部４２は、繰返し処理の最初には初期値算出部４０からの声道形状パラメータをスペクトルパラメータに変換するが、それ以後の繰返し時には声道形状パラメータ修正部４８から与えられる修正後の声道形状パラメータをスペクトルパラメータに変換してスペクトルパラメータ比較部４６に与える。 When the vocal tract cross-sectional area function is obtained in this way, the spectral distribution of speech obtained as a result of speech synthesis based on the function can be calculated using a known speech processing tool. An example of the result is shown in FIG. This conversion is performed by the spectral parameter converter 42 shown in FIG. Note that the spectral parameter conversion unit 42 converts the vocal tract shape parameter from the initial value calculation unit 40 into a spectral parameter at the beginning of the iterative processing, but after the correction given from the vocal tract shape parameter correction unit 48 at subsequent iterations. Are converted into spectral parameters and supplied to the spectral parameter comparison unit 46.

上記したＭＲＩ画像を撮像した際の発話に対応した実際の発話音声を分析することにより得られるスペクトルの例を図７に示す。このスペクトルのスペクトル包絡を求めると図８のようになる。一般に、発話音声のスペクトルは、声道形状により決まるスペクトル包絡の上に、音源の振動波形により決まる微細構造がのった形になっている。したがって、図８に示すスペクトル包絡が、実際の発話時の声道形状に対応するものと考えられる。 FIG. 7 shows an example of a spectrum obtained by analyzing the actual speech sound corresponding to the speech when the MRI image is captured. FIG. 8 shows the spectrum envelope of this spectrum. In general, the spectrum of an uttered speech has a fine structure determined by a vibration waveform of a sound source on a spectrum envelope determined by a vocal tract shape. Therefore, it is considered that the spectrum envelope shown in FIG. 8 corresponds to the vocal tract shape at the time of actual speech.

音声波形２４から図８に示すようなスペクトル包絡を基準スペクトルパラメータとして抽出するのが基準スペクトルパラメータ算出部４４の機能である。 The function of the reference spectrum parameter calculation unit 44 is to extract a spectrum envelope as shown in FIG. 8 from the speech waveform 24 as a reference spectrum parameter.

スペクトルパラメータ比較部４６はスペクトルパラメータ変換部４２から出力されるスペクトルパラメータと、基準スペクトルパラメータ算出部４４により出力されるスペクトルパラメータとを比較して、その差分を算出する機能を持つ。そしてその差分が所定のしきい値より小さくなったとき、又は所定の数の繰返しが完了した場合には、そのときの声道形状パラメータを最適化された声道形状パラメータ２６として出力する。図９に、スペクトルパラメータ変換部４２により変換されたスペクトルの包絡が図８に示す基準スペクトルの包絡に近づくように最適化したときのスペクトルを示す。またこのときの声道断面積関数を図１０にグラフ１００として示す。比較のため、図５に示す初期値の声道断面積関数を破線のグラフ１０２として示す。 The spectral parameter comparison unit 46 has a function of comparing the spectral parameter output from the spectral parameter conversion unit 42 with the spectral parameter output from the reference spectral parameter calculation unit 44 and calculating the difference. When the difference becomes smaller than a predetermined threshold value or when a predetermined number of iterations are completed, the vocal tract shape parameter at that time is output as the optimized vocal tract shape parameter 26. FIG. 9 shows a spectrum when optimized so that the envelope of the spectrum converted by the spectrum parameter converting unit 42 approaches the envelope of the reference spectrum shown in FIG. The vocal tract cross-sectional area function at this time is shown as a graph 100 in FIG. For comparison, the initial value of the vocal tract cross-sectional area function shown in FIG.

声道形状パラメータ修正部４８は、スペクトルパラメータ比較部４６により算出されたスペクトルパラメータの差分に基づき、シミュレーテッド・アニーリング法により声道形状パラメータを最適化する機能を持つ。シミュレーテッド・アニーリング法に限らず、他の最適化手法を用いることもできる。 The vocal tract shape parameter correction unit 48 has a function of optimizing the vocal tract shape parameter by the simulated annealing method based on the difference between the spectral parameters calculated by the spectral parameter comparison unit 46. Not only the simulated annealing method but also other optimization methods can be used.

本実施の形態にかかる音声合成装置２０は以下のように動作する。以下の処理に先立ち、図１に示すＭＲＩ撮像データ２２及び音声波形２４は既に準備されているものとする。 The speech synthesizer 20 according to the present embodiment operates as follows. Prior to the following processing, it is assumed that the MRI imaging data 22 and the voice waveform 24 shown in FIG. 1 have already been prepared.

ＭＲＩ撮像データ２２が与えられると、初期値算出部４０はＭＲＩ撮像データ２２から声道形状パラメータ（声道断面積関数）の初期値を算出し、スペクトルパラメータ変換部４２に与える。スペクトルパラメータ変換部４２は、この声道形状パラメータの初期値をスペクトルパラメータに変換し、スペクトルパラメータ比較部４６に与える。 When the MRI imaging data 22 is given, the initial value calculation unit 40 calculates an initial value of a vocal tract shape parameter (a vocal tract cross-sectional area function) from the MRI imaging data 22 and gives it to the spectrum parameter conversion unit 42. The spectral parameter conversion unit 42 converts the initial value of the vocal tract shape parameter into a spectral parameter and provides it to the spectral parameter comparison unit 46.

一方、基準スペクトルパラメータ算出部４４は音声波形２４から基準スペクトルパラメータを算出し、スペクトルパラメータ比較部４６に与える。 On the other hand, the reference spectrum parameter calculation unit 44 calculates a reference spectrum parameter from the speech waveform 24 and gives it to the spectrum parameter comparison unit 46.

スペクトルパラメータ比較部４６は、スペクトルパラメータ変換部４２から与えられたスペクトルパラメータと、基準スペクトルパラメータ算出部４４から与えられたスペクトルパラメータとを比較し、その差分が所定のしきい値より小さければそのときの声道形状パラメータ（すなわち初期の声道形状パラメータ）を最適化された声道形状パラメータ２６として出力する。さもなければスペクトルパラメータ比較部４６は、声道形状パラメータ修正部４８に対して比較の結果得られた差分を与える。 The spectral parameter comparison unit 46 compares the spectral parameter given from the spectral parameter conversion unit 42 with the spectral parameter given from the reference spectral parameter calculation unit 44, and if the difference is smaller than a predetermined threshold, Are output as optimized vocal tract shape parameters 26 (ie, the initial vocal tract shape parameters). Otherwise, the spectral parameter comparison unit 46 gives the difference obtained as a result of the comparison to the vocal tract shape parameter correction unit 48.

声道形状パラメータ修正部４８は、スペクトルパラメータ比較部４６からスペクトルパラメータの差分のデータを受取ると、その差分がゼロに近づくように、シミュレーテッド・アニーリング法により声道形状パラメータの初期値を修正し、スペクトルパラメータ変換部４２に与える。 When the vocal tract shape parameter correction unit 48 receives the spectral parameter difference data from the spectral parameter comparison unit 46, the vocal tract shape parameter correction unit 48 corrects the initial value of the vocal tract shape parameter by the simulated annealing method so that the difference approaches zero. To the spectral parameter converter 42.

スペクトルパラメータ変換部４２は、今度は声道形状パラメータ修正部４８から与えられた声道形状パラメータからスペクトルパラメータを算出し、スペクトルパラメータ比較部４６に与える。 The spectral parameter conversion unit 42 calculates a spectral parameter from the vocal tract shape parameter given from the vocal tract shape parameter correction unit 48 and gives it to the spectral parameter comparison unit 46.

以下、スペクトルパラメータ比較部４６、声道形状パラメータ修正部４８、及びスペクトルパラメータ変換部４２による声道形状パラメータの最適化処理が繰返される。そして、スペクトルパラメータ変換部４２により算出されたスペクトルパラメータと基準スペクトルパラメータ算出部４４が算出した基準スペクトルパラメータとの差分がしきい値よりも小さくなった時点でスペクトルパラメータ比較部４６はそのときの声道形状パラメータを最適化された声道形状パラメータ２６として出力する。差分がしきい値より小さくならなかったとしても、上記した繰返し処理が所定回数実行されたときにもスペクトルパラメータ比較部４６はそのときの声道形状パラメータを最適化された声道形状パラメータ２６として出力する。 Thereafter, the optimization process of the vocal tract shape parameters by the spectral parameter comparison unit 46, the vocal tract shape parameter correction unit 48, and the spectral parameter conversion unit 42 is repeated. Then, when the difference between the spectrum parameter calculated by the spectrum parameter conversion unit 42 and the reference spectrum parameter calculated by the reference spectrum parameter calculation unit 44 becomes smaller than a threshold value, the spectrum parameter comparison unit 46 The road shape parameter is output as the optimized vocal tract shape parameter 26. Even if the difference does not become smaller than the threshold value, the spectral parameter comparison unit 46 sets the vocal tract shape parameter at that time as the optimized vocal tract shape parameter 26 even when the above-described repetitive processing is executed a predetermined number of times. Output.

こうして、ある話者について、最適化された声道形状パラメータ２６が得られる。音声合成時には、音声合成部２８が、合成目標である音素列５０を与えられると、音源３０からの音声信号に対し、最適化された声道形状パラメータ２６により規定される声道フィルタ処理を実行し、合成音声信号３２を出力する。 Thus, an optimized vocal tract shape parameter 26 is obtained for a speaker. At the time of speech synthesis, when the speech synthesizer 28 is provided with a phoneme string 50 that is a synthesis target, the speech synthesizer 28 performs a vocal tract filtering process defined by the optimized vocal tract shape parameter 26 on the speech signal from the sound source 30. The synthesized voice signal 32 is output.

以上のように、本実施の形態にかかる音声合成装置２０によれば、話者の実際の声道形状パラメータを初期値とし、かつその声道形状パラメータから、実際の音声波形から得られた基準スペクトルに近いスペクトルを持つ音声が合成できるように声道形状パラメータを最適化する。したがって、話者の実際の発話時の声道形状パラメータに近い値を推定することができ、話者の音声に近い、明瞭な発音の音声合成を実現することができる。 As described above, according to the speech synthesizer 20 according to the present embodiment, the actual vocal tract shape parameter of the speaker is set as an initial value, and the reference obtained from the actual speech waveform is obtained from the vocal tract shape parameter. The vocal tract shape parameters are optimized so that speech having a spectrum close to the spectrum can be synthesized. Therefore, a value close to the vocal tract shape parameter at the time of actual speech of the speaker can be estimated, and clear speech synthesis that is close to the speech of the speaker can be realized.

本実施の形態の音声合成装置２０によれば、生体断面の撮像画像から、話者の音声に近い音声を合成できる。したがって、例えば何らかの原因で発話器官を切除した患者についても、生体断面の撮像画像に基づいて、発話器官の切除前の、当該患者本来の音声によく似た、自然な音声を合成することが可能となる。 According to the speech synthesizer 20 of the present embodiment, speech close to the speaker's speech can be synthesized from the captured image of the cross section of the living body. Therefore, it is possible to synthesize natural speech that resembles the original speech of the patient before excision of the speech organ based on the captured image of the cross-section of the patient whose speech organ has been removed for some reason. It becomes.

なお、上記した音声合成装置２０はコンピュータハードウェア及びその上で実行されるコンピュータプログラムにより実現することができる。コンピュータハードウェアとしては、ＭＲＩ画像などの画像情報及び音声波形データを受取ることが可能な装置を含むものであれば、通常のものを使用することができる。 The speech synthesizer 20 described above can be realized by computer hardware and a computer program executed on the computer hardware. Any computer hardware can be used as long as it includes an apparatus capable of receiving image information such as an MRI image and audio waveform data.

さらに、上記した音声合成装置２０を実現するための個々の部品については、それらを実現するためのデジタル信号処理のためのツールキットが容易に入手可能である。本実施の形態では、それら部品の組合せ方に特に特徴がある。したがって、上記した本実施の形態の記載に基づけば、当業者であれば、本実施の形態にかかる音声合成装置２０を実現するためのコンピュータプログラムを容易に作成することが可能である。 Furthermore, for each component for realizing the speech synthesizer 20 described above, a tool kit for digital signal processing for realizing them can be easily obtained. In this embodiment, there is a particular feature in the way of combining these parts. Therefore, based on the above description of the present embodiment, a person skilled in the art can easily create a computer program for realizing the speech synthesizer 20 according to the present embodiment.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味および範囲内でのすべての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each of the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.

本発明の一実施の形態にかかる音声合成装置のブロック図である。It is a block diagram of the speech synthesizer concerning one embodiment of the present invention. ＭＲＩ画像の一例を示す図である。It is a figure which shows an example of an MRI image. ＭＲＩ画像から得られた声道断面の一例を示す図である。It is a figure which shows an example of the vocal tract cross section obtained from the MRI image. 声道断面積関数の時間変化パタンを示すグラフである。It is a graph which shows the time change pattern of a vocal tract cross-sectional area function. ＭＲＩ画像から得られた声道断面関数の一例を示すグラフである。It is a graph which shows an example of the vocal tract cross-section function obtained from the MRI image. 声道断面積関数から得られたスペクトルの一例を示すグラフである。It is a graph which shows an example of the spectrum acquired from the vocal tract cross-sectional area function. 実際の音声波形から得られたスペクトルの一例を示すグラフである。It is a graph which shows an example of the spectrum acquired from the actual speech waveform. 図７に示すスペクトルのスペクトル包絡を示すグラフである。It is a graph which shows the spectrum envelope of the spectrum shown in FIG. 本発明の一実施の形態にかかる音声合成装置によって最適化された声道形状パラメータから得られたスペクトルの一例を示すグラフである。It is a graph which shows an example of the spectrum acquired from the vocal tract shape parameter optimized by the speech synthesizer concerning one embodiment of the present invention. 声道断面積関数の初期値と、最適化後の声道断面積関数とを対比して示すグラフである。It is a graph which compares and shows the initial value of a vocal tract cross-sectional area function, and the vocal tract cross-sectional area function after optimization.

Explanation of symbols

２０音声合成装置、２２ＭＲＩ撮像データ、２４音声波形、２６最適化された声道形状パラメータ、２８音声合成部、３０音源、３２合成音声信号、４０初期値算出部、４２スペクトルパラメータ変換部、４４基準スペクトルパラメータ算出部、４６スペクトルパラメータ比較部、４８声道形状パラメータ修正部 20 speech synthesizer, 22 MRI imaging data, 24 speech waveform, 26 optimized vocal tract shape parameter, 28 speech synthesizer, 30 sound source, 32 synthesized speech signal, 40 initial value calculator, 42 spectral parameter converter, 44 Reference spectral parameter calculation unit, 46 spectral parameter comparison unit, 48 vocal tract shape parameter correction unit

Claims

Vocal tract shape parameter calculating means for calculating a vocal tract shape parameter of the living body from image information representing a structure inside the living body at a predetermined utterance;
Spectral parameter calculation means for calculating a spectral parameter from a speech waveform equivalent to the predetermined utterance;
Vocal tract shape parameter optimization means for optimizing the vocal tract shape parameter calculated by the vocal tract shape parameter calculation means with reference to the spectral parameter calculated by the spectral parameter calculation means Estimating device.

The spectrum parameter calculation means includes first spectrum envelope calculation means for calculating a spectrum envelope of a speech waveform equivalent to the predetermined utterance,
The vocal tract shape parameter optimization means includes:
A second spectral envelope calculating means for calculating a spectral envelope of a voice synthesized based on a given vocal tract shape parameter;
A spectrum envelope comparing means for comparing the spectrum envelope calculated by the first spectrum envelope calculating means and the spectrum envelope calculated by the second spectrum envelope calculating means and calculating the difference thereof;
A vocal tract shape parameter correcting means for correcting the vocal tract shape parameter by the optimization method so that the difference calculated by the spectral envelope comparison means approaches 0, and providing the second spectral envelope calculation means;
Ending condition detecting means for outputting the vocal tract shape parameter corrected by the vocal tract shape parameter correcting means in response to establishment of a predetermined ending condition,
The second spectral envelope calculation means calculates the vocal tract shape parameter calculated by the vocal tract shape parameter calculation means for the first time of the calculation of the vocal tract shape parameter, and otherwise the vocal tract shape parameter. The vocal tract shape parameter estimation apparatus according to claim 1, wherein a spectrum envelope is calculated using each of the vocal tract shape parameters given from the parameter correction means.

The estimation of the vocal tract shape parameter according to claim 2, wherein the vocal tract shape parameter correction means includes means for correcting the vocal tract shape parameter by a simulated annealing method so that the difference approaches zero. apparatus.

The termination condition detection means is a vocal tract corrected by the vocal tract shape parameter correction means in response to the magnitude of the difference calculated by the spectrum envelope comparison means being smaller than a predetermined threshold value. 4. The vocal tract shape parameter estimation apparatus according to claim 2, further comprising means for outputting a shape parameter.

The termination condition detection means is configured such that the magnitude of the difference calculated by the spectrum envelope comparison means is smaller than a predetermined threshold, or the second spectrum envelope calculation means, the spectrum envelope comparison means, and 3. A means for outputting the vocal tract shape parameter modified by the vocal tract shape parameter modifying means in response to the repetition of the processing by the vocal tract shape parameter modifying means being completed a predetermined number of times. Or the estimation apparatus of the vocal tract shape parameter of Claim 3.

The vocal tract shape parameter calculating means includes:
Vocal tract cross-sectional area function specification for determining a vocal tract cross-sectional area function by calculating cross-sectional areas of cross-sections at a plurality of locations of the vocal tract of the living body from a plurality of stereoscopic cross-sectional images of the living body at the time of the predetermined utterance Means,
6. A means for extracting a time-varying pattern of the vocal tract cross-sectional area function specified by the vocal tract cross-sectional area function specifying means and outputting it as the vocal tract shape parameter. The vocal tract shape parameter estimation apparatus according to claim 1.

The vocal tract shape parameter estimation device according to any one of claims 1 to 6,
A speech synthesizer including speech synthesis means for performing speech synthesis using the vocal tract shape parameter estimated by the vocal tract shape parameter estimation device.

A computer program that, when executed by a computer, causes the computer to operate as the vocal tract shape parameter estimation device according to any one of claims 1 to 6.

A computer program that, when executed by a computer, causes the computer to operate as the speech synthesizer according to claim 7.