JP4644879B2

JP4644879B2 - Data generator for articulation parameter interpolation and computer program

Info

Publication number: JP4644879B2
Application number: JP2005329011A
Authority: JP
Inventors: 浩典竹本; 達也北村; 清志本多; パーハム・モクタリ
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2005-11-14
Filing date: 2005-11-14
Publication date: 2011-03-09
Anticipated expiration: 2025-11-14
Also published as: JP2007133328A

Abstract

<P>PROBLEM TO BE SOLVED: To provide an interpolation data creating apparatus capable of creating data for articulatory parameter interpolation which reflects shape change of a speech organ by speech action, by simpler processing than transmissive imaging method. <P>SOLUTION: The interpolation data creating apparatus of articulatory parameters for creating interpolation data when the articulatory parameters for synthesizing voice which continuously changes from a first phoneme to a second phoneme are created by interpolation between known articulatory parameters for synthesizing the first and the second phoneme includes: a change rate calculation means 60 for calculating transition of a change rate of a predetermined sound feature amount from the first phoneme to the second phoneme, from an input voice signal including the first phoneme and the second phoneme which are continuous; and an interpolation data creating means 64 for creating the interpolation data using the change rate. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

この発明は、音声合成に際し、音源に対する変調を行ない、合成される音声をより滑らかにするためのパラメータを生成する技術に関する。 The present invention relates to a technique for performing modulation on a sound source and generating a parameter for smoothing synthesized speech when speech synthesis is performed.

人間とコンピュータシステムに代表される機械系との間のインターフェイスとして、近年種々のものが提案されている。それらの中で最近特に使用される頻度が高くなっているものに、音声がある。音声を用いる事によって、人間と機械系との間のコミュニケーションを、人間同士の間でのそれに近い形で実現できる。 In recent years, various interfaces have been proposed as an interface between a human and a mechanical system represented by a computer system. Among them, voice is one of the most frequently used recently. By using voice, communication between humans and mechanical systems can be realized in a form close to that between humans.

音声によるコミュニケーションを実現するための音声合成技術には、以下の２つの方式がある。１つは、あらかじめ録音された音声の音声波形から音素片を取り出し、それらをつなぎ合わせる事によって音声を合成する方式である。もう１つは、人間の発話器官の形状の変化をシミュレートする事によって音声を合成する方式である。 There are the following two methods for voice synthesis technology for realizing communication by voice. One is a method of synthesizing a voice by taking out phoneme pieces from a voice waveform of a voice recorded in advance and connecting them. The other is a method of synthesizing speech by simulating changes in the shape of a human speech organ.

音声波形の変化に比べて、発話器官の形状の変化は緩やかである事から、後者の方式の方が、聴覚上、より滑らかな音声を合成する事ができる。そのため、この方式は近年特に注目されている。 Since the change in the shape of the speech organ is more gradual than the change in the speech waveform, the latter method can synthesize a smoother speech. For this reason, this method has attracted particular attention in recent years.

この方式の音声合成方法では、音源からの信号を、発話器官の形状を表わすパラメータ（例えば声道断面積関数、声道長、開口面積等。以下「調音パラメータ」という）により特性が変化する電気回路からなるフィルタに通す事によって、音声信号を合成する。例えば、「あ」という音を発声しているときと、「い」という音を発声しているときとでは、調音パラメータが違う。従って、調音パラメータを変化させる事によって、種々の音声を合成する事ができる。もちろん、この方式は、現在ではコンピュータとソフトウェアとによるデジタル技術で実現される。 In this type of speech synthesis method, a signal from a sound source is converted into an electric signal whose characteristics change depending on parameters representing the shape of the speech organ (for example, vocal tract cross-sectional area function, vocal tract length, opening area, etc .; hereinafter referred to as “articulation parameters”) The audio signal is synthesized by passing through a filter composed of a circuit. For example, the articulation parameter differs between when a sound “a” is uttered and when a sound “i” is uttered. Therefore, various voices can be synthesized by changing the articulation parameters. Of course, this method is currently realized by digital technology using a computer and software.

より自然な連続音声を合成するためには、ある音声を発話しているときの調音パラメータと、次の音声を発話しているときの調音パラメータとを、連続音声発話時の発話器官の形状変化の時間的な推移と一致した形で補間する必要がある。そのためには、実際に連続音声を発話している際の発話器官の形状変化の速度を反映した物理量とその時間的推移とを、何らかの方法で算出する必要がある。 In order to synthesize a more natural continuous speech, the articulation parameter when speaking a certain speech and the articulation parameter when speaking the next speech are changed to the shape change of the speech organ during continuous speech. It is necessary to interpolate in a manner consistent with the temporal transition of. For that purpose, it is necessary to calculate the physical quantity reflecting the speed of the shape change of the uttering organ when actually speaking continuous speech and its temporal transition by some method.

発話器官の形状変化の速度を利用する先行技術として非特許文献１に開示のものがある。これは、連続音声を合成するために、声道断面積関数を時間補間したものであり、時間補間のためのパラメータをＭＲＩ（ＭａｇｎｅｔｉｃＲｅｓｏｎａｎｃｅＩｍａｇｉｎｇ：核磁気共鳴画像）動画の平均輝度値の時間的変化から求めたものである。この方法によると、実際の調音動作に基づいた合成であるので、自然な連続音が得られるという利点がある。
声道断面積関数の時間補間による音声合成、竹本浩典、本多清志、北村達也、パーハム・モクタリ、平井啓之．日本音響学会講演論文集、ｐｐ１５７−１５８、２００５年３月。 Non-Patent Document 1 discloses a prior art that uses the speed of the shape change of a speech organ. This is obtained by temporally interpolating the vocal tract cross-sectional area function in order to synthesize continuous speech, and parameters for temporal interpolation are temporally the average luminance value of an MRI (Magnetic Resonance Imaging) moving image. It is obtained from change. According to this method, since the synthesis is based on an actual articulation operation, there is an advantage that a natural continuous sound can be obtained.
Speech synthesis by temporal interpolation of vocal tract cross section function, Hironori Takemoto, Kiyoshi Honda, Tatsuya Kitamura, Parham Moktari, Hiroyuki Hirai. Proceedings of the Acoustical Society of Japan, pp157-158, March 2005.

非特許文献１に開示の技術によれば、実際の調音動作に基づいた自然な連続音を得る事ができるという大きな利点がある。しかしこの方法では、人間に発話をさせながらＭＲＩ動画を撮る必要がある。周知の様にＭＲＩ動画像の撮影には時間がかかる。また、ＭＲＩ動画を撮るために必要な設備も大型で、その利用には面倒な手順がいる。さらに、ＭＲＩ動画の画像処理は大変な上、コストもかかる。 According to the technique disclosed in Non-Patent Document 1, there is a great advantage that a natural continuous sound based on an actual articulation operation can be obtained. However, with this method, it is necessary to take an MRI video while letting humans speak. As is well known, it takes time to capture an MRI moving image. Also, the equipment necessary for taking MRI moving images is large, and there are complicated procedures for its use. Furthermore, the image processing of MRI moving images is difficult and expensive.

非特許文献１に開示された技術と同様の効果を、より簡単に、簡略な処理で得る事ができる様にする事が望ましい。 It is desirable that the same effect as that disclosed in Non-Patent Document 1 can be obtained more easily and with a simple process.

そこで、この発明の目的は、ＭＲＩ動画等の透過撮影手法よりも簡略な処理により、実際の発話動作による発話器官の形状変化を反映する調音パラメータ補間用のデータを生成できる補間用データ生成装置を提供する事である。 Accordingly, an object of the present invention is to provide an interpolation data generating apparatus capable of generating articulation parameter interpolation data reflecting the shape change of an utterance organ due to an actual utterance operation by a simpler process than a transmission imaging method such as an MRI moving image. It is to provide.

本発明の第１の局面に係る調音パラメータ補間用データ生成装置は、第１の音素から第２の音素まで連続的に変化する音声を合成するための調音パラメータを、第１及び第２の音素の音声合成のための既知の調音パラメータの間の補間によって生成する際の補間用データを生成するための装置であって、連続する第１の音素及び第２の音素を含む入力音声信号から、第１の音素から第２の音素までの所定の音響特徴量の変化率の推移を算出するための変化率算出手段と、変化率を用いて補間用データを作成するための補間用データ作成手段とを含む。 The articulation parameter interpolation data generation device according to the first aspect of the present invention provides an articulation parameter for synthesizing speech that continuously changes from a first phoneme to a second phoneme. An apparatus for generating data for interpolation when generating by interpolating between known articulation parameters for speech synthesis of a speech, comprising: an input speech signal including a continuous first phoneme and a second phoneme; Change rate calculation means for calculating a change rate change rate of a predetermined acoustic feature amount from the first phoneme to the second phoneme, and interpolation data creation means for creating interpolation data using the change rate Including.

この補間用データ生成装置によると、各々独立した第１の音素と第２の音素との間を連続的につなぐ様な補間用データを音声から生成できる。音声は、ＭＲＩ動画と異なり、容易に取得できる。また、音響特徴量の変化率は、発話器官の形状変化をよく反映すると考えられる。従って、ＭＲＩ動画等の透過撮影手法を用いるよりも簡略な方法で実際の発話動作による発話器官の形状変化を反映する補間用データを生成する事ができる。 According to this interpolation data generating apparatus, it is possible to generate interpolation data from speech so as to continuously connect the first phoneme and the second phoneme independent of each other. Unlike MRI moving images, audio can be easily acquired. In addition, it is considered that the change rate of the acoustic feature amount well reflects the change in the shape of the speech organ. Accordingly, it is possible to generate interpolation data that reflects a change in the shape of the uttered organ by an actual utterance operation by a simpler method than using a transmission imaging method such as an MRI moving image.

好ましくは、変化率算出手段は、入力音声信号を所定時間ごとにフレーム化するためのフレーム化手段と、フレーム化手段によりフレーム化された音声信号からフレームごとに所定のスペクトル変化率を算出するためのスペクトル変化率算出手段とを含む。 Preferably, the rate-of-change calculating means calculates a predetermined spectrum change rate for each frame from the framing means for framing the input voice signal at predetermined time intervals and the voice signal framed by the framing means. Spectrum change rate calculating means.

この補間用データ生成装置によると、上述した変化率として、フレームごとのスペクトル変化率を算出する事ができる。そして、このスペクトル変化率は、発話器官の形状変化の速度をよく反映すると考えられる上、その算出には確立された技術を用いる事ができる。従って、ＭＲＩ動画等の透過撮影手法を用いるよりも簡略な、音の特徴を使用して実際の発話動作による発話器官の形状変化を反映する補間用データを生成する事ができる。 According to this interpolation data generation apparatus, the spectrum change rate for each frame can be calculated as the above-described change rate. The spectrum change rate is considered to well reflect the speed of the shape change of the speech organ, and an established technique can be used for the calculation. Therefore, it is possible to generate interpolation data that reflects a change in the shape of the utterance organ due to the actual utterance operation using the characteristics of the sound, which is simpler than using a transmission imaging method such as an MRI moving image.

さらに好ましくは、スペクトル変化率算出手段は、第１の音素から第２の音素までの音声信号からフレーム化手段によりフレーム化された音声信号のフレームの内、所定のスペクトル変化率の隣り合う極小値を与える第１及び第２の二つのフレームと、当該二つのフレームの間に存在し、所定のスペクトル変化率の極大値を与える第３のフレームとを含む複数のフレームを組合せるためのフレーム組合せ手段と、フレーム組合せ手段により組合されたフレームの極大値と極小値とにより定められる所定の連続関数で所定のスペクトル変化率の推移を近似するための関数近似手段とを含む。 More preferably, the spectrum change rate calculating means is the adjacent minimum value of the predetermined spectrum change rate among the frames of the audio signal framed by the framing means from the audio signals from the first phoneme to the second phoneme. Frame combination for combining a plurality of frames, including first and second two frames that provide a second frame and a third frame that exists between the two frames and that provides a maximum value of a predetermined spectral change rate And function approximating means for approximating the transition of a predetermined spectrum change rate with a predetermined continuous function determined by the maximum value and the minimum value of the frames combined by the frame combination means.

この補間用データ生成装置によると、音中心にあたる二つの極小値を両端とする連続関数で所定のスペクトル変化率の推移を近似する。そこで、補間用データ生成の際に、スペクトル変化率そのものよりもデータ量の小さい近似関数を使用する事ができる。それゆえ、補間用データ生成の際に必要となるリソースを簡略化する事ができる。従って、より簡略な方法を使用して、実際の発話動作による発話器官の形状変化を反映する補間用データを生成する事ができる。 According to this interpolation data generation apparatus, the transition of a predetermined spectrum change rate is approximated by a continuous function having two minimum values corresponding to the sound center as both ends. Therefore, when generating interpolation data, an approximate function having a data amount smaller than the spectrum change rate itself can be used. Therefore, it is possible to simplify the resources required for generating the interpolation data. Therefore, using a simpler method, it is possible to generate interpolation data that reflects changes in the shape of the uttered organ due to the actual utterance operation.

さらに好ましくは、フレーム組合せ手段は、変化率に基づき、第１及び第２のフレームを検出するための極小値検出手段と、変化率に基づき、第３のフレームを検出するための極大値検出手段と、第１のフレームに第１の音素を、第２のフレームに第２の音素を、それぞれ割当てるための音割当手段と、第１のフレームから第２のフレームまでのフレームを組合せてフレームシーケンスを作成するためのフレームシーケンス作成手段とを含む。 More preferably, the frame combination means is a minimum value detecting means for detecting the first and second frames based on the rate of change, and a maximum value detecting means for detecting the third frame based on the rate of change. A sound sequence assigning means for assigning the first phoneme to the first frame and the second phoneme to the second frame, and a frame sequence combining the frames from the first frame to the second frame. Frame sequence creating means for creating

この補間用データ生成装置によると、第１のフレーム及び第２のフレームに具体的な音素を割当て、上記二つのフレームを含む様にフレームシーケンスを作成する。ゆえに、補間用データ生成の際に、補間の対象を自動的に定めて補間する事ができる。従って、より簡略な方法で実際の発話動作による発話器官の形状変化を反映する補間用データを生成する事ができる。 According to the interpolation data generation apparatus, specific phonemes are assigned to the first frame and the second frame, and a frame sequence is created so as to include the two frames. Therefore, when generating interpolation data, the interpolation target can be automatically determined and interpolated. Therefore, it is possible to generate interpolation data that reflects a change in the shape of the utterance organ by an actual utterance operation by a simpler method.

さらに好ましくは、関数近似手段は、第１のフレームにおける変化率と、第３のフレームにおける変化率とに基づいて、第１のフレームにおいて所定の定数をとり、第３のフレームにおいて第３のフレームにおける変化率の値と所定の関係を有する値をとる第１の近似関数を求める第１の近似手段と、第３のフレームにおける変化率と第２のフレームにおける変化率とに基づいて、第２のフレームにおいて所定の定数をとり、第３のフレームにおいて第１の近似関数と同じ値をとる第２の近似関数を求める第２の近似手段とを含む。 More preferably, the function approximating means takes a predetermined constant in the first frame based on the rate of change in the first frame and the rate of change in the third frame, and the third frame in the third frame. Based on the first approximation means for obtaining a first approximation function that takes a value having a predetermined relationship with the value of the rate of change in, and on the basis of the rate of change in the third frame and the rate of change in the second frame And second approximation means for obtaining a second approximation function that takes a predetermined constant in the second frame and takes the same value as the first approximation function in the third frame.

この補間用データ生成装置によると、該当するフレームにおける変化率に基づいて、近似関数の種類を適切に変化させる事ができる。変化率が最大となる箇所は、補間対象の二つの音素の中間位置とは限らず、その前後にずれる事が多い。そうした場合でも、上記した様に変化率が最大となる箇所の前後で異なる近似関数を適用する事により、簡単な近似関数を用いて精度高く変化率を近似できる。ゆえに、近似の精度が上がり、この近似関数を用いた補間用データ生成処理にかかる時間を短縮する事ができる。従って、より簡略な方法で実際の発話動作による発話器官の形状変化を反映する補間用データを生成する事ができる。 According to this interpolation data generation device, the type of approximation function can be changed appropriately based on the rate of change in the corresponding frame. The location where the change rate is maximum is not necessarily the intermediate position between the two phonemes to be interpolated, and often shifts before and after the location. Even in such a case, it is possible to approximate the change rate with high accuracy using a simple approximate function by applying different approximation functions before and after the portion where the change rate is maximum as described above. Therefore, the accuracy of approximation is improved, and the time required for the interpolation data generation processing using this approximation function can be shortened. Therefore, it is possible to generate interpolation data that reflects a change in the shape of the utterance organ by an actual utterance operation by a simpler method.

さらに好ましくは、補間用データ作成手段は、関数近似手段により得られた所定の関数の、第１のフレームから第２のフレームまでの積分関数を正規化した正規化積分関数を求めるための正規化積分手段を含む。 More preferably, the interpolation data creating means normalizes to obtain a normalized integral function obtained by normalizing the integral function from the first frame to the second frame of the predetermined function obtained by the function approximating means. Includes integration means.

この補間用データ生成装置によると、補間用データとして、正規化積分関数を算出するのみであるので、調音パラメータそのものの補間までする場合と比較して、補間用データ生成のためにかかる時間が短縮できる。また、補間用データをデータベース（以下ＤＢと記載）に収納する場合には、ＤＢの容量が削減できる。従って、より簡略な処理により実際の発話動作による発話器官の形状変化を反映する補間用データを生成する事ができる。 According to this interpolation data generation device, since only the normalized integration function is calculated as the interpolation data, the time required to generate the interpolation data is shortened compared to the case where the articulation parameters themselves are interpolated. it can. Further, when the interpolation data is stored in a database (hereinafter referred to as DB), the capacity of the DB can be reduced. Therefore, it is possible to generate interpolation data that reflects the change in shape of the utterance organ due to the actual utterance operation by a simpler process.

さらに好ましくは、補間用データ作成手段はさらに、第１の音素のための調音パラメータと、第２の音素のための調音パラメータとの間を正規化積分関数によって補間する事により、第１の音素と第２の音素との間の所定の時刻における混合比を算出するための手段を含む。 More preferably, the interpolation data creation means further interpolates between the articulation parameter for the first phoneme and the articulation parameter for the second phoneme by a normalization integration function, thereby providing the first phoneme. Means for calculating a mixing ratio at a predetermined time between the first phoneme and the second phoneme.

この補間用データ生成装置によると、補間用データとして、混合比を算出するのみである。そこで、補間用データ生成のためにかかる時間が短縮できる。また、補間用データをＤＢに収納する場合には、ＤＢの容量が削減できる。さらに、正規化積分関数の算出までで処理をとどめておく場合と比較して、音声の合成時には混合比の算出をする必要がない。従って、より簡略な処理により実際の発話動作による発話器官の形状変化を反映するとともに、音声合成時の処理量を削減可能な補間用データを生成する事ができる。 According to this interpolation data generation device, only the mixture ratio is calculated as the interpolation data. Thus, the time required for generating the interpolation data can be shortened. Further, when the interpolation data is stored in the DB, the capacity of the DB can be reduced. Furthermore, it is not necessary to calculate the mixing ratio when synthesizing speech, compared to the case where the processing is stopped until the calculation of the normalized integration function. Accordingly, it is possible to generate interpolation data that can reflect the change in the shape of the utterance organ due to the actual utterance operation by simpler processing and can reduce the processing amount at the time of speech synthesis.

さらに好ましくは、補間用データ作成手段はさらに、第１の音素のための調音パラメータと、第２の音素のための調音パラメータとの間を混合比によって補間する事により、所定の時刻における調音パラメータである声道断面積関数を算出するための手段を含む。 More preferably, the interpolation data creation means further interpolates between the articulation parameter for the first phoneme and the articulation parameter for the second phoneme by a mixing ratio, thereby providing an articulation parameter at a predetermined time. Means for calculating a vocal tract cross-sectional area function.

この補間用データ生成装置によると、音声の特徴量から補間用データとして、声道断面積関数を算出する。ゆえに、この声道断面積関数を使用して後の音声合成を行なう場合には、音声合成にかかる時間を削減する事ができる。従って、効率的な音声合成を行なう事ができる。その結果、ＭＲＩ動画等による処理よりも簡略な処理によって、実際の発話動作による発話器官の形状変化を反映する補間用データを生成する事ができる。 According to this interpolation data generation device, a vocal tract cross-sectional area function is calculated as interpolation data from the feature amount of speech. Therefore, when performing later speech synthesis using this vocal tract cross-sectional area function, the time required for speech synthesis can be reduced. Therefore, efficient speech synthesis can be performed. As a result, it is possible to generate interpolation data that reflects a change in the shape of the uttered organ due to the actual utterance operation by a process that is simpler than the process using the MRI moving image or the like.

この発明の第２の局面に係るコンピュータプログラムは、コンピュータにより実行されると、当該コンピュータを、上記したいずれかの調音パラメータ補間用データ生成装置として動作させるものである。従って、上記した調音パラメータ補間用データ生成装置と同様の効果を奏する事ができる。 When executed by a computer, the computer program according to the second aspect of the present invention causes the computer to operate as any one of the articulation parameter interpolation data generation devices described above. Therefore, the same effect as the above-described articulation parameter interpolation data generation device can be obtained.

＜スペクトル変化率に基づく補間方法＞
本発明の一実施の形態に係る音声合成システムでは、音声信号の時間変化から、補間用パラメータである声道断面積関数を算出し、その結果を用いて音声合成を行なう。音声信号はＭＲＩ動画像と比較してはるかに容易に得る事ができるし、信号処理も容易である。本実施の形態の詳細を説明する前に、まず、本実施の形態で採用した補間方法の原理について詳述する。なお以下の説明及び図面において、同じ部品には同じ参照番号を付す。 <Interpolation method based on spectral change rate>
In a speech synthesis system according to an embodiment of the present invention, a vocal tract cross-sectional area function, which is an interpolation parameter, is calculated from a time change of a speech signal, and speech synthesis is performed using the result. Audio signals can be obtained much more easily than MRI moving images, and signal processing is also easy. Before describing the details of this embodiment, the principle of the interpolation method employed in this embodiment will be described in detail. In the following description and drawings, the same reference numerals are assigned to the same components.

（１）混合比の算出
混合比とは、発話音声が第１の音素から第２の音素へ移行する途中の音素を合成するために、移行途中の調音パラメータを、第１の音素と第２の音素との調音パラメータの間の補間により生成するための、第１の音素と第２の音素との調音パラメータを混合する割合の事をいう。例えば実際の発話において、「あ」から「い」に移行する際には、徐々に「あ」の音の要素が減り、代わりに「い」の音の要素が増えて、連続的に音が移行する。混合比を使用すると、母音中心の間での連続音を生成するための調音パラメータをスムーズに補間できる。本実施の形態では、混合比を発話音声のスペクトル変化率から得た値で近似する。スペクトル変化率とは、発話音声のスペクトルがどの程度の速度で変化しているかを表す値であり（式については後述する。）、発話時の発話器官の動きを反映している。以下で、スペクトル変化率を用いた混合比の算出方法について述べる。 (1) Calculation of mixing ratio In order to synthesize the phoneme in the middle of the transition of the utterance voice from the first phoneme to the second phoneme, the articulation parameter in the middle of the transition is defined as the first phoneme and the second phoneme. The ratio of mixing the articulation parameters of the first phoneme and the second phoneme, which is generated by interpolation between the articulation parameters of the first phoneme. For example, in an actual utterance, when moving from “A” to “I”, the “A” sound element gradually decreases, and instead, the “I” sound element increases, and the sound continuously increases. Transition. By using the mixing ratio, it is possible to smoothly interpolate the articulation parameters for generating a continuous sound between vowel centers. In the present embodiment, the mixture ratio is approximated by a value obtained from the spectrum change rate of the speech voice. The spectrum change rate is a value indicating how fast the spectrum of the speech is changing (the expression will be described later), and reflects the movement of the speech organ during speech. Hereinafter, a calculation method of the mixing ratio using the spectrum change rate will be described.

図１に、発話音声波形の一例と、この音声波形から得られたスペクトル変化率との関係を示す。図１を参照して、図中で／ａ／、／ｉ／、／ｕ／、／ｅ／、及び／ｏ／で表わされるフレームは各々の母音の母音中心が存在する位置を示している。母音の発声の間は、発話器官形状は安定しており、従ってスペクトル変化率は小さい。一方、ある母音から別の母音への移行時には、発話器官の形状が変化する。この変化の速度は一定ではなく、変化の最初と最後とでは変化速度が小さく、途中のどこかで極大となる事が知られている。図１に示す例では、スペクトル変化率の極大値３０、３２、３４及び３６で発話器官の形状変化速度が最大に達する。 FIG. 1 shows a relationship between an example of a speech voice waveform and a spectrum change rate obtained from the voice waveform. Referring to FIG. 1, the frames represented by / a /, / i /, / u /, / e /, and / o / in the figure indicate the positions where the vowel centers of each vowel exist. During the production of vowels, the shape of the uttered organ is stable and the rate of spectral change is small. On the other hand, at the time of transition from one vowel to another, the shape of the speech organ changes. It is known that the rate of this change is not constant, the rate of change is small at the beginning and end of the change, and it reaches a maximum somewhere in the middle. In the example shown in FIG. 1, the shape change rate of the speech organ reaches a maximum at the maximum values 30, 32, 34 and 36 of the spectral change rate.

前述した様に混合比は、隣接する母音中心の調音パラメータの間を補間して連続的で自然な音声合成をするためのパラメータである。従って、混合比算出にあたっては、隣接する二つの母音の発声中心に対応するフレーム（スペクトル変化率が極小となるフレーム）が両端に配される様にとられたフレームシーケンスのスペクトル変化率を使用してその間の音声を補間するための混合比を算出すればよい。ただし、この場合、フレームシーケンスの最後のフレームは、次のフレームシーケンスの先頭フレームに相当するので、混合比算出の対象とはしない。 As described above, the mixing ratio is a parameter for interpolating between adjacent vowel-centric articulation parameters to synthesize continuous and natural speech. Therefore, when calculating the mixing ratio, the frame sequence spectrum change rate is used so that the frames corresponding to the utterance centers of two adjacent vowels (frames with the minimum spectrum change rate) are arranged at both ends. Then, a mixing ratio for interpolating the sound between them may be calculated. However, in this case, since the last frame of the frame sequence corresponds to the first frame of the next frame sequence, it is not subject to the mixing ratio calculation.

また、処理対象となる音声信号にはノイズが含まれている事が多く、そのために、音声信号から得られるスペクトル変化率のグラフは変形している。そうしたノイズの影響を除去し、かつ演算を簡略にするために、スペクトル変化率の描く曲線を何らかの関数で近似する。本実施の形態では、スペクトル変化率の描く曲線をガウス関数で近似する。 In addition, the audio signal to be processed often includes noise, and for this reason, the graph of the spectrum change rate obtained from the audio signal is deformed. In order to remove the influence of such noise and simplify the calculation, the curve drawn by the spectrum change rate is approximated by some function. In the present embodiment, the curve drawn by the spectrum change rate is approximated by a Gaussian function.

図２に、母音／ａ／から母音／ｉ／まで音声が変化するときの音声信号から得られたスペクトル変化率の一例（スペクトル変化率４２）と、このスペクトル変化率の描く曲線を近似するガウス関数４４とを示す。図２を参照して、母音／ａ／の母音中心がグラフの左端に、母音／ｉ／の母音中心がグラフの右端に配されている。なお、前述の理由により、グラフの右端は母音／ｉ／の母音中心を含むフレームよりも一つ前のフレームとなっている。 FIG. 2 shows an example of a spectrum change rate (spectrum change rate 42) obtained from a voice signal when the voice changes from vowel / a / to vowel / i /, and a Gaussian approximating a curve drawn by this spectrum change rate. Function 44 is shown. Referring to FIG. 2, the vowel center of vowel / a / is arranged at the left end of the graph, and the vowel center of vowel / i / is arranged at the right end of the graph. For the above-described reason, the right end of the graph is the frame immediately before the frame including the vowel center of vowel / i /.

ガウス分布は左右対称の分布を示す。しかし、スペクトル変化率は極大値フレーム４０の左右で異なった分布を持つ事が一般的である。そこで、極大値フレームの左右で異なるガウス関数を割当てる。この近似に用いられるガウス関数は、スペクトル変化率の極大値と極小値とから定められる。なお、ガウス関数は、ｘ軸に漸近するが決して０にはならない。一方、ガウス関数による近似では、フレームシーケンスの両端の母音位置で調音パラメータがそれぞれ両者の調音パラメータと一致する事が望ましく、そのためにはフレームシーケンスの両端の母音位置では近似関数の値が０となる必要がある。そこで、例えば上記したガウス関数を一旦求めた後、さらにフレームシーケンスの両端での関数の値が０となる様に、ガウス関数から定数を引いたり、ガウス関数に所定倍数を乗じた後に定数を引いたりして、関数の値がフレームシーケンスの両端（母音位置）で０となる様にする。 The Gaussian distribution is a symmetrical distribution. However, the spectral change rate generally has different distributions on the left and right of the local maximum frame 40. Therefore, different Gaussian functions are assigned to the left and right of the local maximum frame. The Gaussian function used for this approximation is determined from the maximum value and the minimum value of the spectrum change rate. Note that the Gaussian function is asymptotic to the x-axis but never becomes zero. On the other hand, in the approximation by the Gaussian function, it is desirable that the articulation parameters coincide with the articulation parameters at both vowel positions at both ends of the frame sequence. For this purpose, the value of the approximation function becomes 0 at the vowel positions at both ends of the frame sequence. There is a need. Therefore, for example, once the above Gaussian function is obtained, a constant is subtracted from the Gaussian function so that the value of the function at both ends of the frame sequence is 0, or the constant is subtracted after multiplying the Gaussian function by a predetermined multiple. In other words, the function value is set to 0 at both ends (vowel positions) of the frame sequence.

この様にしてスペクトル変化率を近似するガウス関数を求めた後、このガウス関数から以下の様にして混合比を算出する。 After obtaining a Gaussian function that approximates the spectrum change rate in this way, the mixing ratio is calculated from the Gaussian function as follows.

図３の上段に図２に示すスペクトル変化率５０の曲線と、対応するガウス関数による近似（近似関数５２）とを示す。この近似関数５２を積分し、最終的な積分の値が１となる様に正規化した曲線５４（図３下段に示す）により、母音／ａ／から母音／ｉ／までの移行途中の混合比が得られる。 The upper part of FIG. 3 shows the curve of the spectral change rate 50 shown in FIG. 2 and the approximation by the corresponding Gaussian function (approximation function 52). The approximate function 52 is integrated, and the mixture ratio during the transition from the vowel / a / to the vowel / i / is obtained by a curve 54 (shown in the lower part of FIG. 3) normalized so that the final integration value becomes 1. Is obtained.

以上のガウス関数の割当から混合比までの算出をまとめると、次の様になる。隣接する母音中心間にｍ個のフレームがあり、そのｎ番目にスペクトル変化率の極大値が存在するものとする。このとき、母音間のフレーム番号をｋ（０≦ｋ＜ｎ）、曲線６０に対応する関数（すなわち混合比）をｂｒ（ｋ）とすると、混合比ｂｒ（ｋ）は次の式（１）及び（２）で表わされる。 The calculation from the allocation of the above Gaussian function to the mixture ratio is summarized as follows. It is assumed that there are m frames between adjacent vowel centers, and the maximum value of the spectrum change rate exists at the nth. At this time, if the frame number between vowels is k (0 ≦ k <n) and the function corresponding to the curve 60 (ie, the mixing ratio) is br (k), the mixing ratio br (k) is expressed by the following equation (1). And (2).

また、ｙ（ｋ）は次の式（３）及び（４）で表わされる。

Y (k) is expressed by the following equations (3) and (4).

ここで、ｇｗ_ｎ（ｋ）は、ｎ点のＧａｕｓｓｉａｎＷｉｎｄｏｗのｋ番目の点である。

Here, gw _n (k) is the kth point of the n-point Gaussian Window.

（２）声道断面積関数の算出
次に、この混合比ｂｒを用いて、声道断面積関数を算出する方法について説明する。 (2) Calculation of vocal tract cross-sectional area function Next, a method for calculating the vocal tract cross-sectional area function using the mixture ratio br will be described.

図４及び図５に、声道断面積関数の例を示す。声道断面積関数は時刻の関数であり、ある時刻における声門からの距離と声道断面積との間の関係を示す。図４及び図５では、縦軸に声道断面積をとり、横軸に声門からの距離をとっている。図４に示された声道断面積関数は時刻ｔ１の時のものである。図５に示された声道断面積関数は時刻ｔ２の時のものである。声道断面積関数を求めるためには、各時刻における声門からの距離と、その位置における声道の断面積とを求める必要がある。本実施の形態では、声道を所定個数のセクションに分け、各時刻における各セクションでの声道断面積によって声道断面積関数を表す。また本実施の形態では、セクション数は一定であるものとし、声道長の変化にはセクション長の変化で対応する。 4 and 5 show examples of vocal tract cross-sectional area functions. The vocal tract cross-sectional area function is a function of time and indicates the relationship between the distance from the glottis and the vocal tract cross-sectional area at a certain time. 4 and 5, the vertical axis represents the vocal tract cross-sectional area, and the horizontal axis represents the distance from the glottis. The vocal tract cross-sectional area function shown in FIG. 4 is that at time t1. The vocal tract cross-sectional area function shown in FIG. 5 is that at time t2. In order to obtain the vocal tract cross-sectional area function, it is necessary to obtain the distance from the glottis at each time and the cross-sectional area of the vocal tract at that position. In this embodiment, the vocal tract is divided into a predetermined number of sections, and the vocal tract cross-sectional area function is represented by the vocal tract cross-sectional area in each section at each time. In this embodiment, the number of sections is assumed to be constant, and changes in vocal tract length are handled by changes in section length.

以下では、混合比ｂｒを時刻ｔの関数ｂｒ（ｔ）として考える。このとき、声門から数えてｎ番目のセクションにおける声門からの距離（ｘｐｏｓ）は次の式（５）で表わされる。 In the following, the mixture ratio br is considered as a function br (t) at time t. At this time, the distance (xpos) from the glottis in the n-th section counted from the glottis is expressed by the following equation (5).

また声道の断面積（ａｒｅａ）は次の式（６）で表わされる。

The cross-sectional area (area) of the vocal tract is expressed by the following equation (6).

以上の式（１）〜（６）を用いる事により、ある母音から他の母音に移行する間の任意の時点における声道の各セクションにおける断面積を、両母音に対する調音パラメータの間の補間により算出する事ができる。

By using the above formulas (1) to (6), the cross-sectional area in each section of the vocal tract at an arbitrary time point during the transition from one vowel to another vowel is obtained by interpolation between the articulation parameters for both vowels. It can be calculated.

＜第１の実施の形態＞
以下、上記した補間方法を利用した、本発明の一実施の形態に係る音声合成装置について詳述する。 <First Embodiment>
Hereinafter, a speech synthesizer according to an embodiment of the present invention using the above-described interpolation method will be described in detail.

［構成］
（１）音声合成システム７０
図６に、本発明の第１の実施の形態に係る音声合成システム７０のブロック図を示す。なお、本実施の形態では、ある話者が連続音声を発声したものを録音し、その録音から各母音の調音パラメータと母音間の混合比とを算出する。その連続音声の発話テキスト８４は予め与えられるものとする。 [Constitution]
(1) Speech synthesis system 70
FIG. 6 shows a block diagram of the speech synthesis system 70 according to the first exemplary embodiment of the present invention. In the present embodiment, what a continuous speaker utters is recorded, and the articulation parameters of each vowel and the mixing ratio between the vowels are calculated from the recording. It is assumed that the continuous speech text 84 is given in advance.

図６を参照して、音声合成システム７０は、音声信号と発話テキスト８４とを用いて、音声合成のための調音パラメータを補間するために用いられるデータを生成するための補間用データ生成装置８６と、補間用データ生成装置８６によって生成されたデータを保持するための補間用ＤＢ８８と、入力されたテキスト９０に対し、補間用ＤＢ８８内のデータを用いて合成音声信号９４を出力するための音声合成装置９２とを含む。 Referring to FIG. 6, the speech synthesis system 70 uses the speech signal and the utterance text 84 to generate data used for interpolation to generate data used to interpolate articulation parameters for speech synthesis. And an interpolating DB 88 for holding data generated by the interpolating data generating device 86, and an audio for outputting a synthesized speech signal 94 using the data in the interpolating DB 88 for the input text 90. And a synthesizer 92.

（２）補間用データ生成装置８６
図７に、補間用データ生成装置８６のブロック図を示す。補間用データ生成装置８６は、音声信号からそのスペクトル変化率を生成するためのスペクトル変化率生成部６０と、生成されたスペクトル変化率を保持するためのスペクトル変化率ＤＢ６２と、スペクトル変化率と発話テキスト８４とから、各母音中心の調音パラメータと補間用データとを生成するための補間用データ生成部６４とを含む。母音中心の調音パラメータも補間比が０又は１の補間用データと考える事ができるので、本実施の形態では母音中心の調音パラメータも含めて補間用データと呼ぶ。 (2) Interpolation data generator 86
FIG. 7 shows a block diagram of the interpolation data generation device 86. The interpolation data generation device 86 includes a spectrum change rate generation unit 60 for generating the spectrum change rate from the audio signal, a spectrum change rate DB 62 for holding the generated spectrum change rate, a spectrum change rate and an utterance. An interpolation data generation unit 64 for generating articulation parameters and interpolation data for each vowel center from the text 84 is included. Since the vowel-centric articulation parameter can also be considered as interpolation data having an interpolation ratio of 0 or 1, in this embodiment, the vowel-centric articulation parameter is also referred to as interpolation data.

（３）スペクトル変化率生成部６０
図８に、スペクトル変化率生成部６０のブロック図を示す。スペクトル変化率生成部６０は、与えられた音声信号をフレーム化するためのフレーム化部１００と、フレーム化された音声信号のスペクトルを抽出するためのスペクトル抽出部１０２と、抽出されたスペクトルからケプストラムを算出するためのケプストラム算出部１０４と、算出されたケプストラムから音声特徴量の差であるデルタケプストラムを算出するためのデルタケプストラム算出部１０６と、算出されたデルタケプストラムからスペクトル変化率を算出し、その算出結果をスペクトル変化率ＤＢ６２に与えるためのスペクトル変化率算出部１０８とを含む。 (3) Spectrum change rate generation unit 60
FIG. 8 shows a block diagram of the spectrum change rate generation unit 60. The spectrum change rate generation unit 60 includes a framing unit 100 for framing a given audio signal, a spectrum extraction unit 102 for extracting a spectrum of the framed audio signal, and a cepstrum from the extracted spectrum. A cepstrum calculation unit 104 for calculating the delta cepstrum, a delta cepstrum calculation unit 106 for calculating a delta cepstrum that is a difference in voice feature amount from the calculated cepstrum, a spectrum change rate from the calculated delta cepstrum, A spectrum change rate calculation unit 108 for supplying the calculation result to the spectrum change rate DB 62.

ここで、デルタケプストラムはフレーム間の音声特徴量であるケプストラムの差を表わすパラメータであるので、デルタケプストラムの値が大きいほどフレーム間のスペクトル変化が大きいと言える。デルタケプストラムｄｃｅｐ（ｎ，ｍ）は次の式（７）で表わされる。ここでｍはフレーム番号を表わし、ｎはケプストラムの次数を表わす。 Here, since the delta cepstrum is a parameter representing a difference in cepstrum, which is an audio feature amount between frames, it can be said that the larger the value of the delta cepstrum, the larger the spectrum change between frames. The delta cepstrum dcep (n, m) is expressed by the following equation (7). Here, m represents the frame number, and n represents the order of the cepstrum.

このデルタケプストラムを使用して、スペクトル変化率ｓ（ｍ）は次の式（８）で表わされる。

Using this delta cepstrum, the spectral change rate s (m) is expressed by the following equation (8).

（４）補間用データ生成部６４
図９に、補間用データ生成部６４のブロック図を示す。補間用データ生成部６４は、スペクトル変化率ＤＢ６２に保持されたスペクトル変化率を参照して、母音中心を示す調音運動極小値を含むフレームを検出するための極小値検出部１１０と、スペクトル変化率ＤＢ６２に保持されたスペクトル変化率を参照して、発話母音間の境界を示す調音運動極大値を含むフレームを検出するための極大値検出部１１２と、発話テキストを比較参照して、調音運動が極小値になるフレームに順に母音を割当てるための母音割当部１１４と、全フレームのうち、第１の音素の母音中心であるスペクトル変化率最小値を含むフレームから始まって、スペクトル変化率最大値を含むフレームを経て、第２の音素の母音中心であるスペクトル最小値を含むフレームの一つ前のフレームで終わる様に複数のフレームを一組のフレームシーケンスとして組合せるためのフレームシーケンス作成部１１６とを含む。

(4) Interpolation data generation unit 64
FIG. 9 shows a block diagram of the interpolation data generation unit 64. The interpolation data generation unit 64 refers to the spectral change rate held in the spectral change rate DB 62, and detects a frame including the articulation minimum value indicating the vowel center, and the spectral change rate. By referring to the spectrum change rate stored in the DB 62 and comparing and referring to the utterance text, the maximal value detecting unit 112 for detecting a frame including the articulatory maximal value indicating the boundary between utterance vowels, A vowel assignment unit 114 for assigning vowels in order to the frame having the minimum value and a spectrum change rate maximum value starting from a frame including the spectrum change rate minimum value that is the vowel center of the first phoneme among all frames. Multiple frames so that they end with the frame preceding the frame containing the minimum spectral value that is the vowel center of the second phoneme. And a frame sequence creation unit 116 for combining a set of frame sequence.

補間用データ生成部６４はさらに、フレームシーケンス作成部１１６により組合わされたフレームシーケンス中で先頭のフレームに極小値を、スペクトル変化率の極大値のフレームにガウス関数の極大値を、それぞれ持ち、かつ極小値が０となるガウス関数を割当てるためのガウス関数割当部１１８と、スペクトル変化率の極大値のフレームに極大値を、次のフレームシーケンスの先頭のフレームに極小値を、それぞれ持ち、かつ極小値が０となるガウス関数を割当てるためのガウス関数割当部１２０とを含む。ガウス関数割当部１１８により割当てられたガウス関数のうちの前半部（極小値から極大値まで）と、ガウス関数割当部１２０により割当てられたガウス関数のうち後半部（極大値から極小値の１フレーム前まで）とを極大値部分で接続する事により、発話器官の動きを表す関数が得られる。以下、この関数を「発話器官変化率関数」と呼ぶ。 The interpolation data generation unit 64 further has a minimum value in the first frame in the frame sequence combined by the frame sequence creation unit 116, a maximum value of the Gaussian function in the maximum value frame of the spectrum change rate, and A Gaussian function allocating unit 118 for allocating a Gaussian function having a local minimum value of 0, a local maximum value in the frame of the maximum value of the spectral change rate, and a local minimum value in the first frame of the next frame sequence. A Gaussian function assigning unit 120 for assigning a Gaussian function having a value of 0. The first half (from minimum value to maximum value) of the Gaussian functions assigned by the Gaussian function assignment unit 118 and the latter half (one frame from the maximum value to the minimum value) of the Gaussian functions assigned by the Gaussian function assignment unit 120. Is connected to the local maximum part to obtain a function representing the movement of the speech organ. Hereinafter, this function is referred to as “speech organ change rate function”.

補間用データ生成部６４はさらに、この発話器官変化率関数を先頭フレームから最終フレーム（次のフレームシーケンスの１つ前のフレーム）まで積分して最大値が１になる様に正規化して得られる関数（以下これを「発話器官形状関数」と呼ぶ。）を出力するための積分・正規化部１２２と、積分・正規化部１２２により出力された発話器官形状関数から各フレームにおける混合比を算出するための混合比算出部１２４と、種々の音素に対して予め算出された調音パラメータを保持するための調音パラメータＤＢ１２８と、母音割当部１１４で割当てられた音素に対応する調音パラメータを調音パラメータＤＢ１２８から取り出し、対応する混合比を用いた補間により各時刻における各音素間の声道断面積関数を算出し、その結果を補間用ＤＢ８８に与えるための声道断面積関数算出部１２６とを含む。この声道断面積関数が、本実施の形態における補間用データである。 Further, the interpolation data generation unit 64 integrates this speech organ change rate function from the first frame to the last frame (the frame immediately before the next frame sequence) and normalizes it so that the maximum value becomes 1. An integration / normalization unit 122 for outputting a function (hereinafter referred to as “speech organ shape function”), and a mixture ratio in each frame is calculated from the speech organ shape function output by the integration / normalization unit 122. A mixing ratio calculation unit 124 for holding the sound, an articulation parameter DB 128 for holding pre-calculation parameters for various phonemes, and an articulation parameter corresponding to the phoneme assigned by the vowel assignment unit 114. To calculate the vocal tract cross-sectional area function between each phoneme at each time by interpolation using the corresponding mixing ratio, and the result is used for interpolation. And a vocal tract area function calculation unit 126 for providing the B88. This vocal tract cross-sectional area function is the interpolation data in the present embodiment.

（５）音声合成装置９２
図１０に、音声合成装置９２のブロック図を示す。図１０を参照して、テキスト９０には、各音素を発声すべき時間情報が付されている。音声合成装置９２は、入力されたテキスト９０を音素単位に分割し、隣接する２音素ごとに、当該２音素間の補間用データを補間用ＤＢ８８から抽出して、補間調音パラメータ１３２として出力するための補間パラメータ抽出部１３０と、所定周期のクロック信号を発生するためのクロック部１４０と、合成する連続音声の発音長等に応じて、クロック部１４０からのクロックにより定まるタイミングで、補間調音パラメータ１３２を順番に出力してフィルタ１３８に与えるための出力部１３４と、発話のための音源となる音源信号を出力するための音源１３６と、出力部１３４によって与えられる調音パラメータに従って変化する特性で音源１３６からの音源信号を変調し、合成音声信号９４を出力するためのフィルタ１３８とを含む。 (5) Speech synthesizer 92
FIG. 10 shows a block diagram of the speech synthesizer 92. Referring to FIG. 10, text 90 is attached with time information for uttering each phoneme. The speech synthesizer 92 divides the input text 90 into phonemes, extracts data for interpolation between the two phonemes from the interpolation DB 88 and outputs the data as an interpolated articulation parameter 132 for each two adjacent phonemes. Interpolation parameter extraction unit 130, a clock unit 140 for generating a clock signal of a predetermined period, and an interpolation articulation parameter 132 at a timing determined by the clock from the clock unit 140 in accordance with the sound generation length of the continuous voice to be synthesized. Are sequentially output and provided to the filter 138, a sound source 136 for outputting a sound source signal to be a sound source for speech, and a sound source 136 with characteristics that change in accordance with articulation parameters provided by the output unit 134. And a filter 138 for modulating the sound source signal from and outputting a synthesized speech signal 94.

なお、音声にイントネーションを付けたり音の高さを調整する必要がある場合、音源１３６からの音源信号の周波数を変化させる必要がある。 Note that when it is necessary to add intonation to the sound or adjust the pitch, it is necessary to change the frequency of the sound source signal from the sound source 136.

［動作］
本実施の形態に係る音声合成システム７０の動作には、２つの局面がある。すなわち、第１の局面は、音声信号及び対応する発話テキストから、調音パラメータ補間用のデータ（補間後の調音パラメータ）を生成し、補間用ＤＢを作成する局面（補間用データ生成装置８６の動作に相当する。）である。第２の局面は、補間用ＤＢ８８のデータを用いて、入力テキスト９０の連続音声を合成する局面（音声合成装置９２の動作に相当する。）である。以下、順に説明する。 [Operation]
There are two aspects to the operation of the speech synthesis system 70 according to the present embodiment. That is, in the first aspect, data for articulation parameter interpolation (articulation parameter after interpolation) is generated from the speech signal and the corresponding utterance text, and an interpolation DB is created (operation of the interpolation data generation device 86). Equivalent to The second aspect is an aspect (corresponding to the operation of the speech synthesizer 92) that synthesizes continuous speech of the input text 90 using the data in the interpolation DB 88. Hereinafter, it demonstrates in order.

（１）補間用データ生成装置８６の動作
本実施の形態に係る補間用データ生成装置８６は以下の様に動作する。 (1) Operation of Interpolation Data Generation Device 86 The interpolation data generation device 86 according to the present embodiment operates as follows.

図８を参照して、音声信号が与えられると、フレーム化部１００は、音声信号をフレーム化する。スペクトル抽出部１０２は、フレーム化された音声信号の各フレームから発話スペクトルを抽出する。ケプストラム算出部１０４が、各フレームから抽出された発話スペクトルに基づいてフレームごとのケプストラムを算出する。デルタケプストラム算出部１０６は、算出されたケプストラムから前述した式（７）によってデルタケプストラムを算出する。スペクトル変化率算出部１０８は、算出されたデルタケプストラムから前述した式（８）によってスペクトル変化率を算出する。算出されたスペクトル変化率は、スペクトル変化率ＤＢ６２に与えられ、保持される。 Referring to FIG. 8, when an audio signal is given, framing section 100 frames the audio signal. The spectrum extraction unit 102 extracts a speech spectrum from each frame of the framed voice signal. The cepstrum calculation unit 104 calculates a cepstrum for each frame based on the speech spectrum extracted from each frame. The delta cepstrum calculation unit 106 calculates a delta cepstrum from the calculated cepstrum according to the above-described equation (7). The spectrum change rate calculation unit 108 calculates the spectrum change rate from the calculated delta cepstrum according to the above-described equation (8). The calculated spectrum change rate is given to the spectrum change rate DB 62 and held.

次に、図９を参照して、極小値検出部１１０は、スペクトル変化率ＤＢ６２に保持されたスペクトル変化率から、変化率が極小であるフレームを検出する。同様に、極大値検出部１１２は、スペクトル変化率ＤＢ６２に保持されたスペクトル変化率から、変化率が極大であるフレームを検出する。 Next, referring to FIG. 9, the minimum value detection unit 110 detects a frame having a minimum change rate from the spectrum change rate held in the spectrum change rate DB 62. Similarly, the maximum value detection unit 112 detects a frame having a maximum change rate from the spectrum change rate held in the spectrum change rate DB 62.

母音割当部１１４は、検出された調音運動極小値に発話テキスト８４に含まれた音を順番に対応させて、極小値である母音中心に該当する音を割当てる。 The vowel assignment unit 114 assigns the sound corresponding to the vowel center that is the minimum value by sequentially matching the sounds included in the utterance text 84 to the detected articulation minimum value.

フレームシーケンス作成部１１６は、スペクトル変化率が極小となるフレームのうち、隣接する極小値を与えるフレームの対と、その間のすべてのフレームとからなるフレームシーケンスを作成する。その中にはスペクトル変化率が極大になるフレームも含まれる。この処理により、一つのフレームシーケンスの組が作成される。この処理を繰返す事により、すべてのフレームシーケンスの組を作る。 The frame sequence creation unit 116 creates a frame sequence composed of a pair of frames that give adjacent minimum values among frames in which the spectrum change rate is minimum, and all frames in between. This includes frames in which the spectral change rate is maximized. By this process, a set of one frame sequence is created. By repeating this process, a set of all frame sequences is created.

ガウス関数割当部１１８は、組合された各組のフレームシーケンスのうちで、先頭のフレームと極大値を含むフレームとの間のスペクトル変化率にガウス関数を割当てる。同様に、ガウス関数割当部１２０は、極大値を含むフレームと、当該フレームシーケンスの最後のフレームより一つ前のフレームまでに含まれるスペクトル変化率にガウス関数を割当てる。積分・正規化部１２２は、この二つのガウス関数を接続して得られる発話器官変化率関数についてフレームシーケンスの先頭から最後（最後の極小値を与えるフレームの一つ前のフレーム）まで積分し、さらに最大値が１になる様に正規化する事により、発話器官形状関数を生成する。 The Gaussian function assigning unit 118 assigns a Gaussian function to the spectrum change rate between the first frame and the frame including the maximum value among the combined frame sequences. Similarly, the Gaussian function assignment unit 120 assigns a Gaussian function to the spectrum change rate included in the frame including the maximum value and the frame immediately before the last frame of the frame sequence. The integration / normalization unit 122 integrates the speech organ change rate function obtained by connecting the two Gaussian functions from the beginning to the end of the frame sequence (the frame immediately before the frame that gives the last minimum value), Further, by normalizing so that the maximum value becomes 1, a speech organ shape function is generated.

混合比算出部１２４は、この様にして得られた発話器官形状関数から、前述した式（１）、（２）、（３）、及び（４）によって混合比を算出する。声道断面積関数算出部１２６は、調音パラメータＤＢ１２８に保持された調音パラメータから、母音割当部１１４で割当てられた音素に該当する調音パラメータを読出す。そして読出された調音パラメータを使用して、算出された混合比から、前述した式（５）及び（６）によってフレームシーケンスに含まれる各フレームでの各音素間の声道断面積関数を算出する。算出された声道断面積関数は補間用ＤＢ８８に与えられ、保持される。 The mixing ratio calculation unit 124 calculates the mixing ratio from the utterance organ shape function obtained in this way by the above-described equations (1), (2), (3), and (4). The vocal tract cross-sectional area function calculation unit 126 reads out the articulation parameters corresponding to the phonemes allocated by the vowel allocation unit 114 from the articulation parameters held in the articulation parameter DB 128. Then, using the read articulation parameters, the vocal tract cross-sectional area function between each phoneme in each frame included in the frame sequence is calculated from the calculated mixing ratio by the above-described equations (5) and (6). . The calculated vocal tract cross-sectional area function is given to the interpolation DB 88 and held.

この処理がすべてのフレームシーケンスに対して実行される。 This process is executed for all frame sequences.

（２）音声合成装置９２の動作
本実施の形態に係る音声合成装置９２は以下の様に動作する。 (2) Operation of Speech Synthesizer 92 The speech synthesizer 92 according to the present embodiment operates as follows.

図１０を参照して、テキスト９０が入力されると、補間パラメータ抽出部１３０は、入力テキスト９０を音素単位に分割する。さらに補間パラメータ抽出部１３０は、入力テキスト９０内において隣接する２音素の組の各々について、その２音素を補間するための補間用データ（補間後の調音パラメータ）である声道断面積関数を補間用ＤＢ８８から抽出する。この抽出作業を入力テキスト９０内において隣接する２音素の組合わせのすべてについて行ない、補間調音パラメータ１３２として出力する。 Referring to FIG. 10, when text 90 is input, interpolation parameter extraction unit 130 divides input text 90 into phonemes. Further, the interpolation parameter extraction unit 130 interpolates a vocal tract cross-sectional area function, which is interpolation data (articulation parameters after interpolation) for interpolating the two phonemes for each pair of adjacent two phonemes in the input text 90. Extract from the DB88. This extraction operation is performed for all combinations of adjacent two phonemes in the input text 90 and output as an interpolated articulation parameter 132.

出力部１３４は、出力された補間調音パラメータ１３２を順に読込み、補間調音パラメータ１３２に付された、そのパラメータの２音素間における位置情報及び合成すべき音声の長さ等から、クロック部１４０からのクロックに従って適切な時期に各補間調音パラメータをフィルタ１３８に与える。フィルタ１３８は、与えられた補間調音パラメータに従ってその特性を変化させて音源１３６からの音源信号を変調し、合成音声信号９４を出力する。この合成音声信号９４を図示しない増幅器を介してスピーカに与える事により、連続音声が発生される。 The output unit 134 sequentially reads the output interpolated articulation parameters 132, and from the position information between the two phonemes of the parameters, the length of the speech to be synthesized, and the like attached to the interpolated articulation parameters 132, from the clock unit 140. Each interpolation articulation parameter is given to the filter 138 at an appropriate time according to the clock. The filter 138 modulates the sound source signal from the sound source 136 by changing its characteristic according to the given interpolation articulation parameter, and outputs a synthesized speech signal 94. By applying this synthesized voice signal 94 to a speaker via an amplifier (not shown), continuous voice is generated.

なお、フレームシーケンス作成部１１６（図９参照）で行なわれる、フレームシーケンスの組を作る処理は、手動で行なわれてもよい。 Note that the process of creating a set of frame sequences performed by the frame sequence creation unit 116 (see FIG. 9) may be performed manually.

［第１の実施の形態の効果］
この様にして、本発明の第１の実施の形態に係る音声合成システム７０によれば、実際の人間の発話における発話器官の動きと一致する様な方法で調音パラメータである声道断面積関数を補間する。そのため、聴覚上、滑らかで自然な連続音声を合成する事ができる。さらに、この第１の実施の形態では、声道断面積関数算出まで予め行ない、実際の音声合成時にはこの声道断面積関数を読出すだけでよい。その結果、実際の音声合成時の計算量が削減されるという効果がある。また、従来技術と異なり、音声信号から補間パラメータの抽出を行なう。その結果、ＭＲＩ動画像の撮影に伴う手間及びコストを削減する事ができる。 [Effect of the first embodiment]
In this way, according to the speech synthesis system 70 according to the first embodiment of the present invention, the vocal tract cross-sectional area function that is the articulatory parameter in a manner that matches the movement of the speech organ in the actual human speech. Is interpolated. Therefore, it is possible to synthesize a continuous sound that is smooth and natural in terms of hearing. Further, in the first embodiment, the calculation of the vocal tract cross-sectional area function is performed in advance, and this vocal tract cross-sectional area function need only be read out during actual speech synthesis. As a result, there is an effect that the amount of calculation at the time of actual speech synthesis is reduced. Also, unlike the prior art, interpolation parameters are extracted from the audio signal. As a result, it is possible to reduce labor and cost associated with taking MRI moving images.

［コンピュータによる実現］
本発明の第１の実施の形態に係る音声合成システム７０は、コンピュータと、当該コンピュータ上で実行されるコンピュータプログラムとにより実現できる。以下、図１１〜図１５を参照して音声合成システム７０を実現するコンピュータプログラムの制御構造を説明する。 [Realization by computer]
The speech synthesis system 70 according to the first embodiment of the present invention can be realized by a computer and a computer program executed on the computer. The control structure of the computer program that implements the speech synthesis system 70 will be described below with reference to FIGS.

（１）補間用データ生成部６４を実現するプログラム
図１１に、補間パラメータを算出する処理の詳細なフローチャートを示す。図１１を参照して、補間パラメータ算出処理が開始されると、まずステップ１５０にて初期処理を行なう。すなわち、ワークエリアのクリア、使用する変数のクリア等を行なう。ここで、極小値の配列の添字となる変数ｉには１を代入しておく。続いて、ステップ１５２では、変数ｉに１を加算した値を変数ｊに代入して、ステップ１５４へ進む。 (1) Program for Implementing Interpolation Data Generation Unit 64 FIG. 11 shows a detailed flowchart of processing for calculating interpolation parameters. Referring to FIG. 11, when the interpolation parameter calculation process is started, an initial process is first performed at step 150. In other words, the work area is cleared, variables to be used are cleared, and the like. Here, 1 is assigned to a variable i which is a subscript of the array of minimum values. Subsequently, in step 152, a value obtained by adding 1 to the variable i is substituted into the variable j, and the process proceeds to step 154.

ステップ１５４では、極小値の配列において、ｊ番目のデータがセットされているかどうかを判断する。セットされていれば、処理はステップ１５６に進み、さもなければ、補間パラメータ算出処理を終了する。 In step 154, it is determined whether or not the jth data is set in the array of local minimum values. If it is set, the process proceeds to step 156; otherwise, the interpolation parameter calculation process is terminated.

ステップ１５６では、極小値の配列のｉ番目の調音パラメータ（これを調音パラメータ（ｉ）と呼ぶ。）と調音パラメータ（ｊ）との間を、前述した式（１）〜式（６）を用いて補間する。具体的には、まず、極小値の配列のｉ番目のフレーム番号（フレーム番号（ｉ）と呼ぶ。）とフレーム番号（ｊ）とを参照して、フレーム番号（ｉ）とフレーム番号（ｊ）との間のフレームのスペクトル変化率に２種類のガウス関数を割当てる。すなわち、フレーム番号（ｉ）とフレーム番号（ｊ）との間のスペクトル変化率と、その間のスペクトル変化率の極大値と、その極大値を与えるフレーム番号とを特定し、フレーム番号（ｉ）から極大値を与えるフレームまでについてのガウス関数と、極大値を与えるフレームからフレーム番号（ｊ）−１までについてのガウス関数とを割当てる。それらガウス関数を極大値部分で接続して発話器官変化率関数を得る。この発話器官変化率関数に基づき、前述の式（１）〜式（４）を用いて各フレームにおける混合比を算出する。さらに、調音パラメータ（ｉ）及び調音パラメータ（ｊ）と、算出された混合比とを用いて、前述の式（５）及び（６）によってそのフレームにおける声道断面積関数を算出する。この様にして、フレーム番号（ｉ）とフレーム番号（ｊ）との間の全てのフレームにおいて、補間された声道断面積関数を算出する。 In step 156, the above-described equations (1) to (6) are used between the i-th articulation parameter (referred to as articulation parameter (i)) in the array of local minimum values and the articulation parameter (j). To interpolate. Specifically, first, the frame number (i) and the frame number (j) are referred to by referring to the i-th frame number (referred to as frame number (i)) and the frame number (j) of the array of minimum values. Two types of Gaussian functions are assigned to the spectral change rates of the frames between and. That is, the spectrum change rate between the frame number (i) and the frame number (j), the maximum value of the spectrum change rate therebetween, and the frame number that gives the maximum value are specified, and the frame number (i) A Gaussian function up to the frame giving the maximum value and a Gaussian function from the frame giving the maximum value to the frame number (j) -1 are assigned. These Gaussian functions are connected at the local maximum part to obtain the speech organ change rate function. Based on this speech organ change rate function, the mixing ratio in each frame is calculated using the above-described equations (1) to (4). Furthermore, using the articulation parameter (i), the articulation parameter (j), and the calculated mixture ratio, the vocal tract cross-sectional area function in the frame is calculated by the above-described equations (5) and (6). In this way, the interpolated vocal tract cross-sectional area function is calculated in all frames between the frame number (i) and the frame number (j).

続いて、ステップ１５８において、変数ｉの値に１を加算し、再びステップ１５２の処理に戻る。 Subsequently, in step 158, 1 is added to the value of the variable i, and the process returns to step 152 again.

この様にして、すべての極小値の間のフレームについて、補間用データを算出する処理を繰返す。 In this way, the process of calculating the interpolation data is repeated for the frames between all the minimum values.

（２）音声合成装置９２を実現するプログラム
図１２に、音声合成装置９２を実現するコンピュータプログラムのフローチャートを示す。図１２を参照して、音声合成処理が開始されると、ステップ１６０において、入力発話テキスト９０に応じて、補間用ＤＢから補間パラメータである声道断面積関数を抽出する処理が行なわれる。続いて、ステップ１６２において、ステップ１６０で抽出された声道断面積関数をクロックに従ってフィルタに出力し、合成音声信号を発生させる出力処理を行なう。 (2) Program for Implementing Speech Synthesizer 92 FIG. 12 shows a flowchart of a computer program for realizing speech synthesizer 92. Referring to FIG. 12, when the speech synthesis process is started, in step 160, a process for extracting a vocal tract cross-sectional area function, which is an interpolation parameter, from the interpolation DB is performed in accordance with input speech text 90. Subsequently, in step 162, the vocal tract cross-sectional area function extracted in step 160 is output to the filter according to the clock, and output processing for generating a synthesized speech signal is performed.

図１３に、ステップ１６０の補間パラメータを抽出する処理の詳細なフローチャートを示す。図１３を参照して、補間パラメータ抽出処理が開始されると、まずステップ１７０で初期処理を行なう。すなわち、ワークエリアのクリア、使用する変数のクリア等を行なう。ここで、後述する音素の配列の添字となる変数ｉには１を代入しておく。ステップ１７２で、入力テキスト９０を読出す。ステップ１７４では、テキスト９０を音素単位に分割し、それらの音素を順に配列にセットする。処理はステップ１７６へ進む。 FIG. 13 shows a detailed flowchart of the process of extracting the interpolation parameter in step 160. Referring to FIG. 13, when the interpolation parameter extraction process is started, an initial process is first performed at step 170. In other words, the work area is cleared, variables to be used are cleared, and the like. Here, 1 is assigned to a variable i which is a subscript of the phoneme array described later. In step 172, the input text 90 is read. In step 174, the text 90 is divided into phonemes, and these phonemes are sequentially set in the array. Processing proceeds to step 176.

ステップ１７６では、変数ｉに１を加算した値を変数ｊに代入する。ステップ１７８で、ステップ１７４でセットした音素の配列から、ｉ番目の音素（これを音素（ｉ）と呼ぶ。）及び音素（ｊ）を参照する。このとき、音素（ｊ）に音素がセットされているかどうかを判定する（ステップ１８０）。音素（ｊ）に値がなければ（すなわち終了であれば）、補間パラメータ抽出処理を終了し、さもなければ、処理はステップ１８２へ進む。 In step 176, a value obtained by adding 1 to the variable i is substituted into the variable j. In step 178, the i-th phoneme (referred to as phoneme (i)) and phoneme (j) are referenced from the phoneme array set in step 174. At this time, it is determined whether or not a phoneme is set in the phoneme (j) (step 180). If the phoneme (j) has no value (that is, if it is terminated), the interpolation parameter extraction process is terminated. Otherwise, the process proceeds to Step 182.

ステップ１８２では、音素（ｉ）・音素（ｊ）間の補間調音パラメータを、補間用ＤＢよりすべて抽出し、ワークエリアに順に蓄積していく。続いてステップ１８４で、変数ｉに１を加算し、処理はステップ１７６へ戻る。 In step 182, all interpolation articulation parameters between phonemes (i) and phonemes (j) are extracted from the interpolation DB and stored in the work area in order. Subsequently, at step 184, 1 is added to the variable i, and the process returns to step 176.

この様にして、入力テキストに係る補間調音パラメータを全て順に抽出して、ワークエリアに順に出力し、蓄積していく。 In this way, all the interpolation articulation parameters related to the input text are extracted in order, and are sequentially output and stored in the work area.

図１２に示すステップ１６２の出力処理では、ステップ１６０の補間パラメータ抽出処理で抽出されワークエリアに蓄積された補間調音パラメータを使用して、合成音声信号を発生させる。なお、この処理の詳細については、前述の音声合成装置９２の構成・動作の説明から処理内容が明らかであるため、ここでは詳細な説明は繰返さない。 In the output process of step 162 shown in FIG. 12, a synthesized speech signal is generated using the interpolation articulation parameters extracted in the interpolation parameter extraction process of step 160 and accumulated in the work area. Note that the details of this processing are obvious from the description of the configuration and operation of the speech synthesizer 92 described above, and therefore detailed description thereof will not be repeated here.

［コンピュータハードウェア構成］
上記したコンピュータプログラムを実行するコンピュータシステムの外観の一例を図１４に、そのブロック図の例を図１５に、それぞれ示す。 [Computer hardware configuration]
FIG. 14 shows an example of the external appearance of a computer system that executes the above-described computer program, and FIG. 15 shows an example of a block diagram thereof.

図１４を参照して、このコンピュータシステム３３０は、ＦＤ（フレキシブルディスク）ドライブ３５２及びＣＤ−ＲＯＭ（コンパクトディスク読出専用メモリ）ドライブ３５０を有するコンピュータ３４０と、キーボード３４６と、マウス３４８と、モニタ３４２と、スピーカ３７２とを含む。 Referring to FIG. 14, a computer system 330 includes a computer 340 having an FD (flexible disk) drive 352 and a CD-ROM (compact disk read only memory) drive 350, a keyboard 346, a mouse 348, and a monitor 342. And a speaker 372.

図１５を参照して、コンピュータ３４０は、ＦＤドライブ３５２及びＣＤ−ＲＯＭドライブ３５０に加えて、ＣＰＵ（中央処理装置）３５６と、ＣＰＵ３５６、ＦＤドライブ３５２及びＣＤ−ＲＯＭドライブ３５０に接続されたバス３６６と、ブートアッププログラム等を記憶する読出専用メモリ（ＲＯＭ）３５８と、バス３６６に接続され、プログラム命令、システムプログラム、及び作業データ等を記憶するランダムアクセスメモリ（ＲＡＭ）３６０と、バス３６６に接続され、スピーカ３７２に接続されるサウンドボード３６８を含む。コンピュータシステム３３０はさらに、図示しないプリンタを含んでいる。 Referring to FIG. 15, in addition to the FD drive 352 and the CD-ROM drive 350, the computer 340 includes a CPU (central processing unit) 356 and a bus 366 connected to the CPU 356, the FD drive 352, and the CD-ROM drive 350. And a read only memory (ROM) 358 for storing a boot-up program and the like, and a bus 366, and a random access memory (RAM) 360 for storing a program command, a system program, work data and the like, and a bus 366. And a sound board 368 connected to the speaker 372. The computer system 330 further includes a printer (not shown).

ここでは示さないが、コンピュータ３４０はさらにローカルエリアネットワーク（ＬＡＮ）への接続を提供するネットワークアダプタボードを含んでもよい。 Although not shown here, the computer 340 may further include a network adapter board that provides a connection to a local area network (LAN).

コンピュータシステム３３０に、図７に示す補間用データ生成装置８６又は図１０に示す音声合成装置９２としての動作を行なわせるためのコンピュータプログラムは、ＣＤＲＯＭドライブ３５０又はＦＤドライブ３５２に挿入されるＣＤ−ＲＯＭ３６２又はＦＤ３６４に記憶され、さらにハードディスク３５４に転送される。又は、プログラムは図示しないネットワークを通じてコンピュータ３４０に送信されハードディスク３５４に記憶されてもよい。プログラムは実行の際にＲＡＭ３６０にロードされる。ＣＤ−ＲＯＭ３６２から、ＦＤ３６４から、又はネットワークを介して、直接にＲＡＭ３６０にプログラムをロードしてもよい。 A computer program for causing the computer system 330 to operate as the interpolation data generator 86 shown in FIG. 7 or the speech synthesizer 92 shown in FIG. 10 is a CD-ROM 362 inserted into the CDROM drive 350 or the FD drive 352. Alternatively, it is stored in the FD 364 and further transferred to the hard disk 354. Alternatively, the program may be transmitted to the computer 340 through a network (not shown) and stored in the hard disk 354. The program is loaded into the RAM 360 when executed. The program may be loaded directly into the RAM 360 from the CD-ROM 362, from the FD 364, or via a network.

このプログラムは、コンピュータ３４０にこの実施の形態の補間用データ生成装置８６又は音声合成装置９２としての動作を行なわせる複数の命令を含む。この方法を行なわせるのに必要な基本的機能のいくつかはコンピュータ３４０上で動作するＯＳ又はサードパーティのプログラム、もしくはコンピュータ３４０にインストールされる各種ツールキットのモジュールにより提供される。従って、このプログラムはこの実施の形態の補間用データ生成装置８６又は音声合成装置９２を実現するのに必要な機能全てを必ずしも含まなくてよい。このプログラムは、命令のうち、所望の結果が得られる様に制御されたやり方で適切な機能又は「ツール」を呼出す事により、上記した補間用データ生成装置８６又は音声合成装置９２を実現する命令のみを含んでいればよい。コンピュータシステム３３０の動作は周知であるので、ここでは繰返さない。 This program includes a plurality of instructions that cause the computer 340 to operate as the interpolation data generation device 86 or the speech synthesis device 92 of this embodiment. Some of the basic functions required to perform this method are provided by an OS running on the computer 340 or a third party program, or various toolkit modules installed on the computer 340. Therefore, this program does not necessarily include all functions necessary for realizing the interpolation data generation device 86 or the speech synthesis device 92 of this embodiment. This program is a command that implements the interpolation data generation device 86 or the speech synthesizer 92 by calling an appropriate function or “tool” in a controlled manner so as to obtain a desired result. Only need to be included. The operation of computer system 330 is well known and will not be repeated here.

＜第２の実施の形態＞
上記した第１の実施の形態では、補間用ＤＢ８８には、実際に補間式を適用して補間パラメータである声道断面積関数を算出して補間用データとして蓄積し、それを使用して連続音声を合成した。しかし、本発明はこのような実施の形態には限定されない。例えば、補間用ＤＢ８８には、声道断面積関数ではなく、声道断面積関数を算出するための混合比（前述した式（１）〜式（４）によって算出されるもの）を蓄積しておき、それを連続音声合成の際に使用する方法も考えられる。以下、この方法を適用した、第２の実施の形態について説明する。 <Second Embodiment>
In the first embodiment described above, the interpolation DB 88 calculates the vocal tract cross-sectional area function, which is an interpolation parameter, by actually applying the interpolation formula, accumulates it as interpolation data, and uses it to continuously Synthesized speech. However, the present invention is not limited to such an embodiment. For example, the interpolating DB 88 stores not a vocal tract cross-sectional area function but a mixing ratio (calculated by the above-described equations (1) to (4)) for calculating the vocal tract cross-sectional area function. A method of using it for continuous speech synthesis is also conceivable. Hereinafter, a second embodiment to which this method is applied will be described.

［構成］
本発明の第２の実施の形態に係る音声合成システムは、図６に示す第１の実施の形態に係る音声合成システム７０と類似した構成である。ただし、この第２の実施の形態に係るシステムの補間用データ生成装置は、第１の実施の形態における補間用データ生成装置８６と類似した構成を持つが、補間用データ生成部６４（図９参照）に代えて図１６に示す構成を有する補間用データ生成部１９０を含む点で異なる。またこの第２の実施の形態に係るシステムは、第１の実施の形態における音声合成装置９２（図１０参照）に代えて、図１７に示す構成を有する音声合成装置２１０を含む。 [Constitution]
The speech synthesis system according to the second embodiment of the present invention has a configuration similar to the speech synthesis system 70 according to the first embodiment shown in FIG. However, the interpolation data generation apparatus of the system according to the second embodiment has a configuration similar to that of the interpolation data generation apparatus 86 in the first embodiment, but the interpolation data generation unit 64 (FIG. 9). 16), the difference is that an interpolation data generation unit 190 having the configuration shown in FIG. 16 is included. The system according to the second embodiment includes a speech synthesizer 210 having the configuration shown in FIG. 17 in place of the speech synthesizer 92 (see FIG. 10) in the first embodiment.

（１）補間用データ生成部１９０
図１６に、本実施の形態に係る補間用データ生成部１９０のブロック図を示す。図１６を参照して、補間用データ生成部１９０は、第１の実施の形態に係る補間用データ生成部６４と同様の構成を持つが、図９に示す声道断面積関数算出部１２６を含まず、混合比算出部１２４の出力をそのまま補間用ＤＢ２０８に保持させる点で補間用データ生成部６４と異なる。すなわち、第２の実施の形態においては、第１の実施の形態とは異なり、調音パラメータである声道断面積関数の補間までは行なわず、フレームごとの混合比を算出するにとどめる様にしたものである。 (1) Interpolation data generation unit 190
FIG. 16 is a block diagram of the interpolation data generation unit 190 according to the present embodiment. Referring to FIG. 16, interpolation data generation unit 190 has the same configuration as interpolation data generation unit 64 according to the first embodiment, but includes vocal tract cross-sectional area function calculation unit 126 shown in FIG. 9. It is different from the interpolation data generation unit 64 in that the output of the mixture ratio calculation unit 124 is not stored and is stored in the interpolation DB 208 as it is. In other words, in the second embodiment, unlike the first embodiment, interpolation of the vocal tract cross-sectional area function, which is an articulatory parameter, is not performed, and only the mixing ratio for each frame is calculated. Is.

（２）音声合成装置２１０
図１７に、本実施の形態に係る音声合成装置２１０のブロック図を示す。図１７を参照して、音声合成装置２１０は、図１０に示す第１の実施の形態に係る音声合成装置９２と類似した構成を持つ。異なるのは、音声合成装置２１０が、図１０に示す補間パラメータ抽出部１３０に代えて、隣接する音素の組の各々に対して補間用ＤＢ２０８から２音素間の声道断面積関数を補間するための混合比を抽出するための混合比抽出部２１２と、種々の音素に対して予め算出された調音パラメータを保持するための調音パラメータＤＢ２１６と、テキスト９０により与えられた音素に対応する調音パラメータを調音パラメータＤＢ２１６から取り出し、対応する混合比を用いた補間により音素間の声道断面積関数を算出し、補間調音パラメータ１３２として出力するための声道断面積関数算出部２１４とを含む点である。 (2) Speech synthesizer 210
FIG. 17 shows a block diagram of speech synthesis apparatus 210 according to the present embodiment. Referring to FIG. 17, speech synthesizer 210 has a configuration similar to speech synthesizer 92 according to the first embodiment shown in FIG. The difference is that the speech synthesizer 210 interpolates a vocal tract cross-sectional area function between two phonemes from the interpolating DB 208 for each adjacent phoneme group instead of the interpolation parameter extraction unit 130 shown in FIG. A mixing ratio extraction unit 212 for extracting the mixing ratio of the above, an articulation parameter DB 216 for holding articulation parameters calculated in advance for various phonemes, and an articulation parameter corresponding to the phoneme given by the text 90 It includes a vocal tract cross-sectional area function calculation unit 214 that is extracted from the articulation parameter DB 216, calculates a vocal tract cross-sectional area function between phonemes by interpolation using a corresponding mixing ratio, and outputs the inter-phoneme cross-sectional area function as an interpolated articulation parameter 132. .

［動作］
第２の実施の形態のうち、スペクトル変化率生成部６０は第１の実施の形態に係る音声合成システムに含まれるスペクトル変化率生成部６０と同様に動作する。従ってこれについての詳細な説明は繰返さない。 [Operation]
In the second embodiment, the spectrum change rate generation unit 60 operates in the same manner as the spectrum change rate generation unit 60 included in the speech synthesis system according to the first embodiment. Therefore, detailed description thereof will not be repeated.

（１）補間用データ生成部１９０の動作
図１６に示す補間用データ生成部１９０は、第１の実施の形態に含まれる補間用データ生成部６４とは声道断面積関数算出部１２６を含むか否かの点に違いがあるのみである。従って、動作の詳細な説明は繰返さない。ただし、ここでは補間用データとして混合比算出部１２４で混合比を算出するのみであるので、補間用ＤＢ２０８には声道断面積関数の代わりに混合比が保持される。 (1) Operation of Interpolation Data Generation Unit 190 The interpolation data generation unit 190 shown in FIG. 16 includes a vocal tract cross-sectional area function calculation unit 126, which is different from the interpolation data generation unit 64 included in the first embodiment. The only difference is whether or not. Therefore, detailed description of the operation will not be repeated. However, since the mixing ratio calculation unit 124 only calculates the mixing ratio as interpolation data here, the interpolation DB 208 holds the mixing ratio instead of the vocal tract cross-sectional area function.

（２）音声合成装置２１０の動作
音声合成装置２１０のうち、出力部１３４、音源１３６、フィルタ１３８、及びクロック部１４０は、第１の実施の形態に係る音声合成装置９２における場合と同様に動作する。従ってそれらについての詳細な説明は繰返さない。 (2) Operation of Speech Synthesizer 210 In speech synthesizer 210, output unit 134, sound source 136, filter 138, and clock unit 140 operate in the same manner as in speech synthesizer 92 according to the first embodiment. To do. Therefore, detailed description thereof will not be repeated.

図１７を参照して、テキスト９０は、発声すべき音素列からなるテキストと、各音素を発声するための時間情報とを含む。テキスト９０が入力されると、このテキストと時間情報とは混合比抽出部２１２及び声道断面積関数算出部２１４に与えられる。 Referring to FIG. 17, text 90 includes text composed of phoneme strings to be uttered and time information for uttering each phoneme. When the text 90 is input, this text and time information are given to the mixture ratio extraction unit 212 and the vocal tract cross-sectional area function calculation unit 214.

混合比抽出部２１２は、テキスト９０により与えられた音素列のうち隣接する２音素の組合せの各々について、その２音素を補間するための補間用データである混合比を補間用ＤＢ２０８からすべて抽出する。声道断面積関数算出部２１４は、調音パラメータＤＢ２１６に保持された調音パラメータから、テキスト９０により与えられた音素に該当する調音パラメータを読出す。そして読出された調音パラメータを使用して、式（５）及び式（６）によって抽出された混合比から各音素間の声道断面積関数を算出する。そして、テキスト９０に含まれる２音素間を声道断面積関数で補間して補間調音パラメータ１３２を出力する。この作業を、テキスト９０内の音素列で隣接する２音素の組合せの全てについて行ない、補間調音パラメータ１３２として出力部１３４に与える。その後の動作は、音声合成装置９２と同様である。 The mixing ratio extraction unit 212 extracts all the mixing ratios, which are interpolation data for interpolating the two phonemes, from the interpolation DB 208 for each combination of adjacent two phonemes in the phoneme sequence given by the text 90. . The vocal tract cross-sectional area function calculation unit 214 reads the articulation parameter corresponding to the phoneme given by the text 90 from the articulation parameters held in the articulation parameter DB 216. Then, using the read articulation parameters, a vocal tract cross-sectional area function between the phonemes is calculated from the mixture ratio extracted by Equation (5) and Equation (6). Then, the interpolated articulation parameter 132 is output by interpolating between the two phonemes included in the text 90 with the vocal tract cross-sectional area function. This operation is performed for all combinations of adjacent two phonemes in the phoneme string in the text 90, and is given to the output unit 134 as an interpolated articulation parameter 132. The subsequent operation is the same as that of the speech synthesizer 92.

なお、フレームシーケンス作成部１１６（図１６参照）で行なわれる、フレームシーケンスの組を作る処理は、手動で行なわれてもよい。 Note that the process of creating a frame sequence set performed by the frame sequence creation unit 116 (see FIG. 16) may be performed manually.

［第２の実施の形態の効果］
本実施の形態に係る音声合成システムにおいても、実際の人間の発話における発話器官の動きと一致する様な方法で調音パラメータである声道断面積関数を補間する。そのため、聴覚上、滑らかで自然な連続音声を合成する事ができる。また補間用データ生成装置では２音素間の各フレームに対応する混合比のみを算出し、声道断面積関数までは算出しない。そのため、補間用データの生成に要する時間が短くて済み、補間用ＤＢ２０８として必要な容量も削減できる。また、本実施の形態においても、従来技術と異なり、音声信号から補間パラメータの抽出を行なう。その結果、ＭＲＩ動画像の撮影に伴う手間及びコストを削減する事ができる。 [Effect of the second embodiment]
Also in the speech synthesis system according to the present embodiment, the vocal tract cross-sectional area function that is the articulation parameter is interpolated by a method that matches the movement of the speech organ in the actual human speech. Therefore, it is possible to synthesize a continuous sound that is smooth and natural in terms of hearing. Further, the interpolation data generation apparatus calculates only the mixture ratio corresponding to each frame between two phonemes, and does not calculate the vocal tract cross-sectional area function. Therefore, the time required for generating the interpolation data can be shortened, and the capacity required for the interpolation DB 208 can be reduced. Also in the present embodiment, unlike the prior art, interpolation parameters are extracted from the audio signal. As a result, it is possible to reduce labor and cost associated with taking MRI moving images.

［コンピュータによる実現］
本発明の第２の実施の形態に係る音声合成システムも、第１の実施の形態と同様に、コンピュータと、当該コンピュータ上で実行されるコンピュータプログラムとにより実現できる。なお、本実施の形態に係る音声合成システムを実現するコンピュータプログラムの制御構造については、第１の実施の形態の説明に基づいて、当業者には容易に実現できると思われる。コンピュータのハードウェア構成についても、第１の実施の形態で説明したものと同様である。従って、ここではそれらについての詳細な説明は繰返さない。 [Realization by computer]
Similarly to the first embodiment, the speech synthesis system according to the second embodiment of the present invention can also be realized by a computer and a computer program executed on the computer. Note that the control structure of the computer program that implements the speech synthesis system according to the present embodiment can be easily realized by those skilled in the art based on the description of the first embodiment. The hardware configuration of the computer is the same as that described in the first embodiment. Therefore, detailed description thereof will not be repeated here.

＜第３の実施の形態＞
上記した第２の実施の形態では、補間用ＤＢ２０８には、２音素間で声道断面積関数の補間を行なうための混合比を蓄積し、実際の補間時にそれを使用して声道断面積関数を算出し連続音声を合成した。しかし、本発明はこのような実施の形態には限定されない。例えば、補間用ＤＢには、声道断面積関数又はその算出に用いる混合比ではなく、スペクトル変化率にガウス関数を割当て、さらに積分及び正規化して得られた発話器官形状関数を蓄積しておき、連続音声合成の際にこれらを用いて調音パラメータである声道断面積関数を算出する方法も考えられる。以下、この方法を適用した第３の実施の形態について説明する。 <Third Embodiment>
In the second embodiment described above, the interpolating DB 208 stores the mixing ratio for interpolating the vocal tract cross-sectional area function between two phonemes, and uses this during actual interpolation to use the vocal tract cross-sectional area. The function was calculated and synthesized continuous speech. However, the present invention is not limited to such an embodiment. For example, in the interpolation DB, not the vocal tract cross-sectional area function or the mixing ratio used for the calculation, but a Gaussian function is assigned to the spectral change rate, and the speech organ shape function obtained by integration and normalization is stored. A method of calculating a vocal tract cross-sectional area function, which is an articulation parameter, using these during continuous speech synthesis is also conceivable. Hereinafter, a third embodiment to which this method is applied will be described.

［構成］
本発明の第３の実施の形態に係る音声合成システムも、図６に示す第１の実施の形態に係る音声合成システム７０と類似した構成である。ただし、この第３の実施の形態に係るシステムの補間用データ生成装置は、第１の実施の形態における補間用データ生成装置８６と類似した構成を持つが、補間用データ生成部６４（図９参照）に代えて図１８に示す補間用データ生成部２３０を含む点で異なる。またこの第３の実施の形態に係るシステムは、第１の実施の形態における音声合成装置９２（図１０参照）に代えて図１９に示す構成を有する音声合成装置２５０を含む。 [Constitution]
The speech synthesis system according to the third embodiment of the present invention has a configuration similar to the speech synthesis system 70 according to the first embodiment shown in FIG. However, the interpolation data generation apparatus of the system according to the third embodiment has a configuration similar to that of the interpolation data generation apparatus 86 in the first embodiment, but the interpolation data generation unit 64 (FIG. 9). 18) in that an interpolation data generation unit 230 shown in FIG. 18 is included. The system according to the third embodiment includes a speech synthesizer 250 having the configuration shown in FIG. 19 in place of the speech synthesizer 92 (see FIG. 10) in the first embodiment.

（１）補間用データ生成部２３０
図１８に、本実施の形態に係る補間用データ生成部２３０のブロック図を示す。図１８を参照して、補間用データ生成部２３０は、第１の実施の形態における補間用データ生成部６４と同様の極小値検出部１１０と、極大値検出部１１２と、母音割当部１１４と、フレームシーケンス作成部１１６と、ガウス関数割当部１１８と、ガウス関数割当部１２０と、積分・正規化部１２４とを含む。第３の実施の形態においては、第１の実施の形態とは異なり、混合比の算出及び声道断面積関数の算出までは行なわず、発話器官変化率関数の積分・正規化処理をして発話器官形状関数の算出をするにとどめ、補間用データとしてこの発話器官形状関数を補間用ＤＢ２４６に出力する様にしたものである。 (1) Interpolation data generation unit 230
FIG. 18 is a block diagram of the interpolation data generation unit 230 according to the present embodiment. Referring to FIG. 18, the interpolation data generation unit 230 includes a local minimum value detection unit 110, a local maximum value detection unit 112, and a vowel assignment unit 114, which are the same as the interpolation data generation unit 64 in the first embodiment. A frame sequence creation unit 116, a Gaussian function allocation unit 118, a Gaussian function allocation unit 120, and an integration / normalization unit 124. In the third embodiment, unlike the first embodiment, the calculation of the mixing ratio and the vocal tract cross-sectional area function are not performed, but the integration / normalization processing of the speech organ change rate function is performed. Only the utterance organ shape function is calculated, and this utterance organ shape function is output to the interpolation DB 246 as interpolation data.

（２）音声合成装置２５０
図１９に、本実施の形態に係る音声合成装置２５０のブロック図を示す。図１９を参照して、この音声合成装置２５０は、図１０に示す第１の実施の形態に係る音声合成装置９２と類似した構成を持つ。異なるのは、図１０に示す補間パラメータ抽出部１３０に代えて、入力テキスト９０の中の隣接する音素の組の各々に対して補間用ＤＢ２４６から積分・正規化された発話器官形状関数を抽出するための補間用データ抽出部２５２と、抽出された補間用データから混合比を算出するための混合比算出部２５４と、種々の音素に対して予め算出された調音パラメータを保持するための調音パラメータＤＢ２１６と、テキスト９０により与えられた音素に対応する調音パラメータを調音パラメータＤＢ２１６から取り出し、混合比を用いた補間により各音素間の声道断面積関数を算出し補間調音パラメータ１３２として出力するための、第２の実施の形態で使用したものと同じ声道断面積関数算出部２１４（図１７参照）とを含む点である。 (2) Speech synthesizer 250
FIG. 19 shows a block diagram of speech synthesis apparatus 250 according to the present embodiment. Referring to FIG. 19, speech synthesizer 250 has a configuration similar to speech synthesizer 92 according to the first embodiment shown in FIG. The difference is that instead of the interpolation parameter extraction unit 130 shown in FIG. 10, the integrated and normalized speech organ shape function is extracted from the interpolating DB 246 for each set of adjacent phonemes in the input text 90. Interpolation data extraction unit 252, a mixture ratio calculation unit 254 for calculating a mixture ratio from the extracted interpolation data, and an articulation parameter for holding articulation parameters calculated in advance for various phonemes The articulation parameters corresponding to the phonemes given by the DB 216 and the text 90 are extracted from the articulation parameters DB 216, the vocal tract cross-sectional area function between the phonemes is calculated by interpolation using the mixture ratio, and output as the interpolated articulation parameters 132. This includes the same vocal tract cross-sectional area function calculation unit 214 (see FIG. 17) as that used in the second embodiment.

［動作］
第３の実施の形態に係る音声合成システムのうち、スペクトル変化率生成部６０は第１の実施の形態及び第２の実施の形態に係る音声合成システムに含まれるスペクトル変化率生成部６０と同様に動作する。従ってこれについての詳細な説明は繰返さない。 [Operation]
Of the speech synthesis system according to the third embodiment, the spectrum change rate generation unit 60 is the same as the spectrum change rate generation unit 60 included in the speech synthesis systems according to the first embodiment and the second embodiment. To work. Therefore, detailed description thereof will not be repeated.

（１）補間用データ生成部２３０の動作
図１８に示す補間用データ生成部２３０は、第１の実施の形態に含まれる補間用データ生成部６４とは混合比算出部１２４及び声道断面積関数算出部１２６（図９参照）を含むか否かの点に違いがあるのみである。従って、その動作の詳細な説明は繰返さない。ただし、ここでは補間用データとして積分・正規化部１２４で発話器官変化率関数を積分・正規化して得られた発話器官形状関数を算出するのみであるので、補間用ＤＢ２４６には混合比の代わりに発話器官形状関数が保持される。 (1) Operation of Interpolation Data Generation Unit 230 The interpolation data generation unit 230 shown in FIG. 18 is different from the interpolation data generation unit 64 included in the first embodiment in the mixture ratio calculation unit 124 and the vocal tract cross-sectional area. The only difference is whether or not the function calculation unit 126 (see FIG. 9) is included. Therefore, detailed description of the operation will not be repeated. However, since only the speech organ shape function obtained by integrating / normalizing the speech organ change rate function by the integration / normalization unit 124 is calculated as interpolation data here, the interpolating DB 246 can replace the mixing ratio. The utterance organ shape function is held.

（２）音声合成装置２５０の動作
図１９を参照して、テキスト９０は発声すべき音素列からなるテキストと、各音素を発声するための時間情報とを含む。テキスト９０が入力されると、そのテキストと時間情報とは補間用データ抽出部２５２及び声道断面積関数算出部２１４に与えられる。 (2) Operation of Speech Synthesizer 250 Referring to FIG. 19, text 90 includes text composed of phoneme strings to be uttered and time information for uttering each phoneme. When the text 90 is input, the text and time information are provided to the interpolation data extraction unit 252 and the vocal tract cross-sectional area function calculation unit 214.

補間用データ抽出部２５２は、テキスト９０を参照し、テキスト中で隣接する各音素の組に対し、補間用ＤＢ２４６に保持された発話器官形状関数を抽出する。混合比算出部２５４は前述した式（１）〜（４）によって算出されたデータから、二つの音素に対応する調音パラメータの混合比を算出して、声道断面積関数算出部２１４に与える。その後の動作は、第２の実施の形態における音声合成装置２１０と同様である。 The interpolation data extraction unit 252 refers to the text 90 and extracts a speech organ shape function held in the interpolation DB 246 for each pair of adjacent phonemes in the text. The mixing ratio calculation unit 254 calculates the mixing ratio of the articulation parameters corresponding to the two phonemes from the data calculated by the above formulas (1) to (4), and supplies the calculated mixture ratio to the vocal tract cross-sectional area function calculation unit 214. The subsequent operation is the same as that of the speech synthesizer 210 in the second embodiment.

なお、フレームシーケンス作成部１１６（図１８参照）で行なわれる、フレームシーケンスの組を作る処理は、手動で行なわれてもよい。 Note that the process of creating a frame sequence set performed by the frame sequence creation unit 116 (see FIG. 18) may be performed manually.

［第３の実施の形態の効果］
本実施の形態に係る音声合成システムにおいても、実際の人間の発話における発話器官の動きと一致する様な方法で調音パラメータである声道断面積関数を補間する。そのため、聴覚上、より滑らかで自然な連続音声を合成する事ができる。補間用データ生成部２３０ではガウス関数でスペクトル変化率を近似する事により得られた発話器官変化率関数を積分・正規化して発話器官形状関数を算出するのみであり、混合比又は声道断面積関数は算出しない。そのため、補間用データの生成に要する時間が短くて済み、補間ＤＢ２４６として必要な容量も削減できる。ただし、音声合成装置２５０において混合比の算出と声道断面積関数の算出との双方を行なうので、第１の実施の形態又は第２の実施の形態と比較して計算量は大きくなる。また、本実施の形態においても、従来技術と異なり、音声信号から補間パラメータの抽出を行なう。その結果、ＭＲＩ動画像の撮影に伴う手間及びコストを削減する事ができる。 [Effect of the third embodiment]
Also in the speech synthesis system according to the present embodiment, the vocal tract cross-sectional area function that is the articulation parameter is interpolated by a method that matches the movement of the speech organ in the actual human speech. Therefore, it is possible to synthesize a continuous voice that is smoother and more natural for hearing. Interpolation data generation section 230 only calculates the speech organ shape function by integrating and normalizing the speech organ change rate function obtained by approximating the spectrum change rate with a Gaussian function, and the mixture ratio or vocal tract cross-sectional area is calculated. The function is not calculated. Therefore, the time required for generating the interpolation data can be shortened, and the capacity required for the interpolation DB 246 can be reduced. However, since both the calculation of the mixing ratio and the calculation of the vocal tract cross-sectional area function are performed in the speech synthesizer 250, the amount of calculation is larger than in the first embodiment or the second embodiment. Also in the present embodiment, unlike the prior art, interpolation parameters are extracted from the audio signal. As a result, it is possible to reduce labor and cost associated with taking MRI moving images.

［コンピュータによる実現］
本発明の第３の実施の形態に係る音声合成システムにおいても、第１の実施の形態と同様に、コンピュータと、当該コンピュータ上で実行されるコンピュータプログラムとにより実現できる。なお、本実施の形態に係る音声合成システムを実現するコンピュータプログラムの制御構造については、第１の実施の形態の説明に基づいて、当業者には容易に実現できると思われる。コンピュータのハードウェア構成についても、第１の実施の形態で説明したものと同様である。従って、ここではそれらについての詳細な説明は繰返さない。 [Realization by computer]
Similarly to the first embodiment, the speech synthesis system according to the third embodiment of the present invention can be realized by a computer and a computer program executed on the computer. Note that the control structure of the computer program that implements the speech synthesis system according to the present embodiment can be easily realized by those skilled in the art based on the description of the first embodiment. The hardware configuration of the computer is the same as that described in the first embodiment. Therefore, detailed description thereof will not be repeated here.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味及び範囲内でのすべての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each claim in the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.

発話音声波形の一例とスペクトル変化率の一例とを示すグラフである。It is a graph which shows an example of an utterance voice waveform, and an example of a spectrum change rate. 混合比算出のためのスペクトル変化率の一例を示すグラフである。It is a graph which shows an example of the spectrum change rate for mixture ratio calculation. スペクトル変化率５０、そのガウス関数割当てによる正規分布５２、及び混合比算出のために積分処理されたガウス関数の一例を示すグラフである。It is a graph which shows an example of the Gaussian function by which the spectrum change rate 50, the normal distribution 52 by the Gaussian function allocation, and the integration process for calculation of a mixture ratio were carried out. 声道断面積関数の例を示す図である。It is a figure which shows the example of a vocal tract cross-sectional area function. 声道断面積関数の例を示す図である。It is a figure which shows the example of a vocal tract cross-sectional area function. 本発明の第１の実施の形態に係る音声合成システム７０のブロック図である。1 is a block diagram of a speech synthesis system 70 according to a first embodiment of the present invention. 補間用データ生成装置８６のブロック図である。3 is a block diagram of an interpolation data generation device 86. FIG. スペクトル変化率生成部６０のブロック図である。4 is a block diagram of a spectrum change rate generation unit 60. FIG. 本発明の第１の実施の形態に係る補間用データ生成部６４のブロック図である。It is a block diagram of the data generation part for interpolation 64 concerning a 1st embodiment of the present invention. 本発明の第１の実施の形態に係る音声合成装置９２のブロック図である。It is a block diagram of the speech synthesizer 92 which concerns on the 1st Embodiment of this invention. 補間パラメータを算出する処理の詳細なフローチャートである。It is a detailed flowchart of the process which calculates an interpolation parameter. 音声合成装置９２を実現するコンピュータプログラムのフローチャートである。3 is a flowchart of a computer program that implements a speech synthesizer 92. ステップ１６０の補間パラメータを抽出する処理の詳細なフローチャートである。It is a detailed flowchart of the process which extracts the interpolation parameter of step 160. FIG. コンピュータシステムの外観の一例を示す図である。It is a figure which shows an example of the external appearance of a computer system. 図１４に示すコンピュータシステムのブロック図である。It is a block diagram of the computer system shown in FIG. 本発明の第２の実施の形態に係る補間用データ生成部１９０のブロック図である。It is a block diagram of the data generation part 190 for interpolation which concerns on the 2nd Embodiment of this invention. 本発明の第２の実施の形態に係る音声合成装置２１０のブロック図である。It is a block diagram of the speech synthesizer 210 which concerns on the 2nd Embodiment of this invention. 本発明の第３の実施の形態に係る補間用データ生成部２３０のブロック図である。It is a block diagram of the data generation part 230 for interpolation which concerns on the 3rd Embodiment of this invention. 本発明の第３の実施の形態に係る音声合成装置２５０のブロック図である。It is a block diagram of the speech synthesizer 250 which concerns on the 3rd Embodiment of this invention.

Explanation of symbols

６０スペクトル変化率生成部
６４補間用データ生成部
１００フレーム化部
１０８スペクトル変化率算出部
１１０極小値検出部
１１２極大値検出部
１１６フレームシーケンス作成部
１１８ガウス関数割当部
１２０ガウス関数割当部
１２２積分・正規化部
１２４混合比算出部
１２６声道断面積関数算出部 60 Spectral change rate generation unit 64 Interpolation data generation unit 100 Framing unit 108 Spectral change rate calculation unit 110 Minimum value detection unit 112 Maximum value detection unit 116 Frame sequence creation unit 118 Gaussian function allocation unit 120 Gaussian function allocation unit 122 Normalizer 124 Mixing ratio calculator 126 Vocal tract area function calculator

Claims

Articulation parameters for synthesizing speech that varies continuously from the first phoneme to the second phoneme are generated by interpolation between known articulation parameters for speech synthesis of the first and second phonemes. A data generation device for articulation parameter interpolation for generating interpolating data at the time,
Change rate calculation means for calculating a transition of a predetermined spectrum change rate from the first phoneme to the second phoneme from an input speech signal including the first phoneme and the second phoneme that are continuous; ,
Look including an interpolation data generating means for generating said interpolation data by using the rate of change,
The rate of change calculation means includes:
Framing means for framing the input audio signal every predetermined time;
Spectrum change rate calculating means for calculating the predetermined spectrum change rate for each frame from the voice signal framed by the framing means,
The spectrum change rate calculating means includes:
First and second for giving adjacent minimum values of the predetermined spectrum change rate in the frames of the voice signal framed by the framing means from the voice signal from the first phoneme to the second phoneme. Frame combining means for combining a plurality of frames including two frames and a third frame that exists between the two frames and that gives a maximum value of the predetermined spectral change rate;
Function approximating means for approximating the transition of the predetermined spectral change rate with a Gaussian function defined by the maximum value and the minimum value of the frames combined by the frame combination means;
The interpolation data creation means includes:
A normalized integral function obtained by integrating the Gaussian function obtained by the function approximating means from the first frame to the second frame to obtain an integral function, and further normalizing the integral function with the maximum value. A data generation apparatus for articulation parameter interpolation , including integration / normalization means for obtaining

The function approximating means is:
  Means for obtaining a first Gaussian function approximating the predetermined spectral change rate between the first frame and the third frame among the frames combined by the frame combination means;
  Means for obtaining a second Gaussian function approximating the predetermined spectral change rate between the third frame and the second frame among frames combined by the frame combination means;
  By connecting the first Gaussian function and the second Gaussian function in the third frame, the transition of the predetermined spectral change rate is approximated between the maximum value and two minimum values. The long sound parameter interpolation data generation apparatus according to claim 1, further comprising means for obtaining a Gaussian function.

The interpolation data creation means further interpolates between the articulation parameter for the first phoneme and the articulation parameter for the second phoneme by the normalized integration function, thereby the first phoneme. The articulation parameter interpolation data generation device according to claim 1 or 2 , further comprising means for calculating a mixture ratio between a phoneme and the second phoneme at a predetermined time.

The interpolation data creating means further interpolates between the articulation parameter for the first phoneme and the articulation parameter for the second phoneme by the mixing ratio, thereby providing articulation at the predetermined time. 4. The articulation parameter interpolation data generation device according to claim 3 , further comprising means for calculating a vocal tract cross-sectional area function as a parameter.

A computer program that, when executed by a computer, causes the computer to operate as the articulation parameter interpolation data generation device according to any one of claims 1 to 4 .