JP2003271171A

JP2003271171A - Method, device and program for voice synthesis

Info

Publication number: JP2003271171A
Application number: JP2002069434A
Authority: JP
Inventors: Toshimitsu Minowa; 利光蓑輪
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2002-03-14
Filing date: 2002-03-14
Publication date: 2003-09-25

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice synthesis method and a voice synthesis device capable of obtaining a synthesized voice having high voice quality with small memory capacity. <P>SOLUTION: This voice synthesis method comprises an attribute vector applying step for applying an attribute vector including a predetermined attribute factor and paralanguage information to each element piece of a voice corpus; a clustering step for clustering element pieces; a cluster representative value calculating step for calculating a cluster representative value; an explanation vector generating step for generating an explanation vector to be associated with the cluster representative value; a target attribute vector generating step for generating a target attribute vector of an element piece unit of a synthesized voice; an optimum approximate coefficient calculating step for calculating an optimum approximate coefficient that can optimally approximate the target attribute vector with the explanation vector in each element piece of the voice corpus; and a synthesized voice element piece generating step for generating the element piece of the synthesized voice on the basis of the optimum approximate coefficient, represents an attribute common to element pieces in the same cluster with the explanation vector and applies the attribute to voice synthesis. <P>COPYRIGHT: (C)2003,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、音声合成方法、音
声合成装置および音声合成プログラムに関するものであ
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice synthesizing method, a voice synthesizing apparatus, and a voice synthesizing program.

【０００２】[0002]

【従来の技術】従来の音声合成方法および音声合成装置
に関するものとしては、例えば特開平２０００−２５０
５７０号公報に開示されているようなものがある。この
従来の音声合成方法について図１０を参照して説明す
る。2. Description of the Related Art As a conventional speech synthesizing method and speech synthesizing apparatus, for example, Japanese Patent Laid-Open No. 2000-250.
There is one disclosed in Japanese Patent No. 570. This conventional speech synthesis method will be described with reference to FIG.

【０００３】図１０において、ピッチパタンデータベー
ス１１には、アクセント句単位でピッチパタンデータが
格納されている。このピッチパタンデータは、１フレー
ム毎にピッチの値を有している。また、各ピッチパタン
データには、その韻律カテゴリが併記されている。合成
するテキストは、アクセント句単位で与えられる。In FIG. 10, the pitch pattern database 11 stores pitch pattern data in units of accent phrases. This pitch pattern data has a pitch value for each frame. The prosody category is also described in each pitch pattern data. The text to be composed is given in units of accent phrases.

【０００４】まず、ステップＳ３１において、合成した
いテキストの韻律カテゴリと等しい韻律カテゴリに属す
るピッチパタンデータが、ピッチパタンデータベース１
１に存在するか否かが検索される。次いで、ステップＳ
３１において、ピッチパタンデータがピッチパタンデー
タベース１１に存在した場合はステップＳ３３に進み、
存在しなかった場合はステップＳ３４に進む。次いで、
ステップＳ３３では、合成したいテキストの韻律カテゴ
リと等しい韻律カテゴリからピッチパタンデータが選択
される。一方、ステップＳ３４において、ピッチパタン
データベース１１に含まれる韻律カテゴリのうちから、
合成するテキストの韻律カテゴリに、ピッチパタンの形
状が最も近いと思われる韻律カテゴリが推定される。次
いで、ステップＳ３５において、前述のステップＳ３３
と同様に推定された韻律カテゴリからのピッチパタンデ
ータが選択される。次いで、ステップＳ３６において、
選択された韻律カテゴリとテキストの韻律カテゴリ間の
差分ベクトルが選択されたピッチパタンデータに適用さ
れて変形される。次いで、ステップＳ３７において、ピ
ッチパタンデータをモーラ単位で時間軸方向に線形伸縮
することにより、与えられた時間長に従って時間長補正
が行われる。次いで、ステップＳ３８において、各ピッ
チパタンの高さについて、ピッチパタンの始点と終点の
高さの中点が、点ピッチパタンの話調成分決定アルゴリ
ズムより求めた話調成分高さの平均値となるように決定
される。First, in step S31, pitch pattern data belonging to a prosody category that is the same as the prosody category of the text to be synthesized is pitch pattern database 1.
1 is searched for. Then, step S
In 31, when the pitch pattern data exists in the pitch pattern database 11, the process proceeds to step S33,
If it does not exist, the process proceeds to step S34. Then
In step S33, pitch pattern data is selected from a prosody category equal to the prosody category of the text to be synthesized. On the other hand, in step S34, from among the prosody categories included in the pitch pattern database 11,
The prosody category whose pitch pattern shape is considered to be the closest to the prosody category of the text to be synthesized is estimated. Then, in step S35, the above-mentioned step S33 is performed.
Similarly, pitch pattern data from the estimated prosody category is selected. Then, in step S36,
The difference vector between the selected prosody category and the text prosody category is applied to the selected pitch pattern data and transformed. Next, in step S37, the pitch pattern data is linearly expanded / contracted in the time axis direction in mora units, whereby the time length correction is performed according to the given time length. Next, in step S38, for the height of each pitch pattern, the midpoint between the heights of the start point and the end point of the pitch pattern is the average value of the pitch-tone component heights obtained from the pitch-tone pitch component speech-tone component determination algorithm. Is decided.

【０００５】以上のように、ピッチパタンデータベース
１１から合成したいテキストに適したピッチパタンデー
タを取得して、音声合成を行うことができる。As described above, the voice pattern can be synthesized by acquiring the pitch pattern data suitable for the text to be synthesized from the pitch pattern database 11.

【０００６】[0006]

【発明が解決しようとする課題】しかしながら、このよ
うな従来の音声合成方法では、音声コーパス中の素片の
分類を素片自体ではなく多くの属性で構成される韻律カ
テゴリに基づいて行っているので、音声合成に要するメ
モリ容量が非常に大きくなるという問題があった。ま
た、韻律カテゴリに属するデータの分散が大きい場合に
は韻律カテゴリの代表値が全データを近似できないこと
があり、代表値同士の接続によってばらついた印象の合
成音声になるという問題があった。However, in such a conventional speech synthesizing method, the classification of the segments in the speech corpus is performed based on the prosody category composed of many attributes rather than the segments themselves. Therefore, there is a problem that the memory capacity required for speech synthesis becomes very large. In addition, when the variance of data belonging to the prosody category is large, the representative values of the prosody category may not be able to approximate all the data, and there is a problem that the synthetic voice has an impression of variation due to the connection of the representative values.

【０００７】本発明は、このような問題を解決するため
になされたものであり、音声合成に要するメモリ容量を
低減でき、また、韻律カテゴリに属するデータの分散が
大きい場合でも高音質の合成音声を生成することができ
る音声合成方法、音声合成装置および音声合成プログラ
ムを提供するものである。The present invention has been made in order to solve such a problem, and can reduce the memory capacity required for voice synthesis, and also provide high-quality synthetic voice even when the variance of data belonging to the prosody category is large. A speech synthesis method, a speech synthesis apparatus, and a speech synthesis program capable of generating

【０００８】[0008]

【課題を解決するための手段】本発明の音声合成方法
は、韻律素片および音声波形素片を含む音声コーパスの
各素片に予め決められた属性要因およびパラ言語情報を
含む属性ベクトルを付与する属性ベクトル付与工程と、
前記属性ベクトルが付与された前記素片をクラスタリン
グするクラスタリング工程と、前記クラスタリングで得
られた各クラスタに属する前記素片のクラスタ代表値を
算出するクラスタ代表値算出工程と、前記クラスタリン
グで得られた各クラスタに属する前記素片の属性ベクト
ルに基づいて説明ベクトルを生成する説明ベクトル生成
工程と、合成音声の素片単位の目標属性ベクトルを生成
する目標属性ベクトル生成工程と、前記目標属性ベクト
ルを前記説明ベクトルで最適に近似する最適近似係数を
前記音声コーパスの素片毎に算出する最適近似係数算出
工程と、前記最適近似係数に基づいて合成音声の素片を
生成する合成音声素片生成工程とを含んでいる。この構
成により、韻律素片および音声波形素片の各素片は、ク
ラスタリングされた後、同一クラスタにある素片の属性
ベクトルに基づいて生成された説明ベクトルによって同
一クラスタ内の素片に共通な属性が表現され音声合成に
適用されることとなる。According to the speech synthesis method of the present invention, an attribute vector containing a predetermined attribute factor and paralinguistic information is given to each segment of a speech corpus including a prosodic segment and a speech waveform segment. An attribute vector assigning step to
A clustering step of clustering the pieces to which the attribute vector is given, a cluster representative value calculation step of calculating a cluster representative value of the pieces belonging to each cluster obtained by the clustering, and a clustering value obtained by the clustering The explanation vector generation step of generating an explanation vector based on the attribute vector of the segment belonging to each cluster, the target attribute vector generation step of generating a target attribute vector of each unit of synthesized speech, and the target attribute vector An optimal approximation coefficient calculation step of calculating an optimal approximation coefficient that is optimally approximated by an explanation vector for each speech corpus segment; and a synthetic speech segment generation step of generating a synthetic speech segment based on the optimal approximation coefficient. Is included. With this configuration, each segment of the prosodic segment and the speech waveform segment is clustered and then shared by the segment in the same cluster by the explanation vector generated based on the attribute vector of the segment in the same cluster. The attributes will be expressed and applied to speech synthesis.

【０００９】本発明の音声合成方法は、前記属性ベクト
ル付与工程で付与された前記属性ベクトルは前記予め決
められた属性要因毎に各属性要因が存在するか否かの表
現を含んでいる。この構成により、韻律素片および音声
波形素片の各素片の属性は、簡易な表現をされることと
なる。In the speech synthesis method of the present invention, the attribute vector assigned in the attribute vector assigning step includes an expression of whether or not each attribute factor exists for each of the predetermined attribute factors. With this configuration, the attributes of each segment of the prosodic segment and the speech waveform segment can be simply expressed.

【００１０】本発明の音声合成方法は、前記クラスタリ
ング工程において、聴覚的な検知限に基づいてクラスタ
リングを行う工程を含んでいる。この構成により、韻律
素片および音声波形素片の各素片のばらつきが考慮され
ることとなる。The speech synthesis method of the present invention includes a step of performing clustering based on auditory detection limits in the clustering step. With this configuration, variations in each of the prosodic segment and the speech waveform segment are taken into consideration.

【００１１】本発明の音声合成方法は、前記説明ベクト
ル生成工程において、前記クラスタ毎に各クラスタに属
する前記素片の属性ベクトルを加算して得られるベクト
ルの各属性要因を前記属性要因毎に各クラスタに属する
前記素片の総数で除したものを新たな属性要因とするベ
クトルを生成して各クラスタの説明ベクトルとする工程
を含んでいる。この構成により、韻律素片および音声波
形素片の各素片は、素片単位で多くの属性要因を考慮さ
れることとなる。In the speech synthesis method of the present invention, in the explanation vector generation step, each attribute factor of a vector obtained by adding the attribute vectors of the segment belonging to each cluster for each cluster is set for each attribute factor. The method includes a step of generating a vector having a new attribute factor that is obtained by dividing the total number of the pieces belonging to the cluster and setting the vector as the explanation vector of each cluster. With this configuration, each element of the prosody element and the speech waveform element has many attribute factors taken into consideration for each element.

【００１２】本発明の音声合成方法は、前記説明ベクト
ル生成工程において、前記クラスタ毎に各クラスタに属
する前記素片の属性ベクトルを加算して得られるベクト
ルの各属性要因を前記音声コーパスで前記属性要因が発
生したデータ総数で除したものを新たな属性要因とする
ベクトルを生成して各クラスタの説明ベクトルとする工
程を含んでいる。この構成により、韻律素片および音声
波形素片の各素片は、素片単位で多くの属性要因を考慮
されることとなる。In the speech synthesis method of the present invention, in the explanation vector generation step, each attribute factor of a vector obtained by adding the attribute vectors of the segment belonging to each cluster for each cluster is used as the attribute in the speech corpus. The step of generating a vector having a new attribute factor divided by the total number of data in which the factor has occurred and using it as the explanation vector of each cluster is included. With this configuration, each element of the prosody element and the speech waveform element has many attribute factors taken into consideration for each element.

【００１３】本発明の音声合成方法は、前記クラスタ代
表値算出工程において、前記クラスタの重心点を前記ク
ラスタの代表素片とする工程を含んでいる。この構成に
より、クラスタの代表値はクラスタの重心点で表される
こととなる。The speech synthesis method of the present invention includes the step of, in the cluster representative value calculating step, using the centroid of the cluster as a representative segment of the cluster. With this configuration, the representative value of the cluster is represented by the center of gravity of the cluster.

【００１４】本発明の音声合成方法は、前記クラスタ代
表値算出工程において、前記クラスタの最頻値を前記ク
ラスタの代表素片とする工程を含んでいる。この構成に
より、クラスタの代表値はクラスタの最頻値で表される
こととなる。The speech synthesis method of the present invention includes the step of, in the cluster representative value calculation step, using the mode of the cluster as a representative segment of the cluster. With this configuration, the representative value of the cluster is represented by the mode value of the cluster.

【００１５】本発明の音声合成方法は、韻律素片および
音声波形素片を含む音声コーパスの各素片に予め決めら
れた属性要因およびパラ言語情報を含む属性ベクトルを
付与する属性ベクトル付与工程と、前記属性ベクトルが
付与された前記素片をクラスタリングするクラスタリン
グ工程と、前記クラスタリングで得られた各クラスタに
属する前記素片のクラスタ代表値を算出するクラスタ代
表値算出工程と、前記クラスタリングで得られた各クラ
スタに属する前記素片の属性ベクトルに基づいて説明ベ
クトルを生成する説明ベクトル生成工程と、前記説明ベ
クトル同士の各属性要因を比較する説明ベクトル属性要
因比較工程と、合成音声の素片単位の目標属性ベクトル
を生成する目標属性ベクトル生成工程と、前記目標属性
ベクトルを前記説明ベクトルで最適に近似する最適近似
係数を前記音声コーパスの素片毎に算出する最適近似係
数算出工程と、前記最適近似係数に基づいて合成音声の
素片を生成する合成音声素片生成工程とを含み、前記説
明ベクトル属性要因比較工程は、前記説明ベクトル工程
で生成された全ての説明ベクトルに共通して予め決めら
れた統計的有意水準により同一と見なせる属性要因があ
るときは同一と見なされた前記属性要因を前記説明ベク
トルおよび前記属性ベクトルの属性要因から除くことを
特徴としている。この構成により、同じ代表値を有する
クラスタの発生をなくし、音声コーパスから最適な素片
が選択され音声合成に適用されることとなる。The speech synthesis method of the present invention comprises an attribute vector assigning step of assigning an attribute vector containing a predetermined attribute factor and paralinguistic information to each segment of a speech corpus containing a prosodic segment and a speech waveform segment. , A clustering step of clustering the pieces to which the attribute vector is added, a cluster representative value calculating step of calculating a cluster representative value of the pieces belonging to each cluster obtained by the clustering, and a clustering value obtained by the clustering The explanation vector generation step of generating an explanation vector based on the attribute vector of the segment belonging to each cluster, the explanation vector attribute factor comparison step of comparing the attribute factors of the explanation vectors with each other, and the segment unit of the synthesized speech. Target attribute vector generation step of generating the target attribute vector of An optimal approximation coefficient calculation step of calculating an optimal approximation coefficient that is optimally approximated by a vector for each speech corpus segment; and a synthetic speech segment generation step of generating a synthetic speech segment based on the optimal approximation coefficient. Including, the explanation vector attribute factor comparison step is regarded as the same when all the explanation vectors generated in the explanation vector step have attribute factors that can be regarded as the same according to a predetermined statistical significance level. The attribute factor is removed from the attribute factors of the explanation vector and the attribute vector. With this configuration, it is possible to eliminate the occurrence of clusters having the same representative value, select the optimum segment from the speech corpus, and apply it to speech synthesis.

【００１６】本発明の音声合成方法は、前記説明ベクト
ル属性要因比較工程において、前記説明ベクトル生成工
程で生成された複数の説明ベクトルが予め決められた統
計的有意水準により同一と見なせるときは、同一に見な
された前記説明ベクトルに関連するクラスタ群を合併し
て一つのクラスタとすることを特徴としている。この構
成により、同じ代表値を有するクラスタの発生をなく
し、音声コーパスから最適な素片が選択され音声合成に
適用されることとなる。The speech synthesis method of the present invention is the same when the plurality of explanation vectors generated in the explanation vector generation step in the explanation vector attribute factor comparison step can be regarded as the same according to a predetermined statistical significance level. It is characterized in that the cluster groups related to the explanation vector regarded as above are merged into one cluster. With this configuration, it is possible to eliminate the occurrence of clusters having the same representative value, select the optimum segment from the speech corpus, and apply it to speech synthesis.

【００１７】本発明の音声合成方法は、前記説明ベクト
ル属性要因比較工程において、前記説明ベクトル生成工
程で生成された複数の説明ベクトルが予め決められた統
計的有意水準により同一と見なせるときは、同一に見な
された前記説明ベクトルの個数を求める手順と、前記個
数について２を底とする対数を計算する手順と、前記対
数の計算結果を整数化した数に相当する個数の属性要因
を前記素片に仮に新たに追加する手順とを含み、前記属
性ベクトル付与工程で前記素片に前記追加された属性要
因を含めた属性ベクトルを付与しなおすことを特徴とし
ている。この構成により、同じ代表値を有するクラスタ
の発生をなくし、音声コーパスから最適な素片が選択さ
れ音声合成に適用されることとなる。The speech synthesis method of the present invention is the same when the plurality of explanation vectors generated in the explanation vector generation step in the explanation vector attribute factor comparison step can be regarded as the same according to a predetermined statistical significance level. The step of obtaining the number of the explanation vectors regarded as the above, the step of calculating the base 2 logarithm of the number, and the number of attribute factors corresponding to the integer of the logarithmic calculation result And a step of newly adding the attribute vector, the attribute vector including the added attribute factor is reassigned to the segment in the attribute vector assigning step. With this configuration, it is possible to eliminate the occurrence of clusters having the same representative value, select the optimum segment from the speech corpus, and apply it to speech synthesis.

【００１８】本発明の音声合成方法は、前記目標属性ベ
クトル生成工程で生成された前記合成音声の素片単位の
前記目標属性ベクトルと前記音声コーパスのクラスタの
前記説明ベクトルとの内積を算出する内積算出工程と、
前記算出された内積のうち最大の内積となる説明ベクト
ルを有するクラスタの代表素片を選定する素片選定工程
とを含んでいる。この構成により、属性ベクトルの冗長
性が除去された素片コーパスが生成され音声合成に適用
されることとなる。In the speech synthesis method of the present invention, an inner product of the target attribute vector of the synthesized speech unit generated in the target attribute vector generation step and the explanation vector of the cluster of the speech corpus is calculated. Product calculation process,
A segment selection step of selecting a representative segment of a cluster having an explanation vector that is the maximum inner product among the calculated inner products. With this configuration, a segment corpus from which the attribute vector redundancy is removed is generated and applied to speech synthesis.

【００１９】本発明の音声合成方法は、前記目標属性ベ
クトル生成工程で生成された前記合成音声の素片単位の
前記目標属性ベクトルと前記音声コーパスのクラスタの
前記説明ベクトルとの内積を算出する内積算出工程と、
これらの内積の総和を計算する工程と、前記算出された
各内積を前記総和で除した値を重みとしてクラスタの代
表素片を加重平均することにより合成音声素片を生成す
る合成音声素片生成工程とを含んでいる。この構成によ
り、属性ベクトルの冗長性が除去された素片コーパスが
生成され音声合成に適用されることとなる。In the speech synthesis method of the present invention, the inner product of the target attribute vector of the synthesized speech unit generated in the target attribute vector generation step and the explanation vector of the cluster of the speech corpus is calculated. Product calculation process,
Calculating the sum of these inner products, and generating a synthesized speech unit by generating a weighted average of the representative units of the cluster with the value obtained by dividing each of the calculated inner products by the sum as a weight. The process is included. With this configuration, a segment corpus from which the attribute vector redundancy is removed is generated and applied to speech synthesis.

【００２０】本発明の音声合成方法は、前記目標属性ベ
クトル生成工程で生成された前記合成音声の素片単位の
前記目標属性ベクトルを前記音声コーパスのクラスタの
前記説明ベクトルで最適に近似する最適近似係数を算出
する最適近似係数算出工程と、前記算出された最適近似
係数に基づいて代表素片を加重平均することにより合成
音声素片を生成する合成音声素片生成工程とを含んでい
る。この構成により、属性ベクトルの冗長性が除去され
た素片コーパスが生成され音声合成に適用されることと
なる。The speech synthesis method of the present invention is an optimum approximation for optimally approximating the target attribute vector of the synthesized speech unit generated in the target attribute vector generating step with the explanation vector of the cluster of the speech corpus. The method includes an optimum approximation coefficient calculation step of calculating a coefficient, and a synthetic speech element generation step of generating a synthetic speech element by weighted averaging representative speech elements based on the calculated optimum approximation coefficient. With this configuration, a segment corpus from which the attribute vector redundancy is removed is generated and applied to speech synthesis.

【００２１】本発明の音声合成装置は、韻律素片および
音声波形素片を含む音声コーパスからの素片のクラスタ
の代表素片を格納する代表素片格納手段と、前記代表素
片の説明ベクトルを格納する説明ベクトル格納手段と、
前記代表素片および前記説明ベクトルの対応関係を示す
ポインタを格納するポインタ格納手段と、テキストを入
力するテキスト入力手段と、パラ言語を入力するパラ言
語入力手段と、前記入力されたテキストを解析するテキ
スト解析手段と、前記テキスト解析手段の解析結果およ
び前記入力されたパラ言語情報に基づいて合成音声の素
片単位毎に目標属性ベクトルを生成する目標属性ベクト
ル生成手段と、前記生成された目標属性ベクトルと全て
の前記説明ベクトルとの内積を算出する内積算出手段
と、前記内積の最大値を与える代表韻律素片および代表
音声波形素片を選定する内積最大値素片選定手段と、前
記選択された前記韻律素片に応じて前記音声波形素片を
変形する音声波形素片変形手段と、前記変形された音声
波形素片同士を接続する音声波形素片接続手段とを備え
ている。この構成により、韻律素片および音声波形素片
の各素片は、クラスタリングされた後、同一クラスタに
ある素片の属性ベクトルに基づいて生成された説明ベク
トルによって同一クラスタ内の素片に共通な属性が表現
され音声合成に適用されることとなる。The speech synthesizer of the present invention comprises a representative segment storage means for storing a representative segment of a cluster of segments from a speech corpus including a prosodic segment and a speech waveform segment, and an explanation vector of the representative segment. An explanation vector storage means for storing
Pointer storage means for storing a pointer indicating the correspondence between the representative segment and the explanation vector, a text input means for inputting a text, a paralinguistic input means for inputting a paralanguage, and an analysis of the input text. Text analysis means, target attribute vector generation means for generating a target attribute vector for each unit of synthesized speech based on the analysis result of the text analysis means and the input paralinguistic information, and the generated target attribute An inner product calculating means for calculating an inner product of a vector and all the explanation vectors; an inner product maximum value segment selecting means for selecting a representative prosodic segment and a representative speech waveform segment which gives the maximum value of the inner product; The speech waveform segment transforming means for transforming the speech waveform segment according to the prosody segment thus modified and the transformed speech waveform segment are connected to each other. And a speech waveform segment connecting means that. With this configuration, each segment of the prosodic segment and the speech waveform segment is clustered and then shared by the segment in the same cluster by the explanation vector generated based on the attribute vector of the segment in the same cluster. The attributes will be expressed and applied to speech synthesis.

【００２２】本発明の音声合成装置は、韻律素片および
音声波形素片を含む音声コーパスからの素片のクラスタ
の代表素片を格納する代表素片格納手段と、前記代表素
片の説明ベクトルを格納する説明ベクトル格納手段と、
前記代表素片および前記説明ベクトルの対応関係を示す
ポインタを格納するポインタ格納手段と、テキストを入
力するテキスト入力手段と、パラ言語を入力するパラ言
語入力手段と、前記入力されたテキストを解析するテキ
スト解析手段と、前記テキスト解析手段の解析結果およ
び前記入力されたパラ言語情報に基づいて合成音声の素
片単位毎に目標属性ベクトルを生成する目標属性ベクト
ル生成手段と、前記生成された目標属性ベクトルと全て
の前記説明ベクトルとの内積を算出する内積算出手段
と、前記算出された内積に基づいて前記代表韻律素片お
よび前記代表音声波形素片の加重平均化を行う素片加重
平均化手段と、前記加重平均化された前記韻律素片に応
じて加重平均化された前記音声波形素片を変形する音声
波形素片変形手段と、前記変形された音声波形素片同士
を接続する音声波形素片接続手段とを備えている。この
構成により、韻律素片および音声波形素片の各素片は、
クラスタリングされた後、同一クラスタにある素片の属
性ベクトルに基づいて生成された説明ベクトルによって
同一クラスタ内の素片に共通な属性が表現され音声合成
に適用されることとなる。The speech synthesis apparatus of the present invention comprises a representative segment storage means for storing a representative segment of a cluster of segments from a speech corpus including a prosodic segment and a speech waveform segment, and an explanation vector of the representative segment. An explanation vector storage means for storing
Pointer storage means for storing a pointer indicating the correspondence between the representative segment and the explanation vector, a text input means for inputting a text, a paralinguistic input means for inputting a paralanguage, and an analysis of the input text. Text analysis means, target attribute vector generation means for generating a target attribute vector for each unit of synthesized speech based on the analysis result of the text analysis means and the input paralinguistic information, and the generated target attribute An inner product calculating means for calculating an inner product of a vector and all the explanation vectors, and a segment weighted averaging for performing weighted averaging of the representative prosodic segment and the representative speech waveform segment based on the calculated inner product. Means, and speech waveform segment transforming means for transforming the speech waveform segment that has been weighted and averaged according to the prosodic segment that has been weighted and averaged. And a speech waveform segments connecting means for connecting the voice waveform segments with each other, which is the deformation. With this configuration, each segment of the prosody segment and the speech waveform segment is
After the clustering, the attribute common to the segments in the same cluster is expressed by the explanation vector generated based on the attribute vector of the segments in the same cluster and applied to the speech synthesis.

【００２３】本発明の音声合成装置は、韻律素片および
音声波形素片を含む音声コーパスからの素片を格納する
素片格納手段と、前記素片の説明ベクトルを格納する説
明ベクトル格納手段と、前記素片および前記説明ベクト
ルの対応関係を示すポインタを格納するポインタ格納手
段と、テキストを入力するテキスト入力手段と、パラ言
語を入力するパラ言語入力手段と、前記入力されたテキ
ストを解析するテキスト解析手段と、前記テキスト解析
手段の解析結果および前記入力されたパラ言語情報に基
づいて合成音声の素片単位毎に目標属性ベクトルを生成
する目標属性ベクトル生成手段と、前記合成音声の素片
単位の前記目標属性ベクトルを前記素片の説明ベクトル
で最適に近似する最適近似係数を算出する最適近似係数
算出手段と、前記最適近似係数に基づいて前記韻律素片
および前記音声波形素片の加重平均化を行う素片加重平
均化手段と、前記加重平均化された前記韻律素片に応じ
て加重平均化された前記音声波形素片を変形する音声波
形素片変形手段と、前記変形された音声波形素片同士を
接続する音声波形素片接続手段とを備えている。この構
成により、韻律素片および音声波形素片の各素片は、ク
ラスタリングされた後、同一クラスタにある素片の属性
ベクトルに基づいて生成された説明ベクトルによって同
一クラスタ内の素片に共通な属性が表現され音声合成に
適用されることとなる。The speech synthesizer of the present invention comprises a segment storage means for storing a segment from a speech corpus including a prosodic segment and a speech waveform segment, and an explanation vector storage means for storing an explanation vector of the segment. , Pointer storage means for storing a pointer indicating the correspondence between the segment and the explanation vector, a text input means for inputting a text, a paralinguistic input means for inputting a paralanguage, and an analysis of the input text. A text analysis unit, a target attribute vector generation unit for generating a target attribute vector for each unit of a synthesized speech segment based on the analysis result of the text analysis unit and the input paralinguistic information; and the synthesized speech segment. Optimal approximation coefficient calculating means for optimally approximating the target attribute vector of a unit with the description vector of the segment, A segment weighted averaging means for performing weighted averaging of the prosodic segment and the speech waveform segment based on an appropriate approximation coefficient, and the speech averaged according to the weighted averaged prosodic segment. The speech waveform segment transforming means for transforming the waveform segment is provided, and the speech waveform segment connecting means for connecting the modified speech waveform segments to each other. With this configuration, each segment of the prosodic segment and the speech waveform segment is clustered and then shared by the segment in the same cluster by the explanation vector generated based on the attribute vector of the segment in the same cluster. The attributes will be expressed and applied to speech synthesis.

【００２４】本発明の音声合成プログラムは、コンピュ
ータに、素片データベースから素片を格納する素片格納
工程と、前記素片の説明ベクトルを格納する説明ベクト
ル格納工程と、前記素片および前記説明ベクトルの対応
関係を示すポインタを格納するポインタ格納工程と、テ
キストを入力するテキスト入力工程と、パラ言語を入力
するパラ言語入力工程と、前記入力されたテキストを解
析するテキスト解析工程と、前記テキスト解析工程の解
析結果および前記入力されたパラ言語情報に基づいて合
成音声の素片単位毎に目標属性ベクトルを生成する目標
属性ベクトル生成工程と、前記生成された目標属性ベク
トルと全ての前記説明ベクトルとの内積を算出する内積
算出工程と、前記内積の最大値を与える韻律素片および
音声波形素片を選定する内積最大値素片選定工程と、前
記選択された前記韻律素片に応じて前記音声波形素片を
変形する音声波形素片変形工程と、前記変形された音声
波形素片同士を接続する音声波形素片接続工程とを実行
させるためのものである。この構成により、コンピュー
タに、韻律素片および音声波形素片の各素片をクラスタ
リングさせた後、同一クラスタにある素片の属性ベクト
ルに基づいて生成された説明ベクトルによって同一クラ
スタ内の素片に共通な属性が表現され音声合成に適用さ
せることとなる。The speech synthesis program of the present invention stores, in a computer, a segment storing step of storing a segment from a segment database, an explanation vector storing step of storing an explanation vector of the segment, the segment and the description. A pointer storing step of storing a pointer indicating a vector correspondence relationship, a text input step of inputting text, a paralinguistic input step of inputting a paralanguage, a text analysis step of analyzing the input text, and the text A target attribute vector generating step of generating a target attribute vector for each unit of synthesized speech based on the analysis result of the analysis step and the input paralinguistic information, the generated target attribute vector and all the explanation vectors And an inner product calculation step of calculating an inner product of and a prosodic segment and a speech waveform segment giving the maximum value of the inner product. Inner product maximum value segment selection step, a speech waveform segment transformation step of transforming the speech waveform segment in accordance with the selected prosodic segment, and a speech connecting the transformed speech waveform segments. It is for performing the waveform element connecting step. With this configuration, the computer clusters each segment of the prosodic segment and the speech waveform segment, and then the segment in the same cluster is identified by the explanation vector generated based on the attribute vector of the segment in the same cluster. Common attributes are expressed and applied to speech synthesis.

【００２５】[0025]

【発明の実施の形態】以下、本発明の実施の形態につい
て説明する。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be described below.

【００２６】（実施の形態１）本発明の第１の実施の形
態の音声合成方法を図１のフローチャートを参照して説
明する。まず、ステップＳ１０１において、音声コーパ
スの素片毎に属性ベクトルが付与される。ここで、属性
ベクトルは、言語解析から自動的に求められるものおよ
びパラ言語情報などである。前者の言語解析から自動的
に求められるものの例としては、品詞、係り受け、係り
先、アクセント句のモーラ数、アクセント型などがあ
り、後者のパラ言語情報の例としては、口調、発話スタ
イル、感情などのような人間が聴取して判断するものが
ある。この属性ベクトルの一例を表１に示す。この属性
ベクトルは、下記の式（１）で示すａ_kのように、素片
が属性要因に該当する場合には１、該当しない場合には
０が記入されるようになっている。(Embodiment 1) A speech synthesis method according to a first embodiment of the present invention will be described with reference to the flowchart of FIG. First, in step S101, an attribute vector is assigned to each segment of the speech corpus. Here, the attribute vector is, for example, one automatically obtained from the language analysis and paralinguistic information. Examples of what is automatically obtained from the former linguistic analysis include part-of-speech, dependency, destination, mora number of accent phrase, and accent type, and examples of the latter paralinguistic information include tone, utterance style, There are things such as emotions that humans hear and judge. Table 1 shows an example of this attribute vector. In this attribute vector, 1 is entered when the element corresponds to the attribute factor, and 0 is entered when it does not correspond to the attribute factor, as in _ak shown in the following equation (1).

【数１】ただし、ｋ＝１，２，３・・・、ＮＮ：クラスタに属するデータ数 δ_ki＝１：素片が指定する属性に該当する場合 δ_ki＝０：素片が指定する属性に該当しない場合[Equation 1] However, k = 1, 2, 3, ..., N N: the number of data belonging to the cluster δ _ki = 1: when the attribute corresponds to the element specified by δ _ki = 0: does not correspond to the attribute specified by the element If

【００２７】次いで、ステップＳ１０２において、前述
の音声コーパスの素片が聴覚的な検知限によりクラスタ
リングされる。ここで聴覚的な検知限とは、例えば、韻
律素片の一種の音韻継続時間の場合には５ｍｓｅｃ程度
の時間をいい、音声素片の場合には平均３ｄＢ程度のス
ペクトル差をいう。次いで、ステップＳ１０３におい
て、各クラスタに説明ベクトルが付与される。この説明
ベクトルは、式（２）のように定義され、クラスタの代
表値に関連付けられている。Next, in step S102, the speech corpus segment is clustered by the auditory detection limit. Here, the auditory detection limit means, for example, a time of about 5 msec in the case of a kind of phoneme duration of a prosodic unit, and an average spectrum difference of about 3 dB in the case of a voice unit. Next, in step S103, an explanation vector is given to each cluster. This description vector is defined as in Expression (2) and is associated with the representative value of the cluster.

【数２】ただし、[Equation 2] However,

【数３】また、ｒ_iはｉ番目の要素の正規化係数でありクラスタ
内のデータ総数を表している。[Equation 3] Further, r _i is the normalization coefficient of the i-th element and represents the total number of data in the cluster.

【００２８】前述の式（２）に示すように、クラスタの
代表値を重心点とした説明ベクトルにより素片データベ
ースが構築されるようになっている。As shown in the above equation (2), the segment database is constructed by the explanation vector with the representative value of the cluster as the center of gravity.

【００２９】次いで、ステップＳ１０４において、言語
処理、発話スタイルおよび口調指示などが入力される。
次いで、ステップＳ１０５において、前述のＳ１０４で
入力された言語処理、発話スタイルおよび口調指示など
により、合成音声の素片単位の目標ベクトルｇ_jが生成
される。次いで、ステップＳ１０６において、式（４）
に示すように、素片の目標属性ベクトルと説明ベクトル
の内積が計算される。Next, in step S104, language processing, speech style, tone instruction, etc. are input.
Next, in step S105, the target vector g _j of the synthesized speech in unit of the speech unit is generated by the language processing, the utterance style, the tone instruction, and the like input in S104 described above. Then, in step S106, equation (4)
As shown in, the inner product of the target attribute vector of the segment and the explanation vector is calculated.

【数４】次いで、ステップＳ１０７において、前述のステップＳ
１０６で算出された内積ｐ₁は内積の総和ｃで正規化
し、係数ｗ₁を得る。[Equation 4] Then, in step S107, the above-mentioned step S
The inner product p ₁ calculated in 106 is normalized by the sum c of inner products to obtain the coefficient w ₁ .

【数５】ここで、内積の総和ｃは式（６）で表せる。[Equation 5] Here, the sum c of the inner products can be expressed by equation (6).

【数６】さらに、ステップＳ１０７において、前述の係数ｗ₁を
素片データベースの素片ｃ₁毎に算出し、係数ｗ₁を素片
ｃ₁に乗じて加算して、式（７）に示すように合成音声
の素片ｕ_kを生成する。[Equation 6] Further, in step S107, to calculate the coefficients w ₁ described above in each segment c ₁ of the unit database, by adding by multiplying coefficients w ₁ to segment c _1, synthesized speech as shown in Equation (7) Generate a unit u _k of

【数７】次いで、ステップＳ１０８において、素片が音声素片の
場合は目標周波数や音韻継続時間に応じて変形され、韻
律素変の場合は素片同士の接続部でテーパ窓によって滑
らかに接続され、所望の合成音声が生成されるようにな
っている。[Equation 7] Next, in step S108, if the unit is a voice unit, the unit is transformed according to the target frequency or the phoneme duration, and if the unit is prosodic, the units are connected smoothly by a taper window to obtain a desired unit. Synthetic voice is generated.

【００３０】以上のように、本実施の形態の音声合成方
法によれば、韻律素片および音声波形素片の各素片は、
クラスタリングされた後、同一クラスタにある素片の属
性ベクトルに基づいて生成された説明ベクトルによって
同一クラスタ内の素片に共通な属性が表現されるので、
小さいメモリ容量で合成音声を生成することができ、ま
た、素片間のばらつきを小さくできるので高音質の合成
音声を生成することができる。As described above, according to the speech synthesis method of the present embodiment, each of the prosodic segment and the speech waveform segment is
After clustering, the attribute common to the segments in the same cluster is expressed by the explanation vector generated based on the attribute vector of the segments in the same cluster.
It is possible to generate a synthetic voice with a small memory capacity, and since it is possible to reduce the variation between units, it is possible to generate a high-quality synthetic voice.

【００３１】なお、前述の音声合成方法をプログラミン
グすることにより、コンピュータに本実施の形態の音声
合成を実行させることができる。By programming the above-described speech synthesis method, the computer can be made to execute the speech synthesis of this embodiment.

【００３２】（実施の形態２）本発明の第２の実施の形
態の音声合成方法を図２のフローチャートを参照して説
明する。まず、ステップＳ１０１において、音声コーパ
スの素片毎に属性ベクトルが付与される。ここで、属性
ベクトルは、言語解析から自動的に求められるものおよ
びパラ言語情報などである。前者の言語解析から自動的
に求められるものの例としては、品詞、係り受け、係り
先、アクセント句のモーラ数、アクセント型などがあ
り、後者のパラ言語情報の例としては、口調、発話スタ
イル、感情などのような人間が聴取して判断するものが
ある。この属性ベクトルの一例を表１に示す。この属性
ベクトルは、前述の式（１）で示すａ_kのように、素片
が属性要因に該当する場合には１、該当しない場合には
０が記入されるようになっている。(Embodiment 2) A speech synthesis method according to a second embodiment of the present invention will be described with reference to the flowchart of FIG. First, in step S101, an attribute vector is assigned to each segment of the speech corpus. Here, the attribute vector is, for example, one automatically obtained from the language analysis and paralinguistic information. Examples of what is automatically obtained from the former linguistic analysis include part-of-speech, dependency, destination, mora number of accent phrase, and accent type, and examples of the latter paralinguistic information include tone, utterance style, There are things such as emotions that humans hear and judge. Table 1 shows an example of this attribute vector. In this attribute vector, 1 is entered when the segment corresponds to the attribute factor, and 0 is entered when the segment does not correspond to the attribute factor, as in the case of a _k shown in the above equation (1).

【００３３】次いで、ステップＳ１０２において、前述
の音声コーパスの素片が聴覚的な検知限によりクラスタ
リングされる。ここで聴覚的な検知限とは、例えば、韻
律素片の一種の音韻継続時間の場合には５ｍｓｅｃ程度
の時間をいい、音声素片の場合には平均３ｄＢ程度のス
ペクトル差をいう。次いで、ステップＳ１０３におい
て、各クラスタに説明ベクトルが付与される。この説明
ベクトルは、前述の式（２）のように定義され、クラス
タの代表値に関連付けられている。前述の式（２）に示
すように、クラスタの代表値を重心点とした説明ベクト
ルにより素片データベースが構築されるようになってい
る。次いで、ステップＳ１０４において、言語処理、発
話スタイルおよび口調指示などが入力される。次いで、
ステップＳ１０５において、前述のＳ１０４で入力され
た言語処理、発話スタイルおよび口調指示などにより、
合成音声の素片単位の目標ベクトルｇ_jが生成される。
次いで、ステップＳ１０６において、前述の式（４）に
示すように、素片の目標属性ベクトルと説明ベクトルと
の内積が計算される。Next, in step S102, the speech corpus segment is clustered by the auditory detection limit. Here, the auditory detection limit means, for example, a time of about 5 msec in the case of a kind of phoneme duration of a prosodic unit, and an average spectrum difference of about 3 dB in the case of a voice unit. Next, in step S103, an explanation vector is given to each cluster. This description vector is defined as in the above equation (2) and is associated with the representative value of the cluster. As shown in the above equation (2), the segment database is constructed by the explanation vector with the representative value of the cluster as the center of gravity. Next, in step S104, language processing, speech style, tone instruction, etc. are input. Then
In step S105, by the language processing, the utterance style, the tone instruction, etc. input in S104 described above,
A target vector g _j of the synthesized speech in units of units is generated.
Next, in step S106, the inner product of the target attribute vector of the segment and the explanation vector is calculated as shown in the above equation (4).

【００３４】次いで、ステップＳ２０１において、前述
のステップＳ１０６で得られた内積のうち、最大の内積
を与えた素片が合成音声生成の素片とされる。つまり、
内積を最大にする素片は、合成音声で目標とする属性と
最も近い属性を有するものであるので、式（８）に示す
ように、合成音声で使用する素片ｕ_kとするものであ
る。Next, in step S201, of the inner products obtained in step S106 described above, the segment giving the maximum inner product is set as the segment for synthetic speech generation. That is,
Since the segment that maximizes the inner product has the attribute closest to the target attribute in the synthetic speech, it is the segment u _k used in the synthetic speech as shown in Expression (8). .

【数８】次いで、ステップＳ１０８において、素片が音声素片の
場合は目標周波数や音韻継続時間に応じて変形され、韻
律素変の場合は素片同士の接続部でテーパ窓によって滑
らかに接続され、所望の合成音声が生成されるようにな
っている。[Equation 8] Next, in step S108, if the unit is a voice unit, the unit is transformed according to the target frequency or the phoneme duration, and if the unit is prosodic, the units are connected smoothly by a taper window to obtain a desired unit. Synthetic voice is generated.

【００３５】以上のように、本実施の形態の音声合成方
法によれば、韻律素片および音声波形素片の各素片は、
クラスタリングされた後、同一クラスタにある素片の属
性ベクトルに基づいて生成された説明ベクトルによって
同一クラスタ内の素片に共通な属性が表現され、また、
内積を最大にする素片は合成音声の素片とされ韻律素片
および音声波形素片の各素片の属性は簡易に表現をされ
るので、小さいメモリ容量で合成音声を生成することが
でき、さらに、素片間のばらつきを小さくできるので高
音質の合成音声を生成することができる。As described above, according to the speech synthesis method of the present embodiment, each of the prosodic segment and the speech waveform segment is
After the clustering, the attribute common to the segments in the same cluster is expressed by the explanation vector generated based on the attribute vector of the segments in the same cluster, and
The segment that maximizes the inner product is the segment of the synthetic speech, and the attributes of each segment of the prosodic segment and the speech waveform segment are easily expressed, so that the synthetic speech can be generated with a small memory capacity. Furthermore, since the variation between the segments can be reduced, it is possible to generate high-quality synthesized speech.

【００３６】なお、前述の音声合成方法をプログラミン
グすることにより、コンピュータに本実施の形態の音声
合成を実行させることができる。The computer can be made to execute the voice synthesis of this embodiment by programming the voice synthesis method described above.

【００３７】（実施の形態３）本発明の第３の実施の形
態の音声合成方法を図３のフローチャートを参照して説
明する。まず、ステップＳ１０１において、音声コーパ
スの素片毎に属性ベクトルが付与される。ここで、属性
ベクトルは、言語解析から自動的に求められるものおよ
びパラ言語情報などである。前者の言語解析から自動的
に求められるものの例としては、品詞、係り受け、係り
先、アクセント句のモーラ数、アクセント型などがあ
り、後者のパラ言語情報の例としては、口調、発話スタ
イル、感情などのような人間が聴取して判断するものが
ある。この属性ベクトルの一例を表１に示す。この属性
ベクトルは、前述の式（１）で示すａ_kのように、素片
が属性要因に該当する場合には１、該当しない場合には
０が記入されるようになっている。(Embodiment 3) A speech synthesis method according to a third embodiment of the present invention will be described with reference to the flowchart of FIG. First, in step S101, an attribute vector is assigned to each segment of the speech corpus. Here, the attribute vector is, for example, one automatically obtained from the language analysis and paralinguistic information. Examples of what is automatically obtained from the former linguistic analysis include part-of-speech, dependency, destination, mora number of accent phrase, and accent type, and examples of the latter paralinguistic information include tone, utterance style, There are things such as emotions that humans hear and judge. Table 1 shows an example of this attribute vector. In this attribute vector, 1 is entered when the segment corresponds to the attribute factor, and 0 is entered when the segment does not correspond to the attribute factor, as in the case of a _k shown in the above equation (1).

【００３８】次いで、ステップＳ１０２において、前述
の音声コーパスの素片が聴覚的な検知限によりクラスタ
リングされる。ここで聴覚的な検知限とは、例えば、韻
律素片の一種の音韻継続時間の場合には５ｍｓｅｃ程度
の時間をいい、音声素片の場合には平均３ｄＢ程度のス
ペクトル差をいう。次いで、ステップＳ１０３におい
て、各クラスタに説明ベクトルが付与される。この説明
ベクトルは、前述の式（２）のように定義され、クラス
タの代表値に関連付けられている。前述の式（２）に示
すように、クラスタの代表値を重心点とした説明ベクト
ルにより素片データベースが構築されるようになってい
る。次いで、ステップＳ１０４において、言語処理、発
話スタイルおよび口調指示などが入力される。次いで、
ステップＳ１０５において、前述のＳ１０４で入力され
た言語処理、発話スタイルおよび口調指示などにより、
合成音声の素片単位の目標ベクトルｇ_jが生成される。
次いで、ステップＳ３０１において、式（９）によって
前述の目標ベクトルを素片の説明ベクトルで最小自乗の
意味で最適近似する係数ｗ₁〜ｗ_nが算出される。Next, in step S102, the speech corpus segment is clustered by the auditory detection limit. Here, the auditory detection limit means, for example, a time of about 5 msec in the case of a kind of phoneme duration of a prosodic unit, and an average spectrum difference of about 3 dB in the case of a voice unit. Next, in step S103, an explanation vector is given to each cluster. This description vector is defined as in the above equation (2) and is associated with the representative value of the cluster. As shown in the above equation (2), the segment database is constructed by the explanation vector with the representative value of the cluster as the center of gravity. Next, in step S104, language processing, speech style, tone instruction, etc. are input. Then
In step S105, by the language processing, the utterance style, the tone instruction, etc. input in S104 described above,
A target vector g _j of the synthesized speech in units of units is generated.
Then, in step S301, the coefficient w ₁ to w _n for best fit in the sense of least squares the target vector described above in the description vector of segments by the equation (9) is calculated.

【数９】次いで、ステップＳ３０２において、前述の最適近似係
数は素片に乗じられて加算され、式（１０）に示すよう
に合成音声の素片ｕ_kが生成される。[Equation 9] Next, in step S302, the above-mentioned optimum approximation coefficient is multiplied by the speech units and added, thereby generating a synthesized speech speech unit u _k as shown in Expression (10).

【数１０】次いで、ステップＳ１０８において、素片が音声素片の
場合は目標周波数や音韻継続時間に合わせて変形され、
韻律素変の場合は素片同士の接続部でテーパ窓によって
滑らかに接続され、所望の合成音声が生成されるように
なっている。[Equation 10] Next, in step S108, if the phoneme is a phonetic phoneme, it is transformed according to the target frequency and the phoneme duration,
In the case of prosodic variation, the connected parts are smoothly connected by a tapered window to generate a desired synthesized speech.

【００３９】以上のように、本実施の形態の音声合成方
法によれば、韻律素片および音声波形素片の各素片は、
クラスタリングされた後、同一クラスタにある素片の属
性ベクトルに基づいて生成された説明ベクトルによって
同一クラスタ内の素片に共通な属性が表現され、また、
目標ベクトルを素片の説明ベクトルで最適に近似する最
適近似係数により合成音声の素片が生成されるので、小
さいメモリ容量で合成音声を生成することができ、さら
に、素片間のばらつきを小さくできるので高音質の合成
音声を生成することができる。As described above, according to the speech synthesis method of the present embodiment, each of the prosodic segment and the speech waveform segment is:
After the clustering, the attribute common to the segments in the same cluster is expressed by the explanation vector generated based on the attribute vector of the segments in the same cluster, and
Since the speech synthesis speech segment is generated by the optimal approximation coefficient that optimally approximates the target vector with the segment description vector, the speech synthesis can be generated with a small memory capacity, and the variation between the speech segments can be reduced. As a result, it is possible to generate high-quality synthetic speech.

【００４０】なお、前述の音声合成方法をプログラミン
グすることにより、コンピュータに本実施の形態の音声
合成を実行させることができる。The computer can be made to execute the speech synthesis of the present embodiment by programming the above speech synthesis method.

【００４１】（実施の形態４）本発明の第４の実施の形
態の音声合成方法は、前述の第１乃至第３の実施の形態
で説明したクラスタの代表値を重心点にすることに代え
て、クラスタの代表値をクラスタの最頻値とするもので
ある。(Embodiment 4) The speech synthesis method according to the fourth embodiment of the present invention uses the representative value of the cluster described in the above-mentioned first to third embodiments as the center of gravity. Then, the representative value of the cluster is used as the mode value of the cluster.

【００４２】クラスタの代表値をクラスタの最頻値とす
る一例として、韻律素片に含まれる音韻継続時間のデー
タを挙げて説明する。表２は、音韻継続時間の代表値例
を示すものであり、表の左側には音韻継続時間を示し、
表の右側にはあるクラスタ内のデータ数を示している。
例えば、音韻継続時間が１２ｍｓｅｃであるクラスタ内
のデータ数は２個、音韻継続時間が１５ｍｓｅｃである
クラスタ内のデータ数は１個であることを示している。
この表２においては、音韻継続時間とその個数から平均
値を算出すれば重心点は１８．４ｍｓｅｃである。ま
た、音韻継続時間が２０ｍｓｅｃのデータ数が１０個で
最も多いので最頻値は２０ｍｓｅｃである。表２に示す
ようなクラスタ内のデータに例外的なデータが存在して
偏りが生じている場合は、重心点よりも最頻値を用いた
方が安定した合成音声が生成できるので、前述の第１乃
至第３の実施の形態で説明したクラスタの代表値を重心
点にすることに代えて、クラスタの代表値をクラスタの
最頻値とするのが好ましい。一方、クラスタ内に偏りの
あるデータが存在しない場合でも、最頻値と重心点は一
致することが多いので、一般的に最頻値をクラスタの代
表値とした方が好ましい。As an example of setting the representative value of the cluster as the mode value of the cluster, the data of the phoneme duration included in the prosodic segment will be described. Table 2 shows an example of typical values of phoneme duration, and the phoneme duration is shown on the left side of the table.
The right side of the table shows the number of data in a certain cluster.
For example, it is shown that the number of data in a cluster having a phoneme duration of 12 msec is two and the number of data in a cluster having a phoneme duration of 15 msec is one.
In Table 2, the center of gravity is 18.4 msec when the average value is calculated from the phoneme duration and the number of phonemes. In addition, since the number of pieces of data having a phoneme duration of 20 msec is the largest of 10, the mode value is 20 msec. When exceptional data is present in the data in the cluster as shown in Table 2, a more stable speech synthesis can be generated by using the mode value rather than the centroid point. Instead of using the representative value of the cluster described in the first to third embodiments as the center of gravity, it is preferable to use the representative value of the cluster as the mode value of the cluster. On the other hand, even when there is no biased data in the cluster, the mode value and the center of gravity often coincide with each other. Therefore, it is generally preferable to use the mode value as the representative value of the cluster.

【００４３】以上のように、本実施の形態によれば、ク
ラスタの代表値はクラスタの最頻値で表すことにより、
安定した合成音声を生成することができる。As described above, according to the present embodiment, the representative value of the cluster is represented by the mode value of the cluster.
It is possible to generate stable synthetic speech.

【００４４】なお、前述の音声合成方法をプログラミン
グすることにより、コンピュータに本実施の形態の音声
合成を実行させることができる。By programming the above-mentioned speech synthesis method, the computer can be made to execute the speech synthesis of this embodiment.

【００４５】（実施の形態５）本発明の第５の実施の形
態の音声合成方法を図４のフローチャートを参照して説
明する。本実施の形態の音声合成方法は、前述の第１乃
至第３の実施の形態で説明した属性ベクトルおよび説明
ベクトルを生成する際の処理に関するものである。前述
の素片データベースの説明ベクトルの要素の正規化係数
ｒ₁〜ｒ_nに代えて、元の音声コーパス中の当該要因の発
生データ総数とするものである。(Embodiment 5) A speech synthesis method according to a fifth embodiment of the present invention will be described with reference to the flowchart of FIG. The speech synthesis method according to the present embodiment relates to the processing when generating the attribute vector and the explanation vector described in the first to third embodiments. Instead of the normalization factor r ₁ ~r _n elements of description vectors of the aforementioned segment database, it is an occurrence data the total number of the factors in the original speech corpus.

【００４６】まず、ステップＳ４０１において、説明ベ
クトル同士のｉ番目の要素が比較される。次いで、ステ
ップＳ４０２において、予め決められた統計的手法、例
えば、カイ自乗検定によって、ステップＳ４０１で比較
された説明ベクトル同士のｉ番目の要素の差異が有意に
大きいか否かが判断される。ここで、判断の基準は予め
決められた閾値による。ｉ番目の要素の差異が有意に大
きいときはステップＳ４０４に進み、比較する要素を次
に進める。一方、ｉ番目の要素の差異が有意に大きくな
いときには、ステップＳ４０３に進み、全ての説明ベク
トルが比較されたか否かが判断される。全ての説明ベク
トルが比較されたときは、ステップＳ４０６に進み、全
ての説明ベクトルが比較されていないときは、ステップ
Ｓ４０５に進み比較される説明ベクトルが代えられる。First, in step S401, the i-th element of the explanation vectors is compared. Next, in step S402, it is determined by a predetermined statistical method, for example, a chi-square test, whether the difference between the i-th elements of the explanation vectors compared in step S401 is significantly large. Here, the criterion of judgment is based on a predetermined threshold value. When the difference of the i-th element is significantly large, the process proceeds to step S404, and the element to be compared is advanced. On the other hand, when the difference in the i-th element is not significantly large, the process proceeds to step S403, and it is determined whether all the explanation vectors have been compared. When all the explanation vectors have been compared, the process proceeds to step S406, and when all the explanation vectors have not been compared, the process proceeds to step S405 and the compared explanation vector is replaced.

【００４７】次いで、ステップＳ４０６において、ｉ番
目の要素の差異が有意に大きくない当該要素は属性ベク
トルおよび説明ベクトルから除外される。つまり、予め
決められた一定の有意水準で同一と見なせる場合には、
この要素の属性は各クラスタに共通頻度で発生したこと
を意味し、クラスタの形成には寄与していないことにな
る。したがって、この要素を無意味な要素として説明ベ
クトルおよび属性ベクトルに適用しないようにするもの
である。次いで、ステップＳ４０７において、全ての要
素の比較が終了したか否かが判断される。全ての要素の
比較が終了していない場合はステップＳ４０８に進み比
較される要素が次に進められ、全ての要素の比較が終了
した場合は処理を終了する。Next, in step S406, the i-th element whose difference is not significantly large is excluded from the attribute vector and the explanation vector. In other words, if they can be regarded as the same at a predetermined significance level,
The attribute of this element means that it occurs at a common frequency in each cluster, and does not contribute to the formation of clusters. Therefore, this element is not applied to the explanation vector and the attribute vector as a meaningless element. Next, in step S407, it is determined whether or not the comparison of all the elements is completed. If the comparison of all the elements has not been completed, the process proceeds to step S408, and the element to be compared is advanced to the next, and if the comparison of all the elements is completed, the process ends.

【００４８】以上のように、本実施の形態によれば、予
め決められた統計的有意水準で同一とみなせる要素を無
意味な要素として説明ベクトルおよび属性ベクトルに適
用しないようにすることにより、説明ベクトルおよび属
性ベクトルを最適なサイズにすることができるので小さ
いメモリ容量で合成音声を生成することができ、さら
に、素片間のばらつきを小さくできるので高音質の合成
音声を生成することができる。As described above, according to the present embodiment, the elements which can be regarded as the same at the predetermined statistical significance level are not applied as the meaningless elements to the explanation vector and the attribute vector, and the explanation is made. Since the vector and the attribute vector can be set to the optimum size, it is possible to generate a synthesized voice with a small memory capacity, and further, since the variation between the units can be reduced, it is possible to generate a high-quality synthesized voice.

【００４９】なお、前述の音声合成方法をプログラミン
グすることにより、コンピュータに本実施の形態の音声
合成を実行させることができる。By programming the above-described speech synthesis method, it is possible to cause a computer to execute the speech synthesis of this embodiment.

【００５０】（実施の形態６）本発明の第６の実施の形
態の音声合成方法を図５のフローチャートを参照して説
明する。本実施の形態の音声合成方法は、前述の第１乃
至第３の実施の形態で説明した説明ベクトルの取り扱い
に関するものである。前述の素片データベースの説明ベ
クトルの要素の正規化係数ｒ₁〜ｒ_nに代えて、元の音声
コーパス中の当該要因の発生データ総数とするものであ
る。(Sixth Embodiment) A speech synthesis method according to a sixth embodiment of the present invention will be described with reference to the flowchart of FIG. The speech synthesis method according to the present embodiment relates to the handling of the explanation vector described in the first to third embodiments. Instead of the normalization factor r ₁ ~r _n elements of description vectors of the aforementioned segment database, it is an occurrence data the total number of the factors in the original speech corpus.

【００５１】まず、ステップＳ５０１において、説明ベ
クトル同士が比較される。次いで、ステップＳ５０２に
おいて、予め決められた統計的手法、例えば、カイ自乗
検定によって、ステップＳ５０１で比較された説明ベク
トル同士の差異が有意に大きいか否かが判断される。こ
こで、判断の基準は予め決められた閾値による。説明ベ
クトル同士の差異が有意に大きいときはステップＳ５０
４に進み、比較する説明ベクトルが次に進められる。一
方、説明ベクトル同士の差異が有意に大きくないときに
は、ステップＳ５０３に進み、全ての説明ベクトルが比
較されたか否かが判断される。全ての説明ベクトルが比
較されたときは、ステップＳ５０６に進み、全ての説明
ベクトルが比較されていないときは、ステップＳ５０５
に進み比較される説明ベクトルが代えられる。次いで、
ステップＳ５０６において、説明ベクトル同士の差異が
有意に大きくないクラスタが統合され、新たに素片デー
タベースが構築されるようになっている。つまり、予め
決められた一定の有意水準で同一と見なせる場合には、
前述の本発明の第１の実施の形態で示した聴覚的な検知
限によるクラスタリングが厳しすぎたことを意味してお
り、元々の素片のデータのばらつきが大きいためにクラ
スタが強制分割されたと考えられるので、予め決められ
た一定の有意水準で同一と見なせる素片データを集約し
てその平均値をクラスタの代表値にすることで素片デー
タベースのサイズを縮退することができる。First, in step S501, the explanation vectors are compared with each other. Next, in step S502, it is determined whether or not the difference between the explanation vectors compared in step S501 is significantly large by a predetermined statistical method, for example, the chi-square test. Here, the criterion of judgment is based on a predetermined threshold value. If the difference between the explanation vectors is significantly large, step S50.
4, the explanation vector to be compared is advanced next. On the other hand, when the difference between the explanation vectors is not significantly large, the process proceeds to step S503, and it is determined whether all the explanation vectors have been compared. If all the explanation vectors have been compared, the process proceeds to step S506, and if all the explanation vectors have not been compared, step S505.
The explanation vector to be compared is replaced. Then
In step S506, clusters in which the difference between the explanation vectors is not significantly large are integrated and a new segment database is constructed. In other words, if they can be regarded as the same at a predetermined significance level,
This means that the clustering by the auditory detection limit shown in the first embodiment of the present invention described above is too strict, and the cluster is forcibly divided because the variation of the data of the original segment is large. It is conceivable that the size of the phoneme database can be reduced by aggregating the phoneme data that can be regarded as the same at a predetermined constant significance level and using the average value as the representative value of the cluster.

【００５２】以上のように、本実施の形態によれば、予
め決められた統計的有意水準で同一とみなせる説明ベク
トルのクラスタを統合することにより、素片データベー
スのサイズを削減することができるので、小さいメモリ
容量で合成音声を生成することができ、さらに、素片間
のばらつきを小さくできるので高音質の合成音声を生成
することができる。As described above, according to the present embodiment, the size of the segment database can be reduced by integrating the clusters of explanation vectors that can be regarded as the same at a predetermined statistical significance level. The synthetic speech can be generated with a small memory capacity, and further, the variation between the segments can be reduced, so that the synthetic speech with high sound quality can be generated.

【００５３】なお、前述の説明ベクトルの取り扱いに関
する処理により、代表値間のばらつきが聴覚的な検知限
を超えるものが生じる場合があり、ざらついた音質の合
成音声になることがあるので、どのカテゴリを融合して
削減するかは、視聴チェックを行いながら判断すればよ
い。It should be noted that, due to the above-mentioned processing relating to the handling of the explanation vector, there may be a case where the variation between the representative values exceeds the auditory detection limit, which may result in a synthetic speech having a rough sound quality. Whether or not to reduce the number can be determined while checking the viewing.

【００５４】なお、前述の音声合成方法をプログラミン
グすることにより、コンピュータに本実施の形態の音声
合成を実行させることができる。The computer can be made to execute the voice synthesis of this embodiment by programming the above-mentioned voice synthesis method.

【００５５】（実施の形態７）本発明の第７の実施の形
態の音声合成方法を図６のフローチャートを参照して説
明する。本実施の形態の音声合成方法は、前述の第１乃
至第３の実施の形態で説明した属性要因に関するもので
ある。前述の素片データベースの説明ベクトルの要素の
正規化係数ｒ₁〜ｒ_nに代えて、元の音声コーパス中の当
該要因の発生データ総数とするものである。(Embodiment 7) A speech synthesis method according to a seventh embodiment of the present invention will be described with reference to the flowchart of FIG. The speech synthesis method according to this embodiment relates to the attribute factors described in the above-described first to third embodiments. Instead of the normalization factor r ₁ ~r _n elements of description vectors of the aforementioned segment database, it is an occurrence data the total number of the factors in the original speech corpus.

【００５６】まず、ステップＳ６０１において、任意に
２つの説明ベクトルが選択される。次いで、ステップＳ
６０２において、説明ベクトル間でｉ番目の要素同士が
比較される。次いで、ステップＳ６０３において、予め
決められた統計的手法、例えば、カイ自乗検定によっ
て、ステップＳ６０２で比較された説明ベクトル間のｉ
番目の要素の差異が有意に大きいか否かが判断される。
ここで、判断の基準は予め決められた閾値による。説明
ベクトル間のｉ番目の要素の差異が有意に大きい場合は
ステップＳ６０４に進み、説明ベクトル間のｉ番目の要
素の差異が有意に大きくない場合はステップＳ６０９に
進む。次いで、ステップＳ６０４において、全ての説明
ベクトルと比較が終了したか否かが判断される。全ての
説明ベクトルと比較が終了した場合はステップＳ６０６
に進み、全ての説明ベクトルと比較が終了していない場
合はステップＳ６０５に進む。次いで、ステップＳ６０
６において、ステップＳ６０３で説明ベクトル間のｉ番
目の要素の差異が有意に大きいとされた説明ベクトルの
数Ｎをもとに、Ｌｏｇ₂Ｎの整数化された数が新たな追
加属性の個数とされる。次いで、ステップＳ６０７にお
いて、新たな追加属性が前述のＮ個のクラスタに属する
データの比較観察により決定され、これらのデータの属
性ベクトルが付与しなおされて更新され、結果として説
明ベクトルも更新されるようになっている。次いで、ス
テップＳ６０８において、新たな追加属性に対し他のク
ラスタの全データにおいても前述の属性の有無が見直さ
れ、属性ベクトルが付与しなおされて更新され、結果と
して説明ベクトルも更新されるようになっている。この
ステップＳ６０８の処理が終了するとステップＳ６０２
に戻るようになっている。First, in step S601, two explanation vectors are arbitrarily selected. Then, step S
At 602, the i th element is compared between the description vectors. Next, in step S603, i between the explanation vectors compared in step S602 is determined by a predetermined statistical method, for example, a chi-square test.
It is determined whether the difference of the th element is significantly large.
Here, the criterion of judgment is based on a predetermined threshold value. If the difference in the i-th element between the explanation vectors is significantly large, the process proceeds to step S604, and if the difference in the i-th element between the explanation vectors is not significantly large, the process proceeds to step S609. Next, in step S604, it is determined whether the comparison with all the explanation vectors has been completed. If the comparison with all the explanation vectors is completed, step S606.
If the comparison with all the explanation vectors is not completed, the process proceeds to step S605. Then, step S60
In step 6, the integer number of Log ₂ N becomes the number of new additional attributes based on the number N of the explanation vectors that the difference of the i-th element between the explanation vectors is significantly large in step S603. To be done. Next, in step S607, new additional attributes are determined by comparative observation of the data belonging to the N clusters described above, the attribute vectors of these data are reassigned and updated, and as a result, the explanation vector is also updated. It is like this. Next, in step S608, the presence / absence of the above-described attribute is reviewed in all data of other clusters with respect to the new additional attribute, the attribute vector is reassigned and updated, and as a result, the explanation vector is also updated. Has become. When the process of step S608 ends, step S602
To return to.

【００５７】一方、前述のステップＳ６０３において、
説明ベクトル間のｉ番目の要素の差異が有意に大きくな
いとされた場合は、ステップＳ６０９で比較する要素が
次に進められ、ステップＳ６１０に進む。次いで、ステ
ップＳ６１０において、全ての要素で比較が終了したか
否かが判断される。全ての要素で比較が終了していない
場合はステップＳ６０２に戻り、全ての要素で比較が終
了した場合は処理を終了する。また、前述のステップＳ
６０４において、全ての説明ベクトルと比較が終了して
いない場合はステップＳ６０５に進み、比較する説明ベ
クトルが代えられｉ＝１とされステップＳ６０２に戻る
ようになっている。On the other hand, in step S603 described above,
If it is determined that the difference in the i-th element between the description vectors is not significantly large, the element to be compared is advanced in step S609, and the process proceeds to step S610. Next, in step S610, it is determined whether the comparison has been completed for all the elements. If the comparison has not been completed for all elements, the process returns to step S602, and if the comparison has been completed for all elements, the process ends. In addition, the above step S
In 604, when the comparison with all the explanation vectors has not been completed, the process proceeds to step S605, the explanation vector to be compared is replaced, i = 1, and the process returns to step S602.

【００５８】以上のように、本実施の形態によれば、説
明ベクトル間のｉ番目の要素同士を比較することにより
見逃していた属性要因を着実に探し出すことができ、よ
り有効なデータベースを構築することができるので、音
質のよい音声合成を生成することができる。As described above, according to the present embodiment, it is possible to steadily search for the attribute factor that has been overlooked by comparing the i-th elements between the explanation vectors, and construct a more effective database. Therefore, it is possible to generate a voice synthesis with good sound quality.

【００５９】なお、前述の音声合成方法をプログラミン
グすることにより、コンピュータに本実施の形態の音声
合成を実行させることができる。The computer can be made to execute the speech synthesis of this embodiment by programming the speech synthesis method described above.

【００６０】（実施の形態８）本発明の第８の実施の形
態の音声合成装置を図７のブロック図を参照して説明す
る。まず、本実施の形態の音声合成装置の構成について
説明する。図７に示すように、本実施の形態の音声合成
装置は、テキストを入力するテキスト入力手段７０１
と、テキストを解析するテキスト解析手段７０３と、目
標ベクトルを生成する目標ベクトル生成手段７０４と、
音声波形の素片を変形する音声波形素片変形手段７０５
と、音声波形の素片を接続する音声波形素片接続手段７
０６と、パラ言語を入力するパラ言語入力手段７０２
と、説明ベクトルを格納する説明ベクトル格納手段７０
７と、ベクトルの内積を算出する内積算出手段７０８
と、ポインタを格納するポインタ格納手段７０９と、内
積の最大値を与えた素片を選定する内積最大値素片選定
手段７１０と、素片を格納する素片格納手段７１１とを
備えている。(Embodiment 8) A speech synthesizer according to an eighth embodiment of the present invention will be described with reference to the block diagram of FIG. First, the configuration of the speech synthesizer of this embodiment will be described. As shown in FIG. 7, the speech synthesizer according to the present embodiment has text input means 701 for inputting text.
A text analysis means 703 for analyzing the text, a target vector generation means 704 for generating a target vector,
Speech waveform segment transforming means 705 for transforming speech waveform segments.
And a voice waveform segment connecting means 7 for connecting the voice waveform segment
06 and para-language input means 702 for inputting a para-language
And the explanation vector storage means 70 for storing the explanation vector.
7 and the inner product calculating means 708 for calculating the inner product of the vector
A pointer storage unit 709 for storing a pointer, an inner product maximum value segment selection unit 710 for selecting a segment to which the maximum inner product is given, and a segment storage unit 711 for storing the segment.

【００６１】次に、本実施の形態の音声合成装置の動作
を説明する。まず、テキストはテキスト入力手段７０１
に入力される。次いで、テキスト解析手段７０３に入力
された後、目標ベクトル生成手段７０４に入力される。
一方、パラ言語情報は、文節やアクセント句程度の単位
毎にパラ言語入力手段７０２に入力された後、目標ベク
トル生成手段７０４に入力される。このパラ言語情報
は、人間が判断してもよいし、文のパターンから一意に
決定してもよい。次いで、目標ベクトル生成手段７０４
では、合成しようとする音声の素片単位にその属性を表
現する属性ベクトルが生成される。Next, the operation of the speech synthesizer of this embodiment will be described. First, the text is text input means 701.
Entered in. Next, after being inputted to the text analysis means 703, it is inputted to the target vector generation means 704.
On the other hand, the paralinguistic information is input to the paralinguistic input unit 702 for each unit such as a phrase or an accent phrase, and then to the target vector generating unit 704. This paralinguistic information may be determined by a human or may be uniquely determined from a sentence pattern. Next, the target vector generation means 704
Then, an attribute vector expressing the attribute of each voice unit to be synthesized is generated.

【００６２】一方、素片格納手段７１１に格納された素
片および説明ベクトル格納手段７０７に格納された説明
ベクトルは、ポインタ格納手段７０９に格納されたポイ
ンタにより関連付けられている。また、素片毎に属性ベ
クトルと素片データベースの全説明ベクトルとの内積が
内積算出手段７０８によって算出され、内積最大値素片
選択手段７１０により最大の内積を与えた素片のポイン
タを参照して韻律素片データベースおよび音声波形素片
データベースから、最大の内積を与えた素片を選択する
ようになっている。この選択された素片のうち、韻律素
片に合わせて音声波形素片が音声波形素片変形手段７０
５で変形される。次いで、音声波形素片同士が音声波形
素片接続手段で接続され、合成音声が生成されるように
なっている。On the other hand, the segment stored in the segment storage unit 711 and the explanation vector stored in the explanation vector storage unit 707 are associated with each other by the pointer stored in the pointer storage unit 709. Further, the inner product of the attribute vector and all the explanation vectors of the unit database is calculated for each unit by the inner product calculating unit 708, and the inner product maximum value unit selecting unit 710 refers to the pointer of the unit giving the maximum inner product. Then, the segment giving the maximum inner product is selected from the prosodic segment database and the speech waveform segment database. Of the selected speech segments, the speech waveform segment changing means 70 matches the speech waveform segment according to the prosody segment.
Transformed in 5. Next, the speech waveform segments are connected to each other by the speech waveform segment connection means, and the synthetic speech is generated.

【００６３】以上のように、本実施の形態の音声合成装
置によれば、入力されたテキストの目標ベクトルと素片
の属性ベクトルに基づいて生成された説明ベクトルとの
内積を最大にする素片を算出し合成音声の素片としてい
るので、合成音声で目標とする属性と最も近い素片を得
ることができ高音質の合成音声を生成することができ
る。As described above, according to the speech synthesizer of this embodiment, the segment that maximizes the inner product of the target vector of the input text and the explanation vector generated based on the attribute vector of the segment. Is calculated and used as a unit of the synthesized voice, a unit closest to the target attribute of the synthesized voice can be obtained, and a synthesized voice of high sound quality can be generated.

【００６４】（実施の形態９）本発明の第９の実施の形
態の音声合成装置を図８のブロック図を参照して説明す
る。まず、本実施の形態の音声合成装置の構成について
説明する。図８に示すように、本実施の形態の音声合成
装置は、テキストを入力するテキスト入力手段７０１
と、テキストを解析するテキスト解析手段７０３と、目
標ベクトルを生成する目標ベクトル生成手段７０４と、
音声波形の素片を変形する音声波形素片変形手段７０５
と、音声波形の素片を接続する音声波形素片接続手段７
０６と、パラ言語を入力するパラ言語入力手段７０２
と、説明ベクトルを格納する説明ベクトル格納手段７０
７と、ベクトルの内積を算出する内積算出手段７０８
と、ポインタを格納するポインタ格納手段７０９と、内
積の総和で除して正規化した加重係数により平均化する
素片加重平均化手段８０１と、素片を格納する素片格納
手段７１１とを備えている。(Ninth Embodiment) A speech synthesizer according to a ninth embodiment of the present invention will be described with reference to the block diagram of FIG. First, the configuration of the speech synthesizer of this embodiment will be described. As shown in FIG. 8, the speech synthesizer according to the present embodiment has a text input means 701 for inputting text.
A text analysis means 703 for analyzing the text, a target vector generation means 704 for generating a target vector,
Speech waveform segment transforming means 705 for transforming speech waveform segments.
And a voice waveform segment connecting means 7 for connecting the voice waveform segment
06 and para-language input means 702 for inputting a para-language
And the explanation vector storage means 70 for storing the explanation vector.
7 and the inner product calculating means 708 for calculating the inner product of the vector
A pointer storage means 709 for storing pointers, a segment weighted averaging means 801 for averaging by a weighting coefficient normalized by dividing by the sum of inner products, and a segment storage means 711 for storing segments. ing.

【００６５】次に、本実施の形態の音声合成装置の動作
を説明する。まず、テキストはテキスト入力手段７０１
に入力される。次いで、テキスト解析手段７０３に入力
された後、目標ベクトル生成手段７０４に入力される。
一方、パラ言語情報は、文節やアクセント句程度の単位
毎にパラ言語入力手段７０２に入力された後、目標ベク
トル生成手段７０４に入力される。このパラ言語情報
は、人間が判断してもよいし、文のパターンから一意に
決定してもよい。次いで、目標ベクトル生成手段７０４
では、合成しようとする音声の素片単位にその属性を表
現する属性ベクトルが生成される。Next, the operation of the speech synthesizer of this embodiment will be described. First, the text is text input means 701.
Entered in. Next, after being inputted to the text analysis means 703, it is inputted to the target vector generation means 704.
On the other hand, the paralinguistic information is input to the paralinguistic input unit 702 for each unit such as a phrase or an accent phrase, and then to the target vector generating unit 704. This paralinguistic information may be determined by a human or may be uniquely determined from a sentence pattern. Next, the target vector generation means 704
Then, an attribute vector expressing the attribute of each voice unit to be synthesized is generated.

【００６６】一方、素片格納手段７１１に格納された素
片および説明ベクトル格納手段７０７に格納された説明
ベクトルは、ポインタ格納手段７０９に格納されたポイ
ンタにより関連付けられている。また、素片毎に属性ベ
クトルと素片データベースの全説明ベクトルとの内積が
内積算出手段７０８によって算出され、素片加重平均化
手段８０１により内積の総和で除して正規化した加重係
数により平均化され、加重平均化された素片が生成され
る。次いで、音声波形素片が音声波形素片変形手段７０
５で変形され、音声波形素片同士が音声波形素片接続手
段７０６で接続され、合成音声が生成されるようになっ
ている。On the other hand, the segment stored in the segment storage unit 711 and the explanation vector stored in the explanation vector storage unit 707 are associated with each other by the pointer stored in the pointer storage unit 709. Further, the inner product of the attribute vector and all the explanation vectors of the unit database is calculated for each unit by the inner product calculating unit 708, and is divided by the sum of the inner products by the unit weighted averaging unit 801 to obtain a normalized weighting coefficient. Averaged and weighted averaged pieces are generated. Next, the speech waveform segment transforming means 70
5, the speech waveform segments are connected to each other by the speech waveform segment connection means 706, and a synthetic speech is generated.

【００６７】以上のように、本実施の形態の音声合成装
置によれば、入力されたテキストの目標ベクトルと素片
の属性ベクトルに基づいて生成された説明ベクトルとの
内積を算出し素片を加重平均化しているので、同じ代表
値を有するクラスタの発生を防ぐことができ、また、素
片間のばらつきを小さくすることができるので、高音質
の合成音声を生成することができる。As described above, according to the speech synthesizer of the present embodiment, the inner product of the target vector of the input text and the explanation vector generated based on the attribute vector of the segment is calculated to obtain the segment. Since the weighted averaging is performed, it is possible to prevent the occurrence of clusters having the same representative value, and it is possible to reduce the variation between the segments, so that it is possible to generate high-quality synthetic speech.

【００６８】（実施の形態１０）本発明の第１０の実施
の形態の音声合成装置を図９のブロック図を参照して説
明する。まず、本実施の形態の音声合成装置の構成につ
いて説明する。図９に示すように、本実施の形態の音声
合成装置は、テキストを入力するテキスト入力手段７０
１と、テキストを解析するテキスト解析手段７０３と、
目標ベクトルを生成する目標ベクトル生成手段７０４
と、音声波形の素片を変形する音声波形素片変形手段７
０５と、音声波形の素片を接続する音声波形素片接続手
段７０６と、パラ言語を入力するパラ言語入力手段７０
２と、説明ベクトルを格納する説明ベクトル格納手段７
０７と、目標ベクトルを素片の説明ベクトルで最適に近
似する最適近似係数を算出する最適近似係数算出手段９
０１と、ポインタを格納するポインタ格納手段７０９
と、最適近似係数により素片を加重平均化する素片加重
平均化手段９０２と、素片を格納する素片格納手段７１
１とを備えている。(Embodiment 10) A speech synthesizer according to a tenth embodiment of the present invention will be described with reference to the block diagram of FIG. First, the configuration of the speech synthesizer of this embodiment will be described. As shown in FIG. 9, the speech synthesizer according to the present embodiment has a text input means 70 for inputting text.
1 and a text analysis means 703 for analyzing the text,
Target vector generation means 704 for generating a target vector
And a speech waveform segment transforming means 7 for transforming a segment of a speech waveform.
05, a voice waveform segment connecting means 706 for connecting a voice waveform segment, and a paralinguistic input means 70 for inputting a paralanguage.
2 and the explanation vector storage means 7 for storing the explanation vector
07, and an optimum approximation coefficient calculating means 9 for calculating an optimum approximation coefficient for optimally approximating the target vector with the segment description vector.
01 and pointer storage means 709 for storing a pointer
, A piece weighted averaging means 902 for weighted averaging the pieces by the optimum approximation coefficient, and a piece storage means 71 for storing the pieces.
1 and.

【００６９】次に、本実施の形態の音声合成装置の動作
を説明する。まず、テキストはテキスト入力手段７０１
に入力される。次いで、テキスト解析手段７０３に入力
された後、目標ベクトル生成手段７０４に入力される。
一方、パラ言語情報は、文節やアクセント句程度の単位
毎にパラ言語入力手段７０２に入力された後、目標ベク
トル生成手段７０４に入力される。このパラ言語情報
は、人間が判断してもよいし、文のパターンから一意に
決定してもよい。次いで、目標ベクトル生成手段７０４
では、合成しようとする音声の素片単位にその属性を表
現する属性ベクトルが生成される。Next, the operation of the speech synthesizer of this embodiment will be described. First, the text is text input means 701.
Entered in. Next, after being inputted to the text analysis means 703, it is inputted to the target vector generation means 704.
On the other hand, the paralinguistic information is input to the paralinguistic input unit 702 for each unit such as a phrase or an accent phrase, and then to the target vector generating unit 704. This paralinguistic information may be determined by a human or may be uniquely determined from a sentence pattern. Next, the target vector generation means 704
Then, an attribute vector expressing the attribute of each voice unit to be synthesized is generated.

【００７０】一方、素片格納手段７１１に格納された素
片および説明ベクトル格納手段７０７に格納された説明
ベクトルは、ポインタ格納手段７０９に格納されたポイ
ンタにより関連付けられている。また、最適近似係数算
出手段９０１により目標ベクトルを素片の説明ベクトル
で最適に近似する最適近似係数が算出され素片加重平均
化手段９０２に出力される。素片加重平均化手段９０２
においては、最適近似係数により加重平均化された素片
が生成される。次いで、音声波形素片が音声波形素片変
形手段７０５で変形され、音声波形素片同士が音声波形
素片接続手段７０６で接続され、合成音声が生成される
ようになっている。On the other hand, the segment stored in the segment storage unit 711 and the explanation vector stored in the explanation vector storage unit 707 are associated with each other by the pointer stored in the pointer storage unit 709. Further, the optimum approximation coefficient calculating unit 901 calculates an optimum approximation coefficient that optimally approximates the target vector with the explanation vector of the unit, and outputs it to the unit weighted averaging unit 902. Element weighted averaging means 902
In, a weighted averaged segment is generated by the optimum approximation coefficient. Next, the speech waveform segment is transformed by the speech waveform segment transforming unit 705, the speech waveform segments are connected to each other by the speech waveform segment connecting unit 706, and a synthetic speech is generated.

【００７１】以上のように、本実施の形態の音声合成装
置によれば、入力されたテキストの目標ベクトルを素片
の属性ベクトルに基づいて生成された説明ベクトルで最
適に近似する最適近似係数を算出し、この最適近似係数
により素片を加重平均化して合成音声の素片とするの
で、同じ代表値を有するクラスタの発生を防ぐことがで
き、また、素片間のばらつきを小さくすることができる
ので、高音質の合成音声を生成することができる。As described above, according to the speech synthesizer of the present embodiment, the optimum approximation coefficient for optimally approximating the target vector of the input text with the explanation vector generated based on the attribute vector of the segment is obtained. By calculating and performing weighted averaging of the speech units by this optimum approximation coefficient to obtain the speech synthesis speech units, it is possible to prevent the occurrence of clusters having the same representative value and reduce the variation between the speech units. Therefore, it is possible to generate high-quality synthetic speech.

【００７２】[0072]

【発明の効果】以上説明したように、本発明によれば、
韻律素片および音声波形素片の各素片は、クラスタリン
グされた後、同一クラスタにある素片の属性ベクトルに
基づいて生成された説明ベクトルによって同一クラスタ
内の素片に共通な属性が表現されるので、小さいメモリ
容量で合成音声を生成することができ、また、素片間の
ばらつきを小さくできるので高音質の合成音声を生成す
ることができる。As described above, according to the present invention,
After the prosodic segment and the speech waveform segment are clustered, the attribute common to the segment in the same cluster is expressed by the explanation vector generated based on the attribute vector of the segment in the same cluster. As a result, it is possible to generate synthetic speech with a small memory capacity, and since it is possible to reduce the variation between units, it is possible to generate high-quality synthetic speech.

[Brief description of drawings]

【図１】本発明の第１の実施の形態における音声合成方
法のフローチャートFIG. 1 is a flowchart of a speech synthesis method according to a first embodiment of the present invention.

【図２】本発明の第２の実施の形態における音声合成方
法のフローチャートFIG. 2 is a flowchart of a speech synthesis method according to the second embodiment of the present invention.

【図３】本発明の第３の実施の形態における音声合成方
法のフローチャートFIG. 3 is a flowchart of a voice synthesis method according to a third embodiment of the present invention.

【図４】本発明の第５の実施の形態における音声合成方
法のフローチャートFIG. 4 is a flowchart of a speech synthesis method according to the fifth embodiment of the present invention.

【図５】本発明の第６の実施の形態における音声合成方
法のフローチャートFIG. 5 is a flowchart of a speech synthesis method according to the sixth embodiment of the present invention.

【図６】本発明の第７の実施の形態における音声合成方
法のフローチャートFIG. 6 is a flowchart of a voice synthesis method according to a seventh embodiment of the present invention.

【図７】本発明の第８の実施の形態における音声合成装
置のブロック図FIG. 7 is a block diagram of a speech synthesizer according to an eighth embodiment of the present invention.

【図８】本発明の第９の実施の形態における音声合成装
置のブロック図FIG. 8 is a block diagram of a speech synthesizer according to a ninth embodiment of the present invention.

【図９】本発明の第１０の実施の形態における音声合成
装置のブロック図FIG. 9 is a block diagram of a speech synthesizer according to a tenth embodiment of the present invention.

【図１０】従来の音声合成方法のフローチャートFIG. 10 is a flowchart of a conventional speech synthesis method.

[Explanation of symbols]

７０１テキスト入力手段７０２パラ言語入力手段７０３テキスト解析手段７０４目標ベクトル生成手段７０５音声波形素片変形手段７０６音声波形素片接続手段７０７説明ベクトル格納手段７０８内積算出手段７０９ポインタ格納手段７１０内積最大値素片選定手段７１１素片格納手段８０１、９０２素片加重平均化手段９０１最適近似係数算出手段 701 Text input means 702 Para-language input means 703 Text analysis means 704 target vector generation means 705 Speech waveform segment transforming means 706 Speech waveform segment connecting means 707 means for storing explanation vector 708 Inner product calculating means 709 pointer storage means 710 Inner product maximum value unit selection means 711 Element storage means 801, 902 Element weighted averaging means 901 Optimal approximation coefficient calculation means

Claims

[Claims]

1. An attribute vector assigning step of assigning an attribute vector including a predetermined attribute factor and paralinguistic information to each segment of a speech corpus including a prosody segment and a speech waveform segment, and assigning the attribute vector. A clustering step of clustering the segment, the cluster representative value calculating step of calculating a cluster representative value of the segment belonging to each cluster obtained by the clustering, and the cluster belonging value obtained by the clustering An explanation vector generation step of generating an explanation vector based on the attribute vector of a segment, a target attribute vector generation step of generating a target attribute vector of a unit of synthetic speech in a unit of a speech unit, and the target attribute vector optimized by the explanation vector. An optimal approximation coefficient calculation step of calculating an approximate approximation coefficient for each segment of the speech corpus, A speech synthesis unit for generating synthesized speech units based on the optimum approximation coefficient.

2. The attribute vector assigned in the attribute vector assigning step includes an expression of whether or not each attribute factor exists for each of the predetermined attribute factors. The described speech synthesis method.

3. The speech synthesis method according to claim 1, wherein the clustering step includes a step of performing clustering based on an auditory detection limit.

4. The explanatory vector generating step includes adding each attribute factor of a vector obtained by adding attribute vectors of the segment belonging to each cluster for each cluster to the segment belonging to each cluster for each attribute factor. 2. The speech synthesis method according to claim 1, further comprising a step of generating a vector having a new attribute factor divided by the total number of s, and setting it as an explanation vector of each cluster.

5. The total number of data in which the attribute factors are generated in the speech corpus, wherein each of the attribute factors of the vector obtained by adding the attribute vectors of the segment belonging to each cluster is added to the explanation vector generation step. 2. The speech synthesis method according to claim 1, further comprising the step of generating a vector whose division factor is a new attribute factor and using it as an explanation vector of each cluster.

6. The speech synthesizing method according to claim 1, wherein the cluster representative value calculating step includes a step of using a center of gravity of the cluster as a representative segment of the cluster.

7. The speech synthesis method according to claim 1, wherein the cluster representative value calculating step includes a step of setting a mode value of the cluster as a representative segment of the cluster.

8. An attribute vector assigning step of assigning an attribute vector including a predetermined attribute factor and paralinguistic information to each segment of a speech corpus including a prosody segment and a speech waveform segment, and assigning the attribute vector. A clustering step of clustering the segment, the cluster representative value calculating step of calculating a cluster representative value of the segment belonging to each cluster obtained by the clustering, and the cluster belonging value obtained by the clustering An explanation vector generation step of generating an explanation vector based on the attribute vector of a segment, an explanation vector attribute factor comparison step of comparing each attribute factor of the explanation vectors with each other, and a target attribute vector of each synthesized speech segment unit. And a target attribute vector generation step for optimally approximating the target attribute vector with the explanation vector. An optimum approximation coefficient calculation step of calculating an optimum approximation coefficient for each speech corpus segment, and a synthetic speech segment generation step of generating a synthetic speech segment based on the optimal approximation coefficient. In the attribute factor comparison step, if there is an attribute factor that can be regarded as the same according to a predetermined statistical significance level common to all the explanation vectors generated in the explanation vector step, the attribute factor considered as the same is A method for synthesizing speech, which is characterized by excluding an explanation vector and an attribute factor of the attribute vector.

9. The explanation vector attribute factor comparison step,
When the plurality of explanation vectors generated in the explanation vector generation step can be regarded as the same according to a predetermined statistical significance level, clusters related to the same regarded explanation vectors are merged into one cluster. 9. The voice synthesis method according to claim 8, wherein:

10. The explanation vector attribute factor comparing step is regarded as the same when the plurality of explanation vectors generated in the explanation vector generating step can be regarded as the same according to a predetermined statistical significance level. A procedure for obtaining the number of vectors, a procedure for calculating a base 2 logarithm of the number, and a new number of attribute factors corresponding to the number obtained by converting the logarithmic calculation result into an integer are provisionally newly added to the segment. 9. The voice synthesizing method according to claim 8, further comprising the step of re-assigning an attribute vector including the added attribute factor to the segment in the attribute vector assigning step.

11. An inner product calculating step of calculating an inner product of the target attribute vector of the synthesized speech segment unit generated in the target attribute vector generating step and the explanation vector of the cluster of the speech corpus, 9. The speech synthesis according to claim 1, further comprising: a segment selection step of selecting a representative segment of a cluster having an explanation vector that is the maximum inner product among the calculated inner products. Method.

12. An inner product calculating step of calculating an inner product of the target attribute vector of the synthesized speech segment unit generated in the target attribute vector generating step and the explanation vector of the cluster of the speech corpus, And a step of generating a synthetic speech segment by performing a weighted average of the representative segments of the cluster with a value obtained by dividing each of the calculated inner products by the total as a weight. 9. The voice synthesis method according to claim 1, further comprising:

13. Optimal approximation for calculating an optimal approximation coefficient that optimally approximates the target attribute vector of the synthesized speech segment unit generated in the target attribute vector generation step with the explanation vector of the cluster of the speech corpus. 2. The method according to claim 1, further comprising: a coefficient calculating step; and a synthetic speech element generating step of generating a synthetic speech element by weighted averaging representative speech elements based on the calculated optimum approximation coefficient.
And the speech synthesis method according to claim 8.

14. A representative segment storage means for storing a representative segment of a cluster of segments from a speech corpus including a prosody segment and a speech waveform segment, and an explanation vector storage for storing an explanation vector of the representative segment. Means, pointer storage means for storing a pointer indicating the correspondence between the representative segment and the explanation vector, text input means for inputting text, para-language input means for inputting para-language, and the input text And a target attribute vector generation unit that generates a target attribute vector for each unit of synthesized speech based on the analysis result of the text analysis unit and the input paralinguistic information. An inner product calculating means for calculating an inner product of the target attribute vector and all the explanation vectors, and a representative prosodic segment giving the maximum value of the inner product. And an inner product maximum value segment selection means for selecting a representative speech waveform segment, a speech waveform segment transformation means for transforming the speech waveform segment according to the selected prosodic segment, and the transformed speech. A voice synthesizing apparatus comprising: a voice waveform segment connecting means for connecting waveform segments.

15. A representative segment storage means for storing a representative segment of a cluster of segments from a speech corpus including a prosodic segment and a speech waveform segment, and an explanation vector storage for storing an explanation vector of the representative segment. Means, pointer storage means for storing a pointer indicating the correspondence between the representative segment and the explanation vector, text input means for inputting text, paralanguage input means for inputting paralanguage, and the input text And a target attribute vector generation unit that generates a target attribute vector for each unit of synthesized speech based on the analysis result of the text analysis unit and the input paralinguistic information. An inner product calculating means for calculating an inner product of the target attribute vector and all the explanation vectors, and the representative based on the calculated inner product. A segment weighted averaging means for performing weighted averaging of the prosodic segment and the representative speech waveform segment, and transforming the weighted averaged speech waveform segment according to the weighted averaged prosody segment. A voice synthesizing apparatus comprising: a voice waveform segment deforming means; and a voice waveform segment connecting means for connecting the deformed voice waveform segments.

16. A segment storage unit for storing a segment, an explanation vector storage unit for storing an explanation vector of the segment, and a pointer storage unit for storing a pointer indicating a correspondence between the segment and the explanation vector. A text inputting means for inputting text, a paralinguistic inputting means for inputting paralinguistics, a text analyzing means for analyzing the inputted text, an analysis result of the text analyzing means and the inputted paralinguistic information. A target attribute vector generating means for generating a target attribute vector for each unit of synthesized speech units based on the above, and an optimum approximation for optimally approximating the target attribute vector of each unit of the synthesized speech by the explanation vector of the unit. Optimum approximation coefficient calculating means for calculating a coefficient, and weighted averaging of the prosodic element and the speech waveform element based on the optimum approximation coefficient. A speech segment weighted averaging means, and a speech waveform segment transformation means for transforming the speech waveform segment weighted and averaged according to the weighted averaged prosodic segment.
A voice synthesizing apparatus comprising: a voice waveform segment connecting means for connecting the deformed voice waveform segments.

17. A computer storing a segment storing step of storing a segment from a segment database, an explanation vector storing step of storing an explanation vector of the segment, and a correspondence relationship between the segment and the explanation vector in a computer. A pointer storing step of storing a pointer, a text inputting step of inputting text, a paralinguistic inputting step of inputting paralinguistics, and a text analyzing step of analyzing the input text,
A target attribute vector generation step of generating a target attribute vector for each unit of synthesized speech based on the analysis result of the text analysis step and the input paralinguistic information; and the generated target attribute vector and all of the above An inner product calculating step of calculating an inner product with the explanation vector, an inner product maximum value segment selecting step of selecting a prosodic segment and a speech waveform segment giving the maximum value of the inner product, and the selected prosodic segment A speech synthesizing program for executing a speech waveform segment transformation step of transforming the speech waveform segment accordingly and a speech waveform segment connection step of connecting the transformed speech waveform segments.