JP4403996B2

JP4403996B2 - Prosody pattern generation apparatus, prosody pattern generation method, and prosody pattern generation program

Info

Publication number: JP4403996B2
Application number: JP2005096228A
Authority: JP
Inventors: 康行三井; 聡塚田; 玲史近藤
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2005-03-29
Filing date: 2005-03-29
Publication date: 2010-01-27
Anticipated expiration: 2025-03-29
Also published as: JP2006276493A

Abstract

<P>PROBLEM TO BE SOLVED: To dynamically generate prosodic patterns with high naturalness to inputted pronunciation symbol strings. <P>SOLUTION: An attribute information extraction means 12 extracts attribute information of the inputted pronunciation symbol strings. A similarity calculation means 13 calculates distance between attribute information of the prosodic patterns in a prosodic data base 16 and attribute information of the pronunciation symbol strings to calculate weight to each prosodic pattern. A prosodic pattern generation means 14 generates a new prosodic pattern by continuously coupling the prosodic patterns by the calculated weight. A waveform generation means 15 controls prosody by the generated new prosodic pattern to generate waveform of a synthetic sound. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、音声処理技術に関し、特に、韻律パターンを生成する技術に関する。 The present invention relates to a voice processing technique, and more particularly to a technique for generating a prosodic pattern.

テキスト音声合成では、入力テキストに対して推定された韻律パターンで音声波形を制御することで合成音を生成する。韻律パターンとは、声の高さ（基本周波数），長さ，強さ等の時間的変化をパターン化したものである。韻律パターンは、合成音の自然性に大きく影響しており、音声合成技術においては、合成音の自然性向上を図るために、韻律パターンの生成技術に関する研究がなされてきた。また、韻律パターンは話調を決定付けるものであり、かつ話者独特のものであるため、音声認識技術においても、例えば、非特許文献１等にみられるように、韻律パターンを用いて音声の意味内容や話者の特定を行う技術の研究がなされてきた。 In text-to-speech synthesis, synthesized speech is generated by controlling the speech waveform with a prosodic pattern estimated for the input text. A prosodic pattern is a pattern of temporal changes such as voice pitch (fundamental frequency), length, and strength. Prosodic patterns have a great influence on the naturalness of synthesized sounds. In speech synthesis technology, research on prosody pattern generation techniques has been conducted in order to improve the naturalness of synthesized sounds. In addition, since the prosodic pattern determines the tone of the speech and is unique to the speaker, in speech recognition technology, for example, as shown in Non-Patent Document 1, etc. Research has been done on techniques for identifying meaning and speakers.

音声合成技術の分野では、従来から自然発声の音声波形から韻律パラメータを抽出してデータベースに記憶し、音声合成の際にデータベース内に記憶された自然発声の韻律パラメータを用いる技術についての研究がなされてきた。この方法では、韻律パターンが自然発声の音声波形から抽出したものであるため、自然性が高い韻律パターンが用いられるという利点がある。しかし、全ての入力テキストに対して適合するパターンを記憶しておくのは物理的に不可能であるため、韻律パターンの変形が不可欠となり、このような変形処理によって韻律パターンの自然性が低下してしまう問題が生じる。このような問題に対して、韻律パターンの自然性を向上させる技術が提案されている。 In the field of speech synthesis technology, prosody parameters have been extracted from spontaneous speech waveforms and stored in a database, and research has been conducted on technology that uses the natural speech prosody parameters stored in the database during speech synthesis. I came. This method has an advantage that a prosodic pattern having high naturalness is used because the prosodic pattern is extracted from a speech waveform of a natural utterance. However, since it is physically impossible to memorize a pattern that matches all input texts, it is essential to transform the prosodic pattern, and this transformation process reduces the naturalness of the prosodic pattern. Problems arise. In order to solve such a problem, a technique for improving the naturalness of the prosodic pattern has been proposed.

例えば、特許文献１には、音声合成時に合成音の韻律パターンを選択するために、予め韻律データベース内の韻律パターンのクラスタ毎に代表パターンを生成しておき、代表パターンを変形した変形パターンと音声データから抽出された韻律パターンとの誤差を評価して、代表パターンを生成する学習方法が開示されている（図１３参照）。 For example, in Patent Document 1, a representative pattern is generated for each cluster of prosodic patterns in the prosodic database in order to select a prosody pattern of a synthesized sound at the time of speech synthesis. A learning method for generating a representative pattern by evaluating an error from a prosodic pattern extracted from data is disclosed (see FIG. 13).

また、特許文献２には、特定の韻律パターンから近い順で一定数までの複数の韻律パターンの平均ないし近さに対応して重み付けを行って代表韻律パターンを作成する方法が開示されている。 Patent Document 2 discloses a method of creating a representative prosodic pattern by performing weighting according to the average or closeness of a plurality of prosodic patterns up to a certain number in order from a specific prosodic pattern.

更に、特許文献３には、入力されたテキストデータの文節の属性情報と比較して基準値以上の類似度を持つ文節に関する韻律パラメータから入力テキストデータの文節に対する韻律パラメータを計算する方法が開示されている（図１４参照）。この方法では、前述した従来例と違い、入力されたテキストに対する音声コーパス内の韻律パラメータとの類似度を音声合成時に計算するため、より自然性の高い韻律パターンが得られるとしている。 Further, Patent Document 3 discloses a method for calculating a prosodic parameter for a clause of input text data from a prosodic parameter relating to a clause having a similarity equal to or higher than a reference value compared to attribute information of the clause of input text data. (See FIG. 14). In this method, unlike the above-described conventional example, the similarity between the input text and the prosodic parameters in the speech corpus is calculated at the time of speech synthesis, so that a more prosodic pattern can be obtained.

古井貞熙著，「ディジタル音声処理」，東海大学出版会，１９８５年By Sadahiro Furui, “Digital Speech Processing”, Tokai University Press, 1985 特開平１１−９５７８３号公報JP-A-11-95783 特開２００４−１１７６６３号公報JP 2004-117663 A 特開２０００−５６７８８号公報JP 2000-56788 A

特許文献１や特許文献２に見られる技術では、予め作成された代表パターンを音声合成時に用いる。このため、カテゴリを代表するパターンから合成音の韻律パターンを選択するしかなく、実際にユーザが入力したテキストに対して最適な韻律パターンが予め作成してある代表パターンと大きく異なっている場合には合成音の自然性と安定性が損なわれてしまうという課題があった。 In the techniques shown in Patent Document 1 and Patent Document 2, a representative pattern created in advance is used for speech synthesis. For this reason, there is no choice but to select the prosody pattern of the synthesized sound from the patterns representing the category, and when the optimal prosodic pattern for the text actually input by the user is significantly different from the representative pattern created in advance. There was a problem that the naturalness and stability of the synthesized sound were impaired.

また、特許文献３に見られる技術では、韻律パターンを音声合成時に生成しているため、入力テキストに対して最適な韻律パターンが、大きく異なっているという問題は生じ難くなっているが、韻律パターンの計算処理の際に、入力テキストデータと選択された韻律パターンとの属性情報の類似度を用いていない。このため、入力発音記号列と属性情報が特に類似しているピッチパターンが韻律データベース内に存在しても、そのピッチパターンに類似したピッチパターンが生成されず、高い自然性を持つ合成音を得ることが困難であるという課題があった。 In the technique shown in Patent Document 3, since the prosodic pattern is generated at the time of speech synthesis, the problem that the optimal prosodic pattern for the input text is greatly different is less likely to occur. In the calculation process, the similarity of attribute information between the input text data and the selected prosodic pattern is not used. For this reason, even if a pitch pattern whose attribute information is particularly similar to the input phonetic symbol string exists in the prosodic database, a pitch pattern similar to the pitch pattern is not generated, and a synthetic sound having high naturalness is obtained. There was a problem that it was difficult.

そこで、本発明の目的は、高い安定性を保ちつつ、入力テキストに対し高い自然性を持つ韻律パターンを動的に生成する韻律パターン生成装置および韻律パターン生成方法ならびに韻律パターン生成プログラムを提供すること、更に、より望ましくは、計算負荷を軽減できる韻律パターン生成装置および韻律パターン生成方法ならびに韻律パターン生成プログラムを提供することにある。 Therefore, an object of the present invention is to provide a prosodic pattern generation device, a prosodic pattern generation method, and a prosodic pattern generation program that dynamically generate a prosody pattern having high naturalness with respect to an input text while maintaining high stability. Furthermore, it is more desirable to provide a prosodic pattern generation device, a prosodic pattern generation method, and a prosodic pattern generation program that can reduce the calculation load.

本発明の韻律パターン生成装置は、前記目的を達成するため、文章の構成単位となる各セグメント毎に韻律パターンと属性情報を対応させて予めカテゴリに分割して記憶する韻律データベースと、
入力された発音記号列の属性情報を抽出する属性情報抽出手段と、
前記入力された発音記号列が前記カテゴリ分割された韻律データベース内のどのカテゴリに属するかを特定するカテゴリ選択手段と、
前記カテゴリ選択手段で特定されたカテゴリのデータベースのみを対象に前記韻律データベース内に存在する韻律パターンの属性情報と前記入力された発音記号列から抽出された属性情報の重要度に応じて類似度を計算する類似度計算手段と、
前記類似度に応じた重み付けに従って前記カテゴリ選択手段で特定されたカテゴリのデータベース内の韻律パターンを結合して新規の韻律パターンを生成する韻律パターン生成手段と、を備え、
前記重み付けは、前記類似度が大きい韻律パターンに対しては大きく、前記類似度が小さい韻律パターンに対しては小さくすることで行うことを特徴とする構成を有する。 In order to achieve the above object, the prosody pattern generation apparatus according to the present invention includes a prosody database that stores in advance divided into categories in association with prosodic patterns and attribute information for each segment that is a constituent unit of a sentence;
Attribute information extraction means for extracting attribute information of the phonetic symbol string input;
Category selection means for specifying which category in the prosodic database divided into categories the input phonetic symbol string ;
The degree of similarity is determined according to the importance of the attribute information of the prosodic pattern existing in the prosodic database and the attribute information extracted from the input phonetic symbol string only for the category database specified by the category selecting means. Similarity calculation means for calculating;
Prosody pattern generation means for generating a new prosodic pattern by combining the prosodic patterns in the database of the category specified by the category selection means according to the weighting according to the similarity,
The weighting is performed by increasing the weight for the prosodic pattern having a large similarity and decreasing the weight for a prosodic pattern having a small similarity .

このように、予め作成された代表パターンを用いることなく、入力された発音記号列の属性情報に従って最適な韻律パターンをその都度生成することにより、安定性を保ちつつより高い自然性を持つ韻律パターンを再現することが可能となる。
特に、新規の韻律パターンを生成する際に、韻律データベース内に存在する韻律パターンの属性情報と入力された発音記号列から抽出された属性情報の重要度に応じて類似度を計算し、この類似度に基づいて、類似度が大きい韻律パターンに対しては大きく、類似度が小さい韻律パターンに対しては小さくするように韻律データベース内の韻律パターンを重み付けして結合するようにしているので、入力された発音記号列の属性情報に特に類似した韻律パターンが韻律データベース内に存在する場合には、その韻律パターンに特に類似した韻律パターンを生成して、入力発音記号列に対して高い自然性を持つ韻律パターンを再現することができる。
また、韻律データベース内に発音記号属性情報に類似した属性情報を持つ韻律パターンが存在しない場合であっても、韻律データベース内の複数の韻律パターンを平均したような韻律パターンが生成されるため、安定して音声合成が行える。
以上により、高い安定性を保ちつつ、入力発音記号列に対し高い自然性を持つ合成音の作成が実現される。 In this way, a prosodic pattern having higher naturalness while maintaining stability by generating an optimal prosodic pattern each time according to the attribute information of the input phonetic symbol string without using a representative pattern created in advance. Can be reproduced.
In particular, when generating a new prosodic pattern, the similarity is calculated according to the importance of the prosodic pattern attribute information existing in the prosodic database and the attribute information extracted from the input phonetic symbol string. Based on the degree, the prosodic patterns in the prosodic database are weighted and combined so that they are large for prosody patterns with high similarity and small for prosodic patterns with low similarity. If there is a prosodic pattern in the prosodic database that is particularly similar to the attribute information of the generated phonetic symbol string, a prosodic pattern that is particularly similar to that prosodic pattern is generated, and the input phonetic symbol string is highly natural. The prosodic pattern you have can be reproduced.
Even if there is no prosodic pattern with attribute information similar to phonetic symbol attribute information in the prosodic database, a prosodic pattern that averages multiple prosodic patterns in the prosodic database is generated, which is stable. Can be synthesized.
As described above, it is possible to create a synthesized sound having high naturalness for the input phonetic symbol string while maintaining high stability.

特に、韻律パターンを生成する際に結合に用いる韻律パターンの候補をカテゴリ内の韻律パターンに限定することができるので計算負荷を大幅に軽減することができ、処理速度の向上および記憶容量の削減に繋がる。 In particular , prosody pattern candidates used for combining when generating prosodic patterns can be limited to prosodic patterns within a category, so that the calculation load can be greatly reduced, improving processing speed and reducing storage capacity. Connected.

本発明の、音声合成装置は前述した韻律パターン生成装置と共通する主要部を有するもので、更に、韻律パターン生成手段で生成された韻律パターンにより韻律を制御して音声波形を生成する波形生成手段を有する。 The speech synthesizer of the present invention has a main part in common with the prosody pattern generation device described above, and further, waveform generation means for generating a speech waveform by controlling the prosody by the prosodic pattern generated by the prosody pattern generation means Have

これにより、高い安定性を保ちつつ、入力発音記号列に対し高い自然性を持つ合成音の作成が可能な音声合成装置が提供される。 This provides a speech synthesizer capable of creating a synthesized sound having high naturalness for an input phonetic symbol string while maintaining high stability.

本発明の韻律パターン生成方法は、前記と同様の目的を達成するため、入力された発音記号列の属性情報を抽出するステップと、
前記入力された発音記号列が予めカテゴリ分割された韻律データベース内のどのカテゴリに属するかを判定する判定ステップと、
前記判定ステップで特定されたカテゴリの韻律データベース内に予め記憶されている韻律パターン毎の属性情報と前記入力された発音記号列から抽出された属性情報の重要度に応じて類似度を計算するステップと、
前記類似度に応じた重み付けに従って前記判定ステップで特定されたカテゴリの韻律データベース内の韻律パターンを結合して新規の韻律パターンを生成するステップと、を含み、
前記新規の韻律パターンを生成するステップにおける重み付けは、前記類似度が大きい韻律パターンに対しては大きく、前記類似度が小さい韻律パターンに対しては小さくすることで行うことを特徴とする構成を有する。 In order to achieve the same object as described above, the prosody pattern generation method of the present invention extracts the attribute information of the input phonetic symbol string;
A determination step of determining which category in the prosodic database into which the input phonetic symbol string is pre-categorized ;
Calculating similarity based on attribute information for each prosodic pattern stored in advance in the prosodic database of the category specified in the determining step and importance of attribute information extracted from the input phonetic symbol string When,
Generating a new prosodic pattern by combining prosodic patterns in the prosodic database of the category specified in the determining step according to weighting according to the similarity, and
The weighting in the step of generating the new prosodic pattern is performed by increasing the weight for the prosodic pattern having a large similarity and decreasing the weight for a prosodic pattern having a small similarity. .

予め作成された代表パターンを用いることなく、入力された発音記号列の属性情報に従って、最適な韻律パターンをその都度生成するようにしているので、安定性を保ちつつより高い自然性を持つ韻律パターンを再現することが可能であり、しかも、韻律データベース内に存在する韻律パターンの属性情報と入力された発音記号列から抽出された属性情報の重要度に応じて類似度を計算し、この類似度に基づいて、類似度が大きい韻律パターンに対しては大きく、類似度が小さい韻律パターンに対しては小さくするように韻律データベース内の韻律パターンを重み付けして韻律パターンを結合するようにしているので、入力された発音記号列の属性情報に特に類似した韻律パターンが韻律データベース内に存在する場合には、その韻律パターンに特に類似した韻律パターンを生成して入力発音記号列に対して高い自然性を持つ韻律パターンを再現することができる。また、韻律データベース内に発音記号属性情報に類似した属性情報を持つ韻律パターンが存在しない場合であっても、韻律データベース内の複数の韻律パターンを平均したような韻律パターンが生成されるため、安定して音声合成が行える。従って、高い安定性を保ちつつ、入力発音記号列に対し高い自然性を持つ合成音の作成が実現される。 An optimal prosodic pattern is generated each time according to the input phonetic symbol string attribute information without using a representative pattern created in advance, so the prosody pattern has higher naturalness while maintaining stability. The similarity is calculated according to the importance of the attribute information extracted from the input phonetic symbol string and the attribute information of the prosodic pattern existing in the prosody database. The prosodic pattern is weighted so that it is large for prosody patterns with high similarity and small for prosody patterns with low similarity, so that the prosodic patterns are combined. If there is a prosodic pattern in the prosodic database that is particularly similar to the attribute information of the input phonetic symbol string, It is possible to reproduce the prosody pattern with high naturalness, especially for similar product to enter the pronunciation symbol string prosodic patterns. Even if there is no prosodic pattern with attribute information similar to phonetic symbol attribute information in the prosodic database, a prosodic pattern that averages multiple prosodic patterns in the prosodic database is generated, which is stable. Can be synthesized. Therefore, it is possible to create a synthesized sound having high naturalness for the input phonetic symbol string while maintaining high stability.

特に、韻律パターンを生成する際に結合に用いる韻律パターンの候補をカテゴリ内の韻律パターンに限定することができるので、計算負荷を大幅に軽減することが可能となり、処理速度の向上および記憶容量の削減に繋がる。 In particular , prosody pattern candidates used for combining when generating a prosodic pattern can be limited to prosodic patterns within a category, so that the calculation load can be greatly reduced, the processing speed is improved, and the storage capacity is increased. It leads to reduction.

本発明の音声合成方法は、前述した韻律パターン生成方法と共通する主要部を有するもので、更に、生成された韻律パターンにより韻律を制御して音声波形を生成するステップを含む。 The speech synthesis method of the present invention has a main part in common with the prosody pattern generation method described above, and further includes a step of generating a speech waveform by controlling the prosody using the generated prosody pattern.

この音声合成方法によれば、高い安定性を保ちつつ、入力発音記号列に対し高い自然性を持つ合成音の作成が達成される。 According to this speech synthesis method, it is possible to create a synthesized sound having high naturalness for the input phonetic symbol string while maintaining high stability.

本発明の韻律パターン生成プログラムは、前記と同様の目的を達成するため、韻律パターン生成装置を構成するコンピュータに、
入力された発音記号列の属性情報を抽出する処理と、
前記入力された発音記号列が予めカテゴリ分割された韻律データベース内のどのカテゴリに属するかを判定する判定処理と、
前記判定処理で特定されたカテゴリの韻律データベース内に予め記憶されている韻律パターン毎の属性情報と前記入力された発音記号列から抽出された属性情報の重要度に応じて類似度を計算する処理と、
前記類似度に応じ、前記類似度が大きい韻律パターンに対しては大きく、前記類似度が小さい韻律パターンに対しては小さくするように重み付けを行なって、前記判定処理で特定されたカテゴリのデータベース内の韻律パターンを結合して新規の韻律パターンを生成する処理とを実行させることを特徴とした構成を有する。 In order to achieve the same object as described above, the prosody pattern generation program of the present invention includes a computer constituting the prosody pattern generation apparatus,
A process of extracting attribute information of the phonetic symbol string input;
A determination process for determining which category in the prosodic database in which the input phonetic symbol string is pre-categorized ;
Processing for calculating similarity according to the importance of attribute information for each prosodic pattern stored in advance in the prosodic database of the category specified in the determination processing and attribute information extracted from the input phonetic symbol string When,
Depending on the similarity, weighting is performed so that the prosodic pattern having a high similarity is large and the prosodic pattern having a low similarity is small, and the weight is set in the database of the category specified by the determination process . And a process for generating a new prosodic pattern by combining the prosodic patterns.

この韻律パターン生成プログラムをインストールされたコンピュータは、前述の韻律パターン生成装置として機能する。 A computer in which this prosodic pattern generation program is installed functions as the aforementioned prosodic pattern generation apparatus.

本発明の音声合成プログラムは、前述した韻律パターン生成プログラムと共通する主要部を有するもので、更に、前記コンピュータに、韻律パターン生成手段で生成された韻律パターンにより韻律を制御して音声波形を生成する機能を付与する。 The speech synthesis program of the present invention has a main part in common with the prosody pattern generation program described above, and further generates a speech waveform by controlling the prosody by the prosody pattern generated by the prosody pattern generation means in the computer. The function to perform is given.

この音声合成プログラムをインストールされたコンピュータは、前述の音声合成装置として機能する。 A computer installed with this speech synthesis program functions as the speech synthesis apparatus described above.

本発明の韻律パターン生成装置および韻律パターン生成方法ならびに韻律パターン生成プログラムは、予め作成された代表パターンを用いることなく、入力された発音記号列の属性情報に従って最適な韻律パターンをその都度生成するようにしているので、安定性を保ちつつより高い自然性を持つ韻律パターンを再現することができる。 The prosody pattern generation apparatus, prosody pattern generation method, and prosody pattern generation program according to the present invention generate an optimal prosody pattern each time according to attribute information of an input phonetic symbol string without using a representative pattern created in advance. As a result, prosodic patterns with higher naturalness can be reproduced while maintaining stability.

特に、新規の韻律パターンを生成するに際に、韻律データベース内に存在する韻律パターンの属性情報と入力された発音記号列から抽出された属性情報の重要度に応じて類似度を計算し、この類似度に基づいて、類似度が大きい韻律パターンに対しては大きく、類似度が小さい韻律パターンに対しては小さくするように韻律データベース内の韻律パターンを重み付けして韻律パターンを結合するようにしているので、入力された発音記号列の属性情報に特に類似した韻律パターンが韻律データベース内に存在する場合には、その韻律パターンに特に類似した韻律パターンを生成して、入力発音記号列に対して高い自然性を持つ韻律パターンを再現することができ、更に、韻律パターンを生成する際に結合に用いる韻律パターンの候補をカテゴリ内の韻律パターンに限定しているので計算負荷を大幅に軽減することができ、処理速度の向上および記憶容量の削減に繋がる。 In particular, when generating a new prosodic pattern, the similarity is calculated according to the importance of the prosodic pattern attribute information existing in the prosodic database and the attribute information extracted from the input phonetic symbol string. Based on the similarity , the prosodic patterns in the prosodic database are weighted so that they are large for prosody patterns with high similarity and small for prosody patterns with low similarity. Therefore, if a prosodic pattern that is particularly similar to the attribute information of the input phonetic symbol string exists in the prosodic database, a prosodic pattern that is particularly similar to that prosodic pattern is generated and It can reproduce the prosody pattern with high naturalness, further category candidates prosodic pattern used for coupling in generating the prosody pattern Since the limited prosodic pattern of the inner can significantly reduce the computational load, leading to reduction of the increase and the storage capacity of the processing speed.

また、韻律データベース内に発音記号属性情報に類似した属性情報を持つ韻律パターンが存在しない場合であっても、韻律データベース内の複数の韻律パターンを平均したような韻律パターンが生成されるため、安定して音声合成が行える。 Even if there is no prosodic pattern with attribute information similar to phonetic symbol attribute information in the prosodic database, a prosodic pattern that averages multiple prosodic patterns in the prosodic database is generated, which is stable. Can be synthesized.

従って、高い安定性を保ちつつ、入力発音記号列に対し高い自然性を持つ合成音の作成が実現される。 Therefore, it is possible to create a synthesized sound having high naturalness for the input phonetic symbol string while maintaining high stability.

次に、本発明を実施するための最良の形態について図面を参照して説明する。図１は本発明の音声合成方法を実現するための音声合成プログラムをインストールすることにより音声合成装置として機能するコンピュータ１の構成の概略を示したブロック図である。 Next, the best mode for carrying out the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing an outline of the configuration of a computer 1 that functions as a speech synthesizer by installing a speech synthesis program for realizing the speech synthesis method of the present invention.

この音声合成プログラムは、本発明を適用した韻律パターン生成プログラムを主要部として備えたものであり、この音声合成プログラムをインストールしたコンピュータ１は、その結果として、本発明の韻律パターン生成方法を適用した韻律パターン生成装置としても機能することになる。 This speech synthesis program includes a prosody pattern generation program to which the present invention is applied as a main part, and as a result, the computer 1 in which the speech synthesis program is installed applied the prosody pattern generation method of the present invention. It also functions as a prosodic pattern generation device.

コンピュータ１は、通常のワークステーションもしくはパーソナルコンピュータ等からなり、演算手段としてのマイクロプロセッサ（以下、単にＣＰＵという）２と、ＣＰＵ２の基本的な制御プログラムを格納したＲＯＭ３、および、演算データの一時記憶等に利用されるＲＡＭ４と、大容量記憶装置としてのハードディスク５、ならびに、各種の外部装置やネットワーク等と接続するためのインターフェイス６を備える。 The computer 1 is composed of a normal workstation or personal computer, and includes a microprocessor (hereinafter simply referred to as CPU) 2 as a calculation means, a ROM 3 storing a basic control program for the CPU 2, and a temporary storage of calculation data. And the like, a hard disk 5 as a large-capacity storage device, and an interface 6 for connecting to various external devices and networks.

ハードディスク５には音声合成プログラムがインストールされており、この音声合成プログラムが必要に応じてＲＡＭ４上に展開され、音声合成プログラムに従って駆動制御されるＣＰＵ２が、属性情報抽出手段，類似度計算手段，韻律パターン生成手段，波形生成手段等の機能を実現する。また、ハードディスク５には、韻律データベースが予め格納されているものとする。 A speech synthesis program is installed in the hard disk 5, and this speech synthesis program is expanded on the RAM 4 as necessary, and the CPU 2 that is driven and controlled according to the speech synthesis program has attribute information extraction means, similarity calculation means, prosody. Functions such as pattern generation means and waveform generation means are realized. Further, it is assumed that the hard disk 5 stores a prosodic database in advance.

また、ＣＰＵ２には、マン・マシン・インターフェイスとして機能するキーボード７とモニタ８が入出力回路９を介して接続され、更に、入出力回路９には、ドライバ１０を介して音声出力手段としてのスピーカ１１が接続されている。 Further, a keyboard 7 and a monitor 8 functioning as a man-machine interface are connected to the CPU 2 via an input / output circuit 9, and a speaker as an audio output means is connected to the input / output circuit 9 via a driver 10. 11 is connected.

図２は音声合成プログラムによって駆動制御されるＣＰＵ２の機能の概略を示した機能ブロック図であり、その主要部は、属性情報抽出手段１２，類似度計算手段１３，韻律パターン生成手段１４、および、波形生成手段１５によって構成される。 FIG. 2 is a functional block diagram showing an outline of the functions of the CPU 2 that is driven and controlled by the speech synthesis program. The main parts thereof are attribute information extraction means 12, similarity calculation means 13, prosody pattern generation means 14, and It is comprised by the waveform generation means 15.

属性情報抽出手段１２は、ＲＡＭ４上あるいはハードディスク５上もしくはインターフェイス６を介して外部装置から入力された発音記号列を分析し、セグメント毎に属性情報を抽出する。
発音記号列は、主に読み情報からなり、その他に、少なくともアクセント位置情報等の属性情報と文のセグメント区切りの情報を備える。
ここで言うセグメントとは、文節やアクセント句等の文中の言語的ないし音響的な区切りを指し、文章の基本的な構成単位である。
また、属性情報とは、前述のアクセント位置情報や言語情報、更には、時間長情報等の言語的ないし音響的なパラメータ等を意味する。
ハードディスク５の韻律データベース１６には、人間の肉声等の自然発声の音声から抽出された韻律パターンのパラメータと属性情報の内容、および、其の相互の対応関係がセグメント単位で多数記憶されている。 The attribute information extraction unit 12 analyzes a phonetic symbol string input from an external device on the RAM 4 or the hard disk 5 or via the interface 6 and extracts attribute information for each segment.
The phonetic symbol string is mainly composed of reading information, and further includes at least attribute information such as accent position information and sentence segment delimiter information.
The segment here refers to a linguistic or acoustic delimiter in a sentence such as a clause or an accent phrase, and is a basic constituent unit of a sentence.
The attribute information means the above-described accent position information, language information, and language or acoustic parameters such as time length information.
The prosody database 16 of the hard disk 5 stores many parameters of prosodic patterns and attribute information extracted from speech of natural utterances such as human real voice and their corresponding relationships in segment units.

類似度計算手段１３は、属性情報抽出手段１２によって抽出された入力発音記号列の各セグメントの属性情報と韻律データベース１６内に存在する各セグメント単位の韻律パターンの属性情報との間の特徴量空間内距離を計算する。
特徴量とは属性情報間の類似度を距離として測定できるように数値化したものである。以下、属性情報間の類似度を特徴量空間内距離と表現するものとする。この距離が小さければ類似度は大きく、また、距離が大きければ類似度は小さい。 The similarity calculation means 13 is a feature amount space between the attribute information of each segment of the input phonetic symbol string extracted by the attribute information extraction means 12 and the attribute information of the prosodic pattern of each segment existing in the prosodic database 16. Calculate internal distance.
The feature amount is quantified so that the similarity between attribute information can be measured as a distance. Hereinafter, the similarity between attribute information is expressed as a distance in the feature amount space. If this distance is small, the similarity is large, and if the distance is large, the similarity is small.

韻律パターン生成手段１４は、類似度計算手段１３によって求められた特徴量空間内距離を用い、韻律データベース１６内の各韻律パターンを類似度に応じた重み付けで結合して新規に韻律パターンを生成する。韻律パターンの結合方法としては、例えば、特徴量空間内距離に応じた重み付けで韻律データベース１６内の韻律パターンを加算する方法等が利用できる。 The prosodic pattern generation means 14 generates a new prosodic pattern by combining the prosodic patterns in the prosodic database 16 with weighting according to the similarity, using the distance in the feature amount space obtained by the similarity calculation means 13. . As a method of combining prosodic patterns, for example, a method of adding prosodic patterns in the prosodic database 16 with weighting according to the distance in the feature space can be used.

波形生成手段１５は、韻律パターン生成手段１４によって生成された新規の韻律パターンを用いて合成音を生成する。 The waveform generation unit 15 generates a synthesized sound using the new prosody pattern generated by the prosody pattern generation unit 14.

図２および図３を参照して、本発明を実施するための最良の形態の動作を説明する。 The operation of the best mode for carrying out the present invention will be described with reference to FIGS.

まず、ユーザが作成したい合成音の発音記号列を属性情報抽手段１２に入力する（図３のステップａ１）。 First, a phonetic symbol string of a synthesized sound that the user wants to create is input to the attribute information extraction means 12 (step a1 in FIG. 3).

次に、属性情報抽出手段１２において、入力された発音記号列を分析し、セグメント毎に属性情報を抽出する（ステップａ２）。 Next, the attribute information extraction means 12 analyzes the input phonetic symbol string and extracts attribute information for each segment (step a2).

そして、抽出された属性情報と韻律データベース１６内の各韻律パターンの属性情報との間の特徴量空間内距離を類似度計算手段１３によって求める（ステップａ３）。 Then, a distance in the feature amount space between the extracted attribute information and the attribute information of each prosodic pattern in the prosodic database 16 is obtained by the similarity calculation means 13 (step a3).

次いで、類似度計算手段１３で求められた特徴量空間内距離を用いて、韻律パターン生成手段１４により、韻律データベース１６内の韻律パターンを重み付けして結合し、新規の韻律パターンを生成する（ステップａ４）。 Next, the prosodic pattern generation means 14 weights and combines the prosodic patterns in the prosodic database 16 using the distance in the feature amount space obtained by the similarity calculation means 13 to generate a new prosodic pattern (step) a4).

そして、最終的に、波形生成手段１５により、韻律パターン生成手段１４で作成された新規韻律パターンに基づいて韻律を制御して合成音を生成する（ステップａ５）。 Finally, the waveform generation unit 15 controls the prosody based on the new prosody pattern created by the prosody pattern generation unit 14 to generate a synthesized sound (step a5).

この実施形態では、合成音の韻律を制御する韻律パターンを発音記号列の入力の度に動的に生成するため、入力発音記号列に最適な韻律パターンを生成することが可能となる。 In this embodiment, since the prosodic pattern for controlling the prosody of the synthesized sound is dynamically generated every time the phonetic symbol string is input, it is possible to generate an optimal prosodic pattern for the input phonetic symbol string.

また、韻律データベース１６内の自然発声の韻律パターンを重み付けして結合するため、入力された発音記号列と属性情報が同一もしくは非常に類似した発話内容の韻律パターンが韻律データベース１６内に存在する場合には、この韻律パターンの特徴を強く反映した韻律パターン、要するに、入力された発音記号列に極めて適合した韻律パターンが作成され、一方、入力された発音記号列に良く類似する韻律パターンが韻律データベース１６内に存在しない場合においては、韻律データベース１６内に登録された範囲で、入力された発音記号列と比較的類似する属性情報を備えた複数の韻律パターンの特徴を平均化したような韻律パターンが生成されることになる。このため、韻律パターンの高い自然性と高い安定性の両方を確保した音声合成装置が実現できる。 Further, since the prosodic patterns of the natural utterance in the prosodic database 16 are weighted and combined, the prosodic pattern having the same or very similar utterance content is present in the prosodic database 16 with the same phonetic symbol string and attribute information. The prosodic pattern that strongly reflects the characteristics of this prosodic pattern, in other words, a prosodic pattern that is very suitable for the input phonetic symbol string, is created, while the prosodic pattern that closely resembles the input phonetic symbol string is created in the prosody database. If not, the prosodic pattern is obtained by averaging features of a plurality of prosodic patterns having attribute information relatively similar to the input phonetic symbol string within the range registered in the prosodic database 16. Will be generated. Therefore, it is possible to realize a speech synthesizer that ensures both high naturalness and high stability of the prosodic pattern.

図４に他の実施形態の機能ブロック図を示す。この実施形態は、ＣＰＵ２が前述した属性情報抽出手段１２，類似度計算手段１３，韻律パターン生成手段１４，波形生成手段１５として機能する他、更に、カテゴリ選択手段１７としても機能するようになっている。
また、ハードディスク５には、前述の韻律データベース１６に代えて、カテゴリ分割された韻律データベース１８が登録されている。 FIG. 4 shows a functional block diagram of another embodiment. In this embodiment, the CPU 2 functions not only as the attribute information extraction means 12, similarity calculation means 13, prosody pattern generation means 14, and waveform generation means 15 described above, but also as a category selection means 17. Yes.
In addition, in the hard disk 5, a prosodic database 18 divided into categories is registered instead of the prosodic database 16 described above.

カテゴリ分割された韻律データベース１８は、前述した韻律データベース１６と同等の韻律データベースを予めカテゴリ分割して記憶している。カテゴリ分割の方法としては、モーラ数やアクセント型等の属性情報で分割する方法がある。 The category-divided prosody database 18 stores a prosody database equivalent to the prosody database 16 described above in advance by category division. As a category division method, there is a method of dividing by attribute information such as the number of mora and accent type.

カテゴリ選択手段１７は、入力された発音記号列から属性情報抽出手段１２によって抽出された属性情報を元に、入力された発音記号列の各セグメントが、カテゴリ分割された韻律データベース１８内のどのカテゴリに属するかを判別し、対応するカテゴリのデータベースを特定する。 The category selection means 17 determines which category in the prosodic database 18 in which each segment of the input phonetic symbol string is divided into categories based on the attribute information extracted by the attribute information extraction means 12 from the input phonetic symbol string. And the corresponding category database is specified.

そして、類似度計算手段１３は、カテゴリ選択手段１７で特定されたカテゴリのデータベースのみを対象にして入力発音記号列の各セグメントの属性情報と韻律データベース１８内に存在する各セグメント単位の韻律パターンの属性情報との間の特徴量空間内距離を計算する。また、韻律パターン生成手段１４は、類似度計算手段１３の計算結果に従って重み付けを行い、カテゴリ選択手段１７で選択されたカテゴリに属する韻律パターンのみを用いて結合操作を行うことで韻律パターンを生成する。 The similarity calculation means 13 then applies only the database of the category specified by the category selection means 17 to the attribute information of each segment of the input phonetic symbol string and the prosodic pattern for each segment existing in the prosodic database 18. The distance in the feature amount space between the attribute information is calculated. The prosodic pattern generation means 14 performs weighting according to the calculation result of the similarity calculation means 13 and generates a prosodic pattern by performing a combining operation using only the prosodic patterns belonging to the category selected by the category selection means 17. .

図５を参照して、この実施形態の動作を説明する。但し、ステップｂ１〜ステップｂ２およびステップｂ４〜ステップｂ６の処理は、夫々、前述した実施形態のステップａ１〜ステップａ２およびステップａ３〜ステップａ５の処理（図３参照）と同様であるので、重複する部分の説明は省略し、ここでは、新たに追加した構成に関わるステップｂ３の動作のみを説明する。 The operation of this embodiment will be described with reference to FIG. However, since the processing of step b1 to step b2 and step b4 to step b6 is the same as the processing of step a1 to step a2 and step a3 to step a5 (see FIG. 3) of the above-described embodiment, they overlap. Description of the portion is omitted, and only the operation of step b3 related to the newly added configuration will be described here.

ステップｂ３の処理では、カテゴリ選択手段１７が、入力された発音記号列から抽出された属性情報に基いて、この発音記号列がカテゴリ分割された韻律データベース１８内のどのカテゴリに属するかを判別する。これに続くステップｂ４，ステップｂ５の処理は、カテゴリ選択手段１７の判別処理によって選択されたカテゴリ内でのみ行われる。 In the process of step b3, the category selection means 17 determines which category in the prosodic database 18 into which the phonetic symbol string is divided based on the attribute information extracted from the input phonetic symbol string. . Subsequent steps b4 and b5 are performed only in the category selected by the discrimination processing of the category selection means 17.

この実施形態では、新規の韻律パターンを生成する前に、予め似通った特徴を持つ韻律パターンの集合であるカテゴリを選択しておくため、新規の韻律パターンの生成の際に、一定の類似度を持った韻律パターンのみを用いて計算を行うことが可能となる。このため、処理速度の向上および安定した韻律パターンを得ることが容易となる。 In this embodiment, since a category that is a set of prosodic patterns having similar features is selected in advance before generating a new prosodic pattern, a certain degree of similarity is set when a new prosodic pattern is generated. It is possible to perform calculations using only the prosodic patterns that are possessed. For this reason, it becomes easy to improve the processing speed and obtain a stable prosodic pattern.

図６は図２および図３で示した実施形態に相当する実施例の音声合成装置のブロック図である。音声合成装置として動作するコンピュータ１の構成およびＣＰＵ２の機能については既に述べた通りであり、ここでは専ら音声合成方法に関わる手順について方法的な側面から説明する。 FIG. 6 is a block diagram of a speech synthesis apparatus of an example corresponding to the embodiment shown in FIGS. The configuration of the computer 1 operating as a speech synthesizer and the functions of the CPU 2 are as described above. Here, the procedure relating to the speech synthesis method will be described exclusively from the method side.

なお、本実施例では、韻律パターンがセグメント単位で分割されて記憶されている韻律データベース１６が予め作成されているものとする。セグメントとしては文中の１アクセント句を１セグメントとし、韻律パターンは、基本周波数の時系列を時間方向および周波数方向に正規化したものが登録されているものとする。また、韻律データベース１６には夫々の韻律パターンに対し、少なくともアクセント句のモーラ数，アクセント型，文中の位置，音素列，先行するアクセント句のアクセント型等の属性情報が予め記憶されているものとする。 In this embodiment, it is assumed that the prosodic database 16 in which prosodic patterns are divided and stored in segment units is created in advance. It is assumed that one accent phrase in the sentence is one segment as the segment, and the prosody pattern is registered with the time series of the fundamental frequency normalized in the time direction and the frequency direction. The prosody database 16 stores at least attribute information such as at least the number of accent phrase mora, accent type, position in the sentence, phoneme string, and accent type of the preceding accent phrase for each prosody pattern. To do.

ここで、韻律パターンおよび属性情報について、図７を用いて簡単に説明する。具体例として、「音声を合成します」（仮に例文Ｓと呼ぶ）という文を扱う。例文Ｓをアクセント句に分割すると、「音声を」（第１アクセント句）と「合成します」（第２アクセント句）の２つのアクセント句に分割される。 Here, the prosodic pattern and the attribute information will be briefly described with reference to FIG. As a specific example, a sentence “synthesize speech” (tentatively called example sentence S) is handled. When the example sentence S is divided into accent phrases, it is divided into two accent phrases, “speech” (first accent phrase) and “synthesize” (second accent phrase).

更に、例文Ｓを読み情報のひらがな列に変換すると、「お’んせーを／ごーせーしま’す」（但し、’はアクセント記号，／はアクセント句の区切り記号）となる。 Further, when the example sentence S is read and converted into a hiragana string of information, it becomes “on's name / goose-shima's” (where 'is an accent symbol and / is an accent phrase delimiter).

ここで、第１アクセント句の属性情報の一部を抽出すると、モーラ数が５，アクセント型が１，文中の位置は文頭となり、第２アクセント句の属性情報は、モーラ数が７，アクセント型が６，文中の位置は文末ということになる。 Here, when a part of the attribute information of the first accent phrase is extracted, the number of mora is 5, the accent type is 1, the position in the sentence is the beginning of the sentence, and the attribute information of the second accent phrase is the mora number of 7, the accent type However, the position in the sentence is the end of the sentence.

文中の位置は、文頭，文末，文中，単語等に分けられ、夫々が文中の位置を表す特徴量として数値化されているものとする。 The position in the sentence is divided into a sentence head, a sentence end, a sentence, a word, etc., and each is digitized as a feature value representing the position in the sentence.

本実施例における一連の処理を、Ｎ個のアクセント句から成る或る１文の発音記号列が入力されたときの、ｎ番目のアクセント句（「第nアクセント句」と呼ぶ）に注目して説明する。この発音記号列には、少なくともアクセントの位置とアクセント句の区切りが属性情報として明記されている。 In the series of processes in this embodiment, paying attention to the nth accent phrase (referred to as the “nth accent phrase”) when a phonetic symbol string consisting of N accent phrases is input. explain. In this phonetic symbol string, at least an accent position and an accent phrase break are specified as attribute information.

まず、入力された発音記号列の第ｎアクセント句に対し、図６に示されるようにして属性情報の抽出処理を行う（属性情報を抽出するステップ）。抽出する属性情報の種類は、少なくとも韻律データベース１６内の韻律パターンに対して記憶されているものを全て含む。この属性情報の値を、

とする。但し、Ｊは属性情報の種類の数であり、例えば、図６に示される例のように、属性情報がモーラ数，アクセント型，文中の位置，先行アクセント型の４種であれば、Ｊの値は４となる。また、ｊは属性情報の種類のインデックスであり、この例では、ｊ＝１がモーラ数，ｊ＝２がアクセント型，ｊ＝３が文中の位置，ｊ＝４が先行アクセント型を意味する。 First, attribute information extraction processing is performed on the nth accent phrase of the input phonetic symbol string as shown in FIG. 6 (step of extracting attribute information). The types of attribute information to be extracted include at least those stored for the prosodic patterns in the prosodic database 16. The value of this attribute information

And However, J is the number of types of attribute information. For example, as in the example shown in FIG. 6, if the attribute information has four types of mora number, accent type, position in sentence, and preceding accent type, The value is 4. Further, j is an index of the type of attribute information. In this example, j = 1 indicates the number of mora, j = 2 indicates the accent type, j = 3 indicates the position in the sentence, and j = 4 indicates the preceding accent type.

各属性情報の種類には、特徴量空間内距離を求める際に、例えば、重要度の高い属性情報であるモーラ数やアクセント型が一致していないような場合に距離が大きくなるように（類似度が小さくなるように）重みα_ｊが設定されている。重みα_ｊは、次の式（１）を満たすように、つまり、重み全体の総和が１となるように設定されている。

For each type of attribute information, when calculating the distance in the feature amount space, for example, if the number of mora and accent types that are highly important attribute information do not match, the distance increases (similar The weight α _j is set so that the degree becomes smaller. The weight α _j is set so as to satisfy the following expression (1), that is, the total sum of the weights is 1.

次に、韻律データベース１６内の韻律パターンの各セグメントに対応して記憶されている属性情報を図６に示されるようにして１つ読み出す。この属性情報の値をａ_ｉｊ（ｉ＝１，２，３，・・・，Ｉ、ｊ＝１，２，３，・・・，Ｊ）とする。但し、Ｉは韻律データベース１６内に登録されているセグメントの韻律パターンの総数であり、ｉは其のインデックスである。 Next, one piece of attribute information stored corresponding to each segment of the prosodic pattern in the prosodic database 16 is read as shown in FIG. The value of this attribute information is assumed to be a _ij (i = 1, 2, 3,..., I, j = 1, 2, 3,..., J). Here, I is the total number of prosodic patterns of segments registered in the prosodic database 16, and i is an index thereof.

次に、入力された発音記号列の属性情報と韻律データベース１６から読み出された韻律パターンの属性情報との間で図６に示されるようにして特徴量空間内相対距離の計算を行う（類似度を計算するステップ）。
特徴量空間内相対距離は、入力された発音記号列の属性情報と韻律データベース１６から読み出された属性情報の各パラメータの差に重みを掛けたものを足し合わせて、各アクセント句のモーラ数で割って求めるようにする。 Next, the relative distance in the feature quantity space is calculated as shown in FIG. 6 between the input attribute information of the phonetic symbol string and the attribute information of the prosodic pattern read from the prosodic database 16 (similarity). Step to calculate the degree).
The relative distance in the feature amount space is obtained by adding the weighted difference between the parameters of the input phonetic symbol string attribute information and the attribute information read out from the prosodic database 16 to obtain the number of mora of each accent phrase. Divide by to get it.

従って、入力された発音記号列と韻律データベース１６内のインデックスｉの韻律パターンとの間の特徴量空間内相対距離ｄ_ｉは次の式（２）で表されることになる。

但し、Ｍは入力された発音記号列のセグメント内のモーラ数である。 Therefore, the relative distance d _i in the feature amount space between the input phonetic symbol string and the prosodic pattern of the index i in the prosodic database 16 is expressed by the following equation (2).

However, M is the number of mora in the segment of the inputted phonetic symbol string.

次いで、このようにして求められた特徴量空間内相対距離ｄ_ｉを用いて、韻律パターンを結合するための重みｗ_ｉを計算する。重みｗ_ｉは次の式（３）で求められる。

ここで、Ｄは距離の総和、つまり、韻律データベース１６内に記憶されたｉ＝１〜Ｉの韻律パターンの各々について求めた特徴量空間内相対距離ｄ_ｉを全て加算した値である。 Next, the weight w _i for combining the prosodic patterns is calculated using the relative distance d _i in the feature amount space thus obtained. The weight w _i is obtained by the following equation (3).

Here, D is a sum of distances, that is, a value obtained by adding all the relative distances d _i in the feature amount space obtained for each of the prosodic patterns of i = 1 to I stored in the prosodic database 16.

次に、重みｗ_ｉを用いて、韻律データベース１６内に記憶されたインデックスｉの韻律パターンに各々に対応する重みｗ_ｉを掛けて線形結合による重み付けを行うことで新規の韻律パターンを生成する（新規の韻律パターンを生成するステップ）。 Next, using the weight w _i, and generates a new prosodic pattern by multiplying the weights w _i corresponding to each prosodic pattern of the stored index i in the prosodic database 16 performs weighting by linear combination ( Generating a new prosodic pattern).

そして、これと同様の処理をｎ＝１〜Ｎの第ｎアクセント句に対して繰り返し実行し、各アクセント句に対応して生成された新規の韻律パターンによって音声波形の韻律をアクセント句毎に制御し、更に、時間長等を修正して各セグメントを接続し、最終的に、Ｎ個のアクセント句から成る１文全体の合成音を生成する（音声波形を生成するステップ）。 The same processing is repeated for the nth accent phrase of n = 1 to N, and the prosody of the speech waveform is controlled for each accent phrase by the new prosodic pattern generated corresponding to each accent phrase. Further, the segments are connected by correcting the time length and the like, and finally a synthesized sound of one whole sentence composed of N accent phrases is generated (step of generating a speech waveform).

具体例として例文Ｓを参照して本実施例を説明する。本実施例の概要を図８に示す。図８（ａ）は韻律データベース１６内に登録された個々の韻律パターンの特性（正規化された基本周波数と時系列の関係）を視覚化して示したもので、また、図８（ｂ）では、特徴量空間内での韻律パターンの位置、つまり、各韻律パターンに対応して登録された属性情報の相対的な類似度に関わる情報を視覚化して示している。 As a specific example, the present embodiment will be described with reference to an example sentence S. An outline of this embodiment is shown in FIG. FIG. 8A visually shows the characteristics (relationship between normalized fundamental frequency and time series) of individual prosodic patterns registered in the prosodic database 16, and FIG. The position of the prosodic pattern in the feature amount space, that is, information related to the relative similarity of the attribute information registered corresponding to each prosodic pattern is visualized and shown.

図８（ｂ）において、点Ａ，点Ｂはそれぞれ第１アクセント句，第２アクセント句の特徴量空間内における位置である。また、図８（ａ）に示されているＡ１〜Ａ３，Ｂ１〜Ｂ３，Ｃ，Ｄの各パターンは韻律データベース１６内に記憶されている韻律パターンの内の幾つかを表したものであり、図８（ｂ）における特徴量空間の図に示されているＡ１〜Ａ３，Ｂ１〜Ｂ３と対応している。 In FIG. 8B, points A and B are positions in the feature amount space of the first accent phrase and the second accent phrase, respectively. Each pattern of A1 to A3, B1 to B3, C, and D shown in FIG. 8A represents some of the prosodic patterns stored in the prosodic database 16, This corresponds to A1 to A3 and B1 to B3 shown in the feature space diagram in FIG.

そこで、まず、図８（ｂ）に示されるように、第１アクセント句の属性情報に対応する点Ａについて、韻律データベース１６内に記憶されている各韻律パターンの属性情報との間の特徴量空間内相対距離ｄ_ＡＡ１，ｄ_ＡＡ２，ｄ_ＡＡ３，ｄ_ＡＢ１，ｄ_ＡＢ２，ｄ_ＡＢ３，ｄ_ＡＣ，ｄ_ＡＤ，．．．を計算し、式（２）に従って新規の韻律パターンの生成に利用される各韻律パターンの重みｗ_ＡＡ１，ｗ_ＡＡ２，ｗ_ＡＡ３，ｗ_ＡＢ１，ｗ_ＡＢ２，ｗ_ＡＢ３，ｗ_ＡＣ，ｗ_ＡＤ，．．．を決定する。 Therefore, first, as shown in FIG. 8B, the feature amount between the point A corresponding to the attribute information of the first accent phrase and the attribute information of each prosodic pattern stored in the prosodic database 16 space relative distance _{_{_{_{d AA1, d AA2, d AA3}}}} , d AB1, d AB2, d AB3, d AC, d AD,. . . , The weight _{_{_{_{w AA1, w AA2, w AA3}}}} , w AB1, w AB2, w AB3, w AC, w AD of each prosodic patterns used to generate a new prosodic pattern according to the calculated, equation (2). . . To decide.

また、第２アクセントの属性情報に対応する点Ｂについても、前記と同様に、ｄ_ＢＡ１，ｄ_ＢＡ２，ｄ_ＢＡ３，ｄ_ＢＢ１，ｄ_ＢＢ２，ｄ_ＢＢ３，ｄ_ＢＣ，ｄ_ＢＤ，．．．を計算し、式（２）に従って重みｗ_ＢＡ１，ｗ_ＢＡ２，ｗ_ＢＡ３，ｗ_ＢＢ１，ｗ_ＢＢ２，ｗ_ＢＢ３，ｗ_ＢＣ，ｗ_ＢＤ，．．．を決定する。 As for the point B corresponding to the attribute information of the second accent, similar to the _{_{_{_{above, d BA1, d BA2, d}}}} BA3, d BB1, d BB2, d BB3, d BC, d BD,. . . And weights w _BA1 , w _BA2 , w _BA3 , w _BB1 , w _BB2 , w _BB3 , w _BC , w _BD _,. . . To decide.

このとき、図８（ｂ）に示されるように、Ａ−Ａ１間，Ａ−Ａ２間，Ａ−Ａ３間は他の点に比べて距離が短いため、この３点Ａ１，Ａ２，Ａ３に対応する韻律パターンの重みｗ_ＡＡ１，ｗ_ＡＡ２，ｗ_ＡＡ３が他に比べて大きく、しかも、特にＡ−Ａ１間の距離が際立って短いため、結果として、点Ａ１で示される韻律パターンに極めて類似した韻律パターンとなるような重み付けが決定されることになる。 At this time, as shown in FIG. 8 (b), the distance between A-A1, A-A2, and A-A3 is shorter than the other points, so it corresponds to these three points A1, A2, A3. Prosody pattern weights w _AA1 , w _AA2 , and w _AA3 are larger than others, and the distance between A-A1 is particularly short, resulting in a prosody very similar to the prosody pattern indicated by point A1. The weighting to be a pattern is determined.

同様に、Ｂ−Ｂ１間，Ｂ−Ｂ２間，Ｂ−Ｂ３間は他の点に比べて距離が短いため、この３点Ｂ１，Ｂ２，Ｂ３に対応する韻律パターンを平均化した韻律パターンが得られるような重み付けが決定される。この場合、入力された韻律パターンの属性情報に著しく類似する属性情報を有する韻律パターンは韻律データベース１６内に存在しないことになる。 Similarly, since the distance between B-B1, B-B2, and B-B3 is shorter than other points, a prosodic pattern obtained by averaging the prosodic patterns corresponding to these three points B1, B2, and B3 is obtained. The weights to be determined are determined. In this case, no prosodic pattern having attribute information remarkably similar to the input prosodic pattern attribute information exists in the prosodic database 16.

最終的に、求められた重みｗ_ＡＡ１，ｗ_ＡＡ２，ｗ_ＡＡ３，ｗ_ＡＢ１，ｗ_ＡＢ２，ｗ_ＡＢ３，ｗ_ＡＣ，ｗ_ＡＤ，・・・を用いて重み付き線形結合により例文Ｓの第１アクセント句に関わる新規の韻律パターンを生成し、また、求められた重みｗ_ＢＡ１，ｗ_ＢＡ２，ｗ_ＢＡ３，ｗ_ＢＢ１，ｗ_ＢＢ２，ｗ_ＢＢ３，ｗ_ＢＣ，ｗ_ＢＤ，．．．を用いて例文Ｓの第２アクセント句に関わる新規アクセント句韻律パターンを生成する。 Finally, the weights obtained _{_{_{_{w AA1, w AA2, w AA3}}}} , w AB1, w AB2, w AB3, w AC, w AD, first accent phrase of the sentence S by linear combination weighted with ... And a new prosodic pattern related to the generated weights w _BA1 , w _BA2 , w _BA3 , w _BB1 , w _BB2 , w _BB3 , w _BC , w _BD _,. . . Is used to generate a new accent phrase prosodic pattern related to the second accent phrase of the example sentence S.

次に、属性情報抽出手段１２，類似度計算手段１３，韻律パターン生成手段１４，波形生成手段１５として機能するＣＰＵ２の処理動作について、ハードディスク５にインストールされた音声合成プログラムの概要を示す図９のフローチャートを参照してＣＰＵ２の内部処理の面から具体的に説明する。 Next, the processing operations of the CPU 2 functioning as the attribute information extraction means 12, the similarity calculation means 13, the prosody pattern generation means 14, and the waveform generation means 15 are outlined in FIG. A specific description will be given from the aspect of internal processing of the CPU 2 with reference to the flowchart.

ＣＰＵ２は、まず、ＲＡＭ４上あるいはハードディスク５上もしくは外部装置を対象として発音記号列の読み込みを開始し（ステップｃ１）、読み込んだ発音記号列を先頭から順にアクセント句に分割する（ステップｃ２）。 First, the CPU 2 starts reading a phonetic symbol string on the RAM 4, the hard disk 5, or an external device (step c1), and divides the read phonetic symbol string into accent phrases in order from the top (step c2).

従って、前述の例に従えば、Ｎ個のアクセント句から成る或る１文の第ｎ＝１番目のアクセント句が最初に読み込まれることになる。 Therefore, according to the above example, the n = 1st accent phrase of a certain sentence consisting of N accent phrases is read first.

次いで、属性情報抽出手段１２として機能するＣＰＵ２が、このアクセント句からｊ＝１〜Ｊの各属性情報ａ_ｊを抽出する（ステップｃ３）。 Next, the CPU 2 functioning as the attribute information extraction unit 12 extracts each attribute information a _j of j = 1 to J from this accent phrase (step c3).

前述の例に従えば、ｊ＝１の属性情報ａ_１がモーラ数，ｊ＝２の属性情報ａ_２がアクセント型，ｊ＝３の属性情報ａ_３が文中の位置，ｊ＝４の属性情報ａ_４が先行アクセント型であり、全４種の属性情報が抽出されることなる。 According to the above-described example, j = 1 attribute information a ₁ is the number of mora, j = 2 attribute information a ₂ is an accent type, j = 3 attribute information a ₃ is a position in the sentence, and j = 4 attribute information. a ₄ is a preceding accent type, comprising the attribute information of all four are extracted.

次いで、ＣＰＵ２は、前述の式（３）における距離の総和Ｄを求めるための距離積算値レジスタＤを０に初期化し（ステップｃ４）、更に、読み出しの対象とする韻律データベース１６内のデータを特定するためのデータベース内韻律パターン特定指標ｉに一旦０をセットした後（ステップｃ５）、該指標ｉを直ちに１インクリメントして、韻律データベース１６内に登録されている最初の韻律パターンに対応するデータを読み出すための初期値１に更新する（ステップｃ６）。 Next, the CPU 2 initializes the distance integrated value register D for obtaining the sum D of distances in the above-described equation (3) to 0 (step c4), and further specifies the data in the prosodic database 16 to be read. Is set once to the prosodic pattern identification index i in the database (step c5), the index i is immediately incremented by 1, and the data corresponding to the first prosodic pattern registered in the prosodic database 16 is obtained. Update to the initial value 1 for reading (step c6).

次いで、ＣＰＵ２は、前述の式（２）におけるΣα_ｊ｜ａ_ｊ−ａ_ｉｊ｜の値、即ち、入力された発音記号列が有する１つの属性情報ａ_ｊと此れに対応して韻律データベース１６から読み出されたインデックスｉの韻律パターンが有する１つの属性情報ａ_ｉｊとの差に予め決められた重み付けの係数α_ｊを掛けた値をｊ＝１〜Ｊの全ての属性情報に亘って足し合わせるための積算値記憶レジスタＸの値を０に初期化し（ステップｃ７）、更に、韻律データベース１６から読み出されたインデックスｉの韻律パターンが有する属性情報の種別を特定するための属性情報特定指標ｊに一旦０をセットした後（ステップｃ８）、該指標ｊを直ちに１インクリメントして、韻律データベース１６内の各韻律パターンが有する最初の属性情報を表す値１、つまり、この例ではモーラ数を表す値１に更新する（ステップｃ９）。 Next, the CPU 2 corresponds to the value of Σα _j | a _j −a _ij | in the above formula (2), that is, one attribute information a _{j included in} the input phonetic symbol string and the prosody database 16 corresponding to this. A value obtained by multiplying the difference from one attribute information a _{ij included in} the prosodic pattern of index i read out from the above by a predetermined weighting coefficient α _j is added to all the attribute information of j = 1 to J. The value of the integrated value storage register X for matching is initialized to 0 (step c7), and further, an attribute information specifying index for specifying the type of attribute information included in the prosodic pattern of index i read from the prosodic database 16 A value that represents the first attribute information of each prosodic pattern in the prosodic database 16 by immediately incrementing the index j after setting j to 0 once (step c8). , That is, in this example updates the value 1 representing the number of moras (Step c9).

次いで、ＣＰＵ２は、データベース内韻律パターン特定指標ｉの現在値と属性情報特定指標ｊの現在値に基いて韻律データベース１６からインデックスｉの韻律パターンが有する第ｊ番目の属性情報ａ_ｉｊを読み出し（ステップｃ１０）、前述の式（２）におけるα_ｊ｜ａ_ｊ−ａ_ｉｊ｜の値を求め（ステップｃ１１）、この値を積算値記憶レジスタＸに加算する（ステップｃ１２）。 Next, the CPU 2 reads out the jth attribute information a _ij of the prosodic pattern of the index i from the prosodic database 16 based on the current value of the prosodic pattern specifying index i in the database and the current value of the attribute information specifying index j (step c10), the value of α _j | a _j −a _ij | in the above equation (2) is obtained (step c11), and this value is added to the integrated value storage register X (step c12).

従って、ｉ＝１，ｊ＝１の現時点では、韻律データベース１６からインデックス１の韻律パターンが有する第１番目の属性情報ａ_１１つまり韻律データベース１６に最初に登録された韻律パターンのモーラ数ａ_１１が読み出され、入力された発音記号列が有する第１番目の属性情報ａ_１つまりモーラ数との差分が求められ、この差分にモーラ数に対応した重み付けの係数α_１（設定値）が掛けられ、こうして求められた値が積算値記憶レジスタＸに加算されることになる。 Therefore, at the present time when i = 1 and j = 1, the first attribute information a _{11 included in} the prosodic pattern of index 1 from the prosodic database 16, that is, the number of mora a ₁₁ of the prosodic pattern first registered in the prosodic database 16 is obtained. A difference from the _first attribute information a _1, that is, the number of mora included in the phonetic symbol string read out and input is obtained, and this difference is multiplied by a weighting coefficient α ₁ (set value) corresponding to the number of mora. The value obtained in this way is added to the integrated value storage register X.

次いで、ＣＰＵ２は、属性情報特定指標ｊの現在値が属性情報の種類の総数Ｊに達しているか否か、要するに、インデックスｉの韻律パターンが有するｊ＝１〜Ｊの全ての属性情報について、これに対応する入力された発音記号列の属性情報との差分を求めて重み付けの係数α_ｊを掛ける処理が完了しているか否かを判定する（ステップｃ１３）。 Next, the CPU 2 determines whether or not the current value of the attribute information identification index j has reached the total number J of attribute information types, that is, for all attribute information of j = 1 to J included in the prosodic pattern of the index i. It is determined whether or not the process of finding the difference from the attribute information of the input phonetic symbol string corresponding to is multiplied by the weighting coefficient α _j is completed (step c13).

そして、ステップｃ１３の判定結果が真となった場合、つまり、インデックスｉの韻律パターンが有するｊ＝１〜Ｊの全ての属性情報に関する処理が一通り終わっていない場合には、ＣＰＵ２は、属性情報特定指標ｊの値を１ずつインクリメントしながら前記と同様の処理を繰り返し実行する（ステップｃ９〜ステップｃ１３）。 If the determination result in step c13 is true, that is, if the processing regarding all attribute information of j = 1 to J included in the prosodic pattern of index i has not been completed, the CPU 2 The same processing as described above is repeatedly executed while incrementing the value of the specific index j by 1 (step c9 to step c13).

そして、最終的に、ステップｃ１３の判定結果が偽となって属性情報特定指標ｊの値が属性情報の種類の総数Ｊに達した時点で、式（２）におけるΣα_ｊ｜ａ_ｊ−ａ_ｉｊ｜の値が積算値記憶レジスタＸによって求められることになる。 Finally, when the determination result in step c13 is false and the value of the attribute information identification index j reaches the total number J of attribute information types, Σα _j | a _j −a _ij in equation (2) The value of | is obtained by the integrated value storage register X.

従って、類似度計算手段１３として機能するＣＰＵ２は、ステップｃ１３の判定結果が偽となった時点で、積算値記憶レジスタＸの値つまり式（２）におけるΣα_ｊ｜ａ_ｊ−ａ_ｉｊ｜の値を、入力された発音記号列のセグメント内のモーラ数Ｍで除し、式（２）におけるｄ_ｉの値、即ち、入力された発音記号列と韻律データベース１６内のインデックスｉの韻律パターンとの間の特徴量空間内相対距離ｄ_ｉを得る（ステップｃ１４）。 Therefore, the CPU 2 functioning as the similarity calculation means 13 is the value of the accumulated value storage register X, that is, the value of Σα _j | a _j −a _ij | in the equation (2) when the determination result in step c13 becomes false. and in the segment of the input string of phonetic symbols divided by number of moras M, the value of d _i in the equation (2), i.e., the prosody pattern of the index i in the string of phonetic symbols and prosody database 16 that is input A relative distance d _i in the feature amount space is obtained (step c14).

次いで、ＣＰＵ２は、今回求められた特徴量空間内相対距離ｄ_ｉの値を距離積算値レジスタＤに加算し（ステップｃ１５）、データベース内韻律パターン特定指標ｉの現在値が、韻律データベース１６内に登録されている韻律パターンの総数Ｉに達しているか否か、要するに、韻律データベース１６内に登録されているインデックスｉ＝１〜Ｉの全ての韻律パターンについて特徴量空間内相対距離ｄ_ｉの値が求められているか否かを判定する（ステップｃ１６）。 Next, the CPU 2 adds the value of the relative distance d _i in the feature amount space obtained this time to the distance integrated value register D (step c15), and the current value of the prosodic pattern specifying index i in the database is stored in the prosodic database 16. Whether or not the total number I of the prosodic patterns registered has been reached, in short, the value of the relative distance d _i in the feature amount space for all the prosodic patterns of indexes i = 1 to I registered in the prosodic database 16 It is determined whether or not it has been obtained (step c16).

そして、ステップｃ１６の判定結果が真となった場合、つまり、特徴量空間内相対距離ｄ_ｉの値が求められていない韻律パターンが韻律データベース１６内に残っていると判定された場合には、ＣＰＵ２は、データベース内韻律パターン特定指標ｉの値を１ずつインクリメントしながら前記と同様の処理を繰り返し実行する（ステップｃ６〜ステップｃ１６）。 If the determination result in step c16 is true, that is, if it is determined that the prosodic pattern for which the value of the relative distance d _i in the feature amount space is not found remains in the prosodic database 16, The CPU 2 repeatedly executes the same processing as described above while incrementing the value of the prosodic pattern identification index i in the database by 1 (step c6 to step c16).

そして、最終的にステップｃ１６の判定結果が偽となった時点で、韻律データベース１６内に登録されているインデックスｉ＝１〜Ｉの全ての韻律パターンについて特徴量空間内相対距離ｄ_ｉの値が求められ、同時に、前述の式（３）における距離の総和Ｄの値が距離積算値レジスタＤによって求められることになる。 Then, when the determination result in step c16 is finally false, the value of the relative distance d _i in the feature amount space for all the prosodic patterns of indexes i = 1 to I registered in the prosodic database 16 is obtained. At the same time, the distance sum value D in the above equation (3) is obtained by the distance integrated value register D.

従って、韻律パターン生成手段１４として機能するＣＰＵ２は、ステップｃ１６の判定結果が偽となった時点で、インデックスｉ＝１〜Ｉの全ての韻律パターンについて前述の式（３）における重みｗ_ｉ＝Ｄ／ｄ_ｉの値を個別に求め、インデックスｉ＝１〜Ｉの全ての韻律パターンについてｗ_ｉによる重み付けで線形結合の処理を施し、当該１アクセント句のための新規の韻律パターンを生成し、その内容をＲＡＭ４に一時記憶する（ステップｃ１７）。 Therefore, the CPU 2 functioning as the prosodic pattern generation means 14 has the weights w _i = D in the above-described equation (3) for all the prosodic patterns of indexes i = 1 to I when the determination result in step c16 is false. / D _i is obtained individually, and all prosodic patterns with indices i = 1 to I are subjected to linear combination processing by weighting with w _i to generate a new prosodic pattern for the one accent phrase, The contents are temporarily stored in the RAM 4 (step c17).

この処理は、簡単に言えば、韻律データベース１６内におけるインデックスｉ＝１〜Ｉの韻律パターンの各々に、対応する重みｗ_ｉを乗じ、ｉ＝１〜Ｉに亘って加算するといったものである。 In short, this processing is such that each of the prosodic patterns of indexes i = 1 to I in the prosodic database 16 is multiplied by the corresponding weight w _i and added over i = 1 to I.

このようにして、分割された１つのアクセント句に対する新規の韻律パターンの生成が完了すると、ＣＰＵ２は、分割されたアクセント句の全てについて韻律パターンの生成が完了しているか否かを判定し（ステップｃ１８）、全てのアクセント句についての韻律パターンの生成が完了していなければ、ＣＰＵ２は、再びステップｃ２の処理に復帰して前述の１文から次のアクセント句を分割し、このアクセント句に対して前記と同様の処理を繰り返し実行することで、新たに分割されたアクセント句に対応した韻律パターンを生成する（ステップｃ２〜ステップｃ１８）。 Thus, when the generation of a new prosodic pattern for one divided accent phrase is completed, the CPU 2 determines whether or not the generation of the prosodic pattern has been completed for all of the divided accent phrases (step c18) If generation of prosodic patterns for all accent phrases has not been completed, the CPU 2 returns to the process of step c2 again to divide the next accent phrase from the above-mentioned one sentence, and for this accent phrase By repeating the same process as described above, a prosodic pattern corresponding to a newly divided accent phrase is generated (step c2 to step c18).

そして、最終的にステップｃ１８の判定結果が真となり、Ｎ個のアクセント句から成る１文の第ｎ＝１〜Ｎ番目のアクセント句の全てについて新規の韻律パターンが生成されると、波形生成手段１５として機能するＣＰＵ２が、第ｎ＝１〜Ｎ番目のアクセント句の各々に対応したＮ個の新規の韻律パターンをＲＡＭ４から読み出し、時間長等を修正した上でこれらのセグメントを接続し（ステップｃ１９）、最終的にＮ個のアクセント句から成る１文全体の合成音を生成する（ステップｃ２０）。 When the determination result in step c18 is finally true and new prosodic patterns are generated for all of the n = 1st to Nth accent phrases of one sentence composed of N accent phrases, the waveform generating means The CPU 2 functioning as 15 reads out N new prosodic patterns corresponding to each of the n = 1 to Nth accent phrases from the RAM 4, corrects the time length, etc., and connects these segments (step c19) Finally, a synthesized sound of one whole sentence composed of N accent phrases is generated (step c20).

この実施例を用いれば、入力された発音記号列の属性情報に非常に類似した韻律パターンが韻律データベース１６内に存在する場合には、自然発声から抽出した韻律パターンに非常に近い韻律パターンが生成されるので、非常に高い自然性を有する合成音が生成される。 According to this embodiment, when a prosodic pattern very similar to the attribute information of the input phonetic symbol string exists in the prosodic database 16, a prosodic pattern very close to the prosodic pattern extracted from the natural utterance is generated. Therefore, a synthesized sound having very high naturalness is generated.

また、韻律データベース１６内に類似する属性情報を持つ韻律パターンが存在しない場合であっても、韻律データベース１６内の複数の韻律パターンを平均したような韻律パターンが生成されるため、安定して音声合成を行うことができる。 Even if there is no prosodic pattern having similar attribute information in the prosodic database 16, a prosodic pattern that is an average of a plurality of prosodic patterns in the prosodic database 16 is generated. Synthesis can be performed.

この実施例では、各韻律パターンの属性情報と入力された発音記号列の属性情報の特徴量空間内相対距離を各々の属性情報に基いて計算するようにしたが、特徴量空間内に距離の基準となる原点を定義し、特徴量空間内における原点と各韻律パターンとの距離を予め計算して韻律データベース１６内に記憶しておき、入力された発音記号列の属性情報と原点との間の距離を類似度計算手段１３により其の都度に求め、この距離と各韻律パターンの属性情報における原点との距離との差分により特徴量空間内距離ｄ_ｉを求める方法もある。この方法によって、新規の韻律パターンの生成時における距離計算の回数を減らすことができ更なる計算時間の削減が可能となる。 In this embodiment, the relative distance in the feature amount space of the attribute information of each prosodic pattern and the attribute information of the input phonetic symbol string is calculated based on each attribute information. A reference origin is defined, and the distance between the origin and each prosodic pattern in the feature amount space is calculated in advance and stored in the prosodic database 16, and between the input phonetic symbol string attribute information and the origin There is also a method of obtaining the distance d _i in the feature amount space by the difference between this distance and the distance from the origin in the attribute information of each prosodic pattern. By this method, the number of distance calculations when generating a new prosodic pattern can be reduced, and the calculation time can be further reduced.

また、本実施例では、アクセント句の区切りとアクセント位置が含まれた発音記号列を入力するようにしたが、更に、品詞情報，係り受け情報，送り仮名情報等の言語的情報を含めることも可能である。言語的情報を含めることによって、音響的な情報のみからでは抽出できない若しくは抽出が困難な特徴量についても韻律パターン生成のためのパラメータとすることができる。無論、韻律データベース内の韻律パターンにも言語的情報を付与させておくことが可能である。 In this embodiment, the phonetic symbol string including the accent phrase delimiter and the accent position is input. However, linguistic information such as part-of-speech information, dependency information, and sending kana information may also be included. Is possible. By including linguistic information, feature quantities that cannot be extracted from acoustic information alone or difficult to extract can be used as parameters for prosodic pattern generation. Of course, linguistic information can be given to prosodic patterns in the prosodic database.

次に、図４および図５で示した実施形態に相当する実施例について図面を参照して簡単に説明する。 Next, an example corresponding to the embodiment shown in FIGS. 4 and 5 will be briefly described with reference to the drawings.

図１０は図４および図５で示した実施形態に相当する実施例の音声合成装置のブロック図である。本実施例は、前述した実施例１の構成に加えて、類似度計算手段１３で求められた特徴量空間内相対距離を元に韻律データベース１８内のカテゴリを選択するカテゴリ選択手段１７を備える。音声合成装置として動作するコンピュータ１の構成およびＣＰＵ２の機能については既に述べた通りであり、ここでは専ら音声合成方法に関わる手順について方法的な側面から説明する。 FIG. 10 is a block diagram of a speech synthesizer according to an example corresponding to the embodiment shown in FIGS. In addition to the configuration of the first embodiment described above, the present embodiment includes a category selection unit 17 that selects a category in the prosodic database 18 based on the relative distance in the feature amount space obtained by the similarity calculation unit 13. The configuration of the computer 1 operating as a speech synthesizer and the functions of the CPU 2 are as described above. Here, the procedure relating to the speech synthesis method will be described exclusively from the method side.

今、発音記号列が入力され、実施例１で説明した方法と同様にして、入力発音記号列から属性情報が抽出されているものとする。 Now, it is assumed that a phonetic symbol string is input and attribute information is extracted from the input phonetic symbol string in the same manner as described in the first embodiment.

カテゴリ選択手段１７は、入力発音記号列から抽出された属性情報の一部が韻律データベース１８におけるカテゴリ分割の属性情報と一致していた場合に、そのカテゴリに属するものとして判定し、韻律パターン生成時には選択されたカテゴリに属する韻律パターンを用いて新規韻律パターンを生成する（入力された発音記号列がカテゴリ分割された韻律データベース内のどのカテゴリに属するかを判定するステップ）。 The category selection means 17 determines that the attribute information extracted from the input phonetic symbol string belongs to the category division attribute information in the prosodic database 18 and belongs to that category, and at the time of prosodic pattern generation A new prosodic pattern is generated using the prosodic pattern belonging to the selected category (step of determining which category in the prosodic database into which the input phonetic symbol string is divided into categories).

前述の例文Ｓを例にとって、詳細に説明する。例文Ｓの第１アクセント句の「音声を」からは、前述の通り、属性情報としてモーラ数が５，アクセント型が１，文中の位置が文頭といった情報が抽出される。 The above example sentence S will be described in detail as an example. As described above, information such as the number of mora is 5, the accent type is 1, and the position in the sentence is the beginning of the sentence is extracted from “speech” of the first accent phrase of the example sentence S.

これに対し、図１１に示されるように、韻律データベース１８内のカテゴリ１がモーラ数＝５，アクセント型＝１，文中の位置＝文頭という情報を持つ韻律パターンの属するカテゴリであるとすると、例文Ｓの第１アクセント句はカテゴリ１に属するものであるとカテゴリ選択手段１７によって判定される。 On the other hand, as shown in FIG. 11, if category 1 in the prosodic database 18 is a category to which a prosodic pattern having information that the number of mora = 5, the accent type = 1, and the position in the sentence = the beginning of the sentence belongs, The category selection means 17 determines that the first accent phrase of S belongs to category 1.

同様に、韻律データベース内のカテゴリ２がモーラ数＝７，アクセント型＝６，文中の位置＝文末という情報を持つ韻律パターンの属するカテゴリであるとすると、第２アクセント句の「合成します」は、カテゴリ２に属するものであるとカテゴリ選択手段１７によって判定される。 Similarly, if category 2 in the prosodic database is a category to which the prosodic pattern has information that the number of mora = 7, accent type = 6, position in sentence = sentence end, the second accent phrase "synthesize" The category selection means 17 determines that the image belongs to category 2.

ここで、カテゴリの分割方法としては、例えば、「モーラ数６以上，アクセント型が５または６」といった或る程度の幅を持たせたカテゴリにすることも可能である。 Here, as a category dividing method, for example, a category having a certain range such as “the number of mora is 6 or more and the accent type is 5 or 6” may be used.

このようにしてカテゴリが選択されると、次に、そのカテゴリに属する各韻律パターンの重みを算出する。選択されたカテゴリに属する各韻律パターンに対する重みｗ_ｉ’は次の式（４）によって表される。

ここで、ｄ_ｉ’は選択されたカテゴリに属する韻律パターンの特徴量空間内相対距離、また、Ｄ'は選択されたカテゴリに属する韻律パターンの特徴量空間内相対距離の合計、そして、Ｉ'は選択されたカテゴリに属する韻律パターンの総数である。 When a category is selected in this way, next, the weight of each prosodic pattern belonging to that category is calculated. The weight w _i ′ for each prosodic pattern belonging to the selected category is expressed by the following equation (4).

Here, d _i ′ is the relative distance in the feature amount space of the prosodic pattern belonging to the selected category, D ′ is the sum of the relative distances in the feature amount space of the prosodic pattern belonging to the selected category, and I ′ Is the total number of prosodic patterns belonging to the selected category.

最終的に、重みｗ_ｉ’を用い、選択されたカテゴリに属するインデックスｉ＝１からＩ'の韻律パターンの夫々に各々に対応する重みｗ_ｉ’を掛けて線形結合を行うことにより、実施例１の場合と同様にして求めるべき新規の韻律パターンを生成する。 Finally, the weight w _i ′ is used, and each of the prosodic patterns of indexes i = 1 to I ′ belonging to the selected category is multiplied by the corresponding weight w _i ′ to perform linear combination. As in the case of 1, a new prosodic pattern to be obtained is generated.

具体例として例文Ｓを参照して本実施例を説明する。本実施例の概要を図１１に示す。なお、図１１において、図８と同一の符号は同一物あるいは相当物を示し、その説明を省略するものとする。 As a specific example, the present embodiment will be described with reference to an example sentence S. An outline of this embodiment is shown in FIG. In FIG. 11, the same reference numerals as those in FIG. 8 denote the same or corresponding parts, and the description thereof will be omitted.

韻律データベース１８内の韻律パターンは多数のカテゴリに分割されているが、図１１では、モーラ数＝５，アクセント型＝１の韻律パターンが属するカテゴリ１とモーラ数＝７，アクセント型＝６の韻律パターンが属するカテゴリ２のみを示している。 The prosodic patterns in the prosodic database 18 are divided into a number of categories. In FIG. 11, the prosody of the category 1 to which the prosodic pattern of the number of mora = 5, the accent type = 1, the number of mora = 7, and the accent type = 6. Only category 2 to which the pattern belongs is shown.

上述のように、点Ａおよび点Ｂはそれぞれカテゴリ１およびカテゴリ２に属することが判明しているとし、カテゴリ１にはＡ１，Ａ２，Ａ３が、また、カテゴリ２にはＢ１，Ｂ２，Ｂ３が属しているものとする。 As described above, it is assumed that point A and point B belong to category 1 and category 2, respectively. A1, A2, and A3 are included in category 1, and B1, B2, and B3 are included in category 2. Shall belong.

まず、第１アクセント句のＡについて、カテゴリ１から外れた韻律パターンＢ１，Ｂ２，Ｂ３やＣ，Ｄは無視し、カテゴリ１に属する韻律パターンＡ１，Ａ２，Ａ３との間の特徴量空間内相対距離ｄ_ＡＡ１，ｄ_ＡＡ２，ｄ_ＡＡ３のみを計算し、式（４）に従って重みｗ_ＡＡ１，ｗ_ＡＡ２，ｗ_ＡＡ３を決定する。 First, for the first accent phrase A, the prosodic patterns B1, B2, B3 and C, D that are out of the category 1 are ignored, and the relative in-feature space with the prosodic patterns A1, A2, A3 belonging to the category 1 is ignored. Only the distances d _AA1 , d _AA2 , and d _AA3 are calculated, and the weights w _AA1 , w _AA2 , and w _AA3 are determined according to the equation (4).

次に、第２アクセント句のＢについて、カテゴリ２から外れた韻律パターンＡ１，Ａ２，Ａ３やＣ，Ｄは無視し、カテゴリ２に属する韻律パターンＢ１，Ｂ２，Ｂ３との間の特徴量空間内相対距離ｄ_ＢＢ１，ｄ_ＢＢ２，ｄ_ＢＢ３のみを計算し、式（４）に従って重みｗ_ＢＢ１，ｗ_ＢＢ２，ｗ_ＢＢ３を決定する。 Next, with respect to B of the second accent phrase, prosodic patterns A1, A2, A3 and C, D out of category 2 are ignored, and within the feature amount space between prosodic patterns B1, B2, B3 belonging to category 2 Only the relative distances d _BB1 , d _BB2 , and d _BB3 are calculated, and the weights w _BB1 , w _BB2 , and w _BB3 are determined according to Equation (4).

このとき、実施例１の場合と略同様に、第１アクセント句のＡについてはＡ１に極めて類似した韻律パターンが、また、第２アクセント句のＢについてはＢ１，Ｂ２，Ｂ３を平均した韻律パターンが生成されるような重みが決定されることになる。 At this time, as in the case of the first embodiment, for the first accent phrase A, the prosody pattern very similar to A1, and for the second accent phrase B, the prosody pattern that averages B1, B2, and B3. The weight is generated so that is generated.

最終的に、このようにして求められた重みｗ_ＡＡ１，ｗ_ＡＡ２，ｗ_ＡＡ３とｗ_ＢＢ１，ｗ_ＢＢ２，ｗ_ＢＢ３を用いて、重み付き線形結合により例文Ｓの第１アクセント句および第２アクセント句のための新規な韻律パターンを生成する。 Finally, using the weights w _AA1 , w _AA2 , w _AA3 and w _BB1 , w _BB2 , w _BB3 obtained in this way, the first accent phrase and the second accent phrase of the example sentence S by weighted linear combination Generate a new prosodic pattern for.

次に、属性情報抽出手段１２，類似度計算手段１３，韻律パターン生成手段１４，波形生成手段１５，カテゴリ選択手段１７として機能するＣＰＵ２の処理動作について、ハードディスク５にインストールされた音声合成プログラムの概要を示す図１２のフローチャートを参照してＣＰＵ２の内部処理の面から具体的に説明する。 Next, the outline of the speech synthesis program installed in the hard disk 5 with respect to the processing operations of the CPU 2 functioning as the attribute information extraction means 12, similarity calculation means 13, prosodic pattern generation means 14, waveform generation means 15, and category selection means 17. The internal processing of the CPU 2 will be specifically described with reference to the flowchart of FIG.

ステップｄ１〜ステップｄ３の処理は図９におけるステップｃ１〜ステップｃ３の処理と同様であり、これらの処理により、まず、読み込まれた発音記号列の１つのアクセント句を対象として、属性情報抽出手段１２として機能するＣＰＵ２が、ｊ＝１〜Ｊの各属性情報ａ_ｊを抽出する。 The processing from step d1 to step d3 is the same as the processing from step c1 to step c3 in FIG. 9. By these processing, first, the attribute information extraction means 12 is targeted for one accent phrase of the read phonetic symbol string. The CPU 2 that functions as _j extracts attribute information a _j for j = 1 to J.

次いで、ＣＰＵ２は、韻律データベース１８におけるカテゴリを指定するカテゴリ指定指標ｋに一旦０をセットした後（ステップｄ４）、該指標ｋを直ちに１インクリメントして、韻律データベース１８内に設定された最初のカテゴリを指定するための初期値１に更新する（ステップｄ５）。 Next, the CPU 2 once sets 0 in a category designation index k for designating a category in the prosodic database 18 (step d4), and then immediately increments the index k by 1 to set the first category set in the prosodic database 18 Is updated to the initial value 1 for designating (step d5).

次いで、カテゴリ選択手段１７として機能するＣＰＵ２は、カテゴリ指定指標ｋの現在値に基づいて、カテゴリｋとして登録されている属性情報の種別を韻律データベース１８から読み出し（ステップｄ６）、カテゴリｋに対して設定された属性情報が、今回の処理で読み込まれた発音記号列のアクセント句の属性情報と設定値以上の割合で一致しているか否かを判定する（ステップｄ７）。 Next, the CPU 2 functioning as the category selection means 17 reads out the type of attribute information registered as the category k from the prosodic database 18 based on the current value of the category designation index k (step d6), and for the category k It is determined whether or not the set attribute information matches the accent phrase attribute information of the phonetic symbol string read in the current process at a rate equal to or greater than the set value (step d7).

この実施例におけるカテゴリ選択処理は、実際に演算処理の対象とする韻律パターンの数を減らして処理操作の効率を高めるのが目的であり、カテゴリに登録する属性情報の数は任意に決定し得るが、多種多様なカテゴリが韻律データベース１８内に氾濫することを避けるため、カテゴリ毎に登録する属性情報の総数は、通常、発音記号列のアクセント句から抽出されるアクセント句の種類数Ｊよりも少なくしている。つまり、ステップｄ７で利用される設定値はＪよりも少ない数である。 The purpose of the category selection processing in this embodiment is to reduce the number of prosodic patterns that are actually subject to arithmetic processing and increase the efficiency of processing operations. The number of attribute information registered in a category can be arbitrarily determined. However, in order to avoid flooding various categories in the prosodic database 18, the total number of attribute information registered for each category is usually larger than the number J of accent phrases extracted from the accent phrase of the phonetic symbol string. Less. That is, the set value used in step d7 is a number smaller than J.

ここで、ステップｄ７の判定結果が真となった場合、カテゴリ選択手段１７として機能するＣＰＵ２は、今回の処理で読み込まれた発音記号列のアクセント句がカテゴリｋに属するものと見做し、また、ステップｄ７の判定結果が偽となった場合には、今回の処理で読み込まれた発音記号列のアクセント句がカテゴリｋに属していないと見做す。 Here, when the determination result in step d7 is true, the CPU 2 functioning as the category selection means 17 considers that the accent phrase of the phonetic symbol string read in this processing belongs to the category k, and If the determination result in step d7 is false, it is assumed that the accent phrase of the phonetic symbol string read in this processing does not belong to the category k.

そして、ステップｄ７の判定結果が偽となった場合、つまり、今回の処理で読み込まれた発音記号列のアクセント句がカテゴリｋに属していないと判定された場合には、カテゴリ選択手段１７として機能するＣＰＵ２は、再びステップｄ５の処理に復帰してカテゴリ指定指標ｋの値を１インクリメントし、前記と同様の処理を繰り返し実行することにより、今回の処理で読み込まれた発音記号列のアクセント句が属すると見做し得るカテゴリｋの値を求める。 When the determination result in step d7 is false, that is, when it is determined that the accent phrase of the phonetic symbol string read in this processing does not belong to the category k, the function as the category selection unit 17 The CPU 2 that returns to the process of step d5 again increments the value of the category designation index k by 1 and repeatedly executes the same process as described above, so that the accent phrase of the phonetic symbol string read in this process is obtained. The value of category k that can be regarded as belonging is obtained.

このようにして、今回の処理で読み込まれた発音記号列のアクセント句が属すると見做し得るカテゴリｋの値が求められると、カテゴリ選択手段１７として機能するＣＰＵ２は、この時点で、データの読み込みの対象とするカテゴリをカテゴリｋのみに制限して（ステップｄ８）、類似度計算手段１３，韻律パターン生成手段１４の動作を許容する。 In this way, when the value of the category k that can be regarded as belonging to the accent phrase of the phonetic symbol string read in the current process is obtained, the CPU 2 functioning as the category selection means 17 at this time, The category to be read is limited to only category k (step d8), and the operations of the similarity calculation means 13 and the prosody pattern generation means 14 are allowed.

ステップｄ９の処理は、図９におけるステップｃ４〜ステップｃ１７の処理と同様であり、類似度計算手段１３および韻律パターン生成手段１４として機能するＣＰＵ２によって、前記と同様にして特徴量空間内相対距離ｄ_ｉ’や重みｗ_ｉ’等を求めるための処理が繰り返し実行されることになる（但し、図９のｄ_ｉ，ｗ_ｉ，Ｉ，Ｄは各々ｄ_ｉ’，ｗ_ｉ’，Ｉ’，Ｄ’と読み替えるものとする）。 The processing in step d9 is the same as the processing in steps c4 to c17 in FIG. 9, and the CPU 2 functioning as the similarity calculation means 13 and the prosodic pattern generation means 14 performs the relative distance d in the feature amount space in the same manner as described above. The processing for obtaining _i ′, the weight w _i ′, etc. is repeatedly executed (however, d _i , w _i , I, D in FIG. 9 are d _i ′, w _i ′, I ′, D, respectively). It shall be read as').

しかし、本実施例においては、前述したステップｄ４〜ｄ８の処理、つまり、カテゴリ選択手段１７の機能によってカテゴリが制限され、特徴量空間内相対距離ｄ_ｉ’や重みｗ_ｉ’等を求める対象となる韻律パターンの数が、韻律データベース１６内に登録されている全ての韻律パターンの総数Ｉではなく、選択されたカテゴリｋに属する韻律パターンの総数Ｉ’とされているので、特に、類似度計算手段１３，韻律パターン生成手段１４として機能するＣＰＵ２の負荷が大幅に軽減され、全体としての処理効率が大幅に向上する格別の効果が奏される。 However, in this embodiment, the category is limited by the processing of the above-described steps d4 to d8, that is, the function of the category selection means 17, and the target for obtaining the relative distance d _i ′ in the feature amount space, the weight w _i ′, etc. The number of prosodic patterns is not the total number I of all the prosodic patterns registered in the prosodic database 16, but the total number I ′ of prosodic patterns belonging to the selected category k. The load on the CPU 2 functioning as the means 13 and the prosodic pattern generation means 14 is greatly reduced, and the special effect of greatly improving the processing efficiency as a whole is achieved.

ステップｄ１０〜ステップｄ１２の処理は図９におけるステップｃ１８〜ステップｃ２０の処理と同様であるので、説明を省略する。 The processing from step d10 to step d12 is the same as the processing from step c18 to step c20 in FIG.

このように、本実施例を用いれば、重み付き線形結合に用いられる韻律パターンが限定されるため、計算負荷の軽減を図りつつ実施例１とほぼ同等の安定性と自然性を実現することができるようになる。 As described above, if the present embodiment is used, the prosodic patterns used for the weighted linear combination are limited, so that the stability and naturalness substantially equivalent to those of the first embodiment can be realized while reducing the calculation load. become able to.

本発明の音声合成方法を実現するための音声合成プログラムをインストールすることにより音声合成装置として機能するコンピュータの構成の概略を示したブロック図である。It is the block diagram which showed the outline of the structure of the computer which functions as a speech synthesizer by installing the speech synthesis program for implement | achieving the speech synthesis method of this invention. 音声合成プログラムによって駆動制御されるＣＰＵの機能の概略を示した機能ブロック図である（第１の実施形態）。FIG. 2 is a functional block diagram illustrating an outline of functions of a CPU that is driven and controlled by a speech synthesis program (first embodiment). 同実施形態の動作を簡略化して示したフローチャートである。It is the flowchart which simplified and showed the operation | movement of the embodiment. 音声合成プログラムによって駆動制御されるＣＰＵの機能の概略を示した他の実施形態の機能ブロック図である（第２の実施形態）。It is a functional block diagram of other embodiments showing an outline of functions of a CPU that is driven and controlled by a speech synthesis program (second embodiment). 同実施形態の動作を簡略化して示したフローチャートである。It is the flowchart which simplified and showed the operation | movement of the embodiment. 図２および図３で示した実施形態に相当する実施例の音声合成装置のブロック図である（実施例１）。FIG. 4 is a block diagram of an example speech synthesis apparatus corresponding to the embodiment shown in FIGS. 2 and 3 (Example 1). 韻律パターンと属性情報の関連について示した概念図である（実施例１）。It is the conceptual diagram shown about the relationship between a prosodic pattern and attribute information (Example 1). 図８（ａ）は韻律データベース内に登録された個々の韻律パターンの特性を視覚化して示した概念図、図８（ｂ）は特徴量空間内での韻律パターンの位置を視覚化して示した概念図である（実施例１）。FIG. 8A is a conceptual diagram showing the characteristics of individual prosodic patterns registered in the prosodic database, and FIG. 8B is a visualizing position of the prosodic patterns in the feature amount space. (Example 1) which is a conceptual diagram. ハードディスクにインストールされた音声合成プログラムの概要を示したフローチャートである（実施例１）。3 is a flowchart showing an outline of a speech synthesis program installed on a hard disk (Example 1). 図４および図５で示した実施形態に相当する実施例の音声合成装置のブロック図である（実施例２）。FIG. 6 is a block diagram of a speech synthesis apparatus of an example corresponding to the embodiment shown in FIGS. 4 and 5 (Example 2). カテゴリ分割して韻律データベース内に登録された個々の韻律パターンの特性と特徴量空間内での韻律パターンの位置を視覚化して示した概念図である（実施例２）。(Example 2) which is the conceptual diagram which visualized and showed the characteristic of each prosodic pattern and the position of the prosodic pattern in the feature-value space which were divided into categories and registered in the prosodic database. ハードディスクにインストールされた音声合成プログラムの概要を示したフローチャートである（実施例２）。10 is a flowchart showing an overview of a speech synthesis program installed on a hard disk (Example 2). 代表パターンを変形した変形パターンと音声データから抽出された韻律パターンとの誤差を評価して代表パターンを生成する従来の韻律パターン生成方法について示した概念図である。It is the conceptual diagram shown about the conventional prosodic pattern generation method which evaluates the difference | error of the deformation pattern which deform | transformed the representative pattern, and the prosodic pattern extracted from audio | voice data, and produces | generates a representative pattern. 入力されたテキストデータの文節の属性情報と比較して基準値以上の類似度を持つ文節に関する韻律パラメータから入力テキストデータの文節に対する韻律パラメータを計算する従来の韻律パターン生成方法について示した概念図である。A conceptual diagram showing a conventional prosodic pattern generation method for calculating the prosodic parameters for the clauses of the input text data from the prosodic parameters for the clauses having a similarity greater than or equal to the reference value compared with the attribute information of the clauses of the input text data is there.

Explanation of symbols

１コンピュータ
２ＣＰＵ
３ＲＯＭ
４ＲＡＭ
５ハードディスク
６インターフェイス
７キーボード
８モニタ
９入出力回路
１０ドライバ
１１スピーカ
１２属性情報抽出手段
１３類似度計算手段
１４韻律パターン生成手段
１５波形生成手段
１６韻律データベース
１７カテゴリ選択手段
１８韻律データベース 1 Computer 2 CPU
3 ROM
4 RAM
5 hard disk 6 interface 7 keyboard 8 monitor 9 input / output circuit 10 driver 11 speaker 12 attribute information extraction means 13 similarity calculation means 14 prosody pattern generation means 15 waveform generation means 16 prosody database 17 category selection means 18 prosody database

Claims

A prosodic database that stores a segment divided into categories in advance in association with prosodic patterns and attribute information for each segment that is a constituent unit of a sentence;
Attribute information extraction means for extracting attribute information of the phonetic symbol string input;
Category selection means for specifying which category in the prosodic database divided into categories the input phonetic symbol string ;
The degree of similarity is determined according to the importance of the attribute information of the prosodic pattern existing in the prosodic database and the attribute information extracted from the input phonetic symbol string only for the category database specified by the category selecting means. Similarity calculation means for calculating;
Prosody pattern generation means for generating a new prosodic pattern by combining the prosodic patterns in the database of the category specified by the category selection means according to the weighting according to the similarity,
The prosody pattern generation apparatus according to claim 1, wherein the weighting is performed by increasing the weight for a prosodic pattern having a large similarity and decreasing the weight for a prosodic pattern having a small similarity .

A prosodic database that stores a segment divided into categories in advance in association with prosodic patterns and attribute information for each segment that is a constituent unit of a sentence;
Attribute information extraction means for extracting attribute information of the phonetic symbol string input;
Category selection means for determining which category in the prosodic database divided into categories the input phonetic symbol string ;
The degree of similarity is determined according to the importance of the attribute information of the prosodic pattern existing in the prosodic database and the attribute information extracted from the input phonetic symbol string only for the category database specified by the category selecting means. Similarity calculation means for calculating;
Prosody pattern generating means for generating a new prosodic pattern by combining prosodic patterns in the database of the category specified by the category selecting means according to weighting according to the similarity,
Waveform generating means for generating a speech waveform by controlling the prosody by the generated prosodic pattern,
The speech synthesizer according to claim 1, wherein the weighting is performed by increasing the weight for a prosodic pattern having a large similarity and decreasing the weight for a prosodic pattern having a small similarity .

A prosodic pattern generation method for generating a prosodic pattern by a prosodic pattern generating device,
Extracting attribute information of the input phonetic symbol string;
A determination step of determining which category in the prosodic database into which the input phonetic symbol string is pre-categorized ;
Calculating similarity based on attribute information for each prosodic pattern stored in advance in the prosodic database of the category specified in the determining step and importance of attribute information extracted from the input phonetic symbol string When,
Generating a new prosodic pattern by combining prosodic patterns in the prosodic database of the category specified in the determining step according to weighting according to the similarity, and
Prosody pattern generation characterized in that weighting in the step of generating a new prosodic pattern is performed by decreasing the prosody pattern having a high similarity and decreasing the prosody pattern having a low similarity. Method.

A speech synthesis method for generating synthesized speech by a speech synthesizer,
Extracting attribute information of the input phonetic symbol string;
A determination step of determining which category in the prosodic database into which the input phonetic symbol string is pre-categorized ;
Calculating similarity based on the attribute information of each prosodic pattern stored in advance in the database of the category specified in the determination step and the importance of the attribute information extracted from the input phonetic symbol string; ,
Generating a new prosodic pattern by combining prosodic patterns in the database of the category identified in the determining step according to weighting according to the similarity;
Generating a speech waveform by controlling the prosody according to the generated prosodic pattern,
The speech synthesis method characterized in that the weighting in the step of generating the new prosodic pattern is performed by increasing the weight for the prosodic pattern having a large similarity and decreasing the weight for a prosodic pattern having a small similarity. .

In the computer constituting the prosody pattern generation device,
A process of extracting attribute information of the phonetic symbol string input;
A determination process for determining which category in the prosodic database in which the input phonetic symbol string is pre-categorized ;
Processing for calculating similarity according to the importance of attribute information for each prosodic pattern stored in advance in the prosodic database of the category specified in the determination processing and attribute information extracted from the input phonetic symbol string When,
Depending on the similarity, weighting is performed so that the prosodic pattern having a high similarity is large and the prosodic pattern having a low similarity is small, and the weight is set in the database of the category specified by the determination process . A prosody pattern generation program for executing a process of generating a new prosody pattern by combining the prosodic patterns.

In the computer that composes the speech synthesizer,
A process of extracting attribute information of the phonetic symbol string input;
A determination process for determining which category in the prosodic database in which the input phonetic symbol string is pre-categorized ;
Processing for calculating similarity according to the importance of attribute information for each prosodic pattern stored in advance in the database of the category specified in the determination processing and attribute information extracted from the input phonetic symbol string; ,
In accordance with the similarity, weighting is performed so that the prosodic pattern having the high similarity is large and the prosodic pattern having the low similarity is small, and the weight is set in the database of the category specified by the determination process . Generating a new prosodic pattern by combining the prosodic patterns of
A speech synthesis program that executes a process of generating a speech waveform by controlling a prosody using the generated prosodic pattern.