JP2005181998A

JP2005181998A - Speech synthesizer and speech synthesizing method

Info

Publication number: JP2005181998A
Application number: JP2004340738A
Authority: JP
Inventors: Yoshifumi Hirose; 良文廣瀬
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2003-11-28
Filing date: 2004-11-25
Publication date: 2005-07-07

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech synthesizer which can output natural synthesized speech. <P>SOLUTION: The synthesizer comprises a feature parameter DB 106 for storing foreign origin word attributes of voice unit and voice unit data indicating sound characteristics for every voice unit; a language analyzer 104 and rhythm estimation 109 acquire text data, and estimate the foreign origin word attributes and sound characteristics of the voice unit with respect to each of multiple voice unit for representing the text of text data; a voice unit selector 108 for selecting the voice unit data indicating the contents similar to the foreign origin word attributes of estimated voice unit from the feature parameter DB 106 for every voice unit representing the text; and a speech synthesizer 110 for generating and outputting the synthesized speech by using the selected multiple voice unit data. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、与えられた文字列（テキスト）を音声に変換する音声合成装置および音声合成方法に関する。 The present invention relates to a speech synthesis apparatus and speech synthesis method for converting a given character string (text) into speech.

従来の音声合成装置は、音響的特徴に基づくコスト関数を用いたコスト最小基準に従って素片データベースより素片系列を選択し、その選択した素片系列を用いて合成音声を生成する（例えば、特許文献１参照。）。 A conventional speech synthesizer selects a segment sequence from a segment database according to a minimum cost criterion using a cost function based on acoustic features, and generates a synthesized speech using the selected segment sequence (for example, a patent) Reference 1).

図１７は、上記従来の音声合成装置の構成を示す構成図である。
音声分析部１０は、音声波形データベース２１に保存された音声データに対してテキストデータベース２２と音素ＨＭＭ（隠れマルコフモデル）２３とを用いてラベリングを行い、音素単位（素片単位）ごとに音響的特徴の抽出を行う。ここで、音響的特徴とは、例えば、基本周波数、パワー、継続時間長、及びケプストラム分析によるケプストラム係数などが挙げられる。抽出された音響的特徴を示す情報は、上記素片データベースたる特徴パラメータ３０に素片として保存される。音声単位選択部１２は、音響的特徴を示す情報を保持している特徴パラメータ３０を参照し、目標とする素片（目標素片）に最も音響的に近い素片を探索する。目標素片が複数であれば、複数の素片が素片系列として探索される。ここで音声単位選択部１２は、上述した基本周波数、パワー、及び継続時間長についての目標素片との誤差と、素片を結合した時に生じる歪みとを考慮して素片系列を選択する。音声合成部１３は、音声単位選択部１２により選択された素片系列に対応する複数の音声データを音声波形データベース２１より取得して接続することにより合成音声を生成している。
特許第３０５０８３２号公報 FIG. 17 is a block diagram showing the configuration of the conventional speech synthesizer.
The speech analysis unit 10 performs labeling on speech data stored in the speech waveform database 21 using a text database 22 and a phoneme HMM (Hidden Markov Model) 23, and acoustically for each phoneme unit (unit unit). Extract features. Here, the acoustic features include, for example, fundamental frequency, power, duration length, and cepstrum coefficient by cepstrum analysis. Information indicating the extracted acoustic features is stored as a segment in the feature parameter 30 which is the segment database. The speech unit selection unit 12 searches for a segment that is acoustically closest to the target segment (target segment) with reference to the feature parameter 30 that holds information indicating acoustic features. If there are a plurality of target segments, a plurality of segments are searched as a segment sequence. Here, the speech unit selection unit 12 selects a unit sequence in consideration of the error from the target unit with respect to the above-described fundamental frequency, power, and duration, and distortion generated when the units are combined. The speech synthesizer 13 generates a synthesized speech by acquiring and connecting a plurality of speech data corresponding to the segment series selected by the speech unit selector 12 from the speech waveform database 21.
Japanese Patent No. 3050832

しかしながら、上記従来の音声合成装置では、不自然な合成音声を出力してしまうという問題がある。即ち、上記従来の音声合成装置は、音響的特徴のみに基づいて素片を選択するため、適切な素片を選択することができず、その結果、その素片を用いて生成される合成音声は不自然になってしまうのである。さらに、上記従来の合成音声装置は、目標素片の音響的特徴を正しく抽出することができなかったときには、その影響が素片の選択に大きく影響するため、より不適切な素片を選択してしまう。 However, the conventional speech synthesizer described above has a problem that unnatural synthesized speech is output. That is, since the conventional speech synthesizer selects a segment based only on acoustic features, it cannot select an appropriate segment, and as a result, a synthesized speech generated using the segment. Will become unnatural. Furthermore, when the above-mentioned conventional synthesized speech apparatus cannot correctly extract the acoustic features of the target segment, the influence of the target segment selection greatly affects the selection of the segment. End up.

そこで、本発明は、かかる問題に鑑みてなされたものであって、自然な合成音声を出力することができる音声合成装置及び音声合成方法を提供することを目的とする。 Therefore, the present invention has been made in view of such a problem, and an object thereof is to provide a speech synthesizer and a speech synthesis method capable of outputting natural synthesized speech.

上記目的を達成するために、本発明に係る音声合成装置は、テキストデータを取得して、前記テキストデータの示すテキストを音声に変換する音声合成装置であって、音声単位が外来語に属するか否かについての外来語属性、及び前記音声単位の音響的特徴を示す音声単位データを、音声単位ごとに格納している格納手段と、テキストデータを取得し、前記テキストデータのテキストを表す複数の音声単位のそれぞれに対して、当該音声単位の前記外来語属性及び音響的特徴を推定する特徴推定手段と、前記テキストを表す音声単位ごとに、前記特徴推定手段によって推定された音声単位の外来語属性及び音響的特徴と類似する内容を示す音声単位データを前記格納手段から選択する選択手段と、前記選択手段によって選択された複数の音声単位データを用いて合成音声を生成して出力する音声出力手段とを備えることを特徴とする。例えば、前記選択手段は、前記特徴推定手段が、前記音声単位の外来語属性として外来語に属することを推定したときには、外来語属性として外来語に属することを示す音声単位データを優先して選択する。 In order to achieve the above object, a speech synthesizer according to the present invention is a speech synthesizer that acquires text data and converts the text indicated by the text data into speech, and whether the speech unit belongs to a foreign language. A foreign word attribute about whether or not, and voice unit data indicating the acoustic features of the voice unit, storage means for storing for each voice unit, a plurality of text data is obtained and a plurality of texts representing the text of the text data For each speech unit, feature estimation means for estimating the foreign word attributes and acoustic features of the speech unit, and a speech unit foreign word estimated by the feature estimation means for each speech unit representing the text Selection means for selecting sound unit data showing contents similar to attributes and acoustic features from the storage means, and a plurality of sounds selected by the selection means Characterized in that it comprises an audio output means for generating and outputting a synthesized speech by using the position data. For example, the selecting means preferentially selects speech unit data indicating belonging to a foreign word as a foreign word attribute when the feature estimating means estimates that the foreign word belongs to the foreign word as the foreign word attribute of the speech unit. To do.

これにより、例えばテキストデータの音声単位が外来語に属するときには、その音声単位に対して外来語らしい特徴を示す音声単位データが選択されるため、テキストデータの示す通りに外来語らしい合成音声を生成して出力することができる。即ち、従来では、テキストデータの音声単位が外来語に属していても、その音声単位の音響的特徴だけから音声単位データを選択するめ、外来語らしくない合成音声を出力してしまう。しかし、本発明では、テキストデータの示す通りに自然な合成音声を出力することができる。 Thus, for example, when the speech unit of the text data belongs to a foreign word, the speech unit data that shows the characteristic of the foreign word is selected for the speech unit, so the synthesized speech that seems to be a foreign word is generated as indicated by the text data. Can be output. That is, conventionally, even if the speech unit of the text data belongs to a foreign word, the speech unit data is selected based only on the acoustic features of the speech unit, so that synthesized speech that does not look like a foreign word is output. However, in the present invention, natural synthesized speech can be output as shown by the text data.

また、前記音声単位データは、さらに、音声単位が終助詞に属するか否かについての終助詞属性を示し、前記特徴推定手段は、前記テキストデータのテキストを表す複数の音声単位のそれぞれに対して、当該音声単位の前記外来語属性、音響的特徴、及び終助詞属性を推定し、前記選択手段は、前記特徴推定手段によって推定された音声単位の前記外来語属性、音響的特徴、及び終助詞属性と類似する内容を示す音声単位データを前記格納手段から選択することを特徴としても良い。例えば、前記選択手段は、前記特徴推定手段が、前記音声単位の終助詞属性として終助詞に属することを推定したときには、終助詞属性として終助詞に属することを示す音声単位データを優先して選択する。 Further, the speech unit data further indicates a final particle attribute as to whether or not the speech unit belongs to a final particle, and the feature estimation means is provided for each of a plurality of speech units representing the text of the text data. , Estimating the foreign word attribute, acoustic feature, and final particle attribute of the speech unit, and the selecting means includes the foreign word attribute, acoustic feature, and final particle of the speech unit estimated by the feature estimating means. Voice unit data showing contents similar to the attribute may be selected from the storage means. For example, the selection means preferentially selects speech unit data indicating that it belongs to a final particle as a final particle attribute when the feature estimation means estimates that it belongs to a final particle as a final particle attribute of the speech unit. To do.

これにより、例えばテキストデータの音声単位が終助詞に属するときには、その音声単位に対して、その終助詞に対応する疑問感などを表す音声単位データが選択されるため、テキストデータの示す通りに疑問感などを表現するような合成音声を生成して出力することができる。 As a result, for example, when a speech unit of text data belongs to a final particle, speech unit data representing a sense of question corresponding to the final particle is selected for the speech unit. It is possible to generate and output synthesized speech that expresses a feeling or the like.

また、前記選択手段は、前記特徴推定手段によって推定された音声単位の外来語属性と、前記格納手段に格納されている音声単位データの外来語属性との類似度を数量的に評価して第１のサブコストを導出する第１の導出手段と、前記特徴推定手段によって推定された音声単位の音響的特徴と、前記格納手段に格納されている音声単位データの音響的特徴との類似度を数量的に評価して第２のサブコストを導出する第２の導出手段と、前記第１及び第２の導出手段によって導出された第１及び第２の各サブコストを用いてコストを導出するコスト導出手段と、前記コスト導出手段によって導出されたコストに基づいて、前記格納手段から音声単位データを選択するデータ選択手段とを備えることを特徴としても良い。例えば、前記コスト算出手段は、前記第１及び第２の導出手段によって導出された第１及び第２の各サブコストに対して重みを付けて積算することにより前記コストを導出する。 The selecting means quantitatively evaluates the similarity between the foreign word attribute of the speech unit estimated by the feature estimating means and the foreign word attribute of the voice unit data stored in the storage means. A first deriving unit for deriving one sub-cost, a sound unit acoustic feature estimated by the feature estimating unit, and a similarity between the acoustic unit data stored in the storage unit And second derivation means for deriving a second sub-cost by evaluating the cost, and a cost derivation means for deriving the cost using the first and second sub-costs derived by the first and second derivation means And data selection means for selecting voice unit data from the storage means based on the cost derived by the cost derivation means. For example, the cost calculation means derives the cost by adding a weight to each of the first and second sub-costs derived by the first and second derivation means.

これにより、第１及び第２のサブコストのそれぞれに重みが付けられるため、その重みに応じて、音声単位データの選択に対する音響的特徴の類似度と外来語属性の類似度との寄与の割合を調整することができる。 As a result, each of the first and second sub-costs is weighted. Therefore, according to the weight, the ratio of the contribution between the similarity of the acoustic feature and the similarity of the foreign word attribute to the selection of the voice unit data is determined. Can be adjusted.

また、前記音声合成装置は、さらに、前記特徴推定手段により推定された音響的特徴の信頼度を特定し、前記信頼度に応じて前記第１及び第２の各サブコストに対する重みを決定する重み決定手段を備え、前記コスト導出手段は、前記重み決定手段によって決定された重みを前記第１及び第２のサブコストに付けることを特徴としても良い。例えば、前記重み決定手段は、前記音響的特徴の信頼度が低いときには、音響的特徴の類似度よりも外来語属性の類似度の方が、前記データ選択手段による音声単位データの選択に寄与するように、前記第１及び第２のサブコストに対する重みを決定する。 Further, the speech synthesizer further specifies a reliability of the acoustic feature estimated by the feature estimation unit, and determines a weight for each of the first and second sub-costs according to the reliability. And the cost deriving unit may add the weight determined by the weight determining unit to the first and second sub-costs. For example, when the reliability of the acoustic feature is low, the weight determination means contributes to the selection of the voice unit data by the data selection means by the similarity of the foreign language attribute rather than the similarity of the acoustic feature. Thus, the weights for the first and second sub-costs are determined.

これにより、音響的特徴の信頼度に応じて前記第１及び第２の各サブコストに付けられる重みが変化するため、音声単位データの選択に対する音響的特徴の類似度と外来語属性の類似度との寄与の割合を適切に変更することができる。 As a result, the weight assigned to each of the first and second sub-costs changes according to the reliability of the acoustic feature, so that the similarity of the acoustic feature and the similarity of the foreign word attribute to the selection of the speech unit data It is possible to appropriately change the ratio of contribution.

また、前記選択手段は、さらに、前記格納手段に格納されている複数の音声単位データを接続したときの音響的な歪みを数量的に評価して接続コストを導出する第３の導出手段を備え、前記コスト導出手段は、前記第１及び第２の導出手段によって導出された第１及び第２のサブコストと、前記第３の導出手段によって導出された接続コストとを用いて前記コストを導出することを特徴としても良い。 The selection means further includes third derivation means for deriving a connection cost by quantitatively evaluating acoustic distortion when a plurality of audio unit data stored in the storage means is connected. The cost deriving unit derives the cost using the first and second sub-costs derived by the first and second deriving units and the connection cost derived by the third deriving unit. This may be a feature.

これにより、音響的な歪みを抑えて、より自然な合成音声を出力することができる。
ここで、本発明に係るデータ作成装置は、音声合成のために用いる音声単位データを作成するデータ作成装置であって、音声を波形で示す音声波形信号を格納する音声格納手段と、前記音声波形信号の音声に対応するテキストを示すテキストデータを格納するテキスト格納手段と、前記テキスト格納手段からテキストデータを取得して、前記テキストデータのテキストを音声単位に分割し、各音声単位が外来語に属するか否かについての外来語属性を解析する言語解析手段と、前記音声格納手段から音声波形信号を取得して、前記音声波形信号の示す音声を音声単位に分割し、各音声単位に対して、音響的特徴を解析する音響解析手段と、前記言語解析手段によって解析された外来語属性、及び前記音響解析手段によって解析された音響的特徴を示すように、音声単位ごとに前記音声単位データを作成して記憶手段に格納する作成手段とを備えることを特徴とする。 Thereby, acoustic distortion can be suppressed and more natural synthesized speech can be output.
Here, the data creation device according to the present invention is a data creation device for creating speech unit data used for speech synthesis, the speech storage means for storing speech waveform signals indicating speech as waveforms, and the speech waveform Text storage means for storing text data indicating the text corresponding to the speech of the signal, text data is obtained from the text storage means, the text data text is divided into speech units, and each speech unit is converted to a foreign language Language analysis means for analyzing a foreign word attribute as to whether it belongs or not, obtaining a speech waveform signal from the speech storage means, dividing the speech indicated by the speech waveform signal into speech units, and for each speech unit Acoustic analysis means for analyzing acoustic features, foreign language attributes analyzed by the language analysis means, and acoustic features analyzed by the acoustic analysis means To indicate, characterized in that it comprises a creation means for storing the created to the storage means the voice unit data for each speech unit.

これにより、音声単位ごとに外来語属性及び音響的特徴を示す音声単位データが格納手段に格納されるため、その外来語属性及び音響的特徴に基づいて格納手段から音声単位データを選択することができる。即ち、その音声単位データが格納された格納手段を音声合成装置に活用することができる。その結果、音声合成装置は、テキストデータのテキストから各音声単位の外来語属性及び音響的特徴を推定して、その外来語属性及び音響的特徴に類似する内容の音声単位データを格納手段から選択すれば、テキストデータの示す通りの自然な合成音声を生成することができる。 Thereby, since the speech unit data indicating the foreign word attribute and the acoustic feature for each speech unit is stored in the storage unit, the speech unit data can be selected from the storage unit based on the foreign word attribute and the acoustic feature. it can. That is, the storage means in which the voice unit data is stored can be used for the voice synthesizer. As a result, the speech synthesizer estimates the foreign language attributes and acoustic features of each speech unit from the text of the text data, and selects speech unit data having contents similar to the foreign language attributes and acoustic features from the storage means. Then, natural synthesized speech as shown by the text data can be generated.

なお、本発明は、このような音声合成装置として実現することができるだけでなく、その音声合成装置が音声を合成する方法やプログラム、そのプログラムを格納する記憶媒体としても実現することができる。 The present invention can be implemented not only as such a speech synthesizer, but also as a method and program for synthesizing speech by the speech synthesizer and a storage medium for storing the program.

本発明の音声合成装置は、テキストデータの示す通りに自然な合成音声を出力することができるという作用効果を奏する。 The speech synthesizer according to the present invention has the effect of being able to output natural synthesized speech as indicated by text data.

（第１の実施形態）
図１は、本発明の第１の実施形態における音声合成装置の構成を示す構成図である。この音声合成装置は、入力されたテキストを音声に変換するテキスト音声合成装置であり、特徴パラメータデータベース（特徴パラメータＤＢ）１０６と、言語解析部１０４と、韻律推定部１０９と、音声単位選択部１０８と、音声合成部１１０と、スピーカ１１１とを備える。 (First embodiment)
FIG. 1 is a configuration diagram showing the configuration of the speech synthesizer according to the first embodiment of the present invention. This speech synthesizer is a text-to-speech synthesizer that converts input text into speech, and includes a feature parameter database (feature parameter DB) 106, a language analysis unit 104, a prosody estimation unit 109, and a speech unit selection unit 108. A speech synthesizer 110 and a speaker 111.

特徴パラメータＤＢ１０６は、複数の音声単位の特徴を示す音声単位データを保持するデータベースである。言語解析部１０４は、テキストを示すテキストデータ１００ｔを取得して、そのテキストデータ１００ｔからテキストの言語的特徴を抽出し、その言語的特徴を示す言語情報１０４ｄを出力する。 The feature parameter DB 106 is a database that holds sound unit data indicating features of a plurality of sound units. The language analysis unit 104 acquires text data 100t indicating text, extracts a linguistic feature of the text from the text data 100t, and outputs linguistic information 104d indicating the linguistic feature.

韻律推定部１０９は、言語解析部１０４により抽出された言語的特徴を元にテキストの韻律を推定し、その推定結果を示す韻律情報１０９ｄを生成する。音声単位選択部１０８は、言語解析部１０４および韻律推定部１０９より入力された言語情報１０４ｄおよび韻律情報１０９ｄに基づいて、特徴パラメータＤＢ１０６からテキストに最もふさわしい一連の音声単位データを音声単位系列として選択する。そして音声単位選択部１０８は、その選択した音声単位系列を音声合成部１１０に通知する。 The prosody estimation unit 109 estimates the prosody of the text based on the linguistic features extracted by the language analysis unit 104, and generates prosody information 109d indicating the estimation result. The speech unit selection unit 108 selects a series of speech unit data most suitable for text as a speech unit sequence from the feature parameter DB 106 based on the language information 104d and the prosody information 109d input from the language analysis unit 104 and the prosody estimation unit 109. To do. Then, the speech unit selection unit 108 notifies the speech synthesis unit 110 of the selected speech unit sequence.

音声合成部１１０は、音声単位選択部１０８により選択された音声単位データの示す特徴（例えば、ホルマントや音源情報など）に基づいて、その特徴を音声波形として表す音声波形信号を生成する。そして音声合成部１１０は、音声単位系列に含まれる各音声単位データの音声波形信号を接続して合成音声信号を生成する。スピーカ１１１は、音声合成部１１０によって生成された合成音声信号を可聴音波（合成音声）として外部に出力する。 The speech synthesizer 110 generates a speech waveform signal representing the feature as a speech waveform based on the feature (for example, formant or sound source information) indicated by the speech unit data selected by the speech unit selector 108. Then, the speech synthesizer 110 connects the speech waveform signals of each speech unit data included in the speech unit sequence to generate a synthesized speech signal. The speaker 111 outputs the synthesized voice signal generated by the voice synthesizer 110 to the outside as an audible sound wave (synthesized voice).

次に、図１に示した音声合成装置の各構成要素について詳しく説明する。
図２は、言語解析部１０４の内部構成の一例を示す構成図である。 Next, each component of the speech synthesizer shown in FIG. 1 will be described in detail.
FIG. 2 is a configuration diagram illustrating an example of an internal configuration of the language analysis unit 104.

言語解析部１０４は、形態素解析部３０１と、構文解析部３０２と、読み付与部３０３と、アクセント句推定部３０４とを備える。 The language analysis unit 104 includes a morpheme analysis unit 301, a syntax analysis unit 302, a reading provision unit 303, and an accent phrase estimation unit 304.

形態素解析部３０１は、テキストデータ１００ｔの示すテキストに対して形態素解析を行なう。構文解析部３０２は、形態素解析部３０１によって形態素解析された各形態素に対して、係り受け関係などを解析する。以下、このような解析を構文解析という。読み付与部３０３は、形態素解析部３０１によって形態素解析された形態素に複数通りの読みが存在する場合、その形態素に対して適切な読みを付与する。アクセント句推定部３０４は、形態素解析部３０１によって形態素解析された各形態素に対して、アクセント句の分割やアクセント結合の処理を行う。 The morpheme analysis unit 301 performs morpheme analysis on the text indicated by the text data 100t. The syntax analysis unit 302 analyzes the dependency relationship for each morpheme analyzed by the morpheme analysis unit 301. Hereinafter, such analysis is referred to as syntax analysis. When there are a plurality of readings in the morpheme analyzed by the morpheme analysis unit 301, the reading assigning unit 303 assigns an appropriate reading to the morpheme. The accent phrase estimation unit 304 performs accent phrase division and accent combination processing on each morpheme analyzed by the morpheme analysis unit 301.

このような言語解析部１０４は、テキストデータ１００ｔを取得すると、形態素解析及び構文解析や、適切な読みの付与などを行い、例えば図３に示すような言語情報１０４ｄを出力する。 When the language analysis unit 104 acquires the text data 100t, the language analysis unit 104 performs morphological analysis and syntax analysis, and gives appropriate reading, and outputs language information 104d as shown in FIG. 3, for example.

図３は、言語情報１０４ｄの内容の一例を示す図である。
言語解析部１０４から出力される言語情報１０４ｄは、テキストと、そのテキストに対応する一連の音素（音素表記）と、テキストに含まれる各形態素と、テキストに含まれる各文節と、各形態素の品詞と、形態素内位置と、アクセント句内位置と、係り先位置とを示す。テキストは、例えば日本語の「今日の天気は晴れですよ。」（It is fine today.）である。各形態素は、図３のテキスト中において縦の破線で区切って示される「今日」（today）や「の」(of)などである。各文節は、図３のテキスト中において縦の棒線「｜」で区切って示される「今日の」と「天気は」と「晴れですよ。」である。形態素「今日」に対する品詞は「普通名詞」であり、形態素「の」に対する品詞は「助詞」である。なお、助詞とは、日本語の品詞の１つで、常に他の語のあとに付いて使われる語のうち、活用しない語をいう。また助詞は、前の語が他の語とどのような関係にあるかを示したり、話し手の心情など一定の意味を添えたり、文を完結したりなどする働きがある。また、本実施形態における品詞は、形態素が外来語であるか否かについての属性を示すとともに、形態素が終助詞であるか否かについての属性も示す。ここで、終助詞とは、文や句の終りに用いることにより、疑問や、禁止、詠嘆、感動などの意を表す助詞のことを言う。 FIG. 3 is a diagram illustrating an example of the content of the language information 104d.
The language information 104d output from the language analysis unit 104 includes text, a series of phonemes (phoneme notation) corresponding to the text, morphemes included in the text, clauses included in the text, and part of speech of each morpheme. , The position in the morpheme, the position in the accent phrase, and the relationship position. The text is, for example, “It is fine today” in Japanese. Each morpheme is “today” or “of” indicated by a vertical broken line in the text of FIG. Each phrase is “today's”, “weather is”, and “sunny” shown in the text of FIG. 3 separated by a vertical bar “|”. The part of speech for the morpheme “today” is “common noun”, and the part of speech for the morpheme “no” is “particle”. A particle is one of the parts of speech in Japanese, and is a word that is not used among words that are always used after other words. In addition, particles have the function of indicating how the previous word is related to other words, adding a certain meaning such as the emotion of the speaker, and completing the sentence. In addition, the part of speech in the present embodiment indicates an attribute as to whether or not a morpheme is a foreign word, and also indicates an attribute as to whether or not a morpheme is a final particle. Here, the final particle refers to a particle that expresses the meaning of questions, prohibition, mourning, excitement, etc. when used at the end of a sentence or phrase.

係り先位置は、各文節の係り先を示す。たとえば、図３中の文節「今日の」に対応する係り先位置を示す数字「１」は、「今日の」の文節の係り先が１つ後の文節、つまり「天気は」という文節に係ることを示している。他の文節についても同様である。形態素内位置は、各形態素に含まれる各音素の位置を示し、アクセント句内位置は、各アクセント句に含まれる各音素の位置を示す。 The destination position indicates the destination of each phrase. For example, the number “1” indicating the position of the destination corresponding to the phrase “today” in FIG. 3 relates to the phrase “weather is” after the phrase “today”. It is shown that. The same applies to other clauses. The position in the morpheme indicates the position of each phoneme included in each morpheme, and the position in the accent phrase indicates the position of each phoneme included in each accent phrase.

音素表記は、テキストを音素で示すとともに、さらにアクセント句と文頭及び文末とを示す。例えば、図３の音素表記中の記号「／］と「／」との間が１つのアクセント句となる。音素表記中の記号「＾」は文頭を示し、記号「＄」は文末を示す。 The phoneme notation indicates text with phonemes, and further indicates an accent phrase, a sentence head, and a sentence end. For example, one accent phrase is between the symbols “/” and “/” in the phonetic notation of FIG. The symbol “^” in the phoneme notation indicates the beginning of a sentence, and the symbol “$” indicates the end of the sentence.

なお、言語情報１０４ｄは品詞を階層的に示してもよい。たとえば図３に示した例において、形態素「の」の品詞は「助詞」であって、さらにこの形態素「の」は「助詞」の中でも「格助詞」に属する。したがって、言語情報１０４ｄは、形態素「の」に対する品詞として「助詞」及び「格助詞」を示す。なお、格助詞とは、日本語の助詞の分類の１つで、主として体言につき、その体言と他の語との格関係を示す助詞である。 The language information 104d may indicate parts of speech hierarchically. For example, in the example shown in FIG. 3, the part of speech of the morpheme “no” is “particle”, and the morpheme “no” belongs to “case particle” among the “particles”. Therefore, the language information 104d indicates “participant” and “case particle” as part of speech for the morpheme “no”. Note that a case particle is one of the classifications of Japanese particles, and is a particle that mainly indicates a case and indicates a case relationship between the word and another word.

また、テキストが属する分野（スポーツ、ニュース、エンターテイメントなど）を推定するように言語解析部１０４を構成しても良い。たとえば、テキストデータ１００ｔにそのテキストが属する分野についての情報をあらかじめ持たせておいたり、テキストからキーワードを抽出して分野を推定したりしてもよい。 Further, the language analysis unit 104 may be configured to estimate a field (sports, news, entertainment, etc.) to which the text belongs. For example, the text data 100t may have information about a field to which the text belongs, or a field may be estimated by extracting a keyword from the text.

また、上記分野の推定と同時に、テキストにふさわしい喜、怒、哀、楽などの感情を推定するように言語解析部１０４を構成しても良い。たとえば、テキストデータ１００ｔにそのテキストにおいて表現したい感情を示す情報をあらかじめ持たせておいてもよい（これを実現可能な規格としてvoice XML等が存在する）。 In addition, the language analysis unit 104 may be configured to estimate emotions such as joy, anger, sadness, and comfort that are appropriate for the text simultaneously with the estimation of the above field. For example, text data 100t may be preliminarily provided with information indicating emotions desired to be expressed in the text (voice XML or the like exists as a standard that can realize this).

韻律推定部１０９は、言語解析部１０４から送られる言語情報１０４ｄを基に、テキストデータ１００ｔの示すテキストに最尤の韻律を推定し、その推定結果である韻律情報１０９ｄを生成する。ここで韻律情報１０９ｄは、少なくとも音素単位ごとの継続時間長、基本周波数、及びパワーを示す。なお、音素単位以外でも、モーラ単位や音節単位ごとに継続時間長、基本周波数、及びパワーを推定するように韻律推定部１０９を設計しても良い。また、韻律推定部１０９は、どのような方式の推定を行なっても良い。例えば、一般的に良く用いられている数量化Ｉ類による方法で推定を行なっても良い。 The prosody estimation unit 109 estimates the maximum likelihood prosody for the text indicated by the text data 100t based on the language information 104d sent from the language analysis unit 104, and generates prosody information 109d as the estimation result. Here, the prosody information 109d indicates at least the duration length, the fundamental frequency, and the power for each phoneme unit. In addition to the phoneme unit, the prosody estimation unit 109 may be designed to estimate the duration, the fundamental frequency, and the power for each mora unit or syllable unit. The prosody estimation unit 109 may perform any method of estimation. For example, estimation may be performed by a method based on quantification type I that is generally used.

また、ここでは韻律情報１０９ｄは、音素単位ごとの継続時間長、基本周波数、及びパワーを示したが、さらに、韻律推定時の韻律結果の信頼度も示して良い。 Here, the prosodic information 109d indicates the duration, fundamental frequency, and power for each phoneme unit, but may also indicate the reliability of the prosodic result at the time of prosodic estimation.

特徴パラメータＤＢ１０６は、複数の音声単位データを格納している。この音声単位データは、音声単位の音響的な特徴を示す音響的特徴情報と、音声単位の言語的な特徴を示す言語的特徴情報とを含む。 The feature parameter DB 106 stores a plurality of audio unit data. The speech unit data includes acoustic feature information indicating acoustic features of speech units and linguistic feature information indicating linguistic features of speech units.

図４は、音響的特徴情報の内容を示す図である。
音響的特徴情報１０６ａは、音声単位ごとに、音響的な特徴として少なくとも、基本周波数、継続時間長、及びパワーなどを示し、さらに、ケプストラム分析によるケプストラム係数を示しても良い。 FIG. 4 is a diagram showing the contents of the acoustic feature information.
The acoustic feature information 106a may indicate at least a fundamental frequency, a duration, power, and the like as an acoustic feature for each voice unit, and may further indicate a cepstrum coefficient by cepstrum analysis.

図５は、言語的特徴情報の内容を示す図である。
言語的特徴情報１０６ｂは、図５に示すように、言語的な特徴として、音韻環境、形態素情報、アクセント句情報、及び構文情報を示す。音韻環境は、現在対象としている音素（対象音素）、その対象音素の１つ前の音素（前方音素）、及びその対象音素の１つ後ろの音素（後方音素）を示す。なお、このような音韻環境は、従来から使用されている情報である。形態素情報は、対象音素を含む形態素を示す。具体的には、形態素情報は、その形態素の表記及び品詞を示す。形態素情報により示される品詞は、必要に応じて細分類（階層化）される。また、その品詞が活用する場合には活用形も形態素情報により示される。アクセント句情報は、対象音素がアクセント句の中でどのような位置にあるかを示す情報である。具体的には、アクセント句情報は、アクセント句頭からの距離、アクセント句末までの距離、及びアクセント核からの距離を示す。構文情報は、対象音素が含まれる文節の係り受け関係の情報を示す。 FIG. 5 is a diagram showing the contents of the linguistic feature information.
As shown in FIG. 5, the linguistic feature information 106b indicates phonological environment, morpheme information, accent phrase information, and syntax information as linguistic features. The phoneme environment indicates a phoneme currently targeted (target phoneme), a phoneme immediately before the target phoneme (forward phoneme), and a phoneme immediately behind the target phoneme (back phoneme). Note that such a phoneme environment is information that has been used conventionally. The morpheme information indicates a morpheme including the target phoneme. Specifically, the morpheme information indicates the notation and part of speech of the morpheme. The parts of speech indicated by the morpheme information are subdivided (hierarchized) as necessary. When the part of speech is utilized, the utilization form is also indicated by morpheme information. The accent phrase information is information indicating the position of the target phoneme in the accent phrase. Specifically, the accent phrase information indicates the distance from the accent phrase head, the distance to the accent phrase end, and the distance from the accent nucleus. The syntax information indicates dependency relation information of a phrase including the target phoneme.

図６は、特徴パラメータＤＢ１０６に格納されている１つの音声単位データの内容を示す図である。 FIG. 6 is a diagram showing the contents of one piece of audio unit data stored in the feature parameter DB 106.

特徴パラメータＤＢ１０６は、例えば音声単位が音素である場合、図６に示すように、音素ごとにその音素の特徴をベクトル表示により示す音声単位データを保持する。この音声単位データには、当該音素に対応する上述の言語的特徴情報１０６ｂ及び音響的特徴情報１０６ａが含まれる。例えば、その音声単位データは、音素ｕ_iの特徴として、表記「今日」、音素「ｋy」、及び品詞「普通名詞」などを示す。なお、音声単位データは、音声単位が発声された時の発声者の感情や、その音声単位の元となるテキストが属する分野（ドメイン）を示しても良い。 For example, when the speech unit is a phoneme, the feature parameter DB 106 holds speech unit data indicating the feature of the phoneme for each phoneme by vector display as shown in FIG. The speech unit data includes the above-described linguistic feature information 106b and acoustic feature information 106a corresponding to the phoneme. For example, the speech unit data indicates the notation “today”, the phoneme “ky”, the part of speech “common noun”, and the like as features of the phoneme u _i . Note that the voice unit data may indicate the emotion of the speaker when the voice unit is uttered and the field (domain) to which the text that is the basis of the voice unit belongs.

図７は、音声単位選択部１０８の具体的な構成例を示す図である。
音声単位選択部１０８は、音声単位候補抽出部４０１と、探索部４０２と、コスト計算部４０３とを備える。 FIG. 7 is a diagram illustrating a specific configuration example of the audio unit selection unit 108.
The voice unit selection unit 108 includes a voice unit candidate extraction unit 401, a search unit 402, and a cost calculation unit 403.

音声単位候補抽出部４０１は、言語解析部１０４から送られる言語情報１０４ｄが示す各音声単位（音素）に対して、韻律推定部１０９から送られる韻律情報１０９ｄを考慮して、音声合成に使用される音声単位データの候補となりうる音声単位データの集合を特徴パラメータＤＢ１０６から抽出する。探索部４０２は、音声単位候補抽出部４０１に抽出された候補から、言語解析部１０４から送られる言語情報１０４ｄと、韻律推定部１０９から送られる韻律情報１０９ｄとに最尤な音声単位データを探索する。なお、探索部４０２は、言語情報１０４ｄの音素表記に対応して時系列的に配列する一連の音声単位データを音声単位系列として一度に探索する。 The speech unit candidate extraction unit 401 is used for speech synthesis in consideration of the prosodic information 109d sent from the prosody estimation unit 109 for each speech unit (phoneme) indicated by the language information 104d sent from the language analysis unit 104. A set of speech unit data that can be candidates for speech unit data is extracted from the feature parameter DB 106. The search unit 402 searches the candidate extracted by the speech unit candidate extraction unit 401 for the most likely speech unit data in the language information 104d sent from the language analysis unit 104 and the prosody information 109d sent from the prosody estimation unit 109. To do. Note that the search unit 402 searches a series of speech unit data arranged in a time series corresponding to the phoneme notation of the language information 104d as a speech unit sequence at a time.

コスト計算部４０３は、探索部４０２が最尤な音声単位系列を探索する為の基準となるコストを計算する。このようなコスト計算部４０３は、目標コスト計算部４０４と、接続コスト計算部４０５とを備える。 The cost calculation unit 403 calculates a cost as a reference for the search unit 402 to search for the most likely speech unit sequence. Such a cost calculation unit 403 includes a target cost calculation unit 404 and a connection cost calculation unit 405.

目標コスト計算部４０４は、言語情報１０４ｄが示す各音声単位（音素）に対して、言語情報１０４ｄ及び韻律情報１０９ｄと、音声単位候補抽出部４０１により抽出された候補の言語的特徴情報１０６ｂ及び音響的特徴情報１０６ａとの整合性をコスト（目標コスト）として計算する。 The target cost calculation unit 404, for each speech unit (phoneme) indicated by the language information 104d, the language information 104d and prosodic information 109d, the candidate linguistic feature information 106b and the sound extracted by the speech unit candidate extraction unit 401. The consistency with the characteristic feature information 106a is calculated as a cost (target cost).

言語情報１０４ｄと言語的特徴情報１０６ｂによって示される両言語的特徴に基づくコストの算出は、品詞、形態素内での位置、アクセント句内での位置、構文情報、音韻環境、及び形態素表記のそれぞれの一致度に基づいて行われる。品詞の一致度とは、言語情報１０４ｄの示す音素が属する形態素の品詞と、言語的特徴情報１０６ｂの示す品詞との一致度である。形態素内での位置の一致度とは、言語情報１０４ｄの示す音素の形態素内位置と、言語的特徴情報１０６ｂの示す音素の形態素内位置（形態素頭からの距離や、形態素末までの距離）との一致度である。アクセント句内での位置の一致度とは、言語情報１０４ｄの示す音素のアクセント句内位置と、言語的特徴情報１０６ｂの示す音素のアクセント句内位置（アクセント句先頭からの距離や、アクセント句末までの距離）との一致度である。構文情報の一致度とは、言語情報１０４ｄの示す音素が含まれる文節の係り先位置と、言語的特徴情報１０６ｂに含まれる構文情報により示される係り先距離との一致度である。また、音韻環境の一致度とは、言語情報１０４ｄの示す音素、及びその音素の前後にある音素と、言語的特徴情報１０６ｂの示す前方音素、対象音素、及び後方音素との一致度である。 Cost calculation based on bilingual features indicated by the linguistic information 104d and the linguistic feature information 106b is performed for each of the part of speech, the position in the morpheme, the position in the accent phrase, the syntax information, the phonological environment, and the morpheme notation. This is based on the degree of coincidence. The degree of coincidence of part of speech is the degree of coincidence between the part of speech of the morpheme to which the phoneme indicated by the language information 104d belongs and the part of speech indicated by the linguistic feature information 106b. The degree of coincidence of positions in the morpheme is the phoneme morpheme position indicated by the language information 104d and the phoneme morpheme position indicated by the linguistic feature information 106b (the distance from the morpheme head and the distance to the morpheme end). Is the degree of coincidence. The degree of coincidence of positions in the accent phrase refers to the position in the accent phrase of the phoneme indicated by the language information 104d and the position in the accent phrase of the phoneme indicated by the linguistic feature information 106b (the distance from the beginning of the accent phrase or the end of the accent phrase). Degree of coincidence). The coincidence degree of the syntax information is a coincidence degree between the relation destination position of the phrase including the phoneme indicated by the language information 104d and the relation destination distance indicated by the syntax information included in the linguistic feature information 106b. The degree of coincidence of the phoneme environment is the degree of coincidence between the phoneme indicated by the language information 104d and the phonemes before and after the phoneme, and the front phoneme, the target phoneme, and the rear phoneme indicated by the linguistic feature information 106b.

なお、品詞の一致度に対して、図３に示すように、細分類された品詞による一致度を考慮することにより、より高精度な構成とすることが可能である。 As shown in FIG. 3, the degree of coincidence of the parts of speech can be made higher by considering the degree of coincidence of the finely classified parts of speech as shown in FIG. 3.

韻律情報１０９ｄと音響的特徴情報１０６ａによって示される両音響的特徴に基づくコストの算出は、継続時間長、基本周波数、及びパワーのそれぞれの一致度に基づいて行われる。継続時間長の一致度とは、韻律情報１０９ｄの示す音素の継続時間長と、音響的特徴情報１０６ａの示す継続時間長との一致度である。基本周波数の一致度とは、韻律情報１０９ｄの示す音素の基本周波数と、音響的特徴情報１０６ａの示す基本周波数との一致度である。また、パワーの一致度とは、韻律情報１０９ｄの示す音素のパワーと、音響的特徴情報１０６ａの示すパワーの一致度である。 The calculation of the cost based on both acoustic features indicated by the prosodic information 109d and the acoustic feature information 106a is performed based on the degree of coincidence of the duration time, the fundamental frequency, and the power. The degree of coincidence of the duration length is the degree of coincidence between the duration length of the phoneme indicated by the prosody information 109d and the duration length indicated by the acoustic feature information 106a. The coincidence degree of the fundamental frequency is the coincidence degree between the fundamental frequency of the phoneme indicated by the prosody information 109d and the fundamental frequency indicated by the acoustic feature information 106a. The power coincidence is the degree of coincidence between the phoneme power indicated by the prosody information 109d and the power indicated by the acoustic feature information 106a.

このような目標コスト計算部４０４は、上述のように算出される言語的特徴に基づくコストと、音響的特徴に基づくコストとを合算して、最終的なコスト（目標コスト）を算出する。 The target cost calculation unit 404 calculates the final cost (target cost) by adding the cost based on the linguistic feature calculated as described above and the cost based on the acoustic feature.

接続コスト計算部４０５は、候補同士を結合した場合に生じる歪みを接続コストとして計算する。 The connection cost calculation unit 405 calculates a distortion that occurs when candidates are combined as a connection cost.

ここで、図１に示した本実施形態における音声合成装置の動作について詳しく説明する。ここでは、図３に示したテキスト「今日の天気は晴れですよ。」を示すテキストデータ１００ｔが入力された時の動作について説明する。なお、以下の説明では音声単位の一例として音素を用いているが、本発明は音声単位を音素に限定するものではない。 Here, the operation of the speech synthesizer in the present embodiment shown in FIG. 1 will be described in detail. Here, an operation when the text data 100t indicating the text “Today's weather is sunny” shown in FIG. 3 is input will be described. In the following description, phonemes are used as an example of speech units, but the present invention does not limit speech units to phonemes.

まず、言語解析部１０４は、テキストデータ１００ｔの示すテキストを音素表記で表すとともに、その音素表記を形態素に分割する。また、言語解析部１０４は構文解析を行うことで、構文情報（係り先位置を示す情報）を得る。さらに、言語解析部１０４は、読みの付与、及びアクセント句の付与などを行う。その結果、図３に示したような言語情報１０４ｄが生成される。 First, the language analysis unit 104 represents the text indicated by the text data 100t in phoneme notation, and divides the phoneme notation into morphemes. In addition, the language analysis unit 104 performs syntax analysis to obtain syntax information (information indicating the relationship destination position). Further, the language analysis unit 104 performs reading and accent phrases. As a result, language information 104d as shown in FIG. 3 is generated.

韻律推定部１０９は、図３に示す言語情報１０４ｄを基に、各音素に対して、継続時間長、基本周波数、及びパワーを推定する。その結果、韻律情報１０９ｄが生成される。 The prosody estimation unit 109 estimates the duration, fundamental frequency, and power for each phoneme based on the language information 104d shown in FIG. As a result, prosodic information 109d is generated.

音声単位選択部１０８の音声単位候補抽出部４０１は、図６に示したベクトル表記の音声単位データのように、取得した言語情報１０４ｄと韻律情報１０９ｄを要素とする目標ベクトルｔ_iを音声単位（ここでは音素）ごとに構成する。この場合では、音素「ｋy」が無声子音である為に基本周波数に関する情報はない。しかし、音素が有声音の場合には、その音素の継続時間長を４分割し、各区間での中間点の基本周波数を４点で表す（基本周波数１〜４）ことにより、基本周波数の動的な特徴も音声単位データや目標ベクトルｔ_iに表すことが可能である。なお、本発明は基本周波数の表現を上述のような表現とすることに限定されない。 The speech unit candidate extraction unit 401 of the speech unit selection unit 108 uses the acquired target language t _i having the language information 104d and the prosodic information 109d as elements in speech units (like the speech unit data in vector notation shown in FIG. Here, each phoneme) is configured. In this case, since the phoneme “ky” is an unvoiced consonant, there is no information regarding the fundamental frequency. However, when the phoneme is a voiced sound, the duration of the phoneme is divided into four, and the fundamental frequency at the intermediate point in each section is represented by four points (basic frequencies 1 to 4), thereby changing the fundamental frequency. features also can be expressed in the audio unit data and the target vector t _i. The present invention is not limited to the expression of the fundamental frequency as described above.

次に、音声単位候補抽出部４０１は、特徴パラメータＤＢ１０６より、候補となる音声単位データの集合を取り出す。具体的には、音声単位候補抽出部４０１は、言語情報１０４ｄに示される選択対象の音素と同じ音素を示す全ての音声単位データを取り出す。 Next, the speech unit candidate extraction unit 401 extracts a set of candidate speech unit data from the feature parameter DB 106. Specifically, the speech unit candidate extraction unit 401 extracts all speech unit data indicating the same phoneme as the selection target phoneme indicated in the language information 104d.

なお、特徴パラメータＤＢ１０６に蓄積された音声単位データが十分に多い場合は、音声単位候補抽出部４０１は、音韻環境（前方音素及び後方音素）による制限を加えて候補を取得しても良い。 Note that when the speech unit data stored in the feature parameter DB 106 is sufficiently large, the speech unit candidate extraction unit 401 may acquire candidates by adding restrictions based on the phoneme environment (forward phoneme and backward phoneme).

目標コスト計算部４０４は、目標ベクトルｔ_iと、候補ｕ_iとの一致度を目標コストベクトルＣ_i ^tとして計算する。 The target cost calculation unit 404 calculates the degree of coincidence between the target vector t _i and the candidate u _i as the target cost vector C _i ^t .

図８は、目標ベクトルと候補と目標コストベクトルとを示す図である。
目標コスト計算部４０４は、例えば図８に示すように、候補ｕ_iと目標ベクトルｔ_iが与えられた場合には、ベクトルの各要素について一致度を計算し、その結果を目標コストベクトルＣ_i ^tとする。目標コスト計算部４０４は、この目標コストベクトルＣ_i ^tから目標コストを算出する。即ち、目標コストベクトルＣ_i ^tに示される各要素をサブコストとし、それらのサブコストに重みを付けて積算することで目標コストを計算する。 FIG. 8 is a diagram illustrating target vectors, candidates, and target cost vectors.
For example, as shown in FIG. 8, when the candidate u _i and the target vector t _i are given, the target cost calculation unit 404 calculates the degree of coincidence for each element of the vector, and the result is the target cost vector C _i. ^t . The target cost calculation unit 404 calculates a target cost from the target cost vector C _i ^t . That is, each element shown in the target cost vector C _i ^t is set as a sub cost, and the target cost is calculated by adding the weights to the sub costs.

各サブコストの重みは、経験により設定すればよいが、例えば、次のような方法により決定するように目標コスト計算部４０４を構成してもよい。例えば、目標コスト計算部４０４は、各パラメータにより計算されるコスト値と、代表的な音素でのターゲットからの距離を用いて重回帰分析を行い、回帰モデルの各コスト値の係数を重みとして用いる。また、ターゲットからの距離の推定には、ケプストラム距離を用いればよい。また、等重みなど他の重みを用いても良い。 The weight of each sub-cost may be set based on experience. For example, the target cost calculation unit 404 may be configured to be determined by the following method. For example, the target cost calculation unit 404 performs multiple regression analysis using the cost value calculated by each parameter and the distance from the target with a representative phoneme, and uses the coefficient of each cost value of the regression model as a weight. . Further, the cepstrum distance may be used for estimating the distance from the target. Also, other weights such as equal weights may be used.

なお、言語的特徴に対するサブコストの重みを、形態素情報、アクセント句情報、構文情報の順に小さくなるようにしても良い。即ち、形態素情報の一致度、アクセント句情報の一致度、構文情報の一致度の順に、音声単位データの選択に寄与する優先度が小さくなる。また、アクセント句情報の中でも、アクセント核からの距離、アクセント句末までの距離、アクセント句頭からの距離の順にサブコストの重みを小さくしても良い。即ち、アクセント核からの距離の一致度、アクセント句末までの距離の一致度、アクセント句頭からの距離の一致度の順に、音声単位データの選択に寄与する優先度が小さくなる。 Note that the sub-cost weights for linguistic features may be reduced in the order of morpheme information, accent phrase information, and syntax information. That is, the priority that contributes to the selection of the voice unit data decreases in the order of the matching degree of the morpheme information, the matching degree of the accent phrase information, and the matching degree of the syntax information. In the accent phrase information, the weight of the sub cost may be decreased in the order of the distance from the accent nucleus, the distance to the end of the accent phrase, and the distance from the beginning of the accent phrase. That is, the priority that contributes to the selection of audio unit data decreases in the order of the degree of coincidence of distance from the accent core, the degree of coincidence of the distance to the end of the accent phrase, and the degree of coincidence of the distance from the beginning of the accent phrase.

図８に示した例の場合、表記に関しては目標ベクトルｔ_iの「今日」と候補ｕ_iの「消去」とで不一致、前方音素に関しては目標ベクトルｔ_iの「＾（文頭）」と候補ｕ_iの「ｕ」とで不一致、後方音素に関しては目標ベクトルｔ_iの「ｏ」と候補ｕ_iの「ｏ」とで一致、品詞に関しては目標ベクトルｔ_iの「普通名詞」と候補ｕ_iの「サ変名詞」とで不一致となる。目標コスト計算部４０４は、非数値的な要素に関する項目において、一致をサブコスト「０」とし、不一致をサブコスト「１」とする。また、目標コスト計算部４０４は、数値的な要素に関する項目において、要素間の差分の絶対値をサブコストとする。したがって、形態素頭からの距離に対するサブコストは４−１＝３、形態素末までの距離に対するサブコストは４−３＝１、アクセント句頭からの距離に対するサブコストは４−１＝３、アクセント句末までの距離に対するサブコストは８−５＝３、継続時間長に対するサブコストは３２−２５＝７、パワーに対するサブコストは３０００−２９１０＝９０となる。目標コスト計算部４０４は、例えば経験によって定められた上記重みを各サブコストにつけて合算することで、トータルの目標コストを算出する。 In the case of the example shown in FIG. 8, disagreement out with "erasing" of the candidate u _i with the "today" of the target vector t _i with respect to representation, "^ (beginning of a sentence)," said candidate u of the target vector t _i with respect to the front phoneme disagreement out with a "u" of _i, match out with "o" with the "o" in the target vector t _i candidate u _i with respect to the rear phoneme, the target vector t _i with respect to part-of-speech "common noun" and the candidate u _i Disagrees with “sa kin noun”. The target cost calculation unit 404 sets the match as the sub cost “0” and the mismatch as the sub cost “1” in the items related to the non-numeric elements. In addition, the target cost calculation unit 404 sets an absolute value of a difference between elements as a sub cost in an item related to a numerical element. Therefore, the sub-cost for the distance from the morpheme head is 4-1 = 3, the sub-cost for the distance to the morpheme head is 4-3 = 1, the sub-cost for the distance from the accent head is 4-1 = 3, The sub-cost for distance is 8-5 = 3, the sub-cost for duration is 32-25 = 7, and the sub-cost for power is 3000-2910 = 90. The target cost calculation unit 404 calculates the total target cost by adding the above-mentioned weights determined by experience, for example, to each sub cost.

接続コスト計算部４０５は、２つの音声単位データを接続した時の歪みを接続コストとして計算する。計算方法は特に何でも良いが、例えば、接続コスト計算部４０５は、２つの音声単位データの接続フレームでのケプストラム距離を接続コストとする。 The connection cost calculation unit 405 calculates the distortion when the two audio unit data are connected as the connection cost. The calculation method may be anything, but for example, the connection cost calculation unit 405 uses the cepstrum distance in the connection frame of two voice unit data as the connection cost.

探索部４０２は、音声単位候補抽出部４０１により抽出された候補の中から目標コストと接続コストを用いて最適な音声単位データを選択する。具体的に、探索部４０２は、以下の数１に示す数式に基づいて最適な音声単位系列を探索する。 The search unit 402 selects optimal speech unit data from the candidates extracted by the speech unit candidate extraction unit 401 using the target cost and the connection cost. Specifically, the search unit 402 searches for an optimal speech unit sequence based on the following mathematical formula 1.

数１において、ｎはテキスト（言語情報１０４ｄの音素表記）に含まれる音素の数（音素数）である。例えば、テキスト「今日の天気は晴れですよ。」に対してｎは２１である。ｕは候補として挙げられる音声単位データ、ｔは目標ベクトル、Ｃtは目標コスト、Ｃcは接続コストを示す。

In Equation 1, n is the number of phonemes (phoneme number) included in the text (phoneme representation of the language information 104d). For example, n is 21 for the text “Today's weather is sunny.” u represents voice unit data listed as candidates, t represents a target vector, Ct represents a target cost, and Cc represents a connection cost.

探索部４０２は、テキスト全体でのコストＣが最小になる音声単位系列を特定して音声合成部１１０に通知する。 The search unit 402 identifies a speech unit sequence that minimizes the cost C in the entire text and notifies the speech synthesis unit 110 of it.

次に、本実施形態の音声合成装置が、外来語を示すテキストデータ１００ｔを取得したときの具体的な動作について説明する。 Next, a specific operation when the speech synthesizer of this embodiment acquires text data 100t indicating a foreign word will be described.

例えば、音声合成装置の言語解析部１０４は、日本語のテキスト「これはグラウンドです。」（This is a ground.）を示すテキストデータ１００ｔを取得する。上記テキスト中の「グラウンド」（ground）の品詞は外来語である。 For example, the language analysis unit 104 of the speech synthesizer acquires text data 100t indicating the Japanese text “This is a ground.”. The part of speech of “ground” in the above text is a foreign word.

このようなテキストデータ１００ｔを取得した言語解析部１０４は、そのテキストデータ１００ｔに基づいて言語情報１０４ｄを生成する。 The language analysis unit 104 that has acquired such text data 100t generates language information 104d based on the text data 100t.

図９は、外来語を示すテキストデータ１００ｔから生成された言語情報１０４ｄの内容を示す図である。 FIG. 9 is a diagram showing the contents of the language information 104d generated from the text data 100t indicating a foreign word.

この言語情報１０４ｄは、図３に示す言語情報１０４ｄと同様、テキストと、そのテキストに相当する一連の音素（音素表記）と、テキストに含まれる各形態素と、テキストに含まれる各文節と、各形態素の品詞と、形態素内位置と、アクセント句内位置と、係り先位置とを示す。そして、この言語情報１０４ｄは、形態素「グラウンド」の品詞として外来語を示す。 Similar to the language information 104d shown in FIG. 3, the language information 104d includes a text, a series of phonemes (phoneme notation) corresponding to the text, morphemes included in the text, clauses included in the text, A morpheme part of speech, a morpheme position, an accent phrase position, and a dependency point position are shown. The language information 104d indicates a foreign word as a part of speech of the morpheme “ground”.

音声単位選択部１０８は、言語情報１０４ｄに示される音素ごとに最適な音声単位データを特徴パラメータＤＢ１０６から選択する。 The speech unit selection unit 108 selects optimal speech unit data for each phoneme indicated in the language information 104d from the feature parameter DB 106.

図１０（ａ）は、音声単位データの選択対象となる音素を説明するための説明図である。 FIG. 10A is an explanatory diagram for explaining a phoneme to be selected for voice unit data.

例えば、音声単位選択部１０８は、「グラウンド」という外来語の「グ」の母音部分の音素「ｕ」に最適な音声単位データを選択する。 For example, the speech unit selection unit 108 selects speech unit data optimal for the phoneme “u” of the vowel part of the foreign word “G” of “ground”.

具体的に、音声単位選択部１０８は、まず、音素「ｕ」に対して目標ベクトルｔ_iを生成するとともに、音素「ｕ」に対応する候補ｕ₁，ｕ₂を特徴パラメータＤＢ１０６から選択する。 Specifically, the speech unit selection unit 108 first generates a target vector t _i for the phoneme “u” and selects candidates u ₁ and u ₂ corresponding to the phoneme “u” from the feature parameter DB 106.

図１０（ｂ）は、音素「ｕ」の目標ベクトルｔ_i及び候補ｕ₁，ｕ₂を示す図である。候補ｕ₁の表記「忠臣蔵」の品詞は固有名詞である。候補ｕ₂の示す表記「グラス」の品詞は外来語であって、その表記「グラス」は"glass#"を意味する。 FIG. 10B shows a target vector t _i and candidates u ₁ and u ₂ for the phoneme “u”. The part-of-speech of the notation “Tadaomi” of the candidate u ₁ is a proper noun. The part of speech of the notation “glass” indicated by the candidate u ₂ is a foreign word, and the notation “glass” means “glass #”.

音声単位選択部１０８は、候補ｕ₁，ｕ₂のうち、目標ベクトルｔ_iに最も近い候補を、音声合成に用いる最適な音声単位データとして選択する。 Speech unit selection unit 108, among the candidate u _1, u _2, selects the closest candidate to the target vector t _i, as the optimum speech unit data used for speech synthesis.

ここで、従来の音声合成装置は、音韻環境（前方音素、対象音素、及び後方音素）および音響的特徴（継続時間、パワー、及び基本周波数など）を利用することにより、候補ｕ₁と候補ｕ₂のうち、候補ｕ₁を選択する。その選択の理由は、音響的特徴において、候補ｕ₁の方が、候補ｕ₂よりも目標ベクトルｔ_iに近いからである。しかしながら、日本語の固有名詞に含まれる音素「ｕ」と外来語に含まれる音素「ｕ」との間には前述の音響的特徴では表現できない違いが存在する。その結果、従来の音声合成装置は、不適切な音声単位データを選択して音声合成に用いるため、不自然な合成音声を出力してしまう。 Here, the conventional speech synthesizer uses the phoneme environment (front phoneme, target phoneme, and rear phoneme) and the acoustic features (duration, power, fundamental frequency, etc.), so that candidates u ₁ and u Among the _two , candidate u ₁ is selected. The reason for the selection is that in the acoustic feature, the candidate u ₁ is closer to the target vector t _i than the candidate u ₂ . However, there is a difference between the phoneme “u” included in the Japanese proper noun and the phoneme “u” included in the foreign word that cannot be expressed by the acoustic feature described above. As a result, since the conventional speech synthesizer selects inappropriate speech unit data and uses it for speech synthesis, it outputs an unnatural synthesized speech.

一方、本実施形態の音声合成装置は、言語的特徴の１つである品詞を用いることにより、最適な音声単位データを選択することができる。即ち、音声合成装置の音声単位選択部１０８は、目標ベクトルｔ_iの品詞が外来語であれば、外来語であることを考慮して、品詞として外来語を示す候補ｕ₂を選択する。その結果、本実施形態の音声合成装置は、テキストデータ１００ｔにより示される外来語を外来語らしい自然な合成音声に変換することができる。 On the other hand, the speech synthesizer of the present embodiment can select optimal speech unit data by using the part of speech that is one of linguistic features. That is, if the part of speech of the target vector t _i is a foreign word, the speech unit selection unit 108 of the speech synthesizer selects the candidate u ₂ that indicates the foreign word as a part of speech in consideration of the foreign word. As a result, the speech synthesizer of the present embodiment can convert a foreign word indicated by the text data 100t into a natural synthesized speech that seems to be a foreign word.

次に、本実施形態の音声合成装置が、終助詞を示すテキストデータ１００ｔを取得したときの具体的な動作について説明する。 Next, a specific operation when the speech synthesizer of this embodiment acquires text data 100t indicating a final particle will be described.

例えば、音声合成装置の言語解析部１０４は、日本語のテキスト「これはグラウンドですよね。」（This is a ground, isn't it?）を示すテキストデータ１００ｔを取得する。上記テキスト中の「ね」の品詞は終助詞である。 For example, the language analysis unit 104 of the speech synthesizer acquires the text data 100t indicating the Japanese text “This is a ground, isn't it?”. The part of speech of “Ne” in the above text is a final particle.

図１１（ａ）は、音声単位データの選択対象となる音素を説明するための説明図である。 FIG. 11A is an explanatory diagram for explaining a phoneme to be selected for speech unit data.

例えば、音声単位選択部１０８は、「ね」という終助詞の母音部分の音素「ｅ」に最適な音声単位データを選択する。 For example, the speech unit selection unit 108 selects speech unit data that is optimal for the phoneme “e” of the vowel part of the final particle “ne”.

具体的に、音声単位選択部１０８は、まず、音素「ｅ」に対して目標ベクトルｔ_iを生成するとともに、音素「ｅ」に対応する候補ｕ₁，ｕ₂を特徴パラメータＤＢ１０６から選択する。 Specifically, the speech unit selection unit 108 first generates a target vector t _i for the phoneme “e” and selects candidates u ₁ and u ₂ corresponding to the phoneme “e” from the feature parameter DB 106.

図１１（ｂ）は、音素「ｅ」の目標ベクトルｔ_i及び候補ｕ₁，ｕ₂を示す図である。候補ｕ₁の表記「根」の品詞は普通名詞であって、その表記「根」は"root"を意味する。候補ｕ₂の表記「ね」の品詞は終助詞である。 FIG. 11B shows the target vector t _i and the candidates u ₁ and u ₂ for the phoneme “e”. The part of speech of the notation “root” of the candidate u ₁ is a common noun, and the notation “root” means “root”. The part of speech of the notation “ne” of candidate u ₂ is a final particle.

ここで、従来の音声合成装置は、音韻環境（前方音素、対象音素、及び後方音素）および音響的特徴（継続時間、パワー、及び基本周波数など）を利用することにより、候補ｕ₁と候補ｕ₂のうち、候補ｕ₁を選択する。その選択の理由は、音響的特徴において、候補ｕ₁の方が、候補ｕ₂よりも目標ベクトルｔ_iに近いからである。しかしながら、日本語の終助詞「ね」に含まれる音素「ｅ」には特有の特徴が存在し、その特徴は他の品詞の「ね」に含まれる音素「ｅ」の特徴と大きく異なる。したがって、従来の音声合成装置によって選択される音声単位データは、音響的特徴について目標ベクトルｔ_iと一致度が高いが、実際の合成音声に用いる音声単位データとしては必ずしも適切でない可能性が存在する。 Here, the conventional speech synthesizer uses the phoneme environment (front phoneme, target phoneme, and rear phoneme) and the acoustic features (duration, power, fundamental frequency, etc.), so that candidates u ₁ and u Among the _two , candidate u ₁ is selected. The reason for the selection is that in the acoustic feature, the candidate u ₁ is closer to the target vector t _i than the candidate u ₂ . However, the phoneme “e” included in the Japanese final particle “ne” has a unique feature, and the feature is greatly different from the feature of the phoneme “e” included in the other part of speech “ne”. Therefore, although the speech unit data selected by the conventional speech synthesizer has a high degree of coincidence with the target vector t _{i in} terms of acoustic characteristics, there is a possibility that the speech unit data is not necessarily appropriate as speech unit data used for actual synthesized speech. .

一方、本実施形態の音声合成装置は、言語的特徴の１つである品詞を用いることにより、最適な音声単位データを選択することができる。即ち、音声合成装置の音声単位選択部１０８は、目標ベクトルｔ_iの品詞が終助詞であれば、終助詞であることを考慮して、品詞として終助詞を示す候補ｕ₂を選択する。その結果、本実施形態の音声合成装置は、テキストデータ１００ｔにより示される終助詞を、その終助詞が示す疑問感などを表現するような自然な合成音声に変換することができる。 On the other hand, the speech synthesizer of the present embodiment can select optimal speech unit data by using the part of speech that is one of linguistic features. That is, if the part of speech of the target vector t _i is a final particle, the speech unit selection unit 108 of the speech synthesizer selects the candidate u ₂ indicating the final particle as a part of speech considering that it is a final particle. As a result, the speech synthesizer according to the present embodiment can convert the final particle indicated by the text data 100t into a natural synthesized speech that expresses the feeling of doubt indicated by the final particle.

図１２（ａ）、図１２（ｂ）、図１２（ｃ）、及び図１２（ｄ）は、本発明の効果を説明するための説明図である。これらの図は、４つの単語に属する音韻「よ」（音素表記「ｙｏ」）の音声区間について、文献「Ohtsuka, Kasuya, "An Improved Speech Analysis-Synthesis Algorithm based on the Autoregressive with Exogenous Input Speech Production Model", ICSLP2000」に示されている分析を行った結果を示している。このような分析は、音声信号を音声生成モデルに当てはめて声帯（音源）と声道（フィルタ）に分離するものである。 12 (a), 12 (b), 12 (c), and 12 (d) are explanatory diagrams for explaining the effect of the present invention. These figures show the speech section of the phoneme “yo” (phoneme notation “yo”) belonging to the four words, “Ohtsuka, Kasuya,“ An Improved Speech Analysis-Synthesis Algorithm based on the Autoregressive with Exogenous Input Speech Production Model ” The result of the analysis shown in ", ICSLP2000" is shown. Such analysis is performed by applying a voice signal to a voice generation model and separating it into a vocal cord (sound source) and a vocal tract (filter).

図１２（ａ）は、副詞「よく」（fully）の音韻「よ」に対する分析結果を示す。図１２（ｂ）は、普通名詞「よる」（night）の音韻「よ」に対する分析結果を示す。図１２（ｃ）は、一文に含まれる終助詞である音韻「よ」に対する分析結果を示す。図１２（ｄ）は、他の文に含まれる終助詞である音韻「よ」に対する分析結果を示す。 FIG. 12A shows the analysis result of the adverb “fully” for the phoneme “yo”. FIG. 12B shows the analysis result for the phoneme “yo” of the common noun “yoru” (night). FIG. 12C shows the analysis result for the phoneme “yo” which is a final particle included in one sentence. FIG. 12D shows the analysis result for the phoneme “yo”, which is a final particle included in another sentence.

これらの分析結果は、第１ホルマントの中心周波数Ｆ１と、第２ホルマントの中心周波数Ｆ２と、第３ホルマントの中心周波数Ｆ３と、各ホルマントの帯域幅とを示す。なお、各図中において帯域幅は、上記各中心周波数Ｆ１，Ｆ２，Ｆ３を示す線に重ねて書かれた垂直の線分により表されている。上記各ホルマントの中心周波数は声道の共鳴によるピークを示し、帯域幅は共鳴の強さを表す。広い帯域幅は弱い共鳴を意味し、狭い帯域幅は強い共鳴を意味する。 These analysis results show the center frequency F1 of the first formant, the center frequency F2 of the second formant, the center frequency F3 of the third formant, and the bandwidth of each formant. In each figure, the bandwidth is represented by a vertical line segment written over the lines indicating the center frequencies F1, F2, and F3. The center frequency of each formant indicates a peak due to resonance of the vocal tract, and the bandwidth indicates the strength of the resonance. A wide bandwidth means a weak resonance and a narrow bandwidth means a strong resonance.

各図によって示される４つの分析結果とも、時間軸の前半から後半にかけて第１ホルマントの中心周波数Ｆ１が上昇し、第２ホルマントの中心周波数Ｆ２が下降するという、音韻「よ」に起因する共通の特徴を示す。したがって、音韻「よ」を含む形態素の品詞にかかわらず、音韻「よ」に対する各ホルマントの中心周波数の軌跡（以下、ホルマント軌跡という）は類似している。 In the four analysis results shown in each figure, the common form attributed to the phoneme “yo” that the center frequency F1 of the first formant increases and the center frequency F2 of the second formant decreases from the first half to the second half of the time axis. Show features. Therefore, regardless of the part of speech of the morpheme including the phoneme “yo”, the trajectory of the center frequency of each formant for the phoneme “yo” (hereinafter referred to as formant trajectory) is similar.

このように、音韻「よ」のホルマント軌跡は、共通の特徴を持っているが、その音韻「よ」を含む形態素の品詞に応じて、音韻「よ」を耳で聞いた感じは明確に異なる。即ち、図１２（ｃ）及び図１２（ｄ）に示す終助詞の音韻「よ」に対して、人は近い印象を受ける。一方、図１２（ｃ）に示す終助詞の音韻「よ」と、図１２（ａ）に示す副詞の形態素に含まれる音韻「よ」とに対して、人は異なる印象を受ける。同様に、図１２（ｃ）に示す終助詞の音韻「よ」と、図１２（ｂ）に示す普通名詞の形態素に含まれる音韻「よ」とに対して、人は異なる印象を受ける。 Thus, although the formant trajectory of the phoneme “yo” has a common feature, the feeling of hearing the phoneme “yo” with the ear is clearly different depending on the part of speech of the morpheme that contains the phoneme “yo”. . That is, a person has a close impression to the final particle phoneme “yo” shown in FIG. 12 (c) and FIG. 12 (d). On the other hand, a person has a different impression of the phoneme “yo” of the final particle shown in FIG. 12C and the phoneme “yo” included in the morpheme of the adverb shown in FIG. Similarly, a person has a different impression with respect to the phoneme “yo” of the final particle shown in FIG. 12C and the phoneme “yo” included in the morpheme of the common noun shown in FIG.

このような印象の違いはホルマント軌跡だけでは説明が困難である。
終助詞が発声される文末においてはリラックスした発声が行われるため、声帯は振動しながらも閉じ方が弱くなる傾向がある。このように声門（声帯の両側のひだの間にできた隙間）が広い状況では、気管や肺などの声帯の下の空間（以下、声門下部空間）による共鳴の影響が強く表れることが知られている。このことは文献「D.Klatt, L.Klatt, "Analysis, Synthesis, and Perception of Voice Quality Variations among Female and Male talkers", J. Acoust. Soc. Am. 87(2), February 1990, pp.820-857」に記載されている。 Such a difference in impression is difficult to explain with a formant trajectory alone.
At the end of the sentence where the final particle is uttered, a relaxed utterance is performed, so that the vocal cords tend to be weakened while closing while vibrating. In such a situation where the glottis (the gap formed between the folds on both sides of the glottis) is wide, it is known that the influence of resonance due to the space below the glottis such as the trachea and lungs (hereinafter referred to as the glottal lower space) appears strongly. ing. This is described in the literature “D. Klatt, L. Klatt,“ Analysis, Synthesis, and Perception of Voice Quality Variations among Female and Male talkers ”, J. Acoust. Soc. Am. 87 (2), February 1990, pp.820. -857 ".

この文献の「D. Results III: Tracheal coupling」（p832）によると、声門下部空間の共鳴の影響により、極と零点の出現という現象と、第１ホルマントの帯域幅の増加という現象とが生じる。図１２（ｃ）に示す分析結果には、ホルマントとは異なる周波数の弱いピーク１４１が現れている。また、図１２（ｄ）に示す分析結果においても、上述と同様に、ホルマントとは異なる周波数の弱いピーク１４２が現れている。上記文献では女性の場合１７００Ｈｚ付近に極が出現することが記述されていることから、ピーク１４１，１４２は上記極の出現現象と推測される。また、図１２（ｃ）の分析結果に示される第１ホルマントの帯域幅と、図１２（ｄ）の分析結果に示される第１ホルマントの帯域幅とは、それぞれ比較的広いという共通の特徴を有する。 According to “D. Results III: Tracheal coupling” (p832) in this document, the phenomenon of the appearance of poles and zeros and the phenomenon of an increase in the bandwidth of the first formant occur due to the influence of resonance in the glottal space. In the analysis result shown in FIG. 12C, a weak peak 141 having a frequency different from that of the formant appears. Also, in the analysis result shown in FIG. 12D, a weak peak 142 having a frequency different from that of the formant appears as described above. In the above document, since it is described that a pole appears in the vicinity of 1700 Hz in the case of a woman, the peaks 141 and 142 are presumed to be the appearance phenomenon of the pole. In addition, the bandwidth of the first formant shown in the analysis result of FIG. 12C and the bandwidth of the first formant shown in the analysis result of FIG. Have.

これに対して、図１２（ａ）および図１２（ｂ）に示す分析結果には、上述した２つの現象が大きく現れていない。 On the other hand, the two phenomena described above do not appear greatly in the analysis results shown in FIGS. 12 (a) and 12 (b).

このように、人が各音韻を聞き取る印象は、その各音韻のホルマント軌跡などによって示される音響的特徴が類似していても、各音韻の属する品詞に応じて大きく異なる。特に、品詞が終助詞であるか否か、又は品詞が外来語であるか否かによって大きく異なる。 In this way, the impression that a person hears each phoneme varies greatly depending on the part of speech to which each phoneme belongs, even if the acoustic features indicated by the formant trajectory of each phoneme are similar. In particular, it largely differs depending on whether the part of speech is a final particle or whether the part of speech is a foreign word.

そこで本実施形態の音声合成装置は、形態素に含まれる音素に対して、その形態素の品詞（特に終助詞又は外来語）に応じた音声単位データを選択するため、自然な合成音声を出力することができるのである。つまり、本実施形態の音声合成装置は、テキストデータ１００ｔのテキストに示す通りの自然な合成音声を出力することができる。 Therefore, the speech synthesizer according to the present embodiment outputs a natural synthesized speech for a phoneme included in a morpheme, in order to select speech unit data corresponding to the part of speech of the morpheme (particularly a final particle or a foreign word). Can do it. That is, the speech synthesizer of this embodiment can output natural synthesized speech as shown in the text of the text data 100t.

また、本実施形態の音声合成装置では、音響的特徴のみではなく、外来語又は終助詞であるか否かといった言語的特徴も考慮することにより、韻律推定部１０９による音響的特徴の推定精度が十分でない場合においても、特徴パラメータＤＢ１０６に格納されている音声単位データの言語的特徴に基づいて、より信頼度の高い音声単位データを選択することが可能になる。 In the speech synthesizer according to the present embodiment, not only the acoustic features but also linguistic features such as whether or not a foreign word or a final particle is taken into account, so that the accuracy of the estimation of the acoustic features by the prosody estimation unit 109 is improved. Even when it is not sufficient, it is possible to select speech unit data with higher reliability based on the linguistic features of the speech unit data stored in the feature parameter DB 106.

また、本発明による音声合成装置は、カーナビゲーションシステムやエンターテイメント分野における読み上げ装置等として有用である。 The speech synthesizer according to the present invention is useful as a car navigation system, a reading device in the entertainment field, and the like.

（変形例１）
上記実施形態の音声合成部１１０は、特徴パラメータＤＢ１０６に保持されている一連の音声単位データに基づいて合成音声信号を生成した。本変形例に係る音声合成部は、各音声単位データに相当する音声波形を示す信号を取得して、その信号を接続することにより合成音声信号を生成する。 (Modification 1)
The speech synthesizer 110 of the above embodiment generates a synthesized speech signal based on a series of speech unit data held in the feature parameter DB 106. The voice synthesizer according to the present modification acquires a signal indicating a voice waveform corresponding to each voice unit data, and generates a synthesized voice signal by connecting the signals.

図１３は、本変形例に係る音声合成装置の構成を示す構成図である。
本変形例に係る音声合成装置は、特徴パラメータＤＢ１０６と、言語解析部１０４と、韻律推定部１０９と、音声単位選択部１０８と、音声合成部１１０ａと、スピーカ１１１と、音声波形信号ＤＢ１０１とを備える。 FIG. 13 is a configuration diagram showing the configuration of the speech synthesizer according to this modification.
The speech synthesizer according to this modification includes a feature parameter DB 106, a language analysis unit 104, a prosody estimation unit 109, a speech unit selection unit 108, a speech synthesis unit 110a, a speaker 111, and a speech waveform signal DB 101. Prepare.

音声波形信号ＤＢ１０１は、特徴パラメータＤＢ１０６に格納されている各音声単位データに対応する音声波形を示す音声波形信号を保持している。 The voice waveform signal DB 101 holds a voice waveform signal indicating a voice waveform corresponding to each voice unit data stored in the feature parameter DB 106.

音声合成部１１０ａは、音声単位選択部１０８が選択した一連の音声単位データ（音声単位系列）を特定すると、音声波形信号ＤＢ１０１から各音声単位データに対応する音声波形信号を取得する。そして、音声合成部１１０ａは、その音声波形信号を接続することにより合成音声信号を生成する。 When the voice synthesizer 110a identifies a series of voice unit data (voice unit series) selected by the voice unit selector 108, the voice synthesizer 110a acquires a voice waveform signal corresponding to each voice unit data from the voice waveform signal DB 101. Then, the voice synthesizer 110a generates a synthesized voice signal by connecting the voice waveform signals.

（変形例２）
上記実施形態のコスト計算部４０３は、各サブコストに対して予め定められた重み付けを行い、その重み付けされた各サブコストを積算することで目標コストを算出した。本変形例に係るコスト計算部は、上記重みを変化させる点に特徴がある。 (Modification 2)
The cost calculation unit 403 of the above embodiment calculates a target cost by performing a predetermined weight on each sub-cost and integrating the weighted sub-costs. The cost calculation unit according to this modification is characterized in that the weight is changed.

図１４は、本変形例に係るコスト計算部の内部構成を示す構成図である。
本変形例に係るコスト計算部４０３ａは、目標コスト計算部４０４と、接続コスト計算部４０５と、重み決定部５０１とを備える。 FIG. 14 is a configuration diagram illustrating an internal configuration of the cost calculation unit according to the present modification.
The cost calculation unit 403a according to this modification includes a target cost calculation unit 404, a connection cost calculation unit 405, and a weight determination unit 501.

重み決定部５０１は、目標コスト計算部４０４におけるコスト計算時に、言語的特徴の重みと、音響的特徴の重みとを、韻律推定部１０９により出力される韻律情報１０９ｄの信頼度に基づいて変更する。そして重み決定部５０１は、その変更された重みを目標コスト計算部４０４に通知する。目標コスト計算部４０４は、重み決定部５０１から通知された重みに基づいて目標コストを算出する。 The weight determination unit 501 changes the weight of the linguistic feature and the weight of the acoustic feature based on the reliability of the prosody information 109 d output from the prosody estimation unit 109 during the cost calculation in the target cost calculation unit 404. . Then, the weight determination unit 501 notifies the target cost calculation unit 404 of the changed weight. The target cost calculation unit 404 calculates the target cost based on the weight notified from the weight determination unit 501.

例えば、重み決定部５０１は、韻律情報１０９ｄの信頼度が低い場合には、言語的特徴のサブコストに対する重みを大きく、音響的特徴のサブコストに対する重みを小さくする。その結果、目標コスト計算部４０４は、音響的特徴の一致度よりも言語的特徴の一致度に基づいて目標コストを算出する。そして、探索部４０２は、音響的特徴の一致度よりも言語的特徴の一致度を考慮して音声単位データの選択を行う。即ち、探索部４０２は、目標ベクトルｔ_iに対して候補が音響的特徴で一致していても言語的特徴において一致していなければ、その候補を選択せず、言語的特徴で一致する候補を選択するのである。 For example, when the reliability of the prosodic information 109d is low, the weight determination unit 501 increases the weight for the sub-cost of the linguistic feature and decreases the weight for the sub-cost of the acoustic feature. As a result, the target cost calculation unit 404 calculates the target cost based on the linguistic feature coincidence rather than the acoustic feature coincidence. Then, the search unit 402 selects speech unit data in consideration of the linguistic feature coincidence rather than the acoustic feature coincidence. That is, the search unit 402 does not select a candidate for a target vector t _i if the candidate matches with the acoustic feature but does not match with the linguistic feature. Make a choice.

このように本変形例に係るコスト計算部４０３ａは、韻律推定部１０９の推定結果である韻律情報１０９ｄの信頼度に応じてサブコストに対する重みを変化させるため、例えば韻律推定部１０９による感情などの推定が困難な場合には、直接的な基本周波数や、継続時間長、パワーなどの推定結果に頼らず、言語的特徴の一致度を重視することにより、頑健な音声単位データの選択が可能になる。 As described above, the cost calculation unit 403a according to the present modification changes the weight for the sub-cost according to the reliability of the prosody information 109d that is the estimation result of the prosody estimation unit 109. When it is difficult to select, robust voice unit data can be selected by focusing on the degree of matching of linguistic features without relying on direct fundamental frequency, duration length, power estimation results, etc. .

例えば、韻律推定部１０９が図９に示すような外来語を示す言語情報１０４ｄを取得してその言語情報１０４ｄに基づいて韻律情報１０９ｄを生成する。この場合、重み決定部５０１は、その韻律情報１０９ｄの信頼度は低いと判断し、品詞（例えば外来語）に対応するサブコストの重みを大きくする。その結果、適切な音声単位データが選択されて、より自然な合成音声を出力することができる。また、音素の示す特徴の一致度を評価する場合には、完全一致のみではなく、音素グループの特徴の一致度まで考慮しても良い。これにより日本語表記「バ」と日本語表記「ヴァ」などの表記の揺れにも柔軟に対応することが可能となる。 For example, the prosody estimation unit 109 acquires language information 104d indicating a foreign word as shown in FIG. 9, and generates prosody information 109d based on the language information 104d. In this case, the weight determining unit 501 determines that the reliability of the prosodic information 109d is low, and increases the weight of the sub cost corresponding to the part of speech (for example, a foreign word). As a result, appropriate voice unit data is selected, and more natural synthesized voice can be output. When evaluating the degree of coincidence of features indicated by phonemes, not only perfect coincidence but also the degree of coincidence of phoneme group features may be considered. As a result, it is possible to flexibly cope with fluctuations in notation such as “ba” in Japanese and “va” in Japanese.

（変形例３）
ここで、本実施形態における音声単位データの選択方法に関する変形例について説明する。 (Modification 3)
Here, the modification regarding the selection method of the audio | voice unit data in this embodiment is demonstrated.

上記実施形態の音声単位選択部１０８は、言語的特徴及び音響的特徴を同時に考慮して音声単位データを選択した。本変形例に係る音声単位選択部は、先に、言語的特徴を優先した音声単位データの選択を行う。 The voice unit selection unit 108 of the above embodiment selects the voice unit data in consideration of the linguistic feature and the acoustic feature at the same time. The speech unit selection unit according to the present modification first selects speech unit data giving priority to linguistic features.

図１５は、本変形例に係る音声単位選択部の動作を示すフロー図である。
まず、音声単位選択部は、特徴パラメータＤＢ１０６から、候補として挙げられる音声単位データを選択する（ステップＳ１００）。 FIG. 15 is a flowchart showing the operation of the audio unit selector according to the present modification.
First, the speech unit selection unit selects speech unit data listed as candidates from the feature parameter DB 106 (step S100).

次に、音声単位選択部は、ステップＳ１００で選択された候補の中から、さらに、言語情報１０４ｄの示す音声単位の言語的特徴と一致する音声単位データを選択する（ステップＳ１０２）。そして音声単位選択部は、その選択した音声単位データのコストを算出する（ステップＳ１０４）。 Next, the speech unit selection unit further selects speech unit data that matches the linguistic feature of the speech unit indicated by the language information 104d from the candidates selected in step S100 (step S102). Then, the sound unit selection unit calculates the cost of the selected sound unit data (step S104).

ここで、音声単位選択部は、その算出したコストがしきい値より小さいか否かを判別する（ステップＳ１０６）。しきい値より小さいと判別したときには（ステップＳ１０６のＹ）、音声単位選択部は、ステップＳ１０２で選択した音声単位データを音声合成部１１０に通知する（ステップＳ１０８）。一方、しきい値以上であると判別したときには（ステップＳ１０６のＮ）、音声単位選択部は、上記実施形態と同様、ステップＳ１００で選択された各候補に対してコストを算出する（ステップＳ１１０）。そして、音声単位選択部は、コストが最小となる候補、つまり音声単位データを音声合成部１１０に通知する（ステップＳ１１２）。 Here, the voice unit selection unit determines whether or not the calculated cost is smaller than a threshold value (step S106). When it is determined that the value is smaller than the threshold value (Y in step S106), the speech unit selection unit notifies the speech synthesis unit 110 of the speech unit data selected in step S102 (step S108). On the other hand, when it is determined that the value is greater than or equal to the threshold value (N in step S106), the voice unit selection unit calculates the cost for each candidate selected in step S100, as in the above embodiment (step S110). . Then, the speech unit selection unit notifies the speech synthesis unit 110 of the candidate with the lowest cost, that is, speech unit data (step S112).

（第２の実施形態）
以下、第１の実施形態において用いられた音声単位データを作成するデータ作成装置について説明する。 (Second Embodiment)
Hereinafter, a data creation apparatus for creating sound unit data used in the first embodiment will be described.

図１６は、第２の実施形態におけるデータ作成装置の全体構成を示す構成図である。
データ作成装置は、音声合成装置の特徴パラメータＤＢ１０６に格納される音声単位データを作成するものであって、テキスト格納部７０１と、音声波形格納部７０２と、音声分析部７０３と、言語解析部７０４とを備える。 FIG. 16 is a configuration diagram illustrating the overall configuration of the data creation device according to the second embodiment.
The data creation device creates speech unit data stored in the feature parameter DB 106 of the speech synthesizer, and includes a text storage unit 701, a speech waveform storage unit 702, a speech analysis unit 703, and a language analysis unit 704. With.

音声波形格納部７０２は、録音された音声を波形として示す音声波形信号を格納するデータベースである。テキスト格納部７０１は、上述の録音された音声の書き起こし文をテキストデータとして格納している。即ち、音声波形信号の示す内容とテキストデータの示す内容は同一である。音素ＨＭＭ格納部７０５は、音素ごとに作成された音素ＨＭＭを格納している。 The voice waveform storage unit 702 is a database that stores a voice waveform signal indicating a recorded voice as a waveform. The text storage unit 701 stores the recorded voice transcript as text data. That is, the contents indicated by the speech waveform signal and the contents indicated by the text data are the same. The phoneme HMM storage unit 705 stores a phoneme HMM created for each phoneme.

言語解析部７０４は、テキスト格納部７０１に保持されているテキストデータの示すテキストを言語解析することで、そのテキストから音声単位（例えば音素）ごとに言語的特徴を抽出する。ここで、言語的特徴は、音韻環境、形態素情報、構文情報、及びアクセント句などを示す。そして言語解析部７０４は、その言語的特徴を示す言語的特徴情報を、音声合成装置の特徴パラメータＤＢ１０６に格納するとともに、音声分析部７０３に出力する。 The language analysis unit 704 performs linguistic analysis on the text indicated by the text data held in the text storage unit 701, thereby extracting linguistic features from the text for each speech unit (for example, phonemes). Here, the linguistic features indicate phonological environment, morpheme information, syntax information, accent phrases, and the like. The language analysis unit 704 stores the linguistic feature information indicating the linguistic feature in the feature parameter DB 106 of the speech synthesizer and outputs it to the speech analysis unit 703.

音声分析部７０３は、言語解析部７０４から出力された言語的特徴情報を取得するとともに、音声波形格納部７０２から上記テキストに対応する音声波形信号を取得する。そして音声分析部７０３は、取得した言語的特徴情報に示される音素表記に従って、取得した音声波形信号を対応する音素ごとに分割する。ここで、音声分析部７０３は、音声波形信号を分割するときには、音素ＨＭＭ格納部７０５に格納されている音素ＨＭＭを用いる。さらに、音声分析部７０３は、分割された音声波形信号から音素ごとに音響的特徴を抽出する。ここで、音響的特徴は、基本周波数、継続時間長、及びケプストラムなどを示す。さらに、音響的特徴は、発声時の話者の感情を示しても良い。 The voice analysis unit 703 acquires the linguistic feature information output from the language analysis unit 704 and acquires a voice waveform signal corresponding to the text from the voice waveform storage unit 702. Then, the speech analysis unit 703 divides the acquired speech waveform signal for each corresponding phoneme according to the phoneme notation shown in the acquired linguistic feature information. Here, the speech analysis unit 703 uses the phoneme HMM stored in the phoneme HMM storage unit 705 when dividing the speech waveform signal. Furthermore, the voice analysis unit 703 extracts an acoustic feature for each phoneme from the divided voice waveform signal. Here, the acoustic features indicate a fundamental frequency, a duration length, a cepstrum, and the like. Furthermore, the acoustic feature may indicate a speaker's emotion when speaking.

そして音声分析部７０３は、その音響的特徴を示す音響的特徴情報を生成してこれを音声合成装置の特徴パラメータＤＢ１０６に格納する。 Then, the voice analysis unit 703 generates acoustic feature information indicating the acoustic feature and stores it in the feature parameter DB 106 of the speech synthesizer.

このような本実施形態におけるデータ作成装置の動作について説明する。ここでは、データ作成装置がテキスト「今日の天気は晴れですよ。」に関する音声単位データを特徴パラメータＤＢ１０６に追加する際の手順について説明する。 The operation of the data creation apparatus in this embodiment will be described. Here, a description will be given of a procedure when the data creation device adds voice unit data related to the text “Today's weather is sunny” to the feature parameter DB 106.

まず、言語解析部７０４は、テキスト格納部７０１からテキストデータを読み出し、そのテキストデータに示されるテキストに対して、形態素解析及び構文解析を行うとともに、分野、読み及び感情などを解析する。例えば、言語解析部７０４は、解析結果として、図３に示す言語情報１０４ｄと同様の内容を示す言語的特徴情報を生成し、その言語的特徴情報を特徴パラメータＤＢ１０６に格納する。 First, the language analysis unit 704 reads text data from the text storage unit 701, performs morphological analysis and syntax analysis on the text indicated by the text data, and analyzes fields, readings, emotions, and the like. For example, the language analysis unit 704 generates linguistic feature information indicating the same content as the linguistic information 104d shown in FIG. 3 as an analysis result, and stores the linguistic feature information in the feature parameter DB 106.

次に、音声分析部７０３は、音声波形格納部７０２からテキスト「今日の天気は晴れですよ。」に対応する音声波形信号を取得するとともに、言語解析部７０４から言語的特徴情報を取得する。音声分析部７０３は、言語的特徴情報の示す音素表記を基に、音素ＨＭＭ格納部７０５の音素ＨＭＭを用いることで、音声波形信号を音素毎にセグメンテーションする。この例では、音声単位を音素としているが、特に音素に限定するものではない。 Next, the speech analysis unit 703 acquires a speech waveform signal corresponding to the text “Today's weather is sunny” from the speech waveform storage unit 702 and also acquires linguistic feature information from the language analysis unit 704. The speech analysis unit 703 segments the speech waveform signal for each phoneme by using the phoneme HMM of the phoneme HMM storage unit 705 based on the phoneme notation indicated by the linguistic feature information. In this example, the speech unit is a phoneme, but it is not particularly limited to a phoneme.

音声分析部７０３は、音声波形信号を音素毎にセグメンテーションすると、各音素単位で、基本周波数、継続時間長、及びパワーを分析する。分析の方法は特に限定するものではなく、どの分析方法を用いても良い。音声分析部７０３は、その分析結果を音響的特徴情報として特徴パラメータＤＢ１０６に格納する。 When the speech analysis unit 703 segments the speech waveform signal for each phoneme, the speech analysis unit 703 analyzes the fundamental frequency, the duration, and the power for each phoneme. The analysis method is not particularly limited, and any analysis method may be used. The voice analysis unit 703 stores the analysis result in the feature parameter DB 106 as acoustic feature information.

なお、言語解析部１０４が感情を解析せずに、音声分析部７０３が感情を分析してその分析結果を音響的特徴情報に付与しても良い。また、音声波形信号に予め感情を示す情報が含まれている場合には、その情報を言語的特徴情報又は音響的特徴情報に付与してもよい。 Note that the speech analysis unit 703 may analyze the emotion and add the analysis result to the acoustic feature information without the language analysis unit 104 analyzing the emotion. In addition, when the speech waveform signal includes information indicating emotion in advance, the information may be added to the linguistic feature information or the acoustic feature information.

以上のように動作することで、データ作成装置は、音素ごとに図６に示すようなベクトル表示の音声単位データを特徴パラメータＤＢ１０６内に作成する。テキスト「今日の天気は晴れですよ。」の場合は、２１個のベクトル表示の音声単位データが作成される。 By operating as described above, the data creation device creates vector unit speech unit data as shown in FIG. 6 in the feature parameter DB 106 for each phoneme. In the case of the text “Today's weather is sunny”, 21 vector-displayed voice unit data are created.

このように本実施形態は、各音素に対して言語的特徴情報および音響的特徴情報の双方を含む音声単位データを特徴パラメータＤＢ１０６に容易に構築することができる。 As described above, according to the present embodiment, speech unit data including both linguistic feature information and acoustic feature information for each phoneme can be easily constructed in the feature parameter DB 106.

以上、本発明について第１及び第２の実施形態並びに変形例を用いて説明したが、本発明はこれらに限定されるものではない。 As mentioned above, although this invention was demonstrated using 1st and 2nd embodiment and a modification, this invention is not limited to these.

例えば、第１及び第２の実施形態では、日本語のテキストを音声に変換したが、本発明は他の言語のテキストを音声に変換しても良く、特に外来語や終助詞を有する言語のテキストに対して効果を奏する。 For example, in the first and second embodiments, Japanese text is converted into speech, but the present invention may convert text in other languages into speech, particularly for languages having foreign words and final particles. Has an effect on text.

また、第１及び第２の実施形態では、音素を音声単位として扱ったが、他の単位を音声単位として扱っても良い。 In the first and second embodiments, phonemes are handled as speech units, but other units may be handled as speech units.

本発明の音声合成装置は、テキストデータの示す通りに自然な合成音声を出力することができるという効果を奏し、例えばカーナビゲーションシステムやエンターテイメント分野における読み上げ装置等として有用である。 The speech synthesizer according to the present invention produces an effect that a natural synthesized speech can be output as indicated by text data, and is useful as, for example, a car navigation system or a reading device in the entertainment field.

本発明の第１の実施形態における音声合成装置の構成を示す構成図である。It is a block diagram which shows the structure of the speech synthesizer in the 1st Embodiment of this invention. 同上の言語解析部の内部構成の一例を示す構成図である。It is a block diagram which shows an example of an internal structure of a language analysis part same as the above. 同上の言語情報の内容の一例を示す図である。It is a figure which shows an example of the content of the language information same as the above. 同上の音響的特徴情報の内容を示す図である。It is a figure which shows the content of the acoustic feature information same as the above. 同上の言語的特徴情報の内容を示す図である。It is a figure which shows the content of the linguistic feature information same as the above. 同上の特徴パラメータＤＢに格納されている１つの音声単位データの内容を示す図である。It is a figure which shows the content of the one audio | voice unit data stored in feature parameter DB same as the above. 同上の音声単位選択部の具体的な構成例を示す構成図である。It is a block diagram which shows the specific structural example of an audio | voice unit selection part same as the above. 同上の目標ベクトルと候補と目標コストベクトルとを示す図である。It is a figure which shows a target vector same as the above, a candidate, and a target cost vector. 同上の外来語を示すテキストデータから生成された言語情報の内容を示す図である。It is a figure which shows the content of the linguistic information produced | generated from the text data which shows a foreign word same as the above. （ａ）は、同上の音声単位データの選択対象となる音素を説明するための説明図であり、（ｂ）は、同上の音素「ｕ」に対応する目標ベクトル及び候補を示す図である。(A) is explanatory drawing for demonstrating the phoneme used as the selection object of audio | voice unit data same as the above, (b) is a figure which shows the target vector and candidate corresponding to phoneme "u" same as the above. （ａ）は、同上の音声単位データの選択対象となる音素を説明するための説明図であり、（ｂ）は、音素「ｅ」に対応する目標ベクトル及び候補を示す図である。(A) is explanatory drawing for demonstrating the phoneme used as the selection object of audio | voice unit data same as the above, (b) is a figure which shows the target vector and candidate corresponding to phoneme "e". （ａ）は、副詞「よく」（fully）の音韻「よ」に対する分析結果を示す図であり、（ｂ）は、普通名詞「よる」（night）の音韻「よ」に対する分析結果を示す図であり、（ｃ）は、一文に含まれる終助詞である音韻「よ」に対する分析結果を示す図であり、（ｄ）は、他の文に含まれる終助詞である音韻「よ」に対する分析結果を示す図である。(A) is a figure which shows the analysis result with respect to the phoneme "yo" of adverb "well" (fully), (b) is a figure which shows the analysis result with respect to the phoneme "yo" of common noun "yoru" (night) (C) is a diagram showing an analysis result for a phoneme “yo” that is a final particle included in one sentence, and (d) is an analysis for a phoneme “yo” that is a final particle included in another sentence. It is a figure which shows a result. 第１の実施形態の変形例１に係る音声合成装置の構成を示す構成図である。It is a block diagram which shows the structure of the speech synthesizer which concerns on the modification 1 of 1st Embodiment. 第１の実施形態の変形例２に係るコスト計算部の内部構成を示す構成図である。It is a block diagram which shows the internal structure of the cost calculation part which concerns on the modification 2 of 1st Embodiment. 第１の実施形態の変形例３に係る音声単位選択部の動作を示すフロー図である。It is a flowchart which shows operation | movement of the audio | voice unit selection part which concerns on the modification 3 of 1st Embodiment. 本発明の第２の実施形態におけるデータ作成装置の全体構成を示す構成図である。It is a block diagram which shows the whole structure of the data preparation apparatus in the 2nd Embodiment of this invention. 従来の音声合成装置の構成を示す構成図である。It is a block diagram which shows the structure of the conventional speech synthesizer.

Explanation of symbols

１００ｔテキストデータ
１０４言語解析部
１０４ｄ言語情報
１０６特徴パラメータＤＢ
１０８音声単位選択部
１０９韻律推定部
１０９ｄ韻律情報
１１０音声合成部
１１１スピーカ

100t text data 104 language analysis unit 104d language information 106 feature parameter DB
108 speech unit selection unit 109 prosody estimation unit 109d prosody information 110 speech synthesis unit 111 speaker

Claims

A speech synthesizer that acquires text data and converts text indicated by the text data into speech,
Storage means for storing, for each voice unit, a foreign word attribute as to whether or not the voice unit belongs to a foreign word, and voice unit data indicating an acoustic feature of the voice unit;
Feature estimation means for acquiring text data and estimating the foreign word attribute and acoustic feature of each speech unit for each of a plurality of speech units representing the text of the text data;
Selection means for selecting, from the storage means, speech unit data indicating contents similar to the foreign word attributes and acoustic features of the speech units estimated by the feature estimation means for each speech unit representing the text;
A speech synthesizer comprising: speech output means for generating and outputting synthesized speech using a plurality of speech unit data selected by the selection means.

The selection means includes
When the feature estimation means estimates that the speech unit belongs to a foreign word as the foreign word attribute of the speech unit, the speech unit data indicating that it belongs to the foreign word is preferentially selected as the foreign word attribute. Item 2. The speech synthesizer according to Item 1.

The speech unit data further indicates a final particle attribute as to whether the speech unit belongs to a final particle;
The feature estimation means estimates, for each of a plurality of speech units representing the text of the text data, the foreign word attribute, acoustic feature, and final particle attribute of the speech unit,
The selecting means selects, from the storage means, speech unit data showing contents similar to the foreign word attribute, acoustic feature, and final particle attribute of the speech unit estimated by the feature estimating means. The speech synthesizer according to claim 1.

The selection means includes
When the feature estimation means estimates that the speech unit belongs to a final particle as the final particle attribute of the speech unit, the speech unit data indicating that it belongs to the final particle is preferentially selected as the final particle attribute. Item 4. The speech synthesizer according to item 3.

The speech synthesizer according to claim 3, wherein the acoustic feature indicates at least one of a duration length of a speech unit, a fundamental frequency, and power.

The speech unit data further indicates phonological environment to which the speech unit belongs, syntax information about the syntax of the speech unit, and accent phrase information about the accent phrase of the speech unit,
The feature estimation means includes, for each of a plurality of speech units representing the text of the text data, the foreign word attribute, acoustic feature, final particle attribute, phonological environment, syntax information, and accent phrase information of the speech unit. Estimate
The selection means includes speech unit data indicating content similar to the foreign word attribute, acoustic feature, final particle attribute, phonological environment, syntax information, and accent phrase information of the speech unit estimated by the feature estimation means. 6. The speech synthesizer according to claim 5, wherein the speech synthesizer is selected from storage means.

The selection means includes
A first sub-cost is derived by quantitatively evaluating the similarity between the foreign word attribute of the speech unit estimated by the feature estimating means and the foreign word attribute of the speech unit data stored in the storage means. 1 derivation means;
A second sub-cost is derived by quantitatively evaluating the similarity between the acoustic feature of the speech unit estimated by the feature estimation means and the acoustic feature of the speech unit data stored in the storage means. Two derivation means;
Cost deriving means for deriving a cost using each of the first and second sub-costs derived by the first and second deriving means;
The speech synthesis apparatus according to claim 1, further comprising: a data selection unit that selects speech unit data from the storage unit based on the cost derived by the cost deriving unit.

The cost calculating means includes
The speech synthesizer according to claim 7, wherein the cost is derived by weighting and integrating each of the first and second sub-costs derived by the first and second derivation means. .

The speech synthesizer further includes:
A weight determination unit that identifies the reliability of the acoustic feature estimated by the feature estimation unit and determines a weight for each of the first and second sub-costs according to the reliability;
The cost deriving means is
9. The speech synthesizer according to claim 8, wherein the weight determined by the weight determining means is added to the first and second sub-costs.

The weight determining means includes
When the reliability of the acoustic feature is low, the similarity of the foreign word attribute contributes to the selection of the voice unit data by the data selection means rather than the similarity of the acoustic feature. The speech synthesizer according to claim 9, wherein a weight for 2 sub-costs is determined.

The selecting means further includes:
A third deriving unit for quantitatively evaluating acoustic distortion when connecting a plurality of audio unit data stored in the storage unit and deriving a connection cost;
The cost deriving unit derives the cost using the first and second sub-costs derived by the first and second deriving units and the connection cost derived by the third deriving unit. The speech synthesizer according to claim 10.

A speech synthesis method for acquiring text data and converting text indicated by the text data into speech using data stored in a storage means,
The storage means stores, for each voice unit, a foreign word attribute indicating whether the voice unit belongs to a foreign word, and voice unit data indicating an acoustic feature of the voice unit,
The speech synthesis method includes:
A feature estimation step of obtaining text data and estimating the foreign word attribute and acoustic feature of each speech unit for each of a plurality of speech units representing the text of the text data;
A selection step of selecting, from the storage means, speech unit data indicating contents similar to the foreign word attributes and acoustic features of the speech unit estimated in the feature estimation step for each speech unit included in the text;
A speech synthesis method comprising: a speech output step of generating and outputting synthesized speech using a plurality of speech unit data selected in the selection step.

The speech unit data further indicates a final particle attribute as to whether the speech unit belongs to a final particle;
In the feature estimation step, for each of a plurality of speech units representing the text of the text data, the foreign word attribute, acoustic feature, and final particle attribute of the speech unit are estimated,
In the selection step, speech unit data showing content similar to the foreign word attribute, acoustic feature, and final particle attribute of the speech unit estimated in the feature estimation step is selected from the storage unit. The speech synthesis method according to claim 12.

A program for acquiring text data and converting the text indicated by the text data into speech using data stored in a storage means,
The storage means stores, for each voice unit, a foreign word attribute indicating whether the voice unit belongs to a foreign word, and voice unit data indicating an acoustic feature of the voice unit,
The program is
A feature estimation step of obtaining text data and estimating the foreign word attribute and acoustic feature of each speech unit for each of a plurality of speech units representing the text of the text data;
A selection step of selecting, from the storage means, speech unit data indicating contents similar to the foreign word attributes and acoustic features of the speech unit estimated in the feature estimation step for each speech unit included in the text;
A program that causes a computer to execute a voice output step of generating and outputting synthesized voice using a plurality of voice unit data selected in the selection step.

A data creation device for creating speech unit data used for speech synthesis,
Voice storage means for storing a voice waveform signal indicating the voice as a waveform;
Text storage means for storing text data indicating text corresponding to the voice of the voice waveform signal;
Language analysis means for obtaining text data from the text storage means, dividing the text of the text data into speech units, and analyzing foreign language attributes as to whether each speech unit belongs to a foreign language;
Obtaining an audio waveform signal from the audio storage means, dividing the audio indicated by the audio waveform signal into audio units, and analyzing acoustic characteristics for each audio unit;
Creating means for creating the speech unit data for each speech unit and storing it in the storage means so as to indicate the foreign word attribute analyzed by the language analysis means and the acoustic features analyzed by the acoustic analysis means; A data creation device comprising:

The language analysis means further analyzes a final particle attribute as to whether or not each speech unit belongs to a final particle,
The creation means creates the speech unit data for each speech unit so as to indicate the foreign word attribute and final particle attribute analyzed by the language analysis means and the acoustic features analyzed by the acoustic analysis means. The data creation device according to claim 15.

The data creation apparatus according to claim 16, wherein the acoustic feature indicates at least one of a duration length of a voice unit, a fundamental frequency, and power.

A data creation method for creating speech unit data used for speech synthesis using data stored in a storage means,
The storage means stores a speech waveform signal indicating speech as a waveform, and text data indicating text corresponding to the speech of the speech waveform signal,
The data creation method is:
A language analysis step of obtaining text data from the storage means, dividing the text of the text data into speech units, and analyzing a foreign language attribute as to whether each speech unit belongs to a foreign language;
An acoustic analysis step of obtaining a speech waveform signal from the storage means, dividing the speech indicated by the speech waveform signal into speech units, and analyzing acoustic features for each speech unit;
A creation step of creating the speech unit data for each speech unit and storing it in a storage means so as to indicate the foreign language attribute analyzed in the language analysis step and the acoustic feature analyzed in the acoustic analysis step; A data creation method characterized by including.