JP4751230B2

JP4751230B2 - Prosodic segment dictionary creation method, speech synthesizer, and program

Info

Publication number: JP4751230B2
Application number: JP2006115885A
Authority: JP
Inventors: 典夫澁澤; 智洋谷
Original assignee: Asahi Kasei Corp
Current assignee: Asahi Kasei Corp
Priority date: 2006-04-19
Filing date: 2006-04-19
Publication date: 2011-08-17
Anticipated expiration: 2026-04-19
Also published as: JP2007286507A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method and a program for generating dictionary of prosodic element sequences, a speech synthesizer, and a speech synthesizing program and a speech synthesizing method, wherein a phrase, such as a fixed form phrase which is repeatedly used with high possibility, and another phrase are suitably concatenated with natural intonation. <P>SOLUTION: The speech synthesizer 100 includes a standard prosody dictionary 11 consisting of pitch frequency information by speech units and a prosodic elementary unit dictionary 12 consisting of pitch frequency information of a predetermined phrase unit and prosodic feature information showing prosodic features of predetermined prosodic units of source speech data before and after the phrase in an original sound data. A phrase part, matching the predetermined phrase, in a prosody information series generated from the standard prosody dictionary 11 is replaced with the pitch frequency information of the predetermined phrase unit in the prosodic elementary unit dictionary 12, and based upon the prosodic feature information, standard pitch frequencies corresponding to predetermined prosodic units before and after it are adjusted. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、テキスト文を解析して生成される文章テキストから合成音声波形データを生成する音声合成で用いられる韻律素片辞書作成方法及び韻律素片辞書作成プログラム、並びに音声合成装置、音声合成プログラム及び音声合成方法に関する。 The present invention relates to a prosodic segment dictionary creation method, a prosodic segment dictionary creation program, a speech synthesizer, and a speech synthesis program used in speech synthesis for generating synthesized speech waveform data from sentence text generated by analyzing a text sentence And a speech synthesis method.

従来、文章等のテキストデータから生成される文章テキストから音声を合成する技術として、特許文献１及び特許文献２記載の音声合成装置がある。
特許文献１の音声合成装置は、自然に発声させたいフレーズ（テキスト）を言語辞書に登録しておき、登録されたテキスト全体をそのまま自然発声した際に抽出した基本周波数パタン（以下、Ｆ０パタン）を、通常テキスト用のＦ０パタンとは区分して、韻律辞書に格納しておく。そして、入力テキストが言語辞書に登録されている場合は、登録テキスト用のＦ０パタン群を参照してＦ０パタンを選択して韻律情報を生成し、入力テキストが言語辞書に登録されていない場合、通常テキスト用のＦ０パタン群を参照してＦ０パタンを選択して韻律情報を生成する。これにより、登録フレーズについては、自然発声に近い合成音声を出力できるようにする。 Conventionally, as a technique for synthesizing speech from sentence text generated from text data such as sentences, there are speech synthesis apparatuses described in Patent Document 1 and Patent Document 2.
The speech synthesizer of Patent Document 1 registers a phrase (text) to be uttered naturally in a language dictionary, and extracts a basic frequency pattern (hereinafter referred to as F0 pattern) extracted when the entire registered text is uttered naturally. Are stored in the prosodic dictionary separately from the F0 pattern for normal text. When the input text is registered in the language dictionary, the F0 pattern is selected by referring to the F0 pattern group for the registered text, and when the input text is not registered in the language dictionary, The prosodic information is generated by selecting the F0 pattern with reference to the F0 pattern group for normal text. As a result, for the registered phrase, synthesized speech close to natural utterance can be output.

また、特許文献２の音声合成装置は、文の一部だけを規則合成で変更可能とし、その他の部分は分析で作成した合成パラメータまたは音声波形データを使用して合成する場合に、規則合成時にその文の近傍の定型部の文章など、周辺の文環境を考慮して合成処理を行い、そこから規則合成部分として使用される内容の部分のみを取り出して使用することにより、規則合成部と分析部の韻律の接続性を良くする。これにより、自然性の良い合成をできるようにする。
特開２００４−１９８９１７号公報特開平９−１８８５１５号公報 Further, the speech synthesizer of Patent Document 2 can change only a part of a sentence by rule synthesis, and the other part is synthesized at the time of rule synthesis when synthesizing using synthesis parameters or speech waveform data created by analysis. Analyzes with the rule synthesis unit by extracting and using only the part of the content that is used as the rule synthesis part, taking into account the surrounding sentence environment such as the sentence of the fixed part near the sentence Improve the connectivity of the prosody of the division. This makes it possible to synthesize with good naturalness.
JP 2004-198917 A JP-A-9-188515

しかしながら、上記特許文献１に記載の音声合成装置においては、自然発声のＦ０パタンが反映される箇所が、入力テキストにおけるマッチしたフレーズ部分のみであるため、マッチした部分の自然性は再現できても、通常テキスト用のＦ０パタンと自然発声用のＦ０パタンとの連続する箇所の韻律や、文章全体の韻律については調和がうまくとれずに自然性を再現できない恐れがある。 However, in the speech synthesizer described in Patent Document 1, since the part where the natural utterance F0 pattern is reflected is only the matched phrase part in the input text, the naturalness of the matched part can be reproduced. There is a possibility that the prosody of the continuous portion of the F0 pattern for normal text and the F0 pattern for natural speech and the prosody of the whole sentence are not well-balanced and naturalness cannot be reproduced.

また、上記特許文献２に記載の音声合成装置においては、規則合成で生成された２つのフレーズを含む合成音声波形データから一方のフレーズを取り出して、これと分析音のフレーズとの合成を行うため、これら２つのフレーズのそれぞれ接続側端部の音声素片が、例えば、その分析音側の音韻環境と前記取り出した規則合成音側の音韻環境とが所定の条件を満たしていない場合に、２つのフレーズの接続後の接続性がとれずに自然性を損なう恐れがある。 Further, in the speech synthesizer described in Patent Document 2, one phrase is extracted from synthesized speech waveform data including two phrases generated by rule synthesis, and this is synthesized with a phrase of analysis sound. When the speech segment at the connection side end of each of these two phrases is, for example, when the phoneme environment on the analysis sound side and the phoneme environment on the extracted rule synthesis sound side do not satisfy a predetermined condition, 2 There is a risk that the connectivity after the connection of two phrases is lost and the naturalness is impaired.

そこで、本発明は、このような従来の技術の有する未解決の課題に着目してなされたものであって、定型句等の繰り返し用いられる可能性の高いフレーズと、それ以外のフレーズとの連結をより自然な抑揚を持たせた状態で行うのに好適な、韻律素片辞書作成方法及び韻律素片辞書作成プログラム、並びに、音声合成装置、音声合成プログラム及び音声合成方法を提供することを目的としている。 Therefore, the present invention has been made paying attention to such an unsolved problem of the conventional technology, and a phrase that is highly likely to be repeatedly used such as a fixed phrase and the connection of other phrases. To provide a prosody segment dictionary creation method and prosody segment dictionary creation program, a speech synthesizer, a speech synthesis program, and a speech synthesis method, which are suitable for performing in a state with more natural inflection. It is said.

上記目的を達成するために、本発明に係る請求項１記載の韻律素片辞書作成方法は、
音声合成に用いる韻律辞書を作成するための韻律素片辞書作成方法であって、
所定話者の発話した、発話文に対応した音声波形データから、当該音声波形データに含まれる所定フレーズの音声波形データ部分の韻律情報である第１韻律情報を抽出する第１韻律情報抽出ステップと、
前記所定フレーズの音声波形データ部分に先行及び後続する所定の韻律単位に対応する音声波形データ部分の少なくとも一方から、その韻律情報である第２韻律情報を抽出する第２韻律情報抽出ステップと、
音声単位毎の韻律情報である標準韻律情報から構成される標準韻律辞書を参照し、前記所定の韻律単位に対応する韻律情報である第３韻律情報を生成する第３韻律情報生成ステップと、
前記第２韻律情報抽出ステップで抽出した第２韻律情報と、前記第３韻律情報生成ステップで生成した第３韻律情報とに基づき、前記第２韻律情報と前記第３韻律情報との特徴的な差を示す韻律特徴情報を生成する韻律特徴情報生成ステップと、
前記第１韻律情報抽出ステップで抽出した所定フレーズ毎の第１韻律情報と、前記韻律特徴情報生成ステップで生成された韻律特徴情報とに基づき韻律素片辞書を作成する韻律素片辞書作成ステップと、を含むことを特徴としている。 In order to achieve the above object, the prosody segment dictionary creating method according to claim 1 according to the present invention comprises:
A prosodic segment dictionary creation method for creating a prosodic dictionary used for speech synthesis,
A first prosodic information extraction step of extracting first prosodic information that is prosodic information of a speech waveform data portion of a predetermined phrase included in the speech waveform data from speech waveform data corresponding to an utterance sentence uttered by a predetermined speaker; ,
A second prosodic information extracting step for extracting second prosodic information as prosody information from at least one of the speech waveform data parts corresponding to a predetermined prosodic unit preceding and following the speech waveform data part of the predetermined phrase;
A third prosodic information generation step of generating third prosodic information that is prosodic information corresponding to the predetermined prosodic unit with reference to a standard prosodic dictionary composed of standard prosodic information that is prosodic information for each speech unit;
Based on the second prosodic information extracted in the second prosodic information extracting step and the third prosodic information generated in the third prosodic information generating step, characteristic features of the second prosodic information and the third prosodic information are characterized. Prosodic feature information generation step for generating prosodic feature information indicating the difference,
A prosodic segment dictionary creating step of creating a prosodic segment dictionary based on the first prosodic information for each predetermined phrase extracted in the first prosodic information extracting step and the prosodic feature information generated in the prosodic feature information generating step; It is characterized by including.

つまり、例えば、韻律情報がピッチ周波数情報であるとすると、所定話者の発話した発話文に対応する音声波形データ中にある所定フレーズのピッチ周波数情報と、当該音声波形データ中の当該所定フレーズに先行及び後続する所定の韻律単位の少なくとも一方の韻律単位に対応するピッチ周波数情報の、第３韻律情報との特徴的な差を示す韻律特徴情報とに基づき韻律素片辞書を作成するようにした。 That is, for example, if the prosodic information is pitch frequency information, the pitch frequency information of a predetermined phrase in the speech waveform data corresponding to the utterance sentence uttered by the predetermined speaker and the predetermined phrase in the speech waveform data The prosodic segment dictionary is created based on the prosodic feature information indicating the characteristic difference from the third prosodic information of the pitch frequency information corresponding to at least one of the preceding and succeeding predetermined prosodic units. .

従って、例えば、標準韻律情報（音声単位のピッチ周波数情報）のみで生成された韻律情報系列に対して、その所定フレーズの韻律情報系列部分を、これと同じフレーズの上記作成した韻律素片辞書の有するピッチ周波数情報に置換することが可能であり、更に、前記置換したフレーズの韻律情報系列部分に先行及び後続する標準韻律情報から生成された所定の韻律単位の韻律情報系列部分に対して、前記置換したフレーズに対応する韻律特徴情報に基づき、当該置換したフレーズの韻律情報系列部分に合わせた適切な調整を行うことができるので、このような調整の施された韻律情報系列から合成音声波形データを生成することで、所定フレーズ及びその前後のフレーズを含む文章がより自然音声に近い韻律（抑揚、話速、音量など）で発話（再生）される合成音声波形データを生成することができるという効果が得られる。ここで、韻律単位とは、韻律パタン生成の単位であり、音素、モーラ、アクセント句、１つのアクセント句を複数の区間に分割した単位、複数のアクセント句、呼気段落、など種々の単位を用いることができる。 Therefore, for example, for a prosodic information sequence generated only with standard prosodic information (pitch frequency information in units of speech), the prosodic information sequence portion of the predetermined phrase is replaced with the prosody segment dictionary created above for the same phrase. It is possible to replace the pitch frequency information having, and for the prosodic information sequence portion of a predetermined prosodic unit generated from the standard prosodic information preceding and following the prosodic information sequence portion of the replaced phrase, Based on the prosodic feature information corresponding to the replaced phrase, it is possible to make an appropriate adjustment in accordance with the prosodic information sequence portion of the replaced phrase. , The sentence including the specified phrase and the phrases before and after it are uttered with prosody (inflection, speech speed, volume, etc.) that is closer to natural speech. Effect that it is possible to generate (playback) by the synthesized speech waveform data is obtained. Here, the prosodic unit is a unit for generating a prosodic pattern, and uses various units such as a phoneme, a mora, an accent phrase, a unit obtained by dividing one accent phrase into a plurality of sections, a plurality of accent phrases, and an exhalation paragraph. be able to.

なお、韻律特徴情報の範囲及び前記調整の範囲によって、より自然音声に近い韻律（抑揚、話速、音量など）で発話（再生）される文章部分の範囲が変わってくる。例えば、第１韻律情報のフレーズに先行又は後続する１つの呼気段落を前記範囲とするならば、先行又は後続の呼気段落及びこれと連続する第１韻律情報のフレーズの韻律を、より自然音声に近い韻律にすることが可能である。また、第１韻律情報のフレーズと連続する韻律単位に限らず、第１韻律情報のフレーズに韻律的に影響の大きい箇所の韻律単位の韻律特徴情報を生成し、この韻律単位に対応する標準韻律情報から生成された韻律情報系列部分を調整するようにする。 Note that the range of the sentence portion uttered (reproduced) with prosody (inflection, speech speed, volume, etc.) closer to natural speech varies depending on the range of prosodic feature information and the range of adjustment. For example, if one exhalation paragraph preceding or following the phrase of the first prosodic information is within the above range, the prosody of the preceding or succeeding exhalation paragraph and the phrase of the first prosodic information that is continuous therewith can be made more natural speech. It is possible to make it a close prosody. In addition, the prosodic feature information is not limited to the prosodic unit that is continuous with the first prosodic information phrase, but the prosodic feature information of the prosodic unit having a great prosodic influence on the first prosodic information phrase, and the standard prosodic corresponding to the prosodic unit is generated. The prosodic information sequence portion generated from the information is adjusted.

ここで、上記第１韻律情報、上記第２韻律情報及び上記第３韻律情報は、ピッチ周波数、音量（パワー）、音韻継続長（話速）などの韻律に係る情報から構成される。ピッチ周波数は、音声中の周期性を持つ部分の周期の逆数であり、声の高さを表す。また、音量は、音声単位の発声の強さを示すもので、例えば、孤立単語発声の場合、一般に語頭は強く発声され、語尾に向かって音量は小さくなる。また、音韻継続長は、例えば、音節の発声の長さ（時間）を示すもので、例えば、語尾の音節は語中に比べて比較的長めに発声される。また、短い単語は遅く(つまり各音節は長く)長い単語は早く(各音節は短く)発声される傾向がある。以下、請求項５記載の韻律素片辞書作成プログラム、請求項６〜９記載の音声合成装置、請求項１１〜１４記載の音声合成プログラム、並びに、請求項１５〜１８記載の音声合成方法において同じである。 Here, the first prosodic information, the second prosodic information, and the third prosodic information are composed of information related to prosody such as pitch frequency, volume (power), and phoneme duration (speech rate). The pitch frequency is the reciprocal of the period having a periodicity in the voice, and represents the pitch of the voice. The volume indicates the strength of utterance in units of speech. For example, in the case of an isolated word utterance, generally the beginning of the word is uttered strongly, and the volume decreases toward the ending. The phonological continuation length indicates, for example, the length (time) of the syllable utterance. For example, the ending syllable is uttered relatively longer than the word. Short words tend to be uttered later (that is, each syllable is longer) and longer words are uttered earlier (each syllable is shorter). The same applies to the prosodic segment dictionary creation program according to claim 5, the speech synthesizer according to claims 6 to 9, the speech synthesis program according to claims 11 to 14, and the speech synthesis method according to claims 15 to 18. It is.

また、上記第１韻律情報は、音声波形データを構成する所定フレーズ毎の韻律情報により構成され、ＨＭＭ（隠れマルコフモデル）などによりモデル化されていない情報である。また、標準韻律情報は、音声波形データを構成する音声単位データ毎の韻律情報により構成され、ＨＭＭなどによりモデル化されたものも含む。以下、請求項５記載の韻律素片辞書作成プログラム、請求項６〜９記載の音声合成装置、請求項１１〜１４記載の音声合成プログラム、並びに、請求項１５〜１８記載の音声合成方法において同じである。 The first prosodic information is information that is composed of prosodic information for each predetermined phrase constituting the speech waveform data and is not modeled by an HMM (Hidden Markov Model) or the like. The standard prosodic information includes prosodic information for each speech unit data constituting speech waveform data, and includes information modeled by an HMM or the like. The same applies to the prosodic segment dictionary creation program according to claim 5, the speech synthesizer according to claims 6 to 9, the speech synthesis program according to claims 11 to 14, and the speech synthesis method according to claims 15 to 18. It is.

また、標準韻律辞書は、上記した音声単位毎のピッチ周波数情報である標準韻律情報から構成されるもので、複数の音声単位の標準韻律情報を組み合わせることで、あらゆる文章に対する韻律情報系列を生成することが可能である。ここで、前記標準韻律情報は、例えば、音韻ラベルと、当該音韻ラベルに対応するピッチ周波数の音声単位内における１つ以上の代表点の平均値及び分散値とから構成される。以下、請求項５記載の韻律素片辞書作成プログラム、請求項６〜９記載の音声合成装置、請求項１１〜１４記載の音声合成プログラム、並びに、請求項１５〜１８記載の音声合成方法において同じである。 The standard prosodic dictionary is composed of standard prosodic information that is pitch frequency information for each voice unit described above, and generates a prosodic information sequence for every sentence by combining standard prosodic information of a plurality of voice units. It is possible. Here, the standard prosody information includes, for example, a phoneme label and an average value and a variance value of one or more representative points in a speech unit having a pitch frequency corresponding to the phoneme label. The same applies to the prosodic segment dictionary creation program according to claim 5, the speech synthesizer according to claims 6 to 9, the speech synthesis program according to claims 11 to 14, and the speech synthesis method according to claims 15 to 18. It is.

また、第２韻律情報は、音声波形データを構成する所定韻律単位毎の韻律情報により構成されたものであり、単数及び複数の音声単位データ毎の韻律情報により構成される。以下、請求項５記載の韻律素片辞書作成プログラム、請求項６〜９記載の音声合成装置、請求項１１〜１４記載の音声合成プログラム、並びに、請求項１５〜１８記載の音声合成方法において同じである。 The second prosodic information is composed of prosodic information for each predetermined prosodic unit constituting the speech waveform data, and is composed of prosodic information for each single or plural speech unit data. The same applies to the prosodic segment dictionary creation program according to claim 5, the speech synthesizer according to claims 6 to 9, the speech synthesis program according to claims 11 to 14, and the speech synthesis method according to claims 15 to 18. It is.

また、音声単位は、例えば、音素やモーラ等を用いて所定の単位で構成することが可能である。つまり、音素やモーラそのものを音声単位としたり、音素やモーラ等を用いて、「わたし」という語句を構成して音声単位としたり、「わたしは」、「わたしが」といった語句を構成して音声単位とすることが可能である。ここで、音素（phoneme）とは、ある一つの言語で用いる音の単位で、意味の相違をもたらす最小の単位であり、ある音が当該言語で他の音と弁別的である場合に一つの音素と認められる。また、モーラとは、１子音音素と１短母音音素とを合せたものと等しい長さの音素結合である。また、モーラは、日本語の場合、仮名文字単位に相当し、音節とはやや異なっている。俳句や和歌で５，７，５，７，７，などと数えるときの音の単位で、伸ばす音「ー」(長母音) や詰まる音「ッ」(促音)、跳ねる音「ン」(撥音) なども１モーラと考える。以下、請求項５記載の韻律素片辞書作成プログラム、請求項６〜９記載の音声合成装置、請求項１１〜１４記載の音声合成プログラム、並びに、請求項１５〜１８記載の音声合成方法において同じである。 Also, the voice unit can be configured in a predetermined unit using, for example, phonemes or mora. In other words, phonemes and mora themselves are used as speech units, or phonemes and mora are used to compose the words “I” into speech units, or “I am” and “I am” compose words. It can be a unit. Here, a phoneme is a unit of sound used in one language, and is the smallest unit that makes a difference in meaning. When a certain sound is discriminating from other sounds in the language, one phoneme is used. It is recognized as a phoneme. A mora is a phoneme combination having a length equal to the sum of one consonant phoneme and one short vowel phoneme. Further, in the case of Japanese, the mora corresponds to a kana character unit and is slightly different from the syllable. A unit of sound that is counted as 5, 7, 5, 7, 7, etc. in haiku or waka, and a sound that stretches out “-” (long vowel), a clogging sound “tsu” (promotion sound), a bouncing sound “n” (repellent sound) ) Etc. are also considered 1 mora. The same applies to the prosodic segment dictionary creation program according to claim 5, the speech synthesizer according to claims 6 to 9, the speech synthesis program according to claims 11 to 14, and the speech synthesis method according to claims 15 to 18. It is.

また、上記所定韻律単位とは、アクセント句、呼気段落等を含むものである。例えば、呼気段落内に所定フレーズが含まれている場合は、当該呼気段落内の先行部、後続部が所定韻律単位となり得る。他にも、所定フレーズに先行する呼気段落、所定フレーズに先行するアクセント句なども所定韻律単位となる。呼気段落は、例えば、日本語でいうところの読点で句切られる前後の文となる。例えば、「明日の天気は、晴れでしょう。」の場合は、「明日の天気は」及び「晴れでしょう」が呼気段落となる。なお、アクセント句については後述する。以下、請求項５記載の韻律素片辞書作成プログラム、請求項６〜９記載の音声合成装置、請求項１１〜１４記載の音声合成プログラム、並びに、請求項１５〜１８記載の音声合成方法において同じである。 The predetermined prosodic unit includes an accent phrase, an exhalation paragraph, and the like. For example, when a predetermined phrase is included in the exhalation paragraph, the preceding part and the subsequent part in the exhalation paragraph can be a predetermined prosodic unit. In addition, an exhalation paragraph preceding a predetermined phrase, an accent phrase preceding a predetermined phrase, and the like are also a predetermined prosodic unit. The exhalation paragraph is, for example, a sentence before and after being punctuated by a Japanese punctuation mark. For example, in the case of “Tomorrow's weather will be sunny”, “Tomorrow's weather” and “It will be sunny” are the exhalation paragraphs. The accent phrase will be described later. The same applies to the prosodic segment dictionary creation program according to claim 5, the speech synthesizer according to claims 6 to 9, the speech synthesis program according to claims 11 to 14, and the speech synthesis method according to claims 15 to 18. It is.

また、本発明に係る請求項２記載の韻律素片辞書作成方法は、請求項１記載の韻律素片辞書作成方法において、
前記所定フレーズは、出現頻度の比較的高いフレーズを含むことを特徴としている。
例えば、カーナビゲーションシステムで頻繁に使われる「〜を左折です」、「〜を右折です」などのフレーズや、「おはようございます」、「こんにちは」、「〜でございます」、「〜して下さい」等の定型的に用いられるフレーズなどの、比較的頻繁に用いられる（比較的出現頻度の高い）フレーズに対して、第１及び第２韻律情報の抽出及び韻律特徴情報の生成を行い、これら第１韻律情報及び韻律特徴情報に基づき韻律素片辞書を生成する。 A prosodic segment dictionary creating method according to claim 2 of the present invention is the prosodic segment dictionary creating method according to claim 1,
The predetermined phrase includes a phrase having a relatively high appearance frequency.
For example, "it is a left turn to," which is frequently used in car navigation systems, phrases and such as "Turn right the ~", "Good morning", "Hello", "you will find a ~", "Please - The first and second prosodic information is extracted and the prosodic feature information is generated for a phrase that is used relatively frequently (such as a phrase with a relatively high frequency of appearance) such as a phrase that is regularly used. A prosodic segment dictionary is generated based on the first prosodic information and the prosodic feature information.

従って、このような韻律素片辞書を用いて、標準韻律辞書からのみ生成された韻律情報系列を調整することで、文章中で使われる可能性の高いフレーズの韻律情報系列部分を所定話者の発話した韻律の反映された韻律情報系列部分に置換できると共に、その前後の所定韻律単位に対応する韻律情報系列部分を適切な内容に調整できるので、置換対象のフレーズ及びその前後のフレーズを含む文章がより自然音声に近い韻律（抑揚、話速、音量など）で発話（再生）される合成音声波形データを生成することができるという効果が得られる。 Therefore, by adjusting the prosodic information sequence generated only from the standard prosodic dictionary using such a prosodic segment dictionary, the prosodic information sequence portion of a phrase that is likely to be used in a sentence is obtained from a predetermined speaker. The prosody information series part that reflects the prosodic utterance can be replaced, and the prosody information series part corresponding to the predetermined prosody unit before and after it can be adjusted to an appropriate content, so the sentence containing the phrase to be replaced and the phrases before and after it Can produce synthesized speech waveform data that is uttered (reproduced) with prosody (inflection, speech speed, volume, etc.) closer to natural speech.

また、本発明に係る請求項３記載の韻律素片辞書作成方法は、請求項１又は請求項２記載の韻律素片辞書作成方法において、
前記第２韻律情報及び前記第３韻律情報は、ピッチ周波数情報を含み、
前記韻律特徴情報は、前記第２韻律情報の示すピッチ周波数と、前記第３韻律情報の示すピッチ周波数との比率を示す情報を含むことを特徴としている。
つまり、自然発話された発話文の音声波形データから抽出される所定韻律単位のピッチ周波数（第２韻律情報）と、これに対応する標準ピッチ周波数との比率を示す情報、例えば、両ピッチ周波数の比率そのもの、所定の韻律単位が複数の場合は両ピッチ周波数のそれぞれの平均値の比率などの情報を韻律特徴情報とする。 Further, the prosodic segment dictionary creating method according to claim 3 according to the present invention is the prosodic segment dictionary creating method according to claim 1 or 2,
The second prosodic information and the third prosodic information include pitch frequency information,
The prosodic feature information includes information indicating a ratio between a pitch frequency indicated by the second prosodic information and a pitch frequency indicated by the third prosodic information.
That is, information indicating the ratio between the pitch frequency (second prosodic information) of a predetermined prosody unit extracted from the speech waveform data of the uttered sentence that is naturally uttered and the standard pitch frequency corresponding thereto, for example, both pitch frequencies When the ratio itself has a plurality of predetermined prosodic units, information such as the ratio of the average values of both pitch frequencies is used as the prosodic feature information.

従って、このような韻律素片辞書を用いて、標準韻律辞書からのみ生成された韻律情報における当該韻律素片辞書に対応する所定フレーズの韻律情報系列部分を置換して、その前後の所定韻律単位の韻律情報系列部分を調整するときに、対応する韻律情報系列部分における標準ピッチ周波数を韻律特徴情報の示す比率に応じた周波数へと調整することで適切な内容への調整を行うことが可能となるので、置換後の韻律情報系列部分の韻律に合わせた調整を簡易に行うことができるという効果が得られる。 Therefore, by using such a prosodic segment dictionary, the prosodic information sequence part of the predetermined phrase corresponding to the prosodic segment dictionary in the prosodic information generated only from the standard prosodic dictionary is replaced, and the predetermined prosodic units before and after that When adjusting the prosodic information sequence part, it is possible to adjust to the appropriate content by adjusting the standard pitch frequency in the corresponding prosodic information series part to a frequency according to the ratio indicated by the prosodic feature information As a result, it is possible to easily perform adjustment in accordance with the prosody of the prosodic information sequence portion after replacement.

更に、請求項４に係る発明は、請求項１乃至請求項３のいずれか１項に記載の韻律素片辞書作成方法において、
前記第２韻律情報及び前記第３韻律情報は、ピッチ周波数情報を含み、
前記韻律特徴情報は、前記第２韻律情報の示すピッチ周波数の高低差から求まるイントネーションの大きさと、前記第３韻律情報の示すピッチ周波数の高低差から求まるイントネーションの大きさとの比率を示す情報を含むことを特徴としている。 Furthermore, the invention according to claim 4 is the prosodic segment dictionary creating method according to any one of claims 1 to 3,
The second prosodic information and the third prosodic information include pitch frequency information,
The prosodic feature information includes information indicating a ratio between the intonation size obtained from the pitch frequency difference indicated by the second prosodic information and the intonation size obtained from the pitch frequency difference indicated by the third prosodic information. It is characterized by that.

つまり、自然発話された発話文の音声波形データから抽出される所定韻律単位のピッチ周波数（第２韻律情報）の高低差から求まるイントネーションの大きさと、これに対応する標準ピッチ周波数（第３韻律情報）の高低差から求まるイントネーションの大きさとの比率を示す情報、例えば、ピッチ周波数の最大値と最小値との差分をイントネーションの大きさとし、その両者の比率そのものの情報を韻律特徴情報とする。 That is, the magnitude of intonation obtained from the difference in pitch frequency (second prosodic information) of a predetermined prosodic unit extracted from the speech waveform data of a naturally uttered sentence, and the corresponding standard pitch frequency (third prosodic information) ) Information indicating the ratio of the intonation size obtained from the height difference, for example, the difference between the maximum value and the minimum value of the pitch frequency is set as the intonation size, and the information of the ratio itself is used as the prosodic feature information.

従って、このような韻律素片辞書を用いて、標準韻律辞書からのみ生成された韻律情報系列における当該韻律素片辞書に対応する所定フレーズの韻律情報系列部分を置換して、その前後の所定韻律単位の韻律情報系列部分を調整するときに、例えば、対応する韻律情報系列部分におけるイントネーションの大きさを決めるピッチ周波数を韻律特徴情報の示す比率に応じた周波数へと調整することで適切な内容への調整を行うことが可能となるので、置換後の韻律情報系列部分の韻律に合わせた調整を簡易に行うことができるという効果が得られる。 Therefore, using such a prosodic segment dictionary, the prosodic information sequence portion of the predetermined phrase corresponding to the prosodic segment dictionary in the prosodic information sequence generated only from the standard prosodic dictionary is replaced, and the predetermined prosody before and after that When adjusting the prosodic information sequence portion of a unit, for example, by adjusting the pitch frequency that determines the magnitude of intonation in the corresponding prosodic information sequence portion to a frequency according to the ratio indicated by the prosodic feature information, the content becomes appropriate. Therefore, it is possible to easily perform adjustment according to the prosody of the replaced prosodic information sequence portion.

また、上記目的を達成するために、本発明に係る請求項５記載の韻律素片辞書作成プログラムは、
音声合成に用いる韻律素片辞書を作成するためのプログラムであって、
所定話者の発話した、発話文に対応した音声波形データから、当該音声波形データに含まれる所定フレーズの音声波形データ部分の韻律情報である第１韻律情報を抽出する第１韻律情報抽出ステップと、
前記所定フレーズの音声波形データ部分に先行及び後続する所定の韻律単位に対応する音声波形データ部分の少なくとも一方から、その韻律情報である第２韻律情報を抽出する第２韻律情報抽出ステップと、
音声単位毎の韻律情報である標準韻律情報から構成される標準韻律辞書を参照し、前記所定の韻律単位に対応する韻律情報である第３韻律情報を生成する第３韻律情報生成ステップと、
前記第２韻律情報抽出ステップで抽出した第２韻律情報と、前記第３韻律情報生成ステップで生成した前記第３韻律情報とに基づき、前記第２韻律情報と前記第３韻律情報との特徴的な差を示す韻律特徴情報を生成する韻律特徴情報生成ステップと、
前記第１韻律情報抽出ステップで抽出した所定フレーズ毎の第１韻律情報と、前記韻律特徴情報生成ステップで生成された韻律特徴情報とに基づき韻律素片辞書を作成する韻律素片辞書作成ステップとからなる処理をコンピュータに実行させるためのプログラムを含むことを特徴としている。
このような構成であれば、コンピュータによってプログラムが読み取られ、読み取られたプログラムに従ってコンピュータが処理を実行すると、請求項１記載の韻律素片辞書作成方法と同等の効果が得られる。 In order to achieve the above object, a prosodic segment dictionary creation program according to claim 5 according to the present invention is provided.
A program for creating a prosodic segment dictionary used for speech synthesis,
A first prosodic information extraction step of extracting first prosodic information that is prosodic information of a speech waveform data portion of a predetermined phrase included in the speech waveform data from speech waveform data corresponding to an utterance sentence uttered by a predetermined speaker; ,
A second prosodic information extracting step for extracting second prosodic information as prosody information from at least one of the speech waveform data parts corresponding to a predetermined prosodic unit preceding and following the speech waveform data part of the predetermined phrase;
A third prosodic information generation step of generating third prosodic information that is prosodic information corresponding to the predetermined prosodic unit with reference to a standard prosodic dictionary composed of standard prosodic information that is prosodic information for each speech unit;
A characteristic of the second prosodic information and the third prosodic information based on the second prosodic information extracted in the second prosodic information extracting step and the third prosodic information generated in the third prosodic information generating step Prosodic feature information generating step for generating prosodic feature information indicating a difference,
A prosodic segment dictionary creating step for creating a prosodic segment dictionary based on the first prosodic information for each predetermined phrase extracted in the first prosodic information extracting step and the prosodic feature information generated in the prosodic feature information generating step; It is characterized by including a program for causing a computer to execute a process consisting of:
With this configuration, when the program is read by the computer and the computer executes processing according to the read program, the same effect as the prosodic segment dictionary creating method according to claim 1 can be obtained.

また、上記目的を達成するために、本発明に係る請求項６記載の音声合成装置は、
文章テキストに対応した合成音声波形データを生成する音声合成装置であって、
請求項１乃至請求項４のいずれか１項に記載の韻律素片辞書作成方法又は請求項５記載の韻律素片辞書作成プログラムによって作成された、前記第１韻律情報及び前記韻律特徴情報を含んで成る韻律素片辞書と、
音声単位毎の韻律情報である標準韻律情報から構成される標準韻律辞書と、
音声素片毎のスペクトル情報と前記音声素片毎の励振源情報とを含んで成る素片辞書と、
前記文章テキストに対してアクセント解析及び形態素解析を行うテキスト解析手段と、
前記テキスト解析手段の解析結果と、前記標準韻律辞書の有する標準韻律情報とに基づき、前記文章テキストに対応する韻律情報系列を生成する韻律情報系列生成手段と、
前記文章テキストに対応するフレーズの中に、前記韻律素片辞書の有する前記第１韻律情報に対応したフレーズが含まれているときに、当該フレーズの前記韻律情報系列部分を、前記第１韻律情報に基づき生成された韻律情報系列部分に変更する変更手段と、
前記変更手段で変更されたフレーズ部分に先行及び後続する所定の韻律単位の少なくとも一方の韻律単位に対応する韻律情報系列部分に対して、前記変更された韻律情報系列部分に対応する前記韻律素片辞書の有する前記韻律特徴情報に基づき所定の調整処理を行う韻律情報調整手段と、
前記韻律情報系列と、前記素片辞書の有する前記スペクトル情報及び前記励振源情報とに基づき、前記文章テキストに対応した合成音声波形データを生成する音声波形生成手段と、を備えることを特徴としている。 In order to achieve the above object, a speech synthesizer according to claim 6 according to the present invention comprises:
A speech synthesizer that generates synthesized speech waveform data corresponding to sentence text,
The prosodic segment dictionary creation method according to any one of claims 1 to 4 or the prosodic segment dictionary creation program according to claim 5 includes the first prosodic information and the prosodic feature information. A prosodic segment dictionary consisting of
Standard prosodic dictionary composed of standard prosodic information that is prosodic information for each voice unit;
A segment dictionary comprising spectral information for each speech unit and excitation source information for each speech unit;
Text analysis means for performing accent analysis and morphological analysis on the sentence text;
Prosodic information sequence generation means for generating a prosodic information sequence corresponding to the sentence text based on the analysis result of the text analysis means and the standard prosodic information of the standard prosodic dictionary;
When a phrase corresponding to the first prosodic information included in the prosodic segment dictionary is included in a phrase corresponding to the sentence text, the prosodic information series portion of the phrase is converted to the first prosodic information. A changing means for changing to the prosodic information sequence part generated based on
The prosodic segment corresponding to the changed prosodic information sequence portion with respect to the prosodic information sequence portion corresponding to at least one of the predetermined prosodic units preceding and following the phrase portion changed by the changing means Prosody information adjusting means for performing predetermined adjustment processing based on the prosodic feature information of the dictionary ;
Voice waveform generation means for generating synthesized voice waveform data corresponding to the text text based on the prosodic information series and the spectrum information and the excitation source information of the unit dictionary. .

このような構成であれば、テキスト解析手段によって、前記文章テキストに対してアクセント解析及び形態素解析を行うことが可能であり、韻律情報系列生成手段によって、前記テキスト解析手段の解析結果と、前記標準韻律辞書の有する標準韻律情報とに基づき、前記文章テキストに対応する韻律情報系列を生成することが可能であり、変更手段によって、前記文章テキストに対応するフレーズの中に、前記韻律素片辞書の有する前記第１韻律情報に対応したフレーズが含まれているときに、当該フレーズの韻律情報系列部分を、前記第１韻律情報に基づき生成された韻律情報系列部分に変更することが可能であり、韻律情報調整手段によって、前記変更手段で変更されたフレーズ部分に先行及び後続する所定の韻律単位の少なくとも一方の韻律単位に対応する韻律情報系列部分に対して、前記変更された部分の韻律情報系列部分に対応する前記韻律素片辞書の有する前記韻律特徴情報に基づき所定の調整処理を行うことが可能であり、音声波形生成手段によって、前記韻律情報系列と、前記素片辞書の有する前記スペクトル情報及び前記励振源情報とに基づき、前記文章テキストに対応した合成音声波形データを生成することが可能である。
従って、韻律特徴情報に基づき、変更後の韻律情報系列部分に合わせた適切な調整を行うようにすることで、所定フレーズ及びその前後のフレーズを含む文章がより自然音声に近い韻律（抑揚、話速、音量など）で発話（再生）される合成音声波形データを生成することができるという効果が得られる。
ここで、上記テキスト解析手段は、アクセント解析結果及び形態素解析結果に基づいて、例えば、日本語であれば、入力された漢字かな交じり文（文章テキスト）の読み、アクセント、イントネーションを決定する。また、この解析結果に基づき、文章テキストに対して生成される韻律記号付きの読み情報である発音・韻律記号列（中間言語）を生成する場合もある。なお、中間言語とは、文章テキストを音声波形データに変換する際の合成出力に必要な読み方の制御を簡易に記述した言語である。以下、請求項７〜８記載の音声合成装置、請求項１０〜１２記載の音声合成プログラム、並びに、請求項１３〜１５記載の音声合成方法において同じである。
また、上記韻律記号は、呼気段落区切り、アクセント句区切り、アクセント核位置などから構成される。例えば、日本語におけるアクセントは、単語毎に決まっており、英語におけるストレスと同様、語彙の特定に利用される。例えば、「橋」と「箸」のような同音異義語はアクセントによって区別される。両者の音声的な弁別は、声の高さが急激に下がる点（これをアクセント核と呼ぶ）がどの位置にあるかによってなされている。また、アクセント句とは、高々１つのアクセント核を持つ語句のことである。また、アクセント型とは、アクセント核の位置によって定められているもので、例えば、「橋（はし’）」なら「し」がアクセント核を持つ単語となり、これは２型あるいは尾行型と呼ばれ、「箸（は’し）」なら「は」がアクセント核を持つ単語となり、１型あるいは頭高型と呼ばれ、アクセント核を持たないものは０型あるいは平板型と呼ばれる。また、文節や文として発声される場合は、各単語は、付属語との結合や、他の単語と複合語を形成する場合にアクセント位置の影響を受ける。このようなものは、従属型、不完全支配型、融合型及び支配型といった４つの型に分類できることが解っている。また、漢字には、音読み、訓読みといわれるように一般に複数の読み方がある。さらに、連濁、接辞といったように単語の組み合わせによって読み方が変わるものもある。例えば、連濁としては、めざまし＋とけい→めざましどけい、接辞としては、一＋本→いっぽんがある。以下、請求項７〜８記載の音声合成装置、請求項１０〜１２記載の音声合成プログラム、並びに、請求項１３〜１５記載の音声合成方法において同じである。
また、音声素片は、音素、半音素、複数音素の連なり（例えば、子音−母音、母音−子音−母音、子音−母音−子音など）などから構成されるものである。以下、請求項７〜８記載の音声合成装置、請求項１０〜１２記載の音声合成プログラム、並びに、請求項１３〜１５記載の音声合成方法において同じである。
また、スペクトル情報は、声道の音響的共振特性を表現する情報であり、励振源情報は、有声時の声帯振動及び無声時の肺からの空気の乱流を表現する情報である。励振源情報は、例えば、有声時は周期性のパルスなどにより表現され、無声時は白色雑音などにより表現される。以下、請求項７〜８記載の音声合成装置、請求項１０〜１２記載の音声合成プログラム、並びに、請求項１３〜１５記載の音声合成方法において同じである。
また、スペクトル情報及び励振源情報は、音声素片毎に対応するものが１つあっても良いし、複数あっても良い。以下、請求項７〜８記載の音声合成装置、請求項１０〜１２記載の音声合成プログラム、並びに、請求項１３〜１５記載の音声合成方法において同じである。 With such a configuration, it is possible to perform accent analysis and morpheme analysis on the sentence text by the text analysis unit, and the analysis result of the text analysis unit and the standard by the prosodic information series generation unit It is possible to generate a prosodic information sequence corresponding to the sentence text based on the standard prosodic information possessed by the prosodic dictionary, and the changing means includes a phrase of the prosodic segment dictionary in the phrase corresponding to the sentence text. When a phrase corresponding to the first prosodic information is included, the prosodic information sequence portion of the phrase can be changed to a prosodic information sequence portion generated based on the first prosodic information, At least one of the predetermined prosodic units preceding and following the phrase part changed by the changing means by the prosodic information adjusting means Against prosodic information sequence portion corresponding to the prosodic units, it is possible to perform a predetermined adjustment processing based on the prosodic feature information included in the circuit prosody segment dictionary corresponding to prosodic information sequence portion of the modified parts The speech waveform generation means can generate synthesized speech waveform data corresponding to the sentence text based on the prosodic information series and the spectrum information and the excitation source information of the segment dictionary.
Therefore, by making appropriate adjustments based on the prosodic feature information according to the prosodic information sequence part after the change, the sentence including the predetermined phrase and the phrases before and after the prosodic sentence is more prosthetic (inflection, speech). Synthetic speech waveform data that is uttered (reproduced) at a high speed, volume, etc.) can be generated.
Here, based on the accent analysis result and the morphological analysis result, the text analysis means determines the reading, accent, and intonation of the input kanji mixed sentence (sentence text), for example, in the case of Japanese. In addition, based on the analysis result, a pronunciation / prosodic symbol string (intermediate language) that is reading information with prosodic symbols generated for the sentence text may be generated. Note that the intermediate language is a language that simply describes the control of the reading necessary for the synthesized output when converting the text text into speech waveform data. Hereinafter, the same applies to the speech synthesizer according to claims 7 to 8, the speech synthesis program according to claims 10 to 12, and the speech synthesis method according to claims 13 to 15.
The prosodic symbols are composed of exhalation paragraph breaks, accent phrase breaks, accent nucleus positions, and the like. For example, accents in Japanese are determined for each word, and are used to specify a vocabulary as well as stress in English. For example, homonyms such as “bridge” and “chopsticks” are distinguished by accents. The voice discrimination between the two is made by the position where the point where the pitch of the voice sharply drops (this is called the accent nucleus) is located. An accent phrase is a phrase having at most one accent kernel. The accent type is determined by the position of the accent kernel. For example, if “Hashi” is used, the word “shi” has an accent kernel, which is called type 2 or tail type. In the case of “chopsticks”, “ha” is a word having an accent core, which is called a type 1 or head height type, and those having no accent core are called a type 0 or a flat plate type. Also, when uttered as a phrase or sentence, each word is affected by the accent position when combined with an attached word or when forming a compound word with another word. It has been found that these can be classified into four types: subordinate, incompletely dominated, fused, and dominated. In addition, kanji is generally read in multiple ways, so-called sound reading and kanji reading. In addition, there are things such as rendaku and affix that change the way of reading depending on the combination of words. For example, as rendaku, there are Mezamashi + and Kei → Mezamashi Kei, and as an affix there is 1 + book → Japan. Hereinafter, the same applies to the speech synthesizer according to claims 7 to 8, the speech synthesis program according to claims 10 to 12, and the speech synthesis method according to claims 13 to 15.
The speech segment is composed of a series of phonemes, semiphones, multiple phonemes (for example, consonant-vowel, vowel-consonant-vowel, consonant-vowel-consonant). Hereinafter, the same applies to the speech synthesizer according to claims 7 to 8, the speech synthesis program according to claims 10 to 12, and the speech synthesis method according to claims 13 to 15.
The spectral information is information expressing the acoustic resonance characteristics of the vocal tract, and the excitation source information is information expressing the vocal cord vibration during voiced and the turbulent air flow from the lungs during unvoiced. For example, the excitation source information is expressed by a periodic pulse or the like when voiced, and is expressed by white noise or the like when silent. Hereinafter, the same applies to the speech synthesizer according to claims 7 to 8, the speech synthesis program according to claims 10 to 12, and the speech synthesis method according to claims 13 to 15.
Further, the spectrum information and the excitation source information may be one corresponding to each speech unit, or may be plural. Hereinafter, the same applies to the speech synthesizer according to claims 7 to 8, the speech synthesis program according to claims 10 to 12, and the speech synthesis method according to claims 13 to 15.

また、上記目的を達成するために、本発明に係る請求項７記載の音声合成装置は、
文章テキストに対応した合成音声波形データを生成する音声合成装置であって、
音声素片毎のスペクトル情報と前記音声素片毎の励振源情報とを含んで成る第１素片辞書と、
所定話者の発話した発話文に対応する音声波形データから抽出された、所定フレーズ毎のスペクトル情報と前記所定フレーズ毎の励振源情報と前記音声波形データにおける各前記所定フレーズに先行及び後続する所定の音韻環境の情報とを含んで成る第２素片辞書と、
音声単位毎の韻律情報である標準韻律情報から構成される標準韻律辞書と、
前記文章テキストに対してアクセント解析及び形態素解析を行うテキスト解析手段と、
前記テキスト解析手段の解析結果と、前記標準韻律辞書の有する標準韻律情報とに基づき、前記文章テキストに対応する韻律情報系列を生成する韻律情報系列生成手段と、
前記文章テキストに対応するフレーズの中に、前記第２素片辞書の有するスペクトル情報及び励振源情報に対応するフレーズと一致するものが含まれているときに、前記スペクトル情報及び前記励振源情報を抽出時の前記音声波形データにおける前記一致するフレーズに先行及び後続する前記第２素片辞書の有する所定の音韻環境と、前記韻律情報系列における前記一致するフレーズに対応する部分に先行及び後続する所定の音韻環境とが一致するか否かを判定する判定手段と、
前記韻律情報系列と、前記判定手段の判定結果と、前記第１及び第２素片辞書の有する前記スペクトル情報及び前記励振源情報とに基づき、前記文章テキストに対応した合成音声波形データを生成する音声波形生成手段と、を備え、
前記音声波形生成手段は、前記判定手段において一致しないと判定されたときに、前記第２素片辞書の有する前記一致するフレーズに対応するスペクトル情報及び励振源情報から生成した合成音声波形データの前記音韻環境の一致しない側端部の音声素片データ部分を除いて成る第１合成音声波形データと、前記文章テキストの全体に対して前記第１素片辞書の有するスペクトル情報及び励振源情報に基づき生成した合成音声波形データから前記第１合成音声波形データに対応する部分を除いて成る第２合成音声波形データとが合成されて成る、前記文章テキストに対応した合成音声波形データを生成することを特徴としている。 In order to achieve the above object, a speech synthesizer according to claim 7 according to the present invention comprises:
A speech synthesizer that generates synthesized speech waveform data corresponding to sentence text,
A first segment dictionary comprising spectral information for each speech unit and excitation source information for each speech unit;
Extracted from speech waveform data corresponding to the utterance sentence uttered by the predetermined speaker, the spectrum information for each predetermined phrase, the excitation source information for each predetermined phrase, and the predetermined preceding and following each predetermined phrase in the speech waveform data A second segment dictionary comprising information on the phonetic environment of
Standard prosodic dictionary composed of standard prosodic information that is prosodic information for each voice unit;
Text analysis means for performing accent analysis and morphological analysis on the sentence text;
Prosodic information sequence generation means for generating a prosodic information sequence corresponding to the sentence text based on the analysis result of the text analysis means and the standard prosodic information of the standard prosodic dictionary;
When the phrase corresponding to the sentence text includes a phrase that matches the phrase corresponding to the spectrum information and excitation source information of the second segment dictionary, the spectrum information and the excitation source information are A predetermined phoneme environment of the second segment dictionary preceding and succeeding the matching phrase in the speech waveform data at the time of extraction, and a predetermined preceding and succeeding a portion corresponding to the matching phrase in the prosodic information sequence Determining means for determining whether or not the phonological environment of
Based on the prosodic information series, the determination result of the determination means, and the spectrum information and the excitation source information of the first and second segment dictionaries, synthetic speech waveform data corresponding to the sentence text is generated. Voice waveform generation means,
The speech waveform generation means, when the determination means determines that they do not match, the synthesized speech waveform data generated from the spectrum information and excitation source information corresponding to the matching phrase of the second segment dictionary Based on the first synthesized speech waveform data excluding the speech segment data part at the side edge portion where the phoneme environments do not match, and the spectrum information and excitation source information of the first segment dictionary for the entire sentence text. Generating synthesized speech waveform data corresponding to the sentence text, which is synthesized from second synthesized speech waveform data obtained by excluding a portion corresponding to the first synthesized speech waveform data from the generated synthesized speech waveform data; It is a feature.

このような構成であれば、テキスト解析手段によって、前記文章テキストに対してアクセント解析及び形態素解析を行うことが可能であり、韻律情報系列生成手段によって、前記テキスト解析手段の解析結果と、前記標準韻律辞書の有する標準韻律情報とに基づき、前記文章テキストに対応する韻律情報系列を生成することが可能であり、判定手段によって、前記文章テキストに対応するフレーズの中に、前記第２素片辞書の有するスペクトル情報及び励振源情報に対応するフレーズと一致するものが含まれているときに、前記スペクトル情報及び前記励振源情報を抽出時の前記音声データにおける前記一致するフレーズに先行及び後続する前記第２素片辞書の有する所定の音韻環境と、前記韻律情報系列における前記一致するフレーズに対応する部分に先行及び後続する所定の音韻環境とが一致するか否かを判定することが可能であり、音声波形生成手段によって、前記韻律情報系列と、前記判定手段の判定結果と、前記第１及び第２素片辞書の有する前記スペクトル情報及び前記励振源情報とに基づき、前記文章テキストに対応した合成音声波形データを生成することが可能である。 With such a configuration, it is possible to perform accent analysis and morpheme analysis on the sentence text by the text analysis unit, and the analysis result of the text analysis unit and the standard by the prosodic information series generation unit A prosodic information sequence corresponding to the sentence text can be generated based on standard prosodic information included in the prosodic dictionary, and the second unit dictionary is included in a phrase corresponding to the sentence text by a determination unit. spectral information and when it contains a match with the corresponding phrase in the excitation source information, preceding and succeeding the phrase that the matching of the voice data at the time of extracting the spectral information and the excitation source information possessed by and a predetermined phonetic environment having a second segment dictionary, pairs phrases that the match in the prosodic information sequence It is possible to determine whether or not a predetermined phoneme environment that precedes and follows the portion to be matched, and the prosody information sequence, the determination result of the determination unit, and the first by the speech waveform generation unit And synthesized speech waveform data corresponding to the sentence text can be generated based on the spectrum information and the excitation source information of the second segment dictionary.

更に、前記音声波形生成手段は、前記判定手段において一致しないと判定されたときに、前記第２素片辞書の有する前記一致するフレーズに対応するスペクトル情報及び励振源情報から生成した合成音声波形データの前記一致しない側端部の音声素片データ部分を除いて成る第１合成音声波形データと、前記文章テキストの全体に対して前記第１素片辞書の有するスペクトル情報及び励振源情報に基づき生成した合成音声波形データから、前記第１合成音声波形データに対応する部分を除いて成る第２合成音声波形データとを合成して成る、前記文章テキストに対応した合成音声波形データを生成することが可能である。 Further, the speech waveform generation means generates synthesized speech waveform data generated from spectrum information and excitation source information corresponding to the matching phrase of the second segment dictionary when the determination means determines that they do not match. Is generated based on the first synthesized speech waveform data excluding the speech segment data portion at the non-coincident side end portion of the text and the spectrum information and excitation source information of the first segment dictionary for the entire sentence text. Generating synthesized speech waveform data corresponding to the sentence text, which is obtained by synthesizing the synthesized speech waveform data with the second synthesized speech waveform data excluding the portion corresponding to the first synthesized speech waveform data. Is possible.

つまり、判定手段によって一致しないと判定されたときには、第１素片辞書の有する音声素片単位のスペクトル情報及び励振源情報を組み合わせて生成された前記文章テキストに対応する音声データ（以下、全文音声データと称す）における所定フレーズ部分の音声データ（以下、第１所定フレーズ音声データと称す）を、このフレーズと一致する、第２素片辞書の有するフレーズ単位のスペクトル情報及び励振源情報から生成された音声データ（以下、第２所定フレーズ音声データと称す）に置き換えるときに、全文音声データにおける第１所定フレーズ音声データに連結する他のフレーズの音声素片データとの連結位置の音声素片データについては、前記第２所定フレーズ音声データの音声素片データではなく、前記第１所定フレーズ音声データの音声素片データを用いるようにした。 That is, when it is determined by the determination means that they do not match, the speech data corresponding to the sentence text generated by combining the spectral information of the speech unit and the excitation source information of the first unit dictionary (hereinafter, full-speech speech) Audio data of a predetermined phrase portion (hereinafter referred to as data) (hereinafter referred to as first predetermined phrase audio data) is generated from the spectral information and excitation source information of the phrase unit of the second segment dictionary that matches this phrase. Speech unit data at a connection position with speech unit data of other phrases connected to the first predetermined phrase speech data in the full-speech speech data when the speech data is replaced with the voice data (hereinafter referred to as second predetermined phrase speech data). Is not the speech unit data of the second predetermined phrase voice data, but the first predetermined phrase sound. And to use a voice segment data of the data.

例えば、第２素片辞書が「○×さんが、お待ちでございます。」における、読点の後に続く「お待ちでございます。（＊１）」に対応するスペクトル情報及び励振源情報を有しており、「○×さんがお待ちでございます。」といった読点の無い文章の合成音声波形データを生成する場合を想定する。このような場合に、上記構成であれば、第１素片辞書を用いて生成された全文音声データである「○×さんがお待ちでございます。（＊２）」における「○×さんが」の音声データ部分の「が」の音韻環境と、「お待ちでございます（＊１）。」に先行する「、」の音韻環境とが一致しないことを判定し、「○×さんがお待ちでございます。（＊２）」の「○×さんがお」の合成音声波形データ部分と、「お待ちでございます（＊１）。」の「お」を除く「待ちでございます。」の合成音声波形データ部分とを合成して成る、「○×さんがお待ちでございます。（＊３）」の合成音声波形データを生成することが可能である。 For example, the second segment dictionary has spectrum information and excitation source information corresponding to “Waiting for (* 1)” following the reading in “Mr. It is assumed that synthesized speech waveform data is generated for a sentence with no punctuation, such as “Mr. XX is waiting”. In such a case, in the case of the above configuration, “Mr. XX” is the full-text voice data generated using the first segment dictionary. It is determined that the phonological environment of “ga” in the voice data part of ”and the phonological environment of“, ”preceding“ Waiting for you (* 1) ”do not match. (* 2) ”“ Oxsangao ”synthesized voice waveform data part and“ Waiting for (* 1) ”“ O ”is excluded,“ Sending ”. It is possible to generate synthesized speech waveform data of “O Mr. XX is waiting. (* 3)” composed of the waveform data part.

従って、文章テキストに対応するフレーズの中に、第２素片辞書のフレーズと一致するフレーズがある場合に、第２素片辞書の有するスペクトル情報及び励振源情報に対応したフレーズに先行及び後続する所定の音韻環境が、韻律情報系列生成手段で生成された韻律情報系列の対応する部分の音韻環境と一致しないような場合でも、その部分を音声素片単位のスペクトル情報及び励振源情報から生成される音声素片データに置き換えて合成することができるので、文章テキストに対応する前記一致する部分とその前後の文章とが、より滑らかに接続された合成音声波形データを生成することができるという効果が得られる。 Therefore, when there is a phrase that matches the phrase in the second segment dictionary in the phrase that corresponds to the sentence text, it precedes and follows the phrase that corresponds to the spectrum information and excitation source information that the second segment dictionary has. Even if the predetermined phoneme environment does not match the phoneme environment of the corresponding part of the prosodic information sequence generated by the prosodic information sequence generating means, that part is generated from the spectral information and excitation source information in units of speech units. Therefore, it is possible to generate synthesized speech waveform data in which the matching part corresponding to the sentence text and the sentences before and after the same are connected more smoothly. Is obtained.

ここで、上記第２素片辞書の有する所定フレーズ毎のスペクトル情報及び励振源情報は、所定話者が発話した前記所定フレーズを構成する音声素片毎のスペクトル情報及び励振源情報と、当該所定フレーズを構成する音声素片毎の時間情報とを含んで構成される。以下、請求項８記載の音声合成装置、請求項１１及び１２記載の音声合成プログラム、並びに、請求項１４及び１５記載の音声合成方法において同じである。 Here, the spectrum information and excitation source information for each predetermined phrase included in the second unit dictionary include spectrum information and excitation source information for each speech unit constituting the predetermined phrase uttered by a predetermined speaker, and the predetermined phrase. And time information for each speech unit constituting the phrase. Hereinafter, the speech synthesizing apparatus according to claim 8, claim 1 1 and 1 2, wherein the speech synthesis program, as well, is the same in claim 1 4 and 1 5 speech synthesis method according.

また、上記音声波形生成手段は、第２素片辞書に、前記文章テキストに含まれるフレーズと一致するフレーズがあり、且つ前記抽出時の音声波形データにおける当該フレーズに先行及び後続する所定の音韻環境と、前記生成した韻律情報系列の対応部分の音韻環境とが一致する場合は、一致するフレーズ全てに対応する合成音声波形データを第２素片辞書を用いて生成し、それ以外の部分を第１素片辞書を用いて生成し、これらを合成して、前記文章テキストに対応する合成音声波形データを生成することも可能である。また、第２素片辞書に、前記文章テキストに含まれるフレーズと一致するフレーズがない場合は、前記文章テキストに対応する合成音声波形データを第１素片辞書のみを用いて作成することも可能である。以下、請求項８記載の音声合成装置、請求項１１及び１２記載の音声合成プログラム、並びに、請求項１４及び１５記載の音声合成方法において同じである。 The speech waveform generation means includes a predetermined phoneme environment that has a phrase that matches the phrase included in the sentence text in the second segment dictionary and precedes and follows the phrase in the speech waveform data at the time of extraction. And the phonological environment of the corresponding portion of the generated prosodic information sequence, the synthesized speech waveform data corresponding to all the matching phrases is generated using the second segment dictionary, and the other portions are It is also possible to generate using a single segment dictionary and synthesize these to generate synthesized speech waveform data corresponding to the sentence text. If there is no phrase in the second segment dictionary that matches the phrase included in the sentence text, it is also possible to create synthesized speech waveform data corresponding to the sentence text using only the first segment dictionary. It is. Hereinafter, the speech synthesizing apparatus according to claim 8, claim 1 1 and 1 2, wherein the speech synthesis program, as well, is the same in claim 1 4 and 1 5 speech synthesis method according.

また、上記一致しない側端部の音声素片データは、合成音声波形データにおける時間軸において先頭及び最後から１番目の音声素片データの少なくとも一方のことである。つまり、音韻環境が一致しない側が、先行側だけであれば先頭から１番目、後続側だけであれば最後から１番目、先行側及び後続側の両方であれば、先頭及び最後からそれぞれ１番目の音声素片データとなる。以下、請求項８記載の音声合成装置、請求項１１及び１２記載の音声合成プログラム、並びに、請求項１４及び１５記載の音声合成方法において同じである。 Further, the speech unit data at the side end portion that does not match is at least one of the first speech unit data from the beginning and the end on the time axis in the synthesized speech waveform data. That is, if the side where the phoneme environment does not match is only the leading side, it is the first from the beginning. It becomes speech segment data. Hereinafter, the speech synthesizing apparatus according to claim 8, claim 1 1 and 1 2, wherein the speech synthesis program, as well, is the same in claim 1 4 and 1 5 speech synthesis method according.

また、上記目的を達成するために、請求項８記載の音声合成装置は、
文章テキストに対応した合成音声波形データを生成する音声合成装置であって、
請求項１乃至請求項４のいずれか１項に記載の韻律素片辞書作成方法又は請求項５記載の韻律素片辞書作成プログラムによって作成された、前記第１韻律情報及び前記韻律特徴情報を含んで成る韻律素片辞書と、
音声単位毎の韻律情報である標準韻律情報から構成される標準韻律辞書と、
音声素片毎のスペクトル情報と前記音声素片毎の励振源情報とを含んで成る第１素片辞書と、
前記所定話者の発話した発話文に対応する音声波形データから抽出された、所定フレーズ毎のスペクトル情報と前記所定フレーズ毎の励振源情報と前記音声波形データにおける各前記所定フレーズに先行及び後続する所定の音韻環境の情報とを含んで成る第２素片辞書と、
前記文章テキストに対してアクセント解析及び形態素解析を行うテキスト解析手段と、
前記テキスト解析手段の解析結果と、前記標準韻律辞書の有する標準韻律情報とに基づき、前記文章テキストに対応する韻律情報系列を生成する韻律情報系列生成手段と、
前記文章テキストに対応するフレーズの中に、前記韻律素片辞書の有する前記第１韻律情報に対応したフレーズが含まれているときに、当該フレーズの前記韻律情報系列部分を、前記第１韻律情報に基づき生成された韻律情報系列部分に変更する変更手段と、
前記変更手段で変更されたフレーズ部分に先行及び後続する所定の韻律単位の少なくとも一方の韻律単位に対応する韻律情報系列部分に対して、前記変更された韻律情報系列部分に対応する前記韻律素片辞書の有する前記韻律特徴情報に基づき所定の調整処理を行う韻律情報調整手段と、
前記文章テキストに対応するフレーズの中に、前記第２素片辞書の有するスペクトル情報及び励振源情報に対応するフレーズと一致するものが含まれているときに、前記スペクトル情報及び前記励振源情報を抽出時の前記音声波形データにおける前記一致するフレーズに先行及び後続する前記第２素片辞書の有する所定の音韻環境と、前記韻律情報系列における前記一致するフレーズに対応する部分に先行及び後続する所定の音韻環境とが一致するか否かを判定する判定手段と、
前記韻律情報系列と、前記判定手段の判定結果と、前記第１及び第２素片辞書の有する前記スペクトル情報及び前記励振源情報とに基づき、前記文章テキストに対応した合成音声波形データを生成する音声波形生成手段と、を備え、
前記音声波形生成手段は、前記判定手段において一致しないと判定されたときに、前記第２素片辞書の有する前記一致するフレーズに対応するスペクトル情報及び励振源情報から生成した合成音声波形データの前記一致しない側端部の音声素片データ部分を除いて成る第１合成音声波形データと、前記文章テキストの全体に対して前記第１素片辞書の有するスペクトル情報及び励振源情報に基づき生成した合成音声波形データから、前記第１合成音声波形データに対応する部分を除いて成る第２合成音声波形データとが合成されて成る、前記文章テキストに対応した合成音声波形データを生成することを特徴としている。 In order to achieve the above object, a speech synthesizer according to claim 8 comprises:
A speech synthesizer that generates synthesized speech waveform data corresponding to sentence text,
The prosodic segment dictionary creation method according to any one of claims 1 to 4 or the prosodic segment dictionary creation program according to claim 5 includes the first prosodic information and the prosodic feature information. A prosodic segment dictionary consisting of
Standard prosodic dictionary composed of standard prosodic information that is prosodic information for each voice unit;
A first segment dictionary comprising spectral information for each speech unit and excitation source information for each speech unit;
Extracted from speech waveform data corresponding to an utterance sentence uttered by the predetermined speaker, spectrum information for each predetermined phrase, excitation source information for each predetermined phrase, and preceding and following each predetermined phrase in the speech waveform data A second segment dictionary including information on a predetermined phonetic environment ;
Text analysis means for performing accent analysis and morphological analysis on the sentence text;
Prosodic information sequence generation means for generating a prosodic information sequence corresponding to the sentence text based on the analysis result of the text analysis means and the standard prosodic information of the standard prosodic dictionary;
When a phrase corresponding to the first prosodic information included in the prosodic segment dictionary is included in a phrase corresponding to the sentence text, the prosodic information series portion of the phrase is converted to the first prosodic information. A changing means for changing to the prosodic information sequence part generated based on
The prosodic segment corresponding to the changed prosodic information sequence portion with respect to the prosodic information sequence portion corresponding to at least one of the predetermined prosodic units preceding and following the phrase portion changed by the changing means Prosody information adjusting means for performing predetermined adjustment processing based on the prosodic feature information of the dictionary ;
When the phrase corresponding to the sentence text includes a phrase that matches the phrase corresponding to the spectrum information and excitation source information of the second segment dictionary, the spectrum information and the excitation source information are A predetermined phoneme environment of the second segment dictionary preceding and succeeding the matching phrase in the speech waveform data at the time of extraction, and a predetermined preceding and succeeding a portion corresponding to the matching phrase in the prosodic information sequence Determining means for determining whether or not the phonological environment of
Based on the prosodic information series, the determination result of the determination means, and the spectrum information and the excitation source information of the first and second segment dictionaries, synthetic speech waveform data corresponding to the sentence text is generated. Voice waveform generation means,
The speech waveform generation means, when the determination means determines that they do not match, the synthesized speech waveform data generated from the spectrum information and excitation source information corresponding to the matching phrase of the second segment dictionary The first synthesized speech waveform data excluding the non-coincident side speech segment data part, and the synthesis generated based on the spectral information and excitation source information of the first segment dictionary for the entire sentence text Generating synthesized speech waveform data corresponding to the sentence text, which is synthesized from speech waveform data and second synthesized speech waveform data excluding a portion corresponding to the first synthesized speech waveform data; Yes.

このような構成であれば、テキスト解析手段によって、前記文章テキストに対してアクセント解析及び形態素解析を行うことが可能であり、韻律情報系列生成手段によって、前記テキスト解析手段の解析結果と、前記標準韻律辞書の有する標準韻律情報とに基づき、前記文章テキストに対応する韻律情報系列を生成することが可能であり、韻律情報調整手段によって、前記変更手段で変更されたフレーズ部分に先行及び後続する所定の韻律単位の少なくとも一方の韻律単位に対応する韻律情報系列部分に対して、前記変更された韻律情報系列部分に対応する前記韻律素片辞書の有する前記韻律特徴情報に基づき所定の調整処理を行うことが可能である。 With such a configuration, it is possible to perform accent analysis and morphological analysis on the sentence text by the text analysis unit, and the analysis result of the text analysis unit and the standard by the prosodic information series generation unit It is possible to generate a prosodic information sequence corresponding to the sentence text based on the standard prosodic information possessed by the prosodic dictionary, and a predetermined preceding and succeeding phrase portion changed by the changing means by the prosodic information adjusting means A predetermined adjustment process is performed on the prosodic information sequence portion corresponding to at least one of the prosodic units based on the prosodic feature information of the prosodic segment dictionary corresponding to the changed prosodic information sequence portion It is possible.

更に、判定手段によって、前記文章テキストに対応するフレーズの中に、前記第２素片辞書の有するスペクトル情報及び励振源情報に対応するフレーズと一致するものが含まれているときに、前記スペクトル情報及び前記励振源情報を抽出時の前記音声データにおける前記一致するフレーズに先行及び後続する前記第２素片辞書の有する所定の音韻環境と、前記韻律情報系列における前記一致するフレーズに対応する部分に先行及び後続する所定の音韻環境とが一致するか否かを判定することが可能であり、音声波形生成手段によって、前記韻律情報系列と、前記判定手段の判定結果と、前記第１及び第２素片辞書の有する前記スペクトル情報及び前記励振源情報とに基づき、前記文章テキストに対応した合成音声波形データを生成することが可能である。 Furthermore, when the phrase corresponding to the sentence text includes a phrase that matches the spectrum information and the phrase corresponding to the excitation source information of the second segment dictionary, the spectrum information And a portion corresponding to the matching phrase in the prosodic information sequence and the predetermined phoneme environment of the second segment dictionary preceding and succeeding the matching phrase in the speech data at the time of extraction of the excitation source information It is possible to determine whether or not the preceding and succeeding predetermined phoneme environments match each other, and the prosody information sequence, the determination result of the determination unit, and the first and second by the speech waveform generation unit Generating synthesized speech waveform data corresponding to the sentence text based on the spectrum information and the excitation source information of the segment dictionary; Possible it is.

尚更に、前記音声波形生成手段は、前記判定手段において一致しないと判定されたときに、前記第２素片辞書の有する前記一致するフレーズに対応するスペクトル情報及び励振源情報から生成した合成音声波形データの前記一致しない側端部の音声素片データ部分を除いて成る第１合成音声波形データと、前記文章テキストの全体に対して前記第１素片辞書の有するスペクトル情報及び励振源情報に基づき生成した合成音声波形データから、前記第１合成音声波形データに対応する部分を除いて成る第２合成音声波形データとを合成して成る、前記文章テキストに対応した合成音声波形データを生成することが可能である。 Still further, the speech waveform generation means generates a synthesized speech waveform generated from spectrum information and excitation source information corresponding to the matching phrase of the second segment dictionary when the determination means determines that they do not match. Based on the first synthesized speech waveform data excluding the speech segment data portion at the non-matching side edge of the data, and the spectrum information and excitation source information of the first segment dictionary for the entire sentence text Generating synthesized speech waveform data corresponding to the sentence text by synthesizing the generated synthesized speech waveform data with second synthesized speech waveform data excluding a portion corresponding to the first synthesized speech waveform data; Is possible.

従って、韻律特徴情報系列に基づき、前記変更されたフレーズ部分前後の韻律情報系列部分に対して、変更されたフレーズの韻律情報系列部分に合わせた適切な調整を行うことが可能であり、これにより、所定フレーズ及びその前後のフレーズを含む文章の発話をより自然な韻律（抑揚、話速、音量など）にできる韻律情報を生成することができるので、より自然音声に近い韻律（抑揚、話速、音量など）で発話（再生）される合成音声波形データを生成することができるという効果が得られる。 Therefore, based on the prosodic feature information sequence, it is possible to appropriately adjust the prosodic information sequence portion before and after the changed phrase portion according to the prosodic information sequence portion of the changed phrase. Prosodic information that can make the utterance of a sentence including a given phrase and phrases before and after it more natural prosody (inflection, speech speed, volume, etc.) can be generated, so prosody closer to natural speech (inflection, speech speed) Synthetic speech waveform data that is uttered (reproduced) by sound volume, etc. can be generated.

更に、上記の効果に加えて、文章テキストに対応するフレーズの中に、第２素片辞書のフレーズと一致するフレーズがある場合に、前記音声波形データにおける、前記フレーズに先行及び後続する所定の音韻環境が、前記生成した韻律情報系列における対応する箇所の音韻環境と一致しないような場合でも、その部分を音声素片単位のスペクトル情報及び励振源情報から生成される音声素片データに置き換えて合成することができるので、文章テキストに対応する前記一致する部分とその前後の文章とが、更に自然音声に近い韻律（抑揚、話速、音量など）で発話（再生）される合成音声波形データを生成することができるという効果が得られる。 Furthermore, in addition to the above-described effect, when there is a phrase that matches the phrase in the second segment dictionary in the phrase corresponding to the sentence text, a predetermined number preceding and following the phrase in the speech waveform data Even when the phoneme environment does not match the phoneme environment of the corresponding part in the generated prosodic information sequence, the part is replaced with the speech unit data generated from the spectral information and the excitation source information in units of speech units. Since it is possible to synthesize, synthesized speech waveform data in which the matching part corresponding to the sentence text and the sentence before and after it are uttered (reproduced) with prosody (inflection, speech speed, volume, etc.) closer to natural speech The effect that can be generated is obtained.

更に、請求項９記載の音声合成装置は、請求項６又は請求項８記載の音声合成装置において、
前記変更手段は、前記韻律素片辞書に、前記文章テキストを構成するフレーズと同じフレーズに対応する第１韻律情報が複数含まれているときに、当該フレーズの前記韻律情報系列部分を、当該韻律情報系列部分における該当フレーズに先行及び後続する韻律情報系列部分との接続性が最も良い第１韻律情報に基づき生成された韻律情報系列部分に変更することを特徴としている。 Furthermore, the speech synthesizer according to claim 9 is the speech synthesizer according to claim 6 or 8 ,
When the prosody segment dictionary includes a plurality of first prosodic information corresponding to the same phrase as the phrase constituting the sentence text, the changing means converts the prosodic information series portion of the phrase into the prosody. The prosody information sequence part is generated based on the first prosodic information having the best connectivity with the prosodic information series part preceding and following the corresponding phrase in the information series part.

このような構成であれば、例えば、変更するフレーズ前後の所定音韻環境の標準韻律情報と、複数の変更候補の第１韻律情報との間のコスト（調和の尺度）が最小となる第１韻律情報を選択して、標準韻律情報だけの韻律情報系列を変更することができるので、適切な第１韻律情報及び韻律特徴情報によって変更及び調整された韻律情報系列から、合成音声波形データを生成することができるので、置換対象のフレーズ及びその前後のフレーズを含む文章がより自然音声に近い韻律（抑揚、話速、音量など）で発話（再生）される合成音声波形データを生成することができるという効果が得られる。 With such a configuration, for example, the first prosody in which the cost (standard measure of harmony) between the standard prosodic information of a predetermined phoneme environment before and after the phrase to be changed and the first prosodic information of a plurality of change candidates is minimized. Since the prosodic information sequence of only the standard prosodic information can be changed by selecting information, synthesized speech waveform data is generated from the prosodic information sequence changed and adjusted by appropriate first prosodic information and prosodic feature information Therefore, it is possible to generate synthesized speech waveform data in which a sentence including the phrase to be replaced and the phrases before and after it are uttered (reproduced) with prosody (inflection, speech speed, volume, etc.) closer to natural speech. The effect is obtained.

また、上記目的を達成するために、請求項１０記載の音声合成プログラムは、
文章テキストに対応した合成音声波形データを生成する音声合成プログラムであって、
前記文章テキストに対してアクセント解析及び形態素解析を行うテキスト解析ステップと、
前記テキスト解析ステップにおける解析結果と、音声単位毎の韻律情報である標準韻律情報から構成される標準韻律辞書の有する前記標準韻律情報とに基づき、前記文章テキストに対応する韻律情報を生成する韻律情報系列生成ステップと、
前記文章テキストに対応するフレーズの中に、請求項１乃至請求項４のいずれか１項に記載の韻律素片辞書作成方法又は請求項５記載の韻律素片辞書作成プログラムによって作成された、前記第１韻律情報及び前記韻律特徴情報を含んで成る韻律素片辞書の有する前記第１韻律情報に対応したフレーズが含まれているときに、当該フレーズの前記韻律情報系列部分を、前記第１韻律情報に基づき生成された韻律情報系列部分に変更する変更ステップと、
前記変更ステップで変更されたフレーズ部分に先行及び後続する所定の韻律単位の少なくとも一方の韻律単位に対応する韻律情報系列部分に対して、前記変更された韻律情報系列部分に対応する前記韻律素片辞書の有する前記韻律特徴情報に基づき所定の調整処理を行う韻律情報調整ステップと、
前記韻律情報系列と、音声素片毎のスペクトル情報と前記音声素片毎の励振源情報とを含んで成る素片辞書の有する前記スペクトル情報及び前記励振源情報とに基づき、前記文章テキストに対応した合成音声波形データを生成する音声波形生成ステップとからなる処理をコンピュータに実行させるためのプログラムを含むことを特徴としている。
このような構成であれば、コンピュータによってプログラムが読み取られ、読み取られたプログラムに従ってコンピュータが処理を実行すると、請求項６記載の音声合成装置と同等の作用および効果が得られる。 In order to achieve the above object, the speech synthesis program according to claim 1 0, wherein the
A speech synthesis program for generating synthesized speech waveform data corresponding to sentence text,
A text analysis step for performing accent analysis and morphological analysis on the sentence text;
Prosodic information for generating prosodic information corresponding to the sentence text based on the analysis result in the text analyzing step and the standard prosodic information included in the standard prosodic dictionary composed of standard prosodic information that is prosodic information for each speech unit A sequence generation step;
In the phrase corresponding to the sentence text, the prosodic segment dictionary creation method according to any one of claims 1 to 4 or the prosodic segment dictionary creation program according to claim 5 , When a phrase corresponding to the first prosodic information included in the prosodic segment dictionary including the first prosodic information and the prosodic feature information is included, the prosodic information sequence portion of the phrase is converted to the first prosodic information sequence part. A change step to change to the prosodic information sequence part generated based on the information;
The prosodic segment corresponding to the changed prosodic information sequence portion with respect to the prosodic information sequence portion corresponding to at least one of the predetermined prosodic units preceding and following the phrase portion changed in the changing step A prosodic information adjustment step for performing a predetermined adjustment process based on the prosodic feature information of the dictionary ;
Corresponding to the text text based on the spectrum information and the excitation source information of the segment dictionary including the prosodic information series, the spectrum information for each speech unit and the excitation source information for each speech unit A program for causing a computer to execute a process including a voice waveform generation step for generating the synthesized voice waveform data.
With such a configuration, when the program is read by the computer and the computer executes processing according to the read program, the same operation and effect as those of the speech synthesizer according to claim 6 can be obtained.

また、上記目的を達成するために、請求項１１記載の音声合成プログラムは、
文章テキストに対応した合成音声波形データを生成する音声合成プログラムであって、
前記文章テキストに対してアクセント解析及び形態素解析を行うテキスト解析ステップと、
前記テキスト解析ステップにおける解析結果と、音声単位毎の韻律情報である標準韻律情報から構成される標準韻律辞書の有する標準韻律情報とに基づき、前記文章テキストに対応する韻律情報系列を生成する韻律情報系列生成ステップと、
前記文章テキストに対応するフレーズの中に、所定話者の発話した発話文に対応する音声波形データから抽出された、所定フレーズ毎のスペクトル情報と前記所定フレーズ毎の励振源情報と前記音声波形データにおける各前記所定フレーズに先行及び後続する所定の音韻環境の情報とを含んで成る第２素片辞書の有するスペクトル情報及び励振源情報に対応するフレーズと一致するものが含まれているときに、前記スペクトル情報及び前記励振源情報を抽出時の前記音声データにおける前記一致するフレーズに先行及び後続する前記第２素片辞書の有する所定の音韻環境と、前記韻律情報系列における前記一致するフレーズに対応する部分に先行及び後続する所定の音韻環境とが一致するか否かを判定する判定ステップと、
前記韻律情報系列と、前記判定ステップにおける判定結果と、文章テキストの全体に対して音声素片毎のスペクトル情報と前記音声素片毎の励振源情報とを含んで成る第１素片辞書及び前記第２素片辞書の有する前記スペクトル情報及び前記励振源情報とに基づき、前記文章テキストに対応した合成音声波形データを生成する音声波形生成ステップとからなる処理をコンピュータに実行させるためのプログラムを含み、
前記音声波形生成ステップにおいては、前記判定ステップにおいて一致しないと判定されたときに、前記第２素片辞書の有する前記一致するフレーズに対応するスペクトル情報及び励振源情報から生成した合成音声波形データの前記一致しない側端部の音声素片データ部分を除いて成る第１合成音声波形データと、前記第１素片辞書の有するスペクトル情報及び励振源情報に基づき生成した合成音声波形データから、前記第１合成音声波形データに対応する部分を除いて成る第２合成音声波形データとが合成されて成る、前記文章テキストに対応した合成音声波形データを生成することを特徴としている。
このような構成であれば、コンピュータによってプログラムが読み取られ、読み取られたプログラムに従ってコンピュータが処理を実行すると、請求項７記載の音声合成装置と同等の作用および効果が得られる。 In order to achieve the above object, the speech synthesis program according to claim 1 1, wherein the
A speech synthesis program for generating synthesized speech waveform data corresponding to sentence text,
A text analysis step for performing accent analysis and morphological analysis on the sentence text;
Prosodic information for generating a prosodic information sequence corresponding to the sentence text based on the analysis result in the text analysis step and the standard prosodic information included in the standard prosodic dictionary composed of standard prosodic information that is prosodic information for each speech unit A sequence generation step;
Spectral information for each predetermined phrase, excitation source information for each predetermined phrase, and the speech waveform data extracted from speech waveform data corresponding to an utterance sentence uttered by a predetermined speaker in the phrase corresponding to the sentence text Including a phrase corresponding to the spectrum information and excitation source information of the second segment dictionary including information of a predetermined phoneme environment preceding and following each of the predetermined phrases in Corresponding to a predetermined phoneme environment of the second segment dictionary preceding and following the matching phrase in the speech data when extracting the spectrum information and the excitation source information, and the matching phrase in the prosodic information series A determination step for determining whether or not a predetermined phonological environment preceding and succeeding the portion to be matched matches;
It said prosodic information sequence, the determination result, sentences first segment dictionary and said comprising an excitation source information of each of the speech segment and the spectral information for each speech segment to the entire text in the determination step A program for causing a computer to execute a process including a speech waveform generation step of generating synthesized speech waveform data corresponding to the sentence text based on the spectrum information and the excitation source information of the second segment dictionary ,
In the speech waveform generation step, when it is determined in the determination step that they do not match, the synthesized speech waveform data generated from the spectrum information and the excitation source information corresponding to the matching phrase of the second segment dictionary From the first synthesized speech waveform data excluding the speech segment data portion at the non-matching side edge, and the synthesized speech waveform data generated based on the spectrum information and excitation source information of the first segment dictionary , the first It is characterized in that synthesized speech waveform data corresponding to the sentence text is generated by synthesizing with second synthesized speech waveform data excluding a portion corresponding to one synthesized speech waveform data.
With such a configuration, when the program is read by the computer and the computer executes processing according to the read program, the same operation and effect as those of the speech synthesizer according to claim 7 can be obtained.

また、上記目的を達成するために、請求項１２記載の音声合成プログラムは、
文章テキストに対応した合成音声波形データを生成する音声合成プログラムであって、
前記文章テキストに対してアクセント解析及び形態素解析を行うテキスト解析ステップと、
前記テキスト解析ステップにおける解析結果と、音声単位毎の韻律情報である標準韻律情報から構成される標準韻律辞書の有する標準韻律情報とに基づき、前記文章テキストに対応する韻律情報系列を生成する韻律情報系列生成ステップと、
前記文章テキストに対応するフレーズの中に、請求項１乃至請求項４のいずれか１項に記載の韻律素片辞書作成方法又は請求項５記載の韻律素片辞書作成プログラムによって作成された、前記第１韻律情報及び前記韻律特徴情報を含んで成る韻律素片辞書の有する前記第１韻律情報に対応したフレーズが含まれているときに、当該フレーズの前記韻律情報系列部分を、前記第１韻律情報に基づき生成された韻律情報系列部分に変更する変更ステップと、
前記変更ステップで変更されたフレーズ部分に先行及び後続する所定の韻律単位の少なくとも一方の韻律単位に対応する韻律情報系列部分に対して、前記変更された韻律情報系列部分に対応する前記韻律素片辞書が有する前記韻律特徴情報に基づき所定の調整処理を行う韻律情報調整ステップと、
前記文章テキストに対応するフレーズの中に、所定話者の発話した発話文に対応する音声波形データから抽出された、所定フレーズ毎のスペクトル情報と前記所定フレーズ毎の励振源情報と前記音声波形データにおける各前記所定フレーズに先行及び後続する所定の音韻環境の情報とを含んで成る第２素片辞書の有するスペクトル情報及び励振源情報に対応するフレーズと一致するものが含まれているときに、前記スペクトル情報及び前記励振源情報を抽出時の前記音声データにおける前記一致するフレーズに先行及び後続する所定の音韻環境と、前記韻律情報系列における前記一致するフレーズに対応する部分に先行及び後続する前記第２素片辞書の有する所定の音韻環境とが一致するか否かを判定する判定ステップと、
前記韻律情報系列と、前記判定ステップにおける判定結果と、文章テキストの全体に対して音声素片毎のスペクトル情報と前記音声素片毎の励振源情報とを含んで成る第１素片辞書及び前記第２素片辞書の有する前記スペクトル情報及び前記励振源情報とに基づき、前記文章テキストに対応した合成音声波形データを生成する音声波形生成ステップとからなる処理をコンピュータに実行させるためのプログラムを含み、
前記音声波形生成ステップにおいては、前記判定ステップにおいて一致しないと判定されたときに、前記第２素片辞書の有する前記一致するフレーズに対応するスペクトル情報及び励振源情報から生成した合成音声波形データの前記一致しない側端部の音声素片データ部分を除いて成る第１合成音声波形データと、前記第１素片辞書の有するスペクトル情報及び励振源情報に基づき生成した合成音声波形データから、前記第１合成音声波形データに対応する部分を除いて成る第２合成音声波形データとが合成されて成る、前記文章テキストに対応した合成音声波形データを生成することを特徴としている。
このような構成であれば、コンピュータによってプログラムが読み取られ、読み取られたプログラムに従ってコンピュータが処理を実行すると、請求項８記載の音声合成装置と同等の作用および効果が得られる。 In order to achieve the above object, the speech synthesis program according to claim 1 wherein the
A speech synthesis program for generating synthesized speech waveform data corresponding to sentence text,
A text analysis step for performing accent analysis and morphological analysis on the sentence text;
Prosodic information for generating a prosodic information sequence corresponding to the sentence text based on the analysis result in the text analysis step and the standard prosodic information included in the standard prosodic dictionary composed of standard prosodic information that is prosodic information for each speech unit A sequence generation step;
In the phrase corresponding to the sentence text, the prosodic segment dictionary creation method according to any one of claims 1 to 4 or the prosodic segment dictionary creation program according to claim 5 , When a phrase corresponding to the first prosodic information included in the prosodic segment dictionary including the first prosodic information and the prosodic feature information is included, the prosodic information sequence portion of the phrase is converted to the first prosodic information sequence part. A change step to change to the prosodic information sequence part generated based on the information;
The prosodic segment corresponding to the changed prosodic information sequence portion with respect to the prosodic information sequence portion corresponding to at least one of the predetermined prosodic units preceding and following the phrase portion changed in the changing step A prosodic information adjustment step for performing predetermined adjustment processing based on the prosodic feature information of the dictionary ;
Spectral information for each predetermined phrase, excitation source information for each predetermined phrase, and the speech waveform data extracted from speech waveform data corresponding to an utterance sentence uttered by a predetermined speaker in the phrase corresponding to the sentence text Including a phrase corresponding to the spectrum information and excitation source information of the second segment dictionary including information of a predetermined phoneme environment preceding and following each of the predetermined phrases in and a predetermined phonetic environment preceding and subsequent to the phrase to the match in the audio data at the time of extracting the spectral information and the excitation source information, preceding and succeeding the portions corresponding to the phrases that the match in the prosodic information sequence A determination step for determining whether or not the predetermined phoneme environment of the second segment dictionary matches;
It said prosodic information sequence, the determination result, sentences first segment dictionary and said comprising an excitation source information of each of the speech segment and the spectral information for each speech segment to the entire text in the determination step A program for causing a computer to execute a process including a speech waveform generation step of generating synthesized speech waveform data corresponding to the sentence text based on the spectrum information and the excitation source information of the second segment dictionary ,
In the speech waveform generation step, when it is determined in the determination step that they do not match, the synthesized speech waveform data generated from the spectrum information and the excitation source information corresponding to the matching phrase of the second segment dictionary From the first synthesized speech waveform data excluding the speech segment data portion at the non-matching side edge, and the synthesized speech waveform data generated based on the spectrum information and excitation source information of the first segment dictionary , the first It is characterized in that synthesized speech waveform data corresponding to the sentence text is generated by synthesizing with second synthesized speech waveform data excluding a portion corresponding to one synthesized speech waveform data.
With such a configuration, when the program is read by the computer and the computer executes processing according to the read program, the same operation and effect as those of the speech synthesizer according to claim 8 can be obtained.

また、上記目的を達成するために、請求項１３記載の音声合成方法は、
文章テキストに対応した合成音声波形データを生成する音声合成方法であって、
前記文章テキストに対してアクセント解析及び形態素解析を行うテキスト解析ステップと、
前記テキスト解析ステップにおける解析結果と、音声単位毎の韻律情報である標準韻律情報から構成される標準韻律辞書の有する前記標準韻律情報とに基づき、前記文章テキストに対応する韻律情報を生成する韻律情報系列生成ステップと、
前記文章テキストに対応するフレーズの中に、請求項１乃至請求項４のいずれか１項に記載の韻律素片辞書作成方法又は請求項５記載の韻律素片辞書作成プログラムによって作成された、前記第１韻律情報及び前記韻律特徴情報を含んで成る韻律素片辞書の有する前記第１韻律情報に対応したフレーズが含まれているときに、当該フレーズの前記韻律情報系列部分を、前記第１韻律情報に基づき生成された韻律情報系列部分に変更する変更ステップと、
前記変更ステップで変更されたフレーズ部分に先行及び後続する所定の韻律単位の少なくとも一方の韻律単位に対応する韻律情報系列部分に対して、前記変更された韻律情報系列部分に対応する前記韻律素片辞書の有する前記韻律特徴情報に基づき所定の調整処理を行う韻律情報調整ステップと、
前記韻律情報系列と、音声素片毎のスペクトル情報と前記音声素片毎の励振源情報とを含んで成る素片辞書の有する前記スペクトル情報及び前記励振源情報とに基づき、前記文章テキストに対応した合成音声波形データを生成する音声波形生成ステップと、を含むことを特徴としている。
これにより、請求項６記載の音声合成装置と同等の効果が得られる。 In order to achieve the above object, a method of speech synthesis according to claim 1 3, wherein the
A speech synthesis method for generating synthesized speech waveform data corresponding to sentence text,
A text analysis step for performing accent analysis and morphological analysis on the sentence text;
Prosodic information for generating prosodic information corresponding to the sentence text based on the analysis result in the text analyzing step and the standard prosodic information included in the standard prosodic dictionary composed of standard prosodic information that is prosodic information for each speech unit A sequence generation step;
In the phrase corresponding to the sentence text, the prosodic segment dictionary creation method according to any one of claims 1 to 4 or the prosodic segment dictionary creation program according to claim 5 , When a phrase corresponding to the first prosodic information included in the prosodic segment dictionary including the first prosodic information and the prosodic feature information is included, the prosodic information sequence portion of the phrase is converted to the first prosodic information sequence part. A change step to change to the prosodic information sequence part generated based on the information;
The prosodic segment corresponding to the changed prosodic information sequence portion with respect to the prosodic information sequence portion corresponding to at least one of the predetermined prosodic units preceding and following the phrase portion changed in the changing step A prosodic information adjustment step for performing a predetermined adjustment process based on the prosodic feature information of the dictionary ;
Corresponding to the text text based on the spectrum information and the excitation source information of the segment dictionary including the prosodic information series, the spectrum information for each speech unit and the excitation source information for each speech unit And a voice waveform generation step for generating the synthesized voice waveform data.
Thereby, the same effect as that of the speech synthesizer described in claim 6 can be obtained.

また、上記目的を達成するために、請求項１４記載の音声合成方法は、
文章テキストに対応した合成音声波形データを生成する音声合成方法であって、
前記文章テキストに対してアクセント解析及び形態素解析を行うテキスト解析ステップと、
前記テキスト解析ステップにおける解析結果と、音声単位毎の韻律情報である標準韻律情報から構成される標準韻律辞書の有する標準韻律情報とに基づき、前記文章テキストに対応する韻律情報系列を生成する韻律情報系列生成ステップと、
前記文章テキストに対応するフレーズの中に、所定話者の発話した発話文に対応する音声波形データから抽出された、所定フレーズ毎のスペクトル情報と前記所定フレーズ毎の励振源情報と前記音声波形データにおける各前記所定フレーズに先行及び後続する所定の音韻環境の情報とを含んで成る第２素片辞書の有するスペクトル情報及び励振源情報に対応するフレーズと一致するものが含まれているときに、前記スペクトル情報及び前記励振源情報を抽出時の前記音声データにおける前記一致するフレーズに先行及び後続する所定の音韻環境と、前記韻律情報系列における前記一致するフレーズに対応する部分に先行及び後続する前記第２素片辞書の有する所定の音韻環境とが一致するか否かを判定する判定ステップと、
前記韻律情報系列と、前記判定ステップにおける判定結果と、文章テキストの全体に対して音声素片毎のスペクトル情報と前記音声素片毎の励振源情報とを含んで成る第１素片辞書及び前記第２素片辞書の有する前記スペクトル情報及び前記励振源情報とに基づき、前記文章テキストに対応した合成音声波形データを生成する音声波形生成ステップと、を含み、
前記音声波形生成ステップにおいては、前記判定ステップにおいて一致しないと判定されたときに、前記第２素片辞書の有する前記一致するフレーズに対応するスペクトル情報及び励振源情報から生成した合成音声波形データの前記一致しない側端部の音声素片データ部分を除いて成る第１合成音声波形データと、前記第１素片辞書の有するスペクトル情報及び励振源情報に基づき生成した合成音声波形データから、前記第１合成音声波形データに対応する部分を除いて成る第２合成音声波形データとが合成されて成る、前記文章テキストに対応した合成音声波形データを生成することを特徴としている。
これにより、請求項７記載の音声合成装置と同等の効果が得られる。 In order to achieve the above object, a method of speech synthesis Claim 1 4, wherein the
A speech synthesis method for generating synthesized speech waveform data corresponding to sentence text,
A text analysis step for performing accent analysis and morphological analysis on the sentence text;
Prosodic information for generating a prosodic information sequence corresponding to the sentence text based on the analysis result in the text analysis step and the standard prosodic information included in the standard prosodic dictionary composed of standard prosodic information that is prosodic information for each speech unit A sequence generation step;
Spectral information for each predetermined phrase, excitation source information for each predetermined phrase, and the speech waveform data extracted from speech waveform data corresponding to an utterance sentence uttered by a predetermined speaker in the phrase corresponding to the sentence text Including a phrase corresponding to the spectrum information and excitation source information of the second segment dictionary including information of a predetermined phoneme environment preceding and following each of the predetermined phrases in and a predetermined phonetic environment preceding and subsequent to the phrase to the match in the audio data at the time of extracting the spectral information and the excitation source information, preceding and succeeding the portions corresponding to the phrases that the match in the prosodic information sequence A determination step for determining whether or not the predetermined phoneme environment of the second segment dictionary matches;
It said prosodic information sequence, the determination result, sentences first segment dictionary and said comprising an excitation source information of each of the speech segment and the spectral information for each speech segment to the entire text in the determination step A speech waveform generation step of generating synthesized speech waveform data corresponding to the sentence text, based on the spectrum information and the excitation source information of the second unit dictionary,
In the speech waveform generation step, when it is determined in the determination step that they do not match, the synthesized speech waveform data generated from the spectrum information and the excitation source information corresponding to the matching phrase of the second segment dictionary From the first synthesized speech waveform data excluding the speech segment data portion at the non-matching side edge, and the synthesized speech waveform data generated based on the spectrum information and excitation source information of the first segment dictionary , the first It is characterized in that synthesized speech waveform data corresponding to the sentence text is generated by synthesizing with second synthesized speech waveform data excluding a portion corresponding to one synthesized speech waveform data.
Thereby, the same effect as that of the speech synthesizer described in claim 7 can be obtained.

また、上記目的を達成するために、請求項１５記載の音声合成方法は、
文章テキストに対応した合成音声波形データを生成する音声合成方法であって、
前記文章テキストに対してアクセント解析及び形態素解析を行うテキスト解析ステップと、
前記テキスト解析ステップにおける解析結果と、音声単位毎の韻律情報である標準韻律情報から構成される標準韻律辞書の有する標準韻律情報とに基づき、前記文章テキストに対応する韻律情報系列を生成する韻律情報系列生成ステップと、
前記文章テキストに対応するフレーズの中に、請求項１乃至請求項４のいずれか１項に記載の韻律素片辞書作成方法又は請求項５記載の韻律素片辞書作成プログラムによって作成された、前記第１韻律情報及び前記韻律特徴情報を含んで成る韻律素片辞書の有する前記第１韻律情報に対応したフレーズが含まれているときに、当該フレーズの前記韻律情報系列部分を、前記第１韻律情報に基づき生成された韻律情報系列部分に変更する変更ステップと、
前記変更ステップで変更されたフレーズ部分に先行及び後続する所定の韻律単位の少なくとも一方の韻律単位に対応する韻律情報系列部分に対して、前記変更された韻律情報系列部分に対応する前記韻律素片辞書の有する前記韻律特徴情報に基づき所定の調整処理を行う韻律情報調整ステップと、
前記文章テキストに対応するフレーズの中に、所定話者の発話した発話文に対応する音声波形データから抽出された、所定フレーズ毎のスペクトル情報と前記所定フレーズ毎の励振源情報と前記音声波形データにおける各前記所定フレーズに先行及び後続する所定の音韻環境の情報とを含んで成る第２素片辞書の有するスペクトル情報及び励振源情報に対応するフレーズと一致するものが含まれているときに、前記スペクトル情報及び前記励振源情報を抽出時の前記音声データにおける前記一致するフレーズに先行及び後続する所定の音韻環境と、前記韻律情報系列における前記一致するフレーズに対応する部分に先行及び後続する前記第２素片辞書の有する所定の音韻環境とが一致するか否かを判定する判定ステップと、
前記韻律情報系列と、前記判定ステップにおける判定結果と、文章テキストの全体に対して音声素片毎のスペクトル情報と前記音声素片毎の励振源情報とを含んで成る第１素片辞書及び前記第２素片辞書の有する前記スペクトル情報及び前記励振源情報とに基づき、前記文章テキストに対応した合成音声波形データを生成する音声波形生成ステップと、を含み、
前記音声波形生成ステップにおいては、前記判定ステップにおいて一致しないと判定されたときに、前記第２素片辞書の有する前記一致するフレーズに対応するスペクトル情報及び励振源情報から生成した合成音声波形データの前記一致しない側端部の音声素片データ部分を除いて成る第１合成音声波形データと、前記第１素片辞書の有するスペクトル情報及び励振源情報に基づき生成した合成音声波形データから、前記第１合成音声波形データに対応する部分を除いて成る第２合成音声波形データとを合成して、前記文章テキストに対応した合成音声波形データを生成することを特徴としている。
これにより、請求項８記載の音声合成装置と同等の効果が得られる。 In order to achieve the above object, a method of speech synthesis according to claim 1 5, wherein the
A speech synthesis method for generating synthesized speech waveform data corresponding to sentence text,
A text analysis step for performing accent analysis and morphological analysis on the sentence text;
Prosodic information for generating a prosodic information sequence corresponding to the sentence text based on the analysis result in the text analysis step and the standard prosodic information included in the standard prosodic dictionary composed of standard prosodic information that is prosodic information for each speech unit A sequence generation step;
In the phrase corresponding to the sentence text, the prosodic segment dictionary creation method according to any one of claims 1 to 4 or the prosodic segment dictionary creation program according to claim 5 , When a phrase corresponding to the first prosodic information included in the prosodic segment dictionary including the first prosodic information and the prosodic feature information is included, the prosodic information sequence portion of the phrase is converted to the first prosodic information sequence part. A change step to change to the prosodic information sequence part generated based on the information;
The prosodic segment corresponding to the changed prosodic information sequence portion with respect to the prosodic information sequence portion corresponding to at least one of the predetermined prosodic units preceding and following the phrase portion changed in the changing step A prosodic information adjustment step for performing a predetermined adjustment process based on the prosodic feature information of the dictionary ;
Spectral information for each predetermined phrase, excitation source information for each predetermined phrase, and the speech waveform data extracted from speech waveform data corresponding to an utterance sentence uttered by a predetermined speaker in the phrase corresponding to the sentence text Including a phrase corresponding to the spectrum information and excitation source information of the second segment dictionary including information of a predetermined phoneme environment preceding and following each of the predetermined phrases in and a predetermined phonetic environment preceding and subsequent to the phrase to the match in the audio data at the time of extracting the spectral information and the excitation source information, preceding and succeeding the portions corresponding to the phrases that the match in the prosodic information sequence A determination step for determining whether or not the predetermined phoneme environment of the second segment dictionary matches;
It said prosodic information sequence, the determination result, sentences first segment dictionary and said comprising an excitation source information of each of the speech segment and the spectral information for each speech segment to the entire text in the determination step A speech waveform generation step of generating synthesized speech waveform data corresponding to the sentence text, based on the spectrum information and the excitation source information of the second unit dictionary,
In the speech waveform generation step, when it is determined in the determination step that they do not match, the synthesized speech waveform data generated from the spectrum information and the excitation source information corresponding to the matching phrase of the second segment dictionary From the first synthesized speech waveform data excluding the speech segment data portion at the non-matching side edge, and the synthesized speech waveform data generated based on the spectrum information and excitation source information of the first segment dictionary , the first The synthesized speech waveform data corresponding to the sentence text is generated by synthesizing the synthesized speech waveform data excluding the portion corresponding to the 1 synthesized speech waveform data.
Thereby, the same effect as that of the speech synthesizer described in claim 8 can be obtained.

〔第１の実施の形態〕
以下、本発明に係る韻律素片辞書作成方法及び韻律素片辞書作成プログラム、並びに音声合成装置、音声合成プログラム及び音声合成方法の第１の実施の形態を図面に基づき説明する。図１〜図８は、本発明に係る韻律素片辞書作成方法及び韻律素片辞書作成プログラム、並びに音声合成装置、音声合成プログラム及び音声合成方法の第１の実施の形態を示す図である。 [First Embodiment]
A first embodiment of a prosodic segment dictionary creating method, prosodic segment dictionary creating program, speech synthesis apparatus, speech synthesis program, and speech synthesis method according to the present invention will be described below with reference to the drawings. 1 to 8 are diagrams showing a first embodiment of a prosodic segment dictionary creating method and prosodic segment dictionary creating program, a speech synthesizing device, a speech synthesizing program, and a speech synthesizing method according to the present invention.

まず、本発明の第１の実施の形態に係る音声合成装置の構成を図１に基づき説明する。図１は、本発明の第１の実施の形態に係る音声合成装置１００の構成を示すブロック図である。
図１に示すように、音声合成装置１００は、音声合成対象の文章テキストを解析して、発音・韻律記号列を生成するテキスト解析部１０と、音声単位毎のピッチ周波数情報から構成される標準韻律辞書１１と、所定フレーズ単位のピッチ周波数情報及び後述する韻律特徴情報から構成される韻律素片辞書１２と、発音・韻律記号列に基づき、標準韻律辞書１１と、韻律素片辞書１２とを用いて韻律情報系列を生成する韻律生成部１３と、韻律情報系列に基づき、後述する標準波形素片辞書１５を用いて合成音声波形データを生成する波形合成部１４と、音声単位毎のスペクトル情報及び励振源情報から構成される標準波形素片辞書１５とを含んだ構成となっている。 First, the configuration of the speech synthesizer according to the first embodiment of the present invention will be described with reference to FIG. FIG. 1 is a block diagram showing a configuration of a speech synthesizer 100 according to the first embodiment of the present invention.
As shown in FIG. 1, the speech synthesizer 100 analyzes a text text to be synthesized, generates a pronunciation / prosodic symbol string, and a standard composed of pitch frequency information for each speech unit. A prosodic dictionary 11, a prosodic segment dictionary 12 composed of pitch frequency information in predetermined phrases and prosodic feature information described later, a standard prosodic dictionary 11, and a prosodic segment dictionary 12 based on pronunciation / prosodic symbol strings A prosody generating unit 13 for generating a prosody information sequence using the waveform synthesizing unit 14 for generating synthesized speech waveform data using a standard waveform segment dictionary 15 described later based on the prosodic information sequence, and spectrum information for each speech unit. And a standard waveform segment dictionary 15 composed of excitation source information.

テキスト解析部１０は、音声合成対象の文章テキストに対して、不図示の単語辞書等を用いて、アクセント解析及び形態素解析を行い、入力された文章テキスト（例えば、日本語ならかな漢字まじり文）の読み、アクセント、イントネーションを決定し、更に、韻律記号付きの読み情報（中間言語）である、発音・韻律記号列を生成する。更に、この生成した発音・韻律記号列を韻律生成部１３に出力する。
標準韻律辞書１１は、音声単位のラベル情報毎に対応したピッチ周波数情報である標準ピッチ周波数情報を有するもので、後述する標準韻律生成部１３ａから、発音・韻律記号列に対応した音声単位ラベル系列が入力されると、これと対応する標準ピッチ周波数情報を当該標準韻律生成部１３ａに出力する。 The text analysis unit 10 performs accent analysis and morphological analysis on the text text to be synthesized using a word dictionary (not shown), and the input text text (for example, Japanese kana kanji spelling text) Reading, accent, and intonation are determined, and a pronunciation / prosodic symbol string that is reading information (intermediate language) with prosodic symbols is generated. Further, the generated pronunciation / prosodic symbol string is output to the prosody generation unit 13.
The standard prosody dictionary 11 has standard pitch frequency information that is pitch frequency information corresponding to each label information of speech units. From the standard prosody generation unit 13a described later, a speech unit label sequence corresponding to a pronunciation / prosodic symbol string. Is input to the standard prosody generation unit 13a.

韻律素片辞書１２は、所定話者の発話した、複数フレーズから構成される発話文に対応する音声波形データから、後述する韻律素片辞書作成処理によって作成された所定フレーズ単位のピッチ周波数情報である韻律素片ピッチ周波数情報と、これに対応する韻律特徴情報とを有するもので、後述する韻律素片選択部１３ｃからの読み出し要求に応じて、要求されたフレーズの韻律素片ピッチ周波数情報及び韻律特徴情報を韻律素片選択部１３ｃに出力する。
ここで、韻律特徴情報は、音声波形データから所定フレーズのピッチ周波数情報を抽出するときに、そのフレーズに先行又は後続する所定音韻環境のピッチ周波数情報から生成される韻律的特徴を示す情報である。 The prosodic segment dictionary 12 is pitch frequency information in units of predetermined phrases created by prosody segment dictionary creation processing described later from speech waveform data corresponding to an utterance sentence composed of a plurality of phrases uttered by a predetermined speaker. It has certain prosodic segment pitch frequency information and corresponding prosodic feature information, and in response to a read request from the prosody segment selecting unit 13c described later, prosodic segment pitch frequency information of the requested phrase and Prosodic feature information is output to the prosodic segment selector 13c.
Here, the prosodic feature information is information indicating prosodic features generated from pitch frequency information of a predetermined phoneme environment preceding or succeeding the phrase when the pitch frequency information of the predetermined phrase is extracted from the speech waveform data. .

韻律生成部１３は、標準韻律生成部１３ａと、登録フレーズ照合部１３ｂと、韻律素片選択部１３ｃと、韻律置換整形部１３ｄとを含んだ構成となっている。
標準韻律生成部１３ａは、テキスト解析部１０から入力された発音・韻律記号列に基づき音声単位ラベル系列を生成すると共に、当該音声単位ラベル系列を標準韻律辞書１１に出力して、音声単位ラベル系列に対応する音声単位系列に対応する標準ピッチ周波数情報を取得し、当該音声単位ラベル系列に対応する標準ピッチ周波数系列を含んで構成される標準韻律情報系列を生成する。 The prosody generation unit 13 includes a standard prosody generation unit 13a, a registered phrase collation unit 13b, a prosody segment selection unit 13c, and a prosody replacement shaping unit 13d.
The standard prosody generation unit 13a generates a speech unit label sequence based on the pronunciation / prosodic symbol string input from the text analysis unit 10, and outputs the speech unit label sequence to the standard prosody dictionary 11 to generate a speech unit label sequence. The standard pitch frequency information corresponding to the speech unit sequence corresponding to is acquired, and the standard prosodic information sequence including the standard pitch frequency sequence corresponding to the speech unit label sequence is generated.

登録フレーズ照合部１３ｂは、テキスト解析部１０から入力された発音・韻律記号列に基づき、韻律素片辞書１２に登録されたフレーズと一致するフレーズが当該発音・韻律記号列に含まれているか否かを判定する。この判定結果は、韻律素片選択部１３ｃに出力される。
韻律素片選択部１３ｃは、登録フレーズ照合部１３ｂから入力される判定結果に基づき、一致するフレーズを含むと判定されたときは、韻律素片辞書から一致したフレーズに該当する全ての韻律素片ピッチ周波数情報及び韻律特徴情報を韻律素片辞書１２から読み出して、標準韻律生成部１３ａから入力された標準韻律情報系列に接続性が最も良いものを選択し、当該選択された韻律素片ピッチ周波数情報及び韻律特徴情報を韻律置換整形部１３ｄに出力する。一方、一致するフレーズを含まないと判定された場合は、そのことを韻律置換整形部１３ｄに通知する。 Based on the pronunciation / prosodic symbol string input from the text analysis unit 10, the registered phrase matching unit 13 b determines whether the phrase that matches the phrase registered in the prosodic segment dictionary 12 is included in the pronunciation / prosodic symbol string. Determine whether. This determination result is output to the prosodic segment selection unit 13c.
When it is determined that the prosodic segment selection unit 13c includes a matching phrase based on the determination result input from the registered phrase matching unit 13b, all the prosodic segment units corresponding to the matched phrase from the prosodic segment dictionary The pitch frequency information and the prosodic feature information are read from the prosodic segment dictionary 12, the standard prosodic information sequence input from the standard prosody generation unit 13a is selected, and the selected prosodic segment pitch frequency is selected. Information and prosodic feature information are output to the prosody replacement shaping unit 13d. On the other hand, if it is determined that no matching phrase is included, this is notified to the prosody replacement shaping unit 13d.

韻律置換整形部１３ｄは、韻律素片選択部１３ｃから、一致するフレーズに対応した韻律素片ピッチ周波数情報及び韻律特徴情報が入力されたときは、標準韻律生成部１３ａから入力される標準韻律情報系列における、一致するフレーズの標準ピッチ周波数情報を韻律素片ピッチ周波数情報に置換すると共に、韻律特徴情報に基づき、標準韻律情報系列における、一致するフレーズの前後の所定音韻環境に対応する韻律情報系列部分を整形して、最終的な韻律情報系列を生成する。この生成した韻律情報系列は、波形合成部１４に出力される。 The prosody replacement shaping unit 13d receives standard prosody information input from the standard prosody generation unit 13a when prosodic segment pitch frequency information and prosodic feature information corresponding to the matching phrase are input from the prosody segment selection unit 13c. The prosodic information sequence corresponding to the predetermined phoneme environment before and after the matching phrase in the standard prosodic information sequence based on the prosodic feature information, while replacing the standard pitch frequency information of the matching phrase in the sequence with the prosodic segment pitch frequency information The final prosody information sequence is generated by shaping the part. The generated prosodic information sequence is output to the waveform synthesis unit 14.

一方、韻律置換整形部１３ｄは、韻律素片選択部１３ｃから、一致するフレーズが無いことを示す通知を受けると、標準韻律生成部１３ａから入力される標準韻律情報系列をそのまま波形合成部１４に出力する。
波形合成部１４は、標準波形素片辞書１５から、韻律生成部１３から入力された韻律情報系列に対応するスペクトル情報及び励振源情報を取得し、当該取得したスペクトル情報からスペクトル情報系列を生成すると共に、前記取得した励振源情報から励振源情報系列を生成する。また、波形合成部１４は、合成フィルタを備えており、前記生成したスペクトル情報系列を合成フィルタのパラメータとして用い、当該合成フィルタで、励振源情報系列に基づき生成される励振源信号をフィルタ処理して、文章テキストに対応する合成音声波形データを生成する。そして、生成した合成音声波形データに基づき、合成音声を出力する。 On the other hand, when the prosody replacement shaping unit 13d receives a notification indicating that there is no matching phrase from the prosody segment selection unit 13c, the standard prosody information sequence input from the standard prosody generation unit 13a is directly input to the waveform synthesis unit 14. Output.
The waveform synthesizer 14 acquires spectrum information and excitation source information corresponding to the prosody information sequence input from the prosody generator 13 from the standard waveform segment dictionary 15, and generates a spectrum information sequence from the acquired spectrum information. At the same time, an excitation source information sequence is generated from the acquired excitation source information. Further, the waveform synthesis unit 14 includes a synthesis filter, and uses the generated spectrum information sequence as a parameter of the synthesis filter, and filters the excitation source signal generated based on the excitation source information sequence with the synthesis filter. Then, synthesized speech waveform data corresponding to the sentence text is generated. Then, based on the generated synthesized speech waveform data, synthesized speech is output.

標準波形素片辞書１５は、音声単位毎のスペクトル情報及び音声単位毎の励振源情報を有するものである。本実施の形態においては、スペクトル情報としてフレーム単位で音声から抽出したスペクトル包絡パラメータを用い、励振源情報として、励振源信号の有声度（パルスとノイズの混合比）を用いる。
ここで、本実施の形態において、音声合成装置１００は、図示しないが、上記各構成要素を制御するプログラムが記憶された記憶媒体と、これらのプログラムを実行するためのプロセッサと、プログラムの実行に必要なデータを記憶するＲＡＭと、を備えている。そして、プロセッサにより記憶媒体に記憶されたプログラムを読み出して実行することによって上記各構成要素の処理を実現する。 The standard waveform segment dictionary 15 has spectrum information for each voice unit and excitation source information for each voice unit. In this embodiment, spectral envelope parameters extracted from speech in units of frames are used as spectral information, and the voicing degree (mixing ratio of pulse and noise) of the excitation source signal is used as excitation source information.
Here, in the present embodiment, the speech synthesizer 100, although not shown, stores a storage medium storing a program for controlling each of the above components, a processor for executing these programs, and execution of the programs. And a RAM for storing necessary data. And the process of each said component is implement | achieved by reading and running the program memorize | stored in the storage medium with the processor.

また、記憶媒体は、ＲＡＭ、ＲＯＭ等の半導体記憶媒体、ＦＤ、ＨＤ等の磁気記憶型記憶媒体、ＣＤ、ＣＤＶ、ＬＤ、ＤＶＤ等の光学的読取方式記憶媒体、ＭＯ等の磁気記憶型／光学的読取方式記憶媒体であって、電子的、磁気的、光学的等の読み取り方法のいかんにかかわらず、コンピュータで読み取り可能な記憶媒体であれば、あらゆる記憶媒体を含むものである。 The storage medium is a semiconductor storage medium such as RAM or ROM, a magnetic storage type storage medium such as FD or HD, an optical reading type storage medium such as CD, CDV, LD, or DVD, or a magnetic storage type / optical such as MO. Any storage medium can be used as long as it is a computer-readable storage medium regardless of electronic, magnetic, optical, or other reading methods.

更に、韻律生成部１３における音声単位ラベル系列の生成方法について説明する。
以下、発声内容の文章が「あらゆる現実を、すべて自分の方へねじ曲げたのだ。」である場合を例として説明する。ここでは、音声単位を音素とし、下記の音韻環境の組み合わせを用いた場合を説明する。
・当該呼気段落の位置
・当該アクセント句の位置
・前後のポーズの有無
・当該アクセント句のモーラ長
・当該アクセント句のアクセント型
・先行アクセント句のアクセント型
・当該音素のアクセント句内でのモーラ位置
・当該音素 Furthermore, a method for generating a speech unit label sequence in the prosody generation unit 13 will be described.
In the following, an explanation will be given by taking as an example a case where the sentence of the utterance content is “every reality was twisted towards me”. Here, a case where a speech unit is a phoneme and a combination of the following phonemic environments is used will be described.
-Position of the exhalation paragraph-Position of the accent phrase-Presence / absence of front and back poses-Mora length of the accent phrase-Accent type of the accent phrase-Accent type of the preceding accent phrase-Mora position within the accent phrase of the phoneme・ The phoneme

上記発声例に対して音韻環境を対応付けると、例えば、発声内容の前半部分は、当該音素は「ａ」，「ｒ」，「ａ」，「ｙ」，「ｕ」，「ｒ」，「ｕ」，「ｇ」，「ｅ」，「Ｎ」，「ｊ」，「ｉ」，「ｔｓ」，「ｕ」，「ｏ」となり、呼気段落は「あらゆる現実を」となり、アクセント句（読み）は「あらゆる」及び「現実を」となり、文中での当該呼気段落の位置は「１」となる。 When the phonological environment is associated with the utterance example, for example, in the first half of the utterance content, the phonemes are “a”, “r”, “a”, “y”, “u”, “r”, “u”. ”,“ G ”,“ e ”,“ N ”,“ j ”,“ i ”,“ ts ”,“ u ”,“ o ”, the exhalation paragraph becomes“ every reality ”, and the accent phrase (reading) Becomes “everything” and “reality”, and the position of the exhalation paragraph in the sentence is “1”.

また、例えば、「あらゆる」における、当該呼気段落でのアクセント句の位置は「１」となり、前後のポーズの有無は、当該アクセント句の直前において「ｘ」となり、当該アクセント句の直後において「０」となり、当該アクセント句のモーラ長は「４」となり、アクセント型は、当該アクセント句において「３」となり、先行アクセント句において「ｘ」となり、アクセント句内のモーラ位置は、「ａ」が「１」，「ｒ」「ａ」が「２」，「ｙ」「ｕ」が「３」，「ｒ」「ｕ」が「４」となる。
なお、ここでは、音声単位を音素としたが、これに限らず、半音素やモーラ等であっても良い。また、音韻環境として、上記したものだけに限らず、文のモーラ長、呼気段落のモーラ長、品詞、活用形、活用型等、更に、先々行、先行、後続、後々続等の音韻環境を利用しても良い。 Further, for example, the position of the accent phrase in the exhalation paragraph in “everything” is “1”, and the presence or absence of the preceding or following pause is “x” immediately before the accent phrase, and “0” immediately after the accent phrase. ”, The mora length of the accent phrase is“ 4 ”, the accent type is“ 3 ”in the accent phrase,“ x ”in the preceding accent phrase, and the mora position in the accent phrase is“ a ”when“ a ”is“ “1”, “r” and “a” are “2”, “y” and “u” are “3”, and “r” and “u” are “4”.
Here, the speech unit is a phoneme. However, the present invention is not limited to this, and it may be a semiphone or a mora. In addition, the phonetic environment is not limited to the above, but also uses the phonetic environment such as sentence mora length, expiratory paragraph mora length, part of speech, utilization form, utilization type, etc. You may do it.

本実施の形態においては、音韻ラベルの付与規則を以下に示すものとする。
・Ｐ＿Ｂｐｏｓ＿Ａｐｏｓ＿Ｐｐａｕ＿Ｎｐａｕ＿Ａｌｅｎ＿ＡＣｔｙｐｅ＿ＡＰｔｙｐｅ＿Ｍｐｏｓ
ここで、上記した付与規則における、Ｐが「当該音素」に、Ｂｐｏｓが「当該呼気段落の位置」に、Ａｐｏｓが「当該アクセント句の位置」に、Ｐｐａｕが「当該アクセント句の直前のポーズの有無」に、Ｎｐａｕが「当該アクセント句の直後のポーズの有無」に、Ａｌｅｎが「当該アクセント句のモーラ長」に、ＡＣｔｙｐｅが「当該アクセント句のアクセント型」に、ＡＰｔｙｐｅが「先行アクセント句のアクセント型」に、Ｍｐｏｓが「当該音素のアクセント句内でのモーラ位置」に各々対応している。 In the present embodiment, the rules for assigning phoneme labels are as follows.
P_Bpos_Apos_Ppau_Npau_Alen_Atype_APtype_Mpos
Here, in the above-mentioned assignment rule, P is “the phoneme”, Bpos is “the position of the exhalation paragraph”, Apos is “the position of the accent phrase”, and Ppau is “the pose immediately before the accent phrase”. In the presence / absence field, Npau is set to “whether there is a pose immediately after the accent phrase”, Alen is set to “Mora length of the accent phrase”, ACtype is set to “Accent type of the accent phrase”, and APtype is set to “Present accent phrase”. The Mpos corresponds to the “accent type” and “mora position in the accent phrase of the phoneme”.

上記付与規則を用いて上記発声内容の各音素毎に音韻ラベルを付与すると、例えば、「あらゆる」の「あ」の音素である「ａ」に対する音韻ラベルは、「a_1_1_x_0_4_3_x_1」となる。ここで、音韻ラベル中のｘは情報が無いことを示す。更に、上記発声内容に対応する音声データから、各音素毎の開始時間（秒）及び終了時間（秒）を抽出する。
そして、上記各音素毎の音韻ラベルと上記各音素毎の時間情報とから成る時間付きラベルデータから、音声・韻律記号列に対応する音声単位ラベル系列を生成する。この音声単位ラベル系列に含まれる音韻ラベルから、当該音韻ラベル毎に対応する標準ピッチ周波数情報を標準韻律辞書１１から取得することができる。 When a phoneme label is assigned to each phoneme of the utterance content using the above-mentioned assignment rule, for example, the phoneme label for “a” that is “any” phoneme is “a_1_1_x_0_4_3_x_1”. Here, x in the phoneme label indicates that there is no information. Further, a start time (second) and an end time (second) for each phoneme are extracted from the voice data corresponding to the utterance content.
Then, a speech unit label sequence corresponding to the speech / prosodic symbol string is generated from the timed label data composed of the phoneme label for each phoneme and the time information for each phoneme. The standard pitch frequency information corresponding to each phoneme label can be acquired from the standard prosody dictionary 11 from the phoneme label included in the speech unit label series.

更に、図２〜図４に基づき、韻律素片辞書の作成方法について説明する。ここで、図２は、韻律素片辞書の作成処理の流れを示す図である。また、図３は、ピッチ比及び抑揚比の算出方法の一例を示す図である。また、図４（ａ）は、韻律素片辞書の構成例を示す図であり、（ｂ）は、韻律素片辞書に登録されるピッチ周波数データの一例を示す図である。
本実施の形態において、韻律素片辞書１２は、特定の一人の話者の発話した多種類の発話文に対応する多数の音声波形データから作成する。なお、これら多数の音声波形データは、発話内容の情報と対応付けて事前に収集して記憶装置（ＨＤＤなど）に記憶しておく。 Further, a method for creating a prosodic segment dictionary will be described with reference to FIGS. Here, FIG. 2 is a diagram showing a flow of a prosody segment dictionary creation process. Moreover, FIG. 3 is a figure which shows an example of the calculation method of pitch ratio and intonation ratio. FIG. 4A is a diagram showing a configuration example of a prosodic segment dictionary, and FIG. 4B is a diagram showing an example of pitch frequency data registered in the prosody segment dictionary.
In the present embodiment, the prosodic segment dictionary 12 is created from a large number of speech waveform data corresponding to many kinds of uttered sentences uttered by a specific speaker. It should be noted that these many speech waveform data are collected in advance in association with the utterance content information and stored in a storage device (HDD or the like).

以下、図２に基づき、韻律素片辞書１２の作成処理の流れを説明する。ここで、本実施の形態においては、図示しないＰＣ等の情報処理装置により、専用の韻律素片辞書作成プログラムを起動し、図２のフローチャートに示す韻律素片辞書の作成処理を実行して韻律素片辞書１２を作成するようになっている。また、収集した音声波形データに対応する発話文は、いずれも２つ以上のフレーズから構成されたものとなっている。 Hereinafter, the flow of the creation process of the prosodic segment dictionary 12 will be described with reference to FIG. In this embodiment, a dedicated prosodic segment dictionary creation program is activated by an information processing device such as a PC (not shown), and the prosody segment dictionary creation process shown in the flowchart of FIG. A segment dictionary 12 is created. Also, the utterance sentence corresponding to the collected speech waveform data is composed of two or more phrases.

図２に示すように、まずステップＳ１００に移行し、特定話者の所定の音声波形データを記憶装置（不図示）から取得してステップＳ１０２に移行する。
ステップＳ１０２では、ステップＳ１００で取得した音声波形データの発話内容の情報から、当該音声波形データに、予め登録された定型フレーズが含まれるか否かを判定し、含まれていると判定された場合(Yes)は、ステップＳ１０４に移行し、そうでない場合(No)は、ステップＳ１２２に移行する。ここで、定型フレーズは、例えば、カーナビゲーションシステムなどにおける、「〜を右折です」、「〜を右方向です」、「〜を左折です」、「〜を左方向です」、「〜を直進です」などのフレーズや、他に、「〜でございます」、「〜してください」、「〜ください」などの比較的頻繁に用いられる（比較的出現頻度の高い）フレーズが登録される。 As shown in FIG. 2, first, the process proceeds to step S100, and predetermined speech waveform data of a specific speaker is acquired from a storage device (not shown), and the process proceeds to step S102.
In step S102, it is determined whether or not a pre-registered fixed phrase is included in the speech waveform data from the utterance content information of the speech waveform data acquired in step S100. If (Yes), the process proceeds to step S104. If not (No), the process proceeds to step S122. Here, for example, in a car navigation system, the phrase is “turn right to”, “to turn right”, “to turn left”, “to turn left”, “to go straight” ", Etc.", and phrases that are used relatively frequently (relatively frequently appearing) such as "It is!", "Please do", "~ Please" are registered.

ステップＳ１０４に移行した場合は、音声波形データから、登録フレーズに該当するフレーズ部分のピッチ周波数を第１韻律情報として抽出してステップＳ１０６に移行する。
ステップＳ１０６では、ステップＳ１０４で抽出したピッチ周波数に基づき、フレーズ単位のピッチ周波数情報（韻律素片ピッチ周波数情報）を生成してステップＳ１０８に移行する。 When the process proceeds to step S104, the pitch frequency of the phrase portion corresponding to the registered phrase is extracted from the speech waveform data as the first prosodic information, and the process proceeds to step S106.
In step S106, the phrase unit pitch frequency information (prosodic segment pitch frequency information) is generated based on the pitch frequency extracted in step S104, and the process proceeds to step S108.

ステップＳ１０８では、音声波形データにおいて、登録フレーズに該当するフレーズ部分に先行するフレーズがあるか否かを判定し、あると判定された場合(Yes)は、ステップＳ１１０に移行し、そうでない場合(No)は、ステップＳ１１２に移行する。
ステップＳ１１０に移行した場合は、該当フレーズ部分に先行する所定韻律単位の音声波形データ部分からピッチ周波数を第２韻律情報として抽出してステップＳ１１２に移行する。 In step S108, it is determined whether or not there is a phrase preceding the phrase portion corresponding to the registered phrase in the speech waveform data. If it is determined that there is (Yes), the process proceeds to step S110; No) moves to step S112.
When the process proceeds to step S110, the pitch frequency is extracted as second prosodic information from the speech waveform data part of a predetermined prosody unit preceding the corresponding phrase part, and the process proceeds to step S112.

ステップＳ１１２では、音声波形データにおいて、登録フレーズに該当するフレーズ部分に後続するフレーズがあるか否かを判定し、あると判定された場合(Yes)は、ステップＳ１１４に移行し、そうでない場合(No)は、ステップＳ１１６に移行する。
ステップＳ１１４に移行した場合は、該当フレーズ部分に後続する所定韻律単位の音声波形データ部分からピッチ周波数を第２韻律情報として抽出してステップＳ１１６に移行する。 In step S112, it is determined whether or not there is a phrase that follows the phrase portion corresponding to the registered phrase in the audio waveform data. If it is determined (Yes), the process proceeds to step S114; No) moves to step S116.
When the process proceeds to step S114, the pitch frequency is extracted as second prosodic information from the speech waveform data part of a predetermined prosody unit following the corresponding phrase part, and the process proceeds to step S116.

ここで、ステップＳ１１０，Ｓ１１４においてピッチ周波数の抽出対象となる、登録フレーズに先行または後続する所定韻律単位について説明する。例えば、登録フレーズの音韻記号列が、「イタシマ’ス＋ノデ」のように、その先行句切りがアクセント句内部で、後続句切りが呼気段落末として所定話者が発話したフレーズの場合においては、登録フレーズに先行する所定韻律単位は、呼気段落内の先行部及び呼気段落外の先行部となり、登録フレーズに後続する所定韻律単位は、呼気段落内の後続部及び呼気段落外の後続部となる。また、登録フレーズの音韻記号列が、「モーシアゲマ’ス」のように、その先行句切りがアクセント句またはアクセント句内部で、後続句切りが平叙文末として所定話者が発話したフレーズの場合においては、登録フレーズに先行する所定韻律単位は、呼気段落内先行部、呼気段落外先行部となり、登録フレーズに後続する所定韻律単位は存在しない。また、登録フレーズの音韻記号列が、「ミギホ’ーコーデス＋」のように、その先行句切りが呼気段落末またはアクセント句で、後続句切りが平叙文末として所定話者が発話したフレーズの場合においては、登録フレーズに先行する所定韻律単位は呼気段落外先行部となり、登録フレーズに後続する所定韻律単位は存在しない。 Here, the predetermined prosody unit that precedes or follows the registered phrase, which is a pitch frequency extraction target in steps S110 and S114, will be described. For example, in the case where the phonetic symbol string of the registered phrase is a phrase that is spoken by a predetermined speaker with the preceding punctuation inside the accent phrase and the subsequent punctuation as the end of the exhalation paragraph, such as “Itasima's + Node” The predetermined prosodic unit preceding the registered phrase is the leading part in the expiratory paragraph and the leading part outside the expiratory paragraph, and the predetermined prosodic unit following the registered phrase is the trailing part in the expiratory paragraph and the trailing part outside the expiratory paragraph. Become. In addition, when the phonetic symbol string of the registered phrase is a phrase that is spoken by a predetermined speaker with the preceding punctuation within the accent phrase or the accent phrase and the subsequent punctuation as the end of the plain text, such as “Morsiagema's” The predetermined prosodic units preceding the registered phrase are the leading part in the expiratory paragraph and the leading part outside the expiratory paragraph, and there is no predetermined prosodic unit following the registered phrase. In addition, when the phonetic symbol string of a registered phrase is a phrase uttered by a predetermined speaker with the preceding punctuation as the end of the exhalation paragraph or accent phrase and the subsequent punctuation as the end of the plain sentence, such as “Migiho's Cordes +” The predetermined prosodic unit preceding the registered phrase is the leading part outside the expiratory paragraph, and there is no predetermined prosodic unit following the registered phrase.

また、所定韻律単位は、登録フレーズが含まれる韻律単位や、当該韻律単位に対する登録フレーズの位置に応じて、アクセント句、呼気段落などとなる。
ステップＳ１１６では、登録フレーズに先行または後続する韻律単位のピッチ周波数に対応する標準ピッチ周波数情報を第３韻律情報として、標準韻律辞書（ここでは、標準韻律辞書１１）を参照することで生成してステップＳ１１８に移行する。 Further, the predetermined prosodic unit is an accent phrase, an exhalation paragraph, or the like depending on the prosodic unit including the registered phrase and the position of the registered phrase with respect to the prosodic unit.
In step S116, the standard pitch frequency information corresponding to the pitch frequency of the prosodic unit preceding or following the registered phrase is generated as the third prosodic information by referring to the standard prosodic dictionary (here, the standard prosodic dictionary 11). The process proceeds to step S118.

ステップＳ１１８では、ステップＳ１１６で取得した標準ピッチ周波数情報と、登録フレーズに先行または後続する音韻環境のピッチ周波数とに基づき、登録フレーズに対応する韻律特徴情報を生成してステップＳ１２０に移行する。
ここで、上記韻律特徴情報としては、第２韻律情報のピッチ周波数と、これに対応する第３韻律情報のピッチ周波数との比率（以下、ピッチ比と称す）と、第２韻律情報のピッチ周波数から求まるイントネーション（抑揚）の大きさと、これに対応する第３韻律情報のピッチ周波数から求まるイントネーション（抑揚）の大きさとの比率（以下、抑揚比と称す）とを生成する。 In step S118, based on the standard pitch frequency information acquired in step S116 and the pitch frequency of the phoneme environment preceding or following the registered phrase, prosodic feature information corresponding to the registered phrase is generated, and the process proceeds to step S120.
Here, the prosodic feature information includes a ratio between the pitch frequency of the second prosodic information and the corresponding pitch frequency of the third prosodic information (hereinafter referred to as pitch ratio), and the pitch frequency of the second prosodic information. A ratio of the intonation (intonation) obtained from the above and the intonation (intonation) obtained from the pitch frequency of the third prosodic information corresponding thereto (hereinafter referred to as an intonation ratio) is generated.

例えば、図３に示すように、登録フレーズの音韻記号列が、「今日の天気は、晴れです。」といった発話文における「テ’ンキワ」のフレーズであれば、例えば、音声波形データから抽出される呼気段落内の先行部のアクセント句である「キ’ョーノ」のピッチ周波数の平均値と、これに対応する標準ピッチ周波数の「キ’ョーノ」の平均値との比率がピッチ比となる。例えば、所定韻律単位「キ’ョーノ」に対するピッチ比は、図３に示すように「１．２」となる。 For example, as shown in FIG. 3, if the phoneme symbol string of the registered phrase is a phrase “Tenki Kiwa” in an utterance sentence such as “Today's weather is sunny”, for example, it is extracted from speech waveform data. The ratio between the average value of the pitch frequency of “Kyono”, which is the accent phrase of the preceding part in the exhalation paragraph, and the average value of “Kyono” of the standard pitch frequency corresponding thereto is the pitch ratio. For example, the pitch ratio with respect to the predetermined prosodic unit “Kyo '” is “1.2” as shown in FIG.

一方、図３中の矢印で示すように、音声波形データから抽出される呼気段落内の先行部のアクセント句である「キ’ョーノ」のピッチ周波数の最大値と最小値との差分値（イントネーション（抑揚）の大きさ）と、これに対応する標準ピッチ周波数の最大値と最小値との差分値（イントネーション（抑揚）の大きさ）との比率が抑揚比となる。例えば、所定韻律単位「キ’ョーノ」に対する抑揚比は、図３に示すように「１．５」となる。 On the other hand, as indicated by an arrow in FIG. 3, the difference value (intonation) between the maximum value and the minimum value of the pitch frequency of “Kyono” which is the accent phrase of the preceding part in the exhalation paragraph extracted from the speech waveform data The ratio of the difference between the maximum value and the minimum value of the standard pitch frequency corresponding to this (the size of intonation) is the inflection ratio. For example, the inflection ratio for the predetermined prosodic unit “Kyo '” is “1.5” as shown in FIG.

また、図３に示すように、呼気段落外の後続部である「ハレ’デス」があるので、これについても上記同様にピッチ比及び抑揚比を算出する。例えば、所定韻律単位「ハレ’デス」に対するピッチ比及び抑揚比は、図３に示すようにそれぞれ「１．０」及び「１．１」となる。
つまり、登録フレーズに対する、所定韻律単位の位置毎のピッチ比及び抑揚比が韻律特徴情報となる。 Also, as shown in FIG. 3, since there is a “Hale'Death” that is a subsequent part outside the exhalation paragraph, the pitch ratio and the inflection ratio are calculated in the same manner as described above. For example, the pitch ratio and the inflection ratio for the predetermined prosodic unit “Hare'Death” are “1.0” and “1.1”, respectively, as shown in FIG.
That is, the pitch ratio and the inflection ratio for each position of the predetermined prosodic unit with respect to the registered phrase is the prosodic feature information.

ステップＳ１２０では、ステップＳ１０６で生成した韻律素片ピッチ周波数情報と、ステップＳ１１８で生成した韻律特徴情報とを対応付けて記憶装置（不図示）に記憶してステップＳ１２２に移行する。
ステップＳ１２２では、収集した全ての音声波形データに対して処理が終了したか否かを判定し、終了したと判定された場合(Yes)は、ステップＳ１２４に移行し、そうでない場合(No)は、ステップＳ１００に移行する。
ステップＳ１２４では、記憶装置に記憶された韻律素片ピッチ周波数情報及び韻律特徴情報に基づき、韻律素片辞書を作成して処理を終了する。 In step S120, the prosodic segment pitch frequency information generated in step S106 and the prosodic feature information generated in step S118 are associated with each other and stored in a storage device (not shown), and the process proceeds to step S122.
In step S122, it is determined whether or not the processing has been completed for all collected voice waveform data. If it is determined that the processing has ended (Yes), the process proceeds to step S124, and if not (No), The process proceeds to step S100.
In step S124, a prosodic segment dictionary is created based on the prosodic segment pitch frequency information and prosodic feature information stored in the storage device, and the process ends.

本実施の形態では、韻律素片辞書の作成時において、韻律素片ピッチ周波数情報及び韻律特徴情報の他に、当該韻律素片ピッチ周波数情報に対応するフレーズの、開始位置のピッチ周波数である韻律素片開始ピッチ（Ｈｚ）、終端位置のピッチ周波数である韻律素片終端ピッチ（Ｈｚ）、韻律素片ピッチ周波数データ（波形データ）の格納アドレスを示す韻律素片ピッチ情報アドレスなどの各種情報を生成する。更に、音声波形データに対応する音声・韻律記号列に基づき、登録フレーズの音声・韻律記号列、登録フレーズの先行句切り位置、及び後続句切り位置の情報を生成する。 In the present embodiment, when the prosodic segment dictionary is created, in addition to the prosodic segment pitch frequency information and prosodic feature information, the prosody that is the pitch frequency of the start position of the phrase corresponding to the prosodic segment pitch frequency information. Various information such as the segment start pitch (Hz), the prosody segment end pitch (Hz) which is the pitch frequency of the end position, and the prosody segment pitch information address indicating the storage address of the prosody segment pitch frequency data (waveform data) Generate. Further, based on the speech / prosodic symbol string corresponding to the speech waveform data, information on the speech / prosodic symbol string of the registered phrase, the preceding phrase cutting position and the subsequent phrase cutting position of the registered phrase is generated.

そして、韻律特徴情報、音声・韻律記号列、先行句切り位置、後続句切り位置、韻律素片開始ピッチ（Ｈｚ）、韻律素片終端ピッチ（Ｈｚ）、及び韻律素片ピッチ周波数情報アドレスの情報から、図４（ａ）に示すデータテーブルを生成する。
つまり、韻律素片辞書は、上記生成したデータテーブルと、韻律素片ピッチ周波数データ（波形データ）とから構成される。なお、韻律素片ピッチ周波数データ（波形データ）は、記憶装置における、図４（ａ）のデータテーブルに登録された韻律素片ピッチ情報アドレスの示す記憶領域に記憶される。従って、本実施の形態の韻律素片辞書は、音声・韻律記号列をインデックスとして、上記各種情報を検出し取得することが可能である。
なお、本実施の形態において、上記音声合成装置１００の韻律素片辞書１２は、上記韻律素片辞書の作成方法により作成されたものである。 Then, information of prosodic feature information, speech / prosodic symbol string, leading phrase cutting position, trailing phrase cutting position, prosodic segment start pitch (Hz), prosodic segment end pitch (Hz), and prosodic segment pitch frequency information address From this, the data table shown in FIG.
That is, the prosodic segment dictionary is composed of the generated data table and prosodic segment pitch frequency data (waveform data). The prosodic segment pitch frequency data (waveform data) is stored in the storage area indicated by the prosodic segment pitch information address registered in the data table of FIG. Therefore, the prosodic segment dictionary according to the present embodiment can detect and acquire the various types of information using the speech / prosodic symbol string as an index.
In the present embodiment, the prosodic segment dictionary 12 of the speech synthesizer 100 is created by the prosody segment dictionary creating method.

更に、図５に基づき、音声合成装置１００の動作処理の流れを説明する。ここで、図５は、音声合成装置１００の動作処理を示すフローチャートである。なお、図５中の標準ＰＦは標準ピッチ周波数、韻律素辺ＰＦは韻律素片ピッチ周波数のことである。
図５に示すように、まずステップＳ２００に移行し、テキスト解析部１０において、不図示の外部装置等又は入力デバイス（キーボード等）等を介して文章テキストが入力されたか否かを判定し、入力されたと判定された場合(Yes)は、ステップＳ２０２に移行し、そうでない場合(No)は、入力されるまで判定処理を続行する。 Furthermore, the flow of the operation process of the speech synthesizer 100 will be described with reference to FIG. Here, FIG. 5 is a flowchart showing an operation process of the speech synthesizer 100. Note that the standard PF in FIG. 5 is the standard pitch frequency, and the prosodic side PF is the prosodic segment pitch frequency.
As shown in FIG. 5, first, the process proceeds to step S200, where the text analysis unit 10 determines whether or not a sentence text is input via an external device (not shown) or an input device (such as a keyboard). If it is determined (Yes), the process proceeds to step S202. If not (No), the determination process is continued until it is input.

ステップＳ２０２に移行した場合は、テキスト解析部１０において、ステップＳ２００で入力された文章テキストに対して、アクセント解析及び形態素解析を行い、当該解析結果に基づき、発音・韻律記号列を生成し、当該生成した発音・韻律記号列を韻律生成部１３に出力してステップＳ２０４に移行する。
ステップＳ２０４では、標準韻律生成部１３ａにおいて、テキスト解析部１０から入力された発音・韻律記号列に基づき、標準韻律辞書１１から対応する標準ピッチ周波数情報を取得し、当該取得した標準ピッチ周波数情報に基づき文章テキスト全体の標準韻律情報系列を生成してステップＳ２０６に移行する。 When the process proceeds to step S202, the text analysis unit 10 performs accent analysis and morphological analysis on the text text input in step S200, generates a pronunciation / prosodic symbol string based on the analysis result, The generated pronunciation / prosodic symbol string is output to the prosody generation unit 13, and the process proceeds to step S204.
In step S204, the standard prosody generation unit 13a acquires the corresponding standard pitch frequency information from the standard prosody dictionary 11 based on the pronunciation / prosodic symbol string input from the text analysis unit 10, and uses the acquired standard pitch frequency information as the acquired standard pitch frequency information. Based on the standard prosody information series of the entire sentence text, the process proceeds to step S206.

ステップＳ２０６では、登録フレーズ照合部１３ｂにおいて、テキスト解析部１０から入力された発音・韻律記号列に含まれるフレーズ（記号列）と一致するフレーズ（記号列）が韻律素片辞書１２に含まれるか否かを判定してステップＳ２０８に移行する。
ステップＳ２０８では、登録フレーズ照合部１３ｂにおいて、ステップＳ２０６の判定結果に基づき、一致するフレーズがあると判定された場合(Yes)は、一致するフレーズがあることを韻律素片選択部１３ｃに通知してステップＳ２１０に移行し、そうでない場合(No)は、ステップＳ２２４に移行する。
なお、ステップＳ２０４の処理と、ステップＳ２０６，Ｓ２０８の処理とは順番に行われるようになっているが、これに限らず、これらの処理を並列に行っても良い。 In step S206, whether the phrase (symbol string) that matches the phrase (symbol string) included in the pronunciation / prosodic symbol string input from the text analysis unit 10 is included in the prosodic segment dictionary 12 in the registered phrase matching unit 13b. It is determined whether or not, and the process proceeds to step S208.
In step S208, if the registered phrase collation unit 13b determines that there is a matching phrase based on the determination result in step S206 (Yes), it notifies the prosody segment selection unit 13c that there is a matching phrase. Then, the process proceeds to step S210. Otherwise (No), the process proceeds to step S224.
In addition, although the process of step S204 and the process of step S206, S208 are performed in order, it is not restricted to this, You may perform these processes in parallel.

ステップＳ２１０に移行した場合は、韻律素片選択部１３ｃにおいて、韻律素片辞書１２から、登録フレーズと一致するフレーズに対応し且つ前後の標準ピッチ周波数と最も接続性の良い韻律素片ピッチ周波数情報及び韻律特徴情報を取得し、当該取得した韻律素片ピッチ周波数情報及び韻律特徴情報を韻律置換整形部１３ｄに出力してステップＳ２１２に移行する。ここで、接続性の判断は、韻律素片辞書１２の有する韻律素片開始ピッチ（Ｈｚ）、及び韻律素片終端ピッチ（Ｈｚ）と、一致するフレーズ前後の標準ピッチ周波数とに基づき、接続コストが最も低い（周波数の差が最も小さい）ものを接続性が良いと判断する。 When the process proceeds to step S210, the prosodic segment selection unit 13c selects from the prosodic segment dictionary 12 the prosodic segment pitch frequency information corresponding to the phrase that matches the registered phrase and having the best connection with the preceding and following standard pitch frequencies. Then, the prosodic feature information is acquired, and the acquired prosodic segment pitch frequency information and prosodic feature information are output to the prosodic replacement shaping unit 13d, and the process proceeds to step S212. Here, the connectivity is determined based on the prosodic segment start pitch (Hz) and prosodic segment end pitch (Hz) of the prosodic segment dictionary 12 and the standard pitch frequency before and after the matching phrase. Is the lowest (the smallest difference in frequency) is determined to have good connectivity.

ステップＳ２１２では、韻律置換整形部１３ｄにおいて、韻律素片選択部１３ｃから入力された韻律特徴情報に基づき、標準韻律情報系列における登録フレーズと一致するフレーズに先行及び後続する所定韻律単位に対応した韻律情報（標準ピッチ周波数）系列部分を調整してステップＳ２１４に移行する。本実施の形態においては、韻律特徴情報が、上記説明したピッチ比及び抑揚比となるので、該当する標準ピッチ周波数部分を、当該標準ピッチ周波数と、調整後のピッチ周波数とのピッチ比及び抑揚比が韻律特徴情報のピッチ比及び抑揚比と同じ比となるように調整を行う。 In step S212, in the prosody replacement shaping unit 13d, based on the prosodic feature information input from the prosodic segment selection unit 13c, the prosody corresponding to a predetermined prosodic unit preceding and following the phrase that matches the registered phrase in the standard prosodic information sequence. The information (standard pitch frequency) sequence portion is adjusted, and the process proceeds to step S214. In the present embodiment, the prosodic feature information is the pitch ratio and the inflection ratio described above. Therefore, the corresponding standard pitch frequency portion is the pitch ratio and the inflection ratio between the standard pitch frequency and the adjusted pitch frequency. Is adjusted to the same ratio as the pitch ratio and intonation ratio of the prosodic feature information.

ステップＳ２１４では、韻律置換整形部１３ｄにおいて、標準韻律情報系列における、登録フレーズと一致するフレーズ部分を、韻律素片選択部１３ｃから入力された韻律素片ピッチ周波数情報に置換してステップＳ２１６に移行する。
ステップＳ２１６では、韻律置換整形部１３ｄにおいて、韻律素片選択部１３ｃから入力された韻律素片ピッチ周波数情報と、ステップＳ２１２で調整後の標準ピッチ周波数とが滑らかに接続するように整形処理を施し、最終的な韻律情報系列を生成してステップＳ２１８に移行する。 In step S214, the prosody replacement shaping unit 13d replaces the phrase portion that matches the registered phrase in the standard prosodic information sequence with the prosody segment pitch frequency information input from the prosody segment selection unit 13c, and proceeds to step S216. To do.
In step S216, the prosody replacement shaping unit 13d performs shaping processing so that the prosody segment pitch frequency information input from the prosody segment selection unit 13c and the standard pitch frequency adjusted in step S212 are smoothly connected. Then, a final prosodic information sequence is generated, and the process proceeds to step S218.

ステップＳ２１８では、韻律置換整形部１３ｄにおいて、ステップＳ２１６で生成した韻律情報系列を、波形合成部１４に出力してステップＳ２２０に移行する。
ステップＳ２２０では、波形合成部１４において、韻律置換整形部１３ｄから入力された韻律情報系列に基づき、標準波形素片辞書１５の有するスペクトル情報及び励振源情報を用いて、ステップＳ２００で入力された文章テキストに対応する合成音声波形データを生成してステップＳ２２２に移行する。 In step S218, the prosody replacement shaping unit 13d outputs the prosodic information sequence generated in step S216 to the waveform synthesis unit 14 and proceeds to step S220.
In step S220, the text input in step S200 using the spectrum information and excitation source information of the standard waveform segment dictionary 15 based on the prosody information sequence input from the prosody replacement shaping unit 13d in the waveform synthesizer 14. Synthetic speech waveform data corresponding to the text is generated, and the process proceeds to step S222.

ステップＳ２２２では、波形合成部１４において、ステップＳ２２０で生成した合成音声波形データに基づき、ステップＳ２００で入力された文章テキストの合成音声をスピーカ等（不図示）の出力装置から出力して処理を終了する。
一方、ステップＳ２０８において、一致するフレーズがなくステップＳ２２４に移行した場合は、ステップＳ２０４で生成した標準韻律情報系列を、最終的な韻律情報系列として波形合成部１４に出力してステップＳ２２０に移行する。 In step S222, the waveform synthesizer 14 outputs the synthesized text of the text text input in step S200 from an output device such as a speaker (not shown) based on the synthesized speech waveform data generated in step S220, and ends the processing. To do.
On the other hand, if there is no matching phrase in step S208 and the process proceeds to step S224, the standard prosody information sequence generated in step S204 is output to the waveform synthesis unit 14 as the final prosody information sequence, and the process proceeds to step S220. .

次に、図６〜図８に基づき、本実施の形態の動作を説明する。ここで、図６（ａ）は、定型フレーズを含む実発話文の音声波形データにおけるピッチ周波数系列の一例を示す図であり、（ｂ）は、（ａ）の実発話文に対して標準ピッチ周波数のみで生成したピッチ周波数系列の一例を示す図である。また、図７は、定型フレーズを含む文章テキストに対する標準韻律情報系列の一例を示す図である。また、図８（ａ）は、文章テキストにおける定型フレーズに韻律素片ピッチ周波数を用い、且つ韻律特徴情報に基づく調整処理を施した韻律情報系列の一例を示す図であり、（ｂ）は、（ａ）と同じ文章テキストに対して従来技術の手法で生成した韻律情報系列の一例を示す図である。 Next, the operation of the present embodiment will be described with reference to FIGS. Here, FIG. 6A is a diagram illustrating an example of the pitch frequency sequence in the speech waveform data of the actual utterance sentence including the fixed phrase, and FIG. 6B is a standard pitch with respect to the actual utterance sentence of FIG. It is a figure which shows an example of the pitch frequency series produced | generated only by the frequency. Moreover, FIG. 7 is a figure which shows an example of the standard prosodic information series with respect to the text text containing a fixed phrase. FIG. 8A is a diagram showing an example of a prosodic information sequence in which a prosodic segment pitch frequency is used for a fixed phrase in a sentence text and an adjustment process based on prosodic feature information is performed. It is a figure which shows an example of the prosodic information series produced | generated by the method of the prior art with respect to the same text as (a).

ここでは、特定話者Ａに、発音・韻律記号列が「コノ’タビ、イテンスルコト’ニナリマ’シタノデ、オシラセモーシアゲマ’ス。」となる文章を発話してもらい、登録フレーズを「モーシアゲマ’ス」としたときに、この発話文の音声波形データから生成した韻律素片ピッチ周波数情報及び韻律特徴情報を、韻律素片辞書１２が有する場合を説明する。 Here, a specific speaker A utters a sentence whose pronunciation / prosodic symbol string is “Kono 'Tabi, Itensurukoto'Ninarima'Sitade, Osilase Mosiagema's", and the registered phrase is “Mosiagema's”. The case where the prosodic segment dictionary 12 has the prosodic segment pitch frequency information and prosodic feature information generated from the speech waveform data of the utterance sentence will be described.

特定話者Ａは、図６（ａ）に示す実発話の音声波形データから生成された韻律情報系列から解るように、登録フレーズである「モーシアゲマ’ス」のフレーズを発話するときに、このフレーズに先行する呼気段落（図６（ａ）の例では、「コノ’タビ、イテンスルコト’ニナリマ’シタノデ」）の抑揚がやや強めになる一方、同じ呼気段落内では直前のフレーズ（図６（ａ）の例では、「オシラセ」）の抑揚を抑える傾向にある。 When the specific speaker A utters the phrase “Morsia Gemma's”, which is a registered phrase, as understood from the prosodic information sequence generated from the speech waveform data of the actual utterance shown in FIG. In the exhalation paragraph that precedes (in the example of FIG. 6 (a), “Kono 'Tabi, Itensurukoto'Ninarima'Sitanode”) is slightly stronger, while in the same exhalation paragraph, the immediately preceding phrase (FIG. 6 (a In the example of), there is a tendency to suppress the inflection of “Oshirase”).

一方、図６（ａ）と同じ韻律情報系列を、標準韻律辞書１１の有する標準ピッチ周波数情報だけで生成すると、図６（ｂ）に示すように、図６（ａ）の例と比較して、登録フレーズとは独立に各フレーズの韻律情報系列部分が生成されるため、抑揚が単調気味になっているのが解る。なお、図６（ｂ）中の破線で示された波形が、標準のピッチとなる。 On the other hand, when the same prosodic information sequence as in FIG. 6A is generated only with the standard pitch frequency information possessed by the standard prosodic dictionary 11, as shown in FIG. 6B, it is compared with the example of FIG. Since the prosodic information sequence part of each phrase is generated independently of the registered phrase, it can be seen that the intonation is monotonous. In addition, the waveform shown with the broken line in FIG.6 (b) becomes a standard pitch.

また、ここでは、登録フレーズに先行する呼気段落と、登録フレーズと同じ呼気段落内の直前のフレーズとに対する韻律特徴情報を生成する。つまり、図６（ａ）に示すように、「コノ’タビ、イテンスルコト’ニナリマ’シタノデ」に対する韻律特徴情報（ピッチ比「１．１」、抑揚比「１．５」）と、「オシラセ」に対する韻律特徴情報（ピッチ比「０．８５」、抑揚比「０．６」）とを生成する。
従って、韻律素片辞書１２には、「モーシアゲマ’ス」の韻律素片ピッチ周波数情報と、このフレーズに先行する呼気段落「コノ’タビ、イテンスルコト’ニナリマ’シタノデ」に対する韻律特徴情報と、同じ呼気段落内で当該フレーズに先行する「オシラセ」に対する韻律特徴情報とが登録されていることになる。 Here, prosodic feature information is generated for the exhalation paragraph preceding the registration phrase and the immediately preceding phrase in the same exhalation paragraph as the registration phrase. That is, as shown in FIG. 6A, the prosodic feature information (pitch ratio “1.1”, inflection ratio “1.5”) and “Oshirase” Prosodic feature information (pitch ratio “0.85”, intonation ratio “0.6”) is generated.
Therefore, the prosodic segment dictionary 12 has the same prosodic feature information for the prosodic segment pitch frequency information of “Mosia Gema's” and the exhalation paragraph “Kono 'Tabi, Itensurukoto'Ninarima'Sitade” preceding this phrase. The prosodic feature information for “Oshirase” preceding the phrase in the exhalation paragraph is registered.

以下、韻律素片辞書１２が、上記「モーシアゲマ’ス」の上記韻律素片ピッチ周波数情報及び上記韻律特徴情報を有する音声合成装置１００の実際の動作を説明する。
まず、音声合成装置１００のテキスト解析部１０に、文章テキスト「格別のお引き立てにあずかり、厚く御礼申しあげます。」が入力される（ステップＳ２００）。
テキスト解析部１０は、入力された文章テキストに対して、単語辞書を参照して、アクセント解析及び形態素解析を実行し、その解析結果に基づき文章テキストに対する発音・韻律記号列を生成する（ステップＳ２０２）。この場合の発音・韻律記号列は、「カクベツノオヒ＋キタテニアズカ’リ、アツ＋クオンレーモーシアゲマ’ス。」となる。この発音・韻律記号列は、韻律生成部１３の標準韻律生成部１３ａ及び登録フレーズ照合部１３ｂにそれぞれ入力される。 Hereinafter, the actual operation of the speech synthesizer 100 in which the prosodic segment dictionary 12 has the prosodic segment pitch frequency information and the prosodic feature information of the “Mosia Gemas” will be described.
First, the text text “Thank you very much for your special assistance” is input to the text analysis unit 10 of the speech synthesizer 100 (step S200).
The text analysis unit 10 performs accent analysis and morphological analysis on the input sentence text by referring to the word dictionary, and generates a pronunciation / prosodic symbol string for the sentence text based on the analysis result (step S202). ). In this case, the pronunciation / prosodic symbol string is “Kakubetsu no Ohi + Kitateni Azuka ', Atsu + Quonley Mosia Gemma”. This pronunciation / prosodic symbol string is input to the standard prosody generation unit 13a and the registered phrase collation unit 13b of the prosody generation unit 13, respectively.

標準韻律生成部１３ａは、入力された発音・韻律記号列に基づき、音声単位ラベル系列を生成し、当該音声単位ラベル系列のラベル情報に対応する標準ピッチ周波数情報を、標準韻律辞書１１から取得する。そして、当該取得した標準ピッチ周波数情報から、前記発音・韻律記号列に対応する標準韻律情報系列を生成する（ステップＳ２０４）。このようにして生成された標準韻律情報系列（標準ピッチ周波数系列）は、図７に示すように、全体的に抑揚が単調気味となる。また、当該生成した標準韻律情報系列は、韻律置換整形部１３ｄに入力される。なお、図７中の破線で示された波形が、標準のピッチとなる。 The standard prosody generation unit 13 a generates a speech unit label sequence based on the input pronunciation / prosodic symbol string, and acquires standard pitch frequency information corresponding to the label information of the speech unit label sequence from the standard prosody dictionary 11. . Then, a standard prosodic information sequence corresponding to the pronunciation / prosodic symbol string is generated from the acquired standard pitch frequency information (step S204). The standard prosodic information sequence (standard pitch frequency sequence) generated in this way has a monotonous inflection as a whole, as shown in FIG. The generated standard prosodic information series is input to the prosodic replacement shaping unit 13d. In addition, the waveform shown with the broken line in FIG. 7 becomes a standard pitch.

一方、登録フレーズ照合部１３ｂは、入力された発音・韻律記号列に基づき、当該発音・韻律記号列に、登録フレーズが含まれているか否かを判定する（ステップＳ２０６）。当該発音・韻律記号列には、登録フレーズとして「モーシアゲマ’ス」が含まれているので（ステップＳ２０８の「Ｙｅｓ」の分岐）、このことを韻律素片選択部１３ｃに通知する。 On the other hand, the registered phrase collation unit 13b determines whether or not a registered phrase is included in the pronunciation / prosodic symbol string based on the input pronunciation / prosodic symbol string (step S206). Since the pronunciation / prosodic symbol string includes “Morcia Gemma's” as a registered phrase (“Yes” branch of step S208), this is notified to the prosody segment selection unit 13c.

韻律素片選択部１３ｃは、登録フレーズ照合部１３ｂからの通知を受けると、韻律素片辞書１２から、登録フレーズ「モーシアゲマ’ス」に対応する韻律素片ピッチ周波数情報及び韻律特徴情報が複数ある場合は、標準韻律生成部１３ａで生成された標準韻律情報系列における標準ピッチ周波数との接続性が最も良い「モーシアゲマ’ス」の韻律素片ピッチ周波数情報及び韻律特徴情報を選択する。この選択方法としては、前述したように、韻律素片辞書１２の有する、複数の「モーシアゲマ’ス」の韻律素片開始ピッチ（Ｈｚ）、及び韻律素片終端ピッチ（Ｈｚ）と、標準韻律情報系列における「モーシアゲマ’ス」のフレーズ前後の標準ピッチ周波数とに基づき、周波数の差が最も小さくなるものを接続性が最も良いとものと判断して選択する。ここでは、前提として述べた、先行する呼気段落の韻律特徴情報（ピッチ比「１．１」、抑揚比「１．５」）と、同じ呼気段落内の先行するフレーズの韻律特徴情報（ピッチ比「０．８５」、抑揚比「０．６」）とを有する「モーシアゲマ’ス」の韻律素片ピッチ周波数情報及び韻律特徴情報が選択されたとする。そして、韻律素片選択部１３ｃは、この選択した韻律素片ピッチ周波数情報及び韻律特徴情報を韻律素片辞書１２から取得して、韻律置換整形部１３ｄに出力する（ステップＳ２１０）。 When receiving the notification from the registered phrase collating unit 13b, the prosody segment selecting unit 13c has a plurality of prosody segment pitch frequency information and prosody feature information corresponding to the registered phrase “Mosia Gemma” from the prosody segment dictionary 12. In this case, the prosodic segment pitch frequency information and prosodic feature information of “Mosia Gemas” having the best connectivity with the standard pitch frequency in the standard prosodic information sequence generated by the standard prosody generating unit 13a is selected. As described above, as described above, the prosodic segment dictionary 12 has a plurality of “Mosia Gemas” prosodic segment start pitches (Hz) and prosodic segment end pitches (Hz), and standard prosody information. Based on the standard pitch frequencies before and after the phrase “Morcia Gemma's” in the series, the one with the smallest frequency difference is selected based on the judgment that the connectivity is the best. Here, the prosodic feature information (pitch ratio “1.1”, intonation ratio “1.5”) of the preceding expiratory paragraph and the prosodic feature information (pitch ratio) of the preceding phrase in the same expiratory paragraph, which are described as the premise, It is assumed that the prosodic segment pitch frequency information and prosodic feature information of “Mosia Gemma's” having “0.85” and intonation ratio “0.6”) is selected. Then, the prosodic segment selection unit 13c acquires the selected prosodic segment pitch frequency information and prosodic feature information from the prosody segment dictionary 12 and outputs it to the prosody replacement shaping unit 13d (step S210).

韻律置換整形部１３ｄは、韻律素片選択部１３ｃから「モーシアゲマ’ス」の韻律素片ピッチ周波数情報及び韻律特徴情報が入力されると、標準韻律情報系列における、「モーシアゲマ’ス」に先行する呼気段落の「カクベツノオヒ＋キタテニアズカ’リ」に対応する標準ピッチ周波数に対して、これに対応する韻律特徴情報であるピッチ比「１．１」、抑揚比「１．５」を用いて調整を行う（ステップＳ２１２）。 When the prosodic segment pitch frequency information and prosodic feature information of “Morsia Gemma's” is input from the prosodic segment selection unit 13c, the prosodic replacement shaping unit 13d precedes “Morcia Gemma's” in the standard prosodic information sequence. Adjusted using the pitch ratio “1.1” and the inflection ratio “1.5”, which are the prosodic feature information, for the standard pitch frequency corresponding to “Kakubetsuno Ohi + Kitateni Azuka'ri” in the exhalation paragraph Is performed (step S212).

ここで、前述した図６（ａ）の例と同様に、特定話者Ａは、上記選択された「モーシアゲマ’ス」のフレーズを発話時に、このフレーズに先行する呼気段落（ここでは、「カクベツノオヒ＋キタテニアズカ’リ、」）の抑揚をやや強めにする一方、同じ呼気段落内では直前のフレーズ（ここでは、「アツ＋クオンレー」）の抑揚を抑える傾向にある。そのため、図８（ａ）に示すように、「カクベツノオヒ＋キタテニアズカ’リ、アツ＋クオンレー」の標準韻律情報系列（標準ピッチ周波数系列）に対して、実発話のデータから生成された「モーシアゲマ’ス」の韻律情報系列（韻律素片ピッチ周波数系列）を単純に接続しただけでは、先行する各フレーズに対して、前記同様に登録フレーズとは独立に標準韻律情報系列が生成されているため、特定話者Ａの先行フレーズに対する抑揚とは異なる抑揚を有した標準韻律情報系列と、実発話に対する韻律情報系列とが接続されてしまうので、全体として不自然な抑揚となる。従って、不自然な抑揚を自然な抑揚となるように修正するために、韻律特徴情報に基づき、標準韻律情報系列を調整する。なお、図８（ａ）中の破線で示された波形が、標準のピッチとなる。 Here, as in the example of FIG. 6A described above, the specific speaker A, when speaking the phrase of “Mosiagema's” selected above, has an exhalation paragraph (here “Kakubetsuno”) preceding this phrase. Ohi + Kitateni Azuka 'Li, ")) is slightly strengthened, while in the same exhalation paragraph, it tends to suppress the inflection of the immediately preceding phrase (here" Atsu + Quantley "). Therefore, as shown in FIG. 8 (a), “Kakubetsu no Ohi + Kitateni Azuka'ri, Atsu + Quantley” standard prosodic information sequence (standard pitch frequency sequence) is generated from actual speech data. By simply connecting the prosodic information sequence of “Mosia Gemma” (prosodic segment pitch frequency sequence), a standard prosodic information sequence is generated for each preceding phrase independently of the registered phrase as described above. Therefore, the standard prosodic information sequence having an inflection different from the inflection on the preceding phrase of the specific speaker A and the prosodic information sequence for the actual utterance are connected, resulting in an unnatural inflection as a whole. Therefore, in order to correct the unnatural inflection so that it becomes a natural intonation, the standard prosodic information series is adjusted based on the prosodic feature information. In addition, the waveform shown with the broken line in Fig.8 (a) becomes a standard pitch.

具体的には、「カクベツノオヒ＋キタテニアズカ’リ」に対応する標準ピッチ周波数系列と、調整後のピッチ周波数系列とのピッチ比が１．１、抑揚比が１．５となるように、系列内の各標準ピッチ周波数を調整する。同様に、標準韻律情報系列における、「モーシアゲマ’ス」と同じ呼気段落内の先行フレーズである「アツ＋クオンレー」に対応する標準ピッチ周波数系列と、調整後のピッチ周波数系列とのピッチ比が０．８、抑揚比が０．６となるように、系列内の各標準ピッチ周波数を調整する。 Specifically, the series is such that the pitch ratio between the standard pitch frequency series corresponding to “Kakubetsunohi + Kitatenizazu'ri” and the adjusted pitch frequency series is 1.1 and the inflection ratio is 1.5. Adjust each standard pitch frequency in the. Similarly, in the standard prosodic information sequence, the pitch ratio between the standard pitch frequency sequence corresponding to “Atsu + Quantley”, which is the preceding phrase in the same exhalation paragraph as “Mosia Gema's”, and the adjusted pitch frequency sequence is Each standard pitch frequency in the sequence is adjusted so that the inflection ratio is 0.8.

更に、韻律置換整形部１３ｄは、標準韻律情報系列における「モーシアゲマ’ス」に対応する標準ピッチ周波数系列を、韻律素片選択部１３ｃから入力された韻律素片ピッチ周波数情報に置換すると共に（ステップＳ２１４）、この置換後の韻律素片ピッチ周波数情報（韻律素片ピッチ周波数系列）と、上記調整後の標準韻律情報系列とが滑らかに接続するように、スムージング処理などの整形処理を施して、図８（ｂ）に示すように、「カクベツノオヒ＋キタテニアズカ’リ、アツ＋クオンレーモーシアゲマ’ス。」に対する最終的な韻律情報系列を生成する（ステップＳ２１６）。そして、当該生成した韻律情報系列を波形合成部１４に出力する（ステップＳ２１８）。なお、図８（ｂ）中の破線で示された波形が、調整後の標準のピッチとなる。 Further, the prosody replacement shaping unit 13d replaces the standard pitch frequency sequence corresponding to “Morsia Gemma's” in the standard prosodic information sequence with the prosodic segment pitch frequency information input from the prosodic segment selection unit 13c (Step S1). S214), performing a shaping process such as a smoothing process so that the prosodic segment pitch frequency information (prosodic segment pitch frequency sequence) after the replacement and the adjusted standard prosodic information sequence are smoothly connected, As shown in FIG. 8 (b), a final prosodic information sequence for “Kakubetsu Noh + Kitateni Azuka ', Atsu + Quonley Mosia Gemas” is generated (step S216). Then, the generated prosodic information sequence is output to the waveform synthesizer 14 (step S218). In addition, the waveform shown with the broken line in FIG.8 (b) becomes a standard pitch after adjustment.

波形合成部１４は、韻律生成部１３から入力された韻律情報系列に対応するスペクトル情報及び励振源情報を標準波形素片辞書１５から取得し、当該取得したスペクトル情報からスペクトル情報系列を生成すると共に、前記取得した励振源情報から励振源情報系列を生成する。そして、前記生成したスペクトル情報系列を合成フィルタのパラメータとして用い、当該合成フィルタで、励振源情報系列に基づき生成される励振源信号をフィルタ処理して、文章テキストに対応する合成音声波形データを生成する（ステップＳ２２０）。更に、この生成した合成音声波形データに基づき、不図示のスピーカ等から合成音声を出力する（ステップＳ２２２）。 The waveform synthesizing unit 14 acquires spectrum information and excitation source information corresponding to the prosodic information sequence input from the prosody generating unit 13 from the standard waveform segment dictionary 15, and generates a spectrum information sequence from the acquired spectral information. Then, an excitation source information sequence is generated from the acquired excitation source information. Then, using the generated spectrum information sequence as a parameter of the synthesis filter, the synthesis filter filters the excitation source signal generated based on the excitation source information sequence, and generates synthesized speech waveform data corresponding to the sentence text (Step S220). Further, based on the generated synthesized speech waveform data, synthesized speech is output from a speaker (not shown) or the like (step S222).

このように、本実施の形態の音声合成装置１００は、所定話者の発話した発話文に対応する音声波形データから抽出された登録フレーズ（定型フレーズ）部分のピッチ周波数情報（韻律素片ピッチ周波数情報）と、音声波形データにおける前記定型フレーズに先行及び後続する所定韻律単位に対するピッチ周波数と、これに対応する標準ピッチ周波数とに基づき生成された所定韻律単位の韻律的特徴を示す韻律特徴情報とを有した韻律素片辞書１２を用いて、音声単位のラベル情報毎のピッチ周波数情報である標準ピッチ周波数情報から生成された標準韻律情報系列に含まれる定型フレーズ部分を、韻律素片辞書１２の有する韻律素片ピッチ周波数情報に置換する。更に、標準韻律情報系列における定型フレーズに先行及び後続する所定韻律単位の標準ピッチ周波数系列部分を、韻律素片辞書１２の有する前記韻律素片ピッチ周波数情報に対応する韻律特徴情報に基づき調整して、最終的な韻律情報系列を生成するようにしたので、文章テキストの文章がより自然発話に近い抑揚で発話（再生出力）される合成音声波形データを生成することが可能である。 As described above, the speech synthesizer 100 according to the present embodiment uses the pitch frequency information (prosodic segment pitch frequency) of the registered phrase (standard phrase) extracted from the speech waveform data corresponding to the utterance sentence uttered by the predetermined speaker. Information), a prosodic feature information indicating a prosodic feature of a predetermined prosodic unit generated based on a pitch frequency for a predetermined prosodic unit preceding and following the fixed phrase in the speech waveform data, and a standard pitch frequency corresponding to the pitch frequency; The standard phrase part included in the standard prosodic information sequence generated from the standard pitch frequency information that is the pitch frequency information for each label information of the speech unit is used as the prosodic segment dictionary 12 having the prosodic segment dictionary 12. It replaces with the prosodic segment pitch frequency information it has. Further, the standard pitch frequency sequence portion of a predetermined prosody unit preceding and following the fixed phrase in the standard prosodic information sequence is adjusted based on the prosodic feature information corresponding to the prosodic segment pitch frequency information of the prosodic segment dictionary 12 Since the final prosodic information sequence is generated, it is possible to generate synthesized speech waveform data in which the sentence of the sentence text is uttered (reproduced and output) with an inflection closer to a natural utterance.

上記第１の実施の形態において、テキスト解析部１０は、請求項６記載のテキスト解析手段に対応し、韻律素片辞書１２は、請求項１〜５のいずれか１項に記載の韻律素片辞書に対応し、韻律生成部１３は、請求項６記載の韻律情報系列生成手段に対応し、登録フレーズ照合部１３ｂ、韻律素片選択部１３ｃ及び韻律置換整形部１３ｄによる、登録フレーズに対応する標準韻律情報系列部分を韻律素片辞書１２の韻律素片ピッチ周波数情報から構成される韻律情報系列部分に置換する処理は、請求項６又は９記載の変更手段に対応し、韻律置換整形部１３ｄによる置換部分に先行及び後続する所定韻律単位の韻律情報系列部分を韻律特徴情報に基づき調整する処理は、請求項６記載の韻律情報調整手段に対応し、波形合成部１４は、請求項６記載の音声波形生成手段に対応し、標準波形素片辞書１５は、請求項６、１０及び１３のいずれか１項に記載の素片辞書に対応する。 In the first embodiment, the text analysis unit 10 corresponds to the text analysis unit according to claim 6 , and the prosody segment dictionary 12 is the prosody segment according to any one of claims 1 to 5. Corresponding to the dictionary, the prosody generation unit 13 corresponds to the prosody information sequence generation means according to claim 6 and corresponds to the registered phrase by the registered phrase collation unit 13b, the prosody segment selection unit 13c, and the prosody replacement shaping unit 13d. The processing for replacing the standard prosodic information sequence portion with the prosodic information sequence portion composed of the prosodic segment pitch frequency information of the prosodic segment dictionary 12 corresponds to the changing means according to claim 6 or 9 , and the prosody replacement shaping unit 13d. process of adjusting, based on the prosodic feature information prosodic information sequence portion of a given prosody units preceding and succeeding the replacement part by corresponds to the prosody information adjustment unit according to claim 6, the waveform synthesizer 14, claim 6 Corresponding to the speech waveform generation means mounting, standard waveform segment dictionary 15 corresponds to segment dictionary according to any one of claims 6, 10 and 13.

また、上記第１の実施の形態において、ステップＳ１００〜Ｓ１０４は、請求項１又は５記載の第１韻律情報抽出ステップに対応し、ステップＳ１０８〜Ｓ１１４は、第２韻律情報抽出ステップに対応し、ステップＳ１１６は、請求項１又は５記載の第３韻律情報生成ステップに対応し、Ｓ１１８は、請求項１又は５記載の韻律特徴情報生成ステップに対応し、ステップＳ１２４は、請求項１又は５記載の韻律素片辞書作成ステップに対応する。 In the first embodiment, steps S100 to S104 correspond to the first prosodic information extraction step according to claim 1 or 5, and steps S108 to S114 correspond to the second prosodic information extraction step. Step S116 corresponds to the third prosodic information generation step according to claim 1 or 5, S118 corresponds to the prosodic feature information generation step according to claim 1 or 5, and step S124 corresponds to claim 1 or 5. This corresponds to the step of creating a prosodic segment dictionary.

また、上記第１の実施の形態において、ステップＳ２００〜Ｓ２０２は、請求項１０又は１３記載のテキスト解析ステップに対応し、ステップＳ２０４，Ｓ２１６は、請求項１０又は１３記載の韻律情報系列生成ステップに対応し、ステップＳ２０８，Ｓ２１０，Ｓ２１４は、請求項１０又は１３記載の変更ステップに対応し、ステップＳ２１２は、請求項１０又は１２記載の韻律情報調整ステップに対応し、ステップＳ２２０は、請求項１０又は１３記載の音声波形生成ステップに対応する。 In the first embodiment, steps S200 to S202 correspond to the text analysis step according to claim 10 or 13 , and steps S204 and S216 correspond to the prosodic information sequence generation step according to claim 10 or 13. Correspondingly, steps S208, S210, and S214 correspond to the changing step according to claim 10 or 13 , step S212 corresponds to the prosodic information adjustment step according to claim 10 or 12 , and step S220 corresponds to claim 10. Or it corresponds to the speech waveform generation step described in 13 .

〔第２の実施の形態〕
次に、本発明に係る音声合成装置、音声合成プログラム及び音声合成方法の第２の実施の形態を図面に基づき説明する。図９〜図１３は、本発明に係る音声合成装置、音声合成プログラム及び音声合成方法の第２の実施の形態を示す図である。
まず、本発明の第２の実施の形態に係る音声合成装置の構成を図９及び図１０に基づき説明する。ここで、図９は、本発明の第２の実施の形態に係る音声合成装置２００の構成を示すブロック図である。また、図１０は、音声合成装置２００の波形合成部２３の詳細な構成を示すブロック図である。 [Second Embodiment]
Next, a second embodiment of the speech synthesizer, speech synthesis program, and speech synthesis method according to the present invention will be described with reference to the drawings. 9 to 13 are diagrams showing a second embodiment of the speech synthesizer, speech synthesis program, and speech synthesis method according to the present invention.
First, the configuration of the speech synthesizer according to the second embodiment of the present invention will be described with reference to FIGS. Here, FIG. 9 is a block diagram showing a configuration of speech synthesis apparatus 200 according to the second exemplary embodiment of the present invention. FIG. 10 is a block diagram showing a detailed configuration of the waveform synthesizer 23 of the speech synthesizer 200.

図９に示すように、音声合成装置２００は、音声合成対象の文章テキストを解析して、発音・韻律記号列を生成するテキスト解析部２０と、音声単位毎のピッチ周波数情報から構成される標準韻律辞書２１と、発音・韻律記号列に基づき、標準韻律辞書２１を用いて標準韻律情報系列を生成する標準韻律生成部２２と、標準韻律情報系列に基づき、後述する標準波形素片辞書２４及び後述する大波形素片辞書２５を用いて合成音声波形データを生成する波形合成部２３と、音声単位毎のスペクトル情報及び励振源情報から構成される標準波形素片辞書２４と、フレーズ単位毎のスペクトル情報及び励振源情報から構成される大波形素片辞書２５とを含んだ構成となっている。 As shown in FIG. 9, the speech synthesizer 200 analyzes a text text to be synthesized and generates a pronunciation / prosodic symbol string, and a standard composed of pitch frequency information for each speech unit. A standard prosody generation unit 22 that generates a standard prosody information sequence using the standard prosody dictionary 21 based on the prosody dictionary 21 and the pronunciation / prosodic symbol string; a standard waveform segment dictionary 24 that will be described later based on the standard prosody information sequence; A waveform synthesizer 23 for generating synthesized speech waveform data using a large waveform segment dictionary 25 described later, a standard waveform segment dictionary 24 composed of spectrum information and excitation source information for each speech unit, and for each phrase unit A large waveform segment dictionary 25 composed of spectrum information and excitation source information is included.

テキスト解析部２０は、音声合成対象の文章テキストに対して、不図示の単語辞書等を用いて、アクセント解析及び形態素解析を行い、入力された文章テキスト（例えば、日本語ならかな漢字まじり文）の読み、アクセント、イントネーションを決定し、更に、韻律記号付きの読み情報（中間言語）である、発音・韻律記号列を生成する。更に、この生成した発音・韻律記号列を韻律生成部２２に出力する。 The text analysis unit 20 performs accent analysis and morphological analysis on the text text to be synthesized using a word dictionary (not shown) and the like, and the input text text (for example, Japanese kana kanji characters) Reading, accent, and intonation are determined, and a pronunciation / prosodic symbol string that is reading information (intermediate language) with prosodic symbols is generated. Further, the generated pronunciation / prosodic symbol string is output to the prosody generation unit 22.

標準韻律辞書２１は、音声単位のラベル情報毎に対応したピッチ周波数情報である標準ピッチ周波数情報を有するもので、標準韻律生成部２２から、発音・韻律記号列に対応した音声単位ラベル系列が入力されると、これと対応する標準ピッチ周波数情報を当該標準韻律生成部２２に出力する。
標準韻律生成部２２は、テキスト解析部２０から入力された発音・韻律記号列に基づき音声単位ラベル系列を生成すると共に、当該音声単位ラベル系列を標準韻律辞書２１に出力して、音声単位ラベル系列に対応する標準ピッチ周波数情報を取得し、当該音声単位ラベル系列に対応する標準ピッチ周波数系列を含んで構成される標準韻律情報系列を生成する。 The standard prosody dictionary 21 has standard pitch frequency information which is pitch frequency information corresponding to each label information of speech units, and a speech unit label sequence corresponding to a pronunciation / prosodic symbol string is input from the standard prosody generation unit 22. Then, the corresponding standard pitch frequency information is output to the standard prosody generation unit 22.
The standard prosody generation unit 22 generates a speech unit label sequence based on the pronunciation / prosodic symbol string input from the text analysis unit 20 and outputs the speech unit label sequence to the standard prosody dictionary 21 to generate a speech unit label sequence. Is obtained, and a standard prosodic information sequence including a standard pitch frequency sequence corresponding to the speech unit label sequence is generated.

波形合成部２３は、図１０に示すように、素片選択部２３ａと、素片接続部２３ｂと、合成部２３ｃとを含んだ構成となっている。
素片選択部２３ａは、標準韻律生成部２２から入力された標準韻律情報系列と大波形素片辞書２５に含まれる音韻環境情報とに基づき、標準韻律情報系列に対応する音声素片毎のスペクトル情報及び励振源情報を、標準波形素片辞書２４及び大波形素片辞書２５から選択する。当該選択結果の情報は、素片接続部２３ｂに出力される。 As shown in FIG. 10, the waveform synthesis unit 23 includes a unit selection unit 23a, a unit connection unit 23b, and a synthesis unit 23c.
Based on the standard prosodic information sequence input from the standard prosody generation unit 22 and the phoneme environment information included in the large waveform segment dictionary 25, the segment selection unit 23a has a spectrum for each speech unit corresponding to the standard prosodic information sequence. Information and excitation source information are selected from the standard waveform segment dictionary 24 and the large waveform segment dictionary 25. Information on the selection result is output to the segment connection unit 23b.

具体的には、標準韻律情報系列の発音・韻律記号列に、大波形素片辞書２５の登録フレーズと一致するフレーズが含まれているか否かを判定し、一致するフレーズが含まれているときは、当該フレーズの最初及び最後のモーラに対して、当該フレーズのスペクトル情報及び励振源情報を抽出時の音声波形データ（以下、原音データと称す）における前記最初及び最後のモーラに後続及び先行する音韻環境と、標準韻律情報系列における前記一致するフレーズの最初及び最後のモーラに後続及び先行する音韻環境とが一致しているか否かを判定する。そして、音韻環境が一致するときは、大波形素片辞書２５から最初又は最後のモーラに対応するスペクトル情報及び励振源情報を選択し、音韻環境が一致しないときは、標準波形素片辞書２４から最初又は最後のモーラに対応するスペクトル情報及び励振源情報を選択する。また、一致するフレーズ以外の部分は、全てのモーラに対して、標準波形素片辞書２４からスペクトル情報及び励振源情報を選択する。つまり、発音・韻律記号列に一致するフレーズが含まれていない場合は、標準韻律情報系列の全てに対して、標準波形素片辞書２４からスペクトル情報及び励振源情報が選択される。 Specifically, it is determined whether or not the pronunciation / prosodic symbol string of the standard prosodic information sequence includes a phrase that matches the registered phrase of the large waveform segment dictionary 25, and the matching phrase is included. For the first and last mora of the phrase, following and preceding the first and last mora in the speech waveform data (hereinafter referred to as original sound data) at the time of extraction of the spectrum information and excitation source information of the phrase. It is determined whether the phoneme environment matches the phoneme environment following and preceding the first and last mora of the matching phrase in the standard prosodic information sequence. When the phoneme environments match, spectrum information and excitation source information corresponding to the first or last mora are selected from the large waveform segment dictionary 25, and when the phoneme environments do not match, the standard waveform segment dictionary 24 Spectral information and excitation source information corresponding to the first or last mora are selected. For the parts other than the matching phrases, spectrum information and excitation source information are selected from the standard waveform segment dictionary 24 for all mora. That is, when a phrase that matches the pronunciation / prosodic symbol string is not included, spectrum information and excitation source information are selected from the standard waveform segment dictionary 24 for all standard prosodic information sequences.

素片接続部２３ｂは、素片選択部２３ａからの選択結果の情報に基づき、標準波形素片辞書２４及び大波形素片辞書２５から当該選択結果の情報に対応する音声素片毎のスペクトル情報及び励振源情報を、標準波形素片辞書２４及び大波形素片辞書２５から取得する。そして、当該取得した音声素片毎のスペクトル情報及び励振源情報を接続してスペクトル情報系列及び励振源情報系列を生成する。当該生成したスペクトル情報系列及び励振源情報系列は、合成部２３ｃに出力される。 Based on the information on the selection result from the element selection unit 23a, the element connection unit 23b obtains spectrum information for each speech unit corresponding to the information on the selection result from the standard waveform element dictionary 24 and the large waveform element dictionary 25. The excitation source information is acquired from the standard waveform segment dictionary 24 and the large waveform segment dictionary 25. Then, the acquired spectrum information and excitation source information for each speech unit are connected to generate a spectrum information sequence and an excitation source information sequence. The generated spectrum information sequence and excitation source information sequence are output to the synthesis unit 23c.

合成部２３ｃは、合成フィルタを備えており、素片接続部２３ｂから入力されたスペクトル情報系列を合成フィルタのパラメータとして用い、当該合成フィルタで、励振源情報系列に基づき生成される励振源信号をフィルタ処理して、文章テキストに対応する合成音声波形データを生成する。そして、生成した合成音声波形データに基づき、合成音声を出力する。
標準波形素片辞書２４は、音声単位毎のスペクトル情報及び音声単位毎の励振源情報を有するものである。本実施の形態においては、スペクトル情報としてフレーム単位で音声から抽出したスペクトル包絡パラメータを用い、励振源情報として、励振源信号の有声度を用いる。ここで、音声単位は「モーラ」とする。 The synthesizing unit 23c includes a synthesizing filter, and uses the spectrum information sequence input from the unit connection unit 23b as a parameter of the synthesizing filter. The synthesizing filter generates an excitation source signal generated based on the excitation source information sequence. Filter processing is performed to generate synthesized speech waveform data corresponding to the sentence text. Then, based on the generated synthesized speech waveform data, synthesized speech is output.
The standard waveform segment dictionary 24 has spectrum information for each voice unit and excitation source information for each voice unit. In the present embodiment, spectral envelope parameters extracted from speech in units of frames are used as spectral information, and the voicing level of the excitation source signal is used as excitation source information. Here, the audio unit is “Mora”.

大波形素片辞書２５は、フレーズ（大素片）単位毎のスペクトル情報及びフレーズ（大素片）単位毎の励振源情報を有するものである。本実施の形態においては、スペクトル情報としてフレーム単位で音声から抽出したスペクトル包絡パラメータを用い、励振源情報として、励振源信号の有声度を用いる。なお、フレーズ単位毎のスペクトル情報及びフレーズ単位毎の励振源情報は、各フレーズのスペクトル情報を構成する音声単位のスペクトル情報の時間情報及び各フレーズの励振源情報を構成する音声単位（モーラ）毎の励振源情報の継続時間長情報を有している。従って、フレーズ単位毎のスペクトル情報及びフレーズ単位毎の励振源情報から、音声単位毎にスペクトル情報及び励振源情報を取り出すことが可能である。 The large waveform segment dictionary 25 has spectrum information for each phrase (large segment) unit and excitation source information for each phrase (large segment) unit. In the present embodiment, spectral envelope parameters extracted from speech in units of frames are used as spectral information, and the voicing level of the excitation source signal is used as excitation source information. Note that the spectrum information for each phrase unit and the excitation source information for each phrase unit are the time information of the spectrum information for the speech unit that constitutes the spectrum information for each phrase and the speech unit (mora) that constitutes the excitation source information for each phrase. The duration length information of the excitation source information is included. Therefore, it is possible to extract spectrum information and excitation source information for each voice unit from spectrum information for each phrase unit and excitation source information for each phrase unit.

また、大波形素片辞書２５は、原音データにおける各フレーズの最初及び最後のモーラに先行及び後続する所定音韻環境のモーラ情報も有する。この情報は、上記素片選択部２３ａで用いられ、これにより、原音データにおける音韻環境と、標準韻律情報系列における音韻環境との一致／不一致が判定される。
以下、図１１に基づき、大波形素片辞書２５の詳細な構成を説明する。ここで、図１１は、大波形素片辞書の構成例を示す図である。 The large waveform segment dictionary 25 also has mora information of a predetermined phoneme environment that precedes and follows the first and last mora of each phrase in the original sound data. This information is used by the segment selection unit 23a, and thereby, it is determined whether the phoneme environment in the original sound data matches the phoneme environment in the standard prosodic information sequence.
Hereinafter, based on FIG. 11, the detailed structure of the large waveform segment dictionary 25 is demonstrated. Here, FIG. 11 is a diagram illustrating a configuration example of the large waveform segment dictionary.

大波形素片辞書２５は、図１１に示すように、例えば、登録（定型）フレーズである「右方向です。」というフレーズを含む「次の交差点を、右方向です。」という文章の原音データから抽出された、「右方向です。」のフレーズ部分のスペクトル情報及び励振源情報と、当該フレーズに先行及び後続する所定音韻環境の情報とを有する。この「右方向です。」の場合は、読点の次にくるフレーズで且つ文末にくるフレーズであるため、ここでは、原音データにおける当該フレーズに先行及び後続するモーラの情報「ｐａｕ（ポーズ）」が所定音韻環境の情報となる。なお、図１１中のｎとｍは、それぞれスペクトル情報と励振源情報のサブバンドの次数である。更に、大波形素片辞書２５は、図１１に示すように、登録フレーズを構成する音声単位（モーラ単位）の情報であるモーラ系列情報と、前述した各音声単位（モーラ）毎の継続時間長情報（フレーム数）とを有している。つまり、大波形素片辞書２５は、複数種類の登録フレーズに対して、各フレーズ毎に、スペクトル情報、励振源情報、先行及び後続のモーラの情報を含むモーラ系列情報、各音声単位（モーラ）毎の継続時間長情報を有した構成となる。 As shown in FIG. 11, the large waveform segment dictionary 25 includes, for example, original sound data of a sentence “the next intersection is in the right direction” including a registered (standard) phrase “in the right direction”. The spectrum information and excitation source information of the phrase portion “rightward” extracted from the above, and information of a predetermined phoneme environment preceding and following the phrase. In the case of “to the right”, since it is the phrase that comes after the reading and at the end of the sentence, the information “pau (pause)” of the mora preceding and following the phrase in the original sound data is here. It becomes information of a predetermined phoneme environment. Note that n and m in FIG. 11 are the subband orders of the spectrum information and the excitation source information, respectively. Furthermore, as shown in FIG. 11, the large waveform segment dictionary 25 includes mora sequence information, which is information of voice units (mora units) constituting the registered phrase, and a duration length for each voice unit (mora) described above. Information (number of frames). That is, the large-waveform segment dictionary 25 has, for each of a plurality of types of registered phrases, mora sequence information including spectral information, excitation source information, preceding and succeeding mora information, and each voice unit (mora). It becomes the structure with each duration time information.

ここで、本実施の形態において、音声合成装置２００は、図示しないが、上記各構成要素を制御するプログラムが記憶された記憶媒体と、これらのプログラムを実行するためのプロセッサと、プログラムの実行に必要なデータを記憶するＲＡＭと、を備えている。そして、プロセッサにより記憶媒体に記憶されたプログラムを読み出して実行することによって上記各構成要素の処理を実現する。 Here, in the present embodiment, the speech synthesizer 200, although not shown, stores a storage medium storing a program for controlling each of the above components, a processor for executing these programs, and execution of the programs. And a RAM for storing necessary data. And the process of each said component is implement | achieved by reading and running the program memorize | stored in the storage medium with the processor.

更に、図１２に基づき、音声合成装置２００の動作処理の流れを説明する。ここで、図１２は、音声合成装置２００の動作処理を示すフローチャートである。
図１２に示すように、まずステップＳ３００に移行し、テキスト解析部２０において、不図示の外部装置、入力デバイス（キーボード等）等を介して文章テキストが入力されたか否かを判定し、入力されたと判定された場合(Yes)は、ステップＳ３０２に移行し、そうでない場合(No)は、入力されるまで判定処理を続行する。 Furthermore, the flow of the operation process of the speech synthesizer 200 will be described with reference to FIG. Here, FIG. 12 is a flowchart showing an operation process of the speech synthesizer 200.
As shown in FIG. 12, first, the process proceeds to step S300, where the text analysis unit 20 determines whether or not text text is input via an external device (not shown), an input device (such as a keyboard), and the like. If it is determined (Yes), the process proceeds to step S302. If not (No), the determination process is continued until it is input.

ステップＳ３０２に移行した場合は、テキスト解析部２０において、ステップＳ３００で入力された文章テキストに対して、アクセント解析及び形態素解析を行い、当該解析結果に基づき、発音・韻律記号列を生成し、当該生成した発音・韻律記号列を標準韻律生成部２２に出力してステップＳ３０４に移行する。
ステップＳ３０４では、標準韻律生成部２２において、テキスト解析部２０から入力された発音・韻律記号列に基づき、標準韻律辞書２１から対応する標準ピッチ周波数情報を取得し、当該取得した標準ピッチ周波数情報に基づき文章テキスト全体の標準韻律情報系列を生成し、当該生成した標準韻律情報系列を波形合成部２３に出力してステップＳ３０６に移行する。 When the process proceeds to step S302, the text analysis unit 20 performs accent analysis and morphological analysis on the sentence text input in step S300, generates a pronunciation / prosodic symbol string based on the analysis result, The generated pronunciation / prosodic symbol string is output to the standard prosody generation unit 22, and the process proceeds to step S304.
In step S304, the standard prosody generation unit 22 acquires the corresponding standard pitch frequency information from the standard prosody dictionary 21 based on the pronunciation / prosodic symbol string input from the text analysis unit 20, and uses the acquired standard pitch frequency information as the acquired standard pitch frequency information. Based on this, a standard prosodic information sequence for the entire text is generated, and the generated standard prosodic information sequence is output to the waveform synthesis unit 23, and the process proceeds to step S306.

ステップＳ３０６では、素片選択部２３ａにおいて、標準韻律生成部２２から入力された標準韻律情報系列から、素片選択が未処理のモーラ単位の標準韻律情報系列部分を選択してステップＳ３０８に移行する。
ステップＳ３０８では、素片選択部２３ａにおいて、標準韻律生成部２２から入力された標準韻律情報系列に基づき、ステップＳ３０６で選択したモーラが、登録フレーズ内のモーラであるか否かを判定し、登録フレーズ内のモーラであると判定された場合(Yes)は、ステップＳ３１０に移行し、そうでない場合(No)は、ステップＳ３２６に移行する。 In step S306, the segment selection unit 23a selects from the standard prosody information sequence input from the standard prosody generation unit 22 the standard prosody information sequence part of the mora unit for which segment selection has not been processed, and proceeds to step S308. .
In step S308, the segment selection unit 23a determines whether or not the mora selected in step S306 is a mora in the registered phrase based on the standard prosody information sequence input from the standard prosody generation unit 22. If it is determined that the mora is in the phrase (Yes), the process proceeds to step S310. If not (No), the process proceeds to step S326.

ステップＳ３１０に移行した場合は、素片選択部２３ａにおいて、ステップＳ３０６で選択したモーラは、登録フレーズを構成する最初又は最後のモーラであるか否かを判定し、最初又は最後のモーラであると判定された場合(Yes)は、ステップＳ３１２に移行し、そうでない場合(No)は、ステップＳ３１４に移行する。
ステップＳ３１２に移行した場合は、素片選択部２３ａにおいて、ステップＳ３０６で選択したモーラを含む登録フレーズに対応する原音データにおける、前記選択したモーラに先行又は後続する所定音韻環境と、標準韻律情報系列における前記選択したモーラに先行又は後続する所定音韻環境とが一致するか否かを判定し、一致すると判定された場合(Yes)は、ステップＳ３１４に移行し、そうでない場合(No)は、ステップＳ３２６に移行する。 When the process proceeds to step S310, the segment selection unit 23a determines whether the mora selected in step S306 is the first or last mora that constitutes the registered phrase, and is the first or last mora. If it is determined (Yes), the process proceeds to step S312; otherwise (No), the process proceeds to step S314.
When the process proceeds to step S312, in the segment selection unit 23a, in the original sound data corresponding to the registered phrase including the mora selected in step S306, a predetermined phoneme environment preceding or succeeding the selected mora, and a standard prosodic information sequence It is determined whether or not a predetermined phoneme environment preceding or succeeding the selected mora in step S3 is determined. If it is determined to match (Yes), the process proceeds to step S314; otherwise (No), step S314 is performed. The process proceeds to S326.

ステップＳ３１４に移行した場合は、素片選択部２３ａにおいて、ステップＳ３０６で選択したモーラの標準韻律情報系列に対応するスペクトル情報及び励振源情報を、大波形素片辞書２５から選択することを決定してステップＳ３１６に移行する。
ステップＳ３１６では、素片接続部２３ｂにおいて、ステップＳ３１４の選択結果に基づき、大波形素片辞書２５から、ステップＳ３０６で選択したモーラの標準韻律情報系列に対応するスペクトル情報及び励振源情報を取得してステップＳ３１８に移行する。 When the process proceeds to step S314, the segment selection unit 23a determines to select the spectrum information and excitation source information corresponding to the standard prosodic information sequence of the mora selected in step S306 from the large waveform segment dictionary 25. Then, the process proceeds to step S316.
In step S316, the segment connection unit 23b acquires spectrum information and excitation source information corresponding to the standard prosodic information sequence of the mora selected in step S306 from the large waveform segment dictionary 25 based on the selection result in step S314. Then, the process proceeds to step S318.

ステップＳ３１８では、素片接続部２３ｂにおいて、ステップＳ３１６又はステップＳ３２８で取得したスペクトル情報及び励振源情報を接続してスペクトル情報系列及び励振源情報系列を生成してステップＳ３２０に移行する。
ステップＳ３２０では、素片接続部２３ｂにおいて、ステップＳ３０６で選択したモーラが最後のモーラか否かを判定し、最後のモーラであると判定された場合(Yes)は、ステップＳ３１８で生成したスペクトル情報系列及び励振源情報系列を合成部２３ｃに出力してステップＳ３２２に移行し、そうでない場合(No)は、ステップＳ３０６に移行する。 In step S318, the segment connection unit 23b connects the spectrum information and the excitation source information acquired in step S316 or step S328 to generate a spectrum information sequence and an excitation source information sequence, and the process proceeds to step S320.
In step S320, the segment connecting unit 23b determines whether the mora selected in step S306 is the last mora. If it is determined that the last mora is the last mora (Yes), the spectrum information generated in step S318 is determined. The sequence and the excitation source information sequence are output to the synthesizer 23c, and the process proceeds to step S322. Otherwise (No), the process proceeds to step S306.

ステップＳ３２２に移行した場合は、合成部２３ｃにおいて、素片接続部２３ｂから入力されたスペクトル情報系列及び励振源情報系列に基づき、合成音声波形データを生成してステップＳ３２４に移行する。
ステップＳ３２４では、波形合成部２３において、ステップＳ３２２で生成した合成音声波形データに基づき、合成音を出力して処理を終了する。 When the process proceeds to step S322, the synthesis unit 23c generates synthesized speech waveform data based on the spectrum information sequence and the excitation source information series input from the segment connection unit 23b, and the process proceeds to step S324.
In step S324, the waveform synthesizer 23 outputs a synthesized sound based on the synthesized speech waveform data generated in step S322, and the process ends.

一方、ステップＳ３０６で選択したモーラが登録フレーズ内にないか、又は選択したモーラに先行又は後続する音韻環境が一致してなくステップＳ３２６に移行した場合は、ステップＳ３０６で選択したモーラの標準韻律情報系列に対応するスペクトル情報及び励振源情報を、標準波形素片辞書２４から選択することを決定してステップＳ３２８に移行する。
ステップＳ３２８では、素片接続部２３ｂにおいて、ステップＳ３２６の選択結果に基づき、標準波形素片辞書２４から、ステップＳ３０６で選択したモーラの標準韻律情報系列に対応するスペクトル情報及び励振源情報を取得してステップＳ３１８に移行する。 On the other hand, if the mora selected in step S306 is not in the registered phrase, or if the phonetic environment preceding or succeeding the selected mora does not match and the process proceeds to step S326, the standard prosodic information of the mora selected in step S306 It is determined that the spectrum information and the excitation source information corresponding to the series are selected from the standard waveform segment dictionary 24, and the process proceeds to step S328.
In step S328, the segment connection unit 23b acquires spectrum information and excitation source information corresponding to the standard prosodic information sequence of the mora selected in step S306 from the standard waveform segment dictionary 24 based on the selection result in step S326. Then, the process proceeds to step S318.

次に、図１３に基づき、本実施の形態の動作を説明する。ここで、図１３（ａ）は、例文に対する原音データの構成を示す図であり、（ｂ）は、（ａ）の例文に対して登録フレーズの先行モーラの音韻環境が一致する場合の合成例を示す図であり、（ｃ）は、（ａ）の例文に対して登録フレーズの先行モーラの音韻環境が不一致の場合の合成例を示す図である。 Next, the operation of the present embodiment will be described with reference to FIG. Here, FIG. 13A is a diagram showing the structure of the original sound data for the example sentence, and FIG. 13B is a synthesis example in the case where the phoneme environment of the preceding mora of the registered phrase matches the example sentence of FIG. (C) is a figure which shows the example of a synthesis | combination in case the phonetic environment of the preceding mora of a registration phrase does not correspond with the example sentence of (a).

以下、図１３（ａ）に示すように、大波形素片辞書２５が、上記「次の交差点を右方向です。」の肉声データにおける「右方向です。」のフレーズから抽出したスペクトル情報、励振源情報、先行及び後続モーラの情報を含むモーラ系列情報、各音声単位（モーラ）毎の継続時間長情報を有する音声合成装置２００の実際の動作を説明する。この場合、先行モーラの情報は「ｏ＿ｍ」となる。ここで、「ｏ＿ｍ」における「＿ｍ」は、モーラ「ｏ」に後続するモーラが「ｍ」であることを示す。 Hereinafter, as shown in FIG. 13A, the large waveform segment dictionary 25 extracts spectral information extracted from the phrase “right” in the real voice data “next intersection is right”. The actual operation of the speech synthesizer 200 having source information, mora sequence information including preceding and subsequent mora information, and duration time information for each voice unit (mora) will be described. In this case, the preceding mora information is “o_m”. Here, “_m” in “o_m” indicates that the mora subsequent to the mora “o” is “m”.

まず、音声合成装置２００のテキスト解析部２０に、文章テキスト「小学校前を右方向です。」が入力される（ステップＳ３００）。
テキスト解析部２０は、入力された文章テキストに対して、単語辞書を参照して、アクセント解析及び形態素解析を実行し、その解析結果に基づき文章テキストに対する発音・韻律記号列を生成する（ステップＳ３０２）。この発音・韻律記号列は、標準韻律生成部２２に入力される。 First, the text text “In front of elementary school is in the right direction” is input to the text analysis unit 20 of the speech synthesizer 200 (step S300).
The text analysis unit 20 performs accent analysis and morphological analysis on the input sentence text with reference to the word dictionary, and generates a pronunciation / prosodic symbol string for the sentence text based on the analysis result (step S302). ). This pronunciation / prosodic symbol string is input to the standard prosody generation unit 22.

標準韻律生成部２２は、入力された発音・韻律記号列に基づき、音声単位ラベル系列を生成し、当該音声単位ラベル系列のラベル情報に対応する標準ピッチ周波数情報を、標準韻律辞書２１から取得する。そして、当該取得した標準ピッチ周波数情報から、前記発音・韻律記号列に対応する標準韻律情報系列を生成する（ステップＳ３０４）。このようにして生成された標準韻律情報系列（標準ピッチ周波数系列）は、波形合成部２３の素片選択部２３ａに入力される。 The standard prosody generation unit 22 generates a speech unit label sequence based on the input pronunciation / prosodic symbol string, and acquires standard pitch frequency information corresponding to the label information of the speech unit label sequence from the standard prosody dictionary 21. . Then, a standard prosodic information sequence corresponding to the pronunciation / prosodic symbol string is generated from the acquired standard pitch frequency information (step S304). The standard prosodic information sequence (standard pitch frequency sequence) generated in this way is input to the segment selection unit 23 a of the waveform synthesis unit 23.

素片選択部２３ａは、標準韻律情報系列が入力されると、当該標準韻律情報系列の始端から順に素片選択処理が未処理のモーラ単位の標準韻律情報系列部分（標準ピッチ周波数系列部分）を選択し（ステップＳ３０６）、当該選択した標準韻律情報系列部分のモーラが大波形素片辞書２５の有する登録フレーズに含まれるか否かを判定する（ステップＳ３０８）。ここで、「小学校前を」のフレーズに登録フレーズが含まれていないとすると、このフレーズの標準韻律情報系列部分のモーラは、登録フレーズに含まれていないと判定し（ステップＳ３０８の「Ｎｏ」の分岐）、各選択した標準韻律情報系列部分に対応するスペクトル情報及び励振源情報を、標準波形素片辞書２４から選択することを決定する（ステップＳ３２６）。素片接続部２３ｂは、この選択結果に基づき、各選択した標準韻律情報系列部分に対応するスペクトル情報及び励振源情報を、標準波形素片辞書２４から取得し（ステップＳ３２８）、これらを順に接続してスペクトル情報系列及び励振源情報系列を生成する（ステップＳ３１８）。 When the standard prosodic information sequence is input, the segment selecting unit 23a sequentially extracts the standard prosodic information sequence portion (standard pitch frequency sequence portion) of mora units in which the segment selection processing is not processed in order from the beginning of the standard prosodic information sequence. It is selected (step S306), and it is determined whether or not the mora of the selected standard prosodic information sequence part is included in the registered phrase of the large waveform segment dictionary 25 (step S308). Here, assuming that the registered phrase is not included in the phrase “Before Elementary School”, it is determined that the mora of the standard prosodic information sequence portion of this phrase is not included in the registered phrase (“No” in step S308). Branch), it is determined that the spectrum information and excitation source information corresponding to each selected standard prosodic information sequence portion are selected from the standard waveform segment dictionary 24 (step S326). Based on the selection result, the segment connecting unit 23b acquires spectrum information and excitation source information corresponding to each selected standard prosody information sequence portion from the standard waveform segment dictionary 24 (step S328), and sequentially connects them. Then, a spectrum information sequence and an excitation source information sequence are generated (step S318).

一方、入力された文章テキストにおける「右方向です。」のフレーズは、登録フレーズであるので、このフレーズに対して選択される標準韻律情報系列部分のモーラは登録フレーズ内のモーラであると判定される（ステップＳ３０８の「Ｙｅｓ」の分岐）。なお、大波形素片辞書２５の有する先行及び後続モーラを含む「を右方向です。」のモーラ系列情報は、「ｏ＿ｍ」「ｍｉ」「ｇｉ」「ｈｏ」「ｏ」「ｋｏ」「ｏ」「ｄｅ」「ｓｕ」「ｐａｕ」となる。そして、標準韻律情報系列から、「ｍｉ」に対応する標準韻律情報系列部分が選択されると、このモーラは、登録フレーズの最初のモーラとなるので（ステップＳ３１０の「Ｙｅｓ」の分岐）、大波形素片辞書２５の有する「右方向です。」のフレーズに先行する音韻環境の情報（上記「ｏ＿ｍ」）と、標準韻律情報系列の当該フレーズに先行する音韻環境の情報とが一致するか否かを判定する（ステップＳ３１２）。ここでは、入力された文章テキストである「小学校前を右方向です。」における「を右方向です。」の標準韻律情報系列に対するモーラ系列情報は、「ｏ＿ｍ」「ｍｉ＿ｇ」「ｇｉ＿ｈ」「ｈｏ＿ｏ」「ｏ＿ｋ」「ｋｏ＿ｏ」「ｏ＿ｄ」「ｄｅ＿ｓ」「ｓｕ」「ｐａｕ」となり、先行モーラが「ｏ＿ｍ」となる。従って、大波形素片辞書２５の有する「右方向です。」の先行モーラである「ｏ＿ｍ」と一致するので（ステップＳ３１２の「Ｙｅｓ」の分岐）、上記選択した標準韻律情報系列の「ｍｉ」に対応する標準韻律情報系列部分に対応するスペクトル情報及び励振源情報を、大波形素片辞書２５から選択することを決定する（ステップＳ３１４）。そして、素片接続部２３ｂは、「ｍｉ」に対応する標準韻律情報系列部分に対応するスペクトル情報及び励振源情報を、大波形素片辞書２５から取得し（ステップＳ３１６）、当該スペクトル情報及び励振源情報を、既に生成された「小学校前を」のスペクトル情報系列及び励振源情報系列に接続する（ステップＳ３１８）。 On the other hand, since the phrase “rightward” in the input sentence text is a registered phrase, the mora of the standard prosodic information sequence portion selected for this phrase is determined to be a mora in the registered phrase. ("Yes" branch of step S308). It should be noted that the mora sequence information of “is rightward” including the preceding and succeeding mora of the large waveform segment dictionary 25 is “o_m” “mi” “gi” “ho” “o” “ko” “o”. “De” “su” “pau”. When the standard prosodic information sequence portion corresponding to “mi” is selected from the standard prosodic information sequence, this mora becomes the first mora of the registered phrase (the branch of “Yes” in step S310). Whether or not the phoneme environment information preceding the phrase “It is in the right direction” in the waveform segment dictionary 25 (above “o_m”) matches the information of the phoneme environment preceding the phrase in the standard prosodic information series. Is determined (step S312). Here, the mora sequence information for the standard prosodic information sequence of “is the right direction” in the input sentence text “in front of elementary school” is “o_m”, “mi_g”, “gi_h”, “ho_o”. “O_k”, “ko_o”, “o_d”, “de_s”, “su”, “pau”, and the preceding mora becomes “o_m”. Therefore, since it matches “o_m” that is the preceding mora of “It is in the right direction” of the large waveform segment dictionary 25 (“Yes” branch of step S312), “mi” of the selected standard prosodic information sequence is selected. It is determined that the spectrum information and the excitation source information corresponding to the standard prosodic information sequence portion corresponding to are selected from the large waveform segment dictionary 25 (step S314). Then, the segment connection unit 23b acquires the spectrum information and excitation source information corresponding to the standard prosodic information sequence portion corresponding to “mi” from the large waveform segment dictionary 25 (step S316), and the spectrum information and excitation The source information is connected to the already generated spectrum information series and excitation source information series of “Before Elementary School” (step S318).

また、「ｍｉ」に後続する「ｇｉ」「ｈｏ」「ｏ」「ｋｏ」「ｏ」「ｄｅ」については、登録フレーズの最初又は最後のモーラでは無いので、これらに対応するスペクトル情報及び励振源情報を、大波形素片辞書２５から選択することを決定し（ステップＳ３１４）、素片接続部２３ｂは、これらに対応するスペクトル情報及び励振源情報を、大波形素片辞書２５から取得して（ステップＳ３１６）、既に生成されているスペクトル情報系列及び励振源情報系列に接続する（ステップＳ３１８）。なお、終端の「ｓｕ」については、文章の終端にくるので、この場合は、ステップＳ３１２において無条件で音韻環境が一致していると判断し、スペクトル情報及び励振源情報を、大波形素片辞書２５から選択することを決定する。そして、文末のモーラである「ｓｕ」に対応するスペクトル情報及び励振源情報を、大波形素片辞書２５から取得し、既に生成されているスペクトル情報系列及び励振源情報系列に接続すると、図１３（ｂ）に示すように、文章テキストの「小学校前を」に対しては標準波形素片辞書２４から取得したスペクトル情報及び励振源情報を用いて（図中点線で囲まれたモーラ列に対応）、スペクトル情報系列及び励振源情報系列を生成し、文章テキストの「右方向です。」に対しては大波形素片辞書２５から取得したスペクトル情報及び励振源情報を用いて（図中実線で囲まれたモーラ列に対応）、スペクトル情報系列及び励振源情報系列が生成される。「ｓｕ」は、登録フレーズの最後のモーラとなるので、素片接続部２３ｂは、当該生成したスペクトル情報系列及び励振源情報系列を合成部２３ｃに出力する（ステップＳ３２０の「Ｙｅｓ」の分岐）。 Further, since “gi”, “ho”, “o”, “ko”, “o”, and “de” following “mi” are not the first or last mora of the registered phrase, the spectrum information and the excitation source corresponding to these are not included. The information is determined to be selected from the large waveform segment dictionary 25 (step S314), and the segment connection unit 23b acquires the spectrum information and the excitation source information corresponding to these from the large waveform segment dictionary 25. (Step S316), it connects to the already generated spectrum information sequence and excitation source information sequence (Step S318). Note that since the end “su” comes to the end of the sentence, in this case, it is determined in step S312 that the phoneme environments are unconditionally matched, and the spectrum information and the excitation source information are set to the large waveform segment. It is decided to select from the dictionary 25. Then, when spectrum information and excitation source information corresponding to “su” which is a mora at the end of the sentence are acquired from the large waveform segment dictionary 25 and connected to the already generated spectrum information sequence and excitation source information sequence, FIG. As shown in (b), for the text “Before Elementary School”, spectral information and excitation source information acquired from the standard waveform segment dictionary 24 are used (corresponding to the mora sequence enclosed by the dotted line in the figure). ), A spectrum information sequence and an excitation source information sequence are generated, and for the text “in the right direction”, the spectrum information and the excitation source information obtained from the large waveform segment dictionary 25 are used (indicated by a solid line in the figure). A spectral information sequence and an excitation source information sequence are generated. Since “su” is the last mora of the registered phrase, the segment connection unit 23b outputs the generated spectrum information sequence and excitation source information sequence to the synthesis unit 23c (“Yes” branch of step S320). .

また、テキスト解析部２０に、文章テキスト「向かって右方向です。」が入力された場合を説明する。上記同様の手順で「向かって右方向です。」の標準韻律情報系列が生成されると（ステップＳ３０４）、素片選択部２３ａは、標準韻律情報系列の始端から順に素片選択処理が未処理のモーラ単位の標準韻律情報系列部分（標準ピッチ周波数系列部分）を選択し（ステップＳ３０６）、当該選択した標準韻律情報系列部分のモーラが大波形素片辞書２５の有する登録フレーズに含まれるか否かを判定する（ステップＳ３０８）。ここで、「向かって」のフレーズに登録フレーズが含まれていないとすると、上記「小学校前を」と同様に、素片接続部２３ｂは、各選択した標準韻律情報系列部分に対応するスペクトル情報及び励振源情報を、標準波形素片辞書２４から取得し（ステップＳ３２８）、これらを順に接続してスペクトル情報系列及び励振源情報系列を生成する（ステップＳ３１８）。 Further, a case where the text text “To the right” is input to the text analysis unit 20 will be described. When the standard prosodic information sequence “to the right” is generated in the same procedure as described above (step S304), the segment selecting unit 23a performs unselected segment selection processing in order from the beginning of the standard prosodic information sequence. A standard prosodic information sequence portion (standard pitch frequency sequence portion) in units of mora is selected (step S306), and whether the mora of the selected standard prosodic information sequence portion is included in the registered phrase of the large waveform segment dictionary 25 or not. Is determined (step S308). Here, assuming that the registered phrase is not included in the phrase “heading”, the segment connecting unit 23b, as in the case of “Before Elementary School”, the spectrum information corresponding to each selected standard prosodic information sequence portion. The excitation source information is acquired from the standard waveform segment dictionary 24 (step S328), and these are sequentially connected to generate the spectrum information sequence and the excitation source information sequence (step S318).

一方、入力された文章テキストにおける「右方向です。」のフレーズは、登録フレーズであるので、このフレーズに対して選択される標準韻律情報系列部分のモーラは登録フレーズ内のモーラであると判定される（ステップＳ３０８の「Ｙｅｓ」の分岐）。ここでは、入力された文章テキストである「向かって右方向です。」の標準韻律情報系列に対するモーラ系列情報は、「ｔｅ＿ｍ」「ｍｉ＿ｇ」「ｇｉ＿ｈ」「ｈｏ＿ｏ」「ｏ＿ｋ」「ｋｏ＿ｏ」「ｏ＿ｄ」「ｄｅ＿ｓ」「ｓｕ」「ｐａｕ」となり、先行モーラが「ｔｅ＿ｍ」となる。従って、大波形素片辞書２５の有する「右方向です。」の先行モーラである「ｏ＿ｍ」とは不一致となるので（ステップＳ３１２の「Ｎｏ」の分岐）、上記選択した標準韻律情報系列の「ｍｉ」に対応する標準韻律情報系列部分に対応するスペクトル情報及び励振源情報を、標準波形素片辞書２５から選択することを決定する（ステップＳ３２６）。そして、素片接続部２３ｂは、「ｍｉ」に対応する標準韻律情報系列部分に対応するスペクトル情報及び励振源情報を、標準波形素片辞書２５から取得し（ステップＳ３２８）、当該スペクトル情報及び励振源情報を、既に生成された「向かって」のスペクトル情報系列及び励振源情報系列に接続する（ステップＳ３１８）。「ｍｉ」に後続する「ｇｉ」「ｈｏ」「ｏ」「ｋｏ」「ｏ」「ｄｅ」「ｓｕ」については、上記同様に、登録フレーズの最初若しくは最後のモーラでは無いか又は文末のモーラとなるので、これらに対応するスペクトル情報及び励振源情報を、大波形素片辞書２５から選択することを決定し（ステップＳ３１４）、素片接続部２３ｂは、これらに対応するスペクトル情報及び励振源情報を、大波形素片辞書２５から取得して（ステップＳ３１６）、既に生成されているスペクトル情報系列及び励振源情報系列に接続する（ステップＳ３１８）。 On the other hand, since the phrase “rightward” in the input sentence text is a registered phrase, the mora of the standard prosodic information sequence portion selected for this phrase is determined to be a mora in the registered phrase. ("Yes" branch of step S308). Here, the mora sequence information for the standard prosodic information sequence of the text sentence that is input is “to the right” is “te_m”, “mi_g”, “gi_h”, “ho_o”, “o_k”, “ko_o”, “o_d”. “De_s”, “su”, “pau”, and the preceding mora is “te_m”. Therefore, since it does not coincide with “o_m” that is the preceding mora of “It is in the right direction” of the large waveform segment dictionary 25 (“No” branch in step S312), “0” of the selected standard prosodic information sequence is selected. It is determined that the spectrum information and the excitation source information corresponding to the standard prosodic information sequence portion corresponding to “mi” are selected from the standard waveform segment dictionary 25 (step S326). Then, the segment connection unit 23b acquires the spectrum information and excitation source information corresponding to the standard prosody information sequence portion corresponding to “mi” from the standard waveform segment dictionary 25 (step S328), and the spectrum information and excitation. The source information is connected to the already generated “toward” spectrum information sequence and excitation source information sequence (step S318). As for “gi”, “ho”, “o”, “ko”, “o”, “de”, and “su” following “mi”, it is not the first or last mora of the registered phrase or the mora at the end of the sentence. Therefore, it is determined that the spectrum information and the excitation source information corresponding to these are selected from the large waveform segment dictionary 25 (step S314), and the segment connection unit 23b selects the spectrum information and the excitation source information corresponding thereto. Is obtained from the large waveform segment dictionary 25 (step S316) and connected to the already generated spectrum information sequence and excitation source information sequence (step S318).

この接続結果は、図１３（ｂ）に示すように、文章テキストの「向かってみ」に対しては標準波形素片辞書２４から取得したスペクトル情報及び励振源情報を用いて（図中点線で囲まれたモーラ列に対応）、スペクトル情報系列及び励振源情報系列が生成されたものとなり、文章テキストの「ぎ方向です。」に対しては大波形素片辞書２５から取得したスペクトル情報及び励振源情報を用いて（図中実線で囲まれたモーラ列に対応）、スペクトル情報系列及び励振源情報系列が生成されたものとなる。そして、素片接続部２３ｂは、当該生成したスペクトル情報系列及び励振源情報系列を、合成部２３ｃに出力する（ステップＳ３２０の「Ｙｅｓ」の分岐）。 As shown in FIG. 13B, this connection result is obtained by using the spectrum information and excitation source information obtained from the standard waveform segment dictionary 24 for the text text “Look” (indicated by the dotted line in the figure). The spectrum information sequence and the excitation source information sequence are generated, and the spectrum information and excitation obtained from the large waveform segment dictionary 25 for the sentence text “in the right direction” are generated. Using the source information (corresponding to a mora sequence surrounded by a solid line in the figure), a spectrum information sequence and an excitation source information sequence are generated. Then, the element connection unit 23b outputs the generated spectrum information sequence and excitation source information sequence to the synthesis unit 23c (“Yes” branch of step S320).

合成部２３ｃは、素片接続部２３ｂから入力されたスペクトル情報系列及び励振源情報系列に基づき、スペクトル情報系列を合成フィルタのパラメータとして用い、当該合成フィルタで、励振源情報系列に基づき生成される励振源信号をフィルタ処理して、文章テキストに対応する合成音声波形データを生成する（ステップＳ３２２）。更に、この生成した合成音声波形データに基づき、不図示のスピーカ等から合成音声を出力する（ステップＳ３２４）。 Based on the spectrum information sequence and the excitation source information sequence input from the unit connection unit 23b, the synthesis unit 23c uses the spectrum information sequence as a parameter for the synthesis filter, and is generated by the synthesis filter based on the excitation source information sequence. The excitation source signal is filtered to generate synthesized speech waveform data corresponding to the sentence text (step S322). Further, based on the generated synthesized speech waveform data, synthesized speech is output from a not-shown speaker or the like (step S324).

このように、本実施の形態の音声合成装置２００は、所定話者の発話した発話文に対応する音声波形データから抽出された登録フレーズ（定型フレーズ）部分のスペクトル情報及び励振源情報と、当該フレーズ部分に先行又は後続する音韻環境の情報と、当該フレーズを構成する音声単位毎の継続時間長情報と、当該フレーズのモーラ系列情報とを有した大波形素片辞書２５を用いて、標準韻律情報系列に含まれる定型フレーズ部分の合成波形データを、大波形素片辞書２５の有するスペクトル情報及び励振源情報を用いて生成することが可能である。更に、定型フレーズに先行又は後続する音韻環境の情報と、標準韻律情報系列における定型フレーズに先行及び後続する音韻環境の情報とが一致しないときは、定型フレーズの最初又は最後の一致しない側端部のモーラを、標準波形素片辞書２４の有するスペクトル情報及び励振源情報を用いて生成し、それ以外を大波形素片辞書２５の有するスペクトル情報及び励振源情報を用いて生成することが可能である。これにより、前記したように音韻環境が一致しない場合でも、文章テキストの文章がより自然発話に近い抑揚で発話（再生出力）される合成音声波形データを生成することが可能である。 As described above, the speech synthesizer 200 according to the present embodiment includes the spectrum information and excitation source information of the registered phrase (standard phrase) extracted from the speech waveform data corresponding to the utterance sentence uttered by the predetermined speaker, A standard prosody using a large waveform segment dictionary 25 having information on the phoneme environment preceding or succeeding the phrase part, duration information for each speech unit constituting the phrase, and mora sequence information of the phrase The combined waveform data of the fixed phrase part included in the information series can be generated using the spectrum information and excitation source information of the large waveform segment dictionary 25. Furthermore, when the information of the phoneme environment preceding or following the fixed phrase and the information of the phoneme environment preceding and following the fixed phrase in the standard prosodic information sequence do not match, the first or last mismatched side edge of the fixed phrase Can be generated using the spectrum information and excitation source information included in the standard waveform segment dictionary 24, and the other mora can be generated using the spectrum information and excitation source information included in the large waveform segment dictionary 25. is there. As a result, even when the phoneme environments do not match as described above, it is possible to generate synthesized speech waveform data in which the sentence of the sentence text is uttered (reproduced and output) with an inflection closer to a natural utterance.

上記第２の実施の形態において、テキスト解析部２０は、請求項７記載のテキスト解析手段に対応し、標準韻律生成部２２は、請求項７記載の韻律情報系列生成手段に対応し、波形合成部２３は、請求項７記載の音声波形生成手段に対応し、標準波形素片辞書２４は、請求項７、１１及び１４のいずれか１項に記載の第１素片辞書に対応し、大波形素片辞書２５は、請求項７、１１及び１４のいずれか１項に記載の第２素片辞書に対応する。 In the second embodiment, the text analysis unit 20 corresponds to the text analysis unit described in claim 7 , and the standard prosody generation unit 22 corresponds to the prosody information sequence generation unit described in claim 7 , and waveform synthesis is performed. part 23 corresponds to the speech waveform generation means according to claim 7, wherein the standard waveform segment dictionary 24 corresponds to the first segment dictionary according to any one of claims 7, 11 and 14, the large The waveform segment dictionary 25 corresponds to the second segment dictionary according to any one of claims 7 , 11 and 14 .

また、上記第２の実施の形態において、素片選択部２３ａにおける、標準韻律情報系列から選択したモーラが登録フレーズを構成する最初又は最後のモーラである場合に、当該登録フレーズの原音データ（大波形素片辞書２５の登録データ）における前記選択したモーラに先行又は後続する所定音韻環境と、標準韻律情報系列における前記選択したモーラに先行又は後続する所定音韻環境とが一致するか否かを判定する処理は、請求項７記載の判定手段に対応する。 In the second embodiment, when the mora selected from the standard prosody information sequence in the segment selection unit 23a is the first or last mora constituting the registered phrase, the original sound data (large Whether the predetermined phoneme environment preceding or succeeding the selected mora in the waveform segment dictionary 25) matches the predetermined phoneme environment preceding or succeeding the selected mora in the standard prosodic information sequence The processing to be performed corresponds to the determination means described in claim 7 .

また、上記第２の実施の形態において、ステップＳ３００〜Ｓ３０２は、請求項１１記載のテキスト解析ステップに対応し、ステップＳ３０４は、請求項１１又は１４記載の韻律情報系列生成ステップに対応し、ステップＳ３０６〜Ｓ３２８は、請求項１１又は１４記載の音声波形生成ステップに対応する。
また、上記第２の実施の形態において、ステップＳ３１２は、請求項１１又は１４記載の判定ステップに対応する。 Further, in the second embodiment, steps S300~S302 correspond to the text analysis step according to claim 11, step S304 corresponds to the prosody information sequence generating step according to claim 11 or 14, wherein the step S306 to S328 correspond to the speech waveform generation step according to claim 11 or 14 .
Moreover, in the said 2nd Embodiment, step S312 respond | corresponds to the determination step of Claim 11 or 14 .

〔第３の実施の形態〕
次に、本発明に係る韻律素片辞書作成方法及び韻律素片辞書作成プログラム、並びに音声合成装置、音声合成プログラム及び音声合成方法の第３の実施の形態を図面に基づき説明する。図１４〜図１７は、本発明に係る韻律素片辞書作成方法及び韻律素片辞書作成プログラム、並びに音声合成装置、音声合成プログラム及び音声合成方法の第３の実施の形態を示す図である。 [Third Embodiment]
Next, a third embodiment of the prosody segment dictionary creation method and prosody segment dictionary creation program, speech synthesizer, speech synthesis program, and speech synthesis method according to the present invention will be described with reference to the drawings. 14 to 17 are diagrams showing a third embodiment of the prosody segment dictionary creation method and prosody segment dictionary creation program, and the speech synthesizer, speech synthesis program, and speech synthesis method according to the present invention.

ここで、図１４は、本発明の第３の実施の形態に係る音声合成装置３００の構成を示すブロック図である。
図１４に示すように、音声合成装置３００は、テキスト解析部１０と、標準韻律辞書１１と、韻律素片辞書１２と、韻律生成部１３と、波形合成部２３と、標準波形素片辞書２４と、大波形素片辞書２５とを含んだ構成となっている。 Here, FIG. 14 is a block diagram showing a configuration of a speech synthesizer 300 according to the third embodiment of the present invention.
As shown in FIG. 14, the speech synthesizer 300 includes a text analysis unit 10, a standard prosody dictionary 11, a prosody segment dictionary 12, a prosody generation unit 13, a waveform synthesis unit 23, and a standard waveform segment dictionary 24. And a large waveform segment dictionary 25.

つまり、音声合成装置３００は、上記第１の実施の形態における音声合成装置１００と同様の、テキスト解析部１０、標準韻律辞書１１、韻律素片辞書１２、及び韻律生成部１３と、上記第２の実施の形態における音声合成装置２００と同様の、波形合成部２３、標準波形素片辞書２４、及び大波形素片辞書２５とを含んだ構成となっている。
従って、韻律生成部１３で生成した韻律情報系列に基づき、波形合成部２３が、標準波形素片辞書２４及び大波形素片辞書２５を用いて合成音声波形データを生成し、合成音声を出力する構成となる。以下、第１の実施の形態及び第２の実施の形態と異なる点のみ詳細に説明する。 That is, the speech synthesizer 300 is the same as the speech synthesizer 100 in the first embodiment, the text analysis unit 10, the standard prosody dictionary 11, the prosody segment dictionary 12, and the prosody generation unit 13, and the second Similar to the speech synthesizer 200 in this embodiment, the waveform synthesizer 23, the standard waveform segment dictionary 24, and the large waveform segment dictionary 25 are included.
Therefore, based on the prosody information sequence generated by the prosody generation unit 13, the waveform synthesis unit 23 generates synthesized speech waveform data using the standard waveform segment dictionary 24 and the large waveform segment dictionary 25, and outputs synthesized speech. It becomes composition. Hereinafter, only points different from the first embodiment and the second embodiment will be described in detail.

韻律生成部１３の韻律置換整形部１３ｄは、韻律素片選択部１３ｃから、一致するフレーズに対応した韻律素片ピッチ周波数情報及び韻律特徴情報が入力されたときは、標準韻律生成部１３ａから入力される標準韻律情報系列における、一致するフレーズの標準ピッチ周波数情報を韻律素片ピッチ周波数情報に置換すると共に、韻律特徴情報に基づき、標準韻律情報系列における、一致するフレーズの前後の所定韻律単位に対応する韻律情報系列部分を整形して、最終的な韻律情報系列を生成する。この生成した韻律情報系列は、波形合成部２３に出力される。 The prosody replacement shaping unit 13d of the prosody generation unit 13 receives input from the standard prosody generation unit 13a when prosodic segment pitch frequency information and prosodic feature information corresponding to the matching phrase is input from the prosody segment selection unit 13c. In the standard prosodic information sequence, the standard pitch frequency information of the matching phrase is replaced with the prosodic segment pitch frequency information, and based on the prosodic feature information, the predetermined prosodic unit before and after the matching phrase in the standard prosodic information sequence is replaced. The corresponding prosodic information sequence portion is shaped to generate a final prosodic information sequence. The generated prosodic information sequence is output to the waveform synthesis unit 23.

一方、韻律置換整形部１３ｄは、韻律素片選択部１３ｃから、一致するフレーズが無いことを示す通知を受けると、標準韻律生成部１３ａから入力される標準韻律情報系列をそのまま波形合成部２３に出力する。
波形合成部２３の素片選択部２３ａは、韻律生成部１３から入力された韻律情報系列と大波形素片辞書２５に含まれる音韻環境情報とに基づき、標準韻律情報系列に対応する音声素片毎のスペクトル情報及び励振源情報を、標準波形素片辞書２４及び大波形素片辞書２５から選択する。当該選択結果の情報は、素片接続部２３ｂに出力される。 On the other hand, when the prosody replacement shaping unit 13d receives a notification indicating that there is no matching phrase from the prosody segment selection unit 13c, the standard prosody information sequence input from the standard prosody generation unit 13a is directly input to the waveform synthesis unit 23. Output.
The segment selection unit 23 a of the waveform synthesis unit 23 is based on the prosody information sequence input from the prosody generation unit 13 and the phoneme environment information included in the large waveform segment dictionary 25, and the speech unit corresponding to the standard prosody information sequence. Each spectrum information and excitation source information is selected from the standard waveform segment dictionary 24 and the large waveform segment dictionary 25. Information on the selection result is output to the segment connection unit 23b.

つまり、韻律生成部１３から波形合成部２３に入力される韻律情報系列は、音声単位のラベル情報毎のピッチ周波数情報である標準ピッチ周波数情報から生成された標準韻律情報系列に含まれる定型フレーズ部分が、韻律素片辞書１２の有する韻律素片ピッチ周波数情報に置換されている。更に、標準韻律情報系列における登録（定型）フレーズに先行及び後続する所定韻律単位の標準ピッチ周波数系列が、韻律素片辞書１２の有する前記韻律素片ピッチ周波数情報に対応する韻律特徴情報に基づき調整されている。このような韻律情報系列に対して、定型フレーズ部分に対しては、原音データから抽出されたスペクトル情報及び励振源情報を用いて合成波形データを生成する一方、原音データにおける定型フレーズに先行又は後続する音韻環境の情報と、標準韻律情報系列における定型フレーズに先行及び後続する音韻環境の情報とが一致しないときは、定型フレーズの最初又は最後の一致しない側端部のモーラを、標準波形素片辞書２４の有するスペクトル情報及び励振源情報を用いて生成する。 That is, the prosodic information sequence input from the prosody generating unit 13 to the waveform synthesizing unit 23 is a fixed phrase part included in the standard prosodic information sequence generated from the standard pitch frequency information that is the pitch frequency information for each piece of speech label information. Is replaced with the prosodic segment pitch frequency information of the prosodic segment dictionary 12. Further, the standard pitch frequency sequence of a predetermined prosody unit preceding and following the registered (standard) phrase in the standard prosodic information sequence is adjusted based on the prosodic feature information corresponding to the prosodic segment pitch frequency information of the prosodic segment dictionary 12 Has been. For such a prosodic information series, for the standard phrase part, the synthesized waveform data is generated using the spectrum information extracted from the original sound data and the excitation source information, while the standard phrase in the original sound data is preceded or succeeded. If the information on the phonetic environment to be matched does not match the information on the phoneme environment preceding and following the standard phrase in the standard prosodic information sequence, the mora at the first or last non-matching side edge of the standard phrase is used as the standard waveform segment. It is generated using the spectrum information and excitation source information that the dictionary 24 has.

ここで、本実施の形態において、音声合成装置３００は、図示しないが、上記各構成要素を制御するプログラムが記憶された記憶媒体と、これらのプログラムを実行するためのプロセッサと、プログラムの実行に必要なデータを記憶するＲＡＭと、を備えている。そして、プロセッサにより記憶媒体に記憶されたプログラムを読み出して実行することによって上記各構成要素の処理を実現する。 Here, in the present embodiment, the speech synthesizer 300, although not shown, stores a storage medium storing a program for controlling each of the above components, a processor for executing these programs, and execution of the programs. And a RAM for storing necessary data. And the process of each said component is implement | achieved by reading and running the program memorize | stored in the storage medium with the processor.

更に、図１５に基づき、音声合成装置３００の動作処理の流れを説明する。ここで、図１５は、音声合成装置３００の動作処理を示すフローチャートである。なお、図１５中の標準ＰＦは標準ピッチ周波数、韻律素辺ＰＦは韻律素片ピッチ周波数のことである。
なお、ステップＳ４００〜ステップＳ４１８、及びステップＳ４２４の処理は、上記第１の実施の形態の図５のフローチャートにおけるステップＳ２００〜ステップＳ２１８、及びステップＳ２２４と同様となるので説明を省略する。 Further, the flow of the operation process of the speech synthesizer 300 will be described with reference to FIG. Here, FIG. 15 is a flowchart showing an operation process of the speech synthesizer 300. Note that the standard PF in FIG. 15 is the standard pitch frequency, and the prosodic side PF is the prosodic segment pitch frequency.
In addition, since the process of step S400-step S418 and step S424 is the same as that of step S200-step S218 in the flowchart of FIG. 5 of the said 1st Embodiment, and step S224, description is abbreviate | omitted.

ステップＳ４２０では、波形合成部２３において、韻律生成部１３から入力された韻律情報系列に基づき、合成音声波形データを生成してステップＳ４２２に移行する。
ステップＳ４２２では、波形合成部２３において、ステップＳ４２０で生成した合成音声波形データに基づき、ステップＳ４００で入力された文章テキストの合成音声をスピーカ等（不図示）の出力装置から出力して処理を終了する。 In step S420, the waveform synthesis unit 23 generates synthesized speech waveform data based on the prosody information sequence input from the prosody generation unit 13, and the process proceeds to step S422.
In step S422, the waveform synthesizer 23 outputs the synthesized text of the text text input in step S400 from an output device such as a speaker (not shown) based on the synthesized speech waveform data generated in step S420, and the process ends. To do.

次に、図１６に基づき、ステップＳ４２０の合成音声波形データの生成処理の流れを説明する。ここで、図１６は、波形合成部２３における合成音声波形データの生成処理を示すフローチャートである。
図１６に示すように、まずステップＳ５００に移行し、素片選択部２３ａにおいて、韻律生成部１３から入力された韻律情報系列から、素片選択が未処理のモーラ単位の韻律情報系列部分を選択してステップＳ５０２に移行する。 Next, based on FIG. 16, the flow of the synthetic speech waveform data generation process in step S420 will be described. Here, FIG. 16 is a flowchart showing a process of generating synthesized speech waveform data in the waveform synthesizer 23.
As shown in FIG. 16, first, the process proceeds to step S500, and the segment selection unit 23a selects a prosodic information sequence portion of a mora unit for which segment selection is not processed from the prosody information sequence input from the prosody generation unit 13. Then, the process proceeds to step S502.

ステップＳ５０２では、素片選択部２３ａにおいて、韻律生成部１３から入力された韻律情報系列に基づき、ステップＳ５００で選択したモーラが、登録フレーズ内のモーラであるか否かを判定し、登録フレーズ内のモーラであると判定された場合(Yes)は、ステップＳ５０４に移行し、そうでない場合(No)は、ステップＳ５１８に移行する。
ステップＳ５０４に移行した場合は、素片選択部２３ａにおいて、ステップＳ５００で選択したモーラは、登録フレーズを構成する最初又は最後のモーラであるか否かを判定し、最初又は最後のモーラであると判定された場合(Yes)は、ステップＳ５０６に移行し、そうでない場合(No)は、ステップＳ５０８に移行する。 In step S502, the segment selection unit 23a determines whether or not the mora selected in step S500 is a mora in the registered phrase based on the prosody information sequence input from the prosody generation unit 13. If it is determined that the mora is (No), the process proceeds to step S504. If not (No), the process proceeds to step S518.
When the process proceeds to step S504, the unit selection unit 23a determines whether the mora selected in step S500 is the first or last mora that constitutes the registered phrase, and is the first or last mora. If determined (Yes), the process proceeds to step S506, and if not (No), the process proceeds to step S508.

ステップＳ５０６に移行した場合は、素片選択部２３ａにおいて、ステップＳ５００で選択したモーラを含む登録フレーズに対応する原音データにおける、前記選択したモーラに先行又は後続する所定音韻環境と、韻律情報系列における前記選択したモーラに先行又は後続する所定音韻環境とが一致するか否かを判定し、一致すると判定された場合(Yes)は、ステップＳ５０８に移行し、そうでない場合(No)は、ステップＳ５１８に移行する。 When the process proceeds to step S506, in the segment selection unit 23a, in the original sound data corresponding to the registered phrase including the mora selected in step S500, the predetermined phoneme environment preceding or succeeding the selected mora, and the prosody information sequence It is determined whether or not a predetermined phoneme environment preceding or succeeding the selected mora matches. If it is determined that they match (Yes), the process proceeds to step S508, and if not (No), step S518 is determined. Migrate to

ステップＳ５０８に移行した場合は、素片選択部２３ａにおいて、ステップＳ５００で選択したモーラの韻律情報系列部分に対応するスペクトル情報及び励振源情報を、大波形素片辞書２５から選択することを決定してステップＳ５１０に移行する。
ステップＳ５１０では、素片接続部２３ｂにおいて、ステップＳ５０８の選択結果に基づき、大波形素片辞書２５から、ステップＳ５００で選択したモーラの韻律情報系列部分に対応するスペクトル情報及び励振源情報を取得してステップＳ５１２に移行する。 When the process proceeds to step S508, the unit selection unit 23a determines that the spectrum information and excitation source information corresponding to the prosodic information sequence portion of the mora selected in step S500 are selected from the large waveform unit dictionary 25. Then, the process proceeds to step S510.
In step S510, the segment connection unit 23b acquires spectrum information and excitation source information corresponding to the prosodic information sequence portion of the mora selected in step S500 from the large waveform segment dictionary 25 based on the selection result in step S508. Then, the process proceeds to step S512.

ステップＳ５１２では、素片接続部２３ｂにおいて、ステップＳ５１０又はステップＳ５２０で取得したスペクトル情報及び励振源情報を接続してスペクトル情報系列及び励振源情報系列を生成してステップＳ５１４に移行する。
ステップＳ５１４では、素片接続部２３ｂにおいて、ステップＳ５００で選択したモーラが最後のモーラか否かを判定し、最後のモーラであると判定された場合(Yes)は、ステップＳ５１２で生成したスペクトル情報系列及び励振源情報系列を合成部２３ｃに出力してステップＳ５１６に移行し、そうでない場合(No)は、ステップＳ５００に移行する。
ステップＳ５１６に移行した場合は、合成部２３ｃにおいて、素片接続部２３ｂから入力されたスペクトル情報系列及び励振源情報系列に基づき、合成音声波形データを生成して、一連の処理を終了し元の処理に復帰する。 In step S512, the unit connection unit 23b connects the spectrum information and the excitation source information acquired in step S510 or step S520 to generate a spectrum information sequence and an excitation source information sequence, and the process proceeds to step S514.
In step S514, the segment connecting unit 23b determines whether the mora selected in step S500 is the last mora. If it is determined that the last mora is (Yes), the spectrum information generated in step S512 is determined. The sequence and the excitation source information sequence are output to the synthesizing unit 23c, and the process proceeds to step S516. Otherwise (No), the process proceeds to step S500.
When the process proceeds to step S516, the synthesis unit 23c generates synthesized speech waveform data based on the spectrum information sequence and the excitation source information sequence input from the unit connection unit 23b, ends the series of processing, and returns to the original processing. Return to processing.

一方、ステップＳ５００で選択したモーラが登録フレーズ内にないか、又は選択したモーラに先行又は後続する音韻環境が一致してなくステップＳ５１８に移行した場合は、ステップＳ５００で選択したモーラの韻律情報系列部分に対応するスペクトル情報及び励振源情報を、標準波形素片辞書２４から選択することを決定してステップＳ５２０に移行する。
ステップＳ５２０では、素片接続部２３ｂにおいて、ステップＳ５１８の選択結果に基づき、標準波形素片辞書２４から、ステップＳ５００で選択したモーラの韻律情報系列部分に対応するスペクトル情報及び励振源情報を取得してステップＳ５１２に移行する。 On the other hand, if the mora selected in step S500 is not in the registered phrase, or if the phoneme environment preceding or succeeding the selected mora does not match and the process proceeds to step S518, the prosodic information sequence of the mora selected in step S500 It is determined that the spectrum information and the excitation source information corresponding to the part are selected from the standard waveform segment dictionary 24, and the process proceeds to step S520.
In step S520, the segment connection unit 23b acquires spectrum information and excitation source information corresponding to the prosodic information sequence portion of the mora selected in step S500 from the standard waveform segment dictionary 24 based on the selection result in step S518. Then, the process proceeds to step S512.

次に、図１７に基づき、本実施の形態の動作を説明する。ここで、図１７（ａ）は、登録フレーズの先行モーラの音韻環境が不一致の場合に、従来技術を用いて生成された合成音声波形を示す図であり、（ｂ）は、登録フレーズの先行モーラの音韻環境が不一致の場合に、本発明を用いて生成された合成音声波形を示す図である。
まず、音声合成装置３００のテキスト解析部１０に、文章テキスト「その先、平河門前を右方向です。」が入力される（ステップＳ４００）。 Next, the operation of the present embodiment will be described based on FIG. Here, FIG. 17A is a diagram showing a synthesized speech waveform generated using the conventional technique when the phoneme environments of the preceding mora of the registered phrase are inconsistent, and FIG. It is a figure which shows the synthetic | combination audio | voice waveform produced | generated using this invention when the phonetic environment of mora is inconsistent.
First, the text text “Before, Hirakawamon is in the right direction” is input to the text analysis unit 10 of the speech synthesizer 300 (step S400).

テキスト解析部１０は、上記第１の実施の形態と同様に、入力された文章テキストに対して、単語辞書を参照して、アクセント解析及び形態素解析を実行し、その解析結果に基づき文章テキストに対する発音・韻律記号列を生成する（ステップＳ４０２）。この場合の発音・韻律記号列は、「ソノサキ＋、ヒラカワモンマ’エオミギホ’ーコーデス＋。」となる。この発音・韻律記号列は、韻律生成部１３の標準韻律生成部１３ａ及び登録フレーズ照合部１３ｂにそれぞれ入力される。 As in the first embodiment, the text analysis unit 10 performs accent analysis and morphological analysis on the input sentence text by referring to the word dictionary, and based on the analysis result, A pronunciation / prosodic symbol string is generated (step S402). The pronunciation / prosodic symbol string in this case is “Sonosaki +, Hirakawamonma 'Eomigiho'-Cordes +." This pronunciation / prosodic symbol string is input to the standard prosody generation unit 13a and the registered phrase collation unit 13b of the prosody generation unit 13, respectively.

標準韻律生成部１３ａは、入力された発音・韻律記号列に基づき、音声単位ラベル系列を生成し、当該音声単位ラベル系列のラベル情報に対応する標準ピッチ周波数情報を、標準韻律辞書１１から取得する。そして、当該取得した標準ピッチ周波数情報から、前記発音・韻律記号列に対応する標準韻律情報系列を生成する（ステップＳ４０４）。当該生成した標準韻律情報系列は、韻律置換整形部１３ｄに入力される。 The standard prosody generation unit 13 a generates a speech unit label sequence based on the input pronunciation / prosodic symbol string, and acquires standard pitch frequency information corresponding to the label information of the speech unit label sequence from the standard prosody dictionary 11. . Then, a standard prosodic information sequence corresponding to the pronunciation / prosodic symbol string is generated from the acquired standard pitch frequency information (step S404). The generated standard prosody information sequence is input to the prosody replacement shaping unit 13d.

一方、登録フレーズ照合部１３ｂは、入力された発音・韻律記号列に基づき、当該発音・韻律記号列に、登録フレーズが含まれているか否かを判定する（ステップＳ４０６）。ここで、当該発音・韻律記号列には、登録フレーズとして「ソノサキ＋」及び「ミギホ’ーコーデス＋」が含まれており（ステップＳ２０８の「Ｙｅｓ」の分岐）、このことを韻律素片選択部１３ｃに通知する。 On the other hand, the registered phrase collation unit 13b determines whether or not a registered phrase is included in the pronunciation / prosodic symbol string based on the input pronunciation / prosodic symbol string (step S406). Here, the pronunciation / prosodic symbol string includes “Sonosaki +” and “Migiho's Cordes +” as registration phrases (the branch of “Yes” in step S208), and this is the prosodic segment selection unit. 13c is notified.

韻律素片選択部１３ｃは、登録フレーズ照合部１３ｂからの通知を受けると、韻律素片辞書１２から、登録フレーズ「ソノサキ＋」及び「ミギホ’ーコーデス＋」に対応する韻律素片ピッチ周波数情報及び韻律特徴情報が複数ある場合は、上記第１の実施の形態と同様に、標準韻律情報系列における標準ピッチ周波数との接続性が最も良い「ソノサキ＋」及び「ミギホ’ーコーデス＋」の韻律素片ピッチ周波数情報及び韻律特徴情報を選択する。そして、韻律素片選択部１３ｃは、この選択した韻律素片ピッチ周波数情報及び韻律特徴情報を韻律素片辞書１２から取得して、韻律置換整形部１３ｄに出力する（ステップＳ４１０）。 When receiving the notification from the registered phrase collating unit 13b, the prosodic segment selecting unit 13c receives from the prosody segment dictionary 12 the prosody segment pitch frequency information corresponding to the registered phrases “Sonosaki +” and “Migiho's Cordes +” and When there are a plurality of prosodic feature information, as in the first embodiment, the prosodic segment of “Sonosaki +” and “Migiho's Cordes +” having the best connectivity with the standard pitch frequency in the standard prosodic information sequence Select pitch frequency information and prosodic feature information. Then, the prosodic segment selection unit 13c acquires the selected prosodic segment pitch frequency information and prosodic feature information from the prosody segment dictionary 12 and outputs it to the prosody replacement shaping unit 13d (step S410).

韻律置換整形部１３ｄは、韻律素片選択部１３ｃから「ソノサキ＋」及び「ミギホ’ーコーデス＋」の韻律素片ピッチ周波数情報及び韻律特徴情報が入力されると、標準韻律情報系列における、「ソノサキ＋」に後続する呼気段落の「ヒラカワモンマ’エオ」に対応する標準ピッチ周波数に対して、これに対応する韻律特徴情報であるピッチ比及び抑揚比を用いて調整を行う（ステップＳ４１２）。この場合、「ヒラカワモンマ’エオ」は、「ミギホ’ーコーデス＋」に先行する呼気段落にもなるので、２種類の韻律特徴情報を用いて「ヒラカワモンマ’エオ」に対応する標準ピッチ周波数の調整を行う。 The prosodic replacement shaping unit 13d, when the prosodic segment pitch frequency information and prosodic feature information of “Sonosaki +” and “Migiho's chords +” are input from the prosodic segment selecting unit 13c, The standard pitch frequency corresponding to “Hirakawamonma'eo” in the exhalation paragraph following “+” is adjusted using the pitch ratio and the inflection ratio, which are the prosodic feature information corresponding thereto (step S412). In this case, “Hirakawamonma'Eo” also becomes an exhalation paragraph preceding “Migiho's Cordes +”, so the standard pitch frequency corresponding to “Hirakawamonma'eo” is adjusted using two types of prosodic feature information. .

具体的には、例えば、「ソノサキ＋」に対するピッチ比及び抑揚比が「０．８」及び「０．６」であり、「ミギホ’ーコーデス＋」に対するピッチ比及び抑揚比が「０．８」及び「０．６」であるとすると、両者が一致するので、「ヒラカワモンマ’エオ」に対応する標準ピッチ周波数系列と、調整後のピッチ周波数系列とのピッチ比が０．８、抑揚比が０．６となるように、系列内の各標準ピッチ周波数を調整する。なお、両者が一致しない場合は、例えば、両者のピッチ比及び抑揚比のそれぞれの平均値を用いたり、「ヒラカワモンマ’エオ」の呼気段落を２つのフレーズに分割して、一方を「ソノサキ＋」に対するピッチ比及び抑揚比で調整し、他方を「ミギホ’ーコーデス＋」に対するピッチ比及び抑揚比で調整する。 Specifically, for example, the pitch ratio and the inflection ratio for “Sonosaki +” are “0.8” and “0.6”, and the pitch ratio and the inflection ratio for “Migiho's Cordes +” are “0.8”. And “0.6”, the two match, so the pitch ratio between the standard pitch frequency sequence corresponding to “Hirakawamonma'eo” and the adjusted pitch frequency sequence is 0.8, and the inflection ratio is 0. Each standard pitch frequency in the sequence is adjusted to be .6. If the two do not match, for example, the average value of the pitch ratio and the inflection ratio of the two is used, or the exhalation paragraph of “Hirakawamonma'eo” is divided into two phrases, one of which is “Sonosaki +” The other is adjusted with the pitch ratio and the inflection ratio with respect to “Migiho's Cordes +”.

更に、韻律置換整形部１３ｄは、標準韻律情報系列における「ソノサキ＋」及び「ミギホ’ーコーデス＋」に対応する標準ピッチ周波数系列を、韻律素片選択部１３ｃから入力された韻律素片ピッチ周波数情報に置換すると共に（ステップＳ４１４）、この置換後の韻律素片ピッチ周波数情報（韻律素片ピッチ周波数系列）と、上記調整後の標準韻律情報とが滑らかに接続するように、スムージング処理などの整形処理を施して、「ソノサキ＋、ヒラカワモンマ’エオミギホ’ーコーデス＋。」に対する最終的な韻律情報系列を生成する（ステップＳ４１６）。そして、当該生成した韻律情報系列を波形合成部２３の素片選択部２３ａに出力する（ステップＳ４１８）。 Further, the prosodic replacement shaping unit 13d receives the standard pitch frequency sequence corresponding to “Sonosaki +” and “Migiho's Cordes +” in the standard prosodic information sequence, and the prosodic segment pitch frequency information input from the prosodic segment selection unit 13c. (Step S414) and shaping such as smoothing processing so that the replaced prosodic segment pitch frequency information (prosodic segment pitch frequency sequence) and the adjusted standard prosodic information are smoothly connected. Processing is performed to generate a final prosodic information sequence for “Sonosaki +, Hirakawamonma 'Eomigiho-Cordes +.'” (Step S416). Then, the generated prosodic information sequence is output to the segment selection unit 23a of the waveform synthesis unit 23 (step S418).

素片選択部２３ａは、韻律情報系列が入力されると、当該韻律情報系列の始端から順に素片選択処理が未処理のモーラ単位の韻律情報系列部分（標準ピッチ周波数系列部分）を選択し（ステップＳ５００）、当該選択した韻律情報系列部分のモーラが大波形素片辞書２５の有する登録フレーズに含まれるか否かを判定する（ステップＳ５０２）。ここで、「ソノサキ＋」及び「ミギホ’ーコーデス＋」は登録フレーズであるとする。従って、これらのフレーズに対して選択される韻律情報系列部分のモーラは登録フレーズ内のモーラであると判定される（ステップＳ５０４の「Ｙｅｓ」の分岐）。なお、大波形素片辞書２５の有する「ソノサキ＋、」のモーラ系列情報は、「ｓｏ」「ｎｏ」「ｓａ」「ｋｉ」「ｐａｕ（ポーズ）」となり、「オミギホ’ーコーデス＋」のモーラ系列情報は、「ｐａｕ」「ｍｉ」「ｇｉ」「ｈｏ」「ｏ」「ｋｏ」「ｏ」「ｄｅ」「ｓｕ」となる。 When a prosodic information sequence is input, the segment selecting unit 23a selects a prosodic information sequence portion (standard pitch frequency sequence portion) of mora units in which the segment selection processing has not been processed in order from the beginning of the prosodic information sequence ( Step S500), it is determined whether or not the mora of the selected prosodic information sequence portion is included in the registered phrase of the large waveform segment dictionary 25 (Step S502). Here, “Sonosaki +” and “Migiho'-Cordes +” are registered phrases. Therefore, it is determined that the mora of the prosodic information sequence portion selected for these phrases is the mora in the registered phrase (“Yes” branch of step S504). The “Sonosaki +,” mora sequence information of the large waveform segment dictionary 25 is “so”, “no”, “sa”, “ki”, “pau (pause)”, and the mora of “Omigiho's Cordes +”. The series information is “pau”, “mi”, “gi”, “ho”, “o”, “ko”, “o”, “de”, “su”.

また、登録フレーズに対して複数個の韻律素片が対応する場合に、その各韻律素片に対し、同一音声波形データに基づく個別の大波形素片（スペクトル情報及び励振源情報）を、大波形素片辞書２５に格納してもよい。その場合、図４の例のように、韻律素片辞書１２に登録されている韻律素片ピッチ周波数情報に対応する韻律素片が実際に発話された際に、韻律素片ピッチ周波数情報と同時に抽出された大波形素片の大波形素片辞書２５内での格納アドレス情報を、韻律素片辞書１２に事前に格納しておく。これにより、韻律素片辞書１２に登録された韻律素片ピッチ周波数情報に対する適切な大波形素片が解るので、発声時の音声波形データをより忠実に反映した、より自然な合成音を得ることが可能になる。 In addition, when a plurality of prosodic segments correspond to a registered phrase, individual large waveform segments (spectrum information and excitation source information) based on the same speech waveform data are large for each prosodic segment. You may store in the waveform segment dictionary 25. FIG. In this case, as shown in the example of FIG. 4, when a prosodic segment piece corresponding to the prosodic segment pitch frequency information registered in the prosodic segment dictionary 12 is actually spoken, The storage address information of the extracted large waveform segment in the large waveform segment dictionary 25 is stored in advance in the prosodic segment dictionary 12. As a result, an appropriate large waveform segment corresponding to the prosody segment pitch frequency information registered in the prosody segment dictionary 12 can be understood, so that a more natural synthesized sound that more accurately reflects the speech waveform data at the time of utterance can be obtained. Is possible.

そして、韻律情報系列から、「ｓｏ」に対応する韻律情報系列部分が選択されると、このモーラは、登録フレーズの最初のモーラとなるが（ステップＳ５０４の「Ｙｅｓ」の分岐）、文頭のモーラであるので、無条件で音韻環境が一致していると判定し（ステップＳ５０６の「Ｙｅｓ」の分岐）、このモーラの韻律情報系列部分に対応するスペクトル情報及び励振源情報を、大波形素片辞書２５から選択すると決定する（ステップＳ５０８）。そして、素片接続部２３ｂは、「ｓｏ」に対応する韻律情報系列部分に対応するスペクトル情報及び励振源情報を、大波形素片辞書２５から取得し（ステップＳ５１０）、当該スペクトル情報及び励振源情報を、スペクトル情報系列及び励振源情報系列の先頭にする（ステップＳ５１２）。「ｓｏ」に後続する「ｎｏ」「ｓａ」については、登録フレーズ内のモーラであると共に、当該フレーズの最初及び最後のモーラでは無いので、これらの韻律情報系列部分に対応するスペクトル情報及び励振源情報を、大波形素片辞書２５から選択すると決定し（ステップＳ５０８）、素片接続部２３ｂは、「ｎｏ」「ｓａ」の韻律情報系列部分に対応するスペクトル情報及び励振源情報を、大波形素片辞書２５から取得し（ステップＳ５１０）、当該スペクトル情報及び励振源情報を、「ｓｏ」に対応するスペクトル情報系列及び励振源情報系列に接続する（ステップＳ５１２）。 When the prosodic information sequence portion corresponding to “so” is selected from the prosodic information sequence, this mora becomes the first mora of the registered phrase (the branch of “Yes” in step S504), but the mora at the beginning of the sentence. Therefore, it is determined that the phoneme environments are unconditionally matched (branch “Yes” in step S506), and the spectrum information and the excitation source information corresponding to the prosodic information sequence portion of this mora are converted into large waveform segments. It is determined to be selected from the dictionary 25 (step S508). Then, the segment connecting unit 23b acquires the spectrum information and the excitation source information corresponding to the prosodic information sequence portion corresponding to “so” from the large waveform segment dictionary 25 (Step S510), and the spectrum information and the excitation source. The information is set to the head of the spectrum information series and the excitation source information series (step S512). Since “no” and “sa” following “so” are mora in the registered phrase and not the first and last mora of the phrase, the spectrum information and the excitation source corresponding to these prosodic information sequence parts The information is determined to be selected from the large waveform segment dictionary 25 (step S508), and the segment connection unit 23b converts the spectrum information and excitation source information corresponding to the prosodic information sequence portion of “no” and “sa” into the large waveform. Obtained from the segment dictionary 25 (step S510), the spectrum information and the excitation source information are connected to the spectrum information sequence and the excitation source information sequence corresponding to “so” (step S512).

更に、韻律情報系列から、「ｋｉ」に対応する韻律情報系列部分が選択されると、このモーラは、登録フレーズの最後のモーラとなるので（ステップＳ５０４の「Ｙｅｓ」の分岐）、大波形素片辞書２５の有する「ソノサキ＋」のフレーズに後続する音韻環境の情報（上記「ｐａｕ」）と、韻律情報系列の当該フレーズに後続する音韻環境の情報とが一致するか否かを判定する（ステップＳ５０６）。ここでは、入力された文章テキストである「その先、」の韻律情報系列に対するモーラ系列情報は、「ｓｏ＿ｎ」「ｎｏ＿ｓ」「ｓａ＿ｋ」「ｋｉ」「ｐａｕ」となり、後続モーラが「ｐａｕ」となる。従って、大波形素片辞書２５の有する「ソノサキ＋」の後続モーラである「ｐａｕ」と一致するので（ステップＳ５０６の「Ｙｅｓ」の分岐）、上記選択した韻律情報系列部分の「ｋｉ」に対応する韻律情報系列部分に対応するスペクトル情報及び励振源情報を、大波形素片辞書２５から選択することを決定する（ステップＳ５０８）。そして、素片接続部２３ｂは、「ｋｉ」に対応する標準韻律情報系列部分に対応するスペクトル情報及び励振源情報を、大波形素片辞書２５から取得し（ステップＳ５１０）、当該スペクトル情報及び励振源情報を、既に生成された「ソノサ」のスペクトル情報系列及び励振源情報系列に接続する（ステップＳ５１２）。 Furthermore, when the prosodic information sequence portion corresponding to “ki” is selected from the prosodic information sequence, this mora becomes the last mora of the registered phrase (the branch of “Yes” in step S504). It is determined whether or not the phonological environment information following the phrase “Sonosaki +” possessed by the piece dictionary 25 (above “pau”) matches the phonological environment information following the phrase in the prosodic information sequence ( Step S506). Here, the mora sequence information for the prosodic information sequence of “beyond,” which is the input sentence text is “so_n”, “no_s”, “sa_k”, “ki”, “pau”, and the subsequent mora is “pau”. . Accordingly, since it matches “pau” that is a subsequent mora of “Sonozaki +” in the large waveform segment dictionary 25 (“Yes” branch in step S506), it corresponds to “ki” in the selected prosodic information sequence portion. It is determined that the spectrum information and the excitation source information corresponding to the prosodic information sequence portion to be selected are selected from the large waveform segment dictionary 25 (step S508). Then, the segment connecting unit 23b acquires the spectrum information and excitation source information corresponding to the standard prosodic information sequence portion corresponding to “ki” from the large waveform segment dictionary 25 (step S510), and the spectrum information and the excitation information. The source information is connected to the already generated “sonosa” spectrum information sequence and excitation source information sequence (step S512).

「ソノサキ＋」に後続する、「ヒラカワモンマ’エオ」の各モーラに対応する韻律情報系列部分は、登録フレーズに含まれていないため（ステップＳ５０２の「Ｎｏ」の分岐）、こられのモーラの韻律情報系列部分に対応するスペクトル情報及び励振源情報を、標準波形素片辞書２４から選択することを決定する（ステップＳ５１８）。そして、素片接続部２３ｂは、この選択結果に基づき、各選択した韻律情報系列部分に対応するスペクトル情報及び励振源情報を、標準波形素片辞書２４から取得し（ステップＳ５２０）、「ソノサキ＋」のスペクトル情報系列及び励振源情報系列に続けて、これらを順に接続しスペクトル情報系列及び励振源情報系列を生成する（ステップＳ５１２）。 Since the prosodic information sequence portion corresponding to each mora of “Hirakawamonma 'Eo” following “Sonosaki +” is not included in the registered phrase (the branch of “No” in step S502), the prosody of these mora It is determined that spectrum information and excitation source information corresponding to the information series portion are selected from the standard waveform segment dictionary 24 (step S518). Then, based on this selection result, the segment connection unit 23b obtains spectrum information and excitation source information corresponding to each selected prosodic information sequence portion from the standard waveform segment dictionary 24 (step S520), and “Sonosaki + The spectrum information sequence and the excitation source information sequence are connected in order to generate a spectrum information sequence and an excitation source information sequence (step S512).

更に、韻律情報系列から、「ミギホ’ーコーデス＋」におけるモーラ「ｍｉ」に対応する韻律情報系列部分が選択されると、このモーラは、登録フレーズの最初のモーラとなるので、この韻律情報系列部分のモーラは登録フレーズ内のモーラであると判定される（ステップＳ５０２の「Ｙｅｓ」の分岐）。なお、大波形素片辞書２５の有する先行モーラを含む「、右方向です」のモーラ系列情報は、「ｐａｕ」「ｍｉ」「ｇｉ」「ｈｏ」「ｏ」「ｋｏ」「ｏ」「ｄｅ」「ｓｕ」となる。また、入力された文章テキストである「その先、平河門前を右方向です。」における「オミギホ’ーコーデス＋」の韻律情報系列部分に対するモーラ系列情報は、「ｏ＿ｍ」「ｍｉ＿ｇ」「ｇｉ＿ｈ」「ｈｏ＿ｏ」「ｏ＿ｋ」「ｋｏ＿ｏ」「ｏ＿ｄ」「ｄｅ＿ｓ」「ｓｕ」となる。つまり、大波形素片辞書２５の有する先行モーラである「ｐａｕ」と、「ミギホ’ーコーデス＋」の韻律情報系列部分に対する先行モーラである「ｏ＿ｍ」とは不一致となる（ステップＳ５０６の「Ｎｏ」の分岐）。 Furthermore, when the prosodic information sequence portion corresponding to the mora “mi” in “Migiho's Cordes +” is selected from the prosodic information sequence, this mora becomes the first mora of the registered phrase, so this prosodic information sequence portion Is determined to be a mora in the registered phrase ("Yes" branch of step S502). It should be noted that the mora sequence information of “is in the right direction” including the preceding mora of the large waveform segment dictionary 25 is “pau”, “mi”, “gi”, “ho”, “o”, “ko”, “o”, “de”. “Su”. In addition, the mora sequence information for the prosodic information sequence portion of “Omigiho's Cordes +” in the input sentence text “Beyond Hirakawamon Mae” is “o_m” “mi_g” “gi_h” “ ho_o "" o_k "" ko_o "" o_d "" de_s "" su ". That is, “pau”, which is the preceding mora included in the large waveform segment dictionary 25, does not match “o_m”, which is the preceding mora for the prosodic information sequence portion of “Migiho's Cordes +” (“No” in step S506). Branch).

従って、上記選択した韻律情報系列の「ｍｉ」に対応する韻律情報系列部分に対応するスペクトル情報及び励振源情報を、標準波形素片辞書２５から選択することを決定する（ステップＳ５１８）。そして、素片接続部２３ｂは、「ｍｉ」に対応する韻律情報系列部分に対応するスペクトル情報及び励振源情報を、標準波形素片辞書２５から取得し（ステップＳ５２０）、当該スペクトル情報及び励振源情報を、既に生成された「ソノサキ＋、ヒラカワモンマ’エオ」のスペクトル情報系列及び励振源情報系列に接続する（ステップＳ５１２）。また、「ｍｉ」に後続する「ｇｉ」「ｈｏ」「ｏ」「ｋｏ」「ｏ」「ｄｅ」「ｓｕ」については、登録フレーズの最初若しくは最後のモーラでは無いか又は文末のモーラとなるので、これらに対応するスペクトル情報及び励振源情報を、大波形素片辞書２５から選択することを決定し（ステップＳ５０８）、素片接続部２３ｂは、これらに対応するスペクトル情報及び励振源情報を、大波形素片辞書２５から取得して（ステップＳ５１０）、既に生成されているスペクトル情報系列及び励振源情報系列に接続する（ステップＳ５１２）。 Accordingly, it is determined that the spectrum information and the excitation source information corresponding to the prosodic information sequence portion corresponding to “mi” of the selected prosodic information sequence are selected from the standard waveform segment dictionary 25 (step S518). Then, the segment connecting unit 23b acquires the spectrum information and the excitation source information corresponding to the prosody information sequence portion corresponding to “mi” from the standard waveform segment dictionary 25 (step S520), and the spectrum information and the excitation source. The information is connected to the already generated spectrum information sequence and excitation source information sequence of “Sonosaki +, Hirakawamonma'eo” (step S512). In addition, “gi”, “ho”, “o”, “ko”, “o”, “de”, and “su” following “mi” are not the first or last mora of the registered phrase, or become a mora at the end of the sentence. Then, it is determined to select the spectrum information and the excitation source information corresponding to these from the large waveform segment dictionary 25 (step S508), and the segment connection unit 23b selects the spectrum information and the excitation source information corresponding to these, Obtained from the large waveform segment dictionary 25 (step S510) and connected to the already generated spectrum information sequence and excitation source information sequence (step S512).

この接続結果は、発音・韻律記号列の「ソノサキ＋」に対しては大波形素片辞書２５から取得したスペクトル情報及び励振源情報を用いてスペクトル情報系列及び励振源情報系列が生成されたものとなり、発音・韻律記号列の「ヒラカワモンオミ」に対しては標準波形素片辞書２４から取得したスペクトル情報及び励振源情報を用いてスペクトル情報系列及び励振源情報系列が生成されたものとなり、発音・韻律記号列の「ギホ’ーコーデス＋」に対しては大波形素片辞書２５から取得したスペクトル情報及び励振源情報を用いてスペクトル情報系列及び励振源情報系列が生成されたものとなる。そして、素片接続部２３ｂは、当該生成したスペクトル情報系列及び励振源情報系列を、合成部２３ｃに出力する（ステップＳ５１４の「Ｙｅｓ」の分岐）。 This connection result shows that the spectrum information sequence and the excitation source information sequence are generated using the spectrum information and the excitation source information acquired from the large waveform segment dictionary 25 for the pronunciation / prosodic symbol string “Sonosaki +”. For the pronunciation / prosodic symbol string “Hirakawamon Omi”, the spectrum information sequence and the excitation source information sequence are generated using the spectrum information and the excitation source information acquired from the standard waveform segment dictionary 24. For the prosodic symbol string “Giho's Cordes +”, the spectrum information sequence and the excitation source information sequence are generated using the spectrum information and the excitation source information acquired from the large waveform segment dictionary 25. Then, the element connection unit 23b outputs the generated spectrum information sequence and excitation source information sequence to the synthesis unit 23c (“Yes” branch of step S514).

合成部２３ｃは、素片接続部２３ｂから入力されたスペクトル情報系列及び励振源情報系列に基づき、スペクトル情報系列を合成フィルタのパラメータとして用い、当該合成フィルタで、励振源情報系列に基づき生成される励振源信号をフィルタ処理して、文章テキストに対応する合成音声波形データを生成する（ステップＳ４２０）。
このようにして生成された合成音声波形データは、図１７（ｂ）に示すように、「ヒラカワモンマ’エオ」の「オ」の合成音声波形と「ミギホ’ーコーデス＋」の「ミ」の合成音声波形とが滑らかに接続された状態となる。この生成した合成音声波形データに基づき、不図示のスピーカ等から合成音声を出力すると（ステップＳ４２２）、より自然発声に近い韻律で文章テキストが発音されると共に、「ヒラカワモンマ’エオ」の「オ」と「ミギホ’ーコーデス＋」の「ミ」が連続して発音される。 Based on the spectrum information sequence and the excitation source information sequence input from the unit connection unit 23b, the synthesis unit 23c uses the spectrum information sequence as a parameter for the synthesis filter, and is generated by the synthesis filter based on the excitation source information sequence. The excitation source signal is filtered to generate synthesized speech waveform data corresponding to the sentence text (step S420).
As shown in FIG. 17B, the synthesized speech waveform data generated in this way is a synthesized speech waveform of “H” of “Hirakawamonma'eo” and a synthesized speech of “mi” of “Migiho's Cordes +”. The waveform is connected smoothly. When synthesized speech is output from a speaker or the like (not shown) based on the generated synthesized speech waveform data (step S422), the text of the sentence is pronounced with a prosody closer to natural utterance, and “O” of “Hirakawamonma'Eo”. And “Mi” of “Migiho's Cordes +” are pronounced continuously.

一方、従来手法のように、登録フレーズである「ミギホ’ーコーデス＋」の全体に対して大波形素片辞書２５から取得したスペクトル情報及び励振源情報を用いてスペクトル情報系列及び励振源情報系列を生成して合成音声波形データを生成する場合は、図１７（ａ）の合成音声波形に示すように、「ヒラカワモンマ’エオ」の「オ」の合成音声波形と「ミギホ’ーコーデス＋」の「ミ」の合成音声波形とが不連続となる。従って、このような合成音声波形となる合成音を出力すると、「ヒラカワモンマ’エオ」の「オ」と「ミギホ’ーコーデス＋」の「ミ」が不連続に発音されるため違和感が生じる。以上、本実施の形態の音声合成装置３００は、所定話者の発話した発話文に対応する音声波形データから抽出された登録フレーズ（定型フレーズ）部分のピッチ周波数情報（韻律素片ピッチ周波数情報）と、音声波形データにおける前記定型フレーズに先行及び後続する所定韻律単位に対するピッチ周波数と、これに対応する標準ピッチ周波数とに基づき生成された所定韻律単位の韻律的特徴を示す韻律特徴情報とを有した韻律素片辞書１２を用いて、音声単位のラベル情報毎のピッチ周波数情報である標準ピッチ周波数情報から生成された標準韻律情報系列に含まれる定型フレーズ部分を、韻律素片辞書１２の有する韻律素片ピッチ周波数情報に置換する。更に、標準韻律情報系列における定型フレーズに先行及び後続する所定韻律単位の標準ピッチ周波数系列を、韻律素片辞書１２の有する前記韻律素片ピッチ周波数情報に対応する韻律特徴情報に基づき調整して、最終的な韻律情報系列を生成することが可能である。 On the other hand, as in the conventional method, the spectrum information sequence and the excitation source information sequence are obtained using the spectrum information and the excitation source information acquired from the large waveform segment dictionary 25 for the entire registered phrase “Migiho's Cordes +”. When generating the synthesized speech waveform data, as shown in the synthesized speech waveform of FIG. 17A, the synthesized speech waveform of “H” of “Hirakawamonma'Eo” and “Mikiho's Cordes +” of “M ”And the synthesized voice waveform becomes discontinuous. Accordingly, when a synthesized sound having such a synthesized speech waveform is output, “H” of “Hirakawamonma'Eo” and “Mi” of “Migiho'-Cordés +” are discontinuously pronounced, resulting in an uncomfortable feeling. As described above, the speech synthesizer 300 according to the present embodiment has the pitch frequency information (prosody segment pitch frequency information) of the registered phrase (standard phrase) extracted from the speech waveform data corresponding to the utterance sentence uttered by the predetermined speaker. And prosodic feature information indicating the prosodic features of the predetermined prosodic unit generated based on the pitch frequency for the predetermined prosodic unit preceding and following the fixed phrase in the speech waveform data and the standard pitch frequency corresponding thereto. Using the prosodic segment dictionary 12, the prosodic segment dictionary 12 has a fixed phrase part included in a standard prosodic information sequence generated from standard pitch frequency information that is pitch frequency information for each piece of label information in speech units. Replace with fragment pitch frequency information. Further, the standard pitch frequency sequence of a predetermined prosodic unit preceding and following the fixed phrase in the standard prosodic information sequence is adjusted based on the prosodic feature information corresponding to the prosodic segment pitch frequency information of the prosodic segment dictionary 12, It is possible to generate a final prosodic information sequence.

更に、所定話者の発話した発話文に対応する音声波形データから抽出された登録フレーズ（定型フレーズ）部分のスペクトル情報及び励振源情報と、当該フレーズ部分に先行又は後続する音韻環境の情報と、当該フレーズを構成する音声単位毎の継続時間長情報と、当該フレーズのモーラ系列情報とを有した大波形素片辞書２５を用いて、上記韻律情報系列に含まれる定型フレーズ部分の合成波形データを、大波形素片辞書２５の有するスペクトル情報及び励振源情報を用いて生成することが可能である。これにより、自然発声に近い韻律に加え、自然発声された原音データのスペクトル情報及び励振源情報を用いて合成音声波形データを生成することが可能となるので、これらの相乗効果により、文章テキストの文章が、より自然発話に近い抑揚で発話（再生出力）される合成音声波形データを生成することが可能である。 Further, the spectrum information and excitation source information of the registered phrase (standard phrase) part extracted from the speech waveform data corresponding to the utterance sentence uttered by the predetermined speaker, information on the phoneme environment preceding or succeeding the phrase part, Using the large waveform segment dictionary 25 having the duration length information for each voice unit constituting the phrase and the mora sequence information of the phrase, the synthesized waveform data of the fixed phrase part included in the prosodic information sequence is obtained. It is possible to generate using the spectrum information and excitation source information of the large waveform segment dictionary 25. This makes it possible to generate synthesized speech waveform data using the spectrum information and excitation source information of the naturally-sound original sound data in addition to the prosody close to that of natural utterances. It is possible to generate synthesized speech waveform data in which a sentence is uttered (reproduced and output) with an inflection closer to a natural utterance.

更に、定型フレーズに先行又は後続する音韻環境の情報と、上記韻律情報系列における定型フレーズに先行及び後続する音韻環境の情報とが一致しないときは、定型フレーズの最初又は最後の一致しない側端部のモーラを、標準波形素片辞書２４の有するスペクトル情報及び励振源情報を用いて生成し、それ以外を大波形素片辞書２５の有するスペクトル情報及び励振源情報を用いて生成することが可能である。これにより、前記音韻環境が一致しない場合でも、文章テキストの文章がより自然発話に近い抑揚で発話（再生出力）される合成音声波形データを生成することが可能である。 Furthermore, when the information on the phonetic environment preceding or following the fixed phrase and the information on the phonemic environment preceding and following the fixed phrase in the prosodic information sequence do not match, the first or last mismatched side edge of the fixed phrase Can be generated using the spectrum information and excitation source information included in the standard waveform segment dictionary 24, and the other mora can be generated using the spectrum information and excitation source information included in the large waveform segment dictionary 25. is there. As a result, even when the phonological environments do not match, it is possible to generate synthesized speech waveform data in which the sentence of the sentence text is uttered (reproduced and output) with an inflection closer to natural utterance.

上記第３の実施の形態において、テキスト解析部１０は、請求項８記載のテキスト解析手段に対応し、韻律素片辞書１２は、請求項１〜５のいずれか１項に記載の韻律辞書に対応し、韻律生成部１１は、請求項８記載の韻律情報系列生成手段に対応し、登録フレーズ照合部１３ｂ、韻律素片選択部１３ｃ及び韻律置換整形部１３ｄによる、登録フレーズに対応する標準韻律情報系列部分を韻律素片辞書１２の韻律素片ピッチ周波数情報から構成される韻律情報系列部分に置換する処理は、請求項８記載の変更手段に対応し、韻律置換整形部１３ｄによる置換部分に先行及び後続する所定韻律単位の韻律情報系列部分を韻律特徴情報に基づき調整する処理は、請求項８記載の韻律情報調整手段に対応し、波形合成部２３は、請求項８記載の音声波形生成手段に対応し、標準波形素片辞書２４は、請求項８記載の第１素片辞書に対応し、大波形素片辞書２５は、請求項８記載の第２素片辞書に対応する。 In the third embodiment, the text analysis unit 10 corresponds to the text analysis unit according to claim 8 , and the prosodic segment dictionary 12 is added to the prosody dictionary according to any one of claims 1 to 5. Correspondingly, the prosody generation unit 11 corresponds to the prosody information sequence generation means according to claim 8 , and the standard prosody corresponding to the registered phrase by the registered phrase collation unit 13b, the prosody segment selection unit 13c, and the prosody replacement shaping unit 13d. The processing for replacing the information sequence portion with the prosodic information sequence portion configured from the prosodic segment pitch frequency information of the prosodic segment dictionary 12 corresponds to the changing means according to claim 8 , and is replaced with the replacement portion by the prosodic replacement shaping unit 13d. leading and process adjustments to basis prosodic information sequence portion of the subsequent predetermined prosodic units prosodic feature information corresponds to a prosodic information adjustment unit according to claim 8, the waveform synthesis section 23, the audio according to claim 8 Corresponding to the shape generating means, the standard waveform segment dictionary 24 corresponds to the first segment dictionary according to claim 8, large waveform segment dictionary 25 corresponds to the second segment dictionary according to claim 8 .

また、上記第３の実施の形態において、素片選択部２３ａにおける、標準韻律情報系列から選択したモーラが登録フレーズを構成する最初又は最後のモーラである場合に、当該登録フレーズの原音データ（大波形素片辞書２５の登録データ）における前記選択したモーラに先行又は後続する所定音韻環境と、標準韻律情報系列における前記選択したモーラに先行又は後続する所定音韻環境とが一致するか否かを判定する処理は、請求項８記載の判定手段に対応する。 In the third embodiment, when the mora selected from the standard prosodic information sequence in the segment selection unit 23a is the first or last mora constituting the registered phrase, the original sound data (large Whether the predetermined phoneme environment preceding or succeeding the selected mora in the waveform segment dictionary 25) matches the predetermined phoneme environment preceding or succeeding the selected mora in the standard prosodic information sequence The processing to be performed corresponds to the determination means described in claim 8 .

また、上記第３の実施の形態において、ステップＳ４００〜Ｓ４０２は、請求項１２又は１５記載のテキスト解析ステップに対応し、ステップＳ４０４，Ｓ４１６は、請求項１２又は１５記載の韻律情報系列生成ステップに対応し、ステップＳ４０８，Ｓ４１０，Ｓ４１４は、請求項１２又は１５記載の変更ステップに対応し、ステップＳ４１２は、請求項１２又は１５記載の韻律情報調整ステップに対応し、ステップＳ４２０は、請求項１２又は１５記載の音声波形生成ステップに対応する。 In the third embodiment, steps S400 to S402 correspond to the text analysis step according to claim 12 or 15 , and steps S404 and S416 correspond to the prosodic information sequence generation step according to claim 12 or 15. correspondingly, the step S408, S410, S414 corresponds to the changing step according to claim 12 or 15, wherein, step S412 corresponds to the prosody information adjustment step of claim 12 or 15, wherein, step S420 is claim 12 Or it corresponds to the speech waveform generation step described in 15 .

また、上記第３の実施の形態において、ステップＳ５０６は、請求項１２又は１５記載の判定ステップに対応する。
なお、上記第１〜第３の実施の形態においては、日本語を例に挙げて説明したが、これに限らず、本発明を日本語以外の言語に適用しても良い。
また、上記第１及び第３の実施の形態においては、韻律素片辞書の有する韻律情報をピッチ周波数とした例を説明したが、これに限らず、韻律素片辞書の構成を、ピッチ周波数情報に加えて、あるいはピッチ周波数情報に代えて、話速情報、音量情報など韻律に係る別の情報を有する構成としても良い。 In the third embodiment, step S506 corresponds to the determination step according to claim 12 or 15 .
In the first to third embodiments, Japanese has been described as an example. However, the present invention is not limited to this, and the present invention may be applied to languages other than Japanese.
In the first and third embodiments, the example has been described in which the prosodic information included in the prosodic segment dictionary is the pitch frequency. In addition to the above, or in place of the pitch frequency information, other information related to prosody such as speech speed information and volume information may be included.

また、上記第１及び第３の実施の形態においては、韻律素片辞書の有する韻律特徴情報を、ピッチ比及び抑揚比とした例を説明したが、これに限らず、韻律素片辞書を、ピッチ比及び抑揚比に加えて、あるいはピッチ比及び抑揚比に代えて、原音データのピッチ周波数と、標準ピッチ周波数とから求まる他の情報を有する構成としても良いし、前述したように、韻律情報が話速情報、音量情報などを含む場合は、原音データの話速情報及び音量情報と、標準話速情報及び標準音量情報とから求まる情報を有する構成としても良い。つまり、ピッチ周波数情報に加えて、話速情報や音量情報によって該当する韻律情報系列部分を調整することで、ピッチ周波数だけによる調整よりも、更に自然音声に近い抑揚で文章テキストが発話（再生）される合成音声波形データを生成することができる。 In the first and third embodiments, the example has been described in which the prosodic feature information included in the prosodic segment dictionary is the pitch ratio and the inflection ratio. In addition to the pitch ratio and the inflection ratio, or in place of the pitch ratio and the inflection ratio, the information may include other information obtained from the pitch frequency of the original sound data and the standard pitch frequency. May include information obtained from the speech speed information and volume information of the original sound data, and the standard speech speed information and standard volume information. In other words, in addition to the pitch frequency information, the corresponding prosodic information sequence part is adjusted by the speech speed information and the volume information, so that the text of the sentence is uttered (reproduced) with an inflection closer to natural speech than the adjustment by the pitch frequency alone. Synthesized speech waveform data can be generated.

本発明の第１の実施の形態に係る音声合成装置１００の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesizer 100 which concerns on the 1st Embodiment of this invention. 韻律素片辞書の作成処理の流れを示す図である。It is a figure which shows the flow of a creation process of a prosodic segment dictionary. ピッチ比及び抑揚比の算出方法の一例を示す図である。It is a figure which shows an example of the calculation method of pitch ratio and intonation ratio. （ａ）は、韻律素片辞書の構成例を示す図であり、（ｂ）は、韻律素片辞書に登録されるピッチ周波数データの一例を示す図である。(A) is a figure which shows the structural example of a prosodic segment dictionary, (b) is a figure which shows an example of the pitch frequency data registered into a prosodic segment dictionary. 音声合成装置１００の動作処理を示すフローチャートである。3 is a flowchart showing an operation process of the speech synthesizer 100. （ａ）は、定型フレーズを含む実発話文の音声波形データにおけるピッチ周波数系列の一例を示す図であり、（ｂ）は、（ａ）の実発話文に対して標準ピッチ周波数のみで生成したピッチ周波数系列の一例を示す図である。(A) is a figure which shows an example of the pitch frequency series in the speech waveform data of the actual speech sentence containing a fixed phrase, (b) was produced | generated only with the standard pitch frequency with respect to the actual speech sentence of (a). It is a figure which shows an example of a pitch frequency series. 定型フレーズを含む文章テキストに対する標準韻律情報系列の一例を示す図である。It is a figure which shows an example of the standard prosodic information series with respect to the text text containing a fixed phrase. （ａ）は、文章テキストにおける定型フレーズに韻律素片ピッチ周波数を用い、且つ韻律特徴情報に基づく調整処理を施した韻律情報系列の一例を示す図であり、（ｂ）は、（ａ）と同じ文章テキストに対して従来技術の手法で生成した韻律情報系列の一例を示す図である。(A) is a figure which shows an example of the prosodic information series which used the prosodic segment pitch frequency for the fixed phrase in sentence text, and performed the adjustment process based on prosodic feature information, (b) It is a figure which shows an example of the prosodic information series produced | generated by the method of the prior art with respect to the same sentence text. 本発明の第２の実施の形態に係る音声合成装置２００の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesizer 200 which concerns on the 2nd Embodiment of this invention. 音声合成装置２００の波形合成部２３の詳細な構成を示すブロック図である。3 is a block diagram illustrating a detailed configuration of a waveform synthesis unit 23 of the speech synthesizer 200. FIG. 大波形素片辞書の構成例を示す図である。It is a figure which shows the structural example of a large waveform segment dictionary. 音声合成装置２００の動作処理を示すフローチャートである。4 is a flowchart showing an operation process of the speech synthesizer 200. （ａ）は、例文に対する原音データの構成を示す図であり、（ｂ）は、（ａ）の例文に対して登録フレーズの先行モーラの音韻環境が一致する場合の合成例を示す図であり、（ｃ）は、（ａ）の例文に対して登録フレーズの先行モーラの音韻環境が不一致の場合の合成例を示す図である。(A) is a figure which shows the structure of the original sound data with respect to an example sentence, (b) is a figure which shows the example of a synthesis | combination in case the phonetic environment of the preceding mora of a registration phrase corresponds with the example sentence of (a). (C) is a figure which shows the example of a synthesis | combination when the phonetic environment of the preceding mora of a registration phrase does not correspond with the example sentence of (a). 本発明の第３の実施の形態に係る音声合成装置３００の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesizer 300 which concerns on the 3rd Embodiment of this invention. 音声合成装置３００の動作処理を示すフローチャートである。3 is a flowchart showing an operation process of the speech synthesizer 300. 波形合成部２３における合成音声波形データの生成処理を示すフローチャートである。7 is a flowchart showing a process of generating synthesized speech waveform data in a waveform synthesis unit 23. （ａ）は、登録フレーズの先行モーラの音韻環境が不一致の場合に、従来技術を用いて生成された合成音声波形を示す図であり、（ｂ）は、登録フレーズの先行モーラの音韻環境が不一致の場合に、本発明を用いて生成された合成音声波形を示す図である。(A) is a figure which shows the synthetic | combination speech waveform produced | generated using the prior art, when the phonetic environment of the preceding mora of a registration phrase is inconsistent, (b) is the phonetic environment of the preceding mora of a registration phrase. It is a figure which shows the synthetic | combination audio | voice waveform produced | generated using this invention in the case of mismatching.

Explanation of symbols

１０，２０テキスト解析部
１１，２１標準韻律辞書
１２韻律素片辞書
１３韻律生成部
１３ａ，２２標準韻律生成部
１３ｂ登録フレーズ照合部
１３ｃ韻律素片選択部
１３ｄ韻律置換整形部
１４，２３波形合成部
２３ａ素片選択部
２３ｂ素片接続部
２３ｃ合成部
１５，２４標準波形素片辞書
２５大波形素片辞書
１００〜３００音声合成装置 10, 20 Text analysis unit 11, 21 Standard prosody dictionary 12 Prosody segment dictionary 13 Prosody generation unit 13a, 22 Standard prosody generation unit 13b Registered phrase matching unit 13c Prosody segment selection unit 13d Prosody replacement shaping unit 14, 23 Waveform synthesis unit 23a Segment selection unit 23b Segment connection unit 23c Synthesizer 15, 24 Standard waveform segment dictionary 25 Large waveform segment dictionary 100-300 Speech synthesizer

Claims

A prosodic segment dictionary creation method for creating a prosodic segment dictionary used for speech synthesis,
A first prosodic information extraction step of extracting first prosodic information that is prosodic information of a speech waveform data portion of a predetermined phrase included in the speech waveform data from speech waveform data corresponding to an utterance sentence uttered by a predetermined speaker; ,
A second prosodic information extracting step for extracting second prosodic information as prosody information from at least one of the speech waveform data parts corresponding to a predetermined prosodic unit preceding and following the speech waveform data part of the predetermined phrase;
A third prosodic information generation step of generating third prosodic information that is prosodic information corresponding to the predetermined prosodic unit with reference to a standard prosodic dictionary composed of standard prosodic information that is prosodic information for each speech unit;
Based on the second prosodic information extracted in the second prosodic information extracting step and the third prosodic information generated in the third prosodic information generating step, characteristic features of the second prosodic information and the third prosodic information are characterized. Prosodic feature information generation step for generating prosodic feature information indicating the difference,
A prosodic segment dictionary creating step of creating a prosodic segment dictionary based on the first prosodic information for each predetermined phrase extracted in the first prosodic information extracting step and the prosodic feature information generated in the prosodic feature information generating step; The prosodic segment dictionary creation method characterized by including.

The prosody segment dictionary creation method according to claim 1, wherein the predetermined phrase includes a phrase having a relatively high appearance frequency.

The second prosodic information and the third prosodic information include pitch frequency information,
3. The prosodic element according to claim 1, wherein the prosodic feature information includes information indicating a ratio between a pitch frequency indicated by the second prosodic information and a pitch frequency indicated by the third prosodic information. Single dictionary creation method.

The second prosodic information and the third prosodic information include pitch frequency information,
The prosodic feature information includes information indicating a ratio between the intonation size obtained from the pitch frequency difference indicated by the second prosodic information and the intonation size obtained from the pitch frequency difference indicated by the third prosodic information. The prosodic segment dictionary creation method according to any one of claims 1 to 3.

A prosodic segment dictionary creation program for creating a prosodic segment dictionary used for speech synthesis,
A first prosodic information extraction step of extracting first prosodic information that is prosodic information of a speech waveform data portion of a predetermined phrase included in the speech waveform data from speech waveform data corresponding to an utterance sentence uttered by a predetermined speaker; ,
A second prosodic information extracting step for extracting second prosodic information as prosody information from at least one of the speech waveform data parts corresponding to a predetermined prosodic unit preceding and following the speech waveform data part of the predetermined phrase;
A third prosodic information generation step of generating third prosodic information that is prosodic information corresponding to the predetermined prosodic unit with reference to a standard prosodic dictionary composed of standard prosodic information that is prosodic information for each speech unit;
A characteristic of the second prosodic information and the third prosodic information based on the second prosodic information extracted in the second prosodic information extracting step and the third prosodic information generated in the third prosodic information generating step Prosodic feature information generating step for generating prosodic feature information indicating a difference,
A prosodic segment dictionary creating step for creating a prosodic segment dictionary based on the first prosodic information for each predetermined phrase extracted in the first prosodic information extracting step and the prosodic feature information generated in the prosodic feature information generating step; A prosodic segment dictionary creating program characterized by including a program for causing a computer to execute a process comprising:

A speech synthesizer that generates synthesized speech waveform data corresponding to sentence text,
The prosodic segment dictionary creation method according to any one of claims 1 to 4 or the prosodic segment dictionary creation program according to claim 5 includes the first prosodic information and the prosodic feature information. A prosodic segment dictionary consisting of
Standard prosodic dictionary composed of standard prosodic information that is prosodic information for each voice unit;
A segment dictionary comprising spectral information for each speech unit and excitation source information for each speech unit;
Text analysis means for performing accent analysis and morphological analysis on the sentence text;
Prosodic information sequence generation means for generating a prosodic information sequence corresponding to the sentence text based on the analysis result of the text analysis means and the standard prosodic information of the standard prosodic dictionary;
When a phrase corresponding to the first prosodic information included in the prosodic segment dictionary is included in a phrase corresponding to the sentence text, the prosodic information series portion of the phrase is converted to the first prosodic information. A changing means for changing to the prosodic information sequence part generated based on
The prosodic segment corresponding to the changed prosodic information sequence portion with respect to the prosodic information sequence portion corresponding to at least one of the predetermined prosodic units preceding and following the phrase portion changed by the changing means Prosody information adjusting means for performing predetermined adjustment processing based on the prosodic feature information of the dictionary ;
Voice waveform generation means for generating synthesized voice waveform data corresponding to the sentence text based on the prosodic information series and the spectrum information and the excitation source information of the segment dictionary. Speech synthesizer.

A speech synthesizer that generates synthesized speech waveform data corresponding to sentence text,
A first segment dictionary comprising spectral information for each speech unit and excitation source information for each speech unit;
Extracted from speech waveform data corresponding to the utterance sentence uttered by the predetermined speaker, the spectrum information for each predetermined phrase, the excitation source information for each predetermined phrase, and the predetermined preceding and following each predetermined phrase in the speech waveform data A second segment dictionary comprising information on the phonetic environment of
Standard prosodic dictionary composed of standard prosodic information that is prosodic information for each voice unit;
Text analysis means for performing accent analysis and morphological analysis on the sentence text;
Prosodic information sequence generation means for generating a prosodic information sequence corresponding to the sentence text based on the analysis result of the text analysis means and the standard prosodic information of the standard prosodic dictionary;
When the phrase corresponding to the sentence text includes a phrase that matches the phrase corresponding to the spectrum information and excitation source information of the second segment dictionary, the spectrum information and the excitation source information are A predetermined phoneme environment of the second segment dictionary preceding and succeeding the matching phrase in the speech waveform data at the time of extraction, and a predetermined preceding and succeeding a portion corresponding to the matching phrase in the prosodic information sequence Determining means for determining whether or not the phonological environment of
Based on the prosodic information series, the determination result of the determination means, and the spectrum information and the excitation source information of the first and second segment dictionaries, synthetic speech waveform data corresponding to the sentence text is generated. Voice waveform generation means,
The speech waveform generation means, when the determination means determines that they do not match, the synthesized speech waveform data generated from the spectrum information and excitation source information corresponding to the matching phrase of the second segment dictionary Based on the first synthesized speech waveform data excluding the speech segment data part at the side edge portion where the phoneme environments do not match, and the spectrum information and excitation source information of the first segment dictionary for the entire sentence text. Generating synthesized speech waveform data corresponding to the sentence text, which is synthesized from second synthesized speech waveform data obtained by excluding a portion corresponding to the first synthesized speech waveform data from the generated synthesized speech waveform data; A featured voice synthesizer.

A speech synthesizer that generates synthesized speech waveform data corresponding to sentence text,
The prosodic segment dictionary creation method according to any one of claims 1 to 4 or the prosodic segment dictionary creation program according to claim 5 includes the first prosodic information and the prosodic feature information. A prosodic segment dictionary consisting of
Standard prosodic dictionary composed of standard prosodic information that is prosodic information for each voice unit;
A first segment dictionary comprising spectral information for each speech unit and excitation source information for each speech unit;
Extracted from speech waveform data corresponding to an utterance sentence uttered by the predetermined speaker, spectrum information for each predetermined phrase, excitation source information for each predetermined phrase, and preceding and following each predetermined phrase in the speech waveform data A second segment dictionary including information on a predetermined phonetic environment ;
Text analysis means for performing accent analysis and morphological analysis on the sentence text;
Prosodic information sequence generation means for generating a prosodic information sequence corresponding to the sentence text based on the analysis result of the text analysis means and the standard prosodic information of the standard prosodic dictionary;
When a phrase corresponding to the first prosodic information included in the prosodic segment dictionary is included in a phrase corresponding to the sentence text, the prosodic information series portion of the phrase is converted to the first prosodic information. A changing means for changing to the prosodic information sequence part generated based on
The prosodic segment corresponding to the changed prosodic information sequence portion with respect to the prosodic information sequence portion corresponding to at least one of the predetermined prosodic units preceding and following the phrase portion changed by the changing means Prosody information adjusting means for performing predetermined adjustment processing based on the prosodic feature information of the dictionary ;
When the phrase corresponding to the sentence text includes a phrase that matches the phrase corresponding to the spectrum information and excitation source information of the second segment dictionary, the spectrum information and the excitation source information are A predetermined phoneme environment of the second segment dictionary preceding and succeeding the matching phrase in the speech waveform data at the time of extraction, and a predetermined preceding and succeeding a portion corresponding to the matching phrase in the prosodic information sequence Determining means for determining whether or not the phonological environment of
Based on the prosodic information series, the determination result of the determination means, and the spectrum information and the excitation source information of the first and second segment dictionaries, synthetic speech waveform data corresponding to the sentence text is generated. Voice waveform generation means,
The speech waveform generation means, when the determination means determines that they do not match, the synthesized speech waveform data generated from the spectrum information and excitation source information corresponding to the matching phrase of the second segment dictionary The first synthesized speech waveform data excluding the non-coincident side speech segment data part, and the synthesis generated based on the spectral information and excitation source information of the first segment dictionary for the entire sentence text Generating synthesized speech waveform data corresponding to the sentence text, synthesized from speech waveform data and second synthesized speech waveform data excluding a portion corresponding to the first synthesized speech waveform data; A speech synthesizer.

When the prosody segment dictionary includes a plurality of first prosodic information corresponding to the same phrase as the phrase constituting the sentence text, the changing means converts the prosodic information series portion of the phrase into the prosody. claim and changes the generated prosodic information sequence portion based on the connectivity is the best first prosody information with the preceding and succeeding prosodic information sequence portion corresponding phrases in the information sequence portion 6 or claim 8 The speech synthesizer described.

A speech synthesis program for generating synthesized speech waveform data corresponding to sentence text,
A text analysis step for performing accent analysis and morphological analysis on the sentence text;
Prosodic information for generating prosodic information corresponding to the sentence text based on the analysis result in the text analyzing step and the standard prosodic information included in the standard prosodic dictionary composed of standard prosodic information that is prosodic information for each speech unit A sequence generation step;
In the phrase corresponding to the sentence text, the prosodic segment dictionary creation method according to any one of claims 1 to 4 or the prosodic segment dictionary creation program according to claim 5 , When a phrase corresponding to the first prosodic information included in the prosodic segment dictionary including the first prosodic information and the prosodic feature information is included, the prosodic information sequence portion of the phrase is converted to the first prosodic information sequence part. A change step to change to the prosodic information sequence part generated based on the information;
The prosodic segment corresponding to the changed prosodic information sequence portion with respect to the prosodic information sequence portion corresponding to at least one of the predetermined prosodic units preceding and following the phrase portion changed in the changing step A prosodic information adjustment step for performing a predetermined adjustment process based on the prosodic feature information of the dictionary ;
Corresponding to the text text based on the spectrum information and the excitation source information of the segment dictionary including the prosodic information series, the spectrum information for each speech unit and the excitation source information for each speech unit A speech synthesis program comprising: a program for causing a computer to execute a process comprising a speech waveform generation step for generating synthesized speech waveform data.

A speech synthesis program for generating synthesized speech waveform data corresponding to sentence text,
A text analysis step for performing accent analysis and morphological analysis on the sentence text;
Prosodic information for generating a prosodic information sequence corresponding to the sentence text based on the analysis result in the text analysis step and the standard prosodic information included in the standard prosodic dictionary composed of standard prosodic information that is prosodic information for each speech unit A sequence generation step;
Spectral information for each predetermined phrase, excitation source information for each predetermined phrase, and the speech waveform data extracted from speech waveform data corresponding to an utterance sentence uttered by a predetermined speaker in the phrase corresponding to the sentence text Including a phrase corresponding to the spectrum information and excitation source information of the second segment dictionary including information of a predetermined phoneme environment preceding and following each of the predetermined phrases in Corresponding to a predetermined phoneme environment of the second segment dictionary preceding and following the matching phrase in the speech data when extracting the spectrum information and the excitation source information, and the matching phrase in the prosodic information series A determination step for determining whether or not a predetermined phonological environment preceding and succeeding the portion to be matched matches;
It said prosodic information sequence, the determination result, sentences first segment dictionary and said comprising an excitation source information of each of the speech segment and the spectral information for each speech segment to the entire text in the determination step A program for causing a computer to execute a process including a speech waveform generation step of generating synthesized speech waveform data corresponding to the sentence text based on the spectrum information and the excitation source information of the second segment dictionary ,
In the speech waveform generation step, when it is determined in the determination step that they do not match, the synthesized speech waveform data generated from the spectrum information and the excitation source information corresponding to the matching phrase of the second segment dictionary From the first synthesized speech waveform data excluding the speech segment data portion at the non-matching side edge, and the synthesized speech waveform data generated based on the spectrum information and excitation source information of the first segment dictionary , the first A speech synthesis program for generating synthesized speech waveform data corresponding to the sentence text, which is synthesized with second synthesized speech waveform data excluding a portion corresponding to one synthesized speech waveform data.

A speech synthesis program for generating synthesized speech waveform data corresponding to sentence text,
A text analysis step for performing accent analysis and morphological analysis on the sentence text;
Prosodic information for generating a prosodic information sequence corresponding to the sentence text based on the analysis result in the text analysis step and the standard prosodic information included in the standard prosodic dictionary composed of standard prosodic information that is prosodic information for each speech unit A sequence generation step;
In the phrase corresponding to the sentence text, the prosodic segment dictionary creation method according to any one of claims 1 to 4 or the prosodic segment dictionary creation program according to claim 5 , When a phrase corresponding to the first prosodic information included in the prosodic segment dictionary including the first prosodic information and the prosodic feature information is included, the prosodic information sequence portion of the phrase is converted to the first prosodic information sequence part. A change step to change to the prosodic information sequence part generated based on the information;
The prosodic segment corresponding to the changed prosodic information sequence portion with respect to the prosodic information sequence portion corresponding to at least one of the predetermined prosodic units preceding and following the phrase portion changed in the changing step A prosodic information adjustment step for performing predetermined adjustment processing based on the prosodic feature information of the dictionary ;
Spectral information for each predetermined phrase, excitation source information for each predetermined phrase, and the speech waveform data extracted from speech waveform data corresponding to an utterance sentence uttered by a predetermined speaker in the phrase corresponding to the sentence text Including a phrase corresponding to the spectrum information and excitation source information of the second segment dictionary including information of a predetermined phoneme environment preceding and following each of the predetermined phrases in and a predetermined phonetic environment preceding and subsequent to the phrase to the match in the audio data at the time of extracting the spectral information and the excitation source information, preceding and succeeding the portions corresponding to the phrases that the match in the prosodic information sequence A determination step for determining whether or not the predetermined phoneme environment of the second segment dictionary matches;
It said prosodic information sequence, the determination result, sentences first segment dictionary and said comprising an excitation source information of each of the speech segment and the spectral information for each speech segment to the entire text in the determination step A program for causing a computer to execute a process including a speech waveform generation step of generating synthesized speech waveform data corresponding to the sentence text based on the spectrum information and the excitation source information of the second segment dictionary ,
In the speech waveform generation step, when it is determined in the determination step that they do not match, the synthesized speech waveform data generated from the spectrum information and the excitation source information corresponding to the matching phrase of the second segment dictionary From the first synthesized speech waveform data excluding the speech segment data portion at the non-matching side edge, and the synthesized speech waveform data generated based on the spectrum information and excitation source information of the first segment dictionary , the first A speech synthesis program for generating synthesized speech waveform data corresponding to the sentence text, which is synthesized with second synthesized speech waveform data excluding a portion corresponding to one synthesized speech waveform data.

A speech synthesis method for generating synthesized speech waveform data corresponding to sentence text,
A text analysis step for performing accent analysis and morphological analysis on the sentence text;
Prosodic information for generating prosodic information corresponding to the sentence text based on the analysis result in the text analyzing step and the standard prosodic information included in the standard prosodic dictionary composed of standard prosodic information that is prosodic information for each speech unit A sequence generation step;
In the phrase corresponding to the sentence text, the prosodic segment dictionary creation method according to any one of claims 1 to 4 or the prosodic segment dictionary creation program according to claim 5 , When a phrase corresponding to the first prosodic information included in the prosodic segment dictionary including the first prosodic information and the prosodic feature information is included, the prosodic information sequence portion of the phrase is converted to the first prosodic information sequence part. A change step to change to the prosodic information sequence part generated based on the information;
The prosodic segment corresponding to the changed prosodic information sequence portion with respect to the prosodic information sequence portion corresponding to at least one of the predetermined prosodic units preceding and following the phrase portion changed in the changing step A prosodic information adjustment step for performing a predetermined adjustment process based on the prosodic feature information of the dictionary ;
Corresponding to the text text based on the spectrum information and the excitation source information of the segment dictionary including the prosodic information series, the spectrum information for each speech unit and the excitation source information for each speech unit A speech waveform generation step of generating the synthesized speech waveform data.

A speech synthesis method for generating synthesized speech waveform data corresponding to sentence text,
A text analysis step for performing accent analysis and morphological analysis on the sentence text;
Prosodic information for generating a prosodic information sequence corresponding to the sentence text based on the analysis result in the text analysis step and the standard prosodic information included in the standard prosodic dictionary composed of standard prosodic information that is prosodic information for each speech unit A sequence generation step;
Spectral information for each predetermined phrase, excitation source information for each predetermined phrase, and the speech waveform data extracted from speech waveform data corresponding to an utterance sentence uttered by a predetermined speaker in the phrase corresponding to the sentence text Including a phrase corresponding to the spectrum information and excitation source information of the second segment dictionary including information of a predetermined phoneme environment preceding and following each of the predetermined phrases in and a predetermined phonetic environment preceding and subsequent to the phrase to the match in the audio data at the time of extracting the spectral information and the excitation source information, preceding and succeeding the portions corresponding to the phrases that the match in the prosodic information sequence A determination step for determining whether or not the predetermined phoneme environment of the second segment dictionary matches;
It said prosodic information sequence, the determination result, sentences first segment dictionary and said comprising an excitation source information of each of the speech segment and the spectral information for each speech segment to the entire text in the determination step A speech waveform generation step of generating synthesized speech waveform data corresponding to the sentence text, based on the spectrum information and the excitation source information of the second unit dictionary,
In the speech waveform generation step, when it is determined in the determination step that they do not match, the synthesized speech waveform data generated from the spectrum information and the excitation source information corresponding to the matching phrase of the second segment dictionary From the first synthesized speech waveform data excluding the speech segment data portion at the non-matching side edge, and the synthesized speech waveform data generated based on the spectrum information and excitation source information of the first segment dictionary , the first A speech synthesis method comprising: generating synthesized speech waveform data corresponding to the sentence text, which is synthesized with second synthesized speech waveform data excluding a portion corresponding to one synthesized speech waveform data.

A speech synthesis method for generating synthesized speech waveform data corresponding to sentence text,
A text analysis step for performing accent analysis and morphological analysis on the sentence text;
Prosodic information for generating a prosodic information sequence corresponding to the sentence text based on the analysis result in the text analysis step and the standard prosodic information included in the standard prosodic dictionary composed of standard prosodic information that is prosodic information for each speech unit A sequence generation step;
In the phrase corresponding to the sentence text, the prosodic segment dictionary creation method according to any one of claims 1 to 4 or the prosodic segment dictionary creation program according to claim 5 , When a phrase corresponding to the first prosodic information included in the prosodic segment dictionary including the first prosodic information and the prosodic feature information is included, the prosodic information sequence portion of the phrase is converted to the first prosodic information sequence part. A change step to change to the prosodic information sequence part generated based on the information;
The prosodic segment corresponding to the changed prosodic information sequence portion with respect to the prosodic information sequence portion corresponding to at least one of the predetermined prosodic units preceding and following the phrase portion changed in the changing step A prosodic information adjustment step for performing a predetermined adjustment process based on the prosodic feature information of the dictionary ;
Spectral information for each predetermined phrase, excitation source information for each predetermined phrase, and the speech waveform data extracted from speech waveform data corresponding to an utterance sentence uttered by a predetermined speaker in the phrase corresponding to the sentence text Including a phrase corresponding to the spectrum information and excitation source information of the second segment dictionary including information of a predetermined phoneme environment preceding and following each of the predetermined phrases in and a predetermined phonetic environment preceding and subsequent to the phrase to the match in the audio data at the time of extracting the spectral information and the excitation source information, preceding and succeeding the portions corresponding to the phrases that the match in the prosodic information sequence A determination step for determining whether or not the predetermined phoneme environment of the second segment dictionary matches;
It said prosodic information sequence, the determination result, sentences first segment dictionary and said comprising an excitation source information of each of the speech segment and the spectral information for each speech segment to the entire text in the determination step A speech waveform generation step of generating synthesized speech waveform data corresponding to the sentence text, based on the spectrum information and the excitation source information of the second unit dictionary,
In the speech waveform generation step, when it is determined in the determination step that they do not match, the synthesized speech waveform data generated from the spectrum information and the excitation source information corresponding to the matching phrase of the second segment dictionary From the first synthesized speech waveform data excluding the speech segment data portion at the non-matching side edge, and the synthesized speech waveform data generated based on the spectrum information and excitation source information of the first segment dictionary , the first A speech synthesis method comprising: synthesizing second synthesized speech waveform data excluding a portion corresponding to one synthesized speech waveform data to generate synthesized speech waveform data corresponding to the sentence text.