JP4736524B2

JP4736524B2 - Speech synthesis apparatus and speech synthesis program

Info

Publication number: JP4736524B2
Application number: JP2005133419A
Authority: JP
Inventors: 慈明小松; 亜紀子大和
Original assignee: Brother Industries Ltd
Current assignee: Brother Industries Ltd
Priority date: 2005-04-28
Filing date: 2005-04-28
Publication date: 2011-07-27
Anticipated expiration: 2025-04-28
Also published as: JP2006308998A

Description

本発明は、音声合成装置及び音声合成プログラムに関するものであり、詳細には、違和感のない音声の出力が可能な音声合成装置及び音声合成プログラムに関するものである。 The present invention relates to a speech synthesizer and a speech synthesizer program, and more particularly to a speech synthesizer and a speech synthesizer program capable of outputting speech without a sense of discomfort.

従来、音声合成において、疑問文、同意を求める文、行為を促す文などの文末の語調が平叙文とは異なり、ピッチの上がる文章の音声を合成する場合には、特許文献１に記載の発明のピッチパタン生成方法のように、ピッチが上がってゆく補正パタンを予め複数用意して、基本パタンにその終端位置を合わせて加え合わせることによりピッチの上がる文章のピッチパタンを生成したり、特許文献２に記載の発明の音声合成装置のように、呼気段落のモーラ数に応じて、モーラ位置ごとにピッチの補正量が設定されて、元のピッチを補正したりしている。
特開２００４−２２６５０５号公報特開２００２−１９６８００号公報 Conventionally, in speech synthesis, in the case of synthesizing speech of a sentence with an increasing pitch unlike a plain sentence, the tone of a sentence such as a question sentence, a sentence requesting consent, a sentence prompting an action, etc., the invention described in Patent Document 1 is used. Like the pitch pattern generation method of the above, a plurality of correction patterns that increase the pitch are prepared in advance, and the pitch pattern of a sentence with an increased pitch is generated by adding the end position to the basic pattern. As in the speech synthesizer according to the invention described in 2, the pitch correction amount is set for each mora position according to the number of mora in the exhalation paragraph, and the original pitch is corrected.
JP 2004-226505 A Japanese Patent Laid-Open No. 2002-196800

しかしながら、特許文献１及び２に示す発明のピッチパタン生成方法や音声合成装置では、元々のピッチに補正量を足しているだけなので、音声として出力した際に違和感が生じる場合があるという問題点がある。 However, the pitch pattern generation method and the speech synthesizer according to the inventions disclosed in Patent Documents 1 and 2 only have a correction amount added to the original pitch. is there.

本発明は、上述の問題点を解決するためになされたものであり、違和感のない音声の出力が可能な音声合成装置及び音声合成プログラムを提供することを目的とする。 The present invention has been made to solve the above-described problems, and an object of the present invention is to provide a speech synthesizer and a speech synthesizer program that can output speech without a sense of incongruity.

上記課題を解決するため、請求項１に係る発明の音声合成装置では、音声を音響パラメータ列に分析した音韻データから作られた音韻モデルと音声を分析した基本周波数データから作られた韻律モデルとを少なくとも含む音響モデルの集合である音響辞書を記憶する音響辞書記憶手段と、疑問文、同意を求める文、行為を促す文などの文末の語調が平叙文とは異なる文を発声した音声の前記音韻データから作られた疑問文用音韻モデルと前記文末の語調が平叙文とは異なる文を発声した音声の前記基本周波数データから作られた疑問文用韻律モデルとを少なくとも含む疑問文用音響モデルの集合であり、前記音響辞書とは異なる疑問文用音響辞書を記憶する疑問文用音響辞書記憶手段と、音声を生成する文を単語に分解して品詞を決定し、アクセント句ごとにそのアクセント位置を示すアクセント型を決定し、かつ当該音声を生成する文の読みを決定する言語解析手段と、当該言語解析手段により解析された解析結果に基づいて前記音響辞書から前記音響モデルを選択する音響モデル選択手段と、当該音響モデル選択手段により選択された前記音響モデルを構成する前記音韻モデル及び前記韻律モデルをもとに音声を生成する音声生成手段とを備え、前記音響モデル選択手段は、前記音声を生成する文の文末が所定の文末のパターンである場合、又は、前記音声を生成する文に疑問詞が含まれる場合のうちの少なくとも一方の場合には、当該音声を生成する文の文末の音素、文末の所定数のモーラ、文末のアクセント句、又は、全文のいずれかの音響モデルを、前記音響辞書でなく前記疑問文用音響辞書の前記疑問文用音響モデルから選択することを特徴とする。 In order to solve the above-described problem, in the speech synthesizer according to the first aspect of the present invention, a phoneme model created from phoneme data obtained by analyzing speech into an acoustic parameter sequence and a prosody model created from fundamental frequency data obtained by analyzing speech Acoustic dictionary storage means for storing an acoustic dictionary that is a set of acoustic models including at least the above, and the voice of a voice that utters a sentence whose tone at the end of the sentence is different from a plain sentence, such as a question sentence, a sentence requesting consent, and a sentence prompting an action An interrogative sentence acoustic model including at least an interrogative sentence phonological model created from phonological data and an interrogative sentence prosodic model created from the fundamental frequency data of speech uttered by a sentence whose tone is different from that of a plain sentence A question sentence acoustic dictionary storage means for storing a question sentence acoustic dictionary different from the acoustic dictionary; Language analysis means for determining an accent type indicating the accent position for each cent phrase and determining reading of a sentence that generates the speech; and from the acoustic dictionary based on an analysis result analyzed by the language analysis means. An acoustic model selecting means for selecting an acoustic model; and a speech generating means for generating speech based on the phonological model and the prosodic model constituting the acoustic model selected by the acoustic model selecting means, The model selection means may determine whether the sentence that generates the speech has a predetermined sentence ending pattern or at least one of the cases where the sentence that generates the speech includes an interrogative. A phoneme at the end of a sentence, a predetermined number of mora at the end of the sentence, an accent phrase at the end of the sentence, or an acoustic model of the whole sentence instead of the acoustic dictionary And selecting from the question statements with the acoustic model of the serial question statements with acoustic dictionary.

また、請求項２に係る発明の音声合成装置では、請求項１に記載の発明の構成に加えて、前記所定の文末のパターンは文末の文字が疑問符であることを特徴とする。 Further, in the speech synthesizer of the invention according to claim 2, in addition to the configuration of the invention of claim 1, the predetermined sentence ending pattern has a question ending character as a question mark.

また、請求項３に係る発明の音声合成装置では、請求項１又は２に記載の発明の構成に加えて、前記所定の文末のパターンは文末が質問する言葉、同意を求める言葉又は行為を促す言葉であることを特徴とする。 Moreover, in the speech synthesizer of the invention according to claim 3, in addition to the configuration of the invention according to claim 1 or 2, the pattern at the end of the predetermined sentence prompts a word that the sentence ending asks, a word or an action for seeking consent. Characterized by words.

また、請求項４に係る発明の音声合成プログラムでは、コンピュータに音声を生成する文を単語に分解して品詞を決定し、アクセント句ごとにそのアクセント位置を示すアクセント型を決定し、かつ当該音声を生成する文の読みを決定する言語解析ステップと、音声を音響パラメータ列に分析した音韻データから作られた音韻モデルと音声を分析した基本周波数データから作られた韻律モデルとを少なくとも含む音響モデルの集合である音響辞書から、前記言語解析ステップにより解析された解析結果に基づいて前記音響モデルを選択する音響モデル選択ステップと、当該音響モデル選択ステップにより選択された前記音響モデルを構成する前記音韻モデル及び前記韻律モデルをもとに音声を生成する音声生成ステップとをコンピュータに実行させるための音声合成プログラムであって、前記音響モデル選択ステップは、前記音声を生成する文の文末が所定の文末のパターンである場合、又は、前記音声を生成する文に疑問詞が含まれる場合のうちの少なくとも一方の場合には、当該音声を生成する文の文末の音素、文末の所定数のモーラ、文末のアクセント句、又は、全文のいずれかの音響モデルを、前記音響辞書でなく、疑問文、同意を求める文、行為を促す文などの文末の語調が平叙文とは異なる文を発声した音声の前記音韻データから作られた疑問文用音韻モデルと前記文末の語調が平叙文とは異なる文を発声した音声の前記基本周波数データから作られた疑問文用韻律モデルとを少なくとも含む疑問文用音響モデルの集合である疑問文用音響辞書の前記疑問文用音響モデルから選択することを特徴とする。

In the speech synthesis program of the invention according to claim 4, a sentence for generating speech in a computer is decomposed into words to determine parts of speech, an accent type indicating the accent position is determined for each accent phrase, and the speech Acoustic model including at least a linguistic analysis step that determines the reading of a sentence that generates speech, a phonological model created from phoneme data obtained by analyzing speech into an acoustic parameter sequence, and a prosodic model created from fundamental frequency data analyzed from speech An acoustic model selection step for selecting the acoustic model from an acoustic dictionary that is an analysis of the language analysis step, and the phoneme constituting the acoustic model selected by the acoustic model selection step fruit on the basis of the model and the prosody model a voice generating step of generating a voice to the computer A speech synthesis program for case, the acoustic model selection step, endnotes statements for generating the sound when a predetermined end of the sentence pattern, or that contain interrogative sentences to produce the speech In the case of at least one of the above, the phoneme at the end of the sentence that generates the speech, the predetermined number of mora at the end of the sentence, the accent phrase at the end of the sentence, or the acoustic model of the whole sentence is not the acoustic dictionary, A phonological model for interrogative sentences made from the phonological data of a voice that utters a sentence whose ending tone is different from a plain sentence, such as a question sentence, a sentence requesting consent, a sentence that prompts an action, etc. Is a set of interrogative sentence acoustic models including at least interrogative sentence prosodic models created from the fundamental frequency data of speech uttered by different sentences. And selecting from.

また、請求項５に係る発明の音声合成プログラムでは、請求項４に記載の発明の構成に加えて、コンピュータが扱う前記所定の文末のパターンは文末の文字が疑問符であることを特徴とする。 Further, in the speech synthesis program of the invention according to claim 5, in addition to the configuration of the invention of claim 4, the predetermined sentence ending pattern handled by the computer is characterized in that the ending character is a question mark.

また、請求項６に係る発明の音声合成プログラムでは、請求項４又は５に記載の発明の構成に加えて、コンピュータが扱う前記所定の文末のパターンは文末が質問する言葉、同意を求める言葉又は行為を促す言葉であることを特徴とする。 Further, in the speech synthesis program of the invention according to claim 6, in addition to the configuration of the invention of claim 4 or 5, the predetermined sentence ending pattern handled by the computer is a word asked by the sentence ending, a word seeking consent, It is characterized by words that encourage action.

請求項１に係る発明の音声合成装置では、音響辞書記憶手段は、音声を音響パラメータ列に分析した音韻データから作られた音韻モデルと音声を分析した基本周波数データから作られた韻律モデルとを少なくとも含む音響モデルの集合である音響辞書を記憶し、疑問文用音響辞書記憶手段は、疑問文、同意を求める文、行為を促す文などの文末の語調が平叙文とは異なる文を発声した音声の音韻データから作られた疑問文用音韻モデルと文末の語調が平叙文とは異なる文を発声した音声の基本周波数データから作られた疑問文用韻律モデルとを少なくとも含む疑問文用音響モデルの集合であり、音響辞書とは異なる疑問文用音響辞書を記憶し、言語解析手段は、音声を生成する文を単語に分解して品詞を決定し、アクセント句ごとにそのアクセント位置を示すアクセント型を決定し、かつ音声を生成する文の読みを決定し、音響モデル選択手段は、言語解析手段により解析された解析結果に基づいて音響辞書から音響モデルを選択し、音声生成手段は、音響モデル選択手段により選択された音響モデルを構成する音韻モデル及び韻律モデルをもとに音声を生成することができる。また、音響モデル選択手段は、音声を生成する文の文末が所定の文末のパターンである場合、又は、音声を生成する文に疑問詞が含まれる場合のうちの少なくとも一方の場合には、音声を生成する文の文末の音素、文末の所定数のモーラ、文末のアクセント句、又は、全文のいずれかの音響モデルを、音響辞書でなく疑問文用音響辞書の疑問文用音響モデルから選択することができる。したがって、もともと疑問文、同意を求める文、行為を促す文などの文末の語調が平叙文とは異なる文を発声した音声の音韻データから作られている音響モデルを用いて音声を生成するので、自然な違和感のない音声を合成することができる。 In the speech synthesizer of the invention according to claim 1, the acoustic dictionary storage means includes a phonological model created from phonological data obtained by analyzing speech into an acoustic parameter sequence, and a prosodic model created from fundamental frequency data analyzed by speech. An acoustic dictionary, which is a set of acoustic models including at least, is stored, and the acoustic dictionary storage means for question sentences utters sentences in which the tone at the end of the sentence is different from the plain text, such as question sentences, sentences for asking consent, sentences for prompting actions, etc. A phonological model for interrogative sentences including at least a phonological model for interrogative sentences made from phonetic data of speech and a prosodic model for interrogative sentences made from fundamental frequency data of speech uttered sentences whose tone at the end of the sentence is different from the plain text The speech analysis dictionary for question sentences, which is different from the acoustic dictionary, is stored, and the language analysis means decomposes the sentence for generating the speech into words to determine the part of speech, and for each accent phrase The accent type indicating the current position is determined, and the reading of the sentence that generates speech is determined. The acoustic model selection means selects an acoustic model from the acoustic dictionary based on the analysis result analyzed by the language analysis means, and The generation unit can generate speech based on the phoneme model and the prosody model constituting the acoustic model selected by the acoustic model selection unit. In addition, the acoustic model selection unit may generate a voice when the sentence ending sentence has a predetermined sentence ending pattern or when the sentence producing the voice includes an interrogative word. Select either the phoneme at the end of the sentence that generates the sentence, the specified number of mora at the end of the sentence, the accent phrase at the end of the sentence, or the acoustic model of the whole sentence from the acoustic model for question sentence in the question dictionary instead of the sound dictionary be able to. Therefore, since speech is generated using an acoustic model that is originally made from phonetic data of a voice that utters a sentence whose tone is different from a plain sentence, such as a question sentence, a sentence requesting consent, a sentence that prompts action, It is possible to synthesize speech without a natural feeling of strangeness.

また、請求項２に係る発明の音声合成装置では、請求項１に記載の発明の効果に加えて、所定の文末のパターンを文末の文字が疑問符であることとすることができる。したがって、疑問符がある場合に平叙文とは文末の語調の異なる文であると判断することができるので、容易に疑問文用音響辞書を使うか否かの判断を行うことができる。 In addition, in the speech synthesizer of the invention according to claim 2, in addition to the effect of the invention according to claim 1, a predetermined sentence end pattern can be a sentence end character being a question mark. Therefore, when there is a question mark, it can be determined that the sentence is different in tone from the plain sentence, so it is possible to easily determine whether or not to use the question sentence acoustic dictionary.

また、請求項３に係る発明の音声合成装置では、請求項１又は２に記載の発明の効果に加えて、所定の文末のパターンは文末を質問する言葉、同意を求める言葉又は行為を促す言葉とすることができる。したがって、質問する言葉、同意を求める言葉又は行為を促す言葉が文末にある場合には平叙文とは文末の語調の異なる文であると判断することができるので、容易に疑問文用音響辞書を使うか否かの判断を行うことができる。 In addition, in the speech synthesizer of the invention according to claim 3, in addition to the effect of the invention according to claim 1 or 2, the predetermined pattern at the end of the sentence is a word for asking the end of the sentence, a word for seeking consent, or a word for prompting an act It can be. Therefore, if there are words that ask questions, words that require consent, or words that prompt action, at the end of the sentence, it can be determined that the sentence is different in tone from the plain text, so it is easy to create an acoustic dictionary for question sentences. It can be determined whether or not to use.

また、請求項４に係る発明の音声合成プログラムでは、コンピュータに音声を生成する文を単語に分解して品詞を決定し、アクセント句ごとにそのアクセント位置を示すアクセント型を決定し、かつ音声を生成する文の読みを決定する言語解析ステップと、音声を音響パラメータ列に分析した音韻データから作られた音韻モデルと音声を分析した基本周波数データから作られた韻律モデルとを少なくとも含む音響モデルの集合である音響辞書から、言語解析ステップにより解析された解析結果に基づいて音響モデルを選択する音響モデル選択ステップと、音響モデル選択ステップにより選択された音響モデルを構成する音韻モデル及び韻律モデルをもとに音声を生成する音声生成ステップとを実行させることができる。そして、音響モデル選択ステップでは、音声を生成する文の文末が所定の文末のパターンである場合、又は、音声を生成する文に疑問詞が含まれる場合のうちの少なくとも一方の場合には、音声を生成する文の文末の音素、文末の所定数のモーラ、文末のアクセント句、又は、全文のいずれかの音響モデルを、音響辞書でなく、疑問文、同意を求める文、行為を促す文などの文末の語調が平叙文とは異なる文を発声した音声の音韻データから作られた疑問文用音韻モデルと文末の語調が平叙文とは異なる文を発声した音声の基本周波数データから作られた疑問文用韻律モデルとを少なくとも含む疑問文用音響モデルの集合である疑問文用音響辞書の疑問文用音響モデルから選択することができる。したがって、もともと疑問文、同意を求める文、行為を促す文などの文末の語調が平叙文とは異なる文を発声した音声の音韻データから作られている音響モデルを用いて音声を生成するので、自然な違和感のない音声を合成することができる。 In the speech synthesis program of the invention according to claim 4, a sentence for generating speech to a computer is decomposed into words to determine parts of speech, an accent type indicating the accent position is determined for each accent phrase, and speech is An acoustic model including at least a linguistic analysis step for determining the reading of a sentence to be generated, a phonological model created from phonological data obtained by analyzing speech into an acoustic parameter sequence, and a prosodic model created from fundamental frequency data analyzed from speech. An acoustic model selection step for selecting an acoustic model from the acoustic dictionary as a set based on the analysis result analyzed in the language analysis step, and a phonological model and a prosodic model constituting the acoustic model selected in the acoustic model selection step are also provided. And a sound generation step for generating sound. Then, in the acoustic model selection step, when the sentence end of the sentence that generates the sound is a predetermined sentence end pattern, or when the sentence that generates the sound includes an interrogative word, the sound is Sentence sentence phoneme, sentence end number of mora, sentence sentence accent phrase, or full sentence acoustic model is not an acoustic dictionary, but a question sentence, a sentence requesting consent, a sentence promoting action, etc. Phonetic model for interrogative sentences made from speech phonetic data uttered with a sentence whose tone is different from that of plain text and fundamental frequency data of speech uttered with a sentence whose tone is different from that of plain text It is possible to select from the question sentence acoustic model of the question sentence acoustic dictionary which is a set of question sentence acoustic models including at least the question sentence prosodic model. Therefore, since speech is generated using an acoustic model that is originally made from phonetic data of a voice that utters a sentence whose tone is different from a plain sentence, such as a question sentence, a sentence requesting consent, a sentence that prompts action, It is possible to synthesize speech without a natural feeling of strangeness.

また、請求項５に係る発明の音声合成プログラムでは、請求項４に記載の発明の効果に加えて、所定の文末のパターンを文末の文字が疑問符であることとすることができる。したがって、疑問符がある場合に平叙文とは文末の語調の異なる文であると判断することができるので、容易に疑問文用音響辞書を使うか否かの判断を行うことができる。 In addition, in the speech synthesis program of the invention according to claim 5, in addition to the effect of the invention of claim 4, a predetermined sentence end pattern may be a question end character. Therefore, when there is a question mark, it can be determined that the sentence is different in tone from the plain sentence, so it is possible to easily determine whether or not to use the question sentence acoustic dictionary.

また、請求項６に係る発明の音声合成プログラムでは、請求項４又は５に記載の発明の効果に加えて、所定の文末のパターンは文末を質問する言葉、同意を求める言葉又は行為を促す言葉とすることができる。したがって、質問する言葉、同意を求める言葉又は行為を促す言葉が文末にある場合には平叙文とは文末の語調の異なる文であると判断することができるので、容易に疑問文用音響辞書を使うか否かの判断を行うことができる。 In addition, in the speech synthesis program of the invention according to claim 6, in addition to the effect of the invention of claim 4 or 5, the predetermined pattern at the end of the sentence is a word for asking the end of the sentence, a word for seeking consent, or a word for prompting an act It can be. Therefore, if there are words that ask questions, words that require consent, or words that prompt action, at the end of the sentence, it can be determined that the sentence is different in tone from the plain text, so it is easy to create an acoustic dictionary for question sentences. It can be determined whether or not to use.

以下、本発明の実施の形態を図面を参照して説明する。まず、図１を参照して、本実施の形態の音声合成装置１について説明する。図１は、音声合成装置１の電気的構成を示すブロック図である。図１に示すように、音声合成装置１には音声合成装置１の制御を司るＣＰＵ２が設けられ、ＣＰＵ２には、キーボード３と、各種のデータを一時的に記憶するＲＡＭ４と、音響辞書５０，文末パターン辞書５５，疑問詞辞書５６等を記憶したＲＯＭ５と、デジタルアナログコンバータ（ＤＡＣ）６、計時装置９とが接続している。そして、ＤＡＣ６にはさらにアンプ（ＡＭＰ）７が接続し、ＡＭＰ７にはスピーカ８が接続している。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. First, the speech synthesizer 1 of the present embodiment will be described with reference to FIG. FIG. 1 is a block diagram showing an electrical configuration of the speech synthesizer 1. As shown in FIG. 1, the speech synthesizer 1 is provided with a CPU 2 that controls the speech synthesizer 1. The CPU 2 includes a keyboard 3, a RAM 4 that temporarily stores various data, an acoustic dictionary 50, A ROM 5 storing a sentence end pattern dictionary 55, an interrogative dictionary 56, and the like, a digital analog converter (DAC) 6, and a timing device 9 are connected. An amplifier (AMP) 7 is further connected to the DAC 6, and a speaker 8 is connected to the AMP 7.

また、ＲＡＭ４には、音声合成の処理を行う際に使用される変数や生成データを記憶する種々の記憶エリアが設けられている。例えば、テキスト記憶エリア４１には、キーボード３から入力され、音声合成を行うテキストが記憶される。また、解析結果記憶エリア４２には、言語解析１１（図２参照）によりテキスト記憶エリア４１に記憶されているテキストが解析された結果等が記憶される。 In addition, the RAM 4 is provided with various storage areas for storing variables and generation data used when performing speech synthesis processing. For example, the text storage area 41 stores text that is input from the keyboard 3 and is subjected to speech synthesis. The analysis result storage area 42 stores the result of analyzing the text stored in the text storage area 41 by the language analysis 11 (see FIG. 2).

そして、ｍｃｅｐ列記憶エリア４３には、ｍｃｅｐ列生成１５（図２参照）により生成されたｍｃｅｐ列が記憶され、ｐｉｔｃｈ列記憶エリア４４には、ｐｉｔｃｈ列生成１６（図２参照）により生成されたｐｉｔｃｈ列が記憶される。そして、音源信号記憶エリア４５には、音源信号生成１７（図２参照）により生成された音源信号が記憶され、出力音声波形記憶エリア４６には、ＭＬＳＡフィルター２３（図２参照）により生成された出力音声の波形が記憶される。 The msep sequence storage area 43 stores the msem sequence generated by the msep sequence generation 15 (see FIG. 2), and the pitch sequence storage area 44 generates the pitch sequence generation 16 (see FIG. 2). A pitch string is stored. The sound source signal storage area 45 stores the sound source signal generated by the sound source signal generation 17 (see FIG. 2), and the output sound waveform storage area 46 is generated by the MLSA filter 23 (see FIG. 2). The waveform of the output sound is stored.

また、音響モデル情報記憶エリア６１には、テキスト記憶エリア４１に記憶されて入れているテキストについての音響モデルが音素ごとに記憶され、疑問形フラグ記憶エリア６２には、テキスト記憶エリア４１に記憶されて入れているテキストについての疑問形フラグが音素ごとに記憶される。 The acoustic model information storage area 61 stores an acoustic model for the text stored in the text storage area 41 for each phoneme, and the interrogative flag storage area 62 stores it in the text storage area 41. An interrogative flag for the entered text is stored for each phoneme.

次に、図２を参照して、本実施の形態の音声合成装置及び音声合成プログラムにおける機能構成について説明する。図２は、本実施の形態の機能構成図である。図２に示すように、まず、音声合成されるテキストは言語解析１１が行われる。この言語解析１１では、入力されたテキストが解析されて、その読みとアクセント型が出力される。 Next, with reference to FIG. 2, the functional configuration of the speech synthesis apparatus and speech synthesis program of the present embodiment will be described. FIG. 2 is a functional configuration diagram of the present embodiment. As shown in FIG. 2, first, language analysis 11 is performed on the text to be synthesized. In this language analysis 11, the input text is analyzed and its reading and accent type are output.

例えば、図３に示すように、テキスト記憶エリア４１に「そういえば、京都に行った？」という文章が記憶されているとする。図３は、テキスト記憶エリア４１の一例であるテキスト記憶エリア４１１の模式図である。まず、品詞情報、読み情報、接続情報、アクセント情報等をもつ言語辞書（図示外）が参照されて周知の最長一致法で形態素解析が行われ、「そう」，「いえ」，「ば」，「京都」，「に」，「行っ」，「た」，「？」に解析され、さらに品詞が判定される。そして、言語辞書の接続情報が参照されて複合語がまとめられ、「そういえば」，「京都に」，「行った？」とされる。なお、形態素解析においてはアクセント位置も言語辞書のアクセント情報から割り出される。そして、複合語にまとめられる際に、アクセント位置の移動がある語については、アクセント位置の変更処理も行われる。そして、最後に、言語情報の読み情報が参照されて、文字列がカタカナの文字列に置き換えられ、「ソーイエバ（４）｜キョートニ（１）／イッタ（０）？」という解析結果が出力され、図４に示すように、解析結果記憶エリア４２に記憶される。図４は、テキスト記憶エリア４１１の情報について言語解析１１を行った解析結果の一例である解析結果記憶エリア４２１の模式図である。なお、解析結果記憶エリア４２には、各語の品詞情報も記憶される。なお、ここで「｜」は呼気段落区切りを示し、（）はアクセント区の区切りを示し、（）内の数字がアクセント区のアクセント位置を示している。 For example, as shown in FIG. 3, it is assumed that a text “Is that you went to Kyoto?” Is stored in the text storage area 41. FIG. 3 is a schematic diagram of a text storage area 411 that is an example of the text storage area 41. First, a linguistic dictionary (not shown) with part-of-speech information, reading information, connection information, accent information, etc. is referred to, and morphological analysis is performed using the well-known longest match method. “Kyoto”, “ni”, “go”, “ta”, “?” Are analyzed, and the part of speech is further determined. Then, the connection information of the language dictionary is referred to and the compound words are put together, and “Speaking of which”, “in Kyoto”, “Did you go?”. In the morphological analysis, the accent position is also determined from the accent information in the language dictionary. When words are moved to a compound word, accent position change processing is also performed for words with accent position movement. Finally, the reading information of the language information is referred to, the character string is replaced with a katakana character string, and an analysis result “Soeva (4) | Kotoni (1) / Itta (0)?” Is output. As shown in FIG. 4, it is stored in the analysis result storage area 42. FIG. 4 is a schematic diagram of an analysis result storage area 421 that is an example of an analysis result obtained by performing the language analysis 11 on the information in the text storage area 411. The part-of-speech information for each word is also stored in the analysis result storage area 42. Here, “|” indicates an exhalation paragraph delimiter, () indicates an accent section delimiter, and the numbers in () indicate the accent positions of the accent section.

そして、言語解析１１により解析されたアクセント型及び読みに基づいて、音響モデル選択１２が行われる。ここでは、読みがさらに音素に分解される。そして、音素ごとに、音響辞書５０から音響モデルが選択される。図２に示すように、音響辞書５０は、音韻モデル５１と韻律モデル５２とから形成されており、音韻モデルと韻律モデルとの組み合わされたものを音響モデルと呼ぶこととする。また、音韻モデル５１には、通常用音韻モデル５１１と疑問文用音韻モデル５１２とがあり、韻律モデル５２には、通常用韻律モデル５２１と疑問文用韻律モデル５２２とがある。 Then, an acoustic model selection 12 is performed based on the accent type and the reading analyzed by the language analysis 11. Here, the reading is further broken down into phonemes. Then, an acoustic model is selected from the acoustic dictionary 50 for each phoneme. As shown in FIG. 2, the acoustic dictionary 50 is formed of a phoneme model 51 and a prosody model 52, and a combination of the phoneme model and the prosody model is referred to as an acoustic model. The phoneme model 51 includes a normal phoneme model 511 and a question sentence phoneme model 512, and the prosody model 52 includes a normal phoneme model 521 and a question sentence prosody model 522.

また、通常用音韻モデル５１１及び通常用韻律モデル５２１は、平叙文を発声した際の声を録音した録音データから作られた音韻モデルの集合及び韻律モデルの集合である。また、疑問文用音韻モデル５１２及び疑問文用韻律モデル５２２は、疑問文，同意を求める文，行為を促す文などの平叙文とは異なる、文末のピッチが上がる文を発声した際の声を録音した録音データから作られた音韻モデルの集合及び韻律モデルの集合である。 The normal phoneme model 511 and the normal phoneme model 521 are a set of phoneme models and a set of prosody models made from recorded data obtained by recording a voice when a plain text is uttered. In addition, the interrogative sentence phoneme model 512 and interrogative sentence prosodic model 522 are different from plain sentences such as interrogative sentences, sentences that require consent, and sentences that prompt action, and voices when uttering sentences that increase the pitch at the end of the sentence. A set of phonological models and a set of prosodic models created from recorded recording data.

本実施の形態では、文末に「？」がついている場合、文末が所定の文末パターンに該当する場合、文中に所定の疑問詞が存在する場合に、最後のアクセント句の音韻モデルを変更する場合は、最後のアクセント句の音韻モデルを疑問文用音韻モデル５１２から選択し、韻律モデルを疑問文用韻律モデル５２２から選択して、疑問文用音響モデルを作成する。 In this embodiment, when “?” Is added at the end of a sentence, when the end of the sentence corresponds to a predetermined end-of-sentence pattern, or when a predetermined question word is present in the sentence, the phoneme model of the last accent phrase is changed Selects the phoneme model of the last accent phrase from the question sentence phoneme model 512 and selects the prosody model from the question sentence prosody model 522 to create a question sentence acoustic model.

通常用音韻モデル５１１には、「ａ，ｂ，ｂｙ，ｃｈ，ｃｌ，ｄ，ｄｙ，ｅ，ｆ，ｆｙ，ｇ，ｇｙ，ｈ，ｈｙ，ｉ，ｊ，ｋ，ｋｙ，ｍ，ｍｙ，ｎ，Ｎ，ｎｙ，ｏ，ｐ，ｐａｕ，ｐｙ，ｒ，ｒｙ，ｓ，ｓｈ，ｔ，ｔｓ，ｔｙ，ｕ，ｗ，ｙ，ｚ」の３８種の音素に対する音韻モデルが記憶されている。例えば、この音韻モデルは、自然音声をメルケプストラム分析することによって得られるものである。各音韻モデルはその継続時間をフレーム（１フレームは１０ｍｓとする）で分割され、フレームごとにメルケプストラム係数及びそのフレームが有声であるか無声であるかの情報等が記憶されている。なお、「ｐａｕ」はポーズ（呼気段落の区切り）を示している。 The normal phoneme model 511 includes “a, b, by, ch, cl, d, dy, e, f, fy, g, gy, h, hy, i, j, k, ky, m, my, n”. , N, ny, o, p, pau, py, r, ry, s, sh, t, ts, ty, u, w, y, z ”phoneme models for 38 phonemes are stored. For example, this phonological model is obtained by performing mel cepstrum analysis on natural speech. Each phoneme model is divided into frames (one frame is 10 ms), and a mel cepstrum coefficient and information about whether the frame is voiced or unvoiced are stored for each frame. Note that “pau” indicates a pause (expired paragraph break).

そして、通常用韻律モデル５２１には、音素ごとにアクセント型やアクセント句内でのモーラ位置の情報に対応してｐｉｔｃｈ列生成のための情報となる韻律モデルが記憶されている。例えば、この韻律モデルは、自然音声を基本周波数分析することによって得られるもので、フレームごとのピッチデータが記憶されている。 The normal prosodic model 521 stores a prosodic model that becomes information for generating a pitch sequence in correspondence with information on the accent type and the mora position in the accent phrase for each phoneme. For example, this prosodic model is obtained by performing fundamental frequency analysis of natural speech, and pitch data for each frame is stored.

なお、本実施の形態では、各音素に対して選択された音韻モデル及び音響モデルの情報を統合して音響モデルと呼ぶことし、例えば、音素「ａ」の音響モデルを「（ａ）」のように音素に「（）」を付けて表示することとする。 In the present embodiment, information on the phoneme model and the acoustic model selected for each phoneme is integrated and called an acoustic model. For example, the acoustic model of the phoneme “a” is “(a)”. In this way, the phonemes are displayed with “()” added.

また、疑問文用音韻モデル５１２及び疑問文用韻律モデル５２２は、疑問文，同意を求める文，行為を促す文などの平叙文とは異なる、文末のピッチが上がる文を発声した際の声を録音した録音データから作られた音韻モデルの集合及び韻律モデルの集合であり、作成に使用される録音データは、ピッチの上がる文字の音素の種類、アクセント型、アクセント句内のモーラ位置の様々なパターンの文章を発声したものである。したがって、疑問文用音韻モデル５１２及び疑問文用韻律モデル５２２から選択された疑問文用音韻モデル及び疑問文用韻律モデルからなる疑問文用音響モデルは、１つの音素に対して１種類の音響モデルでなく、音素の種類、その音素の属するアクセント句のアクセント型、アクセント句内のモーラ位置の組合せ分だけ存在する。本実施の形態では、疑問文用音響モデルを通常用の音響モデルと区別するために、「（ａｑ）」，「（ａｑ２）」，「（ａｑ３）」というように、音素の種類の後ろにｑ，ｑ１，ｑ２等を付与して表示することとする。 In addition, the interrogative sentence phoneme model 512 and interrogative sentence prosodic model 522 are different from plain sentences such as interrogative sentences, sentences that require consent, and sentences that prompt action, and voices when uttering sentences that increase the pitch at the end of the sentence. A set of phonological models and prosodic models made from recorded data. The recorded data used to create various types of phonemes of pitch-up characters, accent types, and mora positions in accent phrases It is a utterance of the pattern text. Accordingly, the interrogative sentence acoustic model composed of the interrogative sentence phoneme model and the interrogative sentence prosodic model selected from the interrogative sentence phoneme model 512 and the interrogative sentence prosody model 522 is one acoustic model for one phoneme. Rather, there exist combinations of phoneme types, accent types of accent phrases to which the phonemes belong, and mora positions in the accent phrases. In this embodiment, in order to distinguish the acoustic model for question sentences from the acoustic model for normal use, “(aq)”, “(aq2)”, “(aq3)”, and the like are placed behind the phoneme type. Suppose that q, q1, q2, etc. are assigned and displayed.

例えば、「そういえば、京都に行った。」という例では、図５に示すように「ｓ＿ｏ＿ｏ＿ｉ＿ｅ＿ｂ＿ａ＿ｐａｕ＿ｋｙ＿ｏ＿ｏ＿ｔ＿ｏ＿ｎ＿ｉ＿ｉ＿ｃｌ＿ｔ＿ａ」という音素に分解される。図５は、音響モデル情報記憶エリア６１の一例の「そういえば、京都に行った。」の音響モデルを記憶した音響モデル記憶エリア６１１の模式図である。そして、音韻モデル選択１３では、音素ごとにフレームごとのメルケプストラム係数及びそのフレームが有声であるか無声であるかの情報等が通常用音韻モデル５１１から選択される。そして、韻律モデル選択１４では、「（５，４）、ｐａｕ、（４，１）、（３，０）」というようにアクセント型、アクセント句のモーラ数が整理され、通常用韻律モデル５２１から、音素ごとにアクセント型やアクセント句内でのモーラ位置の情報に対応してｐｉｔｃｈ列生成のための情報となる韻律モデルが選択される。なお、「（５，４）、ｐａｕ、（４，１）、（３，０）」は、５モーラ（拍）のアクセント型４の韻律モデルの次に、ポーズがあり、その後に４モーラのアクセント型１、３モーラのアクセント型０となることを示している。 For example, in the example of “I went to Kyoto, so to speak”, it is decomposed into phonemes “s_o_o_i_e_b_a_pau_ky_o_o_t_o_n_i_i_cl_t_a” as shown in FIG. FIG. 5 is a schematic diagram of an acoustic model storage area 611 that stores an acoustic model of “I went to Kyoto. In the phoneme model selection 13, the mel cepstrum coefficient for each frame and information on whether the frame is voiced or unvoiced are selected from the normal phoneme model 511 for each phoneme. In the prosody model selection 14, the number of mora of the accent type and the accent phrase is arranged as “(5, 4), pau, (4, 1), (3, 0)”. For each phoneme, a prosodic model that is information for generating a pitch sequence is selected corresponding to the information on the mora position in the accent type or accent phrase. “(5,4), pau, (4,1), (3,0)” has a pose after an accent type 4 prosody model of 5 mora (beats), and then 4 mora This indicates that the accent type is 1 and the accent type is 0.

そして、音素ごとに音響モデルが音響モデル情報記憶エリア６１に記憶される。図５は、その一例である音響モデル情報記憶エリア６１１の模式図である。なお、図５に示す模式図では、アクセント型及びモーラ位置も記載しているが、これらの情報は音響モデルに含まれる情報である。音素「ｓ」の音響モデルは「（ｓ）」であり、アクセント型は「４」、モーラ位置は１番目、次の音素「ｏ」の音響モデルは「（ｏ）」であり、アクセント型は「４」、モーラ位置は１番目、次の音素「ｏ」の音響モデルは「（ｏ）」であり、アクセント型は「４」、モーラ位置は２番目、次の音素「ｉ」の音響モデルは「（ｉ）」であり、アクセント型は「４」、モーラ位置は３番目、次の音素「ｅ」の音響モデルは「（ｅ）」であり、アクセント型は「４」、モーラ位置は４番目、次の音素「ｂ」の音響モデルは「（ｂ）」であり、アクセント型は「４」、モーラ位置は５番目、次の音素「ａ」の音響モデルは「（ａ）」であり、アクセント型は「４」、モーラ位置は５番目とされている。 The acoustic model is stored in the acoustic model information storage area 61 for each phoneme. FIG. 5 is a schematic diagram of an acoustic model information storage area 611 as an example. In the schematic diagram shown in FIG. 5, the accent type and the mora position are also described, but these pieces of information are information included in the acoustic model. The acoustic model of the phoneme “s” is “(s)”, the accent type is “4”, the mora position is the first, the acoustic model of the next phoneme “o” is “(o)”, and the accent type is The acoustic model of “4”, the mora position is the first, the next phoneme “o” is “(o)”, the accent type is “4”, the mora position is the second, the acoustic model of the next phoneme “i” Is “(i)”, the accent type is “4”, the mora position is third, the acoustic model of the next phoneme “e” is “(e)”, the accent type is “4”, and the mora position is The acoustic model of the fourth and next phoneme “b” is “(b)”, the accent type is “4”, the mora position is the fifth, and the acoustic model of the next phoneme “a” is “(a)”. Yes, the accent type is “4”, and the mora position is the fifth.

そして、次の音素「ｋｙ」の音響モデルは「（ｋｙ）」であり、アクセント型は「１」、モーラ位置は１番目、次の音素「ｏ」の音響モデルは「（ｏ）」であり、アクセント型は「１」、モーラ位置は１番目、次の音素「ｏ」の音響モデルは「（ｏ）」であり、アクセント型は「１」、モーラ位置は２番目、次の音素「ｔ」の音響モデルは「（ｔ）」であり、アクセント型は「１」、モーラ位置は３番目、次の音素「ｏ」の音響モデルは「（ｏ）」であり、アクセント型は「１」、モーラ位置は３番目、次の音素「ｎ」の音響モデルは「（ｎ）」であり、アクセント型は「１」、モーラ位置は４番目、次の音素「ｉ」の音響モデルは「（ｉ）」であり、アクセント型は「１」、モーラ位置は４番目とされている。そして、次の音素「ｉ」の音響モデルは「（ｉ）」であり、アクセント型は「０」、モーラ位置は１番目、次の音素「ｃｌ」の音響モデルは「（ｃｌ）」であり、アクセント型は「０」、モーラ位置は２番目、次の音素「ｔ」の音響モデルは「（ｔ）」であり、アクセント型は「０」、モーラ位置は３番目、次の音素「ａ」の音響モデルは「（ａ）」であり、アクセント型は「０」、モーラ位置は３番目とされている。 The acoustic model of the next phoneme “ky” is “(ky)”, the accent type is “1”, the mora position is the first, and the acoustic model of the next phoneme “o” is “(o)”. , The accent type is “1”, the mora position is the first, the acoustic model of the next phoneme “o” is “(o)”, the accent type is “1”, the mora position is the second, the next phoneme “t” ”Is the acoustic model“ (t) ”, the accent type is“ 1 ”, the mora position is the third, the acoustic model of the next phoneme“ o ”is“ (o) ”, and the accent type is“ 1 ”. , The mora position is the third, the acoustic model of the next phoneme “n” is “(n)”, the accent type is “1”, the mora position is the fourth, the acoustic model of the next phoneme “i” is “( i) ", the accent type is" 1 ", and the mora position is the fourth. The acoustic model of the next phoneme “i” is “(i)”, the accent type is “0”, the mora position is the first, and the acoustic model of the next phoneme “cl” is “(cl)”. The accent type is “0”, the mora position is second, the acoustic model of the next phoneme “t” is “(t)”, the accent type is “0”, the mora position is third, and the next phoneme “a” The acoustic model is “(a)”, the accent type is “0”, and the mora position is the third.

また、例えば、「そういえば、京都に行った？」というように文末に「？」が付いている文章の場合を考える。この場合にも、「ｓ＿ｏ＿ｏ＿ｉ＿ｅ＿ｂ＿ａ＿ｐａｕ＿ｋｙ＿ｏ＿ｏ＿ｔ＿ｏ＿ｎ＿ｉ＿ｉ＿ｃｌ＿ｔ＿ａ」という音素に分解される。図６は、音響モデル情報記憶エリア６１の一例の「そういえば、京都に行った？」の音響モデルを記憶した音響モデル記憶エリア６１２の模式図である。そして、文末に「？」が付いているので、最後のアクセント句の音素「ｉ＿ｃｌ＿ｔ＿ａ」については、通常用音韻モデル５１１でなく疑問文用音韻モデル５１２から音韻モデルが選択され、通常用韻律モデル５２１でなく疑問文用韻律モデル５２２から韻律モデルが選択される。 Also, for example, consider a sentence with “?” At the end of the sentence such as “Speaking of which, did you go to Kyoto?”. Also in this case, it is decomposed into phonemes of “s_o_o_i_e_b_a_pau_ky_o_o_t_o_n_i_i_cl_t_a”. FIG. 6 is a schematic diagram of an acoustic model storage area 612 in which an acoustic model of “Is that you went to Kyoto?” Is stored as an example of the acoustic model information storage area 61. Since “?” Is added at the end of the sentence, the phoneme model “i_cl_t_a” of the last accent phrase is selected not from the normal phoneme model 511 but from the interrogative sentence phoneme model 512, and the normal prosody model 521. The prosody model is selected from the prosody model for question sentence 522 instead.

図６は、この場合の音響モデル情報記憶エリア６１２の模式図である。音素「ｓ」の音響モデルは「（ｓ）」であり、アクセント型は「４」、モーラ位置は１番目、次の音素「ｏ」の音響モデルは「（ｏ）」であり、アクセント型は「４」、モーラ位置は１番目、次の音素「ｏ」の音響モデルは「（ｏ）」であり、アクセント型は「４」、モーラ位置は２番目、次の音素「ｉ」の音響モデルは「（ｉ）」であり、アクセント型は「４」、モーラ位置は３番目、次の音素「ｅ」の音響モデルは「（ｅ）」であり、アクセント型は「４」、モーラ位置は４番目、次の音素「ｂ」の音響モデルは「（ｂ）」であり、アクセント型は「４」、モーラ位置は５番目、次の音素「ａ」の音響モデルは「（ａ）」であり、アクセント型は「４」、モーラ位置は５番目とされている。 FIG. 6 is a schematic diagram of the acoustic model information storage area 612 in this case. The acoustic model of the phoneme “s” is “(s)”, the accent type is “4”, the mora position is the first, the acoustic model of the next phoneme “o” is “(o)”, and the accent type is The acoustic model of “4”, the mora position is the first, the next phoneme “o” is “(o)”, the accent type is “4”, the mora position is the second, the acoustic model of the next phoneme “i” Is “(i)”, the accent type is “4”, the mora position is third, the acoustic model of the next phoneme “e” is “(e)”, the accent type is “4”, and the mora position is The acoustic model of the fourth and next phoneme “b” is “(b)”, the accent type is “4”, the mora position is the fifth, and the acoustic model of the next phoneme “a” is “(a)”. Yes, the accent type is “4”, and the mora position is the fifth.

そして、次の音素「ｋｙ」の音響モデルは「（ｋｙ）」であり、アクセント型は「１」、モーラ位置は１番目、次の音素「ｏ」の音響モデルは「（ｏ）」であり、アクセント型は「１」、モーラ位置は１番目、次の音素「ｏ」の音響モデルは「（ｏ）」であり、アクセント型は「１」、モーラ位置は２番目、次の音素「ｔ」の音響モデルは「（ｔ）」であり、アクセント型は「１」、モーラ位置は３番目、次の音素「ｏ」の音響モデルは「（ｏ）」であり、アクセント型は「１」、モーラ位置は３番目、次の音素「ｎ」の音響モデルは「（ｎ）」であり、アクセント型は「１」、モーラ位置は４番目、次の音素「ｉ」の音響モデルは「（ｉ）」であり、アクセント型は「１」、モーラ位置は４番目とされている。そして、次の音素「ｉ」の音響モデルは「（ｉｑ）」であり、アクセント型は「０」、モーラ位置は１番目、次の音素「ｃｌ」の音響モデルは「（ｃｌｑ）」であり、アクセント型は「０」、モーラ位置は２番目、次の音素「ｔ」の音響モデルは「（ｔｑ）」であり、アクセント型は「０」、モーラ位置は３番目、次の音素「ａ」の音響モデルは「（ａｑ）」であり、アクセント型は「０」、モーラ位置は３番目とされている。 The acoustic model of the next phoneme “ky” is “(ky)”, the accent type is “1”, the mora position is the first, and the acoustic model of the next phoneme “o” is “(o)”. , The accent type is “1”, the mora position is the first, the acoustic model of the next phoneme “o” is “(o)”, the accent type is “1”, the mora position is the second, the next phoneme “t” ”Is the acoustic model“ (t) ”, the accent type is“ 1 ”, the mora position is the third, the acoustic model of the next phoneme“ o ”is“ (o) ”, and the accent type is“ 1 ”. , The mora position is the third, the acoustic model of the next phoneme “n” is “(n)”, the accent type is “1”, the mora position is the fourth, the acoustic model of the next phoneme “i” is “( i) ", the accent type is" 1 ", and the mora position is the fourth. The acoustic model of the next phoneme “i” is “(iq)”, the accent type is “0”, the mora position is the first, and the acoustic model of the next phoneme “cl” is “(clq)”. , The accent type is “0”, the mora position is second, the acoustic model of the next phoneme “t” is “(tq)”, the accent type is “0”, the mora position is third, and the next phoneme “a” The acoustic model is “(aq)”, the accent type is “0”, and the mora position is the third.

ここで、図７乃至図１０を参照して、疑問文用音韻モデル５１２及び疑問文用韻律モデル５２２から選択された疑問文用音響モデル「（ｉｑ）」，「（ｃｌｑ）」，「（ｔｑ）」，「（ａｑ）」を例に挙げて、疑問文用音響モデルの作成及び選択について説明する。図７は、「（ａｑ）」の疑問文用音響モデルを作成する際の例文「もう買った？」の音素、アクセント型及びアクセント句内のモーラ位置の模式図７１１であり、図８は、「（ｔｑ）」の疑問文用音響モデルを作成する際の例文「本貸して？」の音素、アクセント型及びアクセント句内のモーラ位置の模式図７１２であり、図９は、「（ｃｌｑ）」の疑問文用音響モデルを作成する際の例文「なんて言った？」の音素、アクセント型及びアクセント句内のモーラ位置の模式図７１３であり、図１０は、「（ｉｑ）」の疑問文用音響モデルを作成する際の例文「彼女いない？」の音素、アクセント型及びアクセント句内のモーラ位置の模式図７１４である。 Here, referring to FIG. 7 to FIG. 10, the question sentence acoustic models “(iq)”, “(clq)”, “(tq) selected from the question sentence phoneme model 512 and the question sentence prosody model 522 are used. ) ”And“ (aq) ”as examples, the creation and selection of the question sentence acoustic model will be described. FIG. 7 is a schematic diagram 711 of phonemes, accent types, and mora positions in the accent phrase of the example sentence “I already bought?” When creating the acoustic model for question sentence “(aq)”. FIG. 9 is a schematic diagram 712 of phonemes, accent types, and mora positions in an accent phrase when creating an acoustic model for a question sentence of “(tq)”, FIG. 9 shows “(clq) FIG. 10 is a schematic diagram 713 of a phoneme, an accent type, and a mora position in an accent phrase of an example sentence “What did you say?” When creating an acoustic model for an interrogative sentence, and FIG. 10 is an interrogative sentence of “(iq)” FIG. 714 is a schematic diagram 714 of phonemes, accent types, and mora positions in an accent phrase of an example sentence “She's not there?

例えば、「（ａｑ）」は、文末に「？」が付いており、文末音素が「ａ」であり、その音素「ａ」の属するアクセント句のアクセント型が０型であり、かつ、アクセント句内モーラ位置が「３」である疑問文用音響モデルである。これは、同様の条件を満たす例文「もう買った？」を発声したものを録音したデータから生成された音韻モデルや韻律モデルで形成されている。音素「ａ」の疑問文用音響モデルは、この「（ａｑ）」の他にも、文末に「？」が付いており、文末音素が「ａ」であり、その音素「ａ」の属するアクセント句のアクセント型が１型であり、かつ、アクセント句内モーラ位置が「２」であるようなものや、文中に「なぜ」という疑問文があり、音素「ａ」は文末音素でなく、その音素「ａ」の属するアクセント句のアクセント型が０型であり、かつ、アクセント句内モーラ位置が「１」であるようなものなど、様々なパターンのアクセント型、アクセント句内のモーラ位置、文章の種類（「？」がついている場合、文末が所定の文末パターンの場合、文中に疑問詞が存在する場合）の組合せによる疑問文用音響モデルが録音データから作成される。そして、疑問文用音響モデルの選択時には、これらのアクセント句のアクセント型、アクセント句内のモーラ位置等の条件を満たす音響モデルが選択される。 For example, “(aq)” has “?” At the end of the sentence, the end-of-sentence phoneme is “a”, the accent type of the accent phrase to which the phoneme “a” belongs is type 0, and the accent phrase It is an acoustic model for question sentences whose inner mora position is “3”. This is formed by a phonological model or a prosodic model generated from data obtained by recording an utterance of an example sentence “I already bought?” That satisfies the same condition. In addition to this “(aq)”, the acoustic model for interrogative sentences of the phoneme “a” has “?” At the end of the sentence, the end-of-sentence phoneme is “a”, and the accent to which the phoneme “a” belongs. The accent type of the phrase is type 1 and the mora position in the accent phrase is “2”, or there is a question sentence “why” in the sentence, and the phoneme “a” is not a sentence end phoneme, Accent type of accent phrase to which phoneme “a” belongs is 0 type, and mora position in accent phrase is “1”, etc. Accent type of various patterns, mora position in accent phrase, sentence An acoustic model for a question sentence is created from the recorded data by a combination of types (when “?” Is attached, when the sentence end is a predetermined sentence end pattern, or when there is a question word in the sentence). When selecting the question sentence acoustic model, an acoustic model that satisfies the accent type of the accent phrase, the mora position in the accent phrase, and the like is selected.

同様に、「（ｔｑ）」は、文末に「？」が付いており、文末から１モーラ目の音素が「ｔ」であり、その音素「ｔ」の属するアクセント句のアクセント型が０型であり、かつ、アクセント句内モーラ位置が「３」である疑問文用音響モデルである。これは、同様の条件を満たす例文「本貸して？」を発声したものを録音したデータから生成された音韻モデルや韻律モデルで形成されている。また、「（ｃｌｑ）」は、文末に「？」が付いており、文末から２モーラ目の音素が「ｃｌ」であり、その音素「ｃｌ」の属するアクセント句のアクセント型が０型であり、かつ、アクセント句内モーラ位置が「２」である疑問文用音響モデルである。これは、同様の条件を満たす例文「なんて言った？」を発声したものを録音したデータから生成された音韻モデルや韻律モデルで形成されている。また、「（ｉｑ）」は、文末に「？」が付いており、文末から３モーラ目の音素が「ｉ」であり、その音素「ｉ」の属するアクセント句のアクセント型が０型であり、かつ、アクセント句内モーラ位置が「１」である疑問文用音響モデルである。これは、同様の条件を満たす例文「彼女いない？」を発声したものを録音したデータから生成された音韻モデルや韻律モデルで形成されている。 Similarly, “(tq)” has “?” At the end of the sentence, the phoneme of the first mora from the end of the sentence is “t”, and the accent type of the accent phrase to which the phoneme “t” belongs is 0 type. There is an acoustic model for question sentences in which the mora position in the accent phrase is “3”. This is formed by a phonological model or a prosodic model generated from data obtained by recording an utterance of an example sentence “Lend me?” That satisfies the same conditions. “(Clq)” has “?” At the end of the sentence, the phoneme of the second mora from the end of the sentence is “cl”, and the accent type of the accent phrase to which the phoneme “cl” belongs is 0 type. And an acoustic model for a question sentence in which the mora position in the accent phrase is “2”. This is formed by a phonological model or a prosodic model generated from data obtained by recording an utterance of an example sentence “What did you say?” That satisfies the same condition. “(Iq)” has “?” At the end of the sentence, the phoneme of the third mora from the end of the sentence is “i”, and the accent type of the accent phrase to which the phoneme “i” belongs is 0 type. And the acoustic model for question sentences in which the mora position in the accent phrase is “1”. This is formed by a phonological model or a prosodic model generated from data obtained by recording a utterance of an example sentence “she is not?” That satisfies the same condition.

このようにして、音響モデル選択１２により音響モデルが選択されたら、図２に示すように、ｐｉｔｃｈ列生成１６により、生成された韻律モデル列が接続されてｐｉｔｃｈ列が生成される。ただし、接続時に音韻モデル列の各音韻の長さに合わせて、モーラ長を伸縮して音韻モデルとの同期が取られる。次いで、音韻モデル選択１３により選択された音韻モデルに基づいて、各音素の音韻モデルが結合されてメルケプストラム列と有声／無声情報列（以下、ｍｃｅｐ列とする）が生成される（ｍｃｅｐ列生成１５）。 When the acoustic model is selected by the acoustic model selection 12 in this way, as shown in FIG. 2, the generated prosodic model sequence is connected by the pitch sequence generation 16 to generate a pitch sequence. However, the mora length is expanded and contracted to synchronize with the phoneme model according to the length of each phoneme in the phoneme model sequence at the time of connection. Next, based on the phoneme model selected by the phoneme model selection 13, the phoneme models of each phoneme are combined to generate a mel cepstrum sequence and a voiced / unvoiced information sequence (hereinafter referred to as a "mcep sequence"). 15).

そして、ｍｃｅｐ列生成１５により生成されたｍｃｅｐ列の有声／無声情報、及び、ｐｉｔｃｈ列生成１６により生成されたｐｉｔｃｈ列に基づいて音源信号生成１７が行われる。音源信号は、ｐｉｔｃｈ列に基づいて有声部にはパルス列信号が生成され、無声部には雑音信号が生成される。そして、音源信号がＭＬＳＡフィルター２３を介して音声として出力される。 Then, the sound source signal generation 17 is performed based on the voiced / unvoiced information of the msep sequence generated by the msep sequence generation 15 and the pitch sequence generated by the pitch sequence generation 16. As for the sound source signal, a pulse train signal is generated in the voiced portion and a noise signal is generated in the unvoiced portion based on the pitch sequence. Then, the sound source signal is output as sound through the MLSA filter 23.

次に、図１１及び図１２を参照して、文末パターン辞書５５及び疑問詞辞書５６について説明する。図１１は文末パターン辞書５５の模式図であり、図１２は疑問詞辞書５６の模式図である。これらの辞書は音響モデルを通常の音響モデルから疑問文用の音響モデルに変更するか否かの判断を行う際に使用されるものである。 Next, the sentence ending pattern dictionary 55 and the interrogative dictionary 56 will be described with reference to FIGS. 11 and 12. FIG. 11 is a schematic diagram of the sentence ending pattern dictionary 55, and FIG. 12 is a schematic diagram of the question word dictionary 56. These dictionaries are used when determining whether or not to change the acoustic model from an ordinary acoustic model to an acoustic model for question sentences.

図１１に示すように、文末パターン辞書５５には、疑問文、同意を求める文、行為を促す文など、語尾が上がる語調の文章で使用される文末の語句が品詞の情報と共に記憶されている。図１１に示す例では、助動詞の「でしょ」，副詞の「どう」，動詞の「し」と助詞の「たら」，助詞の「て」と動詞の「い」と助詞の「て」，助詞の「かな」（末尾に「ぁ」が付いている場合を含む），助詞の「ね」（末尾に「ぇ」が付いている場合を含む），助詞の「よ」（末尾に「ぉ」が付いている場合を含む），助詞の「の」（末尾に「ぉ」が付いている場合を含む），助詞の「さ」（末尾に「ぁ」が付いている場合を含む）等であり、他の文末パターンについては省略されている。 As shown in FIG. 11, the sentence ending pattern dictionary 55 stores word phrases at the end of sentences used in sentence-in-tone sentences such as a question sentence, a sentence requesting consent, and a sentence prompting an action together with part-of-speech information. . In the example shown in FIG. 11, the auxiliary verb “de”, the adverb “how”, the verb “shi” and the particle “tara”, the particle “te” and the verb “i” and the particle “te”, the particle "Kana" (including the case where "a" is added at the end), particle "ne" (including the case where "e" is added at the end), particle "yo" ("ぉ" at the end) ) (Including the case where the letter is attached), “no” (including the case where “ぉ” is suffixed), and “sa” (including the case where the suffix is suffixed with “a”) Yes, other end-of-sentence patterns are omitted.

また、図１２に示すように、疑問詞辞書５６には、疑問文、同意を求める文、行為を促す文などで使用される疑問詞が記憶されている。図１２に示す例では、「何」，「いつ」，「誰」，「どこ」，「どれ」，「どう」，「いくら」，「いくつ」，「どうして」，「何故」等であり、他の疑問詞については省略されている。 In addition, as shown in FIG. 12, the interrogative dictionary 56 stores interrogative words used in interrogative sentences, sentences that require consent, sentences that prompt actions, and the like. In the example shown in FIG. 12, “what”, “when”, “who”, “where”, “which”, “how”, “how much”, “how many”, “why”, “why”, etc. Other interrogatives are omitted.

次に、図１３の模式図、図１４及び図１５のフローチャートを参照して、音響モデルの選択に関する処理の動作について説明する。図１３は、疑問形フラグ記憶エリア６２１の模式図であり、図１４及び図１５は、音響モデル選択１２で行われる処理のフローチャートであり、図１５は、図１４に示すフローチャートの続きである。 Next, with reference to the schematic diagram of FIG. 13 and the flowcharts of FIG. 14 and FIG. FIG. 13 is a schematic diagram of the question flag storage area 621. FIGS. 14 and 15 are flowcharts of processing performed in the acoustic model selection 12. FIG. 15 is a continuation of the flowchart shown in FIG.

まず、図１３の模式図を参照して、ＲＡＭ４の疑問形フラグ記憶エリア６２の一例である疑問形フラグ記憶エリア６２１について説明する。この疑問形フラグ記憶エリア６２１は、「そういえば、京都に行った？」の文の疑問形フラグの例であり、音素に対応して疑問形フラグが記憶されている。図１３に示す例では、音素「ｓ」，音素「ｏ」，音素「ｏ」，音素「ｉ」，音素「ｅ」，音素「ｂ」，音素「ａ」，音素「ｋｙ」，音素「ｏ」，音素「ｏ」，音素「ｔ」，音素「ｏ」，音素「ｎ」，音素「ｉ」の疑問形フラグは「０」であり、音素「ｉ」，音素「ｃｌ」，音素「ｔ」，音素「ａ」の疑問形フラグは「１」となっている。 First, an interrogative flag storage area 621 that is an example of the interrogative flag storage area 62 of the RAM 4 will be described with reference to the schematic diagram of FIG. This interrogative flag storage area 621 is an example of an interrogative flag for a sentence “Is that you went to Kyoto?”, And interrogative flags are stored corresponding to phonemes. In the example illustrated in FIG. 13, the phoneme “s”, the phoneme “o”, the phoneme “o”, the phoneme “i”, the phoneme “e”, the phoneme “b”, the phoneme “a”, the phoneme “ky”, and the phoneme “o”. , Phoneme “o”, phoneme “t”, phoneme “o”, phoneme “n”, and phoneme “i” have interrogative flags “0”, phoneme “i”, phoneme “cl”, phoneme “t” , And the interrogative flag of phoneme “a” is “1”.

まず、疑問形フラグ記憶エリア６２の音素欄に音響モデル情報記憶エリアの音素欄に記憶されている音素がセットされ、疑問形フラグ欄に全て初期値の「０」がセットされて、初期化が行われる（Ｓ１）。そして、テキスト記憶エリア４１に記憶されている文章の文末の記号が「？」であり文末パターンに該当するか否かのチェックが行われる（Ｓ２）。文末の記号が「？」であり、文末パターンに該当する場合には（Ｓ３：ＹＥＳ）、最後のアクセント句に属する音素の疑問形フラグに「１」がセットされる（Ｓ４）。文末パターンに該当しなかった場合には（Ｓ３：ＮＯ）、疑問形フラグには何もセットされない。 First, the phonemes stored in the phoneme column of the acoustic model information storage area are set in the phoneme column of the interrogative flag storage area 62, all initial values “0” are set in the interrogative flag column, and initialization is performed. Performed (S1). Then, it is checked whether or not the symbol at the end of the sentence stored in the text storage area 41 is “?” And it corresponds to the sentence end pattern (S2). If the symbol at the end of the sentence is “?” And it corresponds to the sentence end pattern (S3: YES), “1” is set to the interrogative flag of the phoneme belonging to the last accent phrase (S4). If it does not correspond to the sentence end pattern (S3: NO), nothing is set in the question flag.

そして、解析結果記憶エリア４２に記憶されている品詞情報等の解析結果において、文末パターン辞書５５に該当するものがあり、文末パターンに該当するか否かのチェックが行われる（Ｓ５）。文末パターンに該当した場合には（Ｓ６：ＹＥＳ）、最後のアクセント句に属する音素の疑問形フラグに「１」がセットされる（Ｓ７）。文末パターンに該当しなかった場合には（Ｓ６：ＮＯ）、疑問形フラグには何もセットされない。 Then, in the analysis result such as part of speech information stored in the analysis result storage area 42, there is one corresponding to the sentence end pattern dictionary 55, and it is checked whether or not it corresponds to the sentence end pattern (S5). If it corresponds to the sentence end pattern (S6: YES), “1” is set to the interrogative flag of the phoneme belonging to the last accent phrase (S7). If it does not correspond to the sentence end pattern (S6: NO), nothing is set in the question form flag.

そして、解析結果記憶エリア４２に記憶されている品詞情報等の解析結果において、疑問詞辞書５６に登録されている疑問詞が存在するか否かのチェックが行われる（Ｓ８）。疑問詞があれば（Ｓ９：ＹＥＳ）、最後のアクセント句に属する音素の疑問形フラグに「１」がセットされる（Ｓ１０）。疑問詞がなければ（Ｓ９：ＮＯ）、疑問形フラグには何もセットされない。 Then, in the analysis result such as the part of speech information stored in the analysis result storage area 42, it is checked whether or not there is a question word registered in the question word dictionary 56 (S8). If there is an interrogative (S9: YES), “1” is set to the interrogative flag of the phoneme belonging to the last accent phrase (S10). If there is no question word (S9: NO), nothing is set in the question flag.

そして、Ｓ１〜Ｓ１０でセットされた疑問形フラグを参照しながら、各音素の音響モデルが音響辞書５０から選択される（Ｓ１１〜Ｓ１７）。まず、ポインタが最初のポインタに置かれる（Ｓ１１）。そして、その音素の疑問形フラグが「１」であるか否かの判断が行われる（Ｓ１２）。疑問形フラグが「１」でなければ（Ｓ１２：ＮＯ）、疑問文用音韻モデル５１２及び疑問文用韻律モデル５２２から選択する必要はないので、通常用音韻モデル５１１及び通常用韻律モデル５２１から音響モデル（音韻モデル及び韻律モデル）が選択される（Ｓ１５）。そして、ポインタが次の音素へ進められる（Ｓ１６）。 And the acoustic model of each phoneme is selected from the acoustic dictionary 50, referring to the question form flag set in S1 to S10 (S11 to S17). First, a pointer is placed on the first pointer (S11). Then, it is determined whether or not the interrogative flag of the phoneme is “1” (S12). If the question form flag is not “1” (S12: NO), there is no need to select from the question sentence phoneme model 512 and the question sentence prosody model 522, so the sound from the normal phoneme model 511 and the normal prosody model 521 is used. A model (phoneme model and prosody model) is selected (S15). Then, the pointer is advanced to the next phoneme (S16).

また、疑問形フラグが「１」であれば（Ｓ１２：ＹＥＳ）、その音素の属するアクセント句が文中の最後のアクセント句であるか否かの判断が行われる（Ｓ１３）。最後のアクセント句でなければ（Ｓ１３：ＮＯ）、疑問文用音韻モデル５１２及び疑問文用韻律モデル５２２から選択する必要はないので、通常用音韻モデル５１１及び通常用韻律モデル５２１から音響モデル（音韻モデル及び韻律モデル）が選択される（Ｓ１５）。そして、ポインタが次の音素へ進められる（Ｓ１６）。 If the question flag is “1” (S12: YES), it is determined whether the accent phrase to which the phoneme belongs is the last accent phrase in the sentence (S13). If it is not the last accent phrase (S13: NO), it is not necessary to select from the interrogative sentence phoneme model 512 and the interrogative sentence prosody model 522, so the acoustic model (phoneme) is selected from the normal phoneme model 511 and the normal prosody model 521. Model and prosody model) are selected (S15). Then, the pointer is advanced to the next phoneme (S16).

最後のアクセント句であれば（Ｓ１３：ＹＥＳ）、疑問文用の語調にして語尾のピッチを上げる必要があるので、疑問文用音韻モデル５１２及び疑問文用韻律モデル５２２から音響モデル（音韻モデル及び韻律モデル）が選択される（Ｓ１４）。そして、ポインタが次の音素へ進められる（Ｓ１６）。そして、すべての音素についての処理が終了していれば（Ｓ１７：ＹＥＳ）、音響モデル選択の処理は終了するが、全ての音素についての処理が終了していなければ（Ｓ１７：ＮＯ）、Ｓ１２へ戻り、ポインタの指している音素についての音響モデルの選択の処理が行われる（Ｓ１２〜Ｓ１６）。 If it is the last accent phrase (S13: YES), it is necessary to increase the pitch of the ending by changing the tone for the question sentence, so the acoustic model (phoneme model and phonological model) is derived from the question sentence phoneme model 512 and the question sentence prosody model 522. Prosody model) is selected (S14). Then, the pointer is advanced to the next phoneme (S16). If the processing for all phonemes has been completed (S17: YES), the acoustic model selection processing is completed. If the processing for all phonemes has not been completed (S17: NO), the process proceeds to S12. Returning, the process of selecting the acoustic model for the phoneme pointed by the pointer is performed (S12 to S16).

例えば、「そういえば、京都に行った？」の例であれば、Ｓ１１において、始めの音素「ｓ」にポインタが置かれる。そして、図１３に示すように、この音素「ｓ」の疑問形フラグは「０」であるので、（Ｓ１２：ＮＯ）、通常用音韻モデル５１１及び通常用韻律モデル５２１から選択された音響モデル「（ｓ）」が音響モデル情報記憶エリア６１に記憶される（Ｓ１５）。そして、次の音素「ｏ」にポインタが進められる（Ｓ１６）。まだ、全ての音素の処理は終了していないので（Ｓ１７：ＮＯ）、ポインタの示している音素「ｏ」についての処理が行われる。この音素「ｏ」も疑問形フラグは「０」であるので、（Ｓ１２：ＮＯ）、通常用音韻モデル５１１及び通常用韻律モデル５２１から選択された音響モデル「（ｓ）」が音響モデル情報記憶エリア６１に記憶される（Ｓ１５）。そして、次の音素「ｏ」にポインタが進められる（Ｓ１６）。 For example, in the case of “Speaking of which, did you go to Kyoto?”, A pointer is placed on the first phoneme “s” in S11. As shown in FIG. 13, since the question flag of the phoneme “s” is “0” (S12: NO), the acoustic model “511” selected from the normal phoneme model 511 and the normal phoneme model 521 is selected. (S) "is stored in the acoustic model information storage area 61 (S15). Then, the pointer is advanced to the next phoneme “o” (S16). Since all the phonemes have not been processed yet (S17: NO), the process for the phoneme “o” indicated by the pointer is performed. Since the interrogation flag of this phoneme “o” is also “0” (S12: NO), the acoustic model “(s)” selected from the normal phoneme model 511 and the normal phoneme model 521 is stored in the acoustic model information. It is stored in the area 61 (S15). Then, the pointer is advanced to the next phoneme “o” (S16).

このようにして、続く音素「ｏ」，音素「ｉ」，音素「ｅ」，音素「ｂ」，音素「ａ」音素「ｋｙ」，音素「ｏ」，音素「ｏ」，音素「ｔ」，音素「ｏ」，音素「ｎ」，音素「ｉ」の処理が行われるが、これらの音素は全て疑問形フラグが「０」であるので、通常用音韻モデル５１１及び通常用韻律モデル５２１から音響モデルが選択される。 In this way, the following phoneme “o”, phoneme “i”, phoneme “e”, phoneme “b”, phoneme “a” phoneme “ky”, phoneme “o”, phoneme “o”, phoneme “t”, The phoneme “o”, phoneme “n”, and phoneme “i” are processed. Since all of these phonemes have the question flag “0”, the sound from the normal phoneme model 511 and the normal phoneme model 521 is used as the sound. A model is selected.

そして、次の音素「ｉ」では疑問形フラグが「１」であり（Ｓ１２：ＹＥＳ）、最後のアクセント句であるので（Ｓ１３：ＹＥＳ）、疑問文用音韻モデル５１２及び疑問文用韻律モデル５２２から音響モデル（音韻モデル及び韻律モデル）が選択される（Ｓ１４）。そして、ポインタが次の音素「ｃｌ」へ進められる（Ｓ１６）。全ての音素についての処理が終了していないので（Ｓ１７：ＮＯ）、Ｓ１２へ戻る。音素「ｃｌ」は疑問形フラグが「１」であり（Ｓ１２：ＹＥＳ）、最後のアクセント句であるので（Ｓ１３：ＹＥＳ）、疑問文用音韻モデル５１２及び疑問文用韻律モデル５２２から音響モデル（音韻モデル及び韻律モデル）が選択される（Ｓ１４）。 In the next phoneme “i”, the question-type flag is “1” (S12: YES), and is the last accent phrase (S13: YES). Therefore, the question phoneme model 512 and the question sentence prosody model 522 are used. An acoustic model (phoneme model and prosody model) is selected from (S14). Then, the pointer is advanced to the next phoneme “cl” (S16). Since processing for all phonemes has not been completed (S17: NO), the process returns to S12. The phoneme “cl” has an interrogative flag “1” (S12: YES), and is the last accent phrase (S13: YES). Therefore, the phoneme model 512 for the question sentence and the prosody model 522 for the question sentence are used as the acoustic model ( (Phoneme model and prosody model) are selected (S14).

同様にして、続く音素「ｔ」，音素「ａ」についても疑問形フラグが「１」であり（Ｓ１２：ＹＥＳ）、最後のアクセント句であるので（Ｓ１３：ＹＥＳ）、疑問文用音韻モデル５１２及び疑問文用韻律モデル５２２から音響モデル（音韻モデル及び韻律モデル）が選択される（Ｓ１４）。そして、全ての音素の処理が終了したので（Ｓ１７：ＹＥＳ）、処理は終了する。 Similarly, the subsequent phoneme “t” and phoneme “a” also have the question flag “1” (S12: YES) and are the last accent phrase (S13: YES). Then, an acoustic model (phoneme model and prosody model) is selected from the question sentence prosody model 522 (S14). Since all phonemes have been processed (S17: YES), the processing ends.

以上のようにして、疑問文用音韻モデル５１２及び疑問文用韻律モデル５２２を、疑問文、同意を求める文、行為を促す文などの文末の語調が平叙文とは異なる文を発声した音声の前記音韻データから予め作成し、文末が所定のパターン（「？」がついている場合、所定の質問する言葉、同意を求める言葉又は行為を促す言葉）である場合、文中に疑問詞がある場合には、疑問文用音韻モデル５１２及び疑問文用韻律モデル５２２から音響モデル（音韻モデル及び韻律モデル）を選択することにより、疑問文、同意を求める文、行為を促す文などの文末の語調により近づいた自然な音声を出力することができる。 As described above, the interrogative sentence phoneme model 512 and the interrogative sentence prosodic model 522 can be used to create a sentence that has a different tone from that of a plain sentence, such as a question sentence, a sentence requesting consent, and a sentence prompting an action. Created in advance from the phonological data, and when the sentence ends with a predetermined pattern (if "?" Is attached, a predetermined questioning word, a word asking for consent or a word prompting an action), or when there is a questionable word in the sentence Selects the acoustic model (phonological model and prosodic model) from the interrogative phonological model 512 and interrogative sentence prosodic model 522, thereby approaching the tone at the end of the sentence such as an interrogative sentence, a sentence requesting consent, and a sentence prompting an action. Can output natural sound.

なお、本実施の形態のＲＯＭ５の音響辞書５０に記憶されている通常用音韻モデル５１１及び通常用韻律モデル５２１が「音響辞書記憶手段」に該当し、ＲＯＭ５の音響辞書５０に記憶されている疑問文用音韻モデル５１２及び疑問文用韻律モデル５２２が「疑問文用音響辞書記憶手段」に該当する。言語解析１１の処理を行うＣＰＵ２が「言語解析手段」に相当し、音響モデル選択１２の処理を行うＣＰＵ２が「音響モデル選択手段」に相当し、ｍｃｅｐ列生成１５，ｐｉｔｃｈ列生成１６，音源信号生成１７及びＭＬＳＡフィルター２３の処理を行うＣＰＵ２が「音声生成手段」に相当する。 It should be noted that the normal phoneme model 511 and the normal prosody model 521 stored in the acoustic dictionary 50 of the ROM 5 of the present embodiment correspond to the “acoustic dictionary storage means” and the question stored in the acoustic dictionary 50 of the ROM 5 The sentence phoneme model 512 and the question sentence prosody model 522 correspond to the “question sentence sound dictionary storage unit”. The CPU 2 that performs the processing of the language analysis 11 corresponds to “language analysis means”, the CPU 2 that performs the processing of the acoustic model selection 12 corresponds to “acoustic model selection means”, and generates the msec sequence generation 15, the pitch sequence generation 16, and the sound source signal. The CPU 2 that performs processing of the generation 17 and the MLSA filter 23 corresponds to “sound generation means”.

なお、本発明の音声合成装置及び音声合成プログラムは、上記した実施の形態に限定されるものではなく、本発明の要旨を逸脱しない範囲内において種々変更を加え得ることは勿論である。 Note that the speech synthesizer and the speech synthesis program of the present invention are not limited to the above-described embodiments, and it is needless to say that various modifications can be made without departing from the gist of the present invention.

上記実施の形態では、音響モデルのうち音韻モデル及び韻律モデルについて、疑問文用のデータを使用したが、音韻モデルのみ疑問文用のものを使用したり、韻律モデルのみ疑問文用のものをしようしたりしてもよい。また、疑問文の用のモデルを作成するに当たり、上記実施の形態では、音素の種類、その音素の属するアクセント句のアクセント型、アクセント句内のモーラ位置を考慮して、録音データを作成して疑問文用のデータを作成したが、これらの他に、文の長さ、呼気段落の長さ、アクセント句のアクセント位置からの距離（何音素離れているか）、アクセント句内の呼気段落の有無、アクセント句間の係り受けの度合い、品詞、品詞の活用型、品詞の活用形なども考慮に入れてデータを作成してもよい。また、１つ前の音素、２つ前の音素、１つ後の音素、２つ後の音素など、前後の音素の状況も考慮してもよい。また、音素の種類を考慮しなくてもよい。 In the above embodiment, data for question sentences is used for phonological models and prosodic models among acoustic models. However, only phonological models are used for question sentences, or only prosodic models are used for question sentences. You may do it. In creating the model for the question sentence, in the above embodiment, the recording data is created in consideration of the type of phoneme, the accent type of the accent phrase to which the phoneme belongs, and the mora position in the accent phrase. I created data for the question sentence, but in addition to these, the length of the sentence, the length of the exhalation paragraph, the distance from the accent position of the accent phrase (how many phonemes are apart), the presence or absence of the exhalation paragraph in the accent phrase The data may be created taking into account the degree of dependency between accent phrases, part of speech, part of speech utilization, part of speech utilization. In addition, the situation of previous and subsequent phonemes such as the previous phoneme, the previous phoneme, the next phoneme, and the second phoneme may be considered. In addition, it is not necessary to consider the type of phoneme.

また、上記実施の形態では、最後のアクセント句について、文末が所定のパターン（「？」がついている場合、所定の質問する言葉、同意を求める言葉又は行為を促す言葉）である場合、文中に疑問詞がある場合には、疑問文用音韻モデル５１２及び疑問文用韻律モデル５２２から音響モデル（音韻モデル及び韻律モデル）を選択したが、疑問文用の音響モデルから選択する音素に該当する音素は、最後のアクセント句に該当する音素である必要はなく、最後の音素のみであったり、最後のモーラに該当する音素であったりしてもよい。 Further, in the above embodiment, when the last accent phrase has a predetermined pattern (if “?” Is attached, a predetermined question word, a word for asking for consent, or a word prompting an action), When there is an interrogative, an acoustic model (phonological model and prosodic model) is selected from the interrogative sentence phoneme model 512 and interrogative sentence prosodic model 522, but the phoneme corresponding to the phoneme selected from the interrogative sentence acoustic model is selected. Is not necessarily the phoneme corresponding to the last accent phrase, and may be only the last phoneme or the phoneme corresponding to the last mora.

本発明の音声合成装置及び音声合成プログラムは、平叙文とは異なる語調の文章の音声出力を行う音声合成装置及び音声合成プログラムに適応可能である。 The speech synthesizer and speech synthesis program of the present invention can be applied to a speech synthesizer and a speech synthesis program for outputting speech of a sentence having a tone different from that of a plain text.

音声合成装置１の電気的構成を示すブロック図である。2 is a block diagram showing an electrical configuration of the speech synthesizer 1. FIG. 本実施の形態の機能構成図である。It is a functional block diagram of this Embodiment. テキスト記憶エリア４１１の模式図である。3 is a schematic diagram of a text storage area 411. FIG. 解析結果記憶エリア４２１の模式図である。FIG. 10 is a schematic diagram of an analysis result storage area 421. 音響モデル情報記憶エリア６１１の模式図である。5 is a schematic diagram of an acoustic model information storage area 611. FIG. 音響モデル情報記憶エリア６１２の模式図である。5 is a schematic diagram of an acoustic model information storage area 612. FIG. 「（ａｑ）」の疑問文用音響モデルを作成する際の例文「もう買った？」の音素、アクセント型及びアクセント句内のモーラ位置の模式図７１１である。FIG. 7B is a schematic diagram 711 of phonemes, accent types, and mora positions in an accent phrase of an example sentence “I already bought?” When creating an acoustic model for question sentences of “(aq)”. 「（ｔｑ）」の疑問文用音響モデルを作成する際の例文「本貸して？」の音素、アクセント型及びアクセント句内のモーラ位置の模式図７１２である。FIG. 712 is a schematic diagram 712 of a phoneme, an accent type, and a mora position in an accent phrase of an example sentence “Lend me?” When creating the question sentence acoustic model of “(tq)”. 「（ｃｌｑ）」の疑問文用音響モデルを作成する際の例文「なんて言った？」の音素、アクセント型及びアクセント句内のモーラ位置の模式図７１３である。FIG. 713 is a schematic diagram 713 of a phoneme, an accent type, and a mora position in an accent phrase of an example sentence “What did you say?” When creating an acoustic model for a question sentence of “(clq)”. 「（ｉｑ）」の疑問文用音響モデルを作成する際の例文「彼女いない？」の音素、アクセント型及びアクセント句内のモーラ位置の模式図７１４である。FIG. 714 is a schematic diagram 714 of a phoneme, an accent type, and a mora position in an accent phrase of an example sentence “She is not?” When creating an acoustic model for question sentences of “(iq)”. 文末パターン辞書５５の模式図である。It is a schematic diagram of the sentence end pattern dictionary. 疑問詞辞書５６の模式図である。3 is a schematic diagram of an interrogative dictionary 56. FIG. 疑問形フラグ記憶エリア６２１の模式図である。It is a schematic diagram of an interrogative flag storage area 621. 音響モデル選択１２で行われる処理のフローチャートである。10 is a flowchart of processing performed in acoustic model selection 12; 図１４に示すフローチャートである。It is a flowchart shown in FIG.

Explanation of symbols

１音声合成装置
２ＣＰＵ
４ＲＡＭ
５ＲＯＭ
１１言語解析
１２音響モデル選択
１３音韻モデル選択
１４韻律モデル選択
４１テキスト記憶エリア
４２解析結果記憶エリア
５０音響辞書
５１音韻モデル
５２韻律モデル
５５文末パターン辞書
５６疑問詞辞書
６１音響モデル情報記憶エリア
６２疑問形フラグ記憶エリア
５１１通常用音韻モデル
５１２疑問文用音韻モデル
５２１通常用韻律モデル
５２２疑問文用韻律モデル 1 speech synthesizer 2 CPU
4 RAM
5 ROM
11 Language analysis 12 Acoustic model selection 13 Phonological model selection 14 Prosodic model selection 41 Text storage area 42 Analysis result storage area 50 Acoustic dictionary 51 Phonological model 52 Prosodic model 55 End-of-sentence pattern dictionary 56 Interrogative dictionary 61 Acoustic model information storage area 62 Question form Flag storage area 511 Normal phoneme model 512 Question phoneme model 521 Normal phone model 522 Question phone model

Claims

Acoustic dictionary storage means for storing an acoustic dictionary that is a set of acoustic models including at least a phonological model created from phonological data obtained by analyzing speech into an acoustic parameter sequence and a prosodic model created from fundamental frequency data obtained by analyzing speech; ,
A phonological model for interrogative sentences made from the phonological data of a voice that utters a sentence whose ending tone is different from a plain sentence, such as a question sentence, a sentence requesting consent, a sentence that prompts an action, etc. Is a set of interrogative sentence acoustic models including at least interrogative sentence prosodic models created from the fundamental frequency data of speech uttered by different sentences, and stores interrogative sentence acoustic dictionaries different from the acoustic dictionary An acoustic dictionary storage means for sentences;
Language analysis means for decomposing speech generating sentences into words to determine part of speech, determining an accent type indicating the accent position for each accent phrase, and determining reading of the sentence generating the speech;
Acoustic model selection means for selecting the acoustic model from the acoustic dictionary based on the analysis result analyzed by the language analysis means;
Voice generation means for generating voice based on the phonological model and the prosodic model constituting the acoustic model selected by the acoustic model selection means,
The acoustic model selection means, in the case where the sentence ending sentence is a predetermined sentence ending pattern, or in the case where at least one of the sentence containing the voice includes a question word, The phoneme at the end of the sentence for generating the speech, the predetermined number of mora at the end of the sentence, the accent phrase at the end of the sentence, or the acoustic model of the whole sentence is used for the question sentence of the question sentence acoustic dictionary instead of the sound dictionary A speech synthesizer characterized by selecting from an acoustic model.

The speech synthesizer according to claim 1, wherein the predetermined sentence end pattern is a question mark character at the end of the sentence.

The speech synthesizer according to claim 1, wherein the predetermined sentence ending pattern is a word that the sentence ending asks, a word that asks for consent, or a word that prompts an action.

A language analysis step for determining a part of speech by decomposing a sentence that generates speech, determining an accent type indicating an accent position for each accent phrase, and determining a reading of the sentence that generates the speech;
Analyzed by the language analysis step from an acoustic dictionary that is a set of acoustic models including at least a phonological model created from phonological data obtained by analyzing speech into an acoustic parameter sequence and a prosodic model created from fundamental frequency data analyzed from speech. An acoustic model selection step of selecting the acoustic model based on the analyzed result,
A speech synthesis program for causing a computer to execute a speech generation step for generating speech based on the phonological model and the prosodic model constituting the acoustic model selected by the acoustic model selection step,
In the acoustic model selection step, when the sentence end of the sentence that generates the sound is a pattern of a predetermined sentence end, or when the sentence that generates the sound includes a question word, The phoneme at the end of the sentence that generates the speech, the predetermined number of mora at the end of the sentence, the accent phrase at the end of the sentence, or the acoustic model of the whole sentence is not an acoustic dictionary, but a question sentence, a sentence for seeking consent, or an act The phonological model for interrogative sentences created from the phonological data of the speech uttered by a sentence whose tone is different from that of the plain sentence and the basic of the speech uttered by a sentence whose tone is different from that of the plain sentence A speech synthesis program selected from the question sentence acoustic model of the question sentence acoustic dictionary, which is a set of question sentence acoustic models including at least a question sentence prosody model created from frequency data Lamb.

5. The speech synthesis program according to claim 4, wherein the predetermined sentence end pattern is a question mark character at the end of the sentence.

6. The speech synthesis program according to claim 4, wherein the pattern at the end of the predetermined sentence is a word that the sentence end asks, a word that asks for consent, or a word that prompts an action.