JP4648878B2

JP4648878B2 - Style designation type speech synthesis method, style designation type speech synthesis apparatus, program thereof, and storage medium thereof

Info

Publication number: JP4648878B2
Application number: JP2006189291A
Authority: JP
Inventors: 昇宮崎
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2006-07-10
Filing date: 2006-07-10
Publication date: 2011-03-09
Anticipated expiration: 2026-07-10
Also published as: JP2008015424A

Description

この発明は、テキストと、その他に発話様式とを入力して、それらに対応した音声合成出力を得る音声合成方法、及びその装置、そのプログラムとそのプログラムを記憶する記憶媒体に関する。 The present invention relates to a speech synthesis method for inputting a text and an utterance style and obtaining a speech synthesis output corresponding to the text, an apparatus thereof, a program thereof, and a storage medium for storing the program.

テキストを入力し、音声を出力する従来の音声合成技術においては、まず、テキストに対応する発音情報を、辞書や規則などを用いて作成する。ここで、発音情報とは、カナで表現されるような発音に加えて、アクセントの位置や、アクセント句境界、もしくは母音の無声化情報などをさしている。
次に、方式によっては、発音情報から韻律情報を生成する。ここで、韻律とは、例えば声の高さ、声の大きさおよび発話速度の平均的な値や、時間的に変化する変化パタンである。
次に、発音情報や韻律情報に対応する音声波形を生成する。近年の音声合成の技術分野においては、特定の話者が発声した音声データを大量に収集して音声データベースを作成し、この音声データベースの中から発音情報と合致し韻律情報に近い値を持つ音声波形の素片を抽出し、つなぎ合わせて出力する波形素片接続方式が知られている。波形素片接続方式を用いると、高品質な合成音声を得られることが知られている。この発明もこの方式を用いることを前提とする。
テキストを入力し、音声を出力する従来の音声合成技術において、発話様式を考慮したものとしては、非特許文献１に開示されている。発話様式とは、例えば非特許文献１では、喜び、悲しみ等の感情の籠った合成音声の種別である。図１３を参照して上記非特許文献１に示された技術を簡単に説明する。 In a conventional speech synthesis technique for inputting text and outputting speech, first, pronunciation information corresponding to the text is created using a dictionary or a rule. Here, the pronunciation information refers to an accent position, an accent phrase boundary, or vowel devoicing information in addition to pronunciation expressed in kana.
Next, prosody information is generated from pronunciation information depending on the method. Here, the prosody is, for example, an average value of voice pitch, voice loudness, and speech rate, or a change pattern that changes with time.
Next, a speech waveform corresponding to pronunciation information and prosodic information is generated. In recent technical fields of speech synthesis, a speech database is created by collecting a large amount of speech data uttered by a specific speaker, and speech that matches the pronunciation information and has a value close to prosodic information from this speech database. There is known a waveform segment connection method in which waveform segments are extracted and connected to be output. It is known that high-quality synthesized speech can be obtained by using the waveform segment connection method. This invention is also premised on the use of this method.
Non-patent document 1 discloses a conventional speech synthesis technique for inputting text and outputting speech in consideration of the speech style. For example, in Non-Patent Document 1, an utterance style is a type of synthesized speech in which emotions such as joy and sadness are heard. The technique disclosed in Non-Patent Document 1 will be briefly described with reference to FIG.

例えばＮＨＫアナウンサーのナレーションの様な感情によらないセリフが記憶された読み上げ音声データベース１３１（以降、データベースはＤＢと省略する）の音声と、その同一のセリフを指定した感情に従ってナレータに発話してもらった音声が記憶された感情別音声ＤＢ１３０内の音声と、の差から韻律学習部１３２が韻律情報を生成し、それらを感情別韻律辞書１３３に登録する。ここで韻律とは、例えば声の高さ、声の大きさ、および発話速度の平均的な値や、時間的に変化する変化パタンである。感情別韻律辞書１３３に韻律情報が登録されるのと同時に、素片学習部１３４が音声波形の素片を抽出して素片辞書１３５に登録する。
言語解析部１３６にテキストが入力されると、テキストが単語に分割され、単語に発音（読み）が与えられ１つの発音情報が生成される。その発音情報が入力される韻律生成部１３７は、指定された発話様式の感情別韻律辞書１３３に基づいて、発音情報に１つの韻律情報を与える。波形生成部１３８は、素片辞書の中から発音情報と合致し韻律情報に近い値を持つ音声波形の素片を抽出し、つなぎ合わせて合成音声として出力する。
感情音声合成のための基本周波数制御、日本音響学会講演論文集2003年3月、265頁 For example, let the narrator speak according to the voice of the speech database 131 (hereinafter abbreviated as “DB”) in which speech that does not depend on emotions, such as the NHK announcer's narration, is stored, and the emotions that specify the same speech. The prosody learning unit 132 generates prosodic information from the difference between the voice and the voice in the emotion-specific voice DB 130 in which the voice is stored, and registers them in the emotion-specific prosody dictionary 133. Here, the prosody is, for example, an average value of voice pitch, voice loudness, and speech rate, or a change pattern that changes with time. At the same time that the prosodic information is registered in the emotion-specific prosody dictionary 133, the segment learning unit 134 extracts the segment of the speech waveform and registers it in the segment dictionary 135.
When text is input to the language analysis unit 136, the text is divided into words, and pronunciation (reading) is given to the words to generate one pronunciation information. The prosody generation unit 137 to which the pronunciation information is input gives one prosodic information to the pronunciation information based on the emotion-specific prosody dictionary 133 of the designated utterance style. The waveform generation unit 138 extracts speech waveform segments that match the pronunciation information and has a value close to prosodic information from the segment dictionary, and combines them to output as synthesized speech.
Fundamental frequency control for emotional speech synthesis, Proceedings of the Acoustical Society of Japan, March 2003, p.265

ここで、発話様式を、例えば丁寧な人、ぞんざいな人等のような発話者の口調や、その場の状況、及び発話者の感情などを反映した発話のかたちであると定義する。この発明は、その発話様式が反映された音声が、感情や口調を反映しない平静な音声に比べると、特に発音や韻律が大きく変化することに着目する。そこで、発話様式を考慮して発音情報や韻律情報の変更を試みると、表現が多様化するために、様々な発音情報や韻律情報が求められることとなり、音声ＤＢの中に韻律情報に十分近い音声波形が存在しない場合が発生し易くなる。このような場合、従来の音声合成方式では、発音情報は合致するが、韻律情報とは乖離（かいり）の大きな音声波形を代用して用いるため、音声波形の素片をつなぎ合わせる際に大きな不連続性などが発生し、合成音声の品質が致命的に劣化する可能性がある。
上記した従来例では、感情別韻律辞書１３３及び素片辞書１３５にデータを登録するときと、全く同じテキストが入力された時の合成音声の品質は高い。しかし、異なるテキストが入力されると、表現を多様化しているために、そのテキストと発話様式に合致した音声波形の素片が、音声ＤＢに無い可能性が高くなる。これは、上記した素片辞書に登録される素片データの数をいくら増やしても完全に回避することは出来ない。 Here, the utterance style is defined as the form of utterance reflecting the tone of the utterer such as a polite person, an awkward person, the situation of the place, and the emotion of the utterer. The present invention pays attention to the fact that the sound reflecting the speech style is greatly changed in pronunciation and prosody compared to a calm voice that does not reflect emotions and tone. Therefore, when changing the pronunciation information and prosodic information in consideration of the utterance style, various pronunciation information and prosodic information are required because the expression is diversified, which is sufficiently close to the prosodic information in the speech DB. A case where there is no voice waveform is likely to occur. In such a case, in the conventional speech synthesis method, the pronunciation information matches, but a speech waveform that has a large divergence (promotion) from prosodic information is used instead. Continuity may occur and the quality of synthesized speech may be fatally degraded.
In the conventional example described above, the quality of the synthesized speech is high when data is registered in the emotion-specific prosody dictionary 133 and the segment dictionary 135 and when exactly the same text is input. However, when different texts are input, since the expressions are diversified, there is a high possibility that the speech DB does not have speech waveform segments that match the text and the speech style. This cannot be completely avoided even if the number of segment data registered in the segment dictionary is increased.

つまり、従来の方法では、１つの発音情報とそれに対する１つの韻律情報とに基づいて合成音声を生成するので、必要な音声波形の素片が音声ＤＢ内に無い可能性が高く、その様な場合、代用される音声波形の素片が合成音声の品質を致命的に劣化させる。
この発明は、このような点に鑑みてなされたものであり、発話様式を指定して音声合成を行なうが、品質が致命的に悪化した合成音声を出力させる可能性を低めた様式指定型音声合成方法、及びその装置、そのプログラム及びその記憶媒体を提供することを目的とする。 That is, in the conventional method, a synthesized speech is generated based on one pronunciation information and one prosodic information corresponding thereto, so there is a high possibility that a required speech waveform segment is not in the speech DB. In such a case, the fragment of the voice waveform that is substituted fatally degrades the quality of the synthesized voice.
The present invention has been made in view of the above points, and performs speech synthesis by designating an utterance mode, but has reduced possibility of outputting synthesized speech whose quality has been fatally deteriorated. It is an object to provide a synthesis method, an apparatus thereof, a program thereof, and a storage medium thereof.

この発明による様式指定型音声合成装置は、テキストと、テキストで表現される内容以外の音声に変化を与える要因であるところの発話様式情報とが入力され、１つ以上の発音情報と、上記発音情報それぞれに対応し発話様式の反映された程度を表す発話様式スコアとを、発音情報生成手段が出力する。発話様式情報と発音情報を入力として、韻律情報生成手段が、発音情報のそれぞれについて１つ以上の韻律情報と、それら韻律情報それぞれについて発話様式の反映の度合いを表す韻律様式スコアとを出力する。発音情報生成手段からの発音情報と韻律情報生成手段からの韻律情報とを入力として、音声合成手段が、それぞれの上記発音情報または／及び韻律情報が異なる複数の合成音声と、それぞれの合成音声の品質の程度を表す品質スコアを出力する。複数の合成音声の中から、合成音声選択手段が、品質スコアが閾値を超え、且つ、発音様式スコアと韻律様式スコアに基づいた様式スコアの最も高い合成音声を選択して出力し、閾値を超える品質スコアの合成音声が無い場合は、最も品質スコアの高い合成音声を選択して出力する。 The style designation type speech synthesizer according to the present invention receives text and utterance style information that is a factor that causes a change in speech other than the contents expressed by the text, and includes one or more pronunciation information and the above-mentioned pronunciation The pronunciation information generating means outputs an utterance style score corresponding to each piece of information and representing the degree to which the utterance style is reflected. With the utterance style information and the pronunciation information as inputs, the prosodic information generation means outputs one or more prosodic information for each of the pronunciation information and a prosodic style score representing the degree of reflection of the utterance style for each of the prosodic information. With the pronunciation information from the pronunciation information generation means and the prosody information from the prosody information generation means as inputs, the speech synthesis means has a plurality of synthesized voices with different pronunciation information or / and prosodic information, and Output a quality score that represents the degree of quality. The synthesized speech selection means selects and outputs the synthesized speech having the highest style score based on the pronunciation style score and the prosodic style score from the plurality of synthesized speech, and exceeds the threshold. If there is no quality score synthesized speech, the synthesized speech with the highest quality score is selected and output.

音声合成の評価においては、高品質で、所望の発話様式を感じられるような合成音声が求められることはもちろんである。しかし、それよりも、大きな接続歪を持ったり異音が含まれたりといった、致命的に劣化した品質の音声が少しでも出力されると、主観的な印象に大きく影響することが知られている。この発明による様式指定型音声合成装置によれば、発話情報または／及び韻律情報が異なる複数の合成音声の中から、品質スコアが閾値を超え、かつ、様式スコアの最も高い合成音声を選択するので品質がよく、かつ、指定した発話様式とよく一致した合成音声となる。しかも品質スコアが閾値を超える合成音声が無い場合も、最も品質スコアの高い合成音声を出力するため、品質スコアが致命的に低い合成音声を出力する可能性を低めることが可能となる。 In the evaluation of speech synthesis, it is a matter of course that a synthesized speech that can feel a desired speech style with high quality is required. However, it is known that the output of fatally deteriorated quality, such as having a large connection distortion or abnormal noise, will greatly affect the subjective impression. . According to the style designation type speech synthesizer according to the present invention, a synthesized voice having a quality score exceeding a threshold and having the highest style score is selected from a plurality of synthesized voices having different utterance information and / or prosodic information. The synthesized speech is of good quality and closely matches the specified utterance style. In addition, even when there is no synthesized speech whose quality score exceeds the threshold, since the synthesized speech with the highest quality score is output, it is possible to reduce the possibility of outputting the synthesized speech with a critically low quality score.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated.

この発明の様式指定型音声合成装置３００の実施例１の機能構成ブロックを図１に示す。漢字かな混じりのテキストαと、テキストαで表現される内容以外の音声に変化を与える要因である発話様式情報βとが、発音情報生成手段１０に入力される。発音情報生成手段１０は、例えば、「私は傘をさした」のテキストαと、例えば「丁寧な」の発話様式情報βとに基づいて、複数の発音情報と発話様式スコアを韻律情報生成手段１２に出力する。発音情報としては、例えば「ワタクシワ、カサヲサシタ」と「ワタシワ、カサヲサシタ」などが考えられる。「丁寧な」の発話様式情報に対する発話様式スコアは、「ワタクシワ、カサヲサシタ」の方が高くなる。
韻律情報生成手段１２は、入力された発音情報それぞれに対して、複数の韻律情報を与える。例えば「丁寧な」という発話様式を反映させた音声は、通常の音声に比べればはっきりとした抑揚で発声し、通常よりも改まった発声になる。「ワタクシワ、カサヲサシタ」という発音情報に対して、抑揚を表す声の大きさや声の高さの変化幅が自然性を損なわない範囲で平均よりも大きければ、「丁寧な」という発話様式に対する韻律様式スコアは高くなる。 FIG. 1 shows a functional configuration block of the first embodiment of the style designation type speech synthesizer 300 of the present invention. Text α mixed with kana and kana and utterance style information β, which is a factor that causes a change in speech other than the content expressed by the text α, are input to the pronunciation information generating means 10. The pronunciation information generation means 10 is a prosody information generation means for generating a plurality of pronunciation information and an utterance style score based on, for example, the text α of “I put an umbrella” and the utterance style information β of “Polite”, for example. 12 is output. As pronunciation information, for example, “Watakushi, Kasawosashita” and “Watawashi, Kasawosashita” can be considered. The utterance style score for the “polite” utterance style information is higher for “Watakushi, Kasawo Sashita”.
The prosodic information generation means 12 gives a plurality of prosodic information for each input pronunciation information. For example, a voice that reflects the “careful” utterance style is uttered with a clear inflection compared to a normal voice, resulting in a utterance that is more modified than usual. Prosodic style for the utterance style of “Polite” if the pronunciation level of “Watakushiwa, Kasawosashita” is greater than the average in the range of the loudness of the voice and the change in the pitch of the voice that does not impair the naturalness. The score gets higher.

このような韻律情報は、１つの発音情報に対して１個以上付与される。韻律情報生成手段１２は、発音情報生成手段１０から入力された発音情報と発音様式スコアと、その発音情報に対して生成した韻律情報と韻律様式スコアとを、音声合成手段１４に出力する。
音声合成手段１４は、入力された発音情報と韻律情報に基づき全ての合成音声を合成し、各合成音声に対して品質の程度を表す品質スコアを生成する。品質スコアについては、詳しくは後述するが、例えば合成音声と韻律情報との間の基本周波数の一致度合いを反映する値である。音声合成手段１４は、生成した合成音声とその品質スコアと、韻律情報生成手段から入力された発音様式スコアと韻律様式スコアと、を合成音声選択手段１６に出力する。
合成音声選択手段１６は、入力された合成音声の中から、品質スコアが閾値を超え、且つ、発音様式スコアと韻律様式スコアに基づいた様式スコアの最も高い合成音声を選択して出力し、上記閾値を超える品質スコアの合成音声が無い場合は、最も品質スコアの高い合成音声を選択して出力する。なお、発音情報生成手段１０が生成出力する発音様式スコアは、韻律情報生成手段１２、音声合成手段１４、合成音声選択手段１６の順で転送されて行く例を示したが、図１に破線で示すように発音様式スコアを使用して合成音声を選択する合成音声選択手段１６に直接出力するようにしても良い。韻律情報生成手段１２が生成出力する韻律様式スコアも同様に、合成音声選択手段１６に直接出力するようにしてもよい。 One or more such prosodic information is given to one piece of pronunciation information. The prosodic information generation means 12 outputs the pronunciation information and the pronunciation style score input from the pronunciation information generation means 10 and the prosodic information and the prosodic style score generated for the pronunciation information to the speech synthesis means 14.
The voice synthesizing unit 14 synthesizes all synthesized voices based on the input pronunciation information and prosodic information, and generates a quality score representing the degree of quality for each synthesized voice. The quality score will be described later in detail, but is a value that reflects, for example, the degree of coincidence of the fundamental frequency between the synthesized speech and the prosodic information. The speech synthesis unit 14 outputs the generated synthesized speech and its quality score, and the pronunciation style score and prosodic style score input from the prosody information generation unit to the synthesized speech selection unit 16.
The synthesized speech selecting means 16 selects and outputs the synthesized speech having the highest quality score based on the pronunciation style score and the prosodic style score from among the inputted synthesized speech, and the quality score exceeds the threshold value. If there is no synthesized speech with a quality score exceeding the threshold, the synthesized speech with the highest quality score is selected and output. Note that the pronunciation style score generated and output by the pronunciation information generation unit 10 is transferred in the order of the prosodic information generation unit 12, the speech synthesis unit 14, and the synthesized speech selection unit 16, but is shown by a broken line in FIG. As shown, it may be directly output to the synthesized speech selecting means 16 for selecting synthesized speech using the pronunciation style score. Similarly, the prosodic style score generated and output by the prosody information generating unit 12 may be directly output to the synthesized speech selecting unit 16.

このようにこの実施例では、発音情報や韻律情報を唯一に定めず、複数の候補を作成し、それら全てに対する合成音声を作成した上で品質スコアが閾値を超えるもの、或いは品質スコアの最も高いものを出力するため、極端に品質の劣化した合成音声を出力する可能性が減る。
上記した実施例１の動作を整理する目的で、実施例１の動作フローを図２に示す。漢字かな混じりのテキストαと発話様式情報βが発音情報生成手段１０に入力される（ステップＳ１１）。発音情報生成手段１０は、発話様式を反映した１つ以上の発音情報と発音様式スコアを生成する（ステップＳ１２）。韻律情報生成手段１２は、上記生成された発音情報に対して発話様式情報βを反映した韻律情報と韻律様式スコアを生成する（ステップＳ１３）。音声合成手段は、発音情報と韻律情報から複数の合成音声と、その合成音声の品質スコアを生成する（ステップＳ１４）。合成音声選択手段１６は、上記合成音声中に品質スコアが閾値を越えている合成音声が在る場合は、その中の発音様式スコアと韻律様式スコアに基づいた様式スコア、つまり、様式の度合いを一番反映した合成音声を選択して出力し、品質スコアが閾値を超えるものがない場合は、品質スコアの最も高い合成音声を選択（ステップＳ１５）して出力する（ステップＳ１６）。 As described above, in this embodiment, the pronunciation information and prosodic information are not uniquely defined, a plurality of candidates are created, and the synthesized speech for all of them is created, and the quality score exceeds the threshold value, or the highest quality score. Since the output is, the possibility of outputting synthesized speech with extremely deteriorated quality is reduced.
For the purpose of organizing the operation of the first embodiment, the operation flow of the first embodiment is shown in FIG. Text α mixed with kanji and utterance style information β is input to the pronunciation information generating means 10 (step S11). The pronunciation information generating means 10 generates one or more pronunciation information and a pronunciation style score reflecting the utterance style (step S12). The prosodic information generation means 12 generates prosodic information and prosodic style score reflecting the utterance style information β with respect to the generated pronunciation information (step S13). The speech synthesizer generates a plurality of synthesized speech and a quality score of the synthesized speech from the pronunciation information and prosodic information (step S14). When there is a synthesized speech whose quality score exceeds the threshold value in the synthesized speech, the synthesized speech selecting means 16 determines the style score based on the pronunciation style score and the prosodic style score, that is, the degree of the style. When the synthesized voice most reflected is selected and output, and there is no quality score exceeding the threshold, the synthesized voice with the highest quality score is selected (step S15) and output (step S16).

〔各機能構成ブロックの説明〕
〔発音情報生成手段〕
発音情報生成手段１０の機能構成例を図３に示しその動作を説明する。テキストαと発話様式情報βは形態素解析部３０に入力される。形態素解析部３０は、入力された例えば、「私は傘をさした」のテキストαを単語に分割して、品詞や読み等の単語情報を付与する。このとき、形態素解析部３０は、入力された発話様式情報βが例えば「丁寧な」であれば、その様式によって、読みが変わる単語を図４に示すような様式依存辞書３１から読みを検索して、複数の発音情報を生成する。 [Description of each functional block]
[Pronunciation information generation means]
An example of the functional configuration of the pronunciation information generating means 10 is shown in FIG. The text α and the speech style information β are input to the morpheme analysis unit 30. The morpheme analysis unit 30 divides the input text “α that I put an umbrella” into words, and gives word information such as part of speech and reading. At this time, if the input utterance style information β is “polite”, for example, the morpheme analysis unit 30 searches the style-dependent dictionary 31 as shown in FIG. 4 for a word whose reading changes depending on the style. A plurality of pronunciation information.

発音情報としては、カタカナの列にアクセント核情報が埋め込まれた形式とし、例えば、発音情報１ｈであるワタシワ［００］カサオ［０１］サシタ［０１］と、発音情報２ｈであるワタクシワ［００］カサオ［０１］サシタ［０１］とが生成される。［］内の数字は、直前のアクセント句のアクセント核位置を示している。ワタシワの［００］は、アクセントの無い平坦な調子を意味し、アクセントの型は０型とも呼ばれる。カサオの［０１］は、最初のカの音にアクセントがあることを意味する。アクセントの型は1型とも呼ばれる。その発音情報に対して、この例の場合、発音情報１ｈと発音情報２ｈ、それぞれに発音様式スコア生成部が、発音様式スコアを付与する。発話様式スコアとしては、例えば図５に示すように様式依存辞書から引用したアクセント句の割合としてもよい。発音情報１ｈの、ワタクシワ［００］カサオ［０１］サシタ［０１］に対しては、３個のアクセント句の内の1個が様式依存辞書３１から引用しているので、例えば発音様式スコア１ｓを０．３３としている。
それに対して発音情報２ｈの、ワタシワ［００］カサオ［０１］サシタ［０１］に対しては、３個のアクセント句の読みを、図示しない形態素解析部３０内の単語辞書から得ているので、発話様式情報βに依存していないとして発音様式スコア１ｓを０．０としている。 The pronunciation information is in a format in which accent core information is embedded in a row of katakana. For example, Watashiwa [00] Kasao [01] Sashita [01] that is the pronunciation information 1h and Watakushiwa [00] Kasao that is the pronunciation information 2h. [01] Sashita [01] is generated. The number in [] indicates the accent nucleus position of the immediately preceding accent phrase. I [00] means flat tone without accents, and the accent type is also called 0 type. Casao's [01] means that the first mosquito's sound is accented. The accent type is also called type 1. In this example, the pronunciation style score generation unit assigns a pronunciation style score to the pronunciation information 1h and the pronunciation information 2h. As the utterance style score, for example, as shown in FIG. 5, the ratio of the accent phrase quoted from the style-dependent dictionary may be used. For the wacky wrinkle [00] Kasao [01] Sashita [01] of the pronunciation information 1h, one of the three accent phrases is quoted from the style-dependent dictionary 31, so for example the pronunciation style score 1s 0.33.
On the other hand, the reading of the three accent phrases is obtained from the word dictionary in the morpheme analysis unit 30 (not shown) for Watashiwa [00] Kasao [01] Sasita [01] of the pronunciation information 2h. The pronunciation style score 1s is set to 0.0 because it is not dependent on the utterance style information β.

この例では、２個の発音情報１ｈｓと２ｈｓが生成される場合を示しているが、入力されるテキストによっては、ｎ個の発音情報とｎ個の発音様式スコアが生成される。
発音情報１ｈｓと２ｈｓと発話様式情報βは、読み付与部３３に入力され、アクセント句が結合されたことによる連濁化の調整がされる。上記した例では、連濁化の調整は必要ないが、例えば、２語が複合して1語をつくるときに下に来る語の初めの清音を濁音に変える必要が在る場合に、ここで読みが調整される。
読み付与部３３においても、発話様式情報βに依存した発音情報の調整が可能である。例えば、様式依存句末長音化頻度情報３４にアクセント句の語尾の引き伸ばし情報を記憶して置き、その情報に基づいて読み付与部３３が語尾を変化させてもよい。例えば、「それで」と言うところを「それでー」と、句末を引き伸ばすことによっても、発話様式情報βに依存させた表現にすることが出来る。
また、例えば「やりました」に対して、よりくだけた調子の「やっちゃいました」や、「行きました」に対して「行っちゃったすよ」のように、同一の意味に対して読み付与部３３が、異なる発音情報を記憶した様式依存発音変換情報３５の情報に基づき、発音を変換させることでも発話様式情報βに対応させることが可能である。この場合は、発音内容に発音様式スコアは依存する。例えば、「昨日、二次会に行きました」に対して「昨日、二次会に行っちゃったすよ」は、上記したアクセント句の割合よりも、その発音が適用されたことによって、「丁寧な」の発話様式情報βに対する発音様式スコアが低下する。つまり、発音情報そのものにスコアの重み付けをしてもよい。 This example shows the case where two pronunciation information 1hs and 2hs are generated, but depending on the input text, n pronunciation information and n pronunciation style scores are generated.
The pronunciation information 1hs and 2hs and the utterance style information β are input to the reading assigning unit 33, and the turbidity is adjusted by combining the accent phrases. In the above example, there is no need to adjust the turbidity. For example, when two words are combined to form a single word, it is necessary to change the initial clear sound of the word below to a muddy sound. Is adjusted.
The reading giving unit 33 can also adjust the pronunciation information depending on the utterance style information β. For example, the accent-dependent ending length information of the accent phrase may be stored in the style-dependent phrase ending sound frequency information 34, and the reading adding unit 33 may change the ending based on the information. For example, the expression “depending on” utterance style information β can also be obtained by expanding the phrase “So it” to “So it”.
Also, for the same meaning, for example, “I did it” for “I did it” or “I did it” for “I went” The reading imparting unit 33 can also correspond to the utterance style information β by converting the pronunciation based on the information of the style-dependent pronunciation conversion information 35 in which different pronunciation information is stored. In this case, the pronunciation style score depends on the pronunciation content. For example, “Yes, yesterday I went to the second party”, “I went to the second party yesterday” is more “polite” because the pronunciation was applied than the percentage of the accent phrase mentioned above. The pronunciation style score for the utterance style information β decreases. In other words, the pronunciation information itself may be weighted with a score.

連濁化等の読みが調整された発音情報は発音様式スコアと共に、アクセント付与部３６で発音情報全体としてアクセントをどこに置くかのアクセント型が決定され、韻律情報生成手段１２に出力される。発音情報とその発音様式スコアは、発音情報生成手段１０内の発音情報記憶部３７に記憶しても良いし、順次、韻律情報生成手段１２に出力するようにしてもよい。説明の例では、ワタクシワ［００］カサオ［０１］サシタ［０１］の発音情報１ｈと、その発話様式スコア１ｓの０.３３との組みを発音情報１ｈｓ、及び、ワタシワ［００］カサオ［０１］サシタ［０１］の発音情報２ｈと、その発話様式スコア２ｓの０.０との組を発音情報２ｈｓとしている。この発音情報１ｈｓと２ｈｓが、韻律情報生成手段１２に入力される。
〔韻律情報生成手段〕
韻律情報生成手段１２の構成例を図６に示しその動作を説明する。発音情報生成手段１０で生成された発音情報１ｈｓと２ｈｓが順次、発音情報取得部６０に取り込まれ、発音情報が韻律生成部６１に入力される。韻律生成部６１は、発話様式情報βに基づき代表的な発話様式を数段階の割合で強調して発声した音声から作成した韻律ＤＢ６２を、参照して韻律情報を生成する。この実施例の特徴は、韻律ＤＢ６２に代表的な発話様式毎に、数段階の水準を用意している点である。 The pronunciation information whose reading such as turbidity is adjusted, together with the pronunciation style score, the accent assignment unit 36 determines the accent type where the accent is placed as the whole pronunciation information, and is output to the prosody information generation means 12. The pronunciation information and the pronunciation style score may be stored in the pronunciation information storage unit 37 in the pronunciation information generation unit 10 or may be sequentially output to the prosody information generation unit 12. In the example of description, the combination of pronunciation information 1h of Watakusiwa [00] Kasao [01] Sashita [01] and 0.33 of the speech style score 1s is set as pronunciation information 1hs and Watashiwa [00] Kasao [01]. A set of pronunciation information 2h of Sashita [01] and 0.0 of the speech style score 2s is set as pronunciation information 2hs. The pronunciation information 1hs and 2hs are input to the prosodic information generation means 12.
[Prosodic information generation means]
A configuration example of the prosodic information generation means 12 is shown in FIG. 6 and its operation will be described. The pronunciation information 1 hs and 2 hs generated by the pronunciation information generation means 10 are sequentially taken into the pronunciation information acquisition unit 60, and the pronunciation information is input to the prosody generation unit 61. The prosody generation unit 61 generates prosody information by referring to the prosody DB 62 created from speech uttered by emphasizing a representative utterance style at several stages based on the utterance style information β. The feature of this embodiment is that several levels of levels are prepared in the prosody DB 62 for each typical speech style.

韻律ＤＢ６２に示すように、例えば、発話様式情報βの「丁寧な」に対して、その様式をより強調した１．０の水準と、あまり反映していない０．５の水準の２種類が用意される。他の発話様式情報βの「喜び」や「怒り」についても同様である。韻律情報とは、音声の基本周波数の変化パタンと、そのポーズの長さであるとし、例えば、横軸が時間、縦軸が基本周波数で表される図７に示すようなものとする。この韻律ＤＢ６２は、従来技術で説明した感情別韻律辞書１３３と同様な作り方で作成され、予めハードディスク等に記憶されたものである。
「丁寧な」という発話様式については、図８中の韻律情報１ａ_Ｒと１ｂ_Ｒに示す水準１.０（韻律様式スコア）と０.５（韻律様式スコア）の韻律情報例から理解されるように、その様式を反映している度合いの高いものを、例えば、抑揚が大きくて、ポーズをやや長めに取る様式であるとする。その場合、韻律ＤＢ６２内の水準１.０は、０．５に対して基本周波数の変化幅が大きくて、ポーズの時間が長いものになる。韻律ＤＢ６２は、このようなものであるので、その水準は２種類に限られることは無く、例えば０.7や０.８の水準も簡単に用意することが出来る。例えば、０.７は、１.０を１００％、０.５を５０％としたものに対して基本周波数の変化幅やポーズ時間を７０％の大きさにすれば良い。このように韻律ＤＢ６２内の発話様式情報βに対応する水準の数を増やせば、その分、韻律生成部６１で生成する韻律情報を増やすことができる。
韻律生成部６１は、1個の発音情報に対して複数の韻律情報を生成する。上記した発音情報１ｈに対して、韻律生成部６１は、例えば図８に示すような韻律情報１ａ_Ｒと１ｂ_Ｒを生成する。また、発音情報２ｈに対しては、韻律情報２ａ_Ｒと２ｂ_Ｒを生成する。生成された韻律情報それぞれに対して韻律様式スコア付与部６３が、韻律様式スコアを付与する。韻律情報１ａ_Ｒには、韻律様式スコア１．０が、韻律情報１ｂ_Ｒには０．５が付与され、韻律情報２ａ_Ｒには、韻律様式スコア１．０が、韻律情報２ｂ_Ｒには０．５が付与される。 As shown in the prosodic DB 62, for example, two types are prepared for the “careful” utterance style information β, a level of 1.0 that emphasizes the style and a level of 0.5 that does not reflect much. Is done. The same applies to “joy” and “anger” of other utterance style information β. The prosodic information is a change pattern of the fundamental frequency of the voice and the length of the pause. For example, the prosody information is as shown in FIG. 7 where the horizontal axis represents time and the vertical axis represents the fundamental frequency. This prosody DB 62 is created in the same way as the emotion-based prosody dictionary 133 described in the prior art and is stored in advance on a hard disk or the like.
The utterance style “Polite” is understood from the prosodic information examples of levels 1.0 (prosodic style score) and 0.5 (prosodic style score) shown in the prosodic information 1a _R and 1b _R in FIG. In addition, it is assumed that a high degree of reflection of the style is, for example, a style with a large inflection and a slightly longer pose. In that case, the level 1.0 in the prosody DB 62 has a large change frequency of the fundamental frequency with respect to 0.5 and a long pause time. Since the prosody DB 62 is as described above, the levels are not limited to two types, and for example, levels of 0.7 and 0.8 can be easily prepared. For example, in the case of 0.7, 1.0 may be 100% and 0.5 may be 50%, and the change width and pause time of the basic frequency may be 70%. Thus, if the number of levels corresponding to the utterance style information β in the prosody DB 62 is increased, the prosody information generated by the prosody generation unit 61 can be increased accordingly.
The prosody generation unit 61 generates a plurality of prosody information for one piece of pronunciation information. For the pronunciation information 1h described above, the prosody generation unit 61 generates prosody information 1a _R and 1b _R as shown in FIG. 8, for example. For the pronunciation information 2h, prosodic information 2a _R and 2b _R are generated. A prosodic style score assigning unit 63 assigns a prosodic style score to each of the generated prosodic information. The prosodic information 1a _R is assigned a prosodic style score 1.0, the prosodic information 1b _R is assigned 0.5, the prosodic information 2a _R is assigned a prosodic style score 1.0, and the prosodic information 2b _R is 0. .5 is given.

なお、ここでは韻律情報として基本周波数の時間変化パタンおよびポーズの長さを取り上げたが、発話速度の変化パタンや音声のパワーの変化パタンなどを考慮した韻律情報生成方式も考えられる。
韻律生成部６１で生成された複数の韻律情報と韻律様式スコアは、それぞれの発音情報と組になった音声合成情報として、順次、音声合成手段１４に出力される。この例では、音声合成情報１ｇと２ｇの２個であり、それぞれの音声合成情報には、１個の発音情報と発音様式スコアに対して、２個の韻律情報とそれぞれの韻律スコアとが付与されている。
このような音声合成情報は、発音情報と韻律情報の組み合わせの数だけ生成される。この例の場合、発音情報１と２に対して、それぞれ２個の韻律情報が付与されているので、４個の音声合成情報１ｇ_１，１ｇ_２，２ｇ_１，２ｇ_２が生成される。ｎ個の音声合成情報は、韻律情報生成手段１２内に音声合成情報記憶部６５を設け、そこで記憶しても良い。
なお、発音様式スコアと韻律様式スコアとを様式スコア生成部６４で足し合わせて、１個の発話様式の反映の度合いを表す様式スコアとして音声合成情報記憶部６５に記憶してもよい。 Here, the time change pattern of the fundamental frequency and the length of the pause are taken up as the prosody information. However, a prosody information generation method that considers a change pattern of speech speed, a change pattern of voice power, and the like is also conceivable.
The plurality of prosody information and prosodic style scores generated by the prosody generation unit 61 are sequentially output to the speech synthesizer 14 as speech synthesis information paired with each pronunciation information. In this example, there are two pieces of speech synthesis information 1g and 2g, and each piece of speech synthesis information is given two prosodic information and each prosodic score for one pronunciation information and pronunciation style score. Has been.
Such speech synthesis information is generated by the number of combinations of pronunciation information and prosodic information. In the case of this example, two prosodic information are assigned to the pronunciation information 1 and 2, respectively, so that four pieces of speech synthesis information 1g ₁ , 1g ₂ , 2g ₁ , 2g ₂ are generated. The n speech synthesis information may be stored in the speech synthesis information storage unit 65 provided in the prosody information generation means 12.
Note that the pronunciation style score and the prosodic style score may be added together by the style score generation unit 64 and stored in the speech synthesis information storage unit 65 as a style score representing the degree of reflection of one utterance style.

〔音声合成手段〕
音声合成手段１４の構成例を図９に示しその動作を説明する。音声合成情報取得部９０が、韻律情報生成手段１２から音声合成情報１ｇ_＊〜Ｎｇ_＊（＊は１，２，…，ｎを省略して表す）を順次取得し、音素片選択部９１に出力する。音素片選択部９１は、音声合成情報１ｇ_*〜Ｎｇ_*内の発音情報と韻律情報に合致する音声波形の音素片を音声ＤＢ９２から読み出して、音素片接続部９３に出力する。音素片接続部９３は、音素片を接続して合成音声を生成し、合成音声選択手段１６に出力する。
品質スコア生成部９４によって、音素片接続部９３で生成されたそれぞれの合成音声に対して、合成音声の品質の指標である品質スコアが付与される。品質スコアは、例えば、参考特許文献、「波形接続型音声合成における知覚的評価に基づく素片選択サブコスト関数の最適化、信学技報SP2003-81」に示されているような、基本周波数の一致度合いを数値化する方法、平均スペクトルの一致度合いを数値化する方法、スペクトルの不連続性を数値化する方法、或いはこれらを統合する方法などが考えられる。 [Voice synthesis means]
A configuration example of the speech synthesizing unit 14 is shown in FIG. 9 and its operation will be described. The speech synthesis information acquisition unit 90 sequentially acquires speech synthesis information 1g _{* to} Ng _* (* is abbreviated as 1, 2,..., N) from the prosodic information generation unit 12 and outputs it to the phoneme unit selection unit 91. To do. The phoneme piece selection unit 91 reads out a phoneme unit having a speech waveform that matches the pronunciation information and prosodic information in the speech synthesis information 1g _{* to} Ng _* from the speech DB 92 and outputs the phoneme unit to the phoneme unit connection unit 93. The phoneme piece connection unit 93 connects phoneme pieces to generate synthesized speech and outputs the synthesized speech to the synthesized speech selection means 16.
The quality score generation unit 94 assigns a quality score, which is an index of the quality of the synthesized speech, to each synthesized speech generated by the phoneme piece connection unit 93. For example, the quality score can be obtained from the fundamental frequency as shown in the reference patent document, “Optimization of segment selection sub-cost function based on perceptual evaluation in waveform connected speech synthesis, IEICE Technical Report SP2003-81”. A method of quantifying the degree of coincidence, a method of quantifying the degree of coincidence of average spectra, a method of quantifying spectral discontinuity, or a method of integrating these may be considered.

ここで、仮に音声ＤＢ９２に保持されている音声波形の音素片が、上記した音声合成情報１ｇ_１の韻律情報１ａ_Ｒ及び１ｂ_Ｒと、音声合成情報２ｇ_１と２ｇ_２の韻律情報２ａ_Ｒ，２ｂ_Ｒに相当する音素片をほぼ含んでいるが、音声合成情報１ｇ_１の韻律情報１ａ_Ｒに含まれる「ワタクシ」の「タク」の部分に相当する音素片を含んでいなかったとする。
この場合、音声合成情報１ｇ_１の韻律情報1ａ_Ｒに基づいて合成される合成音声１ａ_Oは、その該当する音素片の無い部分に、例えば基本周波数の異なる「タ」や「ク」の音素片が用いられる。その結果、韻律情報１ａ_Ｒと合成音声との基本周波数の一致度合いを表す品質スコアが低下する。例えば音声合成情報１ｇ_１の韻律情報1ａ_Ｒに基づく合成音声１a_Oの品質スコア１ａ_ＱＳが０．７であり、音声合成情報１ｇ_２の韻律情報１ｂ_Ｒに基づく合成音声１ｂ_Oと音声合成情報２ｇ_１，２ｇ_２に対する合成音声２a_O，２ｂ_Oの品質スコア１ｂ_ＱＳ、２ａ_ＱＳ、２ｂ_ＱＳが０．９５であるとする。例えば０．７は、音素片と韻律情報との基本周波数の一致する割合が７０％、０．９５は９５％であると言った意味を持つものである。
合成音声とその品質スコアと、発音様式スコアと韻律様式スコアとは、組みとなって合成音声情報を形成し、これら複数の合成音声情報が合成音声選択手段１６に出力される。つまり、各合成音声情報には、発音様式スコアと韻律様式スコアと品質スコアが添付されて合成音声選択手段１６へ出力される。従って、合成音声情報を見れば、各合成音声の品質の程度と発話様式の反映の程度が分かるようになっている。
なお、合成音声情報は、音声合成手段１４内に音声合成記憶部９５を設けて記憶しても良いし、順次、合成音声選択手段１６に出力するようにしてもよい。 Here, the phoneme pieces of the speech waveform held in the speech DB 92 are the prosody information 1a _R and 1b _{R of} the speech synthesis information 1g ₁ and the prosody information 2a _R and 2b of the speech synthesis information 2g ₁ and 2g _2. nearly comprising, a phonemic piece corresponding to _R, and did not contain phonemic piece corresponding to the portion of the "Tak" in "Watakushi" included in prosody information 1a _R of the speech synthesis information 1 g _1.
In this case, the synthesized speech 1a _O synthesized based on the prosody information 1a _R of the speech synthesis information 1g ₁ is, for example, phonemes of “ta” and “ku” having different fundamental frequencies in a portion where there is no corresponding phoneme. Is used. As a result, the quality score representing the degree of coincidence between the fundamental frequencies of the prosodic information 1a _R and the synthesized speech is lowered. For example, the quality score 1a _QS of the synthesized speech 1a _O based on the prosody information 1a _R of the speech synthesis information 1g ₁ is 0.7, and the synthesized speech 1b _O based on the prosodic information 1b _R of the speech synthesis information 1g ₂ and the speech synthesis information 2g _Assume that the quality scores 1b _QS , 2a _QS , and 2b _QS of the synthesized speech 2a _O and 2b _O for ₁ and 2g ₂ are 0.95. For example, 0.7 has the meaning that the proportion of coincidence of the fundamental frequencies of the phoneme pieces and the prosodic information is 70%, and 0.95 is 95%.
The synthesized speech, the quality score thereof, the pronunciation style score, and the prosodic style score are combined to form synthesized speech information, and the plurality of synthesized speech information is output to the synthesized speech selection means 16. In other words, a pronunciation style score, a prosodic style score, and a quality score are attached to each synthesized voice information and output to the synthesized voice selection means 16. Therefore, by looking at the synthesized speech information, the degree of quality of each synthesized speech and the degree of reflection of the speech style can be understood.
Note that the synthesized speech information may be stored by providing the speech synthesis storage unit 95 in the speech synthesis unit 14 or may be sequentially output to the synthesized speech selection unit 16.

〔合成音声選択手段〕
合成音声選択手段１６の構成例を図１０に示しその動作を説明する。合成音声情報取得部１００が、音声合成手段１４から音声合成情報を取得して合成音声記憶部１０１に記憶する。このとき、合成音声情報取得部１００内の様式スコア生成部１００ａが、音声合成情報それぞれに添付された発音様式スコアと韻律様式スコアとを、例えば、足し合わせて様式スコアとし、音声合成情報内の合成音声と組にして合成音声記憶部１０１に記憶する。
ここで、発音様式スコアと韻律様式スコアとを単純に足し合わせて様式スコアにするのでは無く、それぞれに重み付けをして発話様式にそれぞれが反映される影響度を調整するようにしても良い。例えば、発話様式スコアの方が、発話様式に反映される度合いが大きければ、例えば発話様式スコアを０．８倍し、韻律様式スコアを０.２倍に重み付けして足し合わせる。 [Synthetic voice selection means]
An example of the structure of the synthesized speech selecting means 16 is shown in FIG. The synthesized speech information acquisition unit 100 acquires speech synthesis information from the speech synthesizer 14 and stores it in the synthesized speech storage unit 101. At this time, the style score generation unit 100a in the synthesized speech information acquisition unit 100 adds the pronunciation style score and prosodic style score attached to each of the speech synthesis information, for example, to form a style score, Combined with the synthesized speech and stored in the synthesized speech storage unit 101.
Here, instead of simply adding the pronunciation style score and the prosodic style score to form a style score, the degree of influence reflected in the speech style may be adjusted by weighting each. For example, if the utterance style score has a higher degree of reflection in the utterance style, for example, the utterance style score is multiplied by 0.8 and the prosodic style score is weighted by 0.2 and added.

合成音声選択部１０２は、合成音声記憶部１０１に記憶された合成音声情報の中から、品質スコアが、レジスタ１０２ａに保持された閾値γを超え、且つ、様式スコアの最も高い合成音声を選択して出力する。閾値γを超える品質スコアの合成音声が無い場合は、最も品質スコアの高い合成音声を選択して出力する。
合成音声選択部１０２の動作フローを図１１に示す。まず、合成音声記憶部１０１内に記憶された合成音声情報の中から、最も様式スコアの高い合成音声情報を選択する（ステップＳ２１）。その選択した合成音声情報の品質スコアが閾値γを上回るか否かが判断される（ステップＳ２２）。選択した合成音声情報の品質スコアが、閾値γを上回る場合（Yes）、合成音声選択部１０２は、その選択した合成音声情報の合成音声を合成音声として出力する（ステップＳ２５）。
閾値γを下回る場合（No）、次に様式スコアの高い合成音声を選択（ステップＳ２４）し、ステップＳ２２でその合成音声の品質スコアが閾値γを上回るか否かが判断され、閾値γを超えていればその合成音声を出力する（ステップＳ２５）。 The synthesized speech selection unit 102 selects, from the synthesized speech information stored in the synthesized speech storage unit 101, a synthesized speech having a quality score exceeding the threshold value γ held in the register 102a and having the highest style score. Output. If there is no synthesized speech with a quality score exceeding the threshold γ, the synthesized speech with the highest quality score is selected and output.
The operation flow of the synthesized speech selection unit 102 is shown in FIG. First, the synthesized speech information having the highest style score is selected from the synthesized speech information stored in the synthesized speech storage unit 101 (step S21). It is determined whether or not the quality score of the selected synthesized speech information exceeds a threshold value γ (step S22). When the quality score of the selected synthesized speech information exceeds the threshold γ (Yes), the synthesized speech selection unit 102 outputs the synthesized speech of the selected synthesized speech information as synthesized speech (step S25).
If it is below the threshold value γ (No), the synthesized speech with the next highest style score is selected (step S24), and it is judged whether or not the quality score of the synthesized speech exceeds the threshold value γ in step S22. If so, the synthesized speech is output (step S25).

以上の動作を、合成音声選択部１０２は、様式スコアの低い方向に検索を繰り返し、ステップＳ２３において、合成音声記憶部１０１内の全ての合成音声情報を調べたか否かを判断する（ステップＳ２３）。全ての合成音声情報を調べていれば（Yes）、その中から最も品質スコアの高い合成音声情報の合成音声を合成音声として出力する（ステップＳ２６）。
つまり、様式スコアが最も高い合成音声情報の品質スコアが、閾値γを上回る場合は、その合成音声を出力とし、そうでない場合は、次に様式スコアの高い合成音声を選択し、その品質スコアが閾値γを超えていればその合成音声を出力とする。この動作を繰り返し行い、品質スコアが閾値γを超える合成音声が無い場合は、品質スコアの最も高い合成音声が出力される。
一例として説明して来た各スコアの値を整理する。合成音声１ａと１ｂの様式スコアは、合成音声１ａ_Ｏが１.３３、１ｂ_Ｏが０.８３である。これは、合成音声１ａ_Ｏと１ｂ_Ｏの発話様式スコアが共に０.３３（図５を参照）で、合成音声１ａ_Ｏの韻律様式スコア１ａ_ＲＳが１.０（図８を参照）、合成音声１ｂ_Ｏの韻律様式スコア１ｂ_ＲＳが０.５であることによる。
合成音声２ａ_Ｏと２ｂ_Ｏの様式スコアは、合成音声２ａ_Ｏが１.０、２ｂ_Ｏが０.５である。これは、合成音声２ａ_Ｏと２ｂ_Ｏの発音様式スコアが共に０.０で、合成音声２ａ_Ｏの韻律様式スコア２ａ_ＲＳが１.０、合成音声２ｂ_Ｏの韻律様式スコア２ｂ_ＲＳが０.５であることによる。
各合成音声１ａ_Ｏ〜２ｂ_Ｏの品質スコアは、上記したように韻律情報１ａ_Ｒに合致する音素片が無い合成音声１a_Ｏの品質スコアが０．７で最も低く、他の合成音声１ｂ_Ｏと２ａ_Ｏと２ｂ_Ｏの品質スコアは０.９５である。 The synthesized speech selection unit 102 repeats the above operation in the direction of lower style score, and determines whether or not all synthesized speech information in the synthesized speech storage unit 101 has been examined in step S23 (step S23). . If all the synthesized speech information has been checked (Yes), the synthesized speech of the synthesized speech information with the highest quality score is output as synthesized speech (step S26).
In other words, if the quality score of the synthesized speech information with the highest style score exceeds the threshold γ, the synthesized speech is output, and if not, the synthesized speech with the next highest style score is selected, and the quality score is If the threshold value γ is exceeded, the synthesized speech is output. This operation is repeated, and if there is no synthesized speech whose quality score exceeds the threshold value γ, the synthesized speech with the highest quality score is output.
The score values described as an example are organized. Style score synthesized speech 1a and 1b are synthesized speech 1a _O is 1.33,1B _O is 0.83. This is because the speech style scores of the synthesized speech 1a _O and 1b _O are both 0.33 (see FIG. 5), the prosodic style score 1a _RS of the synthesized speech 1a _O is 1.0 (see FIG. 8), and the synthesized speech prosodic style score _{1b RS} of 1b _O is due to be a 0.5.
Style score synthesized speech 2a _O and 2b _O are synthesized speech 2a _O is 1.0,2B _O is 0.5. This is because the pronunciation style scores of the synthesized speech 2a _O and 2b _O are both 0.0, the prosodic style score 2a _RS of the synthesized speech 2a _O is 1.0, and the prosodic style score 2b _RS of the synthesized speech 2b _O is 0.5. Because it is.
Quality score for each synthesized speech _1a _O _~2b O is the Quality Score prosodic information 1a _R into synthetic speech 1a _O phoneme is no match as the lowest 0.7, and other synthetic speech 1b _O quality score of 2a _O and 2b _O is 0.95.

この状況で、閾値γを例えば、０.８とすると、最も様式スコアの高い合成音声は、合成音声１ａ_Ｏと判定される。しかし、合成音声１ａ_Ｏは、その品質スコア１ａ_ＱＳが０.７であるので、合成音声選択部１０２において、品質が基準を満たさないと判定され、合成音声出力として選択されない。
その次に様式スコアの高い合成音声は、様式スコアが１.０の合成音声２ａ_Ｏであり、この品質スコア２ａ_ＱＳは閾値γを上回る０.９５である。従って、合成音声２ａ_Ｏが選択されて出力される。
上記したような例の場合、従来の技術では、１個の発音情報と１個の韻律情報とに基づいて合成された品質スコアの低い合成音声１ａ_Ｏが出力されていた。
それに対して、この実施例１による様式指定型音声合成装置では、複数の発音情報それぞれに対して複数の韻律情報に基づいて生成した複数の合成音声のそれぞれについて品質スコアを計算し、品質スコアと様式スコアの双方を考慮して１個の合成音声を選択するので、品質スコアが致命的に低い合成音声を出力する可能性を低めることができる。
なお、実施例１においては、１個の発音情報に対して複数の韻律情報の合成音声を生成する例で説明を行ったが、複数の発音情報に対して１個の韻律情報を適用して合成音声を生成してもよい。これを簡潔に表現すると、発音情報または／及び韻律情報が異なる複数の合成音声となる。 In this situation, the threshold γ For example, if 0.8, high synthetic speech of most style score, it is determined that the synthetic speech 1a _O. However, since the quality score 1a _{QS of} the synthesized speech 1a _O is 0.7, the synthesized speech selection unit 102 determines that the quality does not satisfy the standard and is not selected as a synthesized speech output.
High synthesized speech of style score next is a synthetic speech 2a _O fashion score 1.0, the Quality Score 2a _QS is 0.95 above the threshold gamma. Accordingly, the synthesized speech 2a _O is selected and output.
In the case of the above-described example, in the conventional technique, the synthesized speech 1a _O having a low quality score synthesized based on one pronunciation information and one prosodic information has been output.
On the other hand, the style designation type speech synthesizer according to the first embodiment calculates a quality score for each of a plurality of synthesized speech generated based on a plurality of prosodic information for each of a plurality of pronunciation information, Since one synthesized speech is selected in consideration of both the style score, the possibility of outputting synthesized speech with a fatally low quality score can be reduced.
In the first embodiment, an example of generating synthesized speech of a plurality of prosodic information for one piece of pronunciation information has been described. However, one piece of prosodic information is applied to a plurality of pieces of pronunciation information. Synthetic speech may be generated. If this is simply expressed, it becomes a plurality of synthesized speeches with different pronunciation information and / or prosodic information.

また、この発明によれば、合成音声の品質と、所望の発話様式のどちらを重視するかを、合成音声選択手段１６で用いる閾値γの値で操作することができる。所望の発話様式にできるだけ近い合成音声を得るか、発話様式はあまり反映されなくても品質劣化を避けたいかの判断は、音声合成を用いるアプリケーションに強く依存する。この発明によれば、閾値γを高めに設定すれば低い品質の合成音声が出力することを避けることができ、また、閾値γを低めに設定すれば、多少品質が低くとも所望の発話様式を強く反映した合成音声が得られる。したがって、この発明の様式指定型音声合成装置は、アプリケーションの要求に応じて容易に挙動を変更することが出来る。
上記した様式指定型音声合成方法を整理する。図１２に様式指定型音声合成方法の動作フローを示して説明する。まず始めに、発音情報生成過程１２０において、発音情報生成手段１０は、入力されるテキストαと発話様式情報βに基づいて１つ以上の発音情報と上記発音情報それぞれに対応しテキストで表現される内容以外の音声に変化を与える要因を表す発音様式スコアとを生成する。
次に韻律情報生成過程１２１において、韻律情報生成手段１２は、発音情報生成手段からの複数の発音情報と複数の発音様式スコアとが入力され、発音情報のそれぞれについて１つ以上の韻律情報と、その韻律情報それぞれについて上記発話様式の反映の度合いを表す韻律様式スコアとを生成する。
次に音声合成過程１２２において、音声合成手段１４は、韻律情報生成手段１２からの発音情報と発音様式スコアと、韻律情報と韻律様式スコアとが入力され、発音情報それぞれについてそれぞれの韻律情報に従った複数の合成音声を生成する。
次に合成音声選択過程１２３において、発音様式スコアと、韻律様式スコアと、品質スコアとを入力とし、品質スコアが閾値γを超える上記合成音声の中から発音様式スコアと韻律様式スコアに基づいた様式スコアの最も高い合成音声を選択し、閾値を超える品質スコアの合成音声が無い場合は、最も品質スコアの高い合成音声を選択して出力する。 Further, according to the present invention, it is possible to operate with the value of the threshold value γ used in the synthesized speech selection means 16, which of the synthesized speech quality and the desired utterance style is emphasized. The determination of whether to obtain synthesized speech as close as possible to the desired utterance style or whether to avoid quality degradation even if the utterance style is not significantly reflected depends strongly on the application using speech synthesis. According to the present invention, if the threshold value γ is set high, it is possible to avoid the output of low-quality synthesized speech, and if the threshold value γ is set low, the desired utterance pattern can be selected even if the quality is somewhat low. Strongly reflected synthesized speech can be obtained. Therefore, the style designation type speech synthesizer of the present invention can easily change the behavior according to the request of the application.
Organize the above style-designated speech synthesis methods. FIG. 12 shows the operation flow of the style designation type speech synthesis method. First, in the pronunciation information generation process 120, the pronunciation information generation means 10 is represented by text corresponding to one or more pronunciation information and each of the pronunciation information based on the input text α and utterance style information β. A pronunciation style score representing a factor that changes the voice other than the content is generated.
Next, in the prosody information generation step 121, the prosody information generation unit 12 receives a plurality of pronunciation information and a plurality of pronunciation style scores from the pronunciation information generation unit, and each of the pronunciation information includes one or more prosody information, For each of the prosodic information, a prosodic style score representing the degree of reflection of the utterance style is generated.
Next, in the speech synthesis process 122, the speech synthesis means 14 receives the pronunciation information, the pronunciation style score, the prosodic information, and the prosodic style score from the prosody information generation means 12, and follows each prosodic information for each pronunciation information. A plurality of synthesized speech is generated.
Next, in the synthesized speech selection process 123, the pronunciation style score, the prosodic style score, and the quality score are input, and the style based on the pronunciation style score and the prosodic style score is selected from the synthesized speech whose quality score exceeds the threshold γ. The synthesized speech with the highest score is selected, and if there is no synthesized speech with a quality score exceeding the threshold, the synthesized speech with the highest quality score is selected and output.

以上の実施例１の他、この発明である各手段と装置及び方法は上述の実施形態に限定されるものではなく、この発明の趣旨を逸脱しない範囲で適宜変更が可能である。例えば、実施例１で示した発音情報に韻律情報を付与する方法以外の方法として、韻律ＤＢ６２内には発話様式情報に対する上限下限の２水準を用意し、その２水準の間の韻律情報を韻律生成部６１が計算して求める方法も考えられる。
また、上記各手段と装置及び方法において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。
また、上記各手段と装置及び方法における処理機能をコンピュータによって実現する場合、様式指定型音声合成装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記様式指定型音声合成装置における処理機能がコンピュータ上で実現される。
この処理内容を記述したプログラムは、コンピュータで読み取り可能な記憶媒体に記録しておくことができる。コンピュータで読み取り可能な記憶媒体としては、例えば、磁気記憶装置、光ディスク、光磁気記憶媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記憶装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記憶媒体として、ＭＯ（Magneto Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。
また、このプログラムの流通は、例えば、そのプログラムを記憶したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記憶媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 In addition to the first embodiment, each means, apparatus, and method according to the present invention are not limited to the above-described embodiment, and can be appropriately changed without departing from the spirit of the present invention. For example, as a method other than the method of adding prosodic information to the pronunciation information shown in the first embodiment, two levels of upper and lower limits for utterance style information are prepared in the prosodic DB 62, and prosodic information between the two levels is prosodic. A method in which the generation unit 61 calculates and obtains it is also conceivable.
In addition, the processes described in the above means, devices, and methods are not only executed in time series in the order described, but also executed in parallel or individually as required by the processing capability of the device that executes the processing. It may be.
In addition, when the processing functions in the above means, devices, and methods are realized by a computer, the processing contents of the functions that the form-designating speech synthesizer should have are described by a program. Then, by executing this program on a computer, the processing functions in the style-designated speech synthesizer are realized on the computer.
The program describing the processing contents can be recorded on a computer-readable storage medium. The computer-readable storage medium may be any medium such as a magnetic storage device, an optical disk, a magneto-optical storage medium, and a semiconductor memory. Specifically, for example, as a magnetic storage device, a hard disk device, a flexible disk, a magnetic tape, etc., and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only) Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical storage media, MO (Magneto Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.
The program is distributed by selling, transferring, or lending a portable storage medium such as a DVD or CD-ROM storing the program, for example. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記憶媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記憶媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記憶媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。
また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、各装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 A computer that executes such a program first stores, for example, a program recorded on a portable storage medium or a program transferred from a server computer in its own storage device. When executing the process, this computer reads a program stored in its own storage medium and executes a process according to the read program. As another execution form of the program, the computer may read the program directly from the portable storage medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).
In this embodiment, each apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

この発明の様式指定型音声合成装置の実施例１の機能構成ブロックを示す図。The figure which shows the function structure block of Example 1 of the style designation | designated type | formula speech synthesizer of this invention. 実施例１の動作フローを示す図。FIG. 3 is a diagram illustrating an operation flow of the first embodiment. この発明の発音情報生成手段１０の機能構成例を示す図。The figure which shows the function structural example of the pronunciation information generation means 10 of this invention. 様式依存辞書３１の一例を示す図。The figure which shows an example of the style dependence dictionary 31. FIG. この発明の発音情報の一例を示す図。The figure which shows an example of the pronunciation information of this invention. この発明の韻律情報生成手段１２の機能構成例を示す図。The figure which shows the function structural example of the prosodic information generation means 12 of this invention. 韻律情報の一例を示す図。The figure which shows an example of prosody information. この発明の韻律生成部６０が生成する韻律情報の一例を示す図。The figure which shows an example of the prosody information which the prosody generation part 60 of this invention produces | generates. この発明の音声合成手段１４の機能構成例を示す図。The figure which shows the function structural example of the speech synthesis means 14 of this invention. この発明の合成音声選択手段１６の機能構成例を示す図。The figure which shows the function structural example of the synthetic | combination voice selection means 16 of this invention. 合成音声選択部１０２の動作フローの一例を示す図。The figure which shows an example of the operation | movement flow of the synthetic | combination voice selection part. この発明の様式指定型音声合成方法のフローを示す図。The figure which shows the flow of the style designation | designated type | formula speech synthesis method of this invention. 非特許文献１に開示された従来の様式依存音声合成装置を示す図。The figure which shows the conventional style dependence speech synthesizer disclosed by the nonpatent literature 1. FIG.

Claims

The text and utterance style information that is a factor that changes the voice other than the contents expressed in the text are input, and one or more pronunciation information and the utterance style are reflected corresponding to each of the pronunciation information. A pronunciation information generation means for generating and outputting a pronunciation style score representing a degree;
Using the utterance style information and the pronunciation information from the pronunciation information generating means as inputs, one or more prosodic information for each of the pronunciation information and a prosody representing the degree of reflection of the utterance style for each of the prosodic information Prosody information generating means for generating and outputting a style score;
Using the pronunciation information from the pronunciation information generation means and the prosody information from the prosody information generation means as inputs, a plurality of synthesized speech with different pronunciation information or / and prosody information, and the quality of each synthesized speech signal Speech synthesis means for generating and outputting a quality score representing the degree;
The synthesized speech and the quality score are input from the speech synthesizer, and the quality score exceeds the threshold value among the synthesized speech, and the style score based on the pronunciation style score and the prosodic style score is the highest. Selecting and outputting a high synthesized speech, and if there is no synthesized speech with a quality score exceeding the above threshold, synthesized speech selecting means for selecting and outputting the synthesized speech with the highest quality score;
A style designation type speech synthesizer.

The form-designated speech synthesizer according to claim 1,
The prosody information is generated by referring to a prosodic database in which the utterance style of the utterance style information is emphasized at a ratio of several stages.

In the style designation type speech synthesizer according to claim 1 or 2,
A style specification type speech synthesizer comprising: a style score generation unit that obtains the style score as a weighted sum of the pronunciation style score and the prosodic style score.

In the style designation type speech synthesizer according to any one of claims 1 to 3,
The pronunciation style score in the pronunciation information generating means generates a degree of the number of words to which the pronunciation is applied from a style-dependent dictionary that records different pronunciations depending on the utterance style,
The style designation type speech synthesizer according to claim 1, wherein the prosodic style score in the prosodic information generation means is generated by generating an emphasis ratio of the utterance style of the utterance style information.

The form-designated speech synthesizer according to any one of claims 1 to 4,
The style designation type speech synthesizer characterized in that the threshold value can be set from the outside.

The pronunciation information generating means generates one or more pronunciation information and pronunciation corresponding to each of the pronunciation information and the degree of reflection of the utterance style with respect to the utterance style information which is a factor that changes the input text and voice Pronunciation information generation process for generating a style score;
The prosodic information generating means includes, from the utterance style information and the pronunciation information, one or more prosodic information for each of the pronunciation information, and a prosodic style score representing the degree of reflection of the utterance style for each of the prosodic information; Prosody information generation process to generate
Speech synthesis means generates a plurality of synthesized speech signals having different utterance information and / or prosodic information from the pronunciation information and the prosody information, and generates a quality score representing the degree of quality of the synthesized speech signal. The speech synthesis process,
The synthesized speech selection means selects the synthesized speech having the highest quality score based on the pronunciation style score and the prosodic style score from the synthesized speech, and synthesizes the quality score exceeding the threshold. If there is no speech, the synthesized speech selection process that selects the synthesized speech with the highest quality score,
A style-designated speech synthesis method having

6. A format designation type speech synthesis program for causing a computer to function as each device according to claim 1.

A computer-readable storage medium storing the program according to claim 7.