JP2007212884A

JP2007212884A - Speech synthesizer, speech synthesizing method, and computer program

Info

Publication number: JP2007212884A
Application number: JP2006034270A
Authority: JP
Inventors: Kentaro Murase; 健太郎村瀬; Nobuyuki Katae; 伸之片江; Kazuhiro Watanabe; 一宏渡辺
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2006-02-10
Filing date: 2006-02-10
Publication date: 2007-08-23

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech synthesizer, a speech synthesizing method, and a computer program that expand a range of a predetermined variable part up to a position where the variable part can be connected to a pitch pattern of a fixed part with consistency based upon how high the accent at the tail of a word of the variable part is and/or how high the accent of the following fixed part is. <P>SOLUTION: Template data and text data of the variable part are acquired and reading, accent, and meter information is extracted from the acquired template data; and the acquired text data is inserted into the variable part and reading, accent information, and meter information on a synthesized speech including the fixed part is generated. When the accent extracted from the template data does not match the generated accent at the head of the initially set fixed part, the variable part is expanded until the extracted accent matches the generated accent to generate the synthesized speech. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、テキストデータに基づいて、アクセントが自然な合成音声を生成することができる音声合成装置、音声合成方法、及びコンピュータプログラムに関する。特に、地名、氏名等のように任意の単語に入れ替わる可変部分と、可変部に依存せず常に同一である固定部分とで構成されるテキストデータについて、自然な合成音声を生成することができる音声合成装置、音声合成方法、及びコンピュータプログラムに関する。 The present invention relates to a speech synthesizer, a speech synthesis method, and a computer program that can generate synthesized speech with natural accents based on text data. In particular, speech that can generate natural synthesized speech for text data composed of variable parts that replace arbitrary words such as place names and names, and fixed parts that are always the same without depending on the variable parts. The present invention relates to a synthesis device, a speech synthesis method, and a computer program.

任意のテキストデータを合成音声により読み上げるＴＴＳ（TextToSpeech）システムにおける合成音声の品質は日々向上している。もちろん、人間が自然に発する肉声と同等の品質には到達していないが、事前に音声合成の対象となるテキストデータが定まっている場合、会話のリズム、イントネーション、アクセント等の特徴を示す韻律情報を、実際に発生している肉声から事前に抽出しておき、抽出された韻律情報に従って音声合成することにより、肉声に近い自然な音声を合成することが可能となる。 The quality of synthesized speech in a TTS (TextToSpeech) system that reads out arbitrary text data by synthesized speech is improving day by day. Of course, prosody information that shows features such as conversation rhythm, intonation, and accents when the text data that is the target of speech synthesis has been determined in advance, although it has not reached the same quality as a human natural voice. Is extracted in advance from the real voice that is actually generated, and voice synthesis is performed according to the extracted prosodic information, so that it is possible to synthesize a natural voice close to the real voice.

斯かる音声合成システムは、例えば銀行の現金自動預け払い機、各種コールセンターの一次受付に利用される自動音声応答システム等の、機械から自動的に出力される音声応答メッセージを生成するのに適している。すなわち、機械から出力される音声応答メッセージは、例えば「○○円でよろしいでしょうか？」というように金額等を指定する可変部分と、その他の固定部分とで構成された文章であることが多い。したがって、可変部分にＴＴＳシステムを適用し、固定部分に肉声から抽出した韻律情報を利用して合成音声を生成することで、低コストで高品質な音声応答メッセージを生成することができる。 Such a speech synthesis system is suitable for generating a voice response message automatically output from a machine such as an automatic teller machine of a bank or an automatic voice response system used for primary reception of various call centers. Yes. In other words, the voice response message output from the machine is often a text composed of a variable part for specifying the amount of money and other fixed parts, for example, "Are you sure you want to use XX yen?" . Therefore, a high-quality voice response message can be generated at low cost by applying the TTS system to the variable part and generating synthesized speech using prosodic information extracted from the real voice for the fixed part.

しかし、可変部分と固定部分とで別の方法により合成音声を生成していることから、可変部分と固定部分との境界において、韻律が滑らかで自然となるように韻律情報を調整する必要があり、調整しない場合には、可変部分と固定部分との境界にて韻律が不自然になるという問題があった。 However, since synthesized speech is generated by different methods for the variable part and the fixed part, it is necessary to adjust the prosody information so that the prosody is smooth and natural at the boundary between the variable part and the fixed part. If not adjusted, there is a problem that the prosody becomes unnatural at the boundary between the variable part and the fixed part.

斯かる問題を解決すべく、例えば特許文献１では、可変部分の韻律を適切に調整し、固定部分に滑らかに接続させる方法、及び可変部分の韻律に合うように固定部分の韻律を事前に調整しておく方法が開示されている。また、特許文献２では、応答テンプレートに結合するテキストデータの結合順序と結合方法等（コントロール情報）を詳細に記述しておくことで、応答音声を自然かつ滑らかに接続する方法が開示されている。 In order to solve such a problem, for example, in Patent Document 1, the prosody of the variable part is appropriately adjusted and the prosody of the fixed part is adjusted in advance so as to match the prosody of the variable part. A method is disclosed. Further, Patent Document 2 discloses a method of connecting response speech naturally and smoothly by describing in detail the combination order and combination method (control information) of text data combined with a response template. .

さらに、特許文献３では、可変部分の周辺の文字列も含めてＴＴＳを実行し、可変部分の韻律だけ取り出し、固定部分に接続するような文の一部だけを規則合成で変更可能とし、その他の部分は分析合成により得られた合成パラメータ又は音声波形データを使用して合成する音声合成方法が開示されている。
特開平１１−１２６０８７号公報特許第３５７８９６１号公報特開平１１−０３８９８９号公報 Furthermore, in Patent Document 3, TTS is executed including the character string around the variable part, only the prosody of the variable part is extracted, and only a part of the sentence connected to the fixed part can be changed by rule synthesis. This part discloses a speech synthesis method for synthesizing using synthesis parameters or speech waveform data obtained by analysis synthesis.
Japanese Patent Application Laid-Open No. 11-126087 Japanese Patent No. 3578961 Japanese Patent Laid-Open No. 11-038989

しかし、上述した方法を用いる場合、可変部分と固定部分との境界において韻律が滑らかに接続されている場合であっても、アクセントが不自然となる場合が発生する。例えば、「（東京）の天気は晴れです。（括弧内の地名部分を可変部分とする）」と発声した肉声をテンプレートとして、固定部分の韻律情報を準備している場合、可変部分に（山形）を挿入したときには、はめ込むとアクセントが不自然となる。図５に、「（東京）の天気は晴れです。」又は「（山形）の天気は晴れです。」と発声した場合のアクセントの相違を示す。 However, when the above-described method is used, the accent may be unnatural even when the prosody is smoothly connected at the boundary between the variable part and the fixed part. For example, if you have prepared the prosody information of the fixed part using the voice of “(Tokyo) is sunny. (The place name part in parentheses is the variable part)” as a template. ) Is inserted, the accents become unnatural when inserted. FIG. 5 shows the difference in accents when saying "The weather in (Tokyo) is sunny" or "The weather in (Yamagata) is sunny".

すなわち、（東京）と発声する場合のアクセントは末尾が高く終わるタイプであり、後続の「の天気は」の先頭部分も高めのピッチパターンとなる（図５（ａ）参照）。特に「の」のアクセントは「高」に分類される。一方、「山形」と発生する場合のアクセントは末尾が低く終わるタイプであり、後続の「の天気は」の先頭部分も低めのピッチパターンとなる（図５（ｂ）破線参照）。この場合、「の」のアクセントは「低」に分類される。このように、固定部分である「の天気は」を発声した場合のピッチパターンは、直前に存在する可変部分のアクセントのタイプにより相違し、例えば「の」のアクセントの分類は、「高」と「低」とで異なっている。 That is, the accent when uttering (Tokyo) ends up with a high end, and the leading portion of the subsequent “no weather is” has a higher pitch pattern (see FIG. 5A). In particular, the accent of “no” is classified as “high”. On the other hand, when “Yamagata” occurs, the accent ends with a lower end, and the head portion of the subsequent “No weather is” also has a lower pitch pattern (see the broken line in FIG. 5B). In this case, the accent of “no” is classified as “low”. In this way, the pitch pattern when uttering “no weather is” that is a fixed part differs depending on the type of accent of the variable part that exists immediately before, for example, the classification of the accent of “no” is “high” It is different from “Low”.

よって、特許文献１及び３に開示されているように、可変部分のピッチパターンのみを調整した場合であっても、そもそも固定部分のピッチパターンが相違していることから、境界部分ではピッチの相違に基づく違和感が生じる合成音声となる。特に日本語はアクセントによって意味が変化する言葉が存在し、飴（前半が低く後半が高いピッチパターン）、雨（前半が高く後半が低いピッチパターン）のように、アクセントの相違により意味が変わる場合、文意を把握することが困難になる場合も有りうるという問題点があった。 Therefore, as disclosed in Patent Documents 1 and 3, even when only the pitch pattern of the variable portion is adjusted, the pitch pattern of the fixed portion is different from the first place. It is a synthesized voice that produces a sense of incongruity based on. In particular, there are words whose meaning changes depending on the accent in Japanese, and the meaning changes depending on the accent, such as 飴 (pitch pattern with the first half being low and the second half being high) and rain (pitch pattern with the first half being high and the second half being low). There is a problem that it may be difficult to grasp the meaning of the sentence.

また、特許文献２に開示されているように、別のアクセントタイプを想定した固定部分のピッチパターンを変形する場合、変形処理が複雑となり、しかも全てのテンプレート文章、可変部分等に対して変形方法を指定しておくのは膨大な手間がかかり、自然な韻律の形成は実際上困難である。 Further, as disclosed in Patent Document 2, when a pitch pattern of a fixed portion assuming another accent type is deformed, the deformation process becomes complicated, and the deformation method is applied to all template sentences, variable portions, and the like. It takes a lot of time and effort to specify a natural prosody.

本発明は、斯かる事情に鑑みてなされたものであり、事前に定めてある可変部分の範囲を、可変部分の単語の末尾のアクセントの高低、及び／又は後続する固定部分のアクセントの高低に基づいて、固定部分のピッチパターンに整合性を持って接続できる位置まで拡張することにより、合成された音声を自然な音声として出力することができる音声合成装置、音声合成方法、及びコンピュータプログラムを提供することを目的とする。 The present invention has been made in view of such circumstances, and the range of the predetermined variable portion is changed to the height of the accent at the end of the word of the variable portion and / or the height of the accent of the fixed portion that follows. Based on this, a speech synthesizer, a speech synthesis method, and a computer program capable of outputting synthesized speech as natural speech by extending to a position where the pitch pattern of the fixed part can be connected with consistency are provided. The purpose is to do.

上記目的を達成するために第１発明に係る音声合成装置は、可変部分と固定部分とで構成されたテキストデータに対して、可変部分はテキスト音声合成で、固定部分は事前に読み、アクセント、及び韻律情報を記憶してあるテンプレートデータに基づいて合成音声を生成する音声合成装置において、前記テンプレートデータ及び可変部分のテキストデータを取得する取得手段と、取得したテンプレートデータから、読み、アクセント、及び韻律情報を抽出する抽出手段と、取得したテキストデータを可変部分に挿入して、固定部分を含めて合成音声の読み、アクセント、及び韻律情報を生成する生成手段と、初期設定されている固定部分の始端で、前記テンプレートデータから抽出されたアクセントと生成されたアクセントとが一致しているか否かを判断する判断手段と、該判断手段で一致していないと判断された場合、抽出されたアクセントと生成されたアクセントとが一致する位置まで可変部分を拡張する可変部分拡張手段とを備え、拡張された可変部分はテキスト音声合成で、拡張された可変部分を除く固定部分は前記テンプレートデータに基づいて合成音声を生成するようにしてあることを特徴とする。 In order to achieve the above object, the speech synthesizer according to the first aspect of the present invention is directed to text data composed of a variable part and a fixed part. And a speech synthesizer that generates synthesized speech based on template data in which prosodic information is stored, acquisition means for acquiring the template data and variable portion text data, reading from the acquired template data, accents, and Extraction means for extracting prosodic information, generation means for inserting the acquired text data into the variable part, generating synthesized speech readings, accents, and prosodic information including the fixed part, and fixed part that is initially set The accent extracted from the template data matches the generated accent at the beginning of Determining means for determining whether or not, and variable portion extending means for expanding the variable portion to a position where the extracted accent matches the generated accent when it is determined that they do not match. The extended variable part is a text-to-speech synthesis, and the fixed part excluding the extended variable part is configured to generate a synthesized voice based on the template data.

また、第２発明に係る音声合成装置は、可変部分と固定部分とで構成されたテキストデータに対して、可変部分はテキスト音声合成で、固定部分は事前に読み、アクセント、及び韻律情報を記憶してあるテンプレートデータに基づいて合成音声を生成する音声合成装置において、前記テンプレートデータ及び可変部分のテキストデータを取得する取得手段と、取得したテンプレートデータから、読み、アクセント、及び韻律情報を抽出する抽出手段と、取得したテキストデータを可変部分に挿入して、固定部分を含めて合成音声の読み、アクセント、及び韻律情報を生成する生成手段と、初期設定されている可変部分の終端と固定部分の始端とで、前記テンプレートデータから抽出されたアクセントと生成されたアクセントとが一致しているか否かを判断する判断手段と、該判断手段で一致していないと判断された場合、抽出されたアクセントと生成されたアクセントとが一致する位置まで可変部分を拡張する可変部分拡張手段とを備え、拡張された可変部分はテキスト音声合成で、縮小された固定部分は前記テンプレートデータに基づいて合成音声を生成するようにしてあることを特徴とする。 The speech synthesizer according to the second aspect of the present invention, for text data composed of a variable part and a fixed part, the variable part is text-to-speech synthesis, the fixed part is read in advance, and accent and prosody information are stored. In the speech synthesizer for generating synthesized speech based on the template data, the acquisition means for acquiring the template data and the text data of the variable part, and reading, accent, and prosody information are extracted from the acquired template data Extraction means, generation means for inserting the acquired text data into the variable part, and generating synthesized speech readings, accents and prosodic information including the fixed part, and the end and fixed part of the variable part that is initially set The accent extracted from the template data matches the generated accent at the beginning of Determining means for determining whether or not, and variable portion extending means for expanding the variable portion to a position where the extracted accent matches the generated accent when it is determined that they do not match. The expanded variable part is a text-to-speech synthesis, and the reduced fixed part is a synthesized voice based on the template data.

また、第３発明に係る音声合成装置は、第１又は第２発明において、前記可変部分拡張手段は、前記生成手段で生成されたアクセントと、前記抽出手段で抽出されたアクセントとが一致する部分を有するアクセント句を抽出する手段を備え、該手段で抽出されたアクセント句のうち、初期設定されている可変部分に最も近接しているアクセント句の終端まで該可変部分を拡張するようにしてあることを特徴とする。 In the speech synthesizer according to a third aspect of the present invention, in the first or second aspect of the invention, the variable portion expanding means is a portion where the accent generated by the generating means matches the accent extracted by the extracting means. Means for extracting an accent phrase having the variable part, and the variable part is extended to the end of the accent phrase closest to the initial variable part among the accent phrases extracted by the means It is characterized by that.

また、第４発明に係る音声合成方法は、可変部分と固定部分とで構成されたテキストデータに対して、可変部分はテキスト音声合成で、固定部分は事前に読み、アクセント、及び韻律情報を記憶してあるテンプレートデータに基づいて合成音声を生成する音声合成方法において、前記テンプレートデータ及び可変部分のテキストデータを取得し、取得したテンプレートデータから、読み、アクセント、及び韻律情報を抽出し、取得したテキストデータを可変部分に挿入して、固定部分を含めて合成音声の読み、アクセント、及び韻律情報を生成し、初期設定されている固定部分の始端で、前記テンプレートデータから抽出されたアクセントと生成されたアクセントとが一致しているか否かを判断し、一致していないと判断された場合、抽出されたアクセントと生成されたアクセントとが一致する位置まで可変部分を拡張し、拡張された可変部分はテキスト音声合成で、拡張された可変部分を除く固定部分は前記テンプレートデータに基づいて合成音声を生成することを特徴とする。 In the speech synthesis method according to the fourth aspect of the present invention, for text data composed of a variable portion and a fixed portion, the variable portion is text speech synthesis, the fixed portion is read in advance, and accent and prosody information are stored. In the speech synthesis method for generating synthesized speech based on the template data, the template data and the text data of the variable part are acquired, and reading, accent, and prosody information are extracted from the acquired template data, and acquired. Insert text data into the variable part to generate synthesized speech readings, accents, and prosody information including the fixed part, and generate the accents extracted from the template data at the beginning of the fixed part that is initially set It is determined whether or not it matches the accent that has been extracted. The variable part is expanded to the position where the generated accent matches the generated accent, the expanded variable part is text-to-speech synthesis, and the fixed part excluding the expanded variable part generates synthesized speech based on the template data It is characterized by doing.

また、第５発明に係るコンピュータプログラムは、可変部分と固定部分とで構成されたテキストデータに対して、可変部分はテキスト音声合成で、固定部分は事前に読み、アクセント、及び韻律情報を記憶してあるテンプレートデータに基づいて合成音声を生成するコンピュータで実行することが可能なコンピュータプログラムにおいて、前記コンピュータを、前記テンプレートデータ及び可変部分のテキストデータを取得する取得手段、取得したテンプレートデータから、読み、アクセント、及び韻律情報を抽出する抽出手段、取得したテキストデータを可変部分に挿入して、固定部分を含めて合成音声の読み、アクセント、及び韻律情報を生成する生成手段、初期設定されている固定部分の始端で、前記テンプレートデータから抽出されたアクセントと生成されたアクセントとが一致しているか否かを判断する判断手段、該判断手段で一致していないと判断された場合、抽出されたアクセントと生成されたアクセントとが一致する位置まで可変部分を拡張する可変部分拡張手段、及び拡張された可変部分はテキスト音声合成で、拡張された可変部分を除く固定部分は前記テンプレートデータに基づいて合成音声を生成する手段として機能させることを特徴とする。 In addition, the computer program according to the fifth aspect of the present invention, for text data composed of a variable part and a fixed part, the variable part is text-to-speech synthesis, the fixed part is read in advance, and accent and prosody information are stored. In a computer program that can be executed by a computer that generates synthesized speech based on template data, the computer reads the template data and variable portion text data from the acquired template data. Extraction means for extracting accent and prosody information, generation means for inserting the acquired text data into the variable part and generating synthesized speech reading, accent, and prosody information including the fixed part, are initially set Extracted from the template data at the beginning of the fixed part Judgment means for judging whether or not the accent and the generated accent match, and when the judgment means judges that they do not match, the position is variable up to a position where the extracted accent and the generated accent match. A variable part extending means for extending a part, and the extended variable part is a text-to-speech synthesis, and the fixed part excluding the extended variable part is made to function as a means for generating a synthesized speech based on the template data. To do.

第１発明、第４発明、及び第５発明では、可変部分と固定部分とで構成されたテキストデータに対して、可変部分はテキスト音声合成で、固定部分は事前に読み、アクセント、及び韻律情報を記憶してあるテンプレートデータに基づいて合成音声を生成する。テンプレートデータ及び可変部分のテキストデータを取得し、取得したテンプレートデータから、読み、アクセント、及び韻律情報を抽出し、取得したテキストデータを可変部分に挿入して、固定部分を含めて合成音声の読み、アクセント、及び韻律情報を生成する。初期設定されている固定部分の始端で、テンプレートデータから抽出されたアクセントと生成されたアクセントとが一致しているか否かを判断する。両者が一致していないと判断された場合、抽出されたアクセントと生成されたアクセントとが一致する位置まで可変部分を拡張し、拡張された可変部分はテキスト音声合成で、拡張された可変部分を除く固定部分はテンプレートデータに基づいて合成音声を生成する。これにより、固定部分の始端にて生成されたアクセントとテンプレートデータから抽出されたアクセントとが一致していない場合、例えば生成されたアクセントが低いピッチパターンを有し、テンプレートデータではアクセントが高いピッチパターンを有する場合等には、可変部分を次にアクセントが一致する部分にまで拡張することにより、可変部分からテンプレートデータの固定部分へとアクセントの連続性を担保することができ、可変部分と固定部分とを含めた文章全体にわたり肉声に近い自然なアクセント・韻律を有する合成音声を生成することが可能となる。 In the first invention, the fourth invention, and the fifth invention, for text data composed of a variable part and a fixed part, the variable part is text-to-speech synthesis, the fixed part is read in advance, and the accent and prosody information Is generated based on the template data stored. Obtain template data and variable part text data, extract reading, accent, and prosody information from the obtained template data, insert the obtained text data into the variable part, and read the synthesized speech including the fixed part , Accent, and prosody information. It is determined whether or not the accent extracted from the template data matches the generated accent at the beginning of the initial fixed portion. If it is determined that the two do not match, the variable part is expanded to the position where the extracted accent matches the generated accent, and the expanded variable part is text-to-speech synthesis. Except for the fixed part, synthesized speech is generated based on the template data. Thus, if the accent generated at the beginning of the fixed portion and the accent extracted from the template data do not match, for example, the generated accent has a low pitch pattern, and the template data has a high accent pattern. If the variable part is extended to the next part where the accent matches, the continuity of the accent can be ensured from the variable part to the fixed part of the template data. It is possible to generate a synthesized speech having natural accents and prosody similar to the real voice over the entire sentence including.

第２発明では、可変部分と固定部分とで構成されたテキストデータに対して、可変部分はテキスト音声合成で、固定部分は事前に読み、アクセント、及び韻律情報を記憶してあるテンプレートデータに基づいて合成音声を生成する。テンプレートデータ及び可変部分のテキストデータを取得し、取得したテンプレートデータから、読み、アクセント、及び韻律情報を抽出し、取得したテキストデータを可変部分に挿入して、固定部分を含めて合成音声の読み、アクセント、及び韻律情報を生成する。初期設定されている可変部分の終端と固定部分の始端とで、テンプレートデータから抽出されたアクセントと生成されたアクセントとが一致しているか否かを判断する。両者が一致していないと判断された場合、抽出されたアクセントと生成されたアクセントとが一致する位置まで可変部分を拡張し、拡張された可変部分はテキスト音声合成で、縮小された固定部分はテンプレートデータに基づいて合成音声を生成する。これにより、可変部分の終端と固定部分の始端とでアクセントが一致していない場合、例えば可変部分の終端のアクセントが低いピッチパターンを有し、固定部分の始端のアクセントが高いピッチパターンを有する場合等には、可変部分を次にアクセントが一致する部分にまで拡張することにより、可変部分からテンプレートデータの固定部分へとアクセントのより滑らかな連続性を担保することができ、可変部分と固定部分とを含めた文章全体にわたり肉声に近い自然なアクセント・韻律を有する合成音声を生成することが可能となる。 In the second invention, for text data composed of a variable portion and a fixed portion, the variable portion is based on text-to-speech synthesis, the fixed portion is read in advance, and based on template data in which accent and prosody information are stored. To generate synthesized speech. Obtain template data and variable part text data, extract reading, accent, and prosody information from the obtained template data, insert the obtained text data into the variable part, and read the synthesized speech including the fixed part , Accent, and prosody information. It is determined whether the accent extracted from the template data matches the generated accent at the initial end of the variable portion and the beginning of the fixed portion. If it is determined that the two do not match, the variable part is expanded to the position where the extracted accent matches the generated accent, the expanded variable part is text-to-speech synthesis, and the reduced fixed part is Generate synthesized speech based on the template data. As a result, if the accents do not match at the end of the variable part and the start of the fixed part, for example, the accent at the end of the variable part has a low pitch pattern and the accent at the start of the fixed part has a high pitch pattern For example, by extending the variable part to the part where the accent matches next, it is possible to ensure a smoother continuity of the accent from the variable part to the fixed part of the template data. It is possible to generate a synthesized speech having natural accents and prosody similar to the real voice over the entire sentence including.

第３発明では、生成されたアクセントと、テンプレートから抽出されたアクセントとが一致する部分を有するアクセント句を抽出し、抽出されたアクセント句のうち、初期設定されている可変部分に最も近接しているアクセント句の終端まで該可変部分を拡張する。これにより、アクセントの高低が一致している可能性が高いアクセント句の終端部分まで可変部分を拡張することで、可変部分からテンプレートデータの固定部分へとアクセントの自然な変化を担保することができ、しかも品質の良い固定部分のテンプレートデータに含まれる韻律情報を最大限利用することができることから、肉声に近い自然なアクセント・韻律を有する合成音声を生成することが可能となる。 In the third invention, an accent phrase having a portion in which the generated accent matches the accent extracted from the template is extracted, and the extracted accent phrase is closest to the initially set variable portion. The variable part is extended to the end of the accent phrase. As a result, by extending the variable part to the end of the accent phrase, where there is a high possibility that the accent levels match, it is possible to guarantee a natural change in accent from the variable part to the fixed part of the template data. In addition, since prosodic information included in template data of a fixed part with high quality can be used to the maximum extent, it is possible to generate synthesized speech having natural accents and prosody similar to real voices.

第１発明、第４発明、及び第５発明によれば、固定部分の始端にて生成されたアクセントとテンプレートデータから抽出されたアクセントとが一致していない場合、例えば生成されたアクセントが低いピッチパターンを有し、テンプレートデータではアクセントが高いピッチパターンを有する場合等には、可変部分を次にアクセントが一致する部分にまで拡張することにより、可変部分からテンプレートデータの固定部分へとアクセントの連続性を担保することができ、可変部分と固定部分とを含めた文章全体にわたり肉声に近い自然なアクセント・韻律を有する合成音声を生成することが可能となる。 According to the first invention, the fourth invention, and the fifth invention, when the accent generated at the beginning of the fixed portion does not match the accent extracted from the template data, for example, the generated accent has a low pitch. If the template data has a pitch pattern with a high accent in the template data, etc., the variable part is extended to the part where the accent matches next, so that the accent part continues from the variable part to the fixed part of the template data. Therefore, it is possible to generate synthesized speech having natural accents and prosody similar to real voice over the entire sentence including the variable part and the fixed part.

第２発明によれば、可変部分の終端と固定部分の始端とでアクセントが一致していない場合、例えば可変部分の終端のアクセントが低いピッチパターンを有し、固定部分の始端のアクセントが高いピッチパターンを有する場合等には、可変部分を次にアクセントが一致する部分にまで拡張することにより、可変部分からテンプレートデータの固定部分へとアクセントのより滑らかな連続性を担保することができ、可変部分と固定部分とを含めた文章全体にわたり肉声に近い自然なアクセント・韻律を有する合成音声を生成することが可能となる。 According to the second invention, when the accents do not match at the end of the variable portion and the start of the fixed portion, for example, the pitch at the end of the variable portion has a low pitch pattern and the accent at the start of the fixed portion is high. If you have a pattern, etc., you can extend the variable part to the next part where the accent matches, thereby ensuring smoother continuity of the accent from the variable part to the fixed part of the template data. It is possible to generate a synthesized speech having natural accents and prosody similar to the real voice over the entire sentence including the part and the fixed part.

第３発明によれば、アクセントの高低が一致している可能性が高いアクセント句の終端部分まで可変部分を拡張することで、可変部分からテンプレートデータの固定部分へとアクセントの自然な変化を担保することができ、しかも品質の良い固定部分のテンプレートデータに含まれる韻律情報を最大限利用することができることから、肉声に近い自然なアクセント・韻律を有する合成音声を生成することが可能となる。 According to the third aspect of the invention, the variable part is extended to the end part of the accent phrase that is highly likely to match the height of the accent, thereby ensuring a natural change of the accent from the variable part to the fixed part of the template data. In addition, the prosody information included in the fixed portion template data with high quality can be used to the maximum extent, so that it is possible to generate a synthesized speech having a natural accent / prosody close to the real voice.

以下、本発明をその実施の形態を示す図面に基づいて詳述する。 Hereinafter, the present invention will be described in detail with reference to the drawings illustrating embodiments thereof.

（実施の形態１）
図１は、本発明の実施の形態１に係る音声合成装置１を具現化するコンピュータの構成を示すブロック図である。本発明の実施の形態１に係る音声合成装置１に係るコンピュータは、少なくともＣＰＵ、ＤＳＰ等の演算処理部１１、ＲＯＭ１２、ＲＡＭ１３、外部のコンピュータとの間でデータ通信可能な通信インタフェース部１４、定型文章をテンプレート化し、テンプレートデータごとに読み、アクセント、韻律情報等を記憶するテンプレート記憶部１５１を備える記憶部１５、合成された音声を出力する音声出力部１６を備えている。 (Embodiment 1)
FIG. 1 is a block diagram showing a configuration of a computer that embodies the speech synthesis apparatus 1 according to Embodiment 1 of the present invention. A computer related to the speech synthesizer 1 according to the first embodiment of the present invention includes at least an arithmetic processing unit 11 such as a CPU and a DSP, a ROM 12 and a RAM 13, a communication interface unit 14 capable of data communication with an external computer, a fixed type. A sentence is converted into a template, and is read for each template data, and is provided with a storage unit 15 including a template storage unit 151 that stores accents, prosodic information, and the like, and an audio output unit 16 that outputs synthesized speech.

「テンプレート化」とは、以下の手順によりテンプレートデータを生成する作業を意味する。まず、可変部分と固定部分とからなる定型文章（例えば「○○の天気は晴れです。」○○：可変部分、その他：固定部分）に対して、可変部分に適当な言葉を挿入したテンプレートテキスト（例えば「東京の天気は晴れです。」）を人間が読み上げた音声を録音する。次に、録音された音声に基づいて、各音韻の音韻長、声の高さを表すピッチデータ、各音韻の振幅当の韻律情報を抽出する。最後に、テンプレートテキストに対して、読み及び書く音節のアクセント情報（アクセントの高低、アクセント句の位置等）を付与してテンプレートデータとして記憶する。テンプレートデータは、少なくともテンプレート作成時点の可変部分（例えば「東京」）の読み及びアクセント、固定部分の読み、アクセント、音韻長、ピッチデータを有している。 “Template creation” means an operation of generating template data by the following procedure. First, template text in which appropriate words are inserted in the variable part for a fixed sentence composed of a variable part and a fixed part (for example, “The weather of XX is fine.” XX: variable part, other: fixed part) (For example, “The weather in Tokyo is sunny.”) Next, based on the recorded voice, the phoneme length of each phoneme, the pitch data representing the pitch of the voice, and the prosodic information corresponding to the amplitude of each phoneme are extracted. Finally, the syllable accent information (accent level, accent phrase position, etc.) is added to the template text and stored as template data. The template data includes at least a variable portion (eg, “Tokyo”) reading and accent at the time of template creation, a fixed portion reading, accent, phoneme length, and pitch data.

演算処理部１１は、内部バス１７を介して音声合成装置１の上述したようなハードウェア各部と接続されており、上述したハードウェア各部を制御するとともに、ＲＯＭ１２に記憶されている処理プログラム、例えばテンプレートデータから、読み、アクセント、及び韻律情報を、テンプレート記憶部１５１から抽出するプログラム、テキストデータを可変部分に挿入して、固定部分を含めて合成音声の読み、アクセント、及び韻律情報を生成するプログラム、初期設定されている可変部分の終端と固定部分の始端とで、抽出されたアクセントと生成されたアクセントとが一致しているか否かを判断するプログラム、アクセントが一致しているか否かに基づいて可変部分を拡張するプログラム等に従って、種々のソフトウェア的機能を実行する。 The arithmetic processing unit 11 is connected to the above-described hardware units of the speech synthesizer 1 via the internal bus 17, and controls the above-described hardware units and also stores a processing program stored in the ROM 12, for example, A program for extracting reading, accent, and prosody information from the template data from the template storage unit 151, and inserting text data into the variable part, and generating synthesized speech readings, accents, and prosody information including the fixed part. Program, program that determines whether the extracted accent matches the generated accent at the end of the default variable part and the start of the fixed part, whether the accents match Execute various software functions according to programs that expand variable parts based on .

ＲＯＭ１２は、フラッシュメモリ等で構成され、音声合成装置１として機能させるために必要な処理プログラムを記憶している。ＲＡＭ１３は、ＳＲＡＭ等で構成され、ソフトウェアの実行時に発生する一時的なデータを記憶する。通信インタフェース部１４は、外部から音声合成の対象となるテキストデータ及び固定部分のテンプレートデータを識別する情報の受信、あるいは外部のコンピュータから上述したプログラムをダウンロードする。 The ROM 12 is composed of a flash memory or the like, and stores a processing program necessary for functioning as the speech synthesizer 1. The RAM 13 is composed of SRAM or the like, and stores temporary data generated when software is executed. The communication interface unit 14 receives information for identifying text data and fixed portion template data to be synthesized from the outside, or downloads the above-described program from an external computer.

記憶部１５は、ハードディスク等の固定型記憶装置であり、合成音声を生成するために必要な情報を記憶してある。例えば、定型文章をテンプレート化し、テンプレートデータごとに読み、アクセント、韻律情報等を記憶するテンプレート記憶部１５１、可変部分の始端以降についてＴＴＳにより読み及びアクセントを生成するために必要となる単語辞書１５２等を備えている。なお、記憶部１５は、固定型記憶装置に限定されるものではなく、例えばＣＤ、ＤＶＤ等の可搬型記憶媒体を用いた補助記憶装置であっても良いし、通信インタフェース部１４を介して接続可能な外部コンピュータ上の記憶装置であっても良い。音声出力部１６はスピーカ等の音声出力装置であり、合成された音声を出力する。 The storage unit 15 is a fixed storage device such as a hard disk, and stores information necessary for generating synthesized speech. For example, a template sentence is converted into a template and read for each template data, and a template storage unit 151 for storing accents, prosodic information, etc., a word dictionary 152 required for reading and generating accents by TTS after the beginning of the variable part, etc. It has. Note that the storage unit 15 is not limited to a fixed storage device, and may be an auxiliary storage device using a portable storage medium such as a CD or a DVD, or may be connected via the communication interface unit 14. It may be a storage device on a possible external computer. The audio output unit 16 is an audio output device such as a speaker, and outputs synthesized audio.

図２は、本発明の実施の形態１に係る音声合成装置１の演算処理部１１の音声合成処理の手順を示すフローチャートである。本実施の形態１に係る音声合成装置１は、可変部分として事前に設定されている範囲を拡張するか否かを判断し、拡張すると判断した場合、固定部分のテンプレートデータのアクセントと不整合となることなく接続することが可能な位置まで可変部分を拡張して、合成音声を生成する点に特徴を有する。 FIG. 2 is a flowchart showing the procedure of the speech synthesis process of the arithmetic processing unit 11 of the speech synthesizer 1 according to Embodiment 1 of the present invention. The speech synthesizer 1 according to the first embodiment determines whether or not to expand the range set in advance as the variable part, and when it is determined to extend, the accent and mismatch of the template data of the fixed part are determined. It is characterized in that the synthesized part is generated by extending the variable part to a position where it can be connected without becoming.

音声合成装置１の演算処理部１１は、読上げ対象となるテキストデータ及び固定部分の情報を含むテンプレートデータを識別する識別情報、例えばテンプレートＩＤを取得する（ステップＳ２０１）。ここで、読上げ対象となるテキストデータは、可変部分のみのテキストであっても良いし、可変部分と固定部分とを識別することが可能であることを条件として両方の部分を含むテキストで構成されていても良い。また、テンプレートデータを識別する識別情報の替わりに、テンプレートデータの読み、アクセント及び韻律情報を直接取得する構成であっても良い。また、読上げ対象となるテキストデータ及びテンプレートデータを識別する識別情報の取得方法は、特に限定されるものではなく、ユーザによる入力であっても良いし、合成音声を出力するアプリケーションからデータとして取得するものであっても良い。 The arithmetic processing unit 11 of the speech synthesizer 1 acquires identification information for identifying template data including text data to be read out and information on a fixed part, for example, a template ID (step S201). Here, the text data to be read out may be text with only a variable part, or is composed of text including both parts on condition that the variable part and the fixed part can be identified. May be. Further, instead of the identification information for identifying the template data, a configuration in which reading of the template data, accent and prosody information is directly obtained may be used. The method for acquiring identification information for identifying text data and template data to be read is not particularly limited, and may be input by a user or acquired as data from an application that outputs synthesized speech. It may be a thing.

演算処理部１１は、取得したテンプレートＩＤに基づいてテンプレート記憶部１５１を照会し、対応するテンプレートデータの読み、アクセント、及び韻律情報を抽出する（ステップＳ２０２）。ただし、ステップＳ２０１で、テンプレートデータを識別する識別情報を取得する替わりに、直接、テンプレートデータの読み、アクセント、及び韻律情報を取得する場合は、テンプレート記憶部１５１を照会せずに、ステップＳ２０１で取得した情報をそのまま利用する。図３は、本発明の実施の形態１に係る音声合成装置１のテンプレート記憶部１５１に記憶されているテンプレートデータのデータ構成の一例を示す図である。図３（ａ）に示すように、テンプレートデータは、テンプレートデータを識別するテンプレートＩＤに対応付けて、可変部分及び固定部分の読み及びアクセントに関する情報を記憶してある。例えば「’」は、アクセントが「高」から「低」へ変化する位置を示しており、「＿」はアクセント句の境界を示している。 The arithmetic processing unit 11 inquires the template storage unit 151 based on the acquired template ID, and extracts reading of the corresponding template data, accent, and prosody information (step S202). However, instead of acquiring the identification information for identifying the template data in step S201, if the reading of the template data, the accent, and the prosody information are acquired directly, the template storage unit 151 is not queried and the step S201 is performed. Use the acquired information as it is. FIG. 3 is a diagram illustrating an example of a data configuration of template data stored in the template storage unit 151 of the speech synthesizer 1 according to Embodiment 1 of the present invention. As shown in FIG. 3A, the template data stores information related to the reading and accent of the variable part and the fixed part in association with the template ID for identifying the template data. For example, “′” indicates a position where the accent changes from “high” to “low”, and “_” indicates the boundary of the accent phrase.

ここでアクセント句とは、日本語共通語の語句のアクセントを示す最小単位であり、例えばアクセントのタイプとして、最初のアクセントが「低」であり、以降「高」となるタイプ、アクセントが「高」から「低」へ変化する部分を１箇所のみ含み（最初の音節が「高」の場合は２番目以降の音節が「低」、それ以外の場合は、最初の音節が「低」、２番目以降低に変わる音節まで「高」）「高」から「低」へ変化する音節番号で表されるタイプのいずれかで区切られた語句の最小単位を意味している。 Here, the accent phrase is the smallest unit indicating the accent of a Japanese common word phrase. For example, as an accent type, the first accent is “low”, the type is “high”, and the accent is “high”. ”To“ low ”only in one part (if the first syllable is“ high ”, the second and subsequent syllables are“ low ”; otherwise, the first syllable is“ low ”, 2 It means the smallest unit of words delimited by one of the types represented by syllable numbers that change from “high” to “low”.

固定部分の韻律情報は、図３（ｂ）に示すようにテンプレートＩＤに対応付けてアルファベット単位で記憶されており、各音韻の時間長をミリ秒単位で、そのピッチの時系列変化を一定の時間間隔毎の周波数で表した数値列で示している。 The prosodic information of the fixed part is stored in alphabetical units in association with the template IDs as shown in FIG. 3 (b), and the time length of each phoneme is in milliseconds and the time series change of the pitch is constant. It is indicated by a numerical string expressed by the frequency for each time interval.

テンプレート記憶部１５１に記憶されているデータ構成は、図３の構成に限定されるものではない。図４は、本発明の実施の形態１に係る音声合成装置１のテンプレート記憶部１５１に記憶されているデータ構成の他の例を示す図である。図４（ａ）に示すように、アクセント情報は、それぞれの音素に「高」又は「低」の情報を付与したデータ構成であっても良い。また、図４（ｂ）に示すように、固定部分の韻律情報は、各音韻の中心のピッチの値を代表値として記憶しておいても良い。 The data configuration stored in the template storage unit 151 is not limited to the configuration shown in FIG. FIG. 4 is a diagram showing another example of the data configuration stored in the template storage unit 151 of the speech synthesizer 1 according to Embodiment 1 of the present invention. As shown in FIG. 4A, the accent information may have a data configuration in which “high” or “low” information is given to each phoneme. Further, as shown in FIG. 4B, the fixed portion prosody information may store a pitch value at the center of each phoneme as a representative value.

演算処理部１１は、可変部分の始端以降のテキストデータ、すなわち入力された可変部分のテキストデータ及びテンプレートデータの固定部分のテキストデータとを接続したテキストデータについて単語辞書１５２を参照して読み、アクセント、及び韻律情報（音韻長とピッチパタン）を生成する（ステップＳ２０３）。単語辞書１５２は単語単位で読み及びアクセントを記憶してあり、可変部分の始端以降のテキストデータの読み、アクセント、及び韻律情報を単語単位で生成する処理は通常のＴＴＳと同様の処理となる。例えばかな漢字表記のテキストデータに対して形態素解析処理及びアクセント付与処理を実行し、それに基づいて各音韻の音韻長及びピッチパターンを韻律生成処理によって生成する。 The arithmetic processing unit 11 reads the text data after the beginning of the variable part, that is, the text data obtained by connecting the input variable part text data and the fixed part of the template data with reference to the word dictionary 152, , And prosodic information (phoneme length and pitch pattern) are generated (step S203). The word dictionary 152 stores readings and accents in units of words, and the processing for generating text data readings, accents, and prosody information after the beginning of the variable portion in units of words is the same as that of normal TTS. For example, morpheme analysis processing and accenting processing are executed for text data in Kana-Kanji notation, and based on this, phoneme length and pitch pattern of each phoneme are generated by prosody generation processing.

図５は、可変部分に含まれるテキストの相違によりアクセントの連続性が変化する状態を示す図である。図５では、テンプレートデータの初期固定部分のピッチパタンは、初期設定された固定部分５２（実線）で示すようになっているものとする。図５（ａ）は、初期設定された可変部分５１に「東京」が入力された場合のアクセントの変化を示す図である。「東京」の語尾は「高い」アクセントを有することから、初期設定された可変部分５１と初期設定された固定部分５２との境界にてアクセントのずれが生じない。したがって、違和感の無い合成音声を生成することができる。 FIG. 5 is a diagram illustrating a state in which the continuity of accents changes due to a difference in text included in the variable part. In FIG. 5, it is assumed that the pitch pattern of the initial fixed portion of the template data is as indicated by the initial fixed portion 52 (solid line). FIG. 5A is a diagram illustrating a change in accent when “Tokyo” is input to the initially set variable portion 51. Since the ending of “Tokyo” has a “high” accent, there is no accent shift at the boundary between the initial variable portion 51 and the initial fixed portion 52. Therefore, it is possible to generate a synthesized voice without a sense of incongruity.

図５（ｂ）は、初期設定された可変部分５１に「山形」が入力された場合のアクセントの変化を示す図である。「山形」の語尾は「低い」アクセントであるため、本来の自然な発声では、破線で示すように、初期設定された固定部分の最初の部分のアクセントは「低」とならなければならない。しかし、テンプレートデータ上で初期設定された固定部分の最初の部分のアクセントは「高」であるため、初期設定された可変部分５１と初期設定された固定部分５２との境界にてアクセントのずれが生じている。すなわち、本来は破線部５３のようにアクセントが変動すべきところ、初期設定された可変部分５１と初期設定された固定部分５２との境界にて実線のピッチパタンを接続するため、アクセントの高低の違いから周波数の乖離が生じており、このままでは合成音声にて違和感が生じる。 FIG. 5B is a diagram illustrating a change in accent when “mountain shape” is input to the initially set variable portion 51. Since the ending of “Yamagata” is a “low” accent, in the original natural utterance, the initial accent of the fixed portion must be “low” as shown by the broken line. However, since the accent of the first portion of the fixed portion that is initially set on the template data is “high”, there is an accent shift at the boundary between the variable portion 51 that is initially set and the fixed portion 52 that is initially set. Has occurred. That is, where the accent should be changed as in the broken line portion 53, the solid line pitch pattern is connected at the boundary between the initially set variable portion 51 and the initially set fixed portion 52. There is a frequency divergence due to the difference, and if it is left as it is, a sense of incongruity occurs in the synthesized speech.

そこで演算処理部１１は、初期設定された可変部分を可変部分として設定し（ステップＳ２０４）、可変部分とその他の残りの固定部分との境界において、ステップＳ２０２でテンプレート記憶部１５１から取得したアクセントと、ステップＳ２０３で新たに生成したアクセントの高低が一致しているか否かを判断する（ステップＳ２０５）。演算処理部１１が、両者のアクセントが一致していないと判断した場合（ステップＳ２０５：ＮＯ）、演算処理部１１は、可変部分をさらに拡張できるか否かを判断し（ステップＳ２０６）、演算処理部１１が拡張可能であると判断した場合（ステップＳ２０６：ＹＥＳ）、演算処理部１１は、可変部分を拡張して（ステップＳ２０７）、ステップＳ２０５へ処理を戻す。 Therefore, the arithmetic processing unit 11 sets the initially set variable part as the variable part (step S204), and the accent acquired from the template storage unit 151 in step S202 at the boundary between the variable part and the remaining fixed part. Then, it is determined whether or not the accent levels newly generated in step S203 match (step S205). When the arithmetic processing unit 11 determines that the two accents do not match (step S205: NO), the arithmetic processing unit 11 determines whether or not the variable part can be further expanded (step S206). When it is determined that the unit 11 can be expanded (step S206: YES), the arithmetic processing unit 11 expands the variable part (step S207) and returns the process to step S205.

可変部分の拡張は、例えば次のようにすれば良い。ステップＳ２０３で新たに生成したピッチとステップＳ２０２で抽出したテンプレートデータのピッチとの差を、現在の可変部分と固定部分との境界部分から１音節ずつ右側へシフトしつつ算出し、算出したピッチ差が閾値以下になった位置を新たに拡張した可変部分と固定部分との境界とすれば良い。 For example, the variable portion may be expanded as follows. The difference between the pitch newly generated in step S203 and the pitch of the template data extracted in step S202 is calculated while shifting to the right by one syllable from the boundary portion between the current variable portion and the fixed portion, and the calculated pitch difference What is necessary is just to make the position which became below the threshold value into the boundary of the newly expanded variable part and fixed part.

演算処理部１１が、両者のアクセントが一致していると判断した場合（ステップＳ２０５：ＹＥＳ）、演算処理部１１は、違和感の無い合成音声を生成することができるものと判断して、拡張した可変部分及びその他の残りの固定部分についてそれぞれ、ステップＳ２０３で新たに生成した韻律情報とステップＳ２０２で抽出したテンプレートデータの韻律情報を接続する（ステップＳ２０８）。演算処理部１１は、接続された韻律情報に基づいて音声データを生成し（ステップＳ２０９）、音声出力部１６から音声出力する（ステップＳ２１０）。音声データの生成には、ＴＴＳシステムで従来から使われている、波形編集方式、波形接続方式などを用いれば良い。 If the arithmetic processing unit 11 determines that the two accents match (step S205: YES), the arithmetic processing unit 11 determines that the synthesized speech can be generated without any sense of incongruity and has expanded. The prosodic information newly generated in step S203 and the prosodic information of the template data extracted in step S202 are connected to each of the variable part and other remaining fixed parts (step S208). The arithmetic processing unit 11 generates audio data based on the connected prosodic information (step S209), and outputs the audio from the audio output unit 16 (step S210). For the generation of audio data, a waveform editing method, a waveform connection method, or the like conventionally used in the TTS system may be used.

一方、演算処理部１１が、これ以上可変部が拡張できないと判断した場合（ステップＳ２０６：ＮＯ）、演算処理部１１は、テンプレートデータの固定部分と違和感なくアクセントを接続することができないものと判断し、初期設定された可変部分の始端以降のすべてのテキストをＴＴＳで生成するために、初期設定された可変部分の始端以降のすべてのテキストを可変部分と設定し（ステップＳ２１１）、処理をステップＳ２０８へ移行する。 On the other hand, if the arithmetic processing unit 11 determines that the variable unit cannot be expanded any more (step S206: NO), the arithmetic processing unit 11 determines that the accent cannot be connected to the fixed portion of the template data without a sense of incongruity. Then, in order to generate all text after the start of the initially set variable part with TTS, all text after the start of the initially set variable part is set as a variable part (step S211), and the process is stepped. The process proceeds to S208.

ここで、ステップ２０８において韻律をより滑らかにするために、可変部分及び固定部分の韻律情報を修正しても良い。この場合、演算処理部１１は、例えば可変部分の終端の周波数と、固定部分の始端の周波数が一致するように、固定部分のピッチを変更する倍率を特定し、次のポーズ区切りまでの時間に比例した係数をピッチ変更倍率とともに固定部分のピッチに乗算する。これにより、より連続性を担保した状態で合成音声を生成することができ、違和感の無い音声出力を行うことが可能となる。 Here, in order to make the prosody more smooth in step 208, the prosody information of the variable part and the fixed part may be modified. In this case, the arithmetic processing unit 11 specifies the magnification for changing the pitch of the fixed portion so that the frequency at the end of the variable portion matches the frequency at the start of the fixed portion, and the time until the next pause break is determined. Multiply the fixed part pitch by the proportional factor along with the pitch change factor. As a result, synthesized speech can be generated in a state where continuity is ensured, and speech output without a sense of incongruity can be performed.

以上のように本実施の形態１によれば、初期設定された可変部分の終端と初期設定された固定部分の終端との境界において、テンプレートデータのアクセントと、初期設定された可変部分の始端威光のテキストデータに対して新たに生成されたアクセントとが一致していない場合、初期設定された可変部分を拡張することにより、可変部分の始端以降の新たに生成されたアクセントとテンプレートの固定部分のアクセントとを連続的に接続することができ、肉声に近い自然なアクセント・韻律を有する合成音声を生成することが可能となる。 As described above, according to the first embodiment, at the boundary between the end of the initially set variable portion and the end of the initially set fixed portion, the accent of the template data and the start end intimidation of the initially set variable portion. If the newly generated accent does not match the text data of, the newly created accent after the beginning of the variable part and the fixed part of the template Accents can be connected continuously, and a synthesized speech having natural accents and prosody close to the real voice can be generated.

（実施の形態２）
以下、本発明の実施の形態２に係る音声合成装置１につき図面を参照しながら説明する。実施の形態２に係る音声合成装置１を具現化するコンピュータの構成は、実施の形態１と同様であることから、同一の符号を付することにより詳細な説明を省略する。 (Embodiment 2)
Hereinafter, the speech synthesis apparatus 1 according to Embodiment 2 of the present invention will be described with reference to the drawings. Since the configuration of the computer that embodies the speech synthesizer 1 according to the second embodiment is the same as that of the first embodiment, the detailed description is omitted by attaching the same reference numerals.

本実施の形態２に係る音声合成装置１は、可変部分として事前に設定されている範囲を拡張するか否かを判断し、拡張すると判断した場合、テンプレートデータに事前に指定されている、拡張後の可変部分とそれにともなって縮小される固定部分との境界となり得る可変部分拡張候補位置まで、初期設定された可変部分を拡張する点に特徴を有している。 The speech synthesizer 1 according to the second embodiment determines whether or not to extend the range set in advance as the variable part, and if it is determined to extend, the extension specified in the template data in advance It is characterized in that the initially set variable part is extended to a variable part extension candidate position that can be a boundary between a later variable part and a fixed part that is reduced accordingly.

図６は、本発明の実施の形態２に係る音声合成装置１のテンプレート記憶部１５１に記憶されているデータ構成の一例を示す図である。本実施の形態２では、図６に示すように、拡張後の可変部分の終端となり得る可変部分拡張候補位置を、初期設定された可変部分の終端に近い位置から昇順に付与した音節番号列として付与している。ここでは、可変部分拡張候補位置は、音節番号で示される音節の右側の境界で表されており、１音節目の「ノ」の右境界、５音節目の「ワ」の右境界が、可変部分拡張候補位置となる（「’」、「＿」はそれぞれ、アクセントが高から低に変わる場所を表す記号とアクセント句境界を表す記号で、音節数には含めない）。 FIG. 6 is a diagram illustrating an example of a data configuration stored in the template storage unit 151 of the speech synthesizer 1 according to Embodiment 2 of the present invention. In the second embodiment, as shown in FIG. 6, the variable part extension candidate position that can be the end of the variable part after extension is provided as a syllable number string assigned in ascending order from the position near the end of the initially set variable part. Has been granted. Here, the variable partial extension candidate position is represented by the right boundary of the syllable indicated by the syllable number, and the right boundary of “no” of the first syllable and the right boundary of “wa” of the fifth syllable are variable. Partial extension candidate positions (“′” and “_” are symbols indicating the place where the accent changes from high to low and the symbol indicating the accent phrase boundary, and are not included in the number of syllables).

図７は、本発明の実施の形態２に係る音声合成装置１の演算処理部１１の音声合成処理の手順を示すフローチャートである。本実施の形態２に係る音声合成装置１は、可変部分として初期設定されている範囲を拡張するか否かを判断し、拡張すると判断した場合、テンプレートデータに設定されている可変部分拡張候補位置へと順次可変部分を拡張し、アクセントの連続性を担保可能な拡張候補位置まで可変部分を拡張する点に特徴を有する。 FIG. 7 is a flowchart showing a procedure of speech synthesis processing of the arithmetic processing unit 11 of the speech synthesizer 1 according to Embodiment 2 of the present invention. The speech synthesizer 1 according to the second embodiment determines whether or not to expand the range that is initially set as the variable part. If it is determined that the range is to be expanded, the variable part expansion candidate position set in the template data This is characterized in that the variable part is sequentially extended to the extension candidate position to the extension candidate position where the continuity of the accent can be ensured.

音声合成装置１の演算処理部１１は、読上げ対象となるテキストデータ及びテンプレートデータを識別する識別情報、例えばテンプレートＩＤを取得する（ステップＳ７０１）。ここで、読上げ対象となるテキストデータは、可変部分のみのテキストであっても良いし、可変部分と固定部分とが識別できることが可能であることを条件として両方の部分のテキストで構成されていても良い。また、テンプレートデータを識別する識別情報の替わりに、テンプレートデータの読み、アクセント、及び韻律情報を直接取得する構成であっても良い。読上げ対象となるテキストデータ及びテンプレートデータを識別する識別情報の取得方法は、特に限定されるものではなく、ユーザによる入力であっても良いし、合成音声を出力するアプリケーションからデータとして取得するものであっても良い。 The arithmetic processing unit 11 of the speech synthesizer 1 acquires identification information for identifying text data and template data to be read out, for example, a template ID (step S701). Here, the text data to be read out may be text of only the variable part, and is composed of text of both parts on condition that the variable part and the fixed part can be identified. Also good. Further, instead of the identification information for identifying the template data, a configuration in which the reading of the template data, the accent, and the prosody information are directly acquired may be used. The method of acquiring identification information for identifying text data and template data to be read out is not particularly limited, and may be input by a user or acquired as data from an application that outputs synthesized speech. There may be.

演算処理部１１は、取得したテンプレートＩＤに基づいてテンプレート記憶部１５１を照会して、対応するテンプレートデータの読み、アクセント、韻律情報及び記憶されている拡張候補位置を記した音節番号列を抽出する（ステップＳ７０２）。演算処理部１１は、可変部分の始端以降のテキストデータについて単語辞書１５２を参照して読み、アクセント、及び韻律情報を生成する（ステップＳ７０３）。ただし、ステップＳ７０１で、テンプレートデータを識別する識別情報を取得する替わりに、直接、テンプレートデータの読み、アクセント、及び韻律情報を取得する場合は、テンプレート記憶部１５１を照会せずに、ステップＳ７０１で取得した情報をそのまま利用する。 The arithmetic processing unit 11 inquires of the template storage unit 151 based on the acquired template ID, and extracts the syllable number string describing the reading of the corresponding template data, the accent, the prosodic information, and the stored extended candidate position. (Step S702). The arithmetic processing unit 11 reads the text data after the beginning of the variable part with reference to the word dictionary 152, and generates accent and prosody information (step S703). However, instead of acquiring the identification information for identifying the template data in step S701, if the template data reading, accent, and prosody information are to be acquired directly, the template storage unit 151 is not queried, and step S701 is performed. Use the acquired information as it is.

演算処理部１１は、初期設定された可変部分を拡張可変部分候補として設定し（ステップＳ７０４）、拡張可変部分候補とその他の残りの固定部分との境界近傍において、新たに生成したアクセントと、テンプレートデータから抽出したアクセントとが一致しているか否かを判断する（ステップＳ７０５）。演算処理部１１が、両者のアクセントが一致していないと判断した場合（ステップＳ７０５：ＮＯ）、演算処理部１１は、次の拡張候補位置が存在するか否かを判断する（ステップＳ７０６）。 The arithmetic processing unit 11 sets the initially set variable part as an extended variable part candidate (step S704), and the newly generated accent and template in the vicinity of the boundary between the extended variable part candidate and other remaining fixed parts. It is determined whether or not the accent extracted from the data matches (step S705). When the arithmetic processing unit 11 determines that the two accents do not match (step S705: NO), the arithmetic processing unit 11 determines whether or not the next extension candidate position exists (step S706).

演算処理部１１が、次の拡張候補位置が存在すると判断した場合（ステップＳ７０６：ＹＥＳ）、演算処理部１１は、次の拡張候補位置までを拡張可変部分候補として（ステップＳ７０７）、処理をステップＳ７０５へ戻し、拡張した可変部分候補に基づいて上述した処理を繰り返す。演算処理部１１が、次の拡張候補位置が存在しないと判断した場合（ステップＳ７０６：ＮＯ）、演算処理部１１は、アクセントを滑らかにすることができなかったものと判断し、初期設定された可変部分の始端以降のテキスト全てを可変部分として設定する（ステップＳ７０９）。 If the arithmetic processing unit 11 determines that the next extension candidate position exists (step S706: YES), the arithmetic processing unit 11 sets the next extension candidate position as an extension variable part candidate (step S707), and performs the process step. Returning to S705, the above-described processing is repeated based on the expanded variable part candidate. When the arithmetic processing unit 11 determines that there is no next extension candidate position (step S706: NO), the arithmetic processing unit 11 determines that the accent could not be smoothed and was initially set. All the text after the beginning of the variable part is set as the variable part (step S709).

演算処理部１１が、拡張可変部分候補とその他の残りの固定部分との境界近傍において、新たに生成したアクセントと、テンプレートデータから抽出したアクセントとが一致していると判断した場合（ステップＳ７０５：ＹＥＳ）、演算処理部１１は、違和感の無い合成音声を生成することができるものと判断して、現在の拡張可変部分候補まで可変部分を拡張し（ステップＳ７０８）、拡張した可変部分とその他の残りの固定部分についてそれぞれ、ステップＳ７０３で新たに生成した韻律情報とステップＳ７０２で抽出したテンプレートデータの韻律情報とを接続する（ステップＳ７１０）。演算処理部１１は、接続された韻律情報に基づいて音声データを生成し（ステップＳ７１１）、音声出力部１６から音声出力する（ステップＳ７１２）。 When the arithmetic processing unit 11 determines that the newly generated accent matches the accent extracted from the template data in the vicinity of the boundary between the extended variable part candidate and the other remaining fixed part (step S705: YES), the arithmetic processing unit 11 determines that it is possible to generate a synthesized speech without a sense of incongruity, extends the variable part to the current extension variable part candidate (step S708), and expands the variable part and other For the remaining fixed parts, the prosodic information newly generated in step S703 and the prosodic information of the template data extracted in step S702 are connected (step S710). The arithmetic processing unit 11 generates audio data based on the connected prosodic information (step S711), and outputs the audio from the audio output unit 16 (step S712).

ここで、ステップＳ７０５におけるアクセント一致の判断は、例えば以下のようにすればよい。拡張可変部分候補とその他の残りの固定部分の境界の両側の音節において、ステップＳ７０３で新たに生成したアクセントと、ステップＳ７０２で抽出したテンプレートデータのアクセントの高低がそれぞれ一致する場合には「一致」していると判断し、１つでも異なる場合には「一致しない」と判断する。または、境界の両側の音節において、ステップＳ７０３で生成した韻律情報（ピッチデータの値）と、Ｓ７０２で抽出したテンプレートデータの韻律情報（ピッチデータの値）との差が、閾値以下に収まっている場合に「一致」していると判断し、閾値を超えている場合には「一致しない」と判断する。ここでピッチの値の差の計算方法は、代表的な時刻のピッチの差をと用いても良いし、音節内における一定間隔で記述されているピッチの値の差の平均的な値を用いても良い。 Here, the determination of the accent match in step S705 may be as follows, for example. In the syllables on both sides of the boundary between the extension variable part candidate and the remaining remaining fixed part, if the accent newly generated in step S703 and the accent height of the template data extracted in step S702 match, “match”. If any one of them is different, it is determined that they do not match. Alternatively, in the syllables on both sides of the boundary, the difference between the prosodic information (pitch data value) generated in step S703 and the prosodic information (pitch data value) of the template data extracted in S702 is less than or equal to the threshold value. In this case, it is determined that they are “matched”. When the threshold value is exceeded, it is determined that “does not match”. Here, the calculation method of the difference in pitch value may use the difference in pitch at typical time, or the average value of the difference in pitch values described at regular intervals in the syllable. May be.

以上のように本実施の形態２によれば、可変部分からテンプレートデータの固定部分へとアクセントの高低が確実に一致する可変部分拡張候補位置であって、拡張範囲を最小限に止めた可変部分拡張候補位置まで可変部分を拡張することができる。拡張された可変部分はテキスト音声合成で、残りの固定部分はテンプレートデータとして予め用意されている韻律情報に基づいた合成音声で音声が生成されるため、可変部分と固定部分とでアクセントの連続性が担保され、かつ文章全般に亘って肉声に近い自然なアクセント・韻律を有する合成音声を生成することが可能となる。 As described above, according to the second embodiment, the variable part extension candidate position in which the heights of the accents are surely matched from the variable part to the fixed part of the template data, and the variable part in which the extension range is minimized The variable part can be expanded to the expansion candidate position. The expanded variable part is text-to-speech synthesis, and the remaining fixed part is generated with synthesized speech based on prosodic information prepared in advance as template data, so the continuity of accents between the variable part and the fixed part Can be generated, and it is possible to generate a synthesized speech having natural accents and prosody close to the real voice over the whole sentence.

（実施の形態３）
以下、本発明の実施の形態３に係る音声合成装置１につき図面を参照しながら説明する。実施の形態３に係る音声合成装置１を具現化するコンピュータの構成は、実施の形態１と同様であることから、同一の符号を付することにより詳細な説明を省略する。 (Embodiment 3)
Hereinafter, the speech synthesizer 1 according to Embodiment 3 of the present invention will be described with reference to the drawings. Since the configuration of the computer that embodies the speech synthesizer 1 according to the third embodiment is the same as that of the first embodiment, the detailed description is omitted by attaching the same reference numerals.

本実施の形態３に係る音声合成装置１は、初期設定された可変部分の範囲を拡張するか否かを判断し、拡張すると判断した場合、可変部分の始端以降のテキストデータについて新たに生成したアクセントの各アクセント句に対して、アクセント句の終端近傍にて、予め用意してあるテンプレートデータのアクセントと、新たに生成したアクセントとが一致するアクセント句であり、初期設定された可変部分に最も近接したアクセント句の終端まで、事前に設定されていた可変部分を拡張し、拡張された可変部分はテキスト音声合成で、残りの固定部分は予め用意されているテンプレートデータの韻律情報に基づいた合成音声で音声を生成する点に特徴を有する。 The speech synthesizer 1 according to the third embodiment determines whether or not to expand the range of the initially set variable part, and when it determines to extend, newly generates text data after the start of the variable part. For each accent phrase, an accent phrase in which the accent of the template data prepared in advance and the newly generated accent match in the vicinity of the end of the accent phrase. Extend the variable part set in advance to the end of the adjacent accent phrase, the expanded variable part is text-to-speech synthesis, and the remaining fixed part is synthesized based on the prosodic information of the template data prepared in advance It is characterized in that voice is generated by voice.

図８は、アクセントの高低及びアクセント句の一例を示す図である。図８（ａ）は、初期設定された可変部分として「ヤマガタ（山形）」を挿入した場合に、可変部分の始端以降の全てのテキストデータ（山形の天気をお知らせします）に対して、形態素解析とアクセント付与を行った結果を、アクセントの高低及びアクセント句境界で示している。図８（ｂ）は、テンプレートデータとして用意されている初期設定された固定部分「ノテンキハハレデス（の天気は晴れです）」のアクセントの高低及びアクセント句境界を示している（ここでは、可変部分に「東京」を挿入したテンプレートデータが使われているものとする）。図８（ａ）及び（ｂ）では、アクセント句境界は一点鎖線Ａ、Ｂで、初期設定された可変部分と初期設定された固定部分との境界は実線Ｙで示している。 FIG. 8 is a diagram illustrating an example of accent heights and accent phrases. Fig. 8 (a) shows the morpheme for all text data (notifying the weather of Yamagata) after the beginning of the variable part when "Yamagata (Yamagata)" is inserted as the initial variable part. The result of the analysis and the accent assignment is shown by the height of the accent and the accent phrase boundary. FIG. 8B shows the accent height and accent phrase boundary of the fixed portion “Notenki Hahaledes (the weather is sunny)” prepared as template data (in this case variable) Template data with "Tokyo" inserted in the part). In FIGS. 8A and 8B, the accent phrase boundaries are indicated by alternate long and short dash lines A and B, and the boundary between the initially set variable portion and the initially set fixed portion is indicated by the solid line Y.

音声合成装置１の演算処理部１１は、初期設定された可変部分「ヤマガタ（山形）」を含んだ状態で形態素解析とアクセント付与を新規に行い、初期設定された可変部分と初期設定された固定部分との境界Ｙにて、境界Ｙの両側のアクセントの高低が、新規に生成されたアクセント（図８（a））とテンプレートデータのアクセント（図８(ｂ)）との間で一致しているか否かを判断する。この場合、図８（ａ）で示される新規に生成されたアクセントによると、初期設定された可変部分と初期設定された固定部分との境界Ｙの両隣の音節のアクセントは「タ」が低、「ノ」が高となっており、一方、図８（ｂ）で示されるテンプレートデータの初期設定された可変部分と初期設定された固定部分の境界Ｙの両隣の音節のアクセントは「ウ」が高、「ノ」が高となっており、生成されたアクセントとテンプレートデータのアクセントとでは、境界Ｙ両側のアクセントの高低が相違している。 The arithmetic processing unit 11 of the speech synthesizer 1 newly performs morphological analysis and accent assignment in a state including the initially set variable portion “Yamagata”, and sets the initially set variable portion and the initially set fixed portion. At the boundary Y with the part, the heights of the accents on both sides of the boundary Y match between the newly generated accent (FIG. 8A) and the accent of the template data (FIG. 8B). Determine whether or not. In this case, according to the newly generated accent shown in FIG. 8A, the accent of the syllable adjacent to the boundary Y between the initially set variable portion and the initially set fixed portion has a low “ta”. On the other hand, the accent of the syllable adjacent to the boundary Y between the initial variable part and the initial fixed part of the template data shown in FIG. 8B is “U”. “High” and “No” are high, and the accents on both sides of the boundary Y are different between the generated accent and the accent of the template data.

したがって、演算処理部１１は、次のアクセント句境界である境界Ａにて、判断を継続する。まず、演算処理部１１は、境界Ａがテンプレートデータでもアクセント句境界の位置となっているか否かを判断する。図８（ｂ）のテンプレートデータでは境界Ａはアクセント句境界ではないことから、演算処理部１１は、次のアクセント句境界である境界Ｂにて判断を継続する。 Therefore, the arithmetic processing unit 11 continues the determination at the boundary A that is the next accent phrase boundary. First, the arithmetic processing unit 11 determines whether the boundary A is the position of the accent phrase boundary even in the template data. Since the boundary A is not an accent phrase boundary in the template data of FIG. 8B, the arithmetic processing unit 11 continues the determination at the boundary B that is the next accent phrase boundary.

演算処理部１１は、前回と同様に、アクセント句境界Ｂがテンプレートデータでもアクセント句境界の位置となっているか否かを判断する。図８（ｂ）の境界Ｂはテンプレートデータのアクセント句境界の位置と一致していることから、続いて、次にアクセント句境界Ｂの両側において、図８（ａ）と図８（ｂ）とのアクセントの高低が一致しているか否かを判断する。アクセント句境界Ｂの両側において、図８（ａ）では、「ワ」、「ハ」のアクセントは「低低」となっており、図８（ｂ）では、「ワ」、「ハ」のアクセントは「低低」となっているため、図８（ａ）と図８（ｂ）において、アクセント句境界Ｂの両側のアクセントの高低が一致している。したがって、演算処理部１１は、アクセント句境界Ｂまで可変部分を拡張する。 The arithmetic processing unit 11 determines whether the accent phrase boundary B is the position of the accent phrase boundary even in the template data, as in the previous time. Since the boundary B in FIG. 8 (b) coincides with the position of the accent phrase boundary of the template data, next, on both sides of the accent phrase boundary B, FIG. 8 (a) and FIG. 8 (b) It is determined whether or not the accent levels of the two match. On both sides of the accent phrase boundary B, the accent of “wa” and “ha” is “low” in FIG. 8A, and the accent of “wa” and “ha” in FIG. 8B. Is “low”, the accent heights on both sides of the accent phrase boundary B are the same in FIGS. 8A and 8B. Therefore, the arithmetic processing unit 11 extends the variable part to the accent phrase boundary B.

図９及び図１０は、本発明の実施の形態３に係る音声合成装置１の演算処理部１１の音声合成処理の手順を示すフローチャートである。音声合成装置１の演算処理部１１は、読上げ対象となるテキストデータ及びテンプレートデータを識別する識別情報、例えばテンプレートＩＤを取得する（ステップＳ９０１）。ここで、読上げ対象となるテキストデータは、可変部分のみのテキストであっても良いし、可変部分と固定部分とを識別することが可能であることを条件として両方の部分のテキストで構成されていても良い。また、テンプレートデータを識別する識別情報の替わりに、テンプレートデータの読み、アクセント、及び韻律情報を直接取得する構成であっても良い。読上げ対象となるテキストデータ及びテンプレートデータを識別する識別情報の取得方法は、特に限定されるものではなく、ユーザによる入力であっても良いし、合成音声を出力するアプリケーションからデータとして取得するものであっても良い。 9 and 10 are flowcharts showing the procedure of the speech synthesis process of the arithmetic processing unit 11 of the speech synthesis apparatus 1 according to Embodiment 3 of the present invention. The arithmetic processing unit 11 of the speech synthesizer 1 acquires identification information for identifying text data and template data to be read out, for example, a template ID (step S901). Here, the text data to be read out may be text of only the variable part, and is composed of text of both parts on condition that the variable part and the fixed part can be identified. May be. Further, instead of the identification information for identifying the template data, a configuration in which the reading of the template data, the accent, and the prosody information are directly acquired may be used. The method of acquiring identification information for identifying text data and template data to be read out is not particularly limited, and may be input by a user or acquired as data from an application that outputs synthesized speech. There may be.

演算処理部１１は、取得したテンプレートＩＤに基づいてテンプレート記憶部１５１を照会して、対応するテンプレートデータの読み、アクセント、及び記憶されているアクセント句境界の位置を抽出する（ステップＳ９０２）。ただし、ステップＳ９０１で、テンプレートデータを識別する識別情報を取得する替わりに、直接、テンプレートデータの読み、アクセント、及び韻律情報を取得する場合は、テンプレート記憶部１５１を照会せずに、ステップＳ９０１で取得した情報をそのまま利用する。演算処理部１１は、初期設定された可変部分の始端以降のテキストデータについて単語辞書１５２を参照して、新規に、読み、アクセント及び韻律情報を生成し（ステップＳ９０３）、アクセント句境界を設定する（ステップＳ９０４）。 The arithmetic processing unit 11 queries the template storage unit 151 based on the acquired template ID, and extracts the reading of the corresponding template data, the accent, and the position of the stored accent phrase boundary (step S902). However, instead of acquiring the identification information for identifying the template data in step S901, if the reading of the template data, the accent, and the prosody information are acquired directly, the template storage unit 151 is not queried and the step S901 is performed. Use the acquired information as it is. The arithmetic processing unit 11 refers to the word dictionary 152 for the text data after the beginning of the variable part that has been initially set, newly reads, generates accent and prosodic information (step S903), and sets an accent phrase boundary. (Step S904).

演算処理部１１は、初期設定された可変部分を拡張可変部分候補として設定し（ステップＳ９０５）、拡張可変部分候補とその他の残りの固定部分の境界の両側の音節において、ステップＳ９０２で抽出したテンプレートデータのアクセントと、ステップＳ９０３で新規に生成したアクセントとが一致しているか否かを判断する（ステップＳ９０６）。演算処理部１１が、境界の近傍において、テンプレートデータのアクセントと、新規に作成したアクセントの高低が一致していないと判断した場合（ステップＳ９０６：ＮＯ）、演算処理部１１は、新たな拡張可変部分候補となる次のアクセント句境界が存在するかどうかを判断する（ステップＳ９０７）。 The arithmetic processing unit 11 sets the initially set variable part as an extended variable part candidate (step S905), and the template extracted in step S902 in the syllables on both sides of the boundary between the extended variable part candidate and the remaining fixed part. It is determined whether the data accent matches the accent newly generated in step S903 (step S906). When the arithmetic processing unit 11 determines that the accent of the template data does not match the height of the newly created accent in the vicinity of the boundary (step S906: NO), the arithmetic processing unit 11 performs a new extension variable. It is determined whether or not there is a next accent phrase boundary as a partial candidate (step S907).

演算処理部１１が、新たに可変部分を拡張できる次のアクセント句境界が存在すると判断した場合（ステップＳ９０７：ＹＥＳ）、演算処理部１１は、次のアクセント句までを新たな拡張可変部分候補として設定し直し（ステップＳ９０８）、そのアクセント句の右側の境界が、テンプレートデータに保存されているアクセント句境界の位置と一致するか否かを判断する（ステップＳ９０９）。演算処理部１１が、ステップＳ９０８で新たに設定された拡張可変部分候補の右側の境界の位置が、テンプレートデータに保存されているアクセント句境界の位置と一致しないと判断した場合（ステップＳ９０９：ＮＯ）、演算処理部１１は、処理をステップＳ９０７へ戻し、上述した処理を繰り返す。 When the arithmetic processing unit 11 determines that there is a next accent phrase boundary that can newly expand the variable part (step S907: YES), the arithmetic processing unit 11 sets the next accent phrase as a new expansion variable part candidate. It is reset (step S908), and it is determined whether or not the right boundary of the accent phrase matches the position of the accent phrase boundary stored in the template data (step S909). When the arithmetic processing unit 11 determines that the position of the right boundary of the extension variable part candidate newly set in step S908 does not match the position of the accent phrase boundary stored in the template data (step S909: NO) ) The arithmetic processing unit 11 returns the process to step S907 and repeats the above-described process.

演算処理部１１が、新たに設定された拡張可変部分候補の右側の境界の位置が、テンプレートに保存されているアクセント句境界の位置と一致すると判断した場合（ステップＳ９０９：ＹＥＳ）、演算処理部１１は、処理をステップＳ９０６へ戻す。 When the arithmetic processing unit 11 determines that the position of the right boundary of the newly set extension variable part candidate matches the position of the accent phrase boundary stored in the template (step S909: YES), the arithmetic processing unit 11 returns the process to step S906.

演算処理部１１が、現在設定されている拡張可変部分候補とその他の残りの固定部分との境界の両側において、ステップＳ９０２で抽出したテンプレートデータのアクセントと、ステップＳ９０３で新規に作成したアクセントの高低が一致していると判断した場合（ステップＳ９０６：ＹＥＳ）、現在設定されている拡張可変部分候補まで可変部分を拡張し（ステップＳ９１０）、拡張した可変部分及びその他の残りの固定部分についてそれぞれ、ステップＳ２０３で新たに生成した韻律情報とステップＳ２０２で抽出したテンプレートデータの韻律情報とを接続する（ステップＳ９１１）。演算処理部１１は、接続された韻律情報に基づいて音声データを生成し（ステップＳ９１２）、音声出力部１６から音声出力する（ステップＳ９１３）。なお、ステップＳ９０７において、演算処理部１１が、新たに可変部分を拡張できる次のアクセント句境界が存在しないと判断した場合（ステップＳ９０７：ＮＯ）は、初期設定された可変部分の始端以降の全てのテキストを可変部分に設定し（ステップＳ９１４）、上述したステップＳ９１１以降の処理を行う。 The arithmetic processing unit 11 sets the accent of the template data extracted in step S902 and the height of the accent newly created in step S903 on both sides of the boundary between the currently set extension variable portion candidate and the other remaining fixed portions. Are matched (step S906: YES), the variable part is expanded to the currently set extension variable part candidate (step S910), and the extended variable part and other remaining fixed parts are respectively The prosodic information newly generated in step S203 is connected to the prosodic information of the template data extracted in step S202 (step S911). The arithmetic processing unit 11 generates voice data based on the connected prosodic information (step S912), and outputs the voice from the voice output unit 16 (step S913). In step S907, when the arithmetic processing unit 11 determines that there is no next accent phrase boundary that can newly expand the variable part (step S907: NO), all the elements after the start of the initial variable part are set. Is set as a variable part (step S914), and the processing after step S911 described above is performed.

以上のように本実施の形態３によれば、音声の専門技術を有する者でない場合には設定することができない可変部分の拡張候補位置を、事前に定めておくことなく、可変部分からテンプレートデータの固定部分へとアクセントの高低が一致する可能性が高いアクセント句境界の終端まで可変部分を拡張することができ、しかも品質の良い固定部分のテンプレートデータに含まれる韻律情報を最大限活用することができることから、音声の専門技術を有していない者であっても肉声に近い自然なアクセント・韻律を有する合成音声を生成することが可能となる。 As described above, according to the third embodiment, template data from the variable portion can be set without previously setting variable portion extension candidate positions that cannot be set unless the person has expertise in speech. It is possible to extend the variable part to the end of the accent phrase boundary where there is a high possibility that the accent level matches the fixed part of the, and to make the best use of the prosodic information contained in the template data of the fixed part with good quality Therefore, even a person who does not have speech expertise can generate synthesized speech having natural accents and prosody similar to real voices.

（実施の形態４）
以下、本発明の実施の形態４に係る音声合成装置１につき図面を参照しながら説明する。実施の形態４に係る音声合成装置１を具現化するコンピュータの構成は、実施の形態１と同様であることから、同一の符号を付することにより詳細な説明を省略する。本実施の形態４に係る音声合成装置１は、初期設定されている可変部分について、ＴＴＳの対象となる入力された可変部分のアクセント句の数と、テンプレート記憶部１５１に記憶されているテンプレートデータが想定している初期設定された可変部分のアクセント句の数との差異に応じて、初期設定された可変部分を拡張するか否かを判断する点に特徴を有する。 (Embodiment 4)
Hereinafter, the speech synthesizer 1 according to Embodiment 4 of the present invention will be described with reference to the drawings. Since the configuration of the computer that embodies the speech synthesizer 1 according to the fourth embodiment is the same as that of the first embodiment, detailed description thereof is omitted by attaching the same reference numerals. The speech synthesizer 1 according to the fourth embodiment includes the number of accent phrases of the input variable part that is the target of TTS and the template data stored in the template storage unit 151 for the variable part that is initially set. Is characterized in that it is determined whether or not to expand the initially set variable portion in accordance with the difference between the number of accent phrases in the initially set variable portion assumed in FIG.

図１１は、本発明の実施の形態４に係る音声合成装置１のテンプレート記憶部１５１に記憶されているデータ構成の一例を示す図である。図１１に示すように、初期固定部分である「ノテンキワハレデス」の読み及びアクセント（アクセント句の位置含む）を記憶しているだけでなく、テンプレート化の際に用いた可変部分「トウキョウ」の読み及びアクセント（アクセント句の位置と数も含む）も記憶している。 FIG. 11 is a diagram showing an example of a data configuration stored in the template storage unit 151 of the speech synthesizer 1 according to Embodiment 4 of the present invention. As shown in FIG. 11, not only the reading and accent (including the position of the accent phrase) of the initial fixed part “Noten Kaharedes” are stored, but also the variable part “Tokyo” used for templating. And the accent (including the position and number of accent phrases) are also stored.

例えば初期設定された可変部分に「大阪」が入力される場合、可変部分の始端以降のテキストデータ「大阪の天気は晴れです。」に対して、形態素解析及びアクセント付与処理を行い、初期設定された可変部分のアクセントとして「オオサカ」を得る。ここで、「オオサカ」と「トウキョウ」とのアクセント句の数を比較したとき、アクセント句の数は両方とも‘１’であることから、初期設定された可変部分を拡張しない。 For example, when “Osaka” is input to the default variable part, the text data “Osaka weather is sunny” after the start of the variable part is subjected to morphological analysis and accenting processing, and the initial value is set. "Osaka" is obtained as an accent of the variable part. Here, when the numbers of accent phrases of “Osaka” and “Tokyo” are compared, the number of accent phrases is both “1”, so the initially set variable part is not expanded.

次に、例えば初期設定された可変部分に「大阪府の大阪市」が入力される場合、形態素解析及びアクセント付与処理を行い、初期設定された可変部分のアクセントとして「オオサカ’フノオオサカ’シ」を得る。「オオサカ’フノオオサカ’シ」のアクセント句の数は‘２’であり、テンプレートデータで設定されている初期設定された可変部分のアクセント句の数が‘１’と相違する。したがって、例えばアクセント句の数の判断基準値を‘１’と設定しているときには、アクセント句の数の差が判断基準値‘１’以上であることから、可変部分を拡張する。なお、可変部分を拡張するか否かの判断基準値は、事前に設定しておいても良いし、テンプレートデータに設定しておいても良い。また、拡張の方法は、実施の形態１〜３で説明した方法を用いればよい。 Next, for example, when “Osaka City in Osaka Prefecture” is input to the initially set variable portion, morphological analysis and accenting processing are performed, and “Osaka 'Funo Osaka'shi'” is set as the default variable portion accent. Get. The number of accent phrases of “Osaka 'Funo Osaka' is” is “2”, and the number of accent phrases in the initial variable portion set in the template data is different from “1”. Therefore, for example, when the criterion value for the number of accent phrases is set to ‘1’, the variable portion is expanded because the difference in the number of accent phrases is equal to or greater than the criterion value ‘1’. Note that the reference value for determining whether or not to expand the variable portion may be set in advance or may be set in the template data. As the expansion method, the method described in Embodiments 1 to 3 may be used.

さらに、アクセント句の数が同一であっても、初期設定された可変部分と固定部分との境界近傍では、生成されたアクセントとテンプレートデータのアクセントとが一致していないときにも、実施の形態１〜３で説明した方法により可変部分を拡張しても良い。 Furthermore, even when the number of accent phrases is the same, the generated accent and the accent of the template data do not match near the boundary between the initially set variable part and the fixed part. The variable part may be expanded by the method described in 1 to 3.

本実施の形態４によれば、初期設定された可変部分の単語数が、テンプレートデータで想定されている可変部分の単語数と大きく異なる場合であっても、可変部分からテンプレｔ−データの固定部分へとアクセントの高低の相違による違和感が生じない自然なアクセントを有する合成音声を生成することが可能となる。 According to the fourth embodiment, even if the initially set number of words of the variable part is significantly different from the number of words of the variable part assumed in the template data, the template t-data is fixed from the variable part. It is possible to generate a synthesized speech having a natural accent that does not cause a sense of incongruity due to the difference in the height of the accent to the part.

なお、比較の対象となるのはアクセント句の数に限定されるものではなく、例えばモーラ数を用い、同じアクセント句の数を有している場合であっても、モーラ数が異なる場合には可変部分を拡張するようにしても良い。 Note that the number of accent phrases is not limited to the number of accent phrases. For example, if the number of mora is different and the number of mora is different even if the number of accent phrases is the same, The variable part may be expanded.

以上の実施の形態１乃至４に関し、さらに以下の付記を開示する。 Regarding the above first to fourth embodiments, the following additional notes are disclosed.

（付記１）
可変部分と固定部分とで構成されたテキストデータに対して、可変部分はテキスト音声合成で、固定部分は事前に読み、アクセント、及び韻律情報を記憶してあるテンプレートデータに基づいて合成音声を生成する音声合成装置において、
前記テンプレートデータ及び可変部分のテキストデータを取得する取得手段と、
取得したテンプレートデータから、読み、アクセント、及び韻律情報を抽出する抽出手段と、
取得したテキストデータを可変部分に挿入して、固定部分を含めて合成音声の読み、アクセント、及び韻律情報を生成する生成手段と、
初期設定されている固定部分の始端で、前記テンプレートデータから抽出されたアクセントと生成されたアクセントとが一致しているか否かを判断する判断手段と、
該判断手段で一致していないと判断された場合、抽出されたアクセントと生成されたアクセントとが一致する位置まで可変部分を拡張する可変部分拡張手段と
を備え、拡張された可変部分はテキスト音声合成で、拡張された可変部分を除く固定部分は前記テンプレートデータに基づいて合成音声を生成するようにしてあることを特徴とする音声合成装置。 (Appendix 1)
For text data composed of a variable part and a fixed part, the variable part is text-to-speech synthesis, the fixed part is read in advance, and synthesized speech is generated based on template data that stores accent and prosodic information In the speech synthesizer
Obtaining means for obtaining the template data and variable portion text data;
Extraction means for extracting reading, accent, and prosody information from the acquired template data;
Generating means for inserting the acquired text data into the variable part and generating the synthesized speech reading including the fixed part, accent, and prosody information;
Determination means for determining whether or not the accent extracted from the template data matches the generated accent at the beginning of the fixed portion that is initially set;
When the determination means determines that they do not match, the variable means includes variable portion expansion means for expanding the variable portion to a position where the extracted accent matches the generated accent, and the expanded variable portion is a text voice A speech synthesizer characterized in that a synthesized speech is generated on the basis of the template data for fixed portions other than the expanded variable portion in synthesis.

（付記２）
可変部分と固定部分とで構成されたテキストデータに対して、可変部分はテキスト音声合成で、固定部分は事前に読み、アクセント、及び韻律情報を記憶してあるテンプレートデータに基づいて合成音声を生成する音声合成装置において、
前記テンプレートデータ及び可変部分のテキストデータを取得する取得手段と、
取得したテンプレートデータから、読み、アクセント、及び韻律情報を抽出する抽出手段と、
取得したテキストデータを可変部分に挿入して、固定部分を含めて合成音声の読み、アクセント、及び韻律情報を生成する生成手段と、
初期設定されている可変部分の終端と固定部分の始端とで、前記テンプレートデータから抽出されたアクセントと生成されたアクセントとが一致しているか否かを判断する判断手段と、
該判断手段で一致していないと判断された場合、抽出されたアクセントと生成されたアクセントとが一致する位置まで可変部分を拡張する可変部分拡張手段と
を備え、拡張された可変部分はテキスト音声合成で、縮小された固定部分は前記テンプレートデータに基づいて合成音声を生成するようにしてあることを特徴とする音声合成装置。 (Appendix 2)
For text data composed of a variable part and a fixed part, the variable part is text-to-speech synthesis, the fixed part is read in advance, and synthesized speech is generated based on template data that stores accent and prosodic information In the speech synthesizer
Obtaining means for obtaining the template data and variable portion text data;
Extraction means for extracting reading, accent, and prosody information from the acquired template data;
Generating means for inserting the acquired text data into the variable part and generating the synthesized speech reading including the fixed part, accent, and prosody information;
Determining means for determining whether the accent extracted from the template data matches the generated accent at the initial end of the variable portion and the initial end of the fixed portion;
When the determination means determines that they do not match, the variable means includes variable portion expansion means for expanding the variable portion to a position where the extracted accent matches the generated accent, and the expanded variable portion is a text voice A speech synthesizer characterized in that a synthesized speech is generated based on the template data for the fixed portion reduced by synthesis.

（付記３）
前記テンプレートデータは、可変部分の拡張により変動する可変部分と固定部分との境界となり得る可変部分拡張候補位置に関する情報を含んでおり、
前記可変部分拡張手段は、
前記テンプレートデータから前記可変部分拡張候補位置を抽出する手段と、
抽出された可変部分拡張候補位置にて、前記生成手段で生成されたアクセントと、前記抽出手段で抽出されたアクセントとが一致するか否かを判断する手段と
を備え、
該手段で一致すると判断された可変部分拡張候補位置のうち、初期設定されている前記可変部分に最も近接している可変部分拡張候補位置まで該可変部分を拡張するようにしてあることを特徴とする付記１又は２記載の音声合成装置。 (Appendix 3)
The template data includes information on variable part extension candidate positions that can be a boundary between a variable part and a fixed part that change due to the extension of the variable part,
The variable part expanding means includes
Means for extracting the variable partial extension candidate position from the template data;
Means for determining whether the accent generated by the generating means matches the accent extracted by the extracting means at the extracted variable partial extension candidate position;
Of the variable partial extension candidate positions determined to match by the means, the variable part is extended to the variable partial extension candidate position closest to the initially set variable part. The speech synthesizer according to supplementary note 1 or 2.

（付記４）
前記可変部分拡張手段は、
前記生成手段で生成されたアクセントと、前記抽出手段で抽出されたアクセントとが一致する部分を有するアクセント句を抽出する手段
を備え、
該手段で抽出されたアクセント句のうち、初期設定されている可変部分に最も近接しているアクセント句の終端まで該可変部分を拡張するようにしてあることを特徴とする付記１又は２記載の音声合成装置。 (Appendix 4)
The variable part expanding means includes
Means for extracting an accent phrase having a portion in which the accent generated by the generating means matches the accent extracted by the extracting means;
The variable part is extended to the end of the accent phrase closest to the initially set variable part among the accent phrases extracted by the means. Speech synthesizer.

（付記５）
前記可変部分拡張手段は、
前記生成手段で生成されたアクセントと、前記抽出手段で抽出されたアクセントとが一致する部分を有するアクセント句を抽出する手段と、
抽出されたアクセント句の終端及び次のアクセント句の始端の両方において、前記生成手段で生成されたアクセントと、前記抽出手段で抽出されたアクセントとが一致するか否かを判断する手段と
を備え、
該手段で一致すると判断されたアクセント句のうち、初期設定されている前記可変部分に最も近接しているアクセント句の終端まで該可変部分を拡張するようにしてあることを特徴とする付記１又は２記載の音声合成装置。 (Appendix 5)
The variable part expanding means includes
Means for extracting an accent phrase having a portion in which the accent generated by the generating means matches the accent extracted by the extracting means;
Means for determining whether the accent generated by the generating means matches the accent extracted by the extracting means at both the end of the extracted accent phrase and the beginning of the next accent phrase; ,
Supplementary note 1 or 2, wherein the variable part is extended to the end of the accent phrase closest to the initially set variable part among the accent phrases determined to match by the means. The speech synthesizer according to 2.

（付記６）
前記可変部分拡張手段は、
抽出されたアクセント句の数が所定値より大きいか否かを判断する手段
を備え、
該手段で大きいと判断した場合にのみ初期設定された前記可変部分を拡張するようにしてあることを特徴とする付記１乃至５のいずれか一項に記載の音声合成装置。 (Appendix 6)
The variable part expanding means includes
Means for determining whether the number of extracted accent phrases is greater than a predetermined value;
The speech synthesizer according to any one of appendices 1 to 5, wherein the variable portion that is initially set is expanded only when it is determined that the means is large.

（付記７）
可変部分と固定部分とで構成されたテキストデータに対して、可変部分はテキスト音声合成で、固定部分は事前に読み、アクセント、及び韻律情報を記憶してあるテンプレートデータに基づいて合成音声を生成する音声合成方法において、
前記テンプレートデータ及び可変部分のテキストデータを取得し、
取得したテンプレートデータから、読み、アクセント、及び韻律情報を抽出し、
取得したテキストデータを可変部分に挿入して、固定部分を含めて合成音声の読み、アクセント、及び韻律情報を生成し、
初期設定されている固定部分の始端で、前記テンプレートデータから抽出されたアクセントと生成されたアクセントとが一致しているか否かを判断し、
一致していないと判断された場合、抽出されたアクセントと生成されたアクセントとが一致する位置まで可変部分を拡張し、
拡張された可変部分はテキスト音声合成で、拡張された可変部分を除く固定部分は前記テンプレートデータに基づいて合成音声を生成することを特徴とする音声合成方法。 (Appendix 7)
For text data composed of a variable part and a fixed part, the variable part is text-to-speech synthesis, the fixed part is read in advance, and synthesized speech is generated based on template data that stores accent and prosodic information In the speech synthesis method to
Obtaining the template data and variable part text data;
Extract reading, accent, and prosody information from the acquired template data,
Insert the acquired text data into the variable part, generate the synthesized speech reading, accent, and prosody information including the fixed part,
It is determined whether or not the accent extracted from the template data matches the generated accent at the beginning of the fixed portion that is initially set,
If it is determined that they do not match, the variable part is expanded to the position where the extracted accent matches the generated accent,
An extended variable part is a text-to-speech synthesizer, and a fixed part excluding the extended variable part generates a synthesized speech based on the template data.

（付記８）
可変部分と固定部分とで構成されたテキストデータに対して、可変部分はテキスト音声合成で、固定部分は事前に読み、アクセント、及び韻律情報を記憶してあるテンプレートデータに基づいて合成音声を生成する音声合成方法において、
前記テンプレートデータ及び可変部分のテキストデータを取得し、
取得したテンプレートデータから、読み、アクセント、及び韻律情報を抽出し、
取得したテキストデータを可変部分に挿入して、固定部分を含めて合成音声の読み、アクセント、及び韻律情報を生成し、
初期設定されている可変部分の終端と固定部分の始端とで、前記テンプレートデータから抽出されたアクセントと生成されたアクセントとが一致しているか否かを判断し、
一致していないと判断された場合、抽出されたアクセントと生成されたアクセントとが一致する位置まで可変部分を拡張し、
拡張された可変部分はテキスト音声合成で、縮小された固定部分は前記テンプレートデータに基づいて合成音声を生成することを特徴とする音声合成方法。 (Appendix 8)
For text data composed of a variable part and a fixed part, the variable part is text-to-speech synthesis, the fixed part is read in advance, and synthesized speech is generated based on template data that stores accent and prosodic information In the speech synthesis method to
Obtaining the template data and variable part text data;
Extract reading, accent, and prosody information from the acquired template data,
Insert the acquired text data into the variable part, generate the synthesized speech reading, accent, and prosody information including the fixed part,
It is determined whether the accent extracted from the template data matches the generated accent at the initial end of the variable part and the start of the fixed part,
If it is determined that they do not match, the variable part is expanded to the position where the extracted accent matches the generated accent,
A speech synthesis method characterized in that the expanded variable part is a text-to-speech synthesis, and the reduced fixed part is a synthesized speech based on the template data.

（付記９）
前記テンプレートデータは、可変部分の拡張により変動する可変部分と固定部分との境界となり得る可変部分拡張候補位置に関する情報を含んでおり、
前記テンプレートデータから前記可変部分拡張候補位置を抽出し、
抽出された可変部分拡張候補位置にて、生成されたアクセントと、前記テンプレートデータから抽出されたアクセントとが一致するか否かを判断し、
一致すると判断された可変部分拡張候補位置のうち、初期設定されている前記可変部分に最も近接している可変部分拡張候補位置まで該可変部分を拡張することを特徴とする付記７又は８記載の音声合成方法。 (Appendix 9)
The template data includes information on variable part extension candidate positions that can be a boundary between a variable part and a fixed part that change due to the extension of the variable part,
Extracting the variable partial extension candidate position from the template data;
It is determined whether or not the generated accent matches the accent extracted from the template data at the extracted variable partial extension candidate position,
The variable part is extended to the variable part extension candidate position that is closest to the initially set variable part among the variable part extension candidate positions determined to be coincident with each other, Speech synthesis method.

（付記１０）
生成されたアクセントと、前記テンプレートデータから抽出されたアクセントとが一致する部分を有するアクセント句を抽出し、
抽出されたアクセント句のうち、初期設定されている可変部分に最も近接しているアクセント句の終端まで該可変部分を拡張することを特徴とする付記７又は８記載の音声合成方法。 (Appendix 10)
An accent phrase having a portion in which the generated accent matches the accent extracted from the template data;
9. The speech synthesis method according to appendix 7 or 8, wherein, among the extracted accent phrases, the variable part is extended to the end of the accent phrase closest to the initially set variable part.

（付記１１）
生成されたアクセントと、前記テンプレートデータから抽出されたアクセントとが一致する部分を有するアクセント句を抽出し、
抽出されたアクセント句の終端及び次のアクセント句の始端の両方において、生成されたアクセントと、前記テンプレートデータから抽出されたアクセントとが一致するか否かを判断し、
一致すると判断されたアクセント句のうち、初期設定されている前記可変部分に最も近接しているアクセント句の終端まで該可変部分を拡張することを特徴とする付記７又は８記載の音声合成方法。 (Appendix 11)
An accent phrase having a portion in which the generated accent matches the accent extracted from the template data;
Determining whether the generated accent and the accent extracted from the template data match at both the end of the extracted accent phrase and the beginning of the next accent phrase;
9. The speech synthesizing method according to appendix 7 or 8, wherein the variable part is expanded to the end of the accent phrase closest to the initially set variable part among the accent phrases determined to match.

（付記１２）
抽出されたアクセント句の数が所定値より大きいか否かを判断し、
大きいと判断した場合にのみ初期設定された前記可変部分を拡張することを特徴とする付記７乃至１１のいずれか一項に記載の音声合成方法。 (Appendix 12)
Determine whether the number of extracted accent phrases is greater than a predetermined value,
The speech synthesis method according to any one of appendices 7 to 11, wherein the variable portion that is initially set is expanded only when it is determined that the value is large.

（付記１３）
可変部分と固定部分とで構成されたテキストデータに対して、可変部分はテキスト音声合成で、固定部分は事前に読み、アクセント、及び韻律情報を記憶してあるテンプレートデータに基づいて合成音声を生成するコンピュータで実行することが可能なコンピュータプログラムにおいて、
前記コンピュータを、
前記テンプレートデータ及び可変部分のテキストデータを取得する取得手段、
取得したテンプレートデータから、読み、アクセント、及び韻律情報を抽出する抽出手段、
取得したテキストデータを可変部分に挿入して、固定部分を含めて合成音声の読み、アクセント、及び韻律情報を生成する生成手段、
初期設定されている固定部分の始端で、前記テンプレートデータから抽出されたアクセントと生成されたアクセントとが一致しているか否かを判断する判断手段、
該判断手段で一致していないと判断された場合、抽出されたアクセントと生成されたアクセントとが一致する位置まで可変部分を拡張する可変部分拡張手段、及び
拡張された可変部分はテキスト音声合成で、拡張された可変部分を除く固定部分は前記テンプレートデータに基づいて合成音声を生成する手段
として機能させることを特徴とするコンピュータプログラム。 (Appendix 13)
For text data composed of a variable part and a fixed part, the variable part is text-to-speech synthesis, the fixed part is read in advance, and synthesized speech is generated based on template data that stores accent and prosodic information In a computer program that can be executed on a computer,
The computer,
Obtaining means for obtaining the template data and text data of the variable part;
Extraction means for extracting reading, accent, and prosody information from the acquired template data,
Generating means for inserting the acquired text data into the variable part and generating the reading of the synthesized speech including the fixed part, the accent, and the prosody information;
Determination means for determining whether or not the accent extracted from the template data and the generated accent match at the beginning of the fixed portion that is initially set;
If it is determined by the determination means that the extracted accent and the generated accent match, the variable portion extending means that extends the variable portion to a position where the extracted accent matches, and the expanded variable portion is a text-to-speech synthesizer. The computer program causing the fixed part excluding the extended variable part to function as means for generating synthesized speech based on the template data.

（付記１４）
可変部分と固定部分とで構成されたテキストデータに対して、可変部分はテキスト音声合成で、固定部分は事前に読み、アクセント、及び韻律情報を記憶してあるテンプレートデータに基づいて合成音声を生成するコンピュータで実行することが可能なコンピュータプログラムにおいて、
前記コンピュータを、
前記テンプレートデータ及び可変部分のテキストデータを取得する取得手段、
取得したテンプレートデータから、読み、アクセント、及び韻律情報を抽出する抽出手段、
取得したテキストデータを可変部分に挿入して、固定部分を含めて合成音声の読み、アクセント、及び韻律情報を生成する生成手段、
初期設定されている可変部分の終端と固定部分の始端とで、前記テンプレートデータから抽出されたアクセントと生成されたアクセントとが一致しているか否かを判断する判断手段、
該判断手段で一致していないと判断された場合、抽出されたアクセントと生成されたアクセントとが一致する位置まで可変部分を拡張する可変部分拡張手段、及び
拡張された可変部分はテキスト音声合成で、縮小された固定部分は前記テンプレートデータに基づいて合成音声を生成する手段
として機能させることを特徴とするコンピュータプログラム。 (Appendix 14)
For text data composed of a variable part and a fixed part, the variable part is text-to-speech synthesis, the fixed part is read in advance, and synthesized speech is generated based on template data that stores accent and prosodic information In a computer program that can be executed on a computer,
The computer,
Obtaining means for obtaining the template data and text data of the variable part;
Extraction means for extracting reading, accent, and prosody information from the acquired template data,
Generating means for inserting the acquired text data into the variable part and generating the synthesized speech reading, accent, and prosody information including the fixed part;
Judgment means for judging whether or not the accent extracted from the template data matches the generated accent at the initial end of the variable part and the start of the fixed part;
If it is determined by the determination means that the extracted accent and the generated accent match, the variable portion extending means that extends the variable portion to a position where the extracted accent matches, and the expanded variable portion is a text-to-speech synthesizer. The computer program characterized in that the reduced fixed portion functions as means for generating synthesized speech based on the template data.

（付記１５）
前記テンプレートデータは、可変部分の拡張により変動する可変部分と固定部分との境界となり得る可変部分拡張候補位置に関する情報を含んでおり、
前記コンピュータを、
前記テンプレートデータから前記可変部分拡張候補位置を抽出する手段、
抽出された可変部分拡張候補位置にて、前記生成手段で生成されたアクセントと、前記抽出手段で抽出されたアクセントとが一致するか否かを判断する手段、及び
該手段で一致すると判断された可変部分拡張候補位置のうち、初期設定されている前記可変部分に最も近接している可変部分拡張候補位置まで該可変部分を拡張する手段
として機能させることを特徴とする付記１３又は１４記載のコンピュータプログラム。 (Appendix 15)
The template data includes information on variable part extension candidate positions that can be a boundary between a variable part and a fixed part that change due to the extension of the variable part,
The computer,
Means for extracting the variable partial extension candidate position from the template data;
Means for determining whether or not the accent generated by the generating means and the accent extracted by the extracting means match at the extracted variable partial extension candidate position; 15. The computer according to appendix 13 or 14, wherein the computer is caused to function as means for extending the variable part to a variable part extension candidate position closest to the initially set variable part among the variable part extension candidate positions. program.

（付記１６）
前記コンピュータを、
前記生成手段で生成されたアクセントと、前記抽出手段で抽出されたアクセントとが一致する部分を有するアクセント句を抽出する手段、及び
該手段で抽出されたアクセント句のうち、初期設定されている可変部分に最も近接しているアクセント句の終端まで該可変部分を拡張する手段
として機能させることを特徴とする付記１３又は１４記載のコンピュータプログラム。 (Appendix 16)
The computer,
Means for extracting an accent phrase having a portion in which the accent generated by the generating means and the accent extracted by the extracting means match, and an initial variable set among the accent phrases extracted by the means 15. The computer program according to appendix 13 or 14, wherein the computer program functions as means for extending the variable part to the end of the accent phrase closest to the part.

（付記１７）
前記コンピュータを、
前記生成手段で生成されたアクセントと、前記抽出手段で抽出されたアクセントとが一致する部分を有するアクセント句を抽出する手段、
抽出されたアクセント句の終端及び次のアクセント句の始端の両方において、前記生成手段で生成されたアクセントと、前記抽出手段で抽出されたアクセントとが一致するか否かを判断する手段、及び
該手段で一致すると判断されたアクセント句のうち、初期設定されている前記可変部分に最も近接しているアクセント句の終端まで該可変部分を拡張する手段
として機能させることを特徴とする付記１３又は１４記載のコンピュータプログラム。 (Appendix 17)
The computer,
Means for extracting an accent phrase having a portion in which the accent generated by the generating means matches the accent extracted by the extracting means;
Means for determining whether the accent generated by the generating means matches the accent extracted by the extracting means at both the end of the extracted accent phrase and the beginning of the next accent phrase; and Appendices 13 or 14 which function as means for extending the variable part to the end of the accent phrase closest to the initial variable part among the accent phrases determined to match by the means The computer program described.

（付記１８）
前記コンピュータを、
抽出されたアクセント句の数が所定値より大きいか否かを判断する手段、及び
該手段で大きいと判断した場合にのみ初期設定された前記可変部分を拡張する手段
として機能させることを特徴とする付記１３乃至１７のいずれか一項に記載のコンピュータプログラム。 (Appendix 18)
The computer,
A means for determining whether or not the number of extracted accent phrases is greater than a predetermined value, and a function for expanding the variable part that is initially set only when it is determined that the means is large. The computer program according to any one of appendices 13 to 17.

本発明の実施の形態１に係る音声合成装置を具現化するコンピュータの構成を示すブロック図である。It is a block diagram which shows the structure of the computer which embodies the speech synthesizer concerning Embodiment 1 of this invention. 本発明の実施の形態１に係る音声合成装置の演算処理部の音声合成処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the speech synthesis process of the arithmetic processing part of the speech synthesizer concerning Embodiment 1 of this invention. 本発明の実施の形態１に係る音声合成装置のテンプレート記憶部に記憶されているデータ構成の一例を示す図である。It is a figure which shows an example of the data structure memorize | stored in the template memory | storage part of the speech synthesizer which concerns on Embodiment 1 of this invention. 本発明の実施の形態１に係る音声合成装置のテンプレート記憶部に記憶されているデータ構成の他の例を示す図である。It is a figure which shows the other example of the data structure memorize | stored in the template memory | storage part of the speech synthesizer which concerns on Embodiment 1 of this invention. 可変部分に含まれるテキストの相違によりアクセントの連続性が変化する状態を示す図である。It is a figure which shows the state from which the continuity of an accent changes by the difference in the text contained in a variable part. 本発明の実施の形態２に係る音声合成装置のテンプレート記憶部に記憶されているデータ構成の一例を示す図である。It is a figure which shows an example of the data structure memorize | stored in the template memory | storage part of the speech synthesizer which concerns on Embodiment 2 of this invention. 本発明の実施の形態２に係る音声合成装置の演算処理部の音声合成処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the speech synthesis process of the arithmetic processing part of the speech synthesizer concerning Embodiment 2 of this invention. アクセントの高低及びアクセント句の一例を示す図である。It is a figure which shows an example of the height of an accent, and an accent phrase. 本発明の実施の形態３に係る音声合成装置の演算処理部の音声合成処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the speech synthesis process of the arithmetic processing part of the speech synthesizer concerning Embodiment 3 of this invention. 本発明の実施の形態３に係る音声合成装置の演算処理部の音声合成処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the speech synthesis process of the arithmetic processing part of the speech synthesizer concerning Embodiment 3 of this invention. 本発明の実施の形態４に係る音声合成装置のテンプレート記憶部に記憶されているデータ構成の一例を示す図である。It is a figure which shows an example of the data structure memorize | stored in the template memory | storage part of the speech synthesizer which concerns on Embodiment 4 of this invention.

Explanation of symbols

１音声合成装置
１１演算処理部
１２ＲＯＭ
１３ＲＡＭ
１４通信インタフェース部
１５記憶装置
１６音声出力部
１７内部バス 1 Speech Synthesizer 11 Arithmetic Processing Unit 12 ROM
13 RAM
14 Communication Interface 15 Storage Device 16 Audio Output 17 Internal Bus

Claims

For text data composed of a variable part and a fixed part, the variable part is text-to-speech synthesis, the fixed part is read in advance, and synthesized speech is generated based on template data that stores accent and prosodic information In the speech synthesizer
Obtaining means for obtaining the template data and variable portion text data;
Extraction means for extracting reading, accent, and prosody information from the acquired template data;
Generating means for inserting the acquired text data into the variable part and generating the synthesized speech reading including the fixed part, accent, and prosody information;
Determination means for determining whether or not the accent extracted from the template data matches the generated accent at the beginning of the fixed portion that is initially set;
When the determination means determines that they do not match, the variable means includes variable portion expansion means for expanding the variable portion to a position where the extracted accent matches the generated accent, and the expanded variable portion is a text voice A speech synthesizer characterized in that a synthesized speech is generated on the basis of the template data for fixed portions other than the expanded variable portion in synthesis.

For text data composed of a variable part and a fixed part, the variable part is text-to-speech synthesis, the fixed part is read in advance, and synthesized speech is generated based on template data that stores accent and prosodic information In the speech synthesizer
Obtaining means for obtaining the template data and variable portion text data;
Extraction means for extracting reading, accent, and prosody information from the acquired template data;
Generating means for inserting the acquired text data into the variable part and generating the synthesized speech reading including the fixed part, accent, and prosody information;
Determining means for determining whether the accent extracted from the template data matches the generated accent at the initial end of the variable portion and the initial end of the fixed portion;
When the determination means determines that they do not match, the variable means includes variable portion expansion means for expanding the variable portion to a position where the extracted accent matches the generated accent, and the expanded variable portion is a text voice A speech synthesizer characterized in that a synthesized speech is generated based on the template data for the fixed portion reduced by synthesis.

The variable part expanding means includes
Means for extracting an accent phrase having a portion in which the accent generated by the generating means matches the accent extracted by the extracting means;
3. The accent part extracted by the means is extended to the end of the accent phrase closest to the initially set variable part, wherein the variable part is expanded. Voice synthesizer.

For text data composed of a variable part and a fixed part, the variable part is text-to-speech synthesis, the fixed part is read in advance, and synthesized speech is generated based on template data that stores accent and prosodic information In the speech synthesis method to
Obtaining the template data and variable part text data;
Extract reading, accent, and prosody information from the acquired template data,
Insert the acquired text data into the variable part, generate the synthesized speech reading, accent, and prosody information including the fixed part,
It is determined whether or not the accent extracted from the template data matches the generated accent at the beginning of the fixed portion that is initially set,
If it is determined that they do not match, the variable part is expanded to the position where the extracted accent matches the generated accent,
An extended variable part is a text-to-speech synthesizer, and a fixed part excluding the extended variable part generates a synthesized speech based on the template data.

For text data composed of a variable part and a fixed part, the variable part is text-to-speech synthesis, the fixed part is read in advance, and synthesized speech is generated based on template data that stores accent and prosodic information In a computer program that can be executed on a computer,
The computer,
Obtaining means for obtaining the template data and text data of the variable part;
Extraction means for extracting reading, accent, and prosody information from the acquired template data,
Generating means for inserting the acquired text data into the variable part and generating the synthesized speech reading, accent, and prosody information including the fixed part;
Determination means for determining whether or not the accent extracted from the template data and the generated accent match at the beginning of the fixed portion that is initially set;
If it is determined by the determination means that the extracted accent and the generated accent match, the variable portion extending means that extends the variable portion to a position where the extracted accent matches, and the expanded variable portion is a text-to-speech synthesizer. The computer program causing the fixed part excluding the extended variable part to function as means for generating synthesized speech based on the template data.