JP4964695B2

JP4964695B2 - Speech synthesis apparatus, speech synthesis method, and program

Info

Publication number: JP4964695B2
Application number: JP2007182081A
Authority: JP
Inventors: 健司永松; 亮太鴨志田; 雄介藤田
Original assignee: Hitachi Automotive Systems Ltd
Current assignee: Hitachi Astemo Ltd
Priority date: 2007-07-11
Filing date: 2007-07-11
Publication date: 2012-07-04
Anticipated expiration: 2027-07-11
Also published as: JP2009020264A

Abstract

<P>PROBLEM TO BE SOLVED: To naturally combine a natural voice and a synthesized voice by detecting a natural voice section partially matching a synthesized voice section and by providing the synthesized voice with prosody (intonation/rhythm) information of the natural voice section. <P>SOLUTION: The voice synthesis device includes a recorded voice storage means, an input text analysis means, a recorded voice selection means, a connection boundary calculation means, a rule synthesis means, and a connection synthesis means. The voice synthesis device further includes a natural voice prosody section determination means determining a section partially matching the recorded natural voice in the synthesized voice section, a natural voice prosody extraction means extracting the natural voice prosody of the matching part, and a hybrid prosody generation means generating prosody information of the entire synthesized voice section using the extracted natural voice prosody. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、音声合成装置および音声合成方法に関係する。特に、合成音声と肉声とを併用してアナウンス音声を作成するハイブリッド音声合成装置に関係する。 The present invention relates to a speech synthesizer and a speech synthesis method. In particular, the present invention relates to a hybrid speech synthesizer that creates announcement speech by using synthesized speech and real voice together.

鉄道や公共施設での自動アナウンス、または銀行や証券会社などでの電話による情報提供システムのように、音声を使って情報を提供するシステムが大きく普及してきている。これらの応用分野で利用される音声メッセージには、固定的な表現が多いという特徴がある。例えば、鉄道での放送例では、「まもなく、５番線に東京行きがまいります」というアナウンス音声は、「５番線」「東京」という部分を変更して様々なバリエーションが利用されることが多い。 Systems that provide information using voice, such as automatic announcements at railways and public facilities, or telephone information provision systems at banks and securities companies, are becoming widespread. Voice messages used in these application fields are characterized by many fixed expressions. For example, in an example of broadcasting on a railway, the announcement voice “Soon to go to Tokyo on line 5” is often used in various variations by changing the part of “line 5” and “Tokyo”.

そのため、このような固定的な表現が多い自動アナウンス分野で利用されるアナウンスシステムでは、固定的な表現部分を肉声音声の部品として用意しておき、それらを適宜組み合わせることによってアナウンス音声を作成する。例えば、上記のアナウンス例では、「まもなく」「５番線に」「東京行きが」「まいります」という音声部品を結合することで文章としてのアナウンス音声を作成するという手法が採用されている。これを録音編集方式と呼び、上記のようなアナウンス分野では、現在主流となっている方式（システム）である。 For this reason, in such an announcement system used in the automatic announcement field with many fixed expressions, a fixed expression part is prepared as a part of a real voice, and an announcement sound is created by appropriately combining them. For example, in the above-described announcement example, a technique of creating an announcement voice as a sentence by combining voice components “soon”, “to 5th line”, “to go to Tokyo”, “we will go” is adopted. This is called a recording / editing system, and is a system (system) which is currently mainstream in the announcement field as described above.

この録音編集方式では、肉声部品を結合するという方法であるため、肉声感という点での品質は高い。しかし、細切れの音声部品を結合させるため、抑揚やリズムが合わせづらくなり、音声の自然性という観点での品質は落ちてしまう。さらに、音声部品はあらかじめ録音しておかねばならないため、新しい語句が追加された場合には再録音が必要となり、コストや利便性にかける方式となっている。 In this recording / editing system, since the voice parts are combined, the quality in terms of real voice is high. However, since the audio components are cut into pieces, it is difficult to match the inflection and the rhythm, and the quality in terms of the naturalness of the audio falls. Furthermore, since the audio parts must be recorded in advance, re-recording is required when new words are added, which is a method for cost and convenience.

一方、音声合成やＴＴＳ（Ｔｅｘｔ−ｔｏ−Ｓｐｅｅｃｈ）技術と呼ばれる規則合成方式を用いて音声を合成すると、任意の文章を読み上げる音声データを生成することが可能となる。この規則合成方式については、「ディジタル音声処理」（古井貞煕、東海大学出版会）や「ＰｒｏｇｒｅｓｓｉｎＳｐｅｅｃｈＳｙｎｔｈｅｓｉｓ」（ＶＡＮＳＡＮＴＥＮ他、Ｓｐｒｉｎｇｅｒ）などに詳細な記述がある。しかし、この方式は任意の文章を音声に変換することが可能な反面、肉声感や自然性という点では、録音編集方式に及ばない。 On the other hand, by synthesizing speech using a rule synthesis method called speech synthesis or TTS (Text-to-Speech) technology, it is possible to generate speech data that reads an arbitrary sentence. The rule synthesis method is described in detail in “Digital Speech Processing” (Sadaaki Furui, Tokai University Press) and “Progress in Speech Synthesis” (VAN SANTEN et al., Springer). However, this method can convert an arbitrary sentence into speech, but it is not as good as the recording and editing method in terms of the real voice and naturalness.

そこで、上記の録音編集方式の問題点を解決するために、録音編集方式と規則合成方式を併用したハイブリッド方式というものが考えられている。これは、定型的な表現部分、たとえば、上記の文例では「まもなく」や「まいります」などには録音された肉声音声部品を用い、内容が変更される可能性がある「５番線に」や「東京行きが」の部分については、ＴＴＳ技術で生成された合成音声部品を用いて、それらを結合して音声メッセージを作成するという手法である。これによって、録音編集方式の肉声感を保ちつつも、任意の語句に対応できるというＴＴＳ技術の柔軟性を兼ね備えることが可能となる。 Therefore, in order to solve the problems of the above-described recording and editing method, a hybrid method that combines the recording and editing method and the rule composition method has been considered. This is a typical expression part, for example, "Soon" and "I will continue" in the above sentence example, using recorded voice parts, the contents may change " The “Tokyo bound” part is a method of creating a voice message by combining the synthesized voice parts generated by the TTS technology. As a result, it is possible to have the flexibility of the TTS technology that can cope with an arbitrary phrase while maintaining the real voice feeling of the recording and editing method.

しかし、抑揚やリズムの自然性が低いという録音編集方式の問題点は、このハイブリッド方式でも残るため、それを解決する手法として例えば次のような可変位置でのハイブリッド方式が考えられる。これは、合成音声部分と肉声部分との結合位置を、上記のハイブリッド方式の例のように語句・文節単位とするのではなく、無声子音やパワーの小さい音素位置を動的に探索するなどしてより自由に結合位置を決定することで、音声部品間の結合位置を目立たないようにすることを特徴する。さらに、合成音声部分の抑揚・リズムをその前後の肉声部分に合わせて調整することで全体の自然性を向上させることができる。 However, since the problem of the recording and editing system that the naturalness of inflection and rhythm is low still remains in this hybrid system, the following hybrid system at a variable position is conceivable as a method for solving this problem. This is because the position where the synthesized speech part and the real voice part are combined is not a word / phrase unit as in the hybrid system example above, but a voiced consonant or a phoneme position with low power is dynamically searched. In this case, the coupling position between the audio components is made inconspicuous by determining the coupling position more freely. Furthermore, the overall naturalness can be improved by adjusting the inflection and rhythm of the synthesized voice part according to the front and rear real voice parts.

また、逆の観点から、定型的な表現が多い場合に音声合成方式の肉声感や自然性を向上させる技術も発明されている。例えば、特許文献１に示す発明では、音声合成を行う際に、定型的な表現部分の韻律（抑揚・リズム）情報として、肉声そのものから取得した情報を利用する技術を開示している。この技術を用いることで、音声合成方式ではありながらも、定型的な表現部分では肉声とほぼ同等の抑揚・リズムの自然性が得られることとなる。
特開平１１−２４９６７７号公報 Further, from the opposite viewpoint, a technique for improving the real voice and naturalness of the speech synthesis method when there are many typical expressions has been invented. For example, the invention disclosed in Patent Document 1 discloses a technique that uses information acquired from the real voice itself as prosodic (intonation / rhythm) information of a typical expression part when performing speech synthesis. By using this technology, although it is a speech synthesis method, the naturalness of the inflection / rhythm is almost the same as that of the real voice in the typical expression part.
JP-A-11-249677

上記の可変位置ハイブリッド方式を用いることで、肉声部分と合成音声部分との結合位置が目立たなくなり、合成音声部分の抑揚・リズムが肉声に合うように調整されることで、その自然性も向上することになる。しかし、その調整技術がまだ不十分なため、合成音声部分が合成音声であるということが分かってしまい、その結果、品質を大きく下げてしまうという問題がある。 By using the above-mentioned variable position hybrid method, the coupling position between the real voice part and the synthesized voice part becomes inconspicuous, and the naturalness is also improved by adjusting the inflection and rhythm of the synthesized voice part to match the real voice. It will be. However, since the adjustment technique is still insufficient, it is found that the synthesized speech portion is synthesized speech, and as a result, there is a problem that the quality is greatly lowered.

そこで、特許文献１が開示する技術を音声合成部分に適用して、その自然性を肉声とほぼ同等にすることで解決するという手法が考えられる。しかし、二つの手法を単純に組み合わせるだけでは解決することができない。これは、上記の可変位置ハイブリッド方式を実施したシステムで音声メッセージを作成する場合に、合成音声技術が使われる区間と同じ内容、より正確には同じ音韻の肉声が録音されている場合にのみ、かつ同じ音韻部分にのみ、特許文献１の技術を適用できるからである。つまり、同じ内容の肉声が録音されていない場合は適用自体が不可能である。合成音声が使われる区間は任意の語句であることを考えると、同じ内容の肉声が録音されているという可能性はかなり低いと想定せざるをえない。 In view of this, it is conceivable to solve the problem by applying the technique disclosed in Patent Document 1 to the speech synthesis part and making its naturalness almost equivalent to the real voice. However, it cannot be solved by simply combining the two methods. This is only when a voice message is created in a system that implements the above-mentioned variable position hybrid method, and the same content as the section in which the synthesized speech technology is used, more precisely, when the real voice of the same phoneme is recorded, This is because the technique of Patent Document 1 can be applied only to the same phoneme part. In other words, if the same content is not recorded, the application itself is impossible. Considering that the section in which the synthesized speech is used is an arbitrary phrase, it must be assumed that the possibility that the same voice is recorded is quite low.

本発明は上記の問題を鑑みてなされたものであり、可変位置ハイブリッド方式を実施した場合の合成音声区間と部分的に一致する肉声区間を検出して、その肉声区間の韻律（抑揚・リズム）情報を合成音声に付与することで問題を解決する。そして、肉声と合成音声を従来技術よりも自然に結合させることで、肉声感・自然性がともに高い音声メッセージを作成可能なハイブリッド音声合成装置を提供することを目的とする。 The present invention has been made in view of the above problems, and detects a real voice segment that partially matches a synthesized speech segment when the variable position hybrid method is implemented, and prosody (inflection / rhythm) of the real voice segment. The problem is solved by adding information to the synthesized speech. It is another object of the present invention to provide a hybrid speech synthesizer capable of creating a voice message with a higher sense of natural voice and naturalness by combining the real voice and the synthesized voice more naturally than in the prior art.

本発明は、音声に変換するテキストを受け付けて発音テキストに変換する入力テキスト解析部と、予め設定された文章を肉声により記録した肉声データと前記文章を予め格納する録音音声格納部と、前記録音音声格納部に格納された文章と前記発音テキストを比較して、音声合成に使用する肉声データ及び文章を録音音声格納部から選択する録音音声選択部と、前記発音テキストと前記選択された文章から音声合成により音声を生成する合成音声区間と前記肉声データから音声を生成する肉声区間との境目を決定する接続境界算出部と、前記決定された合成音声区間に基づいて、予め設定された音声素片と韻律モデルにより音声合成データを生成する規則合成部と、前記肉声区間に対応する肉声データと、前記生成された音声合成データとを接続して入力されたテキストに対応する合成音声文章を生成する接続合成部と、を備えたハイブリッド音声合成装置において、前記接続境界算出部で決定された合成音声を使用する合成音声区間において前記肉声データの韻律を使用する肉声韻律区間を決定する肉声韻律区間決定部と、前記肉声韻律区間決定部で決定された前記区間の韻律を前記選択した肉声データから抽出する肉声韻律抽出部と、前記抽出された肉声による韻律と、前記韻律モデルから合成音声区間全体の韻律情報を生成するハイブリッド韻律生成部と、を備え、前記規則合成部は、前記合成音声区間について、前記音声素片と前記韻律情報により音声合成データを生成し、前記録音音声選択部は、前記録音音声格納部の前記文章と前記発音テキストを比較して、前記文章のうち前記発音テキストと発音が一致した音節数が最も大きな肉声データ及び文章を出力し、前記接続合成部は、前記肉声区間と、前記肉声韻律区間と、前記ハイブリッド韻律生成部で韻律情報を生成された合成音声区間とを接続し、前記肉声韻律区間決定部は、前記発音テキストと前記選択された文章の比較を行う際に、音節単位での最長一致に基づいて前記肉声データの韻律を使用する箇所を決定する。 The present invention includes an input text analysis unit for converting the accepted text pronunciation text-to-speech, and recording audio storage unit for storing the text and recorded real voice data sentence section that is set in advance by the real voice in advance, the A recorded voice selection unit that compares the sentence stored in the recorded voice storage unit with the pronunciation text and selects the voice data and sentences used for speech synthesis from the recorded voice storage unit, and the pronunciation text and the selected sentence A connection boundary calculation unit for determining a boundary between a synthesized voice section for generating voice by voice synthesis and a real voice section for generating voice from the real voice data, and a voice set in advance based on the determined synthesized voice section A rule synthesizer that generates speech synthesis data using segments and prosodic models, real voice data corresponding to the real voice segment, and the generated speech synthesis data In a hybrid speech synthesizer comprising: a connection synthesizer that generates a synthesized speech sentence corresponding to a subsequently input text, the real voice in a synthesized speech section that uses the synthesized speech determined by the connection boundary calculation unit A real voice prosody section determining unit that determines a real voice prosody section that uses data prosody, a real voice prosody extraction unit that extracts the prosody of the section determined by the real voice prosody section determiner from the selected real voice data, and the extraction prosodic by human voice, which is a hybrid prosody generation unit for generating prosody information for the entire synthetic speech segment from the prosodic model, wherein the rule-based synthesis unit, with the synthesized speech segment, the prosody and the speech segment generates speech synthesis data by the information, the recorded speech selection unit compares the pronunciation text and the sentence of the recorded speech storage unit, the sentence Among them, the voice data and sentence having the largest number of syllables whose pronunciation coincides with the pronunciation text are output, and the connection synthesis unit generates prosody information in the real voice segment, the real voice prosody segment, and the hybrid prosody generation unit The real voice prosody section determination unit uses the prosody of the real voice data based on the longest match in syllable units when comparing the pronunciation text and the selected sentence. that determine the location.

したがって、本発明により、肉声データと合成音声を結合して、肉声感・自然性の高い音声メッセージを作成することが可能なハイブリッド音声合成装置において、その合成音声区間の自然性をさらに向上させることができる。これにより、さらに自然性が高く高品質な音声メッセージの作成が可能となる。 Therefore, according to the present invention, in the hybrid speech synthesizer capable of combining the voice data and the synthesized speech to create a voice message with a high sense of natural voice and naturalness, further improving the naturalness of the synthesized speech section. Can do. This makes it possible to create a voice message with higher naturalness and higher quality.

以下、本発明の一実施形態を添付図面に基づいて説明する。 Hereinafter, an embodiment of the present invention will be described with reference to the accompanying drawings.

図１は、本発明の第１の実施形態を示す音声合成装置のブロック図である。音声合成装置１は、演算処理を行うＣＰＵ（プロセッサ）３と、データやプログラムを格納するメモリ２と、データやプログラムを格納するストレージ装置５と、演算結果などを表示する表示装置４と、音声を出力する音声再生装置６を含んで構成される。メモリ２には、テキストを生成するプログラムとしてテキスト生成部７と、テキスト生成部７が出力したテキストを音声に変換する音声合成部８がロードされ、ＣＰＵ３により実行される。音声合成部８では、テキストを音声データに変換して出力し、ＣＰＵ３は音声データを音声再生装置６に送信して音声の出力を指令する。 FIG. 1 is a block diagram of a speech synthesizer showing a first embodiment of the present invention. The voice synthesizer 1 includes a CPU (processor) 3 that performs arithmetic processing, a memory 2 that stores data and programs, a storage device 5 that stores data and programs, a display device 4 that displays calculation results, and a voice. Is included. The memory 2 is loaded with a text generator 7 as a program for generating text and a speech synthesizer 8 for converting the text output from the text generator 7 into speech, and is executed by the CPU 3. In the voice synthesizer 8, the text is converted into voice data and output, and the CPU 3 transmits the voice data to the voice reproduction device 6 to instruct voice output.

なお、テキスト生成部７は、後述するように、カーナビゲーション装置等における誘導情報をテキストで生成するソフトウェアモジュールとして機能するものである。また、ストレージ装置５には、後述するように音声合成部８が利用する各種データが格納される。 As will be described later, the text generation unit 7 functions as a software module that generates guidance information in a car navigation device or the like as text. The storage device 5 stores various data used by the speech synthesizer 8 as will be described later.

図２は、図１に示した音声合成部８の機能ブロック図である。 FIG. 2 is a functional block diagram of the speech synthesizer 8 shown in FIG.

図２において、本発明の音声合成装置１および音声合成手法の基本的構成は、テキスト生成部７から入力されたテキスト１０の内容を解析して録音された肉声部品（肉声録音音声データ）を使う部分と合成音声部品を使う部分とを切り分けるための発音テキスト２１を生成する入力テキスト解析部２０と、音声合成に使われる文章録音音声（＝肉声録音音声データ）を多数格納した録音音声格納部３０から肉声部品（＝肉声録音音声データの一部または全部）の素材として使用可能な録音文章を決定する録音音声選択部４０と、選択された録音文章の中から音声合成による合成音声区間と肉声録音音声データからなる肉声区間との境目（境界）を決定する接続境界算出部５０と、合成音声を使用する合成音声区間において肉声録音音声データの韻律をそのまま使用できる箇所を決定する肉声韻律区間決定部６０と、肉声録音音声データの韻律を使うと決定された箇所に対応する韻律情報を肉声録音音声データから抽出する肉声韻律抽出部７０と、抽出された韻律情報に基づいて新規に音声合成する部分の韻律を補完して音声合成する区間全体のハイブリッド韻律情報を生成するハイブリッド韻律生成部８０と、生成されたハイブリッド韻律情報に基づいて音声素片データベース１１０と韻律モデル１２０内のデータを用いて音声合成を行う規則合成部１００と、肉声部品区間と合成音声区間とを接続して入力されたテキストに対応する合成音声文章全体を生成する接続合成部９０とで構成される。 In FIG. 2, the basic configuration of the speech synthesizer 1 and the speech synthesis method of the present invention uses a real voice component (a real voice recording voice data) recorded by analyzing the content of the text 10 input from the text generator 7. An input text analysis unit 20 that generates a pronunciation text 21 for separating a part and a part that uses a synthesized voice part, and a recorded voice storage unit 30 that stores a large number of sentence recording voices (= real voice recording voice data) used for voice synthesis. Recording voice selection unit 40 for determining a recorded sentence that can be used as a material of a real voice part (= part or all of the voice recorded voice data), and a synthesized voice section and a voice recording by voice synthesis from the selected recorded sentence A connection boundary calculation unit 50 that determines a boundary (boundary) between the voice data and the voice segment, and voice recording voice data in the synthesized voice section using the synthesized voice A real voice prosody section determining unit 60 for determining a portion where the prosody can be used as it is, a real voice prosody extracting unit 70 for extracting prosody information corresponding to the portion determined using the prosody of the real voice recorded voice data from the real voice recorded voice data, Based on the extracted prosodic information, a hybrid prosody generating unit 80 for generating hybrid prosodic information for the entire section for speech synthesis by complementing the prosody of a part to be newly synthesized with speech, and a speech element based on the generated hybrid prosodic information A rule synthesizing unit 100 that performs speech synthesis using data in the fragment database 110 and the prosodic model 120, and a connection that generates a synthesized speech sentence corresponding to the input text by connecting the real voice component section and the synthesized speech section. It is comprised with the synthetic | combination part 90. FIG.

次に、本発明の基本的構成な図２において、各要素を具体的にどのような装置として構成すればよいかを説明する。ここでは、本音声合成装置１をカーナビゲーションシステムでのガイダンス音声を音声合成する装置として実装する場合について具体的な説明を行う。 Next, in FIG. 2 which is the basic configuration of the present invention, what kind of device should be specifically configured for each element will be described. Here, a specific description will be given of the case where the present speech synthesizer 1 is implemented as a device that synthesizes a guidance speech in a car navigation system.

まず、入力テキスト１０は、例えばキーボードやタッチパネル等でのデジタルペンを用いて入力され、電子データとして入力テキスト解析部２０に渡される。これらの他にも、文字認識装置（ＯＣＲ）を使うなど様々な入力装置が考えられる。また、あらかじめ音声合成される入力テキストをストレージ装置５のデータベースに格納しておいても良いし、図示しないテキスト処理装置やテキスト処理手法を用いて新たなテキストを動的に生成してもかまわない。 First, the input text 10 is input using, for example, a digital pen such as a keyboard or a touch panel, and passed to the input text analysis unit 20 as electronic data. In addition to these, various input devices such as a character recognition device (OCR) can be considered. In addition, input text to be synthesized in advance may be stored in the database of the storage device 5, or new text may be dynamically generated using a text processing device or text processing method (not shown). .

本第１実施形態の場合では、入力テキスト１０として図３に示す「まもなく、渋谷南バイパスの先を右折です。」というデータが音声合成装置１へ入力されたものとして処理の流れを説明する。 In the case of the first embodiment, the processing flow will be described on the assumption that the input text 10 shown in FIG. 3 is “Soon after the Shibuya Minami Bypass is right turn” data is input to the speech synthesizer 1.

次に、録音音声格納部３０は、音声合成される文章に対して肉声部品として使用可能な肉声録音音声データを多数格納したデータベースである。録音音声格納部３０は、音声コーパスとも呼ばれる、様々なデータベース装置、データ格納技術を用いて実現は容易である。 Next, the recorded voice storage unit 30 is a database that stores a large number of recorded voice data that can be used as a real voice part for a text to be synthesized. The recorded voice storage unit 30 can be easily realized by using various database devices and data storage techniques, also called a voice corpus.

本実施形態において、肉声録音音声データは、所定の文章を人間がしゃべった音声（文章録音音声）を予め記録したものである。録音音声格納部３０の一例としては、例えば、図４に示す構造の表（またはリレーショナルデータベース）の形態でデータが格納しておくことができる。ここで肉声録音音声データ自体は、録音音声格納部３０内の音声ファイル２２０に予め格納されている。さらに、録音音声格納部３０には、音声ファイル２２０の肉声録音音声データをテキストに変換したものが録音音声テキスト２１０として格納され、録音音声テキスト２１０の発音をテキストに変換した情報が録音音声発音テキスト２３０に格納されている。また、録音音声テキスト２１０に対応する音声ファイル２２０と録音音声発音テキスト２３０は、ＩＤ２００により一連のデータとして対応付けられる。これらＩＤ２００に対応する録音音声テキスト２１０、音声ファイル２２０及び録音音声発音テキスト２３０を録音音声情報とする。 In the present embodiment, the voice recording voice data is obtained by recording in advance a voice (sentence recording voice) in which a human being speaks a predetermined sentence. As an example of the recorded voice storage unit 30, for example, data can be stored in the form of a table (or relational database) having the structure shown in FIG. Here, the real voice recording voice data itself is stored in advance in the voice file 220 in the recording voice storage unit 30. Further, the recorded voice storage unit 30 stores the recorded voice voice data of the voice file 220 converted to text as the recorded voice text 210, and information obtained by converting the pronunciation of the recorded voice text 210 into text is recorded voice pronunciation text. 230. Also, the audio file 220 corresponding to the recorded audio text 210 and the recorded audio pronunciation text 230 are associated as a series of data by the ID 200. The recorded voice text 210, the voice file 220, and the recorded voice pronunciation text 230 corresponding to these IDs 200 are recorded voice information.

図４に示す録音音声発音テキスト２３０には、アクセントの位置やフレーズの区切りなどを示す記号は削除して、音節（ア、カ、ギなどの母音と子音の組で構成される単位）のみを示すカナ文字だけで記述しているが、アクセント等の記号も含めた発音記号列テキストというフォーマットで格納しておくことも可能である。そのほかにも、音素文字列、または音素ＩＤ列など、その音声がどのような音素（Ａ、Ｋ、Ｇ、Ｉなどの母音と子音を合わせた単位）で構成されているかを示すに足る情報が格納されていれば良い。以下では、音節の単位、音素の単位など分けずに、より抽象的に音韻と呼ぶこととする。 In the recorded voice pronunciation text 230 shown in FIG. 4, symbols indicating accent positions and phrase delimiters are deleted, and only syllables (units composed of vowels and consonants such as a, ka, and gi) are included. Although only the kana characters shown are described, they can be stored in a format of phonetic symbol string text including symbols such as accents. In addition, there is enough information to indicate what phoneme (unit consisting of vowels such as A, K, G, and I and consonants) is composed, such as a phoneme character string or a phoneme ID string. It only has to be stored. In the following, the syllable unit, phoneme unit, etc. will be divided into more abstract phonemes.

次に、入力テキスト解析部２０は、テキスト生成部７から入力されたテキスト１０に対してテキスト解析処理、または自然言語解析処理とよばれる解析処理を行い、のちの録音音声選択部４０や接続境界算出部５０などで利用しやすい形態の情報を抽出、または変換することを目的とする。ここでの具体的な処理は肉声韻律区間としてどのような種類を選択するかなどの実装に依存するものとなる。本第１実施形態の場合、図３に示した入力テキスト１０を、図５に示す発音テキスト２１に変換する処理を行う。図３の入力テキスト１０を図５の発音テキスト２１に変換する手法としては、自然言語解析処理、具体的には単語辞書データを用いた形態素解析を行うことで実現できる。この手法については、例えば、「自然言語処理」（長尾真編、岩波書店）などに開示されている。また、別の手法としては、辞書データが不要なパターンマッチング技術を使うことも可能であろう。その場合は、録音音声格納部３０には、図４の録音音声発音テキスト２３０の代わりに、図６に示すようなマッチングパターンを持っておくことで実現できる。図６のマッチングパターンは、ＩＤ２００１に対応する録音音声テキスト２１０１と、録音音声テキスト２１０１の要部を含むマッチングパターン２１０２とから構成される。この場合は、文字列マッチング処理（上記文献などで広く開示されている）を適用することで、記号「＊」をワイルドカード（どんな文字列とでもマッチング可能）とみなして、最も良く（ワイルドカード部分が最も短くなる）一致する録音音声情報を検索するという処理となる。 Next, the input text analysis unit 20 performs an analysis process called a text analysis process or a natural language analysis process on the text 10 input from the text generation unit 7, and later a recorded voice selection unit 40 or a connection boundary. An object is to extract or convert information in a form that can be easily used by the calculation unit 50 or the like. The specific processing here depends on the implementation such as what kind of real voice prosody section is selected. In the case of the first embodiment, a process of converting the input text 10 shown in FIG. 3 into a pronunciation text 21 shown in FIG. 5 is performed. The method of converting the input text 10 in FIG. 3 into the pronunciation text 21 in FIG. 5 can be realized by performing natural language analysis processing, specifically, morphological analysis using word dictionary data. This method is disclosed in, for example, “Natural Language Processing” (Masao Nagao, Iwanami Shoten). As another technique, it is possible to use a pattern matching technique that does not require dictionary data. In this case, the recorded voice storage unit 30 can be realized by having a matching pattern as shown in FIG. 6 instead of the recorded voice pronunciation text 230 of FIG. The matching pattern in FIG. 6 includes a recorded voice text 2101 corresponding to the ID 2001 and a matching pattern 2102 including a main part of the recorded voice text 2101. In this case, by applying a character string matching process (disclosed widely in the above document etc.), the symbol “*” is regarded as a wild card (can be matched with any character string) and is best (wild card This is a process of searching for the recorded audio information that matches (the part is the shortest).

次に、録音音声選択部４０は、入力テキスト解析部２０で解析された情報（図５の発音テキスト２１）に基づいて、入力テキスト１０に最も近い、つまり同じ内容の肉声部分を多く含む肉声録音音声データ（音声ファイル２２０）を録音音声格納部３０から選択するための処理である。 Next, based on the information analyzed by the input text analysis unit 20 (the pronunciation text 21 in FIG. 5), the recorded voice selection unit 40 is the closest to the input text 10, that is, the real voice recording that contains many real voice parts having the same contents. This is processing for selecting audio data (audio file 220) from the recorded audio storage unit 30.

この処理は、発音テキスト２１と録音音声発音テキスト２３０で共通に含まれる音節の数をカウントすることで実現可能である。第１実施形態の場合、図５の入力テキスト解析結果（発音テキスト２１）と、録音音声格納部３０に格納されているそれぞれの肉声録音音声データの録音音声発音テキスト２３０との比較を実施する。 This processing can be realized by counting the number of syllables included in common in the pronunciation text 21 and the recorded voice pronunciation text 230. In the case of the first embodiment, the input text analysis result (pronunciation text 21) of FIG. 5 is compared with the recorded voice pronunciation text 230 of each real voice recording voice data stored in the recorded voice storage unit 30.

この比較の結果、図７に示すような一致音節数を各肉声録音音声データに対して算出することができる。図７は、録音音声格納部３０の録音音声テキスト２１０に対応する録音音声発音テキスト２３０のうち、発音テキスト２１と発音が一致した音節数２４０をテーブルとして示したものである。この中で、最も一致音節数２４０が大きな肉声録音音声データを含む録音音声情報を録音音声選択部４０の出力結果とすればよい。ここでの音節比較処理は、音節比較の順序を入れ替えないようにしなければならない。例えば、図８（ａ）は、図４に示したＩＤ＝２の録音音声発音テキスト２３０と発音テキスト２１の比較結果を示す。図８（ａ）において、「マモナク」までは一対一で対応してきて、「シ」「ブ」「ヤ」について一致する音節がなく、再び「ミナミ」以降は一対一で対応する音声が存在するため簡単に比較処理が可能であるが、発音テキストによってはテキスト左側の音節を優先して一致させることで後半部の一致音節数が少なくなる場合も存在する。このような場合に対しては、例えば文字列比較方式を最左最短一致方式にしたり、または、複数の一致パターンを生成させてすべての中で最も一致音節数が多い一致パターンを選択するという処理にしたりするなどして、より適切な一致音声を選択することは可能である。 As a result of this comparison, the number of coincident syllables as shown in FIG. 7 can be calculated for each recorded voice data. FIG. 7 shows, as a table, the number of syllables 240 whose pronunciation coincides with the pronunciation text 21 in the recorded voice pronunciation text 230 corresponding to the recorded voice text 210 of the recorded voice storage unit 30. Of these, the recorded voice information including the real voice recorded voice data having the largest number of coincident syllables 240 may be used as the output result of the recorded voice selection unit 40. In this syllable comparison process, the order of syllable comparison must not be changed. For example, FIG. 8A shows a comparison result between the recorded voice pronunciation text 230 with ID = 2 and the pronunciation text 21 shown in FIG. In FIG. 8A, “mamonaku” corresponds one-to-one, there is no matching syllable for “shi”, “bu”, and “ya”, and after “minami” there is a corresponding one-to-one speech. Therefore, the comparison process can be easily performed. However, depending on the pronunciation text, there is a case where the number of matching syllables in the latter half is reduced by matching the syllable on the left side of the text with priority. For such a case, for example, the character string comparison method is the leftmost shortest matching method, or a plurality of matching patterns are generated and a matching pattern having the largest number of matching syllables is selected among all the matching patterns. It is possible to select a more appropriate matching voice by, for example.

また、図８（ｂ）は、ＩＤ＝２とは一部が異なる録音音声発音テキスト２３０と発音テキスト２１の比較結果を示す。この例のように、図８（ａ）よりも一致音節数が多くなるが、非一致部分、すなわち後ほど音声合成処理によって合成音声部品が生成される区間が極端に短くなる場合も出現する。音声合成処理によっては短い合成音声の生成に適しない方式もあるため、このような場合には図８（ａ）の一致パターンを優先させるなど、利用する音声合成部の処理内容によって比較結果の順序づけをすることも考えられる。これらの図８（ａ）と図８（ｂ）などの一致パターンの優先順位の決定方法としては、例えば、入力テキスト１０から求めた発音テキスト２１と録音音節発音テキスト２３０とを比較して、録音音声発音テキスト２３０内で不一致だった箇所の文字数ごとに、図１６に示す不一致コストを参照して、図１５に示すようにトータルの不一致コストというものを計算することができる。この不一致コストを比較することで、一致音節数は少なくても、より不一致コストの小さな図１５（ａ）の一致パターンを優先するということが可能となる。なお、図１６は不一致文字数と不一致コストの関係を予め設定したテーブルである。 FIG. 8B shows a comparison result between the recorded voice pronunciation text 230 and the pronunciation text 21 that are partly different from ID = 2. As in this example, the number of coincidence syllables is greater than in FIG. 8A, but there also appears a case where a non-matching portion, that is, a section where a synthesized speech component is generated by speech synthesis processing becomes extremely short later. Depending on the speech synthesis process, there is a method that is not suitable for generating a short synthesized speech. In such a case, priority is given to the matching pattern in FIG. It is also conceivable to do. As a method for determining the priority order of the matching patterns in FIG. 8A and FIG. 8B, for example, the pronunciation text 21 obtained from the input text 10 and the recorded syllable pronunciation text 230 are compared and recorded. With reference to the mismatch cost shown in FIG. 16, the total mismatch cost as shown in FIG. 15 can be calculated for each number of characters in the phonetic text 230 where there was a mismatch. By comparing the mismatch costs, it is possible to give priority to the match pattern in FIG. 15A having a smaller mismatch cost even if the number of matching syllables is small. FIG. 16 is a table in which the relationship between the number of mismatch characters and the mismatch cost is set in advance.

ここで、図８に示す不一致コストの設定方法は、本発明では録音音声情報（図１５に示す録音音声発音テキスト２３０に相当する肉声録音音声データ）のうち、入力テキスト１０（発音テキスト２１）の音節と一致した部分のみがそのまま利用されるということを考慮して、肉声（肉声録音音声データ）の間に結合される合成音声区間の長さに応じて設定しておく必要がある。図８の例では、単純に不一致文字数のみに応じて不一致コストを定義しているが、不一致文字およびその前後の音韻がどのような種類のものかという音韻環境に応じて不一致コストを設定することもできる。このような設定方法を採れば、肉声録音音声データの無音区間で結合される場合には、たとえ不一致文字数が極端に少なくも、不一致コストを小さくすることで、よりスムーズな一致パターンを優先させるということも可能となる。 Here, the discrepancy cost setting method shown in FIG. 8 is based on the input text 10 (pronunciation text 21) in the recorded voice information (the real voice recording voice data corresponding to the recorded voice pronunciation text 230 shown in FIG. 15) in the present invention. In consideration of the fact that only the part that matches the syllable is used as it is, it is necessary to set it according to the length of the synthesized speech section combined between the real voices (the real voice recording voice data). In the example of FIG. 8, the mismatch cost is simply defined according to only the number of mismatch characters, but the mismatch cost is set according to the phoneme environment such as the type of mismatch characters and the phonemes before and after the mismatch character. You can also. If such a setting method is adopted, when combined in a silent section of real voice recording voice data, even if the number of mismatched characters is extremely small, a smoother matching pattern is prioritized by reducing the mismatching cost. It is also possible.

この順序づけの変更方法は、音声合成部（規則合成部１００）として、どのような特徴のものを採用するか決定した時点で、すなわち、本発明のシステムを実施した時点で、また、この録音音声選択部４０での処理は、音節単位ではなく、より細かな音素単位にすることもでき、処理の手法については音節と同様となる。 This ordering change method is performed when the speech synthesis unit (rule synthesis unit 100) determines what characteristics to adopt, that is, when the system of the present invention is implemented, and when the recorded speech is used. The processing in the selection unit 40 can be performed not in syllable units but in finer phoneme units, and the processing method is the same as in syllables.

ここで、音節単位で処理するか、音素単位で処理するかは、音声合成部（規則合成部１００）がどこまで小さい単位でも音声合成に対応しているかに依存する。もし、規則合成部１００が音節単位での音声合成までしか対応していないのであれば、この録音音声選択部４０と関連する録音音声格納部３０、さらに以降の接続境界算出部５０からハイブリッド韻律生成部８０まですべてが、音節単位での処理を行わなければならない。 Here, whether processing is performed in syllable units or phoneme units depends on how small the speech synthesis unit (rule synthesis unit 100) supports speech synthesis. If the rule synthesizing unit 100 only supports speech synthesis in units of syllables, hybrid prosody generation from the recorded speech storage unit 30 associated with the recorded speech selection unit 40 and the subsequent connection boundary calculation unit 50 is performed. Everything up to section 80 must be processed in syllable units.

一方、規則合成部１００が音素単位での音声合成に対応しているのであれば、録音音声格納部３０からハイブリッド韻律生成部８０までの処理は、音節単位と音素単位のどちらを選ぶことも可能である。本発明の目的である、肉声（肉声録音音声データ）と合成音声をよりスムーズに結合することを目的とするのであれば、より詳細な単位である音素単位での処理を基本とすることが望ましい。 On the other hand, if the rule synthesizing unit 100 supports speech synthesis in phoneme units, the processing from the recorded speech storage unit 30 to the hybrid prosody generation unit 80 can select either syllable units or phoneme units. It is. If the purpose of the present invention is to combine the real voice (real voice recording voice data) and the synthesized voice more smoothly, it is desirable to base the processing on a phoneme unit, which is a more detailed unit. .

次に、接続境界算出部５０では、録音音声選択部４０で選択された肉声録音音声データ（音声ファイル２２０）に対して、どの部分を肉声そのままの肉声部品として利用し、どの部分を合成音声処理で生成された合成音声部品を使うかを決定する。一番簡単な手法としては、録音音声選択部４０で実行された音節比較処理の結果で、一致した音節部分については肉声録音音声データ（音声ファイル２２０）の肉声を使い、それ以外の非一致部分については音声合成で生成された合成音声部品を使うという方法を用いることができる。 Next, the connection boundary calculation unit 50 uses which part of the real voice recording voice data (voice file 220) selected by the voice recording selection unit 40 as a real voice part as it is, and which part is synthesized voice processing. Decide whether to use the synthesized speech component generated in. As the simplest method, as a result of the syllable comparison process executed by the recorded voice selection unit 40, the real voice of the recorded voice data (speech file 220) is used for the matched syllable part, and other non-matching parts. For, a method of using a synthesized speech component generated by speech synthesis can be used.

しかしながら、実際の音声（肉声）では、音節間がなめらかにつながった音声となっているため、単純にすべての音節間で肉声と合成音声を滑らかに結合可能なわけではない。この課題を解決する手法として、次に説明する可変位置ハイブリッド方式と呼ぶべきものがある。 However, since the actual voice (real voice) is a voice in which the syllables are smoothly connected, the real voice and the synthesized voice cannot simply be smoothly combined between all syllables. As a method for solving this problem, there is a method to be called a variable position hybrid system described below.

このハイブリッド方式の手法によると、すべての音節、またはすべての音素間で、それらの接続のしやすさ（肉声と合成音声の結合のしやすさ）を示す接続コストを算出し、最も接続コストの小さな箇所で接続（肉声と合成音声の結合）が行われるように、合成音声部品の長さを伸ばすという処理がなされる。 According to this hybrid method, the connection cost indicating the ease of connection (ease of combining real voice and synthesized speech) between all syllables or all phonemes is calculated, and the connection cost is the highest. The process of extending the length of the synthesized speech component is performed so that connection (combination of real voice and synthesized speech) is performed at a small location.

より具体的には、無声子音先頭のポーズ位置や、音声パワーが十分に小さくなる音素境界を選択し、この音素境界まで合成音声部分を拡大するという処理を行うことができる。すなわち、肉声と合成音声の結合位置が常に一定ではなく、内容に応じて結合位置を動的に変更するという手法である。 More specifically, it is possible to select a pause position at the beginning of the unvoiced consonant or a phoneme boundary where the speech power is sufficiently low, and perform a process of expanding the synthesized speech part to this phoneme boundary. In other words, the combined position of the real voice and the synthesized voice is not always constant, and the combined position is dynamically changed according to the content.

例えば、本第１実施形態のケースで、図８（ａ）の比較によって肉声録音音声データＩＤ２が選択された場合を考える（図９）。この場合、音素・音節の一致比較処理のみで決定された肉声利用部分は「マモナク」「ミナミバイパスオウセツデス」となり、その途中の「シブヤ」は音声合成で生成された合成音声部品を利用する。しかし、「シブヤ」の「ヤ」と「ミナミバイパス」の「ミ」は両方とも有声音声であり、その間で音声を結合するとノイズが生じることになる。 For example, in the case of the first embodiment, consider a case where the real voice recording voice data ID2 is selected by the comparison of FIG. 8A (FIG. 9). In this case, the real voice usage part determined only by the phoneme / syllable coincidence comparison process is “mamonaku” and “minami bypass ootsudesde”, and “shibuya” in the middle uses the synthesized speech component generated by speech synthesis. . However, “Yu” of “Shibuya” and “Mi” of “Minami Bypass” are both voiced voices, and noise is generated when voices are combined between them.

そこで、無音区間や音声パワーの小さな箇所まで合成音声部分を拡大する処理を行う。図９の例の場合、「シブヤ」の直前は無音区間であるので、こちらの結合位置は変化しない。一方、「シブヤ」の後ろ側は、次に無音区間、まはた音声パワーが小さい箇所を探索すると「バイパス」の「パ」の箇所が見つかる。音節「パ」の先頭には破裂音音素「ｐ」が存在しており、ここでは一旦、音声信号が０となる無音区間が生じる。この無音箇所で肉声（肉声録音音声データ）と合成音声を結合するとノイズを生じない。この結果、接続境界算出部５０からは、図１０に示すように、選択された録音音声ＩＤ＝２と、肉声部品を使う区間である「マモナク」と「パスノサキオウセツデス」、そして、合成音声部品を使う区間となる「シブヤミナミバイ」が出力される。 Therefore, a process of expanding the synthesized voice part to a silent section or a part with a small voice power is performed. In the case of the example in FIG. 9, the coupling position does not change because it is a silent section immediately before “Shibuya”. On the other hand, on the back side of “Shibuya”, when a silent section or a portion with low voice power is searched for next, a “pa” portion of “bypass” is found. A plosive phoneme “p” exists at the head of the syllable “pa”, and here, a silent section in which the speech signal is zero occurs once. When the silent voice (real voice recording voice data) and the synthesized voice are combined in the silent part, no noise is generated. As a result, from the connection boundary calculation unit 50, as shown in FIG. 10, the selected recording voice ID = 2, “mamonaku” and “pasno sakiusetsudes” which are sections using the real voice component, and synthesis “Shibuyamanamibai”, which is a section that uses audio parts, is output.

次に、肉声韻律区間決定部６０は、合成音声部品として音声合成処理される音節区間のうち、元の肉声録音音声データの韻律情報が利用できる区間を決定する処理を行う。この処理が本発明の基本的な部分であり、上記従来の可変位置ハイブリッド方式などの技術によって、肉声（肉声録音音声データ）と合成音声の結合を滑らかに行える箇所にまで（接続境界算出部５０において）拡大された合成音声区間の中から肉声録音音声データから抽出された韻律情報を利用できる箇所を特定する処理を基本とする。 Next, the real voice prosody section determination unit 60 performs a process of determining a section in which prosodic information of the original real voice recording voice data can be used among syllable sections that are subjected to voice synthesis processing as a synthesized voice component. This processing is a basic part of the present invention, and the technique (such as the above-mentioned conventional variable position hybrid method) is used to achieve a smooth connection between the real voice (real voice recording voice data) and the synthesized voice (connection boundary calculation unit 50). (B) Basically, a process for identifying a portion where prosodic information extracted from real voice recording voice data can be used from the expanded synthesized voice section.

以下、第１実施形態のケースで具体的に説明する。接続境界算出部５０によって、図８（ａ）に示される一致音節部分（上下の実線）が、肉声韻律区間決定部６０によって図９に示される一致音節部分へと縮小される。つまり、非一致部分である合成音声部分が「シブヤ」から「シブヤミナミバイ」にまで拡大されている。 Hereinafter, the case of the first embodiment will be specifically described. The connection boundary calculation unit 50 reduces the matching syllable part (upper and lower solid lines) shown in FIG. 8A to the matching syllable part shown in FIG. 9 by the real voice prosody section determination unit 60. In other words, the synthesized speech portion which is a non-matching portion is expanded from “Shibuya” to “Shibuyamanamibai”.

ここで図１０のような処理結果の情報が肉声韻律区間決定部６０に入力されると、合成音声区間「シブヤミナミバイ」と、この合成音声区間に対応する肉声録音音声データの区間「ナカノミナミバイ」との比較処理が行われる。ここでの比較も、上述の最左最短一致などの文字列マッチング手法を用いて一致部分を決定することができる。 Here, when the processing result information as shown in FIG. 10 is input to the real voice prosody section determining unit 60, the synthesized voice section “Shibuyamanamibai” and the section “Nakanonamimi” of the real voice recording voice data corresponding to this synthesized voice section. Comparison processing with “buy” is performed. In this comparison as well, the matching part can be determined using a character string matching method such as the leftmost shortest matching described above.

肉声韻律区間決定部６０は、音節単位の最長一致法を用いて、合成音声区間「シブヤミナミバイ」の中で元の肉声録音音声データと音韻（音節）が一致する区間を図１１の破線で示すように「ミナミバイ」と決定することができる。 The real voice prosody section determination unit 60 uses the longest match method in units of syllables to indicate a section where the original recorded voice data and phonology (syllables) coincide with each other in the synthesized speech section “Shibuyamanamibai” with a broken line in FIG. As shown, it can be determined to be “minamibai”.

以上の処理から、肉声韻律区間決定部６０は、図１２に示すように合成音声区間「シブヤミナミバイ」の中から肉声録音音声データの韻律を用いる肉声韻律区間を「ミナミバイ」として出力する。すなわち、肉声韻律区間決定部６０は、肉声区間と、合成音声区間に加えて、合成音声を利用する区間で肉声録音音声データの韻律のみを利用する肉声韻律区間の情報を付加する。 From the above processing, the real voice prosody section determination unit 60 outputs the real voice prosody section using the prosody of the real voice recording voice data from the synthesized voice section “Shibuyama Minamibi” as “Minamibi” as shown in FIG. That is, the real voice prosody section determination unit 60 adds information on the real voice prosody section that uses only the prosody of the real voice recording voice data in the section that uses the synthetic voice in addition to the real voice section and the synthetic voice section.

次に、肉声韻律抽出部７０では、肉声韻律区間決定部６０から出力された合成音声区間に対応する肉声録音音声の区間の韻律情報の抽出処理を行う。韻律情報とは、音声の基本周波数と音素・音節の継続時間長と音声パワーの時間変化を示す情報をさす。この韻律抽出処理は、例えば、音声認識技術を使った自動セグメンテーション処理によって、入力音声を構成する音素、または音節が何か、およびその位置を決定することで実現できる。基本周波数や音声パワーについては、音声信号処理技術で用いられている一般的なＦ０（基本周波数）抽出処理やパワー計算処理などを利用することで実現できる。または、あらかじめ上記の韻律情報を肉声録音音声データ全体に対して抽出しておき、肉声韻律抽出部７０での処理では、合成音声区間に相当する韻律情報部分を抜き出すという手法でも実現できる。第１実施形態のケースの場合にこの肉声韻律抽出部７０から出力される情報の例を図１２に示す。ここでは、合成音声区間「シブヤミナミバイ」に対応する肉声録音音声データの区間「ナカノミナミバイ」の韻律情報（基本周波数の始点と終点、継続時間長）が各音節ごとに抽出されている。 Next, the real voice prosody extraction unit 70 performs a prosody information extraction process for a real voice recording voice section corresponding to the synthesized voice section output from the real voice prosody section determination unit 60. Prosodic information refers to information indicating the fundamental frequency of speech, the duration of phonemes / syllables, and the temporal change in speech power. This prosody extraction process can be realized, for example, by determining what phoneme or syllable constitutes the input speech and its position by automatic segmentation processing using speech recognition technology. The fundamental frequency and audio power can be realized by using a general F0 (basic frequency) extraction process or power calculation process used in the audio signal processing technology. Alternatively, the above-mentioned prosodic information can be extracted from the entire real voice recording voice data in advance, and the real voice prosody extraction unit 70 can extract the prosodic information part corresponding to the synthesized voice section. An example of information output from the real voice prosody extraction unit 70 in the case of the first embodiment is shown in FIG. Here, prosodic information (starting point and end point of fundamental frequency, duration length) of the section “Nakanonamibai” of the voice recording voice data corresponding to the synthesized speech section “Shibuyamanamibai” is extracted for each syllable.

次に、ハイブリッド韻律生成部８０では、肉声韻律抽出部７０から出力された肉声録音音声データの一部区間に対する韻律情報を元に、合成音声区間に対する韻律情報を生成する。この処理は、肉声韻律抽出部７０で抽出された韻律情報のうち、肉声録音音声データと合成音声とで一致する部分はその情報を用い、一致しない部分は抽出された情報を無視して、または抽出された情報を参考にして合成音声に対応する区間の韻律情報を生成する処理を行う。 Next, the hybrid prosody generation unit 80 generates prosody information for the synthesized speech section based on the prosody information for the partial section of the real voice recorded speech data output from the real voice prosody extraction unit 70. This processing uses the information in the prosody information extracted by the real voice prosody extraction unit 70 for the part that matches the recorded voice data and the synthesized voice, and ignores the extracted information for the part that does not match, or A process of generating prosodic information of a section corresponding to the synthesized speech is performed with reference to the extracted information.

第１実施形態のケースで具体的に説明する。ハイブリッド韻律生成部８０では、肉声韻律抽出部７０から図１２に示す肉声録音音声データ区間「ナカノミナミバイ」に対する韻律情報が入力された場合、その肉声録音音声データ区間に対応する合成音声区間「シブヤミナミバイ」の中で抽出された韻律情報が利用できる部分を決定する。ハイブリッド韻律生成部８０での決定処理も、上記のさまざまな処理で利用されてきた文字列一致処理などを用いて実現可能である。この例の場合、「ミナミバイ」の部分は音節が一致するため、その部分の韻律情報としては肉声録音音声データから抽出された韻律情報を利用することができる。一方、音節が一致しない区間「シブヤ」については、規則合成部に含まれている韻律生成処理を使って「シブヤ」に対する韻律情報を新たに生成しても良いし、または「ナカノ」の韻律情報からある韻律変換処理（例えば、基本周波数や音素継続長を、その前後の箇所と連続になるように一律に伸縮・移動させるなどの処理）によって生成することも可能である。 The case of the first embodiment will be specifically described. In the hybrid prosody generation unit 80, when the prosody information for the real voice recorded voice data section “Nakanonamibai” shown in FIG. 12 is input from the real voice prosody extraction section 70, the synthesized voice section “Shibuya” corresponding to the real voice recorded voice data section is input. The portion in which the prosodic information extracted in “southern” can be used is determined. The determination process in the hybrid prosody generation unit 80 can also be realized by using the character string matching process that has been used in the various processes described above. In the case of this example, since the syllabary of the “southern” part matches, the prosody information extracted from the recorded voice data can be used as the prosody information of the part. On the other hand, for the section “Shibuya” where the syllables do not match, the prosody information for “Shibuya” may be newly generated using the prosody generation processing included in the rule synthesis unit, or the prosodic information of “Nakano” Therefore, it can be generated by a prosody conversion process (for example, a process of uniformly expanding / contracting and moving the fundamental frequency and phoneme continuation length so as to be continuous with the preceding and succeeding portions).

図１３に、ハイブリッド韻律生成部８０が韻律生成処理を行って「シブヤ」に対する韻律情報を生成した韻律情報付き音節列の一例を示す。この韻律生成処理については、「ディジタル音声処理」（古井貞煕、東海大学出版会）や「ＰｒｏｇｒｅｓｓｉｎＳｐｅｅｃｈＳｙｎｔｈｅｓｉｓ」（ＶＡＮＳＡＮＴＥＮ他、Ｓｐｒｉｎｇｅｒ）などで開示されている。 FIG. 13 shows an example of a syllable string with prosodic information in which the prosodic generation unit 80 performs prosody generation processing to generate prosodic information for “Shibuya”. This prosody generation processing is disclosed in “Digital Speech Processing” (Sadaaki Furui, Tokai University Press), “Progress in Speech Synthesis” (VAN SANTEN et al., Springer) and the like.

次に規則合成部１００では、ハイブリッド韻律生成部８０から出力された韻律情報付き音節列（図１３）を入力として、韻律情報付き音節列に指定された韻律が実現されるように音声合成処理を行う。その際、合成音声の部品である音声素片データベース１１０と韻律モデル１２０を参照して合成音声への変換処理を行う。この規則合成処理についても上記の文献などで広く開示されているため、説明は省略する。第１実施形態のケースでは、この処理の結果、図１３の韻律を実現した合成音声部品「シブヤミナミバイ」が音声合成処理によって図１４で示すように生成される。 Next, the rule synthesizing unit 100 uses the syllable string with prosodic information (FIG. 13) output from the hybrid prosody generating unit 80 as input, and performs speech synthesis processing so that the prosody specified in the syllable string with prosodic information is realized. Do. At that time, conversion to synthesized speech is performed with reference to the speech segment database 110 and the prosodic model 120 which are components of the synthesized speech. Since this rule composition processing is also widely disclosed in the above-mentioned documents and the like, description thereof is omitted. In the case of the first embodiment, as a result of this processing, a synthesized speech component “Shibuyamanamibi” realizing the prosody of FIG. 13 is generated by speech synthesis processing as shown in FIG.

最後に、接続合成部９０によって、録音音声選択部４０と接続境界算出部５０から出力された肉声録音音声データの部品と、規則合成部１００から出力された合成音声部品とが接続合成（結合）処理されて、ハイブリッド合成音声１３０として出力される。この接続合成処理には、単純に合成音声を結合するだけの処理で実現することも可能であり、その結合部分にＴＤ−ＰＳＯＬＡ（Time Domain Pitch Synchronous Overlap Add）などの波形重畳信号処理などを用いて、より滑らかに接合されることも可能である。 Finally, the connection synthesis unit 90 connects and synthesizes (combines) the parts of the real voice recording voice data output from the recorded voice selection unit 40 and the connection boundary calculation unit 50 and the synthesized voice component output from the rule synthesis unit 100. It is processed and output as hybrid synthesized speech 130. This connection synthesis processing can also be realized by simply combining synthesized speech, and waveform superimposition signal processing such as TD-PSOLA (Time Domain Pitch Synchronous Overlap Add) is used for the combined portion. It is also possible to join more smoothly.

第１実施形態のケースでは、接続境界算出部５０から出力された肉声録音音声データの部品「マモナク」と「パスノサキオウセツデス」に、規則合成部１００から出力された合成音声部品「シブヤミナミバイ」とが結合されて、「マモナク」「シブヤミナミバイ」「パスノサキオウセツデス」に対応するハイブリッド合成音声が出力されることになる。 In the case of the first embodiment, the synthesized voice component “Shibu Minami” output from the rule synthesizing unit 100 is added to the components “mamonak” and “pasno akitsutsudes” of the real voice recording voice data output from the connection boundary calculating unit 50. “Bi” is combined, and a hybrid synthesized speech corresponding to “mamonaku” “shibuyamaminamibai” “pasnosakiousetsudesu” is output.

ここで出力されたハイブリッド合成音声では、「マモナク」と「パスノサキオウセツデス」の区間は完全に肉声録音音声データであり、「シブヤミナミバイ」の区間は合成音声ではあるが、「ミナミバイ」部分は肉声録音音声データの韻律をそのまま実現した合成音声であるため、韻律的に自然で、かつ後半の「パスノサキオウセツデス」と韻律が連続に繋がる合成音声を実現できている。このように本発明によれば、前述の可変位置ハイブリッド合成方式を実施した場合の合成音声区間と部分的に一致する肉声区間を検出して、その肉声区間の韻律（抑揚・リズム）情報を合成音声に付与し、肉声と合成音声を従来技術よりも自然に結合させることで、肉声感・自然性がともに高い音声メッセージを作成可能なハイブリッド音声合成装置を提供することが可能となる。 In the hybrid synthesized speech output here, the sections of “mamonaku” and “pasnosakiousetetsudes” are completely recorded voice data, and the section of “shibuyamanamibai” is synthesized speech, but “minamibai” Since the part is a synthesized voice that directly realizes the prosody of the recorded voice data, it is possible to realize a synthetic voice that is prosodic in nature and has the prosody continuously connected to the latter half of “Pasano Sakiusetsudes”. As described above, according to the present invention, a real voice segment partially matching the synthesized speech segment when the above-described variable position hybrid synthesis method is implemented is detected, and the prosodic (intonation / rhythm) information of the real voice segment is synthesized. It is possible to provide a hybrid speech synthesizer capable of creating a voice message having both a higher feeling of natural voice and naturalness by adding to the voice and combining the voice and the synthesized voice more naturally than in the prior art.

なお、上記において、肉声韻律区間決定部６０は、入力テキスト１０を変換した発音テキスト２１と、録音音声発音テキスト２３０とを、音節単位での最長一致に基づいて肉声録音音声データの韻律を用いる例を示したが、発音テキスト２１と、録音音声発音テキスト２３０との比較を音素単位での最長一致を用いても良い。 In the above, the real voice prosody section determining unit 60 uses the prosody of the real voice recorded voice data for the pronunciation text 21 converted from the input text 10 and the recorded voice pronunciation text 230 based on the longest match in syllable units. However, the longest match in phoneme units may be used for comparison between the pronunciation text 21 and the recorded voice pronunciation text 230.

＜実施形態２＞
次に本発明を、カーナビゲーションシステムに特化した場合の実施形態について説明する。 <Embodiment 2>
Next, an embodiment when the present invention is specialized for a car navigation system will be described.

図１７は、本発明をカーナビゲーションシステムとして実施した場合の構成図である。前記第１実施形態の図２における入力テキスト１０が、図１７ではカーナビゲーション装置（特にその中の発声内容決定部）３１０から受け渡される。また、図２において接続合成部９０から出力されていたハイブリッド合成音声１３０が、本第２実施形態の図１７では、直接、スピーカー（アンプを含む）などの音声再生装置３２０から出力されるようになる。それ以外の構成は第１実施形態の図２と共通であり、各処理部における処理の内容、およびそれらの処理の流れは、基本的に前記第１実施形態に説明したものと同様となる。 FIG. 17 is a configuration diagram when the present invention is implemented as a car navigation system. The input text 10 in FIG. 2 of the first embodiment is transferred from the car navigation device (particularly the utterance content determination unit) 310 in FIG. Also, the hybrid synthesized speech 130 output from the connection synthesis unit 90 in FIG. 2 is directly output from the audio reproduction device 320 such as a speaker (including an amplifier) in FIG. 17 of the second embodiment. Become. The rest of the configuration is the same as in FIG. 2 of the first embodiment, and the contents of processing in each processing unit and the flow of those processing are basically the same as those described in the first embodiment.

そこで、ここでは、本発明の音声合成部８における処理フローについて、図１８のフローチャートを用いて説明する。 Therefore, here, the processing flow in the speech synthesizer 8 of the present invention will be described with reference to the flowchart of FIG.

まず、図１８に示すハイブリッド合成処理が起動すると、カーナビゲーション装置の発声内容決定部で決定された読み上げ（発声）テキストが入力されるまで待ち状態となる。もし、読み上げテキストが入力された場合、その入力テキスト１０は入力テキスト解析処理４１０に渡されて、音声合成処理のための内部表現形式へと変換される。この処理の詳細については、第１実施形態の入力テキスト解析部２０で説明した通りである。 First, when the hybrid synthesizing process shown in FIG. 18 is started, the process waits until the text to be read (spoken) determined by the utterance content determination unit of the car navigation apparatus is input. If a read-out text is input, the input text 10 is transferred to an input text analysis process 410 and converted into an internal representation format for a speech synthesis process. Details of this processing are as described in the input text analysis unit 20 of the first embodiment.

続いて、内部表現データが録音音声選択処理４２０へと渡され、録音音声格納部３０に格納されている肉声録音音声データ（音声ファイル２２０）の中から、録音音声発音テキスト２３０が入力テキスト１０（発音テキスト２１）に最もよく一致するパターンの肉声録音音声データ（音声情報）が選択される。この選択処理の詳細については、第１実施形態の録音音声選択部４０で説明した通りである。 Subsequently, the internal representation data is transferred to the recorded voice selection processing 420, and the recorded voice pronunciation text 230 is input text 10 (from the real voice recorded voice data (voice file 220) stored in the recorded voice storage unit 30. The recorded voice data (voice information) having the pattern that best matches the pronunciation text 21) is selected. The details of this selection processing are as described in the recording voice selection unit 40 of the first embodiment.

もし、この録音音声選択処理４２０において、適切な一致パターンを選択できなかった場合は、入力テキスト１０、および内部表現データが規則合成処理４３０に渡され、入力テキスト全文が合成音声に変換されて出力される。すなわち、適切な一致パターンがない場合は、読み上げ（発声）テキストはすべて合成音声で出力される。 If an appropriate matching pattern cannot be selected in the recorded voice selection process 420, the input text 10 and the internal expression data are passed to the rule synthesis process 430, and the entire input text is converted into a synthesized voice and output. Is done. That is, when there is no appropriate matching pattern, all the reading (speech) text is output as synthesized speech.

一方、録音音声選択処理４２０で録音音声発音テキスト２３０の中から適切な一致パターンを選択できた場合は、一致した録音音声情報（図４に示すデータの横一列の録音音声情報）が接続境界算出処理へと渡される。この処理の詳細については、第１実施形態の接続境界算出部５０で説明した通りである。 On the other hand, when an appropriate matching pattern can be selected from the recorded voice pronunciation text 230 in the recorded voice selection process 420, the matched recorded voice information (recorded voice information in the horizontal row of data shown in FIG. 4) is used to calculate the connection boundary. Passed to processing. The details of this process are as described in the connection boundary calculation unit 50 of the first embodiment.

つづいて、肉声韻律区間決定処理４４０が起動される。この処理では、接続境界算出処理で判定されたすべての接続境界（選択された肉声録音音声データの中の境界）に対して、肉声韻律を使用する肉声韻律区間なのか、合成音声を使用する合成音声区間なのかの決定処理が繰り返して実行される。この処理の詳細については、第１実施形態の肉声韻律区間決定部６０で説明した通りである。 Subsequently, the real voice prosody section determination process 440 is activated. In this process, for all connection boundaries determined by the connection boundary calculation process (boundaries in selected real voice recording voice data), synthesis using real voice prosody or synthesized voice is used. The process of determining whether it is a voice segment is repeatedly executed. Details of this processing are as described in the real voice prosody section determining unit 60 of the first embodiment.

つづいて、肉声韻律抽出処理４５０が起動される。この処理では、肉声韻律区間決定処理４４０において、肉声韻律区間として判定されたすべての区間に対して、韻律抽出処理を繰り返し実行する。この処理の詳細については、第１実施形態の肉声韻律抽出部７０で説明した通りである。 Subsequently, the real voice prosody extraction processing 450 is started. In this process, the prosody extraction process is repeatedly executed for all sections determined as the real voice prosody section in the real voice prosody section determination process 440. The details of this processing are as described in the real voice prosody extraction unit 70 of the first embodiment.

つづいて、ハイブリッド韻律生成処理４６０が起動される。この処理では、肉声韻律区間決定処理４４０において、合成音声区間として判定されたすべての区間、さらにその区間内のすべての音韻に対して、韻律情報の生成処理を繰り返し実行する。この処理の詳細については、第１実施形態のハイブリッド韻律生成部８０で説明した通りである。 Subsequently, the hybrid prosody generation process 460 is activated. In this process, prosody information generation processing is repeatedly executed for all sections determined as synthesized speech sections in real voice prosodic section determination processing 440 and all phonemes in the sections. Details of this processing are as described in the hybrid prosody generation unit 80 of the first embodiment.

つづいて、規則合成処理４７０が起動される。この処理では、ハイブリッド韻律生成処理で生成された韻律情報にしたがって、すべての合成音声区間を、それぞれ合成音声へと変換する処理を行う。この処理の詳細については、第１実施形態の規則合成部１００で説明した通りである。 Subsequently, the rule composition process 470 is activated. In this process, all synthetic speech sections are converted into synthesized speech according to the prosodic information generated in the hybrid prosody generation process. Details of this processing are as described in the rule composition unit 100 of the first embodiment.

つづいて、肉声区間切り出し処理４８０が起動される。この処理は、入力テキストとよく一致して音声格納部から選択された肉声録音音声データデータ（音声ファイル２２０）を分割して、肉声韻律区間決定処理４４０が決定した肉声韻律区間に相当する部分の肉声録音音声データデータのみを切り出して出力する処理を行う。 Subsequently, the real voice segment cutout process 480 is activated. This process divides the real voice recording voice data data (speech file 220) selected from the voice storage unit in good agreement with the input text, and a portion corresponding to the real voice prosody section determined by the real voice prosody section determination processing 440. Performs processing to cut out and output only the voice recording voice data data.

最後に、接続合成処理４９０が起動される。この処理では、規則合成処理４７０、および肉声区間切り出し処理４８０からそれぞれ出力された、合成音声区間に相当する合成音声データと、肉声韻律区間に相当する肉声録音音声データデータとを、それらの区間の順番に応じて、順次、繰り返しながら接続して出力する処理を行う。この結果、この処理から最後に出力されるのは、入力テキストに対応するハイブリッド合成音声（一部が合成音声で一部が肉声の肉声録音音声データ）データとなる。 Finally, connection composition processing 490 is activated. In this process, the synthesized voice data corresponding to the synthesized voice section and the real voice recording voice data data corresponding to the real voice prosody section output from the rule synthesis process 470 and the real voice section cutout process 480, respectively. In accordance with the order, the process of connecting and outputting sequentially is repeated. As a result, the final output from this process is hybrid synthesized speech data (partly synthesized speech and partly real voice recording speech data) corresponding to the input text.

＜実施形態３＞
次に、図１９を用いて、本発明をユーザインタフェースを備えたハイブリッド合成音声の編集ツールとして実施した場合の実施形態について説明する。 <Embodiment 3>
Next, an embodiment when the present invention is implemented as a hybrid synthesized speech editing tool having a user interface will be described with reference to FIG.

図１９は、図１、図２に示す本発明の基本構成に、テキスト入力部５１０、ユーザ入力部５２０、情報表示部５３０を加えた構成である。 FIG. 19 is a configuration in which a text input unit 510, a user input unit 520, and an information display unit 530 are added to the basic configuration of the present invention shown in FIGS.

ここで、テキスト入力部５１０は、本発明の音声合成装置１に対して、読み上げ（発声）テキストを入力するための入力装置であり、例えば、キーボードなど、従来のユーザインタフェース機器を用いることができる。 Here, the text input unit 510 is an input device for inputting a read-out (speech) text to the speech synthesizer 1 of the present invention. For example, a conventional user interface device such as a keyboard can be used. .

このテキスト入力部５１０でテキストが入力されると、前記第１実施形態、もしくは第２実施形態で説明したような処理が実行され、ハイブリッド合成音声１３０が出力されることになる。 When text is input by the text input unit 510, the processing described in the first embodiment or the second embodiment is executed, and the hybrid synthesized speech 130 is output.

ただし、本第３実施形態においては、録音音声選択部４０からハイブリッド韻律生成部８０までの各処理部で処理された結果の情報が、別途、情報表示部５３０へと渡されて（点線の矢印）、ユーザに提示することが可能となっている。また同様に、ユーザ入力部５２０を通してユーザが指定した情報を、録音音声選択部４０からハイブリッド韻律生成部８０までの各処理部に渡すことで、各部が出力する情報を特定の内容に変更することを可能としている。 However, in the third embodiment, the information of the results processed by each processing unit from the recorded voice selection unit 40 to the hybrid prosody generation unit 80 is separately passed to the information display unit 530 (dotted arrow) ) And can be presented to the user. Similarly, by passing the information designated by the user through the user input unit 520 to each processing unit from the recording voice selection unit 40 to the hybrid prosody generation unit 80, the information output by each unit is changed to a specific content. Is possible.

情報表示部５３０は、様々な情報をユーザに提示するための装置であり、例えばディスプレイ装置などのグラフィカル表示装置を用いることができる。例えば、この情報表示部５３０は、前記第１実施形態の図１に示した表示装置４を用いればよい。この情報表示部５３０で表示される情報の一例を図２０に示す。 The information display unit 530 is a device for presenting various types of information to the user, and for example, a graphical display device such as a display device can be used. For example, the information display unit 530 may use the display device 4 shown in FIG. 1 of the first embodiment. An example of information displayed on the information display unit 530 is shown in FIG.

図２０では、上部の入力テキスト欄５３１に入力された読み上げテキストを、入力テキスト解析部２０に渡した結果の解析結果情報（発音テキスト２１）、さらに、録音音声選択部４０で自動的に一致判定されて選択されたＩＤ２の録音音声のテキスト（録音音声発音テキスト２３０）が表示されている。また、一致判定の根拠となる音韻の一致度合いが対応する線の数によって表示されている。このように、録音音声選択部４０でどのような録音音声が、どのような一致判定のもとで選択されたかを表示することができる。 In FIG. 20, the reading text input in the upper input text field 531 is analyzed by the analysis result information (pronunciation text 21) as a result of passing the input text to the input text analysis unit 20, and the recorded voice selection unit 40 automatically determines the match. The text of the recorded voice ID2 selected (recorded voice pronunciation text 230) is displayed. In addition, the degree of phoneme coincidence, which is the basis for coincidence determination, is displayed by the number of corresponding lines. In this way, the recorded voice selection unit 40 can display what kind of recorded voice is selected under what match determination.

また、図２０では、接続境界算出部５０、肉声韻律区間決定部６０の結果、肉声録音音声データを使うと決定された区間を斜体の文字で表示している。このように、入力テキストのうち、どの区間が合成音声で、どの区間が肉声録音音声データとなるのかをグラフィカルに表示することができる。このような表示の仕方は、この他にも、色で分けたり、矩形や角丸矩形で囲い分けるなど、さまざまな表示方法が考えられる。 Further, in FIG. 20, the section determined to use the real voice recorded voice data as a result of the connection boundary calculation unit 50 and the real voice prosody section determination unit 60 is displayed in italic characters. As described above, it is possible to graphically display which section of the input text is the synthesized voice and which section is the real voice recording voice data. In addition to this, various display methods such as dividing by color or enclosing by a rectangle or a rounded rectangle can be considered.

さらに、図２０の中央部には、肉声韻律抽出部７０で抽出された肉声区間に対する韻律情報、およびハイブリッド韻律生成部８０で生成された合成音声区間に対するハイブリッド韻律情報５３２である。このように、時間と周波数（Ｆ０）のグラフとして表示することで、出力されるハイブリッド合成音声がどのような音声になるのかを直感的に示すことが可能となる。 Further, in the central part of FIG. 20, prosody information for the real voice segment extracted by the real voice prosody extraction unit 70 and hybrid prosody information 532 for the synthesized speech segment generated by the hybrid prosody generation unit 80. Thus, by displaying as a graph of time and frequency (F0), it is possible to intuitively indicate what kind of sound the hybrid synthesized speech that is output becomes.

一方、ユーザ入力部５２０はユーザインタフェース機器であり、例えば、マウスやキーボードなどを通してユーザに情報（例えば、録音音声選択部で選択すべき録音音声のＩＤなど）を入力・指定させて、その情報を適切な処理部（例えば、録音音声のＩＤは録音音声選択部４０へ、接続境界情報は接続境界算出部５０へ）へと渡す処理を行う。ここで、ユーザが指定可能な情報としては、録音音声選択部４０で選択されたものの代わりに出力すべき録音音声ＩＤであったり、接続境界算出部５０や肉声韻律区間決定部６０で決定されるものの代わりに出力すべき肉声−合成音声区間の区分であったりする。これらの情報を、例えば、マウスを画面下部の録音音声テキストの上でクリックすることで、代替の録音音声テキストをメニュー表示させて、その中から実際に選択すべきものをユーザに指定させるということを可能とする。 On the other hand, the user input unit 520 is a user interface device. For example, the user inputs and designates information (for example, a recording voice ID to be selected by the recording voice selection unit) through a mouse, a keyboard, and the like. A process of passing to an appropriate processing unit (for example, the recording voice ID to the recording voice selection unit 40 and the connection boundary information to the connection boundary calculation unit 50) is performed. Here, the information that can be specified by the user is a recording voice ID to be output instead of the information selected by the recording voice selection unit 40, or is determined by the connection boundary calculation unit 50 or the real voice prosody section determination unit 60. It may be a section of a real voice-synthesized speech section to be output instead of a thing. For example, by clicking the recorded voice text at the bottom of the screen with the mouse, the alternative recorded voice text is displayed in the menu, and the user is allowed to specify what should be actually selected. Make it possible.

同様に、図２０の情報表示部５３０の解析結果に示される肉声区間（斜体で表示される部分）と合成音声区間（通常表示される部分）を、それぞれマウスでクリック、またはドラッグするなどのインタラクションを行わせて、どの部分を合成音声とするか、どの部分を肉声とするかをユーザに指定させることを可能とする。 Similarly, an interaction such as clicking or dragging with a mouse each of a real voice section (part displayed in italics) and a synthesized voice section (part normally displayed) shown in the analysis result of the information display unit 530 in FIG. It is possible to allow the user to specify which part is the synthesized voice and which part is the real voice.

さらには、その上の韻律情報のグラフ表示画面において、合成音声区間に対して、ハイブリッド韻律生成部８０が出力した韻律情報（グラフ中、点線で表示される曲線）をマウスでドラッグ等で移動させることで、生成すべき韻律情報を直接ユーザが指定可能とする。 Furthermore, on the graph display screen of the prosodic information on the screen, the prosodic information output by the hybrid prosody generating unit 80 (the curve displayed with a dotted line in the graph) is moved by dragging the mouse with respect to the synthesized speech section. Thus, the user can directly specify the prosodic information to be generated.

以上のようにして、ユーザが直接指定した情報は、それぞれ対応する処理部に渡されて、各処理部が自動的に算出した処理結果の代わりに出力されるようにする。このような構成を採ることにより、情報表示部５３０とユーザ入力部５２０を通して、ユーザが直接、ハイブリッド合成音声の中身を指定することが可能となる。 As described above, the information directly designated by the user is transferred to the corresponding processing unit, and is output instead of the processing result automatically calculated by each processing unit. By adopting such a configuration, the user can directly specify the contents of the hybrid synthesized speech through the information display unit 530 and the user input unit 520.

なお、上記各実施形態において、肉声韻律区間決定部６０は、入力テキスト１０を変換した発音テキスト２１と、録音音声発音テキスト２３０とを、音節単位での最長一致に基づいて肉声録音音声データの韻律を用いる例を示したが、各音素または音節に付随する言語情報の一致に基づいて肉声録音音声データの韻律を用いる区間を決定してもよい。さらに、言語情報は、録音音声発音テキスト２３０に含まれるアクセント核（アクセントの下がる位置）を上記位置情報として用いて肉声録音音声データの韻律を使用する区間を決定しても良い。アクセント核を含む言語情報は、入力テキスト解析部２０における解析処理の中間情報として取得することが可能である。また、録音音声格納部３０に格納されている音声データ（図４）に対しては、あらかじめテキスト解析処理を適用した結果、得られたそれらの言語情報を付加して格納しておくことで、上記のアクセント核を始めとする言語情報による一致を判定して、肉声韻律区間の決定を行うことが可能となる。 In each of the above embodiments, the real voice prosody section determining unit 60 converts the pronunciation text 21 converted from the input text 10 and the recorded voice pronunciation text 230 into the prosody of the real voice recorded voice data based on the longest match in syllable units. In the example shown in FIG. 1, the section using the prosody of the recorded voice data may be determined based on the coincidence of the linguistic information associated with each phoneme or syllable. Furthermore, as the language information, an interval using the prosody of the real voice recording voice data may be determined using an accent nucleus (accent lowering position) included in the recorded voice pronunciation text 230 as the position information. The linguistic information including the accent nucleus can be acquired as intermediate information for analysis processing in the input text analysis unit 20. In addition, the voice data (FIG. 4) stored in the recorded voice storage unit 30 is stored by adding the language information obtained as a result of applying the text analysis process in advance. It becomes possible to determine the real voice prosody section by determining the coincidence by the language information including the above accent kernel.

以上説明したように、本発明によれば、可変位置ハイブリッド合成方式を実施した場合の合成音声区間と部分的に一致する肉声区間を検出して、その肉声区間の韻律（抑揚・リズム）情報を合成音声に付与し、肉声と合成音声を従来技術よりも自然に結合させることで、肉声感・自然性がともに高い音声メッセージを作成可能なハイブリッド音声合成装置を提供することが可能となる。特に、音声で誘導を行うカーナビゲーション装置や、音声で案内を行う装置に適用することができる。 As described above, according to the present invention, a real voice segment that partially matches the synthesized voice segment when the variable position hybrid synthesis method is implemented is detected, and the prosodic (intonation / rhythm) information of the real voice segment is obtained. It is possible to provide a hybrid speech synthesizer capable of creating a voice message having both a higher feeling of natural voice and naturalness by giving it to synthesized speech and combining natural voice and synthesized speech more naturally than in the prior art. In particular, the present invention can be applied to a car navigation device that performs guidance by voice and a device that performs guidance by voice.

本発明の第１の実施形態を示し、音声合成装置のシステム構成を示すブロック図。The block diagram which shows the 1st Embodiment of this invention and shows the system configuration | structure of a speech synthesizer. 同じく、本発明の第１の実施形態を示し、音声合成部の処理の流れを示すブロック図。Similarly, the block diagram which shows the 1st Embodiment of this invention and shows the flow of a process of a speech synthesizer. 入力テキストの一例を示す説明図である。It is explanatory drawing which shows an example of an input text. 録音音声格納部に格納されるデータの一例を示す説明図である。It is explanatory drawing which shows an example of the data stored in a sound recording storage part. 入力テキストを変換した発音テキストの一例を示す説明図である。It is explanatory drawing which shows an example of the pronunciation text which converted the input text. マッチングパターンの一例を示す説明図である。It is explanatory drawing which shows an example of a matching pattern. 録音音声発音テキストのうち、発音テキスト２１と発音が一致した音節数を示すテーブルである。It is a table which shows the number of syllables whose pronunciation coincides with the pronunciation text 21 in the recorded voice pronunciation text. 発音テキストと録音音声発音テキストの一致音節数を示す説明図で、（ａ）は、ＩＤ＝２の録音音声発音テキストと発音テキストの比較結果を示し、（ｂ）は他の録音音声発音テキストと発音テキストの比較結果を示す。It is explanatory drawing which shows the number of coincidence syllables of pronunciation text and sound recording sound pronunciation text, (a) shows the comparison result of sound recording sound text and sound recording text of ID = 2, (b) is another sound recording sound pronunciation text and The comparison result of pronunciation text is shown. 発音テキストと録音音声発音テキストの一致音節数を示す説明図で、無音区間や音声パワーの小さな箇所まで合成音声部分を拡大する処理を示す。It is explanatory drawing which shows the number of coincidence syllables of a pronunciation text and a sound recording voice pronunciation text, and shows the process which expands a synthetic | combination voice part to a silence area and a location with small voice power. 接続境界算出部が出力する録音音声、肉声区間及び合成音声区間の一例を示す説明図である。It is explanatory drawing which shows an example of the sound recording which a connection boundary calculation part outputs, a real voice area, and a synthetic | combination voice area. 合成音声区間の中で元の肉声録音音声と音韻（音節）が一致する区間を決定する様子を示す説明図である。It is explanatory drawing which shows a mode that the area in which the original real voice recording sound and a phoneme (syllable) correspond in a synthetic speech area is determined. 肉声韻律区間決定部が出力する録音音声、肉声区間、合成音声区間及び肉声韻律の一例を示す説明図である。It is explanatory drawing which shows an example of the sound recording which a real voice prosody section determination part outputs, a real voice section, a synthetic | combination voice area, and a real voice prosody. 音節。基本周波数（Ｈｚ）及び継続時間（msec）の解析結果を示すテーブルである。syllable. It is a table which shows the analysis result of fundamental frequency (Hz) and duration (msec). ハイブリッド韻律生成部での出力結果を示す説明図である。It is explanatory drawing which shows the output result in a hybrid prosody generation part. 発音テキストと録音音声発音テキストの一致音節数を示す説明図で、（ａ）は、ＩＤ＝２の録音音声発音テキストと発音テキストの比較結果を示し、（ｂ）は他の録音音声発音テキストと発音テキストの比較結果を示す。It is explanatory drawing which shows the number of coincidence syllables of pronunciation text and sound recording sound pronunciation text, (a) shows the comparison result of sound recording sound text and sound recording text of ID = 2, (b) is another sound recording sound pronunciation text and The comparison result of pronunciation text is shown. 不一致文字数と不一致コストの関係を示すテーブルである。It is a table which shows the relationship between the number of mismatch characters and mismatch cost. 本発明の第２の実施形態を示し、音声合成部の処理の流れを示すブロック図。The block diagram which shows the 2nd Embodiment of this invention and shows the flow of a process of a speech synthesizer. 同じく、第２の実施形態を示し、音声合成部における処理の一例を示すフローチャートである。Similarly, it is a flowchart which shows 2nd Embodiment and shows an example of the process in a speech synthesizer. 本発明の第３の実施形態を示し、音声合成部の処理の流れを示すブロック図。The block diagram which shows the 3rd Embodiment of this invention and shows the flow of a process of a speech synthesizer. 同じく、第３の実施形態を示し、情報表示部で表示される情報の一例を示す説明図である。Similarly, it is explanatory drawing which shows 3rd Embodiment and shows an example of the information displayed on an information display part.

Explanation of symbols

２０入力テキスト解析部
３０録音音声格納部
４０録音音声選択部
５０接続境界算出部
６０肉声韻律区間決定部
７０肉声韻律抽出部
８０ハイブリッド韻律生成部
９０接続合成部
１００規則合成部
１１０音声素片データベース
１２０韻律モデル
１３０ハイブリッド合成音声
２００録音音声ＩＤ
２１０録音音声テキスト
２２０録音音声ファイル
２３０録音音声発音テキスト 20 Input text analysis unit 30 Recorded speech storage unit 40 Recorded speech selection unit 50 Connection boundary calculation unit 60 Real voice prosody section determination unit 70 Real voice prosody extraction unit 80 Hybrid prosody generation unit 90 Connection synthesis unit 100 Rule synthesis unit 110 Speech segment database 120 Prosodic model 130 Hybrid synthetic speech 200 Recording speech ID
210 Recorded voice text 220 Recorded voice file 230 Recorded voice pronunciation text

Claims

An input text analysis unit that accepts text to be converted into speech and converts it into pronunciation text;
And recording the voice storage unit for storing the text and recorded real voice data a preset sentence chapter by real voice in advance,
A recorded voice selection unit that compares the pronunciation text with the sentence stored in the recorded voice storage unit, and selects the voice data and sentence used for speech synthesis from the recorded voice storage unit;
A connection boundary calculation unit for determining a boundary between a synthesized speech section for generating speech by speech synthesis from the pronunciation text and the selected sentence and a real voice section for generating speech from the real voice data;
Based on the determined synthesized speech section, a rule synthesizer that generates speech synthesis data based on a preset speech segment and a prosodic model;
In a hybrid speech synthesizer comprising a connection synthesizer that generates a synthesized speech sentence corresponding to a text inputted by connecting the real voice data corresponding to the real voice interval and the generated speech synthesis data,
A real voice prosody section determination unit that determines a real voice prosody section that uses the prosody of the real voice data in a synthetic voice section that uses the synthesized speech determined by the connection boundary calculation unit;
A real voice prosody extraction unit that extracts the prosody of the section determined by the real voice prosody section determination unit from the selected real voice data;
A prosody based on the extracted real voice, and a hybrid prosody generation unit that generates prosody information of the entire synthesized speech section from the prosody model,
The recorded voice selection unit
Compare the sentence and the pronunciation text in the recording voice storage unit, and output the voice data and sentence with the largest number of syllables whose pronunciation coincides with the pronunciation text in the sentence,
The rule composition unit includes:
For the synthesized speech segment, it generates a speech synthesis data by said prosodic information and the speech unit,
The connection composition unit
Connecting the real voice section, the real voice prosody section, and the synthesized voice section in which the prosody information is generated by the hybrid prosody generation unit;
The real voice prosody section determining unit
Wherein in performing pronunciation text and the comparison of the selected sentence, hybrid speech synthesis apparatus characterized that you decide where to use the prosody of the human voice data based on the longest match in syllable.

An input text analysis unit that accepts text to be converted into speech and converts it into pronunciation text;
Real voice data in which a preset sentence is recorded in real voice, and a recording voice storage unit that stores the sentence in advance,
A recorded voice selection unit that compares the pronunciation text with the sentence stored in the recorded voice storage unit, and selects the voice data and sentence used for speech synthesis from the recorded voice storage unit;
A connection boundary calculation unit for determining a boundary between a synthesized speech section for generating speech by speech synthesis from the pronunciation text and the selected sentence and a real voice section for generating speech from the real voice data;
Based on the determined synthesized speech section, a rule synthesizer that generates speech synthesis data based on a preset speech segment and a prosodic model;
In a hybrid speech synthesizer comprising a connection synthesizer that generates a synthesized speech sentence corresponding to a text inputted by connecting the real voice data corresponding to the real voice interval and the generated speech synthesis data,
A real voice prosody section determination unit that determines a real voice prosody section that uses the prosody of the real voice data in a synthetic voice section that uses the synthesized speech determined by the connection boundary calculation unit;
A real voice prosody extraction unit that extracts the prosody of the section determined by the real voice prosody section determination unit from the selected real voice data;
A prosody based on the extracted real voice, and a hybrid prosody generation unit that generates prosody information of the entire synthesized speech section from the prosody model,
The recorded voice selection unit
Compare the sentence and the pronunciation text in the recording voice storage unit, and output the voice data and sentence with the largest number of syllables whose pronunciation coincides with the pronunciation text in the sentence,
The rule composition unit includes:
For the synthesized speech section, generate speech synthesis data from the speech segments and the prosodic information,
The connection composition unit
Connecting the real voice section, the real voice prosody section, and the synthesized voice section in which the prosody information is generated by the hybrid prosody generation unit;
The real voice prosody section determining unit
Wherein in performing pronunciation text and the comparison of the selected sentence, the longest matching feature and be Ruha hybrid speech synthesizer determining where to use the prosody of the human voice data based on the phoneme units.

The real voice prosody section determining unit
The hybrid speech synthesizer according to claim 1 or 2 , wherein a section using a prosody of real voice data is determined based on a match of language information associated with the syllable or phoneme .

4. The hybrid speech synthesizer according to claim 3 , wherein the language information is accent kernel position information .

The connection boundary calculation unit
The hybrid speech synthesizer according to any one of claims 1 to 4, wherein information on a boundary between the synthesized speech section and the real voice section is received, and the boundary is determined based on the information .

The connection boundary calculation unit
Receiving information on the boundary between the synthesized speech section and the real voice section, and determining the boundary based on the information ;
The real voice prosody section determining unit
To any one of claims 1 to claim 4, characterized in that determining the real voice prosody section based on the information accepting information about the real voice prosody interval using the prosody of the human voice data in the synthesized speech segment The hybrid speech synthesizer described.

The hybrid prosody generation unit includes:
Among the synthesized speech segments, the prosody by the speech unit and the extracted real voice is set for the real voice prosody segment, and the prosody of the prosody model is set for the synthesized speech segment excluding the real voice prosody segment, 3. The hybrid speech synthesizer according to claim 1, wherein prosody information of the entire synthesized speech section is generated .

Input text analysis processing that accepts text to be converted to speech and converts it to pronunciation text;
  The recorded voice storage unit stores the real voice data and the sentence used for voice synthesis by comparing the sentence and the pronunciation text of the recorded voice storage unit storing the sentence in advance with the real voice data recording the preset sentence by the real voice Recording voice selection process to select from,
  A connection boundary calculation process for determining a boundary between a synthesized speech section for generating speech by speech synthesis from the pronunciation text and the selected sentence and a real voice section for generating speech from the real voice data;
  Based on the determined synthesized speech section, a rule synthesis process for generating speech synthesis data using a predetermined speech segment and a prosodic model;
  A synthesized speech sentence by executing, by a computer, connection synthesis processing for generating synthesized speech sentences corresponding to input text by connecting the real voice data corresponding to the real voice section and the generated speech synthesis data. In a hybrid speech synthesis method for generating
  A real voice prosody section determination process for determining a real voice prosody section that uses the prosody of the real voice data in a synthetic voice section using the synthesized speech determined in the connection boundary calculation process;
  A real voice prosody extraction process for extracting the prosody of the section determined by the real voice prosody section determination process from the selected real voice data;
  Including the prosody by the extracted real voice, and a hybrid prosody generation process for generating prosody information of the entire synthesized speech section from the prosody model,
  The recorded voice selection process is:
  Compare the sentence and the pronunciation text in the recording voice storage unit, and output the voice data and sentence with the largest number of syllables whose pronunciation coincides with the pronunciation text in the sentence,
  The rule composition process is:
  For the synthesized speech section, generate speech synthesis data from the speech segments and the prosodic information,
  The connection composition process is as follows:
  Connecting the real voice segment, the real voice prosody segment, and the synthesized speech segment generated prosodic information in the hybrid prosody generation process,
  The real voice prosody section determination process includes:
  A hybrid speech synthesizing method characterized in that, when comparing the pronunciation text and the selected sentence, a portion using the prosody of the real voice data is determined based on the longest match in syllable units.

Input text analysis processing that accepts text to be converted to speech and converts it to pronunciation text;
The recorded voice storage unit stores the real voice data and the sentence used for voice synthesis by comparing the sentence and the pronunciation text of the recorded voice storage unit storing the sentence in advance with the real voice data recording the preset sentence by the real voice Recording voice selection process to select from,
A connection boundary calculation process for determining a boundary between a synthesized speech section for generating speech by speech synthesis from the pronunciation text and the selected sentence and a real voice section for generating speech from the real voice data;
Based on the determined synthesized speech section, a rule synthesis process for generating speech synthesis data using a predetermined speech segment and a prosodic model;
A synthesized speech sentence by executing, by a computer, connection synthesis processing for generating synthesized speech sentences corresponding to input text by connecting the real voice data corresponding to the real voice section and the generated speech synthesis data. In a hybrid speech synthesis method for generating
A real voice prosody section determination process for determining a real voice prosody section that uses the prosody of the real voice data in a synthetic voice section using the synthesized speech determined in the connection boundary calculation process;
A real voice prosody extraction process for extracting the prosody of the section determined by the real voice prosody section determination process from the selected real voice data;
Including the prosody by the extracted real voice, and a hybrid prosody generation process for generating prosody information of the entire synthesized speech section from the prosody model,
The recorded voice selection process is:
Compare the sentence and the pronunciation text in the recording voice storage unit, and output the voice data and sentence with the largest number of syllables whose pronunciation coincides with the pronunciation text in the sentence,
The rule composition process is:
For the synthetic speech segment, it viewed including the process of generating the speech synthesis data by said prosodic information and the speech unit,
The connection composition process is as follows:
Connecting the real voice segment, the real voice prosody segment, and the synthesized speech segment generated prosodic information in the hybrid prosody generation process,
The real voice prosody section determination process includes:
When comparing the pronunciation text with the selected text, maximum features and to Ruha hybrid speech synthesis method to determine where to use the prosody of the human voice data based on a match with phoneme.

The real voice prosody section determination process includes:
The hybrid speech synthesis method according to claim 8 or 9, wherein a section using a prosody of real voice data is determined based on a match of linguistic information associated with the syllable or phoneme .

The hybrid speech synthesis method according to claim 10 , wherein the language information is position information of an accent nucleus .

The connection boundary calculation process includes:
The hybrid speech synthesis method according to any one of claims 8 to 11, wherein information on a boundary between the synthesized speech section and the real voice section is received, and the boundary is determined based on the information .

The connection boundary calculation process includes:
Receiving information on the boundary between the synthesized speech section and the real voice section, and determining the boundary based on the information;
The real voice prosody section determination process includes:
12. The real voice prosody section is received according to information on a real voice prosody section that uses the prosody of the real voice data in the synthesized speech section, and the real voice prosody section is determined based on the information. The described hybrid speech synthesis method.

The hybrid prosody generation process includes:
Among the synthesized speech segments, the prosody by the speech unit and the extracted real voice is set for the real voice prosody segment, and the prosody of the prosody model is set for the synthesized speech segment excluding the real voice prosody segment, hybrid speech synthesis method according to claim 8 or claim 9 you and generates a prosodic information of the entire synthesized speech segment.

A program that converts received text into synthesized speech,
  Input text analysis processing that accepts text to be converted to speech and converts it to pronunciation text;
  The recorded voice storage unit stores the real voice data and the sentence used for voice synthesis by comparing the sentence and the pronunciation text of the recorded voice storage unit storing the sentence in advance with the real voice data recording the preset sentence by the real voice Recording voice selection process to select from,
  A connection boundary calculation process for determining a boundary between a synthesized speech section for generating speech by speech synthesis from the pronunciation text and the selected sentence and a real voice section for generating speech from the real voice data;
  A real voice prosody section determination process for determining a real voice prosody section that uses the prosody of the real voice data in a synthetic voice section using the synthesized speech determined in the connection boundary calculation process;
  A real voice prosody extraction process for extracting the prosody of the section determined by the real voice prosody section determination process from the selected real voice data;
  The prosody by the extracted real voice, and a hybrid prosody generation process for generating prosody information of the entire synthesized speech section from the prosody model;
  Based on the determined synthesized speech section, a rule synthesis process for generating speech synthesis data from a predetermined speech segment and the prosodic information;
  A connection synthesis process for generating a synthesized speech sentence corresponding to the input text by connecting the real voice segment, the real voice prosody segment, and the synthesized speech segment for which prosody information is generated by the hybrid prosody generation process; Let the calculator work,
  The recorded voice selection process is:
  Compare the sentence and the pronunciation text in the recording voice storage unit, and output the voice data and sentence with the largest number of syllables whose pronunciation coincides with the pronunciation text in the sentence,
  The real voice prosody section determination process includes:
  When comparing the pronunciation text and the selected sentence, a program is used to determine a location where the prosody of the real voice data is used based on the longest match in syllable units.

A program that converts received text into synthesized speech,
Input text analysis processing that accepts text to be converted to speech and converts it to pronunciation text;
The recorded voice storage unit stores the real voice data and the sentence used for voice synthesis by comparing the sentence and the pronunciation text of the recorded voice storage unit storing the sentence in advance with the real voice data recording the preset sentence by the real voice Recording voice selection process to select from,
A connection boundary calculation process for determining a boundary between a synthesized speech section for generating speech by speech synthesis from the pronunciation text and the selected sentence and a real voice section for generating speech from the real voice data;
A real voice prosody section determination process for determining a real voice prosody section that uses the prosody of the real voice data in a synthetic voice section using the synthesized speech determined in the connection boundary calculation process;
A real voice prosody extraction process for extracting the prosody of the section determined by the real voice prosody section determination process from the selected real voice data;
The prosody by the extracted real voice, and a hybrid prosody generation process for generating prosody information of the entire synthesized speech section from the prosody model;
Based on the determined synthesized speech section, a rule synthesis process for generating speech synthesis data from a predetermined speech segment and the prosodic information;
A connection synthesis process for generating a synthesized speech sentence corresponding to the input text by connecting the real voice segment, the real voice prosody segment, and the synthesized speech segment for which prosody information is generated by the hybrid prosody generation process ; Let the calculator work ,
The recorded voice selection process is:
Compare the sentence and the pronunciation text in the recording voice storage unit, and output the voice data and sentence with the largest number of syllables whose pronunciation coincides with the pronunciation text in the sentence,
The real voice prosody section determination process includes:
When comparing the pronunciation text and the selected sentence, a program is used to determine a location where the prosody of the real voice data is used based on the longest match in phoneme units.