JP3862300B2

JP3862300B2 - Information processing method and apparatus for use in speech synthesis

Info

Publication number: JP3862300B2
Application number: JP02028095A
Authority: JP
Inventors: 泰夫奥谷; 俊明深田; 恭則大洞
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 1995-02-08
Filing date: 1995-02-08
Publication date: 2006-12-27
Anticipated expiration: 2021-12-27
Also published as: JPH08211896A

Description

【０００１】
【産業上の利用分野】
本発明は、音声合成編集方法とその装置、特に、アクセント句単位情報に分解された文情報に基づいて音声合成を行う方法及び装置における、音声合成に用いる情報の処理方法とその装置に関する。
【０００２】
【従来の技術】
一般に、音声合成装置では、表音テキストと呼ばれる音響情報をテキストで表したものを音響処理部へ入力させる。
この表音テキストを生成するためには、形態素解析、句関係解析などの文解析処理が必要である。
【０００３】
この文解析結果を用いて、ポーズ情報(どこまでを一息に発声するか(ポーズ句)と、ポーズ句とポーズ句の間でおく無音時間の長さ)の生成，読み付け(読みの変化に対応)，アクセントの生成(アクセントの結合)など、言語解析から得られる情報をもとに合成音声を生成するための音声言語処理を行ない、その結果を表音テキストとして出力する。表音テキストの一例を図１に示す。
【０００４】
同図において、Ｈ８５，Ｔ３００などの”｛｝”で括られた記号と数値は制御信号とその値である”［］”で括られた中にあるアルファベットが音素列，また、”／”，”＼”がアクセントの立ち上がり点，降下点を表している。
韻律制御部は、この表音テキストを解析(解釈)して、そこから韻律情報を生成するために必要なアクセント情報，ポーズ情報，ピッチ周波数などの情報を抽出する。アクセント情報は”／”と”＼”から、ポーズ情報はＴ３００から３００(ミリ秒)，ピッチ周波数はＨ８５から８５(Ｈｚ)というように得られる。韻律制御部では、これらの情報からピッチパターンを生成する。（図６参照）
このように表音テキストには、音素情報，アクセント情報，制御情報などの音響情報が記されている。
【０００５】
【発明が解決しようとしている課題】
しかしながら、ユーザーが表音テキストを修正して、自分の好みの合成音声を作成しようとしても、言語情報が表音テキスト上にほとんど記されていないので、操作(変更，修正)可能な情報が限られている。
また、言語情報を利用して韻律制御などを行なおうとしても、言語情報がすでに欠落している。仮に言語情報を表音テキスト上に表すことにすると、表音テキストが読みづらくわかりにくいものになってしまう問題があった。
【０００６】
本発明は上述した従来の欠点を解決し、文解析結果から表音テキストを生成するのではなく、解析によって得られる言語情報を音響処理部に提供することによって、音声合成をよりきめ細かく制御でき、多様かつ高品質な合成音声出力を可能とする音声合成編集方法とその装置を提供することを目的とする。
【０００７】
【課題を解決するための手段】
上記目的を達成するため、本発明の音声合成に用いる情報の処理方法とその装置は以下の構成を備える。即ち、アクセント句単位に分解された文情報に基づいて音声合成を行うための音声合成に用いる情報の処理方法であって、アクセント句に含まれる単語の品詞のリストを保持する品詞リスト保持工程と、前記品詞リスト保持工程により保持される品詞のリストの中に存在する品詞を判定する品詞判定工程と、前記品詞判定工程で判定された品詞に基づき、韻律の制御を行う韻律制御工程と、前記韻律制御工程で制御された韻律に基づき、合成音声パラメータを生成する生成工程とを備える。
【０００８】
また、別の発明は、アクセント句単位に分解された文情報に基づいて、音声合成を行うための音声合成に用いる情報の処理装置であって、アクセント句に含まれる単語の品詞のリストを保持する品詞リスト保持手段と、前記品詞リスト保持手段により保持される品詞のリストの中に存在する品詞を判定する品詞判定手段と、前記品詞判定手段で判定された品詞に基づき、韻律の制御を行う韻律制御手段と、前記韻律制御手段で制御された韻律に基づき、合成音声パラメータを生成する生成手段とを備える。
【０００９】
【作用】
以上の構成において、
前記アクセント句単位情報に基づき、韻律の制御を行い、
前記韻律制御工程で制御された韻律に基づき、合成音声パラメータを生成する。
【００１０】
また、別の発明は、
韻律制御手段が、前記アクセント句単位情報に基づき、韻律の制御を行い、
生成手段が、前記制御された韻律に基づき、合成音声パラメータを生成する。
【００１１】
【実施例】
はじめに、本発明に係る一実施例の音声合成装置のポイントの一つを要約した後、詳細な説明に入る。
本実施例の音声合成装置は、図４に示すようなフレーム構造をもつ音声言語処理結果を入力として受理する入力保持部と、ピッチ、継続時間長、パワーなどの韻律制御情報の生成を行なう韻律制御部と、韻律情報を保持する韻律情報保持部と、音韻、パラメータ接続、あるいは、波形接続等の合成音声パラメータの生成を行なう音声パラメータ生成部と、音声パラメータを保持する音声パラメータ保持部と、音声パラメータを合成音声に変換するＤＡコンバータを基本的に備える。
【００１２】
この基本構成によって、文解析の出力を後述するフレーム構造で表現し、このフレーム構造のデータに基づき、きめの細かい制御を行うことができ、多様な合成音声出力が可能になった。
以下、図面を参照して本実施例の音声構成装置を詳細に説明する。
図２は、本実施例の音声合成装置の構成を示すブロック図である。同図において、２０１は、文の解析結果であるフレーム構造データを入力して保持する入力保持部、２０２は、フレーム構造データを入力保持部２０１から入力して、韻律、ピッチ、継続時間長、パワーなどの制御を行なう韻律制御部である。２０３は、韻律制御情報を保持する韻律情報保持部である。２０４は、韻律、パラメータ接続、波形接続などの処理を行なう音声生成部である。２０５は、合成音声生成のパラメータを保持する音声パラメータ保持部である。２０６は、音声パラメータを合成音声に変換するＤＡコンバータである。
【００１３】
次に、図３に示すフローチャートを参照して、本装置の動作を説明する。
まず、ステップＳ３０１では、フレーム構造で表された音声言語処理結果の入力があるかどうかの判定を行なう。ここで、フレーム構造とは、図４に示すようなものを指す。
フレーム構造は、キーとそのバリューからなる対の集合体で、図４では、フレーム番号，表記，アクセント型などがキーで、それぞれのバリューが、「１」，「今日は」，「１」である。また、キーをポインタにすることによって、バリューが次のフレームをさすこともできる（図４の矢印に相当）。この図でのフレームは、アクセント句を単位としている。アクセント句とは、ひとつのアクセント核を持つ単位である。
【００１４】
「音声表記」は、合成音声を生成するときに発声される音韻列をカタカナで表記したものである。また、「アクセント型」は、アクセント核の位置を各アクセント句の先頭からのモーラ数で表現したものである。ここで、「モーラ数」とは当該アクセント句に含まれる母音と促音，撥音の合計数を示す。また、「構成品詞」は、アクセント句を構成する形態素の品詞のリストである。さらに、「ポーズ長」は、アクセント句の後ろに生成する無音句間の長さをあらわしている。
【００１５】
図３に戻って、ステップＳ３０１で入力があれば、ステップＳ３０２に移る。入力がなければ、ステップＳ３０１を繰り返す。
ステップＳ３０２では、韻律制御部２０１において、韻律制御、ピッチ生成、継続時間長などの韻律制御情報の生成を行なう。生成した韻律情報を韻律情報保持部２０３に保持して、ステップＳ３０３に移る。
【００１６】
このステップでの韻律制御方法の詳細を次に示す。図７は、韻律制御部２０１の制御の流れを示すものであり、以下各ステップ毎に説明する。図８は、韻律制御部２０１の詳細構成を示す。
ここで、前述のステップＳ３０１で、入力したアクセント句フレームの先頭の要素が、アクセント句情報保持部８０１にセットされているとする。
【００１７】
ステップＳ７０１では、アクセント句存在判定部８０２が、アクセント句情報保持部８０１にアクセント句がセットされているかどうかを判定する。存在しない場合は処理を終了する。また、存在する場合は、ステップＳ７０２に移る。
ステップＳ７０２では、名詞判定部８０３が、アクセント句を構成する形態素の中に名詞があるかどうかを判定する。存在しない場合は、ステップＳ７０４に移る。存在する場合はステップＳ７０３に移る。
【００１８】
ステップＳ７０３では、韻律情報を生成する。ピッチ周波数設定部８０７では、デフォルト値として与えられているピッチ周波数の最大値と最小値の差に、＋１０の幅を加える。つまり、これによってピッチ周波数の振幅値を大きくし、疑似的な強調効果が生まれる。また、聞き取りやすさを増すために、音韻時間長設定部８０８では音韻時間長を延ばし、パワー設定部８０９ではパワーの大きさを調整する。そして、アクセント句の次フレームをアクセント句情報保持部８０１にセットして、ステップＳ７０１に戻る。
【００１９】
以下、ステップＳ７０４からステップＳ７１０までは、各構成形態素の品詞別にピッチ周波数と音韻時間長とパワーを変化させる処理を行う。
ステップＳ７０４では、感動詞判定部８０３が、アクセント句を構成する形態素の中に感動詞があるかどうかを判定する。存在しない場合は、ステップＳ７０６に移る。存在する場合はステップＳ７０５に移る。
【００２０】
ステップＳ７０５では、韻律情報を生成する。ピッチ周波数設定部８０７では、デフォルト値として与えられているピッチ周波数の最大値と最小値の差に、＋３０の幅を加える。つまり、これによってピッチ周波数の振幅値を大きくし、疑似的な強調効果が生まれる。また、聞き取りやすさを増すために、音韻時間長設定部８０８では音韻時間長を延ばし、パワー設定部８０９ではパワーの大きさを調整する。そして、アクセント句の次フレームをアクセント句情報保持部８０１にセットして、ステップＳ７０１に戻る。
【００２１】
ステップＳ７０６では、数詞判定部８０５が、アクセント句を構成する形態素の中に数詞があるかどうかを判定する。存在しない場合は、ステップＳ７０８に移る。存在する場合はステップＳ７０７に移る。
ステップＳ７０７では、韻律情報を生成する。ピッチ周波数設定部８０７では、デフォルト値として与えられているピッチ周波数の最大値と最小値の差に、＋１０の幅を加える。つまり、これによってピッチ周波数の振幅値を大きくし、疑似的な強調効果が生まれる。また、聞き取りやすさを増すために、音韻時間長設定部８０８では音韻時間長を延ばし、パワー設定部８０９ではパワーの大きさを調整する。そして、アクセント句の次フレームをアクセント句情報保持部８０１にセットして、ステップＳ７０１に戻る。
【００２２】
ステップＳ７０８では、助詞判定部８０６が、アクセント句を構成する形態素の中に助詞があるかどうかを判定する。存在しない場合は、ステップＳ７１０に移る。存在する場合はステップＳ７０９に移る。
ステップＳ７０９では、韻律情報を生成する。ピッチ周波数設定部８０７では、デフォルト値として与えられているピッチ周波数の最大値と最小値の差に、−１０の幅を加える。つまり、これによってピッチ周波数の振幅値を小さくする。また、音韻時間長設定部８０８では音韻時間長を短くし、パワー設定部８０９ではパワーの大きさを調整する。そして、アクセント句の次フレームをアクセント句情報保持部８０１にセットして、ステップＳ７０１に戻る。
【００２３】
ステップＳ７１０では、韻律情報を生成する。ピッチ周波数設定部８０７では、デフォルト値として与えられているピッチ周波数の最大値と最小値の差をそのまま用いる。また、音韻時間長設定部８０８では音韻時間長をデフォルト値とし、パワー設定部８０９ではデフォルトのパワーの大きさを設定する。そして、アクセント句の次フレームをアクセント句情報保持部８０１にセットして、ステップＳ７０１に戻る。
【００２４】
以上説明した各工程の処理によって、適正な韻律制御情報を生成することができる。
次に、図３に戻って、ステップＳ３０３からの処理説明を行う。
ステップＳ３０３では、音声生成部２０４において、韻律情報保持部２０３に保持した韻律情報などを使って、音韻の生成，パラメータからの波形生成、または、波形の接続を行なう。生成した音声合成信号を音声信号保持部２０５に保持し、ステップＳ３０４に移る。
【００２５】
ステップＳ３０４では、音声パラメータ情報をＤＡ変換し、実際の合成音声に変換する。合成音声を発声した場合は全ての処理を終了する。
図５は、本発明の実施例の構成を示す。５０１は制御メモリであり、図３のフローチャートに示すような制御手順に従った制御プログラムを記憶する。５０２は制御メモリ５０１に保持されている制御手順に従って判断・演算などを行なう中央処理装置である。５０３はメモリであり、入力保持部２０１，韻律制御部２０２，韻律情報保持部２０３，音声生成部２０４，音声信号保持部２０５を有している。５０４はディスク装置であり、５０５はバスである。
（他の実施例）
１．上記実施例では、図３において韻律制御の後に音声パラメータ生成を行なう場合について説明したが、これに限定されるものでなく、韻律制御と音声パラメータ生成を並列して行ってもよい。また、
２．上記実施例では、フレーム構造を入力とする場合について説明したが、これに限定されるものではなく、それに類する構造を持ついかなる入力形式であってもよい。
３．上記実施例では、アクセント成分と自然効果成分の二つを分離した場合に関する韻律生成の例を説明したが、これに限定されるものではなく、各アクセント型，モーラ数別にピッチパターンデータを用意しておいてもよい。
４．上記実施例では、構成品詞によって韻律情報を生成する場合について説明したが、これに限定されるものではなく、この他にも、ポーズの有無，強調情報の有無，モーラ数などの要因を用いて韻律情報を生成してもよい。
【００２６】
尚、本発明は、複数の機器から構成されるシステムに適用しても、１つの機器から成る装置に適用しても良い。また、本発明はシステム或は装置にプログラムを供給することによって達成される場合にも適用できることは言うまでもない。以上説明したように、本発明によれば、表音テキストではなく、フレーム構造のデータを入力したので、今まで以上の情報を扱うことができ、よりきめの細かい合成音声を生成することができるという効果が得られる。
【００２７】
【発明の効果】
以上説明したように本発明によれば、
解析によって得られる言語情報を音響処理部に提供することによって、音声合成をよりきめ細かく制御し、多様かつ高品質な合成音声出力を可能とする。
【図面の簡単な説明】
【図１】表音テキストの一例を示すブロック図である。
【図２】本発明の実施例の構成を示すブロック図である。
【図３】本発明の実施例の処理手順を示す動作フローチャートである。
【図４】フレーム構造の一例を示す図である。
【図５】本発明にかかわる合成音声作成支援装置の基本構成図である。
【図６】一般的なピッチパタンの生成方法を示す図である。
【図７】韻律制御部での処理の流れを示すフローチャートである。
【図８】韻律制御部の構成を示す図である。
【符号の説明】
２０１入力保持部
２０２韻律制御部
２０３韻律情報保持部
２０４音声生成部
２０５音声パラメータ保持部
２０６ＤＡコンバータ
５０１制御メモリ
５０２中央処理装置
５０３メモリ
５０４ディスク装置
５０５バス
８０１アクセント句情報保持部
８０２アクセント句存在判定部
８０３名詞存在判定部
８０４感動詞存在判定部
８０５数詞存在判定部
８０６助詞存在判定部
８０７ピッチ周波数設定部
８０８音韻時間長設定部
８０９パワー設定部[0001]
[Industrial application fields]
The present invention relates to a speech synthesis editing method and apparatus, and more particularly to a method and apparatus for processing information used for speech synthesis in a method and apparatus for speech synthesis based on sentence information decomposed into accent phrase unit information.
[0002]
[Prior art]
In general, in a speech synthesizer, text information representing acoustic information called phonetic text is input to an acoustic processing unit.
In order to generate the phonetic text, sentence analysis processing such as morphological analysis and phrase relation analysis is necessary.
[0003]
Using this sentence analysis result, generation and reading of pause information (how long to speak at once (pause phrase) and the length of silent time between pause phrases) and reading (corresponding to changes in reading) ), Accent generation (accent combination), and the like, speech language processing for generating synthesized speech is performed based on information obtained from language analysis, and the result is output as phonetic text. An example of phonetic text is shown in FIG.
[0004]
In the same figure, symbols and numerical values surrounded by “{}” such as H85, T300, etc. are control signals and alphabets enclosed in “[]” which are their values, phonemes, and “/”, “\” Represents the rising and falling points of the accent.
The prosody control unit analyzes (interprets) the phonetic text and extracts information such as accent information, pose information, and pitch frequency necessary for generating prosody information therefrom. Accent information is obtained from “/” and “\”, pause information is T300 to 300 (milliseconds), and pitch frequency is H85 to 85 (Hz). The prosody control unit generates a pitch pattern from these pieces of information. (See Figure 6)
As described above, the phonetic text includes acoustic information such as phoneme information, accent information, and control information.
[0005]
[Problems to be solved by the invention]
However, even if the user corrects the phonetic text to create his or her favorite synthesized speech, the language information is hardly written on the phonetic text, so the information that can be manipulated (changed or corrected) is limited. It has been.
Moreover, even when trying to perform prosodic control using language information, the language information is already missing. If the language information is represented on the phonetic text, there is a problem that the phonetic text becomes difficult to read and difficult to understand.
[0006]
The present invention solves the above-mentioned conventional drawbacks, and does not generate phonetic text from sentence analysis results, but provides speech processing with finer control by providing language information obtained by analysis to the acoustic processing unit, It is an object of the present invention to provide a speech synthesis editing method and apparatus capable of outputting various and high quality synthesized speech.
[0007]
[Means for Solving the Problems]
In order to achieve the above object, an information processing method and apparatus used for speech synthesis of the present invention comprises the following arrangement. That is, a method for processing information used in speech synthesis for performing speech synthesis based on sentence information decomposed into accent phrase units, a part of speech list holding step for holding a list of parts of speech of words included in an accent phrase; A part-of-speech determination step for determining a part-of-speech existing in a part-of- speech list held by the part-of-speech list holding step; a prosody control step for controlling a prosody based on the part-of-speech determined in the part-of-speech determination step; A generating step for generating a synthesized speech parameter based on the prosody controlled in the prosody control step.
[0008]
Another invention is a processing device for information used for speech synthesis for speech synthesis based on sentence information decomposed into units of accent phrases, and holds a list of parts of speech of words included in the accent phrases Part-of-speech list holding means, part-of-speech determination means for determining a part-of-speech existing in a part-of- speech list held by the part-of-speech list holding means, and control of prosody based on the part-of-speech determined by the part-of-speech determination means Prosody control means and generation means for generating synthesized speech parameters based on the prosody controlled by the prosody control means.
[0009]
[Action]
In the above configuration,
Prosody control based on the accent phrase unit information,
A synthesized speech parameter is generated based on the prosody controlled in the prosody control step.
[0010]
Another invention is:
Prosody control means performs prosody control based on the accent phrase unit information,
Generation means generates a synthesized speech parameter based on the controlled prosody.
[0011]
【Example】
First, after summarizing one of the points of the speech synthesizer according to the embodiment of the present invention, a detailed description will be given.
The speech synthesizer of the present embodiment includes an input holding unit that accepts as input a speech language processing result having a frame structure as shown in FIG. 4, and a prosody that generates prosody control information such as pitch, duration, and power. A control unit, a prosody information holding unit that holds prosody information, a voice parameter generation unit that generates synthetic voice parameters such as phoneme, parameter connection, or waveform connection, a voice parameter holding unit that holds voice parameters, A DA converter that converts voice parameters into synthesized voice is basically provided.
[0012]
With this basic configuration, the output of sentence analysis can be expressed in a frame structure, which will be described later, and fine control can be performed based on the data of this frame structure, enabling a variety of synthesized speech outputs.
In the following, the audio configuration apparatus of the present embodiment will be described in detail with reference to the drawings.
FIG. 2 is a block diagram showing the configuration of the speech synthesizer of this embodiment. In the figure, 201 is an input holding unit that inputs and holds frame structure data that is the result of sentence analysis, 202 is a frame structure data that is input from the input holding unit 201, and includes prosody, pitch, duration length, This is a prosody control unit that controls power and the like. Reference numeral 203 denotes a prosody information holding unit that holds prosody control information. Reference numeral 204 denotes a voice generation unit that performs processing such as prosody, parameter connection, and waveform connection. Reference numeral 205 denotes a voice parameter holding unit that holds parameters for generating a synthesized voice. Reference numeral 206 denotes a DA converter that converts voice parameters into synthesized voice.
[0013]
Next, the operation of the present apparatus will be described with reference to the flowchart shown in FIG.
First, in step S301, it is determined whether there is an input of a speech language processing result represented by a frame structure. Here, the frame structure refers to that shown in FIG.
The frame structure is a set of pairs consisting of a key and its value. In FIG. 4, the frame number, notation, accent type, etc. are keys, and each value is “1”, “Today”, “1”. is there. Further, the value can point to the next frame by using the key as a pointer (corresponding to the arrow in FIG. 4). The frame in this figure has an accent phrase as a unit. An accent phrase is a unit having one accent kernel.
[0014]
The “phonetic notation” is a phonetic string that is uttered when generating synthesized speech and is expressed in katakana. The “accent type” expresses the position of the accent nucleus by the number of mora from the beginning of each accent phrase. Here, the “number of mora” indicates the total number of vowels, prompting sounds, and repelling sounds included in the accent phrase. “Component part of speech” is a list of parts of speech of morphemes constituting the accent phrase. Furthermore, the “pause length” represents the length between silent phrases generated after the accent phrase.
[0015]
Returning to FIG. 3, if there is an input in step S301, the process proceeds to step S302. If there is no input, step S301 is repeated.
In step S302, the prosody control unit 201 generates prosody control information such as prosody control, pitch generation, and duration time. The generated prosodic information is held in the prosodic information holding unit 203, and the process proceeds to step S303.
[0016]
The details of the prosody control method in this step are as follows. FIG. 7 shows a control flow of the prosody control unit 201, and will be described for each step. FIG. 8 shows a detailed configuration of the prosody control unit 201.
Here, it is assumed that the leading element of the input accent phrase frame is set in the accent phrase information holding unit 801 in step S301 described above.
[0017]
In step S <b> 701, the accent phrase existence determination unit 802 determines whether an accent phrase is set in the accent phrase information holding unit 801. If it does not exist, the process ends. If it exists, the process proceeds to step S702.
In step S702, the noun determination unit 803 determines whether there is a noun in the morpheme constituting the accent phrase. If not, the process proceeds to step S704. If it exists, the process moves to step S703.
[0018]
In step S703, prosodic information is generated. The pitch frequency setting unit 807 adds a width of +10 to the difference between the maximum value and the minimum value of the pitch frequency given as a default value. In other words, this increases the amplitude value of the pitch frequency and produces a pseudo enhancement effect. In order to increase the ease of listening, the phoneme time length setting unit 808 extends the phoneme time length, and the power setting unit 809 adjusts the power level. Then, the next frame of the accent phrase is set in the accent phrase information holding unit 801, and the process returns to step S701.
[0019]
Hereinafter, from step S704 to step S710, processing for changing the pitch frequency, phoneme length, and power is performed for each part of speech of each constituent morpheme.
In step S704, the moving verb determination unit 803 determines whether there is a moving verb in the morphemes constituting the accent phrase. If not, the process moves to step S706. If it exists, the process moves to step S705.
[0020]
In step S705, prosodic information is generated. The pitch frequency setting unit 807 adds a width of +30 to the difference between the maximum value and the minimum value of the pitch frequency given as a default value. In other words, this increases the amplitude value of the pitch frequency and produces a pseudo enhancement effect. In order to increase the ease of listening, the phoneme time length setting unit 808 extends the phoneme time length, and the power setting unit 809 adjusts the power level. Then, the next frame of the accent phrase is set in the accent phrase information holding unit 801, and the process returns to step S701.
[0021]
In step S706, the numeral determination unit 805 determines whether there is a numeral in the morpheme constituting the accent phrase. If not, the process moves to step S708. If it exists, the process moves to step S707.
In step S707, prosodic information is generated. The pitch frequency setting unit 807 adds a width of +10 to the difference between the maximum value and the minimum value of the pitch frequency given as a default value. In other words, this increases the amplitude value of the pitch frequency and produces a pseudo enhancement effect. In order to increase the ease of listening, the phoneme time length setting unit 808 extends the phoneme time length, and the power setting unit 809 adjusts the power level. Then, the next frame of the accent phrase is set in the accent phrase information holding unit 801, and the process returns to step S701.
[0022]
In step S708, the particle determining unit 806 determines whether there is a particle in the morpheme constituting the accent phrase. If not, the process moves to step S710. If it exists, the process moves to step S709.
In step S709, prosodic information is generated. The pitch frequency setting unit 807 adds a width of −10 to the difference between the maximum value and the minimum value of the pitch frequency given as a default value. That is, this reduces the amplitude value of the pitch frequency. The phoneme time length setting unit 808 shortens the phoneme time length, and the power setting unit 809 adjusts the power level. Then, the next frame of the accent phrase is set in the accent phrase information holding unit 801, and the process returns to step S701.
[0023]
In step S710, prosodic information is generated. In the pitch frequency setting unit 807, the difference between the maximum value and the minimum value of the pitch frequency given as a default value is used as it is. The phoneme time length setting unit 808 sets the phoneme time length as a default value, and the power setting unit 809 sets the default power level. Then, the next frame of the accent phrase is set in the accent phrase information holding unit 801, and the process returns to step S701.
[0024]
Appropriate prosodic control information can be generated by the processing of each step described above.
Next, returning to FIG. 3, the processing from step S303 will be described.
In step S303, the speech generation unit 204 uses the prosody information stored in the prosody information storage unit 203 to generate phonemes, generate waveforms from parameters, or connect waveforms. The generated voice synthesis signal is held in the voice signal holding unit 205, and the process proceeds to step S304.
[0025]
In step S304, the voice parameter information is D / A converted into actual synthesized voice. When the synthesized voice is uttered, all the processes are terminated.
FIG. 5 shows the configuration of an embodiment of the present invention. A control memory 501 stores a control program according to a control procedure as shown in the flowchart of FIG. A central processing unit 502 performs determination / calculation in accordance with a control procedure held in the control memory 501. A memory 503 includes an input holding unit 201, a prosody control unit 202, a prosody information holding unit 203, a voice generation unit 204, and a voice signal holding unit 205. Reference numeral 504 denotes a disk device, and 505 denotes a bus.
(Other examples)
1. In the above embodiment, the case where the speech parameter generation is performed after the prosody control in FIG. 3 has been described, but the present invention is not limited to this, and the prosody control and the speech parameter generation may be performed in parallel. Also,
2. In the above embodiment, the case where the frame structure is input has been described. However, the present invention is not limited to this, and any input format having a similar structure may be used.
3. In the above embodiment, the example of prosody generation related to the case where the accent component and the natural effect component are separated has been described. However, the present invention is not limited to this. You may keep it.
4). In the above embodiment, the case where the prosodic information is generated by the component parts of speech has been described. However, the present invention is not limited to this, and other factors such as presence / absence of pause, presence / absence of emphasis information, and the number of mora are used. Prosodic information may be generated.
[0026]
The present invention may be applied to a system composed of a plurality of devices or an apparatus composed of a single device. Needless to say, the present invention can also be applied to a case where the present invention is achieved by supplying a program to a system or apparatus. As described above, according to the present invention, since frame-structured data is input instead of phonetic text, it is possible to handle more information than before and generate finer synthesized speech. The effect is obtained.
[0027]
【The invention's effect】
As described above, according to the present invention,
By providing linguistic information obtained by the analysis to the acoustic processing unit, it is possible to control speech synthesis more finely and to output various and high quality synthesized speech.
[Brief description of the drawings]
FIG. 1 is a block diagram showing an example of phonetic text.
FIG. 2 is a block diagram showing a configuration of an embodiment of the present invention.
FIG. 3 is an operation flowchart showing a processing procedure according to the embodiment of the present invention.
FIG. 4 is a diagram illustrating an example of a frame structure.
FIG. 5 is a basic configuration diagram of a synthetic speech creation support apparatus according to the present invention.
FIG. 6 is a diagram showing a general pitch pattern generation method.
FIG. 7 is a flowchart showing a flow of processing in the prosody control unit.
FIG. 8 is a diagram illustrating a configuration of a prosody control unit.
[Explanation of symbols]
201 Input Holding Unit 202 Prosody Control Unit 203 Prosody Information Holding Unit 204 Speech Generation Unit 205 Speech Parameter Holding Unit 206 DA Converter 501 Control Memory 502 Central Processing Unit 503 Memory 504 Disk Device 505 Bus 801 Accent Phrase Information Holding Unit 802 Accent Phrase Information Presence Determination Unit 803 noun presence determination unit 804 impression verb presence determination unit 805 number presence determination unit 806 particle presence determination unit 807 pitch frequency setting unit 808 phoneme time length setting unit 809 power setting unit

Claims

An information processing method used for speech synthesis for performing speech synthesis based on sentence information decomposed into accent phrase units,
A part-of-speech list holding step for holding a part-of-speech list of words included in the accent phrase;
A part-of-speech determination step for determining a part-of-speech existing in a part-of- speech list held by the part-of-speech list holding step;
Based on the part of speech determined in the part of speech determination step, the prosody control step of controlling the prosody,
And a generation step of generating a synthesized speech parameter based on the prosody controlled in the prosody control step. A method for processing information used for speech synthesis, comprising:

The processing of information used for speech synthesis according to claim 1, wherein the prosody control step controls the pitch frequency which is one of the prosody based on the type of part of speech determined in the part of speech determination step. Method.

The method for processing information used for speech synthesis according to claim 1, wherein the prosody is a phoneme duration.

The information processing method used for speech synthesis according to claim 1, wherein the prosody is sound power.

The method for processing information used for speech synthesis according to claim 1, further comprising a conversion step of converting into synthesized speech based on the synthesized speech parameter generated in the generating step.

3. The voice according to claim 2, wherein the prosody control step performs control to increase the maximum value and the minimum value of the pitch frequency if the part of speech determined in the part of speech determination step is a noun. Information processing method used for synthesis.

The processing of information used for speech synthesis according to claim 3, wherein the prosodic control step performs control to increase the phonological duration if the part of speech determined in the part of speech determination step is a noun. Method.

The method for processing information used for speech synthesis according to claim 4, wherein the prosody control step performs control to increase the sound power if the part of speech determined in the part of speech determination step is a noun.

The part-of-speech determination step, the processing of the case of a noun in the list of parts of speech to be held by the part of speech list holding step is present, information according to claim 1, wherein the determining a part of speech to use該名the prosody generation Method.

An information processing apparatus used for speech synthesis for performing speech synthesis based on sentence information decomposed into accent phrase units,
A part of speech list holding means for holding a list of parts of speech of words included in the accent phrase;
A part of speech determination means for determining a part of speech existing in a list of part of speech held by the part of speech list holding means;
Based on the part of speech determined by the part of speech determination means, the prosody control means for controlling the prosody;
An apparatus for processing information used for speech synthesis, comprising: generating means for generating a synthesized speech parameter based on the prosody controlled by the prosodic control means.

The processing of information used for speech synthesis according to claim 10, wherein the prosody control means controls the pitch frequency which is one of the prosody based on the type of part of speech determined by the part of speech determination means. apparatus.

The information processing apparatus used for speech synthesis according to claim 10, wherein the prosody is a phoneme duration.

The information processing apparatus used for speech synthesis according to claim 10, wherein the prosody is sound power.

The apparatus for processing information used for speech synthesis according to claim 10, further comprising conversion means for converting into synthesized speech based on the synthesized speech parameter generated by the generating means.

The voice according to claim 11, wherein the prosodic control means performs control to increase the maximum value and the minimum value of the pitch frequency if the part of speech determined by the part of speech determination means is a noun. Information processing device used for synthesis.

The processing of information used for speech synthesis according to claim 12, wherein if the part of speech determined by the part of speech determination unit is a noun, the prosody control unit performs control to increase the phonological duration. apparatus.

14. The information processing apparatus for use in speech synthesis according to claim 13, wherein the prosody control means performs control to increase the sound power if the part of speech determined by the part of speech determination means is a noun.

The part-of-speech determination unit, the processing of the case where a noun in the list of parts of speech to be held by the part of speech list holding means are present, information according to claim 10, characterized in that the determination as part of speech using a該名the prosody generation Equipment .