JPH08211896A

JPH08211896A - System and device for editing speech synthesis

Info

Publication number: JPH08211896A
Application number: JP7020280A
Authority: JP
Inventors: Yasuo Okuya; 泰夫奥谷; Toshiaki Fukada; 俊明深田; Yasunori Ohora; 恭則大洞
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 1995-02-08
Filing date: 1995-02-08
Publication date: 1996-08-20
Anticipated expiration: 2021-12-27
Also published as: JP3862300B2

Abstract

PURPOSE: To provide system and device for editing speech synthesis capable of finely controlling the speech synthesis and outputting various and high quality synthesized voices by providing language information obtained by analysis to an acoustic processing part without omitting. CONSTITUTION: This method is a speech synthesis editing method editing the speech synthesis based on sentence information analyzed to accent phrase unit information, and it is provided with a rhythm control process S302 controlling rhythm based on the accent phrase unit information and a generation process S303 generating a synthesized speech parameter based on the rhythm controlled by the rhythm control process.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、音声合成編集方法とそ
の装置、特に、アクセント句単位情報に分解された文情
報に基づいて、音声合成編集を行う音声合成編集方法と
その装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech synthesis editing method and apparatus thereof, and more particularly to a speech synthesis editing method and apparatus for performing speech synthesis editing based on sentence information decomposed into accent phrase unit information.

【０００２】[0002]

【従来の技術】一般に、音声合成装置では、表音テキス
トと呼ばれる音響情報をテキストで表したものを音響処
理部へ入力させる。この表音テキストを生成するために
は、形態素解析、句関係解析などの文解析処理が必要で
ある。2. Description of the Related Art Generally, in a voice synthesizing device, acoustic information called phonetic text represented by text is input to an acoustic processing unit. In order to generate this phonetic text, sentence analysis processing such as morphological analysis and phrase relation analysis is required.

【０００３】この文解析結果を用いて、ポーズ情報(ど
こまでを一息に発声するか(ポーズ句)と、ポーズ句とポ
ーズ句の間でおく無音時間の長さ)の生成，読み付け(読
みの変化に対応)，アクセントの生成(アクセントの結
合)など、言語解析から得られる情報をもとに合成音声
を生成するための音声言語処理を行ない、その結果を表
音テキストとして出力する。表音テキストの一例を図１
に示す。Using this sentence analysis result, generation and reading of pause information (how much voice is uttered at once (pause phrase) and length of silent time between pause phrases) It performs speech and language processing to generate synthetic speech based on information obtained from language analysis, such as response to changes) and accent generation (combination of accents), and outputs the result as phonetic text. Figure 1 An example of phonetic text
Shown in

【０００４】同図において、Ｈ８５，Ｔ３００などの”
｛｝”で括られた記号と数値は制御信号とその値であ
る”［］”で括られた中にあるアルファベットが音素
列，また、”／”，”＼”がアクセントの立ち上がり
点，降下点を表している。韻律制御部は、この表音テキ
ストを解析(解釈)して、そこから韻律情報を生成するた
めに必要なアクセント情報，ポーズ情報，ピッチ周波数
などの情報を抽出する。アクセント情報は”／”と”
＼”から、ポーズ情報はＴ３００から３００(ミリ秒)，
ピッチ周波数はＨ８５から８５(Ｈｚ)というように得ら
れる。韻律制御部では、これらの情報からピッチパター
ンを生成する。（図６参照）このように表音テキストに
は、音素情報，アクセント情報，制御情報などの音響情
報が記されている。In the figure, H85, T300, etc.
The symbols and numerical values enclosed in {} "are the control signal and its value. The alphabets inside the enclosed [[]" are phoneme strings, and "/" and "\" are the rising and falling points of accents. The prosody control unit analyzes (interprets) this phonetic text and extracts information such as accent information, pose information, and pitch frequency necessary for generating prosody information from the phonetic text. Information is "/" and "
From \ ", the pose information is from T300 to 300 (milliseconds),
The pitch frequency is obtained as H85 to 85 (Hz). The prosody control unit generates a pitch pattern from these pieces of information. (See FIG. 6) As described above, acoustic information such as phoneme information, accent information, and control information is written in the phonetic text.

【０００５】[0005]

【発明が解決しようとしている課題】しかしながら、ユ
ーザーが表音テキストを修正して、自分の好みの合成音
声を作成しようとしても、言語情報が表音テキスト上に
ほとんど記されていないので、操作(変更，修正)可能な
情報が限られている。また、言語情報を利用して韻律制
御などを行なおうとしても、言語情報がすでに欠落して
いる。仮に言語情報を表音テキスト上に表すことにする
と、表音テキストが読みづらくわかりにくいものになっ
てしまう問題があった。However, even if the user corrects the phonetic text to create a synthesized voice of his / her preference, the language information is scarcely written on the phonetic text. Information that can be changed or modified) is limited. Moreover, even if an attempt is made to control prosody using language information, the language information is already missing. If the linguistic information were to be displayed on the phonetic text, the phonetic text would be difficult to read and difficult to understand.

【０００６】本発明は上述した従来の欠点を解決し、文
解析結果から表音テキストを生成するのではなく、解析
によって得られる言語情報を音響処理部に提供すること
によって、音声合成をよりきめ細かく制御でき、多様か
つ高品質な合成音声出力を可能とする音声合成編集方法
とその装置を提供することを目的とする。The present invention solves the above-mentioned drawbacks of the related art, and rather than generating phonetic text from sentence analysis results, provides language information obtained by analysis to the acoustic processing section to make speech synthesis more detailed. An object of the present invention is to provide a speech synthesis editing method and a device thereof which can be controlled and can output various and high quality synthesized speech.

【０００７】[0007]

【課題を解決するための手段】上記目的を達成するた
め、本発明の音声合成編集方法とその装置は以下の構成
を備える。即ち、アクセント句単位情報に分解された文
情報に基づいて、音声合成編集を行う音声合成編集方法
であって、前記アクセント句単位情報に基づき、韻律の
制御を行う韻律制御工程と、前記韻律制御工程で制御さ
れた韻律に基づき、合成音声パラメータを生成する生成
工程とを備える。In order to achieve the above object, a speech synthesis / editing method and apparatus of the present invention have the following configurations. That is, a speech synthesis editing method for performing speech synthesis editing based on sentence information decomposed into accent phrase unit information, the prosody control step of controlling prosody based on the accent phrase unit information, and the prosody control. A generation step of generating a synthetic speech parameter based on the prosody controlled in the step.

【０００８】また、別の発明は、アクセント句単位情報
に分解された文情報に基づいて、音声合成編集を行う音
声合成編集装置であって、前記アクセント句単位情報に
基づき、韻律の制御を行う韻律制御手段と、前記韻律制
御手段で制御された韻律に基づき、合成音声パラメータ
を生成する生成手段とを備える。Another aspect of the present invention is a speech synthesis editing apparatus for performing speech synthesis editing based on sentence information decomposed into accent phrase unit information, wherein prosody is controlled based on the accent phrase unit information. Prosody control means and generation means for generating synthetic speech parameters based on the prosody controlled by the prosody control means.

【０００９】[0009]

【作用】以上の構成において、前記アクセント句単位情
報に基づき、韻律の制御を行い、前記韻律制御工程で制
御された韻律に基づき、合成音声パラメータを生成す
る。In the above structure, the prosody is controlled based on the accent phrase unit information, and the synthetic speech parameter is generated based on the prosody controlled in the prosody control step.

【００１０】また、別の発明は、韻律制御手段が、前記
アクセント句単位情報に基づき、韻律の制御を行い、生
成手段が、前記制御された韻律に基づき、合成音声パラ
メータを生成する。According to another aspect of the invention, the prosody control unit controls the prosody based on the accent phrase unit information, and the generation unit generates the synthetic speech parameter based on the controlled prosody.

【００１１】[0011]

【実施例】はじめに、本発明に係る一実施例の音声構成
装置のポイントの一つを要約した後、詳細な説明に入
る。本実施例の音声構成装置は、図４に示すようなフレ
ーム構造をもつ音声言語処理結果を入力として受理する
入力保持部と、ピッチ、継続時間長、パワーなどの韻律
制御情報の生成を行なう韻律制御部と、韻律情報を保持
する韻律情報保持部と、音韻、パラメータ接続、あるい
は、波形接続等の合成音声パラメータの生成を行なう音
声パラメータ生成部と、音声パラメータを保持する音声
パラメータ保持部と、音声パラメータを合成音声に変換
するＤＡコンバータを基本的に備える。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS First, one of the points of a voice structuring apparatus according to an embodiment of the present invention will be summarized and then a detailed description will be given. The speech structuring apparatus according to the present exemplary embodiment includes an input holding unit that receives a speech language processing result having a frame structure as shown in FIG. 4 as an input, and a prosody that generates prosody control information such as pitch, duration, and power. A control unit, a prosody information holding unit that holds prosody information, a phoneme parameter generation unit that generates synthetic phonetic parameters such as phoneme, parameter connection, or waveform connection; a voice parameter holding unit that holds voice parameters; Basically, a DA converter for converting voice parameters into synthetic voice is provided.

【００１２】この基本構成によって、文解析の出力を後
述するフレーム構造で表現し、このフレーム構造のデー
タに基づき、きめの細かい制御を行うことができ、多様
な合成音声出力が可能になった。以下、図面を参照して
本実施例の音声構成装置を詳細に説明する。図２は、本
実施例の音声合成装置の構成を示すブロック図である。
同図において、２０１は、文の解析結果であるフレーム
構造データを入力して保持する入力保持部、２０２は、
フレーム構造データを入力保持部２０１から入力して、
韻律、ピッチ、継続時間長、パワーなどの制御を行なう
韻律制御部である。２０３は、韻律制御情報を保持する
韻律情報保持部である。２０４は、韻律、パラメータ接
続、波形接続などの処理を行なう音声生成部である。２
０５は、合成音声生成のパラメータを保持する音声パラ
メータ保持部である。２０６は、音声パラメータを合成
音声に変換するＤＡコンバータである。With this basic structure, the output of sentence analysis is expressed by a frame structure described later, and fine control can be performed based on the data of this frame structure, and various synthetic speech outputs are possible. Hereinafter, the audio component device of the present embodiment will be described in detail with reference to the drawings. FIG. 2 is a block diagram showing the configuration of the speech synthesizer of this embodiment.
In the figure, 201 is an input holding unit for inputting and holding frame structure data which is a sentence analysis result, and 202 is
Input frame structure data from the input holding unit 201,
It is a prosody control unit that controls prosody, pitch, duration, power, and the like. Reference numeral 203 denotes a prosody information holding unit that holds prosody control information. A voice generation unit 204 performs processing such as prosody, parameter connection, and waveform connection. Two
Reference numeral 05 is a voice parameter holding unit that holds parameters for generating synthetic voice. Reference numeral 206 denotes a DA converter that converts a voice parameter into synthetic voice.

【００１３】次に、図３に示すフローチャートを参照し
て、本装置の動作を説明する。まず、ステップＳ３０１
では、フレーム構造で表された音声言語処理結果の入力
があるかどうかの判定を行なう。ここで、フレーム構造
とは、図４に示すようなものを指す。フレーム構造は、
キーとそのバリューからなる対の集合体で、図４では、
フレーム番号，表記，アクセント型などがキーで、それ
ぞれのバリューが、「１」，「今日は」，「１」であ
る。また、キーをポインタにすることによって、バリュ
ーが次のフレームをさすこともできる（図４の矢印に相
当）。この図でのフレームは、アクセント句を単位とし
ている。アクセント句とは、ひとつのアクセント核を持
つ単位である。Next, the operation of this apparatus will be described with reference to the flow chart shown in FIG. First, step S301
Then, it is determined whether or not there is an input of the speech language processing result represented by the frame structure. Here, the frame structure means a structure as shown in FIG. The frame structure is
A collection of pairs of keys and their values.
The key is frame number, notation, accent type, etc., and the respective values are "1", "today" and "1". The value can also point to the next frame by using the key as a pointer (corresponding to the arrow in FIG. 4). The frame in this figure is based on accent phrases. An accent phrase is a unit that has one accent nucleus.

【００１４】「音声表記」は、合成音声を生成するとき
に発声される音韻列をカタカナで表記したものである。
また、「アクセント型」は、アクセント核の位置を各ア
クセント句の先頭からのモーラ数で表現したものであ
る。ここで、「モーラ数」とは当該アクセント句に含ま
れる母音と促音，撥音の合計数を示す。また、「構成品
詞」は、アクセント句を構成する形態素の品詞のリスト
である。さらに、「ポーズ長」は、アクセント句の後ろ
に生成する無音句間の長さをあらわしている。The "speech notation" is a notation of a phoneme string uttered when synthetic speech is generated in katakana.
The "accent type" represents the position of the accent nucleus by the number of mora from the beginning of each accent phrase. Here, the "number of mora" indicates the total number of vowels, consonants, and vowels included in the accent phrase. Further, the “part-of-speech” is a list of the parts-of-speech of the morphemes forming the accent phrase. Furthermore, the “pause length” represents the length between silent phrases generated after the accent phrase.

【００１５】図３に戻って、ステップＳ３０１で入力が
あれば、ステップＳ３０２に移る。入力がなければ、ス
テップＳ３０１を繰り返す。ステップＳ３０２では、韻
律制御部２０１において、韻律制御、ピッチ生成、継続
時間長などの韻律制御情報の生成を行なう。生成した韻
律情報を韻律情報保持部２０３に保持して、ステップＳ
３０３に移る。Returning to FIG. 3, if there is an input in step S301, the process proceeds to step S302. If there is no input, step S301 is repeated. In step S302, the prosody control unit 201 generates prosody control information such as prosody control, pitch generation, and duration. The generated prosody information is held in the prosody information holding unit 203, and step S
Move to 303.

【００１６】このステップでの韻律制御方法の詳細を次
に示す。図７は、韻律制御部２０１の制御の流れを示す
ものであり、以下各ステップ毎に説明する。図８は、韻
律制御部２０１の詳細構成を示す。ここで、前述のステ
ップＳ３０１で、入力したアクセント句フレームの先頭
の要素が、アクセント句情報保持部８０１にセットされ
ているとする。Details of the prosody control method in this step will be described below. FIG. 7 shows a control flow of the prosody control unit 201, and each step will be described below. FIG. 8 shows a detailed configuration of the prosody control unit 201. Here, it is assumed that the leading element of the input accent phrase frame is set in the accent phrase information holding unit 801 in step S301.

【００１７】ステップＳ７０１では、アクセント句存在
判定部８０２が、アクセント句情報保持部８０１にアク
セント句がセットされているかどうかを判定する。存在
しない場合は処理を終了する。また、存在する場合は、
ステップＳ７０２に移る。ステップＳ７０２では、名詞
判定部８０３が、アクセント句を構成する形態素の中に
名詞があるかどうかを判定する。存在しない場合は、ス
テップＳ７０４に移る。存在する場合はステップＳ７０
３に移る。In step S701, the accent phrase existence determining unit 802 determines whether or not an accent phrase is set in the accent phrase information holding unit 801. If it does not exist, the process ends. Also, if present,
Then, the process proceeds to step S702. In step S702, the noun determination unit 803 determines whether or not there is a noun in the morphemes forming the accent phrase. If it does not exist, the process proceeds to step S704. If it exists, step S70
Move to 3.

【００１８】ステップＳ７０３では、韻律情報を生成す
る。ピッチ周波数設定部８０７では、デフォルト値とし
て与えられているピッチ周波数の最大値と最小値の差
に、＋１０の幅を加える。つまり、これによってピッチ
周波数の振幅値を大きくし、疑似的な強調効果が生まれ
る。また、聞き取りやすさを増すために、音韻時間長設
定部８０８では音韻時間長を延ばし、パワー設定部８０
９ではパワーの大きさを調整する。そして、アクセント
句の次フレームをアクセント句情報保持部８０１にセッ
トして、ステップＳ７０１に戻る。In step S703, prosody information is generated. The pitch frequency setting unit 807 adds a width of +10 to the difference between the maximum value and the minimum value of the pitch frequency given as the default value. That is, this increases the amplitude value of the pitch frequency and produces a pseudo enhancement effect. Further, in order to increase the audibility, the phoneme time length setting unit 808 extends the phoneme time length, and the power setting unit 80
At 9, the magnitude of power is adjusted. Then, the frame next to the accent phrase is set in the accent phrase information holding unit 801, and the process returns to step S701.

【００１９】以下、ステップＳ７０４からステップＳ７
１０までは、各構成形態素の品詞別にピッチ周波数と音
韻時間長とパワーを変化させる処理を行う。ステップＳ
７０４では、感動詞判定部８０３が、アクセント句を構
成する形態素の中に感動詞があるかどうかを判定する。
存在しない場合は、ステップＳ７０６に移る。存在する
場合はステップＳ７０５に移る。Hereinafter, steps S704 to S7
Up to 10, processing is performed to change the pitch frequency, phoneme time length, and power for each part of speech of each constituent morpheme. Step S
In 704, the verb determination unit 803 determines whether or not there is a verb in the morphemes forming the accent phrase.
If it does not exist, the process proceeds to step S706. When it exists, it moves to step S705.

【００２０】ステップＳ７０５では、韻律情報を生成す
る。ピッチ周波数設定部８０７では、デフォルト値とし
て与えられているピッチ周波数の最大値と最小値の差
に、＋３０の幅を加える。つまり、これによってピッチ
周波数の振幅値を大きくし、疑似的な強調効果が生まれ
る。また、聞き取りやすさを増すために、音韻時間長設
定部８０８では音韻時間長を延ばし、パワー設定部８０
９ではパワーの大きさを調整する。そして、アクセント
句の次フレームをアクセント句情報保持部８０１にセッ
トして、ステップＳ７０１に戻る。In step S705, prosody information is generated. The pitch frequency setting unit 807 adds a width of +30 to the difference between the maximum value and the minimum value of the pitch frequency given as the default value. That is, this increases the amplitude value of the pitch frequency and produces a pseudo enhancement effect. Further, in order to increase the audibility, the phoneme time length setting unit 808 extends the phoneme time length, and the power setting unit 80
At 9, the magnitude of power is adjusted. Then, the frame next to the accent phrase is set in the accent phrase information holding unit 801, and the process returns to step S701.

【００２１】ステップＳ７０６では、数詞判定部８０５
が、アクセント句を構成する形態素の中に数詞があるか
どうかを判定する。存在しない場合は、ステップＳ７０
８に移る。存在する場合はステップＳ７０７に移る。ス
テップＳ７０７では、韻律情報を生成する。ピッチ周波
数設定部８０７では、デフォルト値として与えられてい
るピッチ周波数の最大値と最小値の差に、＋１０の幅を
加える。つまり、これによってピッチ周波数の振幅値を
大きくし、疑似的な強調効果が生まれる。また、聞き取
りやすさを増すために、音韻時間長設定部８０８では音
韻時間長を延ばし、パワー設定部８０９ではパワーの大
きさを調整する。そして、アクセント句の次フレームを
アクセント句情報保持部８０１にセットして、ステップ
Ｓ７０１に戻る。In step S706, the number determination unit 805.
Determines whether or not there is a numeral in the morpheme that constitutes the accent phrase. If it does not exist, step S70.
Go to 8. When it exists, it moves to step S707. In step S707, prosody information is generated. The pitch frequency setting unit 807 adds a width of +10 to the difference between the maximum value and the minimum value of the pitch frequency given as the default value. That is, this increases the amplitude value of the pitch frequency and produces a pseudo enhancement effect. Further, in order to increase the audibility, the phoneme duration setting unit 808 extends the phoneme duration and the power setting unit 809 adjusts the power level. Then, the frame next to the accent phrase is set in the accent phrase information holding unit 801, and the process returns to step S701.

【００２２】ステップＳ７０８では、助詞判定部８０６
が、アクセント句を構成する形態素の中に助詞があるか
どうかを判定する。存在しない場合は、ステップＳ７１
０に移る。存在する場合はステップＳ７０９に移る。ス
テップＳ７０９では、韻律情報を生成する。ピッチ周波
数設定部８０７では、デフォルト値として与えられてい
るピッチ周波数の最大値と最小値の差に、−１０の幅を
加える。つまり、これによってピッチ周波数の振幅値を
小さくする。また、音韻時間長設定部８０８では音韻時
間長を短くし、パワー設定部８０９ではパワーの大きさ
を調整する。そして、アクセント句の次フレームをアク
セント句情報保持部８０１にセットして、ステップＳ７
０１に戻る。In step S708, the particle determination unit 806
Determines whether or not there is a particle in the morphemes that make up the accent phrase. If it does not exist, step S71
Move to 0. When it exists, it moves to step S709. In step S709, prosody information is generated. The pitch frequency setting unit 807 adds a width of -10 to the difference between the maximum value and the minimum value of the pitch frequency given as the default value. That is, this reduces the amplitude value of the pitch frequency. The phoneme duration setting unit 808 shortens the phoneme duration, and the power setting unit 809 adjusts the power level. Then, the frame next to the accent phrase is set in the accent phrase information holding unit 801, and step S7 is performed.
Return to 01.

【００２３】ステップＳ７１０では、韻律情報を生成す
る。ピッチ周波数設定部８０７では、デフォルト値とし
て与えられているピッチ周波数の最大値と最小値の差を
そのまま用いる。また、音韻時間長設定部８０８では音
韻時間長をデフォルト値とし、パワー設定部８０９では
デフォルトのパワーの大きさを設定する。そして、アク
セント句の次フレームをアクセント句情報保持部８０１
にセットして、ステップＳ７０１に戻る。In step S710, prosody information is generated. In the pitch frequency setting unit 807, the difference between the maximum value and the minimum value of the pitch frequency given as the default value is used as it is. The phoneme duration setting unit 808 sets the phoneme duration as a default value, and the power setting unit 809 sets the default power magnitude. Then, the frame next to the accent phrase is stored in the accent phrase information holding unit 801.
, And the process returns to step S701.

【００２４】以上説明した各工程の処理によって、適正
な韻律制御情報を生成することができる。次に、図３に
戻って、ステップＳ３０３からの処理説明を行う。ステ
ップＳ３０３では、音声生成部２０４において、韻律情
報保持部２０３に保持した韻律情報などを使って、音韻
の生成，パラメータからの波形生成、または、波形の接
続を行なう。生成した音声合成信号を音声信号保持部２
０５に保持し、ステップＳ３０４に移る。Appropriate prosody control information can be generated by the processing of each step described above. Next, returning to FIG. 3, the processing from step S303 will be described. In step S303, the voice generation unit 204 uses the prosody information stored in the prosody information storage unit 203 to generate phonemes, generate waveforms from parameters, or connect waveforms. The voice signal holding unit 2 stores the generated voice synthesis signal.
The value is held at 05 and the process proceeds to step S304.

【００２５】ステップＳ３０４では、音声パラメータ情
報をＤＡ変換し、実際の合成音声に変換する。合成音声
を発声した場合は全ての処理を終了する。図５は、本発
明の実施例の構成を示す。５０１は制御メモリであり、
図３のフローチャートに示すような制御手順に従った制
御プログラムを記憶する。５０２は制御メモリ５０１に
保持されている制御手順に従って判断・演算などを行な
う中央処理装置である。５０３はメモリであり、入力保
持部２０１，韻律制御部２０２，韻律情報保持部２０
３，音声生成部２０４，音声信号保持部２０５を有して
いる。５０４はディスク装置であり、５０５はバスであ
る。（他の実施例）１．上記実施例では、図３において韻律制御の後に音声
パラメータ生成を行なう場合について説明したが、これ
に限定されるものでなく、韻律制御と音声パラメータ生
成を並列して行ってもよい。また、２．上記実施例では、フレーム構造を入力とする場合に
ついて説明したが、これに限定されるものではなく、そ
れに類する構造を持ついかなる入力形式であってもよ
い。３．上記実施例では、アクセント成分と自然効果成分の
二つを分離した場合に関する韻律生成の例を説明した
が、これに限定されるものではなく、各アクセント型，
モーラ数別にピッチパターンデータを用意しておいても
よい。４．上記実施例では、構成品詞によって韻律情報を生成
する場合について説明したが、これに限定されるもので
はなく、この他にも、ポーズの有無，強調情報の有無，
モーラ数などの要因を用いて韻律情報を生成してもよ
い。In step S304, the voice parameter information is DA-converted and converted into an actual synthesized voice. When the synthetic voice is uttered, all the processes are ended. FIG. 5 shows the configuration of an embodiment of the present invention. 501 is a control memory,
A control program according to the control procedure as shown in the flowchart of FIG. 3 is stored. Reference numeral 502 is a central processing unit for making judgments / calculations and the like according to the control procedure held in the control memory 501. Reference numeral 503 denotes a memory, which is an input holding unit 201, a prosody control unit 202, a prosody information holding unit 20.
3, it has a voice generation unit 204 and a voice signal holding unit 205. Reference numeral 504 is a disk device, and 505 is a bus. (Other Examples) 1. In the above embodiment, the case where the voice parameter generation is performed after the prosody control in FIG. 3 is described, but the present invention is not limited to this, and the prosody control and the voice parameter generation may be performed in parallel. Also, 2. In the above embodiment, the case where the frame structure is used as the input has been described, but the present invention is not limited to this, and any input format having a similar structure may be used. 3. In the above embodiment, the example of prosody generation in the case where the accent component and the natural effect component are separated has been described, but the present invention is not limited to this, and each accent type,
Pitch pattern data may be prepared for each mora number. 4. In the above embodiment, the case where prosody information is generated by the constituent part-of-speech has been described. However, the present invention is not limited to this.
The prosody information may be generated using a factor such as the number of mora.

【００２６】尚、本発明は、複数の機器から構成される
システムに適用しても、１つの機器から成る装置に適用
しても良い。また、本発明はシステム或は装置にプログ
ラムを供給することによって達成される場合にも適用で
きることは言うまでもない。以上説明したように、本発
明によれば、表音テキストではなく、フレーム構造のデ
ータを入力したので、今まで以上の情報を扱うことがで
き、よりきめの細かい合成音声を生成することができる
という効果が得られる。The present invention may be applied to a system composed of a plurality of devices or an apparatus composed of a single device. Further, it goes without saying that the present invention can be applied to the case where it is achieved by supplying a program to a system or an apparatus. As described above, according to the present invention, since not the phonetic text but the data of the frame structure is input, it is possible to handle more information than ever, and it is possible to generate a more detailed synthesized voice. The effect is obtained.

【００２７】[0027]

【発明の効果】以上説明したように本発明によれば、解
析によって得られる言語情報を音響処理部に提供するこ
とによって、音声合成をよりきめ細かく制御し、多様か
つ高品質な合成音声出力を可能とする。As described above, according to the present invention, by providing linguistic information obtained by the analysis to the acoustic processing unit, the voice synthesis can be controlled more finely and various and high quality synthesized voice output can be performed. And

[Brief description of drawings]

【図１】表音テキストの一例を示すブロック図である。FIG. 1 is a block diagram showing an example of phonetic text.

【図２】本発明の実施例の構成を示すブロック図であ
る。FIG. 2 is a block diagram showing a configuration of an exemplary embodiment of the present invention.

【図３】本発明の実施例の処理手順を示す動作フローチ
ャートである。FIG. 3 is an operation flowchart showing a processing procedure of the embodiment of the present invention.

【図４】フレーム構造の一例を示す図である。FIG. 4 is a diagram showing an example of a frame structure.

【図５】本発明にかかわる合成音声作成支援装置の基本
構成図である。FIG. 5 is a basic configuration diagram of a synthetic voice creation support device according to the present invention.

【図６】一般的なピッチパタンの生成方法を示す図であ
る。FIG. 6 is a diagram showing a general pitch pattern generation method.

【図７】韻律制御部での処理の流れを示すフローチャー
トである。FIG. 7 is a flowchart showing a flow of processing in a prosody control unit.

【図８】韻律制御部の構成を示す図である。FIG. 8 is a diagram showing a configuration of a prosody control unit.

[Explanation of symbols]

２０１入力保持部２０２韻律制御部２０３韻律情報保持部２０４音声生成部２０５音声パラメータ保持部２０６ＤＡコンバータ５０１制御メモリ５０２中央処理装置５０３メモリ５０４ディスク装置５０５バス８０１アクセント句情報保持部８０２アクセント句存在判定部８０３名詞存在判定部８０４感動詞存在判定部８０５数詞存在判定部８０６助詞存在判定部８０７ピッチ周波数設定部８０８音韻時間長設定部８０９パワー設定部 201 input holding unit 202 prosody control unit 203 prosody information holding unit 204 voice generation unit 205 voice parameter holding unit 206 DA converter 501 control memory 502 central processing unit 503 memory 504 disk unit 505 bus 801 accent phrase information holding unit 802 accent phrase existence determination Part 803 Noun existence determination part 804 Sensitive verb presence determination part 805 Numeral presence determination part 806 Particle presence determination part 807 Pitch frequency setting part 808 Phonological time length setting part 809 Power setting part

Claims

[Claims]

1. A voice synthesis editing method for performing voice synthesis editing based on sentence information decomposed into accent phrase unit information, comprising a prosody control step of controlling prosody based on the accent phrase unit information, A step of generating a synthetic speech parameter based on the prosody controlled by the prosody control step.

2. The speech synthesis editing method according to claim 1, wherein the accent phrase unit information includes a part of speech of the accent phrase.

3. The speech synthesis editing method according to claim 1, wherein the accent phrase unit information includes the number of mora of the accent phrase.

4. The speech synthesis editing method according to claim 1, wherein the accent phrase unit information includes an accent type of the accent phrase.

5. The speech synthesis editing method according to claim 1, wherein the accent phrase unit information includes a pause length of the accent phrase.

6. The speech synthesis editing method according to claim 2, wherein the prosody control step controls prosody based on the type of part of speech.

7. The speech synthesis editing method according to claim 6, wherein the prosody control step controls the pitch frequency, which is one of the prosody, based on the type of part of speech.

8. The speech synthesis editing method according to claim 3, wherein the prosody control step controls prosody based on the number of mora.

9. The speech synthesis editing method according to claim 8, wherein the prosody control step controls a pitch frequency, which is one of prosody, based on the number of mora.

10. The speech synthesis editing method according to claim 4, wherein the prosody control step controls a pitch frequency, which is one of prosody, based on the accent type.

11. The speech synthesis editing method according to claim 10, wherein the prosody control step controls prosody based on the accent type.

12. The speech synthesis editing method according to claim 5, wherein the prosody control step controls prosody based on the pause length.

13. The speech synthesis editing method according to claim 12, wherein the prosody control step controls a pitch frequency, which is one of prosody, based on the pause length.

14. The speech synthesis editing method according to claim 1, wherein the prosody is a phoneme duration.

15. The speech synthesis editing method according to claim 1, wherein the prosody is sound power.

16. The speech synthesis editing method according to claim 1, further comprising a conversion step of converting to a synthetic speech based on the synthetic speech parameter generated in the generation step.

17. The speech synthesis editing according to claim 7, wherein the prosody control step performs control to increase the width between the maximum value and the minimum value of the pitch frequency if the part of speech is a noun. Method.

18. The method according to claim 14, wherein in the prosody control step, if the part-of-speech is a noun, the phonological time duration is increased.

19. The speech synthesis editing method according to claim 15, wherein in the prosody control step, if the part of speech is a noun, the sound power is increased.

20. A speech synthesis editing apparatus for performing speech synthesis editing based on sentence information decomposed into accent phrase unit information, wherein a prosody control means for controlling prosody based on the accent phrase unit information, A speech synthesis / editing apparatus comprising: a generation unit configured to generate a synthetic speech parameter based on the prosody controlled by the prosody control unit.

21. The speech synthesis editing apparatus according to claim 20, wherein the accent phrase unit information includes a part of speech of the accent phrase.

22. The accent phrase unit information includes the number of mora of the accent phrase.
A voice synthesis editing apparatus described in 0.

23. The speech synthesis editing apparatus according to claim 20, wherein the accent phrase unit information includes an accent type of the accent phrase.

24. The accent phrase unit information includes a pause length of the accent phrase.
A voice synthesis editing apparatus described in 0.

25. The prosody control means controls prosody based on the type of part of speech.
1. A voice synthesis editing apparatus according to 1.

26. The speech synthesis editing apparatus according to claim 25, wherein the prosody control means controls a pitch frequency, which is one of prosody, based on the type of the part of speech.

27. The prosody control means controls prosody based on the number of mora.
The speech synthesis editing apparatus described in.

28. The speech synthesis editing apparatus according to claim 27, wherein the prosody control means controls a pitch frequency, which is one of prosody, based on the number of mora.

29. The speech synthesis editing apparatus according to claim 23, wherein the prosody control means controls prosody based on the accent type.

30. The speech synthesis editing apparatus according to claim 29, wherein the prosody control means controls a pitch frequency, which is one of prosody, based on the accent type.

31. The prosody control means controls prosody based on the pause length.
The speech synthesis editing apparatus described in.

32. The speech synthesis editing apparatus according to claim 31, wherein the prosody control means controls a pitch frequency, which is one of prosody, based on the pause length.

33. The speech synthesis editing apparatus according to claim 20, wherein the prosody is a phoneme duration time length.

34. The speech synthesis editing apparatus according to claim 20, wherein the prosody is sound power.

35. The voice synthesis / editing apparatus according to claim 20, further comprising a converting unit for converting the voice into synthetic voice based on the synthetic voice parameter generated by the generating unit.

36. The voice synthesis edit according to claim 7, wherein the prosody control means performs control to increase the width between the maximum value and the minimum value of the pitch frequency if the part of speech is a noun. apparatus.

37. The speech synthesis editing apparatus according to claim 33, wherein the prosody control means performs control to lengthen the phoneme duration if the part of speech is a noun.

38. The speech synthesis editing apparatus according to claim 34, wherein the prosody control means controls to increase the sound power if the part of speech is a noun.