JPH05224689A

JPH05224689A - Speech synthesizing device

Info

Publication number: JPH05224689A
Application number: JP4026800A
Authority: JP
Inventors: Hiroshi Hamada; 洋浜田; Jinichi Chiba; 仁一千葉
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1992-02-13
Filing date: 1992-02-13
Publication date: 1993-09-03

Abstract

PURPOSE:To emphasize a specified part in a synthesized speech to an emphasis level represented with a sensual word like 'large' or 'small'. CONSTITUTION:A character string from a text memory 2 receives pronunciation data at a text-speech conversion part 3 and is converted into a speech parameter string corresponding to the character string. An input/output device 1 is used to specify a part to be emphasized in the character string and also to specify whether the emphasis level is high or low by human sensation. The emphasized part in the speech parameter string is inputted to an emphasizing process part 9 according to specification information on the emphasized part. The specified sensual emphasis level is converted by a sensual quantity-physical quantity conversion part 8 into a sound volume conversion value indicating how much sound volume is to be increased, a speaking speed conversion value which indicates how much a speaking speed is to be slowed down, a conversion value indicating how much a fundamental frequency is to be made higher, and a value indicating how much a pause is to be inserted before and after the emphasized part. An emphasizing process for the speech parameter of the emphasis- specified is performed, and the processed parameter and the speech parameter string of an emphasis-unspecified part are put together and outputted.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】この発明は、予め蓄えられた音声
を合成出力する、あるいは、入力された文字列から音声
に変換して合成出力する音声合成装置、特に出力音声中
の特定部分を強調して表現力に優れた音声を出力する音
声合成装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice synthesizing device for synthesizing and outputting a voice stored in advance, or converting an input character string into voice and synthesizing and outputting the same, and in particular emphasizes a specific portion in the output voice. The present invention relates to a voice synthesizing device that outputs a voice having excellent expressiveness.

【０００２】[0002]

【従来の技術】音声処理技術、言語処理技術、マイクロ
プロセッサ等の半導体技術の進歩により、音声の分析合
成、帯域圧縮や、文字列を入力することにより音声を合
成出力する規則音声合成が実用の段階に達してきた。例
えば、電話等の通信を介した注文受付、情報提供、デー
タベース検索などのサービスにおけるガイダンス出力、
データ・情報提供などに音声合成技術が利用されてきて
いる。2. Description of the Related Art Advances in semiconductor technology such as speech processing technology, language processing technology, and microprocessor have led to practical use of speech analysis / synthesis, band compression, and regular speech synthesis that synthesizes and outputs speech by inputting a character string. The stage has been reached. For example, guidance output in services such as order acceptance, information provision, database search, etc. via communication such as telephone,
Speech synthesis technology has been used to provide data and information.

【０００３】音声の合成出力は、予め蓄えられた音声を
必要に応じて編集して出力する録音編集再生方式と、文
字を入力し、その与えられた文字に対する読みを自動的
に付与して音声を合成出力する規則音声合成方式とに分
けることができる。従来、音声の合成においては、いか
に出力音声の品質を向上するかという点に重点がおかれ
て研究開発がなされてきた。しかしながら、音声合成技
術がより一般化するに従って、音声の技術に十分精通し
ていない場合でも、だれでも容易に表現力に富んだ音声
が出力できる音声合成法が望まれてきている。The synthetic output of voice is performed by a recording / editing / playback system in which a voice stored in advance is edited and output as needed, and a voice is input by inputting a character and automatically giving a reading to the given character. Can be divided into a regular speech synthesis method for synthesizing and outputting. Conventionally, in speech synthesis, research and development have been conducted with an emphasis on how to improve the quality of output speech. However, as the voice synthesis technology becomes more general, there is a demand for a voice synthesis method that enables anyone to easily output expressive voice even if they are not sufficiently familiar with the voice technology.

【０００４】音声の合成において、合成音の表現能力を
良くする方法としては、声の質を所望の音に制御する、
話すリズムに変化をつける、などの種々の方法が考えら
れるが、この発明は特定の語句を強調して出力する音声
強調の技術に関するものである。従来から音声の特定部
分を強調し合成出力する方法の開発が行われてきている
が、これらの方法では、予め強調処理する場合の音声合
成に必要なパラメータの変形方法を用意しておき、自動
的に抽出された強調部分、あるいは、予め指定された強
調部分のパラメータに変形を加え合成音声を出力する方
法が採用されてきた。その場合に変形を加えるパラメー
タとしては抑揚を表す音声の基本周波数（ピッチ）のパ
ラメータ、音声の強さ（音圧）、強調部分の前後に加え
るポーズなどが挙げられる。このような音声強調方法に
関しては、例えば、武田、市川「日本語文音声のプロミ
ネンス生成規則の作成と評価」（日本音響学会誌、vol.
47No.6,pp.397-404,1991）などに述べられている。In the synthesis of voice, as a method of improving the expressive ability of synthesized voice, the quality of voice is controlled to a desired sound.
Although various methods such as changing the speaking rhythm are conceivable, the present invention relates to a voice emphasizing technique for emphasizing and outputting a specific phrase. Conventionally, methods for emphasizing and synthesizing a specific part of speech have been developed.In these methods, a method for transforming parameters necessary for speech synthesis when emphasizing processing is prepared in advance, and automatic A method has been adopted in which the parameters of the emphasized part extracted in advance or the part of the emphasized part designated in advance are modified to output synthetic speech. In this case, examples of parameters to be modified include a fundamental frequency (pitch) parameter of voice expressing intonation, voice strength (sound pressure), and poses added before and after the emphasized portion. Regarding such a speech enhancement method, for example, Takeda and Ichikawa “Creation and Evaluation of Prominence Generation Rule for Japanese Sentence Speech” (Acoustic Society of Japan, vol.
47 No. 6, pp. 397-404, 1991) and the like.

【０００５】[0005]

【発明が解決しようとする課題】しかしながら、これら
の方法では（１）本来強調には種々のレベルがあり、強く強調を
表す場合の話し方から、弱く強調を表す場合の話し方ま
で種々多様であるが、強調をする／しないの２種類のみ
の表現しか実現不可能である（２）強調のレベルを変える場合、音声の生成・合成
に関する専門的な知識が必要であり、これらの技術に精
通していない一般の利用者にとって新たな制御や表現を
加えることが不可能であるという問題があった。これらの問題を解決するため、強
調に対する人間の感覚的な入力に対して自動的に物理的
なパラメータを制御し、人間の感覚に合致した強調の施
された音声の合成出力方法の実現が望まれていた。However, in these methods, (1) originally, there are various levels of emphasizing, and there are various ways from the way of speaking when strongly emphasizing to the way of speaking when weakly emphasizing. Only two types of expressions, with and without emphasis, can be realized. (2) When changing the emphasis level, specialized knowledge about voice generation / synthesis is required and familiar with these technologies. There was a problem that new users could not add new controls and expressions to ordinary users. In order to solve these problems, it is desired to realize a synthetic output method of emphasized speech that automatically controls the physical parameters in response to the human's sensory input for emphasis and matches the human sense. It was rare.

【０００６】[0006]

【課題を解決するための手段】この発明によれば、強調
したい部分を指定する手段と、その強調部分の強調の程
度を人間が感覚的な指定に基づき種々のレベルの強調指
示をする手段と、その強調指示されたレベルから自動的
に音声合成制御に必要な物理的制御パラメータを生成す
る手段と、その物理的制御パラメータを用いて前記強調
指定された部分を指定された強調レベルに従って音声を
合成出力する手段とを備え、出力音声中の特定部分に対
して人間の感覚に合致した強調処理が施された音声を合
成出力することが可能とされる。According to the present invention, there are provided a means for designating a portion to be emphasized and a means for a human to give various levels of emphasis instruction based on sensory designation of the degree of emphasis of the emphasized portion. , Means for automatically generating a physical control parameter required for voice synthesis control from the emphasized level, and a voice according to the specified emphasis level for the emphasized portion using the physical control parameter. It is possible to synthesize and output a voice in which a specific portion of the output voice is subjected to an emphasis process that matches a human sense, and a synthesized output unit.

【０００７】[0007]

【実施例】以下に、この発明の実施例を図面を用いて詳
細に説明する。この発明は、文字から音声を合成出力す
る規則合成方式に適用する場合と、予め蓄積されている
音声を出力する分析合成方式に適用する場合とがある。
以下では規則合成方式に適用する場合を例に実施例の説
明を行う。Embodiments of the present invention will be described below in detail with reference to the drawings. The present invention may be applied to a rule synthesizing method for synthesizing and outputting a voice from a character, and may be applied to an analysis and synthesizing method for outputting a voice accumulated in advance.
In the following, the embodiment will be described by taking the case of application to the rule composition method as an example.

【０００８】図１において、利用者が強調部分の指定、
強調レベルの指定を入出力デバイス１で行うが、この入
出力デバイス１に対してテキストメモリ２に記憶されて
いる音声合成しようとするテキストの内容を表示し、利
用者にはその表示されている文字列の部分のうちどの部
分を強調するかの指示を行わせる。すなわち、利用者は
画面上に表示されている文字の中で強調処理を行いたい
部分をマウス、キーボード等の入力手段を用いて指定す
る。さらに、その指定した部分に対してどの程度の強さ
の強調を行いたいかを表す強調のレベルを指定する。こ
のとき、強調のレベルを表現する言葉としては「普通の
強調」、「強めの強調」、「強調レベル大」、「強調レ
ベル中」など、人間の感覚で指定できるものとすること
により音声の専門的な知識や専門用語に精通していない
利用者でも容易に使用することが可能とされる。図２Ａ
に、入出力デバイス１として表示器、キーボード、マウ
ス等を用いた場合の表示画面２１の表示例を示す。利用
者は画面２１に表示されている文字列２２のうち強調処
理を行う部分を指定（図２Ａの例では下線が引いてある
部分）した後、画面２１上の強調スイッチ２３のＯＮを
指示し、次に強調部分に対する強調のレベルを画面２１
上に表示されているスライドボリューム２４をマウス等
で操作指定して入力する。スライドボリューム２４の近
くにそのスライド方向に沿って、強調レベルの大小表示
２５が表示されている。In FIG. 1, the user designates an emphasized portion,
The emphasis level is designated by the input / output device 1, and the contents of the text to be synthesized in the text stored in the text memory 2 are displayed on the input / output device 1 and displayed to the user. Ask the user to specify which part of the character string to emphasize. That is, the user designates the portion of the character displayed on the screen to be highlighted by using the input means such as a mouse and a keyboard. Further, the level of emphasis that indicates how much emphasis should be given to the specified part is specified. At this time, as words expressing the emphasis level, "ordinary emphasis", "strong emphasis", "high emphasis level", "middle emphasis level", etc. can be specified by the human sense, so that Even users who are not familiar with specialized knowledge or technical terms can easily use it. Figure 2A
FIG. 7 shows a display example of the display screen 21 when a display device, a keyboard, a mouse or the like is used as the input / output device 1. The user designates the portion of the character string 22 displayed on the screen 21 to be emphasized (the underlined portion in the example of FIG. 2A) and then instructs the ON of the emphasis switch 23 on the screen 21. Next, the level of emphasis for the emphasized part is displayed on the screen 21.
The slide volume 24 displayed above is designated by a mouse or the like and input. A highlight level display 25 is displayed near the slide volume 24 along the slide direction.

【０００９】図１のテキスト−音声変換部３では、テキ
ストメモリ２から読み出された文字列に対して読みを付
与し、さらに、基本となる音声の単位を結合し、当該文
字列に対応する音声のパラメータを生成する。テキスト
−音声変換の方法については種々提案されており（例え
ば、箱田、広川、水野、中嶌「ＣＯＣ法を用いたテキス
ト合成ボードの試作」電子情報通信学会、音声研究会資
料、SP 90-55(1990)など) 、適当な方式を選択すれば良
いが、後で強調処理を行う必要があるため、変換・生成
される音声パラメータとして音の大きさ、基本周波数、
速度などが制御可能な方式（例えばＬＰＣ合成方式、Ｌ
ＳＰ合成方式などを採用している方式）であることが望
ましい。テキスト−音声変換部３で生成された音声パラ
メータは音声パラメータメモリ４に蓄積される。In the text-to-speech conversion unit 3 of FIG. 1, reading is added to the character string read from the text memory 2, and the basic voice unit is combined to correspond to the character string. Generate audio parameters. Various methods have been proposed for text-to-speech conversion (for example, Hakoda, Hirokawa, Mizuno, Nakajima "Prototype of text synthesis board using COC method", Institute of Electronics, Information and Communication Engineers, Speech Study Group, SP 90-55 ( (1990) etc.), it is sufficient to select an appropriate method, but since it is necessary to perform emphasis processing later, the sound volume, fundamental frequency,
A method in which the speed can be controlled (for example, LPC synthesis method, L
It is desirable that the system adopts the SP composition system or the like). The voice parameters generated by the text-voice converter 3 are stored in the voice parameter memory 4.

【００１０】入力デバイス１に入力された強調部分指定
情報に対応する制御信号は、強調部分指定情報入力部５
において文字列のどの部分が強調対象であるかを表す位
置情報パラメータに変換され、音声−パラメータ対応変
換部６に出力される。音声−パラメータ対応変換部６で
は、入力された位置情報パラメータとテキストメモリ
２，音声パラメータメモリ４の読み出しとの対応がとら
れ、音声パラメータメモリ４に記憶されている文字列が
変換された音声パラメータ列中の強調対象となる部分の
パラメータが抽出される。The control signal corresponding to the emphasized portion designation information input to the input device 1 is the emphasized portion designation information input section 5.
Is converted into a positional information parameter indicating which part of the character string is to be emphasized and is output to the voice-parameter correspondence conversion unit 6. The voice-parameter correspondence conversion unit 6 associates the input positional information parameter with the reading of the text memory 2 and the voice parameter memory 4, and converts the character string stored in the voice parameter memory 4 into a voice parameter. The parameter of the portion to be emphasized in the row is extracted.

【００１１】一方、入力デバイス１に入力された感覚的
な強調レベル指定情報は、強調レベル入力部７で感覚的
な強調レベルを数値で表す感覚レベルに変換される。こ
の感覚レベルとしては例えば最も小さい強調レベルを−
１に、最も大きい強調レベルを１に、普通の強調レベル
を０に対応させた数値を用いる方法、最も小さい強調レ
ベルを０に、最も大きい強調レベルを１に、普通の強調
レベルを0.５に対応された数値を用いる方法等種々の方
法が考えられるが、以下では−１から１に対応させた場
合を例に説明を行う。On the other hand, the sensational emphasis level designation information input to the input device 1 is converted by the emphasis level input section 7 into a sensation level that represents the sensational emphasis level by a numerical value. As this feeling level, for example, the smallest emphasis level is −
A method of using a numerical value in which the highest emphasis level is 1, the normal emphasis level is 0, and the lowest emphasis level is 0, the highest emphasis level is 1, and the normal emphasis level is 0.5. Although various methods such as a method of using a numerical value corresponding to are conceivable, the case of corresponding from -1 to 1 will be described below as an example.

【００１２】感覚量−物理量変換部８では、強調レベル
入力部７より出力された感覚的な強調レベルを表す感覚
レベルから、物理的に音声パラメータの変形を行うため
の物理パラメータ変換値（物理制御パラメータ）を求め
る。人間の発声において、ある特定部分を強調する場
合、音調を際立たせ、強く、ゆっくり発音し、強調する
語の前に間を置き、さらに子音母音の調音を念入りに行
うことが知られている（例えば、和田實「アクセント
イントネーションプロミネンス」、徳川宗賢編、「ア
クセント」（論集日本語研究２）、有精堂（昭和５５
年）など）。この実施例では、強調を表す物理パラメー
タとして、（１）音量（音圧）、（２）基本周波数（ピ
ッチ）、（３）発話速度、（４）強調部分の前に挿入す
るポーズ、（５）強調部分の後に挿入するポーズ、を用
いた場合について述べる。In the sensory quantity-physical quantity converter 8, a physical parameter conversion value (physical control) for physically transforming a voice parameter from the sensory level representing the sensory emphasis level output from the emphasis level input section 7. Parameter). It is known that when a certain part is emphasized in human utterance, the tone is emphasized, pronounced strongly and slowly, a gap is placed before the emphasized word, and consonant vowels are carefully articulated ( For example, Minoru Wada "Accent
"Intonation Prominence", edited by Souken Tokugawa, "Accent" (Ronshu Japanese Studies 2), Yusendo (Showa 55)
Year))). In this embodiment, as physical parameters representing emphasis, (1) volume (sound pressure), (2) fundamental frequency (pitch), (3) speech speed, (4) pose to be inserted before the emphasized part, (5) ) A case of using a pose inserted after the emphasized part will be described.

【００１３】感覚量−物理量変換部８では、予め求めら
れた規則に従って感覚レベルから物理パラメータ変換値
を求める。例えば、感覚レベルと物理パラメータ変換値
との対応を示す変換テーブルを用意しておき、入力され
た感覚レベルから物理パラメータ変換値を求めれば良
い。図３に感覚レベルから物理パラメータ変換値を求め
るテーブルの内容をグラフ化して表した例を示す。図３
Ａは感覚レベルから強調部分の音量に変換するための値
を求めるための図の例であり、例えば、感覚レベルの値
が0.５の場合、強調部分の音量を強調を施さない場合よ
りも７dB大きな値とする。また、図３Ｂは、感覚レベル
から強調部分の発話速度に変換するための係数を表す図
の例であり、例えば感覚レベルの値が0.５の場合、強調
区間の発話速度を0.７倍とし、ゆっくり発話させる。具
体的には、感覚レベルの値が0.５の場合、例えば強調を
施さない場合の音声合成のフレーム周期が８ミリ秒であ
るのに対して強調部分のフレーム周期を１1.４ミリ秒と
する、などの手法により実現可能である。同様に、図３
Ｃは、強調部分の基本周波数を変換するための係数と感
覚レベルとの対応を示す図の例であり、感覚レベルが0.
５の場合、強調部分の基本周波数は強調を施さない場合
の1.１倍となる。また、図３Ｄ，図３Ｅはそれぞれ強調
部分の前に挿入するポーズの長さ、強調部分の後に挿入
するポーズの長さと感覚レベルとの対応を表す図であ
り、例えば感覚レベルが0.５の場合、強調部分の前に４
２０ミリ秒のポーズを、強調部分の後ろに２６０ミリ秒
のポーズを挿入する。The sensory quantity-physical quantity converter 8 calculates a physical parameter conversion value from the sensory level in accordance with a rule previously obtained. For example, a conversion table showing the correspondence between the sense level and the physical parameter conversion value may be prepared, and the physical parameter conversion value may be obtained from the input sense level. FIG. 3 shows an example in which the contents of the table for obtaining the physical parameter conversion value from the sensation level are shown as a graph. Figure 3
A is an example of a diagram for obtaining a value for converting the sense level to the volume of the emphasized portion. For example, when the value of the sense level is 0.5, the volume of the emphasized portion is higher than that when the volume is not emphasized. Increase the value by 7 dB. Further, FIG. 3B is an example of a diagram showing a coefficient for converting a sensation level into an utterance speed of an emphasized portion. For example, when the value of the sensation level is 0.5, the utterance speed of the emphasized section is multiplied by 0.7. And speak slowly. Specifically, when the value of the sense level is 0.5, for example, the frame period of speech synthesis without emphasis is 8 ms, whereas the frame period of the emphasized part is 11.4 ms. It can be realized by a method such as Similarly, FIG.
C is an example of a diagram showing the correspondence between the coefficient for converting the fundamental frequency of the emphasized portion and the sense level, where the sense level is 0.
In the case of 5, the fundamental frequency of the emphasized part is 1.1 times that in the case where no emphasis is applied. 3D and 3E are diagrams showing the correspondence between the length of a pose inserted before the emphasized part, the length of a pose inserted after the emphasized part, and the sense level, for example, when the sense level is 0.5. If 4 before emphasis
Insert a 20 ms pose and a 260 ms pose after the highlight.

【００１４】感覚レベルと物理パラメータ変換値との対
応は、音声合成の素材となる音声の発話者の性別、扱う
文書の内容などにより異なってくるが、一般的な値は、
種々の強調レベルにより人間が発声した音声を分析して
対応を求める、または、種々の物理的なパラメータで合
成された音声を複数の被験者に比較聴取させた結果を精
神測定法の手法を用いて分析し、物理パラメータ値と強
調の感覚レベルとの対応を求める、あるいは音声合成に
必要な物理パラメータを操作・制御可能な音声制御装
置、音声編集装置等により複数の被験者に強調レベルの
異なる合成音声を作成させ強調レベルとパラメータ値と
の対応を求める、などの方法により得ることができる。
なお、物理的なパラメータは図３に示したように、発話
速度については0.６〜0.９倍、音量については３〜９d
B，基本周波数については1.０５〜1.１５倍、強調部分
の前に挿入するポーズの時間長としては２００〜５５０
ミリ秒、強調部分の後に挿入するポーズの時間長として
は１５０〜３５０ミリ秒を、感覚的な強調レベル−１〜
１に対応して変換することが望ましい。Correspondence between the sense level and the physical parameter conversion value differs depending on the gender of the speaker of the voice that is the material for voice synthesis, the content of the document to be handled, etc.
Human speech uttered at various emphasis levels is analyzed to find correspondence, or the results obtained by allowing multiple subjects to hear comparatively synthesized speech with various physical parameters are measured using a psychometric method. Analyzes and finds the correspondence between the physical parameter value and the feeling level of emphasis, or synthesizes voices with different emphasis levels for multiple subjects by a voice control device, voice editing device, etc. that can operate and control physical parameters required for voice synthesis. Is generated to obtain the correspondence between the emphasis level and the parameter value, and the like.
As shown in FIG. 3, the physical parameters are 0.6 to 0.9 times for speech rate and 3 to 9d for volume.
B, 1.05 to 1.15 times the fundamental frequency, 200 to 550 as the length of the pause inserted before the emphasized part
Milliseconds, 150-350 milliseconds as the length of time of the pose inserted after the emphasized portion, and the sensuous emphasis level -1-
It is desirable to convert corresponding to 1.

【００１５】以上では変換テーブルを用いて感覚レベル
を物理量に変換する例について説明を行ったが、変換テ
ーブルに相当する内容を近似式として表す変換式として
蓄えておき、感覚レベルから物理制御パラメータへ計算
により変換する方法によっても実現可能である。図３の
例では、音量の増加分（dB），発話速度の変換係数、基
本周波数の変換係数、強調部分の前に挿入するポーズ長
（ミリ秒）、強調部分の後に挿入するポーズ長（ミリ
秒）をそれぞれａ，α₁，α₂，ｐ₁，ｐ₂とすると、
これらの値は感覚レベルｘを用いて、下記の式を演算す
ることにより求めることができる。In the above, an example of converting a sense level into a physical quantity using a conversion table has been described. However, the contents corresponding to the conversion table are stored as a conversion formula representing an approximate expression, and the sense level is converted into a physical control parameter. It can also be realized by a method of converting by calculation. In the example of FIG. 3, the volume increase (dB), the conversion coefficient of the speech speed, the conversion coefficient of the fundamental frequency, the pause length (milliseconds) inserted before the emphasized portion, and the pause length (millisecond) inserted after the emphasized portion. Sec) is a, α ₁ , α ₂ , p ₁ , p ₂ , respectively,
These values can be obtained by calculating the following formula using the feeling level x.

【００１６】ａ＝0.６x²＋2.６ｘ＋6.０〔dB〕 α₁＝0.７７x²−0.１１ｘ−0.０３ α₂＝0.０２x²＋0.０４ｘ＋1.０７ｐ₁＝４４x²＋１５４ｘ＋３３１〔ミリ秒〕ｐ₂＝２０x²＋７６ｘ＋２２１〔ミリ秒〕図１において強調処理部９では、音声−パラメータ対応
変換部６で得た強調処理対象位置に対応した音声パラメ
ータを、音声パラメータメモリ４より抽出すると共に、
感覚−物理量変換部８で得た強調部分に対する物理パラ
メータ変換値（物理制御パラメータ）を用いて、強調対
象音声部分の音声パラメータに強調処理を施す。音声の
強調処理は、例えば発話速度については合成のフレーム
周期を変換する、基本周波数については音声合成の駆動
パルスの間隔を変更するなどの一般的な手法（例えば、
特願平３−１８０８１２「音声強調装置」など）により
実現可能である。[0016] a = 0.6x ² + 2.6x + 6.0 [dB] _{^{α 1 = 0.77x 2 -0.11x-0.03}} α 2 = 0.02x 2 + 0.04x + 1.07 p 1 = 44x 2 + 154x + 331 [Milliseconds] p ₂ = 20x ² + 76x + 221 [milliseconds] In FIG. 1, the emphasis processing unit 9 extracts a voice parameter corresponding to the emphasis processing target position obtained by the voice-parameter correspondence conversion unit 6 from the voice parameter memory 4. As well as
Using the physical parameter conversion value (physical control parameter) for the emphasized part obtained by the sensation-physical quantity converter 8, the sound parameter of the emphasized sound part is emphasized. The speech enhancement process is performed by a general method such as converting the synthesis frame period for the speech speed and changing the interval of the speech synthesis drive pulse for the fundamental frequency (for example,
This can be realized by Japanese Patent Application No. 3-180812 "Voice enhancement device".

【００１７】音声合成部１０では、まず、強調処理部９
から出力される強調処理された強調部分の音声パラメー
タと、音声パラメータメモリ４中に蓄えられている音声
パラメータのうち強調部分以外の音声パラメータとを参
照し、結合する。このとき、強調部分の音声パラメータ
には強調処理の変形が加えられているため、その前後で
パラメータの不連続が生じる。従って次に、強調部分の
前後の音声パラメータを平滑化し音声パラメータの不連
続を除去する。最後に、得られた最終的な音声パラメー
タから出力すべき音声信号を合成し、さらに出力デバイ
ス１１でディジタル−アナログ変換され、スピーカ等か
ら利用者により指定された区間に対して感覚量で指定さ
れた強調処理を施した合成音声を出力する。In the speech synthesizer 10, first, the emphasis processor 9
The speech parameter of the emphasized portion which has been subjected to the emphasizing process and the speech parameter other than the emphasized portion among the speech parameters stored in the speech parameter memory 4 are referred to and combined. At this time, since the modification of the emphasis process is added to the voice parameter of the emphasized portion, the parameter discontinuity occurs before and after that. Therefore, next, the speech parameters before and after the emphasized portion are smoothed to remove the discontinuity of the speech parameters. Finally, a voice signal to be output is synthesized from the obtained final voice parameters, and is further digital-analog converted by the output device 11, and is designated by a sense amount with respect to a section designated by the user from a speaker or the like. The synthesized speech subjected to the enhanced processing is output.

【００１８】以上、この発明を規則合成方式による音声
合成に適用する場合の例で説明を行ったが、予め分析し
て蓄えられている音声を合成出力する分析合成方式に適
用する場合には、図１でテキストメモリ２とテキスト音
声変換部３との代わりに音声メモリを用い、その音声メ
モリに、なまの音声波形、パワー、あるいは分析コード
化したものを記憶しておき、入出力デバイス１に対して
この音声メモリの記憶内容（コード記憶の場合は音声合
成して）を表示し、強調する部分の指定を行わせるこ
と、を除いて同様の構成により実現可能である。分析合
成方式に対して適用する場合の入出力デバイス１の画面
２１の表示例を図２Ｂに示す。この場合、画面２１に表
示されている音声波形２６や音声パワー２７の中で強調
処理を施す部分をマウス等の入力手段を用いて指示する
（図２Ｂでは音声波形の下の斜線を施した部分）と共に
強調スイッチ２３をＯＮに指示し、その後図２Ａに示し
た例と同様、感覚的な強調のレベルを画面上に表示され
たスライドボリューム２４をマウス等で操作し入力す
る。なお、強調区間や強調のレベルの入力は、この例で
示した方法以外にも、キーボード、マウス、ボリュー
ム、ジョイスティック等さまざまな入力手段により実現
可能である。Although the present invention has been described above in the case of being applied to the speech synthesis by the rule synthesizing method, when it is applied to the analysis and synthesizing method for synthesizing and outputting the speech which is analyzed and stored in advance, In FIG. 1, a voice memory is used instead of the text memory 2 and the text-to-speech conversion unit 3, and the voice memory stores the raw voice waveform, power, or analysis code, and the input / output device 1 With respect to, the stored contents of the voice memory (voice synthesis in the case of code storage) is displayed and the portion to be emphasized is designated. FIG. 2B shows a display example of the screen 21 of the input / output device 1 when applied to the analysis and synthesis method. In this case, a portion of the voice waveform 26 or the voice power 27 displayed on the screen 21 to be emphasized is designated by using an input means such as a mouse (in FIG. 2B, a shaded portion below the voice waveform is indicated. ), The emphasis switch 23 is instructed to be turned on, and then, similarly to the example shown in FIG. 2A, the sensory emphasis level is input by operating the slide volume 24 displayed on the screen with a mouse or the like. The emphasis section and the emphasis level can be input by various input means such as a keyboard, a mouse, a volume, and a joystick other than the method shown in this example.

【００１９】[0019]

【発明の効果】以上説明したように、この発明の音声合
成装置によれば、指定したキーワードや句に対して、感
覚的な言葉で表現された強調のレベルを自由に変化させ
ながら、合成音声を出力することが可能であり、この結
果単調となりがちな合成音声の表現力が増加し、合成音
声により意図や感情を表現することが可能となるばかり
でなく、音声の合成に関する専門的な知識を持たないユ
ーザでも自由に意図を埋め込んだ表現力豊かな合成音声
の出力を制御することが可能となる。As described above, according to the speech synthesizing apparatus of the present invention, the synthesized speech can be changed while freely changing the emphasis level expressed by the sensory words for the designated keyword or phrase. It is possible to output, and as a result, the expressiveness of synthetic speech, which tends to be monotonous, increases, not only can the intention and emotion be expressed by synthetic speech, but also specialized knowledge of speech synthesis. Even a user who does not have the ability to freely control the output of synthetic speech with rich expressiveness in which the intention is embedded.

[Brief description of drawings]

【図１】この発明による音声合成装置の一実施例を示す
ブロック図。FIG. 1 is a block diagram showing an embodiment of a speech synthesizer according to the present invention.

【図２】Ａは規則合成方式に対して適用する場合の入力
画面の例を示す図、Ｂは分析合成方式に対して適用する
場合の入力画面の例を示す図である。FIG. 2A is a diagram showing an example of an input screen when applied to a rule composition method, and B is a diagram showing an example of an input screen when applied to an analysis composition method.

【図３】感覚レベルと物理パラメータ変換値との対応の
各種を示すグラフ。FIG. 3 is a graph showing various correspondences between a sense level and a physical parameter conversion value.

Claims

[Claims]

1. A voice synthesizer for synthesizing and outputting a voice from a stored voice or a character, wherein an emphasized portion designation input means for designating and inputting a portion to be emphasized in output speech, and a degree of enhancement of the emphasized portion are designated. An emphasis level input means for inputting, a physical control parameter converting means for converting the input emphasis level into a plurality of physical control parameters for voice synthesis control, and the designated portion using the physical control parameters A voice synthesizing device comprising: a voice output unit that synthesizes and outputs a voice according to a designated emphasis level.