JP3575919B2

JP3575919B2 - Text-to-speech converter

Info

Publication number: JP3575919B2
Application number: JP16288696A
Authority: JP
Inventors: 薫塚本
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1996-06-24
Filing date: 1996-06-24
Publication date: 2004-10-13
Anticipated expiration: 2016-06-24
Also published as: JPH1011083A

Description

【０００１】
【発明の属する技術分野】
本発明は、入力された文字列情報を基に音声を合成して出力するテキスト音声変換装置に関する。
【０００２】
【従来の技術】
テキストデータ等の文字情報を入力とし、それを音声に変換して出力するテキスト音声変換装置は、出力語彙の制限がないことから、録音再生型の音声合成技術にとって代わる音声合成技術として種々の利用分野での応用が期待できる。
【０００３】
例えば、ワードプロセッサ等で作成されたテキストデータを音声に変換して出力させ文章の校正に利用することもできる。また、テキストを編集するだけで、簡単に応答メッセージを作成、変更をすることができる特徴を生かして、電話等の通信サービスなどでも利用することができる。
【０００４】
図２は、日本語（漢字かな混じり文）を入力とした従来のテキスト音声変換装置（日本語テキスト音声変換）の構成を示している。以下、この図２を参照しながら、従来装置の概要を説明する。
【０００５】
図２において、テキスト解析部１０１では発音辞書１０２を利用して、文字情報入力部１００より入力された漢字かな混じり文から、音韻韻律記号列を生成する。ここで、音韻韻律記号列とは、入力文の読み、アクセント、イントネーション等を文字列として記述したもので、中間言語と呼ばれる。各単語の読みとアクセントは、発音辞書１０２に登録されており、テキスト解析部１０１はこの発音辞書１０２を参照しながら、音韻韻律記号列を生成する。
【０００６】
合成パラメータ生成部１０３では、音韻韻律記号列に基づき、音声素片（音の種類）を取り出し、予め定められた規則より、音韻継続時間（音の長さ）、基本周波数（音の高さ）パターンといった音声合成用のパラメータ（以下、合成パラメータと呼ぶ）を生成する。
【０００７】
このうち音声素片は、単語等を発声したときの発声データから分析生成されるもので、合成のための音声の基本単位であり、これらを重ね合わせて行くことによって、合成波形が生成される。なお、以下ではＣＶ（子音−母音）、ＶＣＶ（母音−子音−母音）等の音声の基本要素の組み合わせ自体を音声単位と呼び、その音声単位の波形を実現する要素を音声素片と呼ぶ。各音声単位は、例えば複数の音声素片でなる組に対応する。音声素片データはＲＯＭ等でなる音声素片データ記憶部１０４に格納されており、合成パラメータ生成部１０３は、音韻韻律記号列から音声単位を認識して対応する音声素片データを取り出す。
【０００８】
音声合成部１０５は、合成パラメータ生成部１０３が生成した合成パラメータに基づいて、合成波形（音声信号）を生成する。このような合成音声信号が、スピーカ−を通して音声出力されたり、通信回線を介して他の装置に伝送されたりする。
【０００９】
第２の従来例として、上述した従来例（第１の従来例と呼ぶ）では、予め定められた規則によって基本周波数パタン、音の高さ、ポーズの長さ、音韻継続時間等の合成パラメータを決定していたものを、自然性を高めるために、実音声の韻律的特徴を分析した結果を統計的に処理して抽出した韻律的特徴パラメータを用いて、合成音声パラメータを与える方法がある。
【００１０】
【発明が解決しようとする課題】
第１の従来例においては、合成パラメータは、入力されたテキストが変換された音韻記号列に応じて、予め定められた規則に従って決定されるものであり、自然音声に比べると単調である。
【００１１】
また、第２の従来例においても、自然音声を分析して、基本周波数パターン、音の高さ、ポーズの長さ、パワー、音韻継続時間等の韻律的特徴パラメータを抽出して用いてはいるが、一つの発話スタイルの韻律的特徴パラメータの組を用いるだけでは、論文、小説、会話等の多様な発話スタイルを自由に表現できないという課題があった。
【００１２】
そこで、複数の発話スタイルの韻律的特徴パラメータを用意し、合成パラメータ生成の際に切り替えて用いるものも既に提案されているが、相互の発話スタイルの関係は明らかではなかったので、発話スタイルの違いを強調したり、弱めたりとユーザが調節することはできなかった。
【００１３】
そのため、予め定められた基準を用いて、合成音声の韻律的特徴を生成する手段を持つテキスト音声変換装置において、発話スタイルの違いによって現れる韻律的な特徴量の違いを強調あるいは弱め、多様な発話スタイルの合成音の生成を可能とし、ユーザの好みに合った韻律パターンで読み上げることのできるテキスト音声変換装置が求められている。
【００１８】
【課題を解決するための手段】
上記課題を解決するため、第２の本発明のテキスト音声変換装置では、入力された文字情報を音声信号に変換するテキスト音声変換装置において、少なくとも通常スタイル、朗読調を含む複数の発話スタイルにおける特徴を保持する韻律パラメータテーブルと、発話スタイルを選択する発話スタイル指定部と、発話スタイルの強調度を指定する強調度指定部と、発話スタイル指定部によって選択された発話スタイルと基準発話スタイルのそれぞれの韻律パラメータの差分を計算する差分計算部と、強調度指定部によって指定された強調度及び差分に応じて韻律パラメータを補正する韻律パラメータ調整手段とを備える。
【００１９】
ここで、韻律パラメータ調整手段によって補正される韻律パラメータは少なくとも音韻継続時間もしくはピッチパターンにすることが望ましい。
【００２０】
このように本発明のテキスト音声変換装置では、韻律パラメータをユーザの好みに応じて変更度合いを調整しながら変更することができ、よりユーザの好みに合った合成音声を得ることができる。
【００２１】
【発明の実施の形態】
以下、本発明によるテキスト音声変換装置を、日本語文を対象とした装置に適用した第１の実施形態を図面を参照しながら詳述する。ここで、図１が、この第１の実施形態のテキスト音声変換装置の全体構成を示すブロック図である。
【００２２】
図１において、第１の実施形態のテキスト音声変換装置は、文字情報入力部１０、テキスト解析部１１、発音辞書１２、合成パラメータ生成部１３、音声素片データ記憶部１４、音声合成部１５、発話スタイル変更手段としての合成パラメータ変更手段１６及び発話スタイル指定部１７を備えている。
【００２３】
ここで、文字情報入力部１０、テキスト解析部１１、発音辞書１２、合成パラメータ生成部１３、音声素片データ記憶部１４、音声合成部１５は、従来のテキスト音声変換装置と同一の動作を行なうものであり、詳細な説明は省略する。
【００２４】
この実施形態では、朗読調から会話調へ変化させる場合を例にして説明する。なお、発話スタイルとしては、通常スタイル、朗読調スタイル、会話調スタイル、アナウンサー調スタイル等が他にもあげられる。
【００２５】
合成パラメータ生成部１３は音韻記号列に基づいて対応する音声素片データを音声素片データ記憶部１４から取り出し、音韻の継続時間や、ポーズ長、パワーや基本周波数パターンといった音声合成用韻律パラメータを生成する。
【００２６】
そして、発話スタイル指定部１７には、朗読スタイルから会話スタイル度までの複数の発話スタイルから使用したい１つの発話スタイルを指定できるスイッチが設けられている。
【００２７】
図４に示すのは発話スタイル指定部１７をソフトウェア的に形成した例であり、スクロールバーの左端が最も朗読調の発話スタイルを示すスタイル１を示し、右に行くに従って会話調の度合い（会話調度と定義する）が高くなり、右端が最も会話調に近い発話スタイルを示すスタイル１０を示している。１０段階のスクロールバーのバーをスライドさせ、目的の発話スタイルを選択できる。この図４ではバーはスタイル６のところを示している。
【００２８】
合成パラメータ変更手段１６では発話スタイル指定部１７でのユーザの指定に従って、音声合成用韻律パラメータを変形する。この第１の実施形態の場合、変更される合成パラメータは１モーラ当りの平均の長さである。合成パラメータ変更手段１６では他にも、音韻継続時間、基本周波数パターン、音の高さ、パワーといった韻律的特徴を変形することが可能である。
【００２９】
次に、第１の実施形態のテキスト音声変換装置の詳細動作を図３のフローチャートを用いて説明する。
【００３０】
まず、文字情報（漢字かな混じり文等のテキストデータ）を取り込み（ステップ２０１）、その文字情報を解析して、１フレーズ毎に、音韻韻律記号列に変換する（ステップ２０２）。
【００３１】
次に、音韻韻律記号列に従って音声素片データ記憶部１４より順次使用する音声素片データを取り出す（ステップ２０３）。そして、フレーズ毎に、音韻韻律記号列に基づいて韻律パラメータ（音韻継続時間、基本周波数パターン、パワー等を規定するパラメータ）を生成する（ステップ２０４）。次に、合成パラメータ変更手段１６では、ステップ２０４で生成された合成パラメータを発話スタイル指定部１７の指定に従って変更する（ステップ２０５）。
【００３２】
合成パラメータの変更方法を説明する。朗読音声と会話音声を比較した際、両者の間には様々な韻律的特徴が存在する。まず、朗読音声と会話音声では、会話音声の方が、韻律パラメータの変動が大きい。例えば、ピッチ、パワー、１モーラ当りの平均的な継続時間や、ポーズ長が会話音声の方が朗読音声よりも大きく変動する。
【００３３】
一例として、朗読音声と、会話もしくは対話音声の、韻律句（フレーズ）内モーラ数毎のモーラ長を比較した場合、朗読調はモーラ長がほぼ一定であるのに対し、会話調では１フレーズ内のモーラ数が少なくなるほど１モーラ当りの平均継続時間が長くなる傾向がある。
【００３４】
このことに対しては、日本音響学会講演論文集１９９５．３１−４−６に記載された渡辺等の「朗読及び対話音声における時間構造の検討」と題する論文に記載されている。
【００３５】
ここで、モーラとは、ほぼ仮名１文字に相当するなど時間的なリズムの単位である。
【００３６】
第１の実施形態では、この特徴を基に、会話調度が高いほど、フレーズ内のモーラ数毎の平均モーラ長が長くなるように合成パラメータを変更する。例えば、１０モーラのフレーズ長を基準にモーラ長継続時間を±１．５倍差をつけたいときには、ｔを朗読調の１モーラ当りの平均継続時間、ｎを１フレーズのモーラ数として、求める継続時間ｔ’は、
ｔ’＝−（ｔ／２０）×ｎ＋１．５ｔ
として、各々のモーラの継続時間長を変換する。また、より会話らしく変化をつけたいときには、１、２割伸縮させるなどし、その度合いをユーザが任意に指定できる。
【００３７】
以上のようにして、韻律パラメータと音声素片データからなる合成パラメータが決定されると、音声信号を合成して（ステップ２０６）出力する（ステップ２０７）。出力方法は、スピーカーからでも通信回線を通じた他の装置への伝送でも良い。
【００３８】
以上のようにして、第１の実施態様のテキスト音声変換装置では、予め定められた基準を用いて、合成音声の韻律的特徴を生成する手段を持つテキスト音声変換装置において、通常の読み上げ調（ないしは朗読調）と会話調などの他の発話スタイルとの違いによって現れる韻律的特徴量を、強調ないしは弱め、通常発話スタイルから、ある選択された度合いの発話スタイルの合成音の生成を可能とし、ユーザの好みに合った韻律パターンで読み上げることのできるテキスト音声変換装置を実現できる。
【００３９】
次に、本発明によるテキスト音声変換装置を、日本語文を対象とした装置に適用した第２の実施形態を説明する。
【００４０】
第２の実施形態においては、入力された文字情報を、複数の発話スタイルで発声された自然音声を、音韻の種類別継続時間、ポーズ長、パワー変動量、ピッチパターン変動量（音の高低の差等）などの、韻律パラメータ毎に分析して作成した韻律パラメータテーブルを用いて、合成パラメータを生成し、音声信号に変換するテキスト音声変換装置において、ユーザが選択した発話スタイルに従って決定された韻律パラメータを、朗読調の韻律パラメータと比較し、その差分を求め、発話スタイルの持つ韻律パラメータの特徴を強調ないしは弱める手段を設けたものである。
【００４１】
韻律パラメータの例としては、音韻継続時間であれば、各音韻の種類毎に、前後の音韻の環境や、語頭、語中、文末などのフレーズ位置、モーラ位置毎に分析し、それぞれの音韻継続時間を分析したものとなる。
【００４２】
以下、この第２の実施形態にかかるテキスト音声変換装置を図５を用いて説明する。なお、この第２の実施形態については、音韻の種類別継続時間、ポーズ長、パワー変動量、ピッチパターンなどの韻律パラメータのうち音韻継続時間を変更する場合を例にして説明する。また、この第２の実施形態では基準発話スタイルとして朗読調の発話スタイルを用いている。
【００４３】
第２の実施形態のテキスト音声変換装置は、文字情報入力部１０、テキスト解析部１１、発音辞書１２、合成パラメータ生成部１３、音声素片データ記憶部１４、音声合成部１５、複数継続時間テーブル１６、発話スタイル指定部１７、音韻継続時間の変更を行なう発話スタイル強調部２０、発話スタイル強調度指定部１９を備えている。
【００４４】
文字情報入力部１０、テキスト解析部１１、発音辞書１２、合成パラメータ生成部１３、音声素片データ記憶部１４、音声合成部１５は、従来の構成と同一動作を行なうものであるので、詳細な説明は省略する。
【００４５】
合成パラメータ生成部１３は、音韻記号列に基づいて対応する音声素片データを音声素片データ記憶部１４から取り出し、発話スタイル指定部１７によって指定された発話スタイルの音韻継続時間テーブルを参照して音韻の継続時間を決定し、ポーズ長、パワーや基本周波数パターンといった、音声合成用韻律パラメータを生成する。
【００４６】
そして、発話スタイル強調度指定部１９には、朗読スタイルから発話スタイル指定部１７で指定した発話スタイル度を強調できるスイッチが設けられており、朗読継続時間テーブルを参照して定められた音韻継続時間と指定された発話スタイルでの音韻継続時間を比較して、発話スタイル強調度指定部１９によって指定された度合いによってその差分を発話スタイル強調部２０で強調する。
【００４７】
次に、第２の実施形態のテキスト音声変換装置の動作を図６のフローチャートを用いて説明する。
【００４８】
まず、文字情報（漢字かな混じり文等の、テキストデータ）を取り込み（ステップ６０１）、その文字情報を解析して、１フレーズ毎に音韻韻律記号列に変換する（ステップ６０２）。次に、音韻韻律記号列に従って、音声素片データ記憶部１４より順次使用する音声素片を取り出す（ステップ６０３）。そして、フレーズ毎に、音韻韻律記号列に基づいて、発話スタイル指定部１７によって指定された発話スタイルの継続時間テーブルと、基準発話スタイルである朗読調の発話スタイルの継続時間テーブルを参照して、音韻継続時間を決定し、合成パラメータ（音韻継続時間、基本周波数パターン、パワー等を規定するパラメータ）を指定スタイルと朗読調の２種類生成する（ステップ６０４）。このとき、継続時間テーブルは、予め自然音声を分析した要因（当確音韻の種類、前後環境、フレーズ位置、フレーズ内モーラ位置等）で継続時間が参照され決定される。
【００４９】
次に、発話スタイル強調部２０では、発話スタイル強調度指定部１９で指定された度合いによって、指定発話スタイル継続時間（Ｔｎとする）と、朗読調継続時間（Ｔｓとする）の差分を、強調して音韻継続時間を変更する。例えば、強調係数をαとして、最終的音韻継続時間Ｔは
Ｔ＝Ｔｓ＋α（Ｔｎ／Ｔｓ−１）Ｔｓ
と計算できる。強調係数αは強調部指定部１９で指定された度合いによって０から数倍まで変化させて用いれば良い（ステップ６０５）。
【００５０】
以上のようにして、韻律パラメータと音声素片データからなる合成パラメータが決定されると、音声信号を合成して（ステップ６０６）出力する（ステップ６０７）。出力方法は、スピーカ−からの出力でも、通信回線を通じた他の装置への伝送でも良い。
【００５１】
以上の第２の実施形態のテキスト音声変換装置によれば、ユーザの好みに応じて、音韻継続時間を変更して発話スタイルを変更させることができる。
【００５２】
なお、上記各実施形態においては、日本語文を対象としたテキスト音声変換装置を示したが、他の言語文を対象としたテキスト音声変換装置に本発明を適用できることは勿論である。
【００５３】
【発明の効果】
以上のように、本発明によれば、発話スタイルの違いによって現れる韻律的な特徴量の違いを強調、あるいは弱め、多様な発話スタイルの合成音の生成を可能とし、ユーザの好みに合った韻律パターンで読み上げることのできるテキスト音声変換装置を実現できる。
【図面の簡単な説明】
【図１】第１の実施形態のテキスト音声変換装置を示すブロック図である。
【図２】従来のテキスト音声変換装置を示す図である。
【図３】図１のテキスト音声変換装置の動作を示すフローチャートである。
【図４】図１のテキスト音声変換装置の会話スタイル指定部１７の説明図である。
【図５】第２の実施形態のテキスト音声変換装置を示すブロック図である。
【図６】図５のテキスト音声変換装置の動作を示すフローチャートである。
【符号の説明】
１０…文字情報入力部、１１…テキスト解析部、１２…発音辞書、１３…合成パラメータ生成部、１４…音声素片データ記憶部、１５…音声合成部、１６…合成パラメータ変更手段、１７…発話スタイル指定部、１８…韻律パラメータテーブル、１９…発話スタイル強調度指定部、２０…発話スタイル強調部。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a text-to-speech converter that synthesizes and outputs speech based on input character string information.
[0002]
[Prior art]
Text-to-speech converters that take text information such as text data as input, convert it into speech, and output it have no restrictions on the output vocabulary. The application in the field can be expected.
[0003]
For example, text data created by a word processor or the like can be converted into voice and output, and used for proofreading of sentences. Also, by utilizing the feature that a response message can be easily created and changed only by editing the text, it can be used in communication services such as telephones.
[0004]
FIG. 2 shows the configuration of a conventional text-to-speech conversion device (Japanese text-to-speech conversion) that inputs Japanese (kanji and kana mixed sentences). Hereinafter, an outline of the conventional apparatus will be described with reference to FIG.
[0005]
In FIG. 2, a text analysis unit 101 uses a pronunciation dictionary 102 to generate a phonemic prosody symbol string from a sentence mixed with Kanji or Kana input from a character information input unit 100. Here, the phonemic prosody symbol string is a description of the input sentence reading, accent, intonation, and the like as a character string, and is called an intermediate language. The pronunciation and accent of each word are registered in the pronunciation dictionary 102, and the text analysis unit 101 generates a phonemic prosody symbol string while referring to the pronunciation dictionary 102.
[0006]
The synthesis parameter generation unit 103 extracts a speech unit (sound type) based on the phoneme prosody symbol string, and obtains a phoneme duration (sound length) and a fundamental frequency (sound pitch) according to a predetermined rule. A parameter for speech synthesis such as a pattern (hereinafter, referred to as a synthesis parameter) is generated.
[0007]
Of these, speech units are analyzed and generated from utterance data when words or the like are uttered, are basic units of speech for synthesis, and a synthesized waveform is generated by superimposing these. . In the following, a combination of basic elements of speech such as CV (consonant-vowel) and VCV (vowel-consonant-vowel) is called a speech unit, and an element that realizes a waveform of the speech unit is called a speech unit. Each speech unit corresponds to, for example, a set of a plurality of speech units. The speech unit data is stored in a speech unit data storage unit 104 such as a ROM, and the synthesis parameter generation unit 103 recognizes a speech unit from a phoneme prosody symbol string and extracts corresponding speech unit data.
[0008]
The speech synthesis unit 105 generates a synthesized waveform (speech signal) based on the synthesis parameters generated by the synthesis parameter generation unit 103. Such a synthesized voice signal is output as voice through a speaker, or transmitted to another device via a communication line.
[0009]
As a second conventional example, in the above-described conventional example (referred to as a first conventional example), synthesis parameters such as a fundamental frequency pattern, a pitch of a sound, a length of a pause, and a phoneme duration are determined according to a predetermined rule. In order to enhance the naturalness of the determined speech, there is a method of providing a synthetic speech parameter using a prosodic feature parameter extracted by statistically processing the result of analyzing the prosodic feature of the real speech.
[0010]
[Problems to be solved by the invention]
In the first conventional example, the synthesis parameter is determined according to a predetermined rule according to a phoneme symbol string obtained by converting an input text, and is monotonous as compared with natural speech.
[0011]
Also in the second conventional example, natural speech is analyzed to extract and use prosodic feature parameters such as a fundamental frequency pattern, a pitch, a pause length, power, and a phoneme duration. However, there is a problem that various utterance styles such as papers, novels, and conversations cannot be freely expressed by using only a set of prosodic feature parameters of one utterance style.
[0012]
Therefore, it has already been proposed to prepare prosodic feature parameters for a plurality of utterance styles and use them when switching the synthesis parameters, but the relationship between the utterance styles was not clear. Could not be adjusted by the user to emphasize or weaken.
[0013]
Therefore, in a text-to-speech conversion device having means for generating a prosodic feature of a synthesized speech using a predetermined criterion, a difference in the prosodic feature amount caused by a difference in speech style is emphasized or weakened, and various utterances are emphasized. There is a need for a text-to-speech converter that can generate synthesized speech in a style and can read aloud in a prosodic pattern that suits the user's preference.
[0018]
[Means for Solving the Problems]
To solve the above SL problems, the text-to-speech conversion system of the second invention, the text-to-speech converter for converting the audio signal character information inputted, the plurality of speech style comprising at least normal style, the recitation tone A prosodic parameter table holding characteristics, an utterance style designating unit for selecting an utterance style, an emphasis degree designating unit for designating the degree of emphasis of the utterance style, and each of the utterance style and the reference utterance style selected by the utterance style designating unit And a prosody parameter adjustment unit that corrects the prosody parameter in accordance with the emphasis degree and the difference designated by the emphasis degree designation section.
[0019]
Here, it is desirable that the prosody parameter corrected by the prosody parameter adjusting means be at least a phoneme duration or a pitch pattern.
[0020]
As described above, in the text-to-speech conversion apparatus of the present invention, the prosodic parameters can be changed while adjusting the degree of change according to the user's preference, and a synthesized speech more suited to the user's preference can be obtained.
[0021]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, a first embodiment in which a text-to-speech conversion device according to the present invention is applied to a device for Japanese sentences will be described in detail with reference to the drawings. Here, FIG. 1 is a block diagram showing the entire configuration of the text-to-speech converter of the first embodiment.
[0022]
In FIG. 1, a text-to-speech conversion apparatus according to a first embodiment includes a character information input unit 10, a text analysis unit 11, a pronunciation dictionary 12, a synthesis parameter generation unit 13, a speech unit data storage unit 14, a speech synthesis unit 15, The apparatus includes a synthesis parameter changing unit 16 as an utterance style changing unit and an utterance style designating unit 17.
[0023]
Here, the character information input unit 10, the text analysis unit 11, the pronunciation dictionary 12, the synthesis parameter generation unit 13, the speech unit data storage unit 14, and the speech synthesis unit 15 perform the same operations as the conventional text-to-speech conversion device. Therefore, detailed description is omitted.
[0024]
In this embodiment, a case where the reading tone is changed from the reading tone to the conversation tone will be described as an example. The utterance style includes a normal style, a reading style, a conversation style, an announcer style, and the like.
[0025]
The synthesis parameter generation unit 13 extracts the corresponding speech unit data from the speech unit data storage unit 14 based on the phoneme symbol string, and converts the speech synthesis prosody parameters such as the duration of the phoneme, the pause length, the power, and the fundamental frequency pattern. Generate.
[0026]
The utterance style specifying unit 17 is provided with a switch capable of specifying one utterance style to be used from a plurality of utterance styles ranging from a reading style to a conversation style degree.
[0027]
FIG. 4 shows an example in which the utterance style designating unit 17 is formed by software. The left end of the scroll bar indicates the style 1 indicating the most utterance-style utterance style. And the right end indicates the style 10 indicating the utterance style closest to the conversation style. By sliding the bar of the 10-step scroll bar, a desired utterance style can be selected. In FIG. 4, the bar indicates the style 6.
[0028]
The synthesis parameter changing means 16 modifies the speech synthesis prosody parameters in accordance with the user's designation in the speech style designating section 17. In the case of the first embodiment, the synthesis parameter to be changed is the average length per mora. In addition, the synthesis parameter changing means 16 can modify prosodic features such as phoneme duration, fundamental frequency pattern, pitch and power.
[0029]
Next, the detailed operation of the text-to-speech conversion apparatus of the first embodiment will be described with reference to the flowchart of FIG.
[0030]
First, character information (text data such as kanji and kana mixed sentences) is fetched (step 201), and the character information is analyzed and converted into a phonological prosody symbol string for each phrase (step 202).
[0031]
Next, speech unit data to be used sequentially is extracted from the speech unit data storage unit 14 in accordance with the phoneme prosody symbol string (step 203). Then, for each phrase, a prosody parameter (a parameter defining a phoneme duration, a fundamental frequency pattern, power, etc.) is generated based on the phoneme prosody symbol string (step 204). Next, the synthesizing parameter changing means 16 changes the synthesizing parameter generated in step 204 in accordance with the specification of the speech style specifying unit 17 (step 205).
[0032]
A method for changing the synthesis parameters will be described. When comparing spoken and spoken voices, there are various prosodic features between them. First, of the reading voice and the conversation voice, the conversation voice has larger fluctuations in the prosodic parameters. For example, the pitch, the power, the average duration per mora, and the pause length fluctuate more in conversational speech than in reading speech.
[0033]
As an example, when comparing the mora length for each number of mora in a prosodic phrase (phrase) between a reading voice and a conversational or conversational voice, the mora length is almost constant in the reading tone, whereas the mora length in the conversation tone is within one phrase. Tends to become longer as the number of mora becomes smaller.
[0034]
For this, the Transactions of the Acoustical Society of Japan, 1995. Watanabe et al., Which is described in 31-4-6, entitled "Examination of time structure in reading and dialogue speech".
[0035]
Here, the mora is a unit of temporal rhythm such as substantially corresponding to one character of kana.
[0036]
In the first embodiment, based on this feature, the synthesis parameters are changed such that the higher the conversation degree, the longer the average mora length for each mora number in the phrase. For example, when it is desired to provide a difference of ± 1.5 times the mora length duration based on the phrase length of 10 mora, t is the average duration per mora of the reading style, and n is the number of mora of one phrase. Time t 'is
t ′ = − (t / 20) × n + 1. 5t
To convert the duration of each mora. In addition, when the user wants to change the conversation more like conversation, the user can arbitrarily specify the degree by, for example, expanding and contracting by 10% or more.
[0037]
As described above, when the synthesis parameter including the prosodic parameter and the speech unit data is determined, the speech signal is synthesized (step 206) and output (step 207). The output method may be transmission from a speaker or another device via a communication line.
[0038]
As described above, in the text-to-speech conversion apparatus of the first embodiment, in the text-to-speech conversion apparatus having the means for generating the prosodic features of the synthesized speech using the predetermined criteria, Or reading-out) and prosodic features that appear due to differences between other utterance styles, such as conversational style, can be emphasized or weakened, and the synthesis utterance of a certain degree of utterance style can be generated from the normal utterance style, It is possible to realize a text-to-speech conversion device that can read aloud a prosodic pattern that suits the user's preference.
[0039]
Next, a description will be given of a second embodiment in which the text-to-speech conversion device according to the present invention is applied to a device for Japanese sentences.
[0040]
In the second embodiment, the input character information is used to convert natural speech uttered in a plurality of utterance styles into phoneme type durations, pause lengths, power fluctuations, pitch pattern fluctuations (sound pitches). In a text-to-speech conversion device that generates a synthesis parameter using a prosody parameter table created by analyzing each prosody parameter (e.g., difference), and converts the synthesis parameter into a speech signal, the prosody determined according to the utterance style selected by the user. The parameter is compared with the prosodic parameter of the reading style, the difference is obtained, and means for emphasizing or weakening the feature of the prosodic parameter of the speech style is provided.
[0041]
As an example of the prosodic parameters, if it is a phonological duration, it analyzes for each phonological type, the surrounding phonological environment, the phrase position such as the beginning, middle, and end of a sentence, and the mora position, and analyzes each phonological duration. It is an analysis of time.
[0042]
Hereinafter, a text-to-speech converter according to the second embodiment will be described with reference to FIG. Note that the second embodiment will be described by taking as an example a case where the phoneme duration is changed among the prosody parameters such as the duration of each phoneme, the pause length, the power variation, and the pitch pattern. In the second embodiment, a reading-style utterance style is used as a reference utterance style.
[0043]
The text-to-speech conversion device according to the second embodiment includes a character information input unit 10, a text analysis unit 11, a pronunciation dictionary 12, a synthesis parameter generation unit 13, a speech unit data storage unit 14, a speech synthesis unit 15, a plurality of duration tables. 16, an utterance style designation unit 17, an utterance style emphasis unit 20 for changing the phoneme duration, and an utterance style emphasis degree designation unit 19.
[0044]
The character information input unit 10, the text analysis unit 11, the pronunciation dictionary 12, the synthesis parameter generation unit 13, the speech unit data storage unit 14, and the speech synthesis unit 15 perform the same operations as those of the conventional configuration. Description is omitted.
[0045]
The synthesis parameter generation unit 13 extracts the corresponding speech unit data from the speech unit data storage unit 14 based on the phoneme symbol string, and refers to the phoneme duration table of the speech style designated by the speech style designation unit 17. The duration of the phoneme is determined, and prosody parameters for speech synthesis such as pause length, power, and fundamental frequency pattern are generated.
[0046]
The utterance style emphasis degree designation unit 19, reading style and emphasis can switch is provided a speech style degree specified by the speech style specified portion 17, phoneme duration defined with reference to the reading duration table Are compared with each other in the utterance style designated, and the difference is emphasized by the utterance style emphasis unit 20 according to the degree designated by the utterance style emphasis degree designation unit 19.
[0047]
Next, the operation of the text-to-speech converter of the second embodiment will be described with reference to the flowchart of FIG.
[0048]
First, character information (text data such as a sentence mixed with kanji and kana) is fetched (step 601), and the character information is analyzed and converted into a phonological symbol string for each phrase (step 602). Next, speech units to be used sequentially are extracted from the speech unit data storage unit 14 in accordance with the phoneme prosody symbol string (step 603). Then, for each phrase, based on the phonemic prosody symbol string, referring to the duration table of the utterance style specified by the utterance style specification unit 17 and the duration table of the utterance style of the reading style that is the reference utterance style, The phoneme duration is determined, and two types of synthesis parameters (parameters defining the phoneme duration, fundamental frequency pattern, power, etc.) are generated, a designated style and a reading style (step 604). At this time, the duration table is determined by referring to the duration based on factors (such as the type of the accurate phoneme, the surrounding environment, the phrase position, the mora position in the phrase, etc.) obtained by analyzing the natural speech in advance.
[0049]
Next, the utterance style emphasizing unit 20 emphasizes the difference between the designated utterance style continuation time (Tn) and the reading tone continuation time (Ts) according to the degree specified by the utterance style emphasis degree specifying unit 19. To change the phoneme duration. For example, when the emphasis coefficient is α, the final phoneme duration T is T = Ts + α (Tn / Ts−1) Ts
Can be calculated. The emphasis coefficient α may be changed from 0 to several times depending on the degree specified by the emphasis section specification section 19 (step 605).
[0050]
As described above, when the synthesis parameter including the prosodic parameter and the speech unit data is determined, the speech signal is synthesized (step 606) and output (step 607). The output method may be output from a speaker or transmission to another device via a communication line.
[0051]
According to the text-to-speech converter of the second embodiment described above, it is possible to change the utterance style by changing the phoneme duration according to the user's preference.
[0052]
In each of the above embodiments, the text-to-speech conversion apparatus for Japanese sentences has been described, but the present invention can of course be applied to a text-to-speech conversion apparatus for other language sentences.
[0053]
【The invention's effect】
As described above, according to the present invention, it is possible to emphasize or weaken a difference in prosodic feature amount that appears due to a difference in utterance style, to generate a synthetic sound of various utterance styles, and to adjust a rhythm to a user's preference. A text-to-speech converter capable of reading out in a pattern can be realized.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a text-to-speech conversion apparatus according to a first embodiment.
FIG. 2 is a diagram showing a conventional text-to-speech conversion device.
FIG. 3 is a flowchart showing an operation of the text-to-speech conversion apparatus in FIG. 1;
FIG. 4 is an explanatory diagram of a conversation style designating unit 17 of the text-to-speech converter of FIG.
FIG. 5 is a block diagram illustrating a text-to-speech conversion apparatus according to a second embodiment.
FIG. 6 is a flowchart showing an operation of the text-to-speech conversion apparatus in FIG. 5;
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 10 ... Character information input part, 11 ... Text analysis part, 12 ... Speech dictionary, 13 ... Synthesis parameter generation part, 14 ... Speech unit data storage part, 15 ... Speech synthesis part, 16 ... Synthesis parameter change means, 17 ... Speech Style designation unit, 18: prosodic parameter table, 19: speech style emphasis degree designation unit, 20: speech style emphasis unit.

Claims

In a text-to-speech converter for converting input character information into a speech signal,
A prosodic parameter table that retains characteristics in at least a plurality of utterance styles including a normal style and a reading style;
An utterance style designating section for selecting an utterance style;
An emphasis level designating section for designating the emphasis level of the utterance style;
A difference calculation unit that calculates a difference between each prosodic parameter of the utterance style selected by the utterance style designation unit and the reference utterance style,
A text-to-speech conversion apparatus comprising: a prosody parameter adjusting unit that corrects a prosody parameter according to the emphasis degree designated by the emphasis degree designation unit and the difference.

2. The text-to-speech converter according to claim 1, wherein the prosodic parameter corrected by the prosodic parameter adjusting means is at least a phoneme duration or a pitch pattern.