JP2004246085A

JP2004246085A - Method, device, and program for speech synthesis

Info

Publication number: JP2004246085A
Application number: JP2003035954A
Authority: JP
Inventors: Masanobu Abe; 匡伸阿部
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2003-02-14
Filing date: 2003-02-14
Publication date: 2004-09-02

Abstract

<P>PROBLEM TO BE SOLVED: To synthesize a speech which would not incorrectly be heard by extracting a characteristic expression from a text and changing a way of pronouncing the characteristic expression part. <P>SOLUTION: (S1) Reading printed in KANA (Japanese syllabary) of KANJI (Chinese character) is given to a text 1 of KANA-KANJI mixed sentences by using a reading dictionary and an accent dictionary 2 and an accent type is set. (S2) The proper expression (e.g. a telephone number) is extracted by using morpheme information and reading information outputted through text analysis. (S3) Rhythm parameters (fundamental frequency pattern of a speech, continuance of a phoneme, power of the speech, etc.) are generated. (S4) General speech (continuously generated speech) synthesis unit 6 are used for selection in speech synthesis of an ordinary text and the proper expression (a part sandwiched between a label and a tag) is selected by using speech synthesis units of monosyllabic pronunciation (independent pronunciation of a monosyllable). (S5) Signal processing of a selected speech synthesis unit is performed to match rhythm parameters, thereby outputting a synthesized speech. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、テキストを入力として、それを音声に変換する音声合成方法、装置及び音声合成プログラムに関する。
テキストを媒体とする情報は、単行本、新聞、雑誌、インターネット等により、これまで大量に発信され蓄積されてきたと共に、今後も大量に発信されていく。このような情報を、単にテキストとしてユーザに提供するだけでなく、音声に変換して提供することができれば、ユーザの情報へのアクセス手段を多様化できるばかりでなく、テキスト情報の有効活用が可能となる。例えばＷｅｂ上でテキストとして提供しているイベント情報は、テキストから音声に変換できる技術があれば、電話でアクセスして、音声で聞くことができる。
【０００２】
【従来の技術】
テキストから音声に変換する技術として、テキストからの音声合成が開発されている。
この技術は入力されたテキストに以下の処理を行い音声に変換する。
▲１▼テキスト解析し、漢字の読み仮名を付与するとともに、アクセント型を設定。
▲２▼音声の基本周波数パタン、音素の継続時間長、音声のパワー等の韻律パラメータの生成。
▲３▼読み仮名に適した音声セグメントを音声データベースから選択。
▲４▼選択した音声セグメントを韻律パラメータに合うように信号処理。
現状では、テキストの内容を何とか聞き取れるレベルにあるが、イントネーションが不自然であったり、よく注意して聞かないと聞き間違ったりする。
【０００３】
（従来例１）
従来例１の音声合成装置を図２を参照して説明する。
テキスト入力部２０からの入力テキストに対し、日本語処理部２１では形態素解析によってポーズの位置、単語・分節の区切りや辞書を参照した読み仮名変換とアクセント付与がなされる。
フレーズパターン算出部２２においてはテキストに含まれるモーラ数（子音＋母音で表される音の最小単位）から得られるフレーズ成分（ポーズで挟まれた一息で話すときの音の高低）を算出する（フレーズ成分の抑揚制御パターン）。
アクセントパターン生成処理部２３は、基本アクセントパターン生成処理部２４と基本アクセントパターンテーブル２５と補正処理部２６を備える。基本アクセントパターンテーブル２５には実音声のピッチ分析によって得られたパターンデータをテーブル化して予め記憶するもので、複数点のアクセント量が当該モーラのアクセントとその前後アクセントとの組み合わせ別にテーブル化される。基本アクセントパターン生成処理部２４には入力テキストの日本語処理結果として区切り内モーラ数と当該モーラのアクセントパターンが与えられ、入力されたアクセント環境に対応する基本アクセントパターンを基本アクセントパターンテーブル２５から読み出す処理を行う。補正処理部２６は、基本アクセントパターン生成処理部２４で生成された基本アクセントパターンを用い、このパターンを日本語処理されたアクセントパターンの区切り内モーラ数とモーラ位置等による補正量を求め、この補正量により基本アクセントパターンを補正してアクセントパターンデータとして出力する（アクセント成分の抑揚制御パターン）。
次にフレーズ成分とアクセント成分の抑揚制御パターンを重ね合わせて入力テキストの抑揚制御パターンを得て音声合成部２７に入力する。音声合成部２７は抑揚制御パターンに従って各音節のピッチを調整し、また各音節に対応付けた音源波パターンと調音フィルタのパラメータから調音フィルタの応答出力として合成音声を得る（特許文献１参照）。
【０００４】
（従来例２）
従来例２のはめ込み型併用音声合成装置を図３を参照して説明する。
入力された文字データ３０は文字データの正規化部３１において漢字仮名混じり文字を正規化文字データに変換して言語処理解析部３３に出力する。変換テーブル３２は漢字仮名混じり文を正規化文字データに変換する変換テーブルであり、文字データの正規化部３１により参照される。
言語処理解析部３３は正規化文字データを単語の系列にし、品詞や活用形の同定を行い、ポーズ設定、読み付与、アクセント設定等の韻律情報の他に特定の単語の属性を解析して、属性及び韻律記号付き（カナ）文字を出力する。単語辞書部３４は、数万から十数万語の単語を有し、文法情報、読み、アクセント形などが記述されており、特定の単語には例えば、Ａ，Ｂ，Ｃ，Ｄ，・・・等のように種別コード（後述する）が付与されており、言語処理解析部３３により参照される。
制御部３５は文例パターンテーブル３６により、言語処理解析部３３から出力される属性及び韻律記号付き（カナ）文字が規則合成用表音文字列であるか又ははめ込みフィーマット表音文字列であるかを判定する（判定については後述する）。すなわち、言語処理解析部３３から得られる単語識別属性列が文例パターンテーブル３６の文例パターンと一致するかを判定する。もし、規則合成用表音文字列と判定されると、通常のテキスト音声合成の制御が以下のように行われる。
【０００５】
（規則合成型韻律生成処理）
規則合成型韻律生成部５１は規則合成用表音文字列のポーズの位置からイントネーションを生成し、イントネーションとアクセントの重畳によりピッチパターンを生成する。韻律テーブル５０はポーズ、アクセント等を設定する規則が記述されており、規則合成型韻律生成部５１により参照される。
音素継続時間生成部６１は表音文字を構成する音素の継続時間を設定して、発話の自然なタイミング（リズム）を実現する。音素長テーブル６０は隣接音素の影響を考慮する音素の固有の時間長規則が記述され、音素継続時間生成部６１により参照される。
波形接続部６２は、規則合成用表音文字列を構成する音素を接続して連続音声を合成するに際し、前記音素継続時間、前記ピッチパターン等の特徴パラメータを音素間で表された補間処理が施され、補間処理された連続音声がＰＣＭデータ６４として出力される。また、音素辞書部６３は音素の波形データを合成単位として記憶し、波形接続部６２により参照される。
【０００６】
（はめ込み型韻律生成処理）
判定と、はめ込みフォーマット表音文字列と判定された場合について説明する。
例えば、入力文字データ（交通情報のデータ）が「第二神明道路の西行きは、玉津付近で、渋滞２ｋｍ」であるとき、言語処理解析部３３への正規化文字データでは、「第二神明道路」は「ダイニシンメーロード」とカナに変換され、道路名を意味し、「西行き」は「ニシユキ」とカナに変換され、方向を意味し、「玉津」は「タマツ」とカナに変換され、地名を意味し、「渋滞」は「ジュータイ」とカナに変換され、程度を意味し、「２ｋｍ」は「ニキロ」とカナに変換され、距離を意味する。この場合、単語辞書部は「ダイニシンメーロード」，「ニシユキ」，「タマツ」，「ジュータイ」，「ニキロ」等に単語を有するが、これらの単語に、例えば、道路名、方向、地名、程度、距離を意味する単語にＡ，Ｂ，Ｃ，Ｄ，Ｅという属性のコードを与える。
文例パターンテーブル３６では、限定されている文例の単語列、例えば、Ａ＋Ｂ＋Ｃ＋Ｄ＋Ｅ、Ａ＋Ｃ＋Ｂ＋Ｄ＋Ｅ、・・・を作成して持っている。
【０００７】
制御部３５では、属性及び韻律記号付きカナ文字で構成される１つの文章が入力したら、文例パターンテーブル３６と照合し、単語の属性列が文例パターンテーブルが有する文例と全く一致しない場合にはピッチ生成処理に規則合成型韻律生成処理を選択する。一方、一致した場合にははめ込み型韻律生成処理を選択するとともにはめ込みフォーマットの表音文字列を生成する。
はめ込みフォーマット表音文字列は、文章番号とはめ込み単語の読みから形成される。はめ込み文例テーブル４０は、文例パターンテーブル３６の各文例に対応して、文例を構成する単語を接続する助詞、助動詞、動詞等の接続語の表音文字列で作成され、上記文例では「ノ」「ハ、」「フキンデ、」「デス。」からなる。この場合、はめ込み型韻律生成部４２は、はめ込みフォーマット表音文字列を構成する単語の属性列の各単語を文例パラメータテーブルの文例の助詞、助動詞、動詞等の接続語の表音文字列の間にはめ込む。すなわち、上記例では『Ａ「ノ」Ｂ「ハ、」Ｃ「フキンデ、」ＤＥ「デス。」』となる。
はめ込み型韻律生成部４２ははめ込み単語のアクセント、ポーズに関しては韻律テーブル５１と同様な構成の韻律テーブル４１を参照し、文章全体のイントネーションについてはめ込み文例テーブル４０を参照して、イントネーションとアクセントを重畳してピッチパターンを形成する。波形接続部６２では、前述と同様に、はめ込み表音文字列の音素に対して音素継続時間を設定し、はめ込み型韻律生成部４２からのピッチパターンを設定して音素の波形が接続され、ＰＣＭデータ６４として出力される。
【０００８】
【特許文献１】
特許第３０７０１２７号（図１、段落（００１６））
【特許文献２】
特許第３１９２９８１号（図２，図３、段落（０００８）〜（００１４））
【０００９】
【発明が解決しようとする課題】
従来のテキストからの音声合成では、発音の仕方をテキストの内容に応じて制御することはしていなかった。しかしながら、人間はテキストの内容に応じて発音の仕方を変えている。簡単な例としては、電話番号の読み上げがある。電話番号を人に伝える場合には、番号の伝え間違いが生じないように、一音、一音区切りながら、かつ、ゆっくりと発声する。このようにすることによって、音韻の明瞭性を挙げている。また、同様な発音の仕方は、待合い場所のレストラン名を伝える場合にも使われる。従来のテキストからの音声合成では、このような場合に発音の仕方を変えていなかったため、電話番号やレストラン名が正しくユーザに伝わらないという問題があった。
本発明は、このようなテキストの内容に応じて、発音の仕方を変えることにより、聞き取り間違えのない音声合成を実現し、スムーズな情報伝達を可能とするものである。また、さらに高度な例としては、ニュース読み上げや、緊急事態を伝える読み上げなど、テキストの内容に応じて発音の仕方を変えることも考えられる。
【００１０】
【課題を解決するための手段】
上記課題を解決するために、本発明はテキストの中から、テキストの内容に応じて発音の仕方を変えるべき部分を推定し、この部分を合成する際に、韻律パラメータや音声データベースを切り替えることによって、適切な音声合成を得る。
【００１１】
【発明の実施の形態】
本発明の実施例を、図１に示した音声合成装置を参照して先にあげた電話番号読み上げを例として説明する。
入力は、ワープロ等で作成したかな漢字混じり文のテキスト１である。
Ｓ１：読み辞書、アクセント辞書２は、テキスト解析部３のテキスト解析で利用される辞書である。テキスト解析部３は、読み辞書、アクセント辞書２を用いてかな漢字混じり文のテキスト１を漢字の読み仮名を付与すると共に、アクセント型を設定する。これらの構成は、通常のテキストからの音声合成の処理と同等であればよく、ここに述べた構成だけにとどまらない。
Ｓ２：固有表現抽出部４は、テキスト解析で出力された形態素情報と読み情報を利用して、電話番号と考えられる表現を抽出（固有表現を抽出）するモジュールである。抽出した固有表現の前後にはラベルあるいはタグを付与する。
【００１２】
固有表現抽出について説明する。
固有表現抽出規則生成システムは、予め用意された訓練用文書（テキスト）を形態素解析して単語に分割し、品詞名や構成文字種などの情報を各単語に付与する。こうして得られた単語列から、固有表現を構成する単語列を取り出し、訓練用文書に対応して予め用意された正解リストを参照して経験則や最小汎化などの一般化手段によって多数の固有表現抽出用の規則（ルール）を生成する。そして、これらの規則をそれぞれ独立に訓練用文書に適用して、その規則が、訓練用文書のどの位置にマッチしたかの記録を記憶しておく。この記録に入っているものは、訓練用文書に対してシステムが出力する固有表現の候補となる。そして、複数の規則を組み合わせる場合には、それらの規則に対応する記録に入っている全ての候補の中から、競合関係と優先順位を考慮して、最終的に出力する候補の例を一定の明快な基準で選び出す。この結果、訓練用文書における不正解の頻度あるいは割合が非常に多い規則があれば、それを削除する。ただし、その規則が訓練用文書のどの位置で正解し、どの位置で不正解になっているかがわかる。そこで、正解の個所の前後の単語列と不正解の個所の前後の単語列を比較して制約を加えることによって、訓練用文書における成績が良くなる規則が作れるかどうか判断できるので、成績が良くなる場合は制約を加えた規則を加える。
このように作成された固有表現抽出規則生成システムを用いて、生成された規則に基づき任意の文書中の固有表現を抽出すると共に、抽出した複数の固有表現に部分的に重なりがあれば、文書における記載開始位置が早いものを優先して抽出し、また、記載開始位置が同じであれば記載終了位置が遅いものを優先して抽出し、さらに、表現は同じであるが種類の異なる固有表現であれば、各固有表現の抽出に用いた各々の規則に予め付与された優先度の大きいものを優先して抽出する。（特開２００１ー３１８７９２段落（０００９），（００１０）参照）
【００１３】
電話番号と認定された数字列に対して、以後、一音、一音をはっきりと発音させる処理を加える。なお、固有表現抽出とは、人名、地名、金額、日時などの重要な要素（固有表現）をテキストから抜き出し分類する技術であるため、本実施例の電話番号に限らず、人名やレストラン名などを抽出し、これらに対して以下に述べる発音の仕方を適用し、聞き違えの少ない音声を合成することもできる。また、電話番号などのように、表記上パタン化されている場合には、固有表現抽出によらずに表記のパタンマッチングによって、電話番号と認定する処理も考えられる。
Ｓ３：韻律生成部５の韻律生成では通常のテキストからの音声合成と同等な処理、（すなわち、韻律パラメータ（音声の基本周波数パタン、音素の継続時間長、音声のパワー等）の生成）を行うが、特に、電話番号の部分に対して、音韻の継続時間長を長めに設定することでゆっくり発音させる。このことにより、明瞭性が向上する。
Ｓ４：音声データベースは、連続発生音声に対応した波形を蓄積した汎用音声（連続発声音声）合成単位６と単音節に対応した音声波形を蓄積した単音節発声音声合成単位７からなる。汎用音声合成単位６は、任意の音声を合成するために用意された音声合成単位を記憶する。
【００１４】
音声合成単位選択部８では汎用音声合成単位６から合成したい音韻系列に適する音声合成単位を選択する。通常のテキストからの音声合成では、音声データベースの汎用音声合成単位６を用いてアナウンサーがニュースを読み上げるような滑らかな音韻の連なりを実現するために、音韻の連なりを考慮した音声合成単位などが使われる。このような音声合成単位では、一音、一音の明瞭性が低下するという欠点があるが、ニュースなどの場合は、人間は前後の文脈などから適当に単語を推定しながら聞いているので明瞭性の低下は問題とはならない。しかし、電話番号のような場合には、文脈情報がないので、間違って聞き取ってしまうことがある。そこで固有表現抽出部４で抽出された固有表現（ラベル、タグで挟まれた部分）は、単音節発声音声合成単位７に前後の音韻環境の影響を考慮せず単音節発声（単音節を単独で発声した）の音声合成単位を準備し、これを用いて電話番号を合成する。単音節発声は、一音、一音、とつとつとした発声であるが、明瞭性が高い特徴があるので、聞き取り間違いが生じにくい音声合成が得られる。
Ｓ５：音声信号処理部９では、音声合成単位選択部８で選択された音声合成単位に対して、音声信号処理によって、基本周波数、継続時間長などを、韻律生成部５で設定された値（韻律パラメータ）に変換する。なお、韻律生成部５に設定した値に設定せずに、音声を接続するだけの場合も考えられる。これは、音声信号処理による変換が、音質の劣化を生じる場合があるためである。
【００１５】
本発明の音声合成装置は、ＣＰＵやメモリ等を有するコンピュータとユーザが利用する端末とＣＤ−ＲＯＭ、磁気ディスク装置、半導体メモリ等の機械読み取り可能な記録媒体とから構成される。
記録媒体に記録された音声合成プログラム、あるいは回線を介して伝送された音声合成プログラムはコンピュータに読み取られ、コンピュータ上に前述した各構成要素を実現し、各処理を実行する。
【００１６】
【発明の効果】
以上説明したように、本発明は、読み上げたいテキストに適した発音の仕方で音声を合成することができる。そのため、テキストからの音声合成の全体としてパフォーマンスを向上させることができる。
【図面の簡単な説明】
【図１】本発明の音声合成装置の構成例を示す図。
【図２】従来例１の音声合成装置の構成を示す図。
【図３】従来例２の音声合成装置の構成を示す図。
【符号の説明】
１・・・テキスト、２・・・読み辞書、アクセント辞書、３・・・テキスト解析部、４・・・固有表現抽出部、５・・・韻律生成部、６・・・汎用音声合成単位、７・・・単音節発声音声合成単位、８・・・音声合成単位選択部、９・・・音声信号処理部、１０・・・合成音声[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a speech synthesis method and apparatus for converting a text as an input into speech, and a speech synthesis program.
Information using text as a medium has been transmitted and accumulated in large volumes through books, newspapers, magazines, and the Internet, and will continue to be transmitted in large quantities in the future. If such information can be provided not only as text to the user but also converted into speech, not only can the means of accessing the user's information be diversified, but also the text information can be effectively used. It becomes. For example, event information provided as text on the Web can be accessed by telephone and heard by voice if there is a technology capable of converting text to voice.
[0002]
[Prior art]
As a technique for converting text to speech, speech synthesis from text has been developed.
This technology converts the input text into speech by performing the following processing.
(1) Analyzes text, assigns kanji reading kana, and sets accent type.
(2) Generation of prosodic parameters such as the fundamental frequency pattern of speech, the duration of phonemes, and the power of speech.
(3) Select a speech segment suitable for the reading kana from the speech database.
{Circle around (4)} Signal processing of the selected voice segment to match the prosody parameters.
At present, the content of the text is somehow audible, but the intonation is unnatural or incorrect if you do not listen carefully.
[0003]
(Conventional example 1)
The speech synthesizer of the first conventional example will be described with reference to FIG.
For the input text from the text input unit 20, the Japanese processing unit 21 performs morphological analysis to convert the reading kana with reference to the pause position, word / segment delimiter, or dictionary, and to add accents.
The phrase pattern calculation unit 22 calculates a phrase component (the pitch of a sound when speaking in a pause between pauses) obtained from the number of moras (the minimum unit of a sound represented by a consonant + a vowel) included in the text ( Inflection control pattern of phrase component).
The accent pattern generation processing unit 23 includes a basic accent pattern generation processing unit 24, a basic accent pattern table 25, and a correction processing unit 26. The basic accent pattern table 25 is a table in which pattern data obtained by pitch analysis of real speech is stored in advance, and the accent amounts of a plurality of points are tabulated for each combination of the mora accent and the accents before and after the mora. . The basic accent pattern generation processing unit 24 is provided with the number of mora in the delimiter and the accent pattern of the mora as a Japanese processing result of the input text, and reads the basic accent pattern corresponding to the input accent environment from the basic accent pattern table 25. Perform processing. The correction processing unit 26 uses the basic accent pattern generated by the basic accent pattern generation processing unit 24, obtains a correction amount based on the number of mora, the mora position, and the like within the delimiter of the accent pattern obtained by processing this pattern in Japanese. The basic accent pattern is corrected by the amount and output as accent pattern data (accent component inflection control pattern).
Next, the inflection control patterns of the phrase component and the accent component are superimposed to obtain the intonation inflection control pattern of the input text, which is input to the speech synthesis unit 27. The speech synthesis unit 27 adjusts the pitch of each syllable according to the intonation control pattern, and obtains a synthesized speech as a response output of the articulation filter from the sound source wave pattern associated with each syllable and the parameters of the articulation filter (see Patent Document 1).
[0004]
(Conventional example 2)
Referring to FIG. 3, a description will be given of an embedded type combined speech synthesizer according to a second conventional example.
In the input character data 30, a character data normalization unit 31 converts characters mixed with kanji and kana into normalized character data, and outputs the normalized character data to a language processing analysis unit 33. The conversion table 32 is a conversion table for converting a sentence mixed with kanji and kana into normalized character data, and is referred to by the character data normalization unit 31.
The language processing analysis unit 33 converts the normalized character data into a series of words, identifies parts of speech and inflected forms, analyzes the attributes of specific words in addition to prosodic information such as pause setting, reading addition, and accent setting, Outputs characters with attributes and prosodic symbols. The word dictionary unit 34 has tens of thousands to hundreds of thousands of words, and describes grammatical information, readings, accent forms, and the like. Specific words include, for example, A, B, C, D,. A type code (to be described later) such as “*” is given, and is referred to by the language processing analysis unit 33.
The control unit 35 determines from the sentence example pattern table 36 whether the attribute and the (kana) character with the prosody symbol output from the language processing analysis unit 33 are a phonogram character string for rule synthesis or a fitted phonogram character string. (The determination will be described later). That is, it is determined whether the word identification attribute sequence obtained from the language processing analysis unit 33 matches the sentence example pattern in the sentence example pattern table 36. If it is determined that the phonetic character string for rule synthesis is used, ordinary text-to-speech synthesis control is performed as follows.
[0005]
(Rule synthesis type prosody generation processing)
The rule synthesis type prosody generation unit 51 generates intonation from the position of the pause in the phonogram string for rule synthesis, and generates a pitch pattern by overlapping the intonation with the accent. The prosody table 50 describes rules for setting poses, accents, and the like, and is referred to by the rule synthesis type prosody generation unit 51.
The phoneme duration generation unit 61 sets the duration of the phonemes constituting the phonogram to realize natural timing (rhythm) of the utterance. The phoneme length table 60 describes a unique time length rule of the phoneme in consideration of the influence of the adjacent phoneme, and is referred to by the phoneme duration generation unit 61.
When connecting the phonemes constituting the rule-synthesizing phonogram string and synthesizing a continuous voice, the waveform connection unit 62 performs an interpolation process in which the feature parameters such as the phoneme duration and the pitch pattern are expressed between phonemes. The continuous sound that has been subjected to the interpolation processing is output as PCM data 64. The phoneme dictionary unit 63 stores the phoneme waveform data as a synthesis unit, and is referred to by the waveform connection unit 62.
[0006]
(Inset type prosody generation processing)
The determination and the case where it is determined that the character string is a fit-in-format phonetic character string will be described.
For example, when the input character data (traffic information data) is “Congestion 2 km on the west of the second Shinmei road near Tazu,” the normalized character data to the language processing analysis unit 33 indicates “second Shinmei. "Road" is converted to "Danishin Mae Road" and Kana, meaning the road name, "Westbound" is converted to "Nishiyuki" and Kana, and the direction is indicated, and "Tamazu" is converted to "Tamamatsu" and Kana. It is converted and means a place name, "traffic jam" is converted to "jutai" and kana, and means a degree, and "2km" is converted to "ni-kilo" and kana and means a distance. In this case, the word dictionary unit has words in “Danishin mailroad”, “Nishiyuki”, “Tama”, “jutai”, “Nikuro”, etc., and these words include, for example, road names, directions, place names, A word having an attribute of A, B, C, D, or E is assigned to a word meaning degree or distance.
In the sentence example pattern table 36, word strings of limited sentence examples, for example, A + B + C + D + E, A + C + B + D + E,.
[0007]
When one sentence composed of the attribute and the Kana character with the prosody symbol is input, the control unit 35 compares the sentence with the sentence example pattern table 36. If the word attribute string does not match the sentence example of the sentence example pattern table at all, the pitch is adjusted. The rule synthesis type prosody generation processing is selected for the generation processing. On the other hand, if they match, the embedding type prosody generation processing is selected, and a phonetic character string in the inset format is generated.
The inlaid format phonetic character string is formed from the sentence number and the reading of the inlaid word. The inlaid sentence example table 40 is created as a phonetic character string of a connecting word such as a particle, an auxiliary verb, or a verb that connects the words constituting the sentence example, corresponding to each of the sentence examples in the sentence example pattern table 36. It consists of "ha,""fukinde," and "death." In this case, the inset-type prosody generation unit 42 inserts each word of the attribute string of the word forming the inset format phonetic character string between the phonogram character strings of connective words such as particles, auxiliary verbs, and verbs in the sentence example parameter table. Fit in. That is, in the above example, "A", "B", "C", "C", "DE", "Death".
The inset type prosody generation unit 42 refers to the prosody table 41 having the same configuration as the prosody table 51 for the accent and pause of the inset word, and refers to the inset sentence example table 40 for the intonation of the entire sentence to superimpose the intonation and accent. To form a pitch pattern. Similarly to the above, the waveform connection unit 62 sets the phoneme duration for the phoneme of the inlaid phonetic character string, sets the pitch pattern from the inlaid prosody generation unit 42, and connects the phoneme waveforms. It is output as data 64.
[0008]
[Patent Document 1]
Patent No. 3070127 (FIG. 1, paragraph (0016))
[Patent Document 2]
Patent No. 3192981 (FIG. 2, FIG. 3, paragraphs (0008) to (0014))
[0009]
[Problems to be solved by the invention]
In conventional speech synthesis from text, the manner of pronunciation was not controlled according to the content of the text. However, humans change the way of pronunciation according to the content of the text. A simple example is reading a phone number. When giving a telephone number to a person, speak slowly and with a single sound, one note at a time, so as not to give a wrong number. By doing so, the clarity of the phoneme is raised. A similar pronunciation is also used to convey the name of the restaurant at the meeting place. In conventional speech synthesis from text, in such a case, the pronunciation method was not changed, so that there was a problem that the telephone number and the restaurant name were not correctly transmitted to the user.
According to the present invention, by changing the manner of pronunciation in accordance with the contents of such texts, it is possible to realize speech synthesis without mistaken listening, and to enable smooth information transmission. Further, as more advanced examples, it is conceivable to change the pronunciation method according to the content of the text, such as reading out news or reading out an emergency.
[0010]
[Means for Solving the Problems]
In order to solve the above-mentioned problem, the present invention estimates, from a text, a portion where the pronunciation method should be changed according to the content of the text and, when synthesizing this portion, switches between prosodic parameters and a voice database. Get proper speech synthesis.
[0011]
BEST MODE FOR CARRYING OUT THE INVENTION
An embodiment of the present invention will be described with reference to the telephone number reading described above with reference to the speech synthesizer shown in FIG.
The input is text 1 of a sentence containing kana-kanji characters created by a word processor or the like.
S1: The reading dictionary and the accent dictionary 2 are dictionaries used for text analysis of the text analysis unit 3. The text analysis unit 3 uses the reading dictionary and the accent dictionary 2 to assign the reading kana of the kanji to the text 1 of the sentence mixed with the kana-kanji and sets the accent type. These configurations need only be equivalent to speech synthesis processing from ordinary text, and are not limited to the configurations described here.
S2: The named entity extraction unit 4 is a module that extracts a phrase considered as a telephone number (extracts a named entity) by using the morphological information and the reading information output by the text analysis. Labels or tags are added before and after the extracted named expression.
[0012]
The named entity extraction will be described.
The named entity extraction rule generation system morphologically analyzes a training document (text) prepared in advance, divides the word into words, and gives information such as a part of speech name and a constituent character type to each word. From the word strings obtained in this way, the word strings that make up the named entity are extracted, and a large number of eigenwords are obtained by generalization means such as empirical rules and minimum generalization by referring to the correct answer list prepared in advance corresponding to the training document. Generate rules for expression extraction. Then, these rules are applied to the training document independently of each other, and a record of where the rule matches in the training document is stored. What is included in this record is a candidate for a named entity output by the system for the training document. Then, when combining a plurality of rules, from among all the candidates included in the record corresponding to those rules, an example of a candidate to be finally output is determined in consideration of a competitive relationship and a priority order. Choose out based on clear criteria. As a result, if there is a rule in which the frequency or ratio of incorrect answers in the training document is extremely high, the rule is deleted. However, it is possible to know at which position in the training document the rule is correct and at which position the rule is incorrect. Therefore, by comparing the word string before and after the correct answer part with the word string before and after the incorrect answer part and adding constraints, it can be determined whether or not a rule that improves the score in the training document can be created. If so, add rules with restrictions.
Using the named entity extraction rule generation system created in this way, the named entities in any document are extracted based on the generated rules, and if the extracted named entities partially overlap, the document In which the description start position is earlier is preferentially extracted, and if the description start position is the same, the later description end position is preferentially extracted, and furthermore, the same expressions but different types of named entities If so, the rule with the higher priority given in advance to each rule used for extracting each named expression is preferentially extracted. (See paragraphs (0009) and (0010) of JP-A-2001-318792)
[0013]
From now on, the phone number and the number sequence recognized are subjected to a process for clearly producing one note or one note. Note that named entity extraction is a technology for extracting and classifying important elements (named entities) such as a person's name, place name, amount of money, date and time from text, and is not limited to the telephone number of the present embodiment, but may be a person's name or a restaurant's name. Can be extracted, and a pronunciation method described below is applied to these to synthesize a speech with little misunderstanding. Further, in the case where a pattern is formed in a notation such as a telephone number, a process of certifying a telephone number by pattern matching of the notation instead of extracting the unique expression may be considered.
S3: In the prosody generation of the prosody generation unit 5, processing equivalent to speech synthesis from ordinary text is performed (that is, generation of prosody parameters (basic frequency pattern of speech, duration of phoneme, power of speech, etc.)). However, in particular, for the telephone number portion, the phoneme is made to sound slowly by setting the duration of the phoneme longer. This improves clarity.
S4: The speech database includes a general-purpose speech (continuous utterance speech) synthesis unit 6 storing a waveform corresponding to a continuously generated voice and a single syllable utterance speech synthesis unit 7 storing a speech waveform corresponding to a single syllable. The general-purpose speech synthesis unit 6 stores a speech synthesis unit prepared for synthesizing an arbitrary speech.
[0014]
The speech synthesis unit selection unit 8 selects a speech synthesis unit suitable for a phoneme sequence to be synthesized from the general speech synthesis unit 6. In speech synthesis from ordinary text, in order to realize a smooth phoneme sequence such that an announcer reads out news using a general-purpose speech synthesis unit 6 in a speech database, a speech synthesis unit that takes phoneme sequence into consideration is used. Is Such a speech synthesis unit has the disadvantage that the clarity of one sound or one sound is reduced, but in the case of news, etc., humans listen while appropriately estimating words from the surrounding context, etc. Sex decline is not a problem. However, in the case of a telephone number or the like, there is no context information, so that the user may be mistakenly heard. Therefore, the named entity extracted by the named entity extraction unit 4 (the portion sandwiched between the label and the tag) is converted into a single syllable utterance (single syllable alone) without considering the effect of the phonological environment before and after the single syllable uttered speech synthesis unit 7. Is prepared, and a telephone number is synthesized using this unit. The monosyllable utterance is one sound, one sound, and a stuttering utterance. However, since it has a feature of high clarity, it is possible to obtain a speech synthesis that does not easily cause a mistake in listening.
S5: In the audio signal processing unit 9, the basic frequency, the duration length, and the like set by the prosody generation unit 5 by the audio signal processing for the audio synthesis unit selected by the audio synthesis unit selection unit 8 ( To the prosody parameter). It is to be noted that a case may be considered in which only the voice is connected without setting the value set in the prosody generation unit 5. This is because conversion by audio signal processing may cause deterioration in sound quality.
[0015]
The speech synthesizer of the present invention includes a computer having a CPU and a memory, a terminal used by a user, and a machine-readable recording medium such as a CD-ROM, a magnetic disk device, and a semiconductor memory.
The speech synthesis program recorded on the recording medium or the speech synthesis program transmitted via the line is read by the computer, and the above-described components are realized on the computer, and each process is executed.
[0016]
【The invention's effect】
As described above, according to the present invention, speech can be synthesized in a pronunciation manner suitable for a text to be read. As a result, the overall performance of speech synthesis from text can be improved.
[Brief description of the drawings]
FIG. 1 is a diagram showing a configuration example of a speech synthesizer of the present invention.
FIG. 2 is a diagram showing a configuration of a speech synthesizer of Conventional Example 1.
FIG. 3 is a diagram showing a configuration of a speech synthesizer of Conventional Example 2.
[Explanation of symbols]
1 ... Text, 2 ... Reading dictionary, accent dictionary, 3 ... Text analysis unit, 4 ... Named expression extraction unit, 5 ... Prosody generation unit, 6 ... General-purpose speech synthesis unit, 7 ... single syllable uttered speech synthesis unit, 8 ... speech synthesis unit selection unit, 9 ... speech signal processing unit, 10 ... synthesized speech

Claims

In a text-to-speech method that inputs any text, converts it into speech, and reads it out,
A speech synthesis method characterized by extracting a named entity from a text and reading out the named entity portion in a different manner of pronunciation from a portion other than the named entity.

The speech synthesis method according to claim 1,
A speech synthesis method characterized in that the way of changing the pronunciation is performed by controlling prosodic parameters.

The speech synthesis method according to claim 1 or 2,
A speech synthesis method characterized by specially providing speech data suitable for reading as a way of changing pronunciation, and synthesizing speech using the speech data.

In a text-to-speech device that inputs any text, converts it into voice, and reads it out,
A named entity extracting means for extracting a named entity from the text,
Means for reading out the named entity part by changing the way of pronunciation with a part other than the named entity.

The speech synthesizer according to claim 4,
A voice synthesizing apparatus characterized in that the means for reading out the named entity part by changing the pronunciation method from the part other than the named entity is performed by controlling prosodic parameters.

The speech synthesizer according to claim 4 or 5,
The voice synthesizing device is characterized in that the means for reading out the named entity part by changing the way of pronunciation is different from the part other than the named entity, specifically holding voice data suitable for reading and synthesizing voice using the voice data.

A text-to-speech program that allows a computer to execute a text-to-speech method that inputs an arbitrary text, converts it into voice, and reads it out,
The process of extracting named entities from the text;
A speech synthesis program that causes a computer to execute a process of reading out the part other than the named entity from the part other than the named entity in a different manner of pronunciation.

The speech synthesis program according to claim 7,
A speech synthesis program in which the process of reading out the named entity portion by changing the pronunciation method with the portion other than the named entity is performed by controlling the prosodic parameters.

The speech synthesis program according to claim 7 or 8,
In the process of reading out the named entity part by changing the way of pronunciation with the part other than the named entity, speech data suitable for reading is held specially, and using it, the named entity part can be used as the part other than the named entity and the pronunciation method. A speech synthesis program that synthesizes speech by changing.