JP3981619B2

JP3981619B2 - Recording list acquisition device, speech segment database creation device, and device program thereof

Info

Publication number: JP3981619B2
Application number: JP2002300714A
Authority: JP
Inventors: 秀之水野; 匡伸阿部; 理水野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2002-10-15
Filing date: 2002-10-15
Publication date: 2007-09-26
Anticipated expiration: 2022-10-15
Also published as: JP2004138661A

Description

【０００１】
【発明の属する技術分野】
本発明は、音声合成方法に用いる収録リスト取得装置と音声素片データベース作成装置、及びそれらの装置プログラムに関する。
【０００２】
【従来の技術】
従来の音声合成技術において、近年では大容量な記憶装置の使用コストの低下と計算機の計算能力の向上に伴って、数十分から数時間に及ぶ音声をそのまま大容量の記憶装置に蓄積しておき、入力されたテキスト及び韻律情報に応じて音声データから音声素片を適切に選択し、そのまま接続するか又は韻律情報に応じてそれらを変形して接続することで高品質な音声を合成する音声合成方法が提案されている（特許文献１、非特許文献１）。
しかしながら、いかに大容量の記憶装置に数時間に及ぶ音声データを蓄積することが可能になったとしても、音声を録音しかつ音声合成に利用できるように音声素片としてセグメンテーションするなどにより音声データベースとして整備する必要があるため、そのための時間的、費用的なコストから現実的に集めることが可能な音声の量は決まってくるため、高品質な合成音声のためにいかに短期間に小コストで音声を収集するかというのは大きな課題であった。
【０００３】
そのため、入力テキストを音声合成する際に使用すべき音声素片が収録されている確率が音響的に見て最大となるように音声データベースを設計する方法（非特許文献２）や、合成処理による劣化を避けるため同一内容の発声を韻律的に多重化する方法（非特許文献３）などが提案されている。
その他の公知文献として、
基本周波数パターンを精密に決定することができる音声基本周波数パターン生成装置に関しては例えば（特許文献２）に記載されている。
更に、合成音声パワーを効率よく、しかも精度良く制御でき、波形構成型の音声合成方式など、肉声に近い品質の合成音を得ることが可能な音声合成方法に関しては（特許文献３）に記載されている。
【０００４】
更に、文法を殆ど知らないユーザでも書き換えのための経験則がなくても、なるべくそのままの形で簡単に記述でき、更に、経験則の追加や削除を容易に行うことが可能な文章書き換え方法に関しては（特許文献４）に記載されている。
更に、重要文の摘出手法としては、特に知識（辞書）を用いないLead法や単語の出現頻度に基づく手法が（非特許文献４）に記載されている。また、テキスト構造に基づく手段が（非特許文献５）に記載されている。また、、機械学習の１手法であるSupport Vector Machine（以下SVMと表記）に基づく重要文の摘出手法が（非特許文献６）に記載されている。
更に、意味的に重要な単語の分類については（非特許文献７）に記載されている。
更に、テキストから音韻系列、ピッチパターン、音韻長等の音韻情報及び音韻情報を求める手法は（非特許文献８）に記載されている。
更に、統計的言語モデルに関しては（非特許文献９）記載されている。
【０００５】
【特許文献１】
特許第２７６１５５２号明細書
【特許文献２】
特開平５−８８６９０号公報
【特許文献３】
特開平６−９５６９６号公報
【特許文献４】
特開２０００−５７１４２公報
【非特許文献１】
M.Beutnagel，A.Conkie，J.Schoroeter，Y.Stylianou，and A.Sydral，“Choose the best to modify the least：A new generation concatenative synthesis system",Proc.Eurospeech'99,
【非特許文献２】
Chu,M.，Yang,H.and Chang，E.，“Selecting Non-uniform Units From a Very Large Corpus for Concatenative Speech Synthesizer"，ICASSP 2001，Vol.2,SPEECH−L2.2,2001.
【非特許文献３】
枡田他、“韻律的に多重なデータベースの設計と評価”、音響学会講演論文集、pp.291−292、2001
【非特許文献４】
Edmundson,H.1969.New methods in automatic abstracting\ Journal of ACM,16(2),264−285,Zechner,K.1996.Fast Generation of Abstracts from General Domain Text Corpora by Extracting Relevant Sentences\ In Proc.of the 16th International Conference on Computational Linguistics,986−989
【非特許文献５】
Miike,S.,Itoh,E.,Ono,K.,Sumita,K.1994.A full-text Retrieval System with a Dynamic Abstract Generation Function\ In Proc.of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval,152−161
【非特許文献６】
平尾、前田、松本、“Support Vector Machineによる重要文抽出”、情処研報、2001−Fi−63,Vol.2001,No.74,pp.121−127
【非特許文献７】
日本語語彙体系（ＮＴＴコミュニケーション科学研究所監修：日本語語彙体系、岩波書店、１９９９）
【非特許文献８】
電子通信学会論文誌“規則による音声合成のための音韻時間長制御”、匂坂他、Vol.67−A,629−636（1984）
【非特許文献９】
北研二、“確率的言語モデル”、東京大学出版会、1999.p.24
【０００６】
【発明が解決しようとする課題】
前述した従来の音響的・韻律的な面から音声データベースを設計する方法では、言語的な面で見たときに重要な単語や言いまわしに対する考慮が全くないため、心理的に非常に重要な音響を与える意味的に重要な単語や意味的なまとまりをもつ言いまわしを収録できる保証は全く無い。
そのため、前記手法に基づく収録リストに基づいて音声を収録した場合は、前記収集した音声を利用した音声合成において、音響面や韻律面というミクロで見て平均的には高品質な合成音を実現可能とは言えるものの、言語的に重要な部分において高品質な合成音が実現できない場合があり、実際の音声によるコミュニケーションという意味において問題があった。
【０００７】
また、言語が持つ表層的な文字表現の多様性を考慮すると、音響面・韻律面から統計的な情報だけで音声を収録することは、一般的な表現形式のみを重視する傾向があり、いかなる表現に対しても高品質な合成音声を生成することを保証することはほぼ不可能と言えた。
この発明の目的は、いかなる表現に対しても高品質な合成音声を生成することを保証することができる音声素片データベース作成方法、この音声素片データベースを用いた音声合成方法、音声素片データベース作成装置、音声合成装置、音声素片データベース作成プログラム、音声合成プログラムを提案しようとするものである。
【０００８】
【課題を解決するための手段】
この発明では更に音声素片を接続して入力された文章に対応する音声を合成する際の音声素片を記録した音声素片データベースを作成する音声素片データベース作成方法において、テキストデータ格納したテキストデータベースにおいて、各テキストの言語的重要度を求め言語的重要度の高い言語的重要文を抽出する言語的重要文抽出過程と、テキストデータベース中の各テキストから各テキストの形態素解析処理と韻律推定により音韻系列及びピッチパターン、テンポ、ポーズ等の韻律的な特徴量を推定する韻律解析過程と、音韻系列及び韻律特徴量によって各テキストの音響的重要度を求め、言語的重要文と一致しない音響的重要度の高い音響的重要度文を抽出する音響的重要文抽出過程と、言語的重要文と音響的重要文に対応した音声収録する音声収録過程と、音声収録過程で収録した音声データに音韻ラベルを付し、音声素片データベースに記録するデータベース記録過程とを有する音声素片データベース作成方法を提案する。
【０００９】
この発明では更に音声素片を接続して入力された文章に対応する音声を合成する際の音声素片を記録した音声素片データベースを作成する音声素片データベース作成方法において、テキストデータを格納したテキストデータベースにおいて、各テキストの表現を言い換え処理により意味的に等価な別の表現に変換する言い換え過程と、言い換え過程により言い換えられたテキストから、各テキストの言語的重要度を求め言語的重要度の高い言語的重要文を抽出する言語的重要文抽出過程と、テキストデータベース中の各テキストから各テキストの形態素解析処理と韻律推定により音韻系列及びピッチパターン、テンポ、ポーズ等の韻律的な特徴量を推定する韻律解析過程と、音韻系列及び韻律特徴量によって各テキストの音響的重要度を求め、言語的な重要文と一致しない音響的重要文を抽出する音響的重要文抽出過程と、言語的重要文と音響的重要文に対応した音声を収録する音声収録過程と、音声収録過程で収録した音声データに音韻ラベルを付し、音声素片データベースに記録するデータベース記録過程とを有する音声素片データベース作成方法を提案する。
【００１０】
この発明では更に前記音声素片データベース作成方法の何れかにより作成された音声素片データベースから複数の音声素片を選択し、選択された音声素片を接続することにより音声を合成する音声合成方法において、
入力テキストの表現を言い換え処理により意味的に等価な別の表現に変換する言い換え過程と、言い換えられたテキストを解析するテキスト解析過程と、テキスト解析過程から得られた読み、及び韻律情報に基づいて、音声素片データベースから最適な音声素片を検索し、それらの音声素片を接続することにより音声を合成する音声合成過程とを有する音声合成方法を提案する。
【００１１】
この発明では更に音声素片を接続して入力された文章に対応する音声を合成する際の音声素片を記録した音声素片データベースを作成する音声素片データベース作成装置において、テキストデータを格納したテキストデータベースにおいて、各テキストの言語的重要度を求め言語的重要度の高い言語的重要文を抽出する言語的重要文抽出手段と、テキストデータベース中の各テキストから各テキスト形態素解析処理と韻律推定により音韻系列及びピッチパターン、テンポ、ポーズ等の韻律的な特徴量を推定する韻律推定手段と、音韻系列及び韻律特徴量によって各テキストの音響的重要度を求め、前記言語的重要文と一致しない音響的重要度の高い音響的重要文を抽出する音響的重要文抽出手段と、言語的重要文と音響的重要文に対応した音声を収録する音声収録手段と、音声収録手段で収録した音声データに音韻ラベルを付し、音声素片データベースに記録するデータベース記録手段とを有する音声素片データベース作成装置を提案する。
【００１２】
この発明では更に音声素片を接続して入力された文章に対応する音声を合成する際の音声素片を記録した音声素片データベースを作成する音声素片データベース作成装置において、テキストデータを格納したテキストデータベースにおいて、各テキストの表現を言い換え処理により意味的に等価な別の表現に変換する言い換え手段と、言い換え手段により言い換えられたテキストから、各テキストの言語的重要度を求め言語的重要度の高い言語的重要文を抽出する言語的重要文抽出手段と、テキストデータベース中の各テキストからテキスト解析処理と韻律推定により音韻系列及びピッチパターン、テンポ、ポーズ等の韻律的な特徴量によって各テキストの音響的重要度を求め前記言語的な重要文と一致しない音響的重要文を抽出する音響的重要文抽出手段と、言語的重要文と音響的重要文に対応した音声を収録する音声収録手段と、音声収録手段で収録した音声データに音韻ラベルを付し、音声素片データベースに記録するデータベース記録手段とを有する音声素片データベース作成装置を提案する。
【００１３】
この発明では更に前記音声素片データベース作成装置の何れかにより作成された音声素片データベースから複数の音声素片を選択し、選択された音声素片を接続することにより音声を合成する音声合成装置において、入力テキストの表現を言い換え処理により意味的に等価な別の表現に変換する言い換え手段と、言い換えられたテキストを解析するテキスト解析手段と、テキスト解析手段から得られた読み、及び韻律情報に基づいて、音声素片データベースから最適な音声素片を検索し、それらの音声素片を接続することにより音声を合成する音声合成手段とを有する音声合成装置を提案する。
この発明では更にコンピュータが解読可能な符号によって記述され、コンピュータに請求項１又は２記載の音声素片データベース作成方法の少なくとも何れか一方を実行させる音声データベース作成プログラムを提案する。
この発明では更にコンピュータが解読可能な符号によって記述され、コンピュータに請求項３に記載の音声合成方法を実行させる音声合成プログラムを提案する。
【００１４】
作用
この発明による音声素片テキストデータベース作成方法及び装置により作成された音声素片テキストデータベースは言語的な尺度で重要なテキストに基づいて音声を収集したから、言語的に重要な言いまわしなどのテキスト表現に対して高品質な合成音声が生成可能である。更に、音響的な尺度において重要なテキストに基づいた多音声も音声素片データベースに記録したから、一般的な内容のテキストにおいても高品質な合成音声を生成することができる。
更に、音声合成の際に入力テキストを言い換え処理により意味的に等価なまま、予め決められたテキスト表現形式に変換することを前提とすることで、合成すべきテキスト表現を予め決められた表現形式にまで圧縮することが可能となる。このため、言い換え処理を行なったテキストにおいて、言語的及び音響的に重要なテキストに基づいて音声を収集し、音声素片データベースに記録することで飛躍的に音声の収集効率を上げることができる。
【００１５】
【発明の実施の形態】
図１にこの発明による音声素片データベース作成装置の一実施例を示す。この図１に示した音声素片データベース作成装置の構成及び動作をこの発明による音声素片データベース作成方法と共に説明する。
図中１はテキストデータベースを示す。このテキストデータベースには例えば日本語テキストが多量に収納されている。形態素解析手段２はテキストデータベース１から日本語テキストを取り出し、テキストの形態素解析を行ない、単語境界の決定と、単語の品詞の付与、単語の読み、アクセント等の形態素を抽出する。
【００１６】
次に、前記抽出された形態素に基づいて言語的重要文抽出手段３により言語的重要文を抽出する。言語的重要文の抽出手段としては、特に知識を用いないLead法や単語の出現頻度に基づく手法、（例えば非特許文献４）あるいはテキスト構造に基づく手法（例えば非特許文献５）などいろいろあるが、ここでは機械学習の１手法であるSupport Vector Machine （以下SVMと表記）に基づく（非特許文献６に記載されている手法）を１例に説明する。
図２にSVMに基づく言語的重要文抽出処理の概念図を示す。まず予めSVMを学習しておく。学習過程を図２Ａに示す。始めに、ステップＳ２１−１でテキストの種別として重要文と非重要文とに分類してある学習用テキストデータを入力する。
【００１７】
次に、ステップＳ２２−１で前記の学習用テキストデータに対してテキスト属性分析処理により属性を求める。属性とは、例えば下記のようなものである。
文の位置（文章中における当該文の出現位置）、文の長さ、単語重要度の総和、キーワードの密度、固有表現の有無（固有名詞、数値等の単語の有無）、各形態素の有無（各種形態素の文中での有無）、重要単語の有無（文中に含まれる重要な単語の有無）
ここで、前記単語重要度は例えばＴＦ・ＩＤＦ法等既存の簡単な方法によるものでも求めることができる。また、前記キーワードも単語重要度の値の大きいものをキーワードとすればよく、キーワードの密度は下記のように求めることができる。
ＦＤ＝Σｗ（ｋ,１）＊ａ（ｋ）
ａ（ｋ）＝ｗ（ｔ）（単語ｔが位置ｋに出現するとき）、０（それ以外）
ｗ（ｋ，１）１を窓の中心とする窓関数ｗ（ｋ）
また意味的に重要な単語については、非特許文献７に記載されているシソーラスにおける階層の深さなどによって求めることができる。
【００１８】
次に、ステップＳ２３−１でSVMにより学習を行なう。ここでいう学習とは、訓練データとして、
（ｘ１，ｙ１），・・・・・（ｘｍ，ｙｍ）ｘｉ∈Ｒｎ，ｙｉ∈[１，−１]：ｘｉは事例ｉにおけるｎ次元の属性のベクトル、ｙｉは正例のとき１、負例のとき−１が与えられたとき、ｘｉを以下のような分離平面で正例（例えば重要文）、負例（例えば非重要文）に分類したときマージン（最も負例より正例側の境界面と最も正例よりの負例の境界面の距離）が最大となるように次式のｗとｂを決定することを意味する。
Ｗ・ｘ＋ｂ＝０，ｗ∈Ｒｎ，ｂ∈Ｒ
言語的重要文抽出処理では、前記記述したステップＳ２３−１で学習したSVMを用いる。図２Ｂに抽出過程を示す。まず、ステップＳ２１−２でテキストデータベースから判別対象のテキストを取り出し、ステップＳ２２−２でテキスト属性分析処理により前記のようにテキストの属性を求める。次にステップＳ２３−２でSVM分類処理により重要文かどうかを判別する。判別方法は、前記の学習過程で求めたｗとｂを利用して下記の判別関数を構成し、
ｆ（ｘ）＝ｓｇｎ（ｗ・ｘ＋ｂ）
例えば、学習過程で、重要文を正例とした場合は、ｆ（ｘ）＝１なら重要文、−１なら非重要文として判別する。
【００１９】
重要文として判別されたテキストを重要文リストに加える。後は単純にテキストデータベースに含まれる全てのテキストを前記のように判別することで、重要文リストを取得することができる。
以上、述べたように言語的重要文を抽出し、言語的重要文リストを取得することが可能である。もちろん重要文抽出の方法は前記で述べたようにSVMに基づく方法には限らない。
次に、音響的重要文抽出方法について説明する。韻律解析手段４により、テキストから音韻系列、ピッチパターン、音韻長等の音韻情報及び韻律情報を求める。これは読み・アクセント解析と韻律解析（参考文献：特許文献２、特許文献３、非特許文８）により求めることができる。次に、前記音韻情報と韻律情報に基づいて音響的統計分析手段５で音響的統計分析処理を行い音響的に異なるパターンの統計的な分析を行なう。例えば、図３に示すような音韻種別、音韻の長さ、前後の音韻環境、ピッチの高さ、音韻長といった属性で分類した音韻属性について頻度分布を求める。
【００２０】
次に、前記統計的分析により得られた結果に基づいて、音響的重要文抽出手段６で音響的重要文抽出処理を行い前記言語的重要文で抽出済みでない文を音韻属性の頻度から決定しテキストの音響的重要度を決定する。具体的にはｉ番目の音韻の重みＷｉを下式
Ｗｉ＝Ａｊｆ／Ｎ
Ｗｉ：ｉ番目の音韻の重み、Ａｊｆ：ｉ番目の音韻の音韻属性Ａｊの頻度、Ｎ：全音韻属性出現数で
定義した場合、Ｌ個の音韻を含む文の音響的重要度Ｓｗは
【数１】

で求められ、音響的重要文は、前記音響的重要度で全文をソートし、既に言語的重要文として得られた分を除いて、重要度最大の文から、予め決められた全文数以内又は重要度となる文になるまでを音響的重要文として抽出し、前記言語的重要文とを併せて収録リストを収録リスト取得手段７で取得する。
【００２１】
次に、取得した収録リストに従って例えば発声者に音声を発声してもらい音声収録手段８で音声を収録する。
音声収録後、ラベリング手段９で、音声に音韻ラベルを付加すると共にその他にピッチマーク等音声合成に必要なデータを付与し、音韻ラベルが付加された音声データをデータベース記録手段１０により音声素片データベース１１に記録する。
図４乃至図６に音声素片データベース１１に記録した音声素片データの一例を示す。この例では各音声データにテキストタグを付加して記録した場合を示す。つまり音声領域データと、音声領域データの発音内容に対応した単語分類されたテキストタグデータと、各単語の形態素（品詞データ）、を各単語が発声されている音声データ中での音声データ対応位置（ｍｓ）、ラベルデータ領域等で構成される。
【００２２】
ラベルデータ領域は例えば図５に示すように音韻単位で音韻種別、前音韻環境、後音韻環境、平均周波数Ｆ₀（Ｈｚ）、平均周波数の傾斜（Ｈｚ／ｍｓ）、時間長（ｍｓ）、パワー（ｄＢ）等で構成される。
ここで音声領域データに関しては他のデータと一緒に格納するのではなく、分離して別のデータ領域に格納してもよい。テキストタグ付き音声素片データベースの他の例としては図６に示すように、音声領域データと、音声領域データの発声内容に対応して単語分類されたテキストタグデータと、形態素（品詞データ）、掛かり受けデータ、音声データ対応位置（ｍｓ）と、図５に示したラベルデータ等で構成することができる。
【００２３】
図７に本発明の音声素片データベース作成装置の他の実施例を示す。この実施例では大量のテキストデータベース１の日本語テキストからテキストを取り出し音声素片データベース作成用言い換え処理手段１２で言い換え処理を行う。ここで、言い換え処理とは、ある文の文字表現を文のもつ内容を変えずに別の文字表現に変換する処理を言う。言い換え処理の処理フローの一例を図８に示す。入力テキスト文に対して、まずステップＳ８１で形態素解析を行なって形態素を抽出し、次にステップＳ８２で構文解析を行なうことで文の構造を求める。
例えば入力文“彼女は大きな犬に噛まれた。”に対してステップＳ８１とＳ８２で実行した形態素解析と構文解析により、図９に示すような解析木を得る。
【００２４】
次に、ステップＳ８３で変換規則の適用により文を変換する。
例えば下記のような変換規則を適用すると、
名詞句１：“は”＋名詞句２：“に”＋動詞句（受動）―＞名詞句２：“が”＋名詞句１：“を”＋動詞句（標準）入力文“彼女は大きな犬に噛まれた。”は、
“大きな犬”：“が”＋“彼女”：“を”＋“噛んだ。”＝＞“大きな犬が彼女を噛んだ。”
と変換することができる。
【００２５】
上記の変換ルールは人手で作ることもできるし、変換例文から解析的な手法により求める（参考文献：特許文献４）こともできる。
更に、ステップＳ８４で、言語モデルの適用を行なって、上記変換された文の調整を行う。この言語モデル処理は単語の意味的な関係や部分的な変換規則により上記変換された文が言語的適格性を保証されないため、言語モデルに基づいて文の修正や書き換えの無効化などを行ない言語的適格性を保証するために実行される。
ここで用いる言語モデルとしては、例えば統計的言語モデル（参考文献：非特許文献９）等を用いることができ、代表的な手法であるＮ単語の連鎖確率に基づくＮｇｒａｍモデル（非特許文献９）等により、変換文の適格性を確率として求め、確率の低い文に対しては確率が高くなるようなＮ単語の順序の入れ換えによる文の修正や、修正不可能な確率の低い文は棄却すること等処理後、言い換え分として出力する。
【００２６】
例えば、入力文“僕は代表に選ばれた”に上記変換規則を適用すると、
僕は１位に選ばれた−＞１位が僕を選んだ
となる。
ここで、予め大量の文章から学習しておいた図１０に示す単語トライグラム表から“１位”、“が”、“僕”、“を”、“選んだ”に対するトライグラム確率は、０．２・０．０２・０．０１・０．６・０．３５＝８．４×１０^-6となるが、このなかで、“１位”、“が”、“僕”のトライグラム確率が低いことがわかる。
ここで“１位”、“僕”を含む他のトライグラムを調べると、“１位”、“に”、“僕”が０．７と高いため、“が”を“に”に修正することで、トライグラム確率は、０．２・０．３・０．７・０．５・０．３５＝７．３５×１０^-3となり、３行程確率が向上することがわかる。従って、“１位が僕を選んだ”という文は“１位に僕を選んだ”に修正することができる。
変換後の文に対する以下の処理については図１の場合と同様であるので省略するが、この実施例のように予め言い換え処理を行なうことにより文字表現のばらつきが減るため、後の処理過程における言語的重要文や音響重要文の抽出において抽出精度が高くなり結果的に非常に効率のよい音声素片データベースが作成可能となる。
【００２７】
図１１に音声合成装置の一実施例を示す。
入力テキストに対し、始めに音声合成用言い換え手段１３で言い換え処理を行ない入力テキストの表現を変換する。
次に、変換されたテキストに対して、テキスト解析手段１４でテキスト解析用辞書１８を用いてテキスト解析を行ない、読み・アクセントの解析を行う。
次に、前記読み・アクセントに基づいて韻律生成手段１５で平均周波数Ｆ₀、パワー、音韻長を求める。
次に前記平均周波数Ｆ₀、パワー、音韻長及び、前記読みから決まる音韻系列に基づいて、音声素片選択手段１６で適切な音声素片を前記図１又は図７で示したような処理によって作成された音声素片データベース１９から選択する。
【００２８】
最後に音声合成手段１７において前記選択された音声素片をそのまま、又は変形して接続し合成音として出力する。
図１及び図７を用いて説明したこの発明による音声素片データベース作成装置のブロック図において形態素解析手段２、言語的重要文抽出手段３、韻律解析手段４、音響的統計分析手段５、音響的重要文抽出手段６、収録リスト取得手段７、音声収録手段８、ラベリング手段９、データベース記録手段１０、音声素片データベース作成用言い換え手段１２を全て処理ステップと読み換えることによりこの発明による音声素片データベース作成方法の処理手順を説明することができる。
【００２９】
この発明による音声素片データベース作成方法をコンピュータが解読可能な符号によって記述された音声素片データベース作成プログラムをコンピュータのＣＰＵによって解読させ、実行させることにより実現することができる。この発明による音声素片データベース作成プログラムはコンピュータが読み取り可能な例えば磁気ディスク或はＣＤ―ＲＯＭのような記録媒体に記録され、記録媒体からコンピュータにインストールされるか、又は通信回線を通じてコンピュータにインストールされて実行される。
また、図１１に示した音声合成装置のブロック図においても、音声合成用言い換え手段１３、テキスト解析手段１４、韻律生成手段１５、音声素片選択手段１６、音声合成手段１７を全て処理ステップとして読み換えることにより、この発明による音声合成方法の処理手順を説明することができる。
【００３０】
この発明による音声合成方法もコンピュータが解読可能な符号によって記述された音声合成プログラムをコンピュータに実行させることによって実現される。この発明による音声合成プログラムも上述と同様にコンピュータが読み取り可能な例えば磁気ディスク或はＣＤ―ＲＯＭのような記録媒体に記録され、これらの記録媒体からコンピュータにインストールされるか、又は通信回線を通じてコンピュータにインストールされ、ＣＰＵに解読されて実行される。
【００３１】
【発明の効果】
以上説明したように、この発明によれば言語的な尺度で重要なテキストに基づいて音声を収録し音声データベースに記録するから、言語的に重要な言いまわしなどのテキスト表現に対して高品質な合成音声が生成可能である。更に音響的な尺度において重要なテキストに基づいて音声をも音声データベースに記録することにより、一般的なテキストにおいても高品質な合成音声が生成可能となる。
更に、音声合成の際に入力テキストを言い換え処理により意味的に等価なまま予め決められたテキスト表現形式に変換することを前提とすることで、合成すべきテキスト表現を予め決められた表現形式にまで圧縮することが可能となる。そのため、前記言い換え処理を行ったテキストにおいて、言語的及び音響的に重要なテキストに基づいて音声を収録し音声データベースに記録することで飛躍的に音声の収録率を上げることが可能となる。
【図面の簡単な説明】
【図１】この発明による音声素片データベース作成装置の一実施例を説明するためのブロック図。
【図２】図１に示した実施例に用いた言語的重要文抽出手段で実行する言語的重要文抽出処理の手順を説明するためのフローチャート図。
【図３】図１に示した実施例で用いた音響的統計分析手段の処理で得られる頻度分布表を説明するための図。
【図４】この説明の音声素片データベース作成装置で作成される音声素片データベースの一例を説明するための図。
【図５】図４に示した音声素片データベースに格納されるラベルデータ領域の構成を説明するための図。
【図６】図４に示した音声素片データベースの他の例を示す図。
【図７】この発明の音声素片データベース作成装置の他の例を説明するためのブロック図。
【図８】図７に示した実施例に用いた言い換え手段１２の動作を説明するためのフローチャート。
【図９】図７に示した実施例に用いた言い換え処理で用いる構文木の一例を説明するための図。
【図１０】図７に示した実施例に用いた言い換え処理で用いる単語トライグラム表を説明するための図。
【図１１】この発明の音声合成装置及び音声合成方法を説明するためのブロック図。
【符号の説明】
１テキストデータベース１１音声素片データベース
２形態素解析手段１２音声素片データベース作成用
３言語的重要文抽出手段言い換え手段
４韻律解析手段１３音声合成用言い換え手段
５音響的統計分析手段１４テキスト解析手段
６音響的重要文抽出手段１５韻律生成手段
７収録リスト取得手段１６音声素片選択手段
８音声収録手段１７音声合成手段
９ラベリング手段１８テキスト解析用辞書
１０データベース記録手段１９音声素片データベース[0001]
BACKGROUND OF THE INVENTION
The present invention is used for a speech synthesis method.Recording list acquisition device andCreate speech segment databaseDevices and their devicesRegarding the program.
[0002]
[Prior art]
In conventional speech synthesis technology, in recent years, voices ranging from several tens of minutes to several hours have been stored in a large-capacity storage device as the cost of using a large-capacity storage device has decreased and the computing capacity of computers has improved. In addition, high-quality speech is synthesized by selecting speech segments appropriately from speech data according to input text and prosodic information and connecting them as they are or by transforming them according to prosodic information and connecting them. A speech synthesis method has been proposed (Patent Document 1, Non-Patent Document 1).
However, even if it becomes possible to store several hours of speech data in a large-capacity storage device, it can be used as a speech database by recording speech and segmenting it as speech segments so that it can be used for speech synthesis. Since the amount of voice that can be collected realistically is determined from the time and cost costs for this, it is necessary to maintain it. It was a big issue whether to collect.
[0003]
Therefore, a method of designing a speech database (Non-Patent Document 2) or a synthesis process so that the probability that a speech unit to be used when speech synthesis of input text is recorded is acoustically maximized. In order to avoid deterioration, a method of prosody multiplexing utterances of the same content (Non-patent Document 3) has been proposed.
Other known literature includes
A speech fundamental frequency pattern generation device capable of precisely determining a fundamental frequency pattern is described in, for example, (Patent Document 2).
Furthermore, a speech synthesis method capable of efficiently and accurately controlling the synthesized speech power and capable of obtaining synthesized speech with a quality close to the real voice, such as a waveform configuration type speech synthesis method, is described in (Patent Document 3). ing.
[0004]
Furthermore, even for users who have little knowledge of grammar, even if there is no empirical rule for rewriting, it is possible to simply write as it is as much as possible, and furthermore, a text rewriting method that allows easy addition and deletion of empirical rules Is described in (Patent Document 4).
Further, as a method for extracting important sentences, a Lead method that does not particularly use knowledge (a dictionary) and a method based on the appearance frequency of words are described in (Non-Patent Document 4). Means based on the text structure is described in (Non-Patent Document 5). Also, a method for extracting important sentences based on Support Vector Machine (hereinafter referred to as SVM), which is one method of machine learning, is described in (Non-Patent Document 6).
Furthermore, classification of semantically important words is described in (Non-Patent Document 7).
Furthermore, a method for obtaining phoneme information and phoneme information such as phoneme series, pitch pattern, and phoneme length from text is described in (Non-Patent Document 8).
Further, a statistical language model is described (Non-Patent Document 9).
[0005]
[Patent Document 1]
Japanese Patent No. 2761552
[Patent Document 2]
JP-A-5-88690
[Patent Document 3]
JP-A-6-95696
[Patent Document 4]
JP 2000-57142 A
[Non-Patent Document 1]
M. Beutnagel, A. Conkie, J. Schoroeter, Y. Stylianou, and A. Sydral, “Choose the best to modify the least: A new generation concatenative synthesis system”, Proc. Eurospeech'99,
[Non-Patent Document 2]
Chu, M., Yang, H. and Chang, E., “Selecting Non-uniform Units From a Very Large Corpus for Concatenative Speech Synthesizer”, ICASSP 2001, Vol.2, SPEECH-L2.2, 2001.
[Non-Patent Document 3]
Hirota et al., “Design and Evaluation of Prosodic Multiple Databases”, Proc. Of Acoustical Society of Japan, pp.291-292, 2001
[Non-Patent Document 4]
Edmundson, H. 1969. New methods in automatic abstracting \ Journal of ACM, 16 (2), 264−285, Zechner, K. 1996. Fast Generation of Abstracts from General Domain Text Corpora by Extracting Relevant Sentences \ In Proc. Of the 16th International Conference on Computational Linguistics, 986-989
[Non-Patent Document 5]
Miike, S., Itoh, E., Ono, K., Sumita, K. 1994.A full-text Retrieval System with a Dynamic Abstract Generation Function \ In Proc.of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, 152-161
[Non-Patent Document 6]
Hirao, Maeda, Matsumoto, “Important sentence extraction by Support Vector Machine”, Jisho Kenho, 2001-Fi-63, Vol.2001, No.74, pp.121-127
[Non-Patent Document 7]
Japanese vocabulary system (supervised by NTT Communication Science Laboratories: Japanese vocabulary system, Iwanami Shoten, 1999)
[Non-Patent Document 8]
IEICE Transactions, “Phonological time length control for speech synthesis by rules”, Osaka, et al., Vol. 67-A, 629-636 (1984)
[Non-patent document 9]
Kenji Kita, “Probabilistic Language Model”, The University of Tokyo Press, 1999.p.24
[0006]
[Problems to be solved by the invention]
In the conventional method of designing a speech database from the acoustic and prosodic aspects described above, there is no consideration for words and phrases that are important when viewed from a linguistic viewpoint, so psychologically very important acoustics. There is no guarantee that it will be possible to record semantically important words or words with semantic meaning.
Therefore, when speech is recorded based on the recording list based on the above method, high-quality synthesized speech is realized on average in terms of the acoustic and prosodic aspects of speech synthesis using the collected speech. Although it is possible, there is a case where high-quality synthesized sound cannot be realized in a linguistically important part, and there is a problem in terms of actual voice communication.
[0007]
In addition, considering the diversity of surface character representations of languages, recording audio with only statistical information from the acoustic and prosodic aspects tends to focus only on general expression formats, It can be said that it is almost impossible to guarantee that high-quality synthesized speech is generated even for expressions.
SUMMARY OF THE INVENTION An object of the present invention is to provide a speech segment database creation method capable of guaranteeing generation of a high-quality synthesized speech for any expression, a speech synthesis method using the speech segment database, and a speech segment database. A creation device, a speech synthesis device, a speech segment database creation program, and a speech synthesis program are proposed.
[0008]
[Means for Solving the Problems]
The present invention further provides a speech unit database creation method for creating a speech unit database in which speech units are recorded when synthesizing speech corresponding to input text by connecting speech units. In the database, the linguistic importance sentence extraction process that extracts the linguistic importance sentence with high linguistic importance by obtaining the linguistic importance degree of each text, and the morphological analysis processing and prosody estimation of each text from each text The prosodic analysis process that estimates prosodic features such as phoneme series and pitch pattern, tempo, pause, etc., and the acoustic importance of each text is obtained from the phoneme series and prosodic features, and does not match the linguistic important sentences Acoustic important sentence extraction process that extracts acoustic importance sentences with high importance, and speech corresponding to linguistic important sentences and acoustic important sentences Proposes a speech recording process of recording, given the phoneme labels to the audio data recorded in voice recording process, the speech unit database creation method and a database recording process of recording the speech unit database.
[0009]
According to the present invention, in the speech segment database creation method for creating a speech segment database that records speech segments when synthesizing speech corresponding to input text by connecting speech segments, text data is stored. In a text database, the linguistic importance of each text is obtained from the paraphrase process in which the representation of each text is converted into another semantically equivalent expression by paraphrase processing, and the text paraphrased by the paraphrase process. Extract linguistically important sentences by extracting high linguistically important sentences and morphological analysis of each text from each text in the text database and prosodic estimation to obtain prosodic features such as phoneme sequence, pitch pattern, tempo, pause, etc. Obtain the acoustic importance of each text based on the prosodic analysis process to be estimated, phoneme series and prosodic features Recorded in an acoustic important sentence extraction process that extracts an acoustic important sentence that does not match a linguistic important sentence, a voice recording process that records speech corresponding to a linguistic important sentence and an acoustic important sentence, and a voice recording process A speech segment database creation method is proposed that includes a database recording process in which phoneme labels are attached to speech data and recorded in the speech segment database.
[0010]
The present invention further provides a speech synthesis method for synthesizing speech by selecting a plurality of speech units from the speech unit database created by any of the speech unit database creation methods and connecting the selected speech units. In
Based on the paraphrase process that converts the expression of the input text to another semantically equivalent expression by paraphrase processing, the text analysis process that analyzes the paraphrased text, the readings obtained from the text analysis process, and the prosodic information A speech synthesis method including a speech synthesis process of searching for an optimal speech unit from a speech unit database and synthesizing speech by connecting these speech units is proposed.
[0011]
The present invention further stores text data in a speech unit database creation device that creates a speech unit database that records speech units when speech is synthesized when speech units corresponding to input sentences are synthesized by connecting speech units. In a text database, a linguistically important sentence extraction means for obtaining the linguistic importance of each text and extracting linguistically important sentences having a high linguistic importance; Prosody estimation means for estimating prosodic features such as phoneme series and pitch pattern, tempo, pause, etc., and obtaining the acoustic importance of each text by phoneme series and prosodic features, and sound that does not match the linguistic important sentences Acoustic important sentence extraction means for extracting acoustic important sentences with high importance, and sounds corresponding to linguistic important sentences and acoustic important sentences And voice recording means for recording a given phoneme labels to the audio data recorded in voice recording means, proposes a speech unit database creation device having a database recording means for recording the speech unit database.
[0012]
The present invention further stores text data in a speech unit database creation device that creates a speech unit database that records speech units when speech is synthesized when speech units corresponding to input sentences are synthesized by connecting speech units. In the text database, the paraphrase means for converting each text representation to another semantically equivalent representation by paraphrase processing, and the linguistic importance of each text is obtained from the text paraphrased by the paraphrase means. A linguistically important sentence extraction means for extracting a high linguistically important sentence, and a text analysis process and prosodic estimation from each text in the text database, and the prosodic features such as phonological sequence, pitch pattern, tempo, pause, etc. An acoustic method for obtaining an acoustic importance and extracting an acoustic important sentence that does not match the linguistic important sentence Essential sentence extraction means, voice recording means for recording speech corresponding to linguistic important sentences and acoustic important sentences, and a database for recording phonetic labels on voice data recorded by the voice recording means and recording them in the speech segment database A speech segment database creation device having recording means is proposed.
[0013]
In the present invention, a speech synthesizer for selecting a plurality of speech units from a speech unit database created by any of the speech unit database creation devices and synthesizing speech by connecting the selected speech units. , The paraphrase means for converting the expression of the input text into another semantically equivalent expression by the paraphrase process, the text analysis means for analyzing the paraphrased text, the reading obtained from the text analysis means, and the prosody information Based on this, a speech synthesizer having speech synthesis means for searching for an optimal speech unit from a speech unit database and synthesizing speech by connecting these speech units is proposed.
The present invention further proposes a speech database creation program that is described by codes that can be read by a computer and that causes the computer to execute at least one of the speech segment database creation methods according to

claim

1 or 2.
The present invention further proposes a speech synthesis program which is described by a computer-readable code and causes the computer to execute the speech synthesis method according to claim 3.
[0014]
Action
Since the speech unit text database created by the speech unit text database creation method and apparatus according to the present invention collects speech based on important text on a linguistic scale, text representations such as linguistically important expressions High-quality synthesized speech can be generated. Furthermore, since multi-voices based on texts important in an acoustic scale are also recorded in the speech segment database, high-quality synthesized speech can be generated even for texts with general contents.
Furthermore, it is assumed that the input text is converted into a predetermined text representation format while being semantically equivalent by the paraphrase processing at the time of speech synthesis, so that the text representation to be synthesized is a predetermined representation format. It becomes possible to compress to. For this reason, in the text that has been subjected to the paraphrasing process, voice is collected based on linguistically and acoustically important text and is recorded in the speech segment database, so that the voice collection efficiency can be dramatically improved.
[0015]
DETAILED DESCRIPTION OF THE INVENTION
FIG. 1 shows an embodiment of a speech segment database creation apparatus according to the present invention. The configuration and operation of the speech unit database creation apparatus shown in FIG. 1 will be described together with the speech unit database creation method according to the present invention.
In the figure, 1 indicates a text database. For example, a large amount of Japanese text is stored in this text database. The morpheme analysis means 2 takes Japanese text from the text database 1 and performs morpheme analysis of the text to determine word boundaries, assign word parts of speech, read words, and extract morphemes such as accents.
[0016]
Next, a linguistic important sentence is extracted by the linguistic important sentence extracting means 3 based on the extracted morphemes. There are various linguistic important sentence extraction means such as a Lead method that does not use knowledge, a method based on the appearance frequency of words, (for example, Non-Patent Document 4), or a method based on a text structure (for example, Non-Patent Document 5). Here, a method based on Support Vector Machine (hereinafter referred to as “SVM”), which is one method of machine learning, will be described as an example.
FIG. 2 shows a conceptual diagram of linguistic important sentence extraction processing based on SVM. First, learn SVM in advance. The learning process is shown in FIG. 2A. First, learning text data classified into an important sentence and an unimportant sentence is input as a text type in step S21-1.
[0017]
Next, an attribute is calculated | required by the text attribute analysis process with respect to the said text data for learning by step S22-1. The attributes are as follows, for example.
Sentence position (appearance position of the sentence in the sentence), sentence length, sum of word importance, keyword density, presence of proper expressions (whether proper nouns, words, etc.), presence of each morpheme ( Presence or absence of various morphemes in sentences) Presence or absence of important words (presence or absence of important words contained in sentences)
Here, the word importance can be obtained by an existing simple method such as the TF / IDF method. The keyword may be a keyword having a large word importance value, and the keyword density can be obtained as follows.
FD = Σw (k, 1) * a (k)
a (k) = w (t) (when word t appears at position k), 0 (otherwise)
Window function w (k) with w (k, 1) 1 as the center of the window
Further, semantically important words can be obtained from the depth of the hierarchy in the thesaurus described in Non-Patent Document 7.
[0018]
Next, learning is performed by SVM in step S23-1. Learning here means training data,
(X1, y1),... (Xm, ym) xiεRn, yiε [1, −1]: xi is an n-dimensional attribute vector in case i, yi is 1 for a positive example, negative When -1 is given in the example, when xi is classified into a positive example (for example, an important sentence) and a negative example (for example, an unimportant sentence) on the separation plane as follows, the margin (the most positive example side from the negative example) This means that w and b in the following equation are determined so that the distance between the boundary surface and the most negative example boundary surface) is maximized.
W · x + b = 0, w∈Rn, b∈R
In the linguistic important sentence extraction process, the SVM learned in step S23-1 described above is used. FIG. 2B shows the extraction process. First, in step S21-2, the text to be determined is extracted from the text database, and in step S22-2, the text attribute is obtained as described above by text attribute analysis processing. Next, in step S23-2, it is determined whether the sentence is an important sentence by SVM classification processing. The discriminating method uses w and b obtained in the above learning process to construct the following discriminant function,
f (x) = sgn (w · x + b)
For example, in the learning process, when an important sentence is a positive example, it is determined that f (x) = 1 is an important sentence, and −1 is a non-important sentence.
[0019]
The text determined as the important sentence is added to the important sentence list. After that, it is possible to acquire the important sentence list by simply discriminating all the texts included in the text database as described above.
As described above, it is possible to extract linguistic important sentences and acquire a linguistic important sentence list. Of course, the important sentence extraction method is not limited to the SVM-based method as described above.
Next, an acoustic important sentence extraction method will be described. The prosody analysis means 4 obtains phoneme information such as phoneme series, pitch pattern, phoneme length, and prosody information from the text. This can be obtained by reading / accent analysis and prosodic analysis (reference documents: Patent Document 2, Patent Document 3, Non-Patent Document 8). Next, acoustic statistical analysis processing is performed by the acoustic statistical analysis means 5 based on the phoneme information and prosodic information, and statistical analysis of acoustically different patterns is performed. For example, the frequency distribution is obtained for phoneme attributes classified by attributes such as phoneme type, phoneme length, preceding and following phoneme environment, pitch height, and phoneme length as shown in FIG.
[0020]
  Next, on the basis of the result obtained by the statistical analysis, an acoustic important sentence extraction process is performed by the acoustic important sentence extraction means 6, and a sentence that has not been extracted as the linguistic important sentence is sounded.rhymeDetermine the acoustic importance of text based on attribute frequency. In particulari th phonemeWeightWiThe following formula
  Wi = Ajf / N
  Wi:i th phonemeWeight, Ajf:i th phonemeFrequency of phoneme attribute Aj, N: number of all phoneme attributes
  When defined, the acoustic importance Sw of a sentence containing L phonemes is
[Expression 1]

The important acoustic sentence is obtained by sorting the whole sentence by the acoustic importance, and from the sentence having the highest importance, within the predetermined total number of sentences, except for the part already obtained as the linguistic important sentence, or Until the sentence having an importance level is extracted as an acoustic important sentence, the recorded list is acquired by the recorded list acquisition means 7 together with the linguistic important sentence.
[0021]
Next, according to the acquired recording list, for example, a voice is uttered by a speaker, and the voice recording means 8 records the voice.
After recording the voice, the labeling means 9 adds a phoneme label to the voice and also adds data necessary for voice synthesis such as a pitch mark, and the voice data to which the phoneme label is added is stored by the database recording means 10 as a speech segment database. 11 is recorded.
FIG. 4 to FIG. 6 show examples of speech unit data recorded in the speech unit database 11. This example shows a case where a text tag is added to each audio data and recorded. That is, the voice data corresponding position in the voice data in which each word is uttered, the voice area data, the text tag data classified into words corresponding to the pronunciation content of the voice area data, and the morpheme (part of speech data) of each word (Ms) and a label data area.
[0022]
For example, as shown in FIG. 5, the label data area is a phoneme type, a phoneme environment, a prephoneme environment, a postphoneme environment, and an average frequency F.₀(Hz), slope of average frequency (Hz / ms), time length (ms), power (dB), and the like.
Here, the audio area data may not be stored together with other data, but may be stored separately in another data area. As another example of a speech unit database with text tags, as shown in FIG. 6, speech region data, text tag data classified into words corresponding to the utterance content of speech region data, morpheme (part of speech data), It can be composed of the data receiving position and the voice data corresponding position (ms) and the label data shown in FIG.
[0023]
FIG. 7 shows another embodiment of the speech segment database creation apparatus of the present invention. In this embodiment, text is extracted from a large amount of Japanese text in the text database 1, and the paraphrase processing is performed by the speech unit database creation paraphrase processing means 12. Here, the paraphrase process refers to a process of converting a character expression of a certain sentence into another character expression without changing the contents of the sentence. An example of the processing flow of the paraphrase processing is shown in FIG. First, morpheme analysis is performed on the input text sentence in step S81 to extract morphemes, and then in step S82 syntax analysis is performed to obtain the sentence structure.
For example, an parse tree as shown in FIG. 9 is obtained by morphological analysis and syntactic analysis executed in steps S81 and S82 for the input sentence “she bitten by a big dog”.
[0024]
Next, in step S83, the sentence is converted by applying a conversion rule.
For example, if you apply the following conversion rule:
Noun phrase 1: “ha” + noun phrase 2: “ni” + verb phrase (passive)-> noun phrase 2: “ga” + noun phrase 1: “to” + verb phrase (standard) input sentence “she is big I was bitten by a dog. "
“Big dog”: “Ga” + “She”: “O” + “Biting.” => “Big dog biting her.”
And can be converted.
[0025]
The above conversion rule can be created manually, or can be obtained from the converted example sentence by an analytical method (reference document: Patent Document 4).
In step S84, the language model is applied to adjust the converted sentence. This language model processing does not guarantee the language eligibility of the converted sentence due to the semantic relationship of words and partial conversion rules, so the language model is modified and invalidated for rewriting based on the language model. Implemented to ensure legal eligibility.
As the language model used here, for example, a statistical language model (reference document: non-patent document 9) or the like can be used, and an Ngram model (non-patent document 9) based on a chain probability of N words, which is a representative technique. As a result, the eligibility of the converted sentence is obtained as a probability, and for sentences with a low probability, the sentence is corrected by changing the order of N words so that the probability is high, or a sentence with a low probability of being uncorrectable is rejected. After processing, the data is output as a paraphrase.
[0026]
For example, if the above conversion rule is applied to the input sentence “I was chosen as the representative”,
I was chosen 1st-> 1st chose me
It becomes.
Here, the trigram probability for “first place”, “ga”, “me”, “choose”, “selected” from the word trigram table shown in FIG. .2, 0.02, 0.01, 0.6, 0.35 = 8.4 x 10^-6However, in this, the trigram probability of “1st place”, “Ga” and “I” is low.
If other trigrams including “1st place” and “I” are examined here, “1st place”, “Ni” and “I” are as high as 0.7, so “Ga” is changed to “Ni”. Therefore, the trigram probability is 0.2 · 0.3 · 0.7 · 0.5 · 0.35 = 7.35 × 10^-3It can be seen that the three-stroke probability is improved. Therefore, the sentence “1st place chose me” can be modified to “1st place chose me”.
Since the following processing for the sentence after conversion is the same as in FIG. 1, it will be omitted. However, by performing paraphrasing processing in advance as in this embodiment, variation in character expression is reduced. The extraction accuracy becomes high in the extraction of the static important sentence and the acoustic important sentence, and as a result, a very efficient speech segment database can be created.
[0027]
FIG. 11 shows an embodiment of a speech synthesizer.
The input text is first subjected to paraphrase processing by the speech synthesizing paraphrase means 13 to convert the expression of the input text.
Next, the text analysis unit 14 performs text analysis on the converted text using the text analysis dictionary 18 to perform reading / accent analysis.
Next, based on the reading and accent, the prosody generation means 15 uses the average frequency F₀Find power, phoneme length.
Next, the average frequency F₀A speech unit database in which an appropriate speech unit is created by the speech unit selection unit 16 based on the phoneme sequence determined from the power, phoneme length, and the reading, by the process shown in FIG. Select from 19.
[0028]
Finally, the speech synthesis unit 17 connects the selected speech segments as they are or after being transformed and outputs them as synthesized speech.
In the block diagram of the speech segment database creation apparatus according to the present invention described with reference to FIGS. 1 and 7, the morpheme analysis means 2, the linguistic important sentence extraction means 3, the prosody analysis means 4, the acoustic statistical analysis means 5, the acoustic The speech segment according to the present invention is obtained by replacing the important sentence extraction unit 6, the recording list acquisition unit 7, the speech recording unit 8, the labeling unit 9, the database recording unit 10, and the speech unit database paraphrase unit 12 with processing steps. Explain the processing procedure of the database creation method.
[0029]
The speech segment database creation method according to the present invention can be realized by causing a CPU of a computer to decrypt and execute a speech segment database creation program described by a computer readable code. The speech segment database creation program according to the present invention is recorded on a computer-readable recording medium such as a magnetic disk or CD-ROM, and is installed in the computer from the recording medium or installed in the computer through a communication line. Executed.
In the block diagram of the speech synthesizer shown in FIG. 11, the speech synthesis paraphrase means 13, the text analysis means 14, the prosody generation means 15, the speech segment selection means 16, and the speech synthesis means 17 are all read as processing steps. In other words, the processing procedure of the speech synthesis method according to the present invention can be described.
[0030]
The speech synthesis method according to the present invention is also realized by causing a computer to execute a speech synthesis program described by codes readable by the computer. The speech synthesis program according to the present invention is recorded in a computer-readable recording medium such as a magnetic disk or a CD-ROM as described above, and is installed in the computer from these recording media, or the computer through a communication line. Installed, and decrypted and executed by the CPU.
[0031]
【The invention's effect】
As described above, according to the present invention, since speech is recorded based on important text on a linguistic scale and recorded in a speech database, high quality is achieved for text representations such as linguistically important phrases. Synthetic speech can be generated. Furthermore, by recording voices in a voice database based on texts that are important in an acoustic scale, it is possible to generate high-quality synthesized voices even in general texts.
Furthermore, the text representation to be synthesized is converted to a predetermined representation format by assuming that the input text is converted into a predetermined text representation format while being semantically equivalent by the paraphrase processing at the time of speech synthesis. Compression becomes possible. For this reason, in the text subjected to the paraphrasing process, it is possible to dramatically increase the voice recording rate by recording the voice based on the linguistically and acoustically important text and recording it in the voice database.
[Brief description of the drawings]
FIG. 1 is a block diagram for explaining an embodiment of a speech segment database creating apparatus according to the present invention.
FIG. 2 is a flowchart for explaining the procedure of linguistic important sentence extraction processing executed by the linguistic important sentence extracting means used in the embodiment shown in FIG. 1;
FIG. 3 is a diagram for explaining a frequency distribution table obtained by processing of the acoustic statistical analysis means used in the embodiment shown in FIG. 1;
FIG. 4 is a diagram for explaining an example of a speech unit database created by the speech unit database creation device of this explanation.
FIG. 5 is a diagram for explaining the configuration of a label data area stored in the speech segment database shown in FIG. 4;
6 is a diagram showing another example of the speech segment database shown in FIG. 4. FIG.
FIG. 7 is a block diagram for explaining another example of the speech segment database creating apparatus of the present invention.
8 is a flowchart for explaining the operation of the paraphrase unit 12 used in the embodiment shown in FIG. 7;
FIG. 9 is a diagram for explaining an example of a syntax tree used in the paraphrase processing used in the embodiment shown in FIG. 7;
10 is a diagram for explaining a word trigram table used in the paraphrasing process used in the embodiment shown in FIG. 7; FIG.
FIG. 11 is a block diagram for explaining a speech synthesis apparatus and speech synthesis method according to the present invention.
[Explanation of symbols]
1 Text database 11 Speech segment database
2 Morphological analysis means 12 For creating speech segment database
3 Linguistic important sentence extraction means Paraphrase means
4 Prosody analysis means 13 Paraphrasing means for speech synthesis
5 Acoustic statistical analysis means 14 Text analysis means
6 Acoustic important sentence extraction means 15 Prosody generation means
7 Recording list acquisition means 16 Speech segment selection means
8 Voice recording means 17 Voice synthesis means
9 Labeling means 18 Text analysis dictionary
10 Database recording means 19 Speech segment database

Claims

In a recording list acquisition device for creating a recording list to be uttered, which is necessary to create a speech segment database to be used for speech synthesis by recording the voice data uttered,
A linguistic important text extracting means for obtaining a linguistic importance of each text in a text database storing text data and extracting a linguistic important text having a high linguistic importance ;
Prosody analysis means for obtaining phoneme information and prosody information such as prosody series and pitch pattern, tempo, pose, etc. by each text morphological analysis processing and prosody estimation from each text in the text database;
Based on the phoneme information and rhythm information, phoneme type, before or after the phoneme environment, the Pitch, acoustically statistical analysis means for determining the frequency of occurrence of combinations of phonemes attributes phoneme length, etc. from the text in the text database ,
Based on the frequency of occurrence of combinations of the phoneme attributes, determine the acoustical importance of each text the sum of the weights of phonemes and weights a value proportional to the frequency of occurrence of combinations of phonemes of the phoneme attributes, and the linguistic importance text An acoustic important text extracting means for extracting acoustic important text with high acoustic importance that does not match;
A recording list acquisition means for combining the linguistic important text and the acoustic important text into a recording list;
A recording list acquisition device characterized by comprising:

In the recorded list acquisition device according to claim 1,
Paraphrase means for converting each text representation into another semantically equivalent representation by paraphrase processing,
The linguistic important text extracting unit extracts a linguistically important text having a high linguistic importance in the text paraphrased by the paraphrasing unit.

Comprising the recording list acquisition device according to claim 1 or 2 ;
Means for recording audio data created according to the recording list acquired by the recording list acquisition device ;
Labeling means for attaching a phonological label to the audio data;
Database recording means for recording the speech data with the phonological label in a speech segment database ;
A speech unit database creation apparatus comprising the speech unit database .

An apparatus program for causing a computer to operate as the apparatus according to claim 1.