JP4454780B2

JP4454780B2 - Audio information processing apparatus, method and storage medium

Info

Publication number: JP4454780B2
Application number: JP2000099420A
Authority: JP
Inventors: 泰夫奥谷; 康弘小森
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2000-03-31
Filing date: 2000-03-31
Publication date: 2010-04-21
Anticipated expiration: 2020-03-31
Also published as: JP2001282273A

Description

【０００１】
【発明の属する技術分野】
本発明は、音声合成で使用される素片辞書を作成する音声情報処理装置及びその方法と記憶媒体に関するものである。
【０００２】
【従来の技術】
近年、音声素片を１ピッチ波形単位で複製及び、或いは削除しながら所望のピッチ間隔で貼り合わせて編集し（PSOLA：ピッチ同期波形重畳法）、それらの音声素片を接続して音声合成する音声合成方法が主流となっている。
【０００３】
【発明が解決しようとする課題】
このような技術を利用して音声合成された音声には、音声素片を編集することによる歪（以下、変形歪）と、音声素片を接続することによる歪（以下、接続歪）とが含まれる。これら２つの歪が、合成された音声の品質劣化を引き起こす大きな要因となる。中でも、素片辞書に登録できる音声素片の数が制限される状況下では、音声合成時に、このような歪が小さくなるように音声素片を選択する余地がほとんど残されていない場合がある。特に、一つの音韻環境について１つの音声素片しか素片辞書に登録できない場合には、歪が小さくなるように音声素片を選択する余地は全くなく、このような素片辞書を用いると、変形歪や接続歪による合成音声の品質劣化は避けられないものとなる。
【０００４】
本発明は上記従来例に鑑みてなされたもので、接続歪や変形歪に基づき歪の影響を考慮して、素片辞書に登録する音声素片を選択することによって音声合成の音質劣化を抑制する音声情報処理装置及びその方法と記憶媒体を提供することを目的とする。
【０００５】
【課題を解決するための手段】
上記目的を達成するために本発明の音声情報処理装置は以下のような構成を備える。即ち、
音韻系列に含まれる音韻に対応する音声素片を変形することによって生じる変形歪及び当該音韻系列に含まれる隣接する音韻に対応する音声素片同士を接続することによって生じる接続歪の少なくともいずれかの歪を求める歪出力手段と、
前記歪出力手段で求められた前記歪を基準として音声素片のＮｂｅｓｔ系列を求めるＮｂｅｓｔ決定手段と、
前記Ｎｂｅｓｔ決定手段で求められたＮｂｅｓｔ系列を基に音声合成用素片辞書に登録する音声素片を選択する選択手段とを有し、
前記Ｎｂｅｓｔ系列のＮが複数である場合、前記Ｎｂｅｓｔ決定手段は、前記音韻系列を構成する複数の音声素片の前記歪に対して、前記歪の総和が小さいＮｂｅｓｔ系列から順に前記Ｎ個のＮｂｅｓｔ系列を求め、前記選択手段は、前記Ｎ個のＮｂｅｓｔ系列を構成する音声素片のうち、上位の頻度に相当する音声素片を、音声合成用素片辞書に登録する音声素片として選択することを特徴とする。
【０００７】
上記目的を達成するために本発明の音声情報処理方法は以下のような工程を備える。即ち、
音声合成用素片辞書に登録する音声素片として選択することを特徴とする音声情報処理装置の音声情報処理方法であって、
音韻系列に含まれる音韻に対応する音声素片を変形することによって生じる変形歪及び当該音韻系列に含まれる隣接する音韻に対応する音声素片同士を接続することによって生じる接続歪の少なくともいずれかの歪を求める歪出力工程と、
前記歪出力工程で求められた前記歪を基準として音声素片のＮｂｅｓｔ系列を求めるＮｂｅｓｔ決定工程と、
前記Ｎｂｅｓｔ決定工程で求められたＮｂｅｓｔ系列を基に音声合成用素片辞書に登録する音声素片を選択する選択工程とを有し、
前記Ｎｂｅｓｔ決定工程は、前記音韻系列を構成する複数の音声素片の前記歪に対して、前記歪の総和が小さいＮｂｅｓｔ系列から順に前記Ｎ個のＮｂｅｓｔ系列を求め、前記選択工程は、前記Ｎ個のＮｂｅｓｔ系列を構成する音声素片のうち、上位の頻度に相当する音声素片を、音声合成用素片辞書に登録する音声素片として選択することを特徴とする。
【０００８】
【発明の実施の形態】
以下、添付図面を参照して本発明の好適な実施の形態を詳細に説明する。
【０００９】
［実施の形態１］
図１は、本発明の実施の形態に係る音声情報処理装置のハードウェア構成を示すブロック図である。尚、本実施の形態では、一般的なパーソナルコンピュータを音声合成装置として用いる場合について説明するが、本発明は専用の音声情報処理装置であっても、また他の形態の装置であっても良い。
【００１０】
図１において、１０１は制御メモリ（ＲＯＭ）で、中央処理装置（ＣＰＵ）１０２で使用される各種制御データを記憶している。ＣＰＵ１０２は、ＲＡＭ１０３に記憶された制御プログラムを実行して、この装置全体の動作を制御している。１０３はメモリ（ＲＡＭ）で、ＣＰＵ１０２による各種制御処理の実行時、ワークエリアとして使用されて各種データを一時的に保存するとともに、ＣＰＵ１０２による各種処理の実行時、外部記憶装置１０４から制御プログラムをロードして記憶している。この外部記憶装置は、例えばハードディスク、ＣＤ−ＲＯＭ等を含んでいる。１０５はＤ／Ａ変換器で、音声信号を示すデジタルデータが入力されると、これをアナログ信号に変換してスピーカ１０９に出力して音声を再生する。１０６は入力部で、オペレータにより操作される、例えばキーボードや、マウス等のポインティングデバイスを備えている。１０７は表示部で、例えばＣＲＴや液晶等の表示器を有している。１０８はバスで、これら各部を接続している。１１０は音声合成ユニットである。
【００１１】
以上の構成において、本実施の形態の音声合成ユニット１１０を制御するための制御プログラムは外部記憶装置１０４からロードされてＲＡＭ１０３に記憶され、その制御プログラムで用いる各種データは、制御メモリ１０１に記憶されている。これらのデータは、中央処理装置１０２の制御の下にバス１０８を通じて適宜メモリ１０３に取り込まれ、中央処理装置１０２による制御処理で使用される。Ｄ／Ａ変換器１０５は、制御プログラムを実行することによって作成される音声波形データ（ディジタル信号）をアナログ信号に変換してスピーカ１０９に出力する。
【００１２】
図２は、本実施の形態に係る音声合成ユニット１１０のモジュール構成を示すブロック図で、この音声合成ユニット１１０は、大きく分けて、素片辞書２０６に音声素片を登録するための処理を実行する素片辞書作成モジュールと、テキストデータを入力し、そのテキストデータに対応する音声を合成して出力する処理を行なう音声合成モジュールの２つのモジュールを有している。
【００１３】
図２において、２０１は、入力部１０６又は外部記憶装置１０４から任意のテキストデータを入力するテキスト入力部、２０２は解析辞書、２０３は言語解析部、２０４は韻律生成規則保持部、２０５は韻律生成部、２０６は素片辞書、２０７は音声素片選択部、２０８は音声素片編集・接続部、２０９は音声波形出力部、２１０は音声データベース、２１１は素片辞書作成部、２１２はテキストコーパスである。このテキストコーパス２１２には、入力部１０６などを介して種々の内容のテキストを入力することができる。
【００１４】
まず、音声合成モジュールについて説明する。この音声合成モジュールでは、言語解析部２０３が、解析辞書２０２を参照して、テキスト入力部２０１から入力されるテキストの言語解析を行なう。こうして解析された結果が韻律生成部２０５に入力される。韻律生成部２０５は、言語解析部２０３における解析結果と、韻律生成規則保持部２０４に保持されている韻律生成規則に関する情報とを基に、音韻系列と韻律情報を生成して音声素片選択部２０７及び音声素片編集・接続部２０８に出力する。続いて、音声素片選択部２０７は、韻律生成部２０５から入力される韻律生成結果を用いて、素片辞書２０６に保持されている音声素片から対応する音声素片を選択する。音声素片編集・接続部２０８は、韻律生成部２０５から入力される韻律生成結果に従って、音声素片選択部２０７から出力される音声素片を編集及び接続して音声波形を生成する。こうして生成された音声波形は、音声波形出力部２０９で出力される。
【００１５】
次に、素片辞書作成モジュールについて説明する。
【００１６】
このモジュールでは、素片辞書作成部２１１が、後述する手順に基づいて、音声データベース２１０の中から音声素片を選び出して素片辞書２０６に登録する。
【００１７】
次に、上記構成を備えた本実施の形態の音声合成処理について説明する。
【００１８】
図３は、図２の音声合成モジュールにおける音声合成処理（オンライン処理）の流れを示すフローチャートである。
【００１９】
まずステップＳ３０１で、テキスト入力部２０１は、文、文節、単語等の単位毎にテキストデータを入力してステップＳ３０２に移る。ステップＳ３０２では、言語解析部２０３により当該テキストデータの言語解析を行う。次にステップＳ３０３に進み、音韻生成部２０５はステップＳ３０２で解析された結果と所定の韻律規則とに基づいて、音韻系列と韻律情報を生成する。次にステップＳ３０４に進み、各音韻毎にステップＳ３０３で得られた韻律情報と所定の音韻環境とに基づいて、音声素片選択部２０７が素片辞書２０６に登録されている音声素片を選択する。次にステップＳ３０５に進み、その選択された音声素片及びステップＳ３０３で生成された韻律情報とに基づいて、音声素片編集・接続部２０８により音声素片の編集および接続を行なってステップＳ３０６に進む。ステップＳ３０６では、音声素片編集・接続部２０８によって生成された音声波形を、音声波形出力部２０９が音声信号として出力する。このようにして、入力されたテキストに対応する音声が出力されることになる。
【００２０】
図４は、図２で示した素片辞書作成モジュールの、より詳細な構成を示すブロック図で、前述の図２と共通する部分は同じ番号で示し、かつ本実施の形態の特徴である素片辞書作成部２１１の構成をより詳細に示している。
【００２１】
図４において、４０１はテキスト入力部、４０２は言語解析部、４０３は解析辞書、４０４は韻律生成規則保持部、４０５は韻律生成部、４０６は音声素片検索部、４０７は音声素片保持部、４０８は音声素片編集部、４０９は変形歪決定部、４１０は接続歪決定部、４１１は歪決定部、４１２は歪保持部、４１３はＮbest決定部、４１４はＮbest保持部、４１５は登録素片決定部、４１６は登録素片保持部である。
【００２２】
以下、詳しく説明する。
【００２３】
テキスト入力部４０１は、テキストコーパス２１２から、例えば文単位にテキストデータを取り出して言語解析部４０２に出力する。言語解析部４０２は、解析辞書４０３を参照してテキスト入力部４０１から入力されたテキストデータを解析する。韻律生成部４０５は、言語解析部４０２で解析された解析結果に基づいて音韻系列を生成し、韻律生成規則保持部４０４が保持する韻律生成規則（アクセントパターン、自然降下成分、ピッチパターン等）を参照して韻律情報を生成する。音声素片検索部４０６は、韻律生成部４０５で生成される韻律情報と音韻系列とに従って音声データベース２１０から、各音韻毎に、所定の音韻環境を考慮した音声素片を検索する。こうして検索された音声素片は一旦、音声素片保持部４０７に保持される。音声素片編集部４０８は、韻律生成部４０５で生成された韻律情報に合わせて音声素片保持部４０７に保持されている音声素片を編集する。この編集には、韻律情報に合わせて音声素片同士を接続する処理や、またその音声素片同士の接続に際して音声素片の一部を削除する等して変形する処理などが含まれる。
【００２４】
変形歪決定部４０９は、各音声素片の変形前と変形後の音響的特徴の変化から変形歪を決定する。接続歪決定部４１０は、音韻系列において一つ前の音声素片の終端付近の音響的特徴と当該音声素片の始端付近の音響的特徴から、これら音声素片同士が接続された場合の接続歪を決定する。歪決定部４１１は、変形歪決定部４０９で決定された変形歪と、接続歪決定部４１０で決定された接続歪とを考慮し、音韻系列ごとにトータルの歪（歪値ともいう）を決定する。歪保持部４１２は、歪決定部４１１で決定された各音声素片に至る歪の値を保持する。Ｎbest決定部４１３は、Ａ*（エースター）探索アルゴリズムを用いて、音韻系列毎に歪が最小となる上位Ｎ通りの最適パスを求める。Ｎbest保持部４１４は、Ｎbest決定部４１３で求めたＮ個の最適パスを入力テキストごとに保持する。登録素片決定部４１５は、Ｎbest保持部４１４に保持されている、各音韻ごとにＮbestの結果から、その頻度順に、素片辞書２０６に登録する音声素片を選び出す。登録素片保持部４１６は、登録素片決定部４１５により選ばれた音声素片を保持する。
【００２５】
図５は、図４で示す素片辞書作成モジュールにおける処理の流れを示すフローチャートである。
【００２６】
まずステップＳ５０１で、テキスト入力部４０１がテキストコーパス２１２から一文ずつテキストデータを取り出す。取り出せるテキストデータが存在しなくなると、最終的に登録する音声素片を決定するステップＳ５１２に進む。テキストデータが存在する場合はステップＳ５０２に進み、言語解析部４０２において、解析辞書４０３を使って、その入力されたテキストデータの言語解析を行なってステップＳ５０３に進む。ステップＳ５０３では、韻律生成部４０５により、韻律生成規則保持部４０４が保持する韻律生成規則と、ステップＳ５０２における言語解析結果とに基づいて韻律情報並びに音韻系列を生成する。次にステップＳ５０４に進み、ステップＳ５０３で生成された音韻系列内の各音韻を順次処理する。このステップＳ５０４で未処理の音韻が存在しない場合はステップＳ５１１に進むが、未処理の音韻が存在する場合はステップＳ５０５に進む。ステップＳ５０５において、音声素片検索部４０６は、各音韻毎に音韻環境及び韻律規則を満足する音声素片を音声データベース２１０から検索して音声素片保持部４０７に保存する。
【００２７】
例えば具体例で説明すると、テキストデータとして「こんにちわ」が入力されると、それが言語解析され、アクセントやイントネーション等を含む韻律情報が生成される。そして、この「こんにちわ」は、例えばｄｉｐｈｏｎｅを音韻の単位として用いた場合、以下のような音韻系列に分解される。
【００２８】

なお、ここで「Ｘ」は、音声「ん」を示し、「/」は無声音を示す。
【００２９】
次にステップＳ５０６に進み、その検索された複数の音声素片について順次処理する。未処理の音声素片が存在しない場合はステップＳ５０４に戻って次の音韻の処理に進むが、存在する場合はステップＳ５０７に進んで、現在の音韻の音声素片を処理する。ステップＳ５０７では、音声素片編集部４０８が、上述の音声合成処理時と同じ手法を用いて音声素片の編集を行なう。ここでいう音声素片の編集とは、例えばピッチ同期波形重畳法（PSOLA）などの処理である。この音声素片の編集には、その音声素片と韻律情報を用いる。音声素片の編集が終了したらステップＳ５０８に進み、変形歪決定部４０９により、現在の音声素片の変形前と変形後における音響的特徴の変化を変形歪として算出する（この詳細は後述する）。次にステップＳ５０９に進み、接続歪決定部４１０により、現在の音声素片とその一つ前の音韻の音声素片の全てとの接続歪を算出する（この処理についても詳しく後述する）。次にステップＳ５１０に進み、歪決定部４１１は、変形歪と接続歪から現在の音声素片に至るパスの全てについて歪値を決定する（後述する）。そして現在の音声素片に至るパスの歪値の上位Ｎ個（Ｎ：求めたいＮbestの個数）と、そのパスを表わす一つ前の音韻の音声素片へのポインタを歪保持部４１２に保持してステップＳ５０６に戻り、現在の音韻において未処理の音声素片が存在するかどうかを調べる。
【００３０】
こうしてステップＳ５０６で、各音韻における全ての音声素片が処理され、更にステップＳ５０４で全ての音韻が処理されるとステップＳ５１１に進む。ステップＳ５１１において、Ｎbest決定部４１３は、Ａ*探索アルゴリズムを用いたＮbest探索を行ない、上位Ｎ位までの最適パス（音声素片系列ともいう）を求め、これをＮbest保持部４１４に保持してステップＳ５０１に戻る。
【００３１】
こうして全テキストに対する処理が終了するとステップＳ５０１からステップＳ５１２に進み、登録素片決定部４１５は、音韻ごとに全テキストのＮbest結果に基づいて所定の頻度以上を選択して音声素片を素片辞書２０６に登録する。尚、このＮbestにおけるＮの値は、予備実験などから経験的に与えておく。こうして決定された音声素片は、登録素片保持部４１６を介して素片辞書２０６に登録される。
【００３２】
図６は、本実施の形態に係る図５のステップＳ５０８における変形歪の求め方を説明する図である。
【００３３】
ここでは、PSOLA法によりピッチ間隔を広げる場合について図示している。矢印はピッチマーク、点線は変形前と変形後のピッチ素片の対応関係を表わしている。本実施の形態では、各ピッチ素片（微細素片ともいう）の変形前後のケプストラム距離に基づいて変形歪を表わす。具体的には、まず変形後のあるピッチ素片（例えば６０で示す）のピッチマーク６１を中心にハニング窓６２（窓長２５.６ミリ秒）をかけ、そのピッチ素片６０を周辺のピッチ素片を含めて切り出す。こうして切り出したピッチ素片６０をケプストラム分析する。次に、ピッチマーク６１に対応する変形前のピッチ素片６３のピッチマーク６４を中心にして同じ窓長のハニング窓６５でピッチ素片を切り出し、変形後の場合と同様にしてケプストラムを求める。このようにして求めたケプストラム同士の距離を、着目しているピッチ素片６０の変形歪として、変形後のピッチ素片とそれに対応する変形前のピッチ素片間の変形歪の総和をPSOLAで採用されるピッチ素片数Ｎpで割った値を、その音声素片の変形歪とする。こうして求められる変形歪を式で記述すると以下のようになる。
【００３４】
Ｄt ＝ ΣΣ｜Ｃorg i,j − Ｃtar i,j｜／Ｎp
ここで最初のΣは、ｉ＝１からＮまでの総和を示し、次のΣはｊ＝０〜１６までの総和を示している。またＣtar i,jは、変形後のｉ番目のピッチ素片のケプストラムのｊ次元目の要素を表わし、同様に、Ｃorg i,jは、変形後に対応する変形前のピッチ素片のケプストラムのｊ次元目の要素を表わしている。
【００３５】
図７は、本実施の形態における接続歪の求め方を説明する図である。
【００３６】
この接続歪は、一つ前の音韻の音声素片と現在の音声素片との接続箇所において生じる歪を示し、ここではケプストラム距離を用いて表わす。具体的には、音声素片境界が存在するフレーム７０，７１（フレーム長５ミリ秒、分析窓幅２５.６ミリ秒）と、それを挟む前後それぞれ２フレームからなる計５フレームを接続歪の算出対象としている。ここでケプストラムは、０次（パワー）〜１６次（パワー）までの計１７次元ベクトルとする。そして、このケプストラムベクトルの各要素の差の絶対値の和を、現在注目している音声素片における接続歪とする。即ち、図７の７００で示すように、一つ前の音韻の音声素片における終端部のケプストラムベクトルの各要素をＣpre i,j（ｉ：フレーム番号、フレーム番号の“０”が音声素片境界があるフレームを示し、ｊがベクトルの要素番号を示す）とする。また、図７の７０１で示すように、注目音声素片における始端部のケプストラムベクトルの各要素をＣcur i,jとすると、現在注目している音声素片の接続歪Ｄcは、
Ｄc＝ΣΣ｜Ｃpre i,j − Ｃcur i,j｜
となる。ここで最初のΣはｉ＝−２〜２の総和を、次のΣはｊ＝０〜１６までの総和を示す。
【００３７】
図８は、本実施の形態に係る歪決定部４１１による、音声素片における歪の決定過程を図示したものである。本実施の形態において、音韻単位はdiphone（ダイフォン）とする。
【００３８】
図中、一つの円がある音韻における１つの音声素片を示し、円内の数字は、この音声素片に至る歪値の総和の最小値を示している。また四角で囲まれた数字は、一つ前の音韻の音声素片と現在注目している音韻の音声素片との間の歪値を示している。また矢印は、現在注目している音韻の音声素片と一つ前の音韻の音声素片との関連を示している。ここでは説明のため、ｎ番目の音韻（現在注目している音韻）のｍ番目の音声素片をＰn,mとする。この音声素片Ｐn,mの最も小さい歪値から上位Ｎ個（Ｎ：求めたいＮbestの数）までに対応する音声素片を一つ前の音韻の中から取り出し、その中のｋ番目の歪値をＤn,m,kとし、その歪値に対応するの一つ前の音韻の音声素片をＰＲＥn,m,kとすると、ＰＲＥn,m,kを介して音声素片Ｐn,mに至るパスにおける歪値の総和Ｓn,m,kは、
Ｓn,m,k ＝Ｓn-１,x,0 ＋Ｄn,m,k （但し、ｘ＝ＰＲＥn,m,k）
となる。
【００３９】
本実施の形態における歪値について説明する。本実施の形態では歪値Ｄtotal（上記説明におけるＤn,m,kに相当する）を、上述の接続歪Ｄcと変形歪Ｄtの重み付き和として定義する。
【００４０】
Ｄtotal ＝ｗ×Ｄc ＋（１−ｗ）×Ｄt :（０≦ｗ≦１）
ここで重み係数ｗは、予備実験など経験的に求められる係数で、ｗ＝０の場合は、歪値が変形歪Ｄtのみで説明され、ｗ＝１の場合は、歪値が接続歪Ｄcのみに依存することになる。
【００４１】
歪保持部４１２では、各音韻の音声素片Ｐn,m毎に、上位Ｎ個の歪値Ｄn,m,kと、それらに対応する一つ前の音韻の音声素片ＰＲＥn,m,kと、ＰＲＥn,m,kを介してＤn,m,kに至るパスの歪値の総和Ｓn,m,kをそれぞれ保持する。
【００４２】
図８では、現在注目している音声素片Ｐn,mに至るパスの総和の最小値が「２２２」となる例を示す。この時の音声素片Ｐn,mの歪値は、Ｄn,m,1(k=1)であり、この歪値Ｄn,m,1に対応する一つ前の音韻の音声素片は、ＰＲＥn,m,1（図８のＰn-1,m８１に相当する）である。８０は、音声素片ＰＲＥn,m,1と音声素片Ｐn,mとを接続するパスである。
【００４３】
図９は、Ｎbestの決定過程を図示したものである。
【００４４】
ステップＳ５１０の終了時点で、各音声素片において、上位Ｎ個の情報がそれぞれ求まっている（フォワード探索）。Ｎbest決定部４１３では、音韻系列の末尾の音声素片９０から逆順に枝を伸ばしながらＮbestパスを求める（バックワード探索）。この枝を伸ばすノードの選択は、予測値（線の横の数字）とそこに至る総歪値の和（歪値は四角の中の数字で示される）が最小となるものである。ここでいう予測値とは、音声素片Ｐn,mにおけるフォワード探索結果の最小歪Ｓn,m,0に相当する。この場合、予測値と実際に左端までに至る最小パスの歪が等しいので、Ａ*探索アルゴリズムの性質により最適パスが求まることが保証される。
【００４５】
図９は、第１位の最適パスが決定された状態を示す図である。
【００４６】
図中、丸が音声素片を示し、その丸の中の数字が歪み予測値、太い実線が第一位のパス、四角の中の数字が歪値、線の横の数字が予測歪み値を示している。次に第２位のパスを求めるために、二重丸のノードの中で、予測値とそこに至る総歪値の和が最小となるノードを選択し、それに繋がる一つ前の音韻の音声素片の全て（最大Ｎ個）に枝を伸ばす。この伸ばした先のノードが二重丸で表現されている。この操作を繰り返すことにより、上位Ｎ個のパスが総歪値の順に決定される。この図９は、Ｎ＝２として枝を伸ばした場合の例を示す図である。
【００４７】
このようにして本実施の形態１によれば、歪の最も小さいパスを形成する音声素片を選択して、それを素片辞書に登録することができる。
【００４８】
［実施の形態２］
前述の実施の形態１では、音韻の単位としてdiphoneを用いる場合について記述したが、本発明はこれに限定されるものではなく、音素や半diphoneなどを単位としてもよい。半diphoneとは、diphoneを音素境界で２つに分割したもののことである。この半diphoneを単位とした場合のメリットについて簡単に説明する。任意のテキストを合成する場合、素片辞書２０６は全種類のdiphoneを用意しておく必要がある。これに対して、半diphoneを単位とした場合は、足りない半diphoneを別の半diphoneで代替できる。例えば、半diphoneの「/a.b.0/(diphone a.bの左側）」の代わりに「/a.n.0/」を利用しても、音質の劣化を少なくして良好に音声を再生できる。これにより、素片辞書２０６のサイズをより小さくできる。
【００４９】
［実施の形態３］
前述の実施の形態１、２では、音韻の単位としてdiphoneや音素や半diphoneを用いる場合について説明したが、本発明はこれに限定されるものではなく、これらを混合して用いてもよい。例えば、利用頻度が高い音韻については、diphoneを単位とし、利用頻度が低い音韻については、２つの半diphoneを用いて表現するようにしても良い。
【００５０】
図１０は、音声素片単位を混合した場合の一例を示した図で、ここでは音韻「o.w」がdiphoneで表され、その前後の音韻は半diphoneで表されている。
【００５１】
［実施の形態４］
実施の形態３において、元のデータベース中で連続する場所から取り出されたかどうかの情報を持ち、連続していた場合は、半diphoneの組を仮想的にdiphoneとして扱うようにしてもよい。つまり、元のデータベース中で連続するということは接続歪が“０”であるため、この場合には変形歪だけを考慮すればよいことになり計算量を大幅に軽減できる。
【００５２】
図１１は、この様子を表わした概念図である。図中の線上の数字は接続歪を表している。
【００５３】
図１１において、１１００で示される半diphoneの組は、元のデータベース中で連続する場所から取り出されたものであり、その接続歪みは“０”に一義的に決定されている。また１１０１で示された半diphoneの組は、元のデータベース中で連続する場所から取り出されたものではないため、それぞれに対して接続歪みが計算される。
【００５４】
［実施の形態５］
前述の実施の形態１では，１単位のテキストデータから得られた音韻系列全体を歪計算の対象とする場合について説明したが、本発明はこれに限定されるものでない。例えば、ポーズや無音部分までを一つの区間として音韻系列を分割し、各区間ごとに歪計算を行ってもよい。ここで言う無音部分とは、例えばp,t,kなどの無音部分のことである。ポーズや無音部分では接続歪が“０”であると考えられるため、このような分割が有効となる。これにより、各区間毎に最適な音声素片の選択が可能となる。
【００５５】
［実施の形態６］
前述の実施の形態１では、接続歪の計算にケプストラムを用いる場合について説明したが、本発明はこれに限定されるものではない。例えば、接続点の前後に亙る波形の差分の和を用いて接続歪を求めても良い。またスペクトル距離などを用いて接続歪を求めてもよい。この場合、接続点はピッチマークに同期させるのが、より好ましい。
【００５６】
［実施の形態７］
前述の実施の形態１では、接続歪の計算において、窓長、シフト長、ケプストラムの次数、フレーム数などを具体的数字を使って説明したが、本発明はこれに限定されるものではない。任意の窓長、シフト長、次数、フレーム数を使って接続歪を算出してもよい。
【００５７】
［実施の形態８］
前述の実施の形態１では、接続歪の計算にケプストラムの次数ごとに差分を取ったものの総和を用いる場合について説明したが、本発明はこれに限定されるものではない。例えば、各次数を統計的性質などを使って正規化（正規化係数ｒj）してもよい。この場合の接続歪Ｄcは、
Ｄc＝ΣΣ（ｒj×｜Ｃpre i,j − Ｃcur i,j｜）
となる。ここで、最初のΣはｉ＝−２〜２の総和を、次のΣはｊ＝０〜１６までの総和を示す。
【００５８】
［実施の形態９］
実施の形態１では、ケプストラムの次数ごとの差分の絶対値をベースに接続歪の算出を行なう場合について説明したが、本発明はこれに限定されるものではない。例えば、差分の絶対値の累乗（累数が偶数の場合は絶対値でなくてもよい）をベースに接続歪の算出を行なってもよい。ここで累数をＮとすると、接続歪Ｄcは、
Ｄc＝ΣΣ｜Ｃpre i,j − Ｃcur i,j｜^N
となる。ここで“^N”はＮの累乗を示す。ここでＮの値を大きくすることは、大きな差分について敏感になることを意味しているので、その結果、接続歪が平均的に小さくなるように働くことになる。
【００５９】
［実施の形態１０］
前述の実施の形態１では、変形歪としてケプストラムを用いる場合について説明したが、本発明はこれに限定されるものではない。例えば、変形前後の一定区間の波形の差分の和を用いて変形歪を求めてもよい。また、スペクトル距離などを用いて変形歪を求めてもよい。
【００６０】
［実施の形態１１］
前述の実施の形態１では、変形歪を波形から得られる情報を基に算出する場合について説明したが、本発明はこれに限定されるものではない。例えば、PSOLAによるピッチ素片の削除および複製の回数などを変形歪を算出する要素としても良い。
【００６１】
［実施の形態１２］
前述の実施の形態１では、音声素片を読み出すごとに接続歪を計算する場合について説明したが、本発明はこれに限定されるものではない。例えば、接続歪を予め計算しておき、テーブル化して保持してもよいものとする。
【００６２】
図１２は、diphone「/a.r/」とdiphone「/r.i/」との間の接続歪を記憶したテーブルの一例を示す図である。ここでは縦軸に「/a.r/」の音声素片、横軸に「/r.i/」の音声素片をとっている。例えば、「/a.r/」の「id3」の音声素片と「/r.i/」の「id2」の音声素片との接続歪は“３.６”で表されている。このように接続可能なdiphone間の接続歪を全てテーブル化して用意することにより、音声素片同士の合成時の接続歪の算出がテーブルの参照だけで済むため、その計算量を大幅に軽減でき、算出時間を大幅に短縮できる。
【００６３】
［実施の形態１３］
前述の実施の形態１では、音声素片を編集する毎に変形歪を計算する場合について説明したが、本発明はこれに限定されるものではない。例えば、変形歪を予め計算しておき、テーブルとして保持しておいても良い。
【００６４】
図１３は、あるdiphoneを基本周波数と音韻時間長について変化させた場合の変形歪をテーブルで表した図である。
【００６５】
図中、μは、そのdiphoneの統計的な平均値を示し、σは標準偏差である。具体的な表の作成方法としては、次のような作成方法が考えられる。まず、基本周波数と音韻時間長に関して統計的に平均値と分散を求める。次に、それらを基に（５×５＝）２５通りの基本周波数と音韻時間長をターゲットとしてPSOLA法をそれぞれ適用し、テーブルの変形歪を一つずつ求めていけばよい。合成時は、ターゲットの基本周波数と音韻時間長が決まれば、テーブルの近傍の値で内挿（もしくは外挿）することによって、変形歪を推定することが可能である。
【００６６】
図１４は、合成時に変形歪を推定するための具体例を示した図である。
【００６７】
図中、黒丸がターゲットの基本周波数と音韻時間長であり、このとき、各格子点の変形歪がテーブルからＡ，Ｂ，Ｃ，Ｄと求まっていると仮定すると、変形歪Ｄtは、以下の式により求めることができる。
Ｄt＝{Ａ・(１−ｙ)＋Ｃ・ｙ}×(１−ｘ)＋｛Ｂ・(１−ｙ)＋Ｄ・ｙ｝×ｘ
【００６８】
［実施の形態１４］
前述の実施の形態１３では、変形歪テーブルの格子点として、そのdiphoneの統計的な平均値と標準偏差を基に５×５のテーブルを作成したが、本発明はこれに限定されるものではなく、任意の格子点を持つテーブルとしてもよい。また、格子点を平均値などに依らず決定的に与えてもよいものとする。例えば、韻律推定で推定されうる範囲を等分割するなどもよいものとする。
【００６９】
［実施の形態１５］
前述の実施の形態１では、接続歪と変形歪の重み和で歪を定量化する場合について説明したが本発明はこれに限定されるものではなく、接続歪と変形歪それぞれに閾値を設定しておき、どちらか一方でもその閾値を越えた場合はその音声素片が選択されないようにして、十分大きな歪の値を与えるようにしてもよい。
【００７０】
上記実施の形態においては、各部を同一の計算機上で構成する場合について説明したが本発明はこれに限定されるものではなく、例えばネットワーク上に分散した計算機や処理装置などに分かれて各部を構成してもよい。
【００７１】
上記実施の形態においては、プログラムを制御メモリ（ＲＯＭ）に保持する場合について説明したが本発明はこれに限定されるものではなく、外部記憶など任意の記憶媒体を用いて実現してもよい。また、同様の動作をする回路で実現してもよい。
【００７２】
なお本発明は、複数の機器から構成されるシステムに適用しても、１つの機器からなる装置に適用してもよい。前述した実施の形態の機能を実現するソフトウエアのプログラムコードを記録した記録媒体を、システム或いは装置に供給し、そのシステム或いは装置のコンピュータ（またはＣＰＵやＭＰＵ）が記録媒体に格納されたプログラムコードを読み出し実行することによっても達成される。
【００７３】
この場合、記録媒体から読み出されたプログラムコード自体が前述した実施の形態の機能を実現することになり、そのプログラムコードを記録した記録媒体は本発明を構成することになる。プログラムコードを供給するための記録媒体としては、例えば、フロッピーディスク、ハードディスク、光ディスク、光磁気ディスク，ＣＤ−ＲＯＭ，ＣＤ−Ｒ，磁気テープ、不揮発性のメモリカード、ＲＯＭなどを用いることができる。
【００７４】
また、コンピュータが読み出したプログラムコードを実行することにより、前述した実施の形態の機能が実現されるだけでなく、そのプログラムコードの指示に基づき、コンピュータ上で稼働しているＯＳなどが実際の処理の一部または全部を行ない、その処理によって前述した実施の形態の機能が実現される場合も含まれる。
【００７５】
更に、記録媒体から読み出されたプログラムコードが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書き込まれた後、そのプログラムコードの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵなどが実際の処理の一部または全部を行ない、その処理によって前述した実施の形態の機能が実現される場合も含まれるものとする。
【００７６】
以上説明したように本実施の形態によれば、接続歪と変形歪を考慮して素片辞書に登録する音声素片を選択することにより、少数の音声素片を登録した辞書を用いても、音質の劣化が少ない合成音声を生成できるという効果がある。
【００７７】
【発明の効果】
以上説明したように本発明によれば、接続歪や変形歪に基づく歪の影響を考慮して素片辞書に登録する音声素片を選択することによって、そのような素片辞書を用いた合成音声の質を向上できるという効果がある。
【００７８】
また本発明によれば、素片辞書に登録する音声素片の数を少なく抑えて、かつその素片辞書を用いて良好な音声を再生できるという効果がある。
【図面の簡単な説明】
【図１】本発明の実施の形態に係る音声情報処理装置のハードウェア構成を示すブロック図である。
【図２】本発明の実施の形態１に係る音声情報処理装置のモジュール構成を示すブロック図である。
【図３】本実施の形態に係るオンラインモジュールにおける処理の流れを示すフローチャートである。
【図４】本実施の形態に係るオフラインモジュールの詳細な構成を示すブロック図である。
【図５】本実施の形態１に係るオフラインモジュールにおける処理の流れを示すフローチャートである。
【図６】本発明の実施の形態に係る音声素片の変形を説明する図である。
【図７】本発明の実施の形態に係る音声素片の接続歪を説明する図である。
【図８】音声素片における歪の決定過程を説明する図である。
【図９】Ｎbestによる決定過程を説明する図である。
【図１０】本発明の実施の形態３に係る音声素片の単位をdiphoneと半diphoneとで混合した場合を説明する図である。
【図１１】本発明の実施の形態４に係る音声素片の単位を取り出した半diphoneによって混合した例を示した図である。
【図１２】本発明の実施の形態１２に係るdiphoneの /a.r/ と/r.i/間の接続歪を決定するテーブル構成例を示す図である。
【図１３】本発明の実施の形態１３に係る変形歪を表わすテーブル例を示す図である。
【図１４】本発明の実施の形態１３に係る変形歪を推定する具体例を示した図である。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech information processing apparatus and method for creating a segment dictionary used in speech synthesis, and a storage medium.
[0002]
[Prior art]
In recent years, speech units are copied and deleted in 1-pitch waveform units, and pasted and edited at desired pitch intervals (PSOLA: Pitch-synchronized waveform superposition method), and these speech units are connected and synthesized. Speech synthesis methods have become mainstream.
[0003]
[Problems to be solved by the invention]
Speech synthesized using such a technique has distortion caused by editing speech segments (hereinafter referred to as deformation distortion) and distortion caused by connecting speech segments (hereinafter referred to as connection distortion). included. These two distortions are major factors that cause deterioration in the quality of synthesized speech. In particular, in situations where the number of speech units that can be registered in the unit dictionary is limited, there may be little room for selecting speech units so that such distortion is reduced during speech synthesis. . In particular, when only one speech unit can be registered in the unit dictionary for one phoneme environment, there is no room for selecting a speech unit so as to reduce distortion. Using such a unit dictionary, Degradation of synthesized speech quality due to deformation distortion and connection distortion is inevitable.
[0004]
The present invention has been made in view of the above-described conventional example, and suppresses the degradation of sound quality of speech synthesis by selecting speech segments to be registered in the segment dictionary in consideration of the effect of distortion based on connection distortion and deformation distortion. An object of the present invention is to provide a voice information processing apparatus, a method thereof and a storage medium.
[0005]
[Means for Solving the Problems]
  In order to achieve the above object, the speech information processing apparatus of the present invention has the following configuration. That is,
  Phoneme segments corresponding to phonemes included in phoneme sequencesDeformation caused by deformingSpeech segments corresponding to distortion and adjacent phonemes included in the phoneme sequenceConnections that result from connecting each otherAt least one of the distortionsDistortionDistortion output means for obtaining,
  Determined by the strain output meansSaidNbest determination means for obtaining an Nbest sequence of speech segments based on distortion;
  Selection means for selecting a speech unit to be registered in the speech synthesis unit dictionary based on the Nbest sequence obtained by the Nbest determination unit.And
  When there are a plurality of Ns in the Nbest sequence, the Nbest determination means determines the N Nbests in order from the Nbest sequence having the smallest sum of the distortions with respect to the distortions of the plurality of speech units constituting the phoneme sequence. A sequence is obtained, and the selection means selects a speech unit corresponding to a higher frequency among speech units constituting the N Nbest sequences as a speech unit to be registered in the speech synthesis unit dictionary.It is characterized by that.
[0007]
  In order to achieve the above object, the speech information processing method of the present invention comprises the following steps. That is,
  A speech information processing method for a speech information processing apparatus, wherein the speech information processing apparatus selects a speech unit to be registered in a speech synthesis unit dictionary,
  Phoneme segments corresponding to phonemes included in phoneme sequencesDeformation caused by deformingSpeech segments corresponding to distortion and adjacent phonemes included in the phoneme sequenceConnections that result from connecting each otherAt least one of the distortionsDistortionStrain output process for obtaining
  Determined in the strain output stepSaidAn Nbest determination step for obtaining an Nbest sequence of speech segments based on distortion;
  A selection step of selecting a speech unit to be registered in the speech synthesis unit dictionary based on the Nbest sequence obtained in the Nbest determination step.And
  The Nbest determination step obtains the N Nbest sequences in order from the Nbest sequence having the smallest sum of the distortions with respect to the distortion of a plurality of speech units constituting the phoneme sequence, and the selection step includes the N Select the speech element corresponding to the higher frequency among the speech elements constituting the Nbest sequence as the speech element to be registered in the speech synthesis element dictionaryIt is characterized by doing.
[0008]
DETAILED DESCRIPTION OF THE INVENTION
Preferred embodiments of the present invention will be described below in detail with reference to the accompanying drawings.
[0009]
[Embodiment 1]
FIG. 1 is a block diagram showing a hardware configuration of a voice information processing apparatus according to an embodiment of the present invention. In this embodiment, a case where a general personal computer is used as a speech synthesizer will be described. However, the present invention may be a dedicated speech information processing apparatus or an apparatus of another form. .
[0010]
In FIG. 1, reference numeral 101 denotes a control memory (ROM) that stores various control data used by a central processing unit (CPU) 102. The CPU 102 executes a control program stored in the RAM 103 to control the operation of the entire apparatus. Reference numeral 103 denotes a memory (RAM) which is used as a work area when various control processes by the CPU 102 are executed, temporarily stores various data, and loads a control program from the external storage device 104 when the CPU 102 executes various processes. And remember. The external storage device includes, for example, a hard disk, a CD-ROM, and the like. Reference numeral 105 denotes a D / A converter. When digital data indicating an audio signal is input, it is converted into an analog signal and output to the speaker 109 to reproduce the audio. An input unit 106 is provided with a pointing device such as a keyboard and a mouse, which is operated by an operator. A display unit 107 includes a display such as a CRT or a liquid crystal. Reference numeral 108 denotes a bus which connects these parts. Reference numeral 110 denotes a speech synthesis unit.
[0011]
In the above configuration, a control program for controlling the speech synthesis unit 110 of the present embodiment is loaded from the external storage device 104 and stored in the RAM 103, and various data used in the control program is stored in the control memory 101. ing. These data are appropriately taken into the memory 103 through the bus 108 under the control of the central processing unit 102 and used in the control processing by the central processing unit 102. The D / A converter 105 converts voice waveform data (digital signal) created by executing the control program into an analog signal and outputs the analog signal to the speaker 109.
[0012]
FIG. 2 is a block diagram showing the module configuration of the speech synthesis unit 110 according to the present embodiment. This speech synthesis unit 110 roughly divides and executes processing for registering speech units in the unit dictionary 206. There are two modules: a segment dictionary creating module, and a speech synthesis module for inputting text data and synthesizing and outputting speech corresponding to the text data.
[0013]
2, 201 is a text input unit for inputting arbitrary text data from the input unit 106 or the

external storage device

104, 202 is an analysis dictionary, 203 is a language analysis unit, 204 is a prosody generation rule holding unit, and 205 is a prosody generation. , 206 is a segment dictionary, 207 is a speech segment selection unit, 208 is a speech segment editing / connection unit, 209 is a speech waveform output unit, 210 is a speech database, 211 is a segment dictionary creation unit, and 212 is a text corpus. It is. Various texts can be input to the text corpus 212 via the input unit 106 or the like.
[0014]
First, the speech synthesis module will be described. In this speech synthesis module, the language analysis unit 203 refers to the analysis dictionary 202 and performs language analysis on the text input from the text input unit 201. The analysis result is input to the prosody generation unit 205. The prosody generation unit 205 generates a phoneme sequence and prosody information based on the analysis result in the language analysis unit 203 and information on the prosody generation rules held in the prosody generation rule holding unit 204 to generate a speech unit selection unit. 207 and the speech unit editing / connecting unit 208. Subsequently, the speech unit selection unit 207 selects a corresponding speech unit from the speech units held in the unit dictionary 206 using the prosody generation result input from the prosody generation unit 205. The speech unit editing / connecting unit 208 edits and connects the speech units output from the speech unit selection unit 207 according to the prosody generation result input from the prosody generation unit 205 to generate a speech waveform. The voice waveform generated in this way is output by the voice waveform output unit 209.
[0015]
Next, the element dictionary creation module will be described.
[0016]
In this module, the unit dictionary creation unit 211 selects a speech unit from the speech database 210 and registers it in the unit dictionary 206 based on a procedure described later.
[0017]
Next, speech synthesis processing according to the present embodiment having the above configuration will be described.
[0018]
FIG. 3 is a flowchart showing the flow of speech synthesis processing (online processing) in the speech synthesis module of FIG.
[0019]
First, in step S301, the text input unit 201 inputs text data for each unit such as a sentence, a clause, a word, etc., and proceeds to step S302. In step S302, the language analysis unit 203 performs language analysis of the text data. In step S303, the phoneme generation unit 205 generates a phoneme sequence and prosody information based on the result analyzed in step S302 and a predetermined prosody rule. Next, proceeding to step S304, the speech unit selection unit 207 selects a speech unit registered in the unit dictionary 206 based on the prosodic information obtained in step S303 and a predetermined phoneme environment for each phoneme. To do. Next, proceeding to step S305, the speech unit editing / connecting unit 208 edits and connects the speech unit based on the selected speech unit and the prosodic information generated at step S303, and then proceeds to step S306. move on. In step S306, the speech waveform output unit 209 outputs the speech waveform generated by the speech segment editing / connection unit 208 as a speech signal. In this way, a voice corresponding to the input text is output.
[0020]
FIG. 4 is a block diagram showing a more detailed configuration of the unit dictionary creation module shown in FIG. 2, and the same parts as those in FIG. 2 are indicated by the same numbers and are the features of the present embodiment. The configuration of the single dictionary creation unit 211 is shown in more detail.
[0021]
4, 401 is a text input unit, 402 is a language analysis unit, 403 is an analysis dictionary, 404 is a prosody generation rule holding unit, 405 is a prosody generation unit, 406 is a speech unit search unit, and 407 is a speech unit holding unit. , 408 is a speech segment editing unit, 409 is a deformation distortion determination unit, 410 is a connection distortion determination unit, 411 is a distortion determination unit, 412 is a distortion holding unit, 413 is an Nbest determination unit, 414 is an Nbest holding unit, and 415 is registered A segment determining unit 416 is a registered segment holding unit.
[0022]
This will be described in detail below.
[0023]
The text input unit 401 extracts text data from the text corpus 212, for example, in sentence units, and outputs the text data to the language analysis unit 402. The language analysis unit 402 analyzes the text data input from the text input unit 401 with reference to the analysis dictionary 403. The prosody generation unit 405 generates a phoneme sequence based on the analysis result analyzed by the language analysis unit 402, and uses the prosody generation rules (accent pattern, natural descent component, pitch pattern, etc.) held by the prosody generation rule holding unit 404. Prosody information is generated by referring to it. The speech segment search unit 406 searches the speech database 210 for speech units considering a predetermined phoneme environment for each phoneme according to the prosodic information and phoneme sequence generated by the prosody generation unit 405. The speech unit searched in this way is temporarily held in the speech unit holding unit 407. The speech unit editing unit 408 edits the speech unit held in the speech unit holding unit 407 in accordance with the prosodic information generated by the prosody generation unit 405. This editing includes processing for connecting speech segments according to prosodic information, and processing for deforming by deleting a part of speech segments when connecting the speech segments.
[0024]
The deformation distortion determination unit 409 determines the deformation distortion from the change in acoustic characteristics before and after the deformation of each speech unit. The connection distortion determination unit 410 is connected when the speech units are connected from the acoustic feature near the end of the previous speech unit in the phoneme sequence and the acoustic feature near the start of the speech unit. Determine the distortion. The distortion determination unit 411 determines a total distortion (also referred to as a distortion value) for each phoneme sequence in consideration of the deformation distortion determined by the deformation distortion determination unit 409 and the connection distortion determined by the connection distortion determination unit 410. To do. The distortion holding unit 412 holds the value of distortion reaching each speech unit determined by the distortion determination unit 411. The Nbest determination unit 413 uses the A * (Aster) search algorithm to obtain the top N optimal paths that minimize the distortion for each phoneme sequence. The Nbest holding unit 414 holds the N optimum paths obtained by the Nbest determining unit 413 for each input text. The registered segment determination unit 415 selects speech units to be registered in the segment dictionary 206 in the order of frequency from the Nbest result for each phoneme held in the Nbest holding unit 414. The registered segment holding unit 416 holds the speech unit selected by the registered segment determining unit 415.
[0025]
FIG. 5 is a flowchart showing the flow of processing in the segment dictionary creation module shown in FIG.
[0026]
First, in step S <b> 501, the text input unit 401 extracts text data from the text corpus 212 one sentence at a time. If there is no text data that can be extracted, the process proceeds to step S512 to determine a speech unit to be finally registered. If text data exists, the process proceeds to step S502, and the language analysis unit 402 performs language analysis of the input text data using the analysis dictionary 403, and then proceeds to step S503. In step S503, the prosody generation unit 405 generates prosody information and phoneme sequences based on the prosody generation rules held by the prosody generation rule holding unit 404 and the language analysis result in step S502. In step S504, the phonemes in the phoneme sequence generated in step S503 are sequentially processed. If there is no unprocessed phoneme in step S504, the process proceeds to step S511. If there is an unprocessed phoneme, the process proceeds to step S505. In step S <b> 505, the speech unit search unit 406 searches the speech database 210 for speech units that satisfy the phoneme environment and prosody rules for each phoneme, and stores them in the speech unit holding unit 407.
[0027]
For example, to explain with a specific example, when “Konchiwa” is input as text data, it is linguistically analyzed, and prosodic information including accents and intonations is generated. This “Konchiwa” is decomposed into the following phoneme sequences when diphone is used as a phoneme unit, for example.
[0028]

Here, “X” indicates voice “n”, and “/” indicates unvoiced sound.
[0029]
In step S506, the retrieved plurality of speech units are sequentially processed. If there is no unprocessed speech segment, the process returns to step S504 to proceed to the next phoneme process. If present, the process proceeds to step S507 to process the speech segment of the current phoneme. In step S507, the speech unit editing unit 408 edits the speech unit using the same method as that used in the speech synthesis process described above. The editing of the speech element here is, for example, a process such as a pitch synchronous waveform superimposing method (PSOLA). The speech unit and prosodic information are used for editing the speech unit. When editing of the speech segment is completed, the process proceeds to step S508, and the deformation distortion determination unit 409 calculates the change in acoustic characteristics before and after the current speech segment is deformed as a deformation distortion (details will be described later). . In step S509, the connection distortion determination unit 410 calculates connection distortion between the current speech unit and all speech units of the previous phoneme (this process will be described in detail later). In step S510, the distortion determination unit 411 determines distortion values for all paths from the deformation distortion and the connection distortion to the current speech segment (described later). The distortion holding unit 412 holds the top N distortion values (N: the number of Nbest to be obtained) of the path to the current speech unit and a pointer to the speech unit of the previous phoneme representing the path. Then, the process returns to step S506 to check whether or not there is an unprocessed speech segment in the current phoneme.
[0030]
Thus, in step S506, all speech segments in each phoneme are processed, and if all phonemes are processed in step S504, the process proceeds to step S511. In step S511, the Nbest determination unit 413 performs an Nbest search using the A * search algorithm, obtains the optimum path (also referred to as a speech unit sequence) up to the top N, and holds this in the Nbest holding unit 414. The process returns to step S501.
[0031]
When the processing for all texts is completed in this way, the process proceeds from step S501 to step S512, and the registered segment determination unit 415 determines a predetermined frequency based on the Nbest result of all texts for each phoneme.Less thanThe top is selected and the speech unit is registered in the unit dictionary 206. Note that the value of N in Nbest is given empirically from preliminary experiments and the like. The speech element determined in this way is registered in the element dictionary 206 via the registered element holding unit 416.
[0032]
FIG. 6 is a diagram for explaining how to obtain the deformation distortion in step S508 of FIG. 5 according to the present embodiment.
[0033]
Here, a case where the pitch interval is widened by the PSOLA method is illustrated. The arrow indicates the pitch mark, and the dotted line indicates the correspondence between the pitch pieces before and after the deformation. In the present embodiment, deformation strain is expressed based on the cepstrum distance before and after deformation of each pitch element (also referred to as fine element). Specifically, first, a Hanning window 62 (window length 25.6 milliseconds) is applied around a pitch mark 61 of a certain pitch piece (for example, indicated by 60) after deformation, and the pitch piece 60 is set to a peripheral pitch. Cut out including the fragment. The cepstrum analysis is performed on the pitch segment 60 thus cut out. Next, the pitch segment is cut out with the Hanning window 65 having the same window length around the pitch mark 64 of the pitch segment 63 before deformation corresponding to the pitch mark 61, and a cepstrum is obtained in the same manner as after the deformation. By using the distance between the cepstrum thus obtained as the deformation distortion of the pitch element 60 of interest, the total deformation distortion between the pitch element after deformation and the corresponding pitch element before deformation is expressed in PSOLA. The value divided by the number of pitch segments Np employed is taken as the deformation distortion of the speech segment. The deformation strain thus obtained can be described by an equation as follows.
[0034]
Dt = ΣΣ | Corg i, j −Ctar i, j | / Np
Here, the first Σ represents the sum from i = 1 to N, and the next Σ represents the sum from j = 0 to 16. Ctar i, j represents the element of the j-th dimension of the cepstrum of the i-th pitch segment after deformation, and similarly, Corg i, j represents j of the cepstrum of the corresponding pitch segment before deformation after deformation. Represents a dimension element.
[0035]
FIG. 7 is a diagram for explaining how to obtain the connection distortion in the present embodiment.
[0036]
This connection distortion indicates distortion generated at the connection point between the speech unit of the previous phoneme and the current speech unit, and is expressed here using a cepstrum distance. Specifically, frames 70 and 71 (frame length: 5 ms, analysis window width: 25.6 ms), and 2 frames before and after sandwiching the frame, are combined for a total of 5 frames. Calculated. Here, the cepstrum is a total of 17-dimensional vectors from 0th order (power) to 16th order (power). Then, the sum of the absolute values of the differences between the elements of the cepstrum vector is set as the connection distortion in the speech unit currently focused on. That is, as indicated by 700 in FIG. 7, each element of the cepstrum vector at the end of the speech unit of the previous phoneme is represented by Cpre i, j (i: frame number, frame number “0” is a speech unit). A frame having a boundary, and j represents an element number of the vector). Further, as indicated by reference numeral 701 in FIG. 7, if each element of the cepstrum vector at the beginning of the speech unit of interest is Ccur i, j, the connection distortion Dc of the speech unit of interest is
Dc = ΣΣ | Cpre i, j −Ccur i, j |
It becomes. Here, the first Σ represents the sum of i = −2 to 2, and the next Σ represents the sum of j = 0 to 16.
[0037]
FIG. 8 illustrates a distortion determination process in the speech unit by the distortion determination unit 411 according to the present embodiment. In this embodiment, the phoneme unit is diphone.
[0038]
In the figure, one speech segment in a phoneme with one circle is shown, and the number in the circle indicates the minimum value of the total sum of distortion values reaching this speech segment. The numbers enclosed by squares indicate the distortion values between the phoneme unit of the previous phoneme and the phoneme unit of the phoneme that is currently focused on. The arrow indicates the relationship between the phoneme unit of the phoneme currently focused on and the phoneme unit of the previous phoneme. Here, for explanation, the mth speech segment of the nth phoneme (the phoneme currently focused on) is Pn, m. The speech units corresponding to the lowest N distortion values of the speech segment Pn, m to the top N (N: the number of Nbest to be obtained) are taken out from the previous phoneme, and the kth distortion among them is extracted. If the value is Dn, m, k and the speech unit of the previous phoneme corresponding to the distortion value is PREn, m, k, the speech unit Pn, m is reached via PREn, m, k. The total distortion value Sn, m, k in the path is
Sn, m, k = Sn-1, x, 0 + Dn, m, k (where x = PREn, m, k)
It becomes.
[0039]
The distortion value in the present embodiment will be described. In the present embodiment, the distortion value Dtotal (corresponding to Dn, m, k in the above description) is defined as a weighted sum of the above-described connection distortion Dc and deformation distortion Dt.
[0040]
Dtotal = w × Dc + (1-w) × Dt: (0 ≦ w ≦ 1)
Here, the weight coefficient w is a coefficient empirically obtained such as a preliminary experiment. When w = 0, the distortion value is described only by the deformation distortion Dt, and when w = 1, the distortion value is only the connection distortion Dc. Will depend on.
[0041]
In the distortion holding unit 412, for each phoneme speech unit Pn, m, the top N distortion values Dn, m, k and the previous phoneme speech unit PREn, m, k corresponding to them. , PREn, m, k, the total distortion value Sn, m, k of the path reaching Dn, m, k is held.
[0042]
FIG. 8 shows an example in which the minimum value of the total sum of the paths leading to the speech element Pn, m currently focused on is “222”. The distortion value of the speech element Pn, m at this time is Dn, m, 1 (k = 1), and the speech element of the previous phoneme corresponding to this distortion value Dn, m, 1 is PREn. , m, 1 (corresponding to Pn-1, m81 in FIG. 8). Reference numeral 80 denotes a path for connecting the speech element PREn, m, 1 and the speech element Pn, m.
[0043]
FIG. 9 illustrates the Nbest determination process.
[0044]
At the end of step S510, the top N pieces of information are obtained for each speech unit (forward search). The Nbest determination unit 413 obtains an Nbest path while extending branches in reverse order from the last speech unit 90 of the phoneme sequence (backward search). The selection of the node that extends the branch is such that the sum of the predicted value (the number next to the line) and the total distortion value that reaches it (the distortion value is indicated by a number in the square) is minimized. The predicted value here corresponds to the minimum distortion Sn, m, 0 of the forward search result in the speech segment Pn, m. In this case, since the predicted value and the distortion of the minimum path that actually reaches the left end are equal, it is guaranteed that the optimum path is obtained by the nature of the A * search algorithm.
[0045]
FIG. 9 is a diagram illustrating a state in which the first optimal path is determined.
[0046]
In the figure, circles indicate speech segments, the numbers in the circles are the predicted distortion values, the thick solid line is the first pass, the numbers in the squares are the distortion values, and the numbers next to the lines are the predicted distortion values. Show. Next, in order to find the second path, a node with the smallest sum of the predicted value and the total distortion value reaching it is selected from the double circle nodes, and the speech of the previous phoneme connected to it is selected. Extend branches to all of the pieces (maximum N). This extended node is represented by a double circle. By repeating this operation, the top N paths are determined in the order of the total distortion value. FIG. 9 is a diagram illustrating an example in which a branch is extended with N = 2.
[0047]
In this way, according to the first embodiment, it is possible to select a speech unit that forms a path with the smallest distortion and register it in the unit dictionary.
[0048]
[Embodiment 2]
In the first embodiment, the case where diphone is used as a phoneme unit has been described. However, the present invention is not limited to this, and a phoneme or a half diphone may be used as a unit. A half-diphone is a diphone divided into two at the phoneme boundary. The advantages of using half a diphone as a unit will be briefly described. When synthesizing arbitrary text, the segment dictionary 206 needs to prepare all kinds of diphones. On the other hand, when a half diphone is used as a unit, a short half diphone can be replaced with another half diphone. For example, even if “/a.n.0/” is used instead of “/a.b.0/(left side of diphone a.b)” of a half-diphone, the sound can be reproduced satisfactorily with less deterioration in sound quality. Thereby, the size of the segment dictionary 206 can be further reduced.
[0049]
[Embodiment 3]
In the first and second embodiments described above, the case where diphone, phoneme, or half diphone is used as a phoneme unit has been described. However, the present invention is not limited to this, and a mixture of these may be used. For example, a phoneme having a high usage frequency may be expressed using diphone as a unit, and a phoneme having a low usage frequency may be expressed using two half diphones.
[0050]
FIG. 10 is a diagram showing an example of a case where speech unit units are mixed. Here, the phoneme “o.w” is represented by diphone, and the phonemes before and after the phoneme are represented by half diphone.
[0051]
[Embodiment 4]
In the third embodiment, there is information on whether or not it has been taken out from a continuous place in the original database, and if it is continuous, a set of half diphones may be treated virtually as diphones. That is, since the connection distortion is “0” when it is continuous in the original database, only the deformation distortion needs to be considered in this case, and the calculation amount can be greatly reduced.
[0052]
FIG. 11 is a conceptual diagram showing this state. The numbers on the lines in the figure represent the connection distortion.
[0053]
In FIG. 11, a set of half-diphones indicated by 1100 is taken from a continuous place in the original database, and the connection distortion is uniquely determined to be “0”. In addition, since the half-diphone set indicated by 1101 is not extracted from a continuous place in the original database, the connection distortion is calculated for each.
[0054]
[Embodiment 5]
In the first embodiment described above, the case has been described where the entire phoneme sequence obtained from one unit of text data is subjected to distortion calculation, but the present invention is not limited to this. For example, a phoneme sequence may be divided into a pause and a silent part as one section, and distortion calculation may be performed for each section. The silent part here is a silent part such as p, t, k, for example. Since it is considered that the connection distortion is “0” in the pause or silent part, such division is effective. Thereby, it is possible to select an optimum speech segment for each section.
[0055]
[Embodiment 6]
In the first embodiment described above, the case where the cepstrum is used for the calculation of the connection distortion has been described, but the present invention is not limited to this. For example, the connection distortion may be obtained using the sum of the differences between waveforms before and after the connection point. Further, the connection distortion may be obtained using a spectral distance or the like. In this case, it is more preferable that the connection point is synchronized with the pitch mark.
[0056]
[Embodiment 7]
In the first embodiment described above, the window length, the shift length, the cepstrum order, the number of frames, and the like have been described using specific numbers in the connection distortion calculation. However, the present invention is not limited to this. The connection distortion may be calculated using an arbitrary window length, shift length, order, and number of frames.
[0057]
[Embodiment 8]
In the first embodiment described above, a case has been described in which the sum of differences obtained for each cepstrum order is used for calculating the connection distortion, but the present invention is not limited to this. For example, each order may be normalized using a statistical property (normalization coefficient rj). The connection distortion Dc in this case is
Dc = ΣΣ (rj × | Cpre i, j −Ccur i, j |)
It becomes. Here, the first Σ represents the sum of i = −2 to 2, and the next Σ represents the sum of j = 0 to 16.
[0058]
[Embodiment 9]
In the first embodiment, the case where the connection distortion is calculated based on the absolute value of the difference for each order of the cepstrum has been described. However, the present invention is not limited to this. For example, the connection distortion may be calculated based on the power of the absolute value of the difference (it may not be an absolute value when the number is an even number). Here, assuming that the progenitor is N, the connection distortion Dc is
Dc = ΣΣ | Cpre i, j −Ccur i, j | ^ N
It becomes. Here, “^ N” indicates the power of N. Here, increasing the value of N means that it becomes sensitive to a large difference, and as a result, it works to reduce the connection distortion on average.
[0059]
[Embodiment 10]
In Embodiment 1 described above, the case where a cepstrum is used as the deformation strain has been described, but the present invention is not limited to this. For example, the deformation distortion may be obtained by using the sum of the waveform differences in a certain section before and after the deformation. Further, the deformation distortion may be obtained using a spectral distance or the like.
[0060]
[Embodiment 11]
In the first embodiment described above, the case where the deformation distortion is calculated based on information obtained from the waveform has been described, but the present invention is not limited to this. For example, the number of pitch pieces deleted and duplicated by PSOLA may be used as an element for calculating deformation distortion.
[0061]
[Embodiment 12]
In the above-described first embodiment, the case where the connection distortion is calculated every time the speech unit is read has been described, but the present invention is not limited to this. For example, the connection distortion may be calculated in advance and stored in a table.
[0062]
FIG. 12 is a diagram illustrating an example of a table storing connection distortions between diphone “/a.r/” and diphone “/r.i/”. Here, the speech unit “/a.r/” is taken on the vertical axis, and the speech unit “/r.i/” is taken on the horizontal axis. For example, the connection distortion between the speech unit “id3” of “/a.r/” and the speech unit “id2” of “/r.i/” is represented by “3.6”. By preparing all connection distortions between connectable diphones as a table in this way, calculation of connection distortion when synthesizing speech elements only needs to be referred to the table, so the amount of calculation can be greatly reduced. The calculation time can be greatly reduced.
[0063]
[Embodiment 13]
In the first embodiment described above, the case where the deformation distortion is calculated every time the speech segment is edited has been described, but the present invention is not limited to this. For example, the deformation distortion may be calculated in advance and held as a table.
[0064]
FIG. 13 is a table showing the deformation distortion when a certain diphone is changed with respect to the fundamental frequency and the phoneme duration.
[0065]
In the figure, μ represents a statistical average value of the diphone, and σ is a standard deviation. As a specific table creation method, the following creation method can be considered. First, the average value and variance are statistically obtained for the fundamental frequency and the phoneme time length. Next, based on these, (5 × 5 =) 25 different fundamental frequencies and phoneme time lengths are used as targets, and the PSOLA method is applied to determine the deformation distortion of the table one by one. At the time of synthesis, if the target fundamental frequency and the phoneme time length are determined, the deformation distortion can be estimated by interpolation (or extrapolation) with values near the table.
[0066]
FIG. 14 is a diagram showing a specific example for estimating deformation distortion at the time of synthesis.
[0067]
In the figure, the black circles represent the target fundamental frequency and the phoneme duration, and assuming that the deformation distortion at each lattice point is determined as A, B, C, D from the table, the deformation distortion Dt is as follows: It can be obtained by an expression.
Dt = {A. (1-y) + C.y} .times. (1-x) + {B. (1-y) + D.y} .times.x
[0068]
[Embodiment 14]
In the thirteenth embodiment described above, a 5 × 5 table is created as a lattice point of the deformation strain table based on the statistical average value and standard deviation of the diphone, but the present invention is not limited to this. Alternatively, a table having arbitrary grid points may be used. Also, the lattice points may be given decisively without depending on the average value or the like. For example, the range that can be estimated by prosody estimation may be equally divided.
[0069]
[Embodiment 15]
In the first embodiment described above, the case where the distortion is quantified by the weighted sum of the connection distortion and the deformation distortion has been described. However, the present invention is not limited to this, and a threshold is set for each of the connection distortion and the deformation distortion. In addition, if either one exceeds the threshold, the speech unit may not be selected, and a sufficiently large distortion value may be given.
[0070]
In the above embodiment, the case where each unit is configured on the same computer has been described. However, the present invention is not limited to this, and for example, each unit is configured by being divided into computers and processing devices distributed on a network. May be.
[0071]
In the above embodiment, the case where the program is held in the control memory (ROM) has been described, but the present invention is not limited to this, and may be realized using an arbitrary storage medium such as an external storage. Further, it may be realized by a circuit that performs the same operation.
[0072]
The present invention may be applied to a system composed of a plurality of devices or an apparatus composed of a single device. A recording medium in which a program code of software that realizes the functions of the above-described embodiments is recorded is supplied to the system or apparatus, and the computer (or CPU or MPU) of the system or apparatus stores the program code in the recording medium. It is also achieved by reading out and executing.
[0073]
In this case, the program code itself read from the recording medium realizes the functions of the above-described embodiment, and the recording medium on which the program code is recorded constitutes the present invention. As a recording medium for supplying the program code, for example, a floppy disk, a hard disk, an optical disk, a magneto-optical disk, a CD-ROM, a CD-R, a magnetic tape, a nonvolatile memory card, a ROM, or the like can be used.
[0074]
Further, by executing the program code read by the computer, not only the functions of the above-described embodiments are realized, but also the OS running on the computer based on the instruction of the program code performs actual processing. In some cases, the functions of the above-described embodiment are realized by performing part or all of the above.
[0075]
Further, after the program code read from the recording medium is written in a memory provided in a function expansion board inserted into the computer or a function expansion unit connected to the computer, the function expansion is performed based on the instruction of the program code. The case where the CPU of the board or the function expansion unit performs part or all of the actual processing and the functions of the above-described embodiments are realized by the processing is included.
[0076]
As described above, according to the present embodiment, it is possible to use a dictionary in which a small number of speech units are registered by selecting speech units to be registered in the unit dictionary in consideration of connection distortion and deformation distortion. There is an effect that it is possible to generate synthesized speech with little deterioration of sound quality.
[0077]
【The invention's effect】
As described above, according to the present invention, synthesis using such a unit dictionary is performed by selecting a speech unit to be registered in the unit dictionary in consideration of the influence of distortion based on connection distortion and deformation distortion. There is an effect that the quality of voice can be improved.
[0078]
Further, according to the present invention, it is possible to reduce the number of speech units registered in the unit dictionary and to reproduce good speech using the unit dictionary.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a hardware configuration of a speech information processing apparatus according to an embodiment of the present invention.
FIG. 2 is a block diagram showing a module configuration of the audio information processing apparatus according to Embodiment 1 of the present invention.
FIG. 3 is a flowchart showing a flow of processing in the online module according to the present embodiment.
FIG. 4 is a block diagram showing a detailed configuration of an offline module according to the present embodiment.
FIG. 5 is a flowchart showing a flow of processing in the offline module according to the first embodiment.
FIG. 6 is a diagram for explaining a modification of a speech element according to the embodiment of the present invention.
FIG. 7 is a diagram for explaining connection distortion of a speech element according to an embodiment of the present invention.
FIG. 8 is a diagram illustrating a process of determining distortion in a speech unit.
FIG. 9 is a diagram illustrating a determination process by Nbest.
FIG. 10 is a diagram illustrating a case where units of speech units according to Embodiment 3 of the present invention are mixed in diphone and half diphone.
FIG. 11 is a diagram showing an example in which speech unit units according to Embodiment 4 of the present invention are mixed by half diphones taken out.
FIG. 12 is a diagram showing a table configuration example for determining connection distortion between /a.r/ and /r.i/ of diphone according to Embodiment 12 of the present invention.
FIG. 13 is a diagram showing an example of a table representing deformation distortion according to Embodiment 13 of the present invention.
FIG. 14 is a diagram showing a specific example for estimating deformation distortion according to Embodiment 13 of the present invention.

Claims

At least one of the connection distortion caused by connecting the speech units to each other corresponding to the phoneme adjacent included in modification distortion and the phoneme series produced by deforming the speech units corresponding to phonemes included in the phoneme sequence Distortion output means for obtaining distortion;
And Nbest determining means for determining a Nbest sequence of speech units based on the said strain obtained by the strain output means,
Have a selection means for selecting a speech unit to be registered in the segment dictionary for speech synthesis based on Nbest sequence determined by the Nbest determining means,
When there are a plurality of N of the Nbest sequences, the Nbest determination means, for the distortions of the plurality of speech units constituting the phoneme sequence, the N Nbest sequences in order from the Nbest sequence having the smallest total distortion. A sequence is obtained, and the selecting means selects a speech unit corresponding to a higher frequency among speech units constituting the N Nbest sequences as a speech unit to be registered in the speech synthesis unit dictionary. A voice information processing apparatus.

The Nbest determination means obtains the N Nbest sequences in order from the Nbest sequence having the smallest sum of the distortions using the A * search algorithm for the distortion of the plurality of speech units constituting the phoneme sequence. The voice information processing apparatus according to claim 1.

The strain output means obtains the deformation strain and the connection strain,
Said selection means, voice information according to claim 1, wherein based on the weighted sum of the connection distortion between the modification distortion, and selects the speech unit to be registered in the segment dictionary for speech synthesis Processing equipment.

The speech information processing apparatus according to claim 1, wherein the distortion output unit determines the connection distortion using a cepstrum distance of each speech element.

The speech information processing apparatus according to claim 1 , wherein the distortion output unit determines the deformation distortion using a cepstrum distance between a speech element before deformation and a speech element after deformation.

The audio information processing apparatus according to claim 1 , wherein the distortion output unit includes a table storing the deformation distortion, and determines the deformation distortion with reference to the table.

The audio information processing apparatus according to claim 1 , wherein the distortion output unit includes a table that stores the connection distortion, and determines the connection distortion with reference to the table.

The speech information processing apparatus according to any one of claims 1 to 7 , further comprising speech synthesis means for synthesizing text data using the speech synthesis segment dictionary.

A speech information processing method for a speech information processing apparatus, wherein the speech information processing apparatus selects a speech unit to be registered in a speech synthesis unit dictionary,
At least one of the connection distortion caused by connecting the speech units to each other corresponding to the phoneme adjacent included in modification distortion and the phoneme series produced by deforming the speech units corresponding to phonemes included in the phoneme sequence and the strain output step of obtaining the strain,
And Nbest determination step of determining the Nbest sequence of speech units based on the said strain obtained by the strain output step,
Have a selection step of selecting a speech unit to be registered in the segment dictionary for speech synthesis based on Nbest sequence determined by the Nbest determining step,
The Nbest determination step obtains the N Nbest sequences in order from the Nbest sequence having the smallest sum of the distortions of the plurality of speech units constituting the phoneme sequence, and the selection step includes the N A speech information processing method, comprising: selecting speech units corresponding to a higher frequency among speech units constituting one Nbest sequence as speech units to be registered in a speech synthesis unit dictionary .

A computer-readable storage medium storing a program for causing a computer to execute the audio information processing method according to claim 9 .