JP3846300B2

JP3846300B2 - Recording manuscript preparation apparatus and method

Info

Publication number: JP3846300B2
Application number: JP2001382100A
Authority: JP
Inventors: 裕司平山; ゆみ堤; 賢大谷; 和人糀谷
Original assignee: Omron Corp
Current assignee: Omron Corp
Priority date: 2001-12-14
Filing date: 2001-12-14
Publication date: 2006-11-15
Anticipated expiration: 2021-12-14
Also published as: JP2003186489A

Abstract

<P>PROBLEM TO BE SOLVED: To generate voice information data of high quality without any expert in language processing. <P>SOLUTION: A voice information database generation system comprises document creating means 5 and 5A which create sound-recorded documents including all voice units in an original document, sound recording managing means 6 and 6A which stores a spoken voice as voice waveform data into a voice waveform database 12 and generates indication information to be given to a speaker, labeling means 8 and 8A which generate label information including a label representing a voice unit and time information indicating its section by making the sound-recorded documents correspond to the voice waveform data and correct or invalidate the time information of the generated label information, and a feature quantity extracting means 14 which generates a feature quantity from the voice waveform data, generates index information including the feature quantity and label information, and stores the generated information into a voice information database 15 while making the generated information correspond to the voice waveform data. <P>COPYRIGHT: (C)2003,JPO

Description

【０００１】
【技術分野】
この発明は，音声対話の分野における，主に音声合成のための，音声情報データベース作成装置，さらにこのシステムの一部として位置づけられる録音原稿作成装置，録音管理装置，ラベリング装置，および方法に関する。
【０００２】
従来，音声情報データベースの作成は，言語処理，録音，音声波形のラベリングなどの各分野に関する専門知識や技能を有した人間が手間と時間をかけて行っていたため，音声情報データベースの作成は，各分野の専門家を有する，または集めることのできる特定の企業や機関だけが行うことができた。また，必要な技能と時間の観点からも，音声情報データベース作成に必要なコストは膨大なものであった。
【０００３】
この問題点はさらに，所望の音声で音声情報データベースを作成しさえすれば，あたかもその人が発話しているかのような自然な声質の合成音声が得られるという，波形接続型音声合成の特長が十分に活かされず，音声合成が広く普及しない要因にもなっていた。
【０００４】
【発明の開示】
この発明は，音声合成により作成したいすべての文字列を含む元原稿から，音声情報データベースに音声情報を格納すべき最小限の文字列を持つ録音原稿を作成する装置および方法を提供するものである。
【０００５】
この発明はまた，上記録音原稿を話者が声を出して発音することにより得られる音声波形に基づいて作成された音声情報データベースが既に存在するときに，さらに追加すべき文字列を含む追加原稿について，最小限の文字列を持つ追加録音原稿を作成する装置および方法を提供するものである。
【０００６】
この発明はまた，録音原稿を話者が声を出して読むことを支援する録音管理装置および方法を提供するものである。
【０００７】
この発明はさらに，録音により得られる音声波形とそれに対応する文字列とを用いて，音声波形についてのラベル情報（音声単位のラベルとその時間情報とからなる）を作成するとともに，作成したラベル情報の信頼性を高めることのできるラベリング装置および方法を提供するものである。
【０００８】
この発明は最終的には，専門的知識をもたない者であっても，比較的容易に音声情報データベースを作成することができるシステムを提供するものである。
【０００９】
この発明による録音原稿作成装置は，複数の文字列を含む元原稿を設定する手段，元原稿に含まれる文字列を構成するすべての音声単位を抽出する元原稿分析手段，および上記元原稿分析手段によって抽出されたすべての音声単位を含むように元原稿から文字列を選択して録音原稿を作成する第１の文字列選択手段を備えているものである。
【００１０】
上記元原稿設定手段とは，手入力される元原稿を受付けるものでもよいし，ＦＤ等の記録媒体に格納されたものを読取るリーダでもよい。文字列とは，単語，句，節，文を含む概念である。いずれにしても元原稿には複数の文字列がある。第１の文字列選択手段は元原稿から選択する文字列の数が最小に（または，できるだけ少なくなるように）選択するものである。
【００１１】
元原稿から選択して録音原稿に加えるべき文字列を最小にするためのこの発明の実施態様においては，上記元原稿分析手段は，元原稿に含まれる文字列を構成するすべての音声単位について，その元原稿における出現回数を検出するものであり，上記文字列選択手段は，出現回数の少ない音声単位を含む文字列から順に，上記すべての音声単位を網羅するまで，文字列を選択するものである。
【００１２】
このようにして，この発明によると，音声情報データベースを作成するにあたって話者が声を出して読むべき録音原稿を元原稿から，操作者が言語処理の専門知識を有していなくても，作成することができる。しかも，録音原稿には（できるだけ）最小限の文字列のみが含まれることとなる。
【００１３】
一実施態様では，上記文字列選択手段は，作成すべき音声情報データベースに関する所与の仕様を満たす条件の下で文字列を選択するものである。
【００１４】
ここで，仕様とは，合成音声の品質レベル，音声情報データベースの容量，録音作業を含む音声情報データベースの作成時間等であり，これらの仕様に基づく要求を満たすように，文字列選択手段の動作が制御される。
【００１５】
この発明はさらに追加録音原稿作成装置を提供している。この追加録音原稿作成装置は上記の録音原稿作成装置に付随するものでもよいし，独立したものでもよい。
【００１６】
この追加録音原稿作成装置は，既存の音声情報データベースに含まれるすべての第１の音声単位を抽出する音声情報データベース分析手段，追加元原稿に含まれる文字列を構成するすべての第２の音声単位を抽出する追加元原稿分析手段，第２の音声単位について，第１の音声単位に含まれていない音声単位を検出する比較手段，および上記比較手段によって検出された音声単位を含む文字列を追加元原稿から選択して追加録音原稿を作成する第２の文字列選択手段を備えているものである。
【００１７】
既に作成された音声情報データベースのデータを有効に利用し，（できるだけ）最小限の文字列を含む追加録音原稿で追加原稿の文字列を音声合成できる音声情報データが作成できるようになる。
【００１８】
録音原稿作成装置（追加録音原稿作成装置を含む）は，一般的には，音声情報データベース作成装置の一部として位置づけられ，音声情報データベース作成装置および録音原稿作成装置はコンピュータシステムにより実現される。このコンピュータシステムを制御する録音原稿作成のためのプログラムは，所与の元原稿に含まれる文字列を構成するすべての音声単位を抽出し，かつそれらの音声単位について，その元原稿における出現回数を検出し，出現回数の少ない音声単位を含む文字列から順に，上記の抽出したすべての音声単位を網羅するまで，元原稿から文字列を選択して録音原稿に加えるようにコンピュータを制御するものである。
【００１９】
さらに，この発明による録音原稿作成方法は，所与の元原稿に含まれる文字列を構成するすべての音声単位を抽出し，かつそれらの音声単位について，その元原稿における出現回数を検出し，出現回数の少ない音声単位を含む文字列から順に，上記の抽出したすべての音声単位を網羅するまで，元原稿から文字列を選択して録音原稿に加えるものである。
【００２０】
この発明による追加録音原稿作成方法は，既存の音声情報データベースに含まれるすべての第１の音声単位を抽出し，追加元原稿に含まれる文字列を構成するすべての第２の音声単位を抽出し，第２の音声単位について，第１の音声単位に含まれていない音声単位を検出し，検出された音声単位を含む文字列を追加元原稿から選択して追加録音原稿を作成するものである。
【００２１】
この発明による録音管理装置は，録音原稿に含まれる複数の文字列を，所与の表示トリガごとに順次表示する表示装置，表示装置に表示された文字列について話者から入力された音声信号を一時的に記憶する録音手段，上記音声信号を分析する音声信号分析手段，上記音声分析手段の分析結果に基づいて音声採否を判定し，採用と判定したときには上記録音手段に一時記憶された音声信号を音声波形データベースに格納するように制御するとともに表示トリガを上記表示装置に与える音声採否判定手段，および上記音声信号分析手段の分析結果または上記音声採否判定手段の判定結果に基づいて，話者に与えるべき指示情報を作成する話者管理手段を備えているものである。
【００２２】
表示装置には，話者が声を出して読むべき（発音すべき）文字列が表示されるから，話者はこの表示にしたがって，発話していけばよい。話者に対して，音声分析結果に基づく指示情報が与えられるから，話者はこの指示にしたがって行動すればよい。指示情報には，音声分析結果（たとえば，声の高さ，音量，話速など）や，これらについての注意，休息指示等が含まれる。さらに，話者の発話した音声についての採否が判定され，採用と判定されたときのみ話者の音声がデータベースに格納されるので，良質の音声波形データを確保することができる。このようにして，音声の分析結果とその履歴情報に基づいて，録音原稿を読上げた話者の音声をデータベースに登録するか否かを判定したり，発話に関する指示情報を話者に対してフィードバックするので，録音作業を管理するための録音ディレクタが付き添わなくても，話者ひとりだけで，録音作業を進めることができ，かつ質の高い音声データを収録することができる。
【００２３】
好ましい実施態様では，上記表示装置に表示される文字列を表わす合成音声信号を作成する音声合成手段，および音声合成手段によって作成された合成音声信号を出力する音声出力手段がさらに設けられる。
【００２４】
適切な読上げ方を手本として，標準音声を出力することにより，話者が録音原稿の文字列を不適切に読上げることを防ぎ，録音音声の品質を向上できる。
【００２５】
この発明による録音管理方法は，録音原稿に含まれる複数の文字列を，所与の表示トリガごとに順次表示し，表示された文字列について話者から入力された音声信号を一時的に記憶し，上記音声信号を分析し，上記分析結果に基づいて音声採否を判定し，採用と判定したときには一時記憶された音声信号を音声波形データベースに格納するとともに表示トリガを発生し，上記音声信号の分析結果または上記音声採否判定結果に基づいて，話者に与えるべき指示情報を作成して出力するものである。
【００２６】
この発明による録音管理のためのプログラムは，録音原稿に含まれる複数の文字列を，所与の表示トリガごとに順次表示し，表示装置に表示された文字列について話者から入力された音声信号を分析し，分析結果に基づいて音声採否を判定し，採用と判定したときには，一時記憶された音声信号を音声波形データベースに格納するとともに表示トリガを発生し，上記音声信号の分析結果または上記音声採否判定結果に基づいて，話者に与えるべき指示情報を作成するようにコンピュータを制御するものである。
【００２７】
この発明によるラベリング装置は，録音原稿中の文字列と，この文字列を発音することにより得られる音声波形データとの対応づけにより，音声波形データを音声単位ごとに区切り，音声単位を表わすラベルとその区切りを表わす時間情報とを含む第１のラベル情報を作成する第１のラベリング手段，および上記第１のラベリング手段によって作成された第１のラベル情報における時間情報を修正または無効化するラベリングエラー除去手段を備えているものである。
【００２８】
上記ラベリングエラー除去手段は，一実施態様では，音声単位ごとに設けられた修正規則に基づいて時間情報を修正するものである。
【００２９】
上記ラベリングエラー除去手段は，他の実施態様では，上記第１のラベリング手段とは異別の第２のラベリング手段によって上記録音原稿中の文字列について作成された第２のラベル情報に含まれる時間情報と，上記第１のラベル情報の対応する時間情報との差を算出し，この差が測定値を超えている場合に，その時間情報について無効化情報を付与するものである。
【００３０】
上記ラベリングエラー除去手段は，さらに他の実施態様では，既に作成されているラベル情報について統計的手法により，音声単位ごとに継続時間の信頼区間を作成し，上記第１のラベル情報に含まれる時間情報から生成される音声単位ごとの継続時間を対応する信頼区間と比較し，継続時間が信頼区間外の場合に，その継続時間を生成した時間情報に無効化情報を付与するものである。
【００３１】
この発明によると，作成されたラベル情報の時間情報について，その時間情報が適切なものかどうかのチェックが行われ，必要に応じて修正または無効化されるので，最終的に得られるラベル情報は信頼性の高いものとなる。
【００３２】
この発明によるラベリング方法は，録音原稿中の文字列と，この文字列を発音することにより得られる音声波形データとの対応づけにより，音声波形データを音声単位ごとに区切り，音声単位を表わすラベルとその区切りを表わす時間情報とを含むラベル情報を作成し，ラベリング手段によって作成されたラベル情報における時間情報を修正または無効化するものである。
【００３３】
この発明によるラベリングのためのプログラムは，録音原稿中の文字列と，この文字列を発音することにより得られる音声波形データとの対応づけにより，音声波形データを音声単位ごとに区切り，音声単位を表わすラベルとその区切りを表わす時間情報とを含むラベル情報を作成し，ラベリング手段によって作成されたラベル情報における時間情報を修正または無効化するようにコンピュータを制御するものである。
【００３４】
この発明による音声情報データベース作成システムは，音声合成により作成すべき文字列を含む元原稿から，音声情報データベースに音声情報を格納すべき文字列を，元原稿に含まれる文字列の音声単位を分析してできるだけ少ない文字列ですべての音声単位を含むように選択して録音原稿を作成する手段，上記録音原稿作成手段により作成された録音原稿に含まれる複数の文字列を，所与の表示トリガごとに順次表示する表示装置，表示装置に表示された文字列について話者から入力された音声信号を一時的に記憶する録音手段，上記音声信号を分析し，この分析結果に基づいて音声採否を判定し，採用と判定したときには上記録音手段に一時記憶された音声信号を音声波形データベースに格納するように制御するとともに表示トリガを上記表示装置に与え，さらに上記分析結果または上記音声採否判定結果に基づいて，話者に与えるべき指示情報を作成する録音管理装置，上記録音原稿作成手段により作成された録音原稿中の文字列と，上記音声波形データベースに格納された音声波形データとの対応づけにより，音声波形データを音声単位ごとに区切り，音声単位を表わすラベルとその区切りを表わす時間情報とを含む第１のラベル情報を作成するとともに，作成されたラベル情報における時間情報を修正または無効化するラベリング装置，上記音声波形データベースに格納された音声波形から特徴量を作成する特徴量作成手段，ならびに上記音声波形データベースに格納された音声波形データと，上記ラベリング装置により作成されたラベル情報および上記特徴量作成手段により作成された特徴量を含むインデックス情報とを対応づけて記憶する音声情報データベース作成手段を備えているものである。
【００３５】
この発明による音声情報データベース作成方法は，音声合成により作成すべき文字列を含む元原稿から，音声情報データベースに音声情報を格納すべき文字列を，元原稿に含まれる文字列の音声単位を分析してできるだけ少ない文字列ですべての音声単位を含むように選択して録音原稿を作成し，作成された録音原稿に含まれる複数の文字列を，表示装置に，所与の表示トリガごとに順次表示し，表示装置に表示された文字列について話者から入力された音声信号を一時的に記憶し，上記音声信号を分析し，この分析結果に基づいて音声採否を判定し，採用と判定したときには一時記憶された音声信号を音声波形データベースに格納するように制御するとともに表示トリガを上記表示装置に与え，さらに上記分析結果または上記音声採否判定結果に基づいて，話者に与えるべき指示情報を作成し，作成された録音原稿中の文字列と，上記音声波形データベースに格納された音声波形データとの対応づけにより，音声波形データを音声単位ごとに区切り，音声単位を表わすラベルとその区切りを表わす時間情報とを含むラベル情報を作成するとともに，作成されたラベル情報における時間情報を修正または無効化し，上記音声波形データベースに格納された音声波形から特徴量を作成し，そして上記音声波形データベースに格納された音声波形データと，作成されたラベル情報および作成された特徴量を含むインデックス情報とを対応づけて音声情報データベースに格納するものである。
【００３６】
この発明によると，専門的な知識を持たない一般的の利用者であっても，比較的容易に，比較的短時間で，比較的高品質の音声情報データベースを作成できる。したがって，波形接続型音声合成において，一般の利用者でも容易に所望の声で，自然な合成音声を作成でき，波形接続型の音声合成が広く一般に普及するようになることが期待される。
【００３７】
この発明は，特に，波形接続型の音声合成で用いられる音声情報データベースの作成を対象としているが，その他の合成方式（波形重畳型など）のためのデータベースの作成にも適用することができる。さらに，音声合成用途以外でもこの発明により作成した音声データベースは，音声認識のための統計的音響モデル（ＨＭＭ）の学習データや音声分析のための試料データとしても利用することができる。
【００３８】
【実施例】
（１）波形接続型音声合成
波形接続型音声合成は，多数（複数）の単語，句，節，文についての音声波形データをあらかじめ用意しておき，これらの音声波形データから必要な部分を切出し（切出された音声波形を波形素片という），複数の波形素片を組合せて接続することによって，新たな単語，句，節または文を表わす合成音声の音声波形を作成するものである。あらかじめ用意する音声波形データを元波形データという。元波形データには後述するようにインデックス情報が付随し，元波形データとインデックス情報のセット（これを波形情報という）は音声情報データベースに格納される。音声合成のために元波形データから必要な部分を切出す単位が音声単位である。
【００３９】
この明細書において，音声単位は，単語，音節，音素および分割音素を含む。単語とは意味の一つのまとまりを表し，文法上の働きをもつものとしての言語の最小単位である。例えば，「ねこが寝る」という文において，「ねこ（neko）」，「が（ga）」，「寝る（neru）」はそれぞれ単語である。音節とは言語学上の発音の単位である。例えば，「ね（ne）」，「こ（ko）」などである。日本語ではかな文字の１つ１つが音節に相当し，100〜300種類程度ある。音節は１つまたは複数の音素で構成される。音素とは，音声の基本的な最小単位である。例えば，「ｎ」，「ｅ」，「ｋ」，「ｏ」などである。音素は，母音（Vowel ，記号Ｖで表す）と子音（Consonant ，記号Ｃで表す）に分類される。日本語では，母音は５種類（ａ，ｉ，ｕ，ｅ，ｏ），子音は約20種類（ｎ，ｋ，ｓ，ｔ，ｍ，ｒなど）がある。分割音素とは，音素をさらに分割したものであり，いくつに分割したものかは問わない。音素は，波形接続型音声合成において最も一般的に用いられる音声単位である。音節も，一般的に用いられる音声単位の１つである。
【００４０】
以上に基づいて，「音声単位」を次のように定義する。すなわち，音声単位とは，母音または子音である音素を分割した分割音素を１つまたは連続させたものである。換言すれば，すべての音声単位は，１つまたは連続する複数の分割音素により構成される。
【００４１】
波形接続型の音声合成では，音声単位として，音節や音素のほかに，ＶＣＶ素片やＣＶＣ素片などの音韻環境を考慮した音声単位も一般的に用いられる。音韻環境を考慮した音声単位とは，ある音声単位について，その前後（両方またはいずれか一方）の音声単位の違いも含めて種類を区別したものである。上では連続する３つの音素からなる音声単位（ＶＣＶ素片，ＣＶＣ素片）を２種類挙げているが，これ以外にも連続する１つ以上の音節からなるものや，連続する１つ以上の分割音素からなるものなど，音韻環境を考慮した音声単位には様々なものが存在する。ＶＣＶ素片とは，母音，子音，母音の３つの連続する音素を１つの音声単位とみなしたものである。例えば，「e-k-o」や「o-g-a」などで種類は700〜800程度ある。ＣＶＣ素片とは，子音，母音，子音の３つの連続する音素を１つの音声単位とみなしたものである。例えば，「n-e-k」や「k-o-g」などで種類は5000〜6000程度ある。
【００４２】
図１は音声波形において，音素，音節および単語の区切りをつけて，音声波形と対応付けて示すものである。図２は，音韻環境を考慮した音声単位を音声波形と対応付けて示すものである。
【００４３】
音声波形とは，空気の振動（音）により発生する空気の粗密を時間変化として表したものである。図１および図２のような音声波形の図において，横軸は時間を，縦軸は空気の密度の高さをそれぞれ表している。音声波形をコンピュータ上で扱うときには通常サンプリング処理により標本化された時系列データを音声波形ファイルとして取扱い，音声波形ファイルに録音（ファイル保存），書込み，読出し等の処理を行う。音声波形データの開始時点からの経過時間を用いて各音声単位の始点，終点および継続時間を表すことができる。
【００４４】
図１において，音声波形の開始点から音の始まりに相当する区間には，ポーズ（無音）を示すラベル「pau 」が与えられ（「ラベル」については後に説明する），「ｔ」，「ａ」，「ｎ」，「ａ」，「ｋ」，「ａ」の音声単位（音素）で音声波形が区切られている。音声波形の下段には，音素の区切り，音節の区切りおよび単語の区切りが音声波形に対応付けて示されている。
【００４５】
図２は，子音「ｋ」を中心として，前後の母音「ａ」を音韻環境として考慮したＶＣＶ形式の音声単位の音声波形を示すものである。この音声波形において，上段には音素単位で音韻環境を考慮した音声単位が示され，下段には分割音素単位で音韻環境を考慮した音声単位が示されている（ここでは，分割音素は１音素を２つに分割したものである）。図２の下段において，音素「ａ」の前半の分割音素を「ａ｜」で後半の分割音素を「｜ａ」で表している。
【００４６】
図３は音声波形とラベル情報との関係を示す。
【００４７】
ラベル情報は，音声波形を音声単位で区切ったときの音声単位（音声波形を構成する音声単位）ごとに設けられ，その音声単位における符号（これをラベルという）（たとえば，音声単位が音素の場合には，ｎ，ｅ，ｋ，ｏ等のアルファベット，音声単位が音節の場合にはｎｅ，ｋｏなどのかな文字）と，その音声単位の音声波形における時間位置情報（単に，時間情報という）とから構成される。時間情報は，音声単位の終りの位置がどこであるのか（音声単位の終点）または音声単位の始まりの位置がどこであるのか（音声単位の始点）を示す情報である。
【００４８】
コンピュータ上では，ラベル情報を，各音声単位を表すラベル（アルファベット記号で表記）とその終点を表す時間情報の組を時系列順に記述したテキスト・ファイルとして扱う。この場合，各音声単位の始点は，直前の音声単位の終点に等しく，各音声単位の継続時間は，直前の音声単位の終点を示す時間情報とその音声単位の終点を示す時間情報との差によって求めることができる。音声波形ファイルの開始時点から音の始まりまでに相当する区間には，ポーズ（無音）を示すラベル「pau 」が与えられる。音声波形ファイルの録音時に，正確に音の開始点と終了点で録音を開始，停止することが難しいため，通常は，音声波形の先頭や末尾にはポーズが含まれる。図３において，音声単位の終点（0.160，0.250など）を時間情報として保持するためには，音の始点を示すために先頭のポーズの終点情報（0.120 ）が必要である（逆に音の終点は最後の音声単位の終点に等しいため，末尾のポーズの終点を示す時間情報は必要ない）。
【００４９】
上述したように，波形情報データベースには複数の音声波形についての波形情報が格納される。波形情報は音声波形データとインデックス情報とから構成される。インデックス情報とは，音声波形（元波形）ごとに，その音声波形を構成する各音声単位について，ラベル情報と音声波形の特徴量（音声単位ごと）とを記述したものである。
【００５０】
特徴量には，音声波形（音声単位ごと）の音韻的特徴と韻律的特徴がある。音韻的特徴には，ケプストラムおよびベクトル量子化データが含まれる。ケプストラムは，音声波形の短時間振幅スペクトルの対数を逆フーリエ変換したものである。ベクトル量子化データは，音声波形の複数のパラメータ値のベクトルを代表ベクトルの符号で表したものである。また韻律的特徴には基本周波数，パワーおよび上述した継続時間が含まれている。基本周波数とは，音源である声帯が振動する周波数であり，音声の「高さ」（ピッチ）を表す指数である。基本周波数が高いほど声の高さは高くなる。パワーとは，音声波形の振幅である。音の「大きさ」に対応する。継続時間は，換言すると音声単位に相当する音声波形の時間長（「長さ」）である。音声単位の長さに対応する音声の継続時間（一つの音声波形で考えると，継続時間の平均値）が小さいことは話速が速いことを示す。
【００５１】
図４は波形情報（音声波形データとインデックス情報のセット）を用いて波形接続型音声合成を行う様子を示している。「さかた」と発音（発話）する合成音声の音声波形を作成するために，「さとう」と発話された音声波形（これを音声１とする）と「たなか」と発話された音声波形（これを音声２とする）の２つの元波形を用いる。これらの元波形を含む音声情報データベースに格納されたインデックス情報が図４の左側に示されている。音声（音声１，２を含む）のそれぞれについて，インデックス情報は，各音声波形を構成する音声単位（ここでは音素）のラベルおよび始点（以下，ラベル情報）と，長さ（時間長），高さ（周波数）および大きさ（振幅）（以下，波形の特徴量）を含む。
【００５２】
作成すべき合成音声を表わす文字列「さかた」が与えられると，インデックス情報を参照して，「sakata」の音声波形を合成するのに必要な音声単位を選択する。音声１から「ｓ」および「ａ」，音声２から「ｔ」および「ａ」と「ｋ」および「ａ」がそれぞれ選択される。
【００５３】
選択された各音声単位に対応する波形素片を，インデックス情報に記述された始点と長さに基づいて，元波形からそれぞれ切出す。音声１の元波形から「ｓ」と「ａ」をそれぞれ表わす波形素片が，音声２の元波形から「ｔ」と「ａ」をそれぞれ表わす波形素片と「ｋ」と「ａ」をそれぞれ表わす波形素片がそれぞれ切出される。これらの波形素片が「ｓ」，「ａ」，「ｋ」，「ａ」，「ｔ」，「ａ」の順序に接続（合成）される。
【００５４】
このように，元波形から切り出した波形素片に対して，信号処理を行うことなく，波形素片を所与の順序で接続するので，音質を劣化させることなく合成音声の音声波形を作成することができる。
【００５５】
図５は波形接続型音声合成処理の流れを示すものである。
【００５６】
音声合成により作成すべき発音（発話）を表わす文字列が与えられる。この入力文字列は，音声単位のラベル列に変換される。例えば，日本語の場合に漢字かな交じりの文の入力があったとすると，この文の単語への分割，幾つかの単語をグループ化したうえでアクセント位置の決定，単語グループ間に挿入するポーズ（間）の長さの決定などの処理を行う。音声単位のラベル列を直接入力するようにしてもよい。
【００５７】
韻律予測処理92では，音声単位ラベル列に基づいて，各音声単位の韻律的特徴を予測する。具体的には，音声情報作成における特徴量抽出の処理で，音声単位ごとに音の高さ，強さ，長さのパターンを抽出した結果を利用する。韻律的特徴を直接指定して入力してもよい。
【００５８】
音声単位選択処理93では，音声情報データベース97から音声単位ラベル列のラベルと一致する音声単位を選択する。一致する音声単位が複数ある場合には，音声情報データベースのインデックス情報を参照して，韻律的特徴が最も一致する音声単位を選ぶようにする。
【００５９】
波形接続処理94では，選択された音声単位のインデックス情報を参照して，元波形データからその音声単位に相当する波形素片を切出し（信号処理せずそのまま），音声単位ラベル列の順に接続する。
【００６０】
音声出力処理95では，接続して出来上がった合成音声の音声波形を音声デバイス（たとえばスピーカ）96へ送り，音を出力する。
【００６１】
波形接続型音声合成は，音声波形データに対して信号処理を行わないという特徴をもつため，以下のような長所がある。
・信号処理による音質の劣化がない。一般に音声波形に対して信号処理を行うと，声が不自然になるなど音質の劣化が発生する。
・元の音声波形データの声の特徴をそのまま残した合成音声が得られる。特定の人物たとえばアナウンサやタレントなどと同じ声の特徴を持つ合成音声を作成できる。
・音声情報データベースを交換することにより合成音声の声を自由に変えられる。
【００６２】
また，予め用意した音声波形をもとにして合成音声を作成するために以下の点を考慮しなければならない。
・合成したい音の全てを含むような音声波形データ（元波形データ）を用意し，かつ元波形データの量が大きくなりすぎないようにする。すなわち元波形として用意されていない音は合成できない。また，元波形のデータの量が大きくなりすぎると音声情報データベースに入らない。
・十分に良好な音質の元波形データを用意し，かつ元波形データに音質のばらつきがないようにする。
・元波形データから必要な部分を探し出して切り出すために，元波形の内容を示す情報（インデックス情報）を作成する必要がある。
【００６３】
（２）第１実施例
図６は音声情報データベース作成システムのハードウェア構成を示すブロック図である。このシステムは最も典型的には，いわゆるパーソナル・コンピュータまたはワークステーションとその周辺機器により実現することができるが，もちろん，音声情報データベース作成システム専用のハードウェア・アーキテクチャを持つものでよい。
【００６４】
音声情報データベース作成システムは，演算装置（ＣＰＵ）20，ワークメモリ（ＲＡＭ）21，通信Ｉ／Ｆ部22，入力Ｉ／Ｆ部23，出力Ｉ／Ｆ部24，データベース25，画面データメモリ26，処理プログラムメモリ27，入力装置28，出力装置29および合成音声出力装置30を含んでいる。
【００６５】
演算装置20は，音声情報データベース作成処理，その他のシステム管理処理のためのプログラムを実行する。
【００６６】
ワークメモリ21は，音声情報データベース作成処理における入出力データや中間処理データを格納するためのメモリである。
【００６７】
通信Ｉ／Ｆ部22は，入出力装置等のハードウェアを接続する場合，または外部機器と直接またはネットワークを介して通信するためのものであり，ノイズ除去や同期処理などを実行する。ネットワークは用途に応じて適切なものを使用すればよい。
【００６８】
データベース25は，音声情報データベース作成システムにおいて作成された各種データベース（詳細は後述する）を格納するためのものである。
【００６９】
画面データメモリ26は，出力装置に含まれる画面表示装置に出力される画面データを保持するメモリである。
【００７０】
処理プログラムメモリ27は，音声情報データベース作成処理のための各種実行プログラム（ＯＳを含む）（このプログラムの詳細については後述する）を格納するメモリである。上述した各種メモリは，半導体メモリ，磁気ディスク，光ディスク，光磁気ディスク，その他の記憶媒体により実現される。
【００７１】
入力装置28は，操作者が音声情報データベース作成システムに情報を入力する為のものであり，例えば，キーボード，マウス，マイクロフォン，ＦＤドライブ，表示画面等を含むものであり，入力Ｉ／Ｆ23を介して演算装置20と接続される。
【００７２】
出力装置29は，音声情報データベース作成システムの操作者に情報を出力するものであり，例えば，ディスプレイ（表示装置），スピーカ等の操作者に情報を伝達するものであり，出力Ｉ／Ｆ24を介して演算装置20と接続される。
【００７３】
この音声情報データベース作成システムが，作成した音声情報データベースを用いて所望の音声を合成する機能（図６に示す）を持つ場合には，合成した音声を表わす波形データは合成音声出力装置30により記録媒体31に記録される。記録媒体は，ＣＤ−ＲＯＭ，フロッピーディスク，ＤＶＤ等を含む。
【００７４】
図７は上記の音声情報データベース作成システムにおいて，主に演算装置20が達成する諸機能を幾つかにまとめて表す機能ブロック図である。
【００７５】
この音声情報データベース作成システムには４つのデータベース，すなわち原稿データベース11，音声波形データベース12，ラベル情報データベース13，および最終的に作成されるべき音声情報データベース15が含まれる。これらのデータベースは基本的にこのシステムが運用される過程で作成されるもので，具体的には図６に示すデータベース25に対応する。
【００７６】
仕様入力部（手段）４は，この音声情報データベース作成システムを運用する操作者ＯＰが音声情報データベースを作成するにあたって定める仕様（事項）（音声情報データベース容量，音声情報データベース品質，作成時間および元原稿ファイル名）を入力（コンピュータに取込む）するもので，具体的には，図６に示す入力装置28により実現され，詳細は図８に示されている。
【００７７】
原稿作成部（手段）５は，仕様入力部４から入力された仕様情報に応じて原稿データベース11内の元原稿，または仕様入力部４から与えられる元原稿に基づいて録音原稿を作成するものである。録音原稿とは，話者ＳＰが声に出して読む原稿（すなわち，録音されるべき原稿）をいう。話者ＳＰ（話し手，発話者）は録音原稿を声に出して読む人である。システムの操作者ＯＰと話者ＳＰとは同一人でも，異なる人でもよい。原稿作成部（手段）５は，図６に示す処理プログラムメモリ27に格納された原稿作成プログラム（図11参照）を実行する演算装置20により実現され，詳細については図８を参照して後述する。
【００７８】
録音管理部（手段）６は，話者ＳＰの発話音声（または録音音声）の分析結果とその履歴情報に基づいて，その音声を音声情報データベースに収録すべきかどうかの判定，話者ＳＰに対する発話の指示，長時間にわたる録音作業の過程で不可欠な休息時間の設定等を行うものである。これにより，録音ディレクタ（操作ＯＰ）の付き添いがなくても，話者ＳＰだけで録音作業を進めることができ，かつ質の高い音声波形データを収録することが可能になる。録音管理部（手段）６は，処理プログラムメモリ27内の録音管理プログラム（図15，16参照）とこれに従う動作を行う演算装置20とにより実現され，その詳細は図９に示されている。
【００７９】
表示装置９は，原稿作成部（手段）５によって作成された原稿の表示，録音管理部（手段）６から出力される休息指示，発話注意等の表示を行うもので，図６の出力装置29に含まれる。
【００８０】
音声入力装置（手段）10は，話者が発生する音声（発話音声）を電気信号（音声波形）に変換するもので，マイクロフォンにより実現される。図６の入力装置28に含まれる。
【００８１】
録音部（手段）７は，音声入力装置10から入力する音声波形に基づいて発話開始および終了を検出するとともに，検出した発話開始と終了との間の音声波形を記録媒体（磁気テープ，磁気ディスク，半導体メモリ等）に一時的に記録する。音声波形は好ましくはディジタルデータに変換されるがアナログのまま一時的に保持してもよい。録音部７の詳細は図９に示され，図６の入力Ｉ／Ｆ23に対応する。
【００８２】
ラベリング部（手段）８は，録音原稿作成部（手段）５で作成された録音原稿を記録した音声波形データのラベル情報を作成する。さらに，作成したラベル情報からラベリングエラーを検出し，ラベリングエラー箇所の修正または除去を行う。これにより，熟練者の技能を必要とせず熟練者と同じ水準でラベリング情報を作成することができる。ラベリング部（手段）８は，図６に示す処理プログラムメモリ27に格納されたラベリングエラー除去プログラム（図18参照）を実行する演算装置20により実現され，詳細については図10を参照して後述する。
【００８３】
特徴量抽出部（手段）４は，ラベル情報を参照しながら，音声波形ごと，または音声単位ごとに音律または音韻特徴を算出し，音声情報データベース15のインデックス情報を作成する。特徴量抽出部14は，処理プログラムメモリ27内の特徴量抽出プログラムとこれに従う動作を行う演算装置20とにより実現される。
【００８４】
出力装置16は，音声情報データベース15に記録された音声情報をＣＤ−ＲＯＭ，フロッピーディスク，ＤＶＤ等の記録媒体17に記録するものである。
【００８５】
操作者ＯＰは，作成すべき音声情報データベースに関する仕様を仕様入力部４を用いて入力する。仕様入力部４は，図８に示すように，ＦＤドライブ（記録媒体読取装置）41と入力装置42を含む。入力装置42は，図12に示すような仕様入力画面を表示する表示装置，表示画面上のボックス等に文字，数字等を入力するためのキーボード，各種操作用のマウス等を含む。
【００８６】
仕様の項目には，作成すべき音声情報データベースの上限容量，同データベースの品質，同データベースを作成するのに要する（許容できる）上限作成時間，および元原稿ファイル名がある。上限容量は，一般的に動作環境やアプリケーションのデータ領域の制限上，音声情報データベースのために使用できるメモリ容量が制限される場合に用いる。品質は，高いほど音声情報データベース15の容量は大きくなるが合成音声の品質も高くなる（詳細は後述する）。作成時間は主に話者ＳＰが音声を入力作業を行う時間である。
【００８７】
音声情報データベースの作成時間が長ければ，データベースの容量は増大する。したがって，上限作成時間はデータベース容量を制限する。データベース作成時間は作成されるデータベースの容量に比例すると考えて良いので，入力された上限作成時間をデータベース容量に次式を用いて変換することができる。
【００８８】
データベース容量＝データベース作成時間×変換係数
【００８９】
変換係数は，データベース作成時間とデータベース容量の比を示す値で，予め用意しておく，または実績値に基づいて調整することが可能である。すなわち，実際の音声情報データベースの作成終了時点において，完成した音声情報データベースの容量と作成に要した時間に基づいて次式を用いて変換係数を調整する。
【００９０】

【００９１】
音声情報データベースの品質は整数値で表されるレベルで表記される。品質レベルが高くなるほど音声単位の種類は増え，その音声情報データベースを用いて生成される合成音声の質も高くなる。この実施例では品質レベルは３レベルあり，例えば，元原稿中のすべての音素が含まれているという品質が「レベル１」，すべての音節が含まれているという品質が「レベル２」，アクセントの有無を区別した音節が含まれているという品質が「レベル３」である。例えば「すずき（suzuki）」という音声は，レベル１ではｓ，ｕ，ｚ，ｋ，ｉの５種類，レベル２では，su，zu，kiの３種類の単位にそれぞれ分類される。品質レベルが高くなれば，データベース容量が増大し，作成時間が長くなる。元原稿ファイル名とは，テキスト・ファイル形式で作成された元原稿のファイル名である。
【００９２】
操作者ＯＰが音声情報データベースの仕様を入力する場合に，音声情報データ作成システムの表示装置に，図12に示す仕様入力画面が表示される。
【００９３】
この仕様入力画面の左端には，開始，仕様入力，原稿作成，録音，ラベリング，特徴量抽出，終了の順に音声情報データベース作成の工程が表示され，現在行っている工程の表示に，周囲や他の工程とは異なる色が付される。画面上段に表示された仕様入力領域には，音声情報データベースの容量（ＤＢ容量），同データベースの品質（ＤＢ品質）レベル，作成時間の各希望値を入力するボックスと，原稿ファイル名を入力するボックスとがある。さらに，入力を確定する「設定」ボタンが設けられている。画面下段に表示された完成時の音声情報データベースの属性表示領域には，ＤＢ容量，ＤＢ品質レベルおよび作成時間について，予め設定されたデフォルト値と，操作者ＯＰが入力した設定仕様値が表示される。
【００９４】
この仕様入力画面において入力されるＤＢ容量，ＤＢ品質および作成時間は仕様入力部42から原稿作成部５の文字列選択処理53に与えられる。ＤＢ容量と作成時間については少なくともいずれか一方が入力されていればよい。
【００９５】
仕様入力画面において元原稿ファイル名が入力されていれば，その入力ファイル名は入力装置42からＦＤドライブ41に与えられる。ＦＤドライブ41は装着されたＦＤに格納されているファイルのうち，入力された元原稿ファイル名の元原稿ファイルを読出し，原稿作成部５の元原稿設定処理51に与える。
【００９６】
図８において原稿作成部５は，元原稿設定処理（手段）51，元原稿分析処理（手段）52および文字列選択処理（手段）53を含んでいる。これらの各処理の動作を図11を参照して説明する。
【００９７】
原稿作成部５は仕様入力部４から仕様データが与えられると，原稿作成処理を開始する（ステップＳ１）。
【００９８】
元原稿設定処理51はＦＤドライブ41から元原稿ファイルが与えられているかどうかを判断する（ステップＳ２）。元原稿ファイルが与えられていればその元原稿ファイルをワークエリアに取込む（ステップＳ３）。元原稿ファイルが与えられていない場合には，元原稿設定処理51は原稿データベース11から既存の元原稿ファイルを読出し，読出した元原稿ファイルをワークエリアに設定する（ステップＳ４）。
【００９９】
原稿データベース11に複数の元原稿ファイル（既に作成されて格納されているもの）が存在する場合には，仕様情報に含まれるＤＢ容量およびＤＢ品質に基づいて適切なものを選択するようにしてもよい。また，ＦＤ等の記録媒体から読出した元原稿ファイルと原稿データベース11から読出した元原稿ファイルとを組合わせたものを元原稿として設定してもよい。元原稿（元原稿ファイル）とは録音原稿の元（源）になる単語，句，節，文等を格納したもので，この元原稿から所要の単語，句，節，文を取出して後述するように録音原稿が作成される。
【０１００】
元原稿分析処理52はワークエリアに設定された元原稿に含まれる文字列を分析して，文字列を構成する各音声単位が元原稿に出現する回数を計測する（ステップＳ５）。
【０１０１】
図13(A) は元原稿の一例を示すものである。この元原稿は日本人の多くの苗字を列挙したものである（図では一部のみが示されている）。この元原稿は各苗字を表わす文字列のリストである。
【０１０２】
このような元原稿が分析される。分析とは，品質レベルに応じて元原稿に記述された単語，句，節，文などを音声単位に分解することである。この実施例では，品質レベル１の音声単位は音素，品質レベル２の音声単位は音節，品質レベル３の音声単位はアクセントを含む音節である。設定されている品質レベル以下の品質レベルのすべてについて，それぞれに応じた音声単位への分析が行なわれる。品質レベル３が設定されているとすると，品質レベル１における音素への分解，品質レベル２における音節への分解および品質レベル３におけるアクセントを含む音節への分解のすべてが行なわれる。
【０１０３】
このように分解された音声単位のすべてについて，品質レベル別に各音声単位が元原稿中に出現する回数を計測し，元原稿分析結果として音声単位リストを作成する。図７(B) は元原稿分析結果を示すものである。元原稿分析結果は品質レベル別の音声単位リストとして記述され，このリストでは出現回数の小さい順に配列され，出現回数が同じものについてはアルファベット順に並べられる。母音だけからなる音節は音素であり，品質レベル１のものとしてリストアップされているから，品質レベル２および品質レベル３のリストには含まれていない。
【０１０４】
原稿作成部５における文字列選択処理53は，元原稿に含まれる単語，句，節，文章（これらを文字列という）に基づいて，先に作成した元原稿分析結果を参照して，できるだけ少ない文字列で，できるだけ多くの音声単位を含むような録音原稿を作成するものである。このために，録音原稿に加えるべき文字列を元原稿から次のように選択する。すなわち，まず最も低い品質レベルについての元原稿分析結果リストを参照して，最も出現回数の少ない音声単位を含む文字列（苗字）を元原稿から選択し，録音原稿に移す（追加する）（ステップＳ８）。録音原稿に追加した文字列に含まれるすべての音声単位を元原稿分析結果リストから削除する（ステップＳ９）。さらに選択した文字列を元原稿から削除する（ステップＳ10）。元原稿分析結果リストにおいて出現回数が少ない音声単位の順に元原稿分析結果リストに残っている音声単位が無くなるまで，上記の処理を繰返す（ステップＳ７）。
【０１０５】
最も低い品質レベルについて，終了すれば，次の品質レベルの元原稿分析結果リストを参照して，録音原稿に追加すべき（移すべき）文字列（苗字）を元原稿において選択する。この処理は設定された品質レベルに達するまで繰返される。
【０１０６】
図14(A) は品質レベル１について作成された録音原稿の例を示している。この録音原稿では，４つの苗字が列挙されている。この４つの苗字は，図13(B) に示す品質レベル１についての元原稿分析結果リストのすべての音声単位を含んでいる。
【０１０７】
図14(B) は品質レベル２についての処理が終了した時点で得られる録音原稿の例を示している。図14(A) の録音原稿と比較すると２つの苗字（しみず，みやもと）が追加されている。これは，図13(B) に示す品質レベル２についての元原稿分析結果リストに挙げられた音声単位（音節）のすべてを含むように苗字を追加的に選択したことによる。
【０１０８】
品質レベル３が設定されている場合には，さらに品質レベル３の要求を満たす文字列の選択と追加が行なわれ，図14(C) に示すような録音原稿が得られる。これは図13(B) に示す品質レベル３についての元原稿分析結果リストに挙げられているアクセントを含む音節のすべてを含むように，元原稿から苗字を抽出したことによる。
【０１０９】
仕様入力部４において，上述したように音声情報ＤＢ容量，ＤＢ品質および作成時間が入力される。このうち，要求されたＤＢ品質（品質レベル１〜３）を満たすように上述の処理が行なわれる。すなわち，要求されたＤＢ品質が品質レベル２であれば，図14(B) の録音原稿が得られた時点で処理が終了し，品質レベル３が要求されている場合には図14(C) の録音原稿が得られるまで処理が続けられる。
【０１１０】
他方，要求されたＤＢ容量および作成時間もステップＳ８〜Ｓ10の処理の繰返しを制御するために用いられる。作成時間は上述したようにＤＢ容量に換算できる。仕様入力部４において入力されたＤＢ容量，または入力された作成時間から換算されたＤＢ容量のうちのいずれか小さい方がワークエリアに設定される（ステップＳ６）。元原稿から文字列（苗字）が選択され，録音原稿にその選択された文字列が移される（加えられる）たびに，加えられた文字列（苗字）についての音声情報容量（音声情報データベース15に格納される波形データ等を含むデータ容量）がワークエリアのＤＢ容量から減算される。この減算結果を残りＤＢ容量という。残りＤＢ容量が零になると録音原稿作成処理は，たとえ途中であっても，終了する（ステップＳ７）。
【０１１１】
図７において，原稿作成部５において上述のように作成された録音原稿は録音管理部６に与えられる。録音管理部６では，後述するように録音原稿に含まれる文字列（苗字）を順次表示装置９に表示させるとともに，必要に応じて休息指示および発話注意を生成して表示させる。
【０１１２】
話者ＳＰは，表示装置９に表示された文字列を表示の順序にしたがって声を出して読む（発話する）。
【０１１３】
話者ＳＰにより発話された音声が音声入力装置10に入力され，電気信号に変換される。
【０１１４】
音声入力装置10から出力される音声を表わす電気信号は音声波形信号として録音部７および録音管理部６に入力される。録音部７に入力された音声波形信号は音声波形データとして録音（保存）される。録音管理部６は，後述するように入力された音声波形を分析する。分析の結果，良品質の音声波形であると判定した場合には，録音管理部６は，録音部７に音声波形データを音声波形データベース12に保存させる指令を与える。
【０１１５】
録音管理部６は，機能の観点から大きく分けると，話者管理処理（手段）６ａ，音声分析処理（手段）６ｂ，音声採否判定処理（手段）６ｃおよび録音管理処理（手段）６ｄを備えている。話者管理処理（手段）６ａは発話注意生成処理（手段）61，休息指示生成処理（手段）62，音声分析結果保持処理（手段）63を備えている。音声分析処理（手段）６ｂは，基本周波数検出処理（手段）64，音量検出処理（手段）65，話速検出処理（手段）66を備えている。音声採否判定処理（手段）６ｃは音声分析結果比較処理（手段）67および音声採否判定処理（手段）68を備えている。
【０１１６】
録音部７は，発話開始，終了検出処理（手段）71および録音処理（手段）72を含んでいる。
【０１１７】
話者は表示装置９の表示にしたがって，録音原稿内の文字列（苗字）を一つずつ声を出して読む。一つの文字列についての音声信号が音声入力装置10から録音管理部６および録音部７に与えられる。
【０１１８】
音声分析処理６ｂは，音声入力装置10から入力された一つの文字列の音声信号について，その基本周波数（高さ），音量（パワー）および話速をそれぞれ処理64，65，66において検出し，これらの検出結果を音声波形分析結果として音声採否判定処理６ｃの音声分析結果比較処理67と話者管理処理６ａの音声分析結果保持処理63に与える。
【０１１９】
音声採否判定処理６ｃの音声分析結果比較処理67は，予め設定して音声波形データベース12に記憶しておいた音声採否判定基準を読出し，与えられる音声波形分析結果と読出した音声採否判定基準とを比較して，音声入力装置10から録音部７に入力された音声を音声波形データベース12に音声波形データとして登録するか否かの判定を行う。音声波形分析結果の各属性（基本周波数，音量，話速）の全てが音声採否判定基準の範囲内に収まっている場合には，録音部７に保存された音声波形データを音声波形データベース12に保存させ（採用と決定），それ以外の場合には，録音部７に音声波形データを消去（不採用と決定）させる。この動作は各文字列を表わす音声信号について順次行なわれる。
【０１２０】
音声分析結果保持処理63は，音声分析処理６ｂから出力された音声波形分析結果の履歴情報を保存しておく。また，音声分析結果保持処理63は音声採否判定処理68による採否判定結果を受取る。採否判定結果が不採用であったときには，音声分析結果保持処理63は録音管理処理６ｄに繰返し指令を与え，不採用となった音声に対応する文字列を再度表示装置９に表示させる。
【０１２１】
発話注意生成処理61または休息指示生成処理62は，音声分析結果保持処理63に保持された音声波形分析結果の履歴情報または，採否判定結果についての情報に基づいて，次のようにして，必要に応じて，発話注意または休息指示を生成して，録音管理処理６ｄに与える。
【０１２２】
発話注意生成処理61は，波形分析結果（周波数，音量，話速）について平均値を常時算出している。そして，今回の波形分析結果とこの平均値を比較し，比較結果に応じて，発話注意を生成する。たとえば，今回の音量と音量の平均値とを比較し，今回の音量が音量の平均値を大きく下廻っていれば（差が所定の閾値以上であれば），「声が小さくなっています」という発話注意を生成する。
【０１２３】
休息指示生成処理62は音声採否判定処理68が不採用と判定した頻度に基づいて休息指示を発生する。たとえば，今回の不採用判定が前回の不採用判定に近ければ，話者の疲れが原因で不採用が頻発していると考えられるので，休息指示を発生する。
【０１２４】
録音管理処理６ｄは，原稿作成部５から与えられる録音原稿を保持し，順次表示装置９に発話すべき文字列を表示する。表示装置９に表示される画面の一例が図17に示されている。この画面では31番目の文字列（苗字）として「佐藤」が表示されている。
【０１２５】
音声採否判定処理68の採否判定結果は音声分析結果保持処理63を介して録音管理処理６ｄに与えられるので，録音管理処理６ｄは採用判定であれば次の文字列（苗字）を表示装置９に表示させ，不採用であれば前回と同じ文字列（苗字）を表示させるように表示装置９を制御する。
【０１２６】
録音管理処理６ｄはまた，発話注意生成処理61から与えられる発話注意や休息指示生成処理62から与えられる休息指示を表示装置９に表示させるように制御する。図17の表示画面では，アドバイスとして，「10分間の休憩をとってください」という休息指示と，「声が小さくなっています」という発話注意とが表示されている。
【０１２７】
表示装置９にはまた，発話注意処理61が算出した音声分析結果の平均値（ハッチングで示す）と今回の音声分析結果が音声の属性ごとに（音量，話速，高さ，発話内容）グラフで表示されている。発話内容は，音声認識による信頼度を示すスコアである。
【０１２８】
休息指示生成処理62は休息指示を出力した後，指示した休息時間が経過したときに再開指示を録音管理処理６ｄに与える。録音管理処理６ｄはこれに応答して，発話すべき文字列の表示を続ける。
【０１２９】
なお，図17において，「録音」ボタンは話者が発話の開始を明示的に入力する場合に用いるもので，発話開始検出機能が備えられている場合には不要である。「再生」ボタンは話者が録音音声を再生して確認するときに用いるものである。
【０１３０】
録音部７には音声入力装置10からの音声信号が入力している。発話開始・終了検出処理71は入力する音声信号の開始時点と終了時点を検出するもので，これらの開始時点から終了時点までの間の音声信号が録音装置72に与えられて録音される。
【０１３１】
図15および図16は，録音管理部６による録音管理処理を示すフローチャートである。
【０１３２】
録音管理処理６ｄは，原稿作成部５によって作成された録音原稿を読込む（ステップＳ21）。このとき，録音済みの文字列（苗字）の数（録音済件数）（変数またはカウンタ）を０にリセットし，録音原稿の文字列数（録音原稿に含まれる文字列（苗字）の総数）を，録音全件数（変数またはカウンタ）としてセットする（ステップＳ22）。
【０１３３】
録音管理部６ｄは，録音済件数が録音全件数よりも小さいか否かの判定を行う（ステップＳ23）。録音済件数が録音全件数以上になった場合には，録音処理を終了する（ステップＳ23でＮｏ）。
【０１３４】
録音済件数が録音全件数よりも小さい場合には，録音管理処理６ｄは，録音原稿の文字列リストの中から（録音済件数＋１）番目の文字列を読上げ文字列として設定し（たとえばバッファに格納し）（ステップＳ24），これを表示装置７に出力する（ステップＳ25）。
【０１３５】
表示装置７には，図17に示すような録音表示画面が表示される。上述した仕様入力画面と同様に画面左側に音声単位データベース作成工程が表示されている。この段階では「録音」が明示されている。画面上段には録音原稿文字列表示領域があり，この領域には話者が読上げるべき文字列（「佐藤（さとう）」）が表示される。画面中段には，上述したように音声波形分析結果領域がある。
【０１３６】
話者ＳＰが読上げる文字列を発話する（声を出して読む）と，その音声が音声入力装置10に入力され，音声は入力装置10から音声波形として録音部７と音声管理部６の音声分析処理６ｂに入力する（ステップＳ26でＹＥＳ）。録音部７に入力された音声波形は音声波形データとして録音される。
【０１３７】
音声分析処理６ｂは上述のように入力された音声波形を，高さ（基本周波数），大きさ（パワー），速さ（継続時間）について分析して（ステップＳ27），その音声波形分析結果を音声採否判定処理６ｃおよび話者管理処理６ａに出力する。
【０１３８】
音声採否判定処理６ｃでは，上述したように予め設定して音声波形データベース12に保存しておいた音声採否判定基準を読出し，読出した音声採否判定基準を用いて，音声波形分析結果が示す高さ（基本周波数），大きさ（パワー），速さ（継続時間）のいずれもが音声採否判定基準内に収まっている（採用）か否（不採用）かの判定を行う（ステップＳ28）。
【０１３９】
高さ（基本周波数），大きさ（パワー），速さ（継続時間）のいずれもが音声採否判定基準内に収まっている場合（ステップＳ28でＹＥＳ）には，音声採用判定処理６ｃは，録音部７および話者管理処理６ａ（さらに録音管理処理６ｄに）に採用信号を出力する。録音部７では，採用信号を入力すると，先に録音しておいた音声波形データを音声波形データベース12に登録する。また，録音管理処理６ｄでは，採用信号が入力されると，そのときの音声波形データが音声波形データベース12に登録されたのであるから，録音済件数に１を加える。すなわち,（録音済件数＋１）を録音済件数として設定する（ステップＳ29）。
【０１４０】
高さ（基本周波数），大きさ（パワー），速さ（継続時間）のいずれかが音声採否判定基準内の範囲に収まっていない場合（ステップＳ28でＮＯ）には，その音声を不採用（録音失敗）として扱われ，音声採否判定処理６ｃは，不採用信号を話者管理処理６ａおよび録音部７に出力する。
【０１４１】
話者管理処理６ａは，不採用信号が入力されると，前回不採用の文字列が何番目であったかを示す前回失敗番号と今回の発話の文字列が何番目であるものかを示す（録音済件数＋１）とを読取り，読取った前回失敗番号と（録音済件数＋１）との差が予め設定された休息要否判定値未満であるかどうかの判定を行う（ステップＳ30）。
【０１４２】
話者管理処理６ａは，（録音済件数＋１）と前回失敗番号との差が休息要否判定値以上の場合には，休息は不要で単に録音をやり直せばよい。このときには，前回失敗番号として（録音済件数＋１）を設定するとともに，録音のやり直しのために（録音済件数＋１）を録音件数として録音管理処理６ｄに出力する。録音管理処理６ｄは，（録音済件数＋１）番目の文字列を表示装置９に表示させ，もう一度（録音済件数＋１）番目の文字列の録音をやりなおす（ステップＳ34からステップＳ25に戻る）。
【０１４３】
話者管理処理６ａは，（録音済件数＋１）と前回失敗番号との差が休息要否判定値未満の場合には，頻繁に不採用判定があったのであり，休息が必要であるとして休息指示を生成し，録音管理処理６ｄに出力する（ステップＳ31）。録音管理処理６ｄは出力された休息指示を表示装置９に表示する。話者ＳＰは表示装置９に表示された休息指示を見て，休息する。
【０１４４】
話者管理処理６ａの休息指示生成処理62は，休息指示を表示した時点から経過時間の計測を開始して，既定の休息時間が経過するまで待ち状態となる（ステップＳ32）。経過時間を計測（ステップＳ33）して，休息時間が経過すると（ステップＳ32でＹＥＳ）ステップＳ34へ進み，（録音済件数＋１）番目の文字列を再度読上げ文字列とする。
【０１４５】
以上のように録音済件数が録音全件数に等しくなるまで繰返し録音処理が行われる（ステップＳ23）。
【０１４６】
ラベリング部８には，原稿作成部５で作成された録音原稿と音声波形データベース12に保存された音声波形データとが与えられる。ラベリング部８は，音声波形データにおいて，その波形に対応する文字列を構成する各音声単位の境界を定め，各音声単位を表わすラベルと，境界を示す時間情報からなるラベル情報を作成する。ラベリング部８はまた，作成したラベル情報についてのラベリング・エラー除去（時間情報の修正と時間情報の無効化）を行う。ラベリング部８は，ラベル情報をラベル情報データベース13に保存する。
【０１４７】
一例として原稿作成部５で作成された録音原稿の中の「さとう（satoo ）」という文字列（苗字）を取上げる。音声波形データベース12にはこの文字列を話者が発話したときの音声波形データが既に格納されている。音声単位が音素の場合には，上記文字列は，音素を単位としたラベル列ｓ，ａ，ｔ，ｏ，ｏで表わされる。音声単位が音節の場合には，ラベルはsa，to，o となる。ラベリングとは，これらのラベル列の各音声単位と音声波形データとを対応させることであり，音声波形データを，音声単位ごとに区切ることである。音声単位が音素の場合について図３を再度参照のこと。
【０１４８】
図10はラベリング部８の機能ブロック図である。ラベリング部８は，ラベリング処理（手段）８ａと，ラベリングエラー除去処理（手段）８ｂとから構成されている。ラベリング処理８ａには，統計モデル作成処理（手段）81，音声単位境界決定処理（手段）82およびラベル情報生成処理（手段）83が含まれている。ラベリングエラー除去処理８ｂには，時間情報エラー修正処理（手段）84，時間情報比較処理（手段）85およびラベル情報無効化処理（手段）86が含まれている。
【０１４９】
ラベリング処理８ａの音声単位境界決定処理82は，原稿作成部５から与えられる録音原稿および音声波形データベース12に保存された音声波形データを読込む。録音原稿は統計モデル作成処理81にも与えられる。録音原稿に含まれる一つ一つの文字列（たとえば「satoo 」）について次の処理が行なわれる。
【０１５０】
統計モデル作成処理81は，予め用意した統計モデル（音声単位ごとに音響的特徴を統計的にモデル化したもの；たとえばHidden Markov Model ）を利用して，入力された録音原稿中の特定の一つの文字列に対応するラベル列にしたがって，そのラベル列を表わす音声波形に相当する音響的特徴量の系列を作成する。音声単位境界決定処理82はこの作成された系列と，上記文字列に対応して実際に録音された音声波形の音響的特徴量の系列とのマッチングをとることによって，実際に録音された（音声波形データベース12からの）音声波形において音声単位の境界を抽出する。
【０１５１】
抽出された音声単位の境界情報（時間情報）は，音声単位を示すラベルと対にされ，ラベル情報生成処理83からラベル情報データベース13に与えられる。ラベル情報は，音声単位を表わすラベルとその音声単位の終了時点（時間情報）（音声波形データの開始時点を０とする）との対を，文字列の順序（時間の順序）で記述したものである。
【０１５２】
なお，ラベリングの詳細については，特開平10−49193 号公報などに開示されている。また，ＨＭＭを用いた自動ラベリングのほか，ＤＰマッチングによる自動ラベリングの方式を利用してもよい。
【０１５３】
ラベリングエラー除去処理８ｂは，生成したラベル情報において，ラベリングエラーの可能性が高い音声単位について，その時間情報（終了時点）を修正したり（時間情報エラー修正），その音声単位自体をデータベースにおいて無効化するための情報を付与したりする（ラベル情報無効化処理）ものである。すなわちエラー除去の処理内容は，修正規則に基づく時間情報エラー修正と，別個に作成された複数のラベル情報の差異に基づくラベル情報無効化の２つに大きく，分けられる。
【０１５４】
時間情報エラー修正処理84では，予め用意した修正規則により，ラベル情報の時間情報を修正する。
【０１５５】
ラベル情報無効化のために，時間情報比較処理85では，先の統計モデル（たとえばＨＭＭモデル）を用いて生成したラベル情報（第１のラベル情報という）（データベース13に格納したもの）と，これとは別の統計モデルを用いて作成した第２のラベル情報との差異を比較する。そして，ラベル情報無効化処理86において，時間情報の差異が予め設定した閾値を超える場合に，それに対応する第１のラベル情報の該当する部分に無効化情報を付与する（無効化情報が付与されたラベル情報は，続く特徴量抽出処理の対象外とされるので，ラベリングエラーが存在したとしても音声単位データベースの品質に悪い影響を与えないようになっている）。
【０１５６】
図18はラベリング部８におけるラベリングエラーの除去処理８ｂの動作を示すフローチャートである。
【０１５７】
ラベリングエラー除去処理８ｂはラベリング処理８ａによって作成され，保存された一文字列（一苗字）についてのラベル情報をラベル情報データベース13から読込む（ステップＳ41）。このラベル情報を第１のラベル情報とする。
【０１５８】
第１のラベル情報に含まれるラベル数をカウントし，このカウント値を変数「全ラベル数」にセットし，ラベル修正規則の数を変数「全規則数」にセットし，変数「処理済ラベル数」を０にリセットする（ステップＳ42）。ラベル修正規則については後述する。変数「処理済修正規則数」を０にリセットする（ステップＳ44）。ラベル修正規則については，後述する。変数「処理済修正規則数」を０にセットする（ステップＳ44）。
【０１５９】
（処理済ラベル数＋１）番目のラベル情報に対して修正規則を順に適用する（ステップＳ46）。修正規則の条件に適合しない場合は，ラベル情報は更新しない。適合する場合は，修正規則の実行部の記述にしたがって，ラベル情報を更新する（ステップＳ47）。
【０１６０】
図19(A) は，ラベル情報の一例を示すものである。ラベル情報はｓ，ａ，ｔ，ｏおよびｏの音声単位（音素）とこれらの音素に対応する境界情報が列記されている。境界情報は，「satoo 」の音声波形データの開始時点を零として各音声単位の終了時点の時間情報である。
【０１６１】
図19(B) は，修正規則の例を示すものである。修正規則は各音声単位について設定されている。修正規則は，「if（条件部），then（実行部）」という形式で表現されており，条件部に記述された条件を満足する場合に限って実行部に記述された処理が実行される。
【０１６２】
図19(A) に示すラベル情報について図19(B) に示す修正規則を具体的に適用してみる。
【０１６３】
図19(A) の第３番目のラベル「ａ」の持続時間は0.076 （秒）（0.101−0.025＝0.076 ）である。図19(B) のラベル「ａ」についての修正規則の条件部は「if（持続時間＜30）」であるから（30は0.030 秒の意味），ラベル「ａ」の持続時間は条件部を満たさない（ステップＳ46でＮＯ）。したがって，修正規則の実行部は実行されない。
【０１６４】
ラベル情報の５番目のラベル「ｏ」の持続時間は0.028 （0.191−0.163）であるから，音声単位「ｏ」についての修正規則の条件部（if（持続時間＜40））を満たす（ステップＳ46でＹＥＳ）。したがって，その実行部「修正持続時間＝持続時間×1.5 」が実行される。持続時間の値＝0.028 であるから，修正持続時間＝0.042（＝0.028×1.5）となる。５番目のラベルの「ｏ」の終了時点は0.205（＝直前の音声単位の終了時点0.163＋0.042）と修正される。図19(C) は，修正した後のラベル情報を示す。
【０１６５】
処理済修正規則数に１を加えながら（ステップＳ48），（処理済ラベル数＋１）番目のラベル情報に対してすべての修正規則を適用する（ステップＳ45による繰返し）。
【０１６６】
一つのラベル情報に対してすべての修正規則を適用し終えれば，処理済ラベル数に１を加え（ステップＳ49），ステップＳ43に戻る。一つの文字列の全ラベルについてステップＳ44〜Ｓ49の処理を終えれば（ステップＳ43でＮＯ），時間情報のエラー修正処理を終える。
【０１６７】
次にラベル情報無効化処理に移る。
【０１６８】
第１のラベル情報を作成したときに用いた統計モデルとは異なる統計モデルを用いて，第１のラベル情報の作成と同じやり方で自動ラベリングを実行し，第２のラベル情報を作成する（ステップＳ50）。作成された第２のラベル情報の例が図20(A) に示されている。
【０１６９】
処理済ラベル数を０に戻し，ラベル無効化の閾値を設定する（ステップＳ51）。ラベル無効化閾値の例が図20(C) に示されている。
【０１７０】
修正された第１のラベル情報（図19(C) ）と第２のラベル情報（図20(A) ）とにおいて，対応するラベルの時間情報の差をそれぞれ算出し（ステップＳ53），この差がラベル無効化閾値を超えているかどうかを判定する（ステップＳ54）。各ラベルについての差の一例が図20(B) に示されている。これらの差のうち閾値を超えているものがあれば（ステップＳ54でＹＥＳ），対応する第１のラベル情報に無効化情報を付与する（ステップＳ55）。例えば，図20(B) において，２番目のラベル「ａ」の時間情報の差は， 0.014（ｓ）であり，ラベル無効化閾値である 0.050（ｓ）の範囲に収まっているので，無効化情報を付する必要はない。これに対して５番目のラベル「ｏ」の時間情報の差は， 0.051（ｓ）であり，ラベル無効化閾値を超えているため，５番目の「ｏ」のラベルに無効化情報を付与する。その直後のラベル（６番目のラベル「ｏ」）のラベルにも自動的に無効化情報を付与する。図20(D) では，第５番目と第６番目のラベル「ｏ」に無効化情報×が付けられている。処理済ラベル数に１を加えながらすべてのラベルについて上記の処理を繰返す（ステップＳ56，Ｓ52）。無効化処理を終えた第１のラベル情報はデータベース13に再び格納される。
【０１７１】
図21は，音声情報データベース15に含まれるインデックス情報（図21(A) ）と，これに対応する音声波形データ（図21(B) ）の一例を示したものである。
【０１７２】
特徴量抽出部14は，ラベル情報データベース13に保存されたラベル情報を読出し，対応する音声波形データを音声波形データベース12から読出す。特徴量抽出部14は，読出した対応するラベル情報と音声波形について，音声単位ごとに，特徴量（長さ，高さ，大きさなど）を算出して，算出した特徴量をラベル情報とともに列記してインデックス情報を作成する。このとき上述した無効化情報が付与された音声単位については特徴量の算出は行わない。さらに特徴量抽出部14はインデックス情報と音声波形データとを対にして音声情報データとして，音声情報データベース15に保存する。
【０１７３】
（３）第２実施例
図22は，音声情報データ作成システムの第２実施例の全体構成を示す機能ブロック図である。この図において，図７に示すものと同一物には同一符号を付し重複説明を避ける。録音原稿を話者ＳＰが声を出して読むにあたって適切な読み方を話者ＳＰに示すための標準音声を作成する機能を録音管理部６Ａが持つ。標準音声はスピーカ18から出力される。原稿作成部５Ａは，元原稿を追加したときに，既に作成されている録音原稿に追加すべき追加録音原稿を作成する機能を持つ。この追加録音原稿は最小限で足りる。追加録音原稿の作成のために音声情報データベース15から音声情報（インデックス情報）が原稿作成部５Ａに与えられる。ラベリング部８Ａは第１の実施例とは異なり，作成したラベル情報（特に時間情報）について，ラベル情報の統計的分析結果に基づいてエラー除去を行う機能を持つ。
【０１７４】
図23は原稿作成部５Ａの機能的構成を示すブロック図である。図24は原稿作成部５Ａの追加録音原稿を作成する動作を示すフローチャートである。以下に，追加録音原稿を作成する処理について説明する。録音原稿作成処理は第１実施例において説明した通りであり，追加録音原稿作成処理はこれに付加される機能であると理解されたい。
【０１７５】
以下の説明では，第１実施例において既に作成された苗字についての追加録音原稿の存在を前提とする。
【０１７６】
音声情報データベース15には苗字についての録音原稿を話者ＳＰが読上げて，これを録音して得られる音声情報が既に格納されているものとする。音声情報データベース分析処理（手段）54は，データベース15から苗字についての音声情報中のインデックス情報を読出し，このインデックス情報を分析してインデックス情報に含まれる音声単位のリストを，品質レベルごとに作成する（図24，ステップＳ61）。音声情報データベースの分析結果の一例が図25(A) に示されている。これは図13(B) に示す元原稿分析結果と全く同じである（音声単位のリストにおける配列順序が異なっているが）。
【０１７７】
仕様入力部４において，現在の音声情報データベース15を前提として，新たに追加的に合成により得たい文字列（単語，句，節，文などを含む）を列挙した原稿（追加元原稿という）の入力を操作者ＯＰから受け付ける（ステップＳ62）。入力が終了するまでは，待ち状態にある（ステップＳ63）。追加元原稿に対応するテキストファイル名だけを入力装置42において入力させて，追加元原稿の内容は，そのファイルをＦＤドライブ41により読み込むようにしてもよい。もちろん追加元原稿をキーボードから入力してもよいし，原稿データベース11に格納されているものを用いてもよい。
【０１７８】
追加元原稿の一例が図25(B) に示されている。この追加元原稿は地名リストである。
【０１７９】
追加元原稿が元原稿設定処理51Ａに設定されると，元原稿分析処理52Ａは，追加元原稿に含まれるすべての文字列について，それらを品質レベル別に，ラベル（音声単位）に分解し，それらの出現回数を計数して，音声単位リストを作成する（ステップＳ64）。これが追加元原稿分析結果であり，図25(B) に示す追加元原稿について，具体例が図25(C) に示されている。
【０１８０】
分析結果比較処理55A は，元原稿分析処理52Ａによる追加元原稿分析結果と音声情報データベース分析処理54による音声情報データベース分析結果とを比較し，追加元原稿分析結果（図25(B) ）に存在するが，音声情報データベース分析結果（図25(A) ）には存在しない音声単位を，品質レベル別に抽出する。この差分抽出結果の一例が図25(D) に示されている。
【０１８１】
文字列選択処理（手段）53Ａは，差分抽出結果に含まれている音声単位について，その音声単位を含む文字列を，録音原稿に追加していく，という処理を品質レベル別に品質レベルの低い方から高い方に向って順に行う。追加録音原稿が，その品質レベルの全音声単位を網羅した時点で，その品質レベルの処理を終え，次の品質レベルの処理に移る。図25(D) に示す例では，品質レベル１には差分として抽出された音声単位がないので，品質レベル２から処理を行うことになる。品質レベル２における処理では「きょうと」が追加され，品質レベル３における処理ではさらに「なら」が追加され，最終的には，追加録音原稿には，「きょうと」と「なら」の２つの文字列が追加される。音声情報データベースにこの２つの文字列の音声を追加するだけで，図25(A) の地名リスト中のすべての地名を，高い品質で合成することができるようになる。
【０１８２】
このようにして，追加原稿の分析結果とインデックス情報分析結果とを比較して，追加元原稿にあってインデックス情報にない音声単位（不足音声単位）を抽出し，不足している音声単位を含む文字列を録音原稿に追加していくので，最初から録音原稿を作り直す必要はない。
【０１８３】
図24において，分析結果比較処理55と文字列選択処理53Ａの動作の流れは次の通りである。
【０１８４】
音声情報データベース分析結果と追加元原稿分析結果を参照し，追加元原稿に存在して，音声情報データベースには存在しないような音声単位をすべて列挙し，音声単位リストとする。また，追加元原稿に含まれるすべての文字列を，文字列リストに加える（ステップＳ65）。
【０１８５】
音声単位リストに音声単位が残っている場合（ステップＳ66でＹＥＳ），音声単位リストから出現回数が最小の音声単位１つを選択し，さらに，文字列リストからその音声単位を含む文字列を１つだけ選択して，追加録音原稿にその文字列を追加する（ステップＳ67）。
【０１８６】
追加録音原稿に追加した文字列に含まれる音声単位のうち，音声単位リストに残っているものをすべて音声単位リストから削除する（ステップＳ68）。また，追加録音原稿に追加した文字列を文字列リストから削除する（ステップＳ69）。
【０１８７】
音声単位リストが空になるまでステップＳ67〜Ｓ69を繰り返す。これにより，追加録音原稿作成が終了する。
【０１８８】
この追加録音原稿作成処理においても，データベース容量やデータベース作成時間の要求がある場合は，この要求による制限が考慮されるのはいうまでもない。
【０１８９】
図26は録音管理部６Ａの構成を示すブロック図である。
【０１９０】
録音管理部６Ａは，上述した第１実施例の録音管理部６に音声合成処理（手段）６ｅがさらに設けられたものである。
【０１９１】
音声合成処理６ｅは，原稿作成部３Ａから録音管理処理６ｄを介して読込んだ録音原稿中の文字列を音で表わす合成音声を作成する。すなわち，音声合成処理６ｅは録音原稿の各文字列について，正しい読み上げ方（アクセント位置，間の取り方，抑揚などの点で），または（話者管理手段６ｃが保持する録音音声分析結果の履歴情報に基づいて）その話者に適切な声の大きさ，高さ，速さで，録音原稿の文字列を読み上げる合成音声を作成する。音声合成手段６ｅで作成された合成音声（予め用意した録音音声でもよい）は標準音声としてスピーカなどの音声出力装置18から出力される。これにより，話者ＳＰは，発話すべき文字列の合成音声を聞き，発話すべき音声の高さ，大きさ，速さを参考にすることができるため，文字列を不適切に読むことを防ぎ，録音音声（音声情報データ）の質を向上させることができる。
【０１９２】
図27は，録音管理部６Ａによる録音管理処理を示すフローチャートである。図15に示すものと同一処理には同一符号を付し重複説明を避ける。また，図16はそのまま適用することができる。
【０１９３】
音声合成手段６ｅは，録音管理処理６ｄから入力された（録音済件数＋１）番目の文字列に対する適切な声の高さ，大きさ，速さ，抑揚等の目標値，またはこれまでの録音音声の分析結果に基づき，韻律的特徴のパラメータを設定する（ステップＳ36）。音声合成手段６ｅは，設定したパラメータを用いて（録音件数＋１）番目の読上げ文字列の合成音声を作成し，作成した合成音声を標準音声として音声出力装置（スピーカ）18に出力する（ステップＳ37）。したがって，表示画面に文字列が表示される（ステップＳ25）だけでなく，その標準音声も出力される。
【０１９４】
図28はラベリング部８Ａの機能的構成を示すブロック図である。図10と比較すると，ラベリングエラー除去処理８ｂに代えて，ラベリングエラー除去処理（手段）８ｃおよびラベル情報統計分析処理（手段）８ｄが設けられている。ラベリングエラー除去処理８ｃは，ラベル情報信頼性確認処理（手段）87およびラベル情報無効化処理（手段）86を含む。ラベル情報統計分析処理８ｄは，信頼区間算出処理（手段）88および統計分析処理（手段）89を含む。
【０１９５】
ラベリング情報統計分析処理８ｄは，既存のラベル情報（ラベル情報データベース13内のラベル情報）を統計的に分析して，音声単位ごとに継続時間の平均値と標準偏差から信頼区間（継続時間に関する信頼区間）を算出し，信頼区間情報を作成する。分析対象となる既存のラベル情報は，話者により音声の特徴が異なり，継続時間の信頼区間が変わることが多いために，これからラベリングエラー除去の処理をしようとするラベル情報と同じ話者のラベル情報を使用することが望ましい。
【０１９６】
ラベリングエラー除去処理８ｃは，ラベル情報統計分析処理８ｄで得られた各音声単位の信頼区間情報を参照して，エラー除去対象のラベル情報に含まれる各音声単位の継続時間が対応するラベルの信頼区間内に収まっているか否かをチェックする。ラベリングエラー除去処理８ｃは，信頼区間に収まっていないラベル情報に無効化情報を付与する，さらに信頼区間内に収まるようにラベル時間情報を修正してもよい。ラベリングエラー除去処理８ｃは，ラベリング処理８ａにより生成されたラベル情報に含まれる各音声単位について，その音声単位の継続時間が，ラベル情報統計分析処理８ｄにより算出された，その音声単位に対応する継続時間の信頼区間の範囲外にある場合，その箇所を（ラベリングの信頼性が低い，すなわち，ラベリングエラーの可能性が高いと判定して），無効化する。これにより，統計的に信頼性が低いと判断されたラベル情報を自動的に無効化でき，結果として，ラベリング結果の品質を高めることができる。
【０１９７】
図29は，ラベリング部８Ａのラベル情報統計分析処理８ｄおよびラベリングエラー除去処理８ｃによるラベリングエラー除去処理の手順を示すフローチャートである。
【０１９８】
ラベル情報統計分析処理８ｄの統計分析処理89はラベリング処理８ａによって作成され，ラベル情報データベース13に保存されたラベル情報，好ましくは同じ話者ＳＰによって録音された音声波形から得られた一群のラベル情報を読込む（ステップＳ71）。
【０１９９】
統計分析処理89は，音声単位別に継続時間の平均値と標準偏差を算出し，ラベル情報の中にその音声単位が出現する個数を計数する（ラベル情報の統計分析）（ステップＳ72）。
【０２００】
図30(A) は統計分析処理89に読込まれたラベル情報の一例を示すものである。図30(B) は統計分析処理89による統計分析の結果の一例を示すものである。
【０２０１】
信頼区間算出処理88は，統計分析処理89による統計分析結果に基づき音声単位ごとの継続時間の信頼区間を以下の算出式により算出する（ステップＳ73）。
【０２０２】
信頼区間＝平均値±Ｚ［（標準偏差）²／（出現回数）］^1/2 ‥‥（式１）
【０２０３】
ここで，Ｚは正規分布に基づく定数である。
【０２０４】
図30(C) は上記の算出式より算出した音声単位ごとの継続時間の信頼区間の一例を示している。
【０２０５】
このようにして得られた信頼区間に関するデータはラベル情報信頼性確認処理87に与えられる。ラベル情報信頼性確認処理87はまた，統計分析処理89が取得したものと同じラベル情報（これをエラー除去対象ラベル情報という）をラベル情報データベース13から読込む。
【０２０６】
ラベル情報信頼性確認処理87は，エラー除去対象ラベル情報に含まれるラベル数をカウントし，変数「全ラベル数」に設定する。また，「処理済ラベル数」を０に設定する（ステップＳ74）。
【０２０７】
（処理済ラベル数＋１）番目のラベルに対応する音声単位の継続時間を算出する（継続時間は，その音声単位の終点を示す時間情報と，直前の音声単位の終点を示す時間情報との差で求められる）（ステップＳ76）。
【０２０８】
（処理済ラベル数＋１）番目のラベルに対応する音声単位の継続時間が，その音声単位の信頼区間の範囲内におさまっていない場合は，ラベル情報無効化処理86は，（処理済ラベル数＋１）番目のラベルに無効化情報を付与する（ステップＳ78）。図30(D) は無効化情報付与後のラベル情報の一例を示すものである。音声単位「ｏ」の継続時間の信頼区間は，図30(C) によると，46.8〜115.2 （ms）である。図30(D) において第５番目および第６番目の音声単位（ラベル）「ｏ」の継続時間はそれぞれ0.191（ｓ）および0.312（ｓ）であり，信頼区間の範囲内に入っていない。したがってラベル「ｏ」は２つとも無効化情報（×印で示す）が付される。他のラベルｓ，ａ，ｔの継続時間は対応する信頼区間の範囲内にあるので無効化情報は付与されない。
【０２０９】
処理済ラベル数の値に１を加え，ステップＳ75を経てステップＳ76に戻り（ステップＳ79），ステップＳ76〜Ｓ78の処理を，処理済ラベル数が全ラベル数に等しくなるまで繰返す（ステップＳ75）。
【０２１０】
以上のようにして，ラベリングエラー除去処理が終了すると，処理後のラベル情報は再びラベル情報データベース13に格納される。
【図面の簡単な説明】
【図１】音声波形における音素，音節および単語の区切りを付け，音声波形と対応付けて示す。
【図２】音韻環境を考慮して音声単位を音声波形と対応付けて示す。
【図３】音声波形とラベル情報との関係を示す。
【図４】波形情報を用いて波形接続型音声合成を行う様子を示す。
【図５】波形接続型音声合成処理の流れを示すものである。
【図６】音声情報データベース作成システムのハードウェア構成を示すブロック図である。
【図７】第１実施例における音声単位データ作成システムの全体構成を示すブロック図である。
【図８】原稿作成部の機能的構成を示すブロック図である。
【図９】録音管理部の機能的構成を示すブロック図である。
【図１０】ラベリング部の機能的構成を示すブロック図である。
【図１１】原稿作成部による録音原稿作成処理を示すフローチャートである。
【図１２】仕様入力表示画面を示す。
【図１３】 (A) は，元原稿の一例を示す。(B) は，元原稿分析結果の一例を示す。
【図１４】 (A) は，レベル１処理後の録音原稿の一例を示す。(B) は，レベル２処理後の録音原稿の一例を示す。(C) は，レベル３処理後の録音原稿の一例を示す。
【図１５】録音管理部による録音処理を示すフローチャートである。
【図１６】録音管理部による録音処理を示すフローチャートである。
【図１７】録音画面を示す。
【図１８】ラベリング部によるラベリングエラー除去処理を示すフローチャートである。
【図１９】 (A) は，第１のラベル情報を示す。(B) は，修正規則を示す。(C) は，修正後の第１のラベル情報を示す。
【図２０】 (A) は，第２のラベル情報を示す。(B) は，ラベルの差異情報を示す。(C) は，無効化閾値を示す。(D) は，無効化情報付与後の第１のラベル情報を示す。
【図２１】 (A) は，インデックス情報を示す。(B) は，音声波形データを示す。
【図２２】第２実施例における音声単位データ作成システムの全体構成を示すブロック図である。
【図２３】第２実施例における原稿作成部の機能的構成を示すブロック図である。
【図２４】第２実施例における原稿作成部による追加録音原稿作成処理を示すフローチャートである。
【図２５】 (A) は，音声情報データベース分析結果の一例を示す。(B) は，追加元原稿の一例を示す。(C) は，追加元原稿分析結果の一例を示す。(D) は，差分抽出結果の一例を示す。(E) は，品質レベル２処理後の追加元原稿の一例を示す。(F) は，品質レベル３処理後の追加元原稿の一例を示す。
【図２６】第２実施例における録音管理部の機能的構成を示すブロック図である。
【図２７】第２実施例における録音管理部による録音処理を示すフローチャートである。
【図２８】第２実施例におけるラベリング部の機能的構成を示すブロック図である。
【図２９】第２実施例におけるラベリング部によるラベリングエラー除去処理を示すフローチャートである。
【図３０】 (A) は，ラベル情報の一例を示す。(B) は，統計分析結果の一例を示す。(C) は，信頼区間情報の一例を示す。(D) は，無効化情報付与後のラベル情報の一例を示す。
【符号の説明】
４仕様入力部
５，５Ａ原稿作成部
５ａ原稿作成処理
５ｂ音声情報データベース分析処理
６，６Ａ録音管理部
６ａ話者管理処理
６ｂ音声分析処理
６ｃ音声採否判定処理
６ｄ録音管理処理
６ｅ音声合成処理
７録音部
８，８Ａラベリング部
８ａラベリング処理
８ｂ，８ｃラベリングエラー除去処理
８ｄラベル情報統計分析処理
９表示装置
10 音声入力装置
11 原稿データベース
12 音声波形データベース
13 ラベル情報データベース
14 特徴量抽出部
15 音声情報データベース
16 出力装置
17 記録媒体[0001]
【Technical field】
The present invention relates to a speech information database creation device mainly for speech synthesis in the field of speech dialogue, and further relates to a recorded document creation device, a recording management device, a labeling device, and a method positioned as a part of this system.
[0002]
Conventionally, the creation of a speech information database has been performed by people with specialized knowledge and skills in various fields such as language processing, recording, and labeling of speech waveforms. Only certain companies or institutions that have or could gather specialists in the field could do so. In addition, from the viewpoint of skills and time required, the cost required to create a speech information database was enormous.
[0003]
This problem is further characterized by the fact that waveform-connected speech synthesis can be obtained by creating a speech information database with the desired speech, as if synthetic speech with natural voice quality is obtained as if the person was speaking. It was not fully utilized, and it was also a factor that prevented speech synthesis from becoming widespread.
[0004]
DISCLOSURE OF THE INVENTION
The present invention provides an apparatus and method for creating a recorded manuscript having a minimum character string for storing voice information in a voice information database from an original manuscript including all character strings to be created by voice synthesis. .
[0005]
The present invention also provides an additional manuscript including a character string to be added when a voice information database created based on a voice waveform obtained by a speaker speaking out the recorded manuscript already exists. The present invention provides an apparatus and method for creating an additional recording manuscript having a minimum character string.
[0006]
The present invention also provides a recording management apparatus and method for supporting a speaker to read out a recorded manuscript.
[0007]
The present invention further creates label information (consisting of a voice unit label and its time information) about the voice waveform using the voice waveform obtained by recording and the corresponding character string, and the created label information. It is intended to provide a labeling apparatus and method capable of improving the reliability of the apparatus.
[0008]
The present invention ultimately provides a system that enables even a person who does not have specialized knowledge to create a speech information database relatively easily.
[0009]
According to the present invention, there is provided a recording manuscript preparation device, a means for setting an original manuscript including a plurality of character strings, an original manuscript analyzing means for extracting all sound units constituting a character string included in the original manuscript, and the original manuscript analyzing means. Is provided with first character string selection means for selecting a character string from the original document so as to include all sound units extracted by the above, and creating a recorded document.
[0010]
The original document setting means may be one that accepts an original document that is manually input, or a reader that reads one stored in a recording medium such as an FD. A string is a concept that includes words, phrases, clauses, and sentences. In any case, the original document has a plurality of character strings. The first character string selecting means selects the number of character strings to be selected from the original document to a minimum (or as small as possible).
[0011]
In the embodiment of the present invention for minimizing the character string to be selected from the original manuscript and added to the recorded manuscript, the original manuscript analyzing means is provided for all sound units constituting the character string included in the original manuscript. The number of appearances in the original manuscript is detected, and the character string selection means selects a character string in order from a character string including sound units with a small number of appearances until all the sound units are covered. is there.
[0012]
In this way, according to the present invention, a recorded manuscript to be read out by a speaker in creating a voice information database can be created from an original manuscript, even if the operator does not have language processing expertise. can do. In addition, the recorded manuscript contains only the smallest possible character string (as much as possible).
[0013]
In one embodiment, the character string selection means selects a character string under a condition that satisfies a given specification related to the voice information database to be created.
[0014]
Here, the specifications are the quality level of synthesized speech, the capacity of the voice information database, the creation time of the voice information database including the recording work, etc., and the operation of the character string selection means so as to satisfy the requirements based on these specifications. Is controlled.
[0015]
The present invention further provides an additional recording manuscript preparation apparatus. This additional recording manuscript preparation device may be attached to the above-mentioned recording manuscript preparation device or may be independent.
[0016]
The additional recording manuscript preparation apparatus includes audio information database analysis means for extracting all first audio units included in an existing audio information database, and all second audio units constituting a character string included in the additional original manuscript. An original source analyzing means for extracting a voice, a comparing means for detecting a voice unit not included in the first voice unit, and a character string including the voice unit detected by the comparing means for the second voice unit Second character string selection means for selecting an original document and creating an additional recording document is provided.
[0017]
The voice information data that can synthesize the character strings of the additional manuscript with the additional recording manuscript including the minimum character strings (as much as possible) can be created by effectively using the data of the already created voice information database.
[0018]
A recorded document creating apparatus (including an additional recorded document creating apparatus) is generally positioned as a part of a voice information database creating apparatus, and the voice information database creating apparatus and the recorded document creating apparatus are realized by a computer system. The recording manuscript preparation program that controls this computer system extracts all the audio units that make up the character string contained in a given original manuscript, and determines the number of occurrences of those audio units in the original manuscript. The computer is controlled to select a character string from the original manuscript and add it to the recorded manuscript until it covers all the extracted audio units in order, starting from the character string that contains the speech units with the least number of occurrences. is there.
[0019]
Further, the recorded manuscript preparation method according to the present invention extracts all sound units constituting a character string included in a given original manuscript, detects the number of appearances in the original manuscript for those sound units, and appears. A character string is selected from the original manuscript and added to the recorded manuscript until all the extracted voice units are covered in order from the character string containing the voice unit having the smallest number of times.
[0020]
The additional recording manuscript preparation method according to the present invention extracts all the first sound units included in the existing sound information database, and extracts all the second sound units constituting the character string included in the addition source manuscript. For the second audio unit, an audio unit that is not included in the first audio unit is detected, and a character string including the detected audio unit is selected from the additional source manuscript to create an additional recording manuscript. .
[0021]
The recording management device according to the present invention includes a display device for sequentially displaying a plurality of character strings included in a recorded manuscript for each given display trigger, and a voice signal input from a speaker for the character strings displayed on the display device. Recording means for temporarily storing, audio signal analyzing means for analyzing the audio signal, voice acceptance / rejection is determined based on the analysis result of the audio analyzing means, and the audio signal temporarily stored in the recording means when it is determined to be adopted To the speaker based on the voice acceptance / rejection determination means for giving the display trigger to the display device and the analysis result of the voice signal analysis means or the determination result of the voice acceptance / rejection determination means. It is provided with a speaker management means for creating instruction information to be given.
[0022]
Since the display device displays a character string that the speaker should read aloud (to be pronounced), the speaker may speak according to this display. Since the instruction information based on the voice analysis result is given to the speaker, the speaker may act according to this instruction. The instruction information includes a voice analysis result (for example, voice pitch, volume, speech speed, etc.), attention to these, a rest instruction, and the like. Furthermore, since it is determined whether or not the voice uttered by the speaker is accepted and the voice of the speaker is stored in the database only when it is determined to be adopted, high-quality speech waveform data can be secured. In this way, based on the voice analysis result and history information, it is determined whether or not the voice of the speaker who has read the recorded manuscript is registered in the database, and the instruction information regarding the utterance is fed back to the speaker. Therefore, even if a recording director for managing the recording work is not accompanied, the recording work can be performed by only one speaker, and high-quality voice data can be recorded.
[0023]
In a preferred embodiment, speech synthesis means for creating a synthesized speech signal representing a character string displayed on the display device and speech output means for outputting the synthesized speech signal created by the speech synthesis means are further provided.
[0024]
Using standard reading as an example of proper reading, it is possible to prevent the speaker from reading the character string of the recorded manuscript inappropriately and improve the quality of the recorded voice.
[0025]
The recording management method according to the present invention sequentially displays a plurality of character strings included in a recorded manuscript for each given display trigger, and temporarily stores a voice signal input from a speaker for the displayed character string. The voice signal is analyzed, and whether or not the voice is accepted is determined based on the analysis result. When the adoption is determined, the temporarily stored voice signal is stored in the voice waveform database and a display trigger is generated to analyze the voice signal. Instruction information to be given to the speaker is created and output based on the result or the voice acceptance / rejection determination result.
[0026]
The recording management program according to the present invention sequentially displays a plurality of character strings included in a recorded manuscript for each given display trigger, and a voice signal input from a speaker for the character string displayed on the display device. The voice signal is rejected based on the analysis result, and if it is determined to be adopted, the temporarily stored voice signal is stored in the voice waveform database and a display trigger is generated. Based on the acceptance / rejection determination result, the computer is controlled to create instruction information to be given to the speaker.
[0027]
The labeling device according to the present invention includes a label representing a voice unit by separating voice waveform data into voice units by associating a character string in a recorded manuscript with voice waveform data obtained by sounding the character string. First labeling means for creating first label information including time information representing the break, and a labeling error for correcting or invalidating time information in the first label information created by the first labeling means A removal means is provided.
[0028]
In one embodiment, the labeling error removing means corrects time information based on a correction rule provided for each voice unit.
[0029]
In another embodiment, the labeling error removal means includes a time included in the second label information created for the character string in the recorded original by the second labeling means different from the first labeling means. The difference between the information and the time information corresponding to the first label information is calculated, and when this difference exceeds the measured value, invalidation information is given to the time information.
[0030]
In still another embodiment, the labeling error removing means creates a confidence interval of duration for each voice unit by using a statistical method for the already created label information, and includes the time included in the first label information. The duration for each speech unit generated from the information is compared with the corresponding confidence interval, and if the duration is outside the confidence interval, invalidation information is added to the time information that generated the duration.
[0031]
According to the present invention, the time information of the created label information is checked to see if the time information is appropriate, and is corrected or invalidated as necessary. It will be highly reliable.
[0032]
The labeling method according to the present invention includes a label representing speech units by separating speech waveform data into speech units by associating character strings in a recorded manuscript with speech waveform data obtained by sounding the character strings. Label information including time information representing the break is created, and the time information in the label information created by the labeling means is corrected or invalidated.
[0033]
The labeling program according to the present invention divides the speech waveform data into speech units by associating the character strings in the recorded manuscript with the speech waveform data obtained by sounding the character strings. Label information including a label to be displayed and time information representing the break is created, and the computer is controlled to correct or invalidate the time information in the label information created by the labeling means.
[0034]
The speech information database creation system according to the present invention analyzes a character string in which speech information is to be stored in a speech information database from an original document that includes a character string to be created by speech synthesis, and analyzes a speech unit of the character string included in the original document. A means for creating a recorded manuscript by selecting to include all audio units with as few character strings as possible, and a plurality of character strings included in the recorded manuscript created by the recording manuscript making means are given display triggers. A display device that sequentially displays each time, a recording means that temporarily stores a speech signal input from a speaker for a character string displayed on the display device, the speech signal is analyzed, and whether or not the speech is accepted based on the analysis result If it is determined to be adopted, the audio signal temporarily stored in the recording means is controlled to be stored in the audio waveform database and the display trigger is raised. A recording management device for creating instruction information to be given to a speaker based on the analysis result or the voice acceptance / rejection determination result, a character string in the recorded manuscript created by the recording manuscript preparation means, By associating with the speech waveform data stored in the speech waveform database, the speech waveform data is separated into speech units, and first label information including a label representing the speech unit and time information representing the separation is created. In addition, a labeling device for correcting or invalidating time information in the generated label information, a feature quantity creating means for creating a feature quantity from a voice waveform stored in the voice waveform database, and a voice stored in the voice waveform database Waveform data, label information created by the labeling device, and the feature value creation means In which it is provided a voice information database creation unit that stores an association and index information including a more created characteristic quantity.
[0035]
The speech information database creation method according to the present invention analyzes a character string for storing speech information in a speech information database from an original document including a character string to be created by speech synthesis, and analyzes a speech unit of the character string included in the original document. Then, create a recorded manuscript by selecting to include all audio units with as few character strings as possible, and sequentially display multiple character strings contained in the created manuscript on the display device for each given display trigger. The voice signal input from the speaker for the character string displayed and displayed on the display device is temporarily stored, the voice signal is analyzed, the voice acceptance / rejection is determined based on the analysis result, and the adoption is determined. Sometimes control is performed to store the temporarily stored audio signal in the audio waveform database, and a display trigger is given to the display device, and the analysis result or the audio acceptance / rejection determination is given. Based on the results, the instruction information to be given to the speaker is created, and the speech waveform data is converted into speech units by associating the character strings in the created recorded manuscript with the speech waveform data stored in the speech waveform database. A speech information stored in the speech waveform database is created by creating label information including a label representing a speech unit and time information representing the separation, and correcting or invalidating time information in the created label information. A feature amount is created from the speech waveform data, and the speech waveform data stored in the speech waveform database is associated with the created label information and the index information including the created feature amount and stored in the speech information database. .
[0036]
According to the present invention, even a general user who does not have specialized knowledge can create a relatively high quality speech information database relatively easily in a relatively short time. Therefore, in waveform connection type speech synthesis, even a general user can easily create a natural synthesized speech with a desired voice, and it is expected that waveform connection type speech synthesis will be widely spread.
[0037]
The present invention is particularly intended for the creation of a speech information database used in waveform-connected speech synthesis, but it can also be applied to creation of databases for other synthesis methods (waveform superposition type, etc.). Furthermore, the speech database created by the present invention can be used for learning data of a statistical acoustic model (HMM) for speech recognition and sample data for speech analysis even for purposes other than speech synthesis.
[0038]
【Example】
(1) Waveform-connected speech synthesis
In waveform-connected speech synthesis, speech waveform data for a large number (multiple) of words, phrases, clauses, and sentences is prepared in advance, and necessary parts are cut out from these speech waveform data ( By combining and connecting a plurality of waveform segments, a synthesized speech waveform representing a new word, phrase, clause or sentence is created. The voice waveform data prepared in advance is called original waveform data. As will be described later, index information is attached to the original waveform data, and a set of the original waveform data and index information (this is called waveform information) is stored in the audio information database. A unit for extracting a necessary part from the original waveform data for speech synthesis is a speech unit.
[0039]
In this specification, the speech unit includes words, syllables, phonemes, and divided phonemes. A word represents a unit of meaning and is the smallest unit of a language that has a grammatical function. For example, in the sentence “the cat sleeps”, “neko”, “ga” and “neru” are words. A syllable is a unit of linguistic pronunciation. For example, “ne” or “ko”. In Japanese, each kana character corresponds to a syllable, and there are about 100 to 300 types. A syllable is composed of one or more phonemes. A phoneme is the basic minimum unit of speech. For example, “n”, “e”, “k”, “o”, and the like. Phonemes are classified into vowels (Vowel, represented by the symbol V) and consonants (Consonant, represented by the symbol C). In Japanese, there are five types of vowels (a, i, u, e, o) and about 20 types of consonants (n, k, s, t, m, r, etc.). A divided phoneme is a further divided phoneme, and it does not matter how many are divided. Phonemes are the most commonly used speech units in waveform connected speech synthesis. Syllables are also one of the commonly used speech units.
[0040]
Based on the above, the “voice unit” is defined as follows. That is, the speech unit is one divided phoneme obtained by dividing a phoneme which is a vowel or a consonant or a continuous phoneme. In other words, all speech units are composed of one or a plurality of continuous divided phonemes.
[0041]
In waveform-connected speech synthesis, speech units such as VCV segments and CVC segments are generally used as speech units in addition to syllables and phonemes. The speech unit considering the phonological environment is a type in which a certain speech unit is distinguished including the difference between the speech units before and after (both or any one). In the above, two types of speech units (VCV segments, CVC segments) consisting of three consecutive phonemes are listed, but in addition to this, those composed of one or more consecutive syllables or one or more consecutive phonemes There are various speech units that take into account the phonological environment, such as those composed of divided phonemes. The VCV segment is a unit in which three consecutive phonemes of a vowel, a consonant, and a vowel are regarded as one voice unit. For example, there are about 700-800 types such as “eko” and “oga”. The CVC segment is obtained by regarding three consecutive phonemes of consonant, vowel, and consonant as one speech unit. For example, there are about 5000 to 6000 types such as “nek” and “kog”.
[0042]
FIG. 1 shows phonetic, syllable, and word breaks in a speech waveform in association with the speech waveform. FIG. 2 shows a speech unit in consideration of the phonological environment in association with a speech waveform.
[0043]
The voice waveform represents the density of air generated by air vibration (sound) as a time change. In the speech waveform diagrams as shown in FIGS. 1 and 2, the horizontal axis represents time, and the vertical axis represents the height of air density. When a speech waveform is handled on a computer, time-series data sampled by normal sampling processing is handled as a speech waveform file, and processing such as recording (file saving), writing, and reading is performed on the speech waveform file. Using the elapsed time from the start time of the voice waveform data, the start point, end point and duration of each voice unit can be expressed.
[0044]
In FIG. 1, a label “pau” indicating a pause (silence) is given to a section corresponding to the beginning of sound from the start point of a speech waveform (“label” will be described later), and “t”, “a ”,“ N ”,“ a ”,“ k ”, and“ a ”, the speech waveform is divided into speech units (phonemes). In the lower part of the speech waveform, phoneme breaks, syllable breaks, and word breaks are shown in association with the speech waveform.
[0045]
FIG. 2 shows a speech waveform in a voice unit in the VCV format with the vowel “a” before and after the consonant “k” as a phonological environment. In this speech waveform, the upper part shows the speech unit considering the phonemic environment in the phoneme unit, and the lower part shows the speech unit considering the phoneme environment in the divided phoneme unit (here, the divided phoneme is one phoneme). Is divided into two parts). In the lower part of FIG. 2, the first divided phoneme of the phoneme “a” is represented by “a |” and the second divided phoneme is represented by “| a”.
[0046]
FIG. 3 shows the relationship between the speech waveform and the label information.
[0047]
The label information is provided for each voice unit when the voice waveform is divided into voice units (voice units constituting the voice waveform), and a code in the voice unit (this is called a label) (for example, when the voice unit is a phoneme) Includes alphabets such as n, e, k, and o, and kana characters such as ne and ko when the speech unit is a syllable), and time position information (simply referred to as time information) in the speech waveform of the speech unit. Consists of The time information is information indicating where the end position of the sound unit is (end point of the sound unit) or where the start position of the sound unit is (start point of the sound unit).
[0048]
On the computer, the label information is handled as a text file in which a set of labels (indicated by alphabetic symbols) representing each voice unit and time information representing the end point is described in time series. In this case, the start point of each voice unit is equal to the end point of the previous voice unit, and the duration of each voice unit is the difference between the time information indicating the end point of the previous voice unit and the time information indicating the end point of the voice unit. Can be obtained. A label “pau” indicating a pause (silence) is given to a section corresponding to the beginning of the sound waveform file to the beginning of the sound. When recording an audio waveform file, it is difficult to start and stop recording accurately at the start and end points of the sound, so usually a pause is included at the beginning and end of the audio waveform. In FIG. 3, in order to hold the end point (0.160, 0.250, etc.) of the voice unit as time information, the end point information (0.120) of the head pose is necessary to indicate the start point of the sound (inversely, the end point of the sound) Is equal to the end point of the last speech unit, so time information indicating the end point of the last pause is not required).
[0049]
As described above, the waveform information database stores waveform information for a plurality of speech waveforms. The waveform information is composed of voice waveform data and index information. The index information describes, for each voice waveform (original waveform), label information and voice waveform feature values (for each voice unit) for each voice unit constituting the voice waveform.
[0050]
The feature quantity includes a phonological feature and a prosodic feature of a speech waveform (each speech unit). Phonological features include cepstrum and vector quantization data. The cepstrum is the inverse Fourier transform of the logarithm of the short-time amplitude spectrum of a speech waveform. The vector quantized data is a vector of a plurality of parameter values of a speech waveform expressed by a representative vector code. Prosodic features also include fundamental frequency, power, and duration described above. The fundamental frequency is a frequency at which the vocal cord as a sound source vibrates, and is an index representing the “height” (pitch) of the voice. The higher the fundamental frequency, the higher the voice pitch. Power is the amplitude of the speech waveform. Corresponds to the “volume” of the sound. In other words, the duration is the time length (“length”) of the speech waveform corresponding to the speech unit. A small speech duration corresponding to the length of the speech unit (meaning the average duration when considering one speech waveform) indicates that the speech speed is fast.
[0051]
FIG. 4 shows how waveform-connected speech synthesis is performed using waveform information (a set of speech waveform data and index information). In order to create a speech waveform of synthesized speech that pronounces (speaks) “Sakata”, a speech waveform uttered “Sato” (this is speech 1) and a speech waveform uttered “Tanaka” (this Two original waveforms (speech 2) are used. The index information stored in the audio information database including these original waveforms is shown on the left side of FIG. For each voice (including voices 1 and 2), the index information includes the label and start point (hereinafter referred to as label information) of the voice unit (here, phoneme) that constitutes each voice waveform, length (time length), and height. (Frequency) and size (amplitude) (hereinafter referred to as waveform features).
[0052]
When the character string “Sakata” representing the synthesized speech to be created is given, the speech unit necessary for synthesizing the speech waveform of “sakata” is selected with reference to the index information. “S” and “a” are selected from the voices 1 and “t”, “a”, “k”, and “a” are selected from the voices 2, respectively.
[0053]
A waveform segment corresponding to each selected voice unit is cut out from the original waveform based on the start point and length described in the index information. Waveform segments representing “s” and “a” respectively from the original waveform of speech 1, waveform segments representing “t” and “a” from the original waveform of speech 2, and “k” and “a”, respectively. Each waveform segment to be represented is cut out. These waveform segments are connected (synthesized) in the order of “s”, “a”, “k”, “a”, “t”, “a”.
[0054]
In this way, since the waveform segments are connected to the waveform segments cut out from the original waveform in a given order without performing signal processing, a speech waveform of synthesized speech is created without degrading sound quality. be able to.
[0055]
FIG. 5 shows the flow of the waveform connection type speech synthesis process.
[0056]
A character string representing a pronunciation (utterance) to be created by speech synthesis is given. This input character string is converted into a label string of voice units. For example, in the case of Japanese, if there is an input of a kanji-kana mixed sentence, this sentence is divided into words, several words are grouped, accent positions are determined, and pauses are inserted between word groups ( The process of determining the length of (between) is performed. You may make it input the label row | line | column of an audio | voice unit directly.
[0057]
In the prosody prediction process 92, prosodic features of each speech unit are predicted based on the speech unit label string. Specifically, the result of extracting the pitch, intensity, and length patterns for each voice unit is used in the feature quantity extraction processing in voice information creation. Prosodic features may be directly specified and entered.
[0058]
In the voice unit selection process 93, a voice unit that matches the label of the voice unit label string is selected from the voice information database 97. When there are a plurality of matching speech units, the speech unit with the most prosodic features is selected by referring to the index information in the speech information database.
[0059]
In the waveform connection processing 94, the index information of the selected voice unit is referred to, and the waveform segment corresponding to the voice unit is cut out from the original waveform data (as it is without signal processing) and connected in the order of the voice unit label string. .
[0060]
In the voice output process 95, the voice waveform of the synthesized voice that has been connected is sent to a voice device (for example, a speaker) 96 to output the sound.
[0061]
Waveform-connected speech synthesis has the following advantages because it does not perform signal processing on speech waveform data.
・ There is no degradation of sound quality due to signal processing. In general, when signal processing is performed on a speech waveform, the sound quality is deteriorated, for example, the voice becomes unnatural.
・ Synthesized speech that retains the voice characteristics of the original speech waveform data is obtained. A synthesized voice having the same voice characteristics as a specific person such as an announcer or a talent can be created.
・ Voice of synthetic voice can be changed freely by exchanging voice information database.
[0062]
In addition, the following points must be considered in order to create synthesized speech based on a speech waveform prepared in advance.
・ Prepare voice waveform data (original waveform data) that includes all the sounds you want to synthesize, and make sure that the amount of original waveform data does not become too large. In other words, sounds that are not prepared as original waveforms cannot be synthesized. In addition, if the amount of data of the original waveform becomes too large, the voice information database cannot be entered.
• Prepare original waveform data with sufficiently good sound quality, and make sure that the original waveform data does not vary in sound quality.
• In order to find and cut out the necessary part from the original waveform data, it is necessary to create information (index information) indicating the contents of the original waveform.
[0063]
(2) First embodiment
FIG. 6 is a block diagram showing a hardware configuration of the voice information database creation system. This system can be most typically realized by a so-called personal computer or workstation and its peripheral devices, but of course, it may have a hardware architecture dedicated to a voice information database creation system.
[0064]
The voice information database creation system includes an arithmetic unit (CPU) 20, a work memory (RAM) 21, a communication I / F unit 22, an input I / F unit 23, an output I / F unit 24, a database 25, a screen data memory 26, A processing program memory 27, an input device 28, an output device 29, and a synthesized speech output device 30 are included.
[0065]
The arithmetic unit 20 executes programs for voice information database creation processing and other system management processing.
[0066]
The work memory 21 is a memory for storing input / output data and intermediate processing data in the voice information database creation processing.
[0067]
The communication I / F unit 22 is for connecting hardware such as an input / output device, or for communicating with an external device directly or via a network, and executes noise removal, synchronization processing, and the like. What is necessary is just to use a suitable network according to a use.
[0068]
The database 25 is for storing various databases (details will be described later) created in the voice information database creation system.
[0069]
The screen data memory 26 is a memory that holds screen data output to a screen display device included in the output device.
[0070]
The processing program memory 27 is a memory for storing various execution programs (including OS) (details of this program will be described later) for the voice information database creation processing. The various memories described above are realized by semiconductor memories, magnetic disks, optical disks, magneto-optical disks, and other storage media.
[0071]
The input device 28 is used by an operator to input information to the voice information database creation system, and includes, for example, a keyboard, a mouse, a microphone, an FD drive, a display screen, and the like, via the input I / F 23. Connected to the arithmetic unit 20.
[0072]
The output device 29 outputs information to an operator of the voice information database creation system. For example, the output device 29 transmits information to an operator such as a display (display device), a speaker, and the like, via an output I / F 24. Connected to the arithmetic unit 20.
[0073]
When this speech information database creation system has a function of synthesizing desired speech using the created speech information database (shown in FIG. 6), waveform data representing the synthesized speech is recorded by the synthesized speech output device 30. Recorded on the medium 31. The recording medium includes a CD-ROM, floppy disk, DVD and the like.
[0074]
FIG. 7 is a functional block diagram showing various functions mainly achieved by the arithmetic unit 20 in the voice information database creation system.
[0075]
This voice information database creation system includes four databases, namely, a manuscript database 11, a voice waveform database 12, a label information database 13, and a voice information database 15 to be finally created. These databases are basically created in the process of operating this system, and specifically correspond to the database 25 shown in FIG.
[0076]
The specification input unit (means) 4 is a specification (items) (speech information database capacity, speech information database quality, creation time and original manuscript) determined when the operator OP who operates this speech information database creation system creates the speech information database. (File name) is input (taken into the computer), and is specifically realized by the input device 28 shown in FIG. 6, and the details are shown in FIG.
[0077]
The manuscript creation unit (means) 5 creates a recorded manuscript based on the original manuscript in the manuscript database 11 or the original manuscript given from the specification input unit 4 in accordance with the specification information input from the specification input unit 4. is there. The recorded manuscript is a manuscript that the speaker SP reads out aloud (that is, a manuscript to be recorded). The speaker SP (speaker, speaker) is a person who reads a recorded manuscript aloud. The system operator OP and the speaker SP may be the same person or different persons. The manuscript preparation unit (means) 5 is realized by an arithmetic unit 20 that executes the manuscript preparation program (see FIG. 11) stored in the processing program memory 27 shown in FIG. 6, and details will be described later with reference to FIG. .
[0078]
The recording management unit (means) 6 determines whether or not the voice should be recorded in the voice information database based on the analysis result of the uttered voice (or recorded voice) of the speaker SP and its history information, and the utterance to the speaker SP. And setting rest time that is indispensable in the process of recording over a long period of time. As a result, even without the attendance of the recording director (operation OP), the recording operation can be performed only by the speaker SP, and high-quality speech waveform data can be recorded. The recording management unit (means) 6 is realized by a recording management program (see FIGS. 15 and 16) in the processing program memory 27 and an arithmetic unit 20 that performs an operation according to the recording management program, and details thereof are shown in FIG.
[0079]
The display device 9 displays a document created by the document creation unit (means) 5, displays a rest instruction output from the recording management unit (means) 6, attention to speech, and the like. include.
[0080]
The voice input device (means) 10 converts voice (uttered voice) generated by a speaker into an electrical signal (voice waveform), and is realized by a microphone. It is included in the input device 28 of FIG.
[0081]
The recording unit (means) 7 detects the start and end of the utterance based on the voice waveform input from the voice input device 10 and records the voice waveform between the detected utterance start and end as a recording medium (magnetic tape, magnetic disk). , Semiconductor memory, etc.). The speech waveform is preferably converted to digital data, but may be temporarily held as analog. Details of the recording unit 7 are shown in FIG. 9 and correspond to the input I / F 23 of FIG.
[0082]
The labeling unit (means) 8 creates label information of the audio waveform data in which the recorded document created by the recorded document creation unit (means) 5 is recorded. Furthermore, a labeling error is detected from the created label information, and the labeling error portion is corrected or removed. As a result, labeling information can be created at the same level as an expert without the skill of an expert. The labeling unit (means) 8 is realized by an arithmetic unit 20 that executes a labeling error elimination program (see FIG. 18) stored in the processing program memory 27 shown in FIG. 6, and details will be described later with reference to FIG. .
[0083]
The feature quantity extraction unit (means) 4 calculates temperament or phoneme features for each speech waveform or for each speech unit while referring to the label information, and creates index information in the speech information database 15. The feature quantity extraction unit 14 is realized by a feature quantity extraction program in the processing program memory 27 and an arithmetic unit 20 that performs an operation according to the program.
[0084]
The output device 16 records audio information recorded in the audio information database 15 on a recording medium 17 such as a CD-ROM, floppy disk, or DVD.
[0085]
The operator OP uses the specification input unit 4 to input specifications related to the voice information database to be created. As shown in FIG. 8, the specification input unit 4 includes an FD drive (recording medium reading device) 41 and an input device 42. The input device 42 includes a display device that displays a specification input screen as shown in FIG. 12, a keyboard for inputting characters, numbers, and the like in a box on the display screen, a mouse for various operations, and the like.
[0086]
The specification items include the upper limit capacity of the voice information database to be created, the quality of the database, the upper limit creation time required to create the database, and the original manuscript file name. The upper limit capacity is generally used when the memory capacity that can be used for the voice information database is limited due to limitations on the operating environment and application data area. The higher the quality, the larger the capacity of the voice information database 15, but the higher the quality of the synthesized voice (details will be described later). The creation time is mainly the time for the speaker SP to input voice.
[0087]
If the creation time of the voice information database is long, the capacity of the database increases. Therefore, the upper limit creation time limits the database capacity. Since the database creation time can be considered to be proportional to the capacity of the database to be created, the input upper limit creation time can be converted into the database capacity using the following equation.
[0088]
Database capacity = database creation time x conversion factor
[0089]
The conversion coefficient is a value indicating the ratio between the database creation time and the database capacity, and is prepared in advance or can be adjusted based on the actual value. That is, at the end of the actual creation of the speech information database, the conversion coefficient is adjusted using the following equation based on the capacity of the completed speech information database and the time required for creation.
[0090]

[0091]
The quality of the voice information database is expressed at a level represented by an integer value. The higher the quality level, the more types of speech units, and the higher the quality of synthesized speech generated using the speech information database. In this embodiment, there are three quality levels. For example, the quality that all phonemes in the original manuscript are included is “level 1”, the quality that all syllables are included is “level 2”, the accent The quality that a syllable that distinguishes whether or not there is included is “level 3”. For example, the speech “suzuki” is classified into three types of units of s, u, z, k, i at level 1 and three types of units of su, zu, ki at level 2. The higher the quality level, the larger the database capacity and the longer the creation time. The original manuscript file name is the file name of the original manuscript created in the text file format.
[0092]
When the operator OP inputs the specifications of the voice information database, the specification input screen shown in FIG. 12 is displayed on the display device of the voice information data creation system.
[0093]
At the left end of this specification input screen, the voice information database creation process is displayed in the order of start, specification input, manuscript creation, recording, labeling, feature extraction, and end. A color different from that of the process is attached. In the specification input area displayed in the upper part of the screen, a box for inputting each desired value of the capacity (DB capacity) of the voice information database, the quality (DB quality) level of the database, and the creation time, and a document file name are input. There is a box. Furthermore, a “set” button for confirming the input is provided. In the attribute display area of the completed voice information database displayed in the lower part of the screen, preset default values and setting specification values input by the operator OP are displayed for the DB capacity, DB quality level, and creation time. The
[0094]
The DB capacity, DB quality, and creation time input on the specification input screen are given from the specification input unit 42 to the character string selection process 53 of the document creation unit 5. It is sufficient that at least one of the DB capacity and the creation time is input.
[0095]
If the original document file name has been input on the specification input screen, the input file name is given from the input device 42 to the FD drive 41. The FD drive 41 reads the original document file having the input original document file name from among the files stored in the mounted FD, and supplies the original document file 51 to the original document setting process 51 of the document creating unit 5.
[0096]
In FIG. 8, the manuscript preparation unit 5 includes an original manuscript setting process (means) 51, an original manuscript analysis process (means) 52, and a character string selection process (means) 53. The operation of each of these processes will be described with reference to FIG.
[0097]
When the specification data is given from the specification input unit 4, the document creating unit 5 starts the document creating process (step S1).
[0098]
The original document setting process 51 determines whether an original document file is given from the FD drive 41 (step S2). If an original document file is given, the original document file is taken into the work area (step S3). If the original document file is not given, the original document setting process 51 reads the existing original document file from the document database 11 and sets the read original document file in the work area (step S4).
[0099]
When there are a plurality of original document files (those that have already been created and stored) in the document database 11, an appropriate file may be selected based on the DB capacity and DB quality included in the specification information. Good. Further, a combination of the original document file read from the recording medium such as the FD and the original document file read from the document database 11 may be set as the original document. The original manuscript (original manuscript file) stores words, phrases, clauses, sentences, etc. that will be the source (source) of the recorded manuscript. Thus, a recorded manuscript is created.
[0100]
The original manuscript analysis process 52 analyzes the character string included in the original manuscript set in the work area, and measures the number of times each sound unit constituting the character string appears in the original manuscript (step S5).
[0101]
FIG. 13A shows an example of an original document. This original manuscript lists many Japanese surnames (only a part is shown in the figure). This original manuscript is a list of character strings representing each last name.
[0102]
Such an original document is analyzed. The analysis is to decompose words, phrases, clauses, sentences, etc. described in the original manuscript into speech units according to the quality level. In this embodiment, the speech unit of quality level 1 is a phoneme, the speech unit of quality level 2 is a syllable, and the speech unit of quality level 3 is a syllable including an accent. All of the quality levels below the set quality level are analyzed into sound units corresponding to each. If quality level 3 is set, decomposition into phonemes at quality level 1, decomposition into syllables at quality level 2, and decomposition into syllables including accents at quality level 3 are all performed.
[0103]
For all the audio units thus decomposed, the number of times each audio unit appears in the original document is measured for each quality level, and an audio unit list is created as the original document analysis result. FIG. 7B shows the original document analysis result. The original manuscript analysis results are described as a sound unit list for each quality level. In this list, they are arranged in ascending order of the number of appearances, and those with the same number of appearances are arranged in alphabetical order. A syllable consisting only of vowels is a phoneme and is listed as having a quality level of 1, so it is not included in the lists of quality level 2 and quality level 3.
[0104]
The character string selection processing 53 in the manuscript preparation unit 5 is as small as possible by referring to the original manuscript analysis result created earlier based on words, phrases, clauses, and sentences (these are called character strings) included in the original manuscript. This is to create a recorded manuscript that contains as many speech units as possible with character strings. For this purpose, the character string to be added to the recorded manuscript is selected from the original manuscript as follows. That is, first, referring to the original manuscript analysis result list for the lowest quality level, a character string (surname) including the speech unit with the smallest number of appearances is selected from the original manuscript and transferred (added) to the recorded manuscript (step) S8). All speech units included in the character string added to the recorded manuscript are deleted from the original manuscript analysis result list (step S9). Further, the selected character string is deleted from the original document (step S10). The above processing is repeated until there are no more audio units remaining in the original manuscript analysis result list in the order of sound units with the smallest number of appearances in the original manuscript analysis result list (step S7).
[0105]
When the lowest quality level is completed, the original manuscript analysis result list of the next quality level is referred to and a character string (last name) to be added (moved) to the recorded manuscript is selected in the original manuscript. This process is repeated until the set quality level is reached.
[0106]
FIG. 14A shows an example of a recorded manuscript created for quality level 1. In this recorded manuscript, four last names are listed. These four last names include all speech units in the original manuscript analysis result list for quality level 1 shown in FIG. 13 (B).
[0107]
FIG. 14 (B) shows an example of a recorded manuscript obtained when the processing for quality level 2 is completed. Compared to the recorded manuscript shown in Fig. 14 (A), two surnames (Shimizu and Miyamoto) are added. This is because the surname is additionally selected to include all of the speech units (syllables) listed in the original manuscript analysis result list for quality level 2 shown in FIG. 13 (B).
[0108]
If quality level 3 is set, a character string that satisfies the requirements of quality level 3 is further selected and added, and a recorded manuscript as shown in FIG. 14C is obtained. This is because the last name was extracted from the original manuscript so as to include all of the syllables including the accents listed in the original manuscript analysis result list for quality level 3 shown in FIG. 13 (B).
[0109]
In the specification input unit 4, the voice information DB capacity, DB quality, and creation time are input as described above. Among these, the above-described processing is performed so as to satisfy the requested DB quality (quality levels 1 to 3). In other words, if the requested DB quality is quality level 2, the process ends when the recorded manuscript of FIG. 14 (B) is obtained, and if quality level 3 is requested, FIG. 14 (C) The process continues until a recorded manuscript is obtained.
[0110]
On the other hand, the requested DB capacity and creation time are also used to control the repetition of the processing of steps S8 to S10. The creation time can be converted into the DB capacity as described above. The smaller one of the DB capacity input in the specification input unit 4 or the DB capacity converted from the input creation time is set in the work area (step S6). Each time a character string (last name) is selected from the original manuscript and the selected character string is moved (added) to the recording manuscript, the voice information capacity (added to the voice information database 15) for the added character string (last name) Data capacity including stored waveform data etc.) is subtracted from the DB capacity of the work area. This subtraction result is called the remaining DB capacity. When the remaining DB capacity becomes zero, the recorded manuscript preparation process is ended even in the middle (step S7).
[0111]
In FIG. 7, the recorded document created in the document creating unit 5 as described above is given to the recording management unit 6. As will be described later, the recording manager 6 sequentially displays a character string (last name) included in the recorded manuscript on the display device 9 and generates and displays a rest instruction and an utterance note as necessary.
[0112]
The speaker SP reads out (speaks) the character string displayed on the display device 9 according to the display order.
[0113]
The voice uttered by the speaker SP is input to the voice input device 10 and converted into an electric signal.
[0114]
An electrical signal representing the voice output from the voice input device 10 is input to the recording unit 7 and the recording management unit 6 as a voice waveform signal. The voice waveform signal input to the recording unit 7 is recorded (saved) as voice waveform data. The recording manager 6 analyzes the input speech waveform as will be described later. If it is determined as a result of the analysis that the sound waveform is of good quality, the recording management unit 6 gives a command to the recording unit 7 to store the sound waveform data in the sound waveform database 12.
[0115]
The sound recording management unit 6 includes a speaker management process (means) 6a, a voice analysis process (means) 6b, a voice acceptance / rejection determination process (means) 6c, and a sound recording management process (means) 6d. Yes. The speaker management process (means) 6 a includes an utterance attention generation process (means) 61, a rest instruction generation process (means) 62, and a speech analysis result holding process (means) 63. The voice analysis process (means) 6 b includes a fundamental frequency detection process (means) 64, a sound volume detection process (means) 65, and a speech speed detection process (means) 66. The voice acceptance / rejection determination process (means) 6 c includes a voice analysis result comparison process (means) 67 and a voice acceptance / rejection determination process (means) 68.
[0116]
The recording unit 7 includes an utterance start / end detection process (means) 71 and a recording process (means) 72.
[0117]
According to the display on the display device 9, the speaker reads out the character strings (last name) in the recorded manuscript one by one. A voice signal for one character string is given from the voice input device 10 to the recording manager 6 and the recorder 7.
[0118]
The voice analysis process 6b detects the fundamental frequency (height), volume (power) and speech speed of the voice signal of one character string input from the voice input device 10 in

processes

64, 65 and 66, respectively. These detection results are given as speech waveform analysis results to speech analysis result comparison processing 67 of speech acceptance / rejection determination processing 6c and speech analysis result holding processing 63 of speaker management processing 6a.
[0119]
The voice analysis result comparison process 67 of the voice acceptance / rejection determination process 6c reads a voice acceptance / rejection judgment criterion set in advance and stored in the speech waveform database 12, and gives a given speech waveform analysis result and the read voice acceptance / rejection judgment standard. In comparison, it is determined whether or not the voice input from the voice input device 10 to the recording unit 7 is registered in the voice waveform database 12 as voice waveform data. When all the attributes (basic frequency, volume, speech speed) of the speech waveform analysis result are within the range of the speech acceptance criteria, the speech waveform data stored in the recording unit 7 is stored in the speech waveform database 12. In other cases, the sound waveform data is erased (determined as non-adopted). This operation is sequentially performed on the audio signal representing each character string.
[0120]
The voice analysis result holding process 63 stores the history information of the voice waveform analysis result output from the voice analysis process 6b. The speech analysis result holding process 63 receives the acceptance / rejection determination result by the speech acceptance / rejection determination process 68. If the acceptance / rejection determination result is not adopted, the voice analysis result holding process 63 gives a repetitive command to the recording management process 6d and causes the display device 9 to display the character string corresponding to the rejected voice again.
[0121]
The utterance attention generation process 61 or the rest instruction generation process 62 is necessary as follows based on the history information of the voice waveform analysis result held in the voice analysis result holding process 63 or the information about the acceptance / rejection determination result. In response, an utterance attention or rest instruction is generated and given to the recording management process 6d.
[0122]
The utterance attention generation processing 61 always calculates an average value for the waveform analysis results (frequency, volume, speech speed). Then, this waveform analysis result is compared with this average value, and utterance attention is generated according to the comparison result. For example, if the current volume is compared with the average volume, and if the current volume is significantly below the average volume (if the difference is greater than or equal to a predetermined threshold), the voice is low Generate utterance attention.
[0123]
The rest instruction generation process 62 generates a rest instruction based on the frequency determined by the voice acceptance / rejection determination process 68 as not being adopted. For example, if the current non-recruitment decision is close to the previous non-recruitment decision, it is considered that non-recruitment has occurred frequently due to speaker fatigue, so a rest instruction is generated.
[0124]
The recording management process 6d holds the recorded manuscript given from the manuscript preparation unit 5 and sequentially displays the character strings to be uttered on the display device 9. An example of a screen displayed on the display device 9 is shown in FIG. On this screen, “Sato” is displayed as the 31st character string (surname).
[0125]
The acceptance / rejection determination result of the speech acceptance / rejection determination processing 68 is given to the recording management processing 6d via the speech analysis result holding processing 63. Therefore, if the recording management processing 6d determines adoption, the next character string (last name) is displayed on the display device 9. If not, the display device 9 is controlled so that the same character string (last name) as the previous time is displayed.
[0126]
The recording management process 6 d also controls the display device 9 to display the utterance attention given from the utterance attention generation process 61 and the rest instruction given from the rest instruction generation process 62. On the display screen of FIG. 17, a rest instruction “Please take a 10-minute break” and an utterance note “The voice is getting lower” are displayed as advice.
[0127]
The display device 9 is also a graph of the average value (indicated by hatching) of the speech analysis result calculated by the speech attention processing 61 and the current speech analysis result for each speech attribute (volume, speech speed, height, speech content). Is displayed. The content of the utterance is a score indicating the reliability by speech recognition.
[0128]
After outputting the rest instruction, the rest instruction generating process 62 gives a restart instruction to the recording management process 6d when the instructed rest time has elapsed. In response to this, the recording management process 6d continues to display the character string to be uttered.
[0129]
In FIG. 17, the “Record” button is used when the speaker explicitly inputs the start of the utterance, and is unnecessary when the utterance start detection function is provided. The “play” button is used when the speaker plays and confirms the recorded voice.
[0130]
The audio signal from the audio input device 10 is input to the recording unit 7. The utterance start / end detection processing 71 detects the start time and end time of the input audio signal, and the audio signal between these start time and end time is given to the recording device 72 and recorded.
[0131]
FIG. 15 and FIG. 16 are flowcharts showing the recording management processing by the recording management unit 6.
[0132]
The recording management process 6d reads the recorded document created by the document creating unit 5 (step S21). At this time, the number of recorded character strings (last names) (number of recorded characters) (variable or counter) is reset to 0, and the number of recorded character strings (total number of character strings (last names) included in the recorded manuscript) , Set as the total number of recordings (variable or counter) (step S22).
[0133]
The recording manager 6d determines whether or not the number of recorded items is smaller than the total number of recorded items (step S23). If the number of recorded items exceeds the total number of recorded items, the recording process is terminated (No in step S23).
[0134]
When the number of recorded items is smaller than the total number of recorded items, the recording management process 6d sets the (number of recorded items + 1) -th character string as a read-out character string from the character string list of the recorded manuscript (for example, in a buffer). (Stored) (step S24), and outputs it to the display device 7 (step S25).
[0135]
A recording display screen as shown in FIG. 17 is displayed on the display device 7. Similar to the specification input screen described above, the voice unit database creation process is displayed on the left side of the screen. At this stage, “recording” is specified. In the upper part of the screen, there is a recorded manuscript character string display area, in which a character string (“Sato”) to be read out by the speaker is displayed. In the middle of the screen, there is a speech waveform analysis result area as described above.
[0136]
When the speaker SP utters a character string to be read (read aloud), the voice is input to the voice input device 10, and the voice is sent from the input device 10 as a voice waveform to the voices of the recording unit 7 and the voice management unit 6. Input to analysis processing 6b (YES in step S26). The voice waveform input to the recording unit 7 is recorded as voice waveform data.
[0137]
The voice analysis processing 6b analyzes the voice waveform input as described above with respect to height (fundamental frequency), magnitude (power), and speed (duration) (step S27), and the voice waveform analysis result is obtained. It outputs to the voice acceptance / rejection determination processing 6c and the speaker management processing 6a.
[0138]
In the speech acceptance / rejection determination processing 6c, the speech acceptance / rejection determination criteria set in advance and stored in the speech waveform database 12 are read as described above, and the height indicated by the speech waveform analysis result using the read speech acceptance / rejection determination criteria. It is determined whether (basic frequency), magnitude (power), and speed (duration) are all within the voice acceptance / rejection criteria (adopted) or not (not adopted) (step S28).
[0139]
If any of the height (basic frequency), the size (power), and the speed (duration) is within the voice acceptance / rejection judgment criteria (YES in step S28), the voice adoption judgment processing 6c performs the recording. The adoption signal is output to the unit 7 and the speaker management process 6a (and to the recording management process 6d). In the recording unit 7, when the adoption signal is input, the previously recorded speech waveform data is registered in the speech waveform database 12. In the recording management process 6d, when an adoption signal is input, the voice waveform data at that time is registered in the voice waveform database 12, so 1 is added to the recorded number. That is, (recorded number + 1) is set as the recorded number (step S29).
[0140]
If any of height (fundamental frequency), magnitude (power), and speed (duration) is not within the range within the acceptance criteria for speech acceptance (NO in step S28), the speech is not adopted ( The voice acceptance / rejection determination process 6c outputs a rejected signal to the speaker management process 6a and the recording unit 7.
[0141]
When a non-recruitment signal is input, the speaker management process 6a indicates the number of the previous failure number indicating the number of the previous non-adopted character string and the number of the character string of the current utterance (recording). The number of completed cases + 1) is read, and it is determined whether or not the difference between the previous failure number read and the number of recorded cases + 1 is less than a predetermined rest necessity determination value (step S30).
[0142]
When the difference between (recorded number + 1) and the previous failure number is equal to or greater than the rest necessity determination value, the speaker management process 6a does not need rest and simply re-records. At this time, (prerecorded number + 1) is set as the previous failure number, and (recorded number + 1) is output as the number of recordings to the recording management process 6d for re-recording. The recording management process 6d displays the (recorded number + 1) -th character string on the display device 9, and re-records the (recorded number + 1) -th character string again (returns from step S34 to step S25).
[0143]
When the difference between (recorded number + 1) and the previous failure number is less than the rest necessity judgment value, the speaker management process 6a is frequently judged as non-recruitment and rests because rest is necessary. An instruction is generated and output to the recording management process 6d (step S31). The recording management process 6d displays the outputted rest instruction on the display device 9. The speaker SP looks at the rest instruction displayed on the display device 9 and rests.
[0144]
The rest instruction generation process 62 of the speaker management process 6a starts measuring elapsed time from the time when the rest instruction is displayed, and waits until a predetermined rest time elapses (step S32). The elapsed time is measured (step S33), and when the rest time elapses (YES in step S32), the process proceeds to step S34, and the (number of recorded items + 1) -th character string is set as a read-out character string again.
[0145]
As described above, the recording process is repeatedly performed until the number of recorded records is equal to the total number of recorded records (step S23).
[0146]
The labeling unit 8 is provided with the recorded document created by the document creation unit 5 and the voice waveform data stored in the voice waveform database 12. In the speech waveform data, the labeling unit 8 determines the boundary of each speech unit constituting the character string corresponding to the waveform, and creates label information including a label representing each speech unit and time information indicating the boundary. The labeling unit 8 also performs labeling error removal (correction of time information and invalidation of time information) for the created label information. The labeling unit 8 stores the label information in the label information database 13.
[0147]
As an example, the character string (surname) “satoo” in the recorded manuscript created by the manuscript preparation unit 5 is picked up. The speech waveform database 12 already stores speech waveform data when a speaker speaks this character string. When the speech unit is a phoneme, the character string is represented by a label sequence s, a, t, o, o with the phoneme as a unit. If the speech unit is syllable, the labels are sa, to, o. Labeling refers to associating each speech unit of these label sequences with speech waveform data, and dividing speech waveform data into speech units. Please refer to FIG. 3 again when the speech unit is phoneme.
[0148]
FIG. 10 is a functional block diagram of the labeling unit 8. The labeling unit 8 includes a labeling process (means) 8a and a labeling error removal process (means) 8b. The labeling process 8a includes a statistical model creation process (means) 81, a voice unit boundary determination process (means) 82, and a label information generation process (means) 83. The labeling error removal process 8b includes a time information error correction process (means) 84, a time information comparison process (means) 85, and a label information invalidation process (means) 86.
[0149]
The audio unit boundary determination process 82 of the labeling process 8a reads the audio document data stored in the audio waveform database 12 and the audio recording database 12 provided from the original preparation unit 5. The recorded manuscript is also given to the statistical model creation process 81. The following processing is performed for each character string (for example, “satoo”) included in the recorded manuscript.
[0150]
The statistical model creation processing 81 uses a statistical model prepared in advance (a statistical model of acoustic features for each voice unit; for example, Hidden Markov Model). According to the label sequence corresponding to the character string, a series of acoustic feature amounts corresponding to the speech waveform representing the label sequence is created. The voice unit boundary determination processing 82 matches the created series with the acoustic feature quantity series of the voice waveform actually recorded corresponding to the character string described above, so Extract speech unit boundaries in speech waveform (from waveform database 12).
[0151]
The extracted voice unit boundary information (time information) is paired with a label indicating the voice unit, and is given to the label information database 13 from the label information generation processing 83. The label information describes a pair of a label representing a voice unit and the end time (time information) of the voice unit (the start time of the voice waveform data is 0) in the order of character strings (time order). It is.
[0152]
Details of labeling are disclosed in Japanese Patent Laid-Open No. 10-49193. In addition to automatic labeling using HMM, an automatic labeling method using DP matching may be used.
[0153]
In the generated label information, the labeling error removal process 8b corrects the time information (end time correction) of the voice unit having a high possibility of labeling error (time information error correction), or invalidates the voice unit itself in the database. Information to be converted (label information invalidation processing). That is, error removal processing is roughly divided into two types: time information error correction based on a correction rule and label information invalidation based on a difference between a plurality of separately created label information.
[0154]
In the time information error correction process 84, the time information of the label information is corrected according to a correction rule prepared in advance.
[0155]
In order to invalidate the label information, in the time information comparison process 85, the label information (referred to as the first label information) (stored in the database 13) generated using the previous statistical model (for example, the HMM model), and this The difference with the second label information created using a different statistical model is compared. Then, in the label information invalidation processing 86, when the difference in time information exceeds a preset threshold value, invalidation information is assigned to the corresponding part of the first label information corresponding to the difference (time information is given to the invalidation information). Since the label information is excluded from the subsequent feature extraction process, even if a labeling error exists, the quality of the speech unit database is not adversely affected.)
[0156]
FIG. 18 is a flowchart showing the operation of the labeling error removal process 8b in the labeling unit 8.
[0157]
The labeling error removal process 8b reads label information about one character string (one surname) created and saved by the labeling process 8a from the label information database 13 (step S41). This label information is set as first label information.
[0158]
The number of labels included in the first label information is counted, this count value is set in the variable “total number of labels”, the number of label correction rules is set in the variable “total number of rules”, and the variable “number of processed labels” "Is reset to 0 (step S42). The label correction rule will be described later. The variable “number of processed correction rules” is reset to 0 (step S44). The label correction rules will be described later. The variable “number of processed correction rules” is set to 0 (step S44).
[0159]
The correction rules are sequentially applied to the (number of processed labels + 1) -th label information (step S46). If the conditions of the amendment rules are not met, the label information is not updated. If it matches, the label information is updated according to the description of the execution part of the correction rule (step S47).
[0160]
FIG. 19 (A) shows an example of label information. In the label information, s, a, t, o and o speech units (phonemes) and boundary information corresponding to these phonemes are listed. The boundary information is time information at the end time of each voice unit with the start time of the “satoo” voice waveform data set to zero.
[0161]
FIG. 19B shows an example of the correction rule. The correction rule is set for each voice unit. The correction rule is expressed in the form of “if (condition part), then (execution part)”, and the process described in the execution part is executed only when the condition described in the condition part is satisfied. .
[0162]
A specific modification rule shown in FIG. 19B is applied to the label information shown in FIG.
[0163]
The duration of the third label “a” in FIG. 19A is 0.076 (seconds) (0.101−0.025 = 0.076). Since the condition part of the modification rule for label “a” in FIG. 19B is “if (duration <30)” (30 means 0.030 seconds), the duration of label “a” is the condition part. Not satisfied (NO in step S46). Therefore, the execution part of the correction rule is not executed.
[0164]
Since the duration of the fifth label “o” of the label information is 0.028 (0.191−0.163), the condition part (if (duration <40)) of the correction rule for the voice unit “o” is satisfied (step S46). YES) Therefore, the execution part “correction duration = duration × 1.5” is executed. Since the duration value = 0.028, the modified duration = 0.042 (= 0.028 x 1.5). The end point of the fifth label “o” is corrected to 0.205 (= the end point 0.163 + 0.042 of the immediately preceding voice unit). FIG. 19 (C) shows the label information after correction.
[0165]
While adding 1 to the number of processed correction rules (step S48), all correction rules are applied to the (number of processed labels + 1) -th label information (repeated by step S45).
[0166]
When all correction rules have been applied to one label information, 1 is added to the number of processed labels (step S49), and the process returns to step S43. When the processing of steps S44 to S49 is completed for all labels of one character string (NO in step S43), the time information error correction processing is completed.
[0167]
Next, the process proceeds to label information invalidation processing.
[0168]
Using a statistical model different from the statistical model used when the first label information was created, automatic labeling is executed in the same manner as the creation of the first label information to create the second label information (step S50). An example of the created second label information is shown in FIG.
[0169]
The number of processed labels is returned to 0, and a label invalidation threshold is set (step S51). An example of the label invalidation threshold is shown in FIG.
[0170]
In the corrected first label information (FIG. 19 (C)) and second label information (FIG. 20 (A)), the difference in time information of the corresponding label is calculated (step S53), and this difference is calculated. Is determined to exceed the label invalidation threshold (step S54). An example of the difference for each label is shown in FIG. 20 (B). If any of these differences exceeds the threshold (YES in step S54), invalidation information is assigned to the corresponding first label information (step S55). For example, in FIG. 20 (B), the time information difference of the second label “a” is 0.014 (s), which is within the range of 0.050 (s) which is the label invalidation threshold. There is no need to attach information. On the other hand, the time information difference of the fifth label “o” is 0.051 (s), which exceeds the label invalidation threshold value, so invalidation information is given to the fifth “o” label. . Invalidation information is also automatically assigned to the label immediately after that (sixth label “o”). In FIG. 20D, invalidation information x is attached to the fifth and sixth labels “o”. The above processing is repeated for all labels while adding 1 to the number of processed labels (steps S56 and S52). The first label information after the invalidation process is stored in the database 13 again.
[0171]
FIG. 21 shows an example of index information (FIG. 21 (A)) included in the speech information database 15 and speech waveform data (FIG. 21 (B)) corresponding thereto.
[0172]
The feature quantity extraction unit 14 reads the label information stored in the label information database 13 and reads the corresponding speech waveform data from the speech waveform database 12. The feature quantity extraction unit 14 calculates feature quantities (length, height, size, etc.) for each voice unit from the read corresponding label information and voice waveform, and lists the calculated feature quantities together with the label information. To create index information. At this time, the feature amount is not calculated for the voice unit to which the invalidation information is added. Further, the feature quantity extraction unit 14 stores the index information and the voice waveform data in the voice information database 15 as voice information data in pairs.
[0173]
(3) Second embodiment
FIG. 22 is a functional block diagram showing the overall configuration of the second embodiment of the audio information data creation system. In this figure, the same components as those shown in FIG. The recording management unit 6A has a function of creating a standard voice for indicating to the speaker SP how to read the recorded manuscript when the speaker SP reads out aloud. Standard audio is output from the speaker 18. The manuscript creation unit 5A has a function of creating an additional recording manuscript to be added to a recording manuscript already created when the original manuscript is added. This additional recording is minimal. Audio information (index information) is given from the audio information database 15 to the original preparation unit 5A for preparation of an additional recording original. Unlike the first embodiment, the labeling unit 8A has a function of removing errors for the created label information (particularly time information) based on the statistical analysis result of the label information.
[0174]
FIG. 23 is a block diagram showing a functional configuration of the document creating section 5A. FIG. 24 is a flowchart showing an operation of creating an additional recording document by the document creating unit 5A. The process for creating an additional recording manuscript will be described below. It should be understood that the recorded document creating process is as described in the first embodiment, and the additional recorded document creating process is a function added to this.
[0175]
In the following description, it is assumed that there is an additional recording manuscript for the last name already created in the first embodiment.
[0176]
It is assumed that the speech information database 15 already stores speech information obtained by the speaker SP reading a recorded manuscript about the last name and recording it. The voice information database analysis process (means) 54 reads the index information in the voice information about the last name from the database 15 and analyzes the index information to create a list of voice units included in the index information for each quality level. (FIG. 24, step S61). An example of the analysis result of the voice information database is shown in FIG. This is exactly the same as the original manuscript analysis result shown in FIG. 13B (although the arrangement order in the list of audio units is different).
[0177]
In the specification input unit 4, on the premise of the current voice information database 15, a manuscript (hereinafter referred to as “additional source manuscript”) that enumerates a character string (including words, phrases, clauses, sentences, etc.) that is newly obtained by synthesis. An input is received from the operator OP (step S62). Until the input is completed, it is in a waiting state (step S63). Only the text file name corresponding to the addition source document may be input by the input device 42, and the content of the addition source document may be read by the FD drive 41. Of course, the additional original document may be input from the keyboard, or one stored in the original database 11 may be used.
[0178]
An example of the original document for addition is shown in FIG. This additional source manuscript is a place name list.
[0179]
When the additional original document is set in the original document setting process 51A, the original document analysis process 52A decomposes all character strings included in the additional original document into labels (sound units) for each quality level. Is counted and the voice unit list is created (step S64). This is an additional source manuscript analysis result, and a specific example of the additional source manuscript shown in FIG. 25 (B) is shown in FIG. 25 (C).
[0180]
The analysis result comparison process 55A compares the addition source manuscript analysis result obtained by the original manuscript analysis process 52A with the voice information database analysis result obtained by the voice information database analysis process 54, and exists in the addition source manuscript analysis result (FIG. 25B) However, speech units that do not exist in the speech information database analysis result (FIG. 25A) are extracted by quality level. An example of the difference extraction result is shown in FIG.
[0181]
The character string selection process (means) 53A performs a process of adding a character string including the voice unit to the recorded manuscript with respect to the voice unit included in the difference extraction result. The order is from the highest to the highest. When the additional recording manuscript covers all audio units of the quality level, the processing of the quality level is finished, and the processing of the next quality level is started. In the example shown in FIG. 25 (D), since there is no audio unit extracted as a difference in quality level 1, processing is performed from quality level 2. In the process at quality level 2, “Kyoto” is added, and in the process at quality level 3, “Nara” is further added. Finally, two additional character strings “Kyoto” and “Nara” are added to the additional recording. Is added. All the place names in the place name list of FIG. 25 (A) can be synthesized with high quality simply by adding the voices of these two character strings to the sound information database.
[0182]
In this way, the analysis result of the additional manuscript and the index information analysis result are compared to extract the audio unit (missing audio unit) that is not included in the index information in the additional original manuscript and includes the missing audio unit. Since character strings are added to the recorded manuscript, there is no need to recreate the recorded manuscript from the beginning.
[0183]
In FIG. 24, the flow of operations of the analysis result comparison process 55 and the character string selection process 53A is as follows.
[0184]
With reference to the audio information database analysis result and the additional source manuscript analysis result, all the audio units that exist in the additional source manuscript and do not exist in the audio information database are enumerated to form an audio unit list. Further, all character strings included in the original document to be added are added to the character string list (step S65).
[0185]
If a voice unit remains in the voice unit list (YES in step S66), one voice unit having the smallest number of appearances is selected from the voice unit list, and a character string including the voice unit is further selected from the character string list. Only one is selected and the character string is added to the additional recording manuscript (step S67).
[0186]
Of the speech units included in the character string added to the additional recording manuscript, all the speech units remaining in the speech unit list are deleted from the speech unit list (step S68). Further, the character string added to the additional recording manuscript is deleted from the character string list (step S69).
[0187]
Steps S67 to S69 are repeated until the voice unit list becomes empty. Thereby, the additional recording manuscript preparation is completed.
[0188]
In this additional recording manuscript preparation process, if there is a request for the database capacity or the database preparation time, it goes without saying that the restriction due to this request is considered.
[0189]
FIG. 26 is a block diagram showing the configuration of the recording manager 6A.
[0190]
The recording manager 6A is obtained by further adding a voice synthesis process (means) 6e to the recording manager 6 of the first embodiment described above.
[0191]
The voice synthesizing process 6e creates a synthesized voice that represents a character string in a recorded manuscript read from the manuscript preparation unit 3A via the recording management process 6d. That is, the speech synthesis process 6e performs a correct reading method (in terms of accent position, spacing, inflection, etc.) for each character string of the recorded manuscript or a history of recorded speech analysis results held by the speaker management means 6c. Based on the information, create a synthesized speech that reads the text of the recorded manuscript at a volume, height, and speed appropriate for the speaker. The synthesized voice (which may be a recorded voice prepared in advance) created by the voice synthesizer 6e is output from the voice output device 18 such as a speaker as a standard voice. As a result, the speaker SP can hear the synthesized speech of the character string to be uttered and can refer to the height, size, and speed of the speech to be uttered. It is possible to prevent and improve the quality of the recorded voice (voice information data).
[0192]
FIG. 27 is a flowchart showing recording management processing by the recording management unit 6A. The same processes as those shown in FIG. Further, FIG. 16 can be applied as it is.
[0193]
The voice synthesizing means 6e is an appropriate voice height, loudness, speed, inflection target value for the (number of recorded cases + 1) -th character string input from the recording management process 6d, or the recorded voice so far Based on the analysis result, parameters of prosodic features are set (step S36). The voice synthesizer 6e creates a synthesized voice of the (number of recordings + 1) -th read-out character string using the set parameters, and outputs the created synthesized voice as a standard voice to the voice output device (speaker) 18 (step S37). ). Therefore, not only the character string is displayed on the display screen (step S25), but also its standard voice is output.
[0194]
FIG. 28 is a block diagram showing a functional configuration of the labeling unit 8A. Compared with FIG. 10, instead of the labeling error removal process 8b, a labeling error removal process (means) 8c and a label information statistical analysis process (means) 8d are provided. The labeling error removal process 8c includes a label information reliability confirmation process (means) 87 and a label information invalidation process (means) 86. The label information statistical analysis process 8 d includes a confidence interval calculation process (means) 88 and a statistical analysis process (means) 89.
[0195]
The labeling information statistical analysis process 8d statistically analyzes the existing label information (label information in the label information database 13), and calculates the confidence interval (reliability related to the duration) from the average value and standard deviation of the duration for each voice unit. (Interval) is calculated and confidence interval information is created. Since the existing label information to be analyzed has different voice characteristics depending on the speaker, and the confidence interval of the duration often changes, the label information of the same speaker as the label information to be removed from the labeling error from now on It is desirable to use information.
[0196]
The labeling error removal process 8c refers to the confidence interval information of each voice unit obtained in the label information statistical analysis process 8d, and the reliability of the label corresponding to the duration of each voice unit included in the label information subject to error removal. Check if it is within the section. The labeling error removal process 8c may add invalidation information to label information that does not fall within the confidence interval, and may further correct the label time information so that it falls within the confidence interval. In the labeling error removal process 8c, for each voice unit included in the label information generated by the labeling process 8a, the duration of the voice unit is a continuation corresponding to the voice unit calculated by the label information statistical analysis process 8d. If it is outside the range of the time confidence interval, it is invalidated (by judging that the labeling reliability is low, that is, the possibility of the labeling error is high). Thereby, it is possible to automatically invalidate the label information determined to be statistically low in reliability, and as a result, it is possible to improve the quality of the labeling result.
[0197]
FIG. 29 is a flowchart showing a procedure of labeling error removal processing by the label information statistical analysis processing 8d and labeling error removal processing 8c of the labeling unit 8A.
[0198]
The statistical analysis process 89 of the label information statistical analysis process 8d is a group of label information obtained from the label information created by the labeling process 8a and stored in the label information database 13, preferably a speech waveform recorded by the same speaker SP. Is read (step S71).
[0199]
The statistical analysis process 89 calculates the average value and standard deviation of the duration for each voice unit, and counts the number of voice units that appear in the label information (statistical analysis of label information) (step S72).
[0200]
FIG. 30 (A) shows an example of label information read into the statistical analysis processing 89. FIG. FIG. 30 (B) shows an example of the result of statistical analysis by the statistical analysis processing 89.
[0201]
The confidence interval calculation processing 88 calculates the confidence interval of the duration for each voice unit based on the statistical analysis result by the statistical analysis processing 89 by the following calculation formula (step S73).
[0202]
Confidence interval = mean value ± Z [(standard deviation) ² / (Appearance count)] ^1/2 (Formula 1)
[0203]
Here, Z is a constant based on a normal distribution.
[0204]
FIG. 30 (C) shows an example of the confidence interval of the duration for each speech unit calculated from the above calculation formula.
[0205]
Data on the confidence interval obtained in this way is given to the label information reliability confirmation processing 87. The label information reliability confirmation processing 87 also reads the same label information acquired by the statistical analysis processing 89 (this is called error removal target label information) from the label information database 13.
[0206]
The label information reliability confirmation processing 87 counts the number of labels included in the error removal target label information and sets it to the variable “total number of labels”. Further, the “number of processed labels” is set to 0 (step S74).
[0207]
Calculate the duration of the audio unit corresponding to the (number of processed labels + 1) th label (the duration is the difference between the time information indicating the end point of the audio unit and the time information indicating the end point of the immediately preceding audio unit) (Step S76).
[0208]
If the duration of the speech unit corresponding to the (number of processed labels + 1) th label does not fall within the confidence interval of the speech unit, the label information invalidation processing 86 performs (number of processed labels + 1). The invalidation information is given to the) th label (step S78). FIG. 30 (D) shows an example of label information after the invalidation information is given. The confidence interval for the duration of the speech unit “o” is 46.8 to 115.2 (ms) according to FIG. 30 (C). In FIG. 30D, the durations of the fifth and sixth speech units (labels) “o” are 0.191 (s) and 0.312 (s), respectively, and do not fall within the confidence interval. Accordingly, invalidation information (indicated by a cross) is attached to both of the labels “o”. Since the durations of the other labels s, a, and t are within the corresponding confidence interval, no invalidation information is given.
[0209]
1 is added to the value of the number of processed labels, the process returns to step S76 through step S75 (step S79), and the processes of steps S76 to S78 are repeated until the number of processed labels becomes equal to the number of all labels (step S75).
[0210]
As described above, when the labeling error removal processing is completed, the processed label information is stored in the label information database 13 again.
[Brief description of the drawings]
FIG. 1 shows phoneme, syllable, and word breaks in a speech waveform, and associates them with a speech waveform.
FIG. 2 shows a speech unit in association with a speech waveform in consideration of a phonological environment.
FIG. 3 shows a relationship between a speech waveform and label information.
FIG. 4 shows how waveform-connected speech synthesis is performed using waveform information.
FIG. 5 shows the flow of a waveform connection type speech synthesis process.
FIG. 6 is a block diagram showing a hardware configuration of a voice information database creation system.
FIG. 7 is a block diagram showing an overall configuration of a voice unit data creation system in the first embodiment.
FIG. 8 is a block diagram illustrating a functional configuration of a document creating unit.
FIG. 9 is a block diagram illustrating a functional configuration of a recording management unit.
FIG. 10 is a block diagram showing a functional configuration of a labeling unit.
FIG. 11 is a flowchart showing a recorded document creation process by a document creation unit.
FIG. 12 shows a specification input display screen.
FIG. 13A shows an example of an original document. (B) shows an example of the original manuscript analysis result.
FIG. 14A shows an example of a recorded manuscript after level 1 processing. (B) shows an example of a recorded manuscript after level 2 processing. (C) shows an example of a recorded manuscript after level 3 processing.
FIG. 15 is a flowchart showing recording processing by a recording management unit.
FIG. 16 is a flowchart showing recording processing by a recording management unit.
FIG. 17 shows a recording screen.
FIG. 18 is a flowchart showing a labeling error removal process by a labeling unit.
FIG. 19A shows first label information. (B) shows the revised rules. (C) shows the corrected first label information.
FIG. 20A shows second label information. (B) shows label difference information. (C) shows the invalidation threshold. (D) shows the first label information after the invalidation information is given.
FIG. 21A shows index information. (B) shows speech waveform data.
FIG. 22 is a block diagram showing an overall configuration of an audio unit data creation system in a second embodiment.
FIG. 23 is a block diagram illustrating a functional configuration of a document creating unit according to the second embodiment.
FIG. 24 is a flowchart showing an additional recording document creation process by a document creation unit in the second embodiment.
FIG. 25A shows an example of a speech information database analysis result. (B) shows an example of an additional source document. (C) shows an example of the analysis results of the additional source document. (D) shows an example of the difference extraction result. (E) shows an example of an additional source document after quality level 2 processing. (F) shows an example of an additional source document after quality level 3 processing.
FIG. 26 is a block diagram showing a functional configuration of a recording manager in the second embodiment.
FIG. 27 is a flowchart showing recording processing by a recording management unit in the second embodiment;
FIG. 28 is a block diagram showing a functional configuration of a labeling unit in the second embodiment.
FIG. 29 is a flowchart showing a labeling error removal process by a labeling unit in the second embodiment.
FIG. 30A shows an example of label information. (B) shows an example of statistical analysis results. (C) shows an example of confidence interval information. (D) shows an example of label information after the invalidation information is given.
[Explanation of symbols]
4 Specification input section
5,5A Manuscript preparation part
5a Manuscript creation process
5b Voice information database analysis processing
6,6A Recording Manager
6a Speaker management process
6b Voice analysis processing
6c Voice acceptance / rejection determination process
6d Recording management process
6e Speech synthesis processing
7 Recording section
8,8A Labeling part
8a Labeling process
8b, 8c Labeling error removal processing
8d Label information statistical analysis processing
9 Display device
10 Voice input device
11 Manuscript database
12 Speech waveform database
13 Label information database
14 Feature extraction unit
15 Voice information database
16 Output device
17 Recording media

Claims

Means for setting an original document containing a plurality of character strings;
Extracting all audio units constituting the character string included in the original document, and for their speech units, based on document analysis means for detecting the number of occurrences in the original document, and the audio extracted by the original document analysis means Based on the detection result of the number of appearances of the unit, a recorded manuscript is created by selecting a character string from the original manuscript until all the above voice units are covered in order from a character string including a voice unit with a small number of occurrences . Character string selection means,
Recording manuscript preparation device equipped with.

Voice information database analysis means for extracting all first voice units included in the existing voice information database;
An additional source manuscript analysis means for extracting all second audio units constituting a character string included in the additional source manuscript;
Comparing means for detecting a speech unit not included in the first speech unit among the second speech units, and a character string including the speech unit detected by the comparing means is selected from the additional source document and additionally recorded. Second character string selection means for creating a manuscript
The recording manuscript preparation apparatus according to claim 1, further comprising:

A method for controlling a recording manuscript preparation apparatus comprising an original manuscript analysis means and a first character string selection means,
The original manuscript analysis means extracts all sound units constituting the character string included in the original manuscript, detects the number of appearances in the original manuscript for those sound units,
The first character string selection means creates a recorded manuscript by selecting a character string from the original manuscript in order from a character string including a voice unit with a small number of appearances until all the extracted voice units are covered. ,
Control method of recording manuscript preparation device .

The recorded manuscript preparation apparatus further includes voice information database analysis means, additional source manuscript analysis means, comparison means, and second character string selection means ,
The voice information database analysis means extracts all first voice units included in the existing voice information database,
The additional source manuscript analysis means extracts all second sound units constituting the character string included in the additional source manuscript,
The comparing means detects a speech unit that is not included in the first speech unit among the second speech units,
The second character string selection means creates an additional recording manuscript by selecting a character string including the detected voice unit from the additional original manuscript.
4. A method for controlling a recorded document creating apparatus according to claim 3.

A program for controlling a recording manuscript preparation apparatus provided with an original manuscript analysis means and a character string selection means,
The original manuscript analyzing means extracts all sound units constituting a character string included in the original manuscript, and detects the number of appearances in the original manuscript for those sound units,
The character string selection means is, in the order from a string that contains a little voice unit of appearance number of times, until it covers all of the speech units extracted above, the control to create a recording document to select a string from the original manuscript Program to be controlled.