JP3958908B2

JP3958908B2 - Transcription text automatic generation device, speech recognition device, and recording medium

Info

Publication number: JP3958908B2
Application number: JP35033699A
Authority: JP
Inventors: 庄衛佐藤; 寛之世木; 和穂尾上; 亨今井; 英輝田中; 彰男安藤
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 1999-12-09
Filing date: 1999-12-09
Publication date: 2007-08-15
Anticipated expiration: 2019-12-09
Also published as: JP2001166790A

Description

【０００１】
【発明の属する技術分野】
本発明は、音響モデルの学習に使用する書き起こしテキストを自動生成する書き起こしテキスト自動生成装置、音声データ、対応する書き起こしテキストから学習される音響モデルを使用して音声認識を行う音声認識装置および書き起こしテキストを自動生成するためのプログラムを記録した記録媒体に関する。
【０００２】
【従来の技術】
音声認識システムにおいて使用される音響モデルは入力された音声がどの程度候補音素らしいかを算出するための統計モデルであり、（NHK技研R&D No.５７，１９９９．８ｐｐ３６を参照）モデルのパラメータは大量の音声データとその書き起こしテキストから学習される。このかな漢字混じり書き起こしデータおよび音素表記書き起こしテキストデータの作成は人手に頼っている。従来では書き起こし作業担当者の作業量に限界があることから認識単位ごとに区切り記号が挿入された音素表記書き起こしテキストデータの入手が困難であり、区切り記号が挿入されないまま音響モデルを作成する場合が多かった。
【０００３】
また、一部区切り記号が挿入された書き起こしテキストデータを入手できたとしても適用しようとする音声認識システムで扱う認識単位と同じ認識単位が書き起こしテキストで用いられるとは限らず、音響モデル学習データと認識システムが扱うデータに不一致があるまま書き起こしテキストが使用されてきた。
【０００４】
【発明が解決しようとする課題】
音声認識システムを構成する際に、音声認識デコーダが取り扱う、音声認識単位と言語モデルが扱う音声認識単位および音響モデルを作成する際に取り扱う音声認識単位を統一することは認識精度を確保する上で不可欠である。
【０００５】
従来のシステムでは、上述した理由により最適な位置に区切り記号が挿入されていない音素表記書き起こしテキストをしなければならない状況が発生する。
【０００６】
ここで、一般的に用いられている形態素を認識単位とし、形態素中の連続する３音素を考慮し、形態素にまたがる３音素は考慮しない音声認識システムを例とする。
【０００７】
すなわち、このシステムでは、形態素間に使用される音響モデルは後続の音素に依存しない形態素中の最初の２音素を考慮する。
【０００８】
上記のモデルを学習するために、従来技術では音素書き起こしテキスト中のすべての３連続音素を使用した音響モデルと、すべての２連続音素を使用した音響モデルを別々に作成し、使用しなければならない。
【０００９】
しかも、この２連続音素の音響モデルが形態素間に出現する２連続音素という条件を考慮していない場合には、十分な精度を確保することはできない。
【００１０】
このため、音響モデルの精度確保のためには、形態素間にまたがる部分は２音素のみ考慮したモデルとして学習し、それ以外の部分は３音素を考慮したモデルが必要となる。このモデル作成のためには形態素区切り記号を挿入した音素書き起こしテキストが必要であり、このテキスト作成には人手を多く介しない方法が望まれている。
【００１１】
また、上記以外の例、形態素にまたがる３音素を考慮するシステムにおいても短い無音、息継ぎ部など２連続音素を考慮しなければならない位置に区切り記号が挿入された書き起こしテキストが必要であるという点では上述と同様である。
【００１２】
そこで、本発明は、上述の点に鑑みて、音響モデル学習のための書き起こしテキストを自動生成することができる書き起こしテキスト自動生成装置、それを搭載した音声認識い装置および記録媒体を提供することにある。
【００１３】
【課題を解決するための手段】
このような目的を達成するために、請求項１の発明は、漢字かな混じり書き起こしテキストデータおよび該漢字かな混じりテキストデータに対応する音素表記を記載した音素表記書き起こしテキストデータを入力する入力手段と、
当該入力された漢字かな混じりテキストデータを所定の認識単位毎に分割する第１の情報処理手段と、文字列およびその読みが記載されており、１つの文字列について複数の読みが許容される読み辞書を記憶した記憶手段と、前記第１の情報処理手段により分割されたテキストデータの分割位置と合致させたまま、前記読み辞書に基づき、前記分割された漢字かな混じりテキストデータ中の文字列を読みに変換し、同一の文字列について複数の読みが存在する場合には、当該複数の読みの中の最適候補を前記入力手段から入力された音素表記書き起こしテキストデータに基づき決定し、当該読みに変換された文字列の前記分割位置に区切り記号を挿入して区切り記号挿入済み音素表記テキストデータを生成する第２の情報処理手段とを具えたことを特徴とする。
【００１４】
請求項２の発明は、請求項１に記載の書き起こしテキスト自動生成装置において、前記所定の認識単位は形態素であり、前記第１の情報処理手段は形態素解析部であることを特徴とする。
【００１５】
請求項３の発明は、請求項１に記載の書き起こしテキスト自動生成装置において、前記第２の情報処理手段は、変換対象の同一の文字列について複数の読みが存在する場合には、ＤＰマッチング処理により最適候補を決定し、前記ＤＰマッチング処理のために前記読みに変換された文字列の中に区切り記号を付したリファレンスパターンを作成し、また、前記音素表記書き起こしテキストデータから所定長さの文字列を抽出してテストパターンを作成し、前記リファレンスパターンと前記テストパターンとをＤＰマッチング処理することを特徴とする。
【００１６】
請求項４の発明は、漢字かな混じり書き起こしテキストデータおよび該漢字かな混じりテキストデータに対応する音素表記を記載した音素表記書き起こしテキストデータを入力する入力手段と、当該入力された漢字かな混じりテキストデータを所定の認識単位毎に分割する第１の情報処理手段と、文字列およびその読みが記載されており、１つの文字列について複数の読みが許容される読み辞書を記憶した記憶手段と、前記第１の情報処理手段により分割されたテキストデータの分割位置と合致させたまま、前記読み辞書に基づき、前記分割された漢字かな混じりテキストデータ中の文字列を読みに変換し、同一の文字列について複数の読みが存在する場合には、当該複数の読みの中の最適候補を前記入力手段から入力された音素表記書き起こしテキストデータに基づき決定すると共に、当該読みに変換された文字列の分割位置と対応させて前記入力手段から入力された音素表記書き起こしテキストデータの分割位置に区切り記号を挿入して区切り記号挿入済み音素表記テキストデータを生成する第２の情報処理手段とを具えたことを特徴とする。
【００１７】
請求項５の発明は、請求項４に記載の書き起こしテキスト自動生成装置において、前記第２の情報処理手段は、前記ＤＰマッチング処理のために前記読みに変換された文字列からなるリファレンスパターンを作成し、また、前記音素表記書き起こしテキストデータから所定長さの文字列を抽出してテストパターンを作成し、前記リファレンスパターンと前記テストパターンとをＤＰマッチング処理すると共に、前記テストパターン上の、ＤＰマッチング処理結果により示される区切れ位置に区切り記号を挿入することを特徴とする。
【００１８】
請求項６の発明は、請求項１または請求項４に記載の書き起こしテキスト自動生成装置において、音響モデルを記憶した音響モデル記憶手段と、学習用音声データを入力する音声データ入力手段と、前記第２の情報処理手段により生成された区切り記号挿入済み音素表記テキストデータとを使用して前記音声データ入力手段から入力された学習用音声データをデコードするデコード手段と、当該デコード結果に基づき、通過時間がゼロである区切り記号を検出し、前記区切り記号挿入済み音素表記テキストデータから検出した区切り記号を削除した書き起こしテキストデータを作成する第３の情報処理手段とをさらに具えたことを特徴とする。
【００１９】
請求項８の発明は、漢字かな混じり書き起こしテキストデータを入力する入力手段と、当該入力された漢字かな混じりテキストデータを所定の認識単位毎に分割する第１の情報処理手段と、文字列およびその読みが記載された読み辞書を記憶した記憶手段と、前記第１の情報処理手段により分割されたテキストデータを、前記読み辞書に基づき、読みに変換し、前記第１の情報処理手段により決定された分割位置に区切り記号を挿入する第２の情報処理手段とを具えたことを特徴とする。
【００２０】
請求項９の発明は、請求項１、請求項４および請求項８のいずれかに記載の書き起こしテキスト自動生成装置と、学習用音声データを入力する音声データ入力手段と、前記自動生成装置により生成された書き起こしテキストと前記音声データ入力手段から入力された学習用音声データに基づき音響モデルを作成する音響モデル作成手段とを具え、当該作成された音響モデルを使用して音声認識を行うことを特徴とする。
【００２１】
請求項１０の発明は、文字列およびその読みが記載されており、１つの文字列について複数の読みが許容される読み辞書を記憶した記憶手段を有する書き起こしテキスト自動生成装置で実行されるプログラムを記録した記録媒体において、前記プログラムは、漢字かな混じり書き起こしテキストデータおよび該漢字かな混じりテキストデータに対応する音素表記を記載した音素表記書き起こしテキストデータを入力する入力ステップと、当該入力された漢字かな混じりテキストデータを所定の認識単位毎に分割する第１の情報処理ステップと、前記第１の情報処理ステップにより分割されたテキストデータの分割位置と合致させたまま、前記読み辞書に基づき、前記分割された漢字かな混じりテキストデータ中の文字列を読みに変換し、同一の文字列について複数の読みが存在する場合には、当該複数の読みの中の最適候補を前記入力ステップから入力された音素表記書き起こしテキストデータに基づき決定し、当該読みに変換された文字列の前記分割位置に区切り記号を挿入して区切り記号挿入済み音素表記テキストデータを生成する第２の情報処理ステップとを具えたことを特徴とする。
【００２２】
請求項１１の発明は、文字列およびその読みが記載されており、１つの文字列について複数の読みが許容される読み辞書を記憶した記憶手段を有する書き起こしテキスト自動生成装置で実行されるプログラムを記録した記録媒体において、前記プログラムは、漢字かな混じり書き起こしテキストデータおよび該漢字かな混じりテキストデータに対応する音素表記を記載した音素表記書き起こしテキストデータを入力する入力ステップと、当該入力された漢字かな混じりテキストデータを所定の認識単位毎に分割する第１の情報処理ステップと、前記第１の情報処理ステップにより分割されたテキストデータの分割位置と合致させたまま、前記読み辞書に基づき、前記分割された漢字かな混じりテキストデータ中の文字列を読みに変換し、同一の文字列について複数の読みが存在する場合には、当該複数の読みの中の最適候補を前記入力ステップから入力された音素表記書き起こしテキストデータに基づき決定すると共に、当該読みに変換された文字列の分割位置と対応させて前記入力ステップから入力された音素表記書き起こしテキストデータの分割位置に区切り記号を挿入して区切り記号挿入済み音素表記テキストデータを生成する第２の情報処理ステップとを具えたことを特徴とする。
【００２３】
【発明の実施の形態】
以下、図面を参照して本発明の実施形態を詳細に説明する。
【００２４】
以下に述べる実施形態の認識単位は形態素であり、使用する音素モデルは形態素中では連続する３音素を考慮したモデルであり、形態素間に使用されるモデルは後続もしくは先行する音素に依存しない２音素を考慮したモデルである。また、区切り記号に対応するモデルとしてはジャンプ可能な短い無音を意味するＳＰの表記を使用する。
【００２５】
図１は本発明第１の実施形態の音声認識システムの機能構成を示す。
【００２６】
なお、図１の構成は音声認識システムの音響モデル作成部の構成を示す。音声認識システムの他の部分の構成は従来と同様であり、たとえば、ＮＨＫ技研Ｒ＆ＤＮｏ．５７１９９９．８ｐｐ３６に記載された音声認識システムを使用することができる。
【００２７】
図１において、１０１は学習用音声データに対応したかな漢字文字列を記載した漢字混じり書き起こしデータである。１０２は区切り記号挿入済み音素表記テキストデータを自動生成するために使用する形態素の読み辞書であり、形態素の文字列の表記とそれに対応する１以上の読み（音素表記）を１組のデータセットとし、複数組のデータセットが読み辞書１０２に記載されている。
【００２８】
１０３は音素表記書き起こしテキストデータであり、かな混じり書き起こしテキストデータの記載内容に対応する音素表記が記載されている。１０４は音響モデルの学習に使用する学習用音声データであり、漢字混じり書き起こしテキストデータに記載された文字列を発声したときに得られる学習用音声データである。
【００２９】
１０５は漢字かな混じり書き起こしテキストデータに記載されている文字列をを解析し、上記文字列を形態素単位に分割する形態素解析部である。形態素解析部１０５は周知であり、本実施形態では、日本語解析プログラム“茶筌”ｖｅｒｓｉｏｎ１．５を実行するＣＰＵを形態素解析部１０５として使用する。
【００３０】
１０６は形態素解析部１０５により形態素ごとに分割された漢字混じりテキストデータ（以下、分割済みテキストデータと称す）である。１０７は分割済みテキストデータ１０６、読み辞書１０２および音素表記書き起こしテキストデータ１０３を使用して区切り記号挿入済み音素表記テキストデータ１０８を自動生成するパターンマッチング部である。パターンマッチング部１０７は図２〜図４に示す処理を実行するＣＰＵを使用することができる。
【００３１】
１０８は形態素単位で区切り記号が挿入された音素表記テキストデータ（以下、区切り記号挿入済み音素表記テキストデータと称す）である。１０９は区切り記号挿入済み音素表記テキストデータ１０８と学習用音声データ１０４を使用して音響モデル１１０を作成する音響モデル学習部である。音響モデル学習部１０９としてはＨＴＫ（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌＴｏｏｌｋｉｔ，ｈｔｔｐ：／／ｗｗｗ．ｅｎｔｒｏｐｉｃ．ｃｏｍ／ｈｔｋ／ｈｔｍｌ）を搭載し実行するＣＰＵを使用することができる。
【００３２】
図２は図１のパターンマッチング部１０７の詳細を示す。なお、図１と同様の個所には同一の符号を付しており、詳細な説明を省略する。
【００３３】
図２において、２０１はリファレンスパターン作成部であり、リファレンスパターン２０２を作成する。
【００３４】
２０２はリファレンスパターンであり、分割済みテキストデータ１０６に示される形態素の分割位置に対応させて、読み辞書１０２に記載された１以上の読みを並べたデータである。分割位置には区切り記号が挿入される。
【００３５】
リファレンスパターンの一作成方法を紹介しておく。
【００３６】
分割済みテキストデータから最初の区切り記号までの文字列、この場合、「それでは」（句読点を除く）を抽出し、「それでは」に対応する表記を持つデータセットを読み辞書１０２の中で検索する。検索の結果得られるデータセットの中から表記に対応する読み「ｓｏｒｅｄｅｗａ」が取り出され、リファレンスパターンに挿入される。次に、分割済みテキストデータ１０６から２番目の分割位置の形態素の表記「今日」が取り出される。読み辞書１０２により「今日」と同じ表記を有するデータセットが得られると、そのデータセットの中から対応する読み、この場合、「ｋｙｏ」と「ｋｏＮｎｉｃｈ」が得られる。
【００３７】
そこで、リファレンスパターンの文字列の最後の文字の次に区切り記号を付した後、２番目に得られた２つの表記を追加する。このとき、１つの表記に対して複数の読み候補があることを示すために{}で２組の文字列をくくり、「，」記号で２組の文字列の区切りを表す。
【００３８】
このようにして、分割済みテキストデータ１０６に記載された形態素毎の文字列表記に対応する読みを読み辞書１０２に基づき取得し、区切り記号を付してリファレンスパターンを作成する。分割済みテキストデータに記載されている表記の文字列全てについて、読み辞書から読みを示す表記を取り出し、区切り記号を付して並べることによりリファレンスパターン２０２を作成することができる。
【００３９】
２０３は音素書き起こしテキストデータからＤＰマッチングのために所定長さ単位で取り出したテストパターンである。
【００４０】
２０４はＤＰ（動的計画法(ＤｙｎａｍｉｃＰｒｏｇｒａｍｉｎｇ)）アルゴリズム、たとえば、２段ＤＰやビルディングレベルアルゴリズム（“音声認識の基礎（下）ＮＴＴアドバンステクノロジ、ｐｐ１８８−２４１）にしたがって、リファレンスパターン２０２とテストパターン２０３とを比較し、リファレンスパターン２０２の中の、１つの形態素の文字列について、複数存在する表記候補の中から最適候補を決定する情報処理部である。
【００４１】
２０５はＤＰ結果処理部であり、リファレンスパターン中に挿入されている区切り記号の位置と合致するように、決定された組み合わせ文字列の中に区切り記号ＳＰを挿入する。
【００４２】
この例では図３に示すようにリファレンスパターン２０２とテストパターン２０３とのＤＰマッチングにより符号３０１で示す文字列の組み合わせ、すなわち、「ｓｏｒｅｄｅｗａ→ｋｙｏ→ｎｏ→ｎｙｕ：ｓｕ→ｄｅｓｕ」の文字列の組み合わせについての距離の累積結果（スコアＡ）と「ｓｏｒｅｄｅｗａ→ｋｏＮｎｉｃｈｉ→ｎｏ→ｎｙｕ：ｓｕ→ｄｅｓｕ」の文字列の組み合わせについての距離の累積結果（スコアＢ）が得られる。この例では前の組み合わせが距離の累積結果が低いので、前の組み合わせが最良パス（文字列の組み合わせ方）として、情報処理部２０４内の最良パス選択部３０２により決定される。この後、ＤＰ結果処理部２０５内の区切り記号置換部３０３によりＤＰ結果３０１内の区切り記号がＳＰの区切り記号に置換される。
【００４３】
図３の例は、分割済みテキストデータを読み辞書１０２に基づき第１の音素表記書き起こしテキストデータに変換し、音素書き起こしテキストデータ１０３と比較して、同一の文字列について複数の音素表記から最適候補を決定する例であった。この例では、読み辞書に記載されている読みが実際の表記と一致しない場合（図３のｎｙｕ：ｓｕ（リファレンスパターン２０２とｎｉｕ：ｓｕ（テストパターン２０３とを参照）には、区切り記号挿入済み音素表記テキストデータ１０８には部分的に誤った表記が混在してしまうという欠点がある。
【００４４】
そこで、この欠点を改良した形態を図４に示す。図４において最良パス選択部３０２から出力されるＤＰ結果（最良パス）４０１とテストパターン（漢字かな混じり書き起こしテキストデータの読みを最も忠実に表す音素表記書き起こしデータから順次に取り出した音素表記の文字列）２０３は、区切り挿入位置算出部４０２に入力され、最良パス４０１が得られたＤＰ結果の区切れ記号位置（分割位置）をトレースすることで得られるテストパターン２０３中の区切り記号対応位置を算出する。
【００４５】
区切り記号挿入部（４０３）ではテストパターン２０３の、算出された区切り位置に所望の区切り記号ＳＰを挿入し、区切り記号挿入済み音素表記テキストデータ１０８を出力する。
【００４６】
これは、漢字かな混じり書き起こしテキストを形態素解析して得られる形態素ごとの分割位置に対応するように、音素表記書き起こしテキストデータ中に区切り記号を挿入していくことにほかならない。このため、漢字書き起こしテキストに辞書に登録されていない読みの漢字が混在していても、その漢字の読みが正しく区切り記号挿入済み音素表記テキストデータ１０８に反映される。
【００４７】
以上の機能構成で実行される書き起こしテキスト自動生成処理および音響モデル学習処理を説明する。
【００４８】
漢字かな混じり書き起こしテキストデータ１０１は言語モデル作成などに用いられる形態素解析部１０５に入力され、形態素解析部１０５から形態素毎に分割された分割済みテキストデータ１０６が出力される。パターンマッチング部１０７は分割済みテキストデータ１０６の示す表記に対応する読みの文字列（音素表記）を読み辞書１０２から取得し、分割済みテキストデータ１０６を読みの文字列に並べ替える。このとき、形態素の分割位置と合致するように区切り記号も付されてリファレンスパターン２０２（図２参照）が作成される。
【００４９】
リファレンスパターン２０２と音素表記書き起こしテキストデータとのＤＰマッチングにより、リファレンスパターンの中に含まれる複数の読み候補の中で、テストパターンの文字列に最も類似する文字列が選択される。また、リファレンスパターンの中の区切り文字は、ＳＰの文字に変換されて区切り記号挿入済み音素記号テキストデータ１０８が自動生成される。
【００５０】
学習用音声データ１０４と区切り記号挿入済み音素記号テキストデータ１０８とに基づいて音響モデル学習部１０９は従来と同様にして学習を行って、音響モデル１１０を作成する。作成された音響モデル１１０を使用して、音声認識が実行される。
【００５１】
さらに単語間にまたがる連続する３音素を考慮したモデルを使用して音声認識を行う場合の区切り文字挿入済み音素表記テキストデータを生成する処理について図５を使用して説明する。
【００５２】
図５において、音響モデル１１０は図１の装置により生成された音響モデルである。形態素毎区切り記号挿入済み音素表記テキストデータ１０８は図１の装置により生成されたテキストデータである。
【００５３】
５０１は上述のデータに基づきデコードを行うデコーダである。
【００５４】
５０２はデコーダ５０１のデコードにおいて、通過時間が０（ゼロ）であるＳＰすなわち、ジャンプパスを通った形態素毎区切り記号挿入済み音素表記テキストデータ１０８中のＳＰを削除するジャンプパスＳＰ検出削除部である。
【００５５】
ジャンプパスＳＰ検出削除部５０２の出力が、連続する３音素を考慮したＳＰ毎の区切り記号挿入済み音素表記テキストデータ５０３となる。
【００５６】
このような構成では、作成された音響モデル１１０とこの音響モデル１１０を学習する際に用いた形態素毎に挿入記号が挿入された音素表記テキストデータ１０８を使用して学習用音声データ１０４をデコーダ５０１でデコードする。ジャンプパスＳＰ検出削除部ではデコード結果により通過時間が０（ゼロ）であるＳＰ、すなわち、ジャンプパスを通ったＳＰを検出し、そのＳＰを音素表記テキストデータ１０８から削除してＳＰ毎の区切り記号挿入済み音素表記テキストデータ５０３を生成する。
【００５７】
以上の処理はソフトウェアプログラムで定義できる。また、そのソフトウェアプログラムをＣＰＵにより実行させればよい。
【００５８】
図１の機能構成を持つ書き起こしテキスト自動生成装置およびそれを有する音声認識装置の具体的な一実施例を図６に示す。
【００５９】
図１において、ＣＰＵ１０００、システムメモリ１０１０、入力装置１０２０、ディスプレイ１０３０、入出力インターフェース（Ｉ／Ｏ）１０４およびハードディスク１０６０がバスを介して接続されている。
【００６０】
ＣＰＵ１０００は、ハードディスク１０６０に記憶された自動生成プログラムをシステムメモリ１０１０にロードした後、自動生成プログラムを実行して、区切り記号挿入済みテキストデータ１０８を生成する。このとき、ＣＰＵ１０００が書き起こしテキストデータ生成装置として機能する。
【００６１】
また、ハードディスク１０６０に記憶された不図示の音声認識プログラム（従来と同一）を使用して音声認識を行う場合には、ＣＰＵ１０００は音声認識装置として機能する。
【００６２】
システムメモリ１０１０はＣＰＵ１０００が実行するプログラム、このプログラムにしたがって実行する情報処理のための入力データ、情報処理結果を一時記憶する。入力装置１０２０はキーボードおよびマウスを有し、実行すべきプログラムを指定したり、文字や文字処理のための動作指示を行う。
【００６３】
ディスプレイ１０３０は入力装置１０２０から入力された指示を表示すると共に、音声認識結果等を表示する。Ｉ／Ｏ１０４０はマイクロホン１０５０から入力されたアナログの音声信号をデジタル音声信号に変換してＣＰＵ１０００に引き渡す。マイクロホン１０４０から入力された音声信号は学習用音声データ１０６６（図１では符号１０４）としてハードディスク１０６０に保存される。
【００６４】
ハードディスク１０５０は自動生成プログラム１０６１、漢字かな混じり書き起こしテキスト１０６２（図１の符号１０１）、音素表記書き起こしテキスト１０６３（図１の符号１０３）、読み辞書１０６４（図１の符号１０２）、音響モデル１０６５、学習用音声データ１０６６を保存する。ハードディスク１０６０にはその他、システム全体の動作制御を行うオペレーティングシステムや音声認識プログラムも保存されている。
【００６５】
テキストデータ１０６２、１０６３、読み辞書１０６４はたとえば、ワードプロセッサやエディタなどの文書処理プログラムを使用して作成し、ハードディスク１０６０に入力（保存）してもよいし、他の情報処理装置で作成したデータをオンライン通信やフロッピーディスク等を介したオフライン通信で予めハードディク１０６０に転送入力してもよい。
【００６６】
自動生成プログラムはフロッピーディスク、ＣＤＲＯＭなどの携帯用記録媒体からＣＰＵ１０００の制御でハードディスク１０６０にインストールされる。
【００６７】
自動生成プログラム１０６１の内容を図７に示す。自動生成プログラム１０６１はＣＰＵ１０００が実行可能なプログラム言語で記載されているが、説明の便宜上、図７では機能表現している。当業者であれば、明細書の記載に基づいて、自動生成プログラムを作成することは容易であろう。
【００６８】
図７において、ステップＳ１０は、漢字かな混じり書き起こしテキストデータをハードディスク１０６０からシステムメモリ１０２０に読み出す処理である。
【００６９】
ステップＳ２０はシステムメモリ１０２０に記憶された漢字かな混じり書き起こしテキストデータについて形態素解析を行う処理である（図１の形態素解析部１０５に対応）。
【００７０】
ステップＳ３０は形態素解析により得られる分割済みテキストデータ（図１の１０６）をシステムメモリ１０２０に書き込む処理である。
【００７１】
ステップＳ４０は読み辞書１０６４をシステムメモリ１０２０に読み込む処理である。なお、この処理は読み辞書１０６４が大容量の場合には省略でき、後述のパターンマッチ処理の際に直接、ハードディスク１０６０から記載データを読み出すこともできる。
【００７２】
ステップＳ５０はシステムメモリ１０２０に記憶された読み辞書、音素表記書き起こしテキストデータおよび分割済みテキストデータに基づきパターンマッチング処理を行い区切り記号挿入済み音素表記テキストデータを生成する処理である。この処理を実行するときのＣＰＵ１０００が図１のパターンマッチング部１０８に相当する。なお、区切り記号ＳＰの挿入方法は図３または図４で示した方法のいずれか使用者の好適な方法を使用する。
【００７３】
ステップＳ６０は生成された区切り記号挿入済み音素表記テキストデータをハードディスク１０６０に保存のために記憶する処理である。
【００７４】
ステップＳ７０は終了判定であり、複数組のテキストデータ１０６２、１０６３が存在する場合、それらテキスト全てについてステップＳ１０〜Ｓ６０の処理が繰り返されて、区切り記号挿入済み音素表記テキストデータの自動生成が行われる。
【００７５】
ステップＳ８０はハードディスク１０６０からシステムメモリ１０２０に学習用音声データを読み出す処理である。ステップＳ９０はハードディスク１０６０に保存されている区切り記号挿入済み音素表記テキストデータと、システムメモリ１０２０に記憶されている学習用音声データを使用して音響モデルを作成する処理である。この処理を実行するときのＣＰＵ１０００が図１の音響モデル学習部１０９として機能する。
【００７６】
このようにしてＣＰＵ１０００は作成された音響モデルを使用して、マイクロホン１０５０から入力された音声信号を音声認識プログラムにしたがって、音声認識する。この処理は従来と同様であるので、詳細な説明を要しないであろう。
【００７７】
以上、説明した実施形態の他に次の形態を実施できる。
１）上述の実施形態は音声認識処理の前処理として、区切り記号挿入済み音素表記テキストデータを自動生成したが、図１のシステムで区切り記号挿入済み音素表記テキストデータを専用的に自動生成して、他の音声認識装置に引き渡してもよい。
【００７８】
２）図６のシステムはパソコンなどの汎用コンピュータを使用する例であるが、ハードウェアはパソコンに限らず、デジタルプロセッサを使用してもよいし、ワークステーションやサーバなど各種の情報処理装置を使用することができる。
【００７９】
３）上述の実施形態の中で、区切れ位置の算出処理や区切れ記号の置換処理については詳述しなかったが周知の情報処理方法を使用すればよい。たとえば、区切れ位置の算出位置については、オペレーティングシステム（たとえば、マイクロソフト社のウィンドウズ９５、９８）の中に、先頭から任意の単語／文字までの単語数／文字数を検出する関数が用意されているので、この関数を使用して、先頭からの単語数や文字数を区切れ記号位置とすることができる。
また、置換処理は文書編集処理でよく使用されているので、詳細な情報処理内容については説明を要しないであろう。
【００８０】
４）上述の実施形態は、実用を考え、複数の読みを許容した辞書を使用しているが、音声認識の対象とする音声内容が限定されている環境では、予め、かな漢字混じり単語の読みは１組だけを登録しておけばよい。
【００８１】
５）上述の図３および図４の処理方法を組み合わせた方法として、次の処理を行うことができる。すなわち、かな漢字混じり単語を読み辞書を使用し読みに変換する。読み辞書を検索しても読み辞書に該当するかな漢字混じり単語が存在しない場合は、音素表記テキストデータ中の分割位置が対応する音素表記を使用して区切り記号挿入済み音素表記書き起こしテキストデータを作成する。
【００８２】
【発明の効果】
以上、説明したように、本発明によれば、区切れ記号がない音素表記書き起こしテキストデータでは区切れ位置が分からないが漢字かな混じり書き起こしテキストデータを使用すると形態素分析等を行うことで、形態素ごとのかな混じり文字列（単語）の区切れを検出することができることに本願発明者は気がつき、区切れ位置で分割されたかな混じりテキストを読み辞書を使用して読み（音素表記の文字列）に変換する（請求項８の発明に対応）。また、これにより手動でなく自動での区切り記号挿入済み音素表記書き起こしテキストの自動生成が可能となる。
【００８３】
請求項１の発明では、区切れ記号が挿入されていない音素表記書き起こしテキストと、読みに変換され、同一の文字列について複数の読みが存在する分割済みテキストとの比較により、最適な読みを得ることができる。
【００８４】
請求項４の発明では、読みに変化された分割済みの音素表記テキストの示す位置にしたがって、かな漢字混じりテキストデータに対して最も忠実な読みを表し、区切れ記号が挿入されていない音素表記書き起こしテキストデータ（入力手段から入力される音素表記書き起こしテキストデータ）に区切れ記号を挿入するので、読みの精度がよい区切れ記号挿入済み音素表記テキストデータを自動生成することができる。
【００８５】
このようにして音響モデルの作成に使用する書き起こしテキストを自動生成することができるので、書き起こしテキストの生成時間を大幅に短縮することが可能となり、また、音響モデルの作成に関わるユーザの操作労力をも大幅に低減することができる。
【図面の簡単な説明】
【図１】本発明実施形態の機能構成を示すブロックずである。
【図２】本発明実施形態のパターンマッチング部の機能構成を示すブロック図である。
【図３】本発明実施形態のＤＰマッチング処理内容を示すブロック図である。
【図４】本発明実施形態の区切り記号総処理内容を示すブロック図である。
【図５】デコーダを使用する区切り記号挿入方法を示すブロック図である。
【図６】本発明実施形態の具体的なシステム構成を示すブロック図である。
【図７】自動生成プログラムの処理内容を示すフローチャートである。
【符号の説明】
１０１漢字かな混じり書き起こしテキストデータ
１０２読み辞書
１０３音素表記書き起こしテキストデータ
１０４学習用音声データ
１０５形態解析部
１０６分割済みテキストデータ
１０７パターンマッチング部
１０８区切り記号済み音素表記テキストデータ
１０９音響モデル学習部
１１０音響モデル[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a transcription text automatic generation device that automatically generates a transcription text used for learning an acoustic model, a speech recognition device that performs speech recognition using an acoustic model learned from speech data and corresponding transcription text. The present invention also relates to a recording medium on which a program for automatically generating a transcription text is recorded.
[0002]
[Prior art]
The acoustic model used in the speech recognition system is a statistical model for calculating how much the input speech seems to be a candidate phoneme (see NHK STRL R & D No. 57, 19999.8pp36). Learned from the voice data and its transcription text. The creation of this Kana-Kanji mixed transcription data and phonetic transcription text data relies on human hands. Conventionally, there is a limit to the amount of work performed by the person in charge of transcription, so it is difficult to obtain phonetic transcription text with delimiters inserted for each recognition unit, and an acoustic model is created without inserting delimiters. There were many cases.
[0003]
Moreover, even if transcription text data with a partial delimiter inserted is available, the same recognition unit used in the speech recognition system to be applied is not always used in the transcription text. Transcribed text has been used with inconsistencies between the data and the data handled by the recognition system.
[0004]
[Problems to be solved by the invention]
When constructing a speech recognition system, unifying the speech recognition unit handled by the speech recognition decoder, the speech recognition unit handled by the language model, and the speech recognition unit handled when creating the acoustic model is to ensure recognition accuracy. It is essential.
[0005]
In the conventional system, for the reason described above, a situation arises in which the phonetic transcription with no delimiter inserted at the optimum position must be written.
[0006]
Here, an example is a speech recognition system in which a morpheme that is generally used is a recognition unit, three consecutive phonemes in the morpheme are taken into consideration, and three phonemes across the morpheme are not taken into account.
[0007]
That is, in this system, the acoustic model used between morphemes considers the first two phonemes in the morpheme that do not depend on subsequent phonemes.
[0008]
In order to learn the above model, in the prior art, an acoustic model using all three continuous phonemes in the phoneme transcription text and an acoustic model using all two continuous phonemes must be created and used separately. Don't be.
[0009]
Moreover, sufficient accuracy cannot be ensured when the acoustic model of two continuous phonemes does not take into account the condition of two continuous phonemes appearing between morphemes.
[0010]
For this reason, in order to ensure the accuracy of the acoustic model, it is necessary to learn the part that spans between morphemes as a model that considers only two phonemes, and the other part needs a model that considers three phonemes. In order to create this model, phoneme transcription text with morpheme delimiters inserted is necessary, and a method that does not require much human intervention is desired for this text creation.
[0011]
In addition to the above examples, even in a system that considers three phonemes that span morphemes, it is necessary to have a transcription text in which a delimiter is inserted at a position where two consecutive phonemes such as short silence and breathing portion must be considered. Then, it is the same as the above-mentioned.
[0012]
In view of the above, the present invention provides a transcription text automatic generation apparatus capable of automatically generating a transcription text for learning an acoustic model, a speech recognition apparatus equipped with the transcription text recording apparatus, and a recording medium. There is.
[0013]
[Means for Solving the Problems]
In order to achieve such an object, the invention of claim 1 is an input means for inputting kanji-kana mixed transcription text data and phoneme-notation transcription text data describing a phoneme notation corresponding to the kanji-kana mixed text data. When,
A first information processing unit that divides the input kanji-kana mixed text data into predetermined recognition units, a character string and its reading are described, and a reading that allows a plurality of readings for one character string is described. A character string in the divided kanji-kana mixed text data is stored on the basis of the reading dictionary while being matched with the division position of the text data divided by the first information processing means and the storage means storing the dictionary. If there are multiple readings for the same character string, the optimal candidate among the multiple readings is determined based on the phonetic transcription text data input from the input means, and the reading 2nd information processing means for generating phoneme notation text data into which a delimiter has been inserted by inserting a delimiter into the division position of the character string converted into And wherein the door.
[0014]
According to a second aspect of the present invention, in the transcription text automatic generation device according to the first aspect, the predetermined recognition unit is a morpheme, and the first information processing means is a morpheme analysis unit.
[0015]
The invention according to claim 3 is the transcription text automatic generation device according to claim 1, wherein the second information processing means performs DP matching when a plurality of readings exist for the same character string to be converted. The optimum candidate is determined by processing, a reference pattern with a delimiter added in the character string converted into the reading for the DP matching processing is created, and the phoneme notation transcription text data has a predetermined length. A character string is extracted to create a test pattern, and the reference pattern and the test pattern are subjected to DP matching processing.
[0016]
According to a fourth aspect of the present invention, there is provided input means for inputting phonetic transcription text data describing kanji mixed kana mixed text data and phoneme notation corresponding to the kanji mixed text data, and the input kanji mixed text A first information processing unit that divides data into predetermined recognition units; a storage unit that stores a reading dictionary in which a character string and a reading thereof are described and a plurality of readings are permitted for one character string; Based on the reading dictionary, the character string in the divided kanji-kana mixed text data is converted into a reading based on the reading dictionary while matching the dividing position of the text data divided by the first information processing means, and the same character If there are multiple readings for a sequence, the best candidate among the multiple readings is written in the phoneme notation written from the input means. In addition, it is determined based on the text data, and a delimiter is inserted by inserting a delimiter into the divided position of the phonetic transcription transcription data input from the input means in correspondence with the division position of the character string converted into the reading. And second information processing means for generating finished phoneme notation text data.
[0017]
According to a fifth aspect of the present invention, in the transcription text automatic generation device according to the fourth aspect, the second information processing means generates a reference pattern composed of a character string converted into the reading for the DP matching processing. Creating a test pattern by extracting a character string of a predetermined length from the phonetic transcription text data, DP matching processing between the reference pattern and the test pattern, on the test pattern, A delimiter symbol is inserted at the delimiter position indicated by the DP matching processing result.
[0018]
The invention of claim 6 is the transcription text automatic generation apparatus according to claim 1 or claim 4, wherein an acoustic model storage means for storing an acoustic model, voice data input means for inputting learning voice data, and Decoding means for decoding learning speech data input from the speech data input means using delimiter-inserted phoneme notation text data generated by the second information processing means, and passing based on the decoding result And further comprising a third information processing means for detecting a delimiter having a time of zero and creating transcription text data in which the detected delimiter is deleted from the phoneme-notated text data with the delimiter inserted therein. To do.
[0019]
The invention of claim 8 includes an input means for inputting kanji-kana mixed text data, a first information processing means for dividing the input kanji-kana mixed text data into predetermined recognition units, a character string, and The storage means storing the reading dictionary in which the reading is described and the text data divided by the first information processing means are converted into readings based on the reading dictionary and determined by the first information processing means. And a second information processing means for inserting a delimiter at the divided position.
[0020]
According to a ninth aspect of the present invention, there is provided a transcription text automatic generation device according to any one of the first, fourth, and eighth embodiments, a voice data input unit that inputs speech data for learning, and the automatic generation device. An acoustic model creating means for creating an acoustic model based on the generated transcription text and learning speech data input from the speech data input means, and performing speech recognition using the created acoustic model It is characterized by.
[0021]
The invention according to claim 10 is a program that is executed by an automatic transcription text generator having a storage means that stores a reading dictionary in which a character string and its reading are described and a plurality of readings are allowed for one character string. In the recording medium recorded with the above-mentioned program, the program inputs the phonetic transcription text data describing the phonetic transcription corresponding to the kanji-kana mixed text data and the phonetic notation corresponding to the kanji-kana mixed text data, and the input A first information processing step for dividing kanji-kana mixed text data for each predetermined recognition unit, and a matching position of the text data divided by the first information processing step, while matching with the division position of the text data, The character string in the divided kanji mixed kana text data is converted into a reading and the same When there are a plurality of readings for a character string, an optimal candidate among the plurality of readings is determined based on the phonetic transcription text data input from the input step, and the character string converted into the reading is determined. And a second information processing step of generating delimiter-inserted phoneme-notation text data by inserting a delimiter at the division position.
[0022]
The invention according to claim 11 is a program to be executed by a transcription text automatic generation device having a storage means storing a reading dictionary in which a character string and its reading are described and a plurality of readings are allowed for one character string. In the recording medium recorded with the above-mentioned program, the program inputs the phonetic transcription text data describing the phonetic transcription corresponding to the kanji-kana mixed text data and the phonetic notation corresponding to the kanji-kana mixed text data, and the input A first information processing step for dividing kanji-kana mixed text data for each predetermined recognition unit, and a matching position of the text data divided by the first information processing step, while matching with the division position of the text data, The character string in the divided kanji mixed kana text data is converted into a reading and the same When there are a plurality of readings for the character string, the optimal candidate among the plurality of readings is determined based on the phonetic transcription text data input from the input step, and the character string converted into the readings A second information processing step of generating a phoneme notation text data into which a delimiter has been inserted by inserting a delimiter at the divisional position of the phonetic transcription text data input from the input step in correspondence with It is characterized by that.
[0023]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
[0024]
The recognition unit of the embodiment described below is a morpheme, and the phoneme model to be used is a model considering three consecutive phonemes in the morpheme, and the model used between morphemes is a two-phoneme that does not depend on the subsequent or preceding phoneme. This is a model that takes into account. In addition, as a model corresponding to the delimiter, the SP notation meaning short silence that can be jumped is used.
[0025]
FIG. 1 shows a functional configuration of the speech recognition system according to the first embodiment of the present invention.
[0026]
The configuration of FIG. 1 shows the configuration of the acoustic model creation unit of the speech recognition system. The structure of the other parts of the speech recognition system is the same as that of the prior art. For example, NHK R & D No. The speech recognition system described in 57 19999.8 pp36 can be used.
[0027]
In FIG. 1, reference numeral 101 denotes kanji-mixed transcription data describing a kana-kanji character string corresponding to learning speech data. Reference numeral 102 denotes a morpheme reading dictionary used to automatically generate delimiter-inserted phoneme notation text data. A morpheme character string notation and one or more corresponding readings (phoneme notation) are set as one data set. A plurality of data sets are described in the reading dictionary 102.
[0028]
103 is phonetic notation transcription text data, which describes phoneme notation corresponding to the description contents of the kana mixed transcription text data. Reference numeral 104 denotes learning voice data used for learning the acoustic model, which is learning voice data obtained when a character string written in the kanji mixed-text data is uttered.
[0029]
Reference numeral 105 denotes a morpheme analysis unit that analyzes a character string described in text data mixed with kanji and kana, and divides the character string into morpheme units. The morphological analysis unit 105 is well known, and in this embodiment, a CPU that executes the Japanese analysis program “tea bowl” version 1.5 is used as the morphological analysis unit 105.
[0030]
Reference numeral 106 denotes kanji mixed text data (hereinafter referred to as divided text data) divided for each morpheme by the morpheme analysis unit 105. Reference numeral 107 denotes a pattern matching unit that automatically generates delimiter-inserted phoneme notation text data 108 using the divided text data 106, the reading dictionary 102, and the phoneme notation transcription text data 103. The pattern matching unit 107 can use a CPU that executes the processes shown in FIGS.
[0031]
Reference numeral 108 denotes phoneme notation text data in which delimiters are inserted in units of morphemes (hereinafter referred to as phoneme notation text data with delimiters inserted). Reference numeral 109 denotes an acoustic model learning unit that creates the acoustic model 110 using the phoneme notation text data 108 with the separators inserted and the learning speech data 104. As the acoustic model learning unit 109, a CPU that is equipped with and executes an HTK (Hidden Markov Model Tool, http://www.entropic.com/http/html) can be used.
[0032]
FIG. 2 shows details of the pattern matching unit 107 of FIG. In addition, the same code | symbol is attached | subjected to the location similar to FIG. 1, and detailed description is abbreviate | omitted.
[0033]
In FIG. 2, reference numeral 201 denotes a reference pattern creation unit, which creates a reference pattern 202.
[0034]
Reference numeral 202 denotes a reference pattern, which is data in which one or more readings described in the reading dictionary 102 are arranged in correspondence with the division position of the morpheme shown in the divided text data 106. A delimiter is inserted at the division position.
[0035]
Here's how to create a reference pattern.
[0036]
A character string from the divided text data to the first delimiter, in this case, “Now” (excluding punctuation marks) is extracted, and a data set having a notation corresponding to “Now” is read and searched in the dictionary 102. A reading “soredewa” corresponding to the notation is taken out from the data set obtained as a result of the search and inserted into the reference pattern. Next, the morpheme expression “today” at the second division position is extracted from the divided text data 106. When a data set having the same notation as “today” is obtained by the reading dictionary 102, corresponding readings are obtained from the data set, and in this case, “kyo” and “koNnich” are obtained.
[0037]
Therefore, after adding a delimiter after the last character of the character string of the reference pattern, the two notations obtained second are added. At this time, in order to indicate that there are a plurality of reading candidates for one notation, two sets of character strings are enclosed with {}, and a delimiter between the two sets of character strings is represented with a “,” symbol.
[0038]
In this way, the reading corresponding to the character string notation for each morpheme described in the divided text data 106 is acquired based on the reading dictionary 102, and a reference pattern is created by adding a delimiter. The reference pattern 202 can be created by taking the notation indicating the reading from the reading dictionary for all the character strings of the notation described in the divided text data and arranging them with a delimiter.
[0039]
A test pattern 203 is extracted from the phoneme transcription text data in units of a predetermined length for DP matching.
[0040]
204 is a reference pattern 202 and a test pattern according to a DP (Dynamic Programming) algorithm, for example, a two-stage DP or a building level algorithm (“Sound recognition basics (bottom) NTT Advanced Technology, pp 188-241)”. 203 is an information processing unit that compares the data with 203 and determines an optimum candidate from among a plurality of notation candidates for one morpheme character string in the reference pattern 202.
[0041]
A DP result processing unit 205 inserts a delimiter SP into the determined combination character string so as to match the position of the delimiter inserted in the reference pattern.
[0042]
In this example, as shown in FIG. 3, a combination of character strings indicated by reference numeral 301 by DP matching between the reference pattern 202 and the test pattern 203, that is, a combination of character strings “soredewa → kyo → no → nyu: su → desu”. The distance accumulation result (score A) for the combination of the character string “soredewa → koNnichi → no → nyu: su → desu” is obtained. In this example, since the previous combination has a low distance accumulation result, the previous combination is determined by the best path selection unit 302 in the information processing unit 204 as the best path (character string combination method). Thereafter, the delimiter replacing unit 303 in the DP result processing unit 205 replaces the delimiter in the DP result 301 with the SP delimiter.
[0043]
In the example of FIG. 3, the divided text data is converted into the first phoneme transcription text data based on the reading dictionary 102, and compared to the phoneme transcription text data 103, the same character string is obtained from a plurality of phoneme transcriptions. This is an example of determining the optimal candidate. In this example, when the reading described in the reading dictionary does not match the actual notation (when nyu: su (see reference pattern 202 and niu: su (see test pattern 203) in FIG. 3), a delimiter has been inserted. The phoneme-notated text data 108 has a drawback in that partially incorrect notations are mixed.
[0044]
Therefore, FIG. 4 shows a form in which this defect is improved. In FIG. 4, the DP result (best path) 401 output from the best path selection unit 302 and the test pattern (phoneme notation sequentially extracted from the phoneme notation transcription data that most faithfully represents the reading of the text data mixed with kanji and kana characters. (Character string) 203 is input to the delimiter insertion position calculation unit 402, and the delimiter corresponding position in the test pattern 203 obtained by tracing the delimiter symbol position (division position) of the DP result from which the best path 401 is obtained. Is calculated.
[0045]
A delimiter insertion unit (403) inserts a desired delimiter SP at the calculated delimiter position of the test pattern 203, and outputs delimiter inserted phoneme description text data 108.
[0046]
This is nothing more than inserting delimiters into the phonetic transcription text data so as to correspond to the division positions for each morpheme obtained by morphological analysis of the kanji-kana mixed text. For this reason, even if the kanji transcription text contains mixed kanji characters that are not registered in the dictionary, the kanji readings are correctly reflected in the phoneme notation text data 108 with the separators inserted.
[0047]
The transcription text automatic generation processing and acoustic model learning processing executed with the above functional configuration will be described.
[0048]
The kanji mixed kanji transcription text data 101 is input to the morpheme analysis unit 105 used for language model creation, and the morpheme analysis unit 105 outputs divided text data 106 divided for each morpheme. The pattern matching unit 107 acquires a reading character string (phoneme notation) corresponding to the notation indicated by the divided text data 106 from the reading dictionary 102, and rearranges the divided text data 106 into the reading character string. At this time, a reference pattern 202 (see FIG. 2) is created with a delimiter added to match the division position of the morpheme.
[0049]
By DP matching between the reference pattern 202 and the phonetic transcription text data, a character string most similar to the test pattern character string is selected from among a plurality of reading candidates included in the reference pattern. Further, the delimiter character in the reference pattern is converted into an SP character, and the delimiter-inserted phoneme symbol text data 108 is automatically generated.
[0050]
Based on the learning speech data 104 and the phoneme symbol text data 108 into which the delimiter has been inserted, the acoustic model learning unit 109 performs learning in the same manner as before and creates the acoustic model 110. Speech recognition is performed using the created acoustic model 110.
[0051]
Further, a process for generating phoneme notation text data with a delimiter inserted when speech recognition is performed using a model that takes into consideration three consecutive phonemes that span between words will be described with reference to FIG.
[0052]
In FIG. 5, an acoustic model 110 is an acoustic model generated by the apparatus of FIG. The phoneme notation text data 108 with the morpheme separator inserted is text data generated by the apparatus of FIG.
[0053]
Reference numeral 501 denotes a decoder that performs decoding based on the above data.
[0054]
Reference numeral 502 denotes a jump path SP detection / deletion unit that deletes an SP having a passage time of 0 (zero) in the decoding of the decoder 501, that is, an SP in the phoneme notation text data 108 into which the morpheme delimiter has been inserted. .
[0055]
The output of the jump path SP detection / deletion unit 502 is phoneme description text data 503 with a delimiter inserted for each SP in consideration of consecutive three phonemes.
[0056]
In such a configuration, the learning speech data 104 is decoded by the decoder 501 using the created acoustic model 110 and the phoneme notation text data 108 in which the insertion symbol is inserted for each morpheme used when learning the acoustic model 110. Decode with. The jump path SP detection / deletion unit detects an SP having a transit time of 0 (zero) based on the decoding result, that is, an SP that has passed through the jump path, deletes the SP from the phoneme notation text data 108, and delimits each SP. Inserted phoneme notation text data 503 is generated.
[0057]
The above processing can be defined by a software program. The software program may be executed by the CPU.
[0058]
FIG. 6 shows a specific example of a transcription text automatic generation apparatus having the functional configuration of FIG. 1 and a speech recognition apparatus having the same.
[0059]
In FIG. 1, a CPU 1000, a system memory 1010, an input device 1020, a display 1030, an input / output interface (I / O) 104, and a hard disk 1060 are connected via a bus.
[0060]
The CPU 1000 loads the automatic generation program stored in the hard disk 1060 into the system memory 1010 and then executes the automatic generation program to generate the text data 108 with the delimiter inserted. At this time, the CPU 1000 functions as a transcription text data generation device.
[0061]
When performing voice recognition using a voice recognition program (not shown) stored in the hard disk 1060 (same as the conventional one), the CPU 1000 functions as a voice recognition device.
[0062]
The system memory 1010 temporarily stores a program executed by the CPU 1000, input data for information processing executed in accordance with the program, and an information processing result. The input device 1020 has a keyboard and a mouse, designates a program to be executed, and gives operation instructions for characters and character processing.
[0063]
The display 1030 displays instructions input from the input device 1020 and also displays voice recognition results and the like. The I / O 1040 converts an analog audio signal input from the microphone 1050 into a digital audio signal and delivers it to the CPU 1000. The audio signal input from the microphone 1040 is stored in the hard disk 1060 as learning audio data 1066 (reference numeral 104 in FIG. 1).
[0064]
The hard disk 1050 includes an automatic generation program 1061, a kanji-kana mixed text 1062 (reference numeral 101 in FIG. 1), a phonetic transcription text 1063 (reference numeral 103 in FIG. 1), a reading dictionary 1064 (reference numeral 102 in FIG. 1), an acoustic model. 1065 and the learning voice data 1066 are stored. In addition, the hard disk 1060 stores an operating system and a voice recognition program for controlling the operation of the entire system.
[0065]
The text data 1062 and 1063 and the reading dictionary 1064 may be created using a document processing program such as a word processor or an editor, and may be input (saved) to the hard disk 1060, or data created by another information processing apparatus may be used. The data may be transferred and input to the hard disk 1060 in advance by online communication or offline communication via a floppy disk.
[0066]
The automatically generated program is installed in the hard disk 1060 from a portable recording medium such as a floppy disk or CDROM under the control of the CPU 1000.
[0067]
The contents of the automatic generation program 1061 are shown in FIG. The automatic generation program 1061 is described in a program language that can be executed by the CPU 1000, but for convenience of explanation, the function generation is expressed in FIG. A person skilled in the art would easily create an automatic generation program based on the description.
[0068]
In FIG. 7, step S <b> 10 is a process of reading text data mixed with kanji and kana from the hard disk 1060 to the system memory 1020.
[0069]
Step S20 is a process of performing morphological analysis on the kanji-kana mixed text data stored in the system memory 1020 (corresponding to the morphological analysis unit 105 in FIG. 1).
[0070]
Step S30 is a process of writing the divided text data (106 in FIG. 1) obtained by the morphological analysis into the system memory 1020.
[0071]
Step S 40 is processing for reading the reading dictionary 1064 into the system memory 1020. This process can be omitted when the reading dictionary 1064 has a large capacity, and the described data can be read directly from the hard disk 1060 during the pattern matching process described later.
[0072]
In step S50, pattern matching processing is performed based on the reading dictionary, phonetic transcription text data, and divided text data stored in the system memory 1020, to generate delimiter inserted phoneme text data. The CPU 1000 for executing this processing corresponds to the pattern matching unit 108 in FIG. As a method for inserting the delimiter SP, any one of the methods shown in FIG. 3 or FIG. 4 is used by the user.
[0073]
Step S60 is a process of storing the generated descriptive symbol-inserted phoneme notation text data in the hard disk 1060 for storage.
[0074]
Step S70 is an end determination. When there are a plurality of sets of text data 1062 and 1063, the processing of steps S10 to S60 is repeated for all the texts, and the phoneme-notated text data with the separators inserted is automatically generated. .
[0075]
Step S80 is a process of reading the learning voice data from the hard disk 1060 to the system memory 1020. Step S90 is a process of creating an acoustic model using the phoneme notation text data with delimiters inserted stored in the hard disk 1060 and the learning speech data stored in the system memory 1020. The CPU 1000 for executing this process functions as the acoustic model learning unit 109 in FIG.
[0076]
In this way, the CPU 1000 uses the created acoustic model to recognize a voice signal input from the microphone 1050 according to a voice recognition program. This process is similar to the prior art and will not require detailed description.
[0077]
In addition to the embodiments described above, the following embodiments can be implemented.
1) In the above embodiment, the phoneme notation text data with the separator inserted is automatically generated as the pre-processing of the speech recognition process. However, the phoneme notation text data with the separator inserted is automatically generated by the system of FIG. It may be delivered to another voice recognition device.
[0078]
2) The system of FIG. 6 is an example using a general-purpose computer such as a personal computer, but the hardware is not limited to a personal computer, and a digital processor may be used, or various information processing devices such as a workstation or a server are used. can do.
[0079]
3) In the above-described embodiment, the processing for calculating the break position and the replacement process for the break symbol have not been described in detail, but a known information processing method may be used. For example, as for the calculation position of the delimiter position, a function for detecting the number of words / number of characters from the beginning to an arbitrary word / character is prepared in the operating system (for example, Microsoft Windows 95, 98). Therefore, using this function, the number of words and characters from the beginning can be separated into symbol positions.
Further, since the replacement process is often used in the document editing process, detailed information processing contents will not require explanation.
[0080]
4) The above-described embodiment uses a dictionary that allows a plurality of readings in consideration of practical use. However, in an environment where the speech content that is the target of speech recognition is limited, the reading of words mixed with kana-kanji in advance is performed. Only one set needs to be registered.
[0081]
5) The following processing can be performed as a method combining the processing methods of FIGS. 3 and 4 described above. That is, a kana-kanji mixed word is read and converted into a reading using a dictionary. If there is no kana-kanji mixed word in the phonetic dictionary even after searching the phonetic dictionary, the phoneme notation is created using the phoneme notation corresponding to the division position in the phoneme notation text data. To do.
[0082]
【The invention's effect】
As described above, according to the present invention, the phoneme notation transcription text data without the delimiter symbol does not know the position of the delimiter, but when using the kanji or kana mixed text data, the morpheme analysis is performed. The inventor of the present application notices that it is possible to detect a kana-mixed character string (word) break for each morpheme, and reads the kana-mixed text divided at the break position using a dictionary (phoneme-written character string). (Corresponding to the invention of claim 8). This also makes it possible to automatically generate a transcription with a phoneme notation inserted with a delimiter automatically instead of manually.
[0083]
According to the first aspect of the present invention, an optimal reading is obtained by comparing the phonetic transcription with no delimiter inserted and a divided text converted into a reading and having a plurality of readings for the same character string. Obtainable.
[0084]
According to the invention of claim 4, in accordance with the position indicated by the divided phoneme notation text changed to reading, the phonetic notation transcription that represents the most faithful reading with respect to the text data mixed with kana-kanji and has no delimiter inserted. Since the delimiter is inserted into the text data (phoneme transcription text data input from the input means), it is possible to automatically generate delimited symbol-inserted phoneme notation text data with high reading accuracy.
[0085]
In this way, the transcription text used to create the acoustic model can be automatically generated, so the generation time of the transcription text can be greatly reduced, and the user's operations related to the creation of the acoustic model can be reduced. Labor can also be greatly reduced.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a functional configuration of an embodiment of the present invention.
FIG. 2 is a block diagram illustrating a functional configuration of a pattern matching unit according to an embodiment of the present invention.
FIG. 3 is a block diagram showing the contents of DP matching processing according to the embodiment of the present invention.
FIG. 4 is a block diagram showing total delimiter symbol processing contents according to the embodiment of the present invention.
FIG. 5 is a block diagram illustrating a delimiter insertion method using a decoder.
FIG. 6 is a block diagram showing a specific system configuration according to the embodiment of the present invention.
FIG. 7 is a flowchart showing the processing contents of an automatically generated program.
[Explanation of symbols]
101 Transcript text data mixed with kanji
102 Reading dictionary
103 Phonetic transcription text data
104 Voice data for learning
105 Morphology analysis unit
106 Divided text data
107 Pattern matching section
108 Phonetic text data with delimiters
109 Acoustic model learning unit
110 Acoustic model

Claims

Input means for inputting kanji-kana mixed transcription text data and phoneme-notation transcription text data describing the phonetic notation corresponding to the kanji-kana mixed text data;
First information processing means for dividing the input kanji-kana mixed text data into predetermined recognition units;
A storage means that stores a reading dictionary in which a character string and its reading are described and a plurality of readings are allowed for one character string;
Based on the reading dictionary, the character string in the divided kanji-kana mixed text data is converted into a reading based on the reading dictionary while matching the dividing position of the text data divided by the first information processing means, and the same character When there are a plurality of readings for a sequence, the optimal candidate in the plurality of readings is determined based on the phonetic transcription text data input from the input means, and the character string converted into the reading A transcription text automatic generation apparatus comprising: a second information processing unit that inserts a delimiter symbol at a division position and generates delimiter-inserted phoneme-notation text data.

2. The transcription text automatic generation apparatus according to claim 1, wherein the predetermined recognition unit is a morpheme, and the first information processing means is a morpheme analysis unit.

2. The transcription text automatic generation apparatus according to claim 1, wherein the second information processing unit determines an optimum candidate by DP matching processing when there are a plurality of readings for the same character string to be converted. , Create a reference pattern with a delimiter in the character string converted into the reading for the DP matching process, and extract a character string of a predetermined length from the phonetic transcription text data A transcription text automatic generation apparatus characterized by creating a test pattern and subjecting the reference pattern and the test pattern to DP matching processing.

Input means for inputting kanji-kana mixed transcription text data and phoneme-notation transcription text data describing the phonetic notation corresponding to the kanji-kana mixed text data;
First information processing means for dividing the input kanji-kana mixed text data into predetermined recognition units;
A storage means that stores a reading dictionary in which a character string and its reading are described and a plurality of readings are allowed for one character string;
Based on the reading dictionary, the character string in the divided kanji-kana mixed text data is converted into a reading based on the reading dictionary while matching the dividing position of the text data divided by the first information processing means, and the same character When there are a plurality of readings for a column, the optimum candidate among the plurality of readings is determined based on the phonetic transcription text data input from the input means, and the character string converted into the readings is determined. Second information processing means for generating phoneme notation text data into which a delimiter symbol has been inserted by inserting a delimiter symbol into the division position of the phonetic transcription transcription data input from the input means in correspondence with the division position; Transcript text automatic generation device characterized by that.

5. The transcription text automatic generation apparatus according to claim 4, wherein the second information processing unit creates a reference pattern including a character string converted into the reading for the DP matching processing, and the phoneme A test pattern is created by extracting a character string of a predetermined length from the written transcription text data, and the reference pattern and the test pattern are subjected to DP matching processing and indicated by the DP matching processing result on the test pattern. A transcription text automatic generation apparatus characterized by inserting a delimiter at a delimiter position.

5. The transcription text automatic generation apparatus according to claim 1 or 4, wherein an acoustic model storage unit storing an acoustic model, a voice data input unit for inputting learning voice data, and the second information processing unit. Decoding means for decoding learning speech data input from the speech data input means using the generated delimiter-inserted phoneme notation text data, and a delimiter whose transit time is zero based on the decoding result And a third information processing means for creating transcription text data in which the detected delimiter is deleted from the phoneme-notated text data into which the delimiter has been inserted. .

An input means for inputting text data transcribed with kanji and kana,
First information processing means for dividing the input kanji-kana mixed text data into predetermined recognition units;
Storage means for storing a character dictionary and a reading dictionary in which the reading is written;
Second information processing that converts the text data divided by the first information processing means into reading based on the reading dictionary and inserts a delimiter at the division position determined by the first information processing means. A transcription text automatic generation device characterized by comprising means.

The transcription text automatic generation device according to any one of claims 1, 4, and 8;
Voice data input means for inputting voice data for learning;
Acoustic model creation means for creating an acoustic model based on the transcription text generated by the automatic generation device and the learning speech data input from the speech data input means,
A speech recognition apparatus that performs speech recognition using the created acoustic model.

In a recording medium in which a character string and its reading are described, and a program to be executed by a transcription text automatic generation device having a storage means storing a reading dictionary in which a plurality of readings are allowed for one character string is recorded, The program is
An input step for inputting kanji-kana mixed transcription text data and phoneme-notation transcription text data describing a phoneme notation corresponding to the kanji-kana mixed text data;
A first information processing step of dividing the input kanji-kana mixed text data into predetermined recognition units;
Based on the reading dictionary, the character string in the divided kanji-kana mixed text data is converted into a reading based on the reading dictionary while matching the dividing position of the text data divided in the first information processing step, and the same character When there are a plurality of readings for a sequence, an optimal candidate in the plurality of readings is determined based on the phonetic transcription text data input from the input step, and the character string converted into the reading A recording medium comprising: a second information processing step of generating delimiter-inserted phoneme notation text data by inserting a delimiter at a division position.

In a recording medium in which a character string and its reading are described, and a program to be executed by a transcription text automatic generation device having a storage means storing a reading dictionary in which a plurality of readings are allowed for one character string is recorded, The program is
An input step for inputting kanji-kana mixed transcription text data and phoneme-notation transcription text data describing a phoneme notation corresponding to the kanji-kana mixed text data;
A first information processing step of dividing the input kanji-kana mixed text data into predetermined recognition units;
Based on the reading dictionary, the character string in the divided kanji-kana mixed text data is converted into a reading based on the reading dictionary while matching the dividing position of the text data divided in the first information processing step, and the same character If there are multiple readings for the sequence, the optimal candidate among the multiple readings is determined based on the phonetic transcription text data input from the input step, and the character string converted into the reading A second information processing step of generating a phoneme notation text data into which a delimiter has been inserted by inserting a delimiter at a division position of the phonetic transcription transcription data input from the input step in correspondence with the division position; A recording medium characterized by the above.