JP4037709B2

JP4037709B2 - Speech recognition method and speech recognition system

Info

Publication number: JP4037709B2
Application number: JP2002216493A
Authority: JP
Inventors: 真吾木内
Original assignee: Alpine Electronics Inc
Current assignee: Alpine Electronics Inc
Priority date: 2002-07-25
Filing date: 2002-07-25
Publication date: 2008-01-23
Anticipated expiration: 2022-07-25
Also published as: JP2004061609A

Description

【０００１】
【発明の属する技術分野】
本発明は、音声認識方法及び音声認識システムに係わり、特に、話者適応技術を用いた音声認識方法及び音声認識システムに関する。
【０００２】
【従来の技術】
特定話者に対する音声認識性能を向上させるために、音声認識装置は該特定話者の音声を学習する話者適応技術を用いる。かかる話者適応技術は大別すると、教師あり、教師なしの2種類に分類可能である。ここでの教師とは入力発声の発声内容を表す音韻表記列を指す。教師あり適応化とは、入力発声に対する音韻表記列が既知の場合の適応化手法であり、適応化の際、未知話者に対し発声語彙を事前に指示する必要がある。
【０００３】
一方、教師なし適応化とは、入力発声に対する音韻表記列が未知の場合の適応化手法であり、未知話者に対し入力発声の発声内容を限定しない、すなわち、未知話者に対し発声内容を指示をする必要がなく、実際に音声認識を使用中の入力音声を用いて、未知話者に意識させずに適応化を行なえるため、使用者にとって使いやすい方式である。
一般に、教師なし適応化は教師あり適応化に比べ、適応化後の認識性能が低いため、現在は教師あり適応化がよく使われている。
【０００４】
以下、従来の教師あり適応化技術を用いた音声認識装置について図２を参照して説明する。
音声認識装置１に入力された話者の発声Sは、入力パターン作成部２に入力され、AD変換、音声分析などの過程を経て、ある時間長をもつフレームと呼ばれる単位ごとの特徴ベクトルの時系列に変換される。この特徴ベクトルの時系列を、ここでは入力パターンと呼ぶ。特徴ベクトルはその時刻における音声スペクトルの特徴量を抽出したもので、通常10次元から100 次元である。
【０００５】
一方、標準パターン記憶部３には隠れマルコフモデル（ＨＭＭ：Hidden Markov Model ）が記憶されている。ＨＭＭは音声の情報源モデルの１つであり、話者の音声を用いてそのパラメータを学習することができる。ＨＭＭは通常所定の認識単位ごとに用意され、ここでは、認識単位として音素を例にとる。従って、標準パターン記憶部3には図3に示すように音素HMMが標準パターンとして記憶される。音素ＨＭＭは、例えば、予め多くの話者の発声を用いて学習した不特定話者ＨＭＭが用いられる。
1000単語を認識対象とする場合、即ち1000単語の認識候補から１単語の正解を求める場合を想定する。単語認識に際して、語彙パターン作成部5は単語を構成する各音素のＨＭＭを連結して、認識候補単語のＨＭＭ(単語ＭＦＦ)を作成する。1000単語認識の場合には1000単語分の単語ＨＭＭを作成する。すなわち、語彙パターン作成部5は図4に示すように、1000語の単語を記憶する単語記憶部5aと音素HMMを連結して各単語の単語HMMを作成する単語HMM作成部5bを有している。
【０００６】
認識部４では、語彙パターン作成部５で作成された1000単語の単語ＨＭＭを用いて入力パターンの認識を行なう。音素ＨＭＭは、音声の情報源のモデルであり、音声パターンの様々な揺らぎに対処するため、標準パターンの記述に統計的な考えが導入されている。音素ＨＭＭは、通常１から１０個の状態とその間の状態遷移から構成される。通常は始状態と終状態が定義されており、単位時間ごとに、各状態からシンボルが出力され、状態遷移が行なわれる。音素の音声パターンは、始状態から終状態までの状態遷移の間に出力されるシンボルの時系列として表される。各状態にはシンボルの出現確率が、状態間の各遷移には遷移確率が定義されている。遷移確率パラメータは音声パターンの時間的な揺らぎを表現するためのパラメータである。出現確率パラメータは、音声パターンの声色の揺らぎを表現するものである。始状態の確率をある値に定め、状態遷移ごとに出現確率、遷移確率を入力パターンに掛けていくことにより、入力音声があるＨＭＭから発生したと仮定した時のその発生確率を計算することができる。
【０００７】
ＨＭＭによる音声認識では、1000個の認識候補単語に対して単語ＨＭＭを用意し、音声が入力されると、各々の認識候補単語の単語ＨＭＭにおいて、発生確率を求め、最大となる単語ＨＭＭを発生源と決定し、その単語ＨＭＭに対応する認識候補単語をもって認識結果とする。認識結果単語は、認識結果出力部６に送られる。認識結果出力部６は、認識結果を画面上に出力し、あるいは、認識結果に対応した制御命令を別の装置に送出するなどの処理を行なう。尚、以上では標準パターン記憶部3に音素に応じてHMMを記憶したが、図5に示すように単語に応じて単語HMMを保存することもできる。かかる場合には語彙パターン作成部5は不用になる。
【０００８】
次に、音声認識装置１に対する教師あり話者適応化技術について説明する。教師あり話者適応化では、発声する単語を予め使用者に指示して、単語表記と入力音声を用いて音素ＨＭＭのパラメータの更新を行なう。このように予め発声に対する正解単語がわかっているという意味で教師あり適応化と呼ばれる。
【０００９】
最初に認識時と同様に、入力パターン作成部７は入力音声から入力パターンを作成する。教師あり適応化の場合、正解単語は予めわかっているため、適応化用辞書作成部８は入力された正解単語表記(入力音声表記)から適応化用辞書を作成する。次に、教師あり適応化部9の語いパターン作成部９ａは、適応化辞書の音素系列と適応化初期標準パターン記憶部9bに予め記憶されている音素毎の適応化初期ＨＭＭを用いて入力パターンに対応する単語ＨＭＭを作成する。そして、適応化部９ｃは入力パターンと適応化用単語ＨＭＭの間で尤度計算を行ない、1つ又は複数の入力パターンについて適応化処理を行った後、適応化後の平均ベクトルを計算して適応化後HMMを求め、適応化されたＨＭＭを標準パターン記憶部3に入力し、それまでの標準ＨＭＭのかわりに記憶する。
以上の教師あり適応技術は周知であり、例えば、特開平7-230295合公報に詳細に記述されている。
【００１０】
【発明が解決しようとする課題】
しかし、教師あり適応化方式では、本来の音声認識における音声の発生とは別に、トレーニングとして装置から指示された単語を発声しなければならず、負担が大きいという欠点がある。すなわち、ヒューマン・インタフェースを良くする（機器の操作をやりやすくする）という音声認識装置を搭載した本来の目的とは違う目的（音声認識の認識性能向上）の作業をする必要があり、煩雑であると共にユーザに負担を強いる問題がある。
以上から本発明の目的は、使用者に意識させずに、自動的に、教師あり適応化方式と同等の性能を備えるようにできる音声認識方法及び音声認識システムを提供することである。
本発明の別の目的は、簡易な構成で教師あり適応化方式と同等の性能を備えるようにできる音声認識方法及び音声認識システムを提供することである。
【００１１】
【課題を解決するための手段】
本発明の第１は、複数の単語の登録音声パターンと入力音声パターンとを比較し、最も類似している登録音声パターンに基づいて音声入力された単語を認識する音声認識方法において、音素に対応させて登録されている標準パターン ( 音声パターン ) を用いて入力音声の音声認識をする音声認識エンジンを音声認識装置に設け、該音声認識エンジンによる音声認識結果と入力音声パターンとを外部の登録音声パターン改良装置に送るステップ、該登録音声パターン改良装置において、前記音声認識装置から送られてきた入力音声パターンに基づいて音声認識するステップ、該音声認識結果である入力音声表記を教師あり話者適応学習における教師信号とみなすステップ、該教師信号と前記音声認識装置から送られてきた前記入力音声パターンを用いて教師あり話者適応化処理を行ない、該教師信号と前記音声認識装置から送られてきた音声認識結果とを比較して異なる音素を求め、該異なる音素の登録音声パターンの改良データを前記話者適応化処理結果に基づいて生成するステップ、該登録音声パターンの改良データを前記音声認識装置に送信するステップ、前記音声認識装置において、前記送られてきた改良データにより前記音声認識エンジンに登録されている登録音声パターンを更新するステップを有している。
【００１２】
又、本発明の第２は音声認識システムであり、複数の単語の登録音声パターンと入力音声パターンとを比較し、最も類似している登録音声パターンに基づいて音声入力された単語を認識する音声認識装置と、音声認識装置における前記登録音声パターンを改良する登録音声パターン改良装置を備え、音声認識装置は、音素に対応させて登録されている標準パターン ( 音声パターン ) を用いて入力音声の音声認識をする音声認識エンジンを備え、該音声認識エンジンによる音声認識結果と入力音声パターンとを外部の登録音声パターン改良装置に送り、該登録音声パターン改良装置は、前記音声認識装置から送られてきた入力音声パターンに基づいて音声認識し、該音声認識結果である入力音声表記を、教師あり話者適応学習における教師信号とみなし、該教師信号と前記音声認識装置から送られてきた前記入力音声パターンを用いて教師あり話者適応化処理を行ない、該教師信号と前記音声認識装置から送られてきた音声認識結果とを比較して異なる音素を求め、該異なる音素の登録音声パターンの改良データを前記話者適応化処理結果に基づいて生成し、該登録音声パターンの改良データを前記音声認識装置に送信し、前記音声認識装置は、前記送られてきた改良データにより前記音声認識エンジンに登録されている登録音声パターンを更新する。
本発明によれば、アンダーグラウンドで音声認識装置の教師あり話者適応学習を行うことが可能となるため、ユーザーは音声認識装置に教師信号を供給する必要がなくなり、又、学習自体は教師あり学習であるため、学習が誤った方向に進む恐れもなくなり、精度の高い学習が可能となり認識率を向上できる。
【００１３】
【発明の実施の形態】
（A）音声認識システムの構成
図1は本発明の音声認識装置を含む音声認識システムの構成図であり、音声入力により制御される装置に取り付けられた音声認識装置11と、認識性能が向上するように、音声認識装置１１の標準パターン記憶部に記憶されている標準パターンを特定するパラメータ類を更新する改良センター１２が設けられている。音声認識装置11と改良センター１２間は任意の通信方式、例えばLAN,携帯電話による無線通信等により自由に通信できるようになっている。
【００１４】
音声認識装置11において、音声入力用マイクロホ21は話者が入力した音声を検出して出力し、ADコンバータ22は音声信号をディジタルに変換し、音声認識エンジン２３は、図2の従来例(音声認識装置１)と同様の構成を備えて入力音声を認識する。音声データファイル・認識結果保存部２４は入力音声の認識結果を保存すると共に入力音声パターンデータをWAV形式で記憶する。データ送信部25は保存部２４に保存されている1以上の認識結果と入力音声パターンデータを対にして改良センターに送信する。データ受信部26は改良センタ12より改良データ（標準パターンを特定するパラメータ類）を受信し、インストーラー27は受信した改良データにより音声認識エンジン23の標準パターンデータを更新する。
【００１５】
改良センター12において、データ受信部31は音声認識装置１1から送られてくる複数組の認識結果・入力パターンデータを受信し、教師信号生成部32は、音声認識装置11の音声認識域エンジン23よりはるかに高性能の音声認識エンジンを備え、送られてきた各組の入力パターンデータに基づいて音声認識を行い、認識結果を入力パターンの入力音声表記(教師信号)として出力する。改良部33は、入力音声表記)と音声認識装置11より送られてきた入力パターンデータとを用いて図2の教師あり適応化部9と同様の適応化処理を行って適応化した音素のHMMを求める。しかる後、音声認識装置より送られてきた認識結果と教師信号(入力音声表記)とを比較して異なる音素を求め、該異なる音素の適応化されたＨＭＭを出力する。改良データ記憶部34は改良部33より出力される改良データ（各音素の適応化されたＨＭＭ）を保存し、データ送信部35は該改良データを音声認識装置に送信する。
【００１６】
（B）音声認識システムの全体の動作
▲１▼ユーザーが音声認識装置１１を使ったとき、そのときの発話内容(音声パターン)と音声認識エンジン23が認識した認識結果をそれぞれ音声データファイル・認識結果保存部２４に記録しておく。発話内容は例えばWAVファイル形式で保持し、IDコードを付与する。認識結果は該IDコードと対応して、例えばテキスト形式で記録する。
▲２▼一定期間（ユーザーの任意でも良いし、メーカー側で設定しても良い）が経過したとき、データ送信部25は音声データファイル・認識結果保存部２４に記録されている音声データファイルと認識結果を対にして改良センター12に送信する。
【００１７】
▲３▼改良センター12において、データ受信部31は音声認識装置11より受信した情報を内蔵のメモリに記憶すると共に順次、発話内容と認識結果を対にして教師信号生成部32と改良部33に入力する。教師信号生成部32は、入力された音声パターンデータに基づいて音声認識を行い、認識結果を入力音声パターンの入力音声表記として出力し、改良部33は、入力音声表記と入力音声パターンデータとを用いて適応化処理を行って適応化した音素のHMMを求める。しかる後、改良部３３は音声認識装置１１より送られてきた認識結果と教師信号(入力音声表記)とを比較して異なる音素を求め、該異なる音素の適応化されたＨＭＭを出力する。改良データ記憶部34は改良部33より出力される改良データ（各音素の適応化されたＨＭＭ）を保存し、データ送信部35は該改良データを音声認識装置11に送信する。
【００１８】
▲４▼改良データを受信したをインストーラー27は、改良データ（各音素の適応化されたＨＭＭ）を音声認識エンジン23内の標準パターン記憶部にインストールし、それまでの標準ＨＭＭのかわりに記憶する。
以上では、標準HMMは音素HMMであるが単語HMMであっても良い。
【００１９】
【発明の効果】
以上本発明によれば、アンダーグラウンドで音声認識装置の教師あり話者適応学習を行うことが可能となるため、ユーザー（話者）が、音声認識装置に教師信号を供給する必要がなくなる。
また、本発明によれば、学習自体は教師あり学習であるため、学習が誤った方向に進む恐れもなくなり、精度の高い学習が可能となるため、個々のユーザが音声認識装置を使った場合に享受できる認識性能を向上させることができる。
また、本発明によれば、ユーザーの音声を高性能の音声認識エンジンを備えたセンター側に送って改良データを生成し、該改良データをユーザの音声認識装置の音声認識エンジンにインストールするため、不特定話者対応の認識率を向上した音声認識装置を提供できる。また、ユーザのーの音声認識エンジンに高性能のものは不要であり、安価で、小型な装置構成とすることができる。
【図面の簡単な説明】
【図１】本発明の音声認識装置を含む音声認識システムの構成図である。
【図２】従来の教師あり適応化技術を用いた音声認識装置である。
【図３】音素HMMを標準パターンとして記憶する例である。
【図４】語彙パターン作成部の構成図である。
【図５】単語HMMを標準パターンとして記憶する例である。
【符号の説明】
１１音声認識装置
１２改良センター
２１音声入力用マイクロホ
２２ ADコンバータ
２３音声認識エンジン
２４音声データファイル・認識結果保存部
２７インストーラー
３１データ受信部
３２教師信号生成部
３３改良部
３４改良データ記憶部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech recognition method and a speech recognition system , and more particularly to a speech recognition method and speech recognition system using speaker adaptation technology.
[0002]
[Prior art]
In order to improve speech recognition performance for a specific speaker, the speech recognition apparatus uses a speaker adaptation technique for learning the speech of the specific speaker. Such speaker adaptation techniques can be broadly classified into two types: supervised and unsupervised. The teacher here refers to a phoneme notation string representing the utterance content of the input utterance. The supervised adaptation is an adaptation method in the case where the phoneme notation sequence for the input utterance is known, and it is necessary to indicate the utterance vocabulary to the unknown speaker in advance at the time of adaptation.
[0003]
On the other hand, unsupervised adaptation is an adaptation method when the phoneme notation sequence for the input utterance is unknown, and does not limit the utterance content of the input utterance to the unknown speaker, that is, the utterance content for the unknown speaker. There is no need to give an instruction, and the adaptation can be performed without making the unknown speaker aware of the input speech that is actually being used for speech recognition.
In general, unsupervised adaptation has a lower recognition performance after adaptation than supervised adaptation, and thus supervised adaptation is often used.
[0004]
Hereinafter, a speech recognition apparatus using a conventional supervised adaptation technique will be described with reference to FIG.
The speaker's utterance S input to the speech recognition device 1 is input to the input pattern creation unit 2 and undergoes a process such as AD conversion and speech analysis, and a feature vector for each unit called a frame having a certain length of time. Converted to series. This time series of feature vectors is referred to herein as an input pattern. The feature vector is an extracted feature of the speech spectrum at that time, and is usually 10 to 100 dimensions.
[0005]
On the other hand, the standard pattern storage unit 3 stores a hidden Markov model (HMM). The HMM is one of voice information source models, and its parameters can be learned using a speaker's voice. The HMM is usually prepared for each predetermined recognition unit. Here, a phoneme is taken as an example of the recognition unit. Therefore, the phoneme HMM is stored in the standard pattern storage unit 3 as a standard pattern as shown in FIG. As the phoneme HMM, for example, an unspecified speaker HMM learned in advance using the utterances of many speakers is used.
Assume that 1000 words are to be recognized, that is, a correct answer for one word is obtained from 1000 word recognition candidates. At the time of word recognition, the vocabulary pattern creation unit 5 creates an HMM (word MFF) of recognition candidate words by connecting the HMMs of the phonemes constituting the word. In the case of 1000 word recognition, a word HMM for 1000 words is created. That is, as shown in FIG. 4, the vocabulary pattern creation unit 5 includes a word storage unit 5a that stores 1000 words and a word HMM creation unit 5b that creates a word HMM for each word by connecting phoneme HMMs. Yes.
[0006]
The recognition unit 4 recognizes an input pattern using the 1000-word word HMM created by the vocabulary pattern creation unit 5. The phoneme HMM is a model of a voice information source, and a statistical idea is introduced in the description of the standard pattern in order to deal with various fluctuations of the voice pattern. A phoneme HMM is usually composed of 1 to 10 states and state transitions therebetween. Normally, a start state and an end state are defined, and a symbol is output from each state and a state transition is performed every unit time. The phoneme speech pattern is represented as a time series of symbols output during the state transition from the start state to the end state. Symbol appearance probability is defined for each state, and transition probability is defined for each transition between states. The transition probability parameter is a parameter for expressing temporal fluctuation of the voice pattern. The appearance probability parameter represents the voice color fluctuation of the voice pattern. It is possible to calculate the occurrence probability when the input speech is assumed to be generated from a certain HMM by setting the probability of the start state to a certain value and multiplying the input pattern by the appearance probability and the transition probability for each state transition. it can.
[0007]
In speech recognition by HMM, word HMMs are prepared for 1000 recognition candidate words, and when speech is input, the occurrence probability is obtained in the word HMM of each recognition candidate word, and the maximum word HMM is generated. The recognition candidate word corresponding to the word HMM is determined as a recognition result. The recognition result word is sent to the recognition result output unit 6. The recognition result output unit 6 performs processing such as outputting the recognition result on the screen or sending a control command corresponding to the recognition result to another apparatus. In the above, the HMM is stored in the standard pattern storage unit 3 according to the phoneme, but the word HMM can also be stored according to the word as shown in FIG. In such a case, the vocabulary pattern creation unit 5 is unnecessary.
[0008]
Next, a supervised speaker adaptation technique for the speech recognition apparatus 1 will be described. In supervised speaker adaptation, a word to be uttered is instructed to a user in advance, and phoneme HMM parameters are updated using word notation and input speech. This is called supervised adaptation in the sense that the correct word for the utterance is known in advance.
[0009]
First, as in the case of recognition, the input pattern creation unit 7 creates an input pattern from the input voice. In the case of supervised adaptation, since the correct word is known in advance, the adaptation dictionary creation unit 8 creates an adaptation dictionary from the inputted correct word notation (input speech notation). Next, the vocabulary pattern creation unit 9a of the supervised adaptation unit 9 inputs the phoneme sequence of the adaptation dictionary and the adaptation initial HMM for each phoneme stored in advance in the adaptation initial standard pattern storage unit 9b. A word HMM corresponding to the pattern is created. Then, the adaptation unit 9c performs likelihood calculation between the input pattern and the adaptation word HMM, performs an adaptation process on one or a plurality of input patterns, and then calculates an average vector after adaptation. An HMM after adaptation is obtained, and the adapted HMM is input to the standard pattern storage unit 3 and stored instead of the standard HMM used so far.
The supervised adaptation technique described above is well known and is described in detail, for example, in Japanese Patent Laid-Open No. 7-230295.
[0010]
[Problems to be solved by the invention]
However, the supervised adaptation method has a drawback in that a word instructed from the apparatus must be uttered as training separately from the generation of speech in the original speech recognition, and the burden is large. In other words, it is necessary to work for a purpose (improving the recognition performance of speech recognition) different from the original purpose of installing the speech recognition device to improve the human interface (to make it easier to operate the device), which is complicated. In addition, there is a problem that imposes a burden on the user.
Accordingly, an object of the present invention is to provide a speech recognition method and a speech recognition system that can automatically have the same performance as a supervised adaptation method without making the user aware of it.
Another object of the present invention is to provide a speech recognition method and a speech recognition system capable of providing a performance equivalent to that of a supervised adaptation method with a simple configuration.
[0011]
[Means for Solving the Problems]
A first aspect of the present invention is a speech recognition method that compares registered speech patterns of a plurality of words with an input speech pattern and recognizes a speech input word based on the most similar registered speech pattern. A speech recognition engine for recognizing the input speech using the standard pattern ( speech pattern ) registered in the speech recognition device, and the speech recognition result by the speech recognition engine and the input speech pattern are externally registered speech A step of sending to a pattern improvement device; a step of recognizing speech based on an input speech pattern sent from the speech recognition device in the registered speech pattern improvement device; A step of considering as a teacher signal in learning, the teacher signal and the input speech pattern sent from the speech recognition device; And performing supervised speaker adaptation processing, comparing the teacher signal with the speech recognition result sent from the speech recognition apparatus to obtain different phonemes, and obtaining the improved speech data of the registered speech pattern of the different phonemes. A step of generating based on the result of the user adaptation processing, a step of transmitting the improved data of the registered voice pattern to the voice recognition device, and the voice recognition device being registered in the voice recognition engine by the sent improvement data. Updating the registered voice pattern .
[0012]
The second of the present invention is a speech recognition system, speech recognition word that is the voice input based on the registered voice patterns by comparing the plurality of words registered voice pattern and the input speech pattern, most similar a recognition device, provided with a registered voice pattern improved apparatus for improving the registered voice pattern in the voice recognition apparatus, the speech recognition apparatus, speech of the input speech using the reference pattern (voice patterns) registered in correspondence with the phoneme A speech recognition engine for recognizing the speech recognition result by the speech recognition engine and an input speech pattern are sent to an external registered speech pattern improvement device, and the registered speech pattern improvement device is sent from the speech recognition device Speech recognition is performed based on the input speech pattern, and the input speech notation that is the speech recognition result is used as a teacher signal in supervised speaker adaptive learning. None, a supervised speaker adaptation process is performed using the teacher signal and the input speech pattern sent from the speech recognition device, and the teacher signal and the speech recognition result sent from the speech recognition device are Comparing different phonemes, generating improved data of the registered speech pattern of the different phonemes based on the speaker adaptation processing result, transmitting the improved data of the registered speech pattern to the speech recognition device, and The recognizing device updates the registered speech pattern registered in the speech recognition engine with the received improved data .
According to the present invention, it is possible to perform supervised speaker adaptive learning of the speech recognition device in the underground, so that the user does not need to supply a teacher signal to the speech recognition device, and the learning itself is supervised. Since it is learning, there is no risk of learning going in the wrong direction, and highly accurate learning is possible and the recognition rate can be improved.
[0013]
DETAILED DESCRIPTION OF THE INVENTION
(A) Configuration of Speech Recognition System FIG. 1 is a configuration diagram of a speech recognition system including a speech recognition device according to the present invention, and the speech recognition device 11 attached to a device controlled by speech input improves recognition performance. As described above, the improvement center 12 for updating the parameters specifying the standard pattern stored in the standard pattern storage unit of the speech recognition apparatus 11 is provided. The voice recognition device 11 and the improvement center 12 can freely communicate with each other by any communication method, for example, wireless communication using a LAN or a mobile phone.
[0014]
In the voice recognition device 11, the voice input microphone 21 detects and outputs the voice input by the speaker, the AD converter 22 converts the voice signal into digital, and the voice recognition engine 23 uses the conventional example of FIG. The input speech is recognized with the same configuration as the recognition device 1). The voice data file / recognition result storage unit 24 stores the recognition result of the input voice and stores the input voice pattern data in the WAV format. The data transmission unit 25 transmits one or more recognition results stored in the storage unit 24 and the input voice pattern data to the improvement center as a pair. The data receiving unit 26 receives the improved data (parameters for specifying the standard pattern) from the improved center 12, and the installer 27 updates the standard pattern data of the speech recognition engine 23 with the received improved data.
[0015]
In the improvement center 12, the data receiving unit 31 receives a plurality of sets of recognition result / input pattern data sent from the speech recognition device 11, and the teacher signal generation unit 32 is received from the speech recognition area engine 23 of the speech recognition device 11. It has a much higher performance speech recognition engine, performs speech recognition based on each set of input pattern data sent, and outputs the recognition result as input speech notation (teacher signal) of the input pattern. The improvement unit 33 performs an adaptation process similar to that of the supervised adaptation unit 9 in FIG. 2 using the input speech notation) and the input pattern data sent from the speech recognition device 11, and the phoneme HMM that has been adapted Ask for. Thereafter, the recognition result sent from the speech recognition apparatus is compared with the teacher signal (input speech notation) to obtain different phonemes, and an HMM adapted to the different phonemes is output. The improved data storage unit 34 stores the improved data (HMM adapted to each phoneme) output from the improving unit 33, and the data transmitting unit 35 transmits the improved data to the speech recognition apparatus.
[0016]
(B) Overall operation of the speech recognition system (1) When the user uses the speech recognition device 11, the speech content (speech pattern) at that time and the recognition result recognized by the speech recognition engine 23 are each a speech data file and recognition. The result is recorded in the result storage unit 24. The utterance content is held in, for example, a WAV file format, and an ID code is given. The recognition result is recorded in, for example, a text format corresponding to the ID code.
(2) When a certain period of time (which may be set by the user or set by the manufacturer) elapses, the data transmission unit 25 determines whether the audio data file recorded in the audio data file / recognition result storage unit 24 The recognition results are paired and transmitted to the improvement center 12.
[0017]
(3) In the improvement center 12, the data receiving unit 31 stores the information received from the voice recognition device 11 in the built-in memory and sequentially sends the utterance contents and the recognition result to the teacher signal generation unit 32 and the improvement unit 33. input. The teacher signal generation unit 32 performs voice recognition based on the input voice pattern data, outputs the recognition result as an input voice notation of the input voice pattern, and the improvement unit 33 outputs the input voice notation and the input voice pattern data. A phoneme HMM is obtained by performing an adaptation process using the same. Thereafter, the improvement unit 33 compares the recognition result sent from the speech recognition apparatus 11 with the teacher signal (input speech notation) to obtain different phonemes, and outputs an adapted HMM of the different phonemes. The improved data storage unit 34 stores the improved data output from the improving unit 33 (HMM adapted to each phoneme), and the data transmitting unit 35 transmits the improved data to the speech recognition apparatus 11.
[0018]
(4) Upon receiving the improved data, the installer 27 installs the improved data (the HMM adapted for each phoneme) in the standard pattern storage unit in the speech recognition engine 23, and stores it instead of the standard HMM used so far. .
In the above, the standard HMM is a phoneme HMM, but it may be a word HMM.
[0019]
【The invention's effect】
As described above, according to the present invention, it is possible to perform supervised speaker adaptive learning of the speech recognition apparatus in the underground, so that it is not necessary for the user (speaker) to supply a teacher signal to the speech recognition apparatus.
Further, according to the present invention, since the learning itself is supervised learning, there is no risk of learning going in the wrong direction, and high-accuracy learning is possible, so that each user uses a speech recognition device. The recognition performance that can be enjoyed by the user can be improved.
Further, according to the present invention, the user's voice is sent to the center side equipped with a high-performance voice recognition engine to generate improved data, and the improved data is installed in the voice recognition engine of the user's voice recognition device. It is possible to provide a speech recognition apparatus with an improved recognition rate for unspecified speakers. In addition, the user's voice recognition engine does not require a high-performance engine, and can be inexpensive and have a small device configuration.
[Brief description of the drawings]
FIG. 1 is a configuration diagram of a voice recognition system including a voice recognition device of the present invention.
FIG. 2 is a speech recognition apparatus using a conventional supervised adaptation technique.
FIG. 3 is an example of storing a phoneme HMM as a standard pattern.
FIG. 4 is a configuration diagram of a vocabulary pattern creation unit.
FIG. 5 is an example of storing a word HMM as a standard pattern.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 11 Voice recognition apparatus 12 Improvement center 21 Microphone 22 for voice input AD converter 23 Voice recognition engine 24 Voice data file and recognition result storage part 27 Installer 31 Data reception part 32 Teacher signal generation part 33 Improvement part 34 Improvement data storage part

Claims

In a speech recognition method for comparing a registered speech pattern of a plurality of words and an input speech pattern and recognizing a speech input word based on the most similar registered speech pattern,
It provided a voice recognition engine to the speech recognition of the input speech using the reference pattern (voice patterns) registered in correspondence with the phoneme to the speech recognition device, an external speech recognition result by the voice recognition engine and the input speech pattern To the registered voice pattern improvement device
In the registered voice pattern improving apparatus,
Voice recognition based on the input voice pattern sent from the voice recognition device,
The input speech notation that is the speech recognition result is regarded as a teacher signal in supervised speaker adaptive learning,
Supervised speaker adaptation processing is performed using the teacher signal and the input speech pattern sent from the speech recognition device, and the teacher signal is compared with the speech recognition result sent from the speech recognition device. Obtaining different phonemes, and generating improved data of registered speech patterns of the different phonemes based on the speaker adaptation processing results,
Transmitting the improved data of the registered voice pattern to the voice recognition device;
In the voice recognition device, and it updates the registered voice pattern by improving data has the sent and registered in the voice recognition engine,
A speech recognition method characterized by the above.

A speech recognition device that compares a registered speech pattern of a plurality of words and an input speech pattern, recognizes a speech input word based on the most similar registered speech pattern, and improves the registered speech pattern in the speech recognition device In a speech recognition system provided with a registered speech pattern improvement device,
The speech recognition device includes a speech recognition engine that recognizes speech of an input speech using a standard pattern ( speech pattern ) registered in correspondence with phonemes, and obtains a speech recognition result and an input speech pattern by the speech recognition engine. Sent to an external registered voice pattern improvement device,
The registered speech pattern improvement device recognizes speech based on the input speech pattern sent from the speech recognition device, regards the input speech notation that is the speech recognition result as a teacher signal in supervised speaker adaptive learning, Supervised speaker adaptation processing is performed using the teacher signal and the input speech pattern sent from the speech recognition device, and the teacher signal is compared with the speech recognition result sent from the speech recognition device. Obtaining different phonemes, generating improved data of the registered speech pattern of the different phonemes based on the speaker adaptation processing result, and transmitting the improved data of the registered speech pattern to the speech recognition device,
The voice recognition device updates a registered voice pattern registered in the voice recognition engine with the improved data sent.
A speech recognition system characterized by that .