JP2004061609A

JP2004061609A - Method and device for speech recognition

Info

Publication number: JP2004061609A
Application number: JP2002216493A
Authority: JP
Inventors: Shingo Kiuchi; 木内　真吾
Original assignee: Alpine Electronics Inc
Current assignee: Alpine Electronics Inc
Priority date: 2002-07-25
Filing date: 2002-07-25
Publication date: 2004-02-26
Anticipated expiration: 2022-07-25
Also published as: JP4037709B2

Abstract

<P>PROBLEM TO BE SOLVED: To perform tutor-attended speaker adaptive learning of a speech recognition device underground. <P>SOLUTION: In a speech recognizing method of comparing an input speech pattern with registered speech patterns of a plurality of words and recognizes speech-inputted words according to the most similar registered speech pattern, the recognition result of the speech recognition device and the input speech pattern to an external registered speech pattern improving device, which generates improvement data on the registered speech pattern registered in a speech recognizing engine of the speech recognition device according to the sent information and transmits the data. Then the speech recognition device updates the registered speech pattern data registered in the speech recognizing engine with the sent improvement data. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、音声認識方法及び音声認識装置に係わり、特に、話者適応技術を用いた音声認識方法及び音声認識装置に関する。
【０００２】
【従来の技術】
特定話者に対する音声認識性能を向上させるために、音声認識装置は該特定話者の音声を学習する話者適応技術を用いる。かかる話者適応技術は大別すると、教師あり、教師なしの２種類に分類可能である。ここでの教師とは入力発声の発声内容を表す音韻表記列を指す。教師あり適応化とは、入力発声に対する音韻表記列が既知の場合の適応化手法であり、適応化の際、未知話者に対し発声語彙を事前に指示する必要がある。
【０００３】
一方、教師なし適応化とは、入力発声に対する音韻表記列が未知の場合の適応化手法であり、未知話者に対し入力発声の発声内容を限定しない、すなわち、未知話者に対し発声内容を指示をする必要がなく、実際に音声認識を使用中の入力音声を用いて、未知話者に意識させずに適応化を行なえるため、使用者にとって使いやすい方式である。
一般に、教師なし適応化は教師あり適応化に比べ、適応化後の認識性能が低いため、現在は教師あり適応化がよく使われている。
【０００４】
以下、従来の教師あり適応化技術を用いた音声認識装置について図２を参照して説明する。
音声認識装置１に入力された話者の発声Ｓは、入力パターン作成部２に入力され、ＡＤ変換、音声分析などの過程を経て、ある時間長をもつフレームと呼ばれる単位ごとの特徴ベクトルの時系列に変換される。この特徴ベクトルの時系列を、ここでは入力パターンと呼ぶ。特徴ベクトルはその時刻における音声スペクトルの特徴量を抽出したもので、通常１０次元から１００　次元である。
【０００５】
一方、標準パターン記憶部３には隠れマルコフモデル（ＨＭＭ：Ｈｉｄｄｅｎ　Ｍａｒｋｏｖ　Ｍｏｄｅｌ　）が記憶されている。ＨＭＭは音声の情報源モデルの１つであり、話者の音声を用いてそのパラメータを学習することができる。ＨＭＭは通常所定の認識単位ごとに用意され、ここでは、認識単位として音素を例にとる。従って、標準パターン記憶部３には図３に示すように　音素ＨＭＭが標準パターンとして記憶される。音素ＨＭＭは、例えば、予め多くの話者の発声を用いて学習した不特定話者ＨＭＭが用いられる。
１０００単語を認識対象とする場合、即ち１０００単語の認識候補から１単語の正解を求める場合を想定する。単語認識に際して、語彙パターン作成部５は単語を構成する各音素のＨＭＭを連結して、認識候補単語のＨＭＭ（単語ＭＦＦ）を作成する。１０００単語認識の場合には１０００単語分の単語ＨＭＭを作成する。すなわち、語彙パターン作成部５は図４に示すように、１０００語の単語を記憶する単語記憶部５ａと音素ＨＭＭを連結して各単語の単語ＨＭＭを作成する単語ＨＭＭ作成部５ｂを有している。
【０００６】
認識部４では、語彙パターン作成部５で作成された１０００単語の単語ＨＭＭを用いて入力パターンの認識を行なう。音素ＨＭＭは、音声の情報源のモデルであり、音声パターンの様々な揺らぎに対処するため、標準パターンの記述に統計的な考えが導入されている。音素ＨＭＭは、通常１から１０個の状態とその間の状態遷移から構成される。通常は始状態と終状態が定義されており、単位時間ごとに、各状態からシンボルが出力され、状態遷移が行なわれる。音素の音声パターンは、始状態から終状態までの状態遷移の間に出力されるシンボルの時系列として表される。各状態にはシンボルの出現確率が、状態間の各遷移には遷移確率が定義されている。遷移確率パラメータは音声パターンの時間的な揺らぎを表現するためのパラメータである。出現確率パラメータは、音声パターンの声色の揺らぎを表現するものである。始状態の確率をある値に定め、状態遷移ごとに出現確率、遷移確率を入力パターンに掛けていくことにより、入力音声があるＨＭＭから発生したと仮定した時のその発生確率を計算することができる。
【０００７】
ＨＭＭによる音声認識では、１０００個の認識候補単語に対して単語ＨＭＭを用意し、音声が入力されると、各々の認識候補単語の単語ＨＭＭにおいて、発生確率を求め、最大となる単語ＨＭＭを発生源と決定し、その単語ＨＭＭに対応する認識候補単語をもって認識結果とする。認識結果単語は、認識結果出力部６に送られる。認識結果出力部６は、認識結果を画面上に出力し、あるいは、認識結果に対応した制御命令を別の装置に送出するなどの処理を行なう。尚、以上では標準パターン記憶部３に音素に応じてＨＭＭを記憶したが、図５に示すように単語に応じて単語ＨＭＭを保存することもできる。かかる場合には語彙パターン作成部５は不用になる。
【０００８】
次に、音声認識装置１に対する教師あり話者適応化技術について説明する。教師あり話者適応化では、発声する単語を予め使用者に指示して、単語表記と入力音声を用いて音素ＨＭＭのパラメータの更新を行なう。このように予め発声に対する正解単語がわかっているという意味で教師あり適応化と呼ばれる。
【０００９】
最初に認識時と同様に、入力パターン作成部７は入力音声から入力パターンを作成する。教師あり適応化の場合、正解単語は予めわかっているため、適応化用辞書作成部８は入力された正解単語表記（入力音声表記）から適応化用辞書を作成する。次に、教師あり適応化部９の語いパターン作成部９ａは、適応化辞書の音素系列と適応化初期標準パターン記憶部９ｂに予め記憶されている音素毎の適応化初期ＨＭＭを用いて入力パターンに対応する単語ＨＭＭを作成する。そして、適応化部９ｃは入力パターンと適応化用単語ＨＭＭの間で尤度計算を行ない、１つ又は複数の入力パターンについて適応化処理を行った後、適応化後の平均ベクトルを計算して適応化後ＨＭＭを求め、適応化されたＨＭＭを標準パターン記憶部３に入力し、それまでの標準ＨＭＭのかわりに記憶する。
以上の教師あり適応技術は周知であり、例えば、特開平７−２３０２９５合公報に詳細に記述されている。
【００１０】
【発明が解決しようとする課題】
しかし、教師あり適応化方式では、本来の音声認識における音声の発生とは別に、トレーニングとして装置から指示された単語を発声しなければならず、負担が大きいという欠点がある。すなわち、ヒューマン・インタフェースを良くする（機器の操作をやりやすくする）という音声認識装置を搭載した本来の目的とは違う目的（音声認識の認識性能向上）の作業をする必要があり、煩雑であると共にユーザに負担を強いる問題がある。
以上から本発明の目的は、使用者に意識させずに、自動的に、教師あり適応化方式と同等の性能を備えるようにできる音声認識方法及び音声認識装置を提供することである。
本発明の別の目的は、簡易な構成で教師あり適応化方式と同等の性能を備えるようにできる音声認識方法及び音声認識装置を提供することである。
【００１１】
【課題を解決するための手段】
本発明の第１は、複数の単語の登録音声パターンと入力音声パターンとを比較し、最も類似している登録音声パターンに基づいて音声入力された単語を認識する音声認識方法であり、▲１▼音声認識装置の認識結果と入力音声パターンを外部の登録音声パターン改良装置に送るステップ、▲２▼登録音声パターン改良装置において、前記送られてきた情報に基づいて前記音声認識装置の音声認識エンジンに登録されている登録音声パターン改良データを生成して送信するステップ、▲３▼音声認識装置において、前記送られてきた改良データにより音声認識エンジンに登録されている登録音声パターンデータを更新するステップを有している。
【００１２】
又、本発明の第２は音声入力された単語を認識する音声認識装置であり、▲１▼単語に応じた音声パターンを登録する音声パターン登録部、▲２▼複数の単語の登録音声パターンと入力音声パターンとを比較し、最も類似している登録音声パターンに基づいて音声入力された単語を認識する音声認識エンジン、▲３▼認識結果と入力音声パターンを組にして保存する保存部、▲４▼定期的に保存部に保存されている認識結果と入力音声パターンを外部の登録音声パターン改良装置に送るデータ送信部、▲５▼音声認識装置より送られてきた情報に基づいて、該音声認識装置の音声認識エンジンの登録音声パターンを改良するデータを生成する前記登録音声パターン改良装置より、該登録音声パターン改良データを受信するデータ受信部、▲６▼前記送られてきた改良データにより音声認識エンジンに登録されている登録音声パターンデータを改良する改良部を有している。
本発明によれば、アンダーグラウンドで音声認識装置の教師あり話者適応学習を行うことが可能となるため、ユーザーは音声認識装置に教師信号を供給する必要がなくなり、又、学習自体は教師あり学習であるため、学習が誤った方向に進む恐れもなくなり、精度の高い学習が可能となり認識率を向上できる。
【００１３】
【発明の実施の形態】
（Ａ）音声認識システムの構成
図１は本発明の音声認識装置を含む音声認識システムの構成図であり、音声入力により制御される装置に取り付けられた音声認識装置１１と、認識性能が向上するように、音声認識装置１１の標準パターン記憶部に記憶されている標準パターンを特定するパラメータ類を更新する改良センター１２が設けられている。音声認識装置１１と改良センター１２間は任意の通信方式、例えばＬＡＮ，携帯電話による無線通信等により自由に通信できるようになっている。
【００１４】
音声認識装置１１において、音声入力用マイクロホ２１は話者が入力した音声を検出して出力し、ＡＤコンバータ２２は音声信号をディジタルに変換し、音声認識エンジン２３は、図２の従来例（音声認識装置１）と同様の構成を備えて入力音声を認識する。音声データファイル・認識結果保存部２４は入力音声の認識結果を保存すると共に入力音声パターンデータをＷＡＶ形式で記憶する。データ送信部２５は保存部２４に保存されている１以上の認識結果と入力音声パターンデータを対にして改良センターに送信する。データ受信部２６は改良センタ１２より改良データ（標準パターンを特定するパラメータ類）を受信し、インストーラー２７は受信した改良データにより音声認識エンジン２３の標準パターンデータを更新する。
【００１５】
改良センター１２において、データ受信部３１は音声認識装置１１から送られてくる複数組の認識結果・入力パターンデータを受信し、教師信号生成部３２は、音声認識装置１１の音声認識域エンジン２３よりはるかに高性能の音声認識エンジンを備え、送られてきた各組の入力パターンデータに基づいて音声認識を行い、認識結果を入力パターンの入力音声表記（教師信号）として出力する。改良部３３は、入力音声表記）と音声認識装置１１より送られてきた入力パターンデータとを用いて図２の教師あり適応化部９と同様の適応化処理を行って適応化した音素のＨＭＭを求める。しかる後、音声認識装置より送られてきた認識結果と教師信号（入力音声表記）とを比較して異なる音素を求め、該異なる音素の適応化されたＨＭＭを出力する。改良データ記憶部３４は改良部３３より出力される改良データ（各音素の適応化されたＨＭＭ）を保存し、データ送信部３５は該改良データを音声認識装置に送信する。
【００１６】
（Ｂ）音声認識システムの全体の動作
▲１▼ユーザーが音声認識装置１１を使ったとき、そのときの発話内容（音声パターン）と音声認識エンジン２３が認識した認識結果をそれぞれ音声データファイル・認識結果保存部２４に記録しておく。発話内容は例えばＷＡＶファイル形式で保持し、ＩＤコードを付与する。認識結果は該ＩＤコードと対応して、例えばテキスト形式で記録する。
▲２▼一定期間（ユーザーの任意でも良いし、メーカー側で設定しても良い）が経過したとき、データ送信部２５は音声データファイル・認識結果保存部２４に記録されている音声データファイルと認識結果を対にして改良センター１２に送信する。
【００１７】
▲３▼改良センター１２において、データ受信部３１は音声認識装置１１より受信した情報を内蔵のメモリに記憶すると共に順次、発話内容と認識結果を対にして教師信号生成部３２と改良部３３に入力する。教師信号生成部３２は、入力された音声パターンデータに基づいて音声認識を行い、認識結果を入力音声パターンの入力音声表記として出力し、改良部３３は、入力音声表記と入力音声パターンデータとを用いて適応化処理を行って適応化した音素のＨＭＭを求める。しかる後、改良部３３は音声認識装置１１より送られてきた認識結果と教師信号（入力音声表記）とを比較して異なる音素を求め、該異なる音素の適応化されたＨＭＭを出力する。改良データ記憶部３４は改良部３３より出力される改良データ（各音素の適応化されたＨＭＭ）を保存し、データ送信部３５は該改良データを音声認識装置１１に送信する。
【００１８】
▲４▼改良データを受信したをインストーラー２７は、改良データ（各音素の適応化されたＨＭＭ）を音声認識エンジン２３内の標準パターン記憶部にインストールし、それまでの標準ＨＭＭのかわりに記憶する。
以上では、標準ＨＭＭは音素ＨＭＭであるが単語ＨＭＭであっても良い。
【００１９】
【発明の効果】
以上本発明によれば、アンダーグラウンドで音声認識装置の教師あり話者適応学習を行うことが可能となるため、ユーザー（話者）が、音声認識装置に教師信号を供給する必要がなくなる。
また、本発明によれば、学習自体は教師あり学習であるため、学習が誤った方向に進む恐れもなくなり、精度の高い学習が可能となるため、個々のユーザが音声認識装置を使った場合に享受できる認識性能を向上させることができる。
また、本発明によれば、ユーザーの音声を高性能の音声認識エンジンを備えたセンター側に送って改良データを生成し、該改良データをユーザの音声認識装置の音声認識エンジンにインストールするため、不特定話者対応の認識率を向上した音声認識装置を提供できる。また、ユーザのーの音声認識エンジンに高性能のものは不要であり、安価で、小型な装置構成とすることができる。
【図面の簡単な説明】
【図１】本発明の音声認識装置を含む音声認識システムの構成図である。
【図２】従来の教師あり適応化技術を用いた音声認識装置である。
【図３】音素ＨＭＭを標準パターンとして記憶する例である。
【図４】語彙パターン作成部の構成図である。
【図５】単語ＨＭＭを標準パターンとして記憶する例である。
【符号の説明】
１１　音声認識装置
１２　改良センター
２１　音声入力用マイクロホ
２２　ＡＤコンバータ
２３　音声認識エンジン
２４　音声データファイル・認識結果保存部
２７　インストーラー
３１　データ受信部
３２　教師信号生成部
３３　改良部
３４　改良データ記憶部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a speech recognition method and a speech recognition device, and more particularly to a speech recognition method and a speech recognition device using speaker adaptation technology.
[0002]
[Prior art]
In order to improve the voice recognition performance for a specific speaker, the voice recognition device uses a speaker adaptation technique for learning the voice of the specific speaker. Such speaker adaptation techniques can be broadly classified into two types: supervised and unsupervised. Here, the teacher refers to a phoneme notation sequence representing the utterance content of the input utterance. The supervised adaptation is an adaptation method in a case where a phoneme notation sequence for an input utterance is known. In the adaptation, it is necessary to instruct an unknown speaker in advance on a vocabulary vocabulary.
[0003]
On the other hand, unsupervised adaptation is an adaptation method in the case where the phoneme notation sequence for an input utterance is unknown, and does not limit the utterance content of the input utterance to the unknown speaker, that is, the utterance content to the unknown speaker. This method is easy for the user to use because it is not necessary to give an instruction, and the adaptation can be performed using the input speech that is actually using the speech recognition without being aware of the unknown speaker.
In general, unsupervised adaptation has lower recognition performance after adaptation than supervised adaptation, so supervised adaptation is often used at present.
[0004]
Hereinafter, a conventional speech recognition apparatus using a supervised adaptation technique will be described with reference to FIG.
The utterance S of the speaker input to the voice recognition device 1 is input to the input pattern creation unit 2 and undergoes processes such as AD conversion and voice analysis to obtain a feature vector for each unit called a frame having a certain time length. Converted to a series. This time series of feature vectors is referred to as an input pattern here. The feature vector is obtained by extracting the feature amount of the speech spectrum at that time, and is usually 10 to 100 dimensions.
[0005]
On the other hand, the standard pattern storage unit 3 stores a hidden Markov model (HMM: Hidden Markov Model). HMM is one of the information source models of speech, and its parameters can be learned using speech of a speaker. The HMM is usually prepared for each predetermined recognition unit. Here, a phoneme is taken as an example of the recognition unit. Therefore, the phoneme HMM is stored in the standard pattern storage unit 3 as a standard pattern as shown in FIG. As the phoneme HMM, for example, an unspecified speaker HMM that has been learned in advance using the utterances of many speakers is used.
It is assumed that 1000 words are to be recognized, that is, a case where a correct answer of one word is obtained from recognition candidates of 1000 words. At the time of word recognition, the vocabulary pattern creation unit 5 creates an HMM (word MFF) of a recognition candidate word by connecting the HMMs of the phonemes constituting the word. In the case of 1000 word recognition, a word HMM for 1000 words is created. That is, as shown in FIG. 4, the vocabulary pattern creation unit 5 includes a word storage unit 5a that stores words of 1000 words and a word HMM creation unit 5b that creates a word HMM for each word by connecting phoneme HMMs. I have.
[0006]
The recognizing unit 4 recognizes the input pattern using the word HMM of 1000 words created by the vocabulary pattern creating unit 5. The phoneme HMM is a model of a speech information source, and a statistical idea is introduced in the description of a standard pattern in order to cope with various fluctuations of a speech pattern. A phoneme HMM usually consists of 1 to 10 states and state transitions between them. Usually, a start state and an end state are defined, and a symbol is output from each state and a state transition is performed for each unit time. The voice pattern of a phoneme is represented as a time series of symbols output during a state transition from a start state to an end state. A symbol appearance probability is defined for each state, and a transition probability is defined for each transition between the states. The transition probability parameter is a parameter for expressing temporal fluctuation of the voice pattern. The appearance probability parameter expresses the fluctuation of the timbre of the voice pattern. By setting the probability of the initial state to a certain value and multiplying the appearance pattern and the transition probability for each state transition by the input pattern, it is possible to calculate the occurrence probability when the input voice is assumed to be generated from a certain HMM. it can.
[0007]
In the speech recognition by the HMM, a word HMM is prepared for 1000 recognition candidate words, and when a speech is input, the occurrence probability is calculated for the word HMM of each recognition candidate word, and the maximum word HMM is generated. Is determined as the source, and the recognition result is set to the recognition candidate word corresponding to the word HMM. The recognition result word is sent to the recognition result output unit 6. The recognition result output unit 6 performs processing such as outputting the recognition result on a screen or transmitting a control command corresponding to the recognition result to another device. In the above description, the HMM is stored in the standard pattern storage unit 3 according to the phoneme, but the word HMM can be stored according to the word as shown in FIG. In such a case, the vocabulary pattern creating section 5 becomes unnecessary.
[0008]
Next, a supervised speaker adaptation technique for the speech recognition device 1 will be described. In the supervised speaker adaptation, a word to be uttered is instructed to the user in advance, and the parameters of the phoneme HMM are updated using the word notation and the input speech. This is called supervised adaptation in the sense that the correct word for the utterance is known in advance.
[0009]
First, as in the case of recognition, the input pattern creating unit 7 creates an input pattern from the input voice. In the case of supervised adaptation, since the correct word is known in advance, the adaptation dictionary creating unit 8 creates an adaptation dictionary from the input correct word notation (input speech notation). Next, the vocabulary pattern creation unit 9a of the supervised adaptation unit 9 inputs the vocabulary pattern using the phoneme sequence of the adaptation dictionary and the adaptation initial HMM for each phoneme stored in the adaptation initial standard pattern storage unit 9b in advance. Create a word HMM corresponding to the pattern. Then, the adaptation unit 9c performs likelihood calculation between the input pattern and the word HMM for adaptation, performs an adaptation process on one or a plurality of input patterns, and calculates an average vector after the adaptation. An HMM is obtained after the adaptation, and the adapted HMM is input to the standard pattern storage unit 3 and stored instead of the standard HMM.
The above-described supervised adaptation technique is well known, and is described in detail in, for example, Japanese Patent Application Laid-Open No. 7-230295.
[0010]
[Problems to be solved by the invention]
However, the supervised adaptation method has a drawback in that a word instructed by the apparatus must be uttered as training separately from the generation of speech in the original speech recognition, and the burden is large. In other words, it is necessary to work for a purpose different from the original purpose (improving the recognition performance of voice recognition) different from the original purpose equipped with the voice recognition device for improving the human interface (to facilitate the operation of the device), which is complicated. In addition, there is a problem of imposing a burden on the user.
In view of the above, an object of the present invention is to provide a speech recognition method and a speech recognition apparatus capable of automatically providing the same performance as that of the supervised adaptation method without making the user aware.
It is another object of the present invention to provide a speech recognition method and a speech recognition apparatus which can have the same performance as the supervised adaptation method with a simple configuration.
[0011]
[Means for Solving the Problems]
The first aspect of the present invention is a voice recognition method for comparing a registered voice pattern of a plurality of words with an input voice pattern and recognizing a word that has been voice-input based on the most similar registered voice pattern. Sending the recognition result of the speech recognition device and the input speech pattern to an external registered speech pattern improving device; and (2) in the registered speech pattern improving device, the speech recognition engine of the speech recognition device based on the transmitted information. Generating and transmitting registered voice pattern improvement data registered in (3), in the voice recognition device, updating registered voice pattern data registered in the voice recognition engine with the transmitted improved data. have.
[0012]
A second aspect of the present invention is a voice recognition device for recognizing a word input by voice, (1) a voice pattern registration unit for registering a voice pattern corresponding to a word, and (2) a voice pattern registration unit for a plurality of words. A voice recognition engine that compares the input voice pattern and recognizes a word that has been voice-input based on the most similar registered voice pattern; (3) a storage unit that stores the recognition result and the input voice pattern as a set; 4) a data transmission unit for periodically transmitting the recognition result and the input voice pattern stored in the storage unit to an external registered voice pattern improvement device; and, (5) a voice transmission unit based on the information transmitted from the voice recognition device. A data receiving unit for receiving the registered voice pattern improvement data from the registered voice pattern improvement device for generating data for improving the registered voice pattern of the voice recognition engine of the recognition device; Has an improved unit for improving the registered voice pattern data registered in the voice recognition engine by improving data sent it said.
ADVANTAGE OF THE INVENTION According to this invention, since it becomes possible to perform the supervised speaker adaptive learning of a speech recognition apparatus in an underground, a user does not need to supply a teacher signal to a speech recognition apparatus, and the learning itself is supervised. Since the learning is performed, there is no fear that the learning proceeds in an erroneous direction, and highly accurate learning can be performed, and the recognition rate can be improved.
[0013]
BEST MODE FOR CARRYING OUT THE INVENTION
(A) Configuration of Speech Recognition System FIG. 1 is a configuration diagram of a speech recognition system including a speech recognition device according to the present invention. A speech recognition device 11 attached to a device controlled by a speech input improves recognition performance. As described above, the improvement center 12 for updating the parameters for specifying the standard pattern stored in the standard pattern storage unit of the voice recognition device 11 is provided. Communication between the voice recognition device 11 and the improvement center 12 can be freely performed by an arbitrary communication method, for example, a wireless communication using a LAN or a mobile phone.
[0014]
In the speech recognition apparatus 11, the speech input microphone 21 detects and outputs speech input by the speaker, the AD converter 22 converts the speech signal into digital, and the speech recognition engine 23 operates as shown in FIG. It has the same configuration as the recognition device 1) and recognizes input speech. The voice data file / recognition result storage unit 24 stores the recognition result of the input voice and stores the input voice pattern data in a WAV format. The data transmission unit 25 transmits one or more recognition results stored in the storage unit 24 and the input voice pattern data to the improvement center in pairs. The data receiving unit 26 receives the improved data (parameters for specifying the standard pattern) from the improvement center 12, and the installer 27 updates the standard pattern data of the speech recognition engine 23 with the received improved data.
[0015]
In the improvement center 12, the data receiving unit 31 receives a plurality of sets of recognition result / input pattern data sent from the speech recognition device 11, and the teacher signal generation unit 32 sends the data from the speech recognition area engine 23 of the speech recognition device 11. It has a much higher performance speech recognition engine, performs speech recognition based on each set of input pattern data sent, and outputs the recognition result as input speech notation (teacher signal) of the input pattern. The improvement unit 33 performs the same adaptation processing as the supervised adaptation unit 9 in FIG. 2 using the input phonetic notation) and the input pattern data sent from the speech recognition device 11, and thereby adapts the HMM of the phoneme. Ask for. Thereafter, a different phoneme is obtained by comparing the recognition result sent from the voice recognition device with the teacher signal (input phonetic notation), and an HMM adapted to the different phoneme is output. The improved data storage unit 34 stores the improved data (the HMM adapted for each phoneme) output from the improving unit 33, and the data transmitting unit 35 transmits the improved data to the speech recognition device.
[0016]
(B) Overall operation of the speech recognition system {circle around (1)} When the user uses the speech recognition apparatus 11, the speech content (speech pattern) at that time and the recognition result recognized by the speech recognition engine 23 are respectively recognized as a speech data file / recognition. The result is stored in the result storage unit 24. The utterance content is held in, for example, a WAV file format, and an ID code is added. The recognition result is recorded in a text format, for example, corresponding to the ID code.
{Circle over (2)} When a certain period (which may be set by the user or set by the manufacturer) has elapsed, the data transmission unit 25 transmits the audio data file recorded in the audio data file / recognition result storage unit 24 to the audio data file. The recognition result is transmitted to the improvement center 12 as a pair.
[0017]
{Circle around (3)} In the improvement center 12, the data receiving unit 31 stores the information received from the voice recognition device 11 in a built-in memory and sequentially pairs the utterance content and the recognition result to the teacher signal generation unit 32 and the improvement unit 33. input. The teacher signal generation unit 32 performs voice recognition based on the input voice pattern data, outputs the recognition result as an input voice notation of the input voice pattern, and the improvement unit 33 converts the input voice notation and the input voice pattern data. The HMM of the phoneme that has been adapted by performing the adaptation process is obtained. Thereafter, the improvement unit 33 compares the recognition result sent from the speech recognition device 11 with the teacher signal (input speech notation) to obtain a different phoneme, and outputs an HMM adapted to the different phoneme. The improved data storage unit 34 stores the improved data (the HMM to which each phoneme has been adapted) output from the improving unit 33, and the data transmitting unit 35 transmits the improved data to the speech recognition device 11.
[0018]
{Circle around (4)} Upon receiving the improved data, the installer 27 installs the improved data (the HMM adapted for each phoneme) in the standard pattern storage unit in the speech recognition engine 23 and stores the data instead of the standard HMM. .
In the above description, the standard HMM is a phoneme HMM, but may be a word HMM.
[0019]
【The invention's effect】
As described above, according to the present invention, it is possible to perform the supervised speaker adaptive learning of the speech recognition device in the underground, so that the user (speaker) does not need to supply a teacher signal to the speech recognition device.
Further, according to the present invention, since the learning itself is supervised learning, there is no possibility that the learning proceeds in the wrong direction, and highly accurate learning can be performed. The recognition performance that can be enjoyed by the user can be improved.
Further, according to the present invention, the user's voice is sent to a center equipped with a high-performance voice recognition engine to generate improved data, and the improved data is installed in the voice recognition engine of the user's voice recognition device. A speech recognition device with an improved recognition rate for unspecified speakers can be provided. In addition, a high-performance speech recognition engine is not required for the user's voice recognition, so that a low-cost and compact device configuration can be realized.
[Brief description of the drawings]
FIG. 1 is a configuration diagram of a speech recognition system including a speech recognition device of the present invention.
FIG. 2 is a conventional speech recognition apparatus using a supervised adaptation technique.
FIG. 3 is an example of storing a phoneme HMM as a standard pattern.
FIG. 4 is a configuration diagram of a vocabulary pattern creation unit.
FIG. 5 is an example of storing a word HMM as a standard pattern.
[Explanation of symbols]
11 Speech Recognition Device 12 Improvement Center 21 Microphone 22 for Speech Input 22 A / D Converter 23 Speech Recognition Engine 24 Speech Data File / Recognition Result Storage Unit 27 Installer 31 Data Reception Unit 32 Teacher Signal Generation Unit 33 Improvement Unit 34 Improved Data Storage Unit

Claims

A voice recognition method for comparing a registered voice pattern of a plurality of words with an input voice pattern and recognizing a word that has been voice-input based on the most similar registered voice pattern,
The recognition result of the voice recognition device and the input voice pattern are sent to an external registered voice pattern improvement device,
In the registered voice pattern improvement device, based on the transmitted information, generates and transmits registered voice pattern improvement data registered in the voice recognition engine of the voice recognition device,
In the voice recognition device, updating the registered voice pattern data registered in the voice recognition engine with the transmitted improved data,
A speech recognition method characterized in that:

The registered voice pattern improved data generation step includes:
Speech recognition is performed based on the input speech pattern, the recognition result is regarded as an input speech notation in supervised speaker adaptive learning, and supervised speaker adaptive learning is performed using the input speech notation to generate registered speech pattern improvement data. ,
2. The speech recognition method according to claim 1, wherein:

In a voice recognition device for recognizing a word input by voice,
A voice pattern registration unit for registering a voice pattern corresponding to a word,
A voice recognition engine that compares a registered voice pattern of a plurality of words with an input voice pattern and recognizes a word that has been voice-input based on the most similar registered voice pattern;
A storage unit for storing the recognition result and the input voice pattern as a set,
A data transmission unit that periodically sends the recognition result and the input voice pattern stored in the storage unit to an external registered voice pattern improvement device,
Data for receiving the registered voice pattern improvement data from the registered voice pattern improvement device that generates data for improving the registered voice pattern of the voice recognition engine of the voice recognition device based on the information sent from the voice recognition device. Receiver,
An improving unit for improving registered voice pattern data registered in the voice recognition engine by the transmitted improved data,
A speech recognition device comprising: