JP2010175765A

JP2010175765A - Speech recognition device and speech recognition program

Info

Publication number: JP2010175765A
Application number: JP2009017524A
Authority: JP
Inventors: Shinichi Honma; 真一本間; Toru Imai; 亨今井
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2009-01-29
Filing date: 2009-01-29
Publication date: 2010-08-12
Anticipated expiration: 2029-01-29
Also published as: JP5054711B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech recognition device for preventing reoccurrence of recognition error by performing feed-back of correction of speech recognition error performed by an operator. <P>SOLUTION: The speech recognition device 1 includes: a cache storage means 211 for storing words of corrected correction word rows in a predetermined word number part; a cache score calculation means 212 for calculating probability value allowing the words stored to the cache storage means 211 to appear in the cache storing means 211 as a cache score; a language score corrected means 221 for generating a corrected language score adding the cache score being an appearance probability value of the words stored in the cache storage means 211 to the language score being an appearance probability of the word obtained from a language model 12a; and a search means 232 for searching the word rows in which a connection probability value is maximum from the language model 12a as speech recognition result on the basis of the corrected language score. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、言語モデルを用いて音声認識を行う音声認識装置および音声認識プログラムに関する。 The present invention relates to a speech recognition apparatus and speech recognition program that perform speech recognition using a language model.

現在、ニュース等の生放送の番組で字幕を付加した放送を行うために、音声認識装置を用いてニュース原稿等の音声をリアルタイムで文字データに変換し、字幕を作成している。この音声認識装置によって認識された文字データには、一般的に数％の誤りが含まれている。そこで、音声に対する音声認識装置の認識結果の誤りを、その音声を聞いた操作者が目視により検出し、修正装置によってキーボードを介して修正することで、リアルタイムに正しい音声認識結果を生成し出力するシステムが開示されている（例えば、特許文献１参照）。 At present, in order to perform broadcasting with subtitles added in a live broadcast program such as news, a voice recognition device is used to convert voice of a news manuscript or the like into character data in real time to create subtitles. The character data recognized by this voice recognition apparatus generally contains several percent of errors. Therefore, an operator who has heard the voice visually detects an error in the recognition result of the voice recognition device for the voice, and corrects it through the keyboard by the correction device, thereby generating and outputting a correct voice recognition result in real time. A system is disclosed (for example, see Patent Document 1).

一方、音声認識の認識精度を高める技術として、言語モデルにキャッシュを適用した技術が開示されている（例えば、非特許文献１参照）。この技術によれば、直前に使われた単語は再び使われやすいという単語の局所的な性質を利用し、キャッシュ中の単語が出現する出現間隔等を考慮して言語モデルに重み付けを行うことで、キャッシュ中のより最近に出現した単語の出現確率を高め、音声認識の精度を高めている。 On the other hand, as a technique for improving the recognition accuracy of voice recognition, a technique in which a cache is applied to a language model is disclosed (for example, see Non-Patent Document 1). According to this technology, by using the local nature of the word that the word used immediately before is easy to use again, the language model is weighted in consideration of the appearance interval of the word in the cache. , The appearance probability of words that appear more recently in the cache is increased, and the accuracy of speech recognition is increased.

特開２００５−４９６５５号公報JP 2005-49655 A

山下優等、“同一単語間距離を考慮した言語モデル適応の効果”、音響学会秋季講演論文集、３−１−７、ｐｐ．９１−９４、２００８年９月Yu Yamashita et al., “Effects of language model adaptation considering distances between identical words”, Acoustical Society Autumn Meeting, 3-1-7, pp. 91-94, September 2008

前記した操作者がリアルタイムで音声認識装置の認識誤りを修正するシステムは、音声認識装置の認識誤りを手動で修正することで、正しい認識結果を出力することができる。しかし、このシステムでは、操作者が行った修正が音声認識装置には反映されないため、音声認識装置は同じ誤りを繰り返してしまう。そのため、操作者は、同じ修正を何度も行わなければならないという問題がある。 The above-described system in which the operator corrects the recognition error of the speech recognition apparatus in real time can output a correct recognition result by manually correcting the recognition error of the speech recognition apparatus. However, in this system, since the correction made by the operator is not reflected in the voice recognition apparatus, the voice recognition apparatus repeats the same error. Therefore, there is a problem that the operator must perform the same correction many times.

一方、キャッシュを適用した音声認識技術では、使用している言語モデルにおいて、認識精度を高めることはできるが、認識誤りが発生した場合、その誤りを修正した結果を次の認識に適用できない。そのため、従来のキャッシュを適用した音声認識では、前記したシステムと同様に、誤りの再発を防止することができないという問題がある。 On the other hand, in the speech recognition technology to which the cache is applied, the recognition accuracy can be improved in the language model being used, but when a recognition error occurs, the result of correcting the error cannot be applied to the next recognition. For this reason, the conventional speech recognition using the cache has a problem that it is impossible to prevent the recurrence of errors, as in the above-described system.

本発明は、以上のような問題点に鑑みてなされたものであり、操作者が行った音声認識誤りの修正をフィードバックして、認識誤りの再発を防止することが可能な音声認識装置および音声認識プログラムを提供することを目的とする。 The present invention has been made in view of the above problems, and a speech recognition apparatus and a speech that can prevent a recurrence of a recognition error by feeding back a correction of a speech recognition error performed by an operator. The purpose is to provide a recognition program.

本発明は、前記目的を達成するために創案されたものであり、まず、請求項１に記載の音声認識装置は、発音辞書、音響モデルおよび言語モデルを用いた音声認識装置により音声信号を音声認識することで得られた文字列の認識誤りを、操作者が修正装置により修正し、音声認識結果として出力する音声認識システムにおける音声認識装置において、キャッシュ記憶手段と、キャッシュスコア算出手段と、言語スコア修正手段と、探索手段と、を備える構成とした。 The present invention has been made to achieve the above object. First, the speech recognition apparatus according to claim 1 utters a speech signal by a speech recognition apparatus using a pronunciation dictionary, an acoustic model, and a language model. In a speech recognition apparatus in a speech recognition system in which an operator corrects a recognition error of a character string obtained by recognition using a correction device and outputs it as a speech recognition result, a cache storage unit, a cache score calculation unit, a language The score correction means and the search means are provided.

かかる構成において、音声認識装置は、修正装置において修正された修正文字列を入力し、当該修正文字列を構成する単語を予め定めた単語数分だけキャッシュ記憶手段に記憶する。これによって、常に直近に修正された修正単語がキャッシュ記憶手段に記憶されることになる。 In such a configuration, the speech recognition apparatus inputs the corrected character string corrected by the correction device, and stores the words constituting the corrected character string in the cache storage unit for the predetermined number of words. As a result, the most recently corrected word is always stored in the cache storage means.

そして、音声認識装置は、キャッシュスコア算出手段によって、キャッシュ記憶手段に記憶されている単語について、当該単語がキャッシュ記憶手段において出現する確率値をキャッシュスコアとして算出する。さらに、音声認識装置は、言語スコア修正手段によって、言語モデルから得られる単語の出現確率値である言語スコアに、キャッシュ記憶手段に記憶されている当該単語の出現確率値であるキャッシュスコアを加算することで、言語モデルから得られる単語の言語スコアを修正した修正言語スコアを生成する。これによって、言語スコアに、修正装置から入力された直近の単語についての重みが付加されることになり、その単語については、言語スコアがより高く設定されることになる。 Then, the speech recognition apparatus calculates, as a cache score, a probability value that the word appears in the cache storage unit with respect to the word stored in the cache storage unit by the cache score calculation unit. Furthermore, the speech recognition apparatus adds the cache score, which is the appearance probability value of the word stored in the cache storage means, to the language score, which is the appearance probability value of the word obtained from the language model, by the language score correction means. Thus, a corrected language score obtained by correcting the language score of the word obtained from the language model is generated. As a result, the weight for the most recent word input from the correction device is added to the language score, and the language score is set higher for that word.

そして、音声認識装置は、探索手段によって、言語スコア修正手段で生成された修正言語スコアに基づいて、接続確率値が最大となる単語列を音声認識結果として言語モデルの中から探索する。これによって、修正装置から入力された直近の単語については言語スコアが高くなり、接続確率値が最大となる単語列において、直近の単語が選択される確率が高くなる。 Then, the speech recognition apparatus searches the language model as a speech recognition result for the word string having the maximum connection probability value based on the corrected language score generated by the language score correcting unit. As a result, the language score is increased for the most recent word input from the correction device, and the probability that the most recent word is selected in the word string having the maximum connection probability value is increased.

また、請求項２に記載の音声認識装置は、請求項１に記載の音声認識装置において、ＩＤＦ値記憶手段を備える構成とした。 A voice recognition device according to a second aspect of the present invention is the voice recognition device according to the first aspect, comprising an IDF value storage means.

かかる構成において、音声認識装置は、ＩＤＦ値記憶手段に、複数の文書中のどれくらいの文書に特定の単語が出現するかを示す尺度であるＩＤＦ（Inverse Document Frequency）値をその特定の単語ごとに対応付けて記憶しておく。
そして、音声認識装置は、キャッシュスコア算出手段によって、キャッシュスコアに対し、当該キャッシュスコアに対応する単語に対応付けられているＩＤＦ値記憶手段に記憶されているＩＤＦ値を、重み付け値として付加する。このようにＩＤＦ値を用いることで、ある特定の文書にしか出現しないキーワードのような単語については、より大きい重みが付加されることになる。 In such a configuration, the speech recognition apparatus stores an IDF (Inverse Document Frequency) value, which is a scale indicating how many specific documents appear in a plurality of documents, in the IDF value storage unit for each specific word. Store them in association with each other.
Then, the speech recognition apparatus adds the IDF value stored in the IDF value storage unit associated with the word corresponding to the cache score as a weighting value to the cache score by the cache score calculation unit. By using the IDF value in this way, a greater weight is added to a word such as a keyword that appears only in a specific document.

さらに、請求項３に記載の音声認識装置は、請求項１または請求項２に記載の音声認識装置において、前記修正文字列は、前記操作者によって修正された単語と修正されなかった単語とが混在した文字列であって、キャッシュスコア算出手段が、修正装置から入力された修正文字列において、操作者が修正した修正単語について、キャッシュスコアに対し、予め定めた重み付け値を付加する構成とした。 Furthermore, the speech recognition device according to claim 3 is the speech recognition device according to claim 1 or 2, wherein the corrected character string includes a word corrected by the operator and a word not corrected. The cache score calculation means adds a predetermined weighting value to the cache score for the corrected word corrected by the operator in the corrected character string input from the correction device. .

かかる構成において、音声認識装置は、操作者が修正した単語のキャッシュスコアに重みが付加されるため、修正された単語については、言語スコアがより高く設定されることになる。なお、修正された単語か否かは、修正装置から取得することが可能である。あるいは、音声認識装置側で、修正装置から修正前の文字列と修正後の文字列とを取得し、比較することで判定してもよい。 In such a configuration, since the weight is added to the cache score of the word corrected by the operator in the voice recognition device, the language score is set higher for the corrected word. Whether or not the word is corrected can be obtained from the correction device. Alternatively, the voice recognition device may obtain the character string before correction and the character string after correction from the correction device, and determine by comparing them.

また、請求項４に記載の音声認識装置は、請求項１から請求項３のいずれか一項に記載の音声認識装置において、既知語変換手段をさらに備える構成とした。 A voice recognition device according to a fourth aspect of the present invention is the voice recognition device according to any one of the first to third aspects, further comprising known word conversion means.

かかる構成において、音声認識装置は、既知語変換手段によって、修正装置から入力された修正文字列の各単語について、言語モデルに登録されていない未知語を、言語モデルに登録されている既知語に分解する。これによって、言語モデルに登録されていない未知語が、言語モデルにおいて確率値が設定されている既知語に変換され、キャッシュ記憶手段に記憶されることになる。 In such a configuration, the speech recognition apparatus converts an unknown word not registered in the language model into a known word registered in the language model for each word of the corrected character string input from the correction apparatus by the known word conversion unit. Decompose. As a result, unknown words that are not registered in the language model are converted into known words for which probability values are set in the language model and stored in the cache storage means.

また、請求項５に記載の音声認識装置は、請求項１から請求項３のいずれか一項に記載の音声認識装置において、外部辞書記憶手段と、未知語代用手段と、をさらに備える構成とした。 The speech recognition device according to claim 5 is the speech recognition device according to any one of claims 1 to 3, further comprising an external dictionary storage unit and an unknown word substitution unit. did.

かかる構成において、音声認識装置は、発音辞書よりも単語の登録数が多い第２の発音辞書である外部辞書を外部辞書記憶手段に記憶しておく。そして、音声認識装置は、未知語代用手段によって、修正装置から入力された修正文字列の各単語について、言語モデルに登録されていない未知語の発音を外部辞書から取得するとともに、当該未知語が出現する確率値を予め言語モデルに未知語の接続確率値として登録されている確率値で代用して設定する。これによって、修正装置から入力された未知語について、言語モデルに予め登録されている未知語の確率値が代用されて割り当てられることになる。
そして、音声認識装置は、探索手段によって、言語モデルにおける接続確率値と、代用された未知語の接続確率値とに基づいて、接続確率値が最大となる単語列を探索する。 In such a configuration, the speech recognition apparatus stores an external dictionary, which is a second pronunciation dictionary having a larger number of registered words than the pronunciation dictionary, in the external dictionary storage unit. Then, the voice recognition device acquires, from the external dictionary, the pronunciation of the unknown word that is not registered in the language model for each word of the corrected character string input from the correction device by the unknown word substitution unit, and the unknown word is The probability value to appear is set in place of a probability value registered in advance as an unknown word connection probability value in the language model. As a result, the unknown word probability value registered in advance in the language model is assigned and assigned to the unknown word input from the correction device.
Then, the speech recognition apparatus searches for a word string having the maximum connection probability value based on the connection probability value in the language model and the connection probability value of the substituted unknown word by the search means.

さらに、請求項６に記載の音声認識装置は、請求項１から請求項３のいずれか一項に記載の音声認識装置において、音素認識手段と、第２未知語代用手段と、をさらに備える構成とした。 Furthermore, the speech recognition apparatus according to claim 6 is the speech recognition apparatus according to any one of claims 1 to 3, further comprising phoneme recognition means and second unknown word substitution means. It was.

かかる構成において、音声認識装置は、音素認識手段によって、音響モデルに基づいて音声信号の発音のデータを生成する。そして、音声認識装置は、第２未知語代用手段によって、修正装置から入力された修正文字列の各単語について、言語モデルに登録されていない未知語の発音を音素認識手段から取得するとともに、当該未知語が出現する確率値を予め言語モデルに未知語の接続確率値として登録されている確率値で代用して設定する。これによって、修正装置から入力された未知語について、言語モデルに予め登録されている未知語の確率値が代用されて割り当てられることになる。
そして、音声認識装置は、探索手段によって、言語モデルにおける接続確率値と、代用された未知語の接続確率値とに基づいて、接続確率値が最大となる単語列を探索する。 In such a configuration, the speech recognition device generates phonetic sound data based on the acoustic model by the phoneme recognition means. The speech recognition apparatus acquires, from the phoneme recognition means, the pronunciation of the unknown word that is not registered in the language model for each word of the corrected character string input from the correction apparatus by the second unknown word substitution means. A probability value at which an unknown word appears is set in place of a probability value registered in advance as an unknown word connection probability value in the language model. As a result, the unknown word probability value registered in advance in the language model is assigned and assigned to the unknown word input from the correction device.
Then, the speech recognition apparatus searches for a word string having the maximum connection probability value based on the connection probability value in the language model and the connection probability value of the substituted unknown word by the search means.

また、請求項７に記載の音声認識装置は、請求項１から請求項３のいずれか一項に記載の音声認識装置において、外部辞書記憶手段と、音素認識手段と、既知語変換手段と、未知語代用手段と、第２未知語代用手段と、をさらに備える構成とした。 A speech recognition device according to claim 7 is the speech recognition device according to any one of claims 1 to 3, wherein an external dictionary storage unit, a phoneme recognition unit, a known word conversion unit, An unknown word substitution unit and a second unknown word substitution unit are further provided.

かかる構成において、音声認識装置は、発音辞書よりも単語の登録数が多い第２の発音辞書である外部辞書を外部辞書記憶手段に記憶しておく。また、音声認識装置は、認識手段によって、音響モデルに基づいて音声信号の発音のデータを生成する。
そして、音声認識装置は、既知語変換手段によって、修正装置から入力された修正文字列の各単語について、言語モデルに登録されていない未知語を、言語モデルに登録されている既知語に分解する。これによって、言語モデルに登録されていない未知語が、言語モデルにおいて確率値が設定されている既知語に変換され、キャッシュ記憶手段に記憶されることになる。 In such a configuration, the speech recognition apparatus stores an external dictionary, which is a second pronunciation dictionary having a larger number of registered words than the pronunciation dictionary, in the external dictionary storage unit. Further, the speech recognition apparatus generates sound generation data of the speech signal based on the acoustic model by the recognition unit.
Then, the speech recognition apparatus decomposes the unknown words not registered in the language model into known words registered in the language model for each word of the corrected character string input from the correction apparatus by the known word conversion unit. . As a result, unknown words that are not registered in the language model are converted into known words for which probability values are set in the language model and stored in the cache storage means.

また、音声認識装置は、未知語代用手段によって、既知語変換手段で既知語に分解できなかった未知語の発音を外部辞書から取得するとともに、当該未知語が出現する確率値を予め言語モデルに未知語の接続確率値として登録されている確率値で代用して設定する。これによって、修正装置から入力された未知語について、言語モデルに予め登録されている未知語の確率値が代用されて割り当てられることになる。 In addition, the speech recognition device acquires from the external dictionary the pronunciation of the unknown word that could not be decomposed into the known word by the known word conversion means by the unknown word substitution means, and the probability value that the unknown word appears in the language model in advance. A probability value registered as an unknown word connection probability value is used instead. As a result, the unknown word probability value registered in advance in the language model is assigned and assigned to the unknown word input from the correction device.

そして、音声認識装置は、第２未知語代用手段によって、未知語代用手段で外部辞書から発音を取得できなかった未知語の発音を音素認識手段から取得するとともに、当該未知語が出現する確率値を予め言語モデルに未知語の接続確率値として登録されている確率値で代用して設定する。これによって、修正装置から入力された未知語について、言語モデルに予め登録されている未知語の確率値が代用されて割り当てられることになる。
そして、音声認識装置は、探索手段によって、言語モデルにおける接続確率値と、代用された未知語の接続確率値とに基づいて、接続確率値が最大となる単語列を探索する。 Then, the speech recognition apparatus acquires, from the phoneme recognition unit, the pronunciation of the unknown word that could not be acquired from the external dictionary by the unknown word substitution unit by the second unknown word substitution unit, and the probability value that the unknown word appears Is substituted with a probability value registered in advance as an unknown word connection probability value in the language model. As a result, the unknown word probability value registered in advance in the language model is assigned and assigned to the unknown word input from the correction device.
Then, the speech recognition apparatus searches for a word string having the maximum connection probability value based on the connection probability value in the language model and the connection probability value of the substituted unknown word by the search means.

さらに、請求項８に記載の音声認識プログラムは、発音辞書、音響モデルおよび言語モデルを用いた音声認識装置により音声信号を音声認識することで得られた文字列の認識誤りを、操作者が修正装置により修正し、音声認識結果として出力する音声認識システムにおいて、前記修正装置により修正された文字列により前記認識誤りを軽減させるために、前記音声認識装置のコンピュータを、キャッシュスコア算出手段、言語スコア修正手段、探索手段として機能させる構成とした。 Furthermore, the speech recognition program according to claim 8 corrects a recognition error of a character string obtained by speech recognition of a speech signal by a speech recognition device using a pronunciation dictionary, an acoustic model, and a language model. In a speech recognition system that is corrected by a device and outputs as a speech recognition result, in order to reduce the recognition error by a character string corrected by the correction device, a computer of the speech recognition device is connected to a cache score calculation means, a language score The configuration is made to function as correction means and search means.

かかる構成において、音声認識プログラムは、キャッシュスコア算出手段によって、修正装置において修正された修正文字列の各単語を予め定めた単語数分だけ記憶させたキャッシュ記憶手段に記憶されている単語について、当該単語がキャッシュ記憶手段において出現する確率値をキャッシュスコアとして算出する。 In such a configuration, the speech recognition program uses the cache score calculation unit to store the words stored in the cache storage unit that stores the words of the corrected character string corrected by the correction device by a predetermined number of words. A probability value that the word appears in the cache storage means is calculated as a cache score.

そして、音声認識プログラムは、言語スコア修正手段によって、言語モデルから得られる単語の出現確率値である言語スコアに、キャッシュ記憶手段に記憶されている当該単語の出現確率値であるキャッシュスコアを加算することで、言語モデルから得られる単語の言語スコアを修正した修正言語スコアを生成する。これによって、言語スコアに、修正装置から入力された直近の単語についての重みが付加されることになり、その単語については、言語スコアがより高く設定されることになる。 Then, the speech recognition program adds the cache score, which is the appearance probability value of the word stored in the cache storage means, to the language score, which is the word appearance probability value obtained from the language model, by the language score correction means. Thus, a corrected language score obtained by correcting the language score of the word obtained from the language model is generated. As a result, the weight for the most recent word input from the correction device is added to the language score, and the language score is set higher for that word.

そして、音声認識プログラムは、探索手段によって、言語スコア修正手段で生成された修正言語スコアに基づいて、接続確率値が最大となる単語列を音声認識結果として言語モデルの中から探索する。これによって、修正装置から入力された直近の単語については言語スコアが高くなり、接続確率値が最大となる単語列において、直近の単語が選択される確率が高くなる。 Then, the speech recognition program searches the language model as a speech recognition result for the word string having the maximum connection probability value based on the corrected language score generated by the language score correcting unit. As a result, the language score is increased for the most recent word input from the correction device, and the probability that the most recent word is selected in the word string having the maximum connection probability value is increased.

本発明は、以下に示す優れた効果を奏するものである。
請求項１，８に記載の発明によれば、修正装置から入力された直近の単語について、言語スコアを高く設定することで、音声認識において、修正装置から入力された単語の言語スコアを高めることができる。これによって、音声認識において認識誤りが発生し、修正装置で修正された単語について、その出現確率を高めることができるため、同様の認識誤りの再発を防止することができる。 The present invention has the following excellent effects.
According to the first and eighth aspects of the present invention, the language score of the word input from the correction device is increased in speech recognition by setting a high language score for the most recent word input from the correction device. Can do. As a result, a recognition error occurs in speech recognition, and the appearance probability of the word corrected by the correction device can be increased, so that the recurrence of the same recognition error can be prevented.

請求項２に記載の発明によれば、キャッシュスコアにＩＤＦ値を重み付け値として付加するためキーワードのような重要な単語については出現確率を高めることができる。これによって、助詞のような機能語（付属語）についてキャッシュスコアが大きくなってしまう弊害を防止することができ、機能語の湧き出し誤りを抑制することができる。 According to the second aspect of the present invention, since the IDF value is added as a weighting value to the cache score, the appearance probability can be increased for important words such as keywords. As a result, it is possible to prevent the adverse effect that the cache score becomes large for a function word (attachment word) such as a particle, and it is possible to suppress errors in the function word.

請求項３に記載の発明によれば、操作者が修正装置において修正した単語について、言語スコアがより高く設定されるため、操作者が修正した単語、すなわち、正しい単語の出現確率を高めることができ、音声認識の認識誤りを抑制することができる。 According to the invention described in claim 3, since the language score is set higher for the word corrected by the operator in the correction device, the appearance probability of the word corrected by the operator, that is, the correct word can be increased. And recognition errors in voice recognition can be suppressed.

請求項４に記載の発明によれば、修正装置から入力された既知語の連鎖で構成される未知語を、個々の既知語に分解することで、未知語に対して、正確な出現確率値を与えることができるため、音声認識の精度を高めることができる。 According to the fourth aspect of the present invention, an accurate appearance probability value for an unknown word is obtained by decomposing an unknown word composed of a chain of known words input from a correction device into individual known words. Therefore, the accuracy of speech recognition can be improved.

請求項５，６に記載の発明によれば、修正装置から入力された未知語の出現確率値に対し、言語モデルに予め登録されている未知語の出現確率値を代用させることができ、修正装置から入力された未知語を音声認識の認識対象となる単語とすることができる。 According to the fifth and sixth aspects of the invention, the unknown word appearance probability value registered in advance in the language model can be substituted for the unknown word appearance probability value input from the correction device. An unknown word input from the apparatus can be a word to be recognized by speech recognition.

請求項７に記載の発明によれば、修正装置から入力された既知語の連鎖で構成される未知語を、個々の既知語に分解することで、未知語に対して、正確な出現確率値を与えることができるため、音声認識の精度を高めることができる。さらに、請求項７に記載の発明によれば、修正装置から入力された未知語の出現確率値に対し、言語モデルに予め登録されている未知語の出現確率値を代用させることができ、修正装置から入力された未知語を音声認識の認識対象となる単語とすることができる。 According to the seventh aspect of the present invention, an accurate appearance probability value for an unknown word is obtained by decomposing an unknown word composed of a chain of known words input from the correction device into individual known words. Therefore, the accuracy of speech recognition can be improved. Furthermore, according to the invention described in claim 7, the appearance probability value of the unknown word registered in advance in the language model can be substituted for the appearance probability value of the unknown word input from the correction device. An unknown word input from the apparatus can be a word to be recognized by speech recognition.

本発明の実施形態に係る音声認識システムの全体構成を示すブロック図である。1 is a block diagram showing an overall configuration of a voice recognition system according to an embodiment of the present invention. 本発明の実施形態に係る音声認識システムの操作画面を説明するための画面例である。It is an example of a screen for demonstrating the operation screen of the speech recognition system which concerns on embodiment of this invention. 本発明の第１実施形態に係る音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus which concerns on 1st Embodiment of this invention. 本発明の第１実施形態に係る音声認識装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speech recognition apparatus which concerns on 1st Embodiment of this invention. 本発明の第２実施形態に係る音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus which concerns on 2nd Embodiment of this invention. 本発明の第３実施形態に係る音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus which concerns on 3rd Embodiment of this invention. 本発明の第４実施形態に係る音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus which concerns on 4th Embodiment of this invention. 本発明の第５実施形態に係る音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus which concerns on 5th Embodiment of this invention. 本発明の第６実施形態に係る音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus which concerns on 6th Embodiment of this invention.

以下、本発明の実施形態について図面を参照して説明する。
＜音声認識システム＞
最初に、図１を参照して、本発明の実施形態に係る音声認識システムの構成について説明する。この音声認識システムＳは、入力音声を音声認識し、認識誤りした文字列については、操作者が手動で修正を行いリアルタイム（修正時間を含む）で音声認識結果として出力するシステムである。例えば、音声認識システムＳは、ニュース等の生放送の番組で字幕を付加した放送を行うために、ニュース原稿等の音声をリアルタイムで文字データに変換し、字幕データとして出力するものである。ここでは、音声認識システムＳは、音声認識装置１と、修正装置３とを備えている。 Embodiments of the present invention will be described below with reference to the drawings.
<Voice recognition system>
Initially, with reference to FIG. 1, the structure of the speech recognition system which concerns on embodiment of this invention is demonstrated. This speech recognition system S is a system that recognizes an input speech, and manually corrects an erroneously recognized character string and outputs it as a speech recognition result in real time (including a correction time). For example, the voice recognition system S converts a voice of a news manuscript or the like into character data in real time and outputs it as subtitle data in order to perform broadcasting with subtitles added in a live broadcast program such as news. Here, the speech recognition system S includes a speech recognition device 1 and a correction device 3.

音声認識装置１は、入力された音声信号（入力音声）を音声認識し、認識結果を文字列として出力するものである。例えば、音声認識装置１は、ニュース等の番組の音声を聞いた発話者（リスピーカ）が、図示を省略したマイクを介して、再度その音声を入力することで、番組の音声を文字列として出力する。この音声認識装置１の認識結果である文字列は、修正装置３に出力される。
なお、音声認識装置１は、修正装置３において、認識結果に対して修正が行われた場合、その修正結果（修正文字列）を、フィードバックすることで、以降の音声認識に反映させる。この音声認識装置１については、後で詳細に説明を行う。 The speech recognition apparatus 1 recognizes an input speech signal (input speech) and outputs a recognition result as a character string. For example, the speech recognition apparatus 1 outputs a program sound as a character string when a speaker (re-speaker) who has heard the sound of a program such as news inputs the sound again through a microphone (not shown). To do. The character string that is the recognition result of the voice recognition device 1 is output to the correction device 3.
When the correction device 3 corrects the recognition result, the voice recognition device 1 feeds back the correction result (corrected character string) to reflect the subsequent voice recognition. The voice recognition device 1 will be described in detail later.

修正装置３は、音声認識装置１で音声認識された文字列について、誤りがある場合に、操作者が修正を行うものである。例えば、修正装置３は、図示を省略した表示装置に図２に示すような画面Ｇを表示する。
図２の画面Ｇの例は、修正装置３が、表示装置の画面Ｇ上に、認識結果表示領域３ａと、修正前文字列表示領域３ｂと、修正文字列入力領域３ｃと、送出ボタン３ｄとを表示した状態を示している。 The correction device 3 is for the operator to correct the character string recognized by the speech recognition device 1 when there is an error. For example, the correction device 3 displays a screen G as shown in FIG. 2 on a display device not shown.
In the example of the screen G in FIG. 2, the correction device 3 displays a recognition result display region 3 a, a pre-correction character string display region 3 b, a corrected character string input region 3 c, and a send button 3 d on the screen G of the display device. Is displayed.

認識結果表示領域３ａは、音声認識装置１から入力された音声認識結果の文字列を表示する領域である。
修正前文字列表示領域３ｂは、認識結果表示領域３ａに表示された文字列で認識誤りが発生している文字列を表示する領域である。この修正前文字列表示領域３ｂには、操作者が、自身が聞いた音声と認識結果表示領域３ａに表示された認識結果とを比較し、認識結果表示領域３ａにおいて、認識誤りが発生している文字列として、図示を省略したマウスやタッチパネルによる操作によって選択された文字列が表示される。 The recognition result display area 3 a is an area for displaying a character string of a voice recognition result input from the voice recognition device 1.
The pre-correction character string display area 3b is an area for displaying a character string in which a recognition error has occurred in the character string displayed in the recognition result display area 3a. In the pre-correction character string display area 3b, the operator compares the voice heard by the operator with the recognition result displayed in the recognition result display area 3a, and a recognition error occurs in the recognition result display area 3a. A character string selected by an operation with a mouse or a touch panel (not shown) is displayed as a character string.

修正文字列入力領域３ｃは、操作者が、修正前文字列表示領域３ｂに表示された文字列に対して、修正を行った正しい文字列を入力する領域である。
図２の例では、音声認識装置１から入力された認識結果において、「文部科学省」を誤って「文部か楽章」と認識した例を示している。
この場合、操作者は、認識結果表示領域３ａにおいて、「文部か楽章」を選択する。このとき、修正前文字列表示領域３ｂには、「文部か楽章」が表示される。そして、操作者は、正しい文字列として、修正文字列入力領域３ｃに「文部科学省」を入力する。
このように、修正装置３は、誤認識された文字列を、操作者によって修正する。 The corrected character string input area 3c is an area in which the operator inputs a correct character string in which the character string displayed in the pre-correction character string display area 3b is corrected.
The example of FIG. 2 shows an example in which “Ministry of Education, Culture, Sports, Science and Technology” is mistakenly recognized as “MEXT or movement” in the recognition result input from the speech recognition apparatus 1.
In this case, the operator selects “sentence or movement” in the recognition result display area 3a. At this time, “text or movement” is displayed in the pre-correction character string display area 3b. Then, the operator inputs “Ministry of Education, Culture, Sports, Science and Technology” in the corrected character string input area 3c as a correct character string.
In this way, the correction device 3 corrects the erroneously recognized character string by the operator.

また、画面Ｇ上の送出ボタン３ｄは、修正が完了した場合、あるいは修正が必要ない場合に、操作者によって押下されることで、修正後の文字列（修正がない場合は、元の文字列）を出力するためのボタンである。この送出ボタン３ｄを押下されることで、修正装置３は、音声認識装置１の音声認識である文字列（あるいは修正後の文字列）を、字幕用のデータとして出力する。
なお、修正装置３は、音声認識結果に修正が行われた場合、送出ボタン３ｄの押下のタイミングで、修正後の文字列（図２の例では、「文部科学省」）を修正文字列（「文部／科学／省」：“／”は単語の区分を示す）として、音声認識装置１にフィードバックする。 The send button 3d on the screen G is pressed by the operator when the correction is completed or when correction is not necessary, so that the corrected character string (the original character string if there is no correction). ) Is a button for outputting. By pressing the send button 3d, the correction device 3 outputs a character string (or a corrected character string) that is voice recognition of the voice recognition device 1 as subtitle data.
When the speech recognition result is corrected, the correction device 3 converts the corrected character string (in the example of FIG. 2, “Ministry of Education, Culture, Sports, Science and Technology”) at the timing of pressing the send button 3d. As “Text / Science / Ministry”: “/” indicates a word segmentation), it is fed back to the speech recognition apparatus 1.

このように、音声認識システムＳを構成することで、音声認識システムＳは、修正装置３において、音声認識結果の誤りが修正された文字列（修正文字列）が、音声認識装置１にフィードバックされ、音声認識の認識精度が高められることになる。
以下、本発明に係る音声認識装置１（１Ｂ〜１Ｆ）について、詳細に説明を行う。 By configuring the speech recognition system S in this way, the speech recognition system S feeds back the character string (corrected character string) in which the error of the speech recognition result is corrected in the correction device 3 to the speech recognition device 1. As a result, the recognition accuracy of voice recognition is improved.
Hereinafter, the speech recognition apparatus 1 (1B to 1F) according to the present invention will be described in detail.

［第１実施形態］
＜音声認識装置の構成＞
まず、図３を参照して、本発明の第１実施形態に係る音声認識装置の構成について説明する。図３に示した音声認識装置１は、修正装置３からフィードバックされる文字列（修正文字列）を参照して、入力音声の音声認識を行うものである。ここでは、音声認識装置１は、発音辞書記憶手段１０と、音響モデル記憶手段１１と、言語モデル記憶手段１２と、音声分析手段２０と、キャッシュ処理手段２１と、スコア結合手段２２と、単語列生成手段２３と、を備えている。 [First Embodiment]
<Configuration of voice recognition device>
First, the configuration of the speech recognition apparatus according to the first embodiment of the present invention will be described with reference to FIG. The speech recognition device 1 shown in FIG. 3 performs speech recognition of input speech with reference to a character string (corrected character string) fed back from the correction device 3. Here, the speech recognition apparatus 1 includes a pronunciation dictionary storage unit 10, an acoustic model storage unit 11, a language model storage unit 12, a speech analysis unit 20, a cache processing unit 21, a score combination unit 22, and a word string. Generating means 23.

発音辞書記憶手段１０は、発音辞書１０ａを記憶するものであって、ハードディスク等の一般的な記憶装置である。発音辞書１０ａは、単語ごとにその発音を示す子音と母音との構成を示したもので、予め複数の単語の発音を登録しておく。 The pronunciation dictionary storage means 10 stores the pronunciation dictionary 10a and is a general storage device such as a hard disk. The pronunciation dictionary 10a shows the structure of consonants and vowels indicating the pronunciation for each word, and the pronunciation of a plurality of words is registered in advance.

音響モデル記憶手段１１は、音響モデル１１ａを記憶するのであって、ハードディスク等の一般的な記憶装置である。音響モデル１１ａは、大量の音声データから予め学習した音素ごとの特徴量を「隠れマルコフモデル」によってモデル化したものである。この音響モデル１１ａは、単一の音響モデルを用いてもよいし、音響の種別（例えば、人物別）ごとに複数のモデルを用いてもよい。 The acoustic model storage unit 11 stores the acoustic model 11a and is a general storage device such as a hard disk. The acoustic model 11a is obtained by modeling a feature amount for each phoneme learned in advance from a large amount of speech data using a “hidden Markov model”. As the acoustic model 11a, a single acoustic model may be used, or a plurality of models may be used for each acoustic type (for example, for each person).

言語モデル記憶手段１２は、言語モデル１２ａを記憶するものであって、ハードディスク等の一般的な記憶装置である。言語モデル１２ａは、大量のテキストから学習した出力系列（単語、形態素等）の出現確率等をモデル化したものである。この言語モデルには、例えば、一般的な「Ｎグラム言語モデル」を用いることができる。
なお、ここでは、発音辞書記憶手段１０と、音響モデル記憶手段１１と、言語モデル記憶手段１２とを、別々の記憶装置として構成しているが、１つの記憶装置内に発音辞書１０ａ、音響モデル１１ａおよび言語モデル１２ａを記憶しておくこととしてもよい。また、ここでは、発音辞書１０ａ、音響モデル１１ａおよび言語モデル１２ａをハードディスクに記憶して構成した例を示しているが、動作時においては、高速化のため、単語列生成手段２３において参照可能な図示を省略したメモリに展開することとする。 The language model storage unit 12 stores a language model 12a and is a general storage device such as a hard disk. The language model 12a models the appearance probability of an output sequence (words, morphemes, etc.) learned from a large amount of text. As this language model, for example, a general “N-gram language model” can be used.
Here, the pronunciation dictionary storage means 10, the acoustic model storage means 11, and the language model storage means 12 are configured as separate storage devices, but the pronunciation dictionary 10a and the acoustic model are stored in one storage device. 11a and the language model 12a may be stored. Further, here, an example is shown in which the pronunciation dictionary 10a, the acoustic model 11a, and the language model 12a are stored in the hard disk. However, during operation, the word string generation means 23 can refer to them for speeding up. It is assumed that it is expanded in a memory not shown.

音声分析手段２０は、外部から入力された音声信号（入力音声）を分析し、その音声信号の特徴量を特徴ベクトルとして抽出するものである。この音声分析手段２０で抽出された特徴ベクトルは、単語列生成手段２３に出力される。
なお、音声分析手段２０は、音声信号の音声波形に窓関数（ハミング窓等）をかけることで、フレーム化された波形を抽出し、その波形を周波数分析することで、種々の特徴量を抽出する。例えば、フレーム化された波形のパワースペクトルの対数を逆フーリエ変換した値であるケプストラム係数等を特徴量とする。この特徴量には、ケプストラム係数以外にも、メル周波数ケプストラム係数（ＭＦＣＣ：Mel Frequency Cepstrum Coefficient）、ＬＰＣ（Linear Predictive Cording）係数、対数パワー等、一般的な音声特徴量を用いることができる。 The voice analysis means 20 analyzes a voice signal (input voice) input from the outside and extracts a feature amount of the voice signal as a feature vector. The feature vector extracted by the voice analysis unit 20 is output to the word string generation unit 23.
Note that the voice analysis means 20 extracts a framed waveform by applying a window function (such as a Hamming window) to the voice waveform of the voice signal, and extracts various feature amounts by performing frequency analysis on the waveform. To do. For example, a cepstrum coefficient that is a value obtained by inverse Fourier transform of the logarithm of the power spectrum of a framed waveform is used as the feature amount. In addition to the cepstrum coefficients, general audio feature quantities such as a mel frequency cepstrum coefficient (MFCC), an LPC (Linear Predictive Cording) coefficient, logarithmic power, and the like can be used as the feature quantity.

キャッシュ処理手段２１は、修正装置３からフィードバックされた修正文字列を単語ごとに入力し、保存（キャッシュ）するとともに、キャッシュ内の単語の出現確率であるキャッシュスコアを算出するものである。ここでは、キャッシュ処理手段２１は、キャッシュ記憶手段２１１と、キャッシュスコア算出手段２１２と、を備えている。 The cache processing means 21 inputs the corrected character string fed back from the correction device 3 for each word, stores (caches) it, and calculates a cache score that is the appearance probability of the word in the cache. Here, the cache processing unit 21 includes a cache storage unit 211 and a cache score calculation unit 212.

キャッシュ記憶手段２１１は、修正装置３からフィードバックされた修正文字列について、予め定めた個数（Ｍ個とする）分の最新の単語を保存するものであって、一般的なメモリ等で構成される。このキャッシュ記憶手段２１１には、逐次最新のＭ個（例えば、３０００個）の単語が記憶される。 The cache storage unit 211 stores a predetermined number (M) of the latest words for the corrected character string fed back from the correction device 3, and is configured by a general memory or the like. . The cache storage unit 211 sequentially stores the latest M (for example, 3000) words.

キャッシュスコア算出手段２１２は、キャッシュ（キャッシュ記憶手段２１１）中の単語の出現確率値であるキャッシュスコアを算出するものである。このキャッシュスコア算出手段２１２で算出されたキャッシュスコアは、スコア結合手段２２の言語スコア修正手段２２１に出力される。
ここで、キャッシュされた最新の単語をｗ_ｎ、Ｍ個前の単語をｗ_ｎ−Ｍと表記し、単語列ｗ_ｎ−Ｍ，ｗ_{ｎ−Ｍ＋１}，…，ｗ_ｎ−１をｗ_ｎ−Ｍ ^ｎ−１と表記したとき、キャッシュスコア算出手段２１２は、単語ｗ_ｎのキャッシュスコアＰ_Ｃ（ｗ_ｎ｜ｗ_ｎ−Ｍ ^ｎ−１）を以下の（１）式により算出する。 The cache score calculation unit 212 calculates a cache score that is an appearance probability value of a word in the cache (cache storage unit 211). The cache score calculated by the cache score calculating unit 212 is output to the language score correcting unit 221 of the score combining unit 22.
Here, the latest word that has been cached notation _w n, the word of M one before and _{w n-M,} a word string _{_{w n-M, w n-}} M + 1, ..., a _{_{w n-1} w _n-M} when expressed as ^n-1, the cache score calculating means 212, a word _{w n} cache score _{_{_{P C (w n | w n}}} -M n-1) the following (1) is calculated by the equation.

この（１）式において、δ（・）は、クロネッカーのδ関数であり、引数が等しいとき、すなわち、（１）式においてｗ_ｎとｗ_ｎ−ｍとが等しいときは“１”、それ以外のときは“０”となる関数である。なお、この（１）式で与えられる確率モデルを、以下では、キャッシュモデルと呼ぶ。 In this equation (1), δ (·) is the Kronecker δ function, and when the arguments are equal, that is, when _wn and wn _−m are equal in equation (1), “1”, otherwise In this case, the function is “0”. Hereinafter, the probability model given by the equation (1) is referred to as a cache model.

スコア結合手段２２は、言語モデル１２ａから得られる単語の出現確率値である言語スコアに、キャッシュとして記憶されている単語の出現確率値であるキャッシュスコアを結合することで、言語スコアを修正するものである。ここでは、スコア結合手段２２は、言語スコア修正手段２２１を備えている。 The score combining means 22 corrects the language score by combining a cache score, which is a word appearance probability value stored as a cache, with a language score, which is a word appearance probability value obtained from the language model 12a. It is. Here, the score combining unit 22 includes a language score correcting unit 221.

言語スコア修正手段２２１は、言語モデル１２ａから得られる言語スコアに、キャッシュスコア算出手段２１２で算出されたキャッシュスコアを結合することで、言語スコアを修正するものである。この言語スコア修正手段２２１で修正された言語スコア（修正言語スコア）は、単語列生成手段２３の探索手段２３２に出力される。 The language score correcting unit 221 corrects the language score by combining the cache score calculated by the cache score calculating unit 212 with the language score obtained from the language model 12a. The language score (corrected language score) corrected by the language score correcting unit 221 is output to the searching unit 232 of the word string generating unit 23.

ここで、言語モデル１２ａをＮ−ｇｒａｍ言語モデルとし、言語モデル１２ａから得られる単語ｗ_ｎのスコア（言語スコア）を、Ｐ_ＬＭ＝（ｗ_ｎ｜ｗ_{ｎ−Ｎ＋１} ^ｎ−１）としたとき、言語スコア修正手段２２１は、以下の（２）式により、言語モデル１２ａのスコア（言語スコア）と、キャッシュモデルのスコア（キャッシュスコア）とを結合することで、言語スコアを修正した修正言語スコアＰ（ｗ_ｎ｜ｗ_１ ^ｎ−１）を生成する。 Here, the language model 12a and N-gram language model, a score (language scores) of a word _{w n} obtained from the language model _12a, P _LM = _| when the _{^{(w n w n-N +}} 1 n-1), The language score correction means 221 combines the score of the language model 12a (language score) and the score of the cache model (cache score) by the following equation (2) to correct the corrected language score P (W _n | w ₁ ⁿ⁻¹ ) is generated.

この（２）式において、λは、０≦λ≦１の定数である。この定数λは、予め実験によって、音声認識誤りが少なくなる値を定めることとしてもよいし、あるいは、予め定めたテキストのパープレキシティ（テキストの予測出力系列数）が最小となる値としてもよい。 In the equation (2), λ is a constant satisfying 0 ≦ λ ≦ 1. The constant λ may be set to a value that reduces speech recognition errors by experiments in advance, or may be set to a value that minimizes a predetermined text perplexity (the number of predicted output sequences of text). .

単語列生成手段２３は、発音辞書１０ａ、音響モデル１１ａおよび言語モデル１２ａに基づいて、音声分析手段２０で抽出された特徴ベクトルから、音声認識結果となる単語列（認識文字列）を生成するものである。ここでは、単語列生成手段２３は、音響スコア算出手段２３１と、探索手段２３２と、を備えている。 The word string generation means 23 generates a word string (recognized character string) as a voice recognition result from the feature vector extracted by the voice analysis means 20 based on the pronunciation dictionary 10a, the acoustic model 11a, and the language model 12a. It is. Here, the word string generation unit 23 includes an acoustic score calculation unit 231 and a search unit 232.

音響スコア算出手段２３１は、音声分析手段２０で抽出され、時系列に入力される特徴ベクトルと、音響モデル１１ａでモデル化されている音素との類似度（確率値）を音響スコアとして算出するものである。なお、この音響スコア算出手段２３１は、後記する探索手段２３２から逐次出力される出力系列の探索候補ごとに音響スコアを算出する。ここで算出された音響スコアは、探索手段２３２に出力される。 The acoustic score calculation means 231 calculates the similarity (probability value) between the feature vector extracted by the speech analysis means 20 and input in time series and the phoneme modeled by the acoustic model 11a as an acoustic score. It is. The acoustic score calculation unit 231 calculates an acoustic score for each output sequence search candidate sequentially output from the search unit 232 described later. The calculated acoustic score is output to the search means 232.

探索手段２３２は、音響スコア算出手段２３１で算出された音響スコアに基づいて、言語モデル１２ａから、接続される出力系列の候補を探索し、その探索結果である探索候補を音響スコア算出手段２３１に出力するとともに、音響スコアと接続確率（言語スコア）とが最大となる出力系列を入力音声に対する認識結果として外部に出力するものである。
なお、探索手段２３２は、言語モデル１２ａについては、言語スコア修正手段２２１によって修正された言語スコア（修正言語スコア）を用いる。そして、探索手段２３２は、音響スコアと言語スコア（修正言語スコア）との積が最大となる出力系列を言語モデル１２ａ（修正後の言語モデル）から探索する。 Based on the acoustic score calculated by the acoustic score calculating unit 231, the searching unit 232 searches the language model 12 a for a candidate for the output series to be connected, and the search candidate that is the search result is stored in the acoustic score calculating unit 231. In addition to outputting, an output sequence having the maximum acoustic score and connection probability (language score) is output to the outside as a recognition result for the input speech.
The search means 232 uses the language score (corrected language score) corrected by the language score correcting means 221 for the language model 12a. Then, the search unit 232 searches the language model 12a (corrected language model) for an output series that maximizes the product of the acoustic score and the language score (corrected language score).

このように音声認識装置１を構成することで、従来の言語モデルを用いた音声認識に対し、キャッシュ（キャッシュ記憶手段２１１）に記憶された修正済みの単語のスコア（キャッシュスコア）を言語スコアに加えることができるため、直近に修正装置３から入力された単語に重みが付与されることになり、音声認識の精度を高め、また、認識誤りの再発を防止することができる。 By configuring the speech recognition apparatus 1 in this manner, the score of the corrected word (cache score) stored in the cache (cache storage unit 211) is used as the language score for speech recognition using the conventional language model. Therefore, a weight is given to the word input from the correction device 3 most recently, so that the accuracy of speech recognition can be improved and the recurrence of recognition errors can be prevented.

また、音声認識装置１は、図示を省略したＣＰＵやメモリを搭載した一般的なコンピュータで実現することができる。このとき、音声認識装置１は、コンピュータを、前記した各手段として機能させる音声認識プログラムによって動作する。 The voice recognition device 1 can be realized by a general computer having a CPU and a memory (not shown). At this time, the speech recognition apparatus 1 operates by a speech recognition program that causes a computer to function as each of the above-described means.

＜音声認識装置の動作＞
次に、図４を参照（構成については適宜図３参照）して、本発明の第１実施形態に係る音声認識装置の音声認識動作について説明する。なお、修正装置３から出力される修正文字列は、音声認識装置１に入力され、キャッシュ記憶手段２１１に記憶されるものとし、以下の説明においては、そのキャッシュ動作についての説明は省略する。 <Operation of voice recognition device>
Next, the speech recognition operation of the speech recognition apparatus according to the first embodiment of the present invention will be described with reference to FIG. Note that the corrected character string output from the correction device 3 is input to the speech recognition device 1 and stored in the cache storage unit 211. In the following description, description of the cache operation is omitted.

まず、音声認識装置１は、音声分析手段２０によって、外部から入力された音声信号（入力音声）を分析し、その音声信号の特徴量を特徴ベクトルとして抽出する（ステップＳ１）。
そして、音声認識装置１は、探索手段２３２によって、言語モデル１２ａから、接続される出力系列の候補を順次リストアップする（ステップＳ２）。そして、音声認識装置１は、音響スコア算出手段２３１によって、ステップＳ２でリストアップされた出力系列の探索候補ごとに、発音辞書１０ａで示される発音の音響モデル１１ａにおける音素の特徴量と、ステップＳ１で抽出された入力音声の特徴量との類似度（確率値）を音響スコアとして算出する（ステップＳ３）。 First, the speech recognition apparatus 1 analyzes a speech signal (input speech) input from the outside by the speech analysis unit 20, and extracts a feature amount of the speech signal as a feature vector (step S1).
Then, the speech recognition apparatus 1 uses the search means 232 to sequentially list output sequence candidates to be connected from the language model 12a (step S2). Then, the speech recognition apparatus 1 uses the acoustic score calculation unit 231 to calculate the phoneme feature amount in the pronunciation acoustic model 11a indicated by the pronunciation dictionary 10a for each search candidate of the output series listed in step S2, and step S1. The similarity (probability value) with the feature amount of the input speech extracted in step S3 is calculated as an acoustic score (step S3).

さらに、音声認識装置１は、探索手段２３２によって、ステップＳ２でリストアップした出力系列の候補ごとに、言語モデル１２ａにおいて、接続確率（言語スコア）を算出する。すなわち、音声認識装置１は、キャッシュ処理手段２１のキャッシュスコア算出手段２１２によって、出力系列内の単語のうちでキャッシュ記憶手段２１１に記憶されている単語の出現確率値であるキャッシュスコアを前記（１）式により算出する（ステップＳ４）。 Furthermore, the speech recognition apparatus 1 uses the search unit 232 to calculate a connection probability (language score) in the language model 12a for each output series candidate listed in step S2. That is, the speech recognition apparatus 1 uses the cache score calculation unit 212 of the cache processing unit 21 to determine the cache score that is the appearance probability value of the word stored in the cache storage unit 211 among the words in the output series (1 ) (Step S4).

そして、音声認識装置１は、スコア結合手段２２の言語スコア修正手段２２１によって、言語モデル１２ａにおいて、出力系列の候補内の当該単語の出現確率値である言語スコアと、ステップＳ４で算出したキャッシュスコアとを結合することで、言語スコアを修正した修正言語スコアを生成する（ステップＳ５）。
そして、音声認識装置１は、探索手段２３２によって、ステップＳ３で算出された音響スコアと、ステップＳ５で生成された修正言語スコアとの積が最大となる出力系列を音声認識結果として出力する（ステップＳ６）。 Then, the speech recognition device 1 uses the language score correcting unit 221 of the score combining unit 22 to determine the language score that is the appearance probability value of the word in the output series candidate and the cache score calculated in step S4 in the language model 12a. Are combined to generate a corrected language score in which the language score is corrected (step S5).
Then, the speech recognition apparatus 1 outputs, as the speech recognition result, an output sequence in which the product of the acoustic score calculated in step S3 and the corrected language score generated in step S5 is maximized by the search unit 232 (step S6).

以上の動作によって、音声認識装置１は、直近に修正装置３から入力された単語によって、言語スコアを修正し、当該単語に重みを付けることができるため、音声認識の精度を高め、また、認識誤りの再発を防止することができる。さらに、操作者の修正の手間を減らすことができる。 With the above operation, the speech recognition device 1 can correct the language score by using the word input from the correction device 3 most recently and weight the word, thereby improving the accuracy of speech recognition and recognizing The recurrence of errors can be prevented. Furthermore, it is possible to reduce the trouble of correction by the operator.

［第２実施形態］
次に、図５を参照して、本発明の第２実施形態に係る音声認識装置の構成について説明する。図５に示した音声認識装置１Ｂは、図３で説明した音声認識装置１と同様、修正装置３からフィードバックされる文字列（修正文字列）を参照して、入力音声の音声認識を行うものである。この音声認識装置１Ｂは、キャッシュ内の単語に対してさらに重みを付ける点が、音声認識装置１と異なっている。 [Second Embodiment]
Next, the configuration of the speech recognition apparatus according to the second embodiment of the present invention will be described with reference to FIG. The speech recognition device 1B shown in FIG. 5 performs speech recognition of input speech with reference to a character string (corrected character string) fed back from the correction device 3 as in the speech recognition device 1 described in FIG. It is. This speech recognition apparatus 1B is different from the speech recognition apparatus 1 in that the words in the cache are further weighted.

ここでは、音声認識装置１Ｂは、発音辞書記憶手段１０と、音響モデル記憶手段１１と、言語モデル記憶手段１２と、重み付け値記憶手段１３と、音声分析手段２０と、キャッシュ処理手段２１Ｂと、スコア結合手段２２と、単語列生成手段２３と、を備えている。重み付け値記憶手段１３およびキャッシュ処理手段２１Ｂ以外の構成については、図３で説明した音声認識装置１と同様の構成であるため、同一の符号を付して説明を省略する。また、キャッシュ処理手段２１、スコア結合手段２２および単語列生成手段２３の内部構成については、図３で説明した音声認識装置１と同様の構成であるため図示を省略する。 Here, the speech recognition apparatus 1B includes a pronunciation dictionary storage unit 10, an acoustic model storage unit 11, a language model storage unit 12, a weight value storage unit 13, a speech analysis unit 20, a cache processing unit 21B, a score Combining means 22 and word string generating means 23 are provided. Since the configuration other than the weight value storage unit 13 and the cache processing unit 21B is the same as the configuration of the speech recognition apparatus 1 described with reference to FIG. The internal configuration of the cache processing unit 21, the score combining unit 22, and the word string generating unit 23 is the same as that of the speech recognition apparatus 1 described with reference to FIG.

重み付け値記憶手段（ＩＤＦ値記憶手段）１３は、キャッシュ記憶手段２１１に記憶される単語に対して設定する重み付け値を単語に対応付けて予め記憶しておくものである。この重み付け値は、例えば、助詞（「が」、「は」、「の」…）等の文法的な役割を持つ語である機能語（付属語）に対し、それ以外の一般的な意味を持つ語である内容語の方が大きな重みとなるように設定する。 The weight value storage means (IDF value storage means) 13 stores in advance a weight value set for a word stored in the cache storage means 211 in association with the word. This weighting value has a general meaning other than that for a function word (attachment) that is a word having a grammatical role such as a particle (“GA”, “HA”, “NO” ...). Set so that the content word that is a word has a greater weight.

ここでは、重み付け値として、特定の単語が全文書中のどれくらいの文書に出現するかを示す尺度であるＩＤＦ（Inverse Document Frequency）値を用いることとし、単語にＩＤＦ値を対応付けたＩＤＦテーブル１３ａを重み付け値記憶手段１３に記憶しておくこととする。また、このＩＤＦ値は、外部の記憶手段３０に記憶した予めＩＤＦ学習のために準備した文書の集合（ＩＤＦ学習用文書集合３０ａ）により、予め学習しておくものとする。このＩＤＦ学習用文書集合３０ａは、例えば、ある期間にニュース等で使用したニュース原稿等である。 Here, an IDF (Inverse Document Frequency) value, which is a scale indicating how many specific documents appear in the entire document, is used as the weighting value, and an IDF table 13a in which an IDF value is associated with a word. Is stored in the weight value storage means 13. The IDF value is learned in advance by a set of documents (IDF learning document set 30a) prepared in advance for IDF learning stored in the external storage unit 30. The IDF learning document set 30a is, for example, a news manuscript used for news or the like during a certain period.

ここで、ＩＤＦ学習用文書集合３０ａに含まれる文書数をＮ_ｄ、文書中に単語ｗ_ｎを含んだ文書数をｄｆ_ｎとしたとき、単語ｗ_ｎのＩＤＦ値（ＩＤＦ（ｗ_ｎ））は、以下の（３）式で与えられ、単語ｗ_ｎにこのＩＤＦ値を対応付けて、ＩＤＦテーブル１３ａとする。 Here, when the number of documents contained in the IDF learning document set 30a _{N d,} a number of documents that contain word _{w n} in a document and df _n, IDF value of word _{w n} (IDF _(w n)) is given by the following expression (3), in association with the IDF value to word _{w n,} the IDF table 13a.

このＩＤＦ値は、各文書にまんべんなく出現する機能語のような単語については小さな値となり、ある特定の文書にしか出現しないキーワードのような単語については大きな値となる。 This IDF value is a small value for words such as function words that appear evenly in each document, and a large value for words such as keywords that appear only in a specific document.

キャッシュ処理手段２１Ｂは、修正装置３からフィードバックされた修正文字列を入力し、予め定めた個数（Ｍ個とする）分の最新の単語を保存（キャッシュ）するとともに、当該キャッシュされた単語に重み付けを行うものである。ここでは、キャッシュ処理手段２１Ｂは、キャッシュ記憶手段２１１と、キャッシュスコア算出手段２１２Ｂと、を備えている。キャッシュ記憶手段２１１は、図３で説明した音声認識装置１と同一の構成であるため説明を省略する。 The cache processing unit 21B inputs the corrected character string fed back from the correction device 3, stores (caches) the latest number of words (assumed to be M), and weights the cached words. Is to do. Here, the cache processing unit 21B includes a cache storage unit 211 and a cache score calculation unit 212B. The cache storage unit 211 has the same configuration as the voice recognition device 1 described in FIG.

キャッシュスコア算出手段２１２Ｂは、キャッシュ（キャッシュ記憶手段２１１）中の単語の出現確率値に重みを付加したキャッシュスコアを算出するものである。このキャッシュスコア算出手段２１２Ｂで算出されたキャッシュスコアは、スコア結合手段２２の言語スコア修正手段２２１に出力される。
ここでは、キャッシュスコア算出手段２１２Ｂは、キャッシュ記憶手段２１１の単語に対し、重み付け値記憶手段１３に記憶されているＩＤＦテーブル１３ａの重み付け値（ＩＤＦ値）によって、重み付けを行った出現確率値であるキャッシュスコアを算出する。
具体的には、キャッシュされた最新の単語をｗ_ｎ、Ｍ個前の単語をｗ_ｎ−Ｍと表記し、単語列ｗ_ｎ−Ｍ，ｗ_{ｎ−Ｍ＋１}，…，ｗ_ｎ−１をｗ_ｎ−Ｍ ^ｎ−１と表記したとき、キャッシュスコア算出手段２１２Ｂは、単語ｗ_ｎのキャッシュスコアＰ_Ｃ（ｗ_ｎ｜ｗ_ｎ−Ｍ ^ｎ−１）を以下の（４）式により算出する。 The cache score calculation unit 212B calculates a cache score by adding a weight to the appearance probability value of the word in the cache (cache storage unit 211). The cache score calculated by the cache score calculating unit 212B is output to the language score correcting unit 221 of the score combining unit 22.
Here, the cache score calculation means 212B is an appearance probability value obtained by weighting the words in the cache storage means 211 with the weight values (IDF values) of the IDF table 13a stored in the weight value storage means 13. Calculate the cash score.
Specifically, the latest cached word is _denoted by wn, the Mth previous word is denoted by _wn-M, and the word string wn _-M , wn _{-M + 1} , ..., wn _-1 is _denoted by wn. when expressed as _-M ^n-1, the cache score calculating unit 212B, the word _{w n} cache score _{_{_{P C (w n | w n}}} -M n-1) the following (4) is calculated by the equation.

この（４）式において、δ（・）は、前記（１）式と同様、クロネッカーのδ関数であり、引数が等しいとき、すなわち、（４）式においてｗ_ｎとｗ_ｎ−ｍとが等しいときは“１”、それ以外のときは“０”となる関数である。また、Ｚは、確率の公理を満たすための正規化係数であって、以下の（５）式で算出される値である。 In this equation (4), [delta] (·), like the equation (1), a [delta] function Kronecker, when the argument is equal, i.e., is equal to _{w n} and _{w n-m} in (4) It is a function that is “1” at times, and “0” at other times. Z is a normalization coefficient for satisfying the probability axiom, and is a value calculated by the following equation (5).

このように、音声認識装置１Ｂは、ＩＤＦ値によって、キャッシュ（キャッシュ記憶手段２１１）中の単語の出現確率値に重み付けを行うため、キャッシュ中の単語に含まれる助詞等の機能語のスコアを相対的に低くすることができ、音声認識結果において、機能語の湧き出し誤りを抑制することができる。
なお、本実施形態において、キャッシュスコア算出手段２１２Ｂは、ＩＤＦ値の代わりに、単語の品詞情報を利用して、機能語の重みを小さくすることとしてもよい。ただし、単語の品詞は一意に決まらない場合もあり、音声認識結果のような誤りを含む文字列の文脈から各単語の正しい品詞を推定することは困難であるため、ＩＤＦ値を用いることが望ましいといえる。 As described above, the speech recognition apparatus 1B weights the appearance probability value of the word in the cache (cache storage unit 211) by the IDF value, and therefore, the relative score of the function word such as the particle included in the word in the cache is relative. Therefore, it is possible to suppress the occurrence of error in function words in the speech recognition result.
In the present embodiment, the cache score calculation unit 212B may use the part of speech information of a word instead of the IDF value to reduce the weight of the function word. However, the part-of-speech of a word may not be uniquely determined, and it is difficult to estimate the correct part-of-speech of each word from the context of a character string including an error such as a speech recognition result. Therefore, it is desirable to use an IDF value. It can be said.

また、ここでは、キャッシュスコア算出手段２１２Ｂは、キャッシュ（キャッシュ記憶手段２１１）中の単語の出現確率値にＩＤＦ値のみで重み付けを与えたが、修正単語であることを示す状態を加味して重み付けを行ってもよい。
具体的には、修正装置３からフィードバックされた単語ｗが修正された単語であるか否かを表す重みをｗｅｉｇｈｔ（ｗ）と表記したとき、キャッシュスコア算出手段２１２Ｂは、前記（４）式および（５）式に代えて、以下の（６）式および（７）式により、単語ｗ_ｎのキャッシュスコアＰ_Ｃ（ｗ_ｎ｜ｗ_ｎ−Ｍ ^ｎ−１）を算出する。 Here, the cache score calculation unit 212B weights only the IDF value to the appearance probability value of the word in the cache (cache storage unit 211). May be performed.
Specifically, when the weight w indicating whether or not the word w fed back from the correction device 3 is a corrected word is expressed as weight (w), the cache score calculation means 212B uses the equation (4) and (5) instead of equation by the following equation (6) and (7), a word _{w n} cache score _P C _| calculates the _{_{^{(w n w n-M n}}} -1).

このｗｅｉｇｈｔ（ｗ）の値は、単語ｗが認識結果そのものであれば“０”、修正された単語であれば、正の定数を与えるものとする。
ここで、単語ｗが修正された単語であるか否かは、修正装置３において、情報を付加することで、キャッシュ処理手段２１Ｂで判断することができる。例えば、図２の例では、修正装置３は、修正が行われなかった認識結果そのものの単語「文部」については“０”、修正が行われた「科学」については正の定数を付加して、音声認識装置１Ｂにフィードバックする。 The value of weight (w) is “0” if the word w is the recognition result itself, and a positive constant if the word w is a corrected word.
Here, whether or not the word w is a corrected word can be determined by the cache processing unit 21B by adding information in the correction device 3. For example, in the example of FIG. 2, the correction device 3 adds “0” for the word “sentence” of the recognition result itself that has not been corrected, and adds a positive constant for “science” that has been corrected. And feed back to the speech recognition apparatus 1B.

また、単語ｗが修正された単語であるか否かは、修正装置３から、修正前の文字列（認識結果）と、修正後の文字列（修正文字列）とをフィードバックしてもらい、図示を省略した修正判定手段によりＤＰマッチングを行うことで、両文字列の比較を行い、差異がある単語については、修正が行われた単語であると判定することとしてもよい。例えば、図２の例では、修正装置３は、修正前の文字列（認識結果）として「文部か楽章」、修正後の文字列（修正文字列）として「文部科学省」を音声認識装置１Ｂにフィードバックする。このように、ＩＤＦ値以外に、修正が行われたか否かによっても重み付けを行うことで、音声認識装置１Ｂは、操作者が修正した単語について言語スコアに大きい重みが付与されるため、同じ認識誤りの発生を防止することができる。 Whether or not the word w is a corrected word is determined by feedback from the correction device 3 of the character string before correction (recognition result) and the character string after correction (corrected character string). By performing DP matching using a correction determination unit that omits the above, both character strings may be compared, and a word with a difference may be determined to be a corrected word. For example, in the example of FIG. 2, the correction device 3 uses the speech recognition device 1 </ b> B as “text section or movement” as the character string before correction (recognition result) and “Ministry of Education, Culture, Sports, Science and Technology” as the character string after correction (corrected character string). To give feedback. In this way, in addition to the IDF value, the speech recognition apparatus 1B assigns a large weight to the language score for the word corrected by the operator by performing weighting depending on whether or not the correction is performed. Generation of errors can be prevented.

なお、この音声認識装置１Ｂの音声認識動作については、基本的に図４で説明した音声認識装置１の動作と同様である。音声認識装置１Ｂと、音声認識装置１の動作の相違点は、図４のステップＳ４において、キャッシュ処理手段２１のキャッシュスコア算出手段２１２Ｂによって、出力系列内の単語のうちでキャッシュ記憶手段２１１に記憶されている単語の出現確率値であるキャッシュスコアを前記（４）式、あるいは（５）式により算出する点である。 The voice recognition operation of the voice recognition device 1B is basically the same as the operation of the voice recognition device 1 described with reference to FIG. Differences in operation between the speech recognition device 1B and the speech recognition device 1 are stored in the cache storage unit 211 among the words in the output series by the cache score calculation unit 212B of the cache processing unit 21 in step S4 of FIG. In other words, the cache score, which is the appearance probability value of the word that is being used, is calculated by the above equation (4) or (5).

以上説明した音声認識装置１Ｂは、図示を省略したＣＰＵやメモリを搭載した一般的なコンピュータで実現することができる。このとき、音声認識装置１Ｂは、コンピュータを、前記した各手段として機能させる音声認識プログラムによって動作する。 The voice recognition device 1B described above can be realized by a general computer having a CPU and a memory (not shown). At this time, the speech recognition apparatus 1B operates by a speech recognition program that causes the computer to function as each of the above-described means.

［第３実施形態］
次に、図６を参照して、本発明の第３実施形態に係る音声認識装置の構成について説明する。図６に示した音声認識装置１Ｃは、図３で説明した音声認識装置１の機能に加え、修正装置３からフィードバックされた修正文字列で、言語モデルに登録されていない未知語を認識可能にするものである。 [Third Embodiment]
Next, with reference to FIG. 6, the structure of the speech recognition apparatus according to the third embodiment of the present invention will be described. The speech recognition apparatus 1C shown in FIG. 6 can recognize unknown words that are not registered in the language model by using the corrected character string fed back from the correction apparatus 3 in addition to the functions of the speech recognition apparatus 1 described in FIG. To do.

ここでは、音声認識装置１Ｃは、発音辞書記憶手段１０と、音響モデル記憶手段１１と、言語モデル記憶手段１２と、音声分析手段２０と、キャッシュ処理手段２１と、スコア結合手段２２と、単語列生成手段２３と、未知語処理手段２４と、を備えている。未知語処理手段２４以外の構成については、図３で説明した音声認識装置１と同様の構成であるため、同一の符号を付して説明を省略する。また、キャッシュ処理手段２１、スコア結合手段２２および単語列生成手段２３の内部構成については、図３で説明した音声認識装置１と同様の構成であるため図示を省略する。 Here, the speech recognition apparatus 1C includes a pronunciation dictionary storage unit 10, an acoustic model storage unit 11, a language model storage unit 12, a speech analysis unit 20, a cache processing unit 21, a score combining unit 22, a word string. A generation unit 23 and an unknown word processing unit 24 are provided. Since the configuration other than the unknown word processing means 24 is the same as the configuration of the speech recognition apparatus 1 described with reference to FIG. 3, the same reference numerals are given and description thereof is omitted. The internal configuration of the cache processing unit 21, the score combining unit 22, and the word string generating unit 23 is the same as that of the speech recognition apparatus 1 described with reference to FIG.

未知語処理手段２４は、修正装置３において操作者が修正した修正文字列において、未知の単語である未知語を、既知の単語である既知語の単語列に変換するものである。
通常、修正装置３において、操作者が修正した単語には、言語モデル１２ａに登録されていないような未知語が存在する場合がある。そこで、未知語処理手段２４は、未知語を既知語に変換する既知語変換手段２４１を備え、既知語変換手段２４１によって、未知語を既知語に変換する。この既知語の単語列に変換された修正文字列は、キャッシュ処理手段２１に出力される。 The unknown word processing means 24 converts an unknown word that is an unknown word into a word string of a known word that is a known word in the corrected character string that is corrected by the operator in the correction device 3.
Usually, in the correction device 3, there may be an unknown word that is not registered in the language model 12a in the word corrected by the operator. Therefore, the unknown word processing unit 24 includes a known word conversion unit 241 that converts an unknown word into a known word, and the unknown word conversion unit 241 converts the unknown word into a known word. The corrected character string converted into the word string of the known word is output to the cache processing means 21.

すなわち、既知語変換手段２４１は、入力された修正文字列の個々の単語について、言語モデル１２ａを参照し、言語モデル１２ａに登録されている単語に分解することで、既知の単語に変換する。例えば、未知語として「ハローワーク」が入力された場合、既知語変換手段２４１は、言語モデル１２ａに登録されている既知の単語である「ハロー」と「ワーク」とに分解する。この個々に分解された既知の単語が、キャッシュ処理手段２１に出力され、キャッシュ記憶手段２１１に記憶されることになる。 That is, the known word conversion means 241 converts each word of the input corrected character string into a known word by referring to the language model 12a and decomposing it into words registered in the language model 12a. For example, when “Hello Work” is input as an unknown word, the known word conversion unit 241 decomposes the word into “Hello” and “Work” which are known words registered in the language model 12a. The individually decomposed known words are output to the cache processing unit 21 and stored in the cache storage unit 211.

このように音声認識装置１Ｃを構成することで、音声認識装置１Ｃは、第１実施形態の効果に加え、修正装置３から、修正文字列として未知語が入力された場合であっても、既知語に変換することで、正確な言語スコアに反映することができ、音声認識の精度を高めることができる。
また、音声認識装置１Ｃは、キャッシュ処理手段２１を図５で説明したキャッシュ処理手段２１Ｂに代え、さらに、図５で説明した重み付け値記憶手段１３を備える構成としてもよい。 By configuring the speech recognition device 1C in this manner, the speech recognition device 1C is known even when an unknown word is input as a corrected character string from the correction device 3 in addition to the effects of the first embodiment. By converting to a word, it can be reflected in an accurate language score, and the accuracy of speech recognition can be improved.
Further, the speech recognition apparatus 1C may be configured to include the weight value storage unit 13 described with reference to FIG. 5 in place of the cache processing unit 21B described with reference to FIG.

この音声認識装置１Ｃの音声認識動作については、図４で説明した音声認識装置１の動作と同様であるため、説明を省略する。なお、音声認識装置１Ｃのキャッシュ動作において、修正装置３からフィードバックされた修正文字列が音声認識装置１Ｃに入力され、キャッシュ記憶手段２１１に記憶される前に、既知語変換手段２４１によって、未知語が既知語に変換される点が、音声認識装置１の動作と異なっている。 The voice recognition operation of the voice recognition device 1C is the same as that of the voice recognition device 1 described with reference to FIG. In the cache operation of the speech recognition apparatus 1C, the modified character string fed back from the modification apparatus 3 is input to the speech recognition apparatus 1C and stored in the cache storage unit 211 by the known word conversion unit 241 by the unknown word conversion unit 241. Is different from the operation of the speech recognition apparatus 1 in that it is converted into a known word.

以上説明した音声認識装置１Ｃは、図示を省略したＣＰＵやメモリを搭載した一般的なコンピュータで実現することができる。このとき、音声認識装置１Ｃは、コンピュータを、前記した各手段として機能させる音声認識プログラムによって動作する。 The speech recognition apparatus 1C described above can be realized by a general computer equipped with a CPU and a memory (not shown). At this time, the speech recognition apparatus 1C operates by a speech recognition program that causes the computer to function as each of the above-described means.

［第４実施形態］
次に、図７を参照して、本発明の第４実施形態に係る音声認識装置の構成について説明する。図７に示した音声認識装置１Ｄは、図３で説明した音声認識装置１の機能に加え、修正装置３からフィードバックされた修正文字列で、発音辞書に登録されていない未知語を認識可能にするものである。 [Fourth Embodiment]
Next, with reference to FIG. 7, the structure of the speech recognition apparatus which concerns on 4th Embodiment of this invention is demonstrated. The voice recognition device 1D shown in FIG. 7 can recognize unknown words that are not registered in the pronunciation dictionary by using the corrected character string fed back from the correction device 3 in addition to the function of the voice recognition device 1 described in FIG. To do.

ここでは、音声認識装置１Ｄは、発音辞書記憶手段１０と、音響モデル記憶手段１１と、言語モデル記憶手段１２と、外部辞書記憶手段１４と、音声分析手段２０と、キャッシュ処理手段２１と、スコア結合手段２２と、単語列生成手段２３と、未知語処理手段２４Ｂと、未知語登録手段２５と、を備えている。外部辞書記憶手段１４、未知語処理手段２４Ｂおよび未知語登録手段２５以外の構成については、図３で説明した音声認識装置１と同様の構成であるため、同一の符号を付して説明を省略する。また、キャッシュ処理手段２１、スコア結合手段２２および単語列生成手段２３の内部構成については、図３で説明した音声認識装置１と同様の構成であるため図示を省略する。 Here, the speech recognition apparatus 1D includes a pronunciation dictionary storage unit 10, an acoustic model storage unit 11, a language model storage unit 12, an external dictionary storage unit 14, a speech analysis unit 20, a cache processing unit 21, a score A combining unit 22, a word string generating unit 23, an unknown word processing unit 24B, and an unknown word registering unit 25 are provided. Since the configuration other than the external dictionary storage unit 14, the unknown word processing unit 24B, and the unknown word registration unit 25 is the same as the configuration of the speech recognition apparatus 1 described with reference to FIG. To do. The internal configuration of the cache processing unit 21, the score combining unit 22, and the word string generating unit 23 is the same as that of the speech recognition apparatus 1 described with reference to FIG.

外部辞書記憶手段１４は、発音辞書（第２の発音辞書）である外部辞書１４ａを記憶するものであって、ハードディスク等の一般的な記憶装置である。通常、音声認識において、認識精度および認識速度を高めるため、発音辞書１０ａには、予め定めた数の単語の発音しか登録されていない。この外部辞書記憶手段１４に記憶されている外部辞書１４ａは、発音辞書１０ａよりも多い単語の発音を登録した巨大辞書である。 The external dictionary storage means 14 stores an external dictionary 14a, which is a pronunciation dictionary (second pronunciation dictionary), and is a general storage device such as a hard disk. Normally, in speech recognition, only a predetermined number of pronunciations of words are registered in the pronunciation dictionary 10a in order to increase recognition accuracy and recognition speed. The external dictionary 14a stored in the external dictionary storage means 14 is a huge dictionary in which pronunciations of more words than the pronunciation dictionary 10a are registered.

未知語処理手段２４Ｂは、修正装置３において操作者が修正した修正文字列において、未知の単語である未知語に対して、出現確率値を与えるものである。
この未知語処理手段２４Ｂは、未知語に予め定めた代用の出現確率値を与える未知語代用手段２４２によって、未知語に出現確率値を与える。この未知語に対する出現確率値は、未知語登録手段２５に出力される。 The unknown word processing means 24B gives an appearance probability value to an unknown word that is an unknown word in the corrected character string corrected by the operator in the correction device 3.
This unknown word processing means 24B gives the appearance probability value to the unknown word by the unknown word substitution means 242 that gives a predetermined substitute appearance probability value to the unknown word. The appearance probability value for this unknown word is output to the unknown word registration means 25.

未知語代用手段２４２は、入力された修正文字列の個々の単語について、言語モデル１２ａに登録されていない未知語の単語については、外部辞書記憶手段１４から発音を取得し、言語モデル１２ａに予め登録されている未知語の出現確率値とともに、未知語登録手段２５に出力するものである。なお、未知語処理手段２４Ｂは、既知語については、そのままキャッシュ処理手段２１に出力する。 The unknown word substitution means 242 obtains pronunciations from the external dictionary storage means 14 for words of unknown words that are not registered in the language model 12a for the individual words of the input corrected character string, and stores them in the language model 12a in advance. This is output to the unknown word registration means 25 together with the appearance probability value of the registered unknown word. The unknown word processing unit 24B outputs the known words to the cache processing unit 21 as they are.

未知語登録手段２５は、未知語処理手段２４Ｂから出力される未知語の発音および出現確率値を、発音辞書１０ａおよび言語モデル１２ａに登録するものである。
すなわち、未知語登録手段２５は、未知語処理手段２４Ｂから出力される未知語のテキストデータと、当該未知語の発音とを発音辞書１０ａに登録する。さらに、未知語登録手段２５は、未知語処理手段２４Ｂから出力される未知語のテキストデータを言語モデル１２ａに登録するとともに、出現確率値を当該未知語に対応付けて登録しておく。
なお、本実施形態では、予め発音辞書１０ａおよび言語モデル１２ａを、単語列生成手段２３が参照可能な図示を省略したメモリに展開して動作させることを想定しているため、図７において、未知語登録手段２５からの出力を単語列生成手段２３としている。 The unknown word registration unit 25 registers the pronunciation and appearance probability value of the unknown word output from the unknown word processing unit 24B in the pronunciation dictionary 10a and the language model 12a.
That is, the unknown word registration unit 25 registers the unknown word text data output from the unknown word processing unit 24B and the pronunciation of the unknown word in the pronunciation dictionary 10a. Further, the unknown word registration unit 25 registers the text data of the unknown word output from the unknown word processing unit 24B in the language model 12a, and registers the appearance probability value in association with the unknown word.
In the present embodiment, it is assumed that the pronunciation dictionary 10a and the language model 12a are expanded and operated in advance in a memory (not shown) that can be referred to by the word string generation unit 23. An output from the word registration unit 25 is used as a word string generation unit 23.

このように音声認識装置１Ｄを構成することで、音声認識装置１Ｄは、第１実施形態の効果に加え、修正装置３から、修正文字列として未知語が入力された場合であっても、当該未知語に言語モデルの予め定めた未知語の出現確率値を代用させることができ、未知語に対しても音声認識を行うことが可能になる。 By configuring the speech recognition device 1D in this way, the speech recognition device 1D can be used even when an unknown word is input as a corrected character string from the correction device 3 in addition to the effects of the first embodiment. An unknown word appearance probability value of a language model determined in advance can be substituted for the unknown word, and voice recognition can be performed for the unknown word.

この音声認識装置１Ｄの音声認識動作については、図４で説明した音声認識装置１の動作と同様であるため、説明を省略する。なお、音声認識装置１Ｄは、修正装置３からフィードバックされた修正文字列が音声認識装置１Ｄに入力された際に、未知語代用手段２４２Ｂによって、外部辞書１４ａを参照して、未知語に対して発音と出現確率値とが割り当てられる点が、音声認識装置１の動作と異なっている。 The voice recognition operation of the voice recognition device 1D is the same as that of the voice recognition device 1 described with reference to FIG. The speech recognition device 1D refers to the external dictionary 14a by the unknown word substitute unit 242B when the corrected character string fed back from the correction device 3 is input to the speech recognition device 1D. The point that the pronunciation and the appearance probability value are assigned is different from the operation of the speech recognition apparatus 1.

以上説明した音声認識装置１Ｄは、図示を省略したＣＰＵやメモリを搭載した一般的なコンピュータで実現することができる。このとき、音声認識装置１Ｄは、コンピュータを、前記した各手段として機能させる音声認識プログラムによって動作する。 The voice recognition device 1D described above can be realized by a general computer having a CPU and a memory (not shown). At this time, the speech recognition apparatus 1D operates by a speech recognition program that causes the computer to function as each of the above-described means.

［第５実施形態］
次に、図８を参照して、本発明の第５実施形態に係る音声認識装置の構成について説明する。図８に示した音声認識装置１Ｅは、図３で説明した音声認識装置１の機能に加え、修正装置３からフィードバックされた修正文字列で、発音辞書に登録されていない未知語を認識可能にするものである。 [Fifth Embodiment]
Next, with reference to FIG. 8, the structure of the speech recognition apparatus according to the fifth embodiment of the present invention will be described. The voice recognition device 1E shown in FIG. 8 can recognize unknown words that are not registered in the pronunciation dictionary by using the corrected character string fed back from the correction device 3 in addition to the functions of the voice recognition device 1 described in FIG. To do.

ここでは、音声認識装置１Ｅは、発音辞書記憶手段１０と、音響モデル記憶手段１１と、言語モデル記憶手段１２と、音声分析手段２０と、キャッシュ処理手段２１と、スコア結合手段２２と、単語列生成手段２３Ｂと、未知語処理手段２４Ｃと、未知語登録手段２５と、音素認識手段２６と、を備えている。単語列生成手段２３Ｂ、未知語処理手段２４Ｃおよび音素認識手段２６以外の構成については、図７で説明した音声認識装置１Ｄと同様の構成であるため、同一の符号を付して説明を省略する。また、キャッシュ処理手段２１およびスコア結合手段２２の内部構成については、図３で説明した音声認識装置１と同様の構成であるため図示を省略する。 Here, the speech recognition apparatus 1E includes a pronunciation dictionary storage unit 10, an acoustic model storage unit 11, a language model storage unit 12, a speech analysis unit 20, a cache processing unit 21, a score combination unit 22, a word string. A generation unit 23B, an unknown word processing unit 24C, an unknown word registration unit 25, and a phoneme recognition unit 26 are provided. Since the configuration other than the word string generation unit 23B, the unknown word processing unit 24C, and the phoneme recognition unit 26 is the same as the configuration of the speech recognition apparatus 1D described with reference to FIG. . The internal configuration of the cache processing unit 21 and the score combining unit 22 is the same as that of the speech recognition apparatus 1 described with reference to FIG.

単語列生成手段２３Ｂは、発音辞書１０ａ、音響モデル１１ａおよび言語モデル１２ａに基づいて、音声分析手段２０で抽出された特徴ベクトルから、音声認識結果となる単語列（認識文字列）を生成するものである。また、単語列生成手段２３Ｂの内部構成については、図３で説明した音声認識装置１の単語列生成手段２３と同様の構成であるため図示を省略する。
なお、ここでは、単語列生成手段２３Ｂは、図３で説明した単語列生成手段２３の機能に加え、図示を省略した計時手段で計時された時刻情報（タイムスタンプ）を単語に付与することとする。 The word string generator 23B generates a word string (recognized character string) that is a voice recognition result from the feature vector extracted by the voice analyzer 20 based on the pronunciation dictionary 10a, the acoustic model 11a, and the language model 12a. It is. The internal configuration of the word string generation unit 23B is the same as that of the word string generation unit 23 of the speech recognition apparatus 1 described with reference to FIG.
Here, in addition to the function of the word string generation unit 23 described with reference to FIG. 3, the word string generation unit 23B adds time information (time stamp) measured by the time measurement unit (not shown) to the word. To do.

未知語処理手段２４Ｃは、修正装置３において操作者が修正した修正文字列において、未知の単語である未知語に対して、出現確率値を与えるものである。
ここでは、未知語処理手段２４Ｃは、未知語に予め定めた代用の出現確率値を与える未知語代用手段２４２Ｂによって、未知語に出現確率値を与える。この未知語に対する出現確率値は、未知語登録手段２５に出力される。 The unknown word processing means 24C gives an appearance probability value to an unknown word that is an unknown word in the corrected character string corrected by the operator in the correction device 3.
Here, the unknown word processing means 24C gives the appearance probability value to the unknown word by the unknown word substitution means 242B that gives a predetermined substitute appearance probability value to the unknown word. The appearance probability value for this unknown word is output to the unknown word registration means 25.

未知語代用手段（第２未知語代用手段）２４２Ｂは、入力された修正文字列の個々の単語について、言語モデル１２ａに登録されていない未知語の単語については、音素認識手段２６から発音を取得し、言語モデル１２ａに予め登録されている未知語の出現確率値とともに、未知語登録手段２５に出力するものである。なお、未知語処理手段２４Ｃは、既知語については、そのままキャッシュ処理手段２１に出力する。 The unknown word substitution means (second unknown word substitution means) 242B obtains pronunciation from the phoneme recognition means 26 for words of unknown words that are not registered in the language model 12a for individual words of the input corrected character string. Then, together with the appearance probability value of the unknown word registered in advance in the language model 12a, it is output to the unknown word registration means 25. The unknown word processing unit 24C outputs the known words to the cache processing unit 21 as they are.

この未知語代用手段２４２Ｂは、単語列生成手段２３Ｂで生成されたタイムスタンプが付与された文字列と、修正装置３から出力される修正文字列とを比較することで、修正された単語のタイムスタンプを得ることができる。そして、その単語が未知語である場合、未知語代用手段２４２Ｂは、その未知語に対し、当該タイムスタンプに対応する音素認識手段２６から出力される発音を対応付ける。 The unknown word substitution unit 242B compares the character string with the time stamp generated by the word string generation unit 23B with the corrected character string output from the correction device 3, thereby correcting the time of the corrected word. You can get a stamp. When the word is an unknown word, the unknown word substitution unit 242B associates the pronunciation output from the phoneme recognition unit 26 corresponding to the time stamp with the unknown word.

音素認識手段２６は、外部から入力される音声信号（入力音声）を、音響モデル１１ａを参照して音素に分解し、発音のデータを生成するものである。この音素認識手段２６は、一般的な音素認識装置を用いることができ、音声認識装置１Ｅの内部に備えることとしてもよいし、外部に接続して構成することとしてもよい。なお、この音素認識手段２６は、認識結果として、音素ごとに、図示を省略した計時手段で時刻情報（タイムスタンプ）を付与し、未知語処理手段２４Ｃに出力する。 The phoneme recognizing unit 26 decomposes a speech signal (input speech) input from the outside into phonemes with reference to the acoustic model 11a to generate pronunciation data. The phoneme recognition means 26 may be a general phoneme recognition device, and may be provided inside the speech recognition device 1E or may be configured to be connected to the outside. The phoneme recognizing means 26 gives time information (time stamp) to each phoneme by a time measuring means (not shown) as a recognition result, and outputs it to the unknown word processing means 24C.

このように音声認識装置１Ｅを構成することで、音声認識装置１Ｅは、第１実施形態の効果に加え、修正装置３から、修正文字列として未知語が入力された場合であっても、当該未知語に言語モデルの予め定めた未知語の出現確率値を代用させることができ、未知語に対しても音声認識を行うことが可能になる。 By configuring the speech recognition device 1E as described above, the speech recognition device 1E can be used in addition to the effects of the first embodiment, even when an unknown word is input as a corrected character string from the correction device 3. An unknown word appearance probability value of a language model determined in advance can be substituted for the unknown word, and voice recognition can be performed for the unknown word.

この音声認識装置１Ｅの音声認識動作については、図４で説明した音声認識装置１の動作と同様であるため、説明を省略する。なお、音声認識装置１Ｅは、修正装置３からフィードバックされた修正文字列が音声認識装置１Ｅに入力された際に、未知語代用手段２４２Ｂによって、未知語に対して発音と出現確率値とが割り当てられる点が、音声認識装置１の動作と異なっている。 The voice recognition operation of the voice recognition device 1E is similar to the operation of the voice recognition device 1 described with reference to FIG. Note that the speech recognition device 1E assigns pronunciation and appearance probability values to unknown words by the unknown word substitution means 242B when the corrected character string fed back from the correction device 3 is input to the speech recognition device 1E. Is different from the operation of the speech recognition apparatus 1.

以上説明した音声認識装置１Ｅは、図示を省略したＣＰＵやメモリを搭載した一般的なコンピュータで実現することができる。このとき、音声認識装置１Ｅは、コンピュータを、前記した各手段として機能させる音声認識プログラムによって動作する。 The voice recognition device 1E described above can be realized by a general computer equipped with a CPU and a memory (not shown). At this time, the speech recognition apparatus 1E operates according to a speech recognition program that causes the computer to function as each of the above-described means.

以上、本発明の実施形態として、第１〜第５実施形態について説明したが、本発明は、これらの実施形態に限定されるものではない。例えば、第３〜第５実施形態は、適宜これらを組み合わせて構成してもよい。例えば、第３〜第５実施形態を組み合わせた第６実施形態とすることができる。 As mentioned above, although 1st-5th embodiment was described as embodiment of this invention, this invention is not limited to these embodiment. For example, the third to fifth embodiments may be appropriately combined. For example, it can be set as 6th Embodiment which combined 3rd-5th embodiment.

［第６実施形態］
ここで、図９を参照して、第３〜第５実施形態を組み合わせた第６実施形態の音声認識装置について説明する。
図９に示した音声認識装置１Ｆは、第３〜第５実施形態を組み合わせ未知語処理手段２４Ｄに、既知語変換手段２４１、未知語代用手段２４２，２４２Ｂを備える構成としている。各構成については、第１〜第５実施形態で説明したため、同一の符号を付して説明を省略するが、未知語処理手段２４Ｄの機能が異なっているため、その動作について説明を行う。 [Sixth Embodiment]
Here, with reference to FIG. 9, the speech recognition apparatus of 6th Embodiment which combined 3rd-5th embodiment is demonstrated.
The speech recognition apparatus 1F shown in FIG. 9 is configured by combining the third to fifth embodiments with the unknown word processing means 24D, the known word conversion means 241, and the unknown word substitution means 242, 242B. Since each configuration has been described in the first to fifth embodiments, the same reference numerals are used and description thereof is omitted. However, since the function of the unknown word processing unit 24D is different, its operation will be described.

音声認識装置１Ｆは、修正装置３から修正文字列が入力された場合、修正文字列のうちに未知語が存在する場合は、既知語変換手段２４１によって、未知語を分解することで複数の既知語に変換する。
ここで、未知語が既知語に変換できない場合、音声認識装置１Ｆは、未知語代用手段２４２によって、外部辞書記憶手段１４の外部辞書１４ａから発音を取得し、言語モデル１２ａに予め登録されている未知語の出現確率値とともに、未知語登録手段２５に出力し登録を行う。 When a corrected character string is input from the correction device 3 and an unknown word is present in the corrected character string, the speech recognition device 1F uses the known word conversion unit 241 to decompose the unknown word into a plurality of known words. Convert to word.
Here, when the unknown word cannot be converted into the known word, the speech recognition apparatus 1F acquires the pronunciation from the external dictionary 14a of the external dictionary storage unit 14 by the unknown word substitution unit 242, and is registered in advance in the language model 12a. The unknown word appearance probability value is output to the unknown word registration means 25 for registration.

さらに、外部辞書１４ａにおいても未知語の発音が取得できない場合、音声認識装置１Ｆは、未知語代用手段２４２Ｂによって、音素認識手段２６から発音を取得し、言語モデル１２ａに予め登録されている未知語の出現確率値とともに、未知語登録手段２５に出力し登録を行う。 Further, when the pronunciation of the unknown word cannot be acquired even in the external dictionary 14a, the speech recognition apparatus 1F acquires the pronunciation from the phoneme recognition means 26 by the unknown word substitution means 242B, and the unknown words registered in the language model 12a in advance. Is output to the unknown word registration means 25 and registered.

これによって、音声認識装置１Ｆは、直近に入力された修正文字列の単語によって、言語スコアを修正し、当該単語に重みを付けることができるため、音声認識の精度を高め、また、認識誤りの再発を防止することができる。さらに、音声認識装置１Ｆは、単語が未知語である場合であっても、当該未知語に言語モデルの予め定めた未知語の出現確率値を代用させることができ、未知語に対しても音声認識を行うことが可能になる。
また、音声認識装置１Ｆは、キャッシュ処理手段２１を図５で説明したキャッシュ処理手段２１Ｂに代え、さらに、図５で説明した重み付け値記憶手段１３を備える構成としてもよい。 As a result, the speech recognition apparatus 1F can correct the language score by using the word of the corrected character string that has been input most recently, and can weight the word, thereby improving the accuracy of speech recognition and reducing the recognition error. Relapse can be prevented. Furthermore, even if the word is an unknown word, the speech recognition apparatus 1F can substitute the unknown word appearance probability value of the language model in advance for the unknown word. Recognition can be performed.
Further, the speech recognition apparatus 1F may be configured to include the weight value storage unit 13 described with reference to FIG. 5 in place of the cache processing unit 21B described with reference to FIG.

以上説明した音声認識装置１Ｆは、図示を省略したＣＰＵやメモリを搭載した一般的なコンピュータで実現することができる。このとき、音声認識装置１Ｅは、コンピュータを、前記した各手段として機能させる音声認識プログラムによって動作する。 The voice recognition device 1F described above can be realized by a general computer equipped with a CPU and a memory (not shown). At this time, the speech recognition apparatus 1E operates according to a speech recognition program that causes the computer to function as each of the above-described means.

Ｓ音声認識システム
１（１Ｂ〜１Ｆ）音声認識装置
１０発音辞書記憶手段
１０ａ発音辞書
１１音響モデル記憶手段
１１ａ音響モデル
１２言語モデル記憶手段
１２ａ言語モデル
１３重み付け値記憶手段（ＩＤＦ値記憶手段）
１３ａＩＤＦテーブル
１４外部辞書記憶手段
１４ａ外部辞書（第２の発音辞書）
２０音声分析手段
２１キャッシュ処理手段
２１１キャッシュ記憶手段
２１２キャッシュスコア算出手段
２２スコア結合手段
２２１言語スコア修正手段
２３単語列生成手段
２３１音響スコア算出手段
２３２探索手段
２４未知語処理手段
２４１既知語変換手段
２４２未知語代用手段
２４２Ｂ未知語代用手段（第２未知語代用手段）
２５未知語登録手段
２６音素認識手段
３修正装置 S Speech recognition system 1 (1B to 1F) Speech recognition device 10 Pronunciation dictionary storage means 10a Pronunciation dictionary 11 Acoustic model storage means 11a Acoustic model 12 Language model storage means 12a Language model 13 Weight value storage means (IDF value storage means)
13a IDF table 14 External dictionary storage means 14a External dictionary (second pronunciation dictionary)
20 speech analysis means 21 cache processing means 211 cache storage means 212 cache score calculation means 22 score combination means 221 language score correction means 23 word string generation means 231 acoustic score calculation means 232 search means 24 unknown word processing means 241 known word conversion means 242 Unknown word substitution means 242B Unknown word substitution means (second unknown word substitution means)
25 Unknown word registration means 26 Phoneme recognition means 3 Correction device

Claims

Speech recognition that the operator corrects the recognition error of the character string obtained by speech recognition by the speech recognition device using the pronunciation dictionary, acoustic model and language model, and outputs it as a speech recognition result In the speech recognition device in the system,
A cache storage means for inputting a corrected character string corrected in the correction device, and storing words constituting the corrected character string by a predetermined number of words;
For a word stored in the cache storage means, a cache score calculation means for calculating a probability value that the word appears in the cache storage means as a cache score;
By adding a cache score that is an appearance probability value of the word stored in the cache storage unit to a language score that is an appearance probability value of the word obtained from the language model, the word obtained from the language model Language score correction means for generating a corrected language score obtained by correcting the language score;
Based on the corrected language score generated by the language score correcting means, search means for searching the language model as a speech recognition result for a word string having a maximum connection probability value;
A speech recognition apparatus comprising:

IDF value storage means for storing an IDF value, which is a scale indicating how many specific documents appear in a plurality of documents, in association with each specific word;
The cache score calculation means adds an IDF value stored in the IDF value storage means associated with a word corresponding to the cache score as a weighting value to the cache score. The speech recognition apparatus according to claim 1.

The corrected character string is a character string in which a word corrected by the operator and a word not corrected are mixed,
2. The cache score calculation unit adds a predetermined weight value to the cache score for a word corrected by the operator in a corrected character string input from the correction device. Or the speech recognition apparatus of Claim 2.

For each word of the corrected character string input from the correction device, the device further comprises known word conversion means for decomposing an unknown word not registered in the language model into a known word registered in the language model, The speech recognition apparatus according to claim 1, wherein a known word is stored in the cache storage unit.

An external dictionary storage means for storing an external dictionary which is a second pronunciation dictionary having a larger number of registered words than the pronunciation dictionary;
For each word of the corrected character string input from the correction device, the pronunciation of an unknown word that is not registered in the language model is obtained from the external dictionary, and the probability value of the unknown word appearing in advance in the language model An unknown word substitution means that substitutes and sets a probability value registered as an unknown word connection probability value in
The search means searches for a word string having a maximum connection probability value based on a connection probability value in the language model and a connection probability value of the substituted unknown word. The speech recognition apparatus according to claim 3.

Phoneme recognition means for generating pronunciation data of the speech signal based on the acoustic model;
For each word of the corrected character string input from the correction device, the pronunciation of an unknown word that is not registered in the language model is acquired from the phoneme recognition means, and the probability value that the unknown word appears is preliminarily set in the language. A second unknown word substitution means that substitutes and sets a probability value registered as a connection probability value of an unknown word in the model,
The search means searches for a word string having a maximum connection probability value based on a connection probability value in the language model and a connection probability value of the substituted unknown word. The speech recognition apparatus according to claim 3.

An external dictionary storage means for storing an external dictionary which is a second pronunciation dictionary having a larger number of registered words than the pronunciation dictionary;
Phoneme recognition means for generating pronunciation data of the speech signal based on the acoustic model;
For each word of the corrected character string input from the correction device, an unknown word that is not registered in the language model is decomposed into known words that are registered in the language model, and stored in the cache storage unit. A known word conversion means;
The pronunciation of an unknown word that could not be decomposed into known words by this known word conversion means is acquired from the external dictionary, and the probability value that the unknown word appears is registered in the language model in advance as a connection probability value of the unknown word. An unknown word substitution means that substitutes with a certain probability value, and
The unknown word substitution means obtains the pronunciation of the unknown word that could not be obtained from the external dictionary from the phoneme recognition means, and the probability value of the unknown word appearing in advance in the language model is connected to the unknown word connection probability value. A second unknown word substitution means that substitutes and sets the probability value registered as:
The search means searches for a word string having a maximum connection probability value based on a connection probability value in the language model and a connection probability value of the substituted unknown word. The speech recognition apparatus according to claim 3.

Speech recognition that the operator corrects the recognition error of the character string obtained by speech recognition by the speech recognition device using the pronunciation dictionary, acoustic model and language model, and outputs it as a speech recognition result In the system, in order to reduce the recognition error by the character string corrected by the correction device, the computer of the voice recognition device,
For a word stored in the cache storage means in which each word of the corrected character string corrected by the correction device is stored for a predetermined number of words, a probability value that the word appears in the cache storage means is determined as a cache score. Cash score calculation means for calculating as
By adding a cache score that is an appearance probability value of the word stored in the cache storage unit to a language score that is an appearance probability value of the word obtained from the language model, the word obtained from the language model A language score correcting means for generating a corrected language score obtained by correcting the language score;
Search means for searching a word string having a maximum connection probability value from the language model as the speech recognition result based on the corrected language score generated by the language score correcting means,
A voice recognition program characterized by functioning as