JP2008083165A

JP2008083165A - Voice recognition processing program and voice recognition processing method

Info

Publication number: JP2008083165A
Application number: JP2006260477A
Authority: JP
Inventors: Hiroaki Kokubo; 浩明小窪; Nobuo Hataoka; 信夫畑岡; Takeshi Honma; 健本間; Hirohiko Sagawa; 浩彦佐川; Hisashi Takahashi; 久高橋; Takeshi Ono; 健大野; Minoru Togashi; 実冨樫; Daisuke Saito; 大介斎藤; Keiko Katsuragawa; 景子桂川
Original assignee: Xanavi Informatics Corp; Nissan Motor Co Ltd
Current assignee: Nissan Motor Co Ltd; Faurecia Clarion Electronics Co Ltd
Priority date: 2006-09-26
Filing date: 2006-09-26
Publication date: 2008-04-10

Abstract

<P>PROBLEM TO BE SOLVED: To provide technology for reducing misrecognitions in keyword recognition. <P>SOLUTION: In keywords defined as words to be recognized, the words which are not similar to each other are stored in a storage device. When the voice input is the keywords which are not similar to each other, processings defined by the keyword are performed. When the voice input are keywords which are similar to each other, information for confirming the keyword is output to an output device. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、音声認識に関するものである。 The present invention relates to speech recognition.

近年、携帯電話やカーナビゲーションなどの情報機器において、音声認識技術を用いたインタフェースが普及しつつある。音声認識技術には、キーワード認識技術がある。この、入力された音声から予め定められた単語であるキーワードを抽出し、抽出されたキーワードに応じて処理を行なうものである。キーワード認識技術は、自由な発話の中からのユーザの意図抽出や、音声検索のためのインデクシング作成などに用いられている。 In recent years, interfaces using voice recognition technology are becoming popular in information devices such as mobile phones and car navigation systems. Speech recognition technology includes keyword recognition technology. A keyword, which is a predetermined word, is extracted from the input voice, and processing is performed according to the extracted keyword. The keyword recognition technique is used for extracting a user's intention from a free utterance and creating an index for voice search.

このキーワード認識を実現するための技術が特許文献１に記載されている。特許文献１には、予め定められているキーワードを受理するために生成されるキーワードモデルとキーワード以外の語句を受理するためのガーベッジモデルとを用意し、入力音声との照合の結果から得られたスコアを比較して、キーワードの抽出を行うことが記載されている。 A technique for realizing this keyword recognition is described in Patent Document 1. In Patent Document 1, a keyword model generated for receiving a predetermined keyword and a garbage model for receiving a phrase other than the keyword are prepared, and obtained from the result of collation with the input speech. It describes that keywords are extracted by comparing scores.

特開２００５-９２３１０号広報Japanese Laid-Open Patent Publication No. 2005-92310

現状では、キーワード認識の認識精度はまだ十分ではない。キーワードの誤認識は２つに分けられる。ひとつは、キーワードＡとして発声された音声を誤ってキーワードＢとして認識してしまう誤りである。もうひとつは、キーワードでない入力音声をキーワードとして受理してしまう誤りである。 At present, the recognition accuracy of keyword recognition is not yet sufficient. Keyword misrecognition can be divided into two categories. One is an error in which a voice uttered as the keyword A is erroneously recognized as the keyword B. The other is an error in which input speech that is not a keyword is accepted as a keyword.

前者の誤りは、例えば、「羽田空港」や「世田谷高校」の発声に対し、キーワード登録されている「空港」や「高校」のみを抽出するなど、通常の音声認識で登録する辞書エントリーの一部分だけを認識対象とする。このため、類似した語句がキーワードの対象となりやすい。また、キーワードのエントリー数が多いほど類似する語句が増加するため、誤り頻度は増加する。 The former error is a part of the dictionary entry that is registered by normal speech recognition, such as extracting only “Airport” and “High School” registered as keywords for utterances of “Haneda Airport” and “Setagaya High School”, for example. Only the recognition target. For this reason, similar phrases are likely to be the target of keywords. Also, as the number of keyword entries increases, the number of similar words increases, and the error frequency increases.

後者の誤りは、例えば、「調布空港」という発声に対し、キーワード登録されている「調布高校」として認識してしまうなど、キーワードでない音声に対しキーワードであると認識することである。この誤りを防止するために、特許文献１には、類似語を生成している。しかし、キーワード内の類似した語が含まれているような場合には解決できない。また、キーワード外の語句を追加したために、キーワードの発話がキーワード外の語句として誤認識される危険性も生じる。 The latter error is, for example, recognizing a voice of non-keyword as a keyword, such as recognizing “Chofu Airport” as “Chofu High School” registered as a keyword. In order to prevent this error, Patent Literature 1 generates similar words. However, it cannot be solved when similar words in the keyword are included. In addition, since a word outside the keyword is added, there is a risk that the utterance of the keyword is erroneously recognized as a word outside the keyword.

本発明は、前述した問題点に鑑みてなされたものであり、キーワード認識における誤認識を減少させることを目的とする。 The present invention has been made in view of the above-described problems, and an object thereof is to reduce erroneous recognition in keyword recognition.

本発明は上記の目的を達成するためになされたもので、認識すべき語彙として定められているキーワードのうち、互いに類似しないものを記憶装置に記憶しておき、入力された音声が互いに類似しないキーワードである場合、第１の処理を実行し、入力された音声が互いに類似するキーワードである場合、第２の処理を実行することを特徴とする。 The present invention has been made to achieve the above-mentioned object, and among the keywords defined as vocabularies to be recognized, those that are not similar to each other are stored in a storage device, and the input voices are not similar to each other. If it is a keyword, the first process is executed, and if the input speech is a keyword similar to each other, the second process is executed.

また、本発明は、入力された音声に含まれる単語であるキーワードを認識するための、コンピュータ実行可能な音声認識プログラムにおいて、複数のキーワードを、互いに音韻的に類似するキーワードが含まれない第１のキーワード群と、該第１のキーワード群に含まれないキーワードである第２のキーワード群とに分類して記憶する記憶手段、を有する前記コンピュータに、入力された音声に含まれる少なくとも１つの前記キーワードを認識するキーワード認識ステップと、前記認識されたキーワードが前記第１のキーワード群に含まれる場合、第１の処理を実行する第１の処理実行ステップと、前記認識されたキーワードが前記第２のキーワード群に含まれる場合、第２の処理を実行する第２の処理実行ステップと、を実行させることを特徴とする。 According to the present invention, in a computer-executable speech recognition program for recognizing a keyword that is a word included in input speech, a plurality of keywords are not included in a keyword that is phonologically similar to each other. Storage means for classifying and storing the keyword group and a second keyword group that is a keyword not included in the first keyword group, and the computer includes at least one of the at least one included in the input voice A keyword recognizing step for recognizing a keyword; a first process executing step for executing a first process when the recognized keyword is included in the first keyword group; and A second process execution step for executing the second process is included in the keyword group. And butterflies.

本発明の技術によれば、キーワード認識における誤認識を減少させることが可能となる。 According to the technique of the present invention, it is possible to reduce erroneous recognition in keyword recognition.

以下、本発明の一実施形態を、図面を参照して詳細に説明する。 Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings.

＜第１の実施形態＞
まず、第１の実施形態を説明する。 <First Embodiment>
First, the first embodiment will be described.

図１は、音声認識処理装置１の構成例である。音声認識処理装置１は、第１の実施の形態のキーワード認識を用いた音声インタフェースを実現する。 FIG. 1 is a configuration example of the speech recognition processing device 1. The voice recognition processing device 1 realizes a voice interface using the keyword recognition of the first embodiment.

音声認識処理装置１は、例えば、ナビゲーションシステムや携帯端末、ＣＴＩ（Computer Telephony Integration）システム等の情報処理装置である。音声認識処理装置１は、ＣＰＵ（Central Processing Unit）１０１、メモリ１０２、２次記憶装置１０３、入力装置１０４、出力装置１０５、通信インタフェース１０６等を有する。ＣＰＵ１０１、メモリ１０２、２次記憶装置１０３、入力装置１０４、出力装置１０５、通信インタフェース１０６等はバス１０７により接続されている。 The speech recognition processing device 1 is an information processing device such as a navigation system, a portable terminal, or a CTI (Computer Telephony Integration) system. The speech recognition processing device 1 includes a CPU (Central Processing Unit) 101, a memory 102, a secondary storage device 103, an input device 104, an output device 105, a communication interface 106, and the like. The CPU 101, the memory 102, the secondary storage device 103, the input device 104, the output device 105, the communication interface 106, and the like are connected by a bus 107.

２次記憶装置１０３には、キーワードモデル１６１、コンフュージョンマトリクス１６４、ガーベッジモデル１６５等が格納されている。 The secondary storage device 103 stores a keyword model 161, a confusion matrix 164, a garbage model 165, and the like.

キーワードモデル１６１は、抽出すべきキーワードの音声の音響モデル系列である。なお、本実施形態の特徴として、キーワードモデル１６１は、キーワードクラスＡ１６２、キーワードクラスＢ１６３の２つに分けられている。キーワードクラスＡ１６２は、互いに類似しないキーワードが含まれている。キーワードクラスＢ１６３は、キーワードモデル１６１内のキーワードのうちキーワードクラスＡ１６２に含まれていないキーワードが含まれる。従って、キーワードクラスＢ１６３には、互いに類似するキーワードも含まれている。 The keyword model 161 is an acoustic model sequence of the keyword speech to be extracted. As a feature of the present embodiment, the keyword model 161 is divided into two, a keyword class A162 and a keyword class B163. The keyword class A162 includes keywords that are not similar to each other. The keyword class B163 includes keywords that are not included in the keyword class A162 among the keywords in the keyword model 161. Therefore, keywords similar to each other are included in the keyword class B163.

なお、キーワードクラスＡ１６２、キーワードクラスＢ１６３のデータ構造は任意である。例えば、キーワードクラスＡ１６２、キーワードクラスＢ１６３は、異なるテーブルやデータベース等に格納されていてもよく、また、キーワードクラスＡ１６２、キーワードクラスＢ１６３は、同じテーブルやデータベース等に格納され、各キーワードに付与されたフラグ等によりキーワードクラスＡ１６２、キーワードクラスＢ１６３に分類されていてもよい。 The data structure of keyword class A162 and keyword class B163 is arbitrary. For example, the keyword class A 162 and the keyword class B 163 may be stored in different tables or databases, and the keyword class A 162 and the keyword class B 163 are stored in the same table or database and assigned to each keyword. It may be classified into keyword class A162 and keyword class B163 by a flag or the like.

コンフュージョンマトリクス１６４は、複数の音声の各々に対し、同一の音声及び他の音声と一致すると認識される度合いである類似度を格納する。 The confusion matrix 164 stores, for each of the plurality of sounds, a degree of similarity that is a degree recognized as matching the same sound and another sound.

ガーベッジモデル１６５は、キーワード以外の音声の音響モデル列を格納する。このガーベッジモデル１６５内の音響モデル列は、一般的な音声認識で用いられるものと同じである。 The garbage model 165 stores an acoustic model sequence of speech other than keywords. The acoustic model sequence in the garbage model 165 is the same as that used in general speech recognition.

ＣＰＵ１０１は、プログラム（図示略）を実行することにより、クラス分類部１５１、分析部１５２、照合部１５３、判定部１５４、処理実行部１５５、処理実行部Ａ１５６、処理実行部Ｂ１５７等を実現する。 The CPU 101 implements a class classification unit 151, an analysis unit 152, a collation unit 153, a determination unit 154, a process execution unit 155, a process execution unit A156, a process execution unit B157, and the like by executing a program (not shown).

クラス分類部１５１は、コンフュージョンマトリクス１６４から、キーワードクラスＡ１６２、キーワードクラスＢ１６３を生成する。分析部１５２は、入力した音声波形を特徴パラメータに変換する。照合部１５３は、分析部１５２で変換された入力音声の特徴パラメータ系列とキーワードモデル１６１およびガーベッジモデル１６５との照合を行い、スコアが最大となるモデルを求める。照合部１５３で行われる照合は、一般的な音声認識で用いられる照合と同じである。判定部１５４は、照合部１５３で得られた結果に基づいて入力音声に含まれるキーワードを判定する。また、判定部１５４は、キーワードがキーワードクラスＡ１６２、キーワードクラスＢ１６３の何れかに該当するキーワードであるかを判定する。 The class classification unit 151 generates a keyword class A 162 and a keyword class B 163 from the confusion matrix 164. The analysis unit 152 converts the input speech waveform into feature parameters. The collation unit 153 collates the feature parameter series of the input speech converted by the analysis unit 152 with the keyword model 161 and the garbage model 165, and obtains a model having the maximum score. The collation performed by the collation unit 153 is the same as that used in general speech recognition. The determination unit 154 determines a keyword included in the input voice based on the result obtained by the collation unit 153. In addition, the determination unit 154 determines whether the keyword is a keyword corresponding to either the keyword class A 162 or the keyword class B 163.

処理実行部１５５は、判定部１５４で検出されたキーワードに基づいて処理を実行する。処理実行部１５５は、処理実行部Ａ１５６、処理実行部Ｂ１５７等を有する。キーワードがキーワードクラスＡ１６２に属するものであれば、処理実行部Ａ１５６が処理を実行する。具体的には、処理実行部Ａ１５６は、認識されたキーワードに応じた処理を実行する。また、キーワードがキーワードクラスＢ１６３に属するものであれば、処理実行部Ｂ１５７が処理を実行する。具体的には、処理実行部Ｂ１５７が、そのキーワードを確認するための情報を出力等する。 The process execution unit 155 executes a process based on the keyword detected by the determination unit 154. The process execution unit 155 includes a process execution unit A156, a process execution unit B157, and the like. If the keyword belongs to keyword class A162, process execution unit A156 executes the process. Specifically, the process execution unit A156 executes a process according to the recognized keyword. If the keyword belongs to the keyword class B163, the process execution unit B157 executes the process. Specifically, the process execution unit B157 outputs information for confirming the keyword.

入力装置１０４は、例えば、マイクロフォン、キーボード、マウス、スキャナ等である。出力装置１０５は、例えば、ディスプレイ、スピーカ、プリンタ等である。音声認識処理装置１は、通信インタフェース１０６、及び、通信ネットワーク（図示略）を介して、他の通信端末（図示略）等と接続する。 The input device 104 is, for example, a microphone, a keyboard, a mouse, a scanner, or the like. The output device 105 is, for example, a display, a speaker, a printer, or the like. The speech recognition processing device 1 is connected to another communication terminal (not shown) or the like via the communication interface 106 and a communication network (not shown).

次に、２次記憶装置１０３内の情報について説明する。 Next, information in the secondary storage device 103 will be described.

まず、キーワードモデル１６１について説明する。 First, the keyword model 161 will be described.

キーワードモデル１６１は、予め登録されている一つ以上のキーワードから生成される音響モデル系列である。ここで用いるキーワードとは、検索クエリやコマンドなど後述する処理実行部１５５での処理に対応づけられて登録されている語句である。音響モデルは特に限定するものではないが、例えば、従来技術のＨＭＭ（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）を用いるとよい。ＨＭＭとは、マルコフモデルに従って遷移する内部状態及び内部状態における観測信号の出現確率分布から構成される確率モデルである。 The keyword model 161 is an acoustic model sequence generated from one or more keywords registered in advance. The keyword used here is a phrase that is registered in association with processing in the processing execution unit 155 described later, such as a search query or a command. Although the acoustic model is not particularly limited, for example, a conventional HMM (Hidden Markov Model) may be used. The HMM is a probability model composed of an internal state that transitions according to a Markov model and an appearance probability distribution of an observation signal in the internal state.

ここで、音響モデルの例を、図２を参照して具体的に説明する。 Here, an example of the acoustic model will be specifically described with reference to FIG.

図２は、キーワード「駅」に対する音響モデル系列の例である。この例では音響モデルとして３状態のトライフォンＨＭＭを用いている。図２において、ＨＭＭモデル２０１はトライフォン“*/e/k”のモデルである。ＨＭＭモデル２０２は、トライフォン“e/k/i”のモデルである。ＨＭＭモデル２０３は、トライフォン“k/i/*”のモデルである。この３つのＨＭＭモデル２０１〜２０３を連結することで、「駅」というキーワードの音響モデルを構成する。 FIG. 2 is an example of an acoustic model series for the keyword “station”. In this example, a tri-state triphone HMM is used as an acoustic model. In FIG. 2, an HMM model 201 is a triphone “* / e / k” model. The HMM model 202 is a triphone “e / k / i” model. The HMM model 203 is a triphone “k / i / *” model. By connecting these three HMM models 201 to 203, an acoustic model having the keyword “station” is formed.

なお、上述のように、本実施形態では、モデルに登録されているキーワードをキーワードクラスＡ１６２とキーワードクラスＢ１６３に分類しておく。キーワードクラスＡ１６２には音響的に類似しない語句のみが登録されており、キーワードクラスＡ１６２に登録されないキーワードはキーワードクラスＢ１６３に登録する。 As described above, in the present embodiment, the keywords registered in the model are classified into the keyword class A 162 and the keyword class B 163. Only keywords that are not acoustically similar are registered in the keyword class A162, and keywords that are not registered in the keyword class A162 are registered in the keyword class B163.

次に、コンフュージョンマトリクス１６４について、図３、図４を参照して説明する。 Next, the confusion matrix 164 will be described with reference to FIGS.

図３は、単語単位のコンフュージョンマトリクス１６４の例である。図３において、コンフュージョンマトリクス１６４は、縦軸３１１、横軸３１２、マトリクス部３１３から成る。縦軸３１１、横軸３１２は、各キーワードの発声である。マトリクス部３１３は、縦軸３１１の発声のキーワードに対し、横軸３１２の発声のキーワードであると認識する確率である。 FIG. 3 is an example of the confusion matrix 164 in units of words. In FIG. 3, the confusion matrix 164 includes a vertical axis 311, a horizontal axis 312, and a matrix unit 313. The vertical axis 311 and the horizontal axis 312 are utterances of each keyword. The matrix unit 313 is the probability of recognizing the utterance keyword on the vertical axis 311 as the utterance keyword on the horizontal axis 312.

ここで、「駅」というキーワードの発声の場合の例を説明する。図３のコンフュージョンマトリクス１６４において、縦軸３１１「駅」という発声に対し、正しく横軸３１２「駅」と認識される確率は、マトリクス部３１３「９８．２」パーセントである。また、横軸３１２「高校」と誤って認識してしまう確率は、マトリクス部３１３「０．１」パーセントである。 Here, an example in the case of the utterance of the keyword “station” will be described. In the confusion matrix 164 of FIG. 3, the probability that the horizontal axis 312 “station” is correctly recognized for the utterance of the vertical axis 311 “station” is the matrix portion 313 “98.2” percent. The probability that the horizontal axis 312 “high school” is erroneously recognized is the matrix portion 313 “0.1” percent.

また、「高校」というキーワードの発声の場合の例を説明する。図３のコンフュージョンマトリクス１６４において、縦軸３１１「高校」という発声に対し、正しく横軸３１２「高校」と認識される確率は、マトリクス部３１３「９０．１」パーセントである。また、横軸３１２「空港」と誤って認識される確率は、マトリクス部３１３「５．１」パーセントである。 An example in the case of the utterance of the keyword “high school” will be described. In the confusion matrix 164 of FIG. 3, the probability that the horizontal axis 312 “high school” is correctly recognized with respect to the utterance of the vertical axis 311 “high school” is the matrix portion 313 “90.1” percent. The probability that the horizontal axis 312 “airport” is erroneously recognized is the matrix portion 313 “5.1” percent.

次に、他のコンフュージョンマトリクス１６４の例を説明する。 Next, an example of another confusion matrix 164 will be described.

図４は、音素単位のコンフュージョンマトリクスの例である。図４において、コンフュージョンマトリクス１６４は、縦軸４１１、横軸４１２、マトリクス部４１３から成る。縦軸４１１、横軸４１２は、各音素の発声である。マトリクス部４１３は、縦軸４１１の発声の音素に対し、横軸４１２の発声の音素であると認識する確率である。 FIG. 4 shows an example of a phoneme unit confusion matrix. In FIG. 4, the confusion matrix 164 includes a vertical axis 411, a horizontal axis 412, and a matrix unit 413. A vertical axis 411 and a horizontal axis 412 are utterances of each phoneme. The matrix unit 413 is the probability of recognizing the phoneme of the utterance on the vertical axis 411 as the phoneme of the utterance on the horizontal axis 412.

次に、ガーベッジモデル１６５について説明する。 Next, the garbage model 165 will be described.

ガーベッジモデル１６５は、キーワードモデル１６１に登録されているキーワード以外の音声の音響モデルである。音響モデルは特に限定するものではないが、上述のキーワードモデル１６１と同様に、例えば、従来技術のＨＭＭを用いるとよい。 The garbage model 165 is an acoustic model of speech other than the keywords registered in the keyword model 161. The acoustic model is not particularly limited, but, for example, a conventional HMM may be used as in the keyword model 161 described above.

ここで、音響モデルの例を、図５を参照して具体的に説明する。 Here, an example of the acoustic model will be specifically described with reference to FIG.

図５において、ＨＭＭモデル５０１は、中心音素が“a”のモデルである。ＨＭＭモデル５０２は中心音素が“i”のモデルである。ＨＭＭモデル５０３は中心音素が“N”のモデルである。このように、全ての音素に対応するモデルを並列に並べたＨＭＭモデルに対して終端ノード（ＨＭＭモデル５０５）から始端（ＨＭＭモデル５０４）へのループ（ＨＭＭモデル５０６）を作ることで、あらゆる音素系列の組み合わせをＨＭＭモデル化する。 In FIG. 5, an HMM model 501 is a model whose central phoneme is “a”. The HMM model 502 is a model whose central phoneme is “i”. The HMM model 503 is a model whose central phoneme is “N”. In this way, by creating a loop (HMM model 506) from the end node (HMM model 505) to the start end (HMM model 504) for an HMM model in which models corresponding to all phonemes are arranged in parallel, all phonemes are created. A series combination is converted into an HMM model.

次に、音声認識処理装置１の動作例を、図６を参照して説明する。 Next, an operation example of the speech recognition processing apparatus 1 will be described with reference to FIG.

音声データは、マイクロフォン等の入力装置１０４、通信インタフェース１０６等から音声認識処理装置１に入力される。音声認識処理装置１のクラス分類部１５１は、音声データの入力を受け付けると（Ｓ６０１）、キーワードモデル１６１のキーワードをキーワードクラスＡ１６２、キーワードクラスＢ１６３に分類する（Ｓ６０２）。そのために、例えば、クラス分類部１５１は、キーワードモデル１６１内のキーワードを１つ選択し、そのキーワードが、他のキーワードと誤認識される率が所定閾値未満であるものをキーワードクラスＡ１６２とし、所定閾値以上であるものをキーワードクラスＢ１６３とする。 The voice data is input to the voice recognition processing apparatus 1 from the input device 104 such as a microphone and the communication interface 106. When receiving the input of voice data (S601), the class classification unit 151 of the voice recognition processing device 1 classifies the keywords of the keyword model 161 into the keyword class A162 and the keyword class B163 (S602). For this purpose, for example, the class classification unit 151 selects one keyword in the keyword model 161, sets a keyword class A162 that has a rate that the keyword is erroneously recognized as another keyword to be less than a predetermined threshold, Those that are equal to or greater than the threshold are defined as keyword class B163.

この具体的例を、図３の場合を例にして説明する。ここでは、判定基準の閾値が４パーセントである場合の例を説明する。 A specific example will be described by taking the case of FIG. 3 as an example. Here, an example in which the threshold value of the criterion is 4% will be described.

例えば、キーワード「駅」の場合を説明する。図３のコンフュージョンマトリクス１６４の場合、「駅」というキーワードに対し、他の単語と誤認識してしまう確率は全て４％以下である。従って、クラス分類部１５１は、キーワード「駅」をキーワードクラスＡ１６２に分類する。 For example, the case of the keyword “station” will be described. In the case of the confusion matrix 164 in FIG. 3, the probability of misrecognizing the keyword “station” as another word is 4% or less. Accordingly, the class classification unit 151 classifies the keyword “station” into the keyword class A162.

また、キーワード「高校」の場合の例を説明する。図３のコンフュージョンマトリクス１６４の場合、「高校」というキーワードが、「空港」というキーワードに誤認識される可能性は５．１％である。従って、クラス分類部１５１は、キーワード「高校」をキーワードクラスＢ１６３に分類する。 An example of the keyword “high school” will be described. In the case of the confusion matrix 164 in FIG. 3, the possibility that the keyword “high school” is erroneously recognized as the keyword “airport” is 5.1%. Accordingly, the class classification unit 151 classifies the keyword “high school” into the keyword class B163.

図４に一例を示すコンフュージョンマトリクス１６４の場合、クラス分類部１５１は、キーワードを構成する各音素に対して上述の図３の場合と同じ処理を行なうので、その詳細な説明は省略する。 In the case of the confusion matrix 164 shown in FIG. 4, the class classification unit 151 performs the same processing as in the case of FIG. 3 described above for each phoneme constituting the keyword, and thus detailed description thereof is omitted.

図６において、分析部１５２は、入力した音声波形を特徴パラメータに変換する（Ｓ６０３）。この分析部１５２による変換処理は従来技術の音声認識と同じである。即ち、特徴パラメータは、音声信号を短期間（数十ｍｓ）毎に分割し、その区間の信号をＭＦＣＣ（Mel frequency cepstrum coefficient）等に変換した多次元ベクトル量である。従って、分析部１５２は、多次元ベクトルとして表される特徴ベクトルの時系列データを取得する。 In FIG. 6, the analysis unit 152 converts the input speech waveform into a feature parameter (S603). The conversion process by the analysis unit 152 is the same as the conventional speech recognition. That is, the feature parameter is a multidimensional vector quantity obtained by dividing the audio signal every short period (several tens of ms) and converting the signal in the section into MFCC (Mel frequency cepstrum coefficient) or the like. Therefore, the analysis unit 152 acquires time-series data of feature vectors represented as multidimensional vectors.

照合部１５３は、分析部１５２で変換された入力音声の特徴パラメータ系列とキーワードモデル１６１およびガーベッジモデル１６５との照合を行い、スコアが最大となるモデルを求める（Ｓ６０４）。この照合部１５３で行われる照合処理は、従来技術の音声認識と同じである。例えば、照合部１５３は、特徴パラメータ系列とキーワードモデル１６１内のモデルとの類似性をスコアとして算出する。また、照合部１５３は、特徴パラメータ系列とガーベッジモデル１６５内のモデルとの類似性をスコアとして算出し、スコアの高いものを選択する。なお、通常、ガーベッジモデルを含む照合では、ガーベッジモデルのスコアにペナルティーをつけることで、キーワードモデルとガーベッジモデルとのスコアバランスを調節するのが一般的である。 The collation unit 153 collates the feature parameter series of the input speech converted by the analysis unit 152 with the keyword model 161 and the garbage model 165, and obtains a model having the maximum score (S604). The collation process performed by the collation unit 153 is the same as the conventional voice recognition. For example, the matching unit 153 calculates the similarity between the feature parameter series and the model in the keyword model 161 as a score. In addition, the matching unit 153 calculates the similarity between the feature parameter series and the model in the garbage model 165 as a score, and selects one having a high score. In general, in the collation including the garbage model, it is common to adjust the score balance between the keyword model and the garbage model by penalizing the score of the garbage model.

判定部１５４は、照合部１５３で得られた結果に基づいて、スコアが最大となるモデルがキーワードクラスＡ１６２であるか否か判定する（Ｓ６０５）。 Based on the result obtained by the collation unit 153, the determination unit 154 determines whether or not the model having the maximum score is the keyword class A162 (S605).

Ｓ６０５の判定の結果、スコアが最大となるモデルがキーワードクラスＡ１６２内のキーワードのモデルである場合、判定部１５４は、処理実行部１５５に、そのキーワードにより定まる処理の実行を指示する。処理実行部１５５は、そのキーワードに基づいて処理を実行する（Ｓ６０６）。具体的には、例えば、処理実行部１５５の処理実行部Ａ１５６は、認識されたキーワードに対応した処理を実行する。この処理は、例えば、予め、記憶音声認識処理装置１０２に、認識されたキーワードと、そのキーワードを認識した場合に実行する処理又はアプリケーション等とを対応付けたテーブル（図示略）を記憶しておき、処理実行部１５５は、そのテーブルを参照して実行する処理又は実行するアプリケーション等を決定してもよい。 As a result of the determination in S605, when the model having the maximum score is the model of the keyword in the keyword class A162, the determination unit 154 instructs the process execution unit 155 to execute the process determined by the keyword. The process execution unit 155 executes a process based on the keyword (S606). Specifically, for example, the process execution unit A156 of the process execution unit 155 executes a process corresponding to the recognized keyword. In this process, for example, a table (not shown) in which the recognized keyword is associated with the process executed when the keyword is recognized or an application or the like is stored in advance in the stored speech recognition processing apparatus 102. The process execution unit 155 may determine a process to be executed or an application to be executed with reference to the table.

具体的には、例えば、キーワード「メール」が認識された場合、処理実行部Ａ１５６は、予め「メール」とのキーワードに対応づけられている「メールアプリケーション」を実行する。また、全国施設名称の検索タスクでは、限られた計算リソースで大量なデータを検索する必要があるため、施設名称をカテゴリー毎に分類することで検索対象を絞り込むことで処理効率を図ることが多い。この場合、キーワード「高校」が認識されると、検索対象のカテゴリーが「高校」に絞り込まれる。 Specifically, for example, when the keyword “mail” is recognized, the process execution unit A156 executes the “mail application” associated with the keyword “mail” in advance. In addition, the nationwide facility name search task requires a large amount of data to be searched with limited computational resources, so it is often possible to improve processing efficiency by narrowing down the search target by classifying the facility name into categories. . In this case, when the keyword “high school” is recognized, the search target category is narrowed down to “high school”.

一方、Ｓ６０５の判定の結果、スコアが最大となるモデルがキーワードクラスＡ１６２内のキーワードのモデルでない場合、判定部１５４は、スコアが最大となるモデルがキーワードクラスＢ１６３内のキーワードのモデルであるか否か判定する（Ｓ６０７）。 On the other hand, as a result of the determination in S605, if the model having the maximum score is not the keyword model in the keyword class A162, the determination unit 154 determines whether the model having the maximum score is the keyword model in the keyword class B163. (S607).

Ｓ６０８の判定の結果、スコアが最大となるモデルがキーワードクラスＢ１６３内のキーワードのモデルでない場合、判定部１５４は、処理を終了する。 As a result of the determination in S608, when the model having the maximum score is not a keyword model in the keyword class B163, the determination unit 154 ends the process.

Ｓ６０８の判定の結果、スコアが最大となるモデルがキーワードクラスＢ１６３内のキーワードのモデルである場合、判定部１５４は、処理実行部１５５に処理の実行を指示する。処理実行部１５５は、認識したキーワードの正否を確認する（Ｓ６０８）。そのために、処理実行部１５５の処理実行部Ｂ１５７は、スコアが最大となるモデルのキーワードを、ディスプレイやスピーカ等の出力装置１０５、又は、通信インタフェース１０６から通信端末等に出力し、ユーザに確認を要求する。ユーザは、入力装置１０４、又は、通信端末の入力装置（図示略）を用いて、そのキーワードが正しいか否かを示す情報を音声認識処理装置１に入力する。 As a result of the determination in S608, when the model having the maximum score is a keyword model in the keyword class B163, the determination unit 154 instructs the process execution unit 155 to execute the process. The process execution unit 155 confirms whether the recognized keyword is correct (S608). For this purpose, the process execution unit B157 of the process execution unit 155 outputs the keyword of the model having the maximum score to the communication terminal or the like from the output device 105 such as a display or a speaker, or the communication interface 106, and confirms with the user. Request. Using the input device 104 or an input device (not shown) of the communication terminal, the user inputs information indicating whether or not the keyword is correct to the voice recognition processing device 1.

ここで、認識したキーワードの正否を確認するためにディスプレイに表示される画面例を、図７を参照して説明する。図７において、画面７０１は、認識したキーワードを確認するために、音声認識処理装置１から出力された情報に基づき表示される例である。ユーザは、入力装置を用いて、ラジオボタン７１１、ラジオボタン７１２の何れかをチェック等して、表示されているキーワードが正しいか否かを示す。なお、画面７０１の例では、表示されているキーワードが正しくない場合、ユーザは、領域７１３に正しいキーワードを入力する。ユーザがボタン７１４を押下等すると、キーワードの正否を示す情報、正しいキーワード等が音声認識処理装置１に入力される。 Here, an example of a screen displayed on the display to confirm the correctness of the recognized keyword will be described with reference to FIG. In FIG. 7, a screen 701 is an example displayed based on information output from the speech recognition processing device 1 in order to confirm a recognized keyword. The user uses the input device to check one of the radio button 711 and the radio button 712 to indicate whether the displayed keyword is correct. In the example of the screen 701, when the displayed keyword is not correct, the user inputs the correct keyword in the area 713. When the user presses the button 714, information indicating whether the keyword is correct, correct keywords, and the like are input to the speech recognition processing apparatus 1.

なお、処理実行部Ｂ１５７は、認識したキーワードと類似したキーワードを複数提示して、ユーザに選択させてもよく、また、認識結果に対する信頼度スコアを算出等してもよい。認識したキーワードと類似したキーワードを複数提示するために、処理実行部Ｂ１５７は、コンフュージョンマトリクス１６４から、認識したキーワードと誤認識される確率が所定閾値以上の単語を選択し、この単語を、認識したキーワードと共に出力等してもよい。また、認識結果に対するスコアを算出等するために、上述の照合部１５３の処理により算出されたモデル毎のスコアと任意の数式とから信頼度スコアを算出する。処理実行部Ｂ１５７は、キーワードを、信頼度スコアの上位から任意の数選択し、この信頼度スコアとキーワードとを出力しても良い。 Note that the process execution unit B157 may present a plurality of keywords similar to the recognized keyword and allow the user to select, or may calculate a reliability score for the recognition result. In order to present a plurality of keywords similar to the recognized keyword, the process execution unit B157 selects a word having a probability that it is erroneously recognized as the recognized keyword from the confusion matrix 164, and recognizes this word. It may be output together with the keyword. Further, in order to calculate a score for the recognition result, a reliability score is calculated from the score for each model calculated by the above-described processing of the matching unit 153 and an arbitrary mathematical expression. The process execution unit B157 may select an arbitrary number of keywords from the top of the reliability score, and may output the reliability score and the keyword.

図６に戻り、処理実行部Ｂ１５７は、正しいキーワードを取得すると、そのキーワードを用いた処理を行なう（Ｓ６０９）。この処理は任意でよく、上述のＳ６０６のよりのように、認識されたキーワードに応じて定まる処理でもよく、また、上述のキーワード確認の際にユーザに指定された処理等でもよい。 Returning to FIG. 6, when the process execution unit B157 acquires a correct keyword, the process execution unit B157 performs a process using the keyword (S609). This process may be arbitrary, and may be a process determined according to the recognized keyword as in S606 described above, or may be a process designated by the user at the time of the keyword confirmation described above.

なお、処理実行部Ｂ１５７は、認識されたキーワードに対応した処理を行う代わりに、認識されたキーワードクラスに対応した処理を行ってもよい。この具体例を、上述した検索タスクの場合で説明する。上述した検索タスクにおいて、キーワードに対応した処理では、認識されたキーワードに基づいて検索対象のカテゴリーが絞り込まれる。例えば、キーワード「高校」が認識された場合、検索カテゴリーが「高校」に絞り込まれる。この時、認識されたキーワード「高校」が「空港」という発話が誤って認識されたとすると、検索カテゴリーの絞込みは失敗する。このため、認識されたキーワードがキーワードクラスＢ１６３に属する場合には、認識されたキーワードに基づく検索カテゴリーの絞込みは行わず、キーワードクラスＢ１６３に属する全キーワードの論理和（ｏｒ）で検索カテゴリーを絞り込む。絞込みの制約が緩すぎて検索対象が多すぎる場合には、例えば、「都道府県を指定してください」などキーワードで設定されるジャンル以外の質問をすることによって、再度検索対象の絞り込みを行うことも可能である。 The process execution unit B157 may perform a process corresponding to the recognized keyword class instead of performing a process corresponding to the recognized keyword. A specific example will be described in the case of the search task described above. In the search task described above, in the process corresponding to the keyword, the search target category is narrowed down based on the recognized keyword. For example, when the keyword “high school” is recognized, the search category is narrowed down to “high school”. At this time, if the recognized keyword “high school” is mistakenly recognized as “airport”, the search category narrowing down fails. Therefore, when the recognized keyword belongs to the keyword class B163, the search category based on the recognized keyword is not narrowed down, and the search category is narrowed down by the logical sum (or) of all the keywords belonging to the keyword class B163. If there are too many search targets due to too narrow restrictions, narrow down the search target again by asking questions other than the genre set by the keyword, such as “Please specify prefectures”. Is also possible.

このように、キーワードクラスＡ１６２に属するキーワードに関しては、お互いに類似するエントリーが存在しないので、誤ったキーワードを認識してしまう可能性は低い。そのため、認識結果を信頼し、ユーザへの確認なしに処理を実行しても、ユーザの意図に反した処理を実行してしまう危険性は少ない。反面、キーワードクラスＢ１６３に属するキーワードに関しては、お互いに類似するエントリーが存在する可能性がある。従って、処理実行部Ｂ１５７では、認識されたキーワード如何にかかわらず、認識結果の確認や全キーワードジャンルでの検索等を行うことで、仮にキーワード認識に誤りがあった場合でも、ユーザの意図に反する処理を実行することはない。
＜第２の実施形態＞
次に、第２の実施形態を説明する。第２の実施形態は、定められたキーワードを含むサブセット辞書を作成し、認識されたキーワードを含む語彙をサブセット辞書から選択する点が、上述の第１の実施形態とは異なる。なお、ここでは、１つ以上の単語を含む言葉を語彙という。 As described above, since there are no entries similar to each other for the keywords belonging to the keyword class A162, the possibility of recognizing an incorrect keyword is low. Therefore, even if the recognition result is trusted and the process is executed without confirmation to the user, there is little risk that the process contrary to the user's intention is executed. On the other hand, for keywords belonging to the keyword class B163, there may be entries similar to each other. Therefore, the process execution unit B157 is contrary to the user's intention even if there is an error in keyword recognition by checking the recognition result or performing a search in all keyword genres regardless of the recognized keyword. The process is not executed.
<Second Embodiment>
Next, a second embodiment will be described. The second embodiment is different from the first embodiment described above in that a subset dictionary including a predetermined keyword is created and a vocabulary including a recognized keyword is selected from the subset dictionary. Here, a word including one or more words is called a vocabulary.

以下で説明する第２の実施形態は、上述の第１の実施形態と一部同じであるので、同じ構成に対しては同じ符号を付与して説明を省略し、異なる構成のみ詳細に説明する。 Since the second embodiment described below is partially the same as the first embodiment described above, the same reference numerals are given to the same components, the description thereof is omitted, and only different components will be described in detail. .

音声認識処理装置８０１の構成例を、図８を参照して説明する。 A configuration example of the speech recognition processing device 801 will be described with reference to FIG.

第２の実施形態の音声認識処理装置８０１は、第１の実施形態の音声認識処理装置１の処理実行部１５５の換わりに、音声再認識部８１１を有する。また、メイン辞書８５１、サブセット辞書Ａ８５２、サブセット辞書Ｂ８５３、音階モデル８５４等をさらに有する。 The speech recognition processing device 801 according to the second embodiment includes a speech re-recognition unit 811 instead of the process execution unit 155 of the speech recognition processing device 1 according to the first embodiment. Further, it further includes a main dictionary 851, a subset dictionary A852, a subset dictionary B853, a scale model 854, and the like.

メイン辞書８５１は、音声認識処理装置８０１の認識対象音声を格納する。サブセット辞書Ａ８５２、サブセット辞書Ｂ８５３は、後述する処理により生成される。音階モデル８５４は、メイン辞書８５１内の語彙の音響モデルを格納する。 The main dictionary 851 stores the recognition target speech of the speech recognition processing device 801. The subset dictionary A852 and the subset dictionary B853 are generated by processing to be described later. The scale model 854 stores an acoustic model of the vocabulary in the main dictionary 851.

音声再認識部８１１は、分析部１５２、照合部１５３、判定部１５４で認識されたキーワードに基づき、分析部１５２で生成された特徴ベクトル系列に対して音声の認識を行う。音声再認識部８１１は、再認識部Ａ８１２、再認識部Ｂ８１３等を有する。 The speech re-recognition unit 811 performs speech recognition on the feature vector series generated by the analysis unit 152 based on the keywords recognized by the analysis unit 152, the matching unit 153, and the determination unit 154. The voice re-recognition unit 811 includes a re-recognition unit A812, a re-recognition unit B813, and the like.

再認識部Ａ８１２は、認識されたキーワードがキーワードクラスＡ１６２に属するモデルのキーワードである場合に、そのキーワードを含む言葉を認識する。再認識部Ａ８１２は、辞書生成部８２１、照合部８２２、判定部８２３等を有する。辞書生成部８２１は、分析部１５２、照合部１５３、判定部１５４で認識されたキーワードと、メイン辞書８５１とから、サブセット辞書Ａ８５２に格納するサブセットを生成する。照合部８２２は、サブセット辞書Ａ８５２と音階モデル８５４とを用いて、入力した特徴ベクトル系列との照合を行いスコア出力する。判定部８２３は、照合スコアが最大となる仮説を探索することで入力音声の認識結果を求める。 When the recognized keyword is a keyword of a model belonging to the keyword class A162, the re-recognition unit A812 recognizes a word including the keyword. The re-recognition unit A812 includes a dictionary generation unit 821, a collation unit 822, a determination unit 823, and the like. The dictionary generation unit 821 generates a subset to be stored in the subset dictionary A852 from the keywords recognized by the analysis unit 152, the collation unit 153, and the determination unit 154 and the main dictionary 851. The matching unit 822 uses the subset dictionary A852 and the scale model 854 to collate with the input feature vector series and output a score. The determination unit 823 obtains the recognition result of the input speech by searching for a hypothesis that maximizes the matching score.

再認識部Ｂ８１３は、認識されたキーワードがキーワードクラスＢ１６３に属するモデルのキーワードである場合に、そのキーワードを含む語彙を認識する。再認識部Ｂ８１３は、辞書生成部８３１、照合部８３２、判定部８３３等を有する。辞書生成部８３１は、キーワードクラスＢ１６３と、メイン辞書８５１とから、サブセット辞書Ｂ８５３に格納するサブセットを生成する。照合部８３２は、サブセット辞書Ｂ８５３と音階モデル８５４とを用いて、入力した特徴ベクトル系列との照合を行いスコア出力する。判定部８３３は、照合スコアが最大となる仮説を探索することで入力音声の認識結果を求める。 When the recognized keyword is a model keyword belonging to the keyword class B163, the re-recognition unit B813 recognizes a vocabulary including the keyword. The re-recognition unit B 813 includes a dictionary generation unit 831, a collation unit 832, a determination unit 833, and the like. The dictionary generation unit 831 generates a subset to be stored in the subset dictionary B853 from the keyword class B163 and the main dictionary 851. Using the subset dictionary B853 and the scale model 854, the collation unit 832 collates the input feature vector series and outputs a score. The determination unit 833 obtains a recognition result of the input speech by searching for a hypothesis that maximizes the matching score.

次に、図９を参照して動作例を説明する。 Next, an operation example will be described with reference to FIG.

音声データは、マイクロフォン等の入力装置１０４、通信インタフェース１０６等から音声認識処理装置１に入力される。音声認識処理装置８０１のクラス分類部１５１は、音声データの入力を受け付けると（Ｓ９０１）、モデル１６１のキーワードをキーワードクラスＡ１６２、キーワードクラスＢ１６３に分類する（Ｓ９０２）。この処理は、上述のＳ６０１、Ｓ６０２と同じである。 The voice data is input to the voice recognition processing apparatus 1 from the input device 104 such as a microphone and the communication interface 106. When receiving the input of voice data (S901), the class classification unit 151 of the voice recognition processing device 801 classifies the keywords of the model 161 into the keyword class A162 and the keyword class B163 (S902). This process is the same as S601 and S602 described above.

音声認識処理装置８０１の分析部１５２は、音声データの入力を受け付けると、入力した音声波形を特徴パラメータに変換する（Ｓ９０３）。この分析部１５２による変換処理は、上述のＳ６０３と同じである。 When receiving the input of voice data, the analysis unit 152 of the voice recognition processing device 801 converts the input voice waveform into a feature parameter (S903). The conversion process by the analysis unit 152 is the same as S603 described above.

照合部１５３は、分析部１５２で変換された入力音声の特徴パラメータ系列とキーワードモデル１６１およびガーベッジモデル１６５との照合を行い、スコアが最大となるモデルを求める（Ｓ９０４）。この処理は、上述のＳ６０４と同じである。 The collation unit 153 collates the feature parameter series of the input speech converted by the analysis unit 152 with the keyword model 161 and the garbage model 165, and obtains a model having the maximum score (S904). This process is the same as S604 described above.

判定部１５４は、照合部１５３で得られた結果に基づいて、スコアが最大となるモデルがキーワードクラスＡ１６２であるか否か判定する（Ｓ９０５）。この処理は、上述のＳ６０５と同じである。 Based on the result obtained by the collation unit 153, the determination unit 154 determines whether or not the model with the maximum score is the keyword class A162 (S905). This process is the same as S605 described above.

Ｓ９０５の判定の結果、スコアが最大となるモデルがキーワードクラスＡ１６２内のキーワードのモデルである場合、判定部１５４は、再認識部Ａ８１２に処理実行を指示する。再認識部Ａ８１２の辞書生成部８２１は、認識したキーワードと、メイン辞書８５１とから、サブセット辞書Ａ８５２を生成する（Ｓ９０６）。具体的には、例えば、辞書生成部８２１は、メイン辞書８５１から、認識したキーワードを含む語彙を抽出し、サブセット辞書Ａ８５２に格納する。 If the model having the maximum score is the model of the keyword in the keyword class A162 as a result of the determination in S905, the determination unit 154 instructs the re-recognition unit A812 to execute the process. The dictionary generation unit 821 of the re-recognition unit A812 generates a subset dictionary A852 from the recognized keyword and the main dictionary 851 (S906). Specifically, for example, the dictionary generation unit 821 extracts a vocabulary including the recognized keyword from the main dictionary 851 and stores it in the subset dictionary A852.

ここで、Ｓ９０６の処理の例について、図１０を参照して具体的に説明する。 Here, an example of the processing of S906 will be specifically described with reference to FIG.

図１０において、メイン辞書８５１には、「厚木市役所」、「厚木高校」、「井の頭公園」、「砧公園」、「京都大学」、「国分寺駅」、「国分寺市役所」、「草津温泉」、「世田谷公園」、「世田谷美術館」、「世田谷高校」、「東京駅」、「品川駅」、「羽田空港」の１４個の語彙が登録されている。ここで、キーワードとして「公園」が認識された場合の例を説明する。 In FIG. 10, the main dictionary 851 includes “Atsugi City Hall”, “Atsugi High School”, “Inokashira Park”, “Sakai Park”, “Kyoto University”, “Kokubunji Station”, “Kokubunji City Hall”, “Kusatsu Onsen”, Fourteen vocabularies are registered: “Setagaya Park”, “Setagaya Art Museum”, “Setagaya High School”, “Tokyo Station”, “Shinagawa Station”, and “Haneda Airport”. Here, an example in which “park” is recognized as a keyword will be described.

辞書生成部８２１は、メイン辞書８５１に登録されている単語のうち、「公園」を含む語彙「井の頭公園」、「砧公園」を抽出し、サブセット辞書Ａ８５２として格納する。このとき、辞書生成部８２１は、表記である「公園」のみだけではなく、音素列「コウエン」を含む語彙（例えば「公演」、「後援」、「講演」等）を選択してもよい。 The dictionary generation unit 821 extracts the words “Inokashira Park” and “Nagi Park” including “park” from the words registered in the main dictionary 851, and stores them as a subset dictionary A852. At this time, the dictionary generation unit 821 may select not only the notation “park” but also a vocabulary including the phoneme string “Kouen” (for example, “performance”, “support”, “lecture”, etc.).

図９において、照合部８２２は、サブセット辞書Ａ８５２と音階モデル８５４とを用いて、入力した特徴ベクトル系列との照合を行い、スコアを算出する（Ｓ９０７）。この処理は、上述のＳ６０４と同じである。 In FIG. 9, the matching unit 822 uses the subset dictionary A852 and the scale model 854 to collate with the input feature vector series and calculate a score (S907). This process is the same as S604 described above.

判定部８２３は、照合スコアが最大となる語彙を決定する（Ｓ９０８）。この処理は、上述のＳ６０４と同じである。 The determination unit 823 determines a vocabulary that maximizes the matching score (S908). This process is the same as S604 described above.

一方、Ｓ９０５の判定の結果、スコアが最大となるモデルがキーワードクラスＡ１６２内のキーワードのモデルでない場合、判定部１５４は、スコアが最大となるモデルがキーワードクラスＢ１６３であるか否か判定する（Ｓ９０９）。この処理は、上述のＳ６０８と同じである。 On the other hand, as a result of the determination in S905, if the model having the maximum score is not the model of the keyword in the keyword class A162, the determination unit 154 determines whether or not the model having the maximum score is the keyword class B163 (S909). ). This process is the same as S608 described above.

Ｓ９０９の判定の結果、スコアが最大となるモデルがキーワードクラスＢ１６３内のキーワードのモデルでない場合、処理を終了する。 As a result of the determination in S909, if the model having the maximum score is not the model of the keyword in the keyword class B163, the process ends.

Ｓ９０９の判定の結果、スコアが最大となるモデルがキーワードクラスＢ１６３内のキーワードのモデルである場合、判定部１５４は、再認識部Ｂ８１３に処理実行を指示する。再認識部Ｂ８１３の辞書生成部８３１は、キーワードクラスＢ１６３と、メイン辞書８５１とから、サブセット辞書Ｂ８５３を生成する（Ｓ９１０）。具体的には、例えば、辞書生成部８２１は、メイン辞書８５１から、キーワードクラスＢ１６３内のキーワードを含む語彙を抽出し、サブセット辞書Ｂ８５３に格納する。 As a result of the determination in S909, when the model having the maximum score is the model of the keyword in the keyword class B163, the determination unit 154 instructs the re-recognition unit B813 to execute the process. The dictionary generation unit 831 of the re-recognition unit B 813 generates a subset dictionary B 853 from the keyword class B 163 and the main dictionary 851 (S910). Specifically, for example, the dictionary generation unit 821 extracts a vocabulary including a keyword in the keyword class B163 from the main dictionary 851, and stores it in the subset dictionary B853.

Ｓ９１０の具体例を、図１１を参照して説明する。 A specific example of S910 will be described with reference to FIG.

図１１において、メイン辞書８５１には、「厚木市役所」、「厚木高校」、「井の頭公園」、「砧公園」、「京都大学」、「国分寺駅」、「国分寺市役所」、「草津温泉」、「世田谷公園」、「世田谷美術館」、「世田谷高校」、「東京駅」、「品川駅」、「羽田空港」の１４個の語彙が登録されている。また、キーワードクラスＢ１６３には、「高校」、「温泉」、「空港」の語彙が登録されているものとする。 In FIG. 11, the main dictionary 851 includes “Atsugi City Hall”, “Atsugi High School”, “Inokashira Park”, “Sakai Park”, “Kyoto University”, “Kokubunji Station”, “Kokubunji City Hall”, “Kusatsu Onsen”, Fourteen vocabularies are registered: “Setagaya Park”, “Setagaya Art Museum”, “Setagaya High School”, “Tokyo Station”, “Shinagawa Station”, and “Haneda Airport”. Further, it is assumed that the vocabulary “high school”, “hot spring”, and “airport” is registered in the keyword class B163.

辞書生成部８３１は、メイン辞書８５１の単語の中から、キーワードクラスＢ１６３内の何れかのキーワードが含まれる単語をサブセット辞書Ｂ８５３に登録する。図１１の例では、サブセット辞書に登録される単語は、「厚木高校」、「草津温泉」、「世田谷高校」、「羽田空港」の４単語である。 The dictionary generation unit 831 registers a word including any keyword in the keyword class B163 from the words in the main dictionary 851 in the subset dictionary B853. In the example of FIG. 11, the words registered in the subset dictionary are four words “Atsugi High School”, “Kusatsu Onsen”, “Setagaya High School”, and “Haneda Airport”.

図９において、照合部８３２は、サブセット辞書Ｂ８５３と音階モデル８５４とを用いて、入力した特徴ベクトル系列との照合を行い、スコアを算出する（Ｓ９１１）。この処理は、上述のＳ６０４と同じである。具体的には、例えば、上述の図１１の場合、照合部８３２は、「高校」、「温泉」、「空港」のいずれのキーワードが認識された場合にも、「厚木高校」、「草津温泉」、「世田谷高校」、「羽田空港」の４単語で構成されるサブセット辞書Ｂ８５３を用いて音声を認識する。 In FIG. 9, the collation unit 832 collates the input feature vector series using the subset dictionary B853 and the scale model 854, and calculates a score (S911). This process is the same as S604 described above. Specifically, for example, in the case of FIG. 11 described above, the collation unit 832 determines that “Atsugi High School”, “Kusatsu Hot Spring”, regardless of which keyword “high school”, “hot spring”, or “airport” is recognized. ”,“ Setagaya High School ”and“ Haneda Airport ”are used to recognize speech using a subset dictionary B853.

判定部８３３は、照合スコアが最大となる語彙を決定する（Ｓ９１２）。この処理は、上述のＳ６０４と同じである。 The determination unit 833 determines the vocabulary with the highest matching score (S912). This process is the same as S604 described above.

判定部８２３、判定部８３３は、照合スコアが最大となる語彙を、出力装置１０５、通信インタフェース１０６に出力する（Ｓ９１３）。なお、照合スコアが最大となる語彙を、図示しないプログラムによる処理に用いてもよい。この処理とは、例えば、上述の第１の実施形態で説明した処理実行部Ａ１５６、処理実行部Ｂ１５７等による処理を同じでもよい。 The determination unit 823 and the determination unit 833 output the vocabulary with the maximum matching score to the output device 105 and the communication interface 106 (S913). Note that the vocabulary with the highest matching score may be used for processing by a program (not shown). This process may be the same as the process performed by the process execution unit A156, the process execution unit B157, and the like described in the first embodiment.

第２の実施形態では、上述のように、キーワードに基づいたサブセット辞書を用いて音声認識をおこなうことで、メイン辞書を用いた音声認識にくらべて、少ない計算量、メモリ量での処理が可能となる。 In the second embodiment, as described above, by performing speech recognition using a subset dictionary based on keywords, it is possible to perform processing with a small amount of calculation and memory compared to speech recognition using a main dictionary. It becomes.

また、キーワードクラスＡ１６２は、予め音韻系列が類似しているキーワードが混在しないように設計しているのに対し、キーワードクラスＢ１６３には、「高校」と「空港」のように音韻系列が類似しているために認識誤りしやすいキーワードが含まれている可能性がある。しかし、キーワードクラスＢ１６３では、認識したキーワードによらずにサブセット辞書Ｂ８５３を作成しているため、誤認識を低減させることが可能となる。即ち、例えば、「厚木高校」の発声に対してキーワードを誤って「空港」と認識してしまったとしても、サブセット辞書Ｂ８５３には「厚木高校」のエントリーが含まれているため、正しい認識結果を得ることができる。 The keyword class A162 is designed so that keywords with similar phoneme sequences are not mixed in advance, whereas the keyword class B163 has similar phoneme sequences such as “high school” and “airport”. Therefore, there is a possibility that a keyword that is easy to recognize incorrectly is included. However, in the keyword class B163, since the subset dictionary B853 is created regardless of the recognized keyword, it is possible to reduce erroneous recognition. That is, for example, even if the keyword is mistakenly recognized as “airport” for the utterance of “Atsugi high school”, the subset dictionary B853 includes an entry of “Atsugi high school”, so the correct recognition result Can be obtained.

また、キーワードクラスＡ１６２には、音韻的に類似したキーワードが共存することがないため、誤ったキーワードに基いてサブセット辞書を生成する可能性は少ない。また、キーワードクラスＢのキーワードに関しては、たとえキーワード認識が誤っているとしても、キーワードに依存しないサブセット辞書を用いることで、正しい認識結果を得ることが可能となる。 In addition, since the keyword class A 162 does not include phonologically similar keywords, there is little possibility of generating a subset dictionary based on an incorrect keyword. In addition, regarding the keyword of keyword class B, even if the keyword recognition is incorrect, it is possible to obtain a correct recognition result by using a subset dictionary that does not depend on the keyword.

また、サブセット辞書を作成し、この辞書から、認識すべき音声を特定する。従って、認識すべき音声が、例えば、「厚木高校」のように、地名や人名等の様々なパターンが考えられる名詞と、一般名詞との組み合わせから成るような場合に、特に有効である。 Also, a subset dictionary is created, and the speech to be recognized is specified from this dictionary. Therefore, it is particularly effective when the speech to be recognized is composed of a combination of a general noun and a noun in which various patterns such as place names and personal names are considered, such as “Atsugi High School”.

以上、この発明の実施形態を図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計変更等も含まれる。 The embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to this embodiment, and includes design changes and the like within a scope not departing from the gist of the present invention.

例えば、上述の実施形態では、音声認識処理装置がコンフュージョンマトリクスから類似する語彙を分類するものとしたが、これに限られるわけではない。例えば、キーワードクラスＡ、キーワードクラスＢへの分類は、設計者が恣意的に分類してもよく、また、任意の判定基準に基づいて分類してもよい。 For example, in the above-described embodiment, the speech recognition processing apparatus classifies similar vocabularies from the confusion matrix, but the present invention is not limited to this. For example, the classification into the keyword class A and the keyword class B may be arbitrarily performed by the designer, or may be performed based on an arbitrary determination criterion.

また、サブセット辞書の生成タイミングは任意である。例えば、予め生成され記憶されていてもよく、また、キーワードが認識される毎、所定時間毎、設計者による指示が入力された場合等に作成してもよい。また、複数のサブセット辞書を事前に作成しておき、与えられたキーワードに応じて、作成済みのサブセット辞書の中から適切な辞書を選択するようにしてもよい。 Also, the generation timing of the subset dictionary is arbitrary. For example, it may be generated and stored in advance, or may be created every time a keyword is recognized, every predetermined time, or when an instruction from a designer is input. A plurality of subset dictionaries may be created in advance, and an appropriate dictionary may be selected from the created subset dictionaries according to a given keyword.

また、サブセット辞書の別の作成法として、キーワードクラスＡ１６２に属するキーワードをまったく含まない単語を対象としてサブセット辞書を作成してもよい。この一例を図１２を参照して説明する。図１２において、キーワードクラスＡ１６２に属するキーワード１２０１は、「駅」、「公園」、「市役所」の３つである。メイン辞書８５１の語彙のうち、これらのキーワードを含む語彙１２０２を除外すると、残りの単語は、「厚木高校」、「京都大学」、「草津温泉」、「世田谷美術館」、「世田谷高校」、「羽田空港」の６単語である。サブセット辞書Ｂ８５３には、この６単語を登録する。以降の処理は、上述と同じである。 As another method of creating a subset dictionary, a subset dictionary may be created for words that do not include any keywords belonging to keyword class A162. An example of this will be described with reference to FIG. In FIG. 12, there are three keywords 1201 belonging to the keyword class A162: “station”, “park”, and “city hall”. If the vocabulary 1202 including these keywords is excluded from the vocabulary of the main dictionary 851, the remaining words are “Atsugi High School”, “Kyoto University”, “Kusatsu Onsen”, “Setagaya Art Museum”, “Setagaya High School”, “ 6 words “Haneda Airport”. These six words are registered in the subset dictionary B853. The subsequent processing is the same as described above.

第１の実施形態の音声認識処理装置の構成例を示す図である。It is a figure which shows the structural example of the speech recognition processing apparatus of 1st Embodiment. 音声モデルを説明するための図である。It is a figure for demonstrating an audio | voice model. 同実施形態において、コンフュージョンマトリクスの一例を示すための図である。In the same embodiment, it is a figure for showing an example of a confusion matrix. 同実施形態において、コンフュージョンマトリクスの一例を示すための図である。In the same embodiment, it is a figure for showing an example of a confusion matrix. 音声モデルを説明するための図である。It is a figure for demonstrating an audio | voice model. 同実施形態において、動作例を示す図である。FIG. 6 is a diagram showing an operation example in the same embodiment. 同実施形態において、画面例を示す図である。In the same embodiment, it is a figure which shows the example of a screen. 第２の実施形態の音声認識処理装置の構成例を示す図である。It is a figure which shows the structural example of the speech recognition processing apparatus of 2nd Embodiment. 同実施形態において、動作例を示す図である。FIG. 6 is a diagram showing an operation example in the same embodiment. 同実施形態において、サブセット辞書を生成する動作例を説明するための図である。FIG. 10 is a diagram for describing an operation example for generating a subset dictionary in the embodiment. 同実施形態において、サブセット辞書を生成する動作例を説明するための図である。FIG. 10 is a diagram for describing an operation example for generating a subset dictionary in the embodiment. 同実施形態において、サブセット辞書を生成する動作例を説明するための図である。FIG. 10 is a diagram for describing an operation example for generating a subset dictionary in the embodiment.

Explanation of symbols

１：音声認識処理装置、１０１：ＣＰＵ、１０２：メモリ、１０３：２次記憶装置、１０４：入力装置、１０５：出力装置、１０６：通信インタフェース、１５１：クラス分類部、１５２：分析部、１５３：照合部、１５４：判定部、１５５：処理実行部、１５６：処理実行部Ａ、１５７：処理実行部Ｂ、１６１：キーワードモデル、１６２：キーワードクラスＡ、１６３：キーワードクラスＢ、１６４：コンフュージョンマトリクス、１６５：ガーベッジモデル、８０１：音声認識処理装置、８１１：音声再認識部、８１２：再認識部Ａ、８１３：再認識部Ｂ、８２１：辞書生成部、８２２：照合部、８２３：判定部、８３１：辞書生成部、８３２：照合部、８３３：判定部、８５１：メイン辞書、８５２：サブセット辞書Ａ、８５３：サブセット辞書Ｂ、８５４：音階モデル 1: speech recognition processing device, 101: CPU, 102: memory, 103: secondary storage device, 104: input device, 105: output device, 106: communication interface, 151: class classification unit, 152: analysis unit, 153: Collation unit, 154: determination unit, 155: processing execution unit, 156: processing execution unit A, 157: processing execution unit B, 161: keyword model, 162: keyword class A, 163: keyword class B, 164: confusion matrix 165: Garbage model, 801: Speech recognition processing device, 811: Speech re-recognition unit, 812: Re-recognition unit A, 813: Re-recognition unit B, 821: Dictionary generation unit, 822: Verification unit, 823: Determination unit, 831: Dictionary generation unit, 832: Verification unit, 833: Determination unit, 851: Main dictionary, 852: Subset dictionary A, 853: Support Set dictionary B, 854: scale model

Claims

In a computer-executable speech recognition program for recognizing a keyword that is a word included in input speech,
Storage means for classifying and storing a plurality of keywords into a first keyword group that does not include phonologically similar keywords and a second keyword group that is a keyword that is not included in the first keyword group The computer having
A keyword recognition step for recognizing at least one of the keywords included in the input voice;
When the recognized keyword is included in the first keyword group, a first process execution step of executing a first process;
When the recognized keyword is included in the second keyword group, a second process execution step for executing a second process is executed.

A speech recognition processing program according to claim 1,
In the computer further storing a confusion matrix including a degree of similarity indicating a degree recognized as matching the same voice and another voice for each of a plurality of voices in the storage means,
From the confusion matrix and the keyword, a word whose similarity is less than a predetermined threshold is selected from the keywords, the word is classified as the first keyword group, and keywords other than the selected keyword are A classification step for classifying as a second keyword group;
A speech recognition processing program characterized by further executing

A speech recognition processing program according to claim 1 or 2,
The first process is a process determined according to the recognized keyword,
The voice recognition processing program, wherein the second process is a specific process.

A speech recognition processing program according to claim 3,
The voice recognition processing program characterized in that the second processing outputs information for confirming the recognized keyword to an output means.

In a computer-executable speech recognition program for recognizing a keyword that is a word included in input speech,
The plurality of keywords include a first keyword group that does not include phonologically similar keywords, a second keyword group that is a keyword that is not included in the first keyword group, and one or more words. A storage means for storing a dictionary including a plurality of vocabularies,
A keyword recognition step for recognizing at least one of the keywords included in the input voice;
When the recognized keyword is included in the first keyword group, a vocabulary including the recognized keyword is selected from the dictionary, and the selected vocabulary is stored in the storage unit as a first subset dictionary. A first vocabulary recognition step for extracting from the input speech a vocabulary included in the first subset dictionary and a vocabulary including the recognized keyword;
When the recognized keyword is included in the second keyword group, the vocabulary including the plurality of keywords is selected from the dictionary, and the selected vocabulary is stored in the storage unit as a second subset dictionary. A second vocabulary recognition step of extracting, from the input speech, a vocabulary included in the second subset dictionary and a vocabulary including the recognized keyword;
And a step of outputting the extracted vocabulary.