JP5037018B2

JP5037018B2 - Speech recognition apparatus and speech recognition method

Info

Publication number: JP5037018B2
Application number: JP2006023229A
Authority: JP
Inventors: 貴志赤坂
Original assignee: Yamaha Motor Co Ltd
Current assignee: Yamaha Motor Co Ltd
Priority date: 2006-01-31
Filing date: 2006-01-31
Publication date: 2012-09-26
Anticipated expiration: 2026-01-31
Also published as: JP2007206239A

Description

この発明は、音声信号から単語等の有意な認識対象を認識する音声認識装置および音声認識方法に関する。また、この発明は、そのような音声認識装置を備えた音声指示装置、情報機器および車両システムに関する。 The present invention relates to a voice recognition apparatus and a voice recognition method for recognizing a significant recognition target such as a word from a voice signal. The present invention also relates to a voice instruction device, an information device, and a vehicle system provided with such a voice recognition device.

自動車に搭載されるカーナビゲーション装置には、音声指示装置（音声入力インタフェース）を備えているものがある。音声指示装置は、運転者の音声を認識する音声認識装置と、認識された音声に対応した指示コマンドを生成するコマンド生成部とを含む。このコマンド生成部によって生成された指示コマンドに従って、カーナビゲーション装置は、目的地の設定、ルート検索その他の動作を実行する。 Some car navigation devices mounted on automobiles include a voice instruction device (voice input interface). The voice instruction device includes a voice recognition device that recognizes a driver's voice and a command generation unit that generates an instruction command corresponding to the recognized voice. In accordance with the instruction command generated by the command generation unit, the car navigation device executes destination setting, route search, and other operations.

音声指示装置は、アイズフリーおよびハンズフリーのインタフェースであるため、運転者は、自動車の安全な運転を阻害されることなくカーナビゲーション装置の操作を行える。
自動車に限らず、自動二輪車においても、ナビゲーション装置その他の情報機器を利用したいという要求がある。この場合にも、自動車の場合と同様に、アイズフリーおよびハンズフリーのインタフェースが必要であり、前述のような音声指示装置はその最有力候補である。 Since the voice instruction device is an eye-free and hands-free interface, the driver can operate the car navigation device without hindering safe driving of the vehicle.
There is a demand to use a navigation device and other information devices not only in automobiles but also in motorcycles. Also in this case, as in the case of an automobile, an eye-free and hands-free interface is required, and the voice instruction device as described above is the most likely candidate.

一般的な単語の音声認識装置は、入力された被認識音声を、順次、音響的特徴パラメータに変換する音声分析部と、得られた音響的特徴パラメータを、予め作成しておいた単語毎の標準的音響的特徴の集合である音響ライブラリ（認識辞書）と比較し、入力された音声が表す単語を特定する照合部とを含む。
一般に、音響ライブラリの作成にあたっては、多くの発話者の発話データが収録される。その収録されたデータを学習データとして用いた学習により、照合部で採用される認識アルゴリズムに適合した様式の音響ライブラリが作成される。
特開２００３−１１４６９６号公報特開２００５−１３４４３６号公報 A general word speech recognition apparatus includes a speech analysis unit that sequentially converts input speech to be recognized into acoustic feature parameters, and the obtained acoustic feature parameters for each word that has been created in advance. Compared with an acoustic library (recognition dictionary) that is a set of standard acoustic features, a matching unit that identifies a word represented by the input speech is included.
Generally, when creating an acoustic library, speech data of many speakers are recorded. By using the recorded data as learning data, an acoustic library having a style suitable for the recognition algorithm employed by the collation unit is created.
JP 2003-114696 A JP 2005-134436 A

しかし、起こりうるすべての状況を網羅した学習データを収集することは不可能である。また、採用する特徴パラメータや認識（照合）アルゴリズムに応じて、認識処理に一種の“癖”が生じることも避けられない。そのため、全ての単語を一様な確率で認識することは不可能であり、どうしても、認識しやすい単語、認識しにくい単語ができてしまう。
そこで、この発明の目的は、認識対象（単語、発話単位など）間の認識確率の不均一を低減し、より確実な音声認識を実現する音声認識装置および音声認識方法を提供することである。 However, it is impossible to collect learning data that covers all possible situations. In addition, depending on the feature parameters and the recognition (collation) algorithm employed, it is inevitable that a kind of “癖” occurs in the recognition process. For this reason, it is impossible to recognize all words with a uniform probability, and words that are easy to recognize and words that are difficult to recognize are inevitably generated.
Accordingly, an object of the present invention is to provide a speech recognition device and a speech recognition method that reduce non-uniformity in recognition probability between recognition objects (words, speech units, etc.) and realize more reliable speech recognition.

また、この発明の他の目的は、そのような音声認識装置を備えた音声指示装置、情報機器および車両システムを提供することである。 Another object of the present invention is to provide a voice instruction device, an information device, and a vehicle system provided with such a voice recognition device.

上記の目的を達成するための請求項１記載の発明は、入力音声信号を入力時系列に従う順方向に認識し、（たとえば、複数の認識候補のそれぞれに関して）入力音声信号が認識候補に該当する尤もらしさを表す順方向尤度情報を生成する順方向認識手段と、入力音声信号を入力時系列とは逆の逆時系列に従う逆方向に認識し、（たとえば、複数の認識候補のそれぞれに関して）入力音声信号が認識候補に該当する尤もらしさを表す逆方向尤度情報を生成する逆方向認識手段と、前記順方向認識手段および逆方向認識手段の出力を統合して、入力音声信号に対応する認識結果を生成する統合判定手段とを含む、音声認識装置である。前記統合判定手段は、前記順方向認識手段および逆方向認識手段がそれぞれ生成する順方向尤度情報および逆方向尤度情報を結合して、入力音声信号が認識候補に該当する尤もらしさを表す結合尤度情報を生成する尤度情報結合手段と、この尤度情報結合手段が生成する結合尤度情報を評価して認識結果を求める認識結果判定手段とを含む。また、前記尤度情報結合手段は、順方向尤度情報および逆方向尤度情報にそれぞれ重み付けして結合する重み付け結合手段を含み、前記重み付け結合手段は、認識候補間の認識確率のばらつきが低減されるように認識候補毎に学習によって予め定めた、認識候補に依存する重みを順方向尤度情報および／または逆方向尤度情報に付与して結合する認識候補適応重み付け結合手段を含む。
この構成によれば、音声信号を入力時系列に従う順方向に認識するだけでなく、入力時系列とは逆の時系列に従う逆方向に関しても認識するので、認識対象間の認識確率の不均一を低減できる。これにより、より確実な音声認識処理が可能になる。また、この発明では、順方向および逆方向の認識処理結果を総合的に判定して認識結果を生成できるので、認識対象間でばらつきの少ない認識確率を実現でき、より確実な認識処理が可能になる。さらに、順方向尤度情報および逆方向尤度情報を結合した結合尤度情報を評価することによって認識結果が求められるので、順方向および逆方向の認識処理結果を反映した認識結果が得られる。これにより、複数の認識対象に対する認識確率のばらつきを低減して、より確実な音声認識処理が可能になる。そして、順方向および逆方向の認識処理結果の結合は、それらに適切な重み付けをして実行されるので、より一層認識確率を向上できる。 In order to achieve the above object, the invention according to claim 1 recognizes an input speech signal in a forward direction according to an input time series, and the input speech signal corresponds to a recognition candidate (for each of a plurality of recognition candidates, for example). Forward direction recognition means for generating forward likelihood information representing likelihood, and an input speech signal is recognized in a reverse direction according to a reverse time series opposite to the input time series (for example, for each of a plurality of recognition candidates). The reverse speech recognition means for generating reverse likelihood information representing the likelihood that the input speech signal corresponds to the recognition candidate, and the outputs of the forward recognition means and the backward recognition means are integrated to correspond to the input speech signal. A speech recognition apparatus including an integrated determination unit that generates a recognition result. The integrated determination unit combines forward likelihood information and backward likelihood information generated by the forward direction recognition unit and the backward direction recognition unit, respectively, and combines the input speech signal representing the likelihood of being a recognition candidate. It includes likelihood information combining means for generating likelihood information and recognition result determination means for evaluating the combined likelihood information generated by the likelihood information combining means to obtain a recognition result. Also, the likelihood information coupling means, seen including weighting means for combining weighted respectively forward likelihood information and the backward likelihood information, the weighting coupling means, variations of the recognition probability between recognition candidate predetermined for each recognition candidate by learning as reduced, including the recognition candidate adaptive weighting means for combining by applying a weight that depends on the recognition candidates in the forward likelihood information and / or reverse likelihood information .
According to this configuration, not only the voice signal is recognized in the forward direction according to the input time series, but also the reverse direction according to the time series opposite to the input time series is recognized. Can be reduced. Thereby, more reliable voice recognition processing can be performed. In addition, according to the present invention, since the recognition result can be generated by comprehensively determining the recognition processing result in the forward direction and the backward direction, a recognition probability with little variation among recognition objects can be realized, and more reliable recognition processing can be performed. Become. Furthermore, since the recognition result is obtained by evaluating the combined likelihood information obtained by combining the forward likelihood information and the backward likelihood information, the recognition result reflecting the recognition processing result in the forward direction and the reverse direction is obtained. Thereby, variation in recognition probability for a plurality of recognition targets is reduced, and more reliable speech recognition processing can be performed. Since the combination of the recognition processing results in the forward direction and the backward direction is executed by appropriately weighting them, the recognition probability can be further improved.

順方向および逆方向認識手段は、たとえば、それぞれ、入力音声信号の特徴系列を生成する音声分析手段と、認識辞書（音響ライブラリ：音響モデル、単語辞書など）と、照合手段とを含む。認識辞書は、複数の認識候補に関して、予め作成した標準的な特徴系列を記憶している。照合手段は、音声分析によって得られた特徴系列と個々の認識候補に関して認識辞書に格納されている標準的な特徴系列とを比較して、入力音声信号が個々の認識候補を表す尤もらしさを表す尤度情報を生成する。より具体的には、照合手段は、ＤＰマッチング法（動的計画法）、ニューラルネットワーク法、ベイズ識別関数法、ベクトル量子化法などの認識アルゴリズムに従って尤度情報を求める。尤度情報とは、音声分析によって得られた特徴系列と認識辞書に記憶されている認識候補の標準的な特徴系列との類似度を表す情報であり、照合手段が実行する認識アルゴリズムに依存する形式の情報である。尤度、特徴系列ベクトル間の距離などがその例である。 The forward direction and reverse direction recognition means each include, for example, a voice analysis means for generating a feature sequence of the input voice signal, a recognition dictionary (acoustic library: acoustic model, word dictionary, etc.), and a matching means. The recognition dictionary stores standard feature sequences created in advance for a plurality of recognition candidates. The matching means compares the feature sequence obtained by the speech analysis with the standard feature sequence stored in the recognition dictionary for each recognition candidate, and represents the likelihood that the input speech signal represents each recognition candidate. Generate likelihood information. More specifically, the matching unit obtains likelihood information according to a recognition algorithm such as a DP matching method (dynamic programming), a neural network method, a Bayes discriminant function method, a vector quantization method, or the like. Likelihood information is information representing the similarity between a feature sequence obtained by speech analysis and a standard feature sequence of recognition candidates stored in the recognition dictionary, and depends on the recognition algorithm executed by the matching means. Format information. Examples are likelihood and distance between feature sequence vectors.

認識辞書は、予め収録した音声データを学習データとして用いる学習によって作成することができる。この場合、順方向認識処理手段のための認識辞書は、音声データの入力時系列に従った特徴系列を用いた学習によって作成し、逆方向認識手段のための認識辞書は、音声データの入力時系列とは反対の時系列に従った特徴系列を用いた学習によって作成する必要がある。 The recognition dictionary can be created by learning using previously recorded audio data as learning data. In this case, the recognition dictionary for the forward direction recognition processing means is created by learning using a feature sequence according to the input time series of the speech data, and the recognition dictionary for the backward direction recognition means is used when the speech data is input. It must be created by learning using a feature sequence according to a time sequence opposite to the sequence.

重み付け結合手段は、認識候補間の認識確率のばらつきが低減されるように認識候補毎に学習によって予め定めた、認識候補に依存する重みを順方向尤度情報および／または逆方向尤度情報に付与して結合する認識候補適応重み付け結合手段を含む。したがって、認識候補毎に適切な重みを付与することができるので、認識候補間の認識確率のばらつきをより一層低減でき、より確実な音声認識が可能になる。認識候補毎の重みは、たとえば、学習によって、認識確率のばらつきが低減されるように予め定めておき、認識候補に対応付けて認識辞書に登録しておくことが好ましい。
請求項２記載の発明は、入力音声信号を入力時系列とは逆の時系列に従って再生し、前記逆方向認識手段に入力する逆再生手段をさらに含む、請求項１記載の音声認識装置である。この構成により、逆方向認識手段において、入力時系列とは逆の時系列に従って入力音声信号の認識処理を行える。より具体的には、逆方向認識手段は、逆再生手段によって逆方向再生された音声信号を分析してその特徴系列を出力する音声分析手段と、認識辞書と、音声分析手段が出力する特徴系列を認識辞書の内容と照合する照合手段とを含む構成とすることができる。この場合の認識辞書は、学習データを逆再生手段によって逆再生させ、その出力を音声分析手段によって分析させて特徴系列を得るプロセスを用いた学習によって作成することが好ましい。 The weighting / combining means uses the weights depending on the recognition candidates, which are determined in advance for each recognition candidate so as to reduce the variation in the recognition probability among the recognition candidates, in the forward likelihood information and / or the reverse likelihood information. Recognition candidate adaptive weighting combining means for adding and combining. Therefore, since an appropriate weight can be assigned to each recognition candidate, variation in recognition probability among recognition candidates can be further reduced, and more reliable voice recognition can be performed. The weight for each recognition candidate is preferably determined in advance so that variation in recognition probability is reduced by learning, for example, and is registered in the recognition dictionary in association with the recognition candidate.
The invention according to claim 2 is the speech recognition apparatus according to claim 1, further comprising reverse reproduction means for reproducing the input voice signal according to a time series reverse to the input time series and inputting the input voice signal to the reverse direction recognition means. . With this configuration, the reverse direction recognizing unit can perform the input speech signal recognition processing according to the time series reverse to the input time series. More specifically, the backward direction recognizing means analyzes the voice signal reproduced in the reverse direction by the backward reproducing means and outputs the feature series, a recognition dictionary, and the feature series output by the voice analyzing means. Can be configured to include a matching unit that matches the contents of the recognition dictionary. The recognition dictionary in this case is preferably created by learning using a process in which learning data is reversely reproduced by reverse reproduction means and the output is analyzed by voice analysis means to obtain a feature series.

請求項３記載の発明は、前記逆方向認識手段は、入力音声信号を入力時系列に従って分析し、入力時系列に従う特徴系列を生成する音声分析手段と、この音声分析手段が生成する特徴系列を、入力時系列とは逆の時系列に従う特徴系列に変換して前記逆方向認識手段に入力する特徴系列反転手段とを含むものである、請求項１記載の音声認識装置である。この構成により、逆方向認識手段において、入力時系列とは逆の時系列に従って入力音声信号の認識処理を行える。より具体的には、逆方向認識手段は、前記音声分析手段と、前記特徴系列反転手段と、認識辞書と、特徴系列反転手段が出力する特徴系列を認識辞書の内容と照合する照合手段とを含む構成とすることができる。この場合の認識辞書は、学習データを入力時系列に従って音声分析手段によって分析させ、これを特徴系列反転手段によって時間軸上で反転させるプロセスを用いた学習によって作成することが好ましい。 According to a third aspect of the present invention, the backward direction recognizing unit analyzes the input speech signal according to the input time series, generates a feature sequence according to the input time series, and a feature sequence generated by the speech analysis unit. The speech recognition apparatus according to claim 1, further comprising: a feature sequence inversion unit that converts the feature sequence according to a time sequence opposite to the input time sequence and inputs the converted feature sequence to the backward direction recognition unit. With this configuration, the reverse direction recognizing unit can perform the input speech signal recognition processing according to the time series reverse to the input time series. More specifically, the backward direction recognition means includes the voice analysis means, the feature series inversion means, a recognition dictionary, and a matching means for matching the feature series output from the feature series inversion means with the contents of the recognition dictionary. It can be set as the structure containing. The recognition dictionary in this case is preferably created by learning using a process in which the learning data is analyzed by the voice analysis means according to the input time series, and is inverted on the time axis by the feature series inversion means.

なお、この構成の場合、音声分析処理は、順方向および逆方向認識処理手段の両方に共通の処理となるので、これらによって音声分析手段を共有することとしてもよい。 In the case of this configuration, the voice analysis process is a process common to both the forward direction and backward direction recognition processing means, and therefore, the voice analysis means may be shared by them .

前記尤度情報結合手段は、前記順方向尤度情報および逆方向尤度情報を線形結合（一次結合）するものであってもよいし、非線形結合（たとえば、尤度情報の２乗以上の巾乗項を含む結合）するものであってもよい。 The likelihood information combining means may be a linear combination (primary combination) of the forward likelihood information and the reverse likelihood information, or may be a non-linear combination (for example, a width equal to or larger than the square of the likelihood information). (Combining including a multiplicative term) .

請求項４記載の発明は、前記統合判定手段は、前記順方向認識手段および逆方向認識手段の少なくともいずれか一方による単独処理結果が、所定以上の尤もらしさを有する認識候補の存在を示しているかどうかを判定する一方向認識判定手段と、この一方向認識判定手段によって、前記単独処理結果が所定以上の尤もらしさを有する認識候補の存在を示していると判定されたときに、その認識候補を認識結果として出力する手段とを含むものである、請求項１〜３のいずれかに記載の音声認識装置である。 According to a fourth aspect of the present invention, in the integrated determination unit, does the single processing result by at least one of the forward direction recognition unit and the backward direction recognition unit indicate the presence of a recognition candidate having a certain likelihood or more? When the one-way recognition determination unit and the one-way recognition determination unit determine whether the single processing result indicates the existence of a recognition candidate having a certain likelihood or more, the recognition candidate is determined. it is intended to include a means for outputting as a recognition result, a speech recognition apparatus according to any one of claims 1-3.

この構成によれば、順方向認識手段または逆方向認識手段の単独処理結果を用いて認識結果を出力できる場合があるので、処理を簡単かつ高速にすることができる。
請求項５記載の発明は、前記一方向認識判定手段は、認識候補に依存する重みを前記単独処理結果に付与して、当該単独処理結果が所定以上の尤もらしさを有する認識候補の存在を示しているかどうかを判定する重み付け判定手段を含むものである、請求項４記載の音声認識装置である。この構成により、認識候補毎に適切な重みを付与しておくことによって、単独処理結果に基づく認識処理の確率を高めることができる。認識候補毎の重みは、学習によって予め求めておき、認識候補に対応付けて認識辞書内に予め登録しておくことが好ましい。 According to this configuration, since the recognition result may be output using the single processing result of the forward direction recognition unit or the backward direction recognition unit, the processing can be simplified and speeded up.
The invention according to claim 5 is characterized in that the one-way recognition determination means assigns a weight depending on a recognition candidate to the single processing result, and indicates that there is a recognition candidate having a likelihood that the single processing result is greater than or equal to a predetermined value. The speech recognition apparatus according to claim 4 , further comprising weighting determination means for determining whether or not a user is present. With this configuration, by assigning an appropriate weight to each recognition candidate, the probability of recognition processing based on the single processing result can be increased. The weight for each recognition candidate is preferably obtained in advance by learning and registered in advance in the recognition dictionary in association with the recognition candidate.

請求項６記載の発明は、請求項１〜５のいずれかに記載の音声認識装置と、この音声認識装置に音声信号を入力するための音声信号入力手段と、前記音声認識装置による認識結果を所定の機器に入力するための指示コマンドデータに変換するコマンドデータ生成手段とを含む、音声指示装置である。この構成では、認識候補間の認識確率のばらつきが低減される結果、使用者の意図に該当する指示コマンドデータが生成される確率が高まる。これにより、優れた音声入力インタフェースを提供できる。 The invention according to claim 6 is the speech recognition device according to any one of claims 1 to 5 , speech signal input means for inputting a speech signal to the speech recognition device, and a recognition result by the speech recognition device. A voice instruction device including command data generating means for converting into instruction command data for input to a predetermined device. In this configuration, as a result of reducing the variation in recognition probability among recognition candidates, the probability that instruction command data corresponding to the user's intention is generated increases. Thereby, an excellent voice input interface can be provided.

請求項７記載の発明は、請求項６記載の音声指示装置と、前記コマンドデータ生成手段によって生成された指示コマンドデータに応じて動作するコマンド処理手段とを含む、情報機器である。この構成により、音声指示によって情報機器を快適に操作できる。
請求項８記載の発明は、請求項７記載の情報機器の前記コマンド処理手段を車体に搭載した車両システムである。この構成により、車載用情報機器を音声指示によって快適に操作することができる車両システムを提供できる。 A seventh aspect of the invention is an information device including the voice instruction device according to the sixth aspect and command processing means that operates in accordance with the instruction command data generated by the command data generation means. With this configuration, the information device can be comfortably operated by voice instructions.
The invention according to claim 8 is a vehicle system in which the command processing means of the information equipment according to claim 7 is mounted on a vehicle body. With this configuration, it is possible to provide a vehicle system that can comfortably operate the in-vehicle information device by voice instructions.

請求項９記載の発明は、入力音声信号を入力時系列に従う順方向に認識し、入力音声信号が認識候補に該当する尤もらしさを表す順方向尤度情報を生成する順方向認識ステップと、入力音声信号を入力時系列とは逆の逆時系列に従う逆方向に認識し、入力音声信号が認識候補に該当する尤もらしさを表す逆方向尤度情報を生成する逆方向認識ステップと、前記順方向認識ステップおよび逆方向認識ステップの結果を統合して、入力音声信号に対応する認識結果を生成する統合判定ステップとを含み、前記統合判定ステップは、前記順方向認識ステップおよび逆方向認識ステップでそれぞれ生成される順方向尤度情報および逆方向尤度情報を結合して、入力音声信号が認識候補に該当する尤もらしさを表す結合尤度情報を生成する尤度情報結合ステップと、この尤度情報結合ステップで生成される結合尤度情報を評価して認識結果を求める認識結果判定ステップとを含み、前記尤度情報結合ステップは、順方向尤度情報および逆方向尤度情報にそれぞれ重み付けして結合する重み付け結合ステップを含み、前記重み付け結合ステップは、認識候補間の認識確率のばらつきが低減されるように認識候補毎に学習によって予め定めた、認識候補に依存する重みを順方向尤度情報および／または逆方向尤度情報に付与して結合する認識候補適応重み付け結合ステップを含む、音声認識方法である。この方法により、請求項１の発明に関連して説明した効果を達成できる。
請求項１０記載の発明は、前記統合判定ステップが、前記順方向認識ステップおよび逆方向認識ステップの少なくともいずれか一方による単独処理結果が、所定以上の尤もらしさを有する認識候補の存在を示しているかどうかを判定する一方向認識判定ステップと、この一方向認識判定ステップによって、前記単独処理結果が所定以上の尤もらしさを有する認識候補の存在を示していると判定されたときに、その認識候補を認識結果として出力するステップとを含む、請求項９に記載の音声認識方法である。
請求項１１記載の発明は、前記一方向認識判定ステップが、認識候補に依存する重みを前記単独処理結果に付与して、当該単独処理結果が所定以上の尤もらしさを有する認識候補の存在を示しているかどうかを判定する重み付け判定ステップを含む、請求項１０記載の音声認識方法である。 The invention according to claim 9 recognizes an input speech signal in a forward direction according to an input time series, and generates a forward likelihood information indicating the likelihood that the input speech signal corresponds to a recognition candidate; A reverse direction recognition step for recognizing a speech signal in a reverse direction according to a reverse time series opposite to an input time series, and generating reverse likelihood information indicating likelihood that the input speech signal corresponds to a recognition candidate; and the forward direction An integrated determination step that integrates the results of the recognition step and the backward recognition step to generate a recognition result corresponding to the input voice signal, and the integrated determination step includes the forward direction recognition step and the backward direction recognition step, respectively. Likelihood information combination that combines the generated forward likelihood information and reverse likelihood information to generate combined likelihood information representing the likelihood that the input speech signal corresponds to the recognition candidate. And a recognition result determination step for obtaining a recognition result by evaluating the combined likelihood information generated in the likelihood information combining step, wherein the likelihood information combining step includes the forward likelihood information and the reverse likelihood information. look including a weighted combination step of combining by weighting each time information, said weighted combination step, variations in the recognition probability between recognition candidate is predetermined by learning for each recognition candidate as reduced, it depends on the recognition candidate weighting the forward likelihood information and / or recognition candidate adaptive weighting combiner steps including binding to impart the opposite direction likelihood information for a speech recognition method. By this method, the effect described in relation to the invention of claim 1 can be achieved .
In the invention according to claim 10 , in the integration determination step, does the single processing result by at least one of the forward direction recognition step and the backward direction recognition step indicate the presence of a recognition candidate having a certain likelihood or more? When the one-way recognition determination step and the one-way recognition determination step determine whether the single processing result indicates the presence of a recognition candidate having a certain likelihood or more, the recognition candidate is determined. The speech recognition method according to claim 9 , further comprising a step of outputting as a recognition result.
The invention according to claim 11 is characterized in that the one-way recognition determination step assigns a weight depending on a recognition candidate to the single processing result, and indicates that the single processing result has a likelihood of being more than a predetermined likelihood. The speech recognition method according to claim 10 , further comprising a weighting determination step for determining whether or not the image is present.

むろん、音声認識方法の発明に関しても、音声認識装置の発明と同様な変形を施すことができる。 Of course, the invention similar to the invention of the speech recognition apparatus can be applied to the invention of the speech recognition method.

以下では、この発明の実施の形態を、添付図面を参照して詳細に説明する。
図１は、この発明の一実施形態に係る車両システムの全体構成を示す。この車両システムは、車両の一例としての二輪車両の車体１と、この車体１に取り付けられた音声指示装置本体１０Ａと、前記車体１に取り付けられ、音声指示装置本体１０Ａからの指令（コマンド）を受けて動作するコマンド処理手段としての情報処理部５１と、乗員２（通常は運転者であるが、同乗者であってもよい。）が装着するヘルメット３に備えられたマイクロフォン５と、同じくヘルメット３に取り付けられたスピーカ６とを備えている。マイクロフォン５は、音声指示装置本体１０Ａに対して音声信号を入力する音声信号入力手段としての機能を担う。このマイクロフォン５および音声指示装置本体１０Ａにより、音声指示装置１０が構成されている。そして、この音声指示装置１０および情報処理部５１などにより、車載用情報機器５０が構成されている。音声指示装置本体１０Ａおよび情報処理部５１は、一体化されていてもよいし、別装置とされていてもよい。 Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.
FIG. 1 shows an overall configuration of a vehicle system according to an embodiment of the present invention. This vehicle system includes a vehicle body 1 of a two-wheeled vehicle as an example of a vehicle, a voice instruction device body 10A attached to the vehicle body 1, and a command (command) attached to the vehicle body 1 from the voice instruction device body 10A. An information processing unit 51 as command processing means that receives and operates, a microphone 5 provided in a helmet 3 worn by an occupant 2 (usually a driver, but may be a passenger), and a helmet And a speaker 6 attached to 3. The microphone 5 functions as a voice signal input unit that inputs a voice signal to the voice instruction apparatus body 10A. The microphone 5 and the voice instruction apparatus main body 10A constitute a voice instruction apparatus 10. The voice instruction apparatus 10 and the information processing unit 51 constitute an in-vehicle information device 50. The voice instruction device main body 10A and the information processing unit 51 may be integrated, or may be separate devices.

このように、車載用情報機器５０は、音声による指示操作が可能な情報機器である。このような情報機器の例としては、ナビゲーション装置（好ましくは、音声案内が可能なもの）、移動電話機、音再生装置（たとえば、ＭＤプレーヤ、ＣＤプレーヤその他のオーディオ装置）を挙げることができる。
マイクロフォン５は、ヘルメット３の口元部分に設けられ、スピーカ６は、ヘルメット３の耳元部分に設けられている。マイクロフォン５は、乗員が発する音声を検出し、その音声に対応した電気信号である音声信号を音声指示装置本体１０Ａに入力する。スピーカ６は、情報処理部５１からの音声信号を受けて、その音声信号を音に変換する。これにより、車両の乗員は、音声によって車載用情報機器５０に音声による指示を与えることができるとともに、車載用情報機器５０が生成する音情報を聴取することができる。こうして、ハンズフリーおよびアイズフリーのインタフェースが構築されている。 Thus, the in-vehicle information device 50 is an information device capable of performing an instruction operation by voice. Examples of such information equipment include a navigation device (preferably capable of voice guidance), a mobile phone, and a sound reproduction device (for example, an MD player, a CD player, or other audio device).
The microphone 5 is provided at the mouth portion of the helmet 3, and the speaker 6 is provided at the ear portion of the helmet 3. The microphone 5 detects a voice emitted by the occupant and inputs a voice signal, which is an electrical signal corresponding to the voice, to the voice instruction apparatus main body 10A. The speaker 6 receives the audio signal from the information processing unit 51 and converts the audio signal into sound. Thus, the vehicle occupant can give a voice instruction to the in-vehicle information device 50 by voice and can listen to sound information generated by the in-vehicle information device 50. In this way, a hands-free and eyes-free interface is constructed.

音声指示装置本体１０Ａは、ヘルメット３に装備されてもよいが、ヘルメット３の付属品を可能な限り少なくするためには、乗員の着衣等に保持される携帯型装置として構成したり、車両側に取り付けられる車載用装置として構成したりすることが好ましい。
マイクロフォン５と音声指示装置本体１０Ａとの間の接続は、ハーネス７に含まれるケーブルを用いた有線接続により行ってもよいし、音声指示装置本体１０Ａを車載用装置として構成する場合には、無線通信によってそれらの間を接続することもできる。スピーカ６と情報処理部５１との間の接続も同様に、ハーネス７に含まれるケーブルを用いた有線接続により行ってもよいし、無線通信接続によって行ってもよい。無線通信の方式としては、たとえば、ブルーツゥース、赤外線通信その他の短距離無線通信方式を採用することができる。 The voice instruction device main body 10A may be mounted on the helmet 3. However, in order to reduce the number of accessories of the helmet 3 as much as possible, the voice instruction device main body 10A may be configured as a portable device held on the occupant's clothes or the like, It is preferable to configure as an in-vehicle device attached to the vehicle.
The connection between the microphone 5 and the voice instruction device main body 10A may be made by a wired connection using a cable included in the harness 7, or when the voice instruction device main body 10A is configured as an in-vehicle device, the connection is wireless. They can also be connected by communication. Similarly, the connection between the speaker 6 and the information processing unit 51 may be performed by wired connection using a cable included in the harness 7 or may be performed by wireless communication connection. As a wireless communication method, for example, Bluetooth, infrared communication, or other short-range wireless communication methods can be employed.

図２は、車載用情報機器５０の電気的構成を示すブロック図である。音声指示装置本体１０Ａは、マイクロフォン５から入力される音声信号を認識する音声認識装置１１と、この音声認識装置１１による認識結果をコマンドデータに変換するコマンドデータ生成部１２（コマンドデータ生成手段）とを備えている。コマンドデータ生成部１２は、車載用情報機器５０に入力可能な形式のコマンドデータを生成し、このコマンドデータを車載用情報機器５０の情報処理部５１に与える。 FIG. 2 is a block diagram showing an electrical configuration of the in-vehicle information device 50. The voice instruction device main body 10A includes a voice recognition device 11 that recognizes a voice signal input from the microphone 5, a command data generation unit 12 (command data generation unit) that converts a recognition result of the voice recognition device 11 into command data. It has. The command data generation unit 12 generates command data in a format that can be input to the in-vehicle information device 50, and provides the command data to the information processing unit 51 of the in-vehicle information device 50.

音声認識装置１１は、たとえば、ＨＭＭ（隠れマルコフモデル）単語認識アルゴリズムに従って入力音声の認識処理を行う一対の認識処理部、すなわち、順方向認識処理部２０（順方向認識手段）および逆方向認識処理部３０（逆方向認識手段）を備えている。さらに、音声認識装置１１は、音声信号を時間軸上で反転して再生する逆再生部１５（逆再生手段）と、順方向および逆方向認識処理部２０，３０の出力を統合して認識結果を出力する統合判定部４０（統合判定手段）とを備えている。認識対象の音声信号は、時間軸に従う順方向波形で順方向認識処理部２０に入力される。一方、逆再生部１５によって時間軸上で反転された逆方向波形の音声信号は、逆方向認識処理部３０に入力されるようになっている。 The speech recognition device 11 is a pair of recognition processing units that perform input speech recognition processing according to, for example, an HMM (Hidden Markov Model) word recognition algorithm, that is, a forward direction recognition processing unit 20 (forward direction recognition means) and a backward direction recognition process. The unit 30 (reverse direction recognition means) is provided. Further, the speech recognition apparatus 11 integrates the outputs of the reverse playback unit 15 (reverse playback means) that reverses and plays back the audio signal on the time axis, and the forward and reverse direction recognition processing units 20 and 30 to obtain a recognition result. And an integrated determination unit 40 (integration determination means). The speech signal to be recognized is input to the forward direction recognition processing unit 20 as a forward direction waveform along the time axis. On the other hand, the audio signal having the reverse waveform inverted on the time axis by the reverse reproduction unit 15 is input to the reverse direction recognition processing unit 30.

順方向認識処理部２０は、音声分析部２１（音声分析手段）と、音響ライブラリ２２（認識辞書）と、照合部２３（照合手段）とを備えている。同様に、逆方向認識処理部３０は、音声分析部３１（音声分析手段と、音響ライブラリ３２（認識辞書）と、照合部３３（照合手段）とを備えている。
音声分析部２１は、マイクロフォン５から入力される音声信号（順方向波形）を分析してその音響的特徴を表す特徴パラメータ系列（たとえば、ＭＦＣＣ（メル周波数ケプストラム係数））を抽出する。同様に、音声分析部３１は、逆再生部１５から入力される音声信号（逆方向波形）を分析してその音響的特徴を表す特徴パラメータ系列（たとえばＭＦＣＣ）を抽出する。 The forward direction recognition processing unit 20 includes a speech analysis unit 21 (speech analysis unit), an acoustic library 22 (recognition dictionary), and a collation unit 23 (collation unit). Similarly, the reverse direction recognition processing unit 30 includes a speech analysis unit 31 (speech analysis unit, acoustic library 32 (recognition dictionary), and collation unit 33 (collation unit).
The voice analysis unit 21 analyzes a voice signal (forward waveform) input from the microphone 5 and extracts a feature parameter series (for example, MFCC (Mel Frequency Cepstrum Coefficient)) representing the acoustic feature. Similarly, the voice analysis unit 31 analyzes the voice signal (reverse waveform) input from the reverse reproduction unit 15 and extracts a feature parameter series (for example, MFCC) representing the acoustic feature.

音響ライブラリ２２は、音響モデル２４と、単語辞書（言語モデル）２５とを備えている。同様に、音響ライブラリ３２は、音響モデル３４と、単語辞書（言語モデル）３５とを備えている。音響モデル２４，３４は、音声の所定単位（たとえば単語）ごとに標準音声パターンの音響的特徴をモデル化したものであり、入力音声パターンとの音響的な類似性の評価を行うための参照情報である。また、単語辞書２５，３５は、音響モデルの接続に関する制約を与えるための情報である。このような情報の典型は、或る単語（音素）に引き続いて別の単語（音素）が出現する確率である。 The acoustic library 22 includes an acoustic model 24 and a word dictionary (language model) 25. Similarly, the acoustic library 32 includes an acoustic model 34 and a word dictionary (language model) 35. The acoustic models 24 and 34 are obtained by modeling the acoustic features of the standard speech pattern for each predetermined unit (for example, word) of speech, and reference information for evaluating the acoustic similarity with the input speech pattern. It is. Further, the word dictionaries 25 and 35 are information for giving restrictions on connection of the acoustic model. A typical example of such information is a probability that another word (phoneme) appears after a certain word (phoneme).

音響モデル２４，３４および単語辞書２５，３５は、事前に収録した学習データを用いた学習によって作成される。学習データは、たとえば、多数の話者の様々な状況における発話を収録した音声データである。ただし、順方向認識処理部２０の音響モデル２４および単語辞書２５は、順方向波形の音声データ（たとえば、「yamaha」、「hoNda」、「suzuki」など）を学習データとして用いた学習によって作成されるのに対して、逆方向認識処理部３０の音響モデル３４および単語辞書３５は、逆方向波形の音声データ（すなわち、時間軸上で反転して再生した音声データ。たとえば、前記の例を反転した「ahamay」、「adNoh」、「ikuzus」など）を学習データとして用いた学習によって作成される。 The acoustic models 24 and 34 and the word dictionaries 25 and 35 are created by learning using learning data recorded in advance. The learning data is, for example, voice data that records utterances of various speakers in various situations. However, the acoustic model 24 and the word dictionary 25 of the forward direction recognition processing unit 20 are created by learning using forward waveform speech data (for example, “yamaha”, “hoNda”, “suzuki”, etc.) as learning data. On the other hand, the acoustic model 34 and the word dictionary 35 of the backward direction recognition processing unit 30 have the backward waveform voice data (that is, the voice data reproduced by being reversed on the time axis. For example, the above example is reversed. "Ahamay", "adNoh", "ikuzus" etc.) as learning data.

照合部２３，３３は、音声分析部２１，３１によって抽出された特徴パラメータ系列と音響ライブラリ２２，３２の認識候補とを照合して、ＨＭＭアルゴリズムによって、認識候補の尤もらしさを表す尤度を生成する。
統合判定部４０は、順方向認識処理部２０および逆方向認識処理部３０の出力を総合的に判定して、認識結果を出力し、コマンドデータ生成部１２に引き渡す。 The collating units 23 and 33 collate the feature parameter series extracted by the speech analyzing units 21 and 31 with the recognition candidates of the acoustic libraries 22 and 32, and generate a likelihood representing the likelihood of the recognition candidate by the HMM algorithm. To do.
The integrated determination unit 40 comprehensively determines the outputs of the forward direction recognition processing unit 20 and the backward direction recognition processing unit 30, outputs a recognition result, and delivers it to the command data generation unit 12.

図３は、統合判定部４０における第１の処理例を説明するためのフローチャートである。
順方向認識処理部２０は、単語辞書２５に登録されているｉ番目の単語（認識候補）に対して順方向尤度ＬＦｉを生成する。同様に、逆方向認識処理部３０は、単語辞書３５に登録されているｉ番目の単語（認識候補）に対して逆方向尤度ＬＲｉを生成する。ただし、単語辞書２５，３５には、共通の複数の単語が同じ順序で登録されているものとする。 FIG. 3 is a flowchart for explaining a first processing example in the integrated determination unit 40.
The forward direction recognition processing unit 20 generates a forward likelihood LFi for the i-th word (recognition candidate) registered in the word dictionary 25. Similarly, the reverse direction recognition processing unit 30 generates a reverse likelihood LRi for the i-th word (recognition candidate) registered in the word dictionary 35. However, it is assumed that a plurality of common words are registered in the word dictionaries 25 and 35 in the same order.

この場合に、統合判定部４０は、順方向および逆方向尤度ＬＦｉ，ＬＲｉを線形結合した結合尤度Ｌｉを次式(1)に従って演算する（ステップＳ１：尤度情報結合手段。重み付け結合手段）。
Ｌi＝αＬＦi＋βＬＲi …… (1)
ただし、重み付け係数α，βは実数（定数）であり、α≠０，β≠０である。これらの重み付け係数α，βを適切に定めることによって、順方向認識結果および逆方向認識結果を適切に結合でき、妥当な認識結果を得ることができる。重み付け係数α，βは、学習によって予め定めることとしてもよい。 In this case, the integrated determination unit 40 calculates a combined likelihood Li obtained by linearly combining the forward and backward likelihoods LFi, LRi according to the following equation (1) (step S1: likelihood information combining unit, weighted combining unit). ).
Li = αLFi + βLRi (1)
However, the weighting coefficients α and β are real numbers (constants), and α ≠ 0 and β ≠ 0. By appropriately determining these weighting coefficients α and β, the forward direction recognition result and the backward direction recognition result can be appropriately combined, and an appropriate recognition result can be obtained. The weighting coefficients α and β may be determined in advance by learning.

また、重み付け係数α，βは、単語毎に予め定めた値αｉ，βｉとしてもよい。この場合の重み付け係数αｉ，βｉは、単語辞書２５，３５に、認識候補の単語とともに格納しておけばよい。この場合には、ステップＳ１では、次式（１ａ）による演算に従って結合尤度Ｌｉが求められる（認識候補適応重み付け結合手段）。
Ｌi＝αｉ・ＬＦi＋βｉ・ＬＲi …… (1a)
次に、統合判定部４０は、次式(2)を満たすｋを見いだす（ステップＳ２）。 Further, the weighting coefficients α and β may be values αi and βi determined in advance for each word. The weighting coefficients αi and βi in this case may be stored in the word dictionaries 25 and 35 together with the recognition candidate words. In this case, in step S1, the joint likelihood Li is obtained according to the calculation by the following equation (1a) (recognition candidate adaptive weighting combining means).
Li = αi · LFi + βi · LRi (1a)
Next, the integrated determination unit 40 finds k that satisfies the following expression (2) (step S2).

ｋ＝arg_imax(Ｌi) …… (2)
すなわち、ｋ番目の単語は、結合尤度Ｌｉが最大となる単語である。
そして、さらに、統合判定部４０は、最大の結合尤度Ｌｋが、次式(3)の条件を満たすかどうかを判断する（ステップＳ３）。
Ｌk ＞ Θ …… (3)
閾値Θは実数であり、定数であってもよいし、単語に応じて異なる値（Θｋ）としてもよい。 k = arg _i max (Li) …… (2)
That is, the kth word is the word that has the maximum joint likelihood Li.
Further, the integration determining unit 40 determines whether or not the maximum joint likelihood Lk satisfies the condition of the following equation (3) (step S3).
Lk> Θ (3)
The threshold Θ is a real number, may be a constant, or may be a different value (Θk) depending on the word.

前記式(3)の条件が満たされれば、すなわち、結合尤度Ｌｋが所定の閾値Θを超えているならば、統合判定部４０は、ｋ番目の単語を認識結果とし、この単語を表すデータを出力する（ステップＳ４）。さもなければ、認識結果を棄却し、いずれの単語を表すデータをも出力しない）ステップＳ５）。
このようにして、この例では、順方向認識処理および逆方向認識処理によって得られる順方向および逆方向尤度ＬＦｉ，ＬＲｉを重み付け係数α，βで重み付けして線形結合することによって結合尤度Ｌｉが求められる。そして、この結合尤度Ｌｉに基づいて、認識結果が求められる。こうして、順方向および逆方向の認識結果を結合して用いることによって、認識しやすさに関するばらつきを抑制できる。 If the condition of the expression (3) is satisfied, that is, if the joint likelihood Lk exceeds a predetermined threshold value Θ, the integrated determination unit 40 uses the kth word as a recognition result, and data representing this word Is output (step S4). Otherwise, the recognition result is rejected and no data representing any word is output) Step S5).
In this way, in this example, the joint likelihood Li is obtained by weighting the forward and backward likelihoods LFi, LRi obtained by the forward direction recognition process and the backward direction recognition process with the weighting coefficients α, β and linearly combining them. Is required. Then, a recognition result is obtained based on the joint likelihood Li. In this way, by combining and using the recognition results in the forward direction and the reverse direction, it is possible to suppress variations related to ease of recognition.

図４は、統合判定部４０における第２の処理例を説明するためのフローチャートである。この図４において、図３に示された各ステップと同様の処理が行われるステップには、図３の場合と同一の参照符号を付して示す。この例では、統合判定部４０は、一対の認識処理部２０，３０のうちの一方である順方向認識処理部２０の出力を優先的に用いた判定を行う。すなわち、統合判定部４０は、まず、順方向認識処理部２０が生成する順方向尤度ＬＦｉを用いて、次式(4)を満たすｊを見いだす（ステップＳ１０）。 FIG. 4 is a flowchart for explaining a second processing example in the integrated determination unit 40. In FIG. 4, steps in which processing similar to the steps shown in FIG. 3 is performed are denoted by the same reference numerals as in FIG. 3. In this example, the integrated determination unit 40 performs determination using the output of the forward direction recognition processing unit 20 that is one of the pair of recognition processing units 20 and 30 with priority. That is, first, the integrated determination unit 40 uses the forward likelihood LFi generated by the forward direction recognition processing unit 20 to find j that satisfies the following expression (4) (step S10).

ｊ＝arg_imax(ＬＦi) …… (4)
すなわち、ｊ番目の単語は、順方向尤度ＬＦｉが最大となる単語である。
そして、さらに、統合判定部４０は、最大の順方向尤度ＬＦｊが、次式(5)の条件を満たすかどうかを判断する（ステップＳ１１。一方向認識判定手段）。
LＦｊ＞ θ …… (5)
閾値θは実数であり、定数であってもよいし、単語に応じて異なる値（θｊ）としてもよい。 j = arg _i max (LFi) (4)
That is, the j-th word is the word that has the maximum forward likelihood LFi.
Further, the integration determination unit 40 determines whether or not the maximum forward likelihood LFj satisfies the condition of the following equation (5) (step S11, one-way recognition determination unit).
LFj> θ (5)
The threshold value θ is a real number and may be a constant or a different value (θj) depending on the word.

前記式(5)の条件が満たされれば、すなわち、順方向尤度ＬＦｊが所定の閾値θを超えているならば、統合判定部４０は、ｊ番目の単語を認識結果とする（ステップＳ１２）。さもなければ、前述の図３を参照して説明した処理（ステップＳ１〜Ｓ５）を実行し、結合尤度Ｌｉに基づく認識処理を実行する。
このようにして、順方向認識処理部２０の単独処理結果のみに基づいて判定が可能な場合には、結合尤度Ｌｉを用いた処理を省くことができるので、認識処理を高速化でき、音声認識装置１１を構成する処理装置の演算負荷を軽減できる。 If the condition of the expression (5) is satisfied, that is, if the forward likelihood LFj exceeds the predetermined threshold θ, the integrated determination unit 40 regards the jth word as a recognition result (step S12). . Otherwise, the process (steps S1 to S5) described with reference to FIG. 3 is executed, and the recognition process based on the joint likelihood Li is executed.
In this way, when the determination can be made based only on the single processing result of the forward direction recognition processing unit 20, the processing using the joint likelihood Li can be omitted, so that the recognition processing can be speeded up and the voice It is possible to reduce the calculation load of the processing device constituting the recognition device 11.

むろん、逆方向認識処理部３０の単独処理結果を優先的に用いて同様の処理を行うことも可能である。
図５は、統合判定部４０における第３の処理例を説明するためのフローチャートである。この図５において、図４に示された各ステップと同様の処理が行われるステップには、図４の場合と同一の参照符号を付して示す。 Of course, it is also possible to preferentially use the single processing result of the backward direction recognition processing unit 30 and perform the same processing.
FIG. 5 is a flowchart for explaining a third processing example in the integrated determination unit 40. In FIG. 5, steps in which processing similar to the steps shown in FIG. 4 is performed are denoted by the same reference numerals as in FIG. 4.

この例では、ステップＳ１１において最大順方向尤度ＬＦｊが閾値θ（またはθｉ）を超えていると判断されたときに、さらに、次式(6)による判断が行われる（ステップＳ１５。一方向認識判定手段）。
LFi ≦θ（またはθｉ） for ∀i≠ｊ …… (6)
すなわち、ｊ以外の任意のｉに対して、順方向尤度ＬＦｉが閾値θ（またはθｉ）以下であることを条件に、ｊ番目の単語を認識結果とする（ステップＳ１２）。つまり、閾値θ（またはθｉ）を超える順方向尤度ＬＦｉを有する単語がただ一つに定まるときに、結合尤度Ｌｉによる判定を行うことなく、ｊ番目の単語が認識単語として出力される。 In this example, when it is determined in step S11 that the maximum forward likelihood LFj exceeds the threshold θ (or θi), a determination according to the following equation (6) is further performed (step S15, one-way recognition). Determination means).
LFi ≦ θ (or θi) for ∀i ≠ j (6)
That is, for any i other than j, the j-th word is set as a recognition result on the condition that the forward likelihood LFi is equal to or less than the threshold θ (or θi) (step S12). That is, when only one word having a forward likelihood LFi exceeding the threshold θ (or θi) is determined, the j-th word is output as a recognized word without performing the determination based on the joint likelihood Li.

前記式(6)の条件が満たされない場合（ステップＳ１５：ＮＯ）、すなわち、閾値θ（またはθｉ）を超える順方向尤度を持つ単語が二つ以上存在する場合には、前述の図３を参照して説明した処理（ステップＳ１〜Ｓ５）を実行し、結合尤度Ｌｉに基づく認識処理を実行する。
ただし、この場合、結合尤度Ｌｉに基づく認識処理（ステップＳ１〜Ｓ５）は、ＬＦｉ＞θ（またはθｉ）を満たすｉの範囲で行うことが好ましい。これにより、結合尤度Ｌｉに基づく判定処理を簡単にすることができるから、処理速度を高めることができるとともに、音声認識装置１１を構成する処理装置の演算負荷を一層軽減できる。 When the condition of the expression (6) is not satisfied (step S15: NO), that is, when there are two or more words having a forward likelihood exceeding the threshold θ (or θi), the above-described FIG. The processing described with reference (steps S1 to S5) is executed, and recognition processing based on the joint likelihood Li is executed.
However, in this case, it is preferable that the recognition process (steps S1 to S5) based on the joint likelihood Li is performed in a range of i that satisfies LFi> θ (or θi). Thereby, since the determination process based on the joint likelihood Li can be simplified, the processing speed can be increased, and the calculation load of the processing apparatus constituting the speech recognition apparatus 11 can be further reduced.

このようにして、この処理例でも、順方向認識処理部２０の単独処理結果のみに基づいて判定が可能な場合には、結合尤度Ｌｉを用いた処理を省くことができる。それに加えて、順方向尤度ＬＦｉに基づいてただ一つの単語を特定できる場合にのみ、順方向認識処理部２０の単独処理結果のみに基づく単語認識を許容しているので、認識結果の確実性を高めることができる。 Thus, even in this processing example, when the determination can be made based only on the single processing result of the forward direction recognition processing unit 20, the processing using the joint likelihood Li can be omitted. In addition, only when a single word can be identified based on the forward likelihood LFi, word recognition based on only the single processing result of the forward recognition processing unit 20 is allowed, so the certainty of the recognition result Can be increased.

むろん、図４の処理例の場合と同じく、逆方向認識処理部３０の単独処理結果を優先的に用いて同様の処理を行うことも可能である。
なお、図５において二点鎖線で示すように、ステップＳ１１において、順方向最大尤度ＬＦｊが閾値θ（またはθｊ）を超えていない場合には、結合尤度Ｌｉに基づく判定処理を行うことなく、認識結果を棄却（ステップＳ５）することとしてもよい。 Of course, as in the case of the processing example of FIG. 4, the same processing can be performed using the single processing result of the backward direction recognition processing unit 30 preferentially.
As indicated by a two-dot chain line in FIG. 5, in step S11, when the forward maximum likelihood LFj does not exceed the threshold θ (or θj), the determination process based on the combined likelihood Li is not performed. The recognition result may be rejected (step S5).

図６は、統合判定部４０における第４の処理例を説明するためのフローチャートである。この図６において、図５に示された各ステップと同様の処理が行われるステップには、図５の場合と同一の参照符号を付して示す。
この例では、図５のステップＳ１５の判定に代えて、次式(7)による判断が行われる（ステップＳ１６。一方向認識判定手段）。 FIG. 6 is a flowchart for explaining a fourth processing example in the integrated determination unit 40. In FIG. 6, steps in which processing similar to the steps shown in FIG. 5 is performed are denoted by the same reference numerals as in FIG. 5.
In this example, instead of the determination in step S15 in FIG. 5, determination by the following equation (7) is performed (step S16, one-way recognition determination means).

ＬＦｊ−ＬＦi ＞Δ for ∀i≠ｊ …… (7)
ただし、閾値Δは実数（ここではΔ＞０）であり、予め定める定数である。閾値Δとして、単語ｉごとの閾値Δｉを用いるようにしてもよい。
前記式(7)は、最大順方向尤度ＬＦｊと他の任意の順方向尤度ＬＦｉとの差が閾値Δ（またはΔｉ）を超えるという条件を表している。換言すれば、最大順方向尤度ＬＦｊと、これに次いで大きな順方向尤度との差が閾値Δ（または閾値Δｉ）を超えている、すなわち、最大順方向尤度ＬＦｊが他と比較して突出しているという条件である。 LFj−LFi> Δ for ∀i ≠ j (7)
However, the threshold Δ is a real number (here, Δ> 0) and is a predetermined constant. A threshold value Δi for each word i may be used as the threshold value Δ.
The equation (7) represents a condition that the difference between the maximum forward likelihood LFj and any other forward likelihood LFi exceeds a threshold value Δ (or Δi). In other words, the difference between the maximum forward likelihood LFj and the next largest forward likelihood exceeds the threshold Δ (or threshold Δi), that is, the maximum forward likelihood LFj is compared with others. It is a condition that it protrudes.

この条件が満たされている場合には（ステップＳ１６のＹＥＳ）、結合尤度Ｌｉによる判定を行うことなく、ｊ番目の単語が認識単語として出力される（ステップＳ１２）。
前記式(7)の条件が満たされない場合（ステップＳ１５：ＮＯ）、すなわち、閾値θ（またはθｉ）を超える順方向尤度ＬＦｉを持つ単語が二つ以上存在する場合には、前述の図３を参照して説明した処理（ステップＳ１〜Ｓ５）を実行し、結合尤度Ｌｉに基づく認識処理を実行する。 If this condition is satisfied (YES in step S16), the j-th word is output as a recognized word without performing determination based on the joint likelihood Li (step S12).
When the condition of the expression (7) is not satisfied (step S15: NO), that is, when there are two or more words having the forward likelihood LFi exceeding the threshold θ (or θi), the above-described FIG. The processing described with reference to (Steps S1 to S5) is executed, and the recognition processing based on the joint likelihood Li is executed.

ただし、この場合、結合尤度Ｌｉに基づく認識処理（ステップＳ１〜Ｓ５）は、ＬＦｉ＞θ（またはθｉ）を満たすｉの範囲で行うことが好ましい。これにより、結合尤度Ｌｉに基づく判定処理を簡単にすることができるから、処理速度を高めることができるとともに、音声認識装置１１を構成する処理装置の演算負荷を一層軽減できる。
このようにして、この処理例でも、順方向認識処理部２０の単独処理結果のみに基づいて判定が可能な場合には、結合尤度Ｌｉを用いた処理を省くことができる。それに加えて、最大順方向尤度ＬＦｊが他と比較して突出していると認められる場合にのみ、順方向認識処理部２０の単独処理結果のみに基づく単語認識を許容しているので、認識結果の確実性を高めることができる。 However, in this case, it is preferable that the recognition process (steps S1 to S5) based on the joint likelihood Li is performed in a range of i that satisfies LFi> θ (or θi). Thereby, since the determination process based on the joint likelihood Li can be simplified, the processing speed can be increased, and the calculation load of the processing apparatus constituting the speech recognition apparatus 11 can be further reduced.
Thus, even in this processing example, when the determination can be made based only on the single processing result of the forward direction recognition processing unit 20, the processing using the joint likelihood Li can be omitted. In addition, the word recognition based on only the single processing result of the forward recognition processing unit 20 is allowed only when it is recognized that the maximum forward likelihood LFj is prominent as compared with others. The certainty can be increased.

むろん、図４の処理例の場合と同じく、逆方向認識処理部３０の単独処理結果を優先的に用いて同様の処理を行うことも可能である。
なお、図５の処理例の場合と同じく、図６において二点鎖線で示すように、ステップＳ１１において、順方向最大尤度ＬＦｊが閾値θ（またはθｊ）を超えていない場合には、結合尤度Ｌｉに基づく判定処理を行うことなく、認識結果を棄却（ステップＳ５）することとしてもよい。 Of course, as in the case of the processing example of FIG. 4, the same processing can be performed using the single processing result of the backward direction recognition processing unit 30 preferentially.
As in the case of the processing example of FIG. 5, as shown by a two-dot chain line in FIG. 6, when the forward maximum likelihood LFj does not exceed the threshold θ (or θj) in step S11, the combined likelihood The recognition result may be rejected (step S5) without performing the determination process based on the degree Li.

図７は、統合判定部４０における第５の処理例を説明するためのフローチャートである。この図７において、図４に示された各ステップと同様の処理が行われるステップには、図４の場合と同一の参照符号を付して示す。
この例では、順方向認識処理部２０による単独処理結果に基づく判定を優先的に行うに当たって、単語ごとに重み付けγｉを付与した重み付き順方向尤度γｉ・ＬＦｉが用いられる。すなわち、重み付き順方向尤度γｉ・ＬＦｉが最大となる単語（ｊ番目の単語）が求められる（ステップＳ２０）。すなわち、
ｊ＝arg_imax(γj・ＬＦi) …… (8)
である。 FIG. 7 is a flowchart for explaining a fifth processing example in the integrated determination unit 40. In FIG. 7, steps in which the same processes as those in FIG. 4 are performed are denoted by the same reference numerals as those in FIG. 4.
In this example, when the determination based on the single processing result by the forward direction recognition processing unit 20 is preferentially performed, the weighted forward likelihood γi · LFi to which the weight γi is assigned for each word is used. That is, a word (j-th word) having the maximum weighted forward likelihood γi · LFi is obtained (step S20). That is,
j = arg _i max (γj · LFi) (8)
It is.

そして、最大の重み付き順方向尤度γｊ・ＬＦｊが次式(9)を満たすかどうかが判断される（ステップＳ２１。一方向認識判定手段。重み付け判定手段）。
γｊ・LＦｊ＞ θ …… (9)
閾値θは実数であり、定数である。閾値θを単語に応じて異なる値θｊとしてもよいが、尤度ＬＦｉに重み付けを行っているので、閾値θに対してまで重み付けする実益はない。なお、重み付け係数γｉは、たとえば、学習によって予め求め、単語辞書２５に個々の単語毎に格納しておけばよい。 Then, it is determined whether or not the maximum weighted forward likelihood γj · LFj satisfies the following expression (9) (step S21: one-way recognition determination means; weight determination means).
γj ・ LFj> θ ...... (9)
The threshold value θ is a real number and is a constant. The threshold value θ may be a different value θj depending on the word, but since the likelihood LFi is weighted, there is no practical benefit of weighting the threshold value θ. For example, the weighting coefficient γi may be obtained in advance by learning and stored in the word dictionary 25 for each individual word.

前記式(9)の条件が満たされれば、すなわち、重み付け順方向尤度γｊ・ＬＦｊが所定の閾値θを超えているならば、統合判定部４０は、ｊ番目の単語を認識結果とする（ステップＳ１２）。さもなければ、前述の図３を参照して説明した処理（ステップＳ１〜Ｓ５）を実行し、結合尤度Ｌｉに基づく認識処理を実行する。
このように、図４のステップＳ１０，１１の処理を前述のステップＳ２０，２１の処理に置き換えることにより、単語毎の重みγｉを付与した重み付き順方向尤度γｉ・ＬＦｉによる認識を行うことができる。 If the condition of Equation (9) is satisfied, that is, if the weighted forward likelihood γj · LFj exceeds a predetermined threshold θ, the integrated determination unit 40 uses the jth word as a recognition result ( Step S12). Otherwise, the process (steps S1 to S5) described with reference to FIG. 3 is executed, and the recognition process based on the joint likelihood Li is executed.
In this way, by replacing the processes in steps S10 and S11 in FIG. 4 with the processes in steps S20 and S21 described above, the recognition by the weighted forward likelihood γi · LFi with the weight γi for each word can be performed. it can.

同様の変形は、図５および図６に示された処理に対しても適用することができる。
図８は、この発明の第２の実施形態に係る音声指示装置の構成を説明するためのブロック図である。この図８において、前述の図２に示された各部に対応する部分には、図２の場合と同一の参照符号を付して示す。
前述の第１の実施形態では、入力音声データを時間軸上で反転して再生出力する逆再生部１５が設けられていて、この逆再生部１５によって逆再生された逆方向音声データに対して音声分析処理が行われるようになっている。 Similar modifications can be applied to the processes shown in FIGS.
FIG. 8 is a block diagram for explaining the configuration of a voice instruction apparatus according to the second embodiment of the present invention. 8, parts corresponding to the respective parts shown in FIG. 2 are given the same reference numerals as those in FIG.
In the first embodiment described above, the reverse reproduction unit 15 that inverts the input audio data on the time axis and reproduces and outputs the reverse audio data that is reversely reproduced by the reverse reproduction unit 15 is provided. Voice analysis processing is performed.

これに対して、この実施形態では、音声分析処理は、順方向音声データに対してのみ行うこととし、音声分析結果を時間軸上で反転して逆方向認識処理部３０の照合部３３に引き渡すようにしている。
より具体的には、マイクロフォン５からの音声信号は、順方向認識処理部２０の音声分析部２１および逆方向認識処理部３０の音声分析部３１に与えられる。順方向認識処理部２０の構成は、前述の第１の実施形態の場合と同様である。 In contrast, in this embodiment, the voice analysis process is performed only on the forward direction voice data, and the voice analysis result is reversed on the time axis and delivered to the collation unit 33 of the backward direction recognition processing unit 30. I am doing so.
More specifically, the audio signal from the microphone 5 is given to the audio analysis unit 21 of the forward direction recognition processing unit 20 and the audio analysis unit 31 of the reverse direction recognition processing unit 30. The configuration of the forward direction recognition processing unit 20 is the same as that in the first embodiment described above.

逆方向認識処理部３０は、音声分析部３１が生成する特徴パラメータ系列を時間軸上で反転し、逆方向特徴パラメータ系列に変換する時間軸上反転部３７を備えている。この時間軸上反転部３７が生成する逆方向特徴パラメータ系列が照合部３３に与えられる。逆方向認識処理部３０のその他の構成は、前述の第１の実施形態の場合と同様である。
このような構成によっても、第１の実施形態の場合と同様に、順方向認識処理および逆方向認識処理を併用して、単語辞書２５，３５に登録された単語（認識候補）間の認識確率のばらつきを抑制できる。 The reverse direction recognition processing unit 30 includes a time axis inversion unit 37 that inverts the feature parameter series generated by the speech analysis unit 31 on the time axis and converts the feature parameter series into a reverse direction feature parameter series. The reverse direction feature parameter series generated by the time axis inversion unit 37 is given to the matching unit 33. Other configurations of the reverse direction recognition processing unit 30 are the same as those in the first embodiment.
Even with such a configuration, as in the case of the first embodiment, the recognition probability between words (recognition candidates) registered in the word dictionaries 25 and 35 using the forward direction recognition process and the backward direction recognition process in combination. The variation of can be suppressed.

ただし、逆方向認識処理部３０の音響ライブラリ３２を構成する音響モデル３４および単語辞書３５を作成する際の学習は、認識処理時と同様の処理手順（音声分析後に時間軸反転する手順）で行う必要がある。
なお、この実施形態では、順方向認識処理部２０および逆方向認識処理部３０にそれぞれ音声分析部２１，３１を設けているが、一つの音声分析部を順方向および逆方向認識処理部２０，３０で共有することとしてもよい。
［実験例］
図９は、音声データを順方向および逆方向の両方で再生し、それぞれについて単語認識実験を行った場合の認識率を示す図である。順方向再生音声データに対する認識では、「単語６」および「単語９」で認識率が低いのに対し、逆方向再生音声データに対する認識では、それらの単語に対して、いずれも８０％を超える認識率が得られた。逆に、「単語１１」では、順方向再生音声データに対する認識のほうが、逆方向再生音声データに対する認識よりも、やや認識率が良い。 However, learning when creating the acoustic model 34 and the word dictionary 35 constituting the acoustic library 32 of the backward direction recognition processing unit 30 is performed in the same processing procedure as the recognition processing (procedure for time axis inversion after speech analysis). There is a need.
In this embodiment, the speech recognition units 21 and 31 are provided in the forward direction recognition processing unit 20 and the backward direction recognition processing unit 30, respectively. However, one speech analysis unit is used as the forward direction and backward direction recognition processing unit 20, 30 may be shared.
[Experimental example]
FIG. 9 is a diagram showing recognition rates when voice data is reproduced both in the forward direction and in the reverse direction, and a word recognition experiment is performed for each. In the recognition for the forward reproduction voice data, the recognition rate is low for “word 6” and “word 9”, whereas in the recognition for the backward reproduction voice data, the recognition of both of those words exceeds 80%. The rate was obtained. On the other hand, in “word 11”, the recognition rate for the forward reproduction sound data is slightly better than the recognition for the reverse reproduction sound data.

この実験結果から、単語によって順方向認識および逆方向認識の適／不適があり、順方向認識および逆方向認識を組み合わせて単語認識処理を行う前述の実施形態によって、単語認識率の均一性を向上できることが理解される。
［変形例］
前述の実施形態では、ＨＭＭ単語認識を例に挙げたが、認識アルゴリズムは、ＤＰマッチング方式、ニューラルネットワーク方式、ベイズ識別関数方式、ベクトル量子化方式など、任意の方式を適用できる。この場合、尤度情報としては、尤度ではなく、特徴ベクトル間の「距離」などが用いられる場合もあり、前記の条件式(3)(5)(6)(7)(9)などにおいて、適宜不等号の向きを反対にする必要があり得る。 From this experimental result, the word recognition rate is improved by the above-described embodiment in which the word recognition processing is performed by combining the forward direction recognition and the backward direction recognition. It is understood that it can be done.
[Modification]
In the above-described embodiment, HMM word recognition is taken as an example, but any method such as a DP matching method, a neural network method, a Bayes discriminant function method, or a vector quantization method can be applied as a recognition algorithm. In this case, as the likelihood information, not “likelihood” but “distance” between feature vectors may be used, and in the conditional expressions (3), (5), (6), (7), (9), etc. It may be necessary to reverse the direction of the inequality sign as appropriate.

また、単語認識でなくても、連続音声認識に対してもこの発明を同様に適用できる。この場合、認識対象は、「単語」の代わりに、「文字」、「音節」等の発話単位になる。
さらに、前述の実施形態では、結合尤度Ｌｉを順方向尤度ＬＦｉおよび逆方向尤度ＬＲｉの線形結合としたが、これらの非線形結合によって結合尤度Ｌｉを表してもよく、一般に、結合尤度Ｌｉは、順方向尤度ＬＦｉおよび逆方向尤度ＬＲｉの関数ｆによって、次式(9)のように表すことができる。 Further, the present invention can be similarly applied to continuous speech recognition even if not word recognition. In this case, the recognition target is an utterance unit such as “character” or “syllable” instead of “word”.
Furthermore, in the above-described embodiment, the joint likelihood Li is a linear combination of the forward likelihood LFi and the reverse likelihood LRi. However, the joint likelihood Li may be represented by a non-linear combination of these, and in general, the joint likelihood Li The degree Li can be expressed as the following equation (9) by the function f of the forward likelihood LFi and the backward likelihood LRi.

Li＝f(LFi，LRi) …… (9)
順方向および逆方向尤度ＬＦｉ，ＬＲｉの非線形結合によって結合尤度Ｌｉを定める場合の例は、次の式(10)および(11)のとおりである。
Ｌｉ＝ａ・ＬＦｉ²＋ｂ・ＬＦｉ＋ｃ・ＬＲｉ²＋ｄ・ＬＲｉ ……(10)
Ｌｉ＝ｍ・LFi³＋ｎ・LFi²＋ｐ・LFi＋ｑ・LRi³＋ｒ・LRi²＋ｓ・LRi ……(11)
ただし、ａ，ｂ，ｃ，ｄ，ｍ，ｎ，ｐ，ｑ，ｒ，ｓは、係数（実数）であり、定数であってもよいし、認識候補（単語）毎に予め異なる値を設定しておいてもよい。 Li = f (LFi, LLi) ...... (9)
An example in which the joint likelihood Li is determined by the non-linear combination of the forward and backward likelihoods LFi and LRi is as shown in the following equations (10) and (11).
Li = a · LFi ² + b · LFi + c · LRi ² + d · LRi (10)
Li = m · LFi ³ + n · LFi ² + p · LFi + q · LRi ³ + r · LRi ² + s · LRi …… (11)
However, a, b, c, d, m, n, p, q, r, and s are coefficients (real numbers) and may be constants, or different values are set in advance for each recognition candidate (word). You may keep it.

また、前述の実施形態では、いずれの場合にも、順方向認識処理および逆方向認識処理の両方が行われるが、図４〜図７に示した処理のように、順方向認識処理結果を優先する場合に、逆方向認識に関連する処理（音声分析および照合処理。図８の構成の場合にはさらに時間軸上反転処理）を後回し処理（ステップＳ１１，Ｓ１５，Ｓ１６，Ｓ２１で否定判定がされた場合に初めて行う処理）にしてもよい。むろん、逆方向認識処理結果を優先する場合も同様であり、順方向認識に関連する処理（音声分析および照合処理）を後回し処理としてもよい。 In the above-described embodiment, both the forward direction recognition process and the backward direction recognition process are performed in any case, but the forward direction recognition process result is prioritized as in the processes illustrated in FIGS. In this case, a negative determination is made in steps S11, S15, S16, and S21 after the processing (speech analysis and collation processing in the case of the configuration shown in FIG. May be performed for the first time). Of course, the same applies to the case where priority is given to the result of the reverse direction recognition process, and the process related to the forward direction recognition (speech analysis and collation process) may be postponed.

また、前述の図４〜７の処理例では、結合尤度Ｌｉを用いた判定の前に、順方向尤度ＬＦｉおよび逆方向尤度ＬＲｉのうちの一方に関する判定のみを行っているが、これらの両方に関する判定を行うようにしてもよい。そして、いずれかの尤度ＬＦｉ，ＬＲｉが十分に大きな値を有する単語が見つかったこと、所定の閾値以上の尤度ＬＦｉ，ＬＲｉを有する単語がただ一つに定まること、他の単語に比較して十分に尤度ＬＦｉ，ＬＲｉが大きいこと、などを条件として認識結果を確定し、その後の結合尤度Ｌｉに基づく判定処理を省くようにしてもよい。 Further, in the processing examples of FIGS. 4 to 7 described above, only the determination regarding one of the forward likelihood LFi and the reverse likelihood LRi is performed before the determination using the combined likelihood Li. The determination regarding both of the above may be performed. Then, a word having a sufficiently large value of any one of the likelihoods LFi, LRi is found, a word having a likelihood LFi, LRi equal to or greater than a predetermined threshold is determined, and compared to other words It is also possible to determine the recognition result on the condition that the likelihoods LFi and LRi are sufficiently large, and omit the subsequent determination process based on the combined likelihood Li.

さらに、前述の実施形態では、車載用情報機器５０のための音声指示装置１０を例にとったが、この発明は、車載用情報機器に限らず、他の情報機器に対する音声指示のために適用することもできる。
その他、「課題を解決するための手段」の項で説明した各種の変形に加え、特許請求の範囲に記載された事項の範囲で種々の設計変更を施すことが可能である。 Furthermore, in the above-described embodiment, the voice instruction device 10 for the in-vehicle information device 50 is taken as an example. However, the present invention is not limited to the in-vehicle information device and is applied for voice instruction to other information devices. You can also
In addition to the various modifications described in the section “Means for Solving the Problems”, various design changes can be made within the scope of the matters described in the claims.

この発明の一実施形態に係る車両システムの全体構成を示す図である。1 is a diagram showing an overall configuration of a vehicle system according to an embodiment of the present invention. 前記車両システムの電気的構成を示すブロック図である。It is a block diagram which shows the electric constitution of the said vehicle system. 統合判定部における第１の処理例を説明するためのフローチャートである。It is a flowchart for demonstrating the 1st process example in an integrated determination part. 統合判定部における第２の処理例を説明するためのフローチャートである。It is a flowchart for demonstrating the 2nd process example in an integrated determination part. 統合判定部における第３の処理例を説明するためのフローチャートである。It is a flowchart for demonstrating the 3rd process example in an integrated determination part. 統合判定部における第４の処理例を説明するためのフローチャートである。It is a flowchart for demonstrating the 4th process example in an integrated determination part. 統合判定部における第５の処理例を説明するためのフローチャートである。It is a flowchart for demonstrating the 5th process example in an integrated determination part. この発明の第２の実施形態に係る音声指示装置の構成を説明するためのブロック図である。It is a block diagram for demonstrating the structure of the voice instruction apparatus which concerns on 2nd Embodiment of this invention. 音声データを順方向および逆方向の両方で再生し、それぞれについて単語認識実験を行った場合の認識率を示す図である。It is a figure which shows the recognition rate at the time of reproducing | regenerating audio | voice data in both the forward direction and a reverse direction, and performing word recognition experiment about each.

Explanation of symbols

１車体
２乗員
３ヘルメット
５マイクロフォン
６スピーカ
７ハーネス
１０音声指示装置１０
１０Ａ音声指示装置本体
１１音声認識装置
１２コマンドデータ生成部
１５逆再生部
２０順方向認識処理部
２１音声分析部
２２音響ライブラリ
２３照合部
２４音響モデル
２５単語辞書
３０逆方向認識処理部
３１音声分析部
３２音響ライブラリ
３３照合部
３４音響モデル
３５単語辞書
３７時間軸上反転部
４０統合判定部
５０車載用情報機器
５１情報処理部 DESCRIPTION OF SYMBOLS 1 Car body 2 Crew 3 Helmet 5 Microphone 6 Speaker 7 Harness 10 Voice indication device 10
DESCRIPTION OF SYMBOLS 10A Voice instruction | indication apparatus main body 11 Voice recognition apparatus 12 Command data generation part 15 Reverse reproduction | regeneration part 20 Forward direction recognition process part 21 Voice analysis part 22 Acoustic library 23 Collation part 24 Acoustic model 25 Word dictionary 30 Reverse direction recognition process part 31 Voice analysis part 32 acoustic library 33 collation unit 34 acoustic model 35 word dictionary 37 time axis inversion unit 40 integration determination unit 50 in-vehicle information device 51 information processing unit

Claims

Forward direction recognition means for recognizing an input speech signal in a forward direction according to an input time series, and generating forward likelihood information representing the likelihood that the input speech signal corresponds to a recognition candidate;
A reverse direction recognizing means for recognizing an input voice signal in a reverse direction according to a reverse time series opposite to the input time series, and generating reverse likelihood information representing the likelihood that the input voice signal corresponds to a recognition candidate;
An integrated determination unit that integrates outputs of the forward direction recognition unit and the backward direction recognition unit to generate a recognition result corresponding to an input voice signal;
The integrated determination unit combines forward likelihood information and backward likelihood information generated by the forward direction recognition unit and the backward direction recognition unit, respectively, and combines the input speech signal representing the likelihood of being a recognition candidate. A likelihood information combining unit that generates likelihood information, and a recognition result determining unit that evaluates the combined likelihood information generated by the likelihood information combining unit and obtains a recognition result,
The likelihood information coupling means, seen including weighting means for combining weighted respectively forward likelihood information and the backward likelihood information,
The weighting / combining means sets the weights depending on the recognition candidates, which are predetermined by learning for each recognition candidate so as to reduce the variation of the recognition probability between the recognition candidates, in the forward likelihood information and / or the reverse likelihood information. bonded by applying the recognition candidate adaptive weighting combiner means including speech recognition device.

The speech recognition apparatus according to claim 1, further comprising reverse reproduction means for reproducing an input voice signal according to a time series reverse to an input time series and inputting the input voice signal to the reverse direction recognition means.

The reverse direction recognition means includes
Voice analysis means for analyzing an input voice signal according to an input time series and generating a feature series according to the input time series;
2. The voice according to claim 1, further comprising: a feature series inversion means for converting the feature series generated by the voice analysis means into a feature series according to a time series opposite to the input time series and inputting the feature series to the backward direction recognition means. Recognition device.

The integrated determination means includes
Unidirectional recognition determination means for determining whether a single processing result by at least one of the forward direction recognition means and the backward direction recognition means indicates the presence of a recognition candidate having a certain likelihood or more.
And a means for outputting the recognition candidate as a recognition result when the one-way recognition determination means determines that the single processing result indicates the presence of a recognition candidate having a certain likelihood or more. speech recognition apparatus according to any one of claims 1-3.

The one-way recognition determination means assigns a weight depending on a recognition candidate to the single processing result, and determines whether the single processing result indicates the presence of a recognition candidate having a certain likelihood or more. The speech recognition apparatus according to claim 4 , comprising means.

The speech recognition device according to any one of claims 1 to 5 ,
Voice signal input means for inputting a voice signal to the voice recognition device;
A voice instruction device comprising: command data generating means for converting a recognition result by the voice recognition device into instruction command data for inputting to a predetermined device.

A voice instruction device according to claim 6 ;
Information processing equipment including command processing means that operates in accordance with instruction command data generated by the command data generating means.

A vehicle system in which the command processing means of the information device according to claim 7 is mounted on a vehicle body.

A forward recognition step for recognizing an input speech signal in a forward direction according to an input time series, and generating forward likelihood information representing the likelihood that the input speech signal corresponds to a recognition candidate;
A reverse direction recognition step for recognizing an input voice signal in a reverse direction according to a reverse time series opposite to the input time series, and generating reverse likelihood information representing the likelihood that the input voice signal corresponds to a recognition candidate;
Integrating the results of the forward direction recognition step and the backward direction recognition step to generate a recognition result corresponding to the input speech signal; and
In the integration determination step, the forward likelihood information and the backward likelihood information respectively generated in the forward direction recognition step and the backward direction recognition step are combined to represent the likelihood that the input speech signal corresponds to the recognition candidate. A likelihood information combining step for generating combined likelihood information, and a recognition result determining step for evaluating the combined likelihood information generated in the likelihood information combining step to obtain a recognition result,
The likelihood information binding step, seen including a weighted combination step of combining by weighting each of the forward likelihood information and the backward likelihood information,
In the weighting and combining step, the forward likelihood information and / or the backward likelihood information are weights that depend on the recognition candidate and are determined in advance for each recognition candidate by learning so as to reduce variation in recognition probability between recognition candidates. recognition candidate adaptive weighted combination step of including, speech recognition method of bonding imparted to.

The integration determination step includes
A one-way recognition determination step for determining whether a single processing result by at least one of the forward direction recognition step and the backward direction recognition step indicates the presence of a recognition candidate having a certain likelihood or more;
A step of outputting the recognition candidate as a recognition result when it is determined by the one-way recognition determination step that the single processing result indicates the presence of a recognition candidate having a certain likelihood or more. Item 10. The speech recognition method according to Item 9 .

The one-way recognition determination step assigns a weight depending on a recognition candidate to the single processing result, and determines whether the single processing result indicates the presence of a recognition candidate having a certain likelihood or more. The speech recognition method according to claim 10 , comprising steps.