JP2007206501A

JP2007206501A - Device for determining optimum speech recognition system, speech recognition device, parameter calculation device, information terminal device and computer program

Info

Publication number: JP2007206501A
Application number: JP2006026866A
Authority: JP
Inventors: Toshiki Endo; 俊樹遠藤; Shingo Kuroiwa; 眞吾黒岩; Toru Shimizu; 徹清水
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2006-02-03
Filing date: 2006-02-03
Publication date: 2007-08-16

Abstract

<P>PROBLEM TO BE SOLVED: To predict the accuracy of speech recognition without any test utterance. <P>SOLUTION: The optimum speech recognition system selection device for determining a system capable of performing the highest precise speech recognition among a plurality of speech recognition systems with respect to the speech including noise comprises: a feature amount calculating section 180 for calculating a predetermined sound feature amount vector from input speech; a transformation processing section 182 for transforming the sound feature amount vector to the feature amount vector of small dimensions by a transformation matrix; a prediction accuracy calculating section 184 for predicting accuracy of the speech recognition for each speech recognition system by using the feature amount vector after transformation and a multiple regression coefficient prepared beforehand; and a maximum value selecting section 185 for selecting the speech recognition system by which speech recognition can be performed with the highest precision. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

この発明は、種類が予測できない雑音が存在する可能性のある環境で発話された音声を認識するためのシステムに関し、特に、複数の音声認識方式を用いて、雑音下でも精度よく音声認識を行なう事が可能な音声認識システム、及びそのためのコンピュータプログラムに関する。 The present invention relates to a system for recognizing speech uttered in an environment where there is a possibility that noise of an unpredictable type exists, and in particular, using a plurality of speech recognition methods, speech recognition is performed accurately even under noise. The present invention relates to a speech recognition system capable of doing things, and a computer program therefor.

近年、ボイスレコーダ、留守番電話、及び携帯電話等の、音声情報を記録できる装置が広く使用されている。しかし、音声情報は記憶容量を多く必要とするため、音声認識によりテキスト情報に変換して記録する事が望ましい。そうしたサービスを提供できれば、例えば携帯電話において、電話帳のための住所等の情報を音声で入力して音声認識によりテキスト情報に変換して保持する事ができる。こうする事により、その情報を音声として再生するだけでなく、画面に文字で表示させる事が可能になり、また他の装置でその情報を利用する事も可能になる。 In recent years, devices capable of recording voice information, such as voice recorders, answering machines, and mobile phones, have been widely used. However, since voice information requires a large storage capacity, it is desirable to convert it into text information by voice recognition and record it. If such a service can be provided, for example, in a mobile phone, information such as an address for a telephone directory can be input by voice and converted into text information by voice recognition and stored. By doing so, it is possible not only to reproduce the information as a sound but also to display it on the screen as characters, and to use the information in other devices.

そうしたサービスの一つとして、携帯電話で入力された音声をサーバに送信し、サーバで音声認識を行なうというサービスが実用化されている。音声認識結果はサーバから携帯電話に返送される。携帯電話では、その音声認識結果を用いて様々な処理を行なう事ができる。 As one of such services, a service has been put into practical use in which voice input by a mobile phone is transmitted to a server and voice recognition is performed by the server. The speech recognition result is returned from the server to the mobile phone. In the cellular phone, various processes can be performed using the voice recognition result.

この様な音声認識を用いるシステムでは、音声認識の精度が問題となる。音声認識の方式は種々存在し、方式により、音声認識の対象により、精度が変動する。 In such a system using voice recognition, the accuracy of voice recognition becomes a problem. There are various speech recognition methods, and the accuracy varies depending on the speech recognition target.

この様な音声認識方式として、複数の音声認識方式による音声認識結果の候補から、スコアが最もよい音声認識結果を選択する事により、音声認識精度を向上させる音声認識装置が提案されている。 As such a speech recognition method, a speech recognition apparatus that improves speech recognition accuracy by selecting a speech recognition result having the best score from candidates of speech recognition results by a plurality of speech recognition methods has been proposed.

図１に、この様な従来技術による音声認識システムのサーバ３０のブロック図を示す。図１を参照して、サーバ３０は、第１〜第３の音声認識部５０〜５４を用いるものであって、携帯電話等とデータのやり取りをするための通信部４０と、入力された音声を所定のフレーム長及び所定のシフト長でフレーム化するためのフレーム化部４２と、フレーム化された音声から、第１〜第３の音声認識部５０〜５４で使用する音響特徴量をそれぞれ音声から抽出し、第１〜第３の音声認識部５０〜５４に与えるための第１〜第３の特徴量抽出部４４〜４８とを含む。 FIG. 1 shows a block diagram of the server 30 of such a conventional speech recognition system. Referring to FIG. 1, server 30 uses first to third voice recognition units 50 to 54, and includes communication unit 40 for exchanging data with a mobile phone or the like, and input voice A frame forming unit 42 for converting a frame into a frame with a predetermined frame length and a predetermined shift length, and acoustic features used by the first to third speech recognition units 50 to 54 from the framed speech, respectively. And first to third feature quantity extraction units 44 to 48 for providing to the first to third voice recognition units 50 to 54.

第１〜第３の音声認識部５０〜５４はそれぞれ、互いに異なる音声認識方式で音声認識を行なうものである。音声認識として最近は音響モデルと言語モデルとを用いた統計的音声認識によるものが主流であって、第１〜第３の音声認識部５０〜５４も方式は異なるものの、統計的音声認識を行なうものである。統計的音声認識では、様々な音声認識結果の候補が、それらが正しい音声認識結果であると考えられる確率とともに生成される。これらを各候補の「スコア」と呼ぶ。 The first to third voice recognition units 50 to 54 perform voice recognition using different voice recognition methods. As the speech recognition, statistical speech recognition using an acoustic model and a language model has been mainly used recently, and the first to third speech recognition units 50 to 54 perform statistical speech recognition although the methods are different. Is. In statistical speech recognition, various speech recognition result candidates are generated with probabilities that they are considered to be correct speech recognition results. These are called the “score” of each candidate.

サーバ３０はさらに、第１〜第３の音声認識部５０〜５４の出力のうち、最も高いスコアを持つ候補を音声認識結果として選択するための選択部５６と、選択された音声認識結果から送信用データを作成し、通信部４０を介して、携帯電話に送信するための送信データ作成部５８とを含む。 The server 30 further transmits a selection unit 56 for selecting a candidate having the highest score among the outputs of the first to third speech recognition units 50 to 54 as a speech recognition result, and sends the selected speech recognition result from the selected speech recognition result. A transmission data creation unit 58 for creating credit data and transmitting the credit data to the mobile phone via the communication unit 40;

サーバ３０は以下の様に動作する。図１を参照して、携帯電話のユーザの発話音声がサーバ３０の通信部４０に与えられる。与えられた音声は、フレーム化部４２で所定のフレーム長及び所定のシフト長でフレーム化される。フレーム化された音声は、第１〜第３の特徴量抽出部４４〜４８に与えられる。 The server 30 operates as follows. With reference to FIG. 1, the voice of the user of the mobile phone is given to communication unit 40 of server 30. The given voice is framed by the framing unit 42 with a predetermined frame length and a predetermined shift length. The framed voice is given to the first to third feature quantity extraction units 44 to 48.

第１〜第３の特徴量抽出部４４〜４８は、各フレームからそれぞれ第１〜第３の音声認識部５０〜５４のための音響特徴量を抽出し、第１〜第３の音声認識部５０〜５４に与える。こうして、第１〜第３の特徴量抽出部４４〜４８から第１〜第３の音声認識部５０〜５４には、それぞれ各フレームから抽出された音響特徴量のシーケンスが与えられる。 The first to third feature amount extraction units 44 to 48 extract acoustic feature amounts for the first to third speech recognition units 50 to 54 from the respective frames, and the first to third speech recognition units. 50-54. In this way, the first to third feature amount extraction units 44 to 48 to the first to third speech recognition units 50 to 54 are respectively provided with sequences of acoustic feature amounts extracted from the respective frames.

第１〜第３の音声認識部５０〜５４は、与えられた音響特徴量のシーケンスに対し、音声認識を行ない、それぞれ音声認識結果の候補をテキストとして出力する。この際、音声認識結果の候補にはスコアが付されている。スコアが高いほど、入力された音声に対して精度よく音声認識を行なう事ができたと考えられる。これら音声認識結果の候補は、スコアとともに選択部５６に与えられる。 The first to third speech recognition units 50 to 54 perform speech recognition on a given sequence of acoustic feature values, and output speech recognition result candidates as texts. At this time, a score is assigned to the candidate speech recognition result. It is considered that the higher the score, the more accurately the speech can be recognized with respect to the input speech. These speech recognition result candidates are given to the selection unit 56 together with the score.

選択部５６は、複数の音声認識結果の候補の中から、最もスコアの高いものを選択する。選択された最もスコアの高い音声認識結果が送信データ作成部５８に与えられる。送信データ作成部５８は、この音声認識結果から送信に適した送信用データを作成し、通信部４０に与える。この送信用データは、通信部４０によって携帯電話等に送信される。この送信用データを受けた携帯電話等は、受信した送信用データから音声認識結果を取り出し、利用する。 The selection unit 56 selects the one with the highest score from a plurality of speech recognition result candidates. The selected speech recognition result with the highest score is given to the transmission data creation unit 58. The transmission data creation unit 58 creates data for transmission suitable for transmission from the voice recognition result and gives the data to the communication unit 40. This transmission data is transmitted to a mobile phone or the like by the communication unit 40. A mobile phone or the like that has received the transmission data extracts the voice recognition result from the received transmission data and uses it.

こうした方式では、複数の音声認識方式により音声認識を行ない、その結果のうちで最も信頼性が高いと思われる結果を採用する。従って、入力される音声の種類にかかわらず、その精度は単独の音声認識を行なう場合より高くなる事が期待できる。 In such a method, speech recognition is performed by a plurality of speech recognition methods, and a result that seems to have the highest reliability among the results is adopted. Therefore, regardless of the type of input voice, the accuracy can be expected to be higher than when performing single voice recognition.

一方、音声認識での他の重要な問題として、雑音の問題がある。一般に、音声認識では、認識対象となる音声情報に雑音が含まれると、認識精度が低下する事が知られている。特に、携帯電話を介して音声が入力される様なシステムでは、音声入力がされる場所が様々で、発話に様々な種類の雑音が混入する事が通常であると考えられる。従って、携帯電話を利用した音声認識システムでは、この様な背景雑音の存在する環境下で入力された音声でも高い精度で認識できる様な仕組みが必要である。 On the other hand, another important problem in speech recognition is the problem of noise. In general, it is known that in speech recognition, if the speech information to be recognized contains noise, the recognition accuracy decreases. In particular, in a system in which voice is input via a mobile phone, it is considered that various kinds of noise are mixed in an utterance because there are various places where the voice is input. Therefore, a voice recognition system using a mobile phone needs a mechanism that can recognize voice input in an environment where such background noise exists with high accuracy.

上記した様に複数の音声認識方式を採用したシステムでは、そのままでは雑音に対する対応が十分にはできない。 As described above, a system that employs a plurality of speech recognition methods cannot sufficiently cope with noise as it is.

この様に背景雑音の種類が予測できない場合の音声認識に対応するために、種々の雑音データに対応した音響モデルを用いる方式がある。この方式では、事前に所定の発話データに対し複数種類の雑音データを重畳する事により、複数の学習用データを作成し、この学習用データを用いて複数の音響モデルの学習を行なう事により、雑音データごとに音声認識用の音響モデルを作成するとともに、音声認識では、この複数の音声モデルを用いて音声認識を用い、最もスコアのよい結果を採用している。 In order to cope with speech recognition when the type of background noise cannot be predicted as described above, there is a method using an acoustic model corresponding to various noise data. In this method, a plurality of learning data is created by superimposing a plurality of types of noise data on predetermined utterance data in advance, and by learning a plurality of acoustic models using this learning data, An acoustic model for speech recognition is created for each noise data, and in speech recognition, speech recognition is performed using the plurality of speech models, and the result with the best score is adopted.

しかしこの方式でも、複数の音声認識装置を同時に動作させなければならず、サーバの負荷が高い。しかも音声認識方式ごとに複数の音響モデルを用いて音声認識を行なう様にすると、結果の信頼性は高くなる可能性があるが、サーバの負荷は飛躍的に増大してしまう。 However, even in this method, a plurality of voice recognition devices must be operated simultaneously, and the load on the server is high. In addition, if voice recognition is performed using a plurality of acoustic models for each voice recognition method, the reliability of the result may be increased, but the load on the server will increase dramatically.

サーバの負荷を増大させずに、種々の雑音に対しても精度よく音声認識を行なおうとする方式として、テスト発話を用いるものがある。この方式では、音声認識処理に先立って、音声認識システムのユーザに所定のテキストを発話させ、その発話から得られる音声情報から所定の音響特徴量を抽出し、各雑音用の音響モデルを用いて音声認識を行ない、この音声認識の際には、テスト発話の内容、すなわち音声認識結果の正解は予め分かっているので、正解を与える様な音声認識結果のスコアを各音声認識装置に算出させる。そして、最も高いスコアを与えた音声認識装置で採用されていた音響モデルが、入力音声の背景雑音に最もよく対応しているものと考え、以後の音声認識ではこの音響モデルを用いる。 As a method for accurately recognizing various noises without increasing the server load, there is a method using a test utterance. In this method, prior to speech recognition processing, a user of a speech recognition system utters a predetermined text, extracts a predetermined acoustic feature amount from speech information obtained from the utterance, and uses an acoustic model for each noise. Speech recognition is performed, and at the time of this speech recognition, since the content of the test utterance, that is, the correct answer of the speech recognition result is known in advance, each speech recognition device is caused to calculate a score of the speech recognition result that gives the correct answer. Then, it is considered that the acoustic model employed in the speech recognition apparatus that gave the highest score corresponds best to the background noise of the input speech, and this acoustic model is used in subsequent speech recognition.

この方式では、背景雑音の種類にかかわらず、その環境で最もよい結果が得られる可能性の高い音響モデルを用いて音声認識を行なう事ができる。また、テスト発話により使用する音響モデルと音声認識方式とが決定した後は、当該ユーザについては一つの音声認識装置しか動作しないため、サーバの負荷の増大を避ける事ができる。 In this method, speech recognition can be performed using an acoustic model that is most likely to obtain the best result in that environment regardless of the type of background noise. Further, after the acoustic model and the voice recognition method to be used are determined by the test utterance, only one voice recognition device operates for the user, so that an increase in server load can be avoided.

しかしこの方式では、ユーザが音声認識装置を使用する前にテスト発話をしなければならないという問題がある。すなわち音声認識の前にいちいちテスト発話を行なう事が必要であれば、音声認識装置を手軽に利用する事が難しくなる。 However, this method has a problem that the user has to make a test utterance before using the speech recognition apparatus. That is, if it is necessary to perform test utterances before and after speech recognition, it is difficult to easily use the speech recognition apparatus.

また、以上に列挙した従来技術によると、音声認識の際に、どの程度の精度で音声認識が可能な環境であるかをユーザが知る事ができない。ユーザにとっては、音声認識を利用して信頼できる結果が得られるか否かが分からず、音声認識の利用に不安を感じるという問題もある。 Further, according to the conventional technologies listed above, the user cannot know the accuracy of the voice recognition environment when the voice recognition is performed. For the user, it is not known whether or not a reliable result can be obtained using voice recognition, and there is a problem that the user feels uneasy about the use of voice recognition.

そこで、本発明の目的の一つは、事前のテスト発話がなくても、音声認識の精度を予測する事ができる最適音声認識方式選択装置及び音声認識装置を提供する事である。 Accordingly, one of the objects of the present invention is to provide an optimum speech recognition method selection device and a speech recognition device that can predict the accuracy of speech recognition without a prior test utterance.

また、本発明の他の目的は、音声認識の精度予測結果を分かりやすくユーザに知らせる事のできる情報端末装置を提供する事である。 Another object of the present invention is to provide an information terminal device that can inform the user of the accuracy prediction result of speech recognition in an easy-to-understand manner.

さらに、本発明のさらに他の目的は、複数の音声認識方式により音声認識するの際の処理量を削減できる音声認識装置、最適音声認識方式選択装置、及びパラメータ算出装置を提供する事である。 Furthermore, still another object of the present invention is to provide a speech recognition device, an optimal speech recognition method selection device, and a parameter calculation device that can reduce the amount of processing when performing speech recognition by a plurality of speech recognition methods.

本発明の第１の局面に係る最適音声認識方式選択装置は、雑音が存在する可能性のある環境下で入力された音声に対し、予め定められた複数の音声認識方式のうちで最も精度が高い音声認識が可能であると予測される方式を判定するための最適音声認識方式選択装置であって、入力される音声から、所定の複数種類の音響特徴量を成分とする特徴量ベクトルを算出するための手段と、予め算出された変換行列を用い、特徴量ベクトルをより次元数の小さな特徴量ベクトルに変換するための変換手段と、変換手段による変換後の特徴量ベクトルと、複数の音声認識方式ごとに予め準備されていた、精度予測のための係数ベクトルとの間の所定の演算により、音声に対する音声認識の精度を音声認識方式ごとに予測するための精度予測手段と、精度予測手段により算出された予測に基づき、最も高い精度で音声認識が可能である音声認識方式を選択するための手段とを含む。 The optimum speech recognition method selection apparatus according to the first aspect of the present invention has the highest accuracy among a plurality of predetermined speech recognition methods for speech input in an environment where noise may exist. An optimum speech recognition method selection device for determining a method that is predicted to be capable of high speech recognition, and calculates a feature amount vector having a plurality of predetermined acoustic feature amounts as components from input speech A conversion means for converting the feature quantity vector into a feature quantity vector having a smaller number of dimensions, a feature quantity vector after conversion by the conversion means, and a plurality of voices. Accuracy prediction means for predicting the accuracy of speech recognition for each speech recognition method by a predetermined calculation between coefficient vectors for accuracy prediction prepared in advance for each recognition method; Based on the prediction calculated by the predicting means, and means for selecting the speech recognition system is capable of speech recognition with the highest accuracy.

この最適音声認識方式選択装置によると、入力された音声の音響特徴量から得られる特徴量ベクトルと、予め算出された変換行列及び精度予測のための係数ベクトルとを用いて、音声認識の精度を音声認識方式ごとに予測する。そして、予測された音声認識精度のうちから最も高い精度を選択し、対応する音声認識方式を、入力された音声を認識するために最適な音声認識方式として選択できる。予測に、テスト発話を使用する必要はない。従って、事前のテスト発話がなくても、音声認識の精度を予測できる最適音声認識方式選択装置を提供できる。 According to this optimum speech recognition method selection device, the accuracy of speech recognition is improved by using a feature vector obtained from the acoustic feature of the input speech, a pre-calculated transformation matrix, and a coefficient vector for accuracy prediction. Predict by speech recognition method. Then, the highest accuracy is selected from the predicted speech recognition accuracy, and the corresponding speech recognition method can be selected as the optimum speech recognition method for recognizing the input speech. There is no need to use test utterances for prediction. Accordingly, it is possible to provide an optimum speech recognition method selection device that can predict the accuracy of speech recognition without a prior test utterance.

本発明の第２の局面に係る音声認識装置は、複数の音声認識手段を用いて音声認識を行なうための音声認識装置であって、他の装置とデータをやり取りするための通信手段と、通信手段を介して、他の装置から、音声情報と、複数の音声認識手段のいずれかを特定する方式特定情報とを受信し、音声情報と方式特定情報とを分離するための分離手段と、それぞれ特定の音声認識方式によって音声認識を行なうための複数の音声認識手段と、方式特定情報に従って、複数の音声認識手段のいずれかを選択的に動作させて音声情報を音声認識させるための方式選択手段と、方式選択手段により選択され、音声情報を音声認識した音声認識手段の出力を通信手段を介して他の装置に返信するための手段とを含む。 A speech recognition apparatus according to a second aspect of the present invention is a speech recognition apparatus for performing speech recognition using a plurality of speech recognition means, a communication means for exchanging data with other devices, and a communication Separating means for receiving voice information and method specifying information for specifying any of a plurality of voice recognition means from other devices via the means, and separating the voice information and the method specifying information, respectively A plurality of voice recognition means for performing voice recognition by a specific voice recognition method, and a method selection means for causing voice information to be recognized by selectively operating one of the plurality of voice recognition means according to the method specification information And means for returning the output of the voice recognition means selected by the method selection means and voice-recognizing the voice information to another apparatus via the communication means.

この音声認識装置によると、受信した方式特定情報に従って、複数の音声認識手段のいずれかを選択的に動作させて音声情報を音声認識させる事ができる。ゆえに、従来と異なり全ての音声認識手段で音声認識をする必要がない。従って、音声認識の際の処理量を削減できる音声認識装置を提供できる。 According to this voice recognition device, voice information can be voice-recognized by selectively operating one of a plurality of voice recognition means in accordance with the received method specifying information. Therefore, unlike the prior art, it is not necessary to perform voice recognition by all voice recognition means. Therefore, it is possible to provide a speech recognition device that can reduce the amount of processing during speech recognition.

本発明の第３の局面に係るパラメータ算出装置は、雑音が存在する可能性のある環境下で入力された音声に対し、予め定められた複数の音声認識方式のうちで最も精度が高い音声認識が可能であると予測される方式を判定する際に使用される、精度予測のためのパラメータを算出するためのパラメータ算出装置であって、予め準備された発話データに、複数種類のノイズデータをそれぞれ重畳して得られる複数の学習データを準備するための手段と、複数の音声認識方式によって複数の学習データを音声認識した結果に基づいて、複数種類の学習データと、複数種類の音声認識方式との組合せの各々について音声認識精度を算出するための音声認識精度算出手段と、複数種類のノイズデータの各々から第１の次元数の予め定める音響特徴量からなる音響特徴量ベクトルを算出するための音響特徴量算出手段と、音響特徴量算出手段により算出された音響特徴量ベクトルと、音声認識精度算出手段により算出された音声認識精度とに基づいて、入力される音声信号から抽出される予め定める音響特徴量ベクトルから、当該音声信号を複数の音声認識方式で音声認識したときの音声認識精度の精度予測のためのパラメータを算出するためのパラメータ算出手段とを含む。 The parameter calculation apparatus according to the third aspect of the present invention is a speech recognition method having the highest accuracy among a plurality of predetermined speech recognition methods for speech input in an environment where noise may exist. Is a parameter calculation device for calculating a parameter for accuracy prediction, used when determining a method that is predicted to be possible, and a plurality of types of noise data are added to speech data prepared in advance. A plurality of types of learning data and a plurality of types of speech recognition methods based on means for preparing a plurality of learning data obtained by superimposing each other, and results of speech recognition of the plurality of learning data by a plurality of speech recognition methods A speech recognition accuracy calculating means for calculating speech recognition accuracy for each of the combinations, and a predetermined acoustic feature amount of a first dimension number from each of a plurality of types of noise data Input based on the acoustic feature quantity calculation means for calculating the acoustic feature quantity vector, the acoustic feature quantity vector calculated by the acoustic feature quantity calculation means, and the speech recognition accuracy calculated by the speech recognition accuracy calculation means Parameter calculating means for calculating a parameter for predicting accuracy of speech recognition accuracy when the speech signal is speech-recognized by a plurality of speech recognition methods from a predetermined acoustic feature vector extracted from the speech signal including.

このパラメータ算出装置によると、予め準備された複数の学習データを複数の音声認識方式によって音声認識した結果に基づいて音声認識精度を算出する。算出された音声認識精度と音響特徴量ベクトルとから当該音声信号を複数の音声認識方式で音声認識したときの音声認識精度の精度予測のためのパラメータを算出する。従って、事前の発話テストがなくても、このパラメータ算出装置によって得られたパラメータを使用して音声認識の精度を予測する事を可能にする音声認識装置を提供できる。 According to this parameter calculation device, the speech recognition accuracy is calculated based on the result of speech recognition of a plurality of learning data prepared in advance by a plurality of speech recognition methods. A parameter for predicting accuracy of speech recognition accuracy when the speech signal is speech-recognized by a plurality of speech recognition methods is calculated from the calculated speech recognition accuracy and acoustic feature vector. Therefore, it is possible to provide a speech recognition device that can predict the accuracy of speech recognition using parameters obtained by the parameter calculation device without a prior speech test.

パラメータ算出手段は、複数種類のノイズデータの各々から得られた音響特徴量ベクトルに対して主成分分析を行ない、音響特徴量ベクトルを、第１の次元数よりも小さな第２の次元数の音響特徴量ベクトルに変換するための変換関数を算出するための手段と、この変換関数を用いて複数種類のノイズの各々から得られた音響特徴量ベクトルを第２の次元数の音響特徴量ベクトルに変換し、音声認識精度算出手段により算出された音声認識精度を目的変数、第２の次元数の音響特徴量ベクトルの各成分を説明変数とする重回帰分析によって、精度予測のためのパラメータの組を複数の音声認識手段の各々に対して算出するための手段とを含む。 The parameter calculation means performs a principal component analysis on the acoustic feature quantity vector obtained from each of the plurality of types of noise data, and uses the acoustic feature quantity vector as an acoustic having a second dimension number smaller than the first dimension number. Means for calculating a conversion function for converting into a feature vector, and an acoustic feature vector obtained from each of a plurality of types of noise using the conversion function as an acoustic feature vector of the second dimension number A set of parameters for accuracy prediction by multiple regression analysis using the speech recognition accuracy calculated by the speech recognition accuracy calculation means as an objective variable and each component of the acoustic feature vector of the second dimension as an explanatory variable. For calculating each of the plurality of voice recognition means.

このパラメータ算出装置によると、複数種類のノイズデータの各々から得られた音響特徴量ベクトルに対して主成分分析を行ない、音響特徴量ベクトルを、第１の次元数よりも小さな第２の次元数の音響特徴量ベクトルに変換するための変換関数を得る。さらに、ノイズデータの各々から得られた音響特徴量ベクトルをこの変換関数を用いて第２の次元数の音響特徴量ベクトルに変換し、音声認識精度算出手段により算出された音声認識精度を目的変数、第２の次元数の音響特徴量ベクトルの各成分を説明変数とする重回帰分析によって、精度予測のためのパラメータの組を複数の音声認識手段の各々に対して算出する。主成分分析により、情報量を保存しつつ説明変数の数を削減するので、重回帰分析処理を簡略化でき、かつ信頼性のある重回帰係数を得る事ができる。また、音声認識精度の予測時にはこの変換関数と重回帰分析により得られたパラメータとを用いる事ができるので、予測に係る時間を短くする事ができるパラメータ算出装置を提供する事ができる。 According to this parameter calculation device, the principal component analysis is performed on the acoustic feature quantity vector obtained from each of the plurality of types of noise data, and the acoustic feature quantity vector is converted into the second dimension number smaller than the first dimension number. A conversion function for converting to an acoustic feature vector is obtained. Further, the acoustic feature quantity vector obtained from each of the noise data is converted into an acoustic feature quantity vector of the second dimension number using this conversion function, and the speech recognition accuracy calculated by the speech recognition accuracy calculating means is used as the objective variable. Then, a parameter set for accuracy prediction is calculated for each of the plurality of speech recognition means by multiple regression analysis using each component of the acoustic feature vector of the second dimension number as an explanatory variable. Principal component analysis reduces the number of explanatory variables while preserving information, so that multiple regression analysis processing can be simplified and reliable multiple regression coefficients can be obtained. In addition, since the conversion function and the parameter obtained by the multiple regression analysis can be used at the time of predicting the speech recognition accuracy, it is possible to provide a parameter calculation device that can shorten the time required for prediction.

本発明の第４の局面に係る情報端末装置は、マイクロフォン及び通信機能を有する情報端末装置であって、マイクロフォンを介して入力された音声から、所定の複数種類の音響特徴量を成分とする特徴量ベクトルを算出するための手段と、予め算出された変換行列を用い、特徴量ベクトルをより次元数の小さな特徴量ベクトルに変換するための変換手段と、変換手段による変換後の特徴量ベクトルと、複数の音声認識方式ごとに予め準備されていた、精度予測のための係数ベクトルとの間の所定の演算により、音声に対する音声認識の精度を音声認識方式ごとに予測するための精度予測手段と、精度予測手段により算出された最も高い精度で音声認識が可能である音声認識方式を選択するための手段と、マイクロフォンを介して入力された音声と、選択するための手段により選択された音声認識方式を特定する情報とを通信機能を介して予め定められる送信先に送信するための手段と、予め定められる送信先から音声と音声認識方式を特定する情報とに対して返信されてくる音声認識結果を通信機能を介して受信し、当該音声認識結果に対し、所定の処理を実行するための処理実行手段とを含む。 An information terminal device according to a fourth aspect of the present invention is an information terminal device having a microphone and a communication function, and features a plurality of predetermined acoustic feature quantities as components from sound input via the microphone. A means for calculating a quantity vector, a conversion means for converting a feature quantity vector into a feature quantity vector having a smaller number of dimensions using a pre-calculated transformation matrix, and a feature quantity vector converted by the conversion means, An accuracy predicting means for predicting the accuracy of speech recognition for speech for each speech recognition method by a predetermined calculation between coefficient vectors for accuracy prediction prepared in advance for each of a plurality of speech recognition methods; , Means for selecting a speech recognition method capable of speech recognition with the highest accuracy calculated by the accuracy prediction means, and sound input via a microphone And means for transmitting the information for identifying the voice recognition method selected by the means for selecting to a predetermined destination via the communication function, and voice and voice recognition method from the predetermined destination Processing execution means for receiving a voice recognition result sent back with respect to the specified information via a communication function and executing a predetermined process on the voice recognition result;

この情報端末装置によると、マイクロフォンを介して入力された音声の特徴量ベクトルと、複数の音声認識方式ごとに予め準備されていた、精度予測のための係数ベクトルとの間の所定の演算により、音声に対する音声認識の精度を音声認識方式ごとに予測する。予測された音声認識の精度のうちで最も高い精度で音声認識が可能である音声認識方式を選択し、音声とともに送信先である音声認識装置に送信する。送信された情報を元に最も適切な音声認識方式で音声認識が行なう事ができる。他の音声認識方式で音声認識を行なう必要はない。従って、複数の音声認識方式による音声認識を利用する際に、必要となる処理を削減できる音声認識装置を提供できる。 According to this information terminal device, by a predetermined calculation between a feature vector of speech input via a microphone and a coefficient vector for accuracy prediction prepared in advance for each of a plurality of speech recognition methods, The accuracy of speech recognition for speech is predicted for each speech recognition method. A speech recognition method capable of performing speech recognition with the highest accuracy among the predicted speech recognition accuracy is selected and transmitted to the speech recognition apparatus that is the transmission destination together with the speech. Based on the transmitted information, speech recognition can be performed by the most appropriate speech recognition method. It is not necessary to perform speech recognition using other speech recognition methods. Therefore, it is possible to provide a speech recognition apparatus that can reduce the processing required when using speech recognition by a plurality of speech recognition methods.

好ましくは、情報端末装置は、選択するための手段により特定された最も高い精度に従って、複数種類の通知方法のいずれかを選択して当該情報端末のユーザに最も高い精度のレベルを通知するためのレベル通知手段を含む。 Preferably, the information terminal device selects one of a plurality of types of notification methods according to the highest accuracy specified by the means for selecting and notifies the user of the information terminal of the highest accuracy level. Includes level notification means.

この情報端末装置によると、通知によりユーザは、音声認識に期待できる精度を示すレベルを知る事ができる。単に音声認識の精度ではなく、精度のレベルが通知されるので、ユーザは音声認識が利用可能な否かを簡単に知る事ができる。従って、分かりやすく音声認識の精度予測結果をユーザに知らせる事のできる情報端末装置を提供できる。 According to this information terminal device, the user can know the level indicating the accuracy that can be expected for voice recognition. Not only the accuracy of speech recognition but the level of accuracy is notified, so that the user can easily know whether speech recognition is available. Therefore, it is possible to provide an information terminal device that can easily inform the user of the accuracy prediction result of voice recognition.

さらに好ましくは、情報端末装置は表示装置をさらに含み、レベル通知手段は、選択するための手段により特定された最も高い精度に従って、複数種類のレベル表示シンボルのいずれかを選択して表示装置に表示するためのレベル表示手段を含む。 More preferably, the information terminal device further includes a display device, and the level notification means selects and displays one of a plurality of types of level display symbols on the display device according to the highest accuracy specified by the means for selecting. Level display means for doing this.

この情報端末装置によると、特定された最も高い精度に従って、複数種類のレベル表示シンボルのいずれかが表示装置に表示される。シンボルという分かりやすい形で音声認識に期待できる精度を知らせるので、ユーザは直ちに音声認識の精度を理解できる。従って、音声認識の精度予測結果を直ちに理解しやすい形でユーザに知らせる事のできる情報端末装置を提供できる。 According to this information terminal device, one of a plurality of types of level display symbols is displayed on the display device in accordance with the specified highest accuracy. Since the accuracy that can be expected for speech recognition is informed in an easy-to-understand form called symbols, the user can immediately understand the accuracy of speech recognition. Therefore, it is possible to provide an information terminal device that can notify the user of the accuracy prediction result of speech recognition in a form that can be easily understood immediately.

本発明の第５の局面に係るコンピュータプログラムは、コンピュータによって実行されると、当該コンピュータを上記のいずれかに記載の装置として動作させる。ゆえに、上記したいずれかの効果をもたらす事ができる。 When executed by a computer, the computer program according to the fifth aspect of the present invention causes the computer to operate as any of the above-described devices. Therefore, any of the above effects can be brought about.

［構成］
図２に、本実施の形態に係る音声認識システム６０のブロック図を示す。図２を参照して、音声認識システム６０は、複数の音声認識方式で音声認識を行なう事が可能なサーバ７６を含む。サーバ７６は、外部から音声信号とともに音声認識方式を指定する情報を受け、指定された音声認識方式で音声認識を行ない、その結果を音声の送信元に返信する機能を持つ。サーバ７６の構成の詳細については後述する。 [Constitution]
FIG. 2 shows a block diagram of the speech recognition system 60 according to the present embodiment. Referring to FIG. 2, voice recognition system 60 includes a server 76 that can perform voice recognition using a plurality of voice recognition methods. The server 76 has a function of receiving information specifying the voice recognition method from the outside together with the voice signal, performing voice recognition using the specified voice recognition method, and returning the result to the voice transmission source. Details of the configuration of the server 76 will be described later.

音声認識システム６０はさらに、予め算出されたパラメータを使用して、入力された音声をサーバ７６の複数の音声認識方式で認識する際の音声認識精度を予測し、最も精度の高くなる音声認識方式を特定する情報を、音声とともにサーバ７６に送信する携帯電話７０と、携帯電話７０での音声認識精度判定に使用するためのパラメータを予め算出するためのパラメータ算出装置７２とを含む。これらのパラメータは、一旦記憶媒体７４に記憶された後、生産される多数の携帯電話の各々にコピーされる。 The speech recognition system 60 further predicts speech recognition accuracy when recognizing input speech using a plurality of speech recognition methods of the server 76 using parameters calculated in advance, and achieves the highest accuracy speech recognition method. Includes a mobile phone 70 that transmits information for identifying to the server 76 together with voice, and a parameter calculation device 72 for calculating in advance parameters for use in voice recognition accuracy determination in the mobile phone 70. These parameters are once stored in the storage medium 74 and then copied to each of a large number of mobile phones to be produced.

この音声認識システムでは、パラメータ算出装置７２は、予め多数種類のノイズデータのうちのそれぞれ一種類を重畳した発話データからなる多数の学習データを用いて、携帯電話７０での精度予測に用いられるパラメータを算出する。具体的には、以下の様な処理をする。 In this speech recognition system, the parameter calculation device 72 uses parameters that are used for accuracy prediction in the mobile phone 70 by using a large number of learning data composed of speech data in which one type of each of a large number of types of noise data is superimposed. Is calculated. Specifically, the following processing is performed.

まず、それら多数の学習データの各々に対し、サーバ７６で行なわれる複数の音声認識方式の各々を用いて音声認識を行ない、それらの認識精度をそれぞれ算出する。さらに、学習データに重畳されたノイズデータの各々から、精度予測のための音響特徴量からなる特徴量ベクトルを抽出し、それら音響特徴量に対して主成分分析を行なう。主成分分析によって特徴量ベクトルの次元数を減少させるための変換行列を算出する。算出された変換行列を用い、ノイズデータから得た特徴量ベクトルをより次元数の少ない特徴量ベクトルに変換する。そして、変換後の特徴量ベクトルと、あらかじめ学習データに対して行なった音声認識結果の精度とから、変換後の特徴量ベクトルの成分の線形和で音声認識結果の精度を予測するための重回帰係数の組を音声認識方式ごとに算出する。 First, speech recognition is performed on each of the large number of learning data using each of a plurality of speech recognition methods performed by the server 76, and the recognition accuracy thereof is calculated. Further, feature quantity vectors composed of acoustic feature quantities for accuracy prediction are extracted from each of the noise data superimposed on the learning data, and principal component analysis is performed on these acoustic feature quantities. A transformation matrix for reducing the number of dimensions of the feature vector is calculated by principal component analysis. Using the calculated transformation matrix, the feature quantity vector obtained from the noise data is transformed into a feature quantity vector having a smaller number of dimensions. Then, from the converted feature vector and the accuracy of the speech recognition result previously performed on the learning data, multiple regression for predicting the accuracy of the speech recognition result by linear summation of the components of the converted feature vector A set of coefficients is calculated for each speech recognition method.

こうして算出された変換行列と、音声認識方式ごとに得られた重回帰係数の組とが、パラメータ算出装置７２で算出されるパラメータであり、携帯電話７０での精度予測に用いられる。なお、重回帰係数の組は、本実施の形態では予測精度の平均と、変換後の特徴量ベクトルの要素の各々に対する重みとからなる。本明細書では、一つの音声認識方式に関する重回帰係数の組を重回帰係数ベクトルと呼ぶ事にする。 The transformation matrix thus calculated and the set of multiple regression coefficients obtained for each voice recognition method are parameters calculated by the parameter calculation device 72 and are used for accuracy prediction by the mobile phone 70. In the present embodiment, the set of multiple regression coefficients includes an average of prediction accuracy and a weight for each element of the converted feature vector. In this specification, a set of multiple regression coefficients related to one speech recognition method is referred to as a multiple regression coefficient vector.

ここで、重回帰分析とは、適切な変数を複数選択する事で、計算しやすく誤差の少ない予測式を作るための統計手法の一つである。従属変数（目的変数）と連続尺度の説明変数（独立変数）の間に式を当てはめ、従属変数が説明変数によってどれくらい説明できるのかを定量的に分析するための回帰分析において、説明変数が２つ以上のものを指す。本実施の形態では、従属変数は音声認識の精度であり、説明変数は変換後の特徴量ベクトルの成分である。精度は事前の音声認識によって音声認識方式ごとに算出できる。従って、重回帰係数ベクトルも音声認識方式ごとに求められる。 Here, the multiple regression analysis is one of statistical methods for creating a prediction formula that is easy to calculate and has few errors by selecting a plurality of appropriate variables. In the regression analysis to quantitatively analyze how much the dependent variable can be explained by the explanatory variable by applying an equation between the dependent variable (objective variable) and the explanatory variable (independent variable) of the continuous scale, there are two explanatory variables. It refers to the above. In the present embodiment, the dependent variable is the accuracy of speech recognition, and the explanatory variable is a component of the feature vector after conversion. The accuracy can be calculated for each speech recognition method by prior speech recognition. Therefore, a multiple regression coefficient vector is also obtained for each speech recognition method.

本実施の形態では、サーバ７６で使用する音声認識方式はＪ個あるものとする。これら第１〜第Ｊの音声認識方式に対して求められた重回帰係数ベクトルをそれぞれＢ_１〜Ｂ_Ｊで表すものとする。また、学習の際に準備されるノイズデータはＭ種類あるものとする。従って、学習のための発話データにこれらノイズデータを重畳して得られる学習データもＭ種類得られる。一つの学習データに対し、一つの音声認識方式の認識精度が一つ算出される。学習データがＭ種類、音声認識方式がＪ種類あれば、学習の際に得られる認識精度の数はＭ×Ｊ種類である。このＭ×Ｊ種類の認識精度は、重回帰分析における従属変数算出の際の目標として用いられる。 In this embodiment, it is assumed that there are J speech recognition methods used by the server 76. The multiple regression coefficient vectors obtained for the _{first to} Jth speech recognition methods are represented by B1 to _BJ , respectively. Further, it is assumed that there are M types of noise data prepared at the time of learning. Therefore, M types of learning data obtained by superimposing these noise data on the speech data for learning are also obtained. One recognition accuracy of one speech recognition method is calculated for one learning data. If there are M types of learning data and J types of speech recognition methods, the number of recognition accuracy obtained during learning is M × J types. This M × J type of recognition accuracy is used as a target when calculating the dependent variable in the multiple regression analysis.

図３に、携帯電話７０のブロック図を示す。図３を参照して、携帯電話７０は、入力された音声を音声信号に変換するためのマイクロフォン８０と、変換された音声信号を所定のフレーム長及び所定のシフト長でフレーム化するためのフレーム化部８１と、音声認識精度の予測の際に利用するパラメータを格納するためのパラメータ格納部８２と、フレーム化された音声信号とパラメータ格納部８２に格納されたパラメータとから、入力音声信号に対する各音声認識方式の音声認識精度を予測し、その結果最も予測精度が高い音声認識精度であると推定される音声認識方式を判定して、さらに判定結果を元に、入力音声に対し予測された精度を表すための表示シンボルを決定するための音声認識方式判定部８４と、サーバ７６から与えられた音声認識結果のテキストを画像情報に変換するためのテキスト表示処理部９８と、テキスト表示処理部９８で生成されたテキスト表示、音声認識方式判定部８４で決定された、予測された音声認識精度を示すシンボル、及びその他の情報を画面等に表示するための表示部８６とを含む。 FIG. 3 shows a block diagram of the mobile phone 70. Referring to FIG. 3, a mobile phone 70 includes a microphone 80 for converting input sound into a sound signal, and a frame for converting the converted sound signal into a frame with a predetermined frame length and a predetermined shift length. , A parameter storage unit 82 for storing parameters used in predicting speech recognition accuracy, a framed speech signal, and parameters stored in the parameter storage unit 82, The speech recognition accuracy of each speech recognition method is predicted, and as a result, the speech recognition method estimated to be the speech recognition accuracy with the highest prediction accuracy is determined, and the input speech is predicted based on the determination result A speech recognition method determination unit 84 for determining a display symbol for representing accuracy, and a speech recognition result text provided from the server 76 is converted into image information. A text display processing unit 98, a text display generated by the text display processing unit 98, a symbol indicating the predicted speech recognition accuracy determined by the speech recognition method determination unit 84, and other information on a screen or the like. And a display unit 86 for displaying.

携帯電話７０はさらに、フレーム化部８１から与えられた音声信号及び音声認識方式判定部８４で予測された音声認識精度結果からサーバ７６に送信する送信用データを作成するための送信データ作成部８８と、サーバ７６との間でデータを送受信するための通信部９０とを含む。以下の説明では、音声入力されるのが、携帯電話７０に対する何らかのコマンドであると想定する。 The cellular phone 70 further generates a transmission data creation unit 88 for creating data for transmission to be transmitted to the server 76 from the voice signal given from the framing unit 81 and the voice recognition accuracy result predicted by the voice recognition method determination unit 84. And a communication unit 90 for transmitting and receiving data to and from the server 76. In the following description, it is assumed that a voice input is a command for the mobile phone 70.

携帯電話７０はさらに、通信部９０によって受信されたデータによって携帯電話７０に所定の動作を行なわせるためのコマンド処理部９２と、通信部９０が受信した通常の音声通話信号を処理するための音声処理部９４と、音声処理部９４の出力を音声に変換するためのスピーカ９６とを含む。 The mobile phone 70 further includes a command processing unit 92 for causing the mobile phone 70 to perform a predetermined operation based on the data received by the communication unit 90, and a voice for processing a normal voice call signal received by the communication unit 90. A processing unit 94 and a speaker 96 for converting the output of the audio processing unit 94 into audio are included.

図４に、音声認識方式判定部８４のブロック図を示す。図４を参照して、音声認識方式判定部８４は、パラメータ格納部８２に格納されたパラメータを使用して、フレーム化部８１によってフレーム化された音声信号をサーバ７６の複数の音声認識方式で音声認識した際の音声認識の精度をそれぞれ予測するための精度予測部１０２と、予測された精度を表すシンボルを携帯電話の画面等に表示する際に使用する精度とシンボルの関係を示したテーブルを格納するための表示シンボルテーブル格納部１０４と、精度予測部１０２での予測精度と表示シンボルテーブル格納部１０４に格納されたテーブルとを比較して、表示シンボルを決定するための表示シンボル決定部１０６とを含む。 FIG. 4 shows a block diagram of the speech recognition method determination unit 84. Referring to FIG. 4, voice recognition method determination unit 84 uses the parameters stored in parameter storage unit 82 to convert the voice signal framed by framing unit 81 using a plurality of voice recognition methods of server 76. An accuracy prediction unit 102 for predicting the accuracy of speech recognition when performing speech recognition, and a table showing the relationship between the accuracy and the symbol used when displaying a symbol representing the predicted accuracy on a screen of a mobile phone or the like Display symbol table storage unit 104 for storing a display symbol, and a display symbol determination unit for determining a display symbol by comparing the prediction accuracy in the accuracy prediction unit 102 with the table stored in the display symbol table storage unit 104 106.

図５に、表示シンボルテーブルの一例を示す。図５を参照して、テーブルの左側の枠内には予測精度を、右側の枠内には表示シンボルを示す。表示シンボルはエントリ１２０〜１２６を含む。 FIG. 5 shows an example of the display symbol table. Referring to FIG. 5, the prediction accuracy is shown in the left frame of the table, and the display symbol is shown in the right frame. The display symbol includes entries 120-126.

図５では例えば、予測された音声認識の精度が９５％以上であれば、音声認識に非常に適していると判定され、携帯電話の画面に◎が表示される（エントリ１２０）。音声認識の精度が９０％以上９５％未満であれば、音声認識に適していると判定され、携帯電話の画面に○が表示される（エントリ１２２）。音声認識の精度が７５％以上９０％未満であれば、あまり音声認識に適していないと判定され、携帯電話の画面に△が表示される（エントリ１２４）。音声認識の精度が７５％未満であれば、音声認識に適していないと判定され、携帯電話の画面に×が表示される（エントリ１２６）。 In FIG. 5, for example, if the predicted accuracy of speech recognition is 95% or more, it is determined that the speech recognition is very suitable for speech recognition, and ◎ is displayed on the screen of the mobile phone (entry 120). If the accuracy of speech recognition is 90% or more and less than 95%, it is determined that the speech recognition is suitable, and a circle is displayed on the screen of the mobile phone (entry 122). If the accuracy of voice recognition is not less than 75% and less than 90%, it is determined that the voice recognition is not suitable, and Δ is displayed on the screen of the mobile phone (entry 124). If the accuracy of voice recognition is less than 75%, it is determined that the voice recognition is not suitable, and x is displayed on the screen of the mobile phone (entry 126).

ユーザは、携帯電話の画面に表示されるこれらのシンボルを見る事によって、現時点での音声認識の予測精度を簡単に知る事ができる。 By viewing these symbols displayed on the screen of the mobile phone, the user can easily know the prediction accuracy of the current speech recognition.

図６に、精度予測部１０２のブロック図を示す。図６を参照して、精度予測部１０２は、フレーム化部８１によってフレーム化された音声信号から精度予測のための音声の特徴量Ｘ₁〜Ｘ_Nを算出し、これらＮ個の特徴量を成分とする特徴量ベクトルを出力するための特徴量算出部１８０と、パラメータ格納部８２に格納された変換行列１８６を用いて、特徴量ベクトル（Ｘ₁〜Ｘ_N）を特徴量ベクトル（Ｘ₁’〜Ｘ_i’）に変換するための変換処理部１８２とを含む。 FIG. 6 shows a block diagram of the accuracy prediction unit 102. With reference to FIG. 6, the accuracy prediction unit 102 calculates speech feature amounts X _{1 to} X _N for accuracy prediction from the speech signal framed by the framing unit 81, and calculates these N feature amounts. a feature quantity calculation unit 180 for outputting a feature vector as a component, using the transformation matrix 186 stored in the parameter storage unit 82, feature vector (X ₁ ~X _N) the feature vector (X ₁ Conversion processing unit 182 for converting to “˜X _i ”).

変換後の特徴量ベクトルの次元数ｉは、変換前の次元数Ｎと比較して小さい。しかし、この特徴量ベクトルは、前述した主成分分析によって得られた変換行列によって変換されたものである。従って、音声の変換に特に関連する情報を十分に有している。 The number of dimensions i of the feature vector after conversion is smaller than the number of dimensions N before conversion. However, this feature vector is transformed by the transformation matrix obtained by the principal component analysis described above. Therefore, it has enough information especially related to voice conversion.

精度予測部１０２はさらに、変換された特徴量ベクトル（Ｘ₁’〜Ｘ_i’）と、パラメータ格納部８２に格納された重回帰分析によって求められた重回帰係数ベクトル１８８とを用いて音声認識精度の予測値を音声認識方式ごとに算出するための重回帰係数による予測精度算出部１８４と、算出された予測精度のうちで最大値をとる精度を判定し、その精度を与える音声認識方式をサーバ７６での音声認識方式として選択するための最大値選択部１８５とを含む。 The accuracy prediction unit 102 further performs speech recognition using the converted feature vector (X ₁ ′ to X _i ′) and the multiple regression coefficient vector 188 obtained by multiple regression analysis stored in the parameter storage unit 82. A prediction accuracy calculation unit 184 using multiple regression coefficients for calculating a predicted value of accuracy for each speech recognition method, and a speech recognition method that determines the accuracy of obtaining the maximum value among the calculated prediction accuracy and gives the accuracy. And a maximum value selection unit 185 for selecting as a voice recognition method in the server 76.

図７に、重回帰係数による予測精度算出部１８４のブロック図を示す。図７を参照して、重回帰係数による予測精度算出部１８４は、パラメータ格納部８２に格納されたパラメータのうちから第１〜第Ｊの音声認識方式の精度予測のための重回帰係数ベクトルを抽出するための第１〜第Ｊの重回帰係数抽出部１９０と、重回帰係数抽出部１９０によって抽出された重回帰係数ベクトル（Ｂ₁〜Ｂ_J）と変換処理部１８２によって変換された特徴量ベクトル（Ｘ₁’〜Ｘ_i’ ）とを用いて第１〜第Ｊの音声認識方式の予測精度Ｚ₁〜Ｚ_Jを算出するための第１〜第Ｊの精度算出部１９２〜１９８とを含む。 FIG. 7 shows a block diagram of the prediction accuracy calculation unit 184 using multiple regression coefficients. Referring to FIG. 7, the prediction accuracy calculation unit 184 using multiple regression coefficients calculates multiple regression coefficient vectors for predicting the accuracy of the first to Jth speech recognition methods from the parameters stored in the parameter storage unit 82. First to J-th multiple regression coefficient extraction unit 190 for extraction, multiple regression coefficient vector (B _{1 to} B _J ) extracted by multiple regression coefficient extraction unit 190 and feature quantity converted by conversion processing unit 182 First to J-th accuracy calculation units 192 to 198 for calculating prediction accuracy Z _{1 to} Z _J of the _first to J-th speech recognition methods using vectors (X ₁ ′ to X _i ′), Including.

図８に、パラメータ算出装置７２のブロック図を示す。図８を参照して、パラメータ算出装置７２（図２参照）は、第１〜第Ｍのノイズを格納するための第１〜第Ｍのノイズ格納部１３０〜１３４と、予め準備された学習のための発話データを格納するための発話データ格納部１３６と、発話データの中身をテキスト化したデータを格納するための正解データ格納部１４２とを含む。 FIG. 8 shows a block diagram of the parameter calculation device 72. Referring to FIG. 8, parameter calculation device 72 (see FIG. 2) includes first to M-th noise storage units 130 to 134 for storing first to M-th noises, and learning learning prepared in advance. Utterance data storage unit 136 for storing utterance data for the purpose, and correct data storage unit 142 for storing data in which the content of the utterance data is converted into text.

パラメータ算出装置７２はさらに、第１〜第Ｍのノイズ格納部１３０〜１３４、発話データ格納部１３６、及び正解データ格納部１４２にそれぞれ格納されたデータを用いて、複数の音声認識方式による音声認識の精度をそれぞれ計算するための学習データによる精度計算部１３８と、精度計算部１３８により計算された精度と、第１〜第Ｍのノイズ格納部１３０〜１３４に格納されたノイズとに基づき、精度予測部１０２（図６参照）が用いるパラメータを主成分分析及び重回帰分析により計算するためのパラメータ計算部１４０とを含む。 The parameter calculation device 72 further uses the data stored in the first to Mth noise storage units 130 to 134, the utterance data storage unit 136, and the correct data storage unit 142 to perform voice recognition using a plurality of voice recognition methods. Based on the accuracy calculation unit 138 based on learning data for calculating the accuracy of each, the accuracy calculated by the accuracy calculation unit 138, and the noise stored in the first to Mth noise storage units 130-134 And a parameter calculation unit 140 for calculating parameters used by the prediction unit 102 (see FIG. 6) by principal component analysis and multiple regression analysis.

図９に、学習データによる精度計算部１３８のブロック図を示す。図９を参照して、学習データによる精度計算部１３８は、発話データ格納部１３６に格納された発話データに第１〜第Ｍのノイズ格納部１３０〜１３４に格納されたノイズを重畳して学習データを作成するための音合成部１５０と、発話データに第１のノイズを重畳したデータを格納するための第１の学習データ格納部１５２とを含む。学習データによる精度計算部１３８はさらに、同様にして得られた第２〜第Ｍの学習データを格納するための第２〜第Ｍの学習データ格納部１５４〜１５６を含む。 FIG. 9 shows a block diagram of the accuracy calculation unit 138 based on learning data. Referring to FIG. 9, the accuracy calculation unit 138 based on learning data learns by superimposing noise stored in the first to Mth noise storage units 130 to 134 on the utterance data stored in the utterance data storage unit 136. It includes a sound synthesizer 150 for creating data and a first learning data storage unit 152 for storing data obtained by superimposing first noise on speech data. The learning data accuracy calculation unit 138 further includes second to Mth learning data storage units 154 to 156 for storing second to Mth learning data obtained in the same manner.

学習データによる精度計算部１３８はさらに、第１〜第Ｍの学習データ格納部１５２〜１５６に格納されたデータを第１〜第Ｊの異なった方式でそれぞれ音声認識するための第１〜第Ｊの音声認識部１５８〜１６２を含む。 The accuracy calculation unit 138 based on learning data further includes first to Jth data for recognizing voices stored in the first to Mth learning data storage units 152 to 156 using the first to Jth different methods, respectively. Voice recognition units 158 to 162.

学習データによる精度計算部１３８はさらに、第１〜第Ｊの音声認識部１５８〜１６２によってＭ個の学習データに対して得られたＪ×Ｍ個の音声認識結果を得て、そのＪ×Ｍ個の音声認識結果が正解データ格納部１４２に格納された正解データと合致するか否かをそれぞれ判定し、各方式による音声認識結果の精度ベクトル（Ｙ₁〜Ｙ_J）を算出するための学習データによる精度算出部１６６を含む。 The accuracy calculation unit 138 based on learning data further obtains J × M speech recognition results obtained for the M learning data by the first to Jth speech recognition units 158 to 162, and the J × M Learning for determining whether or not each speech recognition result matches the correct answer data stored in the correct answer data storage unit 142 and calculating the accuracy vector (Y _{1 to} Y _J ) of the speech recognition result by each method An accuracy calculation unit 166 using data is included.

図１０に、第１〜第Ｍの学習データ格納部１５２〜１５６に格納された第１の学習データ〜第Ｍの学習データに対して第１〜第Ｊの音声認識部１５８〜１６２で行なわれる第１〜第Ｊの音声認識方式とその結果算出される精度との関係についての表を示す。 In FIG. 10, the first to Jth speech recognition units 158 to 162 perform the first learning data to the Mth learning data stored in the first to Mth learning data storage units 152 to 156. The table | surface about the relationship between the 1st-Jth speech recognition system and the precision calculated as a result is shown.

図１０を参照して、左端の枠には学習データの種類を示し、上端の枠には音声認識第１の方式〜第Ｊの方式を示す。図１０では、例えば、第１の学習データに対して第１の音声認識方式で音声認識が行なわれた場合の精度はｙ₁₁％とし、第２の音声認識方式で音声認識が行なわれた場合の精度はｙ₁₂％と表してある。 Referring to FIG. 10, the leftmost frame indicates the type of learning data, and the uppermost frame indicates the first to Jth speech recognition methods. In FIG. 10, for example, the accuracy when speech recognition is performed on the first learning data by the first speech recognition method is y ₁₁ %, and speech recognition is performed by the second speech recognition method. Is represented as y ₁₂ %.

この様に第１〜第Ｍの学習データの全てに対して第１〜第Ｊの音声認識方式で音声認識が行なわれる。 In this way, voice recognition is performed on all of the first to Mth learning data by the first to Jth voice recognition methods.

学習データによる精度算出部１６６によって算出された精度を表すベクトルである精度ベクトル（Ｙ₁〜Ｙ_J）のうち精度ベクトルＹ₁は、（ｙ₁₁、ｙ₂₁、ｙ₃₁、・・・、ｙ_M1）で表される。精度ベクトルＹ₂以下も同様に表され、精度ベクトルＹ_Ｊは（ｙ_1J、ｙ_2J、ｙ_3J・・・、ｙ_MJ）で表される。 Of the accuracy vectors (Y _{1 to} Y _J ), which are the vectors representing the accuracy calculated by the accuracy calculation unit 166 based on the learning data, the accuracy vector Y ₁ is (y ₁₁ , y ₂₁ , y ₃₁ ,..., Y _M1 ). The accuracy vector Y _{2 and} below are similarly expressed, and the accuracy vector Y _J is expressed by (y _1J , y _2J , y _3J ..., Y _MJ ).

図１１にパラメータ計算部１４０のブロック図を示す。図１１を参照して、パラメータ計算部１４０は、第１〜第Ｍのノイズ格納部１３０〜１３４より与えられたノイズの特徴量ベクトル（Ｘ₁〜Ｘ_N）を抽出するための特徴量抽出部１７０と、抽出された特徴量ベクトル（Ｘ₁〜Ｘ_N）を主成分分析して特徴量ベクトルのデータの次元数が減少した特徴量ベクトル（Ｘ₁’〜Ｘ_i’）に変換するための変換行列を求めて、パラメータ記憶媒体７４へ与えるための主成分分析部１７２とを含む。 FIG. 11 shows a block diagram of the parameter calculation unit 140. Referring to FIG. 11, parameter calculation unit 140 extracts a feature amount vector (X _{1 to} X _N ) of noise given from first to Mth noise storage units 130 to 134. 170 for converting the extracted feature vector (X _{1 to} X _N ) into a feature vector (X ₁ ′ to X _i ′) in which the number of dimensions of the feature vector data is reduced by principal component analysis. A principal component analysis unit 172 for obtaining a transformation matrix and giving it to the parameter storage medium 74.

パラメータ計算部１４０はさらに、主成分分析部１７２によって求められた変換行列を用いて特徴量ベクトル（Ｘ₁〜Ｘ_N）を次元数の減少した特徴量ベクトル（Ｘ₁’〜Ｘ_i’）に変換するための変換処理部１７４と、特徴量ベクトル（Ｘ₁’〜Ｘ_i’）と学習データによる精度算出部１６６（図９参照）によって得られた複数の精度ベクトルとを用いて音声認識の方式毎に重回帰係数ベクトルを算出するための重回帰分析部１７６とを含む。 The parameter calculation unit 140 further converts the feature vector (X _{1 to} X _N ) into a feature vector (X ₁ ′ to X _i ′) having a reduced number of dimensions using the transformation matrix obtained by the principal component analysis unit 172. Using the conversion processing unit 174 for conversion, the feature vector (X ₁ ′ to X _i ′), and a plurality of accuracy vectors obtained by the accuracy calculation unit 166 (see FIG. 9) based on learning data, speech recognition is performed. A multiple regression analysis unit 176 for calculating a multiple regression coefficient vector for each method.

図１２にサーバ７６のブロック図を示す。図１２を参照して、サーバ７６は、携帯電話７０とデータのやり取りをするための通信部２１０と、携帯電話７０から送信されたデータを音声と音声入力方式を特定するデータとに分離するための分離部２１２と、分離部２１２で分離された音声認識方式のデータを受けて与えられた音声から特徴量を抽出し、それぞれ第１〜第Ｊの方式で音声認識するための第１〜第Ｊの音声認識部２１６〜２２２と、音声入力方式を特定するデータに基づき、第１〜第Ｊの音声認識部２１６〜２２２のいずれかへ接続を切替るためのスイッチ２１４とを含む。 FIG. 12 shows a block diagram of the server 76. Referring to FIG. 12, server 76 separates data transmitted from mobile phone 70 into data specifying voice and a voice input method, and communication unit 210 for exchanging data with mobile phone 70. And a first to a first for recognizing voices by the first to Jth schemes respectively. J speech recognition units 216 to 222 and a switch 214 for switching the connection to one of the first to Jth speech recognition units 216 to 222 based on the data for specifying the voice input method.

サーバ７６はさらに、音声認識方式を特定するためのデータに基づき、特定された音声認識方式で音声認識を行なう音声認識部からの出力を選択する様に接続を切替るためのスイッチ２２４と、音声認識されたデータを通信部２１０を介して携帯電話７０へ送信するための送信用データを作成するための送信データ生成部２２６とを含む。 The server 76 further includes a switch 224 for switching connection so as to select an output from a voice recognition unit that performs voice recognition using the specified voice recognition method based on data for specifying the voice recognition method, A transmission data generation unit 226 for generating transmission data for transmitting the recognized data to the mobile phone 70 via the communication unit 210.

図示しないが、分離部２１２で分離された音声認識方式を特定するデータにより、第１〜第Ｊの音声認識部２１６〜２２２のうち、特定された方式の音声認識部のみが動作し、他の音声認識部は動作しない。 Although not shown, only the speech recognition unit of the specified method operates among the first to Jth speech recognition units 216 to 222 by the data specifying the speech recognition method separated by the separation unit 212, and the other The voice recognition unit does not work.

［動作］
本実施の形態に係る装置は以下の様に動作する。動作については、図２〜図１２を適宜参照して説明する。まず、音声認識精度を求める際に使用するパラメータをあらかじめパラメータ算出装置７２で算出しておかなければならない。 [Operation]
The apparatus according to the present embodiment operates as follows. The operation will be described with reference to FIGS. First, the parameter used when obtaining the speech recognition accuracy must be calculated by the parameter calculation device 72 in advance.

パラメータ算出装置７２ではまず、発話データ格納部に格納された発話データに第１〜第Ｍのノイズ格納部１３０〜１３４（図８参照）に格納された様々なノイズの各々が音合成部１５０（図９参照）で重畳される。 In the parameter calculation device 72, first, each of various noises stored in the first to Mth noise storage units 130 to 134 (see FIG. 8) in the utterance data stored in the utterance data storage unit is converted into the sound synthesis unit 150 ( (See FIG. 9).

音合成部１５０によって得られた音データ（学習データ）は、それぞれ、第１〜第Ｍの学習データ格納部１５２〜１５６に格納される。第１〜第Ｍの学習データ格納部１５２〜１５６に格納された合計Ｍ種類の学習データは、第１の音声認識部１５８に与えられる。与えられたＭ種類の学習データは、第１の方式を用いて音声認識される。また、同じくＭ種類の重畳学習データは第２の音声認識部１６０にも与えられ、第２の方式で音声認識される。この様にして、第Ｊの音声認識部１６２までの合計Ｊ個の音声認識部にＭ種類の重畳学習データが与えられ、それぞれ音声認識される。この音声認識の結果は、テキストとして出力される。 The sound data (learning data) obtained by the sound synthesis unit 150 is stored in the first to Mth learning data storage units 152 to 156, respectively. The total M types of learning data stored in the first to Mth learning data storage units 152 to 156 are given to the first speech recognition unit 158. The given M types of learning data are voice-recognized using the first method. Similarly, the M types of superposition learning data are also provided to the second speech recognition unit 160, and speech recognition is performed using the second method. In this way, M kinds of superposition learning data are given to a total of J speech recognition units up to the J-th speech recognition unit 162, and each speech recognition is performed. The result of this speech recognition is output as text.

テキストとして出力されたデータは、学習データによる精度算出部１６６に与えられる。学習データによる精度算出部１６６では、予め正解データ格納部１４２に格納されている正解データを使用して、第１〜第Ｊの方式によって音声認識されたＭ×Ｊ種類のデータの音声認識精度が方式ごと、学習データごとに算出される。学習データによる精度算出部１６６によって算出された音声認識結果の精度は、精度ベクトル(Ｙ₁〜Ｙ_J)としてパラメータ計算部１４０へ与えられる。 The data output as text is given to the accuracy calculation unit 166 using learning data. The accuracy calculation unit 166 based on learning data uses the correct answer data stored in the correct answer data storage unit 142 in advance, and the speech recognition accuracy of M × J types of data recognized by the first to Jth methods is high. It is calculated for each method and each learning data. The accuracy of the speech recognition result calculated by the accuracy calculation unit 166 based on the learning data is given to the parameter calculation unit 140 as an accuracy vector (Y _{1 to} Y _J ).

パラメータ計算部１４０は以下の様に動作する。図１１を参照して、まず、第１〜第Ｍのノイズ格納部１３０〜１３４に格納されたノイズデータが特徴量抽出部１７０に与えられる。特徴量抽出部１７０では、ノイズデータの音響特徴量ベクトル（Ｘ₁〜Ｘ_N）が抽出される。次に、主成分分析部１７２でノイズデータの特徴量ベクトル（Ｘ₁〜Ｘ_N）が主成分分析され、特徴量変換のための行列が求められる。求められた行列は、変換処理部１７４とパラメータ記憶媒体７４とに与えられる。 The parameter calculation unit 140 operates as follows. Referring to FIG. 11, first, noise data stored in first to Mth noise storage units 130 to 134 is given to feature amount extraction unit 170. The feature quantity extraction unit 170 extracts acoustic feature quantity vectors (X _{1 to} X _N ) of noise data. Next, the principal component analysis unit 172 performs principal component analysis on the feature quantity vectors (X _{1 to} X _N ) of the noise data to obtain a matrix for feature quantity conversion. The obtained matrix is given to the conversion processing unit 174 and the parameter storage medium 74.

変換処理部１７４では、主成分分析によって求められた行列を用いて特徴量ベクトル（Ｘ₁〜Ｘ_N）を変換処理する事によって特徴量ベクトル（Ｘ₁’〜Ｘ_i’）が得られる。変換された特徴量ベクトル（Ｘ₁’〜Ｘ_i’）は、重回帰分析部１７６に与えられる。 In the conversion processing unit 174, the feature vector (X ₁ ′ to X _i ′) is obtained by converting the feature vector (X _{1 to} X _N ) using the matrix obtained by the principal component analysis. The converted feature vector (X ₁ ′ to X _i ′) is given to the multiple regression analysis unit 176.

重回帰分析部１７６では、変換された特徴量ベクトル（Ｘ₁’〜Ｘ_i’）と精度計算部１３８によって計算された複数の精度ベクトル（Ｙ₁〜Ｙ_J）とを用いて音声認識の方式毎に重回帰係数ベクトル（Ｂ₁〜Ｂ_J）が算出される。算出された重回帰係数ベクトル（Ｂ₁〜Ｂ_J）はパラメータ記憶媒体７４に記憶される。ここで、第ｋの音声認識方式に対して求められた重回帰係数ベクトルを、Ｂ_k＝（ｂ_k0、ｂ_k1、・・・、ｂ_ki）と表す事とする（１≦ｋ≦Ｊ）。これら要素のうち、ｂ_k0は精度の平均値であり、ｂ_k1、・・・、ｂ_kiはそれぞれ、１番目〜ｉ番目の音響特徴量に対する重みである。 The multiple regression analysis unit 176 uses the converted feature vector (X ₁ ′ to X _i ′) and the plurality of accuracy vectors (Y _{1 to} Y _J ) calculated by the accuracy calculation unit 138 to perform speech recognition. Multiple regression coefficient vectors (B _{1 to} B _J ) are calculated for each. The calculated multiple regression coefficient vectors (B _{1 to} B _J ) are stored in the parameter storage medium 74. Here, the multiple regression coefficient vector obtained for the k-th speech recognition method is represented as B _k = (b _k0 , b _k1 ,..., B _ki ) (1 ≦ k ≦ J). . Among these elements, b _k0 is an average value of accuracy, and b _k1 ,..., B _ki are weights for the first to i-th acoustic feature amounts, respectively.

パラメータ算出装置７２（図２参照）によって算出され、パラメータ記憶媒体７４に記憶されたパラメータが携帯電話７０のパラメータ格納部８２（図３参照）にコピーされ、このパラメータを用いて、携帯電話７０での動作が行なわれる。図３を参照して、まず、携帯電話７０に音声が入力される。入力された音声は、マイクロフォン８０で音声信号に変換される。この音声信号は、フレーム化部８１で所定のフレーム長及び所定のシフト長でフレーム化される。 The parameter calculated by the parameter calculation device 72 (see FIG. 2) and stored in the parameter storage medium 74 is copied to the parameter storage unit 82 (see FIG. 3) of the mobile phone 70, and the mobile phone 70 uses this parameter. Is performed. Referring to FIG. 3, first, voice is input to mobile phone 70. The input voice is converted into a voice signal by the microphone 80. The audio signal is framed by the framing unit 81 with a predetermined frame length and a predetermined shift length.

次に、特徴量算出部１８０（図６参照）でフレーム化された音声信号から特徴量ベクトル（Ｘ₁〜Ｘ_N）が算出される。算出された特徴量ベクトル（Ｘ₁〜Ｘ_N）はパラメータ格納部８２に格納されている、主成分分析によって得られた変換行列を用いて特徴量ベクトル（Ｘ₁’〜Ｘ_i’）へ変換される。変換後の特徴量ベクトル（Ｘ₁’〜Ｘ_i’）は第１〜第Ｊの精度算出部１９２〜１９８（図７参照）へ与えられる。 Next, feature amount vectors (X _{1 to} X _N ) are calculated from the speech signal framed by the feature amount calculation unit 180 (see FIG. 6). The calculated feature vector (X _{1 to} X _N ) is converted into a feature vector (X ₁ ′ to X _i ′) using a conversion matrix obtained by principal component analysis stored in the parameter storage unit 82. Is done. The converted feature vector (X ₁ ′ to X _i ′) is given to the first to Jth accuracy calculation units 192 to 198 (see FIG. 7).

重回帰係数抽出部１９０（図７参照）により、パラメータ格納部８２から重回帰係数ベクトルＢ₁〜Ｂ_Jが抽出される。抽出された重回帰係数ベクトルＢ₁は第１の精度算出部１９２に与えられる。重回帰係数ベクトルＢ₂は第２の精度算出部１９４に与えられる。以下、同様の動作が各々の重回帰係数ベクトルについて行なわれ、重回帰係数ベクトルＢ_Jが第Ｊの精度算出部１９８に与えられる。 Multiple regression coefficient vectors B _{1 to} B _J are extracted from the parameter storage unit 82 by the multiple regression coefficient extraction unit 190 (see FIG. 7). The extracted multiple regression coefficient vector B ₁ is given to the first accuracy calculator 192. The multiple regression coefficient vector B ₂ is given to the second accuracy calculation unit 194. Thereafter, the same operation is performed for each of the multiple regression coefficient vectors, and the multiple regression coefficient vector B _J is given to the Jth accuracy calculation unit 198.

第１の精度算出部１９２では、与えられた特徴量ベクトル（Ｘ₁’〜Ｘ_i’）と重回帰係数ベクトルＢ₁＝（ｂ₁₀、ｂ₁₁、・・・、ｂ_1i）とから、次の式により予測音声認識精度Ｚ₁が算出される。 The first accuracy calculation unit 192 calculates the following from the given feature vector (X ₁ ′ to X _i ′) and the multiple regression coefficient vector B ₁ = (b ₁₀ , b ₁₁ ,..., B _1i ). The predicted speech recognition accuracy Z ₁ is calculated by the following formula.

第２の精度算出部１９４でも同様に、与えられた特徴量ベクトル（Ｘ₁’〜Ｘ_i’）と重回帰係数ベクトルＢ₂＝（ｂ₂₀、ｂ₂₁、・・・、ｂ_2i）とから、予測音声認識精度Ｚ₂が次の式より算出される。

Similarly, in the second accuracy calculation unit 194, from the given feature vector (X ₁ ′ to X _i ′) and the multiple regression coefficient vector B ₂ = (b ₂₀ , b ₂₁ ,..., B _2i ) The predicted speech recognition accuracy Z ₂ is calculated from the following equation.

以下、同様にして特徴量ベクトルと重回帰係数ベクトルとから予測音声認識精度が算出されるという処理が、全ての精度算出部で行なわれる。算出された合計Ｊ個の予測音声認識精度は、最大値選択部１８５に与えられる。

In the same manner, the process of calculating the predicted speech recognition accuracy from the feature vector and the multiple regression coefficient vector is performed in all accuracy calculation units. The calculated total J predicted speech recognition accuracies are given to the maximum value selection unit 185.

最大値選択部１８５（図６参照）では、与えられたＪ個の予測音声認識精度のうちから予測精度が最大となるものが選択される。選択された予測精度の最大値は表示シンボル決定部１０６に与えられる。また、最大の予測精度を与えた音声認識方式を特定する情報及びマイクロフォン８０で音声から変換された音声信号の双方は送信データ作成部８８に与えられる。送信データ作成部８８（図３参照）では、与えられたデータが送信用データに変換される。送信用データは通信部９０を介してサーバ７６に送信される。 The maximum value selection unit 185 (see FIG. 6) selects the one having the highest prediction accuracy from the given J predicted speech recognition accuracy. The maximum value of the selected prediction accuracy is given to the display symbol determination unit 106. Further, both the information for identifying the voice recognition method giving the maximum prediction accuracy and the voice signal converted from the voice by the microphone 80 are given to the transmission data creating unit 88. In the transmission data creation unit 88 (see FIG. 3), the given data is converted into transmission data. The transmission data is transmitted to the server 76 via the communication unit 90.

図１２を参照して、サーバ７６に送信されたデータは、通信部２１０を介して分離部２１２に与えられる。分離部２１２では、与えられたデータのうちで音声認識方式を特定するデータと音声に関するデータとが分離される。分離されたデータのうちで、音声認識方式を特定するデータはスイッチ２１４とスイッチ２２４とに与えられる。 Referring to FIG. 12, the data transmitted to server 76 is provided to separation unit 212 via communication unit 210. The separation unit 212 separates data for specifying a speech recognition method and data related to speech from the given data. Of the separated data, data specifying the voice recognition method is given to the switch 214 and the switch 224.

スイッチ２１４は音声認識方式を特定するデータが与えられると入力をその方式を使用する音声認識部へ与える様に接続を切替る。 When the switch 214 is supplied with data specifying a voice recognition method, the switch 214 switches the connection so that an input is given to a voice recognition unit that uses the method.

音声認識方式を特定するデータがスイッチ２２４にも与えられる事により、スイッチ２２４は特定された音声認識方式を使用する音声認識部の出力を送信データ生成部２２６に与える。携帯電話７０（図３）で、精度が最大になると予測された方式を使用した音声認識部で音声認識がされ、その結果が携帯電話７０に送信される。送信された音声認識結果は、通信部９０を介してテキスト表示処理部９８に与えられる。与えられた音声認識結果のテキストは画像情報に変換されて表示部８６に与えられる。 When the data specifying the voice recognition method is also given to the switch 224, the switch 224 gives the output of the voice recognition unit that uses the specified voice recognition method to the transmission data generation unit 226. In the mobile phone 70 (FIG. 3), voice recognition is performed by a voice recognition unit using a method predicted to have the maximum accuracy, and the result is transmitted to the mobile phone 70. The transmitted voice recognition result is given to the text display processing unit 98 via the communication unit 90. The given speech recognition result text is converted into image information and given to the display unit 86.

この送信用データを受けた通信部９０は受信した送信用データをコマンド処理部９２へ与える。ここで、コマンド処理部９２は、与えられたテキストデータをコマンドとして解釈し、対応する処理を行なう。 Receiving this transmission data, the communication unit 90 gives the received transmission data to the command processing unit 92. Here, the command processing unit 92 interprets the given text data as a command and performs corresponding processing.

音声処理部９４及びスピーカ９６の動作は本実施の形態に係る音声認識装置の動作とは直接の関係がないので、動作の説明を省略する。 Since the operations of the speech processing unit 94 and the speaker 96 are not directly related to the operations of the speech recognition apparatus according to this embodiment, the description of the operations is omitted.

この音声認識装置によると、予め用意された学習データから得られたパラメータと入力音声とを使用して、音声認識精度が最も良くなると推定される音声認識方式を特定する事ができる。この推定のためのパラメータは、ノイズから得られた音響特徴量ベクトルから、発話データにノイズを重畳した学習データに対する音声認識精度を予測する様に算出したものである。従って、この予測には、発話データではなくむしろ、ノイズデータのみの方が適している。従って、この音声認識方式によって音声認識を行なえば、テスト発話なしでも精度良く音声認識を行なう事ができる。例えば、音声入力の最初の期間には、発話が入らず、背景ノイズのみが入ると思われる。そのノイズ部分から、最適な音声認識方式を推定できる。 According to this speech recognition apparatus, it is possible to specify a speech recognition method that is estimated to provide the best speech recognition accuracy by using parameters and input speech obtained from learning data prepared in advance. The parameters for this estimation are calculated from the acoustic feature vector obtained from the noise so as to predict the speech recognition accuracy for the learning data in which the noise is superimposed on the speech data. Therefore, noise data alone is more suitable for this prediction rather than speech data. Therefore, if speech recognition is performed by this speech recognition method, speech recognition can be performed with high accuracy without a test utterance. For example, during the first period of voice input, it is considered that there is no utterance and only background noise. An optimal speech recognition method can be estimated from the noise portion.

［コンピュータによる実現］
この実施の形態のサーバ７６は、実質的にはコンピュータハードウェアと、そのコンピュータハードウェアにより実行されるプログラムと、コンピュータハードウェアに格納されるデータとにより実現される。図１３はこのコンピュータシステム３３０の外観を示し、図１４はコンピュータシステム３３０の内部構成を示す。 [Realization by computer]
The server 76 of this embodiment is substantially realized by computer hardware, a program executed by the computer hardware, and data stored in the computer hardware. FIG. 13 shows the external appearance of the computer system 330, and FIG. 14 shows the internal configuration of the computer system 330.

図１３を参照して、このコンピュータシステム３３０は、ＦＤ（フレキシブルディスク）ドライブ３５２及びＣＤ−ＲＯＭ（コンパクトディスク読出専用メモリ）ドライブ３５０を有するコンピュータ３４０と、キーボード３４６と、マウス３４８と、モニタ３４２とを含む。 Referring to FIG. 13, the computer system 330 includes a computer 340 having an FD (flexible disk) drive 352 and a CD-ROM (compact disk read only memory) drive 350, a keyboard 346, a mouse 348, and a monitor 342. including.

図１４を参照して、コンピュータ３４０は、ＦＤドライブ３５２及びＣＤ−ＲＯＭドライブ３５０に加えて、ＣＰＵ（中央処理装置）３５６と、ＣＰＵ３５６、ＦＤドライブ３５２及びＣＤ−ＲＯＭドライブ３５０に接続されたバス３６６と、ブートアッププログラム等を記憶する読出専用メモリ（ＲＯＭ）３５８と、バス３６６に接続され、プログラム命令、システムプログラム、及び作業データ等を記憶するランダムアクセスメモリ（ＲＡＭ）３６０とを含む。 14, in addition to the FD drive 352 and the CD-ROM drive 350, the computer 340 includes a CPU (Central Processing Unit) 356 and a bus 366 connected to the CPU 356, the FD drive 352, and the CD-ROM drive 350. And a read only memory (ROM) 358 for storing a boot-up program and the like, and a random access memory (RAM) 360 connected to the bus 366 for storing a program command, a system program, work data, and the like.

ここでは示さないが、コンピュータ３４０はさらにローカルエリアネットワーク（ＬＡＮ）への接続を提供するネットワークアダプタボードを含んでもよい。 Although not shown here, the computer 340 may further include a network adapter board that provides a connection to a local area network (LAN).

コンピュータシステム３３０にサーバ７６としての動作を行なわせるためのコンピュータプログラムは、ＣＤ−ＲＯＭドライブ３５０又はＦＤドライブ３５２に挿入されるＣＤ−ＲＯＭ３６２又はＦＤ３６４に記憶され、さらにハードディスク３５４に転送される。又は、プログラムは図示しないネットワークを通じてコンピュータ３４０に送信されハードディスク３５４に記憶されてもよい。プログラムは実行の際にＲＡＭ３６０にロードされる。ＣＤ−ＲＯＭ３６２から、ＦＤ３６４から、又はネットワークを介して、直接にＲＡＭ３６０にプログラムをロードしてもよい。 A computer program for causing the computer system 330 to operate as the server 76 is stored in the CD-ROM 362 or FD 364 inserted in the CD-ROM drive 350 or FD drive 352 and further transferred to the hard disk 354. Alternatively, the program may be transmitted to the computer 340 through a network (not shown) and stored in the hard disk 354. The program is loaded into the RAM 360 when executed. The program may be loaded directly into the RAM 360 from the CD-ROM 362, from the FD 364, or via a network.

このプログラムは、コンピュータ３４０にこの実施の形態のサーバ７６として動作を行なわせる複数の命令を含む。この動作を行なわせるのに必要な基本的機能のいくつかはコンピュータ３４０上で動作するオペレーティングシステム（ＯＳ）もしくはサードパーティのプログラム、又はコンピュータ３４０にインストールされる各種ツールキットのモジュールにより提供される。従って、このプログラムはこの実施の形態のシステム及び方法を実現するのに必要な機能全てを必ずしも含まなくてよい。このプログラムは、命令のうち、所望の結果が得られる様に制御されたやり方で適切な機能又は「ツール」を呼出す事により、上記したサーバ７６としての動作を実行する命令のみを含んでいればよい。コンピュータシステム３３０の動作は周知であるので、ここでは繰返さない。 This program includes a plurality of instructions that cause the computer 340 to operate as the server 76 of this embodiment. Some of the basic functions required to perform this operation are provided by operating system (OS) or third party programs running on the computer 340 or various toolkit modules installed on the computer 340. Therefore, this program does not necessarily include all functions necessary to realize the system and method of this embodiment. If this program includes only an instruction for executing the operation as the server 76 described above by calling an appropriate function or “tool” in a controlled manner so as to obtain a desired result, Good. The operation of computer system 330 is well known and will not be repeated here.

また、マイク及びスピーカ等の部品が必要ではあるが、携帯電話７０のハードウェアも、実質的にはコンピュータ３４０と同様である。 Further, although components such as a microphone and a speaker are necessary, the hardware of the mobile phone 70 is substantially the same as that of the computer 340.

［評価実験］
図１５に、ＳＮ（Signal-to-Noise）比と、音声認識方式ＳＳ（Spectrum Subtraction）で実際に得られた音声認識精度並びに上記した主成分分析及び重回帰分析により得られたパラメータを用いて予測した精度の間の相関との関係をグラフで示す。相関が１であれば、実際に得られた音声認識精度と予測精度とが完全に一致する。ここでは、相関が０．８以上であれば予測精度に信頼性があると考える。 [Evaluation experiment]
FIG. 15 shows the signal recognition accuracy actually obtained by the SN (Signal-to-Noise) ratio, the speech recognition method SS (Spectrum Subtraction), and the parameters obtained by the principal component analysis and the multiple regression analysis described above. The relationship between the predicted accuracy and the correlation is shown in a graph. If the correlation is 1, the speech recognition accuracy actually obtained and the prediction accuracy completely match. Here, if the correlation is 0.8 or more, it is considered that the prediction accuracy is reliable.

図１５を参照して、横軸にＳＮ比、縦軸に相関をそれぞれ示す。凡例２６０は理想的な次元数を、凡例２６２は予め決めておいた次元数を示す。理想的な次元数とは、様々な次元数の特徴量ベクトルを使用して音声認識精度の予測を行なった結果、最も音声認識精度の予測精度が良かった特徴量ベクトルの次元数の事を指す。 Referring to FIG. 15, the horizontal axis represents the SN ratio and the vertical axis represents the correlation. Legend 260 indicates the ideal number of dimensions, and legend 262 indicates the number of dimensions determined in advance. The ideal number of dimensions refers to the number of dimensions of the feature vector that has the best prediction accuracy of speech recognition accuracy as a result of predicting speech recognition accuracy using feature vectors of various dimensions. .

このグラフからは、ＳＮ比が小さくなるほど、理想的な次元数２６０及び予め決めておいた次元数２６２の双方における予測値と実測値との相関が高くなる傾向が見られる。 From this graph, it can be seen that the smaller the SN ratio, the higher the correlation between the predicted value and the actual measurement value in both the ideal dimension number 260 and the predetermined dimension number 262.

ＳＮ比が２０ｄＢとなる場合に最も相関が低くなる。しかし、この場合であっても、相関は理想的な次元数２６０及び予め定められた次元数２６２の双方で０．８５以上になっている。ゆえに、方式ＳＳによる音声認識では、上記した方法による予測値に信頼性があると言える。 The correlation is lowest when the S / N ratio is 20 dB. However, even in this case, the correlation is 0.85 or more for both the ideal dimension number 260 and the predetermined dimension number 262. Therefore, it can be said that the predicted value by the above method is reliable in the speech recognition by the system SS.

図１６に、ＳＮ比と、音声認識方式ＡＤＳＲ（Advanced Distributed Speech Recognition Front End）で実際に得られた音声認識精度並びに上記した主成分分析及び重回帰分析により得られたパラメータを用いて予測した精度の間の相関との関係をグラフで示す。 FIG. 16 shows the SN ratio, the speech recognition accuracy actually obtained by the speech recognition method ADSR (Advanced Distributed Speech Recognition Front End), and the accuracy predicted using the parameters obtained by the principal component analysis and the multiple regression analysis described above. The relationship with the correlation between is shown in a graph.

図１６を参照して、横軸にＳＮ比、縦軸に相関をそれぞれ示す。凡例２３０は理想的な次元数を、凡例２３２は予め決めておいた次元数を示す。 Referring to FIG. 16, the horizontal axis represents the SN ratio and the vertical axis represents the correlation. A legend 230 indicates the ideal number of dimensions, and a legend 232 indicates a predetermined number of dimensions.

このグラフからも、図１５に示されたグラフ同様、ＳＮ比が小さくなるほど、理想的な次元数２３０及び予め決めておいた次元数２３２における相関が高くなる傾向が見られる。また、ＳＮ比が１０ｄＢ以下では、相関が０．８以上になる事から、予測精度にある程度信頼性があると言える。 Also from this graph, as in the graph shown in FIG. 15, the correlation between the ideal dimension number 230 and the predetermined dimension number 232 tends to increase as the S / N ratio decreases. Further, when the SN ratio is 10 dB or less, the correlation is 0.8 or more, and thus it can be said that the prediction accuracy is reliable to some extent.

図１７に、ＳＮ比と音声認識方式ＳＳでの予測誤差の残差標準偏差との関係をグラフで示す。図１７を参照して、横軸にＳＮ比、縦軸に予測誤差の残差標準偏差をそれぞれ示す。凡例２４０は理想的な次元数を、凡例２４２は予め定められた次元数を示す。 FIG. 17 is a graph showing the relationship between the SN ratio and the residual standard deviation of the prediction error in the speech recognition method SS. Referring to FIG. 17, the horizontal axis represents the SN ratio and the vertical axis represents the residual standard deviation of the prediction error. Legend 240 indicates the ideal number of dimensions, and legend 242 indicates the number of dimensions determined in advance.

このグラフによると、予測誤差の残差標準偏差が最も大きくなるのは理想的な次元数２４０及び予め定められた次元数２４２ともにＳＮ比が５ｄＢの場合である。また、このときの予測誤差の残差標準偏差はいずれの次元数とも約６．５％である。 According to this graph, the residual standard deviation of the prediction error becomes the largest when the SN ratio is 5 dB for both the ideal dimension number 240 and the predetermined dimension number 242. Further, the residual standard deviation of the prediction error at this time is about 6.5% in any number of dimensions.

この様に最も高い残差予測誤差でも約６．５％なので、ＳＳに対する予測精度にはある程度の信頼性があると言える。 Thus, even the highest residual prediction error is about 6.5%, so it can be said that the prediction accuracy for SS has a certain degree of reliability.

図１８に、ＳＮ比と音声認識方式ＡＤＳＲでの予測誤差の残差標準偏差との関係をグラフで示す。図１８を参照して、横軸にＳＮ比、縦軸に予測誤差の残差標準偏差をそれぞれ示す。凡例２５０は理想的な次元数を、凡例２５２は予め定められた次元数を示す。 FIG. 18 is a graph showing the relationship between the S / N ratio and the residual standard deviation of the prediction error in the speech recognition method ADSR. Referring to FIG. 18, the horizontal axis represents the SN ratio and the vertical axis represents the residual standard deviation of the prediction error. The legend 250 indicates the ideal number of dimensions, and the legend 252 indicates the predetermined number of dimensions.

このグラフによると、ＳＮ比が０ｄＢ及び−５ｄＢの場合に予測誤差の残差標準偏差が大きい値をとる。その値は約８％である。 According to this graph, when the SN ratio is 0 dB and −5 dB, the residual standard deviation of the prediction error takes a large value. Its value is about 8%.

この様に最も高い予測誤差でも約８％なので、ＡＤＳＲによる予測精度にもある程度の信頼性があると言える。 Since the highest prediction error is about 8% in this way, it can be said that the prediction accuracy by ADSR has a certain degree of reliability.

図１９に、方式ＳＳによる音声認識の実際の精度と予測精度との相関をグラフで示す。このグラフではＳＮ比が５ｄＢで相関が０．９２１となる場合のデータを使用している。図１９を参照して、横軸に予測精度、縦軸に実際の精度をそれぞれ示す。グラフ上に引かれた直線に近付くほど、予測精度の値と実際の精度の値とが近くなる。このグラフからは、予測精度及び実際の精度のばらつきが少ないという事がわかる。そこで、音声認識方式ＳＳに対しては、予測精度の信頼性が高いと考えられる。 FIG. 19 is a graph showing the correlation between the actual accuracy of speech recognition by the system SS and the prediction accuracy. This graph uses data when the SN ratio is 5 dB and the correlation is 0.921. Referring to FIG. 19, the horizontal axis represents the prediction accuracy, and the vertical axis represents the actual accuracy. The closer to the straight line drawn on the graph, the closer the prediction accuracy value and the actual accuracy value are. From this graph, it can be seen that there is little variation in prediction accuracy and actual accuracy. Therefore, it is considered that the reliability of the prediction accuracy is high for the speech recognition system SS.

図２０に、方式ＡＤＳＲによる音声認識の実際の精度と予測精度との相関をグラフで示す。このグラフではＳＮ比が１５ｄＢで相関が０．７４９となる場合のデータを使用している。図２０を参照して、横軸に予測精度、縦軸に実際の精度をそれぞれ示す。グラフ上に引かれた直線に近付くほど、予測精度と実際の精度とが近くなる。このグラフにおいては、若干のばらつきが見られるものの、予測精度と実際の精度とのばらつきが比較的小さいという事がわかる。そこで、音声認識方式ＡＤＳＲに対しても、予測精度の信頼性が高いと言える。 FIG. 20 is a graph showing the correlation between the actual accuracy of speech recognition by the method ADSR and the prediction accuracy. This graph uses data when the SN ratio is 15 dB and the correlation is 0.749. Referring to FIG. 20, the horizontal axis represents prediction accuracy, and the vertical axis represents actual accuracy. The closer to the straight line drawn on the graph, the closer the prediction accuracy to the actual accuracy. In this graph, although there is some variation, it can be seen that the variation between the prediction accuracy and the actual accuracy is relatively small. Therefore, it can be said that the reliability of the prediction accuracy is high even for the speech recognition method ADSR.

以上の様に本実施の形態に係るシステムでは、複数のノイズデータと発話データとから、種々のノイズの存在下で、複数の音声認識方式の各々について精度を予測するためのパラメータを求める。そして、実際の音声認識時には、入力される音声信号から、これらのパラメータを用いて各方式の精度を予測し、最も精度が高い方式を音声認識方式として選択する。 As described above, in the system according to the present embodiment, a parameter for predicting accuracy for each of a plurality of speech recognition methods is obtained from a plurality of noise data and speech data in the presence of various noises. In actual speech recognition, the accuracy of each method is predicted from the input speech signal using these parameters, and the method with the highest accuracy is selected as the speech recognition method.

他の音声認識方式による音声認識を実際に行なわなくてもよいので、サーバにかかる負荷が小さくて済む。また、予測された音声認識精度を分かりやすい形でレベル表示するので、ユーザにとって音声認識が使えるかどうかが容易にわかる。従って、信頼性のない音声認識を行なったりする事がなく、音声認識に適した環境で手軽に音声認識を行なう事ができる。 Since it is not necessary to actually perform voice recognition by another voice recognition method, the load on the server can be reduced. Further, since the predicted voice recognition accuracy is displayed in a level that is easy to understand, it is easy for the user to know whether or not voice recognition can be used. Therefore, unreliable speech recognition is not performed, and speech recognition can be easily performed in an environment suitable for speech recognition.

本実施の形態では、携帯電話７０、パラメータ算出装置７２、及びサーバ７６の３部分から構成されるシステムが本発明の課題を解決するために動作する。しかし、これを単一のシステムによって実現する事もできる。 In the present embodiment, a system composed of three parts, a mobile phone 70, a parameter calculation device 72, and a server 76 operates in order to solve the problem of the present invention. However, this can also be realized by a single system.

また、本実施の形態では携帯電話で精度予測等が行なわれているが、必ずしもその必要はない。例えば、他の小型端末によって精度予測等が行なわれる様な実施の形態を採る事もできる。 In the present embodiment, accuracy prediction or the like is performed by a mobile phone, but this is not always necessary. For example, an embodiment in which accuracy prediction or the like is performed by another small terminal can be adopted.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味及び範囲内でのすべての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each claim in the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.

事前のテスト発話を用いた従来技術による音声認識装置のサーバ３０の構成を示すブロック図である。It is a block diagram which shows the structure of the server 30 of the speech recognition apparatus by a prior art using the prior test utterance. 本実施の形態に係る音声認識システム６０の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition system 60 which concerns on this Embodiment. 携帯電話７０の構成を示すブロック図である。3 is a block diagram showing a configuration of a mobile phone 70. FIG. 音声認識方式判定部８４の構成を示すブロック図である。4 is a block diagram showing a configuration of a speech recognition method determination unit 84. FIG. 表示シンボルテーブルの一例を示す表である。It is a table | surface which shows an example of a display symbol table. 精度予測部１０２の構成を示すブロック図である。3 is a block diagram illustrating a configuration of an accuracy prediction unit 102. FIG. 重回帰係数による予測精度算出部１８４の構成を示すブロック図である。It is a block diagram which shows the structure of the prediction accuracy calculation part 184 by a multiple regression coefficient. パラメータ算出装置７２の構成を示すブロック図である。3 is a block diagram showing a configuration of a parameter calculation device 72. FIG. 学習データによる精度計算部１３８の構成を示すブロック図である。It is a block diagram which shows the structure of the precision calculation part 138 by learning data. 第１の学習データ〜第Ｍの学習データに対して行なわれる第１〜第Ｊの音声認識方式とその結果算出される精度との関係について示す表である。It is a table | surface shown about the relationship between the 1st-Jth speech recognition system performed with respect to 1st learning data-Mth learning data, and the precision calculated as a result. パラメータ計算部１４０の構成を示すブロック図である。3 is a block diagram showing a configuration of a parameter calculation unit 140. FIG. サーバ７６の構成を示すブロック図である。3 is a block diagram showing a configuration of a server 76. FIG. 本発明の一実施の形態に係る音声認識システム６０を実現するコンピュータシステムの外観図である。1 is an external view of a computer system that implements a speech recognition system 60 according to an embodiment of the present invention. 図１３に示すコンピュータのブロック図である。It is a block diagram of the computer shown in FIG. ＳＮ比と、音声認識方式ＳＳで実際に得られた音声認識精度、並びに上記した主成分分析及び重回帰分析により得られたパラメータを用いて予測した精度の間の相関との関係を示すグラフである。A graph showing the relationship between the S / N ratio and the accuracy of speech recognition actually obtained by the speech recognition method SS and the accuracy predicted using the parameters obtained by the principal component analysis and multiple regression analysis described above. is there. ＳＮ比と、音声認識方式ＡＤＳＲで実際に得られた音声認識精度、並びに上記した主成分分析及び重回帰分析により得られたパラメータを用いて予測した精度の間の相関との関係を示すグラフである。A graph showing the relationship between the signal-to-noise ratio and the accuracy of speech recognition actually obtained by the speech recognition method ADSR and the accuracy predicted using the parameters obtained by the principal component analysis and multiple regression analysis described above. is there. ＳＮ比と、音声認識方式ＳＳでの予測誤差の残差標準偏差との関係を表すグラフである。It is a graph showing the relationship between SN ratio and the residual standard deviation of the prediction error in the speech recognition method SS. ＳＮ比と、音声認識方式ＡＤＳＲでの予測誤差の残差標準偏差との関係を表すグラフである。It is a graph showing the relationship between SN ratio and the residual standard deviation of the prediction error in the speech recognition method ADSR. 方式ＳＳによる音声認識の実際の精度と予測精度との相関を表すグラフである。It is a graph showing the correlation with the actual precision and prediction precision of the speech recognition by system SS. 方式ＡＤＳＲによる音声認識の実際の精度と予測精度との相関を表すグラフである。It is a graph showing the correlation with the actual precision and prediction precision of the speech recognition by system ADSR.

Explanation of symbols

７０携帯電話
７２パラメータ算出装置
７４パラメータ記憶媒体
７６サーバ
８４音声認識方式判定部
１０２精度予測部
１０６表示シンボル決定部
１８０特徴量算出部
１８２変換処理部
１８４重回帰係数による予測精度算出部
１８５最大値選択部
１９２、１９４、１９６、及び１９８精度算出部
１３８学習データによる精度計算部
１４０パラメータ計算部
１５０音合成部
１５２〜１５６学習データ格納部
１５８〜１６２音声認識部
１６６学習データによる精度算出部
１７０特徴量抽出部
１７２主成分分析部
１７４変換処理部
１７６重回帰分析部
２１２分離部
２１４及び２２４スイッチ
２１６〜２２２音声認識部
70 Cellular Phone 72 Parameter Calculation Device 74 Parameter Storage Medium 76 Server 84 Speech Recognition Method Determination Unit 102 Accuracy Prediction Unit 106 Display Symbol Determination Unit 180 Feature Amount Calculation Unit 182 Conversion Processing Unit 184 Prediction Accuracy Calculation Unit 185 Using Multiple Regression Coefficients Maximum Value Selection Units 192, 194, 196, and 198 Accuracy calculation unit 138 Accuracy calculation unit 140 based on learning data Parameter calculation unit 150 Sound synthesis units 152 to 156 Learning data storage units 158 to 162 Speech recognition unit 166 Accuracy calculation unit 170 based on learning data Extraction unit 172 Principal component analysis unit 174 Conversion processing unit 176 Multiple regression analysis unit 212 Separation unit 214 and 224 Switches 216 to 222 Voice recognition unit

Claims

To determine the method that is predicted to be the most accurate speech recognition method among a plurality of predetermined speech recognition methods for speech input in an environment where noise may exist An optimal speech recognition method selection device,
Means for calculating a feature quantity vector having a predetermined plurality of types of acoustic feature quantities as components from input speech;
Conversion means for converting the feature quantity vector into a feature quantity vector having a smaller number of dimensions using a pre-calculated transformation matrix;
The accuracy of speech recognition for the speech is improved by a predetermined calculation between the feature vector after the conversion by the conversion means and the coefficient vector for accuracy prediction prepared in advance for each of the plurality of speech recognition methods. Accuracy prediction means for predicting each speech recognition method;
An optimum speech recognition method selection apparatus including means for selecting a speech recognition method capable of performing speech recognition with the highest accuracy based on the prediction calculated by the accuracy prediction means.

A speech recognition apparatus for performing speech recognition using a plurality of speech recognition means,
A communication means for exchanging data with other devices;
For receiving voice information and method specifying information for specifying any of the plurality of voice recognition units from the other device via the communication unit, and for separating the voice information and the method specifying information Separating means;
A plurality of speech recognition means each for performing speech recognition by a specific speech recognition method;
In accordance with the method specifying information, a method selection unit for selectively operating one of the plurality of voice recognition units to perform voice recognition of the voice information;
A voice recognition apparatus including means for returning the output of the voice recognition means selected by the method selection means and voice-recognizing the voice information to the other apparatus via the communication means.

When determining a method that is predicted to be the most accurate voice recognition method among a plurality of predetermined voice recognition methods for speech input in an environment where noise may exist A parameter calculation device for calculating a parameter used for accuracy prediction used,
Means for preparing a plurality of learning data obtained by superimposing a plurality of types of noise data on the utterance data prepared in advance;
Based on a result of speech recognition of the plurality of learning data by the plurality of speech recognition methods, for calculating speech recognition accuracy for each of the combination of the plurality of types of learning data and the plurality of types of speech recognition methods Speech recognition accuracy calculating means;
An acoustic feature quantity calculating means for calculating an acoustic feature quantity vector comprising a predetermined acoustic feature quantity of a first dimension from each of the plurality of types of noise data;
The predetermined acoustic feature vector extracted from the input speech signal based on the acoustic feature vector calculated by the acoustic feature calculator and the speech recognition accuracy calculated by the speech recognition accuracy calculator. Parameter calculation means for calculating parameters for predicting the accuracy of speech recognition accuracy when the speech signal is speech-recognized by the plurality of speech recognition methods.

The parameter calculation means includes
A principal component analysis is performed on the acoustic feature quantity vector obtained from each of the plurality of types of noise data, and the acoustic feature quantity vector is obtained as an acoustic feature having a second dimension number smaller than the first dimension number. Means for calculating a conversion function for conversion to a quantity vector;
The voice recognition calculated by the voice recognition accuracy calculation means by converting the acoustic feature quantity vector obtained from each of the plurality of types of noise into the second dimension number of the acoustic feature quantity vector using the conversion function. By means of multiple regression analysis with accuracy as an objective variable and each component of the acoustic feature vector of the second dimension as an explanatory variable, a set of parameters for accuracy prediction is assigned to each of the plurality of speech recognition means. The parameter calculation device according to claim 3, further comprising means for calculating.

An information terminal device having a microphone and a communication function,
Means for calculating a feature quantity vector having a predetermined plurality of types of acoustic feature quantities as components from speech input via the microphone;
Conversion means for converting the feature quantity vector into a feature quantity vector having a smaller number of dimensions using a pre-calculated transformation matrix;
The accuracy of speech recognition for the speech is improved by a predetermined calculation between the feature vector after the conversion by the conversion means and the coefficient vector for accuracy prediction prepared in advance for each of the plurality of speech recognition methods. Accuracy prediction means for predicting each speech recognition method;
Means for selecting a voice recognition method capable of voice recognition with the highest accuracy calculated by the accuracy prediction means;
Means for transmitting the voice input via the microphone and information identifying the voice recognition method selected by the means for selecting to a predetermined destination via the communication function;
A voice recognition result returned from the predetermined transmission destination to the voice and information specifying the voice recognition method is received via the communication function, and a predetermined process is performed on the voice recognition result. An information terminal device including processing execution means for executing.

A computer program that, when executed by a computer, causes the computer to operate as the device according to any one of claims 1 to 5.