JP6466762B2

JP6466762B2 - Speech recognition apparatus, speech recognition method, and program

Info

Publication number: JP6466762B2
Application number: JP2015074838A
Authority: JP
Inventors: 祐太河内; 浩和政瀧; 太一浅見
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-04-01
Filing date: 2015-04-01
Publication date: 2019-02-06
Anticipated expiration: 2035-04-01
Also published as: JP2016194628A

Description

本発明は、音声認識モデルのパラメータを自動的に決定し、それを用いて音声認識を行う技術に関する。 The present invention relates to a technology for automatically determining parameters of a speech recognition model and performing speech recognition using the parameters.

GMM-HMM音声認識装置によるデコードを表現する音声認識モデルは１個の言語重みパラメータ（モデルパラメータ）を含む。このモデルパラメータは経験則的パラメータ(ヒューリスティックパラメータ)であり、実用上、このパラメータ値を設定する必要がある（例えば、非特許文献１等参照）。また、音響モデルとして混合正規分布の代わりにニューラルネットワークを用いることでGMM-HMM音声認識装置より多くの入力情報を扱うことのできるANN-HMM Hybrid音声認識装置が知られている。ANN-HMM Hybrid音声認識装置によるデコードを表現する音声認識モデルは２個のヒューリスティックパラメータを含み、これらのパラメータ値を設定する必要がある（例えば、非特許文献２等参照）。 The speech recognition model expressing the decoding by the GMM-HMM speech recognition device includes one language weight parameter (model parameter). This model parameter is an empirical parameter (heuristic parameter), and it is necessary to set this parameter value for practical use (see, for example, Non-Patent Document 1). Also, an ANN-HMM Hybrid speech recognition device is known that can handle more input information than a GMM-HMM speech recognition device by using a neural network instead of a mixed normal distribution as an acoustic model. The speech recognition model expressing the decoding by the ANN-HMM Hybrid speech recognition device includes two heuristic parameters, and these parameter values need to be set (see, for example, Non-Patent Document 2).

また、GMM-HMM音声認識装置におけるヒューリスティックパラメータを自動的に設定する従来方法として、非特許文献３に記載された方法がある。 Further, as a conventional method for automatically setting heuristic parameters in the GMM-HMM speech recognition apparatus, there is a method described in Non-Patent Document 3.

鹿野清宏，伊藤克亘，河原達也，山本幹雄，“IT Text 音声認識システム”, オーム社(2001): 104-105.Kiyohiro Shikano, Katsunobu Ito, Tatsuya Kawahara, Mikio Yamamoto, “IT Text Speech Recognition System”, Ohmsha (2001): 104-105. Dahl, George E., et al, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” Audio, Speech, and Language Processing, IEEE Transactions on 20.1 (2012): 30-42.Dahl, George E., et al, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” Audio, Speech, and Language Processing, IEEE Transactions on 20.1 (2012): 30-42. Mak, Brian, and Tom Ko, “Min-max discriminative training of decoding parameters using iterative linear programming,” INTERSPEECH, 2008.Mak, Brian, and Tom Ko, “Min-max discriminative training of decoding parameters using iterative linear programming,” INTERSPEECH, 2008.

しかしながら、非特許文献３の方法では、入力音響信号に含まれる雑音成分を考慮することなくパラメータ値の設定が行われる。そのため、雑音成分に応じて適切なパラメータ値が異なる音声認識モデルの場合には、雑音成分の影響で音声認識精度または音声認識率が低下する場合がある。事前に入力音響信号に雑音抑圧処理を行うことも考えられるが、雑音抑圧処理によって音声認識の観点から不適切な歪みが加えられる可能性がある。 However, in the method of Non-Patent Document 3, the parameter value is set without considering the noise component included in the input acoustic signal. Therefore, in the case of a speech recognition model in which appropriate parameter values differ depending on the noise component, the speech recognition accuracy or speech recognition rate may be reduced due to the influence of the noise component. Although it is conceivable to perform noise suppression processing on the input acoustic signal in advance, noise distortion processing may add inappropriate distortion from the viewpoint of speech recognition.

本発明の課題は、入力音響信号に含まれた雑音成分に応じた適切なモデルパラメータを自動設定することである。 An object of the present invention is to automatically set an appropriate model parameter according to a noise component included in an input acoustic signal.

入力音響信号に含まれる音声成分と雑音成分との大きさの関係に応じ、音声認識モデルのモデルパラメータまたは音声認識モデルのモデルパラメータの組み合わせを選択し、選択したモデルパラメータまたはモデルパラメータの組み合わせに応じた音声認識モデルを入力音響信号に適用する。 Select the model parameter of the speech recognition model or the combination of model parameters of the speech recognition model according to the relationship between the magnitude of the speech component and noise component contained in the input acoustic signal, and according to the selected model parameter or combination of model parameters Apply the speech recognition model to the input acoustic signal.

これにより、入力音響信号に含まれた雑音成分に応じた適切なモデルパラメータを自動設定できる。 Thereby, an appropriate model parameter according to the noise component contained in the input acoustic signal can be automatically set.

図１は、実施形態の学習装置の機能構成を例示したブロック図である。FIG. 1 is a block diagram illustrating a functional configuration of the learning device according to the embodiment. 図２は、実施形態の音声認識装置の機能構成を例示したブロック図である。FIG. 2 is a block diagram illustrating a functional configuration of the speech recognition apparatus according to the embodiment. 図３は、実施形態の対応表生成処理を例示するためのフロー図である。FIG. 3 is a flowchart for illustrating the correspondence table generation processing according to the embodiment. 図４は、実施形態の音声認識処理を例示するためのフロー図である。FIG. 4 is a flowchart for illustrating the speech recognition processing according to the embodiment. 図５Ａ，５Ｃは実施形態の対応表を例示するための図であり、図５Ｂは対応表のＳ／Ｎ比と入力音響信号のＳ／Ｎ比との対応関係を例示するための図である。5A and 5C are diagrams for illustrating the correspondence table of the embodiment, and FIG. 5B is a diagram for illustrating the correspondence relationship between the S / N ratio of the correspondence table and the S / N ratio of the input acoustic signal. .

以下、本発明の実施形態を説明する。
［概要］
まず、実施形態の概要を説明する。
各実施形態では、音声認識装置が、音声認識対象となる「入力音響信号」に含まれる「音声成分」と「雑音成分」との関係に応じ、音声認識モデルのモデルパラメータまたは音声認識モデルのモデルパラメータの組み合わせを選択し、選択したモデルパラメータまたはモデルパラメータの組み合わせに応じた音声認識モデルを当該入力音響信号に適用する。「音声成分と雑音成分との関係」に応じてモデルパラメータを選択することで、「雑音成分」に応じて適切なパラメータ値が異なる音声認識モデルであっても、適切なモデルパラメータを自動設定できる。「音声成分と雑音成分との大きさの関係」は、例えば、「音声成分」の大きさと「雑音成分」の大きさとの間の相対値または相対値の関数値である。「音声成分と雑音成分との大きさの関係」は、例えば、「雑音成分」の大きさに対する「音声成分」の大きさの比（Ｓ／Ｎ比）であってもよいし、「音声成分」の大きさに対する「雑音成分」の大きさの比であってもよいし、「音声成分」および「雑音成分」を含む「音響信号」の大きさに対する「雑音成分」の大きさの比であってもよいし、「音響信号」の大きさに対する「音声成分」の大きさの比であってもよいし、「音響信号」の大きさから「音声成分」の大きさを減じた値であってもよいし、「音響信号」の大きさから「雑音成分」の大きさを減じた値であってもよいし、そのような比または値の関数値であってもよい。「音声認識モデル」の具体例は、GMM-HMM音声認識装置によるデコードを表現する音声認識モデルであってもよいし（例えば、非特許文献１等参照）、ANN-HMM Hybrid音声認識装置によるデコードを表現する音声認識モデルであってもよいし（例えば、非特許文献２等参照）、その他の音声認識モデルであってもよい。「モデルパラメータ」は、例えば「ヒューリスティックパラメータ」である。ただし、「モデルパラメータ」の少なくとも一部が「ヒューリスティックパラメータ」以外のパラメータであってもよい。 Embodiments of the present invention will be described below.
[Overview]
First, an outline of the embodiment will be described.
In each embodiment, the speech recognition apparatus determines whether the speech recognition model model parameter or the speech recognition model model corresponds to the relationship between the “speech component” and the “noise component” included in the “input acoustic signal” to be speech recognition target. A combination of parameters is selected and a speech recognition model corresponding to the selected model parameter or combination of model parameters is applied to the input acoustic signal. By selecting model parameters according to the “relation between speech and noise components”, appropriate model parameters can be automatically set even for speech recognition models with different appropriate parameter values according to “noise components”. . The “relationship between the size of the speech component and the noise component” is, for example, a relative value between the size of the “speech component” and the size of the “noise component” or a function value of the relative value. The “relationship between the size of the speech component and the noise component” may be, for example, a ratio (S / N ratio) of the size of the “sound component” to the size of the “noise component”. The ratio of the size of the “noise component” to the size of the “noise component” may be the ratio of the “noise component” to the size of the “acoustic signal” including the “voice component” and the “noise component”. It may be the ratio of the size of the “sound component” to the size of the “acoustic signal”, or a value obtained by subtracting the size of the “sound component” from the size of the “acoustic signal”. It may be a value obtained by subtracting the magnitude of the “noise component” from the magnitude of the “acoustic signal”, or may be a function value of such a ratio or value. A specific example of the “speech recognition model” may be a speech recognition model that expresses decoding by the GMM-HMM speech recognition device (see, for example, Non-Patent Document 1), or decoding by the ANN-HMM Hybrid speech recognition device. (For example, refer to Non-Patent Document 2 etc.) or other speech recognition models. The “model parameter” is, for example, a “heuristic parameter”. However, at least a part of the “model parameter” may be a parameter other than the “heuristic parameter”.

各実施形態では、「音声成分と雑音成分との大きさの関係」を表す「第１指標」と、「第１指標」が表す関係を持つ音響信号に適用したときの音声認識精度または音声認識率が最大または所定値以上となる音声認識モデルのモデルパラメータまたはモデルパラメータの組み合わせと、を対応付けておく。音声認識装置は、「入力音響信号」に含まれる「音声成分と雑音成分との大きさの関係」に対応する「第１指標」に対応付けられたモデルパラメータまたはモデルパラメータの組み合わせを選択し、選択したモデルパラメータまたはモデルパラメータの組み合わせを用いた音声認識モデルを当該「入力音響信号」に適用する。これにより、「音声成分と雑音成分との大きさの関係」に応じ、高い音声認識精度または音声認識率を実現できる。なお「第１指標」の例は、上述の「音声成分と雑音成分との大きさの関係」の例を表す指標であり、例えば、Ｓ／Ｎ比を表す値である。 In each embodiment, speech recognition accuracy or speech recognition when applied to an acoustic signal having a relationship represented by “first index” and “first index” representing “a relationship between the magnitudes of speech components and noise components” A model parameter or a combination of model parameters of a speech recognition model having a maximum rate or a predetermined value or more is associated with the rate. The speech recognition apparatus selects a model parameter or a combination of model parameters associated with the “first index” corresponding to “a relationship between the magnitudes of the speech component and the noise component” included in the “input acoustic signal”, A speech recognition model using the selected model parameter or a combination of model parameters is applied to the “input acoustic signal”. Accordingly, high speech recognition accuracy or speech recognition rate can be realized in accordance with the “relationship between the magnitudes of the speech component and the noise component”. The example of the “first index” is an index that represents the above-described example of “the relationship between the magnitudes of the voice component and the noise component”, and is, for example, a value that represents the S / N ratio.

より具体的には、音声認識装置は、「入力音響信号」に含まれる「音声成分と雑音成分との大きさの関係」を表す「第２指標」に最も近い「第１指標」に対応するモデルパラメータまたはモデルパラメータの組み合わせを選択する。これにより、「第２指標」に一致する「第１指標」が存在しない場合であっても、高い音声認識精度または音声認識率を実現可能なモデルパラメータまたはモデルパラメータの組み合わせを選択できる。 More specifically, the speech recognition apparatus corresponds to the “first index” that is closest to the “second index” that represents “the relationship between the magnitudes of the speech component and the noise component” included in the “input acoustic signal”. Select model parameters or model parameter combinations. As a result, even when there is no “first index” that matches the “second index”, it is possible to select a model parameter or a combination of model parameters that can realize high speech recognition accuracy or speech recognition rate.

あるいは、音声認識装置が、「第２指標」と同一または近傍の「第１指標」に対応するモデルパラメータまたはモデルパラメータの組み合わせのうち、音声認識精度または音声認識率を最大にするものを選択してもよい。「第２指標の近傍の第１指標」とは、「第２指標」からの距離が所定の範囲内にある「第１指標」を意味する。「第２指標の近傍の第１指標」が１個のみ存在していてもよいし、２個存在していてもよいし、３個以上存在していてもよい。「第２指標の近傍の第１指標」の例は、「第２指標」よりも大きく当該「第２指標」に最も近い「第１指標」、「第２指標」よりも小さく当該「第２指標」に最も近い「第１指標」、それら両方、「第２指標」との距離が所定値以内の３個以上の「第１指標」などである。 Alternatively, the speech recognition apparatus selects a model parameter or a combination of model parameters corresponding to the “first index” that is the same as or close to the “second index” and that maximizes the speech recognition accuracy or the speech recognition rate. May be. The “first index in the vicinity of the second index” means a “first index” whose distance from the “second index” is within a predetermined range. There may be only one “first index in the vicinity of the second index”, two, or three or more. Examples of the “first index in the vicinity of the second index” are the “second index” that is larger than the “second index” and closest to the “second index”, and smaller than the “second index”. The “first index” closest to the “index”, both of them, three or more “first indices” whose distance from the “second index” is within a predetermined value, and the like.

あるいは、音声認識装置が、「第２指標」と同一または近傍の「第１指標」に対応するモデルパラメータまたはモデルパラメータの組み合わせのうち、音声認識精度または音声認識率の「重み付け値」を最大にするものを選択してもよい。「重み付け値」は、音声認識精度または音声認識率に「第２指標」と「第１指標」との距離が小さいほど大きな重みを乗じた値である。これにより、「第２指標」と「第１指標」との距離および音声認識精度または音声認識率の両方の指標に基づいて、適切なモデルパラメータまたはモデルパラメータの組み合わせを選択できる。 Alternatively, the speech recognition apparatus maximizes the “weighting value” of the speech recognition accuracy or speech recognition rate among the model parameters or the combination of model parameters corresponding to the “first index” that is the same as or close to the “second index”. You may choose what to do. The “weighting value” is a value obtained by multiplying the voice recognition accuracy or the voice recognition rate by a larger weight as the distance between the “second index” and the “first index” is smaller. Thus, an appropriate model parameter or combination of model parameters can be selected based on both the distance between the “second index” and the “first index” and the indexes of both the speech recognition accuracy and the speech recognition rate.

また、離散的な「第１指標」およびモデルパラメータまたはモデルパラメータの組み合わせではなく、それらの補完値が用いられてもよい。すなわち音声認識装置は、「第１指標」またはその補完値と「第２指標」とが一致するかを判定し、「第２指標」に一致した「第１指標」に対応付けられたモデルパラメータもしくはモデルパラメータの組み合わせ、または「第２指標」に一致した「第１指標」の補完値に対応するモデルパラメータの補完値もしくはモデルパラメータの補完値の組み合わせを選択してもよい。言い換えると、音声認識装置は、入力音響信号に含まれる「音声成分と雑音成分との大きさの関係」を表す「第２指標」と一致する「第１指標」または「第１指標」の補完値に対応するモデルパラメータもしくはモデルパラメータの補完値またはモデルパラメータもしくはモデルパラメータの補完値の組み合わせを選択してもよい。補完方法に限定はなく、例えば、線形補完、多項式補完、スプライン補完等の公知の方法を用いることができる。 Further, not the discrete “first index” and the model parameter or the combination of model parameters, but their complementary values may be used. That is, the speech recognition apparatus determines whether the “first index” or its complementary value matches the “second index”, and the model parameter associated with the “first index” that matches the “second index”. Alternatively, a combination of model parameters, or a complementary value of a model parameter or a combination of complementary values of a model parameter corresponding to the complementary value of the “first index” that matches the “second index” may be selected. In other words, the speech recognition apparatus complements the “first index” or the “first index” that matches the “second index” representing the “relationship between the magnitudes of the speech component and the noise component” included in the input acoustic signal. A model parameter corresponding to the value or a complementary value of the model parameter or a combination of the model parameter or the complementary value of the model parameter may be selected. There is no limitation on the complementing method, and for example, known methods such as linear complementing, polynomial complementing, and spline complementing can be used.

［第１実施形態］
次に、第１実施形態を説明する。
＜構成＞
図１に例示するように、本形態の学習装置１１は、音声信号記憶部１１１ａ、雑音信号記憶部１１１ｂ、正解単語列記憶部１１１ｃ、雑音付き音声信号記憶部１１１ｅ、音声認識結果記憶部１１１ｆ、成分調整加算部１１３、指標生成部１１４、音声認識部１１６、比較部１１７、および対応表生成部１１８を有する。図２に例示するように、本形態の音声認識装置１２は、対応表記憶部１２１ａ、入力音響信号記憶部１２１ｂ、雑音成分記憶部１２１ｃ、入力部１２２、音声／非音声区間判別部１２３、指標生成部１２４、選択部１２５、および音声認識部１２６を有する。各装置は、例えば、ＣＰＵ（central processing unit）等のプロセッサ（ハードウェア・プロセッサ）およびＲＡＭ（random-access memory）・ＲＯＭ（read-only memory）等のメモリ等を備える汎用または専用のコンピュータが所定のプログラムを実行することで構成される。このコンピュータは１個のプロセッサやメモリを備えていてもよいし、複数個のプロセッサやメモリを備えていてもよい。このプログラムはコンピュータにインストールされてもよいし、予めＲＯＭ等に記録されていてもよい。また、ＣＰＵのようにプログラムが読み込まれることで機能構成を実現する電子回路（circuitry）ではなく、プログラムを用いることなく処理機能を実現する電子回路を用いて一部またはすべての処理部が構成されてもよい。また、１個の装置を構成する電子回路が複数のＣＰＵを含んでいてもよい。 [First Embodiment]
Next, the first embodiment will be described.
<Configuration>
As illustrated in FIG. 1, the learning device 11 of the present embodiment includes a speech signal storage unit 111a, a noise signal storage unit 111b, a correct word string storage unit 111c, a noise-added speech signal storage unit 111e, a speech recognition result storage unit 111f, The component adjustment adding unit 113, the index generating unit 114, the voice recognizing unit 116, the comparing unit 117, and the correspondence table generating unit 118 are included. As illustrated in FIG. 2, the speech recognition apparatus 12 according to the present exemplary embodiment includes a correspondence table storage unit 121a, an input acoustic signal storage unit 121b, a noise component storage unit 121c, an input unit 122, a speech / non-speech segment determination unit 123, an index. A generation unit 124, a selection unit 125, and a voice recognition unit 126 are included. Each device is, for example, a general-purpose or dedicated computer provided with a processor (hardware processor) such as a CPU (central processing unit) and a memory such as RAM (random-access memory) and ROM (read-only memory). It is configured by executing the program. The computer may include a single processor and memory, or may include a plurality of processors and memory. This program may be installed in a computer, or may be recorded in a ROM or the like in advance. In addition, some or all of the processing units are configured using an electronic circuit that realizes a processing function without using a program, instead of an electronic circuit (circuitry) that realizes a functional configuration by reading a program like a CPU. May be. In addition, an electronic circuit constituting one device may include a plurality of CPUs.

＜学習処理＞
図３を用いて学習装置１１の処理を説明する。学習装置１１は、「音声成分と雑音成分との大きさの関係（例えば、Ｓ／Ｎ比）」と、音声認識モデルのモデルパラメータまたはモデルパラメータの組み合わせ（例えば、ANN-HMM Hybridモデル音声認識装置のヒューリスティックパラメータの組み合わせ）と、が音声認識精度または音声認識率に与える影響を事前に学習する。すなわち学習装置１１は、「音声成分と雑音成分との大きさの関係」を表す「第１指標」を複数種類設定し、それぞれの「音声成分と雑音成分との大きさの関係」を持つ音響信号に適用したときの音声認識精度または音声認識率が最大または所定値以上となる音声認識モデルのモデルパラメータまたはモデルパラメータの組み合わせを得、それらを対応する「第１指標」に対応付けた対応表を得て出力する。 <Learning process>
The process of the learning apparatus 11 is demonstrated using FIG. The learning device 11 uses a “relationship between magnitudes of speech components and noise components (for example, S / N ratio)” and model parameters of speech recognition models or a combination of model parameters (for example, ANN-HMM Hybrid model speech recognition devices). Of heuristic parameters) and the influence of the combination on the speech recognition accuracy or speech recognition rate. In other words, the learning device 11 sets a plurality of “first indicators” representing “a relationship between the magnitudes of the speech component and the noise component”, and each has a “relationship between the sizes of the speech component and the noise component”. Correspondence table that obtains a model parameter or a combination of model parameters of a speech recognition model whose speech recognition accuracy or speech recognition rate when applied to a signal is the maximum or a predetermined value or more and associates them with the corresponding “first index” And output.

学習処理の前提として、音声信号記憶部１１１ａに時系列の音声信号が格納され、雑音信号記憶部１１１ｂに雑音信号が格納される。音声信号は事前に静音環境で音声（例えば、発話音声）を収録することによって得られたものであってもよいし、音声合成技術によって生成されたものであってもよい。雑音信号も事前に収録されたものであってもよいし、雑音生成アルゴリズムによって生成されたもの（例えば、白色雑音）であってもよい。正解単語列記憶部１１１ｃには、音声信号記憶部１１１ａに格納された音声信号の正解単語列が記憶される。 As a premise of the learning process, a time-series audio signal is stored in the audio signal storage unit 111a, and a noise signal is stored in the noise signal storage unit 111b. The voice signal may be obtained by recording voice (for example, speech voice) in a silent environment in advance, or may be generated by a voice synthesis technique. The noise signal may be recorded in advance, or may be generated by a noise generation algorithm (for example, white noise). The correct word string storage unit 111c stores a correct word string of the audio signal stored in the audio signal storage unit 111a.

図３に例示するように、成分調整加算部１１３が音声信号記憶部１１１ａおよび雑音信号記憶部１１１ｂから音声信号および雑音信号をそれぞれ読み込み、「音声成分と雑音成分との大きさの関係」がα_ｉとなるようにこれらを加算した時系列信号である雑音付き音声信号Ｘ_ｉを得る。ただし、ｉ＝０，・・・，Ｉ−１であり、Ｉは２以上の整数である。すなわち、成分調整加算部１１３は、複数種類の「音声成分と雑音成分との大きさの関係」α_０，・・・，α_Ｉ−１で音声信号および雑音信号を加算し、複数種類の雑音付き音声信号Ｘ_０，・・・，Ｘ_Ｉ−１を得る。例えば、成分調整加算部１１３は、ｉ＝０，・・・，Ｉ−１のそれぞれについて、Ｓ／Ｎ比がα_ｉとなるように音声信号および雑音信号を加算した雑音付き音声信号Ｘ_ｉを得て出力する。言い換えると、成分調整加算部１１３は、例えば、複数種類のＳ／Ｎ比α_０，・・・，α_Ｉ−１で音声信号および雑音信号を加算し、複数種類の雑音付き音声信号Ｘ_０，・・・，Ｘ_Ｉ−１を得る。例えば、α_０，・・・，α_Ｉ−１は互いに異なる離散値である。Ｓ／Ｎ比は、各時点でのＳ／Ｎ比であってもよいし、各時間区間での平均Ｓ／Ｎ比であってもよいし、全時間区間での平均Ｓ／Ｎ比であってもよい。Ｓ／Ｎ比等の「音声成分と雑音成分との大きさの関係」は、音声信号の実効値と雑音信号の実効値とから定めてもよいし、実行値に代えて平均値または最大値または絶対値から定めてもよい。例えば、（音声信号の実効値）／（雑音信号の実効値）をＳ／Ｎ比としてもよいし、この実行値に代えて平均値または最大値または絶対値を用いてもよい。このような雑音付き音声信号Ｘ_ｉの生成方法に限定はないが、例えば、成分調整加算部１１３は、音声信号記憶部１１１ａから読み込んだ音声信号に、雑音信号記憶部１１１ｂから読み込んだ雑音信号にα_ｉ（ただし、ｉ＝０，・・・，Ｉ−１）に応じた係数を乗じた雑音成分Ｎ_ｉを加えて雑音付き音声信号Ｘ_ｉを得る。成分調整加算部１１３は、雑音付き音声信号Ｘ_ｉを雑音付き音声信号記憶部１１１ｅに格納するとともに、雑音付き音声信号Ｘ_ｉと雑音成分Ｎ_ｉ（ただし、ｉ＝０，・・・，Ｉ−１）とを指標生成部１１４に送る（ステップＳ１１３）。 As illustrated in FIG. 3, the component adjustment adding unit 113 reads the audio signal and the noise signal from the audio signal storage unit 111a and the noise signal storage unit 111b, respectively, and “the relationship between the magnitudes of the audio component and the noise component” is α as a _i obtain noise with audio signal X _i is a time-series signal by adding them. However, i = 0,..., I-1, and I is an integer of 2 or more. That is, component adjustment adder 113, a plurality of types of "size of the relationship between speech and noise components" alpha _0, · · ·, adds the audio signals and noise signals in alpha _I-1, a plurality of types of noise Accompanying audio signals X ₀ ,..., X _I-1 are obtained. For example, the component adjustment adding unit 113 adds the audio signal X _i with noise obtained by adding the audio signal and the noise signal so that the S / N ratio is α _i for each of i = 0,. Output. In other words, the component adjustment adding unit 113 adds a sound signal and a noise signal with a plurality of types of S / N ratios α ₀ ,..., Α _I−1 , for example, and adds a plurality of types of sound signals with noise X ₀ ,. ..., X _I-1 is obtained. For example, α ₀ ,..., Α _I-1 are discrete values different from each other. The S / N ratio may be an S / N ratio at each time point, an average S / N ratio in each time interval, or an average S / N ratio in all time intervals. May be. The “relationship between the magnitude of the audio component and the noise component” such as the S / N ratio may be determined from the effective value of the audio signal and the effective value of the noise signal, or an average value or maximum value instead of the effective value. Or you may determine from an absolute value. For example, (the effective value of the audio signal) / (the effective value of the noise signal) may be used as the S / N ratio, or an average value, a maximum value, or an absolute value may be used instead of the effective value. There is no limitation on the generation method of such a noise-added audio signal X _i . For example, the component adjustment adder 113 converts the noise signal read from the noise signal storage unit 111 b into the noise signal read from the audio signal storage unit 111 a. A noise component N _i multiplied by a coefficient corresponding to α _i (where i = 0,..., I−1) is added to obtain a speech signal X _i with noise. The component adjustment adding unit 113 stores the noise-added audio signal X _i in the noise-added audio signal storage unit 111e, and also includes the noise-added audio signal X _i and the noise component N _i (where i = 0,..., I− 1) is sent to the index generation unit 114 (step S113).

指標生成部１１４は、雑音付き音声信号Ｘ_ｉと雑音成分Ｎ_ｉ（ただし、ｉ＝０，・・・，Ｉ−１）を入力とし、雑音付き音声信号Ｘ_ｉの信号実行値と雑音成分Ｎ_ｉの信号実行値とから、新たに「音声成分と雑音成分との大きさの関係」を表す「第１指標」ｒ_ｉを得て出力する。例えば、指標生成部１１４は、雑音付き音声信号Ｘ_ｉの信号実行値と雑音成分Ｎ_ｉの信号実行値とから、新たにＳ／Ｎ比ｒ_ｉを得て出力する。例えば、ｒ_ｉ＝（雑音付き音声信号Ｘ_ｉの信号実行値−雑音成分Ｎ_ｉの信号実行値）／（雑音成分Ｎ_ｉの信号実行値）としてもよいし、この信号実行値に代えて平均値または最大値または絶対値を用いてもよい。雑音付き音声信号Ｘ_ｉと雑音成分Ｎ_ｉとからｒ_ｉを求めることで、雑音付き音声信号Ｘ_ｉの音声区間信号と非音声区間信号とから求める「音声成分と雑音成分との大きさの関係」に近い値を得ることができる（ステップＳ１１４）。得られたｒ_ｉは雑音付き音声信号Ｘ_ｉに対応付けられて雑音付き音声信号記憶部１１１ｅに格納される（ステップＳ１１１ｅ）。 The index generation unit 114 receives the noise-added speech signal X _i and the noise component N _i (where i = 0,..., I−1) as input, and the signal execution value of the noise-added speech signal X _i and the noise component N. _From the signal execution value of _i , a “first index” r _i representing “a relationship between the magnitudes of the speech component and the noise component” is newly obtained and output. For example, the index generation unit 114 newly obtains and outputs the S / N ratio r _i from the signal execution value of the noise-added speech signal X _{i and} the signal execution value of the noise component N _i . For example, r _i = (signal execution value of the noise signal X _i with noise−signal execution value of the noise component N _i ) / (signal execution value of the noise component N _i ), or an average instead of this signal execution value A value or a maximum or absolute value may be used. By determining r _i from the noise-added speech signal X _i and the noise component N _i , the “relationship between the size of the speech component and the noise component obtained from the speech interval signal and the non-speech interval signal of the noise-added speech signal X _i Can be obtained (step S114). The obtained r _i is stored in the noisy audio signal storage unit 111e in association with the noisy audio signal X _i (step S111e).

モデルパラメータ設定部１０９は、所定の音声認識モデルのモデルパラメータまたはモデルパラメータの組み合わせを複数種類設定し、それらを音声認識部１１６に出力する。以下では、モデルパラメータまたはモデルパラメータの組み合わせをｈ_ｍと表現する。ただし、ｍ＝０，・・・，Ｍ−１であり、Ｍは２以上の整数である。ｈ_ｍがモデルパラメータである場合、ｈ_ｍはパラメータ値を表すスカラであり、ｈ_ｍがモデルパラメータの組み合わせである場合、ｈ_ｍはパラメータ値を要素とするベクトルである。 The model parameter setting unit 109 sets a plurality of types of model parameters or combinations of model parameters of a predetermined speech recognition model, and outputs them to the speech recognition unit 116. Hereinafter, the combination of the model parameters or model parameters is expressed as h _m. However, m = 0,..., M−1, and M is an integer of 2 or more. If h _m is the model parameter, h _m is a scalar representing the parameter values, if h _m is a combination of the model parameters, h _m is a vector of parameter values as elements.

音声認識モデルの具体例は、ANN-HMM Hybrid音声認識装置によるデコードを表現する以下の音声認識モデルである（例えば、非特許文献２等参照）。

ここで、ｔは時刻、ｘ_ｔは各時刻ｔの音響特徴量ベクトル、ｓはHMM状態系列、ｓ_ｔは各時刻ｔのHMM状態、ｗは単語列、Ｐ（ｓ_ｔ｜ｘ_ｔ）はニューラルネットワークによる音響モデル、Ｐ（ｓ_ｔ）は各HMM状態に関するユニグラムモデルであるHMM State Unigramモデル、Ｐ（ｓ｜ｗ）は辞書、Ｐ（ｗ）は言語モデルを表している。βおよびγはヒューリスティックパラメータ（モデルパラメータ）である。この例の場合、モデルパラメータ設定部１０９は、βおよびγの組み合わせ（βおよびγを要素とするベクトル）ｈ_ｍを複数種類設定する。例えば、所定の範囲内のβ，γから取り得るすべてのβおよびγの組み合わせｈ_ｍ（ただし、ｍ＝０，・・・，Ｍ−１）を設定する。 A specific example of the speech recognition model is the following speech recognition model that expresses decoding by the ANN-HMM Hybrid speech recognition device (see, for example, Non-Patent Document 2).

Here, t is time, _{x t} is the acoustic feature vector, s is HMM state sequence at each time t, _{s t} is HMM states at each time t, w is the word _{_{sequence, P (s t | x t}} ) is a neural A network acoustic model, P (s _t ) represents a HMM State Unigram model that is a unigram model for each HMM state, P (s | w) represents a dictionary, and P (w) represents a language model. β and γ are heuristic parameters (model parameters). In this example, the model parameter setting unit 109, (vector and the β and gamma components) a combination of β and gamma h _m a multiple type setting. For example, beta within a predetermined range, all the beta and combinations of gamma h _m can take from gamma _(although, m = 0, ···, M -1) to set the.

音声認識モデルとして、GMM-HMM音声認識装置によるデコードを表現する以下の音声認識モデルが用いられてもよい（例えば、非特許文献１等参照）。

ここで、Ｐ（ｘ_ｔ｜ｓ_ｔ）は混合正規分布による音響モデルである。この例の場合、モデルパラメータ設定部１０９は、ｈ_ｍ＝βを複数種類設定する。例えば、所定の範囲内のすべてのβをｈ_ｍ（ただし、ｍ＝０，・・・，Ｍ−１）として設定する（ステップＳ１１４）。 As the speech recognition model, the following speech recognition model that expresses decoding by the GMM-HMM speech recognition device may be used (see, for example, Non-Patent Document 1).

Here, P (x _t | s _t ) is an acoustic model with a mixed normal distribution. In the case of this example, the model parameter setting unit 109 sets plural types of h _m = β. For example, all β within a predetermined range are set as h _m (where m = 0,..., M−1) (step S114).

音声認識部１１６は、雑音付き音声信号記憶部１１１ｅからｒ_ｉおよびＸ_ｉ（ただし、ｉ＝０，・・・，Ｉ−１）を読み込み、モデルパラメータ設定部１０９から送られたｈ_ｍ（ただし、ｍ＝０，・・・，Ｍ−１）を用いた音声認識モデルでＸ_ｉの音声認識を行い、その音声認識結果である単語列を出力する。音声認識結果はすべての（ｉ，ｍ）の組み合わせについて得られ、得られた音声認識結果はｒ_ｉおよび（ｉ，ｍ）に対応付けられて音声認識結果記憶部１１１ｆに格納される（ステップＳ１１５）。 The speech recognition unit 116, from the noise with sound signal storage unit 111e _{r i} and _{X i} (however, i = 0, ···, I -1) reads, _{h m} (but sent from the model parameter setting portion 109 , M = 0,..., M−1) is used for speech recognition of X _i and a word string that is the speech recognition result is output. Speech recognition result is obtained for all combinations of (i, m), the speech recognition result obtained is stored in association with the r _i and (i, m) in the speech recognition result storage unit 111f (step S115 ).

比較部１１７は、正解単語列記憶部１１１ｃから読み込んだ正解単語列と、音声認識結果記憶部１１１ｆから読み込んだ音声認識結果とを比較し、各（ｉ，ｍ）について音声認識結果の音声認識精度を求める。あるいは、比較部１１７は、音声認識精度に代えて各（ｉ，ｍ）について音声認識率を求めてもよい。得られた音声認識精度または音声認識率は、対応する（ｉ，ｍ）およびｒ_ｉとともに対応表生成部１１８に送られる（ステップＳ１１６）。 The comparison unit 117 compares the correct word string read from the correct word string storage unit 111c with the speech recognition result read from the speech recognition result storage unit 111f, and the speech recognition accuracy of the speech recognition result for each (i, m). Ask for. Alternatively, the comparison unit 117 may obtain the speech recognition rate for each (i, m) instead of the speech recognition accuracy. The resulting speech recognition accuracy or the speech recognition rate is sent corresponding (i, m) and with _{r i} in the correspondence table generation unit 118 (step S116).

対応表生成部１１８は、ｉごとに音声認識精度または音声認識率が最大となるｍ（ｉ）∈｛０，・・・，Ｍ−１｝を選択する。あるいは、対応表生成部１１８は、ｉごとに音声認識精度または音声認識率が所定値以上となる１個のｍ（ｉ）∈｛０，・・・，Ｍ−１｝を選択してもよい。対応表生成部１１８は、ｒ_ｉとｈ_ｍ（ｉ）とを対応付けた対応表［ｒ_ｉ，ｈ_ｍ（ｉ）］を生成して出力する。図５Ａは、Ｉ＝８、ｒ_ｉがＳ／Ｎ比、ｈ_ｍ（ｉ）がモデルパラメータβおよびγの組み合わせである場合の対応表［ｒ_ｉ，ｈ_ｍ（ｉ）］の例である（ステップＳ１１７）。対応表［ｒ_ｉ，ｈ_ｍ（ｉ）］は音声認識装置１２（図２）の対応表記憶部１２１ａに格納される。 The correspondence table generation unit 118 selects m (i) ε {0,..., M−1} that maximizes the speech recognition accuracy or speech recognition rate for each i. Alternatively, the correspondence table generation unit 118 may select one m (i) ε {0,..., M−1} for which the speech recognition accuracy or the speech recognition rate is greater than or equal to a predetermined value for each i. . Correspondence table generation unit 118, the correspondence table _{_{[r i, h m (i}} )] that associates _{r i} and _{h m (i)} to generate and output. FIG. 5A is an example of a correspondence table [r _i , h _{m (i)} ] when I = 8, r _i is the S / N ratio, and h _{m (i)} is a combination of model parameters β and γ ( Step S117). Correspondence table _{_{[r i, h m (i}} )] are stored in the correspondence table storage unit 121a of the voice recognition device 12 (FIG. 2).

＜音声認識処理＞
図４を用いて音声認識装置１２の処理を説明する。音声認識装置１２は入力音響信号に含まれる音声成分と雑音成分との大きさの関係に応じ、音声認識モデルのモデルパラメータまたは音声認識モデルのモデルパラメータの組み合わせを選択する。すなわち、入力音響信号に含まれる音声成分と雑音成分との大きさの関係に対応する「第１指標」ｒ_ｉに対応するモデルパラメータまたはモデルパラメータの組み合わせｈ_ｍ（ｉ）を選択する。音声認識装置１２は、選択したモデルパラメータまたはモデルパラメータの組み合わせに応じた音声認識モデルを当該入力音響信号に適用し、音声認識を行う。 <Voice recognition processing>
The process of the speech recognition apparatus 12 will be described with reference to FIG. The speech recognition device 12 selects a model parameter of the speech recognition model or a combination of model parameters of the speech recognition model according to the relationship between the magnitudes of the speech component and the noise component included in the input acoustic signal. That is, a model parameter or a combination of model parameters _{hm (i)} corresponding to the “first index” r _i corresponding to the magnitude relationship between the speech component and the noise component included in the input acoustic signal is selected. The speech recognition device 12 performs speech recognition by applying a speech recognition model corresponding to the selected model parameter or combination of model parameters to the input acoustic signal.

まず、入力音響信号が入力部１２２に入力され、入力音響信号記憶部１２１ｂに格納される。入力音響信号は時系列信号であり、例えば、雑音成分が重畳された音声信号である（ステップＳ１２１）。音声／非音声区間判別部１２３は、入力音響信号記憶部１２１ｂから入力音響信号を読み込み、入力音響信号の音声区間と非音声区間とを判別する。この判別には、例えば、参考文献１（Jongseo Sohn, Nam Soo Kim, Wonyong Sung, “A Statistic Model-Based Voice Activity Detection,” IEEE SIGNAL PROCESSING LETTERS, VOL.6, NO.1, 1999．）等の周知の方法を用いる。非音声区間の信号は雑音成分として雑音成分記憶部１２１ｃに格納され、入力音響信号は指標生成部１２４に送られる（ステップＳ１２２）。 First, an input acoustic signal is input to the input unit 122 and stored in the input acoustic signal storage unit 121b. The input acoustic signal is a time-series signal, for example, an audio signal on which a noise component is superimposed (step S121). The voice / non-speech section discriminating unit 123 reads the input acoustic signal from the input acoustic signal storage unit 121b and discriminates between the voice section and the non-speech section of the input acoustic signal. For this determination, for example, Reference 1 (Jongseo Sohn, Nam Soo Kim, Wonyong Sung, “A Statistic Model-Based Voice Activity Detection,” IEEE SIGNAL PROCESSING LETTERS, VOL.6, NO.1, 1999.) etc. A well-known method is used. The signal in the non-voice section is stored as a noise component in the noise component storage unit 121c, and the input acoustic signal is sent to the index generation unit 124 (step S122).

指標生成部１２４は、入力音響信号、および雑音成分記憶部１２１ｃから読み込んだ非音声区間の信号を用い、「音声成分と雑音成分との大きさの関係」を表す「第２指標」ｕを得て出力する。音声認識処理での「音声成分と雑音成分との大きさの関係」は、前述の学習処理の「音声成分と雑音成分との大きさの関係」と同じ基準に基づくことが望ましい。すなわち、「音声成分と雑音成分との大きさの関係」として学習処理でＳ／Ｎ比が用いられた場合、音声認識処理でもＳ／Ｎ比が用いられることが望ましい。各時点でｕが得られてもよいし、所定の時間区間ごとにｕが得られてもよいし、入力音響信号の全時間区間に対してｕが得られてもよい。入力音響信号の実効値と非音声区間の信号の実効値とから定めてもよいし、実行値に代えて平均値または最大値または絶対値から定めてもよい。例えば、ｕ＝（入力音響信号の信号実効値−非音声区間の信号の信号実効値）／（非音声区間の信号の信号実効値）としてもよいし、この信号実行値に代えて平均値または最大値または絶対値から定めてもよい。得られたｕは選択部１２５に送られる（ステップＳ１２３）。 The index generation unit 124 obtains a “second index” u representing “a relationship between the magnitudes of the voice component and the noise component” by using the input acoustic signal and the signal of the non-voice section read from the noise component storage unit 121c. Output. The “relationship between the size of the speech component and the noise component” in the speech recognition process is preferably based on the same standard as the “relationship between the size of the speech component and the noise component” in the learning process described above. That is, when the S / N ratio is used in the learning process as “the relationship between the magnitudes of the voice component and the noise component”, it is desirable that the S / N ratio is also used in the voice recognition process. U may be obtained at each time point, u may be obtained for each predetermined time interval, or u may be obtained for all time intervals of the input acoustic signal. It may be determined from the effective value of the input acoustic signal and the effective value of the signal in the non-speech section, or may be determined from an average value, a maximum value, or an absolute value instead of the execution value. For example, u = (signal effective value of input acoustic signal−signal effective value of signal in non-speech section) / (signal effective value of signal in non-speech section), or an average value or It may be determined from a maximum value or an absolute value. The obtained u is sent to the selection unit 125 (step S123).

選択部１２５は、対応表記憶部１２１ａに格納された対応表［ｒ_ｉ，ｈ_ｍ（ｉ）］を参照し、「第２指標」ｕに最も近い「第１指標」ｒ_ｉに対応するモデルパラメータまたはモデルパラメータの組み合わせｈ_ｍ（ｉ）を選択する。例えば、図５Ａおよび図５Ｂの例では、ｕにｒ_２が最も近いため、ｒ_２に対応するモデルパラメータの組み合わせｈ_ｍ（２）＝（γ_５，β_３）が選択される。ｕが隣接する２個のｒ_ｉの中間値である場合、ｕに最も近いｒ_ｉが２個存在することになる。このような場合には、例えば、予め定められた何れか一方のｒ_ｉに対応するｈ_ｍ（ｉ）が選択される。なお、ｒ_ｉは雑音付き音声信号Ｘ_ｉと雑音成分Ｎ_ｉとから得られているため、入力音響信号（雑音付き音声信号に相当）と非音声区間の信号（雑音成分に相当）とから得られるｕに対して適切なｈ_ｍ（ｉ）を選択できる。選択されたｈ_ｍ（ｉ）は音声認識部１２６に送られる（ステップＳ１２４）。 Selecting unit 125, the correspondence table _{_{[r i, h m (i}} )] stored in the correspondence table storage unit 121a with reference to, corresponding to the "second index" closest "first index" to u _{r i} Model A parameter or model parameter combination hm _(i) is selected. For example, in the example of FIGS. 5A and 5B, since r ₂ is closest to u, a combination of model parameters h _{m (2)} = (γ ₅ , β ₃ ) corresponding to r ₂ is selected. If u is an intermediate value of the two r _i adjacent, so that the closest r _i in u there are two. In such a case, for example, hm _(i) corresponding to one of the predetermined r _i is selected. Since r _i is obtained from the noise-added speech signal X _i and the noise component N _i , it is obtained from the input acoustic signal (corresponding to the noise-added speech signal) and the non-speech interval signal (corresponding to the noise component). An appropriate _{hm (i)} can be selected for u. The selected _{hm (i)} is sent to the speech recognition unit 126 (step S124).

音声認識部１２６は、送られたモデルパラメータまたはモデルパラメータの組み合わせｈ_ｍ（ｉ）を用いた音声認識モデルを、入力音響信号記憶部１２１ｂから読み込んだ入力音響信号に適用して音声認識を行い、その音声認識結果を出力する（ステップＳ１２５）。 The speech recognition unit 126 performs speech recognition by applying the speech recognition model using the sent model parameter or model parameter combination hm _(i) to the input acoustic signal read from the input acoustic signal storage unit 121b, The voice recognition result is output (step S125).

＜本形態の特徴＞
本形態では、入力音響信号に含まれる音声成分と雑音成分との大きさの関係に応じ、音声認識モデルのモデルパラメータまたは音声認識モデルのモデルパラメータの組み合わせを選択するため、入力音響信号に含まれた雑音成分に応じた適切なモデルパラメータを自動設定できる。 <Features of this embodiment>
In this embodiment, since the model parameter of the speech recognition model or the combination of model parameters of the speech recognition model is selected according to the relationship between the magnitudes of the speech component and the noise component included in the input acoustic signal, it is included in the input acoustic signal. Appropriate model parameters can be automatically set according to noise components.

特に、ANN-HMM Hybrid音声認識では、パラメータ値の決定に関して入力音響信号に含まれる雑音成分の影響を受ける性質がある。したがって従来のGMM-HMM音声認識での雑音成分を考慮しないヒューリスティックパラメータ自動決定手法（例えば、非特許文献３参照）と同様の手法をANN-HMM Hybrid音声認識に適用することは困難であり、手動でパラメータ値を設定する必要があった。事前に入力音響信号に雑音抑圧処理を行うことも考えられるが、一般にこれらの雑音抑圧処理によって音声認識の観点から適していない歪みが音声に加わることになる。そのため、雑音成分を含む入力音響信号から直接HMM状態を判別するニューラルネットワークを学習した方が、音声認識を考慮した処理を行っている点で適していると考えられる。本形態の手法により、入力音響信号から、音声認識精度または音声認識率に対して最適なヒューリスティックパラメータを自動的に決定でき、人手による設定作業をなくし、雑音成分による音声認識精度または音声認識率の低下を防ぐことができる。 In particular, ANN-HMM Hybrid speech recognition has the property of being influenced by noise components included in the input acoustic signal when determining parameter values. Therefore, it is difficult to apply the same method to the ANN-HMM Hybrid speech recognition as the conventional heuristic parameter automatic determination method that does not consider the noise component in the conventional GMM-HMM speech recognition (see Non-Patent Document 3, for example). It was necessary to set the parameter value. Although noise suppression processing may be performed on the input acoustic signal in advance, generally, distortion that is not suitable from the viewpoint of speech recognition is added to the speech by these noise suppression processing. Therefore, it is considered that learning a neural network that directly discriminates the HMM state from an input acoustic signal including a noise component is more suitable in terms of performing processing considering speech recognition. The method of this embodiment can automatically determine the most suitable heuristic parameters for the speech recognition accuracy or speech recognition rate from the input acoustic signal, eliminates manual setting work, and reduces the speech recognition accuracy or speech recognition rate due to noise components. Decline can be prevented.

［第２実施形態］
次に、第２実施形態を説明する。第２実施形態は第１実施形態の変形例である。本形態では、入力音響信号に含まれる音声成分と雑音成分との大きさの関係を表す「第２指標」と同一または近傍の「第１指標」に対応するモデルパラメータまたはモデルパラメータの組み合わせのうち、音声認識精度または音声認識率を最大にするものを選択する。以下では、これまで説明した事項との相違点を中心に説明し、すでに説明した事項については同じ参照番号を引用して説明を簡略化する。 [Second Embodiment]
Next, a second embodiment will be described. The second embodiment is a modification of the first embodiment. In the present embodiment, among the model parameters or combinations of model parameters corresponding to the “first index” that is the same as or close to the “second index” that represents the relationship between the magnitudes of the speech component and the noise component included in the input acoustic signal Select the one that maximizes speech recognition accuracy or speech recognition rate. In the following, differences from the items described so far will be mainly described, and the items already described will be simplified by quoting the same reference numerals.

＜構成＞
図１に例示するように、本形態の学習装置２１は、音声信号記憶部１１１ａ、雑音信号記憶部１１１ｂ、正解単語列記憶部１１１ｃ、雑音付き音声信号記憶部１１１ｅ、音声認識結果記憶部１１１ｆ、成分調整加算部１１３、指標生成部１１４、音声認識部１１６、比較部１１７、および対応表生成部２１８を有する。図２に例示するように、本形態の音声認識装置２２は、対応表記憶部２２１ａ、入力音響信号記憶部１２１ｂ、雑音成分記憶部１２１ｃ、入力部１２２、音声／非音声区間判別部１２３、指標生成部１２４、選択部２２５、および音声認識部１２６を有する。 <Configuration>
As illustrated in FIG. 1, the learning device 21 according to the present embodiment includes a speech signal storage unit 111a, a noise signal storage unit 111b, a correct word string storage unit 111c, a speech signal storage unit with noise 111e, a speech recognition result storage unit 111f, A component adjustment adding unit 113, an index generating unit 114, a speech recognition unit 116, a comparing unit 117, and a correspondence table generating unit 218 are included. As illustrated in FIG. 2, the speech recognition apparatus 22 according to the present exemplary embodiment includes a correspondence table storage unit 221a, an input acoustic signal storage unit 121b, a noise component storage unit 121c, an input unit 122, a speech / non-speech segment determination unit 123, an index. A generation unit 124, a selection unit 225, and a voice recognition unit 126 are included.

＜学習処理＞
第１実施形態との相違点は、図３のステップＳ１１７に代えてステップＳ２１７の処理が行われる点のみである。ステップＳ２１７では、対応表生成部２１８が、ｉごとに音声認識精度または音声認識率が最大となるｍ（ｉ）∈｛０，・・・，Ｍ−１｝を選択し、この音声認識精度または音声認識率の最大値をａ_ｉとする。あるいは、対応表生成部２１８は、ｉごとに音声認識精度または音声認識率が所定値以上となる１個のｍ（ｉ）∈｛０，・・・，Ｍ−１｝を選択し、この音声認識精度または音声認識率をａ_ｉとする。対応表生成部１１８は、ｒ_ｉとｈ_ｍ（ｉ）とａ_ｉとを対応付けた対応表［ｒ_ｉ，ｈ_ｍ（ｉ），ａ_ｉ］を生成して出力する。図５Ｃは、Ｉ＝８、ｒ_ｉがＳ／Ｎ比、ｈ_ｍ（ｉ）がモデルパラメータβおよびγの組み合わせ、ａ_ｉが音声認識精度である場合の対応表［ｒ_ｉ，ｈ_ｍ（ｉ），ａ_ｉ］例である（ステップＳ２１７）。対応表［ｒ_ｉ，ｈ_ｍ（ｉ），ａ_ｉ］は音声認識装置２２（図２）の対応表記憶部２２１ａに格納される。 <Learning process>
The only difference from the first embodiment is that the process of step S217 is performed instead of step S117 of FIG. In step S217, the correspondence table generation unit 218 selects m (i) ε {0,..., M−1} that maximizes the speech recognition accuracy or speech recognition rate for each i, and this speech recognition accuracy or Let the maximum value of the speech recognition rate be a _i . Alternatively, the correspondence table generation unit 218 selects one m (i) ε {0,..., M−1} for which the voice recognition accuracy or the voice recognition rate is greater than or equal to a predetermined value for each i, and this voice Let a _i be the recognition accuracy or speech recognition rate. Correspondence table generation unit 118, the correspondence table that associates _{a i} and _{r i} and _{_{_{h m (i) [r i}}} , h m (i), a i] to generate and output. Figure 5C, I = 8, _{r i} is the S / N _ratio, the combination of _{h m (i)} is the model parameters β and gamma, the correspondence table when _{a i} is a speech recognition accuracy _[r _{i, h m (i ),} it is _{a i]} example (step S217). Correspondence table _{_{[r i, h m (i}} ), a i] is stored in the correspondence table storage unit 221a of the voice recognition device 22 (FIG. 2).

＜音声認識処理＞
第１実施形態との相違点は図４のステップＳ１２４に代えてステップＳ２２４の処理が行われる点のみである。ステップＳ２２４では、選択部２２５が、対応表記憶部２２１ａに格納された対応表［ｒ_ｉ，ｈ_ｍ（ｉ），ａ_ｉ］を参照し、「第２指標」ｕと同一または近傍の「第１指標」ｒ_ｉに対応するモデルパラメータまたはモデルパラメータの組み合わせｈ_ｍ（ｉ）のうち、音声認識精度または音声認識率ａ_ｉを最大にするものを選択する。例えば、図５Ｂおよび図５Ｃの例で、ｕの近傍のｒ_ｉをｒ_２，ｒ_２，ｒ_３とする場合、ｒ_２，ｒ_２，ｒ_３にそれぞれ対応するａ_２，ａ_２，ａ_３のうち最大のａ_ｉに対応するｈ_ｍ（ｉ）を選択する。選択されたｈ_ｍ（ｉ）は音声認識部１２６に送られる。なお、複数のａ_ｉが互いに同一の場合には何れに対応するｈ_ｍ（ｉ）が選択されてもよい。例えば、複数のａ_ｉが互いに同一の場合には、それらのａ_ｉのうち、ｕに最も近いｒ_ｉに対応するａ_ｉに対応付けられたｈ_ｍ（ｉ）が選択されてもよい（ステップＳ２２４）。 <Voice recognition processing>
The only difference from the first embodiment is that the process of step S224 is performed instead of step S124 of FIG. In step S224, the selection unit 225 refers to the correspondence table [r _i , hm _(i) , a _i ] stored in the correspondence table storage unit 221a, and the “second index” u is the same as or near the “second index”. Among the model parameters or model parameter combinations hm _(i) corresponding to “1 index” r _i , the one that maximizes the speech recognition accuracy or speech recognition rate a _i is selected. For example, in the example of FIG. 5B and FIG. 5C, when r _i near u is r ₂ , r ₂ , r ₃ , a ₂ , a ₂ , a ₃ corresponding to r ₂ , r ₂ , r ₃ , respectively. _{Hm (i)} corresponding to the largest a _i is selected. The selected _{hm (i)} is sent to the speech recognition unit 126. When a plurality of a _i are the same as each other, _{hm (i)} corresponding to any of them may be selected. For example, if a plurality of a _i are identical to each other, of those a _i, h _{m (i)} is may be selected associated with the corresponding a _i closest r _i in u (step S224).

［第２実施形態の変形例］
「第２指標」と同一または近傍の「第１指標」に対応するモデルパラメータまたはモデルパラメータの組み合わせのうち、音声認識精度または音声認識率の重み付け値を最大にするものを選択してもよい。ただし、「重み付け値」は、音声認識精度または音声認識率に「第２指標」と「第１指標」との距離が小さいほど大きな重みを乗じた値である。 [Modification of Second Embodiment]
Of the model parameters or combinations of model parameters corresponding to the “first index” that is the same as or close to the “second index”, the one that maximizes the weight value of the speech recognition accuracy or the speech recognition rate may be selected. However, the “weighting value” is a value obtained by multiplying the voice recognition accuracy or the voice recognition rate by a larger weight as the distance between the “second index” and the “first index” is smaller.

＜構成＞
図２に例示するように、本変形例の音声認識装置２２’は、対応表記憶部２２１ａ、入力音響信号記憶部１２１ｂ、雑音成分記憶部１２１ｃ、入力部１２２、音声／非音声区間判別部１２３、指標生成部１２４、選択部２２５’、および音声認識部１２６を有する。 <Configuration>
As illustrated in FIG. 2, the speech recognition device 22 ′ of the present modification includes a correspondence table storage unit 221 a, an input acoustic signal storage unit 121 b, a noise component storage unit 121 c, an input unit 122, and a speech / non-speech section determination unit 123. , An index generation unit 124, a selection unit 225 ′, and a voice recognition unit 126.

＜学習処理＞
第２実施形態と同じである。 <Learning process>
The same as in the second embodiment.

＜音声認識処理＞
第２実施形態との相違点は図４のステップＳ２２４に代えてステップＳ２２４’の処理が行われる点のみである。ステップＳ２２４’では、選択部２２５’が、対応表記憶部２２１ａに格納された対応表［ｒ_ｉ，ｈ_ｍ（ｉ），ａ_ｉ］を参照し、「第２指標」ｕと同一または近傍の「第１指標」ｒ_ｉに対応するモデルパラメータまたはモデルパラメータの組み合わせｈ_ｍ（ｉ）のうち、音声認識精度または音声認識率ａ_ｉに重みｃ_ｉを乗じた重み付け値を最大にするものを選択する。ただし、ｃ_ｉは正値であり、ｕとｒ_ｉとの距離｜ｕ−ｒ_ｉ｜が小さいほど大きい。例えば、図５Ｂおよび図５Ｃの例で、ｕの近傍のｒ_ｉをｒ_２，ｒ_２，ｒ_３とする場合、ｒ_２，ｒ_２，ｒ_３にそれぞれ対応するｃ_１・ａ_２，ｃ_２・ａａ_２，ｃ_３・ａａ_３のうち最大のｃ_ｉ・ａ_ｉに対応するｈ_ｍ（ｉ）を選択する。この例では、ｃ_２＞ｃ_３＞ｃ_１となる。選択されたｈ_ｍ（ｉ）は音声認識部１２６に送られる（ステップＳ２２４’）。 <Voice recognition processing>
The difference from the second embodiment is only that step S224 ′ is performed instead of step S224 in FIG. ', The selector 225' step S224 is, the correspondence table _{_{[r i, h m (i}} ), a i] stored in the correspondence table storage unit 221a refers to the "second index" u identical or near Of the model parameters or model parameter combinations hm _(i) corresponding to the “first index” r _i , the one that maximizes the weighting value obtained by multiplying the speech recognition accuracy or speech recognition rate a _i by the weight c _i is selected. To do. However, c _i is a positive value, and is larger as the distance | u−r _i | between u and r _i is smaller. For example, in the example of FIGS. 5B and 5C, when r _i near u is r ₂ , r ₂ , r ₃ , c ₁ , a ₂ , c ₂ corresponding to r ₂ , r ₂ , r ₃ , respectively. · _aa _2, of c 3 · aa ₃ selects _{h m (i)} corresponding to the maximum of _c _i · _a i. In this example, c ₂ > c ₃ > c ₁ is satisfied. The selected _{hm (i)} is sent to the speech recognition unit 126 (step S224 ′).

［第３実施形態］
次に、第３実施形態を説明する。第３実施形態は第１実施形態の変形例である。本形態では、「第２指標」と一致する「第１指標」または「第１指標」の補完値に対応するモデルパラメータもしくはモデルパラメータの補完値またはモデルパラメータもしくはモデルパラメータの補完値の組み合わせを選択する。 [Third Embodiment]
Next, a third embodiment will be described. The third embodiment is a modification of the first embodiment. In this embodiment, the model parameter or the complement value of the model parameter or the combination of the model parameter or the complement value of the model parameter corresponding to the “first index” or the complement value of the “first index” that matches the “second index” is selected. To do.

＜構成＞
図１に例示するように、本形態の学習装置３１は、音声信号記憶部１１１ａ、雑音信号記憶部１１１ｂ、正解単語列記憶部１１１ｃ、雑音付き音声信号記憶部１１１ｅ、音声認識結果記憶部１１１ｆ、成分調整加算部１１３、指標生成部１１４、音声認識部１１６、比較部１１７、および対応表生成部３１８を有する。図２に例示するように、本変形例の音声認識装置３２は、対応表記憶部２２１ａ、入力音響信号記憶部１２１ｂ、雑音成分記憶部１２１ｃ、入力部１２２、音声／非音声区間判別部１２３、指標生成部１２４、選択部３２５、および音声認識部１２６を有する。 <Configuration>
As illustrated in FIG. 1, the learning device 31 of the present embodiment includes a speech signal storage unit 111a, a noise signal storage unit 111b, a correct word string storage unit 111c, a noise-added speech signal storage unit 111e, a speech recognition result storage unit 111f, A component adjustment adding unit 113, an index generating unit 114, a voice recognition unit 116, a comparing unit 117, and a correspondence table generating unit 318 are included. As illustrated in FIG. 2, the speech recognition device 32 according to the present modification includes a correspondence table storage unit 221a, an input acoustic signal storage unit 121b, a noise component storage unit 121c, an input unit 122, a speech / non-speech segment determination unit 123, An index generation unit 124, a selection unit 325, and a voice recognition unit 126 are included.

＜学習処理＞
第１実施形態との相違点は、図３のステップＳ１１７に代えてステップＳ３１７の処理が行われる点のみである。ステップＳ３１７では、対応表生成部３１８が、ｉごとに音声認識精度または音声認識率が最大となるｍ（ｉ）∈｛０，・・・，Ｍ−１｝（ただし、ｉ＝０，・・・，Ｉ−１）を選択する。あるいは、対応表生成部３１８は、ｉごとに音声認識精度または音声認識率が所定値以上となる１個のｍ（ｉ）∈｛０，・・・，Ｍ−１｝を選択してもよい。さらに対応表生成部３１８は、ｒ_０，・・・，ｒ_Ｉ−１を線形補完等によって補完し、ｒ_０，・・・，ｒ_Ｉ−１およびそれらの補完値からなる連続値ｒ’_０，・・・，ｒ’_Ｚ−１（ただし、ＺはＩよりも大きな整数）を得る。また対応表生成部３１８は、ｈ_ｍ（０），・・・，ｈ_{ｍ（Ｉ−１）}を線形補完等によって補完し、ｈ_ｍ（０），・・・，ｈ_{ｍ（Ｉ−１）}およびそれらの補完値からなる連続値ｈ’_ｍ（０），・・・，ｈ’_{ｍ（Ｚ−１）}を得る。対応表生成部３１８は、ｒ’_ｚとｈ’_ｍ（ｚ）とを対応付けた対応表［ｒ’_ｚ，ｈ’_ｍ（ｚ）］（ただし、ｚ＝０，・・・，Ｚ−１）を生成して出力する。対応表［ｒ’_ｚ，ｈ’_ｍ（ｚ）］は音声認識装置３２（図２）の対応表記憶部３２１ａに格納される。 <Learning process>
The only difference from the first embodiment is that the process of step S317 is performed instead of step S117 of FIG. In step S317, the correspondence table generation unit 318 sets m (i) ε {0,..., M−1} (where i = 0,...) That maximizes the speech recognition accuracy or the speech recognition rate for each i. ., I-1) is selected. Alternatively, the correspondence table generation unit 318 may select one m (i) ε {0,..., M−1} for which the voice recognition accuracy or the voice recognition rate is greater than or equal to a predetermined value for each i. . Further correspondence table generation unit _318, r 0, · · _{·, r} a _I-1 supplemented by linear interpolation or the _like, r 0, · · _{·, r I-1} and the continuous value r consisting of complementary value _'0 ,..., R ′ _Z−1 (where Z is an integer greater than I). Also, the correspondence table generation unit 318 complements _{hm (0)} ,..., Hm _(I-1) by linear interpolation or the like, and hm ₍₀₎ ,..., Hm _(I-1). And continuous values h ′ _{m (0)} ,..., H ′ _{m (Z−1)} composed of their complementary values. The correspondence table generation unit 318 associates r ′ _z with h ′ _{m (z)} [r ′ _z , h ′ _{m (z)} ] (where z = 0,..., Z−1). ) Is generated and output. The correspondence table [r ′ _z , h ′ _{m (z)} ] is stored in the correspondence table storage unit 321a of the speech recognition device 32 (FIG. 2).

＜音声認識処理＞
第１実施形態との相違点は、図４のステップＳ１２４に代えてステップＳ３２４の処理が行われる点のみである。ステップＳ３２４では、選択部３２５は、対応表記憶部１２１ａに格納された対応表［ｒ’_ｚ，ｈ’_ｍ（ｚ）］を参照し、入力されたｕと一致するｒ’_ｚに対応付けられたモデルパラメータもしくはモデルパラメータの組み合わせまたはその補完値ｈ’_ｍ（ｚ）を選択する。選択されたｈ’_ｍ（ｚ）は音声認識部１２６に送られる（ステップＳ３２４）。以降の処理は、ｈ_ｍ（ｉ）に代えてｈ’_ｍ（ｚ）が用いられる以外、第１実施形態と同じである。 <Voice recognition processing>
The only difference from the first embodiment is that the process of step S324 is performed instead of step S124 of FIG. In step S324, the selection unit 325 refers to the correspondence table [r ′ _z , h ′ _{m (z)} ] stored in the correspondence table storage unit 121a, and is associated with r ′ _z that matches the input u. A model parameter or a combination of model parameters or a complementary value h ′ _{m (z)} thereof is selected. The selected h ′ _{m (z)} is sent to the speech recognition unit 126 (step S324). The subsequent processing is the same as that of the first embodiment except that h ′ _{m (z)} is used instead of h _{m (i)} .

［第３実施形態の変形例］
第１実施形態の学習処理によって生成された対応表［ｒ_ｉ，ｈ_ｍ（ｉ）］を用い、音声認識処理時に対応表［ｒ_ｉ，ｈ_ｍ（ｉ）］を補完した対応表［ｒ’_ｚ，ｈ’_ｍ（ｚ）］を生成し、ステップＳ３２４の処理が実行されてもよい。 [Modification of Third Embodiment]
Correspondence table generated by the learning process of the first embodiment _{_{[r i, h m (i}} )] was used, the correspondence table during the speech recognition process _{_{[r i, h m (i}} )] correspondence table complements [r ' _z , h ′ _{m (z)} ] may be generated, and the process of step S324 may be executed.

［その他の変形例等］
なお、本発明は上述の実施の形態に限定されるものではない。例えば、各装置がネットワークを通じて情報をやり取りするのではなく、少なくとも一部の組の装置が可搬型記録媒体を介して情報をやり取りしてもよい。或いは、少なくとも一部の組の装置が非可搬型の記録媒体を介して情報をやり取りしてもよい。すなわち、これらの装置の一部からなる組み合わせが、同一の装置であってもよい。 [Other variations]
The present invention is not limited to the embodiment described above. For example, instead of each device exchanging information via a network, at least some of the devices may exchange information via a portable recording medium. Alternatively, at least some of the devices may exchange information via a non-portable recording medium. That is, the combination which consists of a part of these apparatuses may be the same apparatus.

上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 The various processes described above are not only executed in time series according to the description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. Needless to say, other modifications are possible without departing from the spirit of the present invention.

上述の構成をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。このプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体の例は、非一時的な（non-transitory）記録媒体である。このような記録媒体の例は、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等である。 When the above configuration is realized by a computer, the processing contents of the functions that each device should have are described by a program. By executing this program on a computer, the above processing functions are realized on the computer. The program describing the processing contents can be recorded on a computer-readable recording medium. An example of a computer-readable recording medium is a non-transitory recording medium. Examples of such a recording medium are a magnetic recording device, an optical disk, a magneto-optical recording medium, a semiconductor memory, and the like.

このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 This program is distributed, for example, by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。処理の実行時、このコンピュータは、自己の記録装置に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, this computer reads a program stored in its own recording device and executes a process according to the read program. As another execution form of the program, the computer may read the program directly from the portable recording medium and execute processing according to the program, and each time the program is transferred from the server computer to the computer. The processing according to the received program may be executed sequentially. The above-described processing may be executed by a so-called ASP (Application Service Provider) type service that realizes a processing function only by an execution instruction and result acquisition without transferring a program from the server computer to the computer. Good.

上記実施形態では、コンピュータ上で所定のプログラムを実行させて本装置の処理機能が実現されたが、これらの処理機能の少なくとも一部がハードウェアで実現されてもよい。 In the above embodiment, the processing functions of the apparatus are realized by executing a predetermined program on a computer. However, at least a part of these processing functions may be realized by hardware.

学習装置１１，２１，３１
音声認識装置１２，２２，２２’，３２ Learning device 11, 21, 31
Voice recognition device 12, 22, 22 ', 32

Claims

A model parameter of the speech recognition model or a combination of model parameters of the speech recognition model is selected according to the relationship between the magnitudes of the speech component and the noise component included in the input acoustic signal, and the selected model parameter or the combination of model parameters is selected. A speech recognition apparatus that applies the speech recognition model according to the method to the input acoustic signal ,
A voice having a maximum voice recognition rate or a voice recognition rate equal to or greater than a predetermined value when applied to an acoustic signal having a relationship represented by the first index and the first index representing the magnitude relationship between a voice component and a noise component The model parameter or combination of model parameters of the recognition model is associated with
Among the model parameters or combinations of model parameters corresponding to the first index that is the same as or close to the second index that represents the relationship between the magnitudes of the speech component and the noise component included in the input acoustic signal, the speech recognition accuracy Or a speech recognition device that selects the one that maximizes the speech recognition rate.

A model parameter of the speech recognition model or a combination of model parameters of the speech recognition model is selected according to the relationship between the magnitudes of the speech component and the noise component included in the input acoustic signal, and the selected model parameter or the combination of model parameters is selected. A speech recognition apparatus that applies the speech recognition model according to the method to the input acoustic signal,
A voice having a maximum voice recognition rate or a voice recognition rate equal to or greater than a predetermined value when applied to an acoustic signal having a relationship represented by the first index and the first index representing the magnitude relationship between a voice component and a noise component The model parameter or combination of model parameters of the recognition model is associated with
Among the model parameters or combinations of model parameters corresponding to the first index that is the same as or close to the second index that represents the relationship between the magnitudes of the speech component and the noise component included in the input acoustic signal, the speech recognition accuracy Or select the one that maximizes the weight of the speech recognition rate,
The weight recognition value is a voice recognition device, which is a value obtained by multiplying the voice recognition accuracy or the voice recognition rate by a larger weight as the distance between the second index and the first index is smaller.

A model parameter of the speech recognition model or a combination of model parameters of the speech recognition model is selected according to the relationship between the magnitudes of the speech component and the noise component included in the input acoustic signal, and the selected model parameter or the combination of model parameters is selected. A speech recognition method that applies the speech recognition model according to the method to the input acoustic signal ,
A voice having a maximum voice recognition rate or a voice recognition rate equal to or greater than a predetermined value when applied to an acoustic signal having a relationship represented by the first index and the first index representing the magnitude relationship between a voice component and a noise component The model parameter or combination of model parameters of the recognition model is associated with
Among the model parameters or combinations of model parameters corresponding to the first index that is the same as or close to the second index that represents the relationship between the magnitudes of the speech component and the noise component included in the input acoustic signal, the speech recognition accuracy Or a speech recognition method that selects the one that maximizes the speech recognition rate.

  A model parameter of the speech recognition model or a combination of model parameters of the speech recognition model is selected according to the relationship between the magnitudes of the speech component and the noise component included in the input acoustic signal, and the selected model parameter or the combination of model parameters is selected. A speech recognition method that applies the speech recognition model according to the method to the input acoustic signal,
  A voice having a maximum voice recognition rate or a voice recognition rate equal to or greater than a predetermined value when applied to an acoustic signal having a relationship represented by the first index and the first index representing the magnitude relationship between a voice component and a noise component The model parameter or combination of model parameters of the recognition model is associated with
  Among the model parameters or combinations of model parameters corresponding to the first index that is the same as or close to the second index that represents the relationship between the magnitudes of the speech component and the noise component included in the input acoustic signal, the speech recognition accuracy Or select the one that maximizes the weight of the speech recognition rate,
  The speech recognition method, wherein the weighting value is a value obtained by multiplying the speech recognition accuracy or speech recognition rate by a greater weight as the distance between the second index and the first index is smaller.

Program for causing a computer to function as a speech recognition apparatus according to claim 1 or 2.