JP2006072127A

JP2006072127A - Voice recognition device and voice recognition method

Info

Publication number: JP2006072127A
Application number: JP2004257390A
Authority: JP
Inventors: Akira Baba; 朗馬場
Original assignee: Matsushita Electric Works Ltd
Current assignee: Panasonic Electric Works Co Ltd
Priority date: 2004-09-03
Filing date: 2004-09-03
Publication date: 2006-03-16

Abstract

<P>PROBLEM TO BE SOLVED: To obtain a voice recognition device and a voice recognition method that does not cause significant increase in the hardware resources and having features of little instability in real environment and providing high recognition performance under various environments. <P>SOLUTION: A subtraction processing section is constituted of a signal storage section 8 which stores observed signals Y(f, m-1) to Y(f, m-L) equivalent to previous L frames; and a subtracting section 10, in which the signals that are obtained by respectively multiplying corresponding coefficients α<SB>1</SB>to α<SB>L</SB>of a subtracting coefficient storage section 9 to the observed signals Y(f, m-1) to Y(f, m-L) of the previous time frames stored in the signal storage section 8, and the resultant signals are subjected to power spectrum subtraction from the observed signals Y(f, m) of the current time frame; and the result is outputted as estimated signals Sset(f, m). A voice feature quantity extracting section 4 extracts a voice feature quantity from the inferred signals Sset(f, m). <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、居室内など残響のある環境において、利用者の入力した音声を認識する音声認識装置及び音声認識方法に関するものである。 The present invention relates to a speech recognition apparatus and a speech recognition method for recognizing speech input by a user in a reverberant environment such as a living room.

音声認識技術は、優れたヒューマンインターフェースを具現する上で重要な役割を担っている。音声認識技術を適用した音声認識装置としては図７に示すような構成の装置が従来提供されている（例えば特許文献１）。 Speech recognition technology plays an important role in realizing an excellent human interface. As a voice recognition apparatus to which the voice recognition technology is applied, an apparatus having a configuration as shown in FIG. 7 is conventionally provided (for example, Patent Document 1).

この音声認識装置は、音声を入力するマイクロフォンからなる音声入力部１と、音声入力部１からの出力信号をＡ／Ｄ変換するＡ／Ｄ変換部２と、Ａ／Ｄ変換部２からの出力信号を周波数変換する周波数変換部３と、周波数変換部３からの出力信号から音声の特徴量を抽出する音声特徴量抽出部４と、標準音声から作成された音声認識用の標準パターンを記憶している標準パターン記憶部７と、音声特徴量抽出部４から出力される入力音声の音声特徴量と標準パターン記憶部７に記憶されている標準パターンとの類似度を計算して認識結果６を出力するパターン照合部５とから構成されており、標準パターン記憶部７に記憶させる標準パターンは、予め標準音声の特徴パターンを収集し、これを隠れマルコフモデルなどのモデル化手法を用いて作成したものが用いられている。 The speech recognition apparatus includes a speech input unit 1 composed of a microphone that inputs speech, an A / D conversion unit 2 that performs A / D conversion on an output signal from the speech input unit 1, and an output from the A / D conversion unit 2. A frequency conversion unit 3 that converts the frequency of the signal, a voice feature amount extraction unit 4 that extracts a voice feature amount from an output signal from the frequency conversion unit 3, and a standard pattern for speech recognition created from the standard speech The standard pattern storage unit 7 and the speech feature amount of the input speech output from the speech feature amount extraction unit 4 are calculated with the similarity between the standard pattern stored in the standard pattern storage unit 7 and the recognition result 6 is obtained. The standard pattern to be stored in the standard pattern storage unit 7 collects standard feature patterns in advance and uses a modeling technique such as a hidden Markov model. Thing that was created is used.

ところで、装置の使用環境と標準パターンを作成したときの環境が異なる場合に、利用者の音声と標準パターンとの間に相違が生じることにより、認識率が低下するという問題があるため、特許文献１に開示されている音声認識装置では環境の残響時間に応じた複数の標準パターンを標準パターン記憶部７に記憶しておき、環境に応じて標準パターンを選択して使用するようになっている。 By the way, when the environment in which the device is used and the environment in which the standard pattern is created are different, there is a problem that the recognition rate decreases due to a difference between the user's voice and the standard pattern. In the speech recognition apparatus disclosed in No. 1, a plurality of standard patterns corresponding to the reverberation time of the environment are stored in the standard pattern storage unit 7, and the standard patterns are selected and used according to the environment. .

また、入力音声を分析することにより、環境の逆フィルタを推定し、入力音声を環境の影響をうけていない状態に変換してから認識する音声認識装置も提供されている（例えば非特許文献１）
特開２００４−１１７７２４号公報（図１，段落番号００２５）「調波構造を用いた残響除去法の明瞭性と認識率による音声品質評価」、日本音響学会講演論文集、６１１頁〜６１２頁、２００４年３月発行 There is also provided a speech recognition device that recognizes an input speech by analyzing the input speech, estimating an inverse filter of the environment, and converting the input speech to a state not affected by the environment (for example, Non-Patent Document 1). )
JP 2004-117724 A (FIG. 1, paragraph number 0025) "Speech quality evaluation by clarity and recognition rate of dereverberation method using harmonic structure", Proc. Of the Acoustical Society of Japan, 611-612, published in March 2004

特許文献１に開示されている音声認識装置のような、複数の標準パターンを使用する方式では、複数の標準パターンを保持するためにメモリ容量が増加するという問題がある。音声認識装置では、メモリ容量全体に占める標準パターンの割合が大きいので、複数の標準パターンを用意すると、全体のメモリ容量が倍増するという問題がある。 In a method using a plurality of standard patterns, such as the speech recognition apparatus disclosed in Patent Document 1, there is a problem that the memory capacity increases to hold a plurality of standard patterns. In the speech recognition apparatus, since the ratio of the standard pattern to the entire memory capacity is large, there is a problem that the total memory capacity is doubled when a plurality of standard patterns are prepared.

また非特許文献１に開示されている、環境の逆フィルタを入力音声から推定する手法は、十分に精度良く逆フィルタを推定するためには、大量の入力音声が必要となる。しかしながら、環境の特性はユーザーの位置変化や、室内の温度変化により変化するので、環境の特性が一定な期間内に必要な入力音声データ量を得ることは難しいので、結果として不安定な逆フィルタが学習され、十分な認識性能を得ることは困難である。 Further, the method of estimating the inverse filter of the environment from the input speech disclosed in Non-Patent Document 1 requires a large amount of input speech in order to estimate the inverse filter with sufficient accuracy. However, because the environmental characteristics change due to changes in the user's position and indoor temperature, it is difficult to obtain the required amount of input audio data within a certain period of time, resulting in an unstable inverse filter. Is learned and it is difficult to obtain sufficient recognition performance.

本発明は上述の点に鑑みて為されたもので、その目的とするところはハードウェア資源の大幅な増加を伴わず、また実環境での不安定性の少ない特徴をもった、様々な環境で高い認識性能が得られる音声認識装置方法及び音声認識方法を提供することにある。 The present invention has been made in view of the above-mentioned points, and the object of the present invention is not to involve a significant increase in hardware resources, and in various environments having features of less instability in a real environment. An object of the present invention is to provide a speech recognition apparatus method and speech recognition method that can obtain high recognition performance.

上述の目的を達成するために、請求項１の音声認識装置の発明では、残響環境下で音声を捉える音声入力部と、該音声入力部からの出力信号をＡ／Ｄ変換するＡ／Ｄ変換部と、該Ａ／Ｄ変換部からの出力信号を周波数変換して現在の時間フレームの観測信号を出力する周波数変換部と、過去の所定時間フレームの観測信号を記憶し、現在の時間フレームの観測信号から、前記過去の所定時間フレームの観測信号に所定の減算係数を乗じて得られた信号を減算して推定信号として出力する減算処理部と、該減算処理部から出力される推定信号から音声の特徴量を抽出する音声特徴量抽出部と、残響時間が異なる環境下での標準音声から作成された標準パターンを記憶している標準パターン記憶部と、前記音声特徴量抽出部から抽出された特徴量と前記標準パターン記憶部で記憶している標準パターンとのとの類似度を求めて認識結果を出力するパターン照合部とを備えていることを特徴とする。 In order to achieve the above object, in the invention of the speech recognition apparatus according to claim 1, a speech input unit that captures speech in a reverberant environment, and A / D conversion that performs A / D conversion on an output signal from the speech input unit A frequency conversion unit that frequency-converts an output signal from the A / D conversion unit and outputs an observation signal of a current time frame; stores an observation signal of a past predetermined time frame; From the observation signal, a subtraction processing unit that subtracts a signal obtained by multiplying the observation signal of the past predetermined time frame by a predetermined subtraction coefficient and outputs it as an estimation signal, and an estimation signal output from the subtraction processing unit Extracted from a voice feature quantity extraction unit that extracts voice feature quantities, a standard pattern storage unit that stores standard patterns created from standard voices in environments with different reverberation times, and the voice feature quantity extraction unit Features Characterized in that it includes a pattern matching unit for outputting a recognition result by determining the standard pattern and Noto similarity that is stored in the reference pattern memory.

請求項１の音声認識装置の発明によれば、複数の標準パターンを記憶する必要がなくなって容量の大きな標準パターン記憶部を必要とせず、しかも残響環境下で捉えた音声の成分から残響によって遅延する音声成分を除去して標準パターンと同質とした推定信号から音声の特徴量を抽出することができるため、様々な環境において高い性能で音声認識が行える音声認識装置を実現できる。 According to the speech recognition device of the first aspect of the present invention, it is not necessary to store a plurality of standard patterns, so that a large-capacity standard pattern storage unit is not required, and a delay is caused by reverberation from speech components captured in a reverberant environment. Since the speech feature amount can be extracted from the estimated signal having the same quality as that of the standard pattern by removing the speech component, it is possible to realize a speech recognition apparatus capable of performing speech recognition with high performance in various environments.

請求項２の音声認識装置では、残響環境下で音声を捉える音声入力部と、該音声入力部からの出力信号をＡ／Ｄ変換するＡ／Ｄ変換部と、該Ａ／Ｄ変換部からの出力信号を周波数変換して現在の時間フレームの観測信号を出力する周波数変換部と、現在の時間フレームの観測信号から、過去の所定時間フレームに対応する所定の信号に所定の減算係数を乗じて得られた信号を減算して推定信号として出力し、且つ前記所定の信号に用いる信号として前記推定信号を記憶する減算処理部と、該減算処理部から出力される前記推定信号から音声の特徴量を抽出する音声特徴量抽出部と、残響時間が異なる環境下での標準音声から作成された標準パターンを記憶している標準パターン記憶部と、前記音声特徴量抽出部から抽出された特徴量と前記標準パターン記憶部で記憶している標準パターンとのとの類似度を求めて認識結果を出力するパターン照合部とを備えていることを特徴とする。 In the speech recognition device according to claim 2, a speech input unit that captures speech in a reverberant environment, an A / D conversion unit that performs A / D conversion on an output signal from the speech input unit, and a signal from the A / D conversion unit A frequency converter that converts the frequency of the output signal and outputs an observation signal of the current time frame, and multiplies a predetermined signal corresponding to a past predetermined time frame by a predetermined subtraction coefficient from the observation signal of the current time frame. A subtraction processing unit that subtracts the obtained signal and outputs it as an estimation signal, and stores the estimation signal as a signal used for the predetermined signal, and a feature amount of speech from the estimation signal output from the subtraction processing unit A feature extraction unit, a standard pattern storage unit storing a standard pattern created from standard speech in an environment with different reverberation times, and a feature amount extracted from the speech feature extraction unit Said mark Characterized in that it includes a pattern matching unit for outputting a recognition result by determining the standard pattern and Noto similarity that is stored in the pattern storage unit.

請求項２の音声認識装置の発明によれば、複数の標準パターンを記憶する必要がなくなって容量の大きな標準パターン記憶部を必要とせず、しかも残響環境下で捉えた音声の成分から残響によって遅延する音声成分を除去して標準パターンと同質とした推定信号から音声の特徴量を抽出することができるため、様々な環境において高い性能で音声認識が行え、特に減算に使用する過去の所定フレームまでの時間フレームに対応する信号がより正確となる音声認識装置を実現できる。 According to the speech recognition device of the second aspect of the present invention, it is not necessary to store a plurality of standard patterns, so that a large-capacity standard pattern storage unit is not required, and a delay is caused by reverberation from speech components captured in a reverberant environment. Speech features can be extracted from the estimated signal that has the same quality as the standard pattern by removing the speech component, so that speech recognition can be performed with high performance in various environments, especially up to past predetermined frames used for subtraction It is possible to realize a speech recognition apparatus in which a signal corresponding to the time frame is more accurate.

請求項３の音声認識装置の発明では、請求項１又は２の発明において、前記減算係数は、前記残響環境の伝達関数の１フレーム目の観測信号と所定フレーム目の観測信号のパワー比に所定の係数を乗じた値とすることを特徴とする。 According to a third aspect of the present invention, in the first or second aspect of the present invention, the subtraction coefficient is predetermined to a power ratio between the observation signal of the first frame and the observation signal of the predetermined frame of the transfer function of the reverberation environment. It is a value obtained by multiplying the coefficient of.

請求項３の音声認識装置の発明によれば、減数係数をより正確に導出することができ、その結果音声認識の性能を一層向上させることができる。 According to the speech recognition device of the third aspect, the reduction coefficient can be derived more accurately, and as a result, the performance of speech recognition can be further improved.

請求項４の音声認識装置の発明では、請求項１乃至３の何れかの発明において、減算に使用する前記過去の所定時間フレームが複数のフレームであって、各所定時間フレームの信号に対して夫々の所定時間フレームに対応する所定の減算係数を乗じるとともに乗じた結果を加算して該加算結果を現在の時間フレームの観測信号から減算することを特徴とする。 According to a fourth aspect of the present invention, there is provided the speech recognition apparatus according to any one of the first to third aspects, wherein the predetermined frame in the past used for subtraction is a plurality of frames, and a signal of each predetermined time frame is obtained. Multiplying a predetermined subtraction coefficient corresponding to each predetermined time frame, adding the multiplied results, and subtracting the addition result from the observation signal of the current time frame.

請求項４の音声認識装置の発明によれば、様々な遅れ時間で到来する音声の遅れ成分に対しても、より正確に残響抑圧が行え、その結果音声認識の性能を更に向上させることができる。 According to the invention of the speech recognition apparatus of claim 4, dereverberation can be more accurately suppressed even for delay components of speech arriving at various delay times, and as a result, speech recognition performance can be further improved. .

請求項５の音声認識装置の発明では、請求項１乃至４の何れかの発明において、前記減算処理部は、減算処理においてスムージングにより残響の揺らぎを緩和する機能を備えたことを特徴とする。 According to a fifth aspect of the present invention, there is provided a voice recognition apparatus according to any one of the first to fourth aspects, wherein the subtraction processing unit has a function of mitigating fluctuations of reverberation by smoothing in the subtraction processing.

請求項５の音声認識装置の発明によれば、残響の揺らぎを緩和させることができ、その結果音声認識の性能を向上させることができる。 According to the speech recognition device of the fifth aspect, fluctuation of reverberation can be reduced, and as a result, speech recognition performance can be improved.

請求項６の音声認識装置の発明では、請求項１乃至３の何れかの発明において、前記減算処理部は、前記推定信号の出力に対して前記過去の所定時間フレームを一つ用いる場合、減算処理を行う周波数帯毎又は周波数毎に前記過去の所定時間フレームを選択することを特徴とする。 According to a sixth aspect of the present invention, in the first to third aspects of the invention, the subtraction processing unit subtracts when one of the past predetermined time frames is used for the output of the estimated signal. The past predetermined time frame is selected for each frequency band or frequency for processing.

請求項６の音声認識装置の発明によれば、周波数によって遅れ時間が異なる遅れ成分に対してもより正確に残響抑圧が行え、その結果音声認識の性能を向上させることができる。 According to the invention of the speech recognition apparatus of claim 6, dereverberation can be more accurately suppressed even for delay components having different delay times depending on frequencies, and as a result, the performance of speech recognition can be improved.

請求項７の音声認識方法の発明では、残響環境下で捉えた音声の特徴量と、標準音声から作成した標準パターンとの類似度を求めて認識結果を得る音声認識方法であって、残響環境下で捉えた音声信号を周波数変換する過程と、周波数変換されて得られた現在の時間フレームの観測信号から、過去の所定時間フレームの観測信号に所定の減算係数を乗じて得られる信号を減算して推定信号を得る過程と、前記推定信号から音声の特徴量を抽出する過程と、該抽出した特徴量と前記標準パターンとの類似度から音声認識を行う過程とを有することを特徴とする。 The speech recognition method according to claim 7 is a speech recognition method for obtaining a recognition result by obtaining a similarity between a feature amount of speech captured in a reverberant environment and a standard pattern created from the standard speech, Subtract the signal obtained by multiplying the observation signal of the past predetermined time frame by the predetermined subtraction coefficient from the observation signal of the current time frame obtained by frequency conversion of the audio signal captured below and the frequency conversion And obtaining a presumed signal, extracting a feature amount of speech from the presumed signal, and performing speech recognition from a similarity between the extracted feature amount and the standard pattern. .

請求項７の音声認識方法の発明によれば、残響環境下で捉えた音声の成分から残響によって遅延する音声成分を除去して標準パターンと同質とした推定信号から音声の特徴量を抽出することができるため、ハードウェアの大幅な増加を伴うことなく、様々な環境で音声認識の性能の向上が図れる。 According to the speech recognition method of the present invention, the speech feature amount is extracted from the estimated signal having the same quality as the standard pattern by removing the speech component delayed by the reverberation from the speech component captured in the reverberant environment. Therefore, it is possible to improve the performance of voice recognition in various environments without significantly increasing hardware.

請求項８の音声認識方法の発明では、残響環境下で捉えた音声の特徴量と、標準音声から作成した標準パターンとの類似度を求めて認識結果を得る音声認識方法であって、残響環境下で捉えた音声信号を周波数変換する過程と、周波数変換されて得られた現在の時間フレームの観測信号から、過去の所定時間フレームに対応する所定の信号に所定の減算係数を乗じて得られる信号を減算して推定信号を得、且つ前記所定の信号に用いる信号として前記推定信号を記憶する過程と、前記推定信号から音声の特徴量を抽出する過程と、該抽出した特徴量と前記標準パターンとの類似度から音声認識を行う過程とを有することを特徴とする。 The speech recognition method according to claim 8 is a speech recognition method for obtaining a recognition result by obtaining a similarity between a feature amount of speech captured in a reverberant environment and a standard pattern created from the standard speech, Obtained by multiplying a predetermined signal corresponding to a past predetermined time frame by a predetermined subtraction coefficient from the process of frequency converting the audio signal captured below and the observation signal of the current time frame obtained by frequency conversion. A process of subtracting signals to obtain an estimated signal and storing the estimated signal as a signal to be used for the predetermined signal; a process of extracting a speech feature quantity from the estimated signal; and the extracted feature quantity and the standard And a process of performing speech recognition based on the similarity to the pattern.

請求項８の音声認識方法の発明によれば、残響環境下で捉えた音声の成分から残響によって遅延する音声成分を除去して標準パターンと同質とした推定信号から音声の特徴量を抽出することができるため、ハードウェアの大幅な増加を伴うことなく、様々な環境で音声認識の性能の向上が図れ、特に減算に使用する過去の所定フレームまでの時間フレームに対応する信号がより正確となる。 According to the speech recognition method of the present invention, the speech feature amount is extracted from the estimated signal having the same quality as the standard pattern by removing the speech component delayed by the reverberation from the speech component captured in the reverberant environment. Therefore, it is possible to improve speech recognition performance in various environments without significantly increasing hardware, and in particular, signals corresponding to time frames up to a predetermined frame used for subtraction become more accurate. .

音声認識装置の発明は、数の標準パターンを記憶する必要がなくなって容量の大きな標準パターン記憶部を必要とせず、しかも残響環境下で捉えた音声の成分から残響によって遅延する音声成分を除去して標準パターンと同質とした推定信号から音声の特徴量を抽出することができるため、様々な環境において高い性能で音声認識が行える音声認識装置を実現できるという効果がある。 The invention of the speech recognition apparatus eliminates the need for storing a number of standard patterns, eliminates the need for a large-capacity standard pattern storage unit, and eliminates speech components delayed by reverberation from speech components captured in a reverberant environment. Thus, it is possible to extract a voice feature amount from an estimated signal having the same quality as that of the standard pattern, so that it is possible to realize a voice recognition device that can perform voice recognition with high performance in various environments.

また音声認識方法の発明は、残響環境下で捉えた音声の成分から残響によって遅延する音声成分を除去して標準パターンと同質とした推定信号から音声の特徴量を抽出することができるため、ハードウェアの大幅な増加を伴うことなく、様々な環境で音声認識の性能の向上が図れる。 In addition, the invention of the speech recognition method can extract a speech feature amount from an estimated signal having the same quality as a standard pattern by removing a speech component delayed by reverberation from a speech component captured in a reverberant environment. The voice recognition performance can be improved in various environments without a significant increase in wear.

以下本発明を実施形態により説明する。 Embodiments of the present invention will be described below.

（実施形態１）
図１は本実施形態の構成を示しており、本実施形態では、周波数変換部３で周波数変換された観測信号Ｙ（ｆ，ｍ）を所定フレームまでの過去の時間フレーム、例えばＬフレーム分、つまりＹ（ｆ，ｍ−１）〜Ｙ（ｆ，ｍ−Ｌ）まで記憶する信号記憶部８と、この信号記憶部８で記憶されているＬ個の過去の観測信号Ｙ（ｆ，ｍ−１）〜Ｙ（ｆ，ｍ−Ｌ）夫々に対応するＬ個の減算係数α_１〜α_Ｌを記憶している減算係数記憶部９と、信号記憶部８に記憶している過去の各観測信号Ｙ（ｆ，ｍ−１）〜Ｙ（ｆ，ｍ−Ｌ）に、減算係数記憶部９に記憶している夫々の観測信号観測信号Ｙ（ｆ，ｍ−１）〜Ｙ（ｆ，ｍ−Ｌ）に対応する減算係数α_１〜α_Ｌを乗じ、その乗じて得られた信号を、周波数変換部３から出力されている現在の時間フレームの観測信号から減算することでパワースペクトルの減算を行う減算部１０とを減算処理部として備え、減算部１０から出力される減算結果を、音声特徴量を抽出する推定信号として音声特徴量抽出部４へ出力する点に特徴がある。 (Embodiment 1)
FIG. 1 shows the configuration of this embodiment. In this embodiment, the observed signal Y (f, m) frequency-converted by the frequency converter 3 is a past time frame up to a predetermined frame, for example, L frames, That is, the signal storage unit 8 that stores Y (f, m−1) to Y (f, m−L) and the L past observation signals Y (f, m−) stored in the signal storage unit 8. 1) to Y (f, m−L) subtraction coefficient storage unit 9 storing _L subtraction coefficients α _{1 to} α _L corresponding to each of them, and each past observation stored in the signal storage unit 8 The observation signal observation signals Y (f, m−1) to Y (f, m) stored in the subtraction coefficient storage unit 9 are converted into signals Y (f, m−1) to Y (f, m−L). multiplied by the subtraction factor alpha ₁ to? _L corresponding to -L), a signal obtained by multiplying the current time frame which is outputted from the frequency converter 3 A subtraction unit 10 that subtracts the power spectrum by subtracting from the observed signal of the audio signal, and the subtraction result output from the subtraction unit 10 is used as an estimation signal for extracting a voice feature amount. It is characterized in that it is output to the unit 4.

信号記憶部８は周波数変換部３から出力される観測信号を上述のようにＬフレーム分記憶するもので、観測信号が入力されるたびに最も旧いフレームの観測信号を消去して新たなフレームの観測信号を記憶するようになっている。 The signal storage unit 8 stores the observation signals output from the frequency conversion unit 3 for L frames as described above. Each time an observation signal is input, the observation signal of the oldest frame is deleted, and a new frame is deleted. The observation signal is memorized.

尚マイクロフォンからなる音声入力部１は従来例と同様に残響環境下で音声を捉え、音声入力部１から出力される音声信号はＡ／Ｄ変換部２でＡ／Ｄ変換された後周波数変換部３で周波数変換されるようになっている。 The voice input unit 1 composed of a microphone captures voice in a reverberant environment as in the conventional example, and the voice signal output from the voice input unit 1 is A / D converted by the A / D converter 2 and then the frequency converter. 3 is frequency-converted.

またパターン照合部５は音声特徴量抽出部４で抽出された音声特徴量と、標準パターン記憶部７で記憶されている標準音声の標準パターンとの類似度を求めて類似度に対応した認識結果６を出力するようになっている。標準パターン記憶部７は複数の環境下での標準パターンを記憶するのではなく、一つの標準音声による音声パターンを記憶しているのみで、従来例のように複数の標準パターンを記憶する場合に比して大幅に少ない記憶容量のメモリで構成している。 The pattern matching unit 5 obtains the similarity between the voice feature amount extracted by the voice feature amount extraction unit 4 and the standard pattern of the standard voice stored in the standard pattern storage unit 7, and the recognition result corresponding to the similarity degree. 6 is output. The standard pattern storage unit 7 does not store a standard pattern under a plurality of environments, but only stores a voice pattern based on one standard voice, and stores a plurality of standard patterns as in the conventional example. Compared to a memory having a significantly smaller storage capacity.

次に本実施形態の動作を説明する。 Next, the operation of this embodiment will be described.

まず、ユーザーは、装置が利用される環境におけるユーザー位置からマイクロフォンたる音声入力部１までの伝達関数を測定する。図２は伝達関数ｈ（ｔ）のその測定の結果例を示している。 First, the user measures a transfer function from the user position in the environment where the apparatus is used to the voice input unit 1 as a microphone. FIG. 2 shows an example of the result of the measurement of the transfer function h (t).

次に、例えば式(１）のようなのフレーム間のパワー比に所定の定数βを乗算した値をα_ｐとする。ここで、窓幅Ｗは例えば周波数変換部３での窓長と同じ値とし、窓のシフト幅（図２ではｔ＝０からｔ＝Ｔ_１までの幅）も同様に周波数変換部３での窓シフト幅と同じ値とする。 Next, for example, a value obtained by multiplying the power ratio between frames as shown in Expression (1) by a predetermined constant β is α _p . Here, the window width W is set to the same value as the window length in the frequency conversion section 3, for example, a window shift width (width in FIG. 2 from t = 0 to t = T ₁₎ is likewise in the frequency converter 3 The same value as the window shift width.

図３は求まったα_ｐの値と各フレームとの関係例を示す。 FIG. 3 shows an example of the relationship between the obtained α _p value and each frame.

以上のように算出されたα_ｐを、予め減算係数記憶部９に記憶させておくのである。この記憶させる手段には適宜な手段を用いれば良いのでここでは省略する。 The α _p calculated as described above is stored in the subtraction coefficient storage unit 9 in advance. Since an appropriate means may be used as the means for storing, it is omitted here.

上述のように算出されたα_ｐを減算係数記憶部９にＬフレーム分（α_１〜α_Ｌ）記憶させて準備が完了することになる。 The α _p calculated as described above is stored in the subtraction coefficient storage unit 9 for L frames (α _{1 to} α _L ), and the preparation is completed.

さて本実施形態の音声認識装置が入力音声の認識を行う動作は次の通りである。今既にＬフレーム分の過去の観測信号Ｙ（ｆ，ｍ−１）〜Ｙ（ｆ，ｍ−Ｌ）が信号記憶部８に記憶されている状態において、現在の時間フレームに対応する観測信号Ｙ（ｆ，ｍ）が減算部１０に入力すると、減算部１０は信号記憶部８に対して観測信号Ｙ（ｆ，ｍ−１）〜Ｙ（ｆ，ｍ−Ｌ）を順次読み出すとともに減算係数記憶部９からＬフレーム分の減算係数α_１〜α_Ｌを読み出し、例えば式（２−１）のようなパワースペクトル領域での減算処理を行い、推定信号Sｅｓｔ（ｆ，ｍ）を出力する。 Now, the operation of the speech recognition apparatus of this embodiment for recognizing input speech is as follows. In the state where the past observation signals Y (f, m−1) to Y (f, m−L) for L frames are already stored in the signal storage unit 8, the observation signal Y corresponding to the current time frame is already stored. When (f, m) is input to the subtraction unit 10, the subtraction unit 10 sequentially reads the observation signals Y (f, m−1) to Y (f, m−L) from the signal storage unit 8 and stores the subtraction coefficient. The subtracting coefficients α _{1 to} α _L for L frames are read from the unit 9, for example, subtraction processing is performed in the power spectrum region as shown in Expression (2-1), and the estimated signal Sest (f, m) is output.

尚、減算した結果（式（２−２）で示す）が負の値となった場合には、式（３）のようなフロアリング処理、あるいは式（４）のように零信号として処理しても良い。 If the result of subtraction (shown by equation (2-2)) becomes a negative value, flooring processing as in equation (3) or zero signal as in equation (4) is performed. May be.

Sｅｓｔ（ｆ，ｍ）＝Ｙ（ｆ，ｍ）×０．５ …（３）
Sｅｓｔ（ｆ，ｍ）＝Ｙ（ｆ，ｍ）×０ …（４）
以上のように本実施形態の音声認識装置では、信号記憶部８、減算係数記憶部９、減算部１０からなる減算処理部を有するので、壁や床などに反射することにより遅延して音声入力部１に入力される音声の成分を、観測信号から除去することができ、そのため減算処理によって減算部１０から出力される推定信号が標準パターンと同質になり、この推定信号から音声特徴量抽出部４で抽出した特徴量と標準音声による標準パターンとの類似度をパターン照合部５で計算してその結果に基づいて出力する認識結果６が高い認識率によって得られ、装置としての音声認識の性能が向上する。 Set (f, m) = Y (f, m) × 0.5 (3)
Set (f, m) = Y (f, m) × 0 (4)
As described above, the speech recognition apparatus according to the present embodiment includes the subtraction processing unit including the signal storage unit 8, the subtraction coefficient storage unit 9, and the subtraction unit 10, so that the voice input is delayed by being reflected on the wall or the floor. The speech component input to the unit 1 can be removed from the observation signal, so that the estimated signal output from the subtracting unit 10 by the subtraction process becomes the same quality as the standard pattern, and the speech feature amount extracting unit from this estimated signal The pattern matching unit 5 calculates the similarity between the feature amount extracted in 4 and the standard pattern based on the standard voice, and the recognition result 6 output based on the result is obtained with a high recognition rate. Will improve.

また、複数のフレームによって減算処理を行うので、床や壁など様々な反射経路による複数の遅延成分に対応することができ、より認識性能の向上が得られる。 In addition, since the subtraction process is performed by a plurality of frames, it is possible to cope with a plurality of delay components due to various reflection paths such as a floor and a wall, and the recognition performance can be further improved.

更にまた、減算係数α_ｐを伝達関数から算出しているので、より正確な減算処理を行うことができ、その結果認識性能が向上する。 Furthermore, since the subtraction coefficient α _p is calculated from the transfer function, more accurate subtraction processing can be performed, resulting in improved recognition performance.

尚α_ｐは音声認識装置が実際に使用される環境下で測定した伝達関数に基づいて上述のように算出しているが、例えば音声認識装置が利用されうる複数の環境下において算出したα_ｐの平均値としても良い。 Note that α _p is calculated as described above based on the transfer function measured in an environment where the speech recognition apparatus is actually used. For example, α _p calculated in a plurality of environments where the speech recognition apparatus can be used. It is good also as an average value.

また減算処理により得られる推定信号を逆周波数変換すると、残響感の減少した音声信号を得ることができるので、音声認識装置以外にもハンズフリー電話やインターホンなどに応用できる。
（実施形態２）
本実施形態は、実施形態１の構成に加え、図４に示すように信号記憶部８と減算部１０の間に信号記憶部８に記憶している過去のフレームの観測信号Ｙ（ｆ，ｍ−ｐ）を周波数軸方向への平滑化を行うフィルタ部１１を設けた点に特徴がある。 Further, if the estimated signal obtained by the subtraction process is subjected to inverse frequency conversion, an audio signal with reduced reverberation can be obtained, so that it can be applied to hands-free telephones, intercoms, etc. in addition to the voice recognition device.
(Embodiment 2)
In the present embodiment, in addition to the configuration of the first embodiment, as shown in FIG. 4, past frame observation signals Y (f, m) stored in the signal storage unit 8 between the signal storage unit 8 and the subtraction unit 10. -P) is characterized in that a filter unit 11 is provided for smoothing in the frequency axis direction.

その他の構成は実施形態１と同じであるので、実施形態１と共通の構成要素には同一符号を付して、その共通の構成要素についての説明は省略する。 Since the other configuration is the same as that of the first embodiment, the same components as those of the first embodiment are denoted by the same reference numerals, and the description of the common components is omitted.

而して減算部１０が、信号記憶部８に記憶されている過去のフレームの観測信号Ｙ（ｆ，ｍ−ｐ）の読み出しを行うと、前後の周波数に相当する信号Ｙ（ｆ−１，ｍ−ｐ）、Ｙ（ｆ＋１，ｍ−ｐ）を信号記憶部８から読み出し、例えば式（５）に示すような平滑化処理を行った信号を減算部１０に出力する。 Thus, when the subtraction unit 10 reads the observation signal Y (f, mp) of the past frame stored in the signal storage unit 8, the signal Y (f-1, m−p) and Y (f + 1, m−p) are read from the signal storage unit 8, and for example, a signal subjected to smoothing processing as shown in Expression (5) is output to the subtraction unit 10.

図５は音声信号が空間を伝わる際の周波数の様子を示しており、同図（ａ）は周波数の揺らぎがない場合を示し、同図（ｂ）は壁等で音声が反射する際に、周波数が揺らぐ様子を示している場合を示す。 FIG. 5 shows the state of the frequency when the audio signal travels through the space. FIG. 5A shows the case where there is no frequency fluctuation, and FIG. 5B shows the case where the audio is reflected by a wall or the like. A case where the frequency fluctuates is shown.

図５（ａ）の（イ）に示すように、例えばＫフレーム前にユーザーが発生した音声の周波数ｆ_０の要素（●印）が、空間を伝わり壁等の反射を経てＫフレームの遅延を生じ、現在のフレームに減衰して混入されると、現在のフレームでは、図５（ａ）の（ロ）に示すようにユーザーの発生する音声の周波数ｆ_０の要素（▲印）が観測されているので、観測信号は両者が混合された音声となり、前述のような残響のある音声が観測される。 As shown in FIG. 5 (a) (i), for example, K frames before the speech user encounters a frequency f ₀ of the elements (● mark) is a delay of K frames through the wall reflections, etc. transmitted spatial When the current frame is attenuated and mixed, an element (▲ mark) of the frequency f ₀ of the voice generated by the user is observed in the current frame as shown in (b) of FIG. Therefore, the observation signal becomes a sound in which both are mixed, and the sound with reverberation as described above is observed.

一方壁等で音声が反射する際には、図５（ｂ）の（イ）に示すように、Ｋフレーム前にユーザーが発生した音声の周波数ｆ_０の要素（●印）は、周波数の揺らぎにより、図５（ｂ）の（ロ）に示すように、例えば現在のフレームの周波数ｆ_０−１の要素（△印）に混入されるが、フィルタ部１１により上述した（５）式に基づいて平滑化処理するのである。 On the other hand, when the sound is reflected by a wall or the like, as shown in (b) of FIG. 5B, the element (● mark) of the frequency f ₀ of the sound generated by the user before the K frame is the fluctuation of the frequency. Thus, as shown in (b) of FIG. 5B, for example, it is mixed in the element (Δ mark) of the frequency f ₀ −1 of the current frame, but is based on the above-described equation (5) by the filter unit 11. Smoothing process.

尚減算部１０の減算処理で得られた推定信号Sｅｓｔ（ｆ，ｍ）から認識結果を得るまでの処理動作は実施形態１と同じであるので、説明は省略する。 Note that the processing operation until the recognition result is obtained from the estimated signal Sest (f, m) obtained by the subtraction processing of the subtracting unit 10 is the same as that in the first embodiment, and thus the description thereof is omitted.

以上のように本実施形態の音声認識装置では、残響による遅延成分が周波数軸方向に揺らいでいる場合においても、減算に用いる過去の時間フレームの観測信号Ｙ（ｆ，ｍ−ｐ）をフィルタ部１１により平滑化してＹａｖｅ（ｆ，ｍ−ｐ）とすることで、現在の観測信号Ｙ（ｆ，ｍ）の周波数成分から遅延成分を減算部１０で減算処理することが可能となり、その結果遅延成分を除去することができ、その結果認識性能が向上する。
（実施形態３）
上述の実施形態１では減算処理において過去の時間フレームの観測信号を使用しているが、観測信号の代わりに本実施形態は過去の推定信号を減算に用いる点に特徴がある。 As described above, in the speech recognition apparatus according to the present embodiment, even when the delay component due to reverberation fluctuates in the frequency axis direction, the observation signal Y (f, mp) of the past time frame used for subtraction is filtered. 11, the delay component can be subtracted by the subtracting unit 10 from the frequency component of the current observation signal Y (f, m). Components can be removed, resulting in improved recognition performance.
(Embodiment 3)
In the first embodiment described above, the observation signal of the past time frame is used in the subtraction process, but this embodiment is characterized in that the past estimated signal is used for subtraction instead of the observation signal.

つまり本実施形態では、図６に示すように減算部１０での減算結果として出力される推定信号Sｅｓｔ（ｆ，ｍ）を音声特徴量抽出部４へ出力するとともに、信号記憶部８へも出力するようになっている。 That is, in the present embodiment, as shown in FIG. 6, the estimated signal Sest (f, m) output as the subtraction result in the subtraction unit 10 is output to the audio feature amount extraction unit 4 and also output to the signal storage unit 8. It is supposed to be.

信号記憶部８は減算部１０から出力される推定信号を過去Ｌフレーム分に渡って記憶しておき、減算部１０からの読み出しに応じて順次出力するようになっている。 The signal storage unit 8 stores the estimated signal output from the subtracting unit 10 over the past L frames, and sequentially outputs it according to the reading from the subtracting unit 10.

その他の構成には実施形態１と同じであるので、共通の構成要素には同一符号を付して、その共通の構成要素についての説明は省略する。 Since other configurations are the same as those of the first embodiment, common constituent elements are denoted by the same reference numerals, and description of the common constituent elements is omitted.

而して本実施形態の音声認識装置の減算部１０では信号記憶部８から読み出した推定信号Sｅｓｔ（ｆ，ｍ−ｐ）とともに対応して減算係数記憶部９から読み出す減算係数α_ｐを用いて実施形態１の場合と同様な減算処理を行うのである。 Thus, the subtraction unit 10 of the speech recognition apparatus according to the present embodiment uses the subtraction coefficient α _p read from the subtraction coefficient storage unit 9 corresponding to the estimated signal Set (f, m−p) read from the signal storage unit 8. Subtraction processing similar to that in the first embodiment is performed.

このように減算部１０での減算処理に使用する信号が、観測信号Ｙ（ｆ，ｍ−ｐ）ではなく反射による遅れ成分の取り除かれた推定信号Sｅｓｔ（ｆ，ｍ−ｐ）になるので、本実施形態の音声認識装置では、より正確に遅れ成分だけを減算することができ、その結果認識性能が向上する。 Thus, since the signal used for the subtraction processing in the subtraction unit 10 is not the observation signal Y (f, mp) but the estimated signal Sest (f, mp) from which the delay component due to reflection has been removed. In the speech recognition apparatus of the present embodiment, only the delay component can be subtracted more accurately, and as a result, the recognition performance is improved.

尚本実施形態では初期においては信号記憶部８には過去の時間フレームに対応する推定信号Sｅｓｔ（ｆ，ｍ−ｐ）が記憶されていない零信号で状態あるので、このときの減算結果はSｅｓｔ（ｆ，ｍ）＝Ｙ（ｆ，ｍ）×ｅ^{ｊａｒｇ（Ｙ（ｆ，ｍ））} となって、これが１フレーム前の推定信号Sｅｓｔ（ｆ，ｍ−ｐ）として信号記憶部８に記憶されることになる。以後Ｌフレーム分の推定信号Sｅｓｔ（ｆ，ｍ−１）〜Sｅｓｔ（ｆ，ｍ−Ｌ）が記憶されるまで、推定信号が記憶されていない過去の時間フレームについては零信号が当該フレームの推定信号として用いられる。 In the present embodiment, initially, the signal storage unit 8 is in a state of a zero signal in which the estimated signal Est (f, mp) corresponding to the past time frame is not stored, so the subtraction result at this time is (F, m) = Y (f, m) × e ^{jarg (Y (f, m))} , which is stored in the signal storage unit 8 as the estimated signal Sest (f, m−p) one frame before. Will be. Thereafter, until the estimated signals Sest (f, m-1) to Sest (f, m-L) for L frames are stored, for the past time frames in which the estimated signals are not stored, the zero signal is estimated for that frame. Used as a signal.

また本実施形態において、信号記憶部１１から出力される推定信号Sｅｓｔ（ｆ，ｍ−ｐ）に対して平滑処理するために実施形態２と同様なフィルタ部１１を設けても良い。
（実施形態４）
まず、実施形態1において、減算処理に使用する過去の観測信号として１フレームだけを用いる場合には、減算処理部は、過去Ｋフレーム目の過去の観測信号Ｙ（ｆ，ｍ−Ｋ）を用いて減算処理を行う。但し、Ｋは、１，２，…Lの何れか一つの値である。 In the present embodiment, a filter unit 11 similar to that of the second embodiment may be provided in order to perform smoothing processing on the estimated signal Sest (f, mp) output from the signal storage unit 11.
(Embodiment 4)
First, in the first embodiment, when only one frame is used as the past observation signal used for the subtraction process, the subtraction processing unit uses the past observation signal Y (f, m−K) of the past K frame. To subtract. However, K is one of 1, 2,... L.

ここで、本実施形態では、周波数帯毎又は周波数毎に上述のＫの値を変更する点が実施形態1とは異なる箇所である。本実施形態では、例えば、Ｋとしてお互い異なるＫ_１、K_２、Ｋ３を選択する。 Here, in the present embodiment, the point where the above-described value K is changed for each frequency band or for each frequency is different from the first embodiment. In the present embodiment, for example, different K ₁ , K ₂ , and K 3 are selected as K.

そして、周波数０〜ｆ_１については、Ｋ_１フレーム前の観測信号Ｙ(０，ｍ−Ｋ_１)、
Ｙ(１，ｍ−Ｋ_１)、…、Ｙ（ｆ_１，ｍ−Ｋ_１)を信号記憶部８に記憶させる。また、周波数帯域ｆ１＋１〜ｆ２についてはK２フレーム前の観測信号Ｙ(ｆ_１＋１，ｍ−K_２)、Ｙ(ｆ_１＋２，ｍ−K_２)、…、Ｙ（ｆ_２，ｍ−K_２)を信号記憶部８に記憶させる。また、周波数ｆ_２＋１についてはＫ_３フレーム前の観測信号Ｙ(ｆ_２＋１，ｍ−Ｋ_３)を信号記憶部８に記憶させる。 For frequencies 0 to f ₁ , the observation signal Y (0, m−K ₁ ) before K ₁ frame,
Y (1, m−K ₁ ),..., Y (f ₁ , m−K ₁ ) is stored in the signal storage unit 8. For the frequency bands f1 + 1 to f2, the observation signals Y (f ₁ +1, m−K ₂ ), Y (f ₁ +2, m−K ₂ ),..., Y (f ₂ , m−K ₂ ) before the K2 frame. ) Is stored in the signal storage unit 8. For the frequency f ₂ +1, the observation signal Y (f ₂ +1, m−K ₃ ) before K ₃ frames is stored in the signal storage unit 8.

而して信号記憶部８は、周波数変換部３から周波数変換された信号を受信すると、周波数帯毎に所定の時間フレーム数だけ遅延させて各観測信号を上述のように記憶しておき、減算部１０から読み出し信号を受信すると、記憶している観測信号を順次減算部１０へ出力するのである。尚減衰係数記憶部９で記憶させる減衰係数も対応させる形で記憶させる。 Thus, when the signal storage unit 8 receives the frequency-converted signal from the frequency conversion unit 3, the signal storage unit 8 stores each observation signal as described above by delaying it by a predetermined number of time frames for each frequency band. When the readout signal is received from the unit 10, the stored observation signals are sequentially output to the subtraction unit 10. The attenuation coefficient stored in the attenuation coefficient storage unit 9 is also stored in a corresponding manner.

その他の構成及び動作は実施形態１と同じであるので、構成は図１を参照して説明は省略する。 Since other configurations and operations are the same as those of the first embodiment, the description of the configurations will be omitted with reference to FIG.

以上のように本実施形態の音声認識装置では、音声の反射等による遅れ成分に相当する信号を、周波数帯毎に遅れ量を変えて記憶しておくことができるので、周波数毎に異なる遅れ成分を精度良く減算することができ、その結果結果認識性能が向上する。 As described above, in the speech recognition apparatus according to the present embodiment, a signal corresponding to a delay component due to voice reflection or the like can be stored with a delay amount changed for each frequency band. Can be subtracted with high accuracy, and as a result, the recognition performance is improved.

尚減算係数α_ｐと乗じる信号を観測信号Ｙ（ｆ，ｍ−ｐ）の代わりに実施形態３と同様に過去の所定フレームに対応する推定信号Sｅｓｔ（ｆ．ｍ−ｐ）を用いるようにしても良い。 Note that the signal multiplied by the subtraction coefficient α _p is used instead of the observation signal Y (f, m−p), as in the third embodiment, using the estimated signal Sest (f−m−p) corresponding to a past predetermined frame. Also good.

実施形態１及び実施形態４の回路構成図である。FIG. 6 is a circuit configuration diagram of the first and fourth embodiments. 使用環境での伝達関数の測定例図である。It is a measurement example figure of a transfer function in a use environment. 実施形態１に用いる減衰係数値とフレームとの関係説明図である。FIG. 6 is an explanatory diagram of a relationship between an attenuation coefficient value and a frame used in the first embodiment. 実施形態２の回路構成図である。6 is a circuit configuration diagram of Embodiment 2. FIG. 音声信号が空間を伝わる際の周波数の揺らぎの説明図である。It is explanatory drawing of the fluctuation of the frequency at the time of an audio signal propagating through space. 実施形態３の回路構成図である。6 is a circuit configuration diagram of Embodiment 3. FIG. 従来例の回路構成図である。It is a circuit block diagram of a prior art example.

Explanation of symbols

１音声入力部
２Ａ／Ｄ変換部
３周波数変換部
４音声特徴量抽出部
５パターン照合部
６認識結果
７標準パターン記憶部
８信号記憶部８
９減算係数記憶部９
１０減算部 DESCRIPTION OF SYMBOLS 1 Voice input part 2 A / D conversion part 3 Frequency conversion part 4 Voice feature-value extraction part 5 Pattern collation part 6 Recognition result 7 Standard pattern storage part 8 Signal storage part 8
9 Subtraction coefficient storage unit 9
10 Subtraction part

Claims

An audio input unit that captures audio in a reverberant environment, an A / D converter that performs A / D conversion on the output signal from the audio input unit, and frequency-converts the output signal from the A / D converter to A frequency converter that outputs an observation signal of a time frame, stores an observation signal of a past predetermined time frame, and multiplies the observation signal of the past predetermined time frame by a predetermined subtraction coefficient from the observation signal of the current time frame. A subtraction processing unit that subtracts the signal obtained in this way and outputs it as an estimation signal; a speech feature extraction unit that extracts a speech feature amount from the estimation signal output from the subtraction processing unit; A standard pattern storage unit storing a standard pattern created from a standard voice in the system, a feature amount extracted from the voice feature amount extraction unit, and a standard pattern stored in the standard pattern storage unit Similar Speech recognition apparatus characterized by and a pattern matching unit for outputting a recognition result seeking.

An audio input unit that captures audio in a reverberant environment, an A / D converter that performs A / D conversion on the output signal from the audio input unit, and frequency-converts the output signal from the A / D converter to Estimated by subtracting the signal obtained by multiplying the predetermined signal corresponding to the past predetermined time frame by the predetermined subtraction coefficient from the frequency converter that outputs the observation signal of the time frame and the observation signal of the current time frame A subtraction processing unit that outputs the signal as a signal and stores the estimation signal as a signal used for the predetermined signal; and a voice feature amount extraction unit that extracts a voice feature amount from the estimation signal output from the subtraction processing unit; A standard pattern storage unit storing a standard pattern created from standard speech in an environment with different reverberation times, a feature amount extracted from the speech feature amount extraction unit, and a standard pattern storage unit Speech recognition apparatus characterized by and a pattern matching unit for outputting a recognition result determining the standard pattern and Noto similarity.

3. The subtraction coefficient is a value obtained by multiplying a power ratio between an observation signal of a first frame and an observation signal of a predetermined frame of the transfer function of the reverberant environment by a predetermined coefficient. Voice recognition device.

The past predetermined time frame used for subtraction is a plurality of frames, and a signal corresponding to each predetermined time frame is multiplied by a predetermined subtraction coefficient corresponding to each predetermined time frame and the result of multiplication is added. 4. The speech recognition apparatus according to claim 1, wherein the addition result is subtracted from the observation signal of the current time frame.

The speech recognition apparatus according to claim 1, wherein the subtraction processing unit has a function of mitigating fluctuations of reverberation by smoothing in the subtraction process.

The subtraction processing unit selects the past predetermined time frame for each frequency band or frequency for which the subtraction processing is performed when one past predetermined time frame is used for the output of the estimation signal. The speech recognition apparatus according to claim 1.

A speech recognition method that obtains a recognition result by obtaining the similarity between a feature amount of speech captured in a reverberant environment and a standard pattern created from the standard speech, and a process of frequency-converting the speech signal captured in the reverberant environment And subtracting a signal obtained by multiplying an observation signal of a past predetermined time frame by a predetermined subtraction coefficient from an observation signal of a current time frame obtained by frequency conversion, and obtaining an estimation signal; A speech recognition method, comprising: a step of extracting a speech feature amount from a signal; and a step of performing speech recognition based on a similarity between the extracted feature amount and the standard pattern.

A speech recognition method that obtains a recognition result by obtaining the similarity between a feature amount of speech captured in a reverberant environment and a standard pattern created from standard speech, and a process of frequency-converting a speech signal captured in a reverberant environment And subtracting a signal obtained by multiplying a predetermined signal corresponding to a past predetermined time frame by a predetermined subtraction coefficient from an observation signal of the current time frame obtained by frequency conversion, and obtaining an estimated signal; and Storing the estimated signal as a signal to be used for the predetermined signal; extracting a feature amount of speech from the estimated signal; and performing speech recognition from a similarity between the extracted feature amount and the standard pattern A speech recognition method comprising: