JP3354252B2

JP3354252B2 - Voice recognition device

Info

Publication number: JP3354252B2
Application number: JP33059193A
Authority: JP
Inventors: 敬有吉
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1993-12-27
Filing date: 1993-12-27
Publication date: 2002-12-09
Anticipated expiration: 2017-12-09
Also published as: JPH07191696A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、入力された音声の特徴
量を予め用意された標準パターンの特徴量と比較するこ
とで入力された音声を認識する音声認識装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition apparatus for recognizing an inputted speech by comparing the feature quantity of an inputted speech with the feature quantity of a standard pattern prepared in advance.

【０００２】[0002]

【従来の技術】近年、人間が発生する言語の音韻性情報
を認識する音声認識技術が盛んに研究され、この音声認
識技術を具体的な装置に応用した音声認識装置の開発が
進められている。音韻性情報を認識するには、一般に、
予め単語や音節等を単位とする複数の標準パターンを用
意し、未知の入力音声と各標準パターンとを比較し、入
力音声に最も類似している標準パターンを見つけ出し、
この標準パターンが発声された音声であると判定する等
の手法が用いられている。2. Description of the Related Art In recent years, voice recognition technology for recognizing phonological information of a language generated by humans has been actively studied, and a voice recognition device applying this voice recognition technology to a specific device has been developed. . To recognize phonological information, generally,
Prepare a plurality of standard patterns in units of words or syllables in advance, compare the unknown input voice with each standard pattern, find the standard pattern that is most similar to the input voice,
A method of determining that the standard pattern is a uttered voice is used.

【０００３】このような音声認識の技術における単語や
音節等を認識する技術は、区切って発生された単語等を
認識する孤立単語音声認識と、連続的に発生された音声
から特定の単語等を認識する連続単語音声認識との二通
りに分類できる。音声認識技術を利用した音声認識装置
の実用化に際しては、周囲に生ずる騒音や話者が発生す
るかもしれない不要語等を考慮すると、連続単語音声認
識を実現し得る装置とすることが望ましい。[0003] The technology of recognizing words, syllables, and the like in such a voice recognition technology includes an isolated word voice recognition for recognizing words and the like generated separately and a specific word and the like from continuously generated voices. Recognition can be classified into two types: continuous word speech recognition. When a speech recognition device using a speech recognition technology is put to practical use, it is desirable that the speech recognition device be capable of realizing continuous word speech recognition in consideration of noise generated in the surroundings and unnecessary words that a speaker may generate.

【０００４】周囲の騒音や話者が発生する不要語等を除
外して音声を認識する連続単語音声認識の技術として
は、例えば、連続ＤＰ法等のワードスポッティングが従
来から知られている（吉井貞煕著「デジタル音声処理」
東海大学出版会第８章）。ここで、ワードスポッティ
ング（word spotting ）というのは、ある音声から単語
や音節等の単位を捜し出し、予め定められた言葉を抽出
する技術である。また、連続ＤＰ法（continuous Dynam
ic Programming）というのは、スペクトル等のパラメー
タ系列に変換された入力音声について、この入力音声を
始端から１フレームずつずらして単語や音節等の標準パ
ターンとＤＰマッチング（Dynamic Programming matchi
ng）を行い、マッチング結果としての距離がある閾値以
下となったとき、その時点にその標準パターンの単語や
音節等が存在すると判定する連続単語音声認識である。As a continuous word speech recognition technique for recognizing speech by excluding surrounding noise and unnecessary words generated by a speaker, for example, word spotting such as a continuous DP method is conventionally known (Yoshii). Sadahee, "Digital Audio Processing"
Tokai University Press Chapter 8). Here, the word spotting is a technique for finding a unit such as a word or a syllable from a certain voice and extracting a predetermined word. In addition, continuous DP method (continuous Dynam
ic Programming) refers to a method in which an input speech converted into a parameter series such as a spectrum is shifted by one frame from a start end and a standard pattern such as a word or a syllable is DP-matched (Dynamic Programming matchi).
ng), and when the distance as a matching result falls below a certain threshold, continuous word speech recognition in which it is determined that a word or syllable of the standard pattern exists at that time.

【０００５】ここで、音声認識における入力音声と標準
パターンとの比較に際しては、音声波形そのものを比較
するのではなく、音声波形から位相情報を除去し、スペ
クトルに関連した特徴に変換して入力音声情報を扱うの
が一般的である。これは、音声波形そのものを比較した
のでは情報量が多過ぎるし、波形の位相情報は伝送系や
録音系により変化し易い上に、このような位相情報は人
間による音声の知覚にほとんど寄与しないからである。Here, when comparing the input voice and the standard pattern in the voice recognition, the input voice is not converted by comparing the voice waveform itself but by removing phase information from the voice waveform and converting it into features related to the spectrum. It is common to handle information. This is because the amount of information is too large if the audio waveform itself is compared, the phase information of the waveform is easily changed by the transmission system and the recording system, and such phase information hardly contributes to human perception of the audio. Because.

【０００６】スペクトルに関連した特徴としては、一定
周期毎に抽出された短時間スペクトルが一般に用いられ
る。この短時間スペクトルというのは、音声の短時間区
間毎の電力スペクトル密度を意味し、周波数に応じて緩
やかに変化する成分であるスペクトル包絡と、周波数に
応じて細かく変化する成分であるスペクトル微細構造と
の積（対数尺度では和）に分解して分析することができ
る。これらのうち、スペクトル微細構造は、ピッチ等の
影響を受けて不安定である。このため、音声認識に際し
ては、短時間スペクトルからスペクトル包絡を抽出し、
スペクトル包絡を音声の特徴とするようなことが一般に
行われている。As a feature related to a spectrum, a short-time spectrum extracted at regular intervals is generally used. The short-time spectrum means the power spectrum density for each short-time section of voice, and the spectrum envelope is a component that changes slowly according to the frequency, and the spectrum fine structure is a component that changes finely according to the frequency (Sum on a logarithmic scale) for analysis. Among these, the spectral fine structure is unstable under the influence of the pitch and the like. For this reason, in speech recognition, the spectrum envelope is extracted from the short-time spectrum,
It is common practice to make the spectral envelope a feature of speech.

【０００７】スペクトル包絡を抽出する手法には色々な
種類があるが、そのうちの一つとしてケプストラム分析
（cepstrum）がある。このケプストラム分析というの
は、波形の短時間振幅スペクトルの対数の逆フーリエと
して定義され、スペクトル包絡とスペクトル微細構造と
を近似的に分離することができる点に特色を有する。ま
た、ケプストラム分析に関連したスペクトル包絡を抽出
する手法として、近年では、メルスケールの周波数で再
標本化した対数スペクトルから計算したケプストラムを
用いる試みもなされている。このようなケプストラムを
メルケプストラムという。さらに、ケプストラム分析の
特殊なものとして、ＬＰＣケプストラム分析（ＬＰＣ
は、linear predictive coding：線形予測の略称であ
る）という手法がある。このＬＰＣケプストラムという
のは、波形から直接計算されるケプストラム、すなわち
ＦＦＴケプストラム（ＦＦＴは、fast Fourier transfo
rm：高速フーリエ変換の略称である）に対し、線形予測
モデルによるケプストラムを意味し、ＦＦＴケプストラ
ムによる包絡スペクトルよりもスペクトルのピークを重
視した形の包絡スペクトルを得ることができる点を特色
とする。つまり、スペクトルのピーク部に音声認識の重
要な情報が存在していることに着目し、スペクトルのピ
ークを強調することで、その距離尺度をセンシティブに
してより正確な音声認識を実現させるようにした手法で
ある。There are various types of techniques for extracting a spectral envelope, and one of them is cepstrum analysis (cepstrum). This cepstrum analysis is defined as the inverse Fourier of the logarithm of the short-time amplitude spectrum of the waveform, and has a feature in that the spectral envelope and the spectral fine structure can be approximately separated. As a technique for extracting a spectral envelope related to cepstrum analysis, in recent years, an attempt has been made to use a cepstrum calculated from a log spectrum resampled at a mel-scale frequency. Such a cepstrum is called a mel cepstrum. Further, as a special type of cepstrum analysis, LPC cepstrum analysis (LPC
Is a linear predictive coding (abbreviation for linear prediction). The LPC cepstrum is a cepstrum calculated directly from a waveform, that is, an FFT cepstrum (FFT is a fast Fourier transfo
rm: an abbreviation for Fast Fourier Transform), which means a cepstrum based on a linear prediction model, and is characterized in that an envelope spectrum in which the spectrum peak is emphasized more than an envelope spectrum based on the FFT cepstrum can be obtained. In other words, focusing on the fact that important information for speech recognition exists at the peak of the spectrum, and emphasizing the peak of the spectrum, the distance scale is made sensitive to achieve more accurate speech recognition. Method.

【０００８】[0008]

【発明が解決しようとする課題】このように、音声特徴
量の検出、すなわち、短時間スペクトルのスペクトル包
絡の抽出には、例えば、ケプストラム分析、メルケプス
トラム分析、ＬＰＣケプストラム分析等の手法が用いら
れる。この際、音声特徴量としては、ケプストラム係
数、メルケプストラム係数、ＬＰＣケプストラム係数が
それぞれ用いられる。ところが、このようなケプストラ
ム係数等は、音声の入力レベルに依存しない特徴量であ
るため、促音発声時等の無音区間や音声レベルが低い区
間では、周囲の騒音の特徴量が入力音声の特徴量に影響
を与え、誤認識を生じさせてしまうことがあるという問
題がある。例えば、無音の区間では、入力音声に対応す
る単語の標準パターンに対する距離が周囲の騒音によっ
て広がり、対応しない単語であると誤認されてしまった
り、入力音声に対応しない単語の標準パターンに対する
距離が周囲の騒音によって狭まり、対応する単語である
と誤認されてしまったりするようなことがあり、正確な
音声認識を実現する上での障害となっている。As described above, the detection of the speech feature amount, that is, the extraction of the spectrum envelope of the short-time spectrum uses techniques such as cepstrum analysis, mel cepstrum analysis, and LPC cepstrum analysis. . At this time, a cepstrum coefficient, a mel cepstrum coefficient, and an LPC cepstrum coefficient are used as the audio feature values. However, since such a cepstrum coefficient and the like are feature amounts that do not depend on the input level of the sound, the feature amount of the surrounding noise is reduced to the feature amount of the input sound in a silent section or a section where the sound level is low, such as when a prompt sound is produced. And may cause erroneous recognition. For example, in a silent section, the distance of the word corresponding to the input voice to the standard pattern is widened by the surrounding noise, and the word is mistaken for an uncorresponding word. May be narrowed by the noise and may be mistaken as a corresponding word, which is an obstacle to realizing accurate speech recognition.

【０００９】[0009]

【課題を解決するための手段】請求項１記載の発明は、
入力された音響信号に対してフレーム毎に音響分析を行
うことでその音響信号の音響特徴量を求める音響分析部
と、入力された音響信号に対してフレーム毎に音声らし
さの程度を求める音声検出部と、この音声検出部により
音声らしさの程度が低いと判定された区間の音響特徴量
を白色雑音化する雑音化処理部と、音声の標準パターン
を記憶する標準パターン記憶部と、この標準パターン記
憶部に記憶された標準パターンの時系列と雑音化処理部
による処理を経た音響特徴量の時系列とのマッチングを
行うマッチング処理部とを設けた。According to the first aspect of the present invention,
A sound analysis unit that performs acoustic analysis on an input audio signal for each frame to obtain an audio feature value of the audio signal, and audio detection that obtains a degree of audio-likeness for each frame of the input audio signal. Unit, a noise processing unit that converts the acoustic feature of the section determined to have a low degree of voice likeness by the voice detection unit into white noise, a standard pattern storage unit that stores a standard pattern of voice, and a standard pattern storage unit. A matching processing unit is provided for matching the time series of the standard pattern stored in the storage unit with the time series of the acoustic feature amount processed by the noise processing unit.

【００１０】請求項２記載の発明は、請求項１記載の発
明において、音響分析部では音響特徴量としてケプスト
ラム係数をフレーム毎に求めるケプストラム分析を行
い、雑音化処理部では音声検出部で求められた音声らし
さの程度が低いフレームのケプストラム係数を小さく設
定することで音響特徴量を白色雑音化する。According to a second aspect of the present invention, in the first aspect of the invention, the sound analysis unit performs a cepstrum analysis for obtaining a cepstrum coefficient as an acoustic feature amount for each frame, and the noise processing unit obtains a cepstrum coefficient. By setting a small cepstrum coefficient for a frame having a low degree of voice likeness, the acoustic feature value is converted to white noise.

【００１１】請求項３記載の発明は、請求項１記載の発
明において、音響分析部では音響特徴量としてメルケプ
ストラム係数をフレーム毎に求めるメルケプストラム分
析を行い、雑音化処理部では音声検出部で求められた音
声らしさの程度が低いフレームのメルケプストラム係数
を小さく設定することで音響特徴量を白色雑音化する。According to a third aspect of the present invention, in the first aspect of the present invention, the acoustic analysis unit performs mel-cepstral analysis for obtaining a mel-cepstral coefficient for each frame as an acoustic feature amount, and the noise processing unit performs a mel-cepstral analysis on the speech detecting unit. The acoustic feature is converted to white noise by setting the mel-cepstral coefficient of the obtained frame having a low degree of voice-likeness to be small.

【００１２】請求項４記載の発明は、請求項１記載の発
明において、音響分析部では音響特徴量としてスペクト
ル傾斜を除去する補正が行われた短時間スペクトルをフ
レーム毎に求めるスペクトル分析を行い、雑音化処理部
では音声検出部で求められた音声らしさの程度が低いフ
レームの短時間スペクトルを小さく設定することで音響
特徴量を白色雑音化する。According to a fourth aspect of the present invention, in the first aspect of the present invention, the acoustic analysis unit performs a spectrum analysis for obtaining, for each frame, a short-time spectrum that has been corrected to remove a spectral tilt as an acoustic feature amount. The noise processing unit converts the acoustic feature value into white noise by setting a short-time spectrum of a frame having a low degree of voice likeness obtained by the voice detection unit to be small.

【００１３】請求項５記載の発明は、請求項１記載の発
明において、音声検出部では入力された音響信号のパワ
ーが小さいほど音声らしさの程度が低いと判断し、雑音
化処理部ではその音声検出部で音声らしさの程度が低い
と判断される程音響特徴量を強く白色雑音化する。According to a fifth aspect of the present invention, in the first aspect of the invention, the voice detection unit determines that the lower the power of the input acoustic signal is, the lower the level of voice-likeness is, and the noise reduction processing unit determines the level of the voice. As the detection unit determines that the degree of voice-likeness is low, the acoustic feature amount is strongly converted to white noise.

【００１４】請求項６記載の発明は、請求項１記載の発
明において、標準パターン記憶部に記憶された標準パタ
ーンは、雑音化処理部での音響特徴量の白色雑音化と同
等の処理を経て生成された標準パターンである。According to a sixth aspect of the present invention, in the first aspect of the invention, the standard pattern stored in the standard pattern storage section is subjected to a process equivalent to white noise conversion of the acoustic feature in the noise processing section. This is the generated standard pattern.

【００１５】請求項７記載の発明は、請求項１記載の発
明において、マッチング処理部でのマッチング処理は、
ワードスポッティング処理である。According to a seventh aspect of the present invention, in the first aspect of the invention, the matching processing in the matching processing section is as follows:
This is a word spotting process.

【００１６】[0016]

【作用】請求項１記載の発明では、各フレーム毎に、音
響分析部により入力された音響信号の音響特徴量が求め
られ、音声検出部によりその音響信号の音声らしさの程
度が求められる。この際、音響信号中に音声らしさの程
度が低いと判定された区間がある場合には、音響特徴量
が雑音化処理部で白色雑音化される。つまり、音響信号
中、音声らしさの程度が低い区間は、無音であるか音声
レベルが低い区間であることを意味する。そして、音響
特徴量が白色雑音化されるということは、音響特徴量を
表現するスペクトルが平滑化されることを意味する。し
たがって、無音であるか音声レベルが低い場合には、音
響特徴量としてのスペクトルが平滑化され、周囲の騒音
による影響が除去される。マッチング処理部では、この
ような処理を経た特徴量の時系列と標準パターン記憶部
に記憶された標準パターンの時系列とが比較され、その
マッチングが行われる。これにより、周囲の騒音の有無
に拘らず、正確な音声認識がなされる。According to the first aspect of the present invention, for each frame, the acoustic feature amount of the acoustic signal input by the acoustic analysis unit is obtained, and the degree of soundness of the audio signal is obtained by the audio detection unit. At this time, if there is a section in the acoustic signal where the degree of voice-likeness is determined to be low, the acoustic feature is converted to white noise by the noise processing unit. That is, in the audio signal, a section having a low degree of voice-likeness means a section having no sound or a low sound level. The fact that the acoustic feature is converted to white noise means that the spectrum representing the acoustic feature is smoothed. Therefore, when there is no sound or the sound level is low, the spectrum as the acoustic feature is smoothed, and the influence of the surrounding noise is removed. In the matching processing unit, the time series of the feature amount subjected to such processing is compared with the time series of the standard pattern stored in the standard pattern storage unit, and the matching is performed. Thus, accurate voice recognition is performed regardless of the presence or absence of ambient noise.

【００１７】請求項２記載の発明では、音響特徴量とし
てケプストラム係数を用いるケプストラム分析が音響分
析の手法として選択され、このケプストラム係数を小さ
くすることで音響特徴量の白色雑音化を実現させてい
る。また、請求項３記載の発明では、音響特徴量として
メルケプストラム係数を用いるメルケプストラム分析が
音響分析の手法として選択され、このメルケプストラム
係数を小さくすることで音響特徴量の白色雑音化を実現
させている。そして、請求項４記載の発明では、音響特
徴量としてスペクトル傾斜が除去されたスペクトルを用
いるスペクトル分析が音響分析の手法として選択され、
このスペクトルを小さくすることで音響特徴量の白色雑
音化を実現させている。したがって、請求項２、３及び
４記載の発明では、安定した音響特徴量に基づく正確な
音声認識がなされ、しかも、音響特徴量の白色雑音化が
容易である。According to the second aspect of the present invention, a cepstrum analysis using a cepstrum coefficient as an acoustic feature is selected as an acoustic analysis technique, and white noise of the acoustic feature is realized by reducing the cepstrum coefficient. . According to the third aspect of the present invention, mel-cepstral analysis using a mel-cepstrum coefficient as an acoustic feature is selected as an acoustic analysis technique. By reducing the mel-cepstral coefficient, white noise of the acoustic feature is realized. ing. According to the fourth aspect of the present invention, a spectrum analysis using a spectrum from which a spectrum inclination has been removed as an acoustic feature amount is selected as an acoustic analysis method,
By reducing this spectrum, the acoustic features are converted to white noise. Therefore, according to the second, third, and fourth aspects of the present invention, accurate speech recognition based on stable acoustic features is performed, and the acoustic features are easily converted to white noise.

【００１８】請求項５記載の発明では、音声検出部にお
ける音声らしさの程度の判断に際して、入力された音響
信号のパワーが小さいほど音声らしさの程度が低いと判
断され、雑音化処理部では、入力された音響信号のパワ
ーの程度に応じて音響特徴量の白色雑音化の程度が決定
される。つまり、音響信号は、そのパワーが小さいほど
強く白色雑音化される。これにより、より精度が高い音
声認識がなされる。According to the fifth aspect of the present invention, when determining the degree of voice-likeness in the voice detection unit, it is determined that the lower the power of the input acoustic signal is, the lower the level of voice-likeness is. The degree of white noise conversion of the acoustic feature quantity is determined according to the degree of the power of the acoustic signal that has been performed. In other words, the acoustic signal is converted to white noise as the power is smaller. Thereby, more accurate voice recognition is performed.

【００１９】請求項６記載の発明では、標準パターン記
憶部に記憶された標準パターンは、雑音化処理部での音
響特徴量の白色雑音化と同等の処理を経て生成されてい
るので、標準パターンの生成が容易である。そして、現
実に入力される音響信号の特徴量と極めて近似する標準
パターンを用意することができ、より精度の高い音声認
識がなされる。According to the present invention, the standard pattern stored in the standard pattern storage section is generated through the same processing as the white noise conversion of the acoustic features in the noise processing section. Is easy to generate. Then, it is possible to prepare a standard pattern that is very similar to the characteristic amount of the acoustic signal actually input, and more accurate speech recognition is performed.

【００２０】請求項７記載の発明では、マッチング部で
は、音声特徴量の時系列と標準パターンの時系列とのマ
ッチング処理に際し、ワードスポッティング処理がなさ
れる。これにより、標準パターンとして生成された単語
や音節等が含まれたある言葉が発声された場合、その言
葉に含まれるその単語等が抽出されて認識される。In the seventh aspect of the present invention, the matching unit performs a word spotting process when performing a matching process between the time series of the audio feature quantity and the time series of the standard pattern. Thus, when a word including a word or a syllable generated as a standard pattern is uttered, the word or the like included in the word is extracted and recognized.

【００２１】[0021]

【実施例】本発明の一実施例を図１に基づいて説明す
る。図１に示すのは各部のブロック図であり、音声を入
力する音声入力部１にＡ／Ｄ変換部２（Ａ／Ｄは、 ana
logto disitalの略称である）が接続され、このＡ／Ｄ
変換部２には音響前処理部３と音響分析部４とが順に接
続されている。また、前記音響前処理部３には音声検出
部５も接続され、この音声検出部５と前記音響分析部４
とには雑音化処理部６が接続されている。そして、標準
パターン記憶部７が設けられ、この標準パターン記憶部
７と前記雑音化処理部６とはマッチング処理部８に接続
され、このマッチング処理部８は認識結果出力部９に接
続されている。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS One embodiment of the present invention will be described with reference to FIG. FIG. 1 is a block diagram of each unit. An audio input unit 1 for inputting audio has an A / D conversion unit 2 (A / D is ana
logto disital) is connected and this A / D
A sound preprocessing unit 3 and a sound analysis unit 4 are sequentially connected to the conversion unit 2. The sound preprocessing unit 3 is also connected to a sound detection unit 5, and the sound detection unit 5 and the sound analysis unit 4
Is connected to the noise processing unit 6. Then, a standard pattern storage unit 7 is provided, and the standard pattern storage unit 7 and the noise processing unit 6 are connected to a matching processing unit 8, and the matching processing unit 8 is connected to a recognition result output unit 9. .

【００２２】ここで、前記音声入力部１は、例えばマイ
クロフォンであり、この音声入力部１より入力されたア
ナログ信号である音響信号が前記Ａ／Ｄ変換部２に出力
される構造である。Here, the audio input unit 1 is, for example, a microphone, and has a structure in which an audio signal, which is an analog signal input from the audio input unit 1, is output to the A / D conversion unit 2.

【００２３】前記Ａ／Ｄ変換部２は、前記音声入力部１
からの音響信号をデジタル信号に変換し、標本化、量子
化及び符号化を実行する構造のものである。このＡ／Ｄ
変換部２でのデジタル変換の条件としては、例えば、標
本化周波数が１６kHz で量子化ビット数が１６ビットで
ある。この際、低周波域の帯域制限をかけた後に標本化
がなされるよう構成されている。これは、標本化定理に
従った標本化をして折り返し否（aliasing distortion
）の発生を防止するためである。The A / D converter 2 includes the audio input unit 1
Is converted into a digital signal, and sampling, quantization, and encoding are performed. This A / D
The conditions for digital conversion in the converter 2 are, for example, a sampling frequency of 16 kHz and a quantization bit number of 16 bits. In this case, the sampling is performed after the band limitation of the low frequency band is applied. This is because sampling is performed according to the sampling theorem and aliasing distortion is not performed.
) Is prevented.

【００２４】次いで、前記音響前処理部３は、Ａ／Ｄ変
換部２でデジタル信号に変換された入力音響信号を高域
強調（プリエンファシス：pre-emphasis）する構造のも
のである。この音響前処理部３は、Ｈ(z)＝１−Ｚ~¹ ………式１の伝達関数を有する１次のデジタルフィルタや、差分演
算回路等により構成されている。Next, the audio pre-processing unit 3 has a structure in which the input audio signal converted into a digital signal by the A / D conversion unit 2 is subjected to high-frequency emphasis (pre-emphasis). The acoustic pre-processing unit 3 is configured by a first-order digital filter having a transfer function of H (z) = 1-Z ~ ¹ ...

【００２５】次いで、前記音響分析部４は、入力された
音響信号の特徴量を抽出する演算部であり、音響信号の
短時間スペクトルのスペクトル包絡を抽出する構造のも
のである。本実施例では、音響信号をＬＰＣスペクトラ
ム分析し、フレーム毎にケプストラム係数としてのケプ
ストラムベクトル（０次は含まず）ｃt を求める構造の
ものが用いられている。この音響分析部４における音響
信号の分析条件は、フレーム周期：１０ms 窓周期：１６ms 窓関数：ハミング窓ＬＰＣ分析次数：１４次ケプストラム次数：１４次である。Next, the acoustic analysis unit 4 is an arithmetic unit for extracting a characteristic amount of the input audio signal, and has a structure for extracting a spectrum envelope of a short-time spectrum of the audio signal. In this embodiment, a structure is used in which an acoustic signal is subjected to LPC spectrum analysis, and a cepstrum vector (not including the 0th order) ct as a cepstrum coefficient is obtained for each frame. The acoustic signal analysis conditions in the acoustic analyzer 4 are as follows: frame period: 10 ms window period: 16 ms window function: Hamming window LPC analysis order: 14th order Cepstrum order: 14th order

【００２６】次いで、前記音声検出部５は、前記音響前
処理部３で高域強調された入力音響信号における各フレ
ームでの平均パワーに基づき、その音響信号の音声らし
さの程度を求める構造のものである。フレーム平均パワ
ーは、ＬＰＣ分析中の０次の自己相関係数から求めるこ
とができる。ここで、前記音声検出部５においては、フ
レーム平均パワーｐとこの音声検出部５で求める音声ら
しさの程度ｖとの関係を、次に示す式２で定義してい
る。Next, the sound detecting section 5 has a structure for obtaining the degree of soundness of the sound signal based on the average power in each frame of the input sound signal which has been subjected to the high frequency emphasis by the sound preprocessing section 3. It is. The frame average power can be obtained from the zero-order autocorrelation coefficient during the LPC analysis. Here, in the voice detector 5, the relationship between the average frame power p and the degree v of voice likeness determined by the voice detector 5 is defined by the following equation (2).

【００２７】[0027]

【数２】 (Equation 2)

【００２８】この式２におけるｐ₀ は実験的に求められ
る定数であり、音声区間の始終端のパワーの値よりもや
や大きな値が用いられる。この式より明らかなように、
音声らしさの程度ｖは、０≦ｖ≦１であり、フレーム平
均パワーｐが十分に大きい時には音声らしさの程度ｖは
１で、フレーム平均パワーｐが０の時には音声らしさの
程度ｖは０であり、その間では、フレーム平均パワーｐ
が小さくなるにつれて音声らしさの程度ｖは単調に低く
なっていく。P ₀ in this equation 2 is a constant obtained experimentally, and a value slightly larger than the power value at the start and end of the voice section is used. As is clear from this equation,
The voice-likeness degree v is 0 ≦ v ≦ 1, and when the frame average power p is sufficiently large, the voice-likeness degree v is 1, and when the frame average power p is 0, the voice-likeness degree v is 0. , Meanwhile, the frame average power p
Becomes smaller, the degree v of voice-likeness decreases monotonously.

【００２９】次いで、前記雑音化処理部６は、前記音声
検出部５により求められた音声らしさの程度に応じ、前
記音響分析部４により求められた音響特徴量を白色雑音
化する構造のものである。この雑音化処理部６では、ｃ*t＝ｖｃt ………式３の演算処理が実行される。ここで、前述した通り、ｃt
は前記音響分析部４により求められた音響特徴量、すな
わちケプストラムベクトルであり、ｖは前記音声検出部
５により求められた音声らしさの程度である。そして、
ｃ*tは、入力された音響信号中の音声らしさの程度に応
じて白色雑音化されたケプストラムベクトルである。こ
の式から明らかなように、前記雑音化処理部６では、ケ
プストラムベクトルｃt と音声らしさの程度ｖとの積に
よりケプストラムベクトルｃ*tを決定している。ここ
で、白色雑音のケプストラムベクトルは０、すなわち０
ベクトルである。したがって、音声らしさの程度ｖが低
ければ低いほどケプストラムベクトルｃtが強く白色雑
音化されることになる。Next, the noise processing section 6 has a structure in which the acoustic feature quantity obtained by the acoustic analysis section 4 is converted into white noise in accordance with the degree of likelihood of voice obtained by the voice detection section 5. is there. In the noise processing section 6, the calculation processing of c * t = vct... Here, as described above, ct
Is a sound feature amount obtained by the sound analysis unit 4, that is, a cepstrum vector, and v is a degree of soundness obtained by the sound detection unit 5. And
c * t is a cepstrum vector converted into white noise according to the degree of likelihood of speech in the input acoustic signal. As is clear from this equation, the noise processing unit 6 determines the cepstrum vector c * t by the product of the cepstrum vector ct and the degree v of voice-likeness. Here, the cepstrum vector of the white noise is 0, that is, 0
Vector. Therefore, the lower the voice-likeness degree v, the stronger the cepstrum vector ct becomes white noise.

【００３０】次いで、前記標準パターン記憶部７には、
音声認識を実行させる単語や音節等の標準パターンが多
数記憶されている。これらの標準パターンは、音声入力
部１に入力されて音響分析部４でケプストラムベクトル
ｃt とされ、雑音化処理部４で所定の処理が施されたケ
プストラムベクトルｃ*tの時系列と同等の内容を有し、
このケプストラムベクトルｃ*tの時系列と同じ処理を経
て生成されたケプストラムベクトルｃ*rである。Next, the standard pattern storage section 7 stores
A large number of standard patterns, such as words and syllables, for which speech recognition is executed are stored. These standard patterns are input to the voice input unit 1 and converted into cepstrum vectors ct by the acoustic analysis unit 4, and have the same contents as the time series of the cepstrum vectors c * t subjected to the predetermined processing by the noise processing unit 4. Has,
This is a cepstrum vector c * r generated through the same processing as the time series of the cepstrum vector c * t.

【００３１】次いで、前記マッチング処理部８は、前記
標準パターン記憶部７に記憶された標準パターン、つま
りケプストラムベクトルｃ*rの時系列と、前記雑音化処
理部６による処理を経た音響特徴量、つまりケプストラ
ムベクトルｃ*tの時系列とでマッチング処理を実行する
構造のものである。このマッチング処理部８でのマッチ
ング処理は、連続ＤＰ法を用いたマッチング処理であ
る。この際、距離尺度は、群遅延スペクトル距離尺度等
の距離尺度が用いられる。Next, the matching processing unit 8 calculates the time series of the standard pattern stored in the standard pattern storage unit 7, that is, the cepstrum vector c * r, the acoustic feature amount processed by the noise processing unit 6, That is, it has a structure in which matching processing is performed with the time series of the cepstrum vector c * t. The matching process in the matching processing unit 8 is a matching process using the continuous DP method. At this time, a distance scale such as a group delay spectrum distance scale is used as the distance scale.

【００３２】次いで、前記認識結果出力部９は、前記マ
ッチング処理部８での認識結果を出力する構造であり、
例えば、該当する単語等の有無を信号や表示として出力
する等の構造となっている。Next, the recognition result output unit 9 has a structure for outputting the recognition result of the matching processing unit 8,
For example, the structure is such that the presence or absence of a corresponding word or the like is output as a signal or display.

【００３３】このような構成において、音声入力部１に
入力された音響信号はＡ／Ｄ変換部２でデジタル変換さ
れ、標本化、量子化及び符号化される。そして、音響前
処理部３で高域強調が施され、スペクトル傾斜が平坦化
される。これにより、音響信号のダイナミックレンジが
圧縮され、実効的なＳＮＲ（signal-to-quantizationno
ise ratio：信号対量子化雑音比）が高められる。In such a configuration, the audio signal input to the audio input unit 1 is digitally converted by the A / D converter 2 and is sampled, quantized, and coded. Then, high-frequency emphasis is performed by the acoustic preprocessing unit 3, and the spectrum inclination is flattened. As a result, the dynamic range of the audio signal is compressed, and the effective SNR (signal-to-quantization noise) is reduced.
ise ratio (signal to quantization noise ratio) is increased.

【００３４】次いで、高域強調された音響信号は、音響
分析部４によるＬＰＣケプストラム分析によりその特徴
量がケプストラムベクトルｃt として抽出される。これ
と同時に、音声検出部５では、式２により、高域強調さ
れた音響信号の音声らしさの程度ｖが各フレームの平均
パワーｐに基づき求められる。Next, the characteristic amount of the high-frequency emphasized sound signal is extracted as a cepstrum vector ct by the LPC cepstrum analysis by the sound analysis unit 4. At the same time, the voice detection unit 5 calculates the voice-like degree v of the high-frequency-emphasized audio signal based on the average power p of each frame by Expression 2.

【００３５】そして、こうして求められたケプストラム
ベクトルｃt 及び音声らしさの程度ｖは雑音化処理部６
に送られ、この雑音化処理部６での式３の演算処理によ
り白色雑音化処理されたケプストラムベクトルｃ*tが求
められる。ここで、この雑音化処理部６で処理されたケ
プストラムベクトルｃ*tは、音声検出部５で求められた
音声らしさの程度が低ければ低いほど強く白色雑音化さ
れる。つまり、音声らしさの程度が低いということは、
その区間が無音であるか音声レベルが低いことを意味し
ているため、無音区間や音声レベルが低い区間が白色雑
音化され、その区間のスペクトルが平坦にされる。The cepstrum vector ct and the degree v of voice-likeness thus determined are converted to the noise processing unit 6.
The cepstrum vector c * t which has been subjected to white noise processing by the arithmetic processing of Expression 3 in the noise processing section 6 is obtained. Here, the cepstrum vector c * t processed by the noise reduction processing unit 6 is converted to white noise more strongly as the degree of voice likeness obtained by the voice detection unit 5 is lower. In other words, low audioness means
Since the section is silent or has a low voice level, the silent section or the section having a low voice level is converted to white noise, and the spectrum of the section is flattened.

【００３６】次いで、マッチング処理部８では、雑音化
処理部６での処理を経たケプストラムベクトルｃ*tの時
系列と、標準パターン記憶部７に格納されている標準パ
ターンであるケプストラムベクトルｃ*rの時系列とがマ
ッチング処理される。この時のマッチング処理は、ワー
ドスポッティングである連続ＤＰ法によりなされる。し
たがって、音声の端点フリーの音声認識がなされる。そ
して、マッチング対象であるケプストラムベクトルｃ*t
の時系列とケプストラムベクトルｃ*rの時系列とは、共
に、音声らしさの程度が低い区間、すなわち、無音であ
るか音声レベルが低い区間が白色雑音化され、その区間
のスペクトルが平滑化されている。したがって、周囲の
騒音による影響がない音声認識がなされ、音声認識の精
度の向上が図られる。したがって、マッチング処理部８
の処理結果を出力する認識結果出力部９より、高精度な
認識結果が出力される。Next, in the matching processing unit 8, the time series of the cepstrum vector c * t that has been processed by the noise processing unit 6 and the cepstrum vector c * r that is the standard pattern stored in the standard pattern storage unit 7 Is subjected to a matching process. The matching process at this time is performed by the continuous DP method that is word spotting. Therefore, speech end point-free speech recognition is performed. Then, the cepstrum vector c * t to be matched
Both the time series of the cepstrum vector c * r and the time series of the cepstrum vector c * r have a low degree of voice-likeness, that is, a section with no sound or a low voice level is converted to white noise, and the spectrum of the section is smoothed. ing. Therefore, voice recognition is performed without being affected by ambient noise, and the accuracy of voice recognition is improved. Therefore, the matching processing unit 8
A highly accurate recognition result is output from the recognition result output unit 9 that outputs the processing result of (1).

【００３７】ここで、音響分析部４の変形例について説
明する。本実施例では、入力された音響信号の特徴量を
求める手法としてＬＰＣケプストラム分析を実行する音
響分析部４を設けたが、音響信号の特徴量を求める手法
としてはこれに限らず、例えば、ケプストラム分析やメ
ルケプストラム分析、スペクトル傾斜補正を施したスペ
クトル分析等の手法を用いる音響分析部としても良い。
要は、雑音化処理部６での白色雑音化の処理を容易にす
ることができる特徴量を求めることができる構造であれ
ば、その種類を問わない。より詳細には、メルケプスト
ラム係数としてのメルケプストラムベクトルは、ケプス
トラムベクトルと同様に、０ベクトルが白色雑音を表現
する。また、スペクトル傾斜補正を施したスペクトル分
析は、ＦＦＴやバンドパスフィルタバンクによって求め
られたスペクトルに対し、対数変換や最小２乗近似直線
を減じる補正（指数変換）等のスペクトル傾斜補正をす
ることにより実行される。この結果、補正後のスペクト
ルベクトルは、ケプストラムベクトルと同様に、０ベク
トルが白色雑音を表現する。したがって、標準パターン
記憶部に格納する標準パターンをメルケプストラム分析
を施した標準パターンとしたり、スペクトル傾斜補正を
施したスペクトル分析を施した標準パターンとするだけ
で、本実施例の装置にそのまま適用できる。Here, a modified example of the acoustic analysis unit 4 will be described. In the present embodiment, the acoustic analysis unit 4 for performing the LPC cepstrum analysis is provided as a method for obtaining the characteristic amount of the input audio signal. However, the method for obtaining the characteristic amount of the audio signal is not limited thereto. An acoustic analysis unit using a technique such as analysis, mel-cepstral analysis, or spectral analysis with spectral tilt correction may be used.
In short, any type can be used as long as it is a structure that can obtain a feature amount that can facilitate the white noise processing in the noise processing unit 6. More specifically, in the mel cepstrum vector as the mel cepstrum coefficient, the 0 vector represents white noise, similarly to the cepstrum vector. The spectrum analysis with the spectrum tilt correction is performed by performing a spectrum tilt correction such as a logarithmic conversion or a correction (exponential conversion) for reducing a least square approximation straight line on the spectrum obtained by the FFT or the band-pass filter bank. Be executed. As a result, in the corrected spectrum vector, the 0 vector expresses white noise as in the case of the cepstrum vector. Therefore, the standard pattern stored in the standard pattern storage unit can be directly applied to the apparatus of the present embodiment simply by using a standard pattern subjected to mel-cepstral analysis or a standard pattern subjected to spectral analysis with spectral tilt correction. .

【００３８】次いで、音声検出部５の変形例について説
明する。まず、音声検出部５では、音響信号中の音声ら
しさの程度を求めるための基礎データとして各フレーム
の平均パワーＰを求めるが、このフレーム平均パワーｐ
はＬＰＣ分析により求められるため、ＬＰＣケプストラ
ム分析を行う音響分析部４の構造を一部共用して音声検
出部５を構成しても良い。音声検出部５の他の変形例と
しては、音声らしさの程度を判定する基礎データとして
音声パワーを用いず、ゼロ交差数、ピッチ周波数、フォ
ルマントの先鋭度、各音素パターンとの距離等を用いて
も良い。Next, a modified example of the voice detection unit 5 will be described. First, the voice detection unit 5 calculates the average power P of each frame as basic data for obtaining the degree of voice-likeness in the audio signal.
Is obtained by the LPC analysis, the voice detection unit 5 may be configured by partially sharing the structure of the acoustic analysis unit 4 that performs the LPC cepstrum analysis. As another modified example of the voice detection unit 5, the voice power is not used as basic data for determining the degree of voice-likeness, but the number of zero crossings, pitch frequency, sharpness of formants, distance from each phoneme pattern, and the like are used. Is also good.

【００３９】次いで、マッチング処理部８の変形例につ
いて説明する。本実施例では、連続ＤＰ法を実行する構
造のマッチング処理部８としたが、状態遷移モデル等を
用いる他の方式によるマッチングを実行する構造として
も良く、また、これらのようなワードスポッティングに
限らず、孤立単語音声認識を実行する構造としても良
い。Next, a modification of the matching processing section 8 will be described. In the present embodiment, the matching processing unit 8 is configured to execute the continuous DP method. However, the matching processing unit 8 may be configured to execute matching by another method using a state transition model or the like, and is not limited to such word spotting. Instead, a structure for executing isolated word speech recognition may be adopted.

【００４０】[0040]

【発明の効果】請求項１記載の発明は、入力された音響
信号に対してフレーム毎に音響分析を行うことでその音
響信号の音響特徴量を求める音響分析部と、入力された
音響信号に対してフレーム毎に音声らしさの程度を求め
る音声検出部と、この音声検出部により音声らしさの程
度が低いと判定された区間の音響特徴量を白色雑音化す
る雑音化処理部と、音声の標準パターンを記憶する標準
パターン記憶部と、この標準パターン記憶部に記憶され
た標準パターンの時系列と雑音化処理部による処理を経
た音響特徴量の時系列とのマッチングを行うマッチング
処理部とを設けたので、入力された音響信号中に音声ら
しさの程度が低いと判定された区間がある場合、つま
り、無音であるか音声レベルが低い区間がある場合に
は、音響特徴量を雑音化処理部で白色雑音化して音響特
徴量を表現するスペクトルを平滑化し、これにより、そ
の区間に対する周囲の騒音による影響を除去し、周囲の
騒音の特徴量が音響特徴量に混ざることによる誤認識を
防止することができ、したがって、音声の認識精度を向
上させることができる等の効果を有する。According to the first aspect of the present invention, there is provided an acoustic analysis unit for performing acoustic analysis on an input audio signal for each frame to obtain an acoustic feature of the audio signal, and A voice detection unit that calculates the degree of voice-likeness for each frame, a noise processing unit that converts the acoustic features of the section determined to have low voice-likeness into white noise by the voice detection unit, and a voice standard. A standard pattern storage unit for storing a pattern, and a matching processing unit for matching a time series of the standard pattern stored in the standard pattern storage unit with a time series of the acoustic feature amount processed by the noise reduction processing unit. Therefore, if there is a section in the input sound signal where the degree of voice-likeness is determined to be low, that is, if there is a section where there is no sound or the sound level is low, the audio feature The processing unit smoothes the spectrum that expresses the acoustic features by converting it to white noise, thereby removing the influence of surrounding noise on that section, and preventing erroneous recognition due to the surrounding noise being mixed with the acoustic features. Therefore, the present invention has an effect that the accuracy of voice recognition can be improved.

【００４１】請求項２記載の発明は、請求項１記載の発
明において、音響分析部では音響特徴量としてケプスト
ラム係数をフレーム毎に求めるケプストラム分析を行
い、雑音化処理部では音声検出部で求められた音声らし
さの程度が低いフレームのケプストラム係数を小さく設
定することで音響特徴量を白色雑音化し、請求項３記載
の発明は、請求項１記載の発明において、音響分析部で
は音響特徴量としてメルケプストラム係数をフレーム毎
に求めるメルケプストラム分析を行い、雑音化処理部で
は音声検出部で求められた音声らしさの程度が低いフレ
ームのメルケプストラム係数を小さく設定することで音
響特徴量を白色雑音化し、請求項４記載の発明は、請求
項１記載の発明において、音響分析部では音響特徴量と
してスペクトル傾斜を除去する補正が行われた短時間ス
ペクトルをフレーム毎に求めるスペクトル分析を行い、
雑音化処理部では音声検出部で求められた音声らしさの
程度が低いフレームの短時間スペクトルを小さく設定す
ることで音響特徴量を白色雑音化するように構成したの
で、安定した音響特徴量に基づく正確な音声認識を行う
ことができ、したがって、音声の認識精度をより向上さ
せることができ、また、音響特徴量を容易に白色雑音化
することができ、したがって、白色雑音化するに際して
の演算処理の簡略化を図ることができる等の効果を有す
る。According to a second aspect of the present invention, in the first aspect of the present invention, the sound analysis unit performs a cepstrum analysis for obtaining a cepstrum coefficient as an acoustic feature amount for each frame, and the noise processing unit obtains a cepstrum coefficient. By setting the cepstrum coefficient of a frame having a low voice-likeness to a small value, the acoustic feature is converted to white noise. According to the third aspect of the present invention, in the first aspect of the present invention, Perform a mel-cepstral analysis to determine the cepstrum coefficient for each frame, and in the noise reduction processing unit, set the mel-cepstral coefficient of the frame with a low degree of voice-likeness obtained by the voice detection unit to a small value to make the acoustic feature amount white noise, According to a fourth aspect of the present invention, in the first aspect of the present invention, the acoustic analysis unit includes a spectrum gradient as an acoustic feature. The short-time spectrum correction is performed to remove performs a spectrum analysis for determining for each frame,
The noise reduction processing unit is configured to convert the acoustic features into white noise by setting the short-time spectrum of the frames with low voice-likeness determined by the speech detection unit to be small, so that the noise based on the stable acoustic features Accurate speech recognition can be performed, so that the speech recognition accuracy can be further improved, and the acoustic feature can be easily converted to white noise. Can be simplified.

【００４２】請求項５記載の発明は、請求項１記載の発
明において、音声検出部では入力された音響信号のパワ
ーが小さいほど音声らしさの程度が低いと判断し、雑音
化処理部ではその音声検出部で音声らしさの程度が低い
と判断される程音響特徴量を強く白色雑音化するように
構成したので、音声らしさの程度を判断するパラメータ
として音響信号のパワーを用いることで容易かつ正確に
音声らしさの程度の判断を実現させることができ、した
がって、演算処理の簡略化を図ることができ、また、音
響信号は、そのパワーが小さいほど強く白色雑音化され
るため、より精度が高い音声認識の実現に寄与すること
ができる等の効果を有する。According to a fifth aspect of the present invention, in the first aspect of the present invention, the voice detection unit determines that the lower the power of the input audio signal is, the lower the level of voice-likeness is. The acoustic feature is configured to strongly white noise as the detection unit determines that the degree of voice-likeness is low, so it is easy and accurate to use the power of the acoustic signal as a parameter to determine the degree of voice-likeness. The determination of the degree of voice-likeness can be realized, and therefore, the arithmetic processing can be simplified. In addition, the smaller the power of the sound signal is, the stronger the white noise becomes, and thus the more accurate the sound signal is. It has effects such as being able to contribute to the realization of recognition.

【００４３】請求項６記載の発明は、請求項１記載の発
明において、標準パターン記憶部に記憶された標準パタ
ーンは、雑音化処理部での音響特徴量の白色雑音化と同
等の処理を経て生成された標準パターンであるので、標
準パターンの生成が容易であり、また、現実に入力され
る音響信号の特徴量と極めて近似する標準パターンを用
意することができ、したがって、より精度の高い音声認
識の実現に寄与することができる等の効果を有する。According to a sixth aspect of the present invention, in the first aspect of the present invention, the standard pattern stored in the standard pattern storage section is subjected to processing equivalent to white noise conversion of the acoustic feature in the noise processing section. Since the generated standard pattern is used, it is easy to generate the standard pattern, and it is possible to prepare a standard pattern that is extremely similar to the characteristic amount of the actually input audio signal, and therefore, a more accurate voice It has effects such as being able to contribute to the realization of recognition.

【００４４】請求項７記載の発明は、請求項１記載の発
明において、マッチング処理部でのマッチング処理は、
ワードスポッティング処理であるので、標準パターンと
して生成された単語や音節等が含まれたある言葉が発声
された場合、その言葉に含まれるその単語等を抽出して
認識することができ、この際、入力音声がない場合に誤
った認識結果が生ずるのを防止することができる等の効
果を有する。According to a seventh aspect of the present invention, in the first aspect of the present invention, the matching processing in the matching processing section is as follows:
Since the word spotting process is performed, when a word including a word or a syllable generated as a standard pattern is uttered, the word or the like included in the word can be extracted and recognized. This has an effect that it is possible to prevent an erroneous recognition result from occurring when there is no input voice.

[Brief description of the drawings]

【図１】本発明の一実施例を示すブロック図である。FIG. 1 is a block diagram showing one embodiment of the present invention.

[Explanation of symbols]

４音響分析部５音声検出部６雑音化処理部７標準パターン記憶部８マッチング処理部 Reference Signs List 4 Acoustic analysis unit 5 Voice detection unit 6 Noise reduction processing unit 7 Standard pattern storage unit 8 Matching processing unit

───────────────────────────────────────────────────── フロントページの続き (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 15/10 G10L 21/02 G10L 15/20 ──────────────────────────────────────────────────続き Continued on the front page (58) Field surveyed (Int.Cl. ⁷ , DB name) G10L 15/10 G10L 21/02 G10L 15/20

Claims

(57) [Claims]

An acoustic analysis unit for performing acoustic analysis on an input audio signal for each frame to obtain an acoustic feature amount of the audio signal, and an audio-likeness of the input audio signal for each frame. A voice detection unit for determining the degree, a noise processing unit for converting the acoustic feature amount of a section determined to have a low degree of voice likeness by the voice detection unit into white noise, and a standard pattern storage unit for storing a standard pattern of voice And a matching processing unit for matching the time series of the standard pattern stored in the standard pattern storage unit with the time series of the acoustic feature amount processed by the noise processing unit. apparatus.

2. A sound analysis unit performs a cepstrum analysis for obtaining a cepstrum coefficient as an acoustic feature amount for each frame, and a noise reduction processing unit sets a small cepstrum coefficient of a frame having a low degree of soundness obtained by the sound detection unit. 2. The speech recognition apparatus according to claim 1, wherein the acoustic feature is converted into white noise by performing the processing.

3. A sound analysis unit performs a mel cepstrum analysis for obtaining a mel cepstrum coefficient as an acoustic feature amount for each frame, and a noise reduction processing unit performs a mel cepstrum coefficient calculation for a frame having a low degree of voice likeness obtained by the sound detection unit. 2. The speech recognition apparatus according to claim 1, wherein the acoustic feature value is converted to white noise by setting a small value.

4. A sound analysis unit performs a spectrum analysis for each frame to obtain a short-time spectrum corrected for removing a spectrum inclination as an acoustic feature amount, and a noise processing unit performs a speech likeness obtained by a speech detection unit. 2. The speech recognition apparatus according to claim 1, wherein the acoustic feature quantity is converted into white noise by setting a short-time spectrum of a frame having a low degree of noise to be small.

5. The voice detection unit determines that the lower the power of the input audio signal is, the lower the level of voice-likeness is, and the noise processing unit determines that the level of voice-likeness is lower by the voice detection unit. The speech recognition apparatus according to claim 1, wherein the acoustic feature amount is strongly converted to white noise.

6. The standard pattern stored in the standard pattern storage unit is a standard pattern generated through processing equivalent to white noise conversion of an acoustic feature in a noise processing unit. The speech recognition device according to claim 1.

7. The speech recognition device according to claim 1, wherein the matching processing in the matching processing unit is word spotting processing.