JPH07191696A

JPH07191696A - Speech recognition device

Info

Publication number: JPH07191696A
Application number: JP5330591A
Authority: JP
Inventors: Takashi Ariyoshi; 敬有吉
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1993-12-27
Filing date: 1993-12-27
Publication date: 1995-07-28
Anticipated expiration: 2017-12-09
Also published as: JP3354252B2

Abstract

PURPOSE:To prevent misdetection due to the mixture of the feature quantity of an ambient noise with the feature quantity of a speech by reducing the influence of the ambient noise upon a voiceless section of a double consonant, etc., or a section wherein a speech level is low as to an inputted speech signal. CONSTITUTION:As for the inputted sound signal, acoustic feature quantities and the degrees of sound likelihood are found, frame by frame. At this time, when there is a decided section which the degree of sound likelihood is low, i.e., a section which is voiceless or low in speech level, the acoustic feature quantity of the section is made into a white noise by a noise processing part 6 and the spectrum representing the acoustic feature quantity is smoothed; and a matching process part 8 matches a time series of acoustic feature quantities after the process of the noise processing part 6 with a time series of a standard pattern to perform speech recognition. Therefore, the influence of an ambient noise on the section which is voiceless or low in speech level is removed to perform the speech recognition accurately.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、入力された音声の特徴
量を予め用意された標準パターンの特徴量と比較するこ
とで入力された音声を認識する音声認識装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition apparatus for recognizing an input voice by comparing the feature amount of the input voice with the feature amount of a standard pattern prepared in advance.

【０００２】[0002]

【従来の技術】近年、人間が発生する言語の音韻性情報
を認識する音声認識技術が盛んに研究され、この音声認
識技術を具体的な装置に応用した音声認識装置の開発が
進められている。音韻性情報を認識するには、一般に、
予め単語や音節等を単位とする複数の標準パターンを用
意し、未知の入力音声と各標準パターンとを比較し、入
力音声に最も類似している標準パターンを見つけ出し、
この標準パターンが発声された音声であると判定する等
の手法が用いられている。2. Description of the Related Art In recent years, a voice recognition technique for recognizing phonological information of a language generated by humans has been actively researched, and a voice recognition device in which the voice recognition technique is applied to a concrete device is being developed. . To recognize phonological information, in general,
Prepare a plurality of standard patterns in units of words, syllables, etc. in advance, compare the unknown input voice and each standard pattern, find the standard pattern that is most similar to the input voice,
Techniques such as determining that this standard pattern is uttered voice are used.

【０００３】このような音声認識の技術における単語や
音節等を認識する技術は、区切って発生された単語等を
認識する孤立単語音声認識と、連続的に発生された音声
から特定の単語等を認識する連続単語音声認識との二通
りに分類できる。音声認識技術を利用した音声認識装置
の実用化に際しては、周囲に生ずる騒音や話者が発生す
るかもしれない不要語等を考慮すると、連続単語音声認
識を実現し得る装置とすることが望ましい。Techniques for recognizing words, syllables, and the like in such speech recognition techniques include isolated word speech recognition for recognizing words and the like generated by dividing them, and specific words and the like for continuously generating sounds. There are two types of recognition: continuous word speech recognition. In practical use of a voice recognition device using the voice recognition technology, it is desirable to use a device that can realize continuous word voice recognition in consideration of noise generated in the surroundings and unnecessary words that a speaker may generate.

【０００４】周囲の騒音や話者が発生する不要語等を除
外して音声を認識する連続単語音声認識の技術として
は、例えば、連続ＤＰ法等のワードスポッティングが従
来から知られている（吉井貞煕著「デジタル音声処理」
東海大学出版会第８章）。ここで、ワードスポッティ
ング（word spotting ）というのは、ある音声から単語
や音節等の単位を捜し出し、予め定められた言葉を抽出
する技術である。また、連続ＤＰ法（continuous Dynam
ic Programming）というのは、スペクトル等のパラメー
タ系列に変換された入力音声について、この入力音声を
始端から１フレームずつずらして単語や音節等の標準パ
ターンとＤＰマッチング（Dynamic Programming matchi
ng）を行い、マッチング結果としての距離がある閾値以
下となったとき、その時点にその標準パターンの単語や
音節等が存在すると判定する連続単語音声認識である。Word spotting such as continuous DP method has been conventionally known as a technique of continuous word voice recognition for recognizing voice by excluding ambient noise and unnecessary words generated by a speaker. Sadahi's "Digital Speech Processing"
Tokai University Press Chapter 8). Here, word spotting is a technique for searching a voice for a unit such as a word or a syllable and extracting a predetermined word. In addition, the continuous DP method (continuous Dynam
ic Programming) is a DP matching (Dynamic Programming matchi) for an input voice converted into a parameter sequence such as a spectrum by shifting the input voice one frame at a time from the start end and a standard pattern such as a word or a syllable.
ng), and when the distance as a result of matching falls below a certain threshold, it is a continuous word speech recognition that determines that a word or syllable of the standard pattern exists at that time.

【０００５】ここで、音声認識における入力音声と標準
パターンとの比較に際しては、音声波形そのものを比較
するのではなく、音声波形から位相情報を除去し、スペ
クトルに関連した特徴に変換して入力音声情報を扱うの
が一般的である。これは、音声波形そのものを比較した
のでは情報量が多過ぎるし、波形の位相情報は伝送系や
録音系により変化し易い上に、このような位相情報は人
間による音声の知覚にほとんど寄与しないからである。Here, when comparing the input voice and the standard pattern in the voice recognition, the input voice is not compared with the voice waveform itself, but the phase information is removed from the voice waveform and converted into a feature related to the spectrum. It is common to handle information. This is because the amount of information is too large when comparing the voice waveforms themselves, and the phase information of the waveform is likely to change depending on the transmission system and the recording system, and such phase information hardly contributes to human perception of voice. Because.

【０００６】スペクトルに関連した特徴としては、一定
周期毎に抽出された短時間スペクトルが一般に用いられ
る。この短時間スペクトルというのは、音声の短時間区
間毎の電力スペクトル密度を意味し、周波数に応じて緩
やかに変化する成分であるスペクトル包絡と、周波数に
応じて細かく変化する成分であるスペクトル微細構造と
の積（対数尺度では和）に分解して分析することができ
る。これらのうち、スペクトル微細構造は、ピッチ等の
影響を受けて不安定である。このため、音声認識に際し
ては、短時間スペクトルからスペクトル包絡を抽出し、
スペクトル包絡を音声の特徴とするようなことが一般に
行われている。As a characteristic relating to the spectrum, a short-time spectrum extracted at regular intervals is generally used. This short-time spectrum means the power spectrum density for each short-term section of speech, and the spectrum envelope that is a component that gently changes with frequency and the spectrum fine structure that is a component that finely changes with frequency. It can be analyzed by breaking it down into the product of (and sum on a logarithmic scale). Among these, the spectral fine structure is unstable under the influence of pitch and the like. Therefore, in speech recognition, the spectrum envelope is extracted from the short-time spectrum,
It is common practice to use the spectral envelope as a feature of speech.

【０００７】スペクトル包絡を抽出する手法には色々な
種類があるが、そのうちの一つとしてケプストラム分析
（cepstrum）がある。このケプストラム分析というの
は、波形の短時間振幅スペクトルの対数の逆フーリエと
して定義され、スペクトル包絡とスペクトル微細構造と
を近似的に分離することができる点に特色を有する。ま
た、ケプストラム分析に関連したスペクトル包絡を抽出
する手法として、近年では、メルスケールの周波数で再
標本化した対数スペクトルから計算したケプストラムを
用いる試みもなされている。このようなケプストラムを
メルケプストラムという。さらに、ケプストラム分析の
特殊なものとして、ＬＰＣケプストラム分析（ＬＰＣ
は、linear predictive coding：線形予測の略称であ
る）という手法がある。このＬＰＣケプストラムという
のは、波形から直接計算されるケプストラム、すなわち
ＦＦＴケプストラム（ＦＦＴは、fast Fourier transfo
rm：高速フーリエ変換の略称である）に対し、線形予測
モデルによるケプストラムを意味し、ＦＦＴケプストラ
ムによる包絡スペクトルよりもスペクトルのピークを重
視した形の包絡スペクトルを得ることができる点を特色
とする。つまり、スペクトルのピーク部に音声認識の重
要な情報が存在していることに着目し、スペクトルのピ
ークを強調することで、その距離尺度をセンシティブに
してより正確な音声認識を実現させるようにした手法で
ある。There are various types of methods for extracting the spectral envelope, and one of them is the cepstrum analysis. The cepstrum analysis is defined as the inverse Fourier of the logarithm of the short-time amplitude spectrum of the waveform, and has a characteristic in that the spectrum envelope and the spectrum fine structure can be approximately separated. Further, as a method for extracting a spectrum envelope related to cepstrum analysis, an attempt has recently been made to use a cepstrum calculated from a logarithmic spectrum resampled at a mel-scale frequency. Such a cepstrum is called a mel cepstrum. Furthermore, as a special type of cepstrum analysis, LPC cepstrum analysis (LPC
Is a linear predictive coding: an abbreviation for linear prediction). The LPC cepstrum is a cepstrum calculated directly from a waveform, that is, an FFT cepstrum (FFT is a fast Fourier transfo
rm: an abbreviation for Fast Fourier Transform), means a cepstrum based on a linear prediction model, and is characterized in that an envelope spectrum in which the peak of the spectrum is emphasized rather than the envelope spectrum according to the FFT cepstrum can be obtained. In other words, focusing on the fact that important information for speech recognition exists in the peak part of the spectrum, by emphasizing the peak of the spectrum, we made the distance measure sensitive to realize more accurate speech recognition. It is a technique.

【０００８】[0008]

【発明が解決しようとする課題】このように、音声特徴
量の検出、すなわち、短時間スペクトルのスペクトル包
絡の抽出には、例えば、ケプストラム分析、メルケプス
トラム分析、ＬＰＣケプストラム分析等の手法が用いら
れる。この際、音声特徴量としては、ケプストラム係
数、メルケプストラム係数、ＬＰＣケプストラム係数が
それぞれ用いられる。ところが、このようなケプストラ
ム係数等は、音声の入力レベルに依存しない特徴量であ
るため、促音発声時等の無音区間や音声レベルが低い区
間では、周囲の騒音の特徴量が入力音声の特徴量に影響
を与え、誤認識を生じさせてしまうことがあるという問
題がある。例えば、無音の区間では、入力音声に対応す
る単語の標準パターンに対する距離が周囲の騒音によっ
て広がり、対応しない単語であると誤認されてしまった
り、入力音声に対応しない単語の標準パターンに対する
距離が周囲の騒音によって狭まり、対応する単語である
と誤認されてしまったりするようなことがあり、正確な
音声認識を実現する上での障害となっている。As described above, for the detection of the speech feature amount, that is, the extraction of the spectrum envelope of the short time spectrum, a method such as cepstrum analysis, mel cepstrum analysis, or LPC cepstrum analysis is used. . At this time, a cepstrum coefficient, a mel cepstrum coefficient, and an LPC cepstrum coefficient are used as the audio feature amount, respectively. However, since such a cepstrum coefficient is a feature amount that does not depend on the input level of the voice, the feature amount of ambient noise is the feature amount of the input voice in a silent section such as when a phonation is made or a section where the voice level is low. However, there is a problem in that it may cause misrecognition. For example, in a silent section, the distance from the standard pattern of the word corresponding to the input voice is spread by surrounding noise, and it is mistaken for a word that does not correspond to the input pattern. There is a case where it is narrowed by the noise of and is mistakenly recognized as a corresponding word, which is an obstacle to realizing accurate voice recognition.

【０００９】[0009]

【課題を解決するための手段】請求項１記載の発明は、
入力された音響信号に対してフレーム毎に音響分析を行
うことでその音響信号の音響特徴量を求める音響分析部
と、入力された音響信号に対してフレーム毎に音声らし
さの程度を求める音声検出部と、この音声検出部により
音声らしさの程度が低いと判定された区間の音響特徴量
を白色雑音化する雑音化処理部と、音声の標準パターン
を記憶する標準パターン記憶部と、この標準パターン記
憶部に記憶された標準パターンの時系列と雑音化処理部
による処理を経た音響特徴量の時系列とのマッチングを
行うマッチング処理部とを設けた。The invention according to claim 1 is
An acoustic analysis unit that obtains the acoustic feature amount of the input acoustic signal by performing acoustic analysis for each frame, and a voice detection that obtains the degree of voice-likeness for each frame of the input acoustic signal. Section, a noise reduction processing section for converting the acoustic feature amount of the section determined to have a low degree of voice likeness into white noise by the voice detection section, a standard pattern storage section for storing a standard pattern of the voice, and the standard pattern. A matching processing unit for matching the time series of the standard pattern stored in the storage unit with the time series of the acoustic feature amount processed by the noise processing unit is provided.

【００１０】請求項２記載の発明は、請求項１記載の発
明において、音響分析部では音響特徴量としてケプスト
ラム係数をフレーム毎に求めるケプストラム分析を行
い、雑音化処理部では音声検出部で求められた音声らし
さの程度が低いフレームのケプストラム係数を小さく設
定することで音響特徴量を白色雑音化する。According to a second aspect of the present invention, in the first aspect of the invention, the acoustic analysis unit performs a cepstrum analysis for obtaining a cepstrum coefficient as an acoustic feature amount for each frame, and the noise reduction processing unit obtains the cepstrum coefficient. By setting a small cepstrum coefficient of a frame having a low degree of speech likeness, the acoustic feature amount is converted to white noise.

【００１１】請求項３記載の発明は、請求項１記載の発
明において、音響分析部では音響特徴量としてメルケプ
ストラム係数をフレーム毎に求めるメルケプストラム分
析を行い、雑音化処理部では音声検出部で求められた音
声らしさの程度が低いフレームのメルケプストラム係数
を小さく設定することで音響特徴量を白色雑音化する。According to a third aspect of the present invention, in the first aspect of the invention, the acoustic analysis unit performs a mel-cepstral analysis for obtaining a mel-cepstrum coefficient as an acoustic feature amount for each frame, and the noise reduction processing unit uses a speech detection unit. By setting the mel-cepstral coefficient of the frame having a low degree of voice-likeness to be small, the acoustic feature quantity is converted into white noise.

【００１２】請求項４記載の発明は、請求項１記載の発
明において、音響分析部では音響特徴量としてスペクト
ル傾斜を除去する補正が行われた短時間スペクトルをフ
レーム毎に求めるスペクトル分析を行い、雑音化処理部
では音声検出部で求められた音声らしさの程度が低いフ
レームの短時間スペクトルを小さく設定することで音響
特徴量を白色雑音化する。According to a fourth aspect of the present invention, in the first aspect of the present invention, the acoustic analysis unit performs a spectrum analysis to obtain a short-time spectrum for which correction is performed to remove a spectrum inclination as an acoustic feature amount for each frame, The noise conversion processing unit converts the acoustic feature amount into white noise by setting a small short-time spectrum of a frame having a low degree of voice likeness obtained by the voice detection unit.

【００１３】請求項５記載の発明は、請求項１記載の発
明において、音声検出部では入力された音響信号のパワ
ーが小さいほど音声らしさの程度が低いと判断し、雑音
化処理部ではその音声検出部で音声らしさの程度が低い
と判断される程音響特徴量を強く白色雑音化する。According to a fifth aspect of the present invention, in the first aspect of the present invention, the voice detection unit determines that the smaller the power of the input acoustic signal is, the lower the likelihood of voice is, and the noise conversion processing unit determines that voice. As the detection unit determines that the degree of voice-likeness is low, the acoustic feature amount is strongly converted into white noise.

【００１４】請求項６記載の発明は、請求項１記載の発
明において、標準パターン記憶部に記憶された標準パタ
ーンは、雑音化処理部での音響特徴量の白色雑音化と同
等の処理を経て生成された標準パターンである。According to a sixth aspect of the present invention, in the first aspect of the invention, the standard pattern stored in the standard pattern storage section is subjected to processing equivalent to white noise conversion of the acoustic feature quantity in the noise processing section. This is the generated standard pattern.

【００１５】請求項７記載の発明は、請求項１記載の発
明において、マッチング処理部でのマッチング処理は、
ワードスポッティング処理である。According to a seventh aspect of the invention, in the invention according to the first aspect, the matching processing in the matching processing section is
This is word spotting processing.

【００１６】[0016]

【作用】請求項１記載の発明では、各フレーム毎に、音
響分析部により入力された音響信号の音響特徴量が求め
られ、音声検出部によりその音響信号の音声らしさの程
度が求められる。この際、音響信号中に音声らしさの程
度が低いと判定された区間がある場合には、音響特徴量
が雑音化処理部で白色雑音化される。つまり、音響信号
中、音声らしさの程度が低い区間は、無音であるか音声
レベルが低い区間であることを意味する。そして、音響
特徴量が白色雑音化されるということは、音響特徴量を
表現するスペクトルが平滑化されることを意味する。し
たがって、無音であるか音声レベルが低い場合には、音
響特徴量としてのスペクトルが平滑化され、周囲の騒音
による影響が除去される。マッチング処理部では、この
ような処理を経た特徴量の時系列と標準パターン記憶部
に記憶された標準パターンの時系列とが比較され、その
マッチングが行われる。これにより、周囲の騒音の有無
に拘らず、正確な音声認識がなされる。According to the first aspect of the invention, the acoustic feature amount of the acoustic signal input by the acoustic analysis unit is obtained for each frame, and the degree of the voice-likeness of the acoustic signal is obtained by the voice detection unit. At this time, if there is a section in the acoustic signal that is determined to have a low degree of voice likeness, the acoustic feature amount is converted to white noise by the noise conversion processing unit. That is, in the acoustic signal, a section with a low degree of soundness is a section with no sound or a low sound level. The fact that the acoustic feature amount is converted into white noise means that the spectrum expressing the acoustic feature amount is smoothed. Therefore, when there is no sound or the sound level is low, the spectrum as the acoustic feature amount is smoothed, and the influence of ambient noise is removed. In the matching processing unit, the time series of the feature amount that has undergone such processing is compared with the time series of the standard pattern stored in the standard pattern storage unit, and matching is performed. As a result, accurate voice recognition is performed regardless of the presence or absence of ambient noise.

【００１７】請求項２記載の発明では、音響特徴量とし
てケプストラム係数を用いるケプストラム分析が音響分
析の手法として選択され、このケプストラム係数を小さ
くすることで音響特徴量の白色雑音化を実現させてい
る。また、請求項３記載の発明では、音響特徴量として
メルケプストラム係数を用いるメルケプストラム分析が
音響分析の手法として選択され、このメルケプストラム
係数を小さくすることで音響特徴量の白色雑音化を実現
させている。そして、請求項４記載の発明では、音響特
徴量としてスペクトル傾斜が除去されたスペクトルを用
いるスペクトル分析が音響分析の手法として選択され、
このスペクトルを小さくすることで音響特徴量の白色雑
音化を実現させている。したがって、請求項２、３及び
４記載の発明では、安定した音響特徴量に基づく正確な
音声認識がなされ、しかも、音響特徴量の白色雑音化が
容易である。According to the second aspect of the present invention, the cepstrum analysis using the cepstrum coefficient as the acoustic feature quantity is selected as the acoustic analysis method, and the white noise of the acoustic feature quantity is realized by reducing the cepstrum coefficient. . In the invention according to claim 3, the mel-cepstral analysis using the mel-cepstrum coefficient as the acoustic feature quantity is selected as the method of the acoustic analysis, and the mel-cepstral coefficient is reduced to realize white noise conversion of the acoustic feature quantity. ing. Further, in the invention according to claim 4, a spectrum analysis using a spectrum from which the spectrum inclination is removed is selected as the acoustic analysis method.
By reducing this spectrum, white noise of acoustic features is realized. Therefore, according to the second, third and fourth aspects of the present invention, accurate voice recognition is performed based on the stable acoustic feature amount, and the acoustic feature amount can be easily converted into white noise.

【００１８】請求項５記載の発明では、音声検出部にお
ける音声らしさの程度の判断に際して、入力された音響
信号のパワーが小さいほど音声らしさの程度が低いと判
断され、雑音化処理部では、入力された音響信号のパワ
ーの程度に応じて音響特徴量の白色雑音化の程度が決定
される。つまり、音響信号は、そのパワーが小さいほど
強く白色雑音化される。これにより、より精度が高い音
声認識がなされる。According to the fifth aspect of the present invention, when the degree of voice likeness in the voice detector is judged, the smaller the power of the input acoustic signal is, the lower the degree of voice likeness is. The degree of white noise conversion of the acoustic feature amount is determined according to the degree of power of the generated acoustic signal. That is, the smaller the power of the acoustic signal, the stronger the white noise becomes. As a result, more accurate voice recognition is performed.

【００１９】請求項６記載の発明では、標準パターン記
憶部に記憶された標準パターンは、雑音化処理部での音
響特徴量の白色雑音化と同等の処理を経て生成されてい
るので、標準パターンの生成が容易である。そして、現
実に入力される音響信号の特徴量と極めて近似する標準
パターンを用意することができ、より精度の高い音声認
識がなされる。In the invention according to claim 6, the standard pattern stored in the standard pattern storage section is generated through processing equivalent to white noise conversion of the acoustic feature quantity in the noise conversion processing section. Is easy to generate. Then, it is possible to prepare a standard pattern that is extremely close to the characteristic amount of the actually input acoustic signal, and more highly accurate voice recognition is performed.

【００２０】請求項７記載の発明では、マッチング部で
は、音声特徴量の時系列と標準パターンの時系列とのマ
ッチング処理に際し、ワードスポッティング処理がなさ
れる。これにより、標準パターンとして生成された単語
や音節等が含まれたある言葉が発声された場合、その言
葉に含まれるその単語等が抽出されて認識される。In the seventh aspect of the invention, the matching section performs word spotting processing when matching the time series of the voice feature quantity and the time series of the standard pattern. As a result, when a certain word including a word or a syllable generated as a standard pattern is uttered, the word or the like included in the word is extracted and recognized.

【００２１】[0021]

【実施例】本発明の一実施例を図１に基づいて説明す
る。図１に示すのは各部のブロック図であり、音声を入
力する音声入力部１にＡ／Ｄ変換部２（Ａ／Ｄは、 ana
logto disitalの略称である）が接続され、このＡ／Ｄ
変換部２には音響前処理部３と音響分析部４とが順に接
続されている。また、前記音響前処理部３には音声検出
部５も接続され、この音声検出部５と前記音響分析部４
とには雑音化処理部６が接続されている。そして、標準
パターン記憶部７が設けられ、この標準パターン記憶部
７と前記雑音化処理部６とはマッチング処理部８に接続
され、このマッチング処理部８は認識結果出力部９に接
続されている。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT An embodiment of the present invention will be described with reference to FIG. FIG. 1 is a block diagram of each unit. A voice input unit 1 for inputting a voice is provided with an A / D conversion unit 2 (A / D is ana
abbreviation of logto disital) is connected and this A / D
An acoustic preprocessing unit 3 and an acoustic analysis unit 4 are sequentially connected to the conversion unit 2. A voice detector 5 is also connected to the acoustic preprocessor 3, and the voice detector 5 and the acoustic analyzer 4 are connected.
The noise reduction processing unit 6 is connected to and. A standard pattern storage unit 7 is provided, the standard pattern storage unit 7 and the noise reduction processing unit 6 are connected to a matching processing unit 8, and the matching processing unit 8 is connected to a recognition result output unit 9. .

【００２２】ここで、前記音声入力部１は、例えばマイ
クロフォンであり、この音声入力部１より入力されたア
ナログ信号である音響信号が前記Ａ／Ｄ変換部２に出力
される構造である。Here, the voice input unit 1 is, for example, a microphone, and has a structure in which an acoustic signal which is an analog signal input from the voice input unit 1 is output to the A / D conversion unit 2.

【００２３】前記Ａ／Ｄ変換部２は、前記音声入力部１
からの音響信号をデジタル信号に変換し、標本化、量子
化及び符号化を実行する構造のものである。このＡ／Ｄ
変換部２でのデジタル変換の条件としては、例えば、標
本化周波数が１６kHz で量子化ビット数が１６ビットで
ある。この際、低周波域の帯域制限をかけた後に標本化
がなされるよう構成されている。これは、標本化定理に
従った標本化をして折り返し否（aliasing distortion
）の発生を防止するためである。The A / D conversion unit 2 includes the voice input unit 1
It has a structure of converting an acoustic signal from the digital signal into a digital signal, and performing sampling, quantization and encoding. This A / D
The conditions for digital conversion in the conversion unit 2 are, for example, a sampling frequency of 16 kHz and a quantization bit number of 16 bits. At this time, sampling is performed after band limitation in the low frequency range. This is based on the sampling theorem and is sampled according to aliasing distortion.
) Is to prevent the occurrence of.

【００２４】次いで、前記音響前処理部３は、Ａ／Ｄ変
換部２でデジタル信号に変換された入力音響信号を高域
強調（プリエンファシス：pre-emphasis）する構造のも
のである。この音響前処理部３は、Ｈ(z)＝１−Ｚ~¹ ………式１の伝達関数を有する１次のデジタルフィルタや、差分演
算回路等により構成されている。Next, the acoustic pre-processing unit 3 has a structure for emphasizing (pre-emphasis) the input acoustic signal converted into a digital signal by the A / D conversion unit 2. The acoustic preprocessing unit 3, H (z) = 1-order and a digital filter having a 1-Z transfer function to ¹ ......... formula 1, is constituted by the difference arithmetic circuit, and the like.

【００２５】次いで、前記音響分析部４は、入力された
音響信号の特徴量を抽出する演算部であり、音響信号の
短時間スペクトルのスペクトル包絡を抽出する構造のも
のである。本実施例では、音響信号をＬＰＣスペクトラ
ム分析し、フレーム毎にケプストラム係数としてのケプ
ストラムベクトル（０次は含まず）ｃt を求める構造の
ものが用いられている。この音響分析部４における音響
信号の分析条件は、フレーム周期：１０ms 窓周期：１６ms 窓関数：ハミング窓ＬＰＣ分析次数：１４次ケプストラム次数：１４次である。Next, the acoustic analysis unit 4 is a computing unit for extracting the feature amount of the input acoustic signal, and has a structure for extracting the spectral envelope of the short-time spectrum of the acoustic signal. In the present embodiment, a structure is used in which an LPC spectrum analysis is performed on an acoustic signal and a cepstrum vector (not including the 0th order) ct as a cepstrum coefficient is obtained for each frame. The acoustic signal analysis conditions in the acoustic analysis unit 4 are: frame period: 10 ms window period: 16 ms window function: Hamming window LPC analysis order: 14th order cepstrum order: 14th order.

【００２６】次いで、前記音声検出部５は、前記音響前
処理部３で高域強調された入力音響信号における各フレ
ームでの平均パワーに基づき、その音響信号の音声らし
さの程度を求める構造のものである。フレーム平均パワ
ーは、ＬＰＣ分析中の０次の自己相関係数から求めるこ
とができる。ここで、前記音声検出部５においては、フ
レーム平均パワーｐとこの音声検出部５で求める音声ら
しさの程度ｖとの関係を、次に示す式２で定義してい
る。Next, the voice detection unit 5 has a structure for obtaining the degree of voice-likeness of the acoustic signal based on the average power in each frame in the input acoustic signal which is high-frequency emphasized by the acoustic preprocessing unit 3. Is. The frame average power can be obtained from the zero-order autocorrelation coefficient during the LPC analysis. Here, in the voice detection unit 5, the relationship between the frame average power p and the degree of voice-likeness v obtained by the voice detection unit 5 is defined by the following Expression 2.

【００２７】[0027]

【数２】 [Equation 2]

【００２８】この式２におけるｐ₀ は実験的に求められ
る定数であり、音声区間の始終端のパワーの値よりもや
や大きな値が用いられる。この式より明らかなように、
音声らしさの程度ｖは、０≦ｖ≦１であり、フレーム平
均パワーｐが十分に大きい時には音声らしさの程度ｖは
１で、フレーム平均パワーｐが０の時には音声らしさの
程度ｖは０であり、その間では、フレーム平均パワーｐ
が小さくなるにつれて音声らしさの程度ｖは単調に低く
なっていく。P ₀ in the equation (2) is an experimentally determined constant, and a value slightly larger than the power value at the beginning and end of the voice section is used. As is clear from this formula,
The degree of soundness v is 0 ≦ v ≦ 1, the degree of soundness v is 1 when the frame average power p is sufficiently large, and the degree of soundness v is 0 when the frame average power p is 0. , Meanwhile, mean frame power p
As v becomes smaller, the degree of soundness v becomes monotonically lower.

【００２９】次いで、前記雑音化処理部６は、前記音声
検出部５により求められた音声らしさの程度に応じ、前
記音響分析部４により求められた音響特徴量を白色雑音
化する構造のものである。この雑音化処理部６では、ｃ*t＝ｖｃt ………式３の演算処理が実行される。ここで、前述した通り、ｃt
は前記音響分析部４により求められた音響特徴量、すな
わちケプストラムベクトルであり、ｖは前記音声検出部
５により求められた音声らしさの程度である。そして、
ｃ*tは、入力された音響信号中の音声らしさの程度に応
じて白色雑音化されたケプストラムベクトルである。こ
の式から明らかなように、前記雑音化処理部６では、ケ
プストラムベクトルｃt と音声らしさの程度ｖとの積に
よりケプストラムベクトルｃ*tを決定している。ここ
で、白色雑音のケプストラムベクトルは０、すなわち０
ベクトルである。したがって、音声らしさの程度ｖが低
ければ低いほどケプストラムベクトルｃtが強く白色雑
音化されることになる。Next, the noise conversion processing unit 6 has a structure for converting the acoustic feature amount obtained by the acoustic analysis unit 4 into white noise according to the degree of voice likeness obtained by the voice detection unit 5. is there. In the noise processing unit 6, the calculation process of c * t = vct ... Here, as described above, ct
Is a sound feature amount obtained by the sound analysis unit 4, that is, a cepstrum vector, and v is a degree of soundness obtained by the sound detection unit 5. And
c * t is a cepstrum vector that has been converted into white noise according to the degree of voice-likeness in the input acoustic signal. As is clear from this equation, the noise processing unit 6 determines the cepstrum vector c * t by the product of the cepstrum vector ct and the degree v of voice-likeness. Here, the cepstrum vector of white noise is 0, that is, 0.
Is a vector. Therefore, the lower the degree of voice-likeness v, the stronger the white noise of the cepstrum vector ct.

【００３０】次いで、前記標準パターン記憶部７には、
音声認識を実行させる単語や音節等の標準パターンが多
数記憶されている。これらの標準パターンは、音声入力
部１に入力されて音響分析部４でケプストラムベクトル
ｃt とされ、雑音化処理部４で所定の処理が施されたケ
プストラムベクトルｃ*tの時系列と同等の内容を有し、
このケプストラムベクトルｃ*tの時系列と同じ処理を経
て生成されたケプストラムベクトルｃ*rである。Next, in the standard pattern storage unit 7,
Many standard patterns such as words and syllables for executing voice recognition are stored. These standard patterns are input to the voice input unit 1, converted into the cepstrum vector ct by the acoustic analysis unit 4, and have the same contents as the time series of the cepstrum vector c * t subjected to the predetermined processing by the noise processing unit 4. Have
It is a cepstrum vector c * r generated through the same processing as the time series of the cepstrum vector c * t.

【００３１】次いで、前記マッチング処理部８は、前記
標準パターン記憶部７に記憶された標準パターン、つま
りケプストラムベクトルｃ*rの時系列と、前記雑音化処
理部６による処理を経た音響特徴量、つまりケプストラ
ムベクトルｃ*tの時系列とでマッチング処理を実行する
構造のものである。このマッチング処理部８でのマッチ
ング処理は、連続ＤＰ法を用いたマッチング処理であ
る。この際、距離尺度は、群遅延スペクトル距離尺度等
の距離尺度が用いられる。Next, the matching processing unit 8 stores the standard pattern stored in the standard pattern storage unit 7, that is, the time series of the cepstrum vector c * r, and the acoustic feature quantity that has been processed by the noise processing unit 6. That is, the structure is such that the matching process is executed with the time series of the cepstrum vector c * t. The matching process in the matching processing unit 8 is a matching process using the continuous DP method. At this time, as the distance measure, a distance measure such as a group delay spectrum distance measure is used.

【００３２】次いで、前記認識結果出力部９は、前記マ
ッチング処理部８での認識結果を出力する構造であり、
例えば、該当する単語等の有無を信号や表示として出力
する等の構造となっている。Next, the recognition result output unit 9 has a structure for outputting the recognition result of the matching processing unit 8.
For example, the structure is such that the presence or absence of the corresponding word is output as a signal or display.

【００３３】このような構成において、音声入力部１に
入力された音響信号はＡ／Ｄ変換部２でデジタル変換さ
れ、標本化、量子化及び符号化される。そして、音響前
処理部３で高域強調が施され、スペクトル傾斜が平坦化
される。これにより、音響信号のダイナミックレンジが
圧縮され、実効的なＳＮＲ（signal-to-quantizationno
ise ratio：信号対量子化雑音比）が高められる。In such a configuration, the acoustic signal input to the voice input section 1 is digitally converted by the A / D conversion section 2, and is sampled, quantized and encoded. Then, the acoustic preprocessing unit 3 applies high-frequency emphasis to flatten the spectrum inclination. As a result, the dynamic range of the acoustic signal is compressed and the effective SNR (signal-to-quantization no
ise ratio: signal to quantization noise ratio) is increased.

【００３４】次いで、高域強調された音響信号は、音響
分析部４によるＬＰＣケプストラム分析によりその特徴
量がケプストラムベクトルｃt として抽出される。これ
と同時に、音声検出部５では、式２により、高域強調さ
れた音響信号の音声らしさの程度ｖが各フレームの平均
パワーｐに基づき求められる。Next, the high-frequency-emphasized acoustic signal is subjected to LPC cepstrum analysis by the acoustic analysis unit 4 and its characteristic amount is extracted as a cepstrum vector ct. At the same time, the voice detection unit 5 obtains the degree of voice-likeness v of the high-frequency-emphasized acoustic signal based on the average power p of each frame by the expression 2.

【００３５】そして、こうして求められたケプストラム
ベクトルｃt 及び音声らしさの程度ｖは雑音化処理部６
に送られ、この雑音化処理部６での式３の演算処理によ
り白色雑音化処理されたケプストラムベクトルｃ*tが求
められる。ここで、この雑音化処理部６で処理されたケ
プストラムベクトルｃ*tは、音声検出部５で求められた
音声らしさの程度が低ければ低いほど強く白色雑音化さ
れる。つまり、音声らしさの程度が低いということは、
その区間が無音であるか音声レベルが低いことを意味し
ているため、無音区間や音声レベルが低い区間が白色雑
音化され、その区間のスペクトルが平坦にされる。The cepstrum vector ct and the voice-likeness degree v thus obtained are calculated by the noise processing unit 6
And the white noise-ized cepstrum vector c * t is obtained by the calculation processing of the equation 3 in the noise-ized processing unit 6. Here, the cepstrum vector c * t processed by the noise conversion processing unit 6 is more strongly converted into white noise as the degree of voice likeness obtained by the voice detection unit 5 is lower. In other words, the low degree of voice-likeness means
Since it means that the section is silent or the voice level is low, the silent section or the section with a low voice level is converted into white noise, and the spectrum of the section is flattened.

【００３６】次いで、マッチング処理部８では、雑音化
処理部６での処理を経たケプストラムベクトルｃ*tの時
系列と、標準パターン記憶部７に格納されている標準パ
ターンであるケプストラムベクトルｃ*rの時系列とがマ
ッチング処理される。この時のマッチング処理は、ワー
ドスポッティングである連続ＤＰ法によりなされる。し
たがって、音声の端点フリーの音声認識がなされる。そ
して、マッチング対象であるケプストラムベクトルｃ*t
の時系列とケプストラムベクトルｃ*rの時系列とは、共
に、音声らしさの程度が低い区間、すなわち、無音であ
るか音声レベルが低い区間が白色雑音化され、その区間
のスペクトルが平滑化されている。したがって、周囲の
騒音による影響がない音声認識がなされ、音声認識の精
度の向上が図られる。したがって、マッチング処理部８
の処理結果を出力する認識結果出力部９より、高精度な
認識結果が出力される。Next, in the matching processing unit 8, the time series of the cepstrum vector c * t which has been processed by the noise processing unit 6 and the cepstrum vector c * r which is the standard pattern stored in the standard pattern storage unit 7. Are subjected to matching processing. The matching process at this time is performed by the continuous DP method which is word spotting. Therefore, the end-free voice recognition of the voice is performed. Then, the cepstrum vector c * t to be matched
And the time series of the cepstrum vector c * r are both white noise in a section with a low degree of voice-likeness, that is, a section with no sound or a low voice level, and the spectrum of that section is smoothed. ing. Therefore, the voice recognition is performed without being affected by the ambient noise, and the accuracy of the voice recognition is improved. Therefore, the matching processing unit 8
A high-accuracy recognition result is output from the recognition result output unit 9 that outputs the processing result of.

【００３７】ここで、音響分析部４の変形例について説
明する。本実施例では、入力された音響信号の特徴量を
求める手法としてＬＰＣケプストラム分析を実行する音
響分析部４を設けたが、音響信号の特徴量を求める手法
としてはこれに限らず、例えば、ケプストラム分析やメ
ルケプストラム分析、スペクトル傾斜補正を施したスペ
クトル分析等の手法を用いる音響分析部としても良い。
要は、雑音化処理部６での白色雑音化の処理を容易にす
ることができる特徴量を求めることができる構造であれ
ば、その種類を問わない。より詳細には、メルケプスト
ラム係数としてのメルケプストラムベクトルは、ケプス
トラムベクトルと同様に、０ベクトルが白色雑音を表現
する。また、スペクトル傾斜補正を施したスペクトル分
析は、ＦＦＴやバンドパスフィルタバンクによって求め
られたスペクトルに対し、対数変換や最小２乗近似直線
を減じる補正（指数変換）等のスペクトル傾斜補正をす
ることにより実行される。この結果、補正後のスペクト
ルベクトルは、ケプストラムベクトルと同様に、０ベク
トルが白色雑音を表現する。したがって、標準パターン
記憶部に格納する標準パターンをメルケプストラム分析
を施した標準パターンとしたり、スペクトル傾斜補正を
施したスペクトル分析を施した標準パターンとするだけ
で、本実施例の装置にそのまま適用できる。Here, a modification of the acoustic analysis unit 4 will be described. In the present embodiment, the acoustic analysis unit 4 that executes the LPC cepstrum analysis is provided as a method for obtaining the characteristic amount of the input acoustic signal, but the method for obtaining the characteristic amount of the acoustic signal is not limited to this. The acoustic analysis unit may use a technique such as analysis, mel cepstrum analysis, or spectrum analysis with spectrum tilt correction.
In short, any type of structure may be used as long as the feature quantity that can facilitate the white noise conversion process in the noise conversion processing unit 6 can be obtained. More specifically, in the mel cepstrum vector as the mel cepstrum coefficient, the 0 vector represents white noise, like the cepstrum vector. In addition, the spectrum analysis with the spectrum tilt correction is performed by performing spectrum tilt correction such as logarithmic conversion or correction for reducing the least-squares approximation straight line (exponential conversion) with respect to the spectrum obtained by the FFT or the bandpass filter bank. To be executed. As a result, in the corrected spectrum vector, the 0 vector represents white noise, like the cepstrum vector. Therefore, the standard pattern stored in the standard pattern storage unit can be applied to the apparatus of this embodiment as it is, only by the standard pattern subjected to the mel cepstrum analysis or the standard pattern subjected to the spectrum analysis with the spectrum tilt correction. .

【００３８】次いで、音声検出部５の変形例について説
明する。まず、音声検出部５では、音響信号中の音声ら
しさの程度を求めるための基礎データとして各フレーム
の平均パワーＰを求めるが、このフレーム平均パワーｐ
はＬＰＣ分析により求められるため、ＬＰＣケプストラ
ム分析を行う音響分析部４の構造を一部共用して音声検
出部５を構成しても良い。音声検出部５の他の変形例と
しては、音声らしさの程度を判定する基礎データとして
音声パワーを用いず、ゼロ交差数、ピッチ周波数、フォ
ルマントの先鋭度、各音素パターンとの距離等を用いて
も良い。Next, a modification of the voice detector 5 will be described. First, the voice detection unit 5 obtains the average power P of each frame as basic data for obtaining the degree of voice likeness in the acoustic signal.
Is obtained by LPC analysis, the voice detection unit 5 may be configured by partially sharing the structure of the acoustic analysis unit 4 that performs LPC cepstrum analysis. As another modification of the voice detection unit 5, the voice power is not used as the basic data for determining the degree of voice-likeness, but the number of zero crossings, the pitch frequency, the sharpness of the formant, the distance from each phoneme pattern, etc. are used. Is also good.

【００３９】次いで、マッチング処理部８の変形例につ
いて説明する。本実施例では、連続ＤＰ法を実行する構
造のマッチング処理部８としたが、状態遷移モデル等を
用いる他の方式によるマッチングを実行する構造として
も良く、また、これらのようなワードスポッティングに
限らず、孤立単語音声認識を実行する構造としても良
い。Next, a modification of the matching processing section 8 will be described. In the present embodiment, the matching processing unit 8 is configured to execute the continuous DP method, but the matching processing unit 8 may be configured to execute matching by another method using a state transition model or the like, and is not limited to such word spotting. Instead, the structure may be such that the isolated word voice recognition is executed.

【００４０】[0040]

【発明の効果】請求項１記載の発明は、入力された音響
信号に対してフレーム毎に音響分析を行うことでその音
響信号の音響特徴量を求める音響分析部と、入力された
音響信号に対してフレーム毎に音声らしさの程度を求め
る音声検出部と、この音声検出部により音声らしさの程
度が低いと判定された区間の音響特徴量を白色雑音化す
る雑音化処理部と、音声の標準パターンを記憶する標準
パターン記憶部と、この標準パターン記憶部に記憶され
た標準パターンの時系列と雑音化処理部による処理を経
た音響特徴量の時系列とのマッチングを行うマッチング
処理部とを設けたので、入力された音響信号中に音声ら
しさの程度が低いと判定された区間がある場合、つま
り、無音であるか音声レベルが低い区間がある場合に
は、音響特徴量を雑音化処理部で白色雑音化して音響特
徴量を表現するスペクトルを平滑化し、これにより、そ
の区間に対する周囲の騒音による影響を除去し、周囲の
騒音の特徴量が音響特徴量に混ざることによる誤認識を
防止することができ、したがって、音声の認識精度を向
上させることができる等の効果を有する。According to the first aspect of the present invention, an acoustic analysis unit that obtains an acoustic feature amount of an input acoustic signal by performing an acoustic analysis for each frame on the input acoustic signal, and an input acoustic signal On the other hand, a voice detection unit that obtains the degree of voice-likeness for each frame, a noise processing unit that whitens the acoustic feature quantity of the section that is determined to have a low degree of voice-likeness by this voice detection unit, and a voice standard. A standard pattern storage unit for storing patterns, and a matching processing unit for matching the time series of standard patterns stored in the standard pattern storage unit with the time series of acoustic feature values processed by the noise processing unit are provided. Therefore, if there is a section in the input acoustic signal that is determined to have a low degree of voice, that is, if there is a section with no sound or a low voice level, the acoustic feature amount is set to noise. The processing unit smoothes the spectrum that represents the acoustic features by converting it to white noise, thereby eliminating the effect of ambient noise on that section, and erroneously recognizing that the ambient noise features are mixed with the acoustic features. Therefore, there is an effect that the accuracy of voice recognition can be improved.

【００４１】請求項２記載の発明は、請求項１記載の発
明において、音響分析部では音響特徴量としてケプスト
ラム係数をフレーム毎に求めるケプストラム分析を行
い、雑音化処理部では音声検出部で求められた音声らし
さの程度が低いフレームのケプストラム係数を小さく設
定することで音響特徴量を白色雑音化し、請求項３記載
の発明は、請求項１記載の発明において、音響分析部で
は音響特徴量としてメルケプストラム係数をフレーム毎
に求めるメルケプストラム分析を行い、雑音化処理部で
は音声検出部で求められた音声らしさの程度が低いフレ
ームのメルケプストラム係数を小さく設定することで音
響特徴量を白色雑音化し、請求項４記載の発明は、請求
項１記載の発明において、音響分析部では音響特徴量と
してスペクトル傾斜を除去する補正が行われた短時間ス
ペクトルをフレーム毎に求めるスペクトル分析を行い、
雑音化処理部では音声検出部で求められた音声らしさの
程度が低いフレームの短時間スペクトルを小さく設定す
ることで音響特徴量を白色雑音化するように構成したの
で、安定した音響特徴量に基づく正確な音声認識を行う
ことができ、したがって、音声の認識精度をより向上さ
せることができ、また、音響特徴量を容易に白色雑音化
することができ、したがって、白色雑音化するに際して
の演算処理の簡略化を図ることができる等の効果を有す
る。According to a second aspect of the present invention, in the first aspect of the invention, the acoustic analysis unit performs a cepstrum analysis for obtaining a cepstrum coefficient as an acoustic feature amount for each frame, and the noise conversion processing unit obtains the cepstrum coefficient. The acoustic feature quantity is converted into white noise by setting a small cepstrum coefficient of a frame having a low degree of speech-likeness. In the invention according to claim 3, in the invention according to claim 1, the acoustic analysis unit uses the melody as the acoustic feature quantity. The mel-cepstral analysis is performed to obtain the cepstrum coefficient for each frame, and the noise reduction processing unit converts the acoustic feature amount into white noise by setting the mel-cepstral coefficient of the frame with a low degree of speech likeness obtained by the voice detection unit to a small value, According to a fourth aspect of the present invention, in the first aspect of the present invention, the acoustic analysis unit uses a spectral gradient as an acoustic feature amount. The short-time spectrum correction is performed to remove performs a spectrum analysis for determining for each frame,
Since the noise conversion processing unit is configured to convert the acoustic feature quantity into white noise by setting the short-time spectrum of the frame with a low degree of soundness determined by the speech detection section to a small value, it is based on a stable acoustic feature quantity. Accurate voice recognition can be performed, therefore, the recognition accuracy of voice can be further improved, and the acoustic feature quantity can be easily converted into white noise. Therefore, arithmetic processing for converting to white noise can be performed. It is possible to achieve simplification.

【００４２】請求項５記載の発明は、請求項１記載の発
明において、音声検出部では入力された音響信号のパワ
ーが小さいほど音声らしさの程度が低いと判断し、雑音
化処理部ではその音声検出部で音声らしさの程度が低い
と判断される程音響特徴量を強く白色雑音化するように
構成したので、音声らしさの程度を判断するパラメータ
として音響信号のパワーを用いることで容易かつ正確に
音声らしさの程度の判断を実現させることができ、した
がって、演算処理の簡略化を図ることができ、また、音
響信号は、そのパワーが小さいほど強く白色雑音化され
るため、より精度が高い音声認識の実現に寄与すること
ができる等の効果を有する。According to a fifth aspect of the present invention, in the first aspect of the invention, the voice detection unit determines that the smaller the power of the input acoustic signal is, the lower the degree of voice-likeness is, and the noise reduction processing unit determines that voice. Since the detection unit is configured to strongly whiten the acoustic features as the degree of voice-likeness is determined to be low, it is easy and accurate to use the power of the acoustic signal as a parameter for determining the degree of voice-likeness. It is possible to realize the determination of the degree of soundness, and therefore, it is possible to simplify the arithmetic processing. Further, the smaller the power of an acoustic signal is, the more it is converted into white noise. It has an effect that it can contribute to the realization of recognition.

【００４３】請求項６記載の発明は、請求項１記載の発
明において、標準パターン記憶部に記憶された標準パタ
ーンは、雑音化処理部での音響特徴量の白色雑音化と同
等の処理を経て生成された標準パターンであるので、標
準パターンの生成が容易であり、また、現実に入力され
る音響信号の特徴量と極めて近似する標準パターンを用
意することができ、したがって、より精度の高い音声認
識の実現に寄与することができる等の効果を有する。According to a sixth aspect of the present invention, in the first aspect of the invention, the standard pattern stored in the standard pattern storage section is subjected to processing equivalent to white noise conversion of the acoustic feature quantity in the noise processing section. Since it is the generated standard pattern, it is easy to generate the standard pattern, and it is possible to prepare a standard pattern that is extremely close to the characteristic amount of the acoustic signal actually input, and therefore a more accurate voice It has an effect that it can contribute to the realization of recognition.

【００４４】請求項７記載の発明は、請求項１記載の発
明において、マッチング処理部でのマッチング処理は、
ワードスポッティング処理であるので、標準パターンと
して生成された単語や音節等が含まれたある言葉が発声
された場合、その言葉に含まれるその単語等を抽出して
認識することができ、この際、入力音声がない場合に誤
った認識結果が生ずるのを防止することができる等の効
果を有する。According to a seventh aspect of the invention, in the invention according to the first aspect, the matching processing in the matching processing section is:
Since it is word spotting processing, when a certain word containing words or syllables generated as a standard pattern is uttered, it is possible to extract and recognize that word or the like contained in that word. This has the effect of preventing erroneous recognition results from occurring when there is no input voice.

[Brief description of drawings]

【図１】本発明の一実施例を示すブロック図である。FIG. 1 is a block diagram showing an embodiment of the present invention.

[Explanation of symbols]

４音響分析部５音声検出部６雑音化処理部７標準パターン記憶部８マッチング処理部 4 Acoustic Analysis Section 5 Voice Detection Section 6 Noise Reduction Processing Section 7 Standard Pattern Storage Section 8 Matching Processing Section

Claims

[Claims]

1. An acoustic analysis unit that obtains an acoustic feature amount of an input acoustic signal by performing an acoustic analysis for each frame on the input acoustic signal, and an audio analysis unit for determining the soundness of the input acoustic signal for each frame. A voice detection unit that obtains a degree, a noise conversion processing unit that whitens the acoustic feature amount of a section determined to have a low degree of voice likeness by the voice detection unit, and a standard pattern storage unit that stores a standard pattern of voice. And a matching processing unit for matching the time series of the standard pattern stored in the standard pattern storage unit and the time series of the acoustic feature amount processed by the noise reduction processing unit. apparatus.

2. The acoustic analysis unit performs a cepstrum analysis for obtaining a cepstrum coefficient as an acoustic feature amount for each frame, and the noise reduction processing unit sets a small cepstrum coefficient for a frame having a low degree of voice likeness obtained by the voice detection unit. The speech recognition apparatus according to claim 1, wherein the acoustic feature amount is converted into white noise by performing the above.

3. The sound analysis unit performs a mel-cepstral analysis for obtaining a mel-cepstrum coefficient as an acoustic feature amount for each frame, and the noise-reduction processing unit includes a mel-cepstral coefficient of a frame with a low degree of soundness calculated by the sound detection unit. 2. The voice recognition apparatus according to claim 1, wherein the acoustic feature quantity is converted into white noise by setting a small value.

4. The sound analysis unit performs a spectrum analysis for obtaining a short-time spectrum for which correction is performed to remove a spectrum inclination as an acoustic feature amount for each frame, and the noise conversion processing unit performs speech likelihood determined by a voice detection unit. The speech recognition apparatus according to claim 1, wherein the acoustic feature quantity is converted into white noise by setting a short time spectrum of a frame having a low degree of noise to a small value.

5. The voice detection unit determines that the lower the power of the input acoustic signal is, the lower the likelihood of voice is, and the noise reduction processing unit determines that the voice detection unit is lower in the likelihood of voice. The speech recognition apparatus according to claim 1, wherein the acoustic feature quantity is strongly converted into white noise.

6. The standard pattern stored in the standard pattern storage unit is a standard pattern generated through a process equivalent to white noise conversion of the acoustic feature quantity in the noise conversion processing unit. 1. The voice recognition device according to 1.

7. The speech recognition apparatus according to claim 1, wherein the matching processing in the matching processing section is word spotting processing.