JP4745837B2

JP4745837B2 - Acoustic analysis apparatus, computer program, and speech recognition system

Info

Publication number: JP4745837B2
Application number: JP2006016172A
Authority: JP
Inventors: 利明内部; 栄二宇都宮; 恒夫加藤; 正樹内藤
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2006-01-25
Filing date: 2006-01-25
Publication date: 2011-08-10
Anticipated expiration: 2026-01-25
Also published as: JP2007199247A

Description

本発明は、音声認識用の音響分析装置及びコンピュータプログラム、音声認識システムに関する。 The present invention relates to an acoustic analyzer for speech recognition, a computer program, and a speech recognition system.

近年、音声認識により情報提供を行うサービスシステムが普及してきている。その音声認識システムの音声認識性能は、音声入力用マイクの周辺の背景雑音が大きい環境において著しく劣化する。そのために背景雑音による性能劣化を抑えるための雑音抑圧手法が従来より提案されている。一般的な雑音抑圧手法では、入力信号から雑音成分を推定し、その推定結果に基づいて入力信号から雑音成分を除去しているが、雑音は常に変動しているために、雑音抑圧後の音声が不自然に聞こえる場合がある。これを音声の歪みとよぶ。音声の歪みは音声認識システムの音声認識性能に悪影響を与える。このため例えば特許文献１記載の従来技術では、信号対雑音比（ＳＮＲ）が低い場合には雑音区間の推定が難しいので雑音抑圧を停止し、ＳＮＲが高い場合にのみ雑音抑圧を行うようにしている。 In recent years, service systems that provide information by voice recognition have become widespread. The voice recognition performance of the voice recognition system is significantly deteriorated in an environment where the background noise around the voice input microphone is large. For this reason, a noise suppression method for suppressing performance degradation due to background noise has been proposed. In general noise suppression methods, the noise component is estimated from the input signal, and the noise component is removed from the input signal based on the estimation result. However, since the noise always fluctuates, May sound unnatural. This is called audio distortion. Speech distortion adversely affects the speech recognition performance of the speech recognition system. For this reason, for example, in the conventional technique described in Patent Document 1, it is difficult to estimate the noise interval when the signal-to-noise ratio (SNR) is low, so that noise suppression is stopped, and noise suppression is performed only when the SNR is high. Yes.

また、サーバ・クライアント型の音声認識システムにおける通信量削減のための分散型音声認識（ＤＳＲ）用の符号化方式として、欧州電気通信標準化機構（European Telecommunications Standards institute：ＥＴＳＩ）により、雑音抑圧のない符号化方式（ＥＳ２０１１０８）と、雑音抑圧付きの符号化方式（ＥＳ２０２０５０）とが規格化されている。
再表０１／０２４１６７号公報 In addition, there is no noise suppression by the European Telecommunications Standards Institute (ETSI) as a coding method for distributed speech recognition (DSR) for reducing traffic in server-client speech recognition systems. An encoding method (ES201108) and an encoding method with noise suppression (ES202050) are standardized.
No. 01/024167

通常、音声認識システムにおいては、音声入力用マイクの周辺の背景雑音が大きいときには、雑音抑圧の適用により、音声認識性能は向上する。しかしながら、背景雑音が小さいときに雑音抑圧を適用すると、かえって音声認識性能が低下する場合がある。 Usually, in a speech recognition system, when background noise around the speech input microphone is large, speech recognition performance is improved by applying noise suppression. However, if noise suppression is applied when the background noise is small, the speech recognition performance may be degraded.

また、雑音抑圧付きの符号化方式（ＥＳ２０２０５０）は、ＳＮＲが低いときには、雑音抑圧の効果により、雑音抑圧のない符号化方式（ＥＳ２０１１０８）よりも音声認識性能が向上する。しかしながら、ＳＮＲが高いときには、雑音抑圧の影響により、かえって雑音抑圧のない符号化方式（ＥＳ２０１１０８）よりも音声認識性能が悪くなる。 Also, the coding method with noise suppression (ES202050) improves speech recognition performance when compared with the coding method without noise suppression (ES201108) due to the effect of noise suppression when the SNR is low. However, when the SNR is high, the speech recognition performance is worse than that of the encoding method without noise suppression (ES201108) due to the effect of noise suppression.

したがって、上述した特許文献１記載の従来技術のように、雑音区間の推定が難しいという理由から、ＳＮＲの低い場合には雑音抑圧を行わず、ＳＮＲの高い場合にのみ雑音抑圧を行うというのは、音声認識性能の向上という観点からは好ましくない。 Therefore, as in the prior art described in Patent Document 1 described above, noise suppression is not performed when the SNR is low and noise suppression is performed only when the SNR is high, because the noise interval is difficult to estimate. From the viewpoint of improving speech recognition performance, it is not preferable.

本発明は、このような事情を考慮してなされたもので、その目的は、雑音抑圧ありの音響特徴量抽出と、雑音抑圧なしの音響特徴量抽出とを効果的に使い分けることができるようにすることにより、音声認識性能の向上を図ることのできる音響分析装置及び音声認識システムを提供することにある。 The present invention has been made in view of such circumstances, and its purpose is to enable effective use of acoustic feature extraction with noise suppression and acoustic feature extraction without noise suppression. Accordingly, an object of the present invention is to provide an acoustic analysis apparatus and a speech recognition system that can improve speech recognition performance.

また、本発明の他の目的は、本発明の音響分析装置をコンピュータを利用して実現するためのコンピュータプログラムを提供することにある。 Another object of the present invention is to provide a computer program for realizing the acoustic analysis apparatus of the present invention using a computer.

上記の課題を解決するために、本発明に係る音響分析装置は、音声認識用の音響分析装置において、話者に発声を促すタイミングに基づき、音声入力手段からの入力信号の中から発声区間の前記入力信号と非発声区間の前記入力信号とを区別して記憶するバッファ手段と、前記バッファ手段に記憶されている非発声区間の前記入力信号に基づき、背景雑音の大きさが所定レベル以上であるか判定する判定手段と、前記入力信号に含まれる雑音成分を抑圧する雑音抑圧手段と、前記判定手段の判定結果に応じて、前記雑音抑圧手段により雑音成分が抑圧された前記入力信号から、若しくは、雑音成分が抑圧されていない前記入力信号から、音響特徴量を抽出する分析手段と、を備え、前記分析手段は、背景雑音の大きさが所定レベル以上である場合には前記雑音抑圧手段により雑音成分が抑圧された前記入力信号から音響特徴量を抽出し、それ以外の場合には雑音成分が抑圧されていない前記入力信号から音響特徴量を抽出することを特徴とする。 In order to solve the above-described problem, an acoustic analysis device according to the present invention is a speech recognition acoustic analysis device, in which an utterance section is selected from input signals from speech input means based on a timing for prompting a speaker to speak. Buffer means for distinguishing and storing the input signal and the input signal in the non-speech section , and the background noise is a predetermined level or more based on the input signal in the non-speech section stored in the buffer means From the input signal whose noise component is suppressed by the noise suppression unit according to a determination result of the determination unit, a noise suppression unit that suppresses a noise component included in the input signal, , from the input signal a noise component is not suppressed, comprising an analyzing means for extracting acoustic features, wherein the analysis means, the magnitude of the background noise is at a predetermined level or higher The case was extracted acoustic features from said input signal from which a noise component has been suppressed by the noise suppression means, to extract acoustic features from the input signal a noise component is not suppressed in other cases Features.

本発明に係る音響分析装置は、音声認識用の音響分析装置において、音声入力手段からの入力信号に基づき、信号対雑音比が所定レベル未満であるか判定する判定手段と、前記入力信号に含まれる雑音成分を抑圧する雑音抑圧手段と、前記判定手段の判定結果に応じて、前記雑音抑圧手段により雑音成分が抑圧された前記入力信号から、若しくは、雑音成分が抑圧されていない前記入力信号から、音響特徴量を抽出する分析手段と、を備えたことを特徴とする。 The acoustic analysis device according to the present invention includes a determination unit that determines whether a signal-to-noise ratio is less than a predetermined level based on an input signal from the voice input unit in the acoustic analysis device for voice recognition, From the input signal in which the noise component is suppressed by the noise suppression unit or from the input signal in which the noise component is not suppressed, according to the determination result of the determination unit And an analysis means for extracting an acoustic feature quantity.

本発明に係る音響分析装置においては、前記分析手段は、信号対雑音比が所定レベル未満である場合には前記雑音抑圧手段により雑音成分が抑圧された前記入力信号から音響特徴量を抽出し、それ以外の場合には雑音成分が抑圧されていない前記入力信号から音響特徴量を抽出することを特徴とする。 In the acoustic analysis apparatus according to the present invention, when the signal-to-noise ratio is less than a predetermined level, the analysis unit extracts an acoustic feature amount from the input signal in which a noise component is suppressed by the noise suppression unit, In other cases, an acoustic feature is extracted from the input signal in which noise components are not suppressed.

本発明に係る音響分析装置においては、前記分析手段は、雑音成分が抑圧された前記入力信号から音響特徴量を抽出するときに専用の第１の音響特徴量抽出演算手段と、雑音成分が抑圧されていない前記入力信号から音響特徴量を抽出するときに専用の第２の音響特徴量抽出演算手段と、を有することを特徴とする。 In the acoustic analysis apparatus according to the present invention, the analysis unit includes a first acoustic feature quantity extraction calculation unit dedicated for extracting an acoustic feature quantity from the input signal in which the noise component is suppressed, and the noise component is suppressed. And a second acoustic feature amount extraction calculation unit dedicated for extracting the acoustic feature amount from the input signal that has not been performed.

本発明に係る音響分析装置においては、話者に発声を促すタイミングに基づき、発声区間の前記入力信号と非発声区間の前記入力信号とを区別して記憶するバッファ手段を備えたことを特徴とする。 The acoustic analysis apparatus according to the present invention is characterized by comprising buffer means for distinguishing and storing the input signal in the utterance interval and the input signal in the non-utterance interval based on the timing of prompting the speaker to speak. .

本発明に係る音声認識システムは、前述の音響分析装置を備えたことを特徴とする。 A speech recognition system according to the present invention includes the above-described acoustic analyzer.

本発明に係る音声認識システムは、音声認識サーバ装置と通信回線を介して接続されるクライアント装置に、前述の音響分析装置を備えたことを特徴とする。 The speech recognition system according to the present invention is characterized in that the above-described acoustic analysis device is provided in a client device connected to a speech recognition server device via a communication line.

本発明に係るコンピュータプログラムは、音声認識用の音響分析を行うためのコンピュータプログラムであって、話者に発声を促すタイミングに基づき、音声入力手段からの入力信号の中から発声区間の前記入力信号と非発声区間の前記入力信号とを区別してバッファ手段に記憶させる切替制御機能と、前記バッファ手段に記憶されている非発声区間の前記入力信号に基づき、背景雑音の大きさが所定レベル以上であるか判定する判定機能と、前記入力信号に含まれる雑音成分を抑圧する雑音抑圧機能と、前記判定手段の判定結果に応じて、前記雑音抑圧手段により雑音成分が抑圧された前記入力信号から、若しくは、雑音成分が抑圧されていない前記入力信号から、音響特徴量を抽出する分析機能と、をコンピュータに実現させるコンピュータプログラムであり、前記分析機能は、背景雑音の大きさが所定レベル以上である場合には前記雑音抑圧手段により雑音成分が抑圧された前記入力信号から音響特徴量を抽出し、それ以外の場合には雑音成分が抑圧されていない前記入力信号から音響特徴量を抽出することを特徴とする。
A computer program according to the present invention is a computer program for performing acoustic analysis for speech recognition, and is based on the timing for prompting a speaker to speak, the input signal in the utterance section among the input signals from the speech input means And a switching control function for distinguishing and storing the input signal in the non-speech section in the buffer means and the input signal in the non-speech section stored in the buffer means with a background noise level of a predetermined level or higher A determination function for determining whether there is a noise suppression function for suppressing a noise component included in the input signal, and according to a determination result of the determination unit, from the input signal in which the noise component is suppressed by the noise suppression unit, or, from the input signal a noise component is not suppressed, computer to realize an analysis function for extracting acoustic features, to the computer The analysis function extracts an acoustic feature amount from the input signal in which a noise component is suppressed by the noise suppression unit when the magnitude of background noise is a predetermined level or more, and otherwise Is characterized in that an acoustic feature is extracted from the input signal in which noise components are not suppressed .

本発明に係るコンピュータプログラムは、音声認識用の音響分析を行うためのコンピュータプログラムであって、音声入力手段からの入力信号に基づき、信号対雑音比が所定レベル未満であるか判定する判定機能と、前記入力信号に含まれる雑音成分を抑圧する雑音抑圧機能と、前記判定手段の判定結果に応じて、前記雑音抑圧手段により雑音成分が抑圧された前記入力信号から、若しくは、雑音成分が抑圧されていない前記入力信号から、音響特徴量を抽出する分析機能と、をコンピュータに実現させることを特徴とする。
これにより、前述の音響分析装置がコンピュータを利用して実現できるようになる。 A computer program according to the present invention is a computer program for performing acoustic analysis for speech recognition, and a determination function for determining whether a signal-to-noise ratio is less than a predetermined level based on an input signal from a speech input means; A noise suppression function for suppressing a noise component included in the input signal, and the noise component is suppressed from the input signal in which the noise component is suppressed by the noise suppression unit according to a determination result of the determination unit And an analysis function for extracting an acoustic feature amount from the input signal that is not received.
Thereby, the above-described acoustic analyzer can be realized by using a computer.

本発明によれば、雑音抑圧ありの音響特徴量抽出と、雑音抑圧なしの音響特徴量抽出とを効果的に使い分けすることができる。これにより、音声認識システムの音声認識性能を向上させることが可能になる。 According to the present invention, it is possible to effectively use acoustic feature extraction with noise suppression and acoustic feature extraction without noise suppression. Thereby, it becomes possible to improve the speech recognition performance of the speech recognition system.

以下、図面を参照し、本発明の各実施形態について説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

［第１の実施形態］
図１は、本発明の第１の実施形態に係る音響分析装置１の構成を示すブロック図である。図１において、音響分析装置１は、切替部１１、背景雑音バッファ１２、入力音声バッファ１３、切替制御部１４、判定部１５、雑音抑圧部１６、音響特徴量抽出部１７ａ、１７ｂ、及び切替部１８−１、１８−２を有する。 [First Embodiment]
FIG. 1 is a block diagram showing a configuration of an acoustic analyzer 1 according to the first embodiment of the present invention. In FIG. 1, the acoustic analysis apparatus 1 includes a switching unit 11, a background noise buffer 12, an input voice buffer 13, a switching control unit 14, a determination unit 15, a noise suppression unit 16, acoustic feature amount extraction units 17a and 17b, and a switching unit. 18-1, 18-2.

音響分析装置１には、マイク入力信号が入力される。マイク入力信号は、話者が発声した音声を入力するためのマイクにより入力される信号である。マイク入力信号には、話者の音声とともにマイクで集音された背景雑音が含まれる。 A microphone input signal is input to the acoustic analysis device 1. The microphone input signal is a signal input by a microphone for inputting voice uttered by a speaker. The microphone input signal includes background noise collected by the microphone along with the voice of the speaker.

また、音響分析装置１には、発声指示タイミング信号が入力される。発声指示タイミング信号は、話者に対して音声認識入力用の音声の発声を促すタイミングを示す信号である。発声指示タイミング信号が示すタイミングの後に（具体的には数百ミリ秒後に）、話者に対して発声を促す例えば画面表示、電子音出力等が行われる。話者はその画面表示、電子音出力等に従って発声する。 Also, the sound analysis apparatus 1 receives an utterance instruction timing signal. The utterance instruction timing signal is a signal indicating the timing for prompting the speaker to utter the voice for voice recognition input. After the timing indicated by the utterance instruction timing signal (specifically, after several hundred milliseconds), for example, screen display, electronic sound output, or the like is performed to urge the speaker to speak. The speaker speaks according to the screen display, electronic sound output, and the like.

切替部１１は、マイク入力信号を記憶するバッファの切り替えを行う。背景雑音バッファ１２は、話者の音声を含まないマイク入力信号を記憶するためのバッファである。入力音声バッファ１３は、話者の音声を含むマイク入力信号を記憶するためのバッファである。 The switching unit 11 switches the buffer that stores the microphone input signal. The background noise buffer 12 is a buffer for storing a microphone input signal that does not include the voice of the speaker. The input voice buffer 13 is a buffer for storing a microphone input signal including the voice of the speaker.

切替制御部１４は、発声指示タイミング信号に基づき、切替部１１の制御を行う。切替制御部１４は、発声指示タイミング信号が入力されると、先ず、切替部１１に対して、マイク入力信号の出力先を背景雑音バッファ１２に切り替えるように指示する。次いで、その発声指示タイミング信号が示すタイミングの後に（具体的には数百ミリ秒後であり、この時点で話者に対して発声を促す画面表示等が行われる）、マイク入力信号の出力先を入力音声バッファ１３に切り替えるように指示する。これにより、まだ話者が発声していないときの発声前のマイク入力信号は背景雑音バッファ１２に記憶され、その後、話者に対して発声を促す画面表示等が行われてからの話者が発声した音声を含むマイク入力信号は入力音声バッファ１３に記憶される。この結果、背景雑音バッファ１２には、話者の音声を含まない背景雑音のみのマイク入力信号が記憶されることになる。なお、切替制御部１４は、話者の発声終了のタイミングに応じて、切替部１１に入力音声バッファ１３への出力を停止するように指示する。話者の発声終了のタイミングとしては、例えば、マイク入力信号が背景雑音に基づいた所定レベル以下になり数百ミリ秒〜数秒間経過した時点、所定時間のタイムアウト時点などが挙げられる。 The switching control unit 14 controls the switching unit 11 based on the utterance instruction timing signal. When the utterance instruction timing signal is input, the switching control unit 14 first instructs the switching unit 11 to switch the output destination of the microphone input signal to the background noise buffer 12. Next, after the timing indicated by the utterance instruction timing signal (specifically, after several hundred milliseconds, a screen display for prompting the speaker to speak is performed at this time), the output destination of the microphone input signal Is switched to the input audio buffer 13. As a result, the microphone input signal before utterance when the speaker is not yet uttered is stored in the background noise buffer 12, and then the speaker after the screen display for prompting the speaker to speak is performed. A microphone input signal including the uttered voice is stored in the input voice buffer 13. As a result, the background noise buffer 12 stores a microphone input signal of only background noise that does not include the voice of the speaker. The switching control unit 14 instructs the switching unit 11 to stop the output to the input voice buffer 13 according to the timing of the end of the speaker's utterance. As the timing of the end of the speaker's utterance, for example, when the microphone input signal is below a predetermined level based on background noise and several hundred milliseconds to several seconds have elapsed, a predetermined time-out time point, etc. can be mentioned.

本実施形態では、音声認識システムに特有の「話者に発声を促すタイミング」を雑音区間の判別に活用することを着想し、雑音区間と非雑音区間とを区別してマイク入力信号をそれぞれ別のバッファに記憶するように構成している。つまり、話者に発声を促すタイミングに基づき、非発声区間のマイク入力信号については背景雑音バッファ１２に記憶し、発声区間のマイク入力信号については入力音声バッファ１３に記憶する。これにより、ＳＮＲの低い場合においても、雑音区間と非雑音区間とを判別し、雑音区間のマイク入力信号については背景雑音バッファ１２に、非雑音区間のマイク入力信号についてはマイク入力信号に、それぞれ記憶することができる。 In this embodiment, it is conceived that “timing to utter a speaker” peculiar to a speech recognition system is used for discrimination of a noise interval, and a microphone input signal is distinguished from each other by distinguishing a noise interval from a non-noise interval. It is configured to store in a buffer. That is, based on the timing of prompting the speaker to speak, the microphone input signal in the non-speech section is stored in the background noise buffer 12, and the microphone input signal in the utterance section is stored in the input speech buffer 13. Thereby, even when the SNR is low, the noise interval and the non-noise interval are discriminated, the microphone input signal in the noise interval is converted into the background noise buffer 12, and the microphone input signal in the non-noise interval is changed into the microphone input signal. Can be remembered.

判定部１５は、背景雑音バッファ１２に記憶されているマイク入力信号に基づき、背景雑音の大きさが所定レベル以上であるか判定する。この判定処理では、背景雑音バッファ１２に記憶されているマイク入力信号の電力レベルを背景雑音レベルとして算出し、この算出した背景雑音レベルを所定レベルと比較する。判定部１５は、この比較結果を切替部１８−１、１８−２に出力する。 Based on the microphone input signal stored in the background noise buffer 12, the determination unit 15 determines whether the background noise level is equal to or higher than a predetermined level. In this determination process, the power level of the microphone input signal stored in the background noise buffer 12 is calculated as a background noise level, and the calculated background noise level is compared with a predetermined level. The determination unit 15 outputs the comparison result to the switching units 18-1 and 18-2.

上記判定部１５が判定に用いる背景雑音バッファ１２には非発声区間のマイク入力信号が入力されている。これにより、話者の音声を含まない非発声区間のマイク入力信号に基づいて背景雑音の大きさの判定が行われるので、その判定精度はよい。 The background noise buffer 12 used for determination by the determination unit 15 is input with a microphone input signal in a non-voicing section. Thereby, since the magnitude of the background noise is determined based on the microphone input signal in the non-speaking section not including the speaker's voice, the determination accuracy is good.

切替部１８−１、１８−２は、判定部１５の比較結果に応じて、入出力接続の切替を連動して行う。つまり、切替部１８−１が入力音声バッファ１３の出力と音響特徴量抽出部１７ａの入力とを接続するときには、切替部１８−２は音響特徴量抽出部１７ａの出力を自己の出力とする。一方、切替部１８−１が入力音声バッファ１３の出力と雑音抑圧部１６の入力とを接続するときには、切替部１８−２は音響特徴量抽出部１７ｂの出力を自己の出力とする。また、音響特徴量抽出部１７ａと、雑音抑圧部１６及び音響特徴量抽出部１７ｂの組とは、切替部１８−１、１８−２により選択されている一方のみが動作する。 The switching units 18-1 and 18-2 perform input / output connection switching in conjunction with each other according to the comparison result of the determination unit 15. That is, when the switching unit 18-1 connects the output of the input audio buffer 13 and the input of the acoustic feature amount extraction unit 17a, the switching unit 18-2 uses the output of the acoustic feature amount extraction unit 17a as its own output. On the other hand, when the switching unit 18-1 connects the output of the input audio buffer 13 and the input of the noise suppression unit 16, the switching unit 18-2 uses the output of the acoustic feature quantity extraction unit 17b as its own output. Further, only one of the acoustic feature amount extraction unit 17a, the noise suppression unit 16, and the acoustic feature amount extraction unit 17b selected by the switching units 18-1 and 18-2 operates.

音響特徴量抽出部１７ａは、入力音声バッファ１３からマイク入力信号を読み出し、読み出したマイク入力信号から音響特徴量を抽出する演算を行なう。この音響特徴量抽出部１７ａとしては、例えばＥＴＳＩ規格の雑音抑圧のない符号化方式（ＥＳ２０１１０８）が利用できる。音響特徴量抽出部１７ａは、抽出結果の音響特徴量を切替部１８−２に出力する。 The acoustic feature quantity extraction unit 17a reads the microphone input signal from the input voice buffer 13, and performs an operation of extracting the acoustic feature quantity from the read microphone input signal. As the acoustic feature quantity extraction unit 17a, for example, an encoding method (ES2011108) without noise suppression of the ETSI standard can be used. The acoustic feature quantity extraction unit 17a outputs the acoustic feature quantity of the extraction result to the switching unit 18-2.

雑音抑圧部１６は、入力音声バッファ１３からマイク入力信号を読み出し、読み出したマイク入力信号から雑音成分を抑圧する。この雑音抑圧後のマイク入力信号は、音響特徴量抽出部１７ｂに出力される。 The noise suppression unit 16 reads the microphone input signal from the input audio buffer 13 and suppresses the noise component from the read microphone input signal. The microphone input signal after the noise suppression is output to the acoustic feature amount extraction unit 17b.

音響特徴量抽出部１７ｂは、雑音抑圧部１６から入力される雑音抑圧後のマイク入力信号から音響特徴量を抽出する演算を行なう。この音響特徴量抽出部１７ｂとしては、例えばＥＴＳＩ規格の雑音抑圧付きの符号化方式（ＥＳ２０２０５０）が利用できる。なお、ＥＴＳＩでは、雑音抑圧及び符号化方式の両方を「ＥＳ２０２０５０」で規格化している。音響特徴量抽出部１７ｂは、抽出結果の音響特徴量を切替部１８−２に出力する。 The acoustic feature amount extraction unit 17 b performs an operation of extracting an acoustic feature amount from the noise-suppressed microphone input signal input from the noise suppression unit 16. As the acoustic feature quantity extraction unit 17b, for example, an encoding method with noise suppression (ES202050) of the ETSI standard can be used. In ETSI, both noise suppression and coding scheme are standardized by “ES202050”. The acoustic feature quantity extraction unit 17b outputs the acoustic feature quantity of the extraction result to the switching unit 18-2.

切替部１８−２は、判定部１５の判定結果に応じて、音響特徴量抽出部１７ａからの音声特徴量を出力するか、若しくは、音響特徴量抽出部１７ｂからの音声特徴量を出力するか、を切り替える（このとき切替部１８−１も連動して入力音声バッファ１３の出力の接続先を切り替える）。この切替では、背景雑音の大きさが所定レベル以上である場合には、音響特徴量抽出部１７ｂで抽出された音声特徴量、つまり、雑音成分が抑圧されたマイク入力信号から抽出された音響特徴量を出力する。それ以外の場合、つまり背景雑音の大きさが所定レベル未満である場合には、音響特徴量抽出部１７ａで抽出された音声特徴量、つまり、雑音成分が抑圧されていないマイク入力信号から抽出された音響特徴量を出力する。切替部１８−２から出力された音響特徴量は、本音響分析装置１の出力として音声認識処理に用いられる。 Whether the switching unit 18-2 outputs the audio feature amount from the acoustic feature amount extraction unit 17a or the audio feature amount from the acoustic feature amount extraction unit 17b according to the determination result of the determination unit 15. (At this time, the switching unit 18-1 also switches the connection destination of the output of the input audio buffer 13 in conjunction). In this switching, when the magnitude of the background noise is equal to or higher than a predetermined level, the audio feature extracted by the acoustic feature extraction unit 17b, that is, the acoustic feature extracted from the microphone input signal in which the noise component is suppressed. Output quantity. In other cases, that is, when the magnitude of the background noise is less than a predetermined level, it is extracted from the voice feature amount extracted by the acoustic feature amount extraction unit 17a, that is, from the microphone input signal in which the noise component is not suppressed. Output acoustic features. The acoustic feature amount output from the switching unit 18-2 is used for speech recognition processing as an output of the acoustic analysis apparatus 1.

これにより、マイク周辺の背景雑音が大きいときには、雑音抑圧を適用して抽出された音響特徴量を用いることにより、音声認識性能を向上させることができる。一方、背景雑音が小さいときには、雑音抑圧を適用せずに抽出された音響特徴量を用いることにより、音声認識性能の低下を回避することができる。このように本実施形態によれば、背景雑音レベルに応じて、雑音抑圧ありの音響特徴量抽出と、雑音抑圧なしの音響特徴量抽出とを効果的に使い分けすることができる。これにより、音声認識性能の向上に寄与することが可能になる。 Thereby, when the background noise around the microphone is large, the speech recognition performance can be improved by using the acoustic feature amount extracted by applying the noise suppression. On the other hand, when the background noise is small, a decrease in speech recognition performance can be avoided by using the acoustic feature amount extracted without applying noise suppression. As described above, according to this embodiment, it is possible to effectively use acoustic feature extraction with noise suppression and acoustic feature extraction without noise suppression according to the background noise level. Thereby, it becomes possible to contribute to the improvement of voice recognition performance.

［第２の実施形態］
図２は、本発明の第２の実施形態に係る音響分析装置１の構成を示すブロック図である。この図２において図１の各部に対応する部分には同一の符号を付け、その説明を省略する。
第２の実施形態では、信号対雑音比（ＳＮＲ）に基づいて、雑音抑圧ありの音響特徴量抽出を行うか、若しくは、雑音抑圧なしの音響特徴量抽出を行うか、を判定する。 [Second Embodiment]
FIG. 2 is a block diagram showing a configuration of the acoustic analyzer 1 according to the second embodiment of the present invention. In FIG. 2, the same reference numerals are given to portions corresponding to the respective portions in FIG.
In the second embodiment, it is determined based on the signal-to-noise ratio (SNR) whether to perform acoustic feature extraction with noise suppression or acoustic feature extraction without noise suppression.

図２において、判定部１５ａは、背景雑音バッファ１２に記憶されているマイク入力信号と、入力音声バッファ１３に記憶されているマイク入力信号とに基づき、ＳＮＲが所定レベル未満であるか判定する。この判定処理では、入力音声バッファ１３に記憶されているマイク入力信号の電力レベルを信号レベルとして算出し、背景雑音バッファ１２に記憶されているマイク入力信号の電力レベルを雑音レベルとして算出し、それら信号レベルと雑音レベルからＳＮＲを算出する。そして、その算出したＳＮＲを所定レベルと比較する。判定部１５ａは、この比較結果を切替部１８ａ−１、１８ａ−２に出力する。 In FIG. 2, the determination unit 15 a determines whether the SNR is less than a predetermined level based on the microphone input signal stored in the background noise buffer 12 and the microphone input signal stored in the input audio buffer 13. In this determination processing, the power level of the microphone input signal stored in the input audio buffer 13 is calculated as the signal level, the power level of the microphone input signal stored in the background noise buffer 12 is calculated as the noise level, The SNR is calculated from the signal level and the noise level. Then, the calculated SNR is compared with a predetermined level. The determination unit 15a outputs the comparison result to the switching units 18a-1 and 18a-2.

上記判定部１５ａが判定に用いる背景雑音バッファ１２には非発声区間のマイク入力信号が入力されており、また、入力音声バッファ１３には発声区間のマイク入力信号が入力されている。これにより、話者の音声を含まない非発声区間のマイク入力信号から雑音レベルを算出し、話者の音声を含む発声区間のマイク入力信号から信号レベルを算出することができるので、判定対象のＳＮＲは精度よく算出され、その結果、ＳＮＲの判定精度はよいものとなる。 The background noise buffer 12 used for determination by the determination unit 15a receives a microphone input signal in a non-speech section, and the input sound buffer 13 receives a microphone input signal in a utterance section. As a result, the noise level can be calculated from the microphone input signal in the non-speech section that does not include the speaker's voice, and the signal level can be calculated from the microphone input signal in the utterance section that includes the speaker's voice. The SNR is calculated with high accuracy, and as a result, the determination accuracy of the SNR is good.

切替部１８ａ−１、１８ａ−２は、図１の切替部１８−１、１８−２と同様に、判定部１５ａの比較結果に応じて、入出力接続の切替を連動して行う。また、音響特徴量抽出部１７ａと、雑音抑圧部１６及び音響特徴量抽出部１７ｂの組とは、切替部１８ａ−１、１８ａ−２により選択されている一方のみが動作する。 Similarly to the switching units 18-1 and 18-2 in FIG. 1, the switching units 18a-1 and 18a-2 perform input / output connection switching in conjunction with the comparison result of the determination unit 15a. Further, only one of the acoustic feature amount extraction unit 17a, the noise suppression unit 16, and the acoustic feature amount extraction unit 17b selected by the switching units 18a-1 and 18a-2 operates.

切替部１８ａ−２は、判定部１５ａの判定結果に応じて、音響特徴量抽出部１７ａからの音声特徴量を出力するか、若しくは、音響特徴量抽出部１７ｂからの音声特徴量を出力するか、を切り替える（このとき切替部１８ａ−１も連動して入力音声バッファ１３の出力の接続先を切り替える）。この切替では、ＳＮＲが所定レベル未満である場合には音響特徴量抽出部１７ｂで抽出された音声特徴量、つまり、雑音成分が抑圧されたマイク入力信号から抽出された音響特徴量を出力する。それ以外の場合、つまりＳＮＲが所定レベル以上である場合には、音響特徴量抽出部１７ａで抽出された音声特徴量、つまり、雑音成分が抑圧されていないマイク入力信号から抽出された音響特徴量を出力する。切替部１８ａ−２から出力された音響特徴量は、本音響分析装置１の出力として音声認識処理に用いられる。 Whether the switching unit 18a-2 outputs the audio feature amount from the acoustic feature amount extraction unit 17a or the audio feature amount from the acoustic feature amount extraction unit 17b according to the determination result of the determination unit 15a. (At this time, the switching unit 18a-1 also switches the connection destination of the output of the input audio buffer 13 in conjunction with it). In this switching, when the SNR is less than a predetermined level, the audio feature amount extracted by the acoustic feature amount extraction unit 17b, that is, the acoustic feature amount extracted from the microphone input signal in which the noise component is suppressed is output. In other cases, that is, when the SNR is equal to or higher than a predetermined level, the audio feature quantity extracted by the acoustic feature quantity extraction unit 17a, that is, the acoustic feature quantity extracted from the microphone input signal in which the noise component is not suppressed. Is output. The acoustic feature amount output from the switching unit 18a-2 is used for speech recognition processing as the output of the acoustic analysis apparatus 1.

これにより、ＳＮＲが低いときには、雑音抑圧を適用して抽出された音響特徴量を用いることにより、音声認識性能を向上させることができる。一方、ＳＮＲが高いときには、雑音抑圧を適用せずに抽出された音響特徴量を用いることにより、音声認識性能の低下を回避することができる。このように本実施形態によれば、ＳＮＲに応じて、雑音抑圧ありの音響特徴量抽出と、雑音抑圧なしの音響特徴量抽出とを効果的に使い分けすることができる。これにより、音声認識性能の向上に寄与することが可能になる。 Thereby, when the SNR is low, the speech recognition performance can be improved by using the acoustic feature amount extracted by applying noise suppression. On the other hand, when the SNR is high, a decrease in speech recognition performance can be avoided by using the extracted acoustic feature amount without applying noise suppression. As described above, according to the present embodiment, it is possible to effectively use the acoustic feature extraction with noise suppression and the acoustic feature extraction without noise suppression according to the SNR. Thereby, it becomes possible to contribute to the improvement of voice recognition performance.

次に、上述した各実施形態に係る音響分析装置１を適用した音声認識システムの実施例を説明する。 Next, an example of a speech recognition system to which the acoustic analysis device 1 according to each embodiment described above is applied will be described.

図３は、本発明に係る音響分析装置１を適用した音声認識システムの一実施例である。図３に示される実施例１では、音声認識システムを単独の装置で実現している。 FIG. 3 shows an embodiment of a speech recognition system to which the acoustic analyzer 1 according to the present invention is applied. In the first embodiment shown in FIG. 3, the voice recognition system is realized by a single device.

図３において、音声認識装置１００は、本発明に係る音響分析装置１と、マイク１０１と、音声認識部１０２と、制御部１０３と、表示部１０４とを有する。マイク１０１から入力されたマイク入力信号は音響分析装置１に入力される。音響分析装置１は、そのマイク入力信号から音声特徴量を抽出し、抽出した音声特徴量を音声認識部１０２に出力する。音声認識部１０２は、その音声特徴量に基づき、音声認識処理を行う。その音声認識結果は、制御部１０３に出力される。制御部１０３は、音声認識結果を表示部１０４で表示させる。 In FIG. 3, the speech recognition device 100 includes the acoustic analysis device 1 according to the present invention, a microphone 101, a speech recognition unit 102, a control unit 103, and a display unit 104. A microphone input signal input from the microphone 101 is input to the acoustic analyzer 1. The acoustic analysis device 1 extracts a voice feature value from the microphone input signal and outputs the extracted voice feature value to the voice recognition unit 102. The voice recognition unit 102 performs voice recognition processing based on the voice feature amount. The voice recognition result is output to the control unit 103. The control unit 103 causes the display unit 104 to display the voice recognition result.

また、制御部１０３は音声認識の実行制御を行う。その実行制御では、話者に対する発声の指示を行う。例えば、表示部１０４でのプロンプト表示により、話者に発声の開始を合図する。そのプロンプト表示を行うタイミングは、発声指示タイミング信号により、音響分析装置１に通知される。 The control unit 103 performs voice recognition execution control. In the execution control, the speaker is instructed to speak. For example, a prompt display on the display unit 104 signals the speaker to start speaking. The timing for performing the prompt display is notified to the acoustic analysis device 1 by an utterance instruction timing signal.

本実施例１は、携帯型、据置き型のいずれのタイプの音声認識装置にも適用可能である。 The first embodiment can be applied to any type of speech recognition apparatus of a portable type or a stationary type.

図４は、本発明に係る音響分析装置１を適用した音声認識システムの他の実施例である。この図４において図３の各部に対応する部分には同一の符号を付け、その説明を省略する。図４に示される実施例２では、サーバ・クライアント型の音声認識システムを実現している。 FIG. 4 shows another embodiment of a speech recognition system to which the acoustic analyzer 1 according to the present invention is applied. In FIG. 4, parts corresponding to those in FIG. 3 are given the same reference numerals, and explanation thereof is omitted. In the second embodiment shown in FIG. 4, a server / client type speech recognition system is realized.

図４において、クライアント装置２００は、通信部２０１を有し、通信回線２２０を介して音声認識サーバ２１０とデータを送受信する。通信部２０１は、音響分析装置１で抽出された音声特徴量を音声認識サーバ２１０に送信して、音声認識要求を行う。音声認識サーバ２１０は、その音声認識要求に応じて、クライアント装置２００から送られた音声特徴量に基づき、音声認識処理を行う。その音声認識結果は、通信回線２２０を介してクライアント装置２００に送信される。クライアント装置２００では、通信部２０１が音声認識サーバ２１０からの音声認識結果を受信し、該音声認識結果を制御部に出力する。制御部１０３は、音声認識結果を表示部１０４で表示させる。 In FIG. 4, the client device 200 includes a communication unit 201 and transmits / receives data to / from the voice recognition server 210 via the communication line 220. The communication unit 201 transmits the voice feature amount extracted by the acoustic analysis device 1 to the voice recognition server 210 to make a voice recognition request. In response to the voice recognition request, the voice recognition server 210 performs voice recognition processing based on the voice feature amount sent from the client device 200. The voice recognition result is transmitted to the client apparatus 200 via the communication line 220. In the client device 200, the communication unit 201 receives the voice recognition result from the voice recognition server 210, and outputs the voice recognition result to the control unit. The control unit 103 causes the display unit 104 to display the voice recognition result.

本実施例２は、携帯型、据置き型のいずれのタイプの音声認識装置にも適用可能であるが、特に携帯通信端末のように十分な処理能力を確保することの難しい装置に適用する場合に有用である。 The second embodiment can be applied to both a portable type and a stationary type voice recognition device, but particularly when applied to a device that is difficult to ensure sufficient processing capability such as a portable communication terminal. Useful for.

上述したように本発明の実施形態によれば、雑音抑圧ありの音響特徴量抽出と、雑音抑圧なしの音響特徴量抽出とを効果的に使い分けすることができる。これにより、音声認識システムの音声認識性能を向上させることが可能になるという優れた効果が得られる。 As described above, according to the embodiment of the present invention, it is possible to effectively use acoustic feature extraction with noise suppression and acoustic feature extraction without noise suppression. Thereby, the outstanding effect that it becomes possible to improve the voice recognition performance of a voice recognition system is acquired.

なお、図１又は図２に示す音響分析装置１の機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより音響分析処理を行ってもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものであってもよい。
また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、フラッシュメモリ等の書き込み可能な不揮発性メモリ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。 Note that a program for realizing the function of the acoustic analysis apparatus 1 shown in FIG. 1 or 2 is recorded on a computer-readable recording medium, and the program recorded on the recording medium is read into a computer system and executed. The acoustic analysis process may be performed. Here, the “computer system” may include an OS and hardware such as peripheral devices.
The “computer-readable recording medium” means a flexible disk, a magneto-optical disk, a ROM, a writable nonvolatile memory such as a flash memory, a portable medium such as a CD-ROM, a hard disk built in a computer system, etc. This is a storage device.

さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（例えばＤＲＡＭ（Dynamic Random Access Memory））のように、一定時間プログラムを保持しているものも含むものとする。
また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。
また、上記プログラムは、前述した機能の一部を実現するためのものであっても良い。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であっても良い。 Further, the “computer-readable recording medium” means a volatile memory (for example, DRAM (Dynamic DRAM) in a computer system that becomes a server or a client when a program is transmitted through a network such as the Internet or a communication line such as a telephone line. Random Access Memory)), etc., which hold programs for a certain period of time.
The program may be transmitted from a computer system storing the program in a storage device or the like to another computer system via a transmission medium or by a transmission wave in the transmission medium. Here, the “transmission medium” for transmitting the program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line.
The program may be for realizing a part of the functions described above. Furthermore, what can implement | achieve the function mentioned above in combination with the program already recorded on the computer system, and what is called a difference file (difference program) may be sufficient.

以上、本発明の実施形態を図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、本発明の要旨を逸脱しない範囲の設計変更等も含まれる。 The embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to this embodiment, and includes design changes and the like within a scope not departing from the gist of the present invention.

本発明の第１の実施形態に係る音響分析装置１の構成を示すブロック図である。It is a block diagram which shows the structure of the acoustic analyzer 1 which concerns on the 1st Embodiment of this invention. 本発明の第２の実施形態に係る音響分析装置１の構成を示すブロック図である。It is a block diagram which shows the structure of the acoustic analyzer 1 which concerns on the 2nd Embodiment of this invention. 本発明に係る音響分析装置１を適用した音声認識システムの一実施例を示すブロック図である。It is a block diagram which shows one Example of the speech recognition system to which the acoustic analyzer 1 which concerns on this invention is applied. 本発明に係る音響分析装置１を適用した音声認識システムの他の実施例を示すブロック図である。It is a block diagram which shows the other Example of the speech recognition system to which the acoustic analyzer 1 which concerns on this invention is applied.

Explanation of symbols

１…音響分析装置、１１…切替部（バッファ手段）、１２…背景雑音バッファ（バッファ手段）、１３…入力音声バッファ（バッファ手段）、１４…切替制御部（バッファ手段）、１５，１５ａ…判定部、１６…雑音抑圧部、１７ａ、１７ｂ…音響特徴量抽出部（分析手段）、１８−１〜２，１８ａ−１〜２…切替部（分析手段）、１００…音声認識装置、１０１…マイク（音声入力手段）、１０２…音声認識部、１０３…制御部、１０４…表示部、２００…クライアント装置、２０１…通信部、２１０…音声認識サーバ、２２０…通信回線

DESCRIPTION OF SYMBOLS 1 ... Acoustic analyzer, 11 ... Switching part (buffer means), 12 ... Background noise buffer (buffer means), 13 ... Input audio buffer (buffer means), 14 ... Switching control part (buffer means), 15, 15a ... Determination 16, noise suppression unit, 17 a, 17 b, acoustic feature quantity extraction unit (analysis means), 18-1 to 2, 18 a-1 to 2, switching unit (analysis means), 100, voice recognition device, 101, microphone (Voice input means), 102 ... Voice recognition unit, 103 ... Control unit, 104 ... Display unit, 200 ... Client device, 201 ... Communication unit, 210 ... Voice recognition server, 220 ... Communication line

Claims

In an acoustic analyzer for speech recognition,
Buffer means for distinguishing and storing the input signal of the utterance section and the input signal of the non-vocal section from the input signal from the voice input means based on the timing of prompting the speaker to speak.
Determination means for determining whether the magnitude of background noise is a predetermined level or more based on the input signal of the non-speech interval stored in the buffer means ;
Noise suppression means for suppressing a noise component included in the input signal;
Analyzing means for extracting an acoustic feature quantity from the input signal in which the noise component is suppressed by the noise suppressing means or from the input signal in which the noise component is not suppressed according to the determination result of the determining means; With
The analysis unit extracts an acoustic feature amount from the input signal in which a noise component is suppressed by the noise suppression unit when the magnitude of background noise is a predetermined level or more, and in other cases, the noise component is An acoustic analysis apparatus that extracts an acoustic feature amount from the input signal that is not suppressed .

The analysis means includes
A first acoustic feature quantity extraction calculation unit dedicated for extracting an acoustic feature quantity from the input signal in which a noise component is suppressed;
A second acoustic feature quantity extraction calculation unit dedicated for extracting an acoustic feature quantity from the input signal in which noise components are not suppressed;
The acoustic analysis apparatus according to claim 1, comprising:

Speech recognition system characterized by comprising an acoustic analyzer according to claim 1 or 2.

A voice recognition system comprising the acoustic analysis device according to claim 1 or 2 in a client device connected to a voice recognition server device via a communication line.

A computer program for performing acoustic analysis for speech recognition,
A switching control function for distinguishing and storing the input signal of the utterance section and the input signal of the non-vocal section from the input signal from the voice input means based on the timing of prompting the speaker to speak;
A determination function for determining whether the magnitude of background noise is greater than or equal to a predetermined level based on the input signal of the non-speech interval stored in the buffer means ;
A noise suppression function for suppressing a noise component included in the input signal;
According to the determination result of the determination means, from the input signal in which the noise component is suppressed by the noise suppression means, or from the input signal in which the noise component is not suppressed, an analysis function for extracting an acoustic feature amount; Is a computer program that causes a computer to realize
The analysis function extracts an acoustic feature amount from the input signal in which a noise component is suppressed by the noise suppression unit when the magnitude of background noise is a predetermined level or more, and in other cases, the noise component is A computer program that extracts an acoustic feature from the input signal that is not suppressed .