JP2019219556A

JP2019219556A - Voice recognition support system

Info

Publication number: JP2019219556A
Application number: JP2018117802A
Authority: JP
Inventors: 義規加藤; Yoshinori Kato; 信光平野; Nobumitsu Hirano; 征幸佐藤; Masayuki Sato; 藤原　宗; So Fujiwara; 宗藤原
Original assignee: New Japan Radio Co Ltd
Current assignee: New Japan Radio Co Ltd
Priority date: 2018-06-21
Filing date: 2018-06-21
Publication date: 2019-12-26
Anticipated expiration: 2038-06-21
Also published as: JP7042169B2

Abstract

To improve S/N ratio by removing a stationary noise in addition to a time-varying noise and increase a voice recognition rate of a target voice signal even when the level of the input target voice signal is low.SOLUTION: An input AGC processing unit 3, a subtractive beamforming processing unit 4, a time-varying noise spectrum estimation processing unit 6, a stationary noise spectrum estimation processing unit 7, a target voice signal extraction processing unit 9, a target voice section detection processing unit 10 and an output AGC processing unit 13 are included. The stationary noise spectrum estimation processing unit 7 operates in a noise section detected by the target voice section detection processing unit 10.SELECTED DRAWING: Figure 1

Description

本発明は、複数のマイクロフォン（以下マイク）で得られた受信信号に含まれる受信音声信号から雑音成分を除去して目的音声信号を取り出すための音声認識支援システムに関する。 The present invention relates to a voice recognition support system for removing a noise component from a received voice signal included in a received signal obtained by a plurality of microphones (hereinafter referred to as microphones) to extract a target voice signal.

複数のマイクを用いて、それらのマイクで得られた受信信号に含まれる受信音声信号から雑音成分を除去して高Ｓ／Ｎ比の目的音声信号を取り出すために、図１０に示すような音声認識支援システムが提案されている（非特許文献１）。 In order to extract a target audio signal having a high S / N ratio by removing noise components from a received audio signal included in a received signal obtained by the microphones using a plurality of microphones, an audio signal as shown in FIG. A recognition support system has been proposed (Non-Patent Document 1).

図１０において、２１Ｌ、２１Ｒは所定間隔で配置されたマイクである。２２はＡ／Ｄ変換処理部であり、マイク２１Ｌ、２１Ｒで受信した受信音声信号をＡ／Ｄ変換し所定時間単位のフレーム信号を生成する。 In FIG. 10, 21L and 21R are microphones arranged at predetermined intervals. Reference numeral 22 denotes an A / D conversion processing unit, which performs A / D conversion of a received audio signal received by the microphones 21L and 21R to generate a frame signal in a predetermined time unit.

２３は減算型ビームフォーミング処理部であり、マイク２１Ｌ、２１Ｒで受信した受信音声信号を取り込み特定方向以外の受信音声信号を雑音成分として算出する。２４はその減算型ビームフォーミング処理部２３で得られた雑音成分を時間領域から周波数領域へ変換するＦＦＴ（高速フーリエ変換）処理部である。 Reference numeral 23 denotes a subtraction type beamforming processing unit which takes in the received audio signals received by the microphones 21L and 21R and calculates the received audio signals in the directions other than the specific direction as noise components. Reference numeral 24 denotes an FFT (Fast Fourier Transform) processing unit for converting the noise component obtained by the subtraction type beamforming processing unit 23 from the time domain to the frequency domain.

２５は時変雑音スペクトル推定処理部であり、ＦＦＴ処理部２４から出力する雑音成分から時間経過で変化する雑音成分のスペクトルを推定する。２６はＡ／Ｄ変換処理部２２から出力する受信音声信号を時間領域から周波数領域へ変換するＦＦＴ処理部である。 Reference numeral 25 denotes a time-varying noise spectrum estimation processing unit which estimates the spectrum of a noise component that changes with time from the noise component output from the FFT processing unit 24. Reference numeral 26 denotes an FFT processing unit that converts a received audio signal output from the A / D conversion processing unit 22 from a time domain to a frequency domain.

２７は目的音声信号抽出処理部であり、ＦＦＴ処理部２６から取り込まれる受信音声信号のスペクトルから時変雑音スペクトル推定処理部２５で推定された雑音成分のスペクトルを減算することで目的音声信号のスペクトルを抽出する。 Reference numeral 27 denotes a target voice signal extraction processing unit which subtracts the spectrum of the noise component estimated by the time-varying noise spectrum estimation processing unit 25 from the spectrum of the received voice signal fetched from the FFT processing unit 26 to obtain the spectrum of the target voice signal. Is extracted.

２８は目的音声信号抽出処理部２７から取り出された目的音声信号を周波数領域から時間領域の信号に逆変換するＩＦＦＴ（逆高速フーリエ変換）処理部である。 Reference numeral 28 denotes an IFFT (Inverse Fast Fourier Transform) processing unit that performs an inverse transform of the target audio signal extracted from the target audio signal extraction processing unit 27 from a frequency domain to a time domain signal.

この音声認識支援システムでは、マイク２１Ｌ，２１Ｒの受信音声信号から雑音成分を除去することで目的音声信号を取り出すことができる。 In this voice recognition support system, a target voice signal can be extracted by removing noise components from the voice signals received by the microphones 21L and 21R.

マイクロホン対を用いたスペクトルサブトラクションによる雑音雑音除去法、水町・赤木著、電子情報通信学会論文誌Ａ Vol.J82-A、No.4、pp.503-512、1999年4月Noise and Noise Reduction Method Using Spectral Subtraction Using Microphone Pair, Mizumachi and Akagi, IEICE Transactions on Electronics, Information and Communication Engineers A Vol.J82-A, No.4, pp.503-512, April 1999

ところが、図１０の音声認識支援システムでは、時変雑音スペクトル推定処理部２５で推定した時変雑音スペクトルによって、時間経過で変化する雑音成分に追従して受信音声信号のスペクトルから雑音成分のスペクトルを除去することができるが、環境雑音等の定常的な雑音のスペクトルや車の走行雑音などのスペクトルを推定することができず、定常雑音の除去性能が低く、Ｓ／Ｎ比の改善が望まれていた。また、受信音声信号に含まれる目的音声信号のレベルが小さい場合は、その目的音声信号が雑音成分に埋もれ、目的音声信号の音声認識率が低くなっていた。 However, in the speech recognition support system of FIG. 10, the time-varying noise spectrum estimated by the time-varying noise spectrum estimating processing unit 25 follows the noise component changing over time to extract the noise component spectrum from the spectrum of the received speech signal. Although it can be removed, the spectrum of stationary noise such as environmental noise and the spectrum of vehicle running noise cannot be estimated, the performance of removing stationary noise is low, and it is desired to improve the S / N ratio. I was Further, when the level of the target voice signal included in the received voice signal is low, the target voice signal is buried in a noise component, and the voice recognition rate of the target voice signal is low.

本発明の目的は、時変雑音に加えて定常雑音を除去してＳ／Ｎ比を改善し、また入力する目的音声信号のレベルが小さくても当該目的音声信号の音声認識率を高くできるようにした音声認識支援システムを提供することである。 An object of the present invention is to improve the S / N ratio by removing stationary noise in addition to time-varying noise, and to increase the speech recognition rate of the target voice signal even when the level of the input target voice signal is low. The object of the present invention is to provide a speech recognition support system which has been developed.

上記目的を達成するために、請求項１にかかる発明は、受信信号を入力し、所定の信号レベルに調整した受信音声信号を出力する入力ＡＧＣ処理部と、該入力ＡＧＣ処理部から出力される前記受信音声信号を取り込んで特定方向以外の受信音声信号を雑音成分として取り出す減算型ビームフォーミング処理部と、該減算型ビームフォーミング処理部によって取り出された雑音成分を取り込んで時間経過で変化する雑音成分スペクトルを推定する時変雑音スペクトル推定処理部と、前記減算型ビームフォーミング処理部によって取り出された雑音成分を取り込んで定常的に発生する雑音成分スペクトルを推定する定常雑音スペクトル推定処理部と、前記入力ＡＧＣ処理部から出力される前記受信音声信号を取り込んで前記時変雑音スペクトル推定処理部で推定された時変雑音スペクトルと前記定常雑音スペクトル推定処理部で推定された定常雑音成分スペクトルを取り除いて目的音声信号を抽出する目的音声信号抽出処理部と、該目的音声信号抽出処理部で抽出された前記目的音声信号を取り込み目的音声区間の信号レベルを調整する出力ＡＧＣ処理部と、前記目的音声信号抽出処理部で抽出された前記目的音声信号の開始タイミングと終了タイミングから前記目的音声区間を検出し前記目的音声区間以外を雑音区間として検出する目的音声区間検出処理部とを備え、前記定常雑音スペクトル推定処理部は、前記目的音声区間検出処理部で検出された前記雑音区間で動作することを特徴とする。
請求項２にかかる発明は、請求項１に記載の音声認識支援システムにおいて、前記定常雑音スペクトル推定処理部は、前記雑音区間で検出された雑音成分のスペクトルを累算することで定常雑音スペクトルを推定することを特徴とする。
請求項３にかかる発明は、請求項１又は２に記載の音声認識支援システムにおいて、前記入力ＡＧＣ処理部は、前記目的音声区間が第１設定時間より長いときレベル圧縮した前記受信信号を出力し、前記雑音区間が第２設定時間より長いときレベル増幅して前記受信音声信号のレベルが第１設定値を超えない範囲の前記受信音声信号を出力することを特徴とする。
請求項４にかかる発明は、請求項１、２又は３に記載の音声認識支援システムにおいて、前記出力ＡＧＣ処理部は、前記出力ＡＧＣ処理部に入力する前記目的音声信号のレベルが第２設定値を超えない範囲で前記目的音声信号を選択的にレベル増幅することを特徴とする。
請求項５にかかる発明は、請求項１、２、３又は４に記載の音声認識支援システムにおいて、前記目的音声信号の開始タイミングを調整する手段を備えていることを特徴とする。 To achieve the above object, the invention according to claim 1 provides an input AGC processing unit for receiving a received signal and outputting a received audio signal adjusted to a predetermined signal level, and an output from the input AGC processing unit. A subtractive beamforming processing unit that captures the received voice signal and extracts a received voice signal in a direction other than a specific direction as a noise component; and a noise component that captures the noise component extracted by the subtractive beamforming processing unit and changes over time. A time-varying noise spectrum estimating section for estimating a spectrum, a stationary noise spectrum estimating section for taking in a noise component extracted by the subtraction type beamforming processing section and estimating a noise component spectrum that constantly occurs, and The received voice signal output from the AGC processing unit is fetched to estimate the time-variant noise spectrum. A target audio signal extraction processing unit that extracts a target audio signal by removing the time-varying noise spectrum estimated by the processing unit and the stationary noise component spectrum estimated by the stationary noise spectrum estimation processing unit; and the target audio signal extraction processing unit. An output AGC processing unit that takes in the target audio signal extracted in step (a) and adjusts the signal level of the target audio section; and outputs the target audio from the start timing and end timing of the target audio signal extracted by the target audio signal extraction processing unit. A target voice section detection processing section for detecting a section and detecting a section other than the target voice section as a noise section, wherein the stationary noise spectrum estimation processing section operates in the noise section detected by the target voice section detection processing section. It is characterized by doing.
According to a second aspect of the present invention, in the speech recognition support system according to the first aspect, the stationary noise spectrum estimation processing section accumulates a spectrum of a noise component detected in the noise section to generate a stationary noise spectrum. It is characterized by estimation.
According to a third aspect of the present invention, in the voice recognition support system according to the first or second aspect, the input AGC processing section outputs the level-compressed reception signal when the target voice section is longer than a first set time. When the noise section is longer than a second set time, level amplification is performed to output the received voice signal in a range where the level of the received voice signal does not exceed the first set value.
According to a fourth aspect of the present invention, in the voice recognition support system according to any one of the first to third aspects, the output AGC processing unit is configured to set a level of the target audio signal input to the output AGC processing unit to a second set value. The target audio signal is selectively level-amplified within a range not exceeding.
According to a fifth aspect of the present invention, in the voice recognition support system according to the first, second, third, or fourth aspect, a means for adjusting a start timing of the target voice signal is provided.

本発明によれば、定常雑音スペクトル推定処理部を設け、その定常雑音スペクトル推定処理部を雑音区間で動作させるので、時変雑音推定と定常雑音推定を並行して処理することができ、あらゆる雑音を低減することができ、目的音声信号のＳ／Ｎ比を大きく改善できる。また、入力ＡＧＣ処理部と出力ＡＧＣ処理部を備えるので、入力する目的音声信号のレベルが小さくても目的音声信号の音声認識率を高くできる。 According to the present invention, since the stationary noise spectrum estimation processing unit is provided and the stationary noise spectrum estimation processing unit is operated in the noise section, the time-varying noise estimation and the stationary noise estimation can be processed in parallel, and any noise Can be reduced, and the S / N ratio of the target audio signal can be greatly improved. Further, since the input AGC processing unit and the output AGC processing unit are provided, the speech recognition rate of the target audio signal can be increased even if the level of the input target audio signal is low.

第１実施例の音声認識支援システムの機能ブロック図である。It is a functional block diagram of a speech recognition support system of a 1st example. 入力ＡＧＣ処理部のフローチャートである。5 is a flowchart of an input AGC processing unit. （ａ）、（ｂ）はマイクと目的音声と雑音の関係を示す説明図、（ｃ）は減算型ビームフォーミング処理部のフローチャートである。(A), (b) is explanatory drawing which shows the relationship between a microphone, target sound, and noise, (c) is a flowchart of a subtraction type beamforming processing part. 時変雑音スペクトル推定処理のフローチャートである。It is a flowchart of a time-varying noise spectrum estimation process. 定常雑音スペクトル推定処理のフローチャートである。It is a flowchart of a stationary noise spectrum estimation process. 目的音声信号抽出処理のフローチャートである。It is a flowchart of a target audio signal extraction process. 目的音声区間検出処理のフローチャートである。It is a flowchart of a target voice section detection process. （ａ）は通常の受信音声信号のエントロピーの特性図、（ｂ）は過大な受信音声信号のエントロピーの特性図、（ｃ）は過小な受信音声信号のエントロピーの特性図である。(A) is a characteristic diagram of entropy of a normal received voice signal, (b) is a characteristic diagram of entropy of an excessively large received voice signal, and (c) is a characteristic diagram of entropy of a too small received voice signal. 出力ＡＧＣ処理部のフローチャートである。It is a flowchart of an output AGC processing unit. 従来の音声認識支援システムの機能ブロック図である。It is a functional block diagram of the conventional voice recognition support system.

図１に本発明の１つの実施例の音声認識支援システムを示す。１Ｌ、１Ｒは所定間隔で配置されたＬチャネル、Ｒチャネルのマイクである。２はＡ／Ｄ変換処理部であり、マイク１Ｌ、１Ｒで受信した受信信号をＡ/Ｄ変換することにより、所定時間単位のフレーム信号を生成する。３はＡ/Ｄ変換処理部２から取り出された受信信号のレベルを調整し、受信音声信号を出力する入力ＡＧＣ処理部である。 FIG. 1 shows a speech recognition support system according to one embodiment of the present invention. Reference numerals 1L and 1R denote L-channel and R-channel microphones arranged at predetermined intervals. Reference numeral 2 denotes an A / D conversion processing unit, which generates a frame signal in a predetermined time unit by performing A / D conversion on a reception signal received by the microphones 1L and 1R. Reference numeral 3 denotes an input AGC processing unit that adjusts the level of a received signal extracted from the A / D conversion processing unit 2 and outputs a received audio signal.

４は減算型ビームフォーミング処理部であり、入力ＡＧＣ処理部３から出力する２個の受信音声信号を取り込み特定方向以外の受信音声信号を雑音成分として算出する。５はその減算型ビームフォーミング処理部４で得られた雑音成分を時間領域から周波数領域へ変換するＦＦＴ（高速フーリエ変換）処理部である。 Reference numeral 4 denotes a subtraction type beamforming processing unit which takes in two received audio signals output from the input AGC processing unit 3 and calculates a received audio signal in a direction other than a specific direction as a noise component. An FFT (Fast Fourier Transform) processing unit 5 converts the noise component obtained by the subtraction type beamforming processing unit 4 from the time domain to the frequency domain.

６は時変雑音スペクトル推定処理部であり、ＦＦＴ処理部５から取り込まれる雑音成分から時間経過で変化する時変雑音スペクトル（周波数とレベル）を推定する。７は定常雑音スペクトル推定処理部であり、ＦＦＴ処理部５から出力する雑音成分によって定常的に発生する定常雑音スペクトルを推定する。 Reference numeral 6 denotes a time-varying noise spectrum estimating processing unit which estimates a time-varying noise spectrum (frequency and level) that changes with time from a noise component taken in from the FFT processing unit 5. Reference numeral 7 denotes a stationary noise spectrum estimating processing unit, which estimates a stationary noise spectrum constantly generated by the noise component output from the FFT processing unit 5.

８は入力ＡＧＣ処理部３から出力する受信音声信号を時間領域から周波数領域に変換するＦＦＴ処理部である。９は目的音声信号抽出処理部であり、時変雑音スペクトル推定処理部６で得られた時変雑音スペクトルと定常雑音スペクトル推定処理部７で得られた定常雑音スペクトルを取り込んで、ＦＦＴ処理部８から取り込まれた受信音声信号のスペクトルから時変雑音のスペクトルと定常雑音のスペクトルを取り除くことで、目的音声信号のスペクトルを抽出する。 Reference numeral 8 denotes an FFT processing unit that converts a received audio signal output from the input AGC processing unit 3 from a time domain to a frequency domain. Reference numeral 9 denotes a target voice signal extraction processing unit, which takes in the time-varying noise spectrum obtained by the time-varying noise spectrum estimation processing unit 6 and the stationary noise spectrum obtained by the stationary noise spectrum estimation processing unit 7 and performs FFT processing unit 8 Then, the spectrum of the target voice signal is extracted by removing the spectrum of the time-varying noise and the spectrum of the stationary noise from the spectrum of the received voice signal taken in from.

１０は目的音声区間検出処理部であり、目的音声信号抽出処理部９で得られた目的音声信号のスペクトルを取り込んで、目的音声区間と雑音区間の境界を検出する。１１は目的音声信号抽出処理部９から出力する目的音声信号を周波数領域から時間領域の信号に逆変換するＩＦＦＴ（逆高速フーリエ変換）処理部である。 Reference numeral 10 denotes a target voice section detection processing unit which captures the spectrum of the target voice signal obtained by the target voice signal extraction processing unit 9 and detects a boundary between the target voice section and the noise section. Reference numeral 11 denotes an IFFT (Inverse Fast Fourier Transform) processing unit for inversely converting the target audio signal output from the target audio signal extraction processing unit 9 from a frequency domain to a time domain signal.

１２は遅延処理部であり、目的音声区間検出処理部１０において雑音区間と目的音声区間を検出する際に雑音区間から目的音声区間に切り替わるタイミングの誤差を補正する。１３は遅延処理部１２で遅延補正が行われた目的音声信号のレベルを調整する出力ＡＧＣ処理部である。 Reference numeral 12 denotes a delay processing unit that corrects an error in timing of switching from the noise section to the target voice section when the target voice section is detected by the target voice section detection processing unit 10. Reference numeral 13 denotes an output AGC processing unit that adjusts the level of the target audio signal subjected to the delay correction by the delay processing unit 12.

以下、個々の処理部について説明する。図２は入力ＡＧＣ処理部３の処理フローチャートを示す。Ａ／Ｄ変換処理（Ｓ１）の次に受信信号に含まれる音声帯域以外の信号をハイパスフィルタ、ローパスフィルタによって除去し、受信音声信号を取り出す（Ｓ２）。そして受信音声信号のレベルが設定値Ａを超えていればレベル圧縮を行う（Ｓ３，Ｓ４）。 Hereinafter, each processing unit will be described. FIG. 2 shows a processing flowchart of the input AGC processing unit 3. After the A / D conversion processing (S1), signals other than the audio band included in the received signal are removed by a high-pass filter and a low-pass filter, and a received audio signal is extracted (S2). If the level of the received audio signal exceeds the set value A, level compression is performed (S3, S4).

また、目的音声区間検出処理部１０による目的音声連続検出時間（目的音声区間）が設定時間Ｔ１を超えているときはレベル圧縮を行う（Ｓ５，Ｓ６）。しかし、そうでないときは、次に雑音連続検出時間（雑音区間）が設定時間Ｔ２を超えているかどうかを判定する（Ｓ７）。そして、雑音連続検出時間が設定時間Ｔ２を超えているときは目的音声信号のレベルが小さいと推定して、レベル増幅を行う（Ｓ８）。雑音連続検出時間が設定時間Ｔ２を超えていないときはそのままとする（Ｓ９）。そして、ステップＳ４，Ｓ６，Ｓ８に応じて受信音声信号のレベル変更をおこなう（Ｓ１０）。 When the target voice continuation detection time (target voice section) by the target voice section detection processing unit 10 exceeds the set time T1, level compression is performed (S5, S6). However, if not, it is next determined whether or not the continuous noise detection time (noise section) exceeds the set time T2 (S7). If the continuous noise detection time exceeds the set time T2, the level of the target audio signal is estimated to be low, and the level is amplified (S8). If the continuous noise detection time does not exceed the set time T2, the process is left as it is (S9). Then, the level of the received audio signal is changed according to steps S4, S6, and S8 (S10).

以上の処理により、目的音声区間が設定時間Ｔ１より長いとき受信音声信号のレベル圧縮が行われ、雑音区間が設定時間Ｔ２より長いとき受信音声信号のレベルが設定値Ａを超えない範囲で受信音声信号のレベル増幅が行われる。 With the above processing, when the target voice section is longer than the set time T1, the level of the received voice signal is compressed, and when the noise section is longer than the set time T2, the level of the received voice signal does not exceed the set value A. Signal level amplification is performed.

図３（ａ）、（ｂ）は減算型ビームフォーミング処理部４の処理の説明図、図３（ｂ）はそのフローチャートである。マイク１Ｌとマイク１Ｒが図３（ａ）のように距離Ｌ１だけ離れて配置されていて、実線で示す目的音声と破線で示す雑音が両マイク１Ｌ，１Ｒで受信されたとすると、目的音声はマイク１Ｒに対してマイク１Ｌに到達する時間がｄだけ遅延し、雑音はマイク１Ｌに対してマイク１Ｒに到達する時間がτだけ遅延するので、これを検出する（Ｓ１１）。そして、検出した遅延時間ｄ、τを用いて次の式（１）に示す演算を行うことにより、マイク１Ｌに入力する雑音成分ｇｌｒとマイク１Ｒに入力する雑音成分ｇｒｌを抽出する（Ｓ１２)。これらの雑音成分ｇｌｒ、ｇｒｌが雑音成分として出力する。ｌはマイク１Ｌの受信音声信号、ｒはマイク１Ｒの受信音声信号である。

FIGS. 3A and 3B are explanatory diagrams of the processing of the subtraction type beamforming processing unit 4, and FIG. 3B is a flowchart thereof. Assuming that the microphone 1L and the microphone 1R are arranged at a distance L1 as shown in FIG. 3A and the target sound indicated by a solid line and the noise indicated by a broken line are received by both the

microphones

1L and 1R, the target sound is The time to reach the microphone 1L with respect to the microphone 1L is delayed by d, and the noise is detected by delaying the time to reach the microphone 1R with respect to the microphone 1L by τ (S11). Then, the noise component glr input to the microphone 1L and the noise component grl input to the microphone 1R are extracted by performing the calculation shown in the following equation (1) using the detected delay times d and τ (S12). These noise components glr and grl are output as noise components. 1 is a received audio signal of the microphone 1L, and r is a received audio signal of the microphone 1R.

図４は時変雑音スペクトル推定処理部６の処理のフローチャートである。減算型ビームフォーミング処理部４で得られた雑音成分ｇｌｒ、ｇｒｌをＦＦＴ処理（Ｓ２１）した後、時変雑音スペクトル推定（Ｓ２２）を行う。 FIG. 4 is a flowchart of the processing of the time-varying noise spectrum estimation processing unit 6. After performing the FFT processing (S21) on the noise components glr and grl obtained by the subtraction type beamforming processing unit 4, time-varying noise spectrum estimation (S22) is performed.

時変雑音スペクトル推定値の算出式は次の式（２）ようになる。∧付きのＮ(ω）は周波数領域の推定された雑音成分であることを示す。Ｇ(ω)は減算型ビームフォーミング出力を時間領域から周波数領域に変換した直後の雑音成分のスペクトル、εは０を超えて、１より十分小さい値である。

The equation for calculating the time-varying noise spectrum estimation value is as shown in the following equation (2). N (ω) with ∧ indicates an estimated noise component in the frequency domain. G (ω) is the spectrum of the noise component immediately after the subtractive beamforming output is converted from the time domain to the frequency domain, and ε exceeds 0 and is a value sufficiently smaller than 1.

図５は定常雑音スペクトル推定処理部７の処理のフローチャートである。定常雑音スペクトル推定は、時変雑音スペクトル推定がリアルタイムで変化する雑音成分の推定であるのに対し、定常的に発生している雑音成分を目的音声区間検出処理部１０で検出した雑音区間（減算型ビームフォーミングで処理しきれなかった推定雑音に含まれる残留目的音声成分を除く）で検出することで、雑音除去性能を向上させる処理である。時変雑音スペクトル推定結果を遅延（Ｓ３１）させ、目的音声区間検出処理結果が雑音区間を示すとき（Ｓ３２）、定常雑音スペクトル推定値を算出する（Ｓ３３）。 FIG. 5 is a flowchart of the processing of the stationary noise spectrum estimation processing section 7. In the stationary noise spectrum estimation, while the time-varying noise spectrum estimation is an estimation of a noise component that changes in real time, a noise section in which a stationary noise component is detected by the target speech section detection processing unit 10 (subtraction) This is a process for improving the noise removal performance by detecting the residual target speech component included in the estimated noise that cannot be completely processed by the beamforming. The time-varying noise spectrum estimation result is delayed (S31), and when the target voice section detection processing result indicates a noise section (S32), a stationary noise spectrum estimation value is calculated (S33).

遅延処理Ｓ３１は、目的音声区間検出処理によって雑音区間と目的音声区間の境界を検出する際に、雑音区間から目的音声区間の切り替わりのタイミング誤差を補正するために、音声区間への切り替わりタイミングより前の雑音スペクトルから定常雑音スペクトルを算出することで、残留目的音声成分が含まれることを防ぐ処理である。 When detecting the boundary between the noise section and the target voice section by the target voice section detection processing, the delay processing S31 is performed before the timing of switching to the voice section in order to correct the timing error of switching from the noise section to the target voice section. This is a process for calculating a stationary noise spectrum from the noise spectrum of the above to prevent the residual target speech component from being included.

定常雑音スペクトル推定値の算出式は次の累算式となる。αは平均化するための係数（0≦α≦１）である。（ｎ）は現フレーム、（ｎ−１）は１フレーム前のフレームを表す。右側の第２項は現フレームまでの定常雑音スペクトル推定値の累積値である。

The formula for calculating the stationary noise spectrum estimated value is the following accumulation formula. α is a coefficient for averaging (0 ≦ α ≦ 1). (N) represents the current frame, and (n-1) represents the previous frame. The second term on the right is the accumulated value of the stationary noise spectrum estimation values up to the current frame.

図６は目的音声信号抽出処理部９の処理フローチャートである。入力ＡＧＣ処理部３の処理結果をＦＦＴ処理部８で時間領域から周波数領域の信号に処理（Ｓ４１）した結果と、時変雑音スペクトル推定処理部６で推定した結果と、定常雑音スペクトル推定処理部７で処理した結果を取り込んで、目的音声信号抽出処理部９で目的音声信号のスペクトルを算出する（Ｓ４２）。 FIG. 6 is a processing flowchart of the target audio signal extraction processing section 9. The result obtained by processing the processing result of the input AGC processing unit 3 from the time domain to the frequency domain signal by the FFT processing unit 8 (S41), the result estimated by the time-varying noise spectrum estimation processing unit 6, and the stationary noise spectrum estimation processing unit 7, the target audio signal extraction processor 9 calculates the spectrum of the target audio signal (S42).

目的音声信号のスペクトルの推定値の算出式は次の式（４）となる。∧付きのＳ(ω）は周波数領域の目的音声信号、Ｘ(ω）はＦＦＴ処理部８から取り込まれる周波数領域の受信音声信号（目的音声信号と雑音成分を含む）を示す。β、γは係数（０≦β≦１、０≦γ≦１）である。

The calculation formula of the estimated value of the spectrum of the target audio signal is the following formula (4). S (ω) with ∧ indicates a target audio signal in the frequency domain, and X (ω) indicates a received audio signal (including the target audio signal and noise component) in the frequency domain taken in from the FFT processing unit 8. β and γ are coefficients (0 ≦ β ≦ 1, 0 ≦ γ ≦ 1).

図７は目的音声区間検出処理部１０の処理のフローチャートである。ここでは、目的音声信号と雑音成分が含まれている受信音声信号から、目的音声区間と雑音区間を判別する。図８（ａ）に示すように、通常の場合は、受信音声信号のエントロピー（パワー）が閾値ｈを超えている場合はその超えている連続期間は目的音声区間、閾値ｈを下回っている連続期間は雑音区間となる。この目的音声区間は、目的音声信号の開始タイミングｔａから、目的音声信号の終了タイミングｔｂまでの区間であり、その他の区間は雑音区間となる。目的音声区間検出処理部１０は隣り合うこの２個のタイミングｔａ，ｔｂを検出して、入力ＡＧＣ処理部３、遅延処理部１２、出力ＡＧＣ処理部１３を制御する。 FIG. 7 is a flowchart of the processing of the target voice section detection processing unit 10. Here, the target voice section and the noise section are determined from the target voice signal and the received voice signal containing the noise component. As shown in FIG. 8A, in the normal case, when the entropy (power) of the received audio signal exceeds the threshold h, the continuous period in which the entropy (power) exceeds the threshold h is the target audio section, and the continuous The period is a noise section. The target voice section is a section from the start timing ta of the target voice signal to the end timing tb of the target voice signal, and the other sections are noise sections. The target voice section detection processing section 10 detects these two adjacent timings ta and tb, and controls the input AGC processing section 3, the delay processing section 12, and the output AGC processing section 13.

まず、目的音声信号抽出処理部９で得られた目的音声信号抽出結果を取り込んでそのエントロピーを算出（Ｓ５１）する。そのエントロピーが閾値ｈよりも大きくなったときは、目的音声信号が検出された（タイミングｔａ）として、ホールドタイムを設定する（Ｓ５２，Ｓ５３）。このホールドタイムは、一旦検出した目的音声信号の開始タイミングｔａ以降に目的音声信号が検出されなくなっても、つまり雑音が検出されても、そのホールドタイムの期間中は雑音検出をマスクするためのものである。このようにして、検出した目的音声信号の開始タイミングｔａからホールドタイムが終了するまでは、目的音声信号の終了が検出されても無視し音質劣化を防止する。このホールドタイムは、例えば１００ｍｓｅｃ〜２００ｍｓｅｃ程度に設定される。 First, the target audio signal extraction result obtained by the target audio signal extraction processing unit 9 is fetched and its entropy is calculated (S51). When the entropy becomes larger than the threshold value h, it is determined that the target audio signal has been detected (timing ta), and a hold time is set (S52, S53). This hold time is used to mask noise detection during the hold time even if the target audio signal is no longer detected after the start timing ta of the target audio signal once detected, that is, even if noise is detected. It is. In this manner, until the end of the target audio signal from the start timing ta of the detected target audio signal to the end of the hold time, even if the end of the target audio signal is detected, the end is ignored and the sound quality deterioration is prevented. This hold time is set to, for example, about 100 msec to 200 msec.

また、目的音声信号の開始タイミングｔａが検出されてから目的音声信号の終了タイミングｔｂが検出されるまでの目的音声連続検出時間（目的音声区間）が設定時間Ｔ１を超えたら、入力ＡＧＣ処理部３によってレベル圧縮を行う（Ｓ５４，Ｓ５５）。図８（ｂ）に示すように、受信音声信号の全体のエントロピーが高い場合は、目的音声信号の他に雑音成分も閾値ｈを超えてしまうので、全部が目的音声信号と誤認識される。そこで、目的音声連続検出時間が設定時間Ｔ１を超えたら、受信音声信号の目的音声信号と雑音成分の識別が可能なように、入力ＡＧＣ処理部３によって受信音声信号のレベル圧縮を行う。 If the target voice continuation detection time (target voice section) from when the start timing ta of the target voice signal is detected to when the end timing tb of the target voice signal is detected exceeds the set time T1, the input AGC processing unit 3 To perform level compression (S54, S55). As shown in FIG. 8B, when the overall entropy of the received voice signal is high, the noise component in addition to the target voice signal also exceeds the threshold value h, and all of them are erroneously recognized as the target voice signal. Therefore, when the target voice continuation detection time exceeds the set time T1, the level of the received voice signal is compressed by the input AGC processing unit 3 so that the target voice signal and the noise component of the received voice signal can be identified.

また、目的音声連続検出時間が設定時間Ｔ１よりも短いときは、目的音声信号の開始タイミングｔａの補正を行う（Ｓ５６）。この補正は、タイミングｔａを補正（実際のタイミングｔａよりも前へ補正）して目的音声信号の検出処理に余裕を持たせるためのものである。また、その補正のための遅延時間の算出を直前フレームの目的音声信号について行い（Ｓ５７）、遅延処理部１２にその遅延時間を設定する。 When the target sound continuous detection time is shorter than the set time T1, the start timing ta of the target sound signal is corrected (S56). This correction is for correcting the timing ta (correcting the timing ta before the actual timing ta) so that the target audio signal detection processing has a margin. The delay time for the correction is calculated for the target audio signal of the immediately preceding frame (S57), and the delay time is set in the delay processing unit 12.

一方、エントロピーが閾値ｈ未満になったときは、ステップＳ５２において、目的音声信号が検出されなくなったとして、ステップＳ５３で設定されたホールドタイムが満了するのを待つ（Ｓ５８）。そしてホールドタイムが満了したときに、目的音声区間終了タイミングｔｂが検出されてから次の目的音声区間開始タイミングｔａが検出されるまでの雑音区間連続検出時間（雑音区間）が設定時間Ｔ２を超えている場合に、図２で説明した入力ＡＧＣ処理部３においてレベル増幅を行う（Ｓ５９，Ｓ６０）。図８（ｃ）に示すように目的音声信号が含まれていても受信音声信号のエントロピーが全体的に低く閾値ｈに達しない場合は、そのままでは全部が雑音成分として誤認識されてしまう。そこで、目的音声信号と雑音成分の識別が可能なように、受信音声信号のレベル増幅を行う。また、雑音区間連続検出時間が設定時間Ｔ２を超えない場合は、入力ＡＧＣ処理部３におけるレベル変更は行わない（Ｓ６１）。 On the other hand, if the entropy is less than the threshold value h, it is determined in step S52 that the target audio signal is no longer detected, and the control waits until the hold time set in step S53 expires (S58). When the hold time expires, the continuous detection period of noise section (noise section) from the detection of the target voice section end timing tb to the detection of the next target voice section start timing ta exceeds the set time T2. If there is, level amplification is performed in the input AGC processing unit 3 described in FIG. 2 (S59, S60). As shown in FIG. 8 (c), if the entropy of the received voice signal is low overall and does not reach the threshold h even if the target voice signal is included, the whole is erroneously recognized as a noise component as it is. Therefore, the level of the received voice signal is amplified so that the target voice signal and the noise component can be distinguished. If the noise section continuous detection time does not exceed the set time T2, the level is not changed in the input AGC processing unit 3 (S61).

図９は遅延処理部１２と出力ＡＧＣ処理部１３の処理フローチャートである。目的音声信号抽出処理部９から出力し、ＩＦＦＴ処理部１１で周波数領域から時間領域の信号に復元された目的音声信号は、目的音声区間検出処理部１０で検出された雑音区間から目的音声区間への切り替わりのタイミングｔａの誤差が遅延処理部１２における遅延処理によって補正される（Ｓ７１）。この遅延処理は、出力ＡＧＣ処理部１３での処理に合わせるために行われる。 FIG. 9 is a processing flowchart of the delay processing unit 12 and the output AGC processing unit 13. The target voice signal output from the target voice signal extraction processing unit 9 and restored from the frequency domain to the time domain signal by the IFFT processing unit 11 is converted from the noise section detected by the target voice section detection processing unit 10 to the target voice section. The error of the switching timing ta is corrected by the delay processing in the delay processing unit 12 (S71). This delay processing is performed to match the processing in the output AGC processing unit 13.

そして、遅延処理された目的音声信号のレベルが設定値Ｂを超えているときは、出力ＡＧＣ処理部１２においてレベル圧縮が行われる（Ｓ７２、Ｓ７３、Ｓ７４）。また、目的音声区間検出処理部１０で目的音声区間が検出されているときは、出力ＡＧＣ処理部１３においてレベル増幅が行われる（Ｓ７５、Ｓ７６）が、目的音声区間が検出されていないときはそのままとなる（Ｓ７５，Ｓ７７）。このようにして、出力ＡＧＣ処理部１３は、入力する目的音声信号のレベルが設定値Ｂを超えない範囲で目的音声信号を選択的にレベル増幅する。 If the level of the target audio signal subjected to the delay processing exceeds the set value B, the output AGC processing unit 12 performs level compression (S72, S73, S74). When the target voice section is detected by the target voice section detection processing section 10, the level is amplified by the output AGC processing section 13 (S75, S76). (S75, S77). In this way, the output AGC processing unit 13 selectively amplifies the level of the target audio signal within a range where the level of the input target audio signal does not exceed the set value B.

以上から、本実施例の音声認識支援システムによれは、図３の実線の方向からマイク１Ｌ、１Ｒに入力する目的音声を、図２の破線で示す方向からマイク１Ｌ、１Ｒに入力する雑音に対して分離して取り出し、且つその目的音声を所定のレベルにＳ／Ｎ比を高くして調整することができ、音声認識支援に好適となる。 As described above, according to the voice recognition support system of the present embodiment, the target voice input to the microphones 1L and 1R from the direction indicated by the solid line in FIG. 3 and the noise input to the microphones 1L and 1R from the direction indicated by the broken line in FIG. The target voice can be separated and taken out, and the target voice can be adjusted to a predetermined level by increasing the S / N ratio, which is suitable for voice recognition support.

１Ｌ，１Ｒ：マイク、２：Ａ／Ｄ変換処理部、３：入力ＡＧＣ処理部、４：減算型ビームフォーミング処理部、５：ＦＦＴ処理部、６：時変雑音スペクトル推定処理部、７：定常雑音スペクトル推定処理部、８：ＦＦＴ処理部、９：目的音声信号抽出処理部、１０：目的音声区間検出処理部、１１：ＩＦＦＴ処理部、１２：遅延処理部、１３：出力ＡＧＣ処理部 1L, 1R: microphone, 2: A / D conversion processing section, 3: input AGC processing section, 4: subtraction beamforming processing section, 5: FFT processing section, 6: time-varying noise spectrum estimation processing section, 7: stationary Noise spectrum estimation processing section, 8: FFT processing section, 9: target voice signal extraction processing section, 10: target voice section detection processing section, 11: IFFT processing section, 12: delay processing section, 13: output AGC processing section

Claims

An input AGC processing unit for receiving a received signal and outputting a received audio signal adjusted to a predetermined signal level;
A subtraction-type beamforming processing unit that takes in the received audio signal output from the input AGC processing unit and extracts a received audio signal in a direction other than a specific direction as a noise component;
A time-varying noise spectrum estimation processing unit that takes in the noise components extracted by the subtraction type beamforming processing unit and estimates a noise component spectrum that changes over time;
A steady-state noise spectrum estimation processing unit that captures the noise component extracted by the subtraction-type beamforming processing unit and estimates a noise component spectrum that constantly occurs.
It takes in the received voice signal output from the input AGC processing unit and removes the time-varying noise spectrum estimated by the time-varying noise spectrum estimation processing unit and the stationary noise component spectrum estimated by the stationary noise spectrum estimation processing unit. A target audio signal extraction processing unit for extracting a target audio signal by
An output AGC processing unit that takes in the target audio signal extracted by the target audio signal extraction processing unit and adjusts a signal level of a target audio section;
A target voice section detection processing section that detects the target voice section from the start timing and the end timing of the target voice signal extracted by the target voice signal extraction processing section and detects a section other than the target voice section as a noise section,
The speech recognition support system according to claim 1, wherein the stationary noise spectrum estimation processing section operates in the noise section detected by the target speech section detection processing section.

The speech recognition support system according to claim 1,
The speech recognition support system according to claim 1, wherein the stationary noise spectrum estimation processing unit estimates a stationary noise spectrum by accumulating a spectrum of a noise component detected in the noise section.

The speech recognition support system according to claim 1 or 2,
The input AGC processing unit outputs the level-compressed received signal when the target voice section is longer than a first set time, and amplifies the level of the received voice signal when the noise section is longer than a second set time. Output the received voice signal within a range not exceeding a first set value.

The speech recognition support system according to claim 1, 2 or 3,
Speech recognition support, wherein the output AGC processor selectively amplifies the level of the target audio signal within a range where the level of the target audio signal input to the output AGC processor does not exceed a second set value. system.

The speech recognition support system according to claim 1, 2, 3, or 4,
A speech recognition support system comprising means for adjusting a start timing of the target speech signal.