JP5231139B2

JP5231139B2 - Sound source extraction device

Info

Publication number: JP5231139B2
Application number: JP2008218565A
Authority: JP
Inventors: 真人戸上; 洋平川口; 康成大淵
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2008-08-27
Filing date: 2008-08-27
Publication date: 2013-07-10
Anticipated expiration: 2028-08-27
Also published as: JP2010054728A

Description

本発明は会話抽出装置に関し、様々な音源が混ざった中から特定の音源の信号のみを抽出する音源抽出装置に関する。 The present invention relates to a conversation extraction device, and more particularly to a sound source extraction device that extracts only a specific sound source signal from a mixture of various sound sources.

複数のマイクロホンを用いて、様々な音の中から、特定の音のみを抽出する音源分離技術が従来より盛んに検討されている。走行騒音が重畳した車室内収録の音声データからドライバの声を抽出するなどの応用が検討されてきている（例えば、特許文献１参照）。従来の音源分離技術は、独立成分分析に基づくブラインド音源分離技術と、ＳＮＲ最大化基準に基づく方法（例えば、非特許文献２参照）などのビームフォーミング技術の２つに大別される。 2. Description of the Related Art Conventionally, a sound source separation technique for extracting only a specific sound from various sounds using a plurality of microphones has been actively studied. Applications such as extracting a driver's voice from voice data recorded in a vehicle interior on which traveling noise is superimposed have been studied (see, for example, Patent Document 1). Conventional sound source separation techniques are roughly classified into two types: a blind sound source separation technique based on independent component analysis, and a beamforming technique such as a method based on an SNR maximization criterion (see, for example, Non-Patent Document 2).

特開2007-10897号公報JP 2007-10897 J. Chen, J. Benesty, and Y. Huang, “A minimum distortion noise reduction algorithm with multiple microphones,” IEEE Trans. ASLP, vol.16, pp.481-493, 2008J. Chen, J. Benesty, and Y. Huang, “A minimum distortion noise reduction algorithm with multiple microphones,” IEEE Trans. ASLP, vol.16, pp.481-493, 2008 S. Araki, H. Sawada, and S. Makino, “Blind speech separation in a meeting situation with maximum snr beamformers,” Proc. ICASSP2007, vol.I, pp.41-44, 2007S. Araki, H. Sawada, and S. Makino, “Blind speech separation in a meeting situation with maximum snr beamformers,” Proc. ICASSP2007, vol.I, pp.41-44, 2007 M. Togami, T. Sumiyoshi, and A. Amano, “Stepwise phase difference restoration method for sound source localization using multiple microphone pairs,” Proc. ICASSP2007, vol.I, pp.117-120, 2007.M. Togami, T. Sumiyoshi, and A. Amano, “Stepwise phase difference restoration method for sound source localization using multiple microphone pairs,” Proc. ICASSP2007, vol.I, pp.117-120, 2007.

ブラインド音源分離技術は、マイク配置や目的音方向の情報を必要としないという利点があるが、残響が存在するような環境では性能が十分ではないという課題がある。ＳＮＲ最大化基準に基づくビームフォーミング法は、信号帯域が広帯域の場合、性能が悪いという問題がある。そこで、時間周波数分解により、狭帯域信号に変換した信号に対して、ＳＮＲ最大化基準に基づくビームフォーミング法を適用することが一般的である。しかし、一般に狭帯域信号に変換するためには、フレーム長が長い必要があるが、フレーム長が長い場合、音声の定常性の仮定が崩れて、かえって性能が劣化するという問題があった。時間領域の広帯域信号に適用可能な手法として、最小歪みビームフォーマ法（例えば、非特許文献１参照）がある。この方法は、雑音がプロジェクタのファンの音など、定常的な場合は雑音抑圧効果が高いが、原理的に雑音が音声のように時々刻々音量が変化する、非定常な雑音の場合、雑音抑圧効果が低いという課題があった。 The blind sound source separation technique has an advantage that information on the microphone arrangement and the target sound direction is not required, but there is a problem that the performance is not sufficient in an environment where reverberation exists. The beamforming method based on the SNR maximization criterion has a problem that the performance is poor when the signal band is wide. Therefore, it is common to apply a beamforming method based on the SNR maximization criterion to a signal converted into a narrowband signal by time-frequency decomposition. However, in general, in order to convert to a narrowband signal, it is necessary to have a long frame length. However, if the frame length is long, there is a problem that the assumption of speech steadiness is lost and the performance deteriorates. As a technique applicable to a time domain wideband signal, there is a minimum distortion beamformer method (for example, see Non-Patent Document 1). This method has a high noise suppression effect when the noise is steady, such as the sound of a fan of a projector, but in principle, noise suppression is performed when the noise is non-stationary noise whose volume changes from moment to moment like speech. There was a problem that the effect was low.

本発明の音源抽出装置は、雑音の空間的伝達特性を複数チャンネルのマイク素子を用いて推定することが可能な多チャンネル空間予測と、多チャンネル空間予測に伴う目的音の歪みの補正処理を有する。多チャンネル空間予測では、雑音が定常/非定常にかかわりなく、雑音の空間伝達特性を推定できる。したがって推定した空間伝達特性を用いれば、非定常な雑音であっても、抑圧することが可能である。また、本発明は、複数タップを持った雑音除去フィルタを有しており、残響を考慮して雑音を抑圧することができる。同様に目的音の残響も考慮できるため、目的音の残響成分を歪まず取り出すことができる。 The sound source extraction device of the present invention has multi-channel spatial prediction capable of estimating spatial noise transfer characteristics using a multi-channel microphone element, and correction processing of target sound distortion accompanying multi-channel spatial prediction. . In multi-channel spatial prediction, the spatial transfer characteristics of noise can be estimated regardless of whether the noise is stationary or non-stationary. Therefore, if the estimated spatial transfer characteristic is used, even non-stationary noise can be suppressed. In addition, the present invention includes a noise removal filter having a plurality of taps, and can suppress noise in consideration of reverberation. Similarly, since the reverberation of the target sound can be considered, the reverberation component of the target sound can be extracted without distortion.

本発明の音源抽出装置は、複数のマイクロホン素子からなるマイクロホンアレイと、マイクロホンアレイから出力されるアナログ信号をデジタル信号に変換するＡＤ変換装置と、計算装置と、記憶装置とを有し、計算装置は、ＡＤ変換装置によって変換されたデジタル信号中の雑音成分を抑圧するようなデジタル信号処理を施し、雑音抑圧信号を取り出した後、該雑音抑圧信号中に含まれる目的音の歪みを補正し、補正後の信号を再生又は記憶装置に記憶する。 A sound source extraction device of the present invention includes a microphone array including a plurality of microphone elements, an AD conversion device that converts an analog signal output from the microphone array into a digital signal, a calculation device, and a storage device. Performs digital signal processing to suppress a noise component in the digital signal converted by the AD converter, and after extracting the noise suppression signal, corrects distortion of the target sound included in the noise suppression signal, The corrected signal is reproduced or stored in a storage device.

計算装置は、複数のマイクロホン素子のうちの１つの素子に含まれる雑音信号を当該素子以外の素子に含まれる雑音信号に第１のＦＩＲフィルタをかけたものの和で近似するとともに近似誤差の２乗和が最少となるように第１のＦＩＲフィルタの係数を決定する多チャンネル空間予測部を有し、雑音抑圧信号を、複数のマイクロホン素子のうちの任意の１素子の信号から当該素子以外の素子に含まれる信号に多チャンネル空間予測部で予測した第１のＦＩＲフィルタを重畳したものの和を差し引くことで生成することができる。 The calculation apparatus approximates a noise signal included in one element of the plurality of microphone elements by a sum of noise signals included in elements other than the element multiplied by the first FIR filter, and squares an approximation error. A multi-channel spatial prediction unit that determines the coefficient of the first FIR filter so that the sum is minimized, and a noise suppression signal is transmitted from any one of a plurality of microphone elements to an element other than the element. Can be generated by subtracting the sum of the superposed first FIR filter predicted by the multi-channel spatial prediction unit.

更に、雑音抑圧信号をマイクロホンアレイの全てのマイクロホン素子の出力に対して個々に生成し、生成した複数の雑音抑圧信号に第２のＦＩＲフィルタを施して１チャンネルの歪み補正信号を得る多チャンネル歪み補正部を有し、歪み補正信号とマイクロホンアレイの中の特定のマイクロホン素子の出力信号又はその遅延信号との間の２乗誤差とマイクロホン素子の入力信号が雑音のみの場合の歪み補正信号の２乗和に定数値を掛けたものとの和が最も小さくなるように、多チャンネル歪み補正部の前記第２のＦＩＲフィルタを決定するのが好ましい。 Furthermore, a noise suppression signal is individually generated for the outputs of all microphone elements of the microphone array, and a second channel FIR filter is applied to the generated plurality of noise suppression signals to obtain a one-channel distortion correction signal. A correction unit is included, and a square error between the distortion correction signal and the output signal of a specific microphone element in the microphone array or its delay signal and 2 of the distortion correction signal when the input signal of the microphone element is only noise. It is preferable to determine the second FIR filter of the multi-channel distortion correction unit so that the sum of the product sum and the constant value is minimized.

更に、雑音信号を推定する雑音信号推定部を有し、推定した雑音信号と歪み補正信号に個別の第３のＦＩＲフィルタに重畳したものの和と、マイクロホンアレイの中の特定のマイクロホン素子の出力信号又はその遅延信号との間の２乗誤差が最も小さくなるように第３のＦＩＲフィルタを決定し、歪み補正信号に第３のＦＩＲフィルタを重畳したものを出力する１チャンネル歪み補正部を有するのが好ましい。 And a noise signal estimator for estimating the noise signal, the sum of the estimated noise signal and distortion correction signal superimposed on an individual third FIR filter, and the output signal of a specific microphone element in the microphone array Alternatively, the third FIR filter is determined so that the square error between the delay signal and the delay signal is minimized, and a one-channel distortion correction unit that outputs a distortion correction signal obtained by superimposing the third FIR filter is provided. Is preferred.

雑音区間は、ユーザーの目的音位置の指定操作により同定した目的音位置の情報に基づいて算出した短時間区間毎の目的音パワーと雑音パワーの比率から計算される混合度をもとに同定することができる。 The noise section is identified based on the degree of mixing calculated from the ratio between the target sound power and the noise power for each short time section calculated based on the target sound position information identified by the user's target sound position designation operation. be able to.

本発明の雑音抑圧法では、雑音の空間的伝達特性が不変であれば、雑音の原信号が音声のような非定常な雑音であっても原理的に消去することが可能である。したがって、複数の音声が混ざった音から特定の音声をとりだすことが可能であり、高精度な音声監視システムが実現可能である。また、本発明は、時間領域又は、サブバンド領域の広帯域の信号に対して適用可能であり、時間周波数領域に信号を変換する必要がない。時間周波数領域の音声の定常性問題を考慮する必要がなく、時間周波数領域の技術と比べて、性能の高い雑音抑圧信号を得ることが可能である。 In the noise suppression method of the present invention, if the noise spatial transfer characteristic is unchanged, even if the noise original signal is non-stationary noise such as speech, it can be eliminated in principle. Therefore, a specific sound can be taken out from a sound in which a plurality of sounds are mixed, and a highly accurate sound monitoring system can be realized. Further, the present invention can be applied to a wideband signal in the time domain or subband domain, and it is not necessary to convert the signal to the time frequency domain. It is not necessary to consider the continuity problem of speech in the time-frequency domain, and it is possible to obtain a noise suppression signal with higher performance compared to the technology in the time-frequency domain.

以下、本発明の具体的な形態を、図を用いて説明する。
図１に、本発明の第一の実施例のハードウェア構成を示す。複数のマイクロホン素子を有するマイクロホンアレイ101にて取り込んだアナログ音圧は、ＡＤ変換装置102に送られ、アナログからデジタルデータに変換される。デジタルデータへの変換処理は、マイクロホン素子毎に行われる。変換された各マイクロホン素子のデジタル音圧データは、中央演算装置103に送られ、デジタル信号処理を施される。この際、デジタル信号処理を行うソフトウェア及び必要なデータは不揮発性メモリ105に予め記憶されており、また処理に必要なワークエリアは揮発性メモリ104上に確保される。デジタル信号処理により処理された音圧データはＤＡ変換装置106に送られ、デジタルデータからアナログ音圧に変換される。変換後、スピーカ107から出力され、再生される。本発明の第一実施例におけるソフトウェアブロックは全て中央演算装置103上で実行されるものとする。 Hereinafter, specific embodiments of the present invention will be described with reference to the drawings.
FIG. 1 shows a hardware configuration of the first embodiment of the present invention. The analog sound pressure captured by the microphone array 101 having a plurality of microphone elements is sent to the AD converter 102 and converted from analog to digital data. Conversion processing to digital data is performed for each microphone element. The converted digital sound pressure data of each microphone element is sent to the central processing unit 103 and subjected to digital signal processing. At this time, software for performing digital signal processing and necessary data are stored in advance in the nonvolatile memory 105, and a work area necessary for the processing is secured on the volatile memory 104. The sound pressure data processed by the digital signal processing is sent to the DA converter 106 and converted from digital data to analog sound pressure. After the conversion, it is output from the speaker 107 and reproduced. All software blocks in the first embodiment of the present invention are executed on the central processing unit 103.

図２に、第一実施例のソフトウェアブロック構成図を示す。また図２０に、ソフトウェアブロックと図１に示したハードウェア構成との対応関係を示す。波形取り込み部201はＡＤ変換装置で取り込んだマイクロホン素子毎のデジタルデータを揮発性メモリ104上に展開する。取り込んだ音圧データを次式(1)のように表記する。 FIG. 2 shows a software block configuration diagram of the first embodiment. FIG. 20 shows the correspondence between the software block and the hardware configuration shown in FIG. The waveform capturing unit 201 develops digital data for each microphone element captured by the AD converter on the volatile memory 104. The acquired sound pressure data is expressed as the following equation (1).

ｘ_m(t) (1)
mはマイクロホン素子のインデックスを表し、１からＭまでの値をとる。Ｍは雑音抑圧処理に用いるマイクロホン素子数とする。tはサンプリング間隔単位の時間インデックスとする。 x _m (t) (1)
m represents an index of the microphone element and takes a value from 1 to M. M is the number of microphone elements used for noise suppression processing. t is a time index in sampling interval units.

取り込んだ波形は、フィルタ適応処理部202に送られ、雑音抑圧フィルタの適応処理を行う。適応後のフィルタ係数は、揮発性メモリ104又は不揮発性メモリ105に確保されたフィルタデータ204に記憶される。フィルタリング部203は、記憶されたフィルタデータ204を読み込み、雑音抑圧フィルタを波形取り込み部201が取り込んだマイク入力信号に重畳し、雑音抑圧後の信号を得る。雑音抑圧後の信号は波形再生部205に送られ、スピーカ107から出力され、再生される。また雑音抑圧後の信号を揮発性メモリ104又は不揮発性メモリ105に記憶し、ネットワーク装置などを使って外部システムに送信するような構成をとっても良いし、別システムが読み出し再生するような構成をとっても良い。 The acquired waveform is sent to the filter adaptation processing unit 202, and the adaptive processing of the noise suppression filter is performed. The filter coefficient after adaptation is stored in the filter data 204 secured in the volatile memory 104 or the nonvolatile memory 105. The filtering unit 203 reads the stored filter data 204 and superimposes the noise suppression filter on the microphone input signal captured by the waveform capturing unit 201 to obtain a signal after noise suppression. The noise-suppressed signal is sent to the waveform reproduction unit 205, output from the speaker 107, and reproduced. Further, the signal after noise suppression may be stored in the volatile memory 104 or the nonvolatile memory 105 and transmitted to an external system using a network device or the like, or may be configured to be read and reproduced by another system. good.

波形取り込み部201が取り込む音は、ユーザーにとって不要な雑音だけか、又はユーザーが聞きたい目的音が混ざった音と仮定する。本発明は、このような音の中から、雑音を抑圧し、ユーザーが聞きたい目的音を取り出すことを目的としている。Ｍ個のマイク素子のうち一つをターゲットマイクと呼び、ターゲットマイクの入力信号から目的音成分を抜き出す。フィルタ適応処理部202は、波形取り込み部が得た音データを後述する方法で、ユーザーにとって不要な雑音だけか、又はユーザーが聞きたい目的音が混ざった音かを判定し、判定結果を用いてフィルタ適応を行う。フィルタの適応はいわゆるバッチ処理で行われる。つまりある程度長時間の録音データを使ってフィルタの適応を行う。それに対して、フィルタリング部203は、フィルタデータ204さえあれば、波形を得るたびに動作することが可能である。 It is assumed that the sound captured by the waveform capturing unit 201 is only a noise unnecessary for the user or a sound mixed with a target sound that the user wants to hear. An object of the present invention is to suppress noise and extract a target sound that a user wants to hear from such sounds. One of the M microphone elements is called a target microphone, and a target sound component is extracted from the input signal of the target microphone. The filter adaptation processing unit 202 determines whether the sound data obtained by the waveform capturing unit is only noise that is unnecessary for the user or a target sound that the user wants to hear, and uses the determination result. Perform filter adaptation. The filter is adapted by so-called batch processing. In other words, the filter is adapted using recorded data for a long time. On the other hand, the filtering unit 203 can operate every time a waveform is obtained as long as the filter data 204 is present.

図３にフィルタ適応処理部202内の処理のフローチャートを示す。フィルタ適応処理は、まず波形取り込み部が得た音をユーザーにとって不要な雑音だけの音か、又はユーザーが聞きたい目的音が混ざった音（混合音）のどどちらかを判定する。雑音取り込みS301では、雑音だと判定された時間帯のデータを取り込み、揮発性メモリ上に展開する。混合音取り込みS302では、混合音と判定された時間帯のデータを取り込み、揮発性メモリ上に展開する。得られた雑音を式(2)で表す。また得られた混合音を式(3)で表す。 FIG. 3 shows a flowchart of processing in the filter adaptation processing unit 202. In the filter adaptation process, first, it is determined whether the sound obtained by the waveform capturing unit is only a noise unnecessary for the user or a sound (mixed sound) mixed with a target sound that the user wants to hear. In the noise capturing S301, data in a time zone determined to be noise is captured and developed on a volatile memory. In mixed sound capture S302, data of a time zone determined to be a mixed sound is captured and developed on a volatile memory. The obtained noise is expressed by equation (2). Further, the obtained mixed sound is expressed by Equation (3).

Tはベクトル、行列の共役転置を表す演算子とする。それぞれ時間長Ln、Ls分データが得られるとする。

T is an operator representing a conjugate transpose of a vector or matrix. Assume that data of time lengths Ln and Ls are obtained.

本発明では、フィルタバンク処理などを用いて、マイクで得られた信号を複数のサブバンドに分割した後、処理を行っても良い。その場合、信号がマイクから取り込まれた直後にアナリシスフィルタバンク処理を行い、サブバンドに分割し、各サブバンド毎に本発明の雑音抑圧処理を施し、サブバンド毎の雑音抑圧後の信号に対して、シンセシスフィルタバンク処理を行い、各サブバンドの信号が合成された形で得られる構成をとればよい。ＤＦＴ（Discrete Fourier Transform）モジュレートフィルタバンクを用いる場合、サブバンド分割後の信号は複素数になるが、本発明の処理は入力信号が複素数であっても実数であっても適用可能である。 In the present invention, processing may be performed after the signal obtained by the microphone is divided into a plurality of subbands using filter bank processing or the like. In that case, the analysis filter bank processing is performed immediately after the signal is taken in from the microphone, divided into subbands, the noise suppression processing of the present invention is performed for each subband, and the signal after noise suppression for each subband is performed. Thus, the synthesis filter bank processing may be performed to obtain a configuration obtained by combining the signals of the subbands. When a DFT (Discrete Fourier Transform) modulated filter bank is used, the signal after subband division is a complex number, but the processing of the present invention can be applied regardless of whether the input signal is a complex number or a real number.

得られた雑音、混合音は雑音多チャンネル空間予測S303で処理される。雑音統計量として、式(4)で表わされる雑音共分散行列と式(5)で表わされる雑音相関行列とが得られる。ここで、V_m(t)は式(6)で定義する。これはm番目のマイク入力信号を含まない要素数が(M-1)Lのベクトルである。Lはフィルタ長とする。またDは因果性を満たすための遅延とする。 The obtained noise and mixed sound are processed in the noise multi-channel spatial prediction S303. As the noise statistic, the noise covariance matrix expressed by Equation (4) and the noise correlation matrix expressed by Equation (5) are obtained. Here, V _m (t) is defined by Equation (6). This is a vector whose number of elements does not include the mth microphone input signal (M−1) L. L is the filter length. D is a delay to satisfy causality.

m番目以外のマイク素子を用いて、m番目のマイク信号を２乗誤差最少となるように近似的に求めるフィルタ（ＦＩＲ＝Finite Impulse Responseフィルタ）は式(7)で表わされる。

A filter (FIR = Finite Impulse Response filter) that approximately obtains the m-th microphone signal so as to minimize the square error using microphone elements other than the m-th microphone element is expressed by Expression (7).

以後、本発明におけるフィルタは、ＦＩＲフィルタとする。雑音多チャンネル空間予測では、マイク毎にこのフィルタを求める。従来のシングルチャンネル空間予測法（例えば、非特許文献１参照）では、あるマイク素子（予測先）の信号を他の１つのマイク素子（予測元）の信号で近似することを行う。反響残響の影響で予測先と予測元の振幅特性が大きく異なる場合にシングルチャンネル空間予測の予測精度が悪くなるという問題があった。それに対して、本発明の多チャンネル空間予測では、たとえある１つのマイク素子である振幅特性の谷ができていたとしても、他のマイクでその谷をカバーすることができるため、高精度な予測が可能である。雑音多チャンネル空間予測S303は、得られた雑音の多チャンネル空間予測フィルタを出力する。目的音推定S304では、式(8)でマイク毎に雑音が抑圧された信号y_m(t)を得る。X_m(t)は、式(9)で定義される。

Hereinafter, the filter in the present invention is an FIR filter. In noise multi-channel spatial prediction, this filter is obtained for each microphone. In the conventional single channel spatial prediction method (see, for example, Non-Patent Document 1), a signal of a certain microphone element (prediction destination) is approximated by a signal of another microphone element (prediction source). When the amplitude characteristics of the prediction destination and the prediction source are greatly different due to the effect of reverberation reverberation, there is a problem that the prediction accuracy of the single channel spatial prediction is deteriorated. On the other hand, in the multi-channel spatial prediction according to the present invention, even if a valley of the amplitude characteristic that is a certain microphone element is formed, the valley can be covered by another microphone, so that a highly accurate prediction is possible. Is possible. Noise multi-channel spatial prediction S303 outputs a multi-channel spatial prediction filter of the obtained noise. In the target sound estimation S304, the signal y _m (t) in which the noise is suppressed for each microphone by the equation (8) is obtained. X _m (t) is defined by equation (9).

この信号は、雑音が抑圧されており、目的音に起因する成分のみになっているが、空間予測フィルタにより目的音は歪んでしまっている。

In this signal, noise is suppressed and only a component resulting from the target sound is present, but the target sound is distorted by the spatial prediction filter.

目的音推定S304は、雑音抑圧信号とともに、雑音抑圧信号の共分散行列である式(10)及びターゲットマイクと雑音抑圧信号との相関行列である式(11)を出力する。targetはターゲットマイクのマイクインデックスとする。Y(t)は、式(12)で定義される。L₂は後段の歪み補正処理のフィルタ長とする。 The target sound estimation S304 outputs, together with the noise suppression signal, Expression (10) that is a covariance matrix of the noise suppression signal and Expression (11) that is a correlation matrix between the target microphone and the noise suppression signal. target is the microphone index of the target microphone. Y (t) is defined by equation (12). L ₂ is the filter length of the subsequent distortion correction processing.

残留雑音推定S305では、雑音のみの信号を多チャンネル空間予測処理による雑音抑圧処理にかけた時の出力信号y_v,m(t)を式(13)で算出する。得られた残留雑音成分と雑音の共分散行列として式(14)の結果を出力する。Y_v(t)は、式(15)で定義される。

In residual noise estimation S305, an output signal y _{v, m} (t) when a noise-only signal is subjected to noise suppression processing by multi-channel spatial prediction processing is calculated by Equation (13). The result of Equation (14) is output as the obtained residual noise component and noise covariance matrix. Y _v (t) is defined by equation (15).

空間的/F特歪み補正S306では、y_m(t)に含まれる目的音の歪みを後述する2段階の補正処理で補正して、補正後の信号を出力する。F特とは、周波数毎の振幅・位相特性を指す。2段階の補正をかけることで、空間的・振幅・位相特性的に補正がかかった信号が得られる。

In the spatial / F characteristic distortion correction S306, the distortion of the target sound included in y _m (t) is corrected by a two-stage correction process described later, and a corrected signal is output. F characteristics refer to the amplitude / phase characteristics for each frequency. By applying two levels of correction, a signal with corrected spatial, amplitude, and phase characteristics can be obtained.

図４に、本発明のフィルタリング部203の処理フローを示す。多チャンネル空間予測部401で、目的音と雑音が混合したm番目のマイク以外の入力信号に空間予測フィルタw_mを重畳する。遅延処理部402では、マイク入力信号を因果性を満たすために、m番目のマイク入力信号をDポイント遅延させる。遅延したマイク入力信号から多チャンネル空間予測フィルタ重畳後の信号を差し引くことで、雑音抑圧信号が得られる。多チャンネル歪み補正部403では、得られた多チャンネルの雑音抑圧信号に、式(16)で定義される多チャンネルの歪み補正フィルタHをかける。 FIG. 4 shows a processing flow of the filtering unit 203 of the present invention. The multi-channel spatial prediction unit 401 superimposes a spatial prediction filter w _m on an input signal other than the mth microphone in which the target sound and noise are mixed. The delay processing unit 402 delays the mth microphone input signal by D points in order to satisfy the causality of the microphone input signal. A noise suppression signal is obtained by subtracting the signal after the multi-channel spatial prediction filter superposition from the delayed microphone input signal. The multi-channel distortion correction unit 403 applies a multi-channel distortion correction filter H defined by Equation (16) to the obtained multi-channel noise suppression signal.

歪補正後の信号s_distorted(t)はモノラル信号となる。１チャンネル歪み補正部404では、s_distorted(t)に周波数歪みの補正フィルタgを重畳し、歪み補正後の信号として下式(17)を得る。

The signal s _distorted (t) after distortion correction is a monaural signal. The 1-channel distortion correction unit 404 superimposes a frequency distortion correction filter g on s _distorted (t) to obtain the following expression (17) as a signal after distortion correction.

図５に本発明の効果を示す。図５の最上段の波形は、ターゲットマイクに含まれる目的音信号を取り出した波形となる。この波形に近い雑音抑圧後の波形を得ることが目的となる。2段目の波形は、雑音が混合した後の波形である。雑音により元の目的音信号と異なった形となっていることが分かる。３段目の波形は、シングルチャンネル空間予測に基づく方法（例えば、非特許文献1参照）を適用した後の波形である。雑音成分は減っており、最上段の波形に近づいているが、歪みは大きく形は異なっている。４段目の波形は、本発明の処理により雑音抑圧を行った後の波形である。目的音に非常に近い波形になっていることが分かる。このように本発明によれば、歪みの小さい雑音抑圧信号を得ることができる。

FIG. 5 shows the effect of the present invention. The uppermost waveform in FIG. 5 is a waveform obtained by extracting the target sound signal included in the target microphone. The object is to obtain a waveform after noise suppression close to this waveform. The second-stage waveform is a waveform after noise is mixed. It can be seen that noise has a different shape from the original target sound signal. The third-stage waveform is a waveform after applying a method based on single-channel spatial prediction (for example, see Non-Patent Document 1). The noise component is decreasing and approaches the top waveform, but the distortion is large and the shape is different. The waveform at the fourth stage is a waveform after noise suppression is performed by the processing of the present invention. It can be seen that the waveform is very close to the target sound. Thus, according to the present invention, a noise suppression signal with small distortion can be obtained.

図３の雑音取り込みS301での雑音区間の判定は、ユーザーが雑音のみが存在する時間区間を波形表示ツール上でドラッグして指定するような形態をとっても良い。また従来の独立成分分析や、後述するスパース性に基づく時間周波数振り分け法に基づく時間周波数領域音源分離により音を分離した信号と、ユーザーが指定した目的音の空間位置を元に雑音区間をシステムが自動的に同定するような形態をとっても良い。 The determination of the noise interval in the noise acquisition S301 in FIG. 3 may take a form in which the user specifies a time interval in which only noise exists by dragging on the waveform display tool. In addition, the system uses the independent independent component analysis and the time-frequency domain sound source separation based on the time-frequency distribution method based on sparsity, which will be described later, and the noise interval based on the spatial position of the target sound specified by the user. It may take the form of automatically identifying.

後者の形態の具体的な処理フローを図６に示す。混合音取り込み601は、複数の音源が混ざった音を複数のマイクロホン素子で受音した信号を出力する。時間周波数領域音源分離602は、独立成分分析に基づく音源分離の場合には、時間周波数領域の音源方向推定（例えば、非特許文献３参照）を使って推定した時間周波数毎の音源方向推定結果をクラスタリングし、音源毎の原信号を復元する。 A specific processing flow of the latter form is shown in FIG. The mixed sound capturing 601 outputs a signal obtained by receiving a sound mixed with a plurality of sound sources with a plurality of microphone elements. In the case of sound source separation based on independent component analysis, the time-frequency domain sound source separation 602 represents the sound source direction estimation result for each time frequency estimated using sound source direction estimation in the time-frequency domain (see, for example, Non-Patent Document 3). Clustering is performed to restore the original signal for each sound source.

目的音指定603では復元した原信号の中から、ユーザーが抽出したい音を選択する。選択は、ユーザーがそれぞれの原信号の音をスピーカで再生させ聞きながら選択するような構成をとっても良いし、復元した原信号毎に音源方向推定（例えば、非特許文献３参照）を行い、推定した音源方向を画面上に表示し、画面上に表示した音源方向の中から、抽出したい方向をユーザーに選択させるような構成をとっても良い。このようにして、目的音指定603は、時間周波数領域音源分離602が出力した複数の復元信号の中で、ユーザーが抽出したい目的音がどの音源であるかの情報を出力し終了する。ここで、目的音の数は１つである必要はなく、複数であっても良い。 In target sound designation 603, the user selects a sound to be extracted from the restored original signal. The selection may be configured such that the user selects and reproduces the sound of each original signal through a speaker, and estimates the sound source direction for each restored original signal (for example, see Non-Patent Document 3). The sound source direction may be displayed on the screen, and the user may select the direction to be extracted from the sound source directions displayed on the screen. In this way, the target sound designation 603 outputs information indicating which sound source is the target sound that the user wants to extract among the plurality of restoration signals output by the time-frequency domain sound source separation 602, and ends. Here, the number of target sounds does not have to be one, and may be plural.

区間毎の処理604では、復元信号を数秒の短区間に切って、ループ処理を行う。目的音指定603後に復元信号は、目的音か雑音かに振り分けることが可能である。目的音と振り分けられた音を全て加算し、同様に雑音と振り分けられた音を全て加算する。加算後の目的音及び雑音の時間毎のパワーの時系列は、図７の最上段及び２段目に示すような形状となる。短区間毎の目的音のパワーをPs(τ)、雑音のパワーをPn(τ)とする。ここで、τは短区間のインデックスを表す変数とする。 In processing 604 for each section, the restoration signal is cut into short sections of several seconds, and loop processing is performed. After the target sound designation 603, the restored signal can be assigned to the target sound or noise. All the sounds that have been distributed to the target sound are added, and similarly, all the sounds that have been distributed to the noise are added. The time series of the power for each time of the target sound and noise after the addition has a shape as shown in the uppermost stage and the second stage of FIG. The target sound power for each short section is Ps (τ), and the noise power is Pn (τ). Here, τ is a variable representing an index of a short section.

混合度処理605では、目的音の雑音に対するパワー比率（混合度）の推定値として、Ps(τ)+ Pn(τ)とPs(τ)の比率を短区間毎に計算する。音源混合度は例えば、図７の3段目のような時系列となる。ソーティング606では、混合度が小さい短区間を同定する目的で、混合度が小さい上記比率を小さいものから順番に並べなおす。区間毎の処理607は、次の短区間に処理を移す。雑音区間推定608は、混合度が小さい短区間から予め定める上位Ｎ区間を取り出す。取り出した区間を雑音区間として出力して終了する。 In the mixing degree processing 605, the ratio of Ps (τ) + Pn (τ) and Ps (τ) is calculated for each short section as an estimated value of the power ratio (mixing degree) to the noise of the target sound. The sound source mixing degree is, for example, a time series as shown in the third row of FIG. In the sorting 606, the above ratios with a low degree of mixing are rearranged in order from the lowest in order to identify short sections with a low degree of mixing. In the processing 607 for each section, the processing is moved to the next short section. In the noise interval estimation 608, predetermined upper N intervals are extracted from short intervals with a low degree of mixing. The extracted section is output as a noise section and the process ends.

時間周波数領域の音源分離処理として、時間周波数毎に計算した音源方向のヒストグラムから音源分離を行う例を図８に示す。時間周波数毎の処理801では、まず、複数素子のマイク入力信号を、短時間毎（フレームシフト）に処理する。短時間毎に処理を開始する波形の先頭をフレームシフトずつずらす。フレームシフトは数＋ｍｓ程度の時間長になるように予め定める。処理を開始する波形の先頭から終端までの時間長をフレームサイズと呼び、フレームシフトより長い値に設定する。マイク素子毎にフレームサイズ分のデータに対して、直流成分カット、ハニング窓重畳、短時間フーリエ変換を施し、時間周波数領域の信号を得る。短時間処理の処理単位をフレームと呼び、フレームのインデックスをτと記載する。マイク素子番号mで得られたｆ番目の周波数のフレームτの信号をx_m(f,τ)と記載し、X(f,τ)=[x₁(f,τ)…x_m(f,τ)… x_M(f,τ)]^Tとする。時間周波数毎の処理801では、周波数f、フレームτ毎に処理を行うループをスタートする。 FIG. 8 shows an example of performing sound source separation from the histogram of the sound source direction calculated for each time frequency as the sound source separation processing in the time frequency domain. In the processing 801 for each time frequency, first, microphone input signals of a plurality of elements are processed every short time (frame shift). The top of the waveform that starts processing every short time is shifted frame by frame. The frame shift is predetermined so as to have a time length of about several ms. The time length from the beginning to the end of the waveform to start processing is called the frame size, and is set to a value longer than the frame shift. A DC component cut, Hanning window superimposition, and short-time Fourier transform are performed on the data corresponding to the frame size for each microphone element to obtain a time-frequency domain signal. A processing unit of short-time processing is called a frame, and a frame index is described as τ. The signal of the frame τ of the f-th frequency obtained with the microphone element number m is described as x _m (f, τ), and X (f, τ) = [x ₁ (f, τ)... X _m (f, τ) ... x _M (f, τ)] ^T In processing 801 for each time frequency, a loop for performing processing for each frequency f and frame τ is started.

位相差解析802では、GCC-PHATやSPIRE法(例えば、非特許文献３参照)により、周波数f、フレームτの音源方向を推定する。ヒストグラム生成803では、推定した音源方向のヒストグラムを推定する。周波数f、フレームτに求めた音源方向に対応するヒストグラムのビンに周波数f、フレームτ毎に１票加算される。時間周波数毎の処理804は、次の周波数又は次のフレームに処理を移す。ヒストグラムピークサーチ805は、求めた音源方向のヒストグラムのピークを探索する。前後のビンよりも値が大きいヒストグラムのビンがピークとして検出され、そのピークの中から、投票値が大きい順番で予め定める数のピークが抽出され、出力される。ピークの数Pはマイク数以下とする。ステアリングベクトル生成806では、周波数f、フレームτ毎の音源方向とヒストグラムピークサーチ805で求めた各ピークとの方向差を比較して、最も方向差が小さいピークを選択する。ステアリングベクトル生成806では、選択されたピークの番号がpとなる音源方向のうち、周波数fの音源方向に対応する入力ベクトルX(f,τ)の集合をΓ_p(f)とする。ピーク、周波数毎に１つ保持するステアリングベクトルa_p(f)を式(18)で求める。求めたステアリングベクトルの大きさを1に正規化する。正規化後のステアリングベクトルをa^_p (f)と表記する。このステアリングベクトルを元に生成した行列A(f)を式(19)とおく。逆フィルタリング807ではA(f)の一般化逆行列で定義されるフィルタ（式(20)）を時間周波数毎のマイク入力信号に重畳する。重畳後のベクトルは時間周波数毎の分離信号を要素に持つベクトルとなっている。 In the phase difference analysis 802, the sound source direction of the frequency f and the frame τ is estimated by GCC-PHAT or SPIRE method (see, for example, Non-Patent Document 3). In the histogram generation 803, a histogram of the estimated sound source direction is estimated. One vote is added for each frequency f and frame τ to the bin of the histogram corresponding to the sound source direction obtained for the frequency f and frame τ. The processing 804 for each time frequency shifts the processing to the next frequency or the next frame. The histogram peak search 805 searches for the peak of the histogram of the obtained sound source direction. Histogram bins whose values are larger than the previous and next bins are detected as peaks, and a predetermined number of peaks are extracted and output from the peaks in order of the vote value. The number P of peaks is less than the number of microphones. In the steering vector generation 806, the direction difference between the sound source direction for each frequency f and frame τ and each peak obtained by the histogram peak search 805 is compared, and the peak having the smallest direction difference is selected. In the steering vector generation 806, a set of input vectors X (f, τ) corresponding to the sound source direction of the frequency f among the sound source directions in which the selected peak number is p is Γ _p (f). A steering vector a _p (f), which is held for each peak and frequency, is obtained by Expression (18). Normalize the magnitude of the obtained steering vector to 1. The steering vector after normalization is expressed as a ^ _p (f). The matrix A (f) generated based on this steering vector is set as equation (19). In inverse filtering 807, a filter defined by the generalized inverse matrix of A (f) (Equation (20)) is superimposed on the microphone input signal for each time frequency. The superposed vector is a vector having a separated signal for each time frequency as an element.

時間領域波形生成808では、音源毎に全ての時間周波数成分をより集め、逆短時間フーリエ変換及び重畳加算処理を行い、時間領域の音源毎の波形を得て、出力する。

In the time domain waveform generation 808, all time frequency components are collected for each sound source, inverse short time Fourier transform and superposition addition processing are performed, and a waveform for each time domain sound source is obtained and output.

図９には、雑音除去に加え残響除去をリアルタイムに行うための構成を記載している。波形取り込み部901からフィルタデータ904では、それぞれ図２の波形取り込み部201からフィルタデータ204と同じ内容の処理を行う。図２の構成では、ターゲットマイクをＭ個のマイクのうちのある特定の一つとしていたが、図９では全てのマイクの雑音抑圧後の波形を抽出する。つまり、ターゲットマイクを1からMまで変化させて、雑音抑圧を実施し、雑音抑圧後の波形を抽出する。 FIG. 9 shows a configuration for performing reverberation removal in real time in addition to noise removal. The processing from the waveform capturing unit 901 to the filter data 904 performs the same processing as that of the filter capturing unit 204 from the waveform capturing unit 201 in FIG. In the configuration of FIG. 2, the target microphone is a specific one of the M microphones, but in FIG. 9, the waveforms after noise suppression of all the microphones are extracted. That is, the target microphone is changed from 1 to M, noise suppression is performed, and the waveform after noise suppression is extracted.

目的音区間抽出部905では、フィルタリング部903が出力する雑音抑圧後のMチャンネルの信号に対して、信号のパワー時系列を算出する。そして、パワーに基づくＶＡＤ（音声区間検出技術）を利用して、音声区間を抽出する。さらに、予め定める個数又は取り出した後の総時間長が所定の時間長になるように、パワーが大きい順で音声区間を抽出する。抽出した音声区間を目的音区間として出力する。このようにパワーが大きい音声区間を取り出すことで、高精度な空間伝達特性の学習が可能となる。 The target sound section extraction unit 905 calculates a signal power time series for the noise-suppressed M channel signal output from the filtering unit 903. Then, a voice section is extracted using power-based VAD (speech section detection technology). Further, the speech sections are extracted in descending order of power so that the predetermined number or the total time length after extraction becomes a predetermined time length. The extracted speech section is output as the target sound section. Thus, by extracting a voice section with high power, it is possible to learn spatial transfer characteristics with high accuracy.

目的音伝達特性学習部906では、２次統計量に基づくマルチチャンネル残響除去で用いられる各種統計量を目的音区間抽出部905が取り出した目的音区間波形から学習し、学習後に残響除去フィルタを算出し、算出した残響除去フィルタを残響除去フィルタ907に書き出す。ここまでの処理がいわゆるバッチ処理であったのに対して、以後リアルタイムに取り出した波形に対する雑音抑圧処理及び残響除去処理を行う。 The target sound transfer characteristic learning unit 906 learns various statistics used in multi-channel dereverberation based on secondary statistics from the target sound interval waveform extracted by the target sound interval extraction unit 905, and calculates a dereverberation filter after learning. Then, the calculated dereverberation filter is written in the dereverberation filter 907. The processing so far has been so-called batch processing, but after that, noise suppression processing and dereverberation processing are performed on the waveform extracted in real time.

リアルタイム波形取り込み部908は、複数チャンネルの音データをフィルタリング処理するために必要な最小限のデータが得られるたびにそのデータを出力する。出力されたデータはフィルタリング部903に送られ、雑音抑圧された後、残響除去部909に送られる。 The real-time waveform capturing unit 908 outputs the data every time the minimum data necessary for filtering the sound data of a plurality of channels is obtained. The output data is sent to the filtering unit 903, noise is suppressed, and then sent to the dereverberation unit 909.

残響除去部909は、バッチ処理で適応した残響除去フィルタ907を読み込み、残響除去処理を行う。残響除去後のデータは、リアルタイム波形再生部910に送られ、ＤＡ変換を施され、スピーカから放出される。 The dereverberation unit 909 reads the dereverberation filter 907 adapted by batch processing, and performs dereverberation processing. The data after dereverberation is sent to the real-time waveform reproducing unit 910, subjected to DA conversion, and emitted from the speaker.

一般に残響除去フィルタの適応には、長時間の観測データが必要であるため、バッチで適応したフィルタを使うのが望ましい。目的音が複数存在する場合を勘案し、目的音区間抽出部905で、得られた区間毎に音源方向推定を行い、得られた区間を方向推定結果を元にクラスタリングし、各クラスタ毎に所定の時間長の目的音信号をパワーに基づき抽出し、抽出した区間から目的音伝達特性学習部906で、残響除去フィルタを方向毎に求めるようにし、さらに残響除去部909の前に音源方向推定を行い、推定した方向に最も近い方向の残響除去フィルタを使って残響除去するような構成をとっても良い。 In general, adaptation of a dereverberation filter requires long-time observation data, so it is desirable to use a filter adapted in batch. Considering the case where there are multiple target sounds, the target sound section extraction unit 905 performs sound source direction estimation for each obtained section, and the obtained sections are clustered based on the direction estimation result, and predetermined for each cluster. Target sound signal is extracted based on power, and the target sound transfer characteristic learning unit 906 obtains a dereverberation filter for each direction from the extracted section, and further performs sound source direction estimation before the dereverberation unit 909. Alternatively, a configuration may be adopted in which dereverberation is performed using a dereverberation filter in a direction closest to the estimated direction.

図１０に、図３の空間的/F特歪み補正S306の空間的歪み補正の構成例を示す。空間的歪み補正フィルタHは次式(21)で定義され、式(22)で計算される。 FIG. 10 shows a configuration example of the spatial distortion correction of the spatial / F special distortion correction S306 in FIG. The spatial distortion correction filter H is defined by the following equation (21) and calculated by the equation (22).

残留雑音推定部1001では、式(13)で定義されるy_v,m(t)を計算する。目的音推定部1002は、式(8)で定義されるy_m(t)を計算する。遅延処理部1007はターゲットマイクの入力信号に因果性を満たすための遅延Dを入れ、遅延後の信号を出力する。目的音共分散推定部1005は、R_{cov(noiseless)}を計算する。残留雑音共分散推定部1003は、R_{cov(noise,noiseless)}を計算する。μ乗算1004は、R_{cov(noise,noiseless)}の全要素にμを乗算する。R_{cov(noiseless)}+μR_{cov(noise,noiseless)}の逆行列invRを逆行列演算部1006で計算する。目的音相関行列推定部1008では、式(11)で定義される相関ベクトルR_{cor(noiseless)}を計算し、行列掛け算部1009では、R_{cor(noiseless)}invRの行列の積を計算する。行列の積を歪み補正フィルタＨとして出力する。

Residual noise estimation unit 1001 calculates y _{v, m} (t) defined by equation (13). The target sound estimation unit 1002 calculates y _m (t) defined by Expression (8). The delay processing unit 1007 adds a delay D for satisfying the causality to the input signal of the target microphone, and outputs the delayed signal. The target sound covariance estimation unit 1005 calculates R _{cov (noiseless)} . The residual noise covariance estimation unit 1003 calculates R _{cov (noise, noiseless)} . The μ multiplication 1004 multiplies all elements of R _{cov (noise, noiseless)} by μ. An inverse matrix operation unit 1006 calculates an inverse matrix invR of R _{cov (noiseless)} + μR _{cov (noise, noiseless)} . The target sound correlation matrix estimation unit 1008 calculates a correlation vector R _{cor (noiseless)} defined by equation (11), and the matrix multiplication unit 1009 calculates a product of R _{cor (noiseless)} invR matrices. The matrix product is output as a distortion correction filter H.

図１１にF特歪み補正の一構成を示す。多チャンネル歪み補正部1101は、式(16)で定義される多チャンネル歪み後の信号を算出する。遅延処理部1102は、ターゲットマイクの入力信号を因果性を満たす遅延Dだけ遅延させ、遅延後の信号を出力する。雑音共分散行列は、式(24)で定義されるR_cov(noise)を計算する。ここで、V(t)は式(23)で定義される。 FIG. 11 shows one configuration of the F characteristic distortion correction. The multi-channel distortion correction unit 1101 calculates a signal after multi-channel distortion defined by Expression (16). The delay processing unit 1102 delays the input signal of the target microphone by a delay D that satisfies causality, and outputs a delayed signal. The noise covariance matrix calculates R _{cov (noise)} defined by equation (24). Here, V (t) is defined by equation (23).

μ乗算部1104は、R_cov(noise)の全ての要素に予め定める係数μを乗算する。目的音共分散推定部1105は、下式(26)で定義される行列R_cov(input)を計算する。ここで、X(t)は式(25)で定義される。

The μ multiplier 1104 multiplies all elements of R _{cov (noise) by} a predetermined coefficient μ. The target sound covariance estimation unit 1105 calculates a matrix R _{cov (input)} defined by the following equation (26). Here, X (t) is defined by equation (25).

雑音相関推定部1107は、式(27)で定義される相関行列R_cor(noise)を計算する。

The noise correlation estimation unit 1107 calculates a correlation matrix R _{cor (noise)} defined by Expression (27).

R_cov(input)+ μR_cov(noise)の逆行列invR2を逆行列演算部1106で計算する。行列掛け算部1108では、R_cor(noise)invR2の行列の積を計算し、それを雑音推定フィルタRとする。雑音推定部1109では、目的音と雑音が混合した多チャンネル信号X(t)から雑音成分n(t)をn(t)=RX(t)で推定する。n(t)は１ｃｈの雑音信号である。

An inverse matrix operation unit 1106 calculates an inverse matrix invR2 of R _{cov (input)} + μR _{cov (noise)} . The matrix multiplication unit 1108 calculates the product of the matrix R _{cor (noise)} invR 2 and sets it as the noise estimation filter R. The noise estimation unit 1109 estimates the noise component n (t) from the multi-channel signal X (t) in which the target sound and noise are mixed using n (t) = RX (t). n (t) is a 1ch noise signal.

最小２乗フィルタ推定部1110では、式(28)で表わされる入力信号推定値x_taget^(t-D)とx_target(t-D)の間の２乗誤差が最小値をとるg及びqを、最小２乗法（式(29)）で求める。式中”*”は畳み込みを表す演算子とする。求めた歪み補正フィルタgを出力して終了する。 In the least square filter estimation unit 1110, g and q at which the square error between the input signal estimated value x _taget ^ (tD) and x _target (tD) represented by the equation (28) takes the minimum value are set to the minimum 2 Obtained by multiplication (Formula (29)). In the expression, “*” is an operator representing convolution. The obtained distortion correction filter g is output and the process ends.

図１２は、本発明の第二実施例のハードウェア構成を示した図である。マイクロホンアレイ1201で取り込んだ音データはAD変換装置1203でアナログの音圧からデジタル音圧データに変換される。変換されたデータを計算機1204上で処理した後、データはHUB1205を介してサーバ上の計算機に送信される。また、カメラ1202で取り込んだ画像データも音声データとともに送信される。サーバ上ではHUB 1206を介して、送信されたデータを受信する。受信したデータは、サーバ上の計算機1207で信号処理を施される。信号処理を施された音データは大規模ストレージ1211で録音される。

FIG. 12 is a diagram showing a hardware configuration of the second embodiment of the present invention. The sound data captured by the microphone array 1201 is converted from analog sound pressure to digital sound pressure data by the AD converter 1203. After the converted data is processed on the computer 1204, the data is transmitted to the computer on the server via the HUB 1205. Also, the image data captured by the camera 1202 is transmitted together with the audio data. On the server, the transmitted data is received via the HUB 1206. The received data is subjected to signal processing by a computer 1207 on the server. The sound data subjected to the signal processing is recorded in the large-scale storage 1211.

また会議データを閲覧するユーザーのリクエストに応じて、サーバはデータを会議データ閲覧ユーザーに送信する。閲覧ユーザーサイドのHUB 1211を介してデータが、閲覧ユーザーが保有する計算機1208に送られる。計算機1208上でデータが処理されスピーカ1209から再生される。また、一部の音響情報が表示装置1210に表示される。 Further, in response to a request from a user browsing the conference data, the server transmits the data to the conference data browsing user. The data is sent to the computer 1208 owned by the browsing user via the browsing user side HUB 1211. Data is processed on the computer 1208 and reproduced from the speaker 1209. Also, some acoustic information is displayed on the display device 1210.

図１３は、閲覧ユーザーの表示装置1210に表示する画面の構成を示している。表示装置1210の画面1301は4つのサブ画面からなる。カメラ画像表示部1301-1上には、会議時にカメラ1202で撮影した動画が表示される。音源位置表示部1301-2は、会議時にマイクロホンアレイで取り込んだ音から推定した音源位置を表示する。音源位置は、会議時の音声全てを使って作った方向ヒストグラムのピークサーチをすることで求める構成を取っても良いし、カメラ画像と同期して、映像時刻前後の音声波形から生成した方向ヒストグラムをピークサーチすることで求めた音源位置を表示するような構成を取っても良い。1301-2の画面を会議室を縮尺した平面図と見立て、音源の平面的な位置を表示するようにする。音源位置毎に、表示の色や形を変化させて表示しても良い。 FIG. 13 shows a configuration of a screen displayed on the display device 1210 of the browsing user. A screen 1301 of the display device 1210 includes four sub screens. On the camera image display unit 1301-1, a moving image shot by the camera 1202 at the meeting is displayed. The sound source position display unit 1301-2 displays the sound source position estimated from the sound captured by the microphone array during the conference. The sound source position may be obtained by performing a peak search of the direction histogram created using all the audio at the time of the conference, or the direction histogram generated from the audio waveforms before and after the video time in synchronization with the camera image A configuration may be adopted in which the sound source position obtained by performing a peak search is displayed. Think of the screen of 1301-2 as a plan view of the conference room, and display the planar position of the sound source. The display color and shape may be changed for each sound source position.

発話タイミング表示部1301-3は、発話箇所を、発話音量に応じて濃淡を変えてマーキングする。音源位置表示部1301-2で各音源の表示に使った色や形で各音源の発話位置をマーキングするようにしても良い。サムネイル画像表示部1301-4は、発話箇所毎に1枚その発話箇所に含まれる時間帯のカメラ画像を表示する。カメラが複数台ある場合は、発話箇所の音源方向を写したカメラの画像を表示するようにしても良い。また、カメラ画像表示部1301-1のある特定の点をユーザーが計算機付属のマウスでクリックすると、そのクリック位置の音が再生されたり、音源位置表示部1301-2の音源位置をクリックすると、その音源の再生箇所が発話タイミング表示部1301-3に表示され、発話タイミング表示部1301-3の発話箇所をクリックすると、そのクリック箇所が再生されるような構成を取っても良い。 The utterance timing display unit 1301-3 marks the utterance portion with different shades according to the utterance volume. The sound source position display unit 1301-2 may mark the utterance position of each sound source with the color and shape used to display each sound source. Thumbnail image display section 1301-4 displays one camera image for each utterance location and the time zone included in that utterance location. When there are a plurality of cameras, a camera image showing the sound source direction of the utterance portion may be displayed. When the user clicks on a certain point on the camera image display unit 1301-1 with the mouse attached to the computer, the sound at that clicked position is played, or when the sound source position on the sound source position display unit 1301-2 is clicked, A configuration may be adopted in which the playback location of the sound source is displayed on the utterance timing display unit 1301-3, and when the utterance location on the utterance timing display unit 1301-3 is clicked, the clicked location is reproduced.

図１４は、本発明の第二実施例のソフトウェア構成図を示した図である。音取り込み部1401で取り込んだ複数チャンネルの音情報及び画像取り込み部1403で取り込んだ画像データは、データ送信部1404に送られ、サーバに送られる。また、会議拠点におけるマイクロホンアレイの各マイク素子の配置及びカメラの配置及び向きに関する情報1402も音情報や画像データと一緒に送信される。サーバ上では、データ受信部1405で、音情報や画像データ及びマイクロホンアレイの各マイク素子の配置及びカメラの配置及び向きのデータを受信し、拠点毎データ1413に記憶する。拠点毎データ1413は大規模ストレージ上のデータ領域とする。 FIG. 14 is a diagram showing the software configuration of the second embodiment of the present invention. The sound information of a plurality of channels captured by the sound capturing unit 1401 and the image data captured by the image capturing unit 1403 are sent to the data transmission unit 1404 and sent to the server. In addition, information 1402 regarding the arrangement of the microphone elements of the microphone array and the arrangement and orientation of the camera at the conference base is also transmitted together with the sound information and the image data. On the server, the data receiving unit 1405 receives the sound information, the image data, the arrangement of the microphone elements of the microphone array, the data of the arrangement and orientation of the camera, and stores them in the data 1413 for each site. The base data 1413 is a data area on a large-scale storage.

閲覧拠点では、ユーザーI/F処理部1412で、ユーザーのクリック位置やドラッグ位置を認識し、再生する音源位置の情報に変換する。拠点毎データ1410内に記憶された該当音源位置の音声波形を再生する。拠点毎データ1410内に該当音源位置の音声波形が存在しなければ、会議データリクエスト部1406が、該当音源位置の音声波形を送信するリクエストをサーバに送信するような作りを取っても良い。サーバに送信されたリクエストはデータ受信部1407で受信される。そして音響情報生成部1409に、リクエストに含まれる再生音源位置の音声波形を抽出するようなコマンドを送付する。 At the viewing base, the user I / F processing unit 1412 recognizes the user's click position and drag position, and converts the information into sound source position information to be reproduced. The sound waveform of the corresponding sound source position stored in the site-specific data 1410 is reproduced. If the sound waveform of the sound source position does not exist in the base data 1410, the conference data request unit 1406 may be configured to transmit a request for transmitting the sound waveform of the sound source position to the server. The request transmitted to the server is received by the data receiving unit 1407. Then, a command for extracting the sound waveform of the reproduction sound source position included in the request is sent to the acoustic information generation unit 1409.

音響情報生成部1409では、拠点毎データ1413に記憶された多チャンネルの音声波形とその音声波形を録音したマイクロホンアレイの空間的配置の情報から、本発明の第一の実施例に基づき再生音源位置の音声波形を分離して抽出する。データ送信部1408は、抽出した音声波形を閲覧拠点に送信する。また、カメラ画像や各時間の音源方向の情報を送付するようにしても良い。画像表示部1415は、カメラ画像を表示装置上のカメラ画像表示部に表示する。表示する際、再生音源波形に合わせて再生画像を変えても良い。音声再生部1411は、ユーザーが選択した音源位置の波形の指定された再生箇所を再生し、スピーカから音声を出力する。 In the acoustic information generation unit 1409, the reproduction sound source position based on the first embodiment of the present invention is obtained from the multi-channel audio waveform stored in the site-specific data 1413 and the spatial arrangement information of the microphone array that records the audio waveform. The voice waveform is separated and extracted. The data transmission unit 1408 transmits the extracted speech waveform to the browsing base. Moreover, you may make it send the information of a sound source direction of a camera image and each time. The image display unit 1415 displays the camera image on the camera image display unit on the display device. When displayed, the playback image may be changed in accordance with the playback sound source waveform. The audio reproduction unit 1411 reproduces the designated reproduction location of the waveform of the sound source position selected by the user, and outputs audio from the speaker.

図１５に、ユーザーI/F処理部及び音声再生部、画像表示部を含んだユーザークリックやドラッグ処理の処理フローを示す。聞きたい方向を選択1501で、ユーザーのクリック位置やドラッグ位置からユーザーの聞きたい方向を同定する。音源が存在するか1502で、同定した方向に音源が存在するか判定し、もし存在しない場合は、その方向に音源が存在しない旨、メッセージ提示1507して終了する。音源が存在する場合は、雑音区間同定1503で、第一実施例で示した図６の雑音区間抽出処理により、雑音区間を抽出する。目的音抽出1504で、雑音区間の情報から第一実施例で示した図３の雑音抑圧方式で雑音抑圧後の目的音を抽出する。再生区間を選択1505では、雑音抑圧後の目的音の発話区間を発話タイミング表示部に表示した後、ユーザーに発話区間の中から聞きたい区間を選択させる。音・画像を再生1506では、選択させた発話区間の音声を音声再生部が再生するとともに、再生発話区間に対応するカメラ画像を、表示装置1210のカメラ画像表示部1301-1上に再生発話区間と同期させて、表示する。再生終了後、処理を終了する。 FIG. 15 shows a processing flow of user click and drag processing including a user I / F processing unit, an audio reproduction unit, and an image display unit. In a direction 1501 for selecting the direction to be heard, the direction in which the user wants to hear is identified from the click position or drag position of the user. In 1502, it is determined whether there is a sound source in the identified direction. If there is no sound source, it is determined that there is no sound source in that direction, and the message is presented 1507 and the process ends. If there is a sound source, the noise section is extracted by the noise section extraction processing of FIG. 6 shown in the first embodiment in the noise section identification 1503. In the target sound extraction 1504, the target sound after noise suppression is extracted from the noise section information by the noise suppression method of FIG. 3 shown in the first embodiment. In the selection of a playback section 1505, the speech section of the target sound after noise suppression is displayed on the speech timing display section, and then the user selects a section to be heard from the speech sections. In the sound / image playback 1506, the audio playback unit plays back the audio of the selected utterance section, and the playback utterance section displays the camera image corresponding to the playback utterance section on the camera image display section 1301-1 of the display device 1210. Synchronize with and display. After the reproduction is finished, the process is finished.

図１６は、本発明の第三実施例の監視システムの異常音検出ブロックを示した図である。対象とする異常音は、例えば、工場における機械の異常時の動作音であったり、オフィス、家庭内でのガラスが割れる音などである。ハードウェア構成は、図１２に示した第二実施例のハードウェア構成と同一とする。またソフトウェアブロック構成は図１４に記載の構成と同一とする。音源情報生成部1601は、図１４の音響情報生成部に相当する。 FIG. 16 is a diagram showing an abnormal sound detection block of the monitoring system of the third embodiment of the present invention. The target abnormal sound is, for example, an operation sound when a machine malfunctions in a factory, or a sound of glass breaking in an office or home. The hardware configuration is the same as the hardware configuration of the second embodiment shown in FIG. The software block configuration is the same as that shown in FIG. The sound source information generation unit 1601 corresponds to the acoustic information generation unit in FIG.

異常音データベース1603には、異常音の振幅スペクトルやケプストラムなどの音響特徴量やHidden Markov Model 形式で記述された異常音の音響特徴量の遷移パターンの状態遷移情報が記憶されているものとする。パターンマッチング部1602は、取り出した音源波形の情報と異常音データベースに記載の異常音の情報とパターンマッチングを行う。音源波形に短時間フーリエ変換をかけ、振幅スペクトルやケプストラムなどの音響特徴量を抽出し、抽出した音響特徴量と異常音データベースに記載の異常音の音響特徴量の遷移パターンやHidden Markov Modelで記述された異常音のスペクトルパターンとの距離計算を行う。距離計算の結果から、異常音の存在確率のゆう度を計算する。Hidden Markov Modelで記述された異常音のスペクトルパターンの場合、ビタビアルゴリズムなどで高速に距離計算を行うことが可能である。 It is assumed that the abnormal sound database 1603 stores state transition information of a transition pattern of an acoustic feature amount of an abnormal sound described in the Hidden Markov Model format or an acoustic feature amount such as an amplitude spectrum or a cepstrum of the abnormal sound. The pattern matching unit 1602 performs pattern matching with the extracted sound source waveform information and abnormal sound information described in the abnormal sound database. Apply a short-time Fourier transform to the sound source waveform to extract acoustic features such as amplitude spectrum and cepstrum, and describe the extracted acoustic features and abnormal sound acoustic feature transition patterns and Hidden Markov Model in the abnormal sound database The distance with the spectrum pattern of the abnormal sound is calculated. The likelihood of the existence probability of abnormal sound is calculated from the result of distance calculation. In the case of a spectrum pattern of abnormal sound described by the Hidden Markov Model, it is possible to perform distance calculation at high speed using a Viterbi algorithm or the like.

異常音判定部1604では、計算したゆう度から異常音が存在するかどうかを短時間区間毎に判定する。判定の結果、異常音が存在した場合は、アラート送信部1605で警告情報を送信する。警告情報は、閲覧拠点上のスピーカから予め定める警告音を鳴らすとともに、その異常音が発生した場所と時間帯を画面上に表示するような形態をとる。 The abnormal sound determination unit 1604 determines whether or not there is an abnormal sound for each short time section from the calculated likelihood. If there is an abnormal sound as a result of the determination, the alert transmission unit 1605 transmits warning information. The warning information takes a form in which a predetermined warning sound is emitted from a speaker on the browsing base and the location and time zone where the abnormal sound occurs are displayed on the screen.

図１７は、異常音検出処理の具体的な処理フローを示した図である。混合音取り込み1701では、様々な音が混ざった複数チャンネルの音データを取り込む。時間周波数領域音源分離1702で音源毎の信号を生成する。時間周波数領域音源分離では、音源毎の信号を完全には分離しきれないため、次にその分離精度を高める処理を加える。音源毎の処理1703では、分離した音源毎の処理ループを開始する。区間毎の処理1704では、処理対象の音源信号の各短時間区間毎の波形に対する処理ループを開始する。混合度処理1705では、処理対象の音源波形のパワーPs(t)と処理対象以外の音源のパワーを加算したものPn(t)とを用いて、混合度Ps(t)/(Pn(t)+Ps(t))を区間t毎に計算する。計算した混合度をソーティング1706で混合度が小さいものから順番に並び変える。区間毎の処理1707では、次の区間に処理を移す。雑音区間抽出1708では、ソーティング後の混合度の情報から、混合度が小さいものから順番に総時間が予め定める時間になるまで区間を抽出する。そして抽出した区間を雑音区間として出力する。雑音除去1709では、本発明の第一実施例の図３に示す処理フローにより雑音を除去した目的音だけの信号を抽出する。異常音検出1710で、異常音情報とのパターンマッチング処理を行い、異常音が検出された場合は、アラート送信部1711に処理を移して、アラートを閲覧拠点に送信した後次の音源の処理に移る。また異常音が検出されなかった場合は、何もせず次の音源の処理に移る。 FIG. 17 is a diagram illustrating a specific processing flow of the abnormal sound detection processing. The mixed sound capturing 1701 captures sound data of a plurality of channels in which various sounds are mixed. A signal for each sound source is generated by time frequency domain sound source separation 1702. In the time-frequency domain sound source separation, since the signal for each sound source cannot be completely separated, processing for improving the separation accuracy is added next. In the processing 1703 for each sound source, a processing loop for each separated sound source is started. In the processing 1704 for each section, a processing loop for the waveform for each short time section of the sound source signal to be processed is started. In the mixing degree processing 1705, the mixing degree Ps (t) / (Pn (t)) is obtained by using the power Ps (t) of the sound source waveform to be processed and the sum Pn (t) of the power of the sound source other than the processing target. + Ps (t)) is calculated for each interval t. In the sorting 1706, the calculated mixing degrees are rearranged in descending order. In the processing 1707 for each section, the processing is moved to the next section. In the noise section extraction 1708, sections are extracted from the information on the degree of mixing after sorting until the total time reaches a predetermined time in order from the smallest degree of mixing. The extracted section is output as a noise section. In the noise removal 1709, a signal of only the target sound from which noise has been removed is extracted by the processing flow shown in FIG. 3 of the first embodiment of the present invention. In abnormal sound detection 1710, pattern matching processing with abnormal sound information is performed, and if abnormal sound is detected, the process moves to the alert transmission unit 1711 to send the alert to the viewing base and then process the next sound source Move. If no abnormal sound is detected, the process proceeds to the next sound source without doing anything.

図１８に、本発明に基づき、ユーザーが指定した音源位置の音声を高速再生するための話速変換処理の処理フローを示す。本処理フローは、図１４における音声再生部1411で処理される。本処理の目的は、ユーザーが指定した音源の音声をゆっくりと聞きやすい速度で再生し、それ以外の話者の音声を高速に再生することで、聞きたい音だけを聞きやすく再生することである。それ以外の音は高速に再生されるため、時間をかけずに聞き流すことができる。 FIG. 18 shows a processing flow of speech speed conversion processing for high speed reproduction of sound at a sound source position designated by a user based on the present invention. This processing flow is processed by the audio reproduction unit 1411 in FIG. The purpose of this process is to play back the sound of the sound source specified by the user at a speed that is easy to hear, and to play the sound of the other speakers at high speed so that only the sound you want to hear is easy to hear. . Other sounds are played at high speed, so you can listen to them without spending time.

目的音/雑音抽出1801では、本発明の第一実施例により、目的音が存在する区間と雑音だけの区間を抽出する。区間毎の処理1802では、抽出した音声を短時間区間にわけて、各区間毎のループ処理を開始する。SNRに基づく音声検出1803では、目的音の短時間パワーPs(t)と雑音の短時間パワーPn(t)からSNR=Ps(t)/Pn(t)を計算する。音声判定1804では、SNRが予め定める閾値以上であれば、音声がその短時間区間に存在すると判定し、その区間の再生速度を予め定める目的音区間用の話速に設定する（1806）。また、閾値以下であれば、その区間を雑音区間と判定し、雑音区間用話速に設定1805で、その区間の再生速度を予め定める雑音区間用の話速に設定する。ここで、予め雑音区間用の話速が目的音区間用話速より速くなるように設定しておく。設定の後、区間毎の処理1807で次の区間に処理を移す。設定した話速に従い再生1808で、実際にスピーカから設定した話速に従い話速変換処理を行い、変換後の音声を再生した後、終了する。 In the target sound / noise extraction 1801, according to the first embodiment of the present invention, a section where the target sound exists and a section of only noise are extracted. In processing 1802 for each section, the extracted voice is divided into short sections, and loop processing for each section is started. In SNR based speech detection 1803, SNR = Ps (t) / Pn (t) is calculated from the short time power Ps (t) of the target sound and the short time power Pn (t) of the noise. In the voice determination 1804, if the SNR is equal to or greater than a predetermined threshold, it is determined that the voice is present in the short time section, and the playback speed in that section is set to a predetermined speech speed for the target sound section (1806). If it is equal to or less than the threshold value, the section is determined to be a noise section, and the speech speed for the noise section is set to 1805, and the playback speed of the section is set to a predetermined speech speed for the noise section. Here, the speech speed for the noise section is set in advance so as to be faster than the speech speed for the target sound section. After the setting, the process moves to the next section in the process 1807 for each section. In the playback 1808 according to the set speech speed, the speech speed conversion processing is performed in accordance with the speech speed actually set from the speaker, the converted voice is played back, and the process ends.

図１９は、ユーザーが選択した音源方向の情報だけを抽出し、再生する処理のフロー図である。1901から1904までは図１８の相当する処理と同様とする。このフローでは、区間を削除1905で、目的音区間と判定されなかった区間を再生区間から削除する。また、区間を残す1906で、目的音区間と判定された区間を再生区間に残す。区間毎の処理1907は、次の区間に処理を移す。設定した再生区間を再生1908では、設定した再生区間をスピーカから再生した後、処理を終了する。 FIG. 19 is a flowchart of processing for extracting and reproducing only the information on the sound source direction selected by the user. The processing from 1901 to 1904 is the same as the corresponding processing in FIG. In this flow, the section that is not determined to be the target sound section is deleted from the playback section in Delete 1905. In 1906, which leaves the section, the section determined as the target sound section is left in the playback section. The process 1907 for each section moves to the next section. In the playback of the set playback section 1908, the process is terminated after the set playback section is played back from the speaker.

本発明の雑音抑圧装置のハードウェア構成図。The hardware block diagram of the noise suppression apparatus of this invention. 本発明の雑音抑圧装置のソフトウェアブロック構成図。The software block block diagram of the noise suppression apparatus of this invention. 本発明の雑音抑圧装置の処理フロー図。The processing flowchart of the noise suppression apparatus of this invention. 本発明の雑音抑圧装置のフィルタリング部の詳細ブロック構成図。The detailed block block diagram of the filtering part of the noise suppression apparatus of this invention. 本発明の雑音抑圧手法の効果を示す図。The figure which shows the effect of the noise suppression method of this invention. 本発明のブラインド雑音抑圧装置の処理フロー図。The processing flow figure of the blind noise suppression apparatus of this invention. 本発明のブラインド雑音抑圧装置における混合度処理の例を示した図。The figure which showed the example of the mixing degree process in the blind noise suppression apparatus of this invention. 本発明のブラインド雑音抑圧装置の時間周波数領域音源分離の構成例を示す図。The figure which shows the structural example of the time frequency domain sound source separation of the blind noise suppression apparatus of this invention. 雑音抑圧と残響除去を同時に行う信号処理装置のブロック図。The block diagram of the signal processing apparatus which performs noise suppression and dereverberation simultaneously. 本発明における多チャンネル歪み補正処理の詳細ブロック図。The detailed block diagram of the multichannel distortion correction process in this invention. 本発明における１チャンネル歪み補正処理の詳細ブロック図。The detailed block diagram of the 1 channel distortion correction process in this invention. 本発明を会議支援システムや音声監視システムに応用する場合のハードウェア構成図。The hardware block diagram in the case of applying this invention to a meeting assistance system or a voice monitoring system. 会議支援システムの画面表示例を示した図。The figure which showed the example of a screen display of a meeting assistance system. 会議支援システムのソフトウェアブロック構成を示した図。The figure which showed the software block structure of the meeting assistance system. 会議支援システムのユーザーインターフェース及び内部処理のフローチャート。The flowchart of the user interface and internal process of a meeting assistance system. 本発明を音声監視システムに応用した異常音検出装置のブロック図。The block diagram of the abnormal sound detection apparatus which applied this invention to the audio | voice monitoring system. 音声監視システムの処理フロー図。The processing flow figure of a voice monitoring system. 本発明の再生処理に話速変換処理を利用した処理フロー図。The processing flow figure which used the speech rate conversion process for the reproduction | regeneration processing of this invention. 本発明の再生処理に無音削除処理を利用した処理フロー図。The processing flow figure which used the silence deletion process for the reproduction | regeneration processing of this invention. 本発明の雑音抑圧装置のソフトウェアブロックとハードウェアの対応関係を示した図。The figure which showed the correspondence of the software block and hardware of the noise suppression apparatus of this invention.

Explanation of symbols

101…マイクロホンアレイ、102…ＡＤ変換装置、103…中央演算装置、104…揮発性メモリ、105…不揮発性メモリ、106…ＤＡ変換装置、107…スピーカ、201…波形取り込み部、202…フィルタ適応処理部、203…フィルタリング部、204…フィルタデータ、205…波形再生部、401…多チャンネル空間予測部、402…遅延処理部、403…多チャンネル歪み補正部、404…１チャンネル歪み補正部、901…波形取り込み部、902…フィルタ適応処理部、903…フィルタリング部、904…フィルタデータ、905…目的音区間抽出部、906…目的音伝達特性学習部、907…残響除去フィルタ、908…リアルタイム波形取り込み部、909…残響除去部、910…リアルタイム波形再生部、1001…残留雑音推定部、1002…目的音推定部、1003…残留雑音共分散推定部、1004…μ乗算部、1005…目的音共分散推定部、1006…逆行列演算部、1007…遅延処理部、1008…目的音相関行列推定部、1009…行列掛け算部、1101…多チャンネル歪補正部、1102…遅延処理部、1103…雑音共分散推定部、1104…μ乗算部、1105…目的音共分散推定部、1106…逆行列演算部、1107…雑音相関推定部、1108…行列掛け算部、1109…雑音推定部、1110…最小２乗フィルタ推定部、1201…マイクロホンアレイ、1201…カメラ、1203…ＡＤ変換装置、1204…計算機、1205…ＨＵＢ、1206…ＨＵＢ2、1207…計算機、1208…計算機、1209…スピーカ、1210…表示装置、1301…画面、1301-1…カメラ画像表示部、1301-2…音源位置表示部、1301-3…発話タイミング表示部、1301-4…サムネイル画像表示部、1401…音取り込み部、1403…画像取り込み部、1404…データ送信部、1405…データ受信部、1406…会議データリクエスト部、1407…データ受信部、1408…データ送信部、1409…音響情報生成部、1410…拠点毎データ、1411…音声再生部、1412…拠点毎データ、1601…音源抽出部、1602…パターンマッチング部、1603…異常音データベース、1604…異常音判定部、1605…アラート送信部 DESCRIPTION OF SYMBOLS 101 ... Microphone array, 102 ... AD converter, 103 ... Central processing unit, 104 ... Volatile memory, 105 ... Non-volatile memory, 106 ... DA converter, 107 ... Speaker, 201 ... Waveform acquisition part, 202 ... Filter adaptation process , 203 ... filtering unit, 204 ... filter data, 205 ... waveform reproduction unit, 401 ... multi-channel spatial prediction unit, 402 ... delay processing unit, 403 ... multi-channel distortion correction unit, 404 ... 1-channel distortion correction unit, 901 ... Waveform capture unit, 902 ... filter adaptation processing unit, 903 ... filtering unit, 904 ... filter data, 905 ... target sound section extraction unit, 906 ... target sound transfer characteristic learning unit, 907 ... dereverberation removal filter, 908 ... real time waveform capture unit , 909 ... Reverberation removal part, 910 ... Real time waveform reproduction part, 1001 ... Residual noise estimation part, 1002 ... Target sound estimation part, 1003 ... Residual noise covariance estimation part, 1004 ... μ multiplication part, 1005 ... Target sound covariance estimation Part , 1006: Inverse matrix calculation unit, 1007 ... Delay processing unit, 1008 ... Target sound correlation matrix estimation unit, 1009 ... Matrix multiplication unit, 1101 ... Multi-channel distortion correction unit, 1102 ... Delay processing unit, 1103 ... Noise covariance estimation unit , 1104... Μ multiplier, 1105. Target sound covariance estimator, 1106. Inverse matrix calculator, 1107 ... Noise correlation estimator, 1108 ... Matrix multiplier, 1109 ... Noise estimator, 1110 ... Least square filter estimator , 1201... Microphone array, 1201. -1 ... Camera image display unit, 1301-2 ... Sound source position display unit, 1301-3 ... Speech timing display unit, 1301-4 ... Thumbnail image display unit, 1401 ... Sound capture unit, 1403 ... Image capture unit, 1404 ... Data Transmission unit, 1405 ... Data reception unit, 1406 ... Conference data request unit, 1407 ... Data 1408 ... data transmission unit, 1409 ... acoustic information generation unit, 1410 ... data for each site, 1411 ... sound reproduction unit, 1412 ... data for each site, 1601 ... sound source extraction unit, 1602 ... pattern matching unit, 1603 ... abnormal Sound database, 1604 ... abnormal sound determination unit, 1605 ... alert transmission unit

Claims

A microphone array composed of a plurality of microphone elements;
An AD converter for converting an analog signal output from the microphone array into a digital signal;
A computing device;
A storage device,
The calculation device performs digital signal processing for suppressing a noise component in the digital signal converted by the AD conversion device, extracts the noise suppression signal, and then distorts the target sound included in the noise suppression signal. And the corrected signal is reproduced or stored in the storage device , and a noise signal included in one element of the plurality of microphone elements is transmitted to all the plurality of elements other than the element. A multi-channel spatial prediction unit that approximates the sum of the noise signal included by the first FIR filter and determines the coefficient of the first FIR filter so that the square sum of the approximation error is minimized; The multi-channel spatial prediction unit predicts the noise suppression signal from a signal of any one of the plurality of microphone elements to a signal included in an element other than the element. Signal extraction apparatus characterized by generating by subtracting the sum of those by superimposing the first FIR filter.

2. The sound source extraction device according to claim 1 , wherein the noise suppression signals are individually generated for outputs of all microphone elements of the microphone array, and a second FIR filter is applied to the plurality of generated noise suppression signals. A multi-channel distortion correction unit that obtains a distortion correction signal of one channel, and a square error between the distortion correction signal and an output signal of a specific microphone element in the microphone array or a delayed signal thereof; Determining the second FIR filter of the multi-channel distortion correction unit so that the sum of the square of the distortion correction signal and the constant value multiplied by a constant value when the input signal is only noise is minimized. A sound source extraction device.

3. The sound source extraction apparatus according to claim 2 , further comprising a noise signal estimation unit for estimating a noise signal, the sum of the estimated noise signal and distortion correction signal superimposed on an individual third FIR filter, and the microphone array. The third FIR filter is determined so that the square error between the output signal of the specific microphone element in the signal or the delayed signal is minimized, and the third FIR filter is superimposed on the distortion correction signal. A sound source extraction apparatus comprising a one-channel distortion correction unit for outputting a sound.

4. The sound source extraction apparatus according to claim 3 , wherein the mixing is calculated from a ratio between the target sound power and the noise power for each short time section calculated based on the target sound position information identified by the user's target sound position specifying operation. A sound source extraction device characterized by identifying noise intervals based on degrees.

5. The sound source extraction device according to claim 4 , wherein the speech speed of the identified noise section is reproduced at a speed higher than that of the other sections.

5. The sound source extraction apparatus according to claim 4 , wherein only the sound in a section other than the identified noise section is reproduced.