JP4519900B2

JP4519900B2 - Objective sound extraction device, objective sound extraction program, objective sound extraction method

Info

Publication number: JP4519900B2
Application number: JP2007325036A
Authority: JP
Inventors: 孝之稗方; 孝司森田; 陽平池田; 敏章下田
Original assignee: Kobe Steel Ltd
Current assignee: Kobe Steel Ltd
Priority date: 2007-04-26
Filing date: 2007-12-17
Publication date: 2010-08-04
Anticipated expiration: 2027-12-17
Also published as: JP2008295010A

Abstract

<P>PROBLEM TO BE SOLVED: To perform object sound extraction in which a high object sound extraction performance (noise removal performance) can be ensured by a small apparatus in acoustic environments where the object sound and other noises (non-object sounds) are mixed in acoustic signals obtained via a plurality of microphones and mixing conditions can vary. <P>SOLUTION: The object sound extraction apparatus includes sound source separation sections 10 for separating and generating an object sound separation signal corresponding to an object sound and other reference sound separation signals based on each combination of a main acoustic signal obtained through a main microphone receiving mainly input of the object sound and a plurality of other sub acoustic signals; an object sound separation signal synthesis section 20 for synthesizing the plurality of the separated and generated object sound separation signals; and a spectrum subtraction processing section 31 for extracting an acoustic signal corresponding to the object sound from the synthesis signal by performing a spectrum subtraction processing between the synthesis signal and the reference sound separation signals. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は，マイクロホンを通じて得られる音響信号に基づいて，所定の目的音源からの目的音に相当する音響信号を抽出して出力する目的音抽出装置，そのプログラム及びその方法に関するものである。 The present invention relates to a target sound extraction apparatus that extracts and outputs an acoustic signal corresponding to a target sound from a predetermined target sound source based on an acoustic signal obtained through a microphone, a program thereof, and a method thereof.

電話会議システム，テレビ会議システム，券売機，カーナビゲーションシステム等，話者等の音源が発する音響を入力する機能を備えた装置においては，マイクロホンによってある特定の音源（以下，目的音源という）から発せられる音（以下，目的音という）が収音されるが，音源の存在する環境に応じて，そのマイクロホンを通じて得られる音響信号に，前記目的音に相当する音響信号成分以外の雑音成分が含まれる。そして，マイクロホンを通じて得られる音響信号において，雑音成分の割合が大きいと，目的音の明瞭性が損なわれ，通話品質の悪化や自動音声認識率の悪化等の問題が生じる。
従来，例えば非特許文献１に示されるように，話者の発する音声（目的音の一例）を主として入力する主マイクロホン（音声マイクロホン）と，その話者の周囲の雑音を主として入力する（話者の音声がほとんど混入しない）副マイクロホン（雑音マイクロホン）とを用い，前記主マイクロホンを通じて得られる音響信号から，前記副マイクロホンを通じて得られる音響信号に基づく雑音信号を除去する２入力スペクトルサブストラクション処理が知られている。ここで，２入力スペクトルサブストラクション処理は，前記主マイクロホンによる入力信号及び前記副マイクロホンによる入力信号それぞれの時系列特徴ベクトルの減算処理により，話者が発する音声（前記目的音）に相当する音響信号を抽出（即ち，雑音成分を除去する）する処理である。 In a device with a function to input sound emitted from a sound source such as a speaker, such as a telephone conference system, video conference system, ticket vending machine, car navigation system, etc., the sound can be emitted from a specific sound source (hereinafter referred to as a target sound source) by a microphone. Sound (hereinafter referred to as the target sound) is collected, but depending on the environment in which the sound source exists, the sound signal obtained through the microphone includes a noise component other than the sound signal component corresponding to the target sound. . In the acoustic signal obtained through the microphone, if the ratio of the noise component is large, the clarity of the target sound is impaired, and problems such as deterioration in call quality and automatic speech recognition rate occur.
Conventionally, as shown in Non-Patent Document 1, for example, a main microphone (speech microphone) that mainly inputs a voice uttered by a speaker (an example of a target sound) and noise around the speaker are mainly input (speaker). A two-input spectral subtraction process that uses a secondary microphone (noise microphone) and removes a noise signal based on the acoustic signal obtained through the secondary microphone from the acoustic signal obtained through the primary microphone. It has been. Here, the two-input spectrum subtraction process is an acoustic signal corresponding to the voice (the target sound) uttered by the speaker by the subtraction process of the time series feature vectors of the input signal from the main microphone and the input signal from the sub microphone. Is extracted (that is, noise components are removed).

ところで，前記副マイクロホンは，これに前記目的音が極力混入しないよう，前記主マイクロホンとは異なる位置に配置されたマイクロホン，或いは前記主マイクロホンとは異なる方向に指向性を有するマイクロホンが採用される。このため，複数の方向から異なる雑音が各マイクロホンに到来する場合，前記副マイクロホンにより主に収音される雑音と前記主マイクロホンに主に混入する雑音とが異なる状況が生じ得る。そのような状況が発生した場合，前記２入力スペクトルサブストラクション処理による雑音除去性能が悪化する。
これに対し，特許文献１には，複数の前記副マイクロホン（雑音マイクロホン）を用い，そのそれぞれを通じて入力される音響信号について，状況に応じてその中からいずれかを選択した信号又は予め定められた重みで加重平均した合成信号と，前記主マイクロホンを通じて入力される音響信号とに基づいて，前記２入力スペクトルサブストラクション処理を実行する雑音除去装置が示されている。これにより，時間的，空間的に性質が変化するような非定常雑音が生じる音響空間においても有効な雑音除去が可能になるとされている。
また，特許文献２には，カメラ一体型ＶＴＲ装置において，撮影範囲における複数方向からの音声を収音した複数の音声信号の相関係数を求め，その相関係数に基づいて，撮影範囲中央の方向に存在する人物からの音声信号を強調する技術が示されている。
また，特許文献３〜５には，目的音を主として入力するマイクロホン（前記主マイクロホンに相当）を通じて得られる音響信号（以下，主音響信号という）から，目的音以外の参照音（非目的音）を主として入力するマイクロホン（前記副マイクロホンに相当）を通じて得られる音響信号を適応フィルタにより処理した信号を除去することによって目的音の抽出信号を得るとともに，その抽出信号のパワーが最小化するように適応フィルタを調整する技術が示されている。 By the way, as the sub microphone, a microphone disposed at a position different from the main microphone or a microphone having directivity in a direction different from the main microphone is adopted so that the target sound is not mixed therein. For this reason, when different noises arrive at each microphone from a plurality of directions, a situation may occur in which noise mainly picked up by the sub-microphone and noise mainly mixed in the main microphone are different. When such a situation occurs, the noise removal performance by the two-input spectrum subtraction process deteriorates.
On the other hand, Patent Document 1 uses a plurality of sub-microphones (noise microphones), and for each of the sound signals input through each of the sub-microphones (noise microphones), a signal obtained by selecting one of them according to the situation or predetermined There is shown a noise removal apparatus that performs the two-input spectrum subtraction process based on a synthesized signal weighted and averaged by weights and an acoustic signal input through the main microphone. As a result, it is said that it is possible to remove noise effectively even in an acoustic space where non-stationary noise whose properties change temporally and spatially occurs.
Further, in Patent Document 2, in a camera-integrated VTR apparatus, a correlation coefficient of a plurality of audio signals obtained by collecting sounds from a plurality of directions in a shooting range is obtained, and based on the correlation coefficient, the center of the shooting range is obtained. A technique for enhancing an audio signal from a person in a direction is shown.
In Patent Documents 3 to 5, a reference sound other than the target sound (non-target sound) is obtained from an acoustic signal (hereinafter referred to as a main acoustic signal) obtained through a microphone that mainly inputs the target sound (corresponding to the main microphone). The target sound extraction signal is obtained by removing the signal obtained by processing the acoustic signal obtained through the microphone (equivalent to the sub-microphone) mainly processed by the adaptive filter and the power of the extraction signal is minimized. Techniques for adjusting the filter are shown.

一方，所定の音響空間に複数の音源と複数のマイクロホン（音響入力手段）とが存在する場合，その複数のマイクロホンごとに，複数の音源各々からの個別の音響信号（以下，音源信号という）が重畳された音響信号（以下，混合音響信号という）が入力される。このようにして入力された複数の前記混合音響信号のみに基づいて，前記音源信号各々を同定（分離）する音源分離処理の方式は，ブラインド音源分離方式（Blind Source Separation方式，以下，ＢＳＳ方式という）と呼ばれる。
さらに，ＢＳＳ方式の音源分離処理の１つに，独立成分分析法（Independent Component Analysis，以下，ＩＣＡ法という）に基づくＢＳＳ方式の音源分離処理がある。このＩＣＡ法に基づくＢＳＳ方式は，複数のマイクロホンを通じて入力される複数の前記混合音響信号において，前記音源信号どうしが統計的に独立であることを利用して所定の分離行列（逆混合行列）を最適化し，入力された複数の前記混合音響信号に対して最適化された分離行列によるフィルタ処理を施すことによって前記音源信号の同定（音源分離）を行う処理方式である。その際，分離行列の最適化は，ある時点で設定されている分離行列を用いたフィルタ処理により同定（分離）された信号（分離信号）に基づいて，逐次計算（学習計算）により以降に用いる分離行列を計算することによって行われる。
ここで，ＩＣＡ法に基づくＢＳＳ方式の音源分離処理によれば，分離信号各々は，混合音響信号の入力数（＝マイクロホンの数）と同じ数の出力端（出力チャンネルといってもよい）各々を通じて出力される。このようなＩＣＡ法に基づくＢＳＳ方式の音源分離処理は，例えば，非特許文献２や非特許文献３等に詳説されている。
また，音源分離処理としては，バイナリーマスキング処理（バイノーラル信号処理の一例）による音源分離処理も知られている。バイナリーマスキング処理は，複数の指向性マイクロホンを通じて入力される混合音声信号相互間で，複数に区分された周波数成分（周波数ビン）ごとのレベル（パワー）を比較することにより，混合音声信号それぞれについて主となる音源からの音声信号以外の信号成分を除去する処理であり，比較的低い演算負荷で実現できる音源分離処理である。これについては，例えば，非特許文献４や非特許文献５等に詳説されている。
特開平６−６７６９１号公報特開２００１−８２８５号公報特開平６−８３３７２号公報特開平６−９０４９３号公報特開平６−１６５２８６号公報菅村他，「２入力による雑音除去手法を用いた自動車内の音声認識」，電子情報通信学会技術研究報告，ＳＰ−８１，pp.41-48，1989 猿渡洋，「アレー信号処理を用いたブラインド音源分離の基礎」，電子情報通信学会技術報告，vol.EA2001-7，pp.49-56，April 2001. 高谷智哉他，「SIMOモデルに基づくICAを用いた高忠実度なブラインド音源分離」，電子情報通信学会技術報告，vol.US2002-87，EA2002-108，January 2003. R.F.Lyon, "A computational model of binaural localization and separation" ,In Proc. ICASSP, 1983. M. Bodden, "Modeling human sound-source localization and the cocktail-party-effect", Acta Acoustica, vol.1, pp.43--55, 1993. On the other hand, when there are a plurality of sound sources and a plurality of microphones (acoustic input means) in a predetermined acoustic space, individual sound signals (hereinafter referred to as sound source signals) from the plurality of sound sources are provided for each of the plurality of microphones. A superimposed acoustic signal (hereinafter referred to as a mixed acoustic signal) is input. A sound source separation processing method for identifying (separating) each of the sound source signals based only on the plurality of mixed sound signals input in this manner is a blind source separation method (hereinafter referred to as a BSS method). ).
Further, as one of the BSS sound source separation processes, there is a BSS sound source separation process based on an independent component analysis method (hereinafter referred to as ICA method). The BSS method based on the ICA method uses a fact that the sound source signals are statistically independent among a plurality of the mixed acoustic signals input through a plurality of microphones to generate a predetermined separation matrix (inverse mixing matrix). In this processing method, the sound source signal is identified (sound source separation) by performing a filtering process using an optimized separation matrix on the plurality of input mixed sound signals. At that time, the optimization of the separation matrix is used later by sequential calculation (learning calculation) based on the signal (separated signal) identified (separated) by the filter processing using the separation matrix set at a certain time. This is done by calculating the separation matrix.
Here, according to the sound source separation processing of the BSS method based on the ICA method, each separated signal has the same number of output terminals (also called output channels) as the number of mixed acoustic signals input (= the number of microphones). Is output through. Such BSS sound source separation processing based on the ICA method is described in detail in, for example, Non-Patent Document 2 and Non-Patent Document 3.
As sound source separation processing, sound source separation processing by binary masking processing (an example of binaural signal processing) is also known. The binary masking process is performed mainly for each mixed audio signal by comparing the level (power) of each divided frequency component (frequency bin) between the mixed audio signals input through a plurality of directional microphones. Is a sound source separation process that can be realized with a relatively low calculation load. For example, Non-Patent Document 4 and Non-Patent Document 5 are described in detail.
JP-A-6-67691 JP 2001-8285 A JP-A-6-83372 JP-A-6-90493 JP-A-6-165286 Kashimura et al., “Voice recognition using two-input noise reduction method”, IEICE Technical Report, SP-81, pp.41-48, 1989 Hiroshi Saruwatari, “Basics of Blind Sound Source Separation Using Array Signal Processing”, IEICE Technical Report, vol.EA2001-7, pp.49-56, April 2001. Tomoya Takatani et al., “High fidelity blind source separation using ICA based on SIMO model”, IEICE technical report, vol.US2002-87, EA2002-108, January 2003. RFLyon, "A computational model of binaural localization and separation", In Proc. ICASSP, 1983. M. Bodden, "Modeling human sound-source localization and the cocktail-party-effect", Acta Acoustica, vol.1, pp.43--55, 1993.

しかしながら，非特許文献１に示される技術や特許文献３〜５に示される技術では，目的音が前記副マイクロホンに対して比較的大きな音量で混入した場合，その目的音に対応する音響信号の成分が雑音成分として誤って除去されること等により，高い雑音除去性能が得られないという問題点があった。
また，特許文献１に示されるように，複数の前記副マイクロホン（雑音マイクロホン）を通じて入力される複数の音声信号を予め定められた重みで加重平均して得られる合成信号を前記２入力スペクトルサブストラクション処理の入力信号として採用した場合，音響環境の変化によって加重平均の重みと，複数の前記副マイクロホンそれぞれに対する前記目的音の混入度合いとの不整合が生じて雑音除去性能が悪化するという問題点があった。また，特許文献１に示されるように，複数の前記副マイクロホン（雑音マイクロホン）を通じて入力される複数の音響信号の中からいずれかを選択した信号を前記２入力スペクトルサブストラクション処理の入力信号として採用した場合，複数の方向から異なる雑音が各マイクロホンに到来する状況下においては，選択に漏れた音響信号に基づく雑音成分が除去されず，やはり雑音除去性能が悪化するという問題点があった。
また，特許文献２に示される技術は，撮影範囲中央の人物からの音声信号が強調されるものの，それ以外の音声信号も残存し，目的音の信号が抽出されるわけではない。 However, in the technique shown in Non-Patent Document 1 and the techniques shown in Patent Documents 3 to 5, when the target sound is mixed with the sub-microphone at a relatively large volume, the component of the acoustic signal corresponding to the target sound There is a problem that high noise removal performance cannot be obtained due to erroneous removal of noise as a noise component.
Further, as shown in Patent Document 1, a composite signal obtained by weighted averaging a plurality of audio signals input through a plurality of sub-microphones (noise microphones) with a predetermined weight is obtained as the two-input spectrum substructure. When employed as an input signal for processing, there is a problem in that noise removal performance deteriorates due to a mismatch between the weighted average weight and the degree of mixing of the target sound with respect to each of the plurality of sub-microphones due to changes in the acoustic environment. there were. Moreover, as shown in Patent Document 1, a signal selected from among a plurality of acoustic signals input through a plurality of sub-microphones (noise microphones) is employed as an input signal for the two-input spectrum subtraction process. In this case, under the situation where different noises arrive at each microphone from a plurality of directions, the noise component based on the acoustic signal leaked to the selection is not removed, and the noise removal performance is deteriorated.
In the technique disclosed in Patent Document 2, although the audio signal from the person in the center of the shooting range is emphasized, other audio signals remain and the target sound signal is not extracted.

また，前記主音響信号及び前記副音響信号に基づいて，前記ＩＣＡ法に基づくＢＳＳ方式の音源分離処理や前記バイナリーマスキング処理を実行すれば，目的音に対応する分離信号を得ることができるが，音響環境によっては，その分離信号に目的音以外の雑音の信号成分が比較的高い割合で含まれてしまう場合が生じるという問題点があった。例えば，前記ＩＣＡ法に基づくＢＳＳ方式の音源分離処理において，目的音及びそれ以外の雑音の音源がマイクロホンの数以上に存在したり，雑音が反射・反響するような環境では，音源分離性能が悪化する。
また，鋭い指向特性を実現する音響入力装置としては，例えば，マイクロホンアレイ及び遅延和型フィルタを備えた音響入力装置が知られているが，それは指向性を鋭くするほど装置が大型化するという問題点があった。
従って，本発明は上記事情に鑑みてなされたものであり，その目的とするところは，複数のマイクロホンを通じて得られる音響信号に目的音及びそれ以外の雑音（非目的音）が混入し，またその混入状態が変化し得る音響環境下において，小型の装置によって高い目的音抽出性能（雑音除去性能）を確保できる目的音抽出装置，目的音抽出プログラム及び目的音抽出方法を提供することにある。 In addition, if the BSS sound source separation processing based on the ICA method and the binary masking processing are executed based on the main acoustic signal and the sub acoustic signal, a separation signal corresponding to the target sound can be obtained. Depending on the acoustic environment, there is a problem in that the separated signal may contain a signal component of noise other than the target sound at a relatively high rate. For example, in the BSS sound source separation processing based on the ICA method, the sound source separation performance is deteriorated in an environment where the target sound and other noise sound sources are present more than the number of microphones, or the noise is reflected / reflected. To do.
As an acoustic input device that realizes a sharp directional characteristic, for example, an acoustic input device including a microphone array and a delay-and-sum type filter is known. However, the larger the directivity, the larger the size of the device. There was a point.
Therefore, the present invention has been made in view of the above circumstances, and the object of the present invention is that the target sound and other noise (non-target sound) are mixed in the acoustic signal obtained through a plurality of microphones, and An object of the present invention is to provide a target sound extraction device, a target sound extraction program, and a target sound extraction method that can ensure high target sound extraction performance (noise removal performance) with a small device in an acoustic environment in which the mixing state can change.

上記目的を達成するために本発明（後述する第１発明に相当）に係る目的音抽出装置は，所定の目的音源から出力される目的音を主に入力する１つの主マイクロホンを通じて得られる１つの主音響信号と，前記主マイクロホンとは異なる複数の方向それぞれに指向性を有する複数の副マイクロホンそれぞれを通じて得られる複数の副音響信号と，に基づいて，前記目的音に相当する音響信号を抽出して抽出信号を出力する目的音抽出装置であり，次の（１−１）〜（１−３）に示す各構成要素を備えるものである。
（１−１）前記主音響信号と前記複数の副音響信号それぞれとからなる２つの音響信号の組合せそれぞれについて個別に設けられ，当該２つの音響信号に基づいて，前記目的音に対応する目的音分離信号と前記目的音以外の参照音に対応する参照音分離信号とを独立成分分析法に基づくブラインド音源分離方式による音源分離処理によって分離生成する音源分離手段。
（１−２）前記音源分離手段により分離生成された複数の前記目的音分離信号を合成する目的音分離信号合成手段。
（１−３）前記目的音分離信号合成手段により得られた合成信号と前記音源分離手段により分離生成された複数の前記参照音分離信号との間でスペクトル減算処理を行うことにより，前記目的音分離信号合成手段により得られた合成信号から前記目的音に相当する音響信号を抽出して抽出信号を出力するスペクトル減算処理手段。
本発明において，前記音源分離手段により分離生成される複数の前記目的音分離信号は，目的音の信号成分を主として含む信号である。同様に，前記音源分離手段により分離生成される複数の前記参照音分離信号は，位置や指向性の方向がそれぞれ異なる前記副マイクロホンそれぞれの収音範囲におけるノイズ音源の音（目的音以外の音（参照音））の信号成分を主として含む信号である。
しかしながら，複数のマイクロホン（前記主マイクロホン及び前記副マイクロホン）に対する目的音源の位置や雑音の発生状況によっては，前記目的音分離信号に，目的音以外の雑音の信号成分が比較的多く残存する場合もある。従って，それらを合成した前記合成信号も，基本的には目的音の信号成分を主として含む信号ではあるが，状況によっては雑音の信号成分が比較的多く残存する場合もある。
一方，前記合成信号に目的音以外のノイズ音（参照音）の成分が含まれている場合であっても，スペクトル減算処理によって前記合成信号から前記目的音の信号成分を抽出した信号は，前記参照音分離信号の信号成分が除去された信号である。しかも，前記スペクトル減算処理手段による抽出信号は，複数の方向から異なる雑音（参照音）が前記主マイクロホンに到来する状況においても，それら複数の雑音それぞれに対応する前記参照音分離信号全ての信号成分が除去された信号である。
従って，複数の前記目的音分離信号の合成信号に対し，前記参照音分離信号それぞれの信号成分を除去する前記スペクトル減算処理を施すことにより，比較的強い特定の雑音が前記主マイクロホンに到来する状況や，複数の方向から異なる雑音が前記主マイクロホンに到来する状況においても，高い雑音除去性能を確保できる。 In order to achieve the above object, a target sound extraction apparatus according to the present invention (corresponding to a first invention described later) is a single target microphone obtained through one main microphone that mainly inputs a target sound output from a predetermined target sound source . a main audio signal, and a plurality of sub-acoustic signals obtained through the respective plurality of sub microphones having directivity in a plurality of different directions from the front Kinushi microphone, based on the extracted audio signal corresponding to the target sound Thus, the target sound extraction apparatus that outputs the extraction signal includes the following components (1-1) to (1-3).
(1-1) provided separately for each combination of the main acoustic signal and two acoustic signals consisting of said plurality of sub-acoustic signal, target sound based on the two acoustic signals, corresponding to the target sound Sound source separation means for separating and generating a separation signal and a reference sound separation signal corresponding to a reference sound other than the target sound by sound source separation processing by a blind sound source separation method based on an independent component analysis method .
(1-2) A target sound separation signal synthesizing unit that synthesizes a plurality of target sound separation signals separated and generated by the sound source separation unit.
(1-3) The target sound is obtained by performing spectral subtraction processing between the synthesized signal obtained by the target sound separation signal synthesizing means and the plurality of reference sound separation signals separated and generated by the sound source separation means. spectrum subtraction processing means to output an extraction signal to extract an acoustic signal from the obtained combined signal corresponding to the target sound by separating the signal combining means.
In the onset Ming, a plurality of target sound separation signal generated separated by the sound source separation unit is mainly including signal a signal component of the target sound. Similarly, the plurality of reference sound separation signals separated and generated by the sound source separation means are noise sound sources (sounds other than the target sound ( This signal mainly includes the signal component of the reference sound)).
However, depending on the position of the target sound source and the occurrence of noise for a plurality of microphones (the main microphone and the sub microphone), a relatively large amount of noise signal components other than the target sound may remain in the target sound separation signal. is there. Therefore, the synthesized signal obtained by synthesizing them is basically a signal mainly including the signal component of the target sound, but a relatively large amount of noise signal components may remain depending on the situation.
On the other hand, even when the synthesized signal includes a component of a noise sound (reference sound) other than the target sound, the signal obtained by extracting the signal component of the target sound from the synthesized signal by spectral subtraction processing is The signal from which the signal component of the reference sound separation signal is removed. In addition, the signal extracted by the spectrum subtraction processing means includes all signal components of the reference sound separation signal corresponding to each of the plurality of noises even when different noises (reference sounds) arrive at the main microphone from a plurality of directions. Is a signal that has been removed.
Therefore, a situation in which relatively strong specific noise arrives at the main microphone by performing the spectral subtraction process for removing the signal component of each of the reference sound separation signals on a composite signal of the plurality of target sound separation signals. Even in a situation where different noises arrive at the main microphone from a plurality of directions, high noise removal performance can be ensured.

ところで，一般に，ＩＣＡ法に基づくＢＳＳ方式による音源分離処理において，高い音源分離性能を得るためには，分離処理（フィルタ処理）に用いる分離行列を求めるための逐次計算（学習計算）の回数を増やす，或いはその逐次計算に用いる音響信号（ディジタル信号）のサンプル数を増やすことが必要となり，そうすると，演算負荷が大きくなる。例えば，その逐次計算を実用的なプロセッサで行った場合，入力される音響信号の時間長に対して数倍の時間を要することもあり，リアルタイム処理に適さない。
一方，スペクトル減算処理は，その演算負荷が比較的小さく，実用的なプロセッサによってもリアルタイム処理が可能である。
そこで，本発明に係る目的音抽出装置において，前記音源分離手段が実行する音源分離処理が，次の（１−１−１）又は（１−１−２）のいずれかに示す処理であることが考えられる。
（１−１−１）前記音源分離手段が実行する前記音源分離処理において，マイクロホンを通じて時系列に入力される音響信号に対し所定の分離行列に基づくフィルタ処理を順次実行して分離信号を生成するとともに，前記時系列に入力される音響信号における予め定められた周期で区分された区間信号ごとに該区間信号全てを用いて以降の前記フィルタ処理に用いる前記分離行列を求める逐次計算を行い，該逐次計算の回数を予め定められた回数に制限する。
（１−１−２）前記音源分離手段が実行する前記音源分離処理において，マイクロホンを通じて時系列に入力される音響信号に対し所定の分離行列に基づくフィルタ処理を順次実行して分離信号を生成するとともに，前記時系列に入力される音響信号における予め定められた周期で区分された区間信号の先頭側の一部の時間帯の信号ごとに，その信号を用いて以降の前記フィルタ処理に用いる前記分離行列を求める逐次計算を実行する。
上記（１−１−１）又は（１−１−２）に示した音源分離処理において，前記フィルタ処理は，演算負荷の小さな処理であり，実用的なプロセッサによって前記スペクトル減算処理と併せて実行されても，比較的余裕をもってリアルタイムでの処理を実現できる。
また，上記（１−１−１）又は（１−１−２）に示した音源分離処理における前記逐次計算（学習計算）も，逐次計算回数やその逐次計算に用いる音響信号（ディジタル信号）のサンプル数（時間帯）が制限された演算負荷の小さな処理である。そのため，前記逐次計算（学習計算）は，実用的なプロセッサによって前記フィルタ処理及び前記スペクトル減算処理（リアルタイム処理）と併せて実行されても，比較的短時間でその処理（以降に用いる前記分離行列の算出）が完了する。その結果，前記フィルタ処理に用いられる前記分離行列が，音響環境の変化に適応した状態に速やかに更新され，音響環境の変化に対する目的音抽出の適応力が高まる。また，このような前記逐次計算（学習計算）の簡素化より，前記音源分離処理により得られる分離信号に多少のノイズが含まれることとなっても，前記音源分離処理とスペクトル減算処理との組合せにより，全体として目的音の抽出性能を十分に確保できる。 By the way, generally, in the sound source separation process by the BSS method based on the ICA method, in order to obtain a high sound source separation performance, the number of sequential calculations (learning calculations) for obtaining a separation matrix used for the separation process (filter process) is increased. Alternatively, it is necessary to increase the number of samples of the acoustic signal (digital signal) used for the sequential calculation, which increases the calculation load. For example, when the sequential calculation is performed by a practical processor, it may take several times the time length of the input acoustic signal, and is not suitable for real-time processing.
On the other hand, spectral subtraction processing has a relatively small calculation load, and real-time processing is possible even with a practical processor.
Therefore, in the target sound extraction apparatus according to the present invention, the sound source separation process executed by the sound source separation means is a process shown in either of the following (1-1-1) or (1-1-2). Can be considered.
(1-1-1) the sound the sound source separation processing Oite sense the separating means is executed, sequential execution to separation signal filter processing based on a predetermined separating matrix to audio signals inputted in time series through the microphone And sequentially calculating the separation matrix used for the subsequent filter processing using all the section signals for each section signal divided by a predetermined period in the acoustic signal input in time series. And the number of sequential calculations is limited to a predetermined number.
(1-1-2) the sound the sound source separation processing Oite sense the separating means is executed, sequential execution to separation signal filter processing based on a predetermined separating matrix to audio signals inputted in time series through the microphone For each of the signals in the partial time zone on the head side of the section signal divided by a predetermined period in the acoustic signal input in time series, the subsequent filter processing using the signal The sequential calculation for obtaining the separation matrix used in the above is executed.
In the sound source separation process shown in (1-1-1) or (1-1-2), the filter process is a process with a small calculation load, and is executed together with the spectrum subtraction process by a practical processor. However, real-time processing can be realized with a relatively large margin.
In addition, the sequential calculation (learning calculation) in the sound source separation processing shown in (1-1-1) or (1-1-2) is the same as the number of sequential calculations and the acoustic signal (digital signal) used for the sequential calculation. This is a processing with a small calculation load with a limited number of samples (time zone). Therefore, even if the sequential calculation (learning calculation) is executed by a practical processor in combination with the filter processing and the spectral subtraction processing (real-time processing), the processing (the separation matrix used later) is performed in a relatively short time. Calculation) is completed. As a result, the separation matrix used for the filtering process is quickly updated to a state adapted to the change in the acoustic environment, and the adaptability of the target sound extraction to the change in the acoustic environment is increased. Further, due to the simplification of the sequential calculation (learning calculation), even if some noise is included in the separated signal obtained by the sound source separation process, the combination of the sound source separation process and the spectrum subtraction process is combined. Therefore, the target sound extraction performance can be sufficiently secured as a whole.

また，本発明に係る目的音抽出装置が，さらに次の（１−４）及び（１−５）に示す構成要素を備えればなお好適である。
（１−４）それぞれ指向性の方向が異なる３つ以上のマイクロホンを通じて得られる３つ以上の入力音響信号に基づいて，該３つ以上の入力音響信号の中から１つの前記主音響信号と複数の前記副音響信号とを特定する主・副音響信号特定手段。
（１−５）前記主・副音響信号特定手段による特定結果に従って，前記３つ以上のマイクロホンから前記音源分離手段への音響信号の伝送経路を切り替える信号経路切替手段。
例えば，前記主・副音響信号特定手段が，例えば，前記３つ以上の入力音響信号それぞれの信号強度の比較に基づいて，又は前記３つ以上の入力音響信号それぞれにおける予め定められた周波数成分の占める割合の比較に基づいて，１つの前記主音響信号と複数の前記副音響信号とを特定すること等が考えられる。
これらの構成要素を備えることにより，本発明に係る目的音抽出装置は，目的音源の位置が変わり得るために，複数のマイクロホンのうちの予め定められた１つを前記主マイクロホンとして固定できない対象に対しても適用できる。 It is further preferable that the target sound extraction apparatus according to the present invention further includes the constituent elements shown in the following (1-4) and (1-5).
(1-4) its Re respectively based on three or more input audio signals obtained through directional three or more microphones in different directions, and one of the main sound among the three or more input audio signals Main / sub-acoustic signal specifying means for specifying a signal and a plurality of the sub-acoustic signals;
(1-5) Signal path switching means for switching the transmission path of the acoustic signal from the three or more microphones to the sound source separation means according to the identification result by the main / sub acoustic signal identification means.
For example, the main / sub-acoustic signal specifying means may determine, for example, a predetermined frequency component in each of the three or more input sound signals based on a comparison of signal strengths of the three or more input sound signals. It may be possible to specify one main acoustic signal and a plurality of sub-acoustic signals based on comparison of the proportions occupied.
By providing these components, the target sound extraction apparatus according to the present invention can change the position of the target sound source, so that a predetermined one of a plurality of microphones cannot be fixed as the main microphone. It can also be applied to.

また，本発明は，以上に示した目的音抽出装置における各手段が実行する処理をコンピュータに実行させる目的音抽出プログラムとして捉えることもできる。
即ち，本発明に係る目的音抽出プログラムは，所定の目的音源から出力される目的音を主に入力する１つの主マイクロホンを通じて得られる１つの主音響信号と，前記主マイクロホンとは異なる複数の方向それぞれに指向性を有する複数の副マイクロホンそれぞれを通じて得られる複数の副音響信号と，に基づいて，前記目的音に相当する音響信号を抽出して抽出信号を出力する処理をコンピュータに実行させる目的音抽出プログラムであり，さらに，次の（２−１）〜（２−３）に示す処理をコンピュータに実行させるプログラムである。
（２−１）前記主音響信号と前記複数の副音響信号それぞれとからなる２つの音響信号の組合せそれぞれについて個別に，当該２つの音響信号に基づいて，前記目的音に対応する目的音分離信号と前記目的音以外の参照音に対応する参照音分離信号とを独立成分分析法に基づくブラインド音源分離方式の処理により分離生成する音源分離処理。
（２−２）前記音源分離処理により分離生成された複数の前記目的音分離信号を合成する目的音分離信号合成処理。
（２−３）前記目的音分離信号合成処理により得られた合成信号と前記音源分離処理により分離生成された複数の前記参照音分離信号との間でスペクトル減算処理を行うことにより，前記目的音分離信号合成処理により得られた合成信号から前記目的音に相当する音響信号を抽出して抽出信号を出力する処理。
以上に示した目的音抽出プログラムを実行するコンピュータによっても，前述した本発明に係る目的音抽出装置と同様の作用効果が得られる。
また，本発明は，以上に示した本発明に係る目的音抽出プログラムにおける各処理をコンピュータによって実行する目的音抽出方法として捉えることもできる。 The present invention can also be understood as a target sound extraction program that causes a computer to execute the processing executed by each means in the target sound extraction apparatus described above.
That is, target sound extraction program according to the present invention, one of the main sound signal obtained through one of the primary microphone for inputting a target sound to be outputted from a predetermined target source mainly before several different from the Kinushi microphone An object of causing a computer to execute a process of extracting an acoustic signal corresponding to the target sound and outputting an extracted signal based on a plurality of sub-acoustic signals obtained through a plurality of sub-microphones each having directivity in each direction It is a sound extraction program, and is a program that causes a computer to execute the following processes (2-1) to (2-3).
(2-1) individually for each combination of the main acoustic signal and two acoustic signals consisting of said plurality of sub-acoustic signal, based on the two acoustic signals, target sound separation signals corresponding to the target sound Sound source separation processing for separating and generating a reference sound separation signal corresponding to a reference sound other than the target sound by a blind sound source separation method based on an independent component analysis method .
(2-2) A target sound separation signal synthesis process for synthesizing a plurality of target sound separation signals separated and generated by the sound source separation process.
(2-3) Spectral subtraction processing is performed between the synthesized signal obtained by the target sound separation signal synthesis processing and the plurality of reference sound separation signals separated and generated by the sound source separation processing, whereby the target sound is obtained. A process of extracting an acoustic signal corresponding to the target sound from the synthesized signal obtained by the separated signal synthesizing process and outputting the extracted signal.
The same effect as that of the above-described target sound extraction apparatus according to the present invention can be obtained by a computer that executes the target sound extraction program described above.
The present invention can also be understood as a target sound extraction method in which each process in the target sound extraction program according to the present invention described above is executed by a computer.

本発明（後述する第１発明に相当）によれば，複数の方向から異なる雑音が各マイクロホンに到来する音響環境下や，目的音が前記副マイクロホンのいずれかに対して比較的大きな音量で混入するような音響環境下，さらににはそのような音響環境が変化するような場合でも高い雑音除去性能を確保できる。
また，本発明によれば，後述するように，前記主マイクロホン自体の指向性が緩やかなものであっても，本発明に係る目的音抽出装置は非常に急峻な指向性を有する音響入力装置として機能する。しかも，前記主マイクロホンの位置若しくは指向性の方向に対する前記副マイクロホンの位置若しくは指向性の方向を調節する（近づけたり遠ざけたりする）ことにより，雑音として取り扱われる（除去される）音の音源の位置や方向を調節できるため，本発明に係る目的音抽出装置の指向性能を調節することができ，利便性が高い。また，後述するように，そのように急峻な，或いはフレキシブルな指向性を有する音響入力装置として機能する装置を，非常に小型の装置として実現できる。 According to the present invention (corresponding to the first invention described later), the target sound is mixed at a relatively large volume with respect to any of the sub-microphones in an acoustic environment where different noises arrive at each microphone from a plurality of directions. High noise removal performance can be ensured under such an acoustic environment, and even when such an acoustic environment changes.
Further, according to the present invention, as will be described later, even if the directivity of the main microphone itself is moderate, the target sound extraction device according to the present invention is an acoustic input device having a very steep directivity. Function. Moreover, the position of the sound source of the sound treated (removed) as noise by adjusting (approaching or moving away) the position or directionality of the sub microphone relative to the position or directionality of the main microphone Since the directivity of the target sound extraction apparatus according to the present invention can be adjusted, the convenience is high. Further, as will be described later, a device that functions as an acoustic input device having such steep or flexible directivity can be realized as a very small device.

以下添付図面を参照しながら，本発明の実施の形態について説明し，本発明の理解に供する。尚，以下の実施の形態は，本発明を具体化した一例であって，本発明の技術的範囲を限定する性格のものではない。
ここに，図１は第１発明の実施形態に係る目的音抽出装置Ｘ１の概略構成を表すブロック図，図２は目的音抽出装置Ｘ１における目的音抽出処理の過程を表す概念図，図３は第２発明の実施形態に係る目的音抽出装置Ｘ２の概略構成を表すブロック図，図４は目的音抽出装置Ｘ２における目的音抽出処理の過程を表す概念図，図５は第３発明の実施形態に係る目的音抽出装置Ｘ３の概略構成を表すブロック図，図６は目的音抽出装置Ｘ３における目的音抽出処理の過程を表す概念図，図７は目的音抽出装置Ｘ１〜Ｘ３の目的音抽出性能を評価する第１の実験条件を表す図，図８は目的音抽出装置Ｘ１〜Ｘ３の目的音抽出性能を評価する第２の実験条件を表す図，図９は第１の実験条件の下での目的音抽出装置Ｘ１〜Ｘ３及び従来の目的音抽出処理の目的音抽出性能を表す図，図１０は第２の実験条件の下での目的音抽出装置Ｘ１〜Ｘ３及び従来の目的音抽出処理の目的音抽出性能を表す図，図１１は目的音抽出装置Ｘ１の指向性を評価する第３の実験条件を表す図，図１２は第３の実験条件の下での目的音抽出装置Ｘ１の指向性を表す図，図１３は目的音抽出装置Ｘ１〜Ｘ３に採用され得る音響入力装置Ｖ２の概略構成を表すブロック図，図１４はＦＤＩＣＡ法に基づくＢＳＳ方式の音源分離処理を行う音源分離装置Ｚの概略構成を表すブロック図，図１５は目的音抽出装置Ｘ１〜Ｘ３の音源分離処理における学習計算を除く処理のシーケンスの第１例を表すタイムチャート，図１６は目的音抽出装置Ｘ１〜Ｘ３の音源分離処理における学習計算を除く処理のシーケンスの第２例を表すタイムチャート，図１７は目的音抽出装置Ｘ１〜Ｘ３の音源分離処理における第１実施例に係る学習計算のシーケンスを表すタイムチャート，図１８は目的音抽出装置Ｘ１〜Ｘ３の音源分離処理における第２実施例に係る学習計算のシーケンスを表すタイムチャートである。 Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings so that the present invention can be understood. The following embodiment is an example embodying the present invention, and does not limit the technical scope of the present invention.
FIG. 1 is a block diagram showing a schematic configuration of the target sound extraction device X1 according to the embodiment of the first invention, FIG. 2 is a conceptual diagram showing a process of target sound extraction processing in the target sound extraction device X1, and FIG. FIG. 4 is a block diagram showing a schematic configuration of the target sound extraction device X2 according to the embodiment of the second invention, FIG. 4 is a conceptual diagram showing a process of target sound extraction processing in the target sound extraction device X2, and FIG. 5 is an embodiment of the third invention. FIG. 6 is a conceptual diagram showing the process of target sound extraction processing in the target sound extraction device X3, and FIG. 7 is the target sound extraction performance of the target sound extraction devices X1 to X3. FIG. 8 is a diagram showing a second experimental condition for evaluating the target sound extraction performance of the target sound extraction devices X1 to X3, and FIG. 9 is a graph showing the first experimental condition for evaluating the target sound extraction device X1 to X3. Target sound extraction devices X1 to X3 and conventional target sound extraction FIG. 10 shows the target sound extraction performance of the target sound extraction devices X1 to X3 and the conventional target sound extraction processing under the second experimental conditions, and FIG. 11 shows the target sound extraction performance. FIG. 12 is a diagram showing the third experimental condition for evaluating the directivity of the extracting device X1, FIG. 12 is a diagram showing the directivity of the target sound extracting device X1 under the third experimental condition, and FIG. 13 is the target sound extracting device X1. FIG. 14 is a block diagram showing a schematic configuration of a sound source separation device Z that performs BSS type sound source separation processing based on the FDICA method, and FIG. 15 is a target sound. FIG. 16 is a time chart showing a first example of a processing sequence excluding learning calculation in the sound source separation processing of the extraction devices X1 to X3. FIG. 16 shows a first processing sequence excluding learning calculation in the sound source separation processing of the target sound extraction devices X1 to X3. 2 examples FIG. 17 is a time chart showing a learning calculation sequence according to the first embodiment in the sound source separation processing of the target sound extraction devices X1 to X3, and FIG. It is a time chart showing the sequence of the learning calculation which concerns on 2 Example.

［第１発明］
まず，図１に示すブロック図を参照しつつ，第１発明の実施形態に係る目的音抽出装置Ｘ１について説明する。
図１に示すように，目的音抽出装置Ｘ１は，複数のマイクロホンを含む音響入力装置Ｖ１，複数（図１では３つ）の音源分離処理部１０（１０−１〜１０−３），目的音分離信号合成処理部２０及びスペクトル減算処理部３１を備えている。ここで，前記音響入力装置Ｖ１は，１つの主マイクロホン１０１及び複数（図１では３つ）の副マイクロホン１０２（１０２−１〜１０２−３）を含む。また，前記主マイクロホン１０１及び複数の前記副マイクロホン１０２は，それぞれ複数の異なる位置に配置されたもの，又はそれぞれ異なる複数の方向に指向性を有するものである。
前記主マイクロホン１０１は，所定の目的音源（例えば，所定範囲内で移動し得る話者等）が発する音響（以下，目的音という）を主に入力する音響入力手段である。
また，複数の前記副マイクロホン１０２−１〜１０２−３は，前記主マイクロホン１０１とは異なる複数の位置それぞれに配置されたもの，或いはそれぞれ異なる複数の方向に指向性を有するものであり，主として目的音以外の参照音（雑音）を入力する音響入力手段である。なお，副マイクロホン１０２との記載は，複数の副マイクロホン１０２−１〜１０２−３を総称した記載である。
なお，図１に示す主マイクロホン１０１及び副マイクロホン１０２は，それぞれ指向性を有するマイクロホンであり，副マイクロホン１０２は，それぞれ前記主マイクロホン１０２とは異なる複数の方向それぞれに指向性を有するよう配置されている。 [First invention]
First, the target sound extraction device X1 according to the embodiment of the first invention will be described with reference to the block diagram shown in FIG.
As shown in FIG. 1, the target sound extraction device X1 includes a sound input device V1 including a plurality of microphones, a plurality (three in FIG. 1) of sound source separation processing units 10 (10-1 to 10-3), a target sound. A separated signal synthesis processing unit 20 and a spectrum subtraction processing unit 31 are provided. Here, the acoustic input device V1 includes one main microphone 101 and a plurality of (three in FIG. 1) sub microphones 102 (102-1 to 102-3). The main microphone 101 and the plurality of sub microphones 102 are arranged at a plurality of different positions, respectively, or have directivity in a plurality of different directions.
The main microphone 101 is sound input means for mainly inputting sound (hereinafter referred to as target sound) emitted from a predetermined target sound source (for example, a speaker that can move within a predetermined range).
The plurality of sub-microphones 102-1 to 102-3 are arranged at a plurality of positions different from the main microphone 101, or have directivity in a plurality of different directions, respectively. It is an acoustic input means for inputting a reference sound (noise) other than sound. Note that the description of the sub microphone 102 is a general term for the plurality of sub microphones 102-1 to 102-3.
The main microphone 101 and the sub microphone 102 shown in FIG. 1 are microphones having directivity, and the sub microphones 102 are arranged so as to have directivities in a plurality of directions different from the main microphone 102, respectively. Yes.

前記主マイクロホン１０１及び前記副マイクロホン１０２それぞれが指向性を有するマイクロホンである場合，前記主マイクロホン１０１の指向中心方向（正面方向）を中心（０°）として一方の側の＋１８０°未満の方向（例えば，＋９０°の方向），及び他方の側の−１８０°未満の方向（例えば，−９０°の方向）のそれぞれに，前記副マイクロホン１０２の指向中心方向（正面方向）が設定されることが望ましい。
また，各マイクロホン１０１，１０２の指向方向が，同一平面内においてそれぞれ異なる方向に設定される他，三次元的に異なる方向に設定されることも考えられる。 When each of the main microphone 101 and the sub microphone 102 is a directional microphone, the direction (less than + 180 °) on one side with respect to the center direction (front direction) of the main microphone 101 (0 °) (for example, , + 90 ° direction) and a direction of less than −180 ° on the other side (for example, −90 ° direction), it is desirable that the pointing center direction (front direction) of the sub microphone 102 is set. .
It is also conceivable that the directivity directions of the microphones 101 and 102 are set in different directions in the same plane and in three-dimensionally different directions.

そして，目的音抽出装置Ｘ１は，前記主マイクロホン１０１を通じて得られる主音響信号と，それ以外の複数の前記副マイクロホン１０２を通じて得られる副音響信号とに基づいて，前記目的音に相当する音響信号を抽出してその抽出信号（以下，目的音抽出信号という）を出力するものである。
目的音抽出装置Ｘ１において，前記音源分離処理部１０，前記目的音分離信号合成処理部２０及び前記スペクトル減算処理部３１は，例えばコンピュータの一例であるＤＳＰ(Digital Signal Processor)及びそのＤＳＰにより実行されるプログラムが記憶されたＲＯＭ，或いはＡＳＩＣ等により具現化される。この場合，そのＲＯＭには，前記音源分離処理部１０，前記目的音分離信号合成処理部２０及び前記スペクトル減算処理部３１が行う処理（後述）を前記ＤＳＰに実行させるためのプログラムが予め記憶されている。 Then, the target sound extraction device X1 generates an acoustic signal corresponding to the target sound based on the main acoustic signal obtained through the main microphone 101 and the sub acoustic signals obtained through the other plurality of sub microphones 102. The extracted signal (hereinafter referred to as the target sound extraction signal) is output.
In the target sound extraction apparatus X1, the sound source separation processing unit 10, the target sound separation signal synthesis processing unit 20, and the spectrum subtraction processing unit 31 are executed by, for example, a DSP (Digital Signal Processor) which is an example of a computer and the DSP. It is embodied by a ROM in which a program is stored, an ASIC, or the like. In this case, the ROM stores in advance a program for causing the DSP to execute processing (described later) performed by the sound source separation processing unit 10, the target sound separation signal synthesis processing unit 20, and the spectrum subtraction processing unit 31. ing.

前記音源分離処理部１０（１０−１〜１０−３）は，前記主音響信号と複数の前記副音響信号それぞれとの組合せそれぞれについて設けられ，その組合せである主音響信号及び副音響信号とに基づいて，前記目的音に対応する分離信号（目的音の同定信号）である目的音分離信号と，前記目的音以外の音である参照音（雑音といってもよい）に対応する参照音分離信号（参照音の同定信号）とを分離生成する音源分離処理を実行するものである（前記音源分離手段の一例）。
なお，各マイクロホン１０１，１０２と前記音源分離処理部１０との間には，不図示のＡ／Ｄコンバータが設けられており，そのＡ／Ｄコンバータによってデジタル信号に変換された音響信号が，前記音源分離処理部１０に伝送される。例えば，目的音が人の声である場合，８ｋＨｚ程度のサンプリング周期でデジタル化すればよい。
ここで，前記音源分離処理部１０（１０−１〜１０−３）は，例えば，非特許文献２や非特許文献３に示される独立成分分析法に基づくブラインド音源分離方式による音源分離処理，或いは非特許文献４や非特許文献５に示されるバイナリーマスキング処理等の音源分離処理を実行するものである。 The sound source separation processing unit 10 (10-1 to 10-3) is provided for each combination of the main acoustic signal and each of the plurality of sub-acoustic signals. Based on the target sound separation signal corresponding to the target sound (identification signal of the target sound) and the reference sound separation corresponding to the reference sound (may be referred to as noise) other than the target sound. A sound source separation process for separating and generating a signal (identification signal of a reference sound) is executed (an example of the sound source separation means).
An A / D converter (not shown) is provided between each of the microphones 101 and 102 and the sound source separation processing unit 10, and an acoustic signal converted into a digital signal by the A / D converter is It is transmitted to the sound source separation processing unit 10. For example, if the target sound is a human voice, it may be digitized with a sampling period of about 8 kHz.
Here, the sound source separation processing unit 10 (10-1 to 10-3) is, for example, a sound source separation process by a blind sound source separation method based on an independent component analysis method shown in Non-Patent Document 2 or Non-Patent Document 3, or The sound source separation process such as the binary masking process shown in Non-Patent Document 4 and Non-Patent Document 5 is executed.

以下，図１４に示すブロック図を参照しつつ，前記音源分離処理部１０として採用可能な装置の一例である音源分離装置Ｚについて説明する。
以下に示す音源分離装置Ｚは，所定の音響空間に複数の音源と複数のマイクロホン１０１，１０２が存在する状態で，そのマイクロホン１０１，１０２各々を通じて，音源各々からの個別の音声信号（以下，音源信号という）が重畳された信号である複数の混合音声信号が逐次入力される場合に，その混合音声信号に対してＩＣＡ法に基づくＢＳＳ方式の音源分離処理を施すことにより，前記音源信号に対応する複数の分離信号（音源信号を同定した信号）を逐次生成する処理を行うものである。
また，図１４に示す前記音源分離装置Ｚは，ＩＣＡ−ＢＳＳ方式の一種であるＦＤＩＣＡ方式（Frequency-Domain ICA）に基づく音源分離処理を行うものである。 Hereinafter, a sound source separation device Z that is an example of a device that can be employed as the sound source separation processing unit 10 will be described with reference to the block diagram shown in FIG.
The sound source separation device Z shown below is a state in which a plurality of sound sources and a plurality of microphones 101 and 102 exist in a predetermined acoustic space, and through the microphones 101 and 102, individual audio signals (hereinafter referred to as sound sources). When a plurality of mixed sound signals, which are signals superimposed on each other, are sequentially input, the mixed sound signals are subjected to BSS sound source separation processing based on the ICA method, thereby supporting the sound source signals. A process for sequentially generating a plurality of separated signals (signals identifying sound source signals) is performed.
The sound source separation device Z shown in FIG. 14 performs sound source separation processing based on the FDICA method (Frequency-Domain ICA) which is a kind of ICA-BSS method.

ＦＤＩＣＡ方式では，まず，入力された混合音声信号ｘ(ｔ)について，ＳＴ−ＤＦＴ処理部１３によって所定の周期ごとに区分された信号であるフレーム毎に短時間離散フーリエ変換（Short Time Discrete Fourier Transform，以下，ＳＴ−ＤＦＴ処理という）を行い，観測信号の短時間分析を行う。そして，そのＳＴ−ＤＦＴ処理後の各チャンネルの信号（各周波数成分の信号）について，分離演算処理部１１ｆにより分離行列Ｗ(ｆ)に基づく分離演算処理（フィルタ処理）を施すことによって音源分離（音源信号の同定）を行う。ここでｆを周波数ビン，ｍを分析フレーム番号とすると，分離信号（同定信号）ｙ(ｆ，ｍ)は，次の（１）式のように表すことができる。

この（１）式からわかるように，分離演算処理（フィルタ処理）は，周波数ビンごとに行われる。
ここで，分離フィルタＷ(ｆ)の更新式は，例えば次の（２）式のように表すことができる。

このＦＤＩＣＡ方式によれば，音源分離処理が各狭帯域における瞬時混合問題として取り扱われ，比較的簡単かつ安定に分離フィルタ（分離行列）Ｗ(ｆ)を更新することができる。
図１４において，主マイクロホン１０１に対応する分離信号ｙ1(ｆ)が前記目的音分離信号である。また，副マイクロホン１０２に対応する分離信号ｙ2(ｆ)が前記参照音分離信号である。
なお，図１４においては，入力される混合音声信号ｘ1，ｘ2のチャンネル数（即ち，マイクロホンの数）が２つである例について示しているが，（チャンネル数ｎ）≧（音源の数ｍ）であれば，３チャンネル以上であっても同様の構成により実現できる。 In the FDICA method, first, a short time discrete Fourier transform (Short Time Discrete Fourier Transform) is performed for each frame, which is a signal divided by the ST-DFT processing unit 13 for each predetermined period, with respect to the input mixed audio signal x (t). , Hereinafter referred to as ST-DFT processing), and a short time analysis of the observation signal is performed. Then, the signal of each channel (signal of each frequency component) after the ST-DFT processing is subjected to separation calculation processing (filter processing) based on the separation matrix W (f) by the separation calculation processing unit 11f, thereby separating sound sources ( Sound source signal identification). Here, if f is a frequency bin and m is an analysis frame number, the separated signal (identification signal) y (f, m) can be expressed as the following equation (1).

As can be seen from the equation (1), the separation calculation process (filter process) is performed for each frequency bin.
Here, the update formula of the separation filter W (f) can be expressed as the following formula (2), for example.

According to this FDICA method, sound source separation processing is handled as an instantaneous mixing problem in each narrow band, and the separation filter (separation matrix) W (f) can be updated relatively easily and stably.
In FIG. 14, a separation signal y1 (f) corresponding to the main microphone 101 is the target sound separation signal. Further, the separation signal y2 (f) corresponding to the sub microphone 102 is the reference sound separation signal.
FIG. 14 shows an example in which the number of channels (that is, the number of microphones) of the input mixed audio signals x1 and x2 is two, but (number of channels n) ≧ (number of sound sources m). If so, it can be realized with the same configuration even if there are three or more channels.

また，目的音抽出装置Ｘ１において，前記目的音分離信号合成処理部２０は，前記音源分離処理部１０それぞれにより分離生成された複数の前記目的音分離信号の合成処理を実行し，それにより得られる合成信号を出力するものである（前記目的音分離信号合成手段の一例）。
例えば，前記目的音分離信号合成処理部２０は，複数の前記目的音分離信号について，複数に区分された周波数成分（周波数ビン）ごとに平均処理や加重平均処理を実行すること等により，それら目的音分離信号を合成する。
また，目的音抽出装置Ｘ１において，前記スペクトル減算処理部３１は，前記目的音分離信号合成処理部２０により得られた合成信号と，前記音源分離処理部１０それぞれにより分離生成された複数の前記参照音分離信号との間でスペクトル減算処理を行うことにより，前記合成信号から前記目的音に相当する音響信号を抽出し，その抽出信号（前記目的音抽出信号）を出力するものである（前記スペクトル減算処理手段の一例）。
前記スペクトル減算処理部３１は，周知のスペクトル減算処理（スペクトラム差分法に基づく目的音抽出処理）により，前記合成信号から前記参照音分離信号それぞれの信号成分を除去することによって前記目的音抽出信号を抽出する処理を実行するものである。
前記スペクトル減算処理において，前記スペクトル減算処理部３１は，前記合成信号及び前記参照音分離信号それぞれについて，所定時間長分のフレームごとに離散フーリエ変換処理（ＤＦＴ）を実行し，観測信号（ここでは，前記合成信号）の短時間分析を行う。ここで，周波数ビンをｆ，分析フレーム番号をｍ，観測信号である前記合成信号のスペクトル値（ＤＦＴ後の信号値）をＹ(ｆ，ｍ)とし，目的音信号のスペクトル値がＳ(ｆ，ｍ)，雑音信号（目的音以外の音の信号）のスペクトル値がＮ(ｆ，ｍ)であるとすると，前記合成信号のスペクトル値Ｙ(ｆ，ｍ)は，次の（３）式により表される。

ここで，目的音信号と雑音信号との間に相関がないものと仮定し，さらに，雑音信号のスペクトル値Ｎ(ｆ，ｍ)を前記参照音信号のスペクトル値で近似できるとすると，前記スペクトル減算処理部３１は，目的音信号のスペクトル推定値（即ち，前記目的音抽出信号のスペクトル値）を，次の（４）式に基づき算出できる。

Further, in the target sound extraction device X1, the target sound separation signal synthesis processing unit 20 executes a synthesis process of the plurality of target sound separation signals separated and generated by each of the sound source separation processing units 10 and is obtained thereby. A synthesized signal is output (an example of the target sound separation signal synthesizing means).
For example, the target sound separation signal synthesis processing unit 20 performs an average process or a weighted average process for each of the plurality of target sound separation signals for each of the frequency components (frequency bins) divided into a plurality of the target sound separation signals. Synthesize a sound separation signal.
Further, in the target sound extraction device X1, the spectrum subtraction processing unit 31 includes a plurality of the reference signals separated and generated by the synthesized signal obtained by the target sound separation signal synthesis processing unit 20 and the sound source separation processing unit 10, respectively. By performing spectral subtraction processing with the sound separation signal, an acoustic signal corresponding to the target sound is extracted from the synthesized signal, and the extracted signal (the target sound extraction signal) is output (the spectrum An example of subtraction processing means).
The spectrum subtraction processing unit 31 removes the signal component of each of the reference sound separation signals from the synthesized signal by a known spectrum subtraction process (target sound extraction process based on a spectrum difference method) to obtain the target sound extraction signal. The process to extract is performed.
In the spectrum subtraction process, the spectrum subtraction processing unit 31 performs a discrete Fourier transform process (DFT) for each frame of a predetermined time length for each of the synthesized signal and the reference sound separation signal, and an observation signal (here, , The synthesized signal) is analyzed for a short time. Here, the frequency bin is f, the analysis frame number is m, the spectrum value of the synthesized signal (signal value after DFT) is Y (f, m), and the spectrum value of the target sound signal is S (f , M) and the spectrum value of the noise signal (sound signal other than the target sound) is N (f, m), the spectrum value Y (f, m) of the synthesized signal is expressed by the following equation (3): Is represented by

Here, assuming that there is no correlation between the target sound signal and the noise signal, and further assuming that the spectrum value N (f, m) of the noise signal can be approximated by the spectrum value of the reference sound signal, the spectrum The subtraction processing unit 31 can calculate the spectrum estimation value of the target sound signal (that is, the spectrum value of the target sound extraction signal) based on the following equation (4).

次に，図２を参照しつつ，目的音抽出装置Ｘ１における目的音抽出処理の過程について説明する。なお，説明の簡単化のため，図２には，前記副音響信号が２つである場合（即ち，前記副マイクロホン１０２が２つである場合）の例を示している。
前記音源分離処理部１０により分離生成される複数の前記目的音分離信号は，目的音の信号成分を主として含む信号である。同様に，前記音源分離処理部１０により分離生成される複数の前記参照音分離信号（図２におけるＹ_B1，Ｙ_B2）は，位置や指向性の方向がそれぞれ異なる前記副マイクロホン１０２それぞれの収音範囲におけるノイズ音源の音（参照音）の信号成分（図２において斜線のバーグラフ以外のバーグラフで示される成分）を主として含む信号である。
しかしながら，目的音源の位置や雑音の発生状況によっては，前記目的音分離信号に，目的音以外の参照音の信号成分が比較的多く残存する場合もある。従って，それらを合成した前記合成信号（図２におけるＹ_C）も，基本的には目的音の信号成分（図２において斜線のバーグラフで示される成分）を主として含む信号ではあるが，状況によっては雑音の信号成分が比較的多く残存する場合もある。
一方，前記目的音分離信号に目的音以外のノイズ音（参照音）の成分が含まれている場合であっても，前記スペクトル減算処理部３１により，前記合成信号から前記目的音の信号成分を抽出した結果である前記目的音抽出信号（図２におけるＹ_O）は，前記参照音分離信号の信号成分が除去された信号である。しかも，前記目的音抽出信号は，複数の方向から異なる雑音（参照音）が前記主マイクロホン１０１に到来する状況においても，それら複数の雑音それぞれに対応する前記参照音分離信号全ての信号成分が除去された信号である。
従って，目的音抽出装置Ｙ１によれば，比較的強い特定の雑音が前記主マイクロホン１０１に到来する状況や，複数の方向から異なる雑音が前記主マイクロホン１０１に到来する状況においても，高い雑音除去性能を確保できる。
また，非線形処理である前記スペクトル減算処理のみでは，その出力信号（目的音の抽出信号）に非線形処理に特有のミュージカル雑音が生じやすいが，目的音抽出装置Ｘ１においては，前記音源分離処理部１０による線形フィルタ処理が施された後の信号に基づいて前記スペクトル減算処理が行われるので，前記目的音抽出信号に耳障りなミュージカル雑音が含まれることを防止できる。特に，目的音及び雑音を含む音源の数が少数（３つ以下程度）の点音源である場合，音源分離処理が特に有効に目的音抽出に寄与し，ミュージカル雑音の抑制効果が高まる。 Next, the process of target sound extraction processing in the target sound extraction device X1 will be described with reference to FIG. For simplification of explanation, FIG. 2 shows an example in which there are two sub-acoustic signals (that is, there are two sub-microphones 102).
The plurality of target sound separation signals separated and generated by the sound source separation processing unit 10 are signals mainly including a signal component of the target sound. Similarly, the plurality of reference sound separation signals (Y _B1 and Y _B2 in FIG. 2) separated and generated by the sound source separation processing unit 10 are collected by the sub microphones 102 having different positions and directivity directions. It is a signal mainly including signal components (components indicated by bar graphs other than the shaded bar graph in FIG. 2) of noise sound source (reference sound) in the range.
However, depending on the position of the target sound source and the occurrence of noise, a relatively large amount of reference sound signal components other than the target sound may remain in the target sound separation signal. Therefore, the synthesized signal obtained by synthesizing them (Y _C in FIG. 2) is basically a signal mainly including the signal component of the target sound (the component indicated by the hatched bar graph in FIG. 2). In some cases, a relatively large amount of noise signal components may remain.
On the other hand, even if the target sound separation signal includes a noise sound (reference sound) component other than the target sound, the spectrum subtraction processing unit 31 converts the signal component of the target sound from the synthesized signal. The target sound extraction signal (Y _O in FIG. 2), which is the result of extraction, is a signal from which the signal component of the reference sound separation signal has been removed. Moreover, the target sound extraction signal removes all signal components of the reference sound separation signal corresponding to each of the plurality of noises even in a situation where different noises (reference sounds) arrive at the main microphone 101 from a plurality of directions. Signal.
Therefore, according to the target sound extraction device Y1, high noise removal performance can be obtained even in a situation where a relatively strong specific noise arrives at the main microphone 101 or a situation where different noises arrive at the main microphone 101 from a plurality of directions. Can be secured.
Further, only the spectral subtraction process, which is a non-linear process, tends to generate musical noise peculiar to the non-linear process in its output signal (target sound extraction signal). However, in the target sound extraction apparatus X1, the sound source separation processing unit 10 Since the spectral subtraction process is performed based on the signal after the linear filter process is performed, it is possible to prevent an unpleasant musical noise from being included in the target sound extraction signal. In particular, when the number of sound sources including the target sound and noise is a small number (about three or less), the sound source separation processing contributes to the target sound extraction particularly effectively, and the effect of suppressing musical noise is enhanced.

［第２発明］
次に，図３に示すブロック図を参照しつつ，第２発明の実施形態に係る目的音抽出装置Ｘ２について説明する。なお，図３において，目的音抽出装置Ｘ２が備える構成要素のうち，前記目的音抽出装置Ｘ１が備えるものと同じ処理を実行する構成要素については図１における符号と同じ符号を付している。
図３に示すように，目的音抽出装置Ｘ２は，複数のマイクロホンを含む音響入力装置Ｖ１，複数（図３では３つ）の音源分離処理部１０（１０−１〜１０−３）及びスペクトル近似信号抽出処理部３２を備えている。ここで，前記音響入力装置Ｖ１は，前記目的音抽出装置Ｘ１における前記音響入力装置Ｖ１と同じものである。
そして，目的音抽出装置Ｘ２も，前記主マイクロホン１０１を通じて得られる主音響信号と，それ以外の複数の前記副マイクロホン１０２を通じて得られる副音響信号とに基づいて，前記目的音に相当する音響信号を抽出してその抽出信号（前記目的音抽出信号）を出力するものである。
目的音抽出装置Ｘ２において，前記音源分離処理部１０及び前記スペクトル近似信号抽出処理部３２は，例えばコンピュータの一例であるＤＳＰ及びそのＤＳＰにより実行されるプログラムが記憶されたＲＯＭ，或いはＡＳＩＣ等により具現化される。この場合，そのＲＯＭには，前記音源分離処理部１０及び前記スペクトル近似信号抽出処理部３２が行う処理（後述）を前記ＤＳＰに実行させるためのプログラムが予め記憶されている。 [Second invention]
Next, the target sound extraction device X2 according to the embodiment of the second invention will be described with reference to the block diagram shown in FIG. In FIG. 3, among the constituent elements included in the target sound extraction device X2, the same reference numerals as those in FIG. 1 are assigned to the constituent elements that perform the same processing as that included in the target sound extraction device X1.
As shown in FIG. 3, the target sound extraction device X2 includes an acoustic input device V1 including a plurality of microphones, a plurality (three in FIG. 3) of sound source separation processing units 10 (10-1 to 10-3), and a spectrum approximation. A signal extraction processing unit 32 is provided. Here, the sound input device V1 is the same as the sound input device V1 in the target sound extraction device X1.
The target sound extraction device X2 also obtains an acoustic signal corresponding to the target sound based on the main acoustic signal obtained through the main microphone 101 and the sub acoustic signals obtained through the other plurality of sub microphones 102. The extracted signal (the target sound extraction signal) is output.
In the target sound extraction apparatus X2, the sound source separation processing unit 10 and the spectrum approximate signal extraction processing unit 32 are implemented by, for example, a DSP that is an example of a computer and a ROM that stores a program executed by the DSP, an ASIC, or the like. It becomes. In this case, the ROM stores in advance a program for causing the DSP to execute processing (described later) performed by the sound source separation processing unit 10 and the spectrum approximate signal extraction processing unit 32.

前記音源分離処理部１０（１０−１〜１０−３）は，前記主音響信号と複数の前記副音響信号それぞれとの組合せそれぞれについて設けられ，前記主音響信号と前記副音響信号とに基づいて，前記目的音に対応する分離信号（同定信号）である目的音分離信号を分離生成する音源分離処理を実行するものである。
なお，各マイクロホン１０１，１０２と前記音源分離処理部１０との間には，前記目的音抽出装置Ｘ１と同様に，不図示のＡ／Ｄコンバータが設けられている。
ここで，前記音源分離処理部１０（１０−１〜１０−３）は，前記目的音抽出装置Ｘ１の場合と同様に，例えば，非特許文献２や非特許文献３に示される独立成分分析法に基づくブラインド音源分離方式による音源分離処理，或いは非特許文献４や非特許文献５に示されるバイナリーマスキング処理等の音源分離処理を実行するものである。 The sound source separation processing unit 10 (10-1 to 10-3) is provided for each combination of the main sound signal and each of the plurality of sub sound signals, and based on the main sound signal and the sub sound signal. , A sound source separation process for separating and generating a target sound separation signal which is a separation signal (identification signal) corresponding to the target sound is executed.
Note that an A / D converter (not shown) is provided between the microphones 101 and 102 and the sound source separation processing unit 10 as in the target sound extraction device X1.
Here, the sound source separation processing unit 10 (10-1 to 10-3) is, for example, the independent component analysis method shown in Non-Patent Document 2 or Non-Patent Document 3, as in the case of the target sound extraction device X1. The sound source separation process based on the blind sound source separation method based on the above, or the sound source separation process such as the binary masking process shown in Non-Patent Document 4 or Non-Patent Document 5 is executed.

また，前記スペクトル近似信号抽出処理部３２は，前記音源分離処理部１０によって分離生成された複数の前記目的音分離信号について，複数に区分された周波数帯域（周波数ビン）ごとの信号成分のうち，その信号成分が前記目的音分離信号相互間で所定の近似条件を満たすものを抽出することにより，複数の前記目的音分離信号から前記目的音に相当する音響信号を抽出し，その抽出信号（前記目的音抽出信号）を出力するものである。
例えば，前記スペクトル近似信号抽出処理部３２は，複数の前記目的音分離信号について，周波数ビンごとにそれらの信号成分のレベル（パワー）を比較し，そのレベルの比や差が予め定められた範囲内にあるという前記近似条件を満たす場合に，それらの信号成分のいずれか１つを選択する，又はそれらの信号成分を合成する（例えば平均値や最小値を算出する）ことによって前記目的音抽出信号を抽出する。 In addition, the spectrum approximate signal extraction processing unit 32 includes a plurality of target sound separation signals separated and generated by the sound source separation processing unit 10 among signal components for each of frequency bands (frequency bins) divided into a plurality of parts. By extracting the signal component that satisfies a predetermined approximation condition between the target sound separation signals, an acoustic signal corresponding to the target sound is extracted from the plurality of target sound separation signals, and the extracted signal (the Target sound extraction signal).
For example, the spectrum approximation signal extraction processing unit 32 compares the levels (power) of the signal components for each frequency bin for the plurality of target sound separation signals, and the ratio or difference between the levels is predetermined. If the approximate condition of being within is satisfied, the target sound extraction is performed by selecting any one of those signal components or synthesizing those signal components (for example, calculating an average value or a minimum value). Extract the signal.

次に，図４を参照しつつ，目的音抽出装置Ｘ２における目的音抽出処理の過程について説明する。なお，説明の簡単化のため，図４には，前記副音響信号が２つである場合（即ち，前記副マイクロホン１０２が２つである場合）の例を示している。
前記音源分離処理部１０により分離生成される複数の前記目的音分離信号（図４におけるＹ_A1，Ｙ_A2）は，それぞれ目的音の信号成分（図４において斜線のバーグラフで示される成分）を主として含む信号である。
しかしながら，目的音源の位置や雑音の発生状況によっては，前記目的音分離信号に，目的音以外の参照音の信号成分（図４において斜線のバーグラフ以外のバーグラフで示される成分）が比較的多く残存する場合もある。
一方，前記目的音分離信号に目的音以外のノイズ音（参照音）の成分が含まれている場合であっても，複数のマイクロホン１０１，１０２それぞれの位置又は指向性の方向が異なるので，雑音成分を多く含む前記目的音分離信号は，その全てのうちの一部であるか，或いは前記目的音分離信号それぞれに含まれる雑音成分の種類が異なることが通常である。
従って，前記スペクトル近似信号抽出処理部３２により，複数の前記目的音分離信号（図４におけるＹ_A1，Ｙ_A2）において近似する信号成分を抽出した結果である前記目的音抽出信号（図４におけるＹ_O）は，各種の雑音の信号成分が除去された信号である。
従って，目的音抽出装置Ｙ２によれば，比較的強い特定の雑音が前記主マイクロホン１０１に到来する状況や，複数の方向から異なる雑音が前記主マイクロホン１０１に到来する状況においても，高い雑音除去性能を確保できる。 Next, the process of target sound extraction processing in the target sound extraction device X2 will be described with reference to FIG. For simplification of explanation, FIG. 4 shows an example in which there are two sub-acoustic signals (that is, there are two sub-microphones 102).
A plurality of the target sound separation signals (Y _A1 and Y _A2 in FIG. 4) separated and generated by the sound source separation processing unit 10 each have a signal component of the target sound (components indicated by hatched bar graphs in FIG. 4). This signal is mainly included.
However, depending on the position of the target sound source and the state of noise generation, the target sound separation signal may include a signal component of a reference sound other than the target sound (a component indicated by a bar graph other than the shaded bar graph in FIG. 4). There may be many remaining.
On the other hand, even if the target sound separation signal includes a noise sound (reference sound) component other than the target sound, the position or directivity direction of each of the plurality of microphones 101 and 102 is different. The target sound separation signal containing many components is usually a part of all of them, or the types of noise components included in the target sound separation signals are usually different.
Accordingly, the target sound extraction signal (Y in FIG. 4) is obtained as a result of extracting approximate signal components in the plurality of target sound separation signals (Y _A1 and Y _A2 in FIG. 4) by the spectrum approximate signal extraction processing unit 32. _O ) is a signal from which various noise signal components have been removed.
Therefore, according to the target sound extraction device Y2, high noise removal performance can be obtained even in a situation where a relatively strong specific noise arrives at the main microphone 101 or a situation where different noises arrive at the main microphone 101 from a plurality of directions. Can be secured.

［第３発明］
次に，図５に示すブロック図を参照しつつ，第３発明の実施形態に係る目的音抽出装置Ｘ３について説明する。なお，図５において，目的音抽出装置Ｘ３が備える構成要素のうち，前記目的音抽出装置Ｘ１が備えるものと同じ処理を実行する構成要素については図１における符号と同じ符号を付している。
図５に示すように，目的音抽出装置Ｘ３は，複数のマイクロホンを含む音響入力装置Ｖ１，複数（図３では３つ）の音源分離処理部１０（１０−１〜１０−３）及びスペクトル減算処理部３１’を備えている。ここで，前記音響入力装置Ｖ１は，前記目的音抽出装置Ｘ１における前記音響入力装置Ｖ１と同じものである。
そして，目的音抽出装置Ｘ３も，前記主マイクロホン１０１を通じて得られる主音響信号と，それ以外の複数の前記副マイクロホン１０２を通じて得られる副音響信号とに基づいて，前記目的音に相当する音響信号を抽出してその抽出信号（前記目的音抽出信号）を出力するものである。
目的音抽出装置Ｘ３において，前記音源分離処理部１０及び前記スペクトル減算処理部３１’は，例えばコンピュータの一例であるＤＳＰ及びそのＤＳＰにより実行されるプログラムが記憶されたＲＯＭ，或いはＡＳＩＣ等により具現化される。この場合，そのＲＯＭには，前記音源分離処理部１０及び前記スペクトル減算処理部３１’が行う処理（後述）を前記ＤＳＰに実行させるためのプログラムが予め記憶されている。 [Third invention]
Next, the target sound extraction device X3 according to the embodiment of the third invention will be described with reference to the block diagram shown in FIG. In FIG. 5, among the constituent elements included in the target sound extracting device X3, constituent elements that execute the same processes as those included in the target sound extracting device X1 are denoted by the same reference numerals as those in FIG.
As shown in FIG. 5, the target sound extraction device X3 includes an acoustic input device V1 including a plurality of microphones, a plurality (three in FIG. 3) of sound source separation processing units 10 (10-1 to 10-3), and a spectral subtraction. A processing unit 31 ′ is provided. Here, the sound input device V1 is the same as the sound input device V1 in the target sound extraction device X1.
The target sound extraction device X3 also generates an acoustic signal corresponding to the target sound based on the main acoustic signal obtained through the main microphone 101 and the sub acoustic signals obtained through the other plurality of sub microphones 102. The extracted signal (the target sound extraction signal) is output.
In the target sound extraction device X3, the sound source separation processing unit 10 and the spectral subtraction processing unit 31 ′ are realized by a DSP that is an example of a computer and a ROM that stores a program executed by the DSP, an ASIC, or the like. Is done. In this case, the ROM stores in advance a program for causing the DSP to execute processing (described later) performed by the sound source separation processing unit 10 and the spectrum subtraction processing unit 31 ′.

前記音源分離処理部１０（１０−１〜１０−３）は，前記主音響信号と複数の前記副音響信号それぞれとの組合せそれぞれについて設けられ，前記主音響信号と前記副音響信号とに基づいて，前記目的音以外の雑音（参照音）に対応する分離信号（同定信号）である参照音分離信号を分離生成する音源分離処理を実行するものである。
なお，各マイクロホン１０１，１０２と前記音源分離処理部１０との間には，前記目的音抽出装置Ｘ１と同様に，不図示のＡ／Ｄコンバータが設けられている。
ここで，前記音源分離処理部１０（１０−１〜１０−３）は，前記目的音抽出装置Ｘ１の場合と同様に，例えば，非特許文献２や非特許文献３に示される独立成分分析法に基づくブラインド音源分離方式による音源分離処理，或いは非特許文献４や非特許文献５に示されるバイナリーマスキング処理等の音源分離処理を実行するものである。 The sound source separation processing unit 10 (10-1 to 10-3) is provided for each combination of the main sound signal and each of the plurality of sub sound signals, and based on the main sound signal and the sub sound signal. The sound source separation process for separating and generating a reference sound separation signal that is a separation signal (identification signal) corresponding to noise (reference sound) other than the target sound is executed.
Note that an A / D converter (not shown) is provided between the microphones 101 and 102 and the sound source separation processing unit 10 as in the target sound extraction device X1.
Here, the sound source separation processing unit 10 (10-1 to 10-3) is, for example, the independent component analysis method shown in Non-Patent Document 2 or Non-Patent Document 3, as in the case of the target sound extraction device X1. The sound source separation process based on the blind sound source separation method based on the above, or the sound source separation process such as the binary masking process shown in Non-Patent Document 4 or Non-Patent Document 5 is executed.

また，前記スペクトル減算処理部３１’は，前記主マイクロホン１０１を通じて得られる前記主音響信号と，前記音源分離処理部１０により分離生成された複数の前記参照音分離信号との間で前述したスペクトル減算処理を行うことにより，前記主音響信号から前記目的音に相当する音響信号を抽出し，その抽出信号（前記目的音抽出信号）を出力するものである。このスペクトル減算処理部３１’は，処理対象（観測信号）が前記合成信号から前記主音響信号に入れ替わったこと以外は前記目的音抽出装置Ｘ１における前記スペクトル減算処理部３１と同じ処理を実行するものである。 Further, the spectral subtraction processing unit 31 ′ performs the spectral subtraction described above between the main acoustic signal obtained through the main microphone 101 and the plurality of reference sound separation signals separated and generated by the sound source separation processing unit 10. By performing processing, an acoustic signal corresponding to the target sound is extracted from the main acoustic signal, and the extracted signal (the target sound extraction signal) is output. The spectrum subtraction processing unit 31 ′ performs the same processing as the spectrum subtraction processing unit 31 in the target sound extraction device X1 except that the processing target (observation signal) is switched from the synthesized signal to the main acoustic signal. It is.

次に，図６を参照しつつ，目的音抽出装置Ｘ１における目的音抽出処理の過程について説明する。なお，説明の簡単化のため，図６には，前記副音響信号が２つである場合（即ち，前記副マイクロホン１０２が２つである場合）の例を示している。
前記音源分離処理部１０により分離生成され複数の前記参照音分離信号（図６におけるＹ_B1，Ｙ_B2）は，位置や指向性の方向がそれぞれ異なる前記副マイクロホン１０２それぞれの収音範囲におけるノイズ音源の音（参照音）の信号成分（図６において斜線のバーグラフ以外のバーグラフで示される成分）を主として含む信号である。
一方，前記主音響信号には，目的音以外の参照音の信号成分が比較的多く残存する場合もある。このように，前記主音響信号に目的音以外のノイズ音（参照音）の成分が含まれていても，前記スペクトル減算処理部３１’により，前記主音響信号から前記目的音の信号成分を抽出した結果である前記目的音抽出信号（図６におけるＹ_O）は，前記参照音分離信号の信号成分が除去された信号である。しかも，前記目的音抽出信号は，複数の方向から異なる雑音（参照音）が前記主マイクロホン１０１に到来する状況においても，それら複数の雑音それぞれに対応する前記参照音分離信号全ての信号成分が除去された信号である。
従って，目的音抽出装置Ｙ３によれば，比較的強い特定の雑音が前記主マイクロホン１０１に到来する状況や，複数の方向から異なる雑音が前記主マイクロホン１０１に到来する状況においても，高い雑音除去性能を確保できる。
また，非線形処理である前記スペクトル減算処理のみでは，その出力信号（目的音の抽出信号）に非線形処理に特有のミュージカル雑音が生じやすいが，目的音抽出装置Ｘ３においては，前記音源分離処理部１０による線形フィルタ処理が施された後の信号に基づいて前記スペクトル減算処理が行われるので，前記目的音抽出信号に耳障りなミュージカル雑音が含まれることを防止できる。特に，目的音及び雑音を含む音源の数が少数（３つ以下程度）の点音源である場合，音源分離処理が特に有効に雑音抽出に寄与し，ミュージカル雑音の抑制効果が高まる。
なお，ＦＤＩＣＡ方式の音源分離処理を実行する前記音源分離処理部１０の処理結果である前記参照音分離信号，前記目的音分離信号及びそれらの合成信号，並びに，前記スペクトル減算処理や前記スペクトル近似信号抽出処理により得られる前記目的抽出信号は，いずれも周波数領域の音響信号である。このため，図１，３，５には図示されていないが，目的音抽出装置Ｙ１，Ｙ２，Ｙ３は，さらに，ＩＤＦＴ処理部と音響出力処理部とを備えている。
前記ＩＤＦＴ処理部は，周波数領域の前記目的音抽出信号を時間領域の信号に変換する処理，即ち，逆離散フーリエ変換（ＩＤＦＴ）処理を施して所定のバッファメモリに出力する処理を実行する。
また，前記音響出力処理部は，前記ＩＤＦＴ処理部により得られた時間領域の目的音抽出信号を順次外部出力する（例えば，実時間で出力する）。 Next, the process of target sound extraction processing in the target sound extraction device X1 will be described with reference to FIG. For simplification of explanation, FIG. 6 shows an example in which there are two sub-acoustic signals (that is, there are two sub-microphones 102).
A plurality of the reference sound separation signals (Y _B1 and Y _B2 in FIG. 6) separated and generated by the sound source separation processing unit 10 are noise sound sources in the sound collection ranges of the sub microphones 102 having different positions and directivity directions. Is a signal mainly including a signal component (a component indicated by a bar graph other than the shaded bar graph in FIG. 6).
On the other hand, there may be a case where a relatively large amount of the reference sound signal component other than the target sound remains in the main sound signal. As described above, even if the main sound signal includes a noise sound (reference sound) component other than the target sound, the spectrum subtraction processing unit 31 ′ extracts the signal component of the target sound from the main sound signal. The target sound extraction signal (Y _O in FIG. 6), which is the result of the above, is a signal from which the signal component of the reference sound separation signal has been removed. Moreover, the target sound extraction signal removes all signal components of the reference sound separation signal corresponding to each of the plurality of noises even in a situation where different noises (reference sounds) arrive at the main microphone 101 from a plurality of directions. Signal.
Therefore, according to the target sound extraction device Y3, high noise removal performance can be obtained even in a situation where a relatively strong specific noise arrives at the main microphone 101 or a situation where different noises arrive at the main microphone 101 from a plurality of directions. Can be secured.
Further, only the spectral subtraction process, which is a non-linear process, tends to generate musical noise peculiar to the non-linear process in its output signal (target sound extraction signal). However, in the target sound extraction apparatus X3, the sound source separation processing unit 10 Since the spectral subtraction process is performed based on the signal after the linear filter process is performed, it is possible to prevent an unpleasant musical noise from being included in the target sound extraction signal. In particular, when the number of sound sources including the target sound and noise is a small number (about three or less), the sound source separation process contributes to noise extraction particularly effectively, and the effect of suppressing musical noise is enhanced.
The reference sound separation signal, the target sound separation signal and their combined signal, which are the processing results of the sound source separation processing unit 10 that executes the sound source separation processing of the FDICA method, and the spectrum subtraction process and the spectrum approximation signal The target extraction signals obtained by the extraction process are all acoustic signals in the frequency domain. Therefore, although not shown in FIGS. 1, 3, and 5, the target sound extraction devices Y1, Y2, and Y3 further include an IDFT processing unit and a sound output processing unit.
The IDFT processing unit executes a process of converting the target sound extraction signal in the frequency domain into a signal in the time domain, that is, a process of performing an inverse discrete Fourier transform (IDFT) process and outputting it to a predetermined buffer memory.
The sound output processing unit sequentially outputs the target sound extraction signal in the time domain obtained by the IDFT processing unit to the outside (for example, outputs in real time).

［目的音抽出性能の評価］
以下，図７〜図１０を参照しつつ，以上に示した目的音抽出装置Ｘ１〜Ｘ３それぞれの目的音抽出性能の評価結果について説明する。
図７及び図８に，目的音抽出装置Ｘ１〜Ｘ３の目的音抽出性能を評価する第１の実験条件及び第２の実験条件を示す。
前記第１の実験条件は，指向性を有する前記主マイクロホン１０１の正面方向に目的音源が，指向性を有する前記副マイクロホン１０２それぞれの正面方向にその他のノイズ音源（参照音源）が存在するという理想状態に比較的近い条件である。
また，前記第２の実験条件は，指向性を有する前記主マイクロホン１０１の正面方向に目的音源が存在する一方，その他のノイズ音源（参照音源）が前記副マイクロホン１０２それぞれに必ずしも対応していないという実際の使用環境に比較的近い条件である。
前記第１の実験条件及び前記第２の実験条件それぞれの下での前記目的音抽出装置Ｘ１〜Ｘ３及び従来の目的音抽出装置の目的音抽出性能を，前記目的音抽出信号におけるＮＲＲ（Noise Reduction Rate）により表したものが図９及び図１０である。図９及び図１０において，前記目的音抽出装置Ｘ１〜Ｘ３それぞれを装置Ｘ１〜装置Ｘ３，従来の目的音抽出装置を従来装置と記している。なお，ここでいう従来の目的音抽出装置は，前記主音響信号から，前記副音響信号に基づく前記スペクトル減算処理によって目的音に対応する信号成分を抽出するものである。
図９及び図１０からわかるように，実験条件にかかわらず，前記目的音抽出装置Ｘ１〜Ｘ３のいずれによっても，従来装置に比べて極めて高い目的音抽出性能が得られることがわかる。
また，前記目的音抽出装置Ｘ１〜Ｘ３の中では，特に，前記目的音抽出装置Ｘ１による目的音抽出性能が高く，それに続いて前記目的音抽出装置Ｘ３，前記目的音抽出装置Ｘ２の順で，高い目的音抽出性能が得られることがわかる。
このように，前記目的音抽出装置Ｘ１〜Ｘ３によれば，様々な音響環境の下において，従来よりも高い目的音抽出性能（雑音除去性能）を確保できる。 [Evaluation of target sound extraction performance]
Hereinafter, the evaluation results of the target sound extraction performance of each of the target sound extraction devices X1 to X3 described above will be described with reference to FIGS.
7 and 8 show the first experimental condition and the second experimental condition for evaluating the target sound extraction performance of the target sound extraction devices X1 to X3.
The first experimental condition is that the target sound source exists in the front direction of the main microphone 101 having directivity, and other noise sound sources (reference sound sources) exist in the front direction of the sub microphones 102 having directivity. The condition is relatively close to the state.
The second experimental condition is that the target sound source exists in the front direction of the main microphone 101 having directivity, while other noise sound sources (reference sound sources) do not necessarily correspond to the sub microphones 102, respectively. The conditions are relatively close to the actual usage environment.
The target sound extraction performance of the target sound extraction devices X1 to X3 and the conventional target sound extraction device under each of the first experimental condition and the second experimental condition is expressed as NRR (Noise Reduction) in the target sound extraction signal. FIG. 9 and FIG. 10 show those expressed by (Rate). 9 and 10, the target sound extraction devices X1 to X3 are referred to as devices X1 to X3, and the conventional target sound extraction device is referred to as a conventional device. The conventional target sound extraction device here extracts a signal component corresponding to the target sound from the main sound signal by the spectral subtraction process based on the sub-acoustic signal.
As can be seen from FIGS. 9 and 10, regardless of the experimental conditions, any of the target sound extraction devices X1 to X3 can obtain extremely high target sound extraction performance as compared with the conventional device.
Among the target sound extraction devices X1 to X3, in particular, the target sound extraction performance by the target sound extraction device X1 is high, followed by the target sound extraction device X3 and the target sound extraction device X2. It can be seen that high target sound extraction performance can be obtained.
As described above, according to the target sound extraction devices X1 to X3, higher target sound extraction performance (noise removal performance) than the conventional one can be ensured under various acoustic environments.

［指向性の評価］
以下，図１１及び図１２を参照しつつ，前記目的音抽出装置Ｘ１の指向性の評価結果について説明する。
図１１に，目的音抽出装置Ｘ１の指向性を評価する第３の実験条件を示す。この第３の実験条件は，目的音源を移動させることにより，前記主マイクロホン１０１の正面方向を基準としてどの程度の範囲まで目的音を抽出できるかを評価する実験条件である。
前記第３の実験条件の下での前記目的音抽出装置Ｘ１及び指向性を有する前記主マイクロホン１０１自体の指向特性，即ち，全３６０度方向からの音源に対するマイク感度（単位ｄＢ）を表したものが図１２である。 [Evaluation of directivity]
The directivity evaluation results of the target sound extraction device X1 will be described below with reference to FIGS.
FIG. 11 shows a third experimental condition for evaluating the directivity of the target sound extraction device X1. This third experimental condition is an experimental condition for evaluating to what extent the target sound can be extracted by moving the target sound source with reference to the front direction of the main microphone 101.
The directional characteristic of the target sound extraction device X1 and the directional main microphone 101 itself under the third experimental condition, that is, the microphone sensitivity (unit dB) with respect to the sound source from all 360 degrees. Is FIG.

図１２からわかるように，前記主マイクロホン１０１自体の指向性が非常に緩やかなものであるにもかかわらず，前記目的音抽出装置Ｘ１においては，前記主マイクロホン１０１の正面方向を中心としたごく狭い範囲で高いＮＲＲが得られる一方で，目的音源がその範囲から外れると急激にＮＲＲが低下する。
このように，前記主マイクロホン１０１自体の指向性が非常に緩やかなものであっても，前記目的音抽出装置Ｘ１としては非常に急峻な指向性を有する音響入力装置として機能する。 As can be seen from FIG. 12, although the directivity of the main microphone 101 itself is very gradual, the target sound extraction device X1 is very narrow with the front direction of the main microphone 101 as the center. While a high NRR is obtained in the range, the NRR rapidly decreases when the target sound source is out of the range.
Thus, even if the directivity of the main microphone 101 itself is very gentle, the target sound extracting device X1 functions as an acoustic input device having a very steep directivity.

また，図１２に示す結果において，前記主マイクロホン１０１の正面方向（指向範囲の中心方向）を中心（０°方向）として概ね＋４５°及び−４５°の方向が，指向性の範囲の境界を形成する方向となっている。
一方，前記第３の実験条件において，それぞれ左右対称でほぼ同じ指向特性を有する前記主マイクロホン１０１及び前記副マイクロホン１０２が，前記主マイクロホン１０１の指向中心方向（０°）に対して２つの前記副マイクロホン１０２それぞれの指向中心方向が＋９０°及び−９０°に設定されている。このことから，前記目的音抽出装置Ｘ１〜Ｘ３において，前記主マイクロホン１０１及び前記副マイクロホン１０２がそれぞれ左右対称でほぼ同じ指向特性を有する場合，指向性の範囲の境界を形成する方向が，前記主マイクロホン１０１の指向中心方向と，前記副マイクロホン１０２それぞれの指向中心方向との中間方向となることがわかる。
また，図１２に示す例は，各マイクロホン１０１，１０２の指向方向が，同一平面内においてそれぞれ異なる方向に設定された場合の例であるが，それらが三次元的に異なる方向に設定した場合，指向性の範囲の境界を三次元的に所望の方向に設定できる。
例えば，ある一の平面内において前記主マイクロホン１０１の正面方向と２つの前記副マイクロホン１０２−１，１０２−２の正面方向とを０°方向及び±９０°の方向に向け，もう１つの前記副マイクロホン１０２−３の正面方向を前記一の平面に直交する方向に向けること等も考えられる。これにより，前記目的音抽出装置Ｘ１の指向特性を三次元的に所望の特性に設定できる。
従って，前記目的音抽出装置Ｘ１に，前記主マイクロホン１０１の位置若しくは指向性の方向に対する前記副マイクロホン１０２の位置若しくは指向性の方向を調節する（近づけたり遠ざけたりする）ためのスイッチやダイヤル等の操作部を設けることにより，前記目的音抽出装置Ｘ１の指向性能を容易に調節することができ，利便性が高い。
また，以上に示した前記目的音抽出装置Ｘ１の指向性能は，前記目的音抽出装置Ｘ２及びＸ３も同様に有する。 Further, in the result shown in FIG. 12, directions of + 45 ° and −45 ° with the front direction (center direction of the directivity range) of the main microphone 101 as the center (0 ° direction) form the boundary of the directivity range. It has become a direction.
On the other hand, in the third experimental condition, the main microphone 101 and the sub microphone 102, which are symmetric with respect to each other and have substantially the same directivity characteristics, are divided into two sub-directories with respect to the central direction (0 °) of the main microphone 101. The directivity center directions of the microphones 102 are set to + 90 ° and −90 °, respectively. From this, in the target sound extraction devices X1 to X3, when the main microphone 101 and the sub microphone 102 are symmetrical and have substantially the same directivity characteristics, the direction in which the boundary of the directivity range is formed is the main sound extraction device X1 to X3. It can be seen that the directional center direction of the microphone 101 is an intermediate direction between the directional center directions of the sub microphones 102.
Further, the example shown in FIG. 12 is an example in which the directivity directions of the microphones 101 and 102 are set in different directions in the same plane, but when they are set in three-dimensionally different directions, The boundary of the directivity range can be set in a desired direction three-dimensionally.
For example, in a certain plane, the front direction of the main microphone 101 and the front directions of the two sub microphones 102-1 and 102-2 are directed in the directions of 0 ° and ± 90 °, and the other sub microphone is directed. For example, the front direction of the microphone 102-3 may be oriented in a direction perpendicular to the one plane. Thereby, the directivity characteristic of the target sound extraction device X1 can be set to a desired characteristic three-dimensionally.
Therefore, the target sound extraction device X1 has a switch, a dial, etc. for adjusting the position or directivity direction of the sub microphone 102 with respect to the position or directivity direction of the main microphone 101 (approaching or moving away). By providing the operation unit, the directivity of the target sound extraction device X1 can be easily adjusted, which is highly convenient.
Further, the target sound extraction devices X2 and X3 have the directivity performance of the target sound extraction device X1 described above.

ところで，鋭い指向特性を実現する音響入力装置としては，例えば，マイクロホンアレイ及び遅延和型フィルタを備えた音響入力装置が知られている。しかしながら，そのような従来の音響入力装置において，図１２に示すような鋭い指向性を実現するためには，マイクロホンアレイを構成するマイクロホンの数を増やし，かつそれらマイクロホンを数メートルに渡って配列しなければならず，人が簡単に運搬できないほどに装置が大型化してしまう。
一方，前記目的音抽出装置Ｘ１〜Ｘ３は，数センチメートルの間隔で配置された３〜５個程度のマイクロホンと，信号処理を行うＤＳＰやＡＳＩＣ等のごく小型のプロセッサとを備えた小型の装置（一般的なハンディマイク程度の大きさの装置）により，図１２に示すような鋭い指向性を実現できる。 By the way, as an acoustic input device that realizes a sharp directional characteristic, for example, an acoustic input device including a microphone array and a delay sum filter is known. However, in such a conventional acoustic input device, in order to realize the sharp directivity as shown in FIG. 12, the number of microphones constituting the microphone array is increased, and the microphones are arranged over several meters. The equipment becomes so large that it cannot be easily transported by people.
On the other hand, the target sound extraction devices X1 to X3 are small devices including about 3 to 5 microphones arranged at intervals of several centimeters and a very small processor such as a DSP or ASIC for performing signal processing. A sharp directivity as shown in FIG. 12 can be realized by (an apparatus having a size of a general handy microphone).

次に，図１３に示すブロック図を参照しつつ，前記目的音抽出装置Ｘ１〜Ｘ３において，前記音響入力装置Ｖ１の代わりに採用可能な装置の一例である音響入力装置Ｖ２について説明する。
前記音響入力装置Ｖ１においては，前記主音響信号を得るための前記主マイクロホン１０１及び前記副音響信号を得るための複数の前記副マイクロホン１０２が予め定められていたが，前記音響入力装置Ｖ２は，複数のマイクロホンを備え，そのいずれを前記主マイクロホン１０１及び前記副マイクロホン１０２として機能させるかを状況に応じて切り替えるものである。
図１３に示すように，前記音響入力装置Ｖ２は，３つ以上（図１３では４つ）のマイクロホン１００−１〜１００−４と，主・副音響信号特定部４１と，信号切替器４２とを備えている。
３つ以上の前記マイクロホン１００−１〜１００−４は，それぞれ配置位置が異なる又はそれぞれ指向性の方向が異なるマイクロホンである。これらのマイクロホン１００−１〜１００−４は，状況に応じて，前記主マイクロホン１０１として機能したり，或いは前記副マイクロホン１０２として機能する。
例えば，前記マイクロホン１００−１〜１００−４は，それぞれ同じ指向性を有するマイクロホンであり，図１３に示すように，所定の円周（中心ＰＯ）上にその円における放射線方向外側に向けて等間隔に（マイクロホン位置と円の中心ＰＯとを結んだときの中心角が等しくなるように）配置される。 Next, an acoustic input device V2 that is an example of a device that can be used in place of the acoustic input device V1 in the target sound extraction devices X1 to X3 will be described with reference to a block diagram shown in FIG.
In the sound input device V1, the main microphone 101 for obtaining the main sound signal and the plurality of sub microphones 102 for obtaining the sub sound signals are determined in advance. A plurality of microphones are provided, and which one of them functions as the main microphone 101 and the sub microphone 102 is switched according to the situation.
As shown in FIG. 13, the acoustic input device V2 includes three or more (four in FIG. 13) microphones 100-1 to 100-4, a main / sub acoustic signal specifying unit 41, a signal switch 42, It has.
The three or more microphones 100-1 to 100-4 are microphones having different arrangement positions or different directivity directions. These microphones 100-1 to 100-4 function as the main microphone 101 or the sub microphone 102 depending on the situation.
For example, each of the microphones 100-1 to 100-4 is a microphone having the same directivity, and as shown in FIG. 13, on the predetermined circumference (center PO) toward the outside in the radiation direction in the circle, etc. They are arranged at intervals (so that the central angles when the microphone position and the center PO of the circle are connected are equal).

また，前記主・副音響信号特定部４１は，３つ以上の前記マイクロホン１００−１〜１００−４それぞれを通じて得られる３つ以上の入力音響信号に基づいて，それら入力音響信号の中から１つの前記主音響信号と，複数の前記副音響信号とを特定する処理を実行するものである（前記主・副音響信号特定手段の一例）。さらに，前記主・副音響信号特定部４１は，前記主音響信号及び前記副音響信号の特定結果に応じた制御信号を前記信号切替器４２に対して出力する。
前記主・副音響信号特定部４１は，例えば，３つ以上の前記入力音響信号それぞれの信号強度（音圧）を比較し，その信号強度が最大である入力音響信号を前記主音響信号として特定し，その他の入力音響信号の全て又はその一部（２つ以上）を前記副音響信号として特定する。前記その他の入力音響信号のうちの一部を前記副音響信号として特定する方法としては，例えば，前記主音響信号を得るマイクロホンに対し，配置位置又は指向方向が両側それぞれに隣接する２つのマイクロホンを通じて得られる音響信号を前記副音響信号として特定すること等が考えられる。
また，前記主・副音響信号特定部４１が，３つ以上の前記入力音響信号それぞれにおける予め定められた周波数成分の占める割合を比較し，その割合が最大であるものを前記主音響信号として特定し，その他の入力音響信号の全て又はその一部（２つ以上）を前記副音響信号として特定すること等も考えられる。これは，前記目的音源が発する音響の周波数特性がある程度既知である場合等に有効である。
前記主・副音響信号特定部４１は，例えばコンピュータの一例であるＤＳＰ及びそのＤＳＰにより実行されるプログラムが記憶されたＲＯＭ，或いはＡＳＩＣ等により具現化される。この場合，そのＲＯＭには，前記主・副音響信号特定部４１が行う処理（後述）を前記ＤＳＰに実行させるためのプログラムが予め記憶されている。 The main / sub-acoustic signal specifying unit 41 selects one of the input acoustic signals based on three or more input acoustic signals obtained through the three or more microphones 100-1 to 100-4. A process of specifying the main sound signal and the plurality of sub sound signals is executed (an example of the main / sub sound signal specifying means). Further, the main / sub-acoustic signal specifying unit 41 outputs a control signal corresponding to the specification result of the main and sub-acoustic signals to the signal switch 42.
The main / sub-acoustic signal specifying unit 41 compares, for example, the signal strengths (sound pressures) of three or more input acoustic signals, and specifies the input acoustic signal having the maximum signal strength as the main acoustic signal. Then, all or some (two or more) of the other input sound signals are specified as the sub sound signals. As a method of specifying a part of the other input sound signals as the sub-acoustic signal, for example, through the two microphones whose arrangement positions or directing directions are adjacent to both sides with respect to the microphone that obtains the main sound signal. It is conceivable to specify the obtained acoustic signal as the sub acoustic signal.
Further, the main / sub-acoustic signal specifying unit 41 compares the proportions of predetermined frequency components in each of the three or more input acoustic signals, and specifies the one having the largest proportion as the main acoustic signal. However, it may be possible to specify all or part (two or more) of the other input acoustic signals as the sub-acoustic signal. This is effective when the frequency characteristics of the sound emitted from the target sound source are known to some extent.
The main / sub-acoustic signal specifying unit 41 is embodied by, for example, a DSP that is an example of a computer and a ROM that stores a program executed by the DSP, or an ASIC. In this case, the ROM stores in advance a program for causing the DSP to execute processing (to be described later) performed by the main / sub-acoustic signal specifying unit 41.

また，前記信号切替器４２は，前記主・副音響信号特定部４１から出力される制御信号（信号の特定結果に応じた信号）に従って，３つ以上の前記マイクロホン１００−１〜１００−４から前記音源分離処理部１０への音響信号の伝送経路を切り替える装置である（前記信号経路切替手段の一例）。
前記信号切替器４２は，前記マイクロホン１００−１〜１００−４それぞれに接続される信号入力端Ｉｎ１〜Ｉｎ４と，前記主音響信号の出力用の１つの信号出力端Ｏｔ１と，前記副音響信号の出力用の複数（図１３では３つ）の信号出力端Ｏｔ２〜Ｏｔ４とを備えている。さらに，前記信号切替器４２は，前記主・副音響信号特定部４１から出力される制御信号に応じて，各信号入力端Ｉｎ１〜Ｉｎ４と各信号出力端Ｏｔ１〜Ｏｔ４とを接続する信号経路を，予め定められた複数の切替パターンの中から選択的に切り替える。これにより，前記主・副音響信号特定部４１によって前記主音響信号として特定された音響信号が前記出力端Ｏｔ１から出力され，前記主・副音響信号特定部４１によって前記副音響信号として特定された音響信号が前記出力端Ｏｔ２〜Ｏｔ４から出力される。
前記目的音抽出装置Ｘ１〜Ｘ３は，図１３に示すような音響入力装置Ｖ２を備えることにより，目的音源の位置が変わり得るために，複数のマイクロホンのうちの予め定められた１つを前記主マイクロホン１０１として固定できない対象に対しても適用可能となる。 Further, the signal switching unit 42 is connected to the three or more microphones 100-1 to 100-4 in accordance with a control signal (a signal corresponding to a signal identification result) output from the main / sub-acoustic signal identifying unit 41. It is a device that switches a transmission path of an acoustic signal to the sound source separation processing unit 10 (an example of the signal path switching unit).
The signal switch 42 includes signal input terminals In1 to In4 connected to the microphones 100-1 to 100-4, one signal output terminal Ot1 for outputting the main acoustic signal, and the sub acoustic signal. A plurality of (three in FIG. 13) signal output terminals Ot2 to Ot4 for output are provided. Further, the signal switch 42 has a signal path for connecting the signal input terminals In1 to In4 and the signal output terminals Ot1 to Ot4 in accordance with the control signal output from the main / sub acoustic signal specifying unit 41. , Selectively switch from a plurality of predetermined switching patterns. Thereby, the acoustic signal specified as the main acoustic signal by the main / sub-acoustic signal specifying unit 41 is output from the output end Ot1, and specified as the sub-acoustic signal by the main / sub-acoustic signal specifying unit 41. An acoustic signal is output from the output terminals Ot2 to Ot4.
Since the target sound extraction devices X1 to X3 include a sound input device V2 as shown in FIG. 13, the position of the target sound source can be changed, so that a predetermined one of a plurality of microphones is selected as the main sound source. It can also be applied to a target that cannot be fixed as the microphone 101.

次に，図１５〜図１８に示すタイムチャートを参照しつつ，前記音源分離処理部１０が前記ＦＤＩＣＡ方式に基づく音源分離処理を行う場合について，その音源分離処理のシーケンスについて説明する。なお，前述したように，前記ＦＤＩＣＡ方式に基づく音源分離処理は，独立成分分析法に基づくブラインド音源分離方式による音源分離処理の一例である。なお，以下の説明において，前記目的音抽出装置Ｘ１における前記目的音分離信号合成処理部２０及び前記スペクトル減算処理部３１の処理と，前記目的音抽出装置Ｘ２における前記スペクトル近似信号抽出処理部３２の処理と，前記目的音抽出装置Ｘ３における前記スペクトル減算処理部３１’の処理とを総称してポスト処理という。
前記ＦＤＩＣＡ方式に基づく音源分離処理では，複数のマイクロホン（目的音抽出装置Ｘ１〜Ｘ３における前記主マイクロホン１０１及び前記副マイクロホン１０２）を通じて時系列に入力される音響信号（以下，入力音響信号という）に対し，これを周波数領域の信号に変換した上で，分離行列Ｗ(ｆ)に基づくフィルタ処理（行列演算）を順次実行して分離信号（前記参照音分離信号や前記目的音分離信号）を生成する処理が実行される。ここで，前記入力音響信号は，図１４における前記混合音声信号ｘ1(ｔ)，ｘ2(ｔ)に相当し，図１，図３及び図５における前記主音響信号及び前記副音響信号に相当する。
また，前述したように，前記フィルタ処理は，所定時間長分のフレーム信号（例えば，前記混合音声信号が数十ms〜数百ms程度の周期で区分された信号）ごとに行われる。このフィルタ処理は，演算負荷の小さな処理であり，実用的なプロセッサによって前記ポスト処理と併せて実行されても，比較的余裕をもってリアルタイムでの処理を実現できる。
さらに，前述したように，前記ＦＤＩＣＡ方式に基づく音源分離処理では，時系列に入力される前記入力音響信号を用いて，前記フィルタ処理に用いる前記分離行列Ｗ(ｆ)を求める学習計算（逐次計算）も行われる。この学習計算は，演算負荷が大きく，一般に，リアルタイム処理に適さない。 Next, with reference to the time charts shown in FIGS. 15 to 18, a description will be given of a sequence of the sound source separation processing when the sound source separation processing unit 10 performs sound source separation processing based on the FDICA method. As described above, the sound source separation process based on the FDICA method is an example of the sound source separation process based on the blind sound source separation method based on the independent component analysis method. In the following description, the processing of the target sound separation signal synthesis processing unit 20 and the spectral subtraction processing unit 31 in the target sound extraction device X1 and the processing of the spectrum approximate signal extraction processing unit 32 in the target sound extraction device X2 will be described. The processing and the processing of the spectrum subtraction processing unit 31 ′ in the target sound extraction device X3 are collectively referred to as post processing.
In the sound source separation processing based on the FDICA method, sound signals (hereinafter referred to as input sound signals) input in time series through a plurality of microphones (the main microphone 101 and the sub microphone 102 in the target sound extraction devices X1 to X3). On the other hand, after converting this into a frequency domain signal, filter processing (matrix operation) based on the separation matrix W (f) is sequentially performed to generate a separation signal (the reference sound separation signal and the target sound separation signal). Is executed. Here, the input sound signal corresponds to the mixed sound signals x1 (t) and x2 (t) in FIG. 14, and corresponds to the main sound signal and the sub sound signal in FIGS. .
Further, as described above, the filtering process is performed for each frame signal of a predetermined time length (for example, a signal obtained by dividing the mixed audio signal with a period of about several tens of ms to several hundreds of ms). This filter process is a process with a small calculation load, and even if it is executed together with the post process by a practical processor, it is possible to realize a real-time process with a relatively large margin.
Further, as described above, in the sound source separation processing based on the FDICA method, learning calculation (sequential calculation) for obtaining the separation matrix W (f) used for the filter processing is performed using the input acoustic signals input in time series. ) Is also performed. This learning calculation is computationally intensive and is generally not suitable for real-time processing.

図１５は，目的音抽出装置Ｘ１〜Ｘ３における前記学習計算を除く処理のシーケンスの第１例を表すタイムチャートである。なお，以下に示すＳｔ１，Ｓｔ２，…は，処理手順（ステップ）の識別符号を表す。
図１５に示すように，目的音抽出装置Ｘ１〜Ｘ３においては，前記音源分離処理部１０が，前記入力音響信号について，所定時間長分のフレーム信号｛Frame(ｉ−１)，Frame(ｉ)，Frame(ｉ＋１)…｝ごとに，離散フーリエ変換（ＤＦＴ）処理（Ｓｔ１）を施し，その処理結果である周波数領域のフレーム信号をメモリに一時記憶させる。この第１例では，前記音源分離処理部１０は，離散フーリエ変換処理（Ｓｔ１）を，前記フレーム信号の時間長と同じ周期で実行する。これにより，連続する２つのフレーム信号は，時間帯の重複のない信号となる。
さらに，前記音源分離処理部１０は，ＤＦＴ処理により得られる周波数領域のフレーム信号ごとに，分離行列Ｗ(ｆ)に基づくフィルタ処理（Ｓｔ２：行列演算）を順次実行して分離信号を生成する。
次に，他の処理部（前記目的音分離信号合成処理部２０及び前記スペクトル減算処理部３１，又は前記スペクトル近似信号抽出処理部３２，又は前記スペクトル減算処理部３１’）が，前記フィルタ処理（Ｓｔ２）により得られた分離信号に基づいて前記ポスト処理（Ｓｔ３）を実行する。これにより，前記入力音響信号における前記フレーム信号それぞれに対応する周波数領域の前記目的音抽出信号が得られる。
さらに，前記ＩＤＦＴ処理部（不図示）が，逆離散フーリエ変換（ＩＤＦＴ）処理（Ｓｔ４）を実行して周波数領域の前記目的音抽出信号を時間領域の信号に変換し，前記音響出力処理部が，時間領域の目的音抽出信号（出力音響信号）を順次外部出力する（Ｓｔ５）。
以上に示したステップＳｔ１〜Ｓｔ４の処理は，演算負荷の小さな処理であり，実用的なプロセッサによって実行されても，比較的余裕をもって前記フレーム信号の時間長の範囲内で処理を完了できる。従って，前記出力音響信号は，前記入力音響信号に対して若干の遅延時間ｔｄ（数十ms〜数百ms未満）が生じるものの，前記入力音響信号の入力に応じて実時間で出力される音響信号となる。 FIG. 15 is a time chart showing a first example of a processing sequence excluding the learning calculation in the target sound extraction devices X1 to X3. St1, St2,... Shown below represent identification codes of processing procedures (steps).
As shown in FIG. 15, in the target sound extraction devices X1 to X3, the sound source separation processing unit 10 performs frame signal {Frame (i-1), Frame (i) for a predetermined time length for the input acoustic signal. , Frame (i + 1)...}, A discrete Fourier transform (DFT) process (St1) is performed, and a frequency domain frame signal as a result of the process is temporarily stored in the memory. In the first example, the sound source separation processing unit 10 executes the discrete Fourier transform process (St1) at the same cycle as the time length of the frame signal. As a result, two consecutive frame signals become signals without overlapping time zones.
Further, the sound source separation processing unit 10 sequentially performs filter processing (St2: matrix operation) based on the separation matrix W (f) for each frame signal in the frequency domain obtained by the DFT processing to generate a separation signal.
Next, another processing unit (the target sound separation signal synthesis processing unit 20 and the spectral subtraction processing unit 31, or the spectral approximate signal extraction processing unit 32, or the spectral subtraction processing unit 31 ′) performs the filtering process ( The post processing (St3) is executed based on the separation signal obtained in St2). Thus, the target sound extraction signal in the frequency domain corresponding to each of the frame signals in the input acoustic signal is obtained.
Further, the IDFT processing unit (not shown) executes an inverse discrete Fourier transform (IDFT) process (St4) to convert the target sound extraction signal in the frequency domain into a time domain signal, and the sound output processing unit The target sound extraction signal (output acoustic signal) in the time domain is sequentially output to the outside (St5).
The processing of steps St1 to St4 described above is processing with a small calculation load, and even if executed by a practical processor, the processing can be completed within a range of the time length of the frame signal with a comparative margin. Therefore, although the output sound signal has a slight delay time td (several tens of ms to less than several hundred ms) with respect to the input sound signal, the sound output in real time according to the input sound signal is input. Signal.

また，図１６は，目的音抽出装置Ｘ１〜Ｘ３における前記学習計算を除く処理のシーケンスの第２例を表すタイムチャートである。
図１６に示す例においても，前記音源分離処理部１０が，前記入力音響信号について，前記フレーム信号｛Frame(ｉ−１)，Frame(ｉ)，Frame(ｉ＋１)…｝ごとに，離散フーリエ変換（ＤＦＴ）処理（Ｓｔ１）を施し，その処理結果である周波数領域のフレーム信号をメモリに一時記憶させる。但し，この第２例では，前記音源分離処理部１０は，離散フーリエ変換処理（Ｓｔ１）を，前記フレーム信号の時間長よりも短い周期で実行する。これにより，連続する２つのフレーム信号は，一部の時間帯が重複する信号となる。
さらに，前記音源分離処理部１０は，ＤＦＴ処理により得られる周波数領域のフレーム信号ごとに，分離行列Ｗ(ｆ)に基づくフィルタ処理（Ｓｔ２：行列演算）を順次実行して分離信号を生成する。その際，前記音源分離処理部１０が生成する連続する２フレーム分の分離信号も，一部の時間帯（図１６において波線の円内の時間帯）が重複する信号となる。そのため，前記音源分離処理部１０は，連続する２フレーム分の分離信号における重複する時間帯の部分について合成処理（加重平均処理等）を施すことにより，出力する分離信号を生成する。
次に，前記第１例（図１５）と同様に，他の処理部が，前記フィルタ処理（Ｓｔ２）により得られた分離信号に基づいて前記ポスト処理（Ｓｔ３）を実行する。
さらに，前記第１例（図１５）と同様に，前記ＩＤＦＴ処理部（不図示）が，逆離散フーリエ変換（ＩＤＦＴ）処理（Ｓｔ４）を実行して周波数領域の前記目的音抽出信号を時間領域の信号に変換し，前記音響出力処理部が，時間領域の目的音抽出信号（出力音響信号）を順次外部出力する（Ｓｔ５）。
以上に示した第２例の処理においても，前記出力音響信号は，前記入力音響信号に対して若干の遅延時間ｔｄ（数十ms〜数百ms未満）が生じるものの，前記入力音響信号の入力に応じて実時間で出力される音響信号となる。 FIG. 16 is a time chart showing a second example of a processing sequence excluding the learning calculation in the target sound extraction devices X1 to X3.
Also in the example shown in FIG. 16, the sound source separation processing unit 10 performs discrete Fourier transform on the input acoustic signal for each frame signal {Frame (i−1), Frame (i), Frame (i + 1). (DFT) processing (St1) is performed, and the frame signal in the frequency domain, which is the processing result, is temporarily stored in the memory. However, in the second example, the sound source separation processing unit 10 executes the discrete Fourier transform process (St1) at a cycle shorter than the time length of the frame signal. As a result, two consecutive frame signals become signals with some overlapping time zones.
Further, the sound source separation processing unit 10 sequentially performs filter processing (St2: matrix operation) based on the separation matrix W (f) for each frame signal in the frequency domain obtained by the DFT processing to generate a separation signal. At this time, the separated signals for two consecutive frames generated by the sound source separation processing unit 10 are also signals having overlapping portions of time zones (time zones within a wavy circle in FIG. 16). For this reason, the sound source separation processing unit 10 generates a separated signal to be output by performing a synthesis process (weighted average process or the like) on overlapping time zone portions in the separated signals for two consecutive frames.
Next, as in the first example (FIG. 15), another processing unit executes the post processing (St3) based on the separated signal obtained by the filter processing (St2).
Further, as in the first example (FIG. 15), the IDFT processing unit (not shown) executes an inverse discrete Fourier transform (IDFT) process (St4) to convert the target sound extraction signal in the frequency domain into the time domain. The sound output processing unit sequentially outputs the target sound extraction signal (output sound signal) in the time domain to the outside (St5).
Also in the processing of the second example described above, the output acoustic signal has a slight delay time td (several tens of ms to less than several hundred ms) with respect to the input acoustic signal, but the input acoustic signal is input. The acoustic signal is output in real time according to

一方，前記ＦＤＩＣＡ方式に基づく音源分離処理における前記学習計算は，連続する複数の前記フレーム信号が入力されるごとに，その複数のフレーム信号を用いた逐次計算によって新たな前記分離行列Ｗ(ｆ)（以降の前記フィルタ処理に用いられる分離行列）を算出する処理であり，図５に示した各処理（Ｓｔ１〜Ｓｔ５）と並行して実行される。このようにして新たに算出された分離行列Ｗ(ｆ)は，以降に実行される前記フィルタ処理に用いられる。
以下，前記学習計算において新たな分離行列Ｗ(ｆ)を算出するごとに用いられる予め定められた数（複数）の連続する前記フレーム信号の集合のことを，以下，メタフレーム信号という。このメタフレーム信号は，時系列に入力される前記入力音響信号における予め定められた周期で区分された信号（前記区間信号に相当）であり，直接的には，周波数領域の信号に変換された（逆離散フーリエ変換処理が施された）メタフレーム信号が前記学習計算に用いられる。前記フレーム信号の時間長（信号区分の周期）が数十ミリ秒〜数百ミリ秒であるのに対し，前記メタフレーム信号の時間長（信号区分の周期）は，処理を実行するプロセッサの能力にもよるが，音響環境の変化への適応時間として許容される時間（例えば，数秒程度）である。 On the other hand, the learning calculation in the sound source separation processing based on the FDICA method is performed every time a plurality of continuous frame signals are input, and a new separation matrix W (f) is obtained by sequential calculation using the plurality of frame signals. This is a process for calculating (separation matrix used in the subsequent filter process), and is executed in parallel with each process (St1 to St5) shown in FIG. The newly calculated separation matrix W (f) is used for the filter processing to be executed later.
Hereinafter, a set of a predetermined number (several) consecutive frame signals used every time a new separation matrix W (f) is calculated in the learning calculation is hereinafter referred to as a metaframe signal. This metaframe signal is a signal (corresponding to the section signal) divided in a predetermined cycle in the input acoustic signal input in time series, and is directly converted to a frequency domain signal. A metaframe signal (which has been subjected to inverse discrete Fourier transform processing) is used for the learning calculation. Whereas the time length of the frame signal (period of signal division) is several tens of milliseconds to several hundred milliseconds, the time length of the metaframe signal (period of signal division) is the ability of the processor to execute processing. Although it depends, it is the time (for example, about several seconds) allowed as the adaptation time to the change of the acoustic environment.

図１７は，前記ＦＤＩＣＡ方式に基づく音源分離処理を行う前記音源分離処理部１０が実行する前記学習計算の第１実施例のタイムチャートである。
図１７に示す前記学習計算（逐次計算）の例（第１実施例）は，前記メタフレーム信号｛Mframe(１)，Mframe(２)，Mframe(３)，…｝ごとに，そのメタフレーム信号全てを用いて以降の前記フィルタ処理に用いる前記分離行列Ｗ(ｆ)を求める場合の例である。但し，この場合，学習計算における逐次計算の回数が，予め定められた上限回数以下となるように（その上限回数に至れば逐次計算を完了させるように）制限されている。
図１７に示す第１実施例の学習計算では，時刻Ｔi〜Ｔi+1の期間（周期：Ｔi+1−Ｔi）に入力された前記入力音響信号に相当する前記メタフレーム信号Mframe(i)の全てを用いて分離行列Ｗ(ｆ)の計算（学習）を行う。そして，以降の前記フィルタ処理により用いられる分離行列Ｗ(ｆ)が，前記学習計算により求められた新たな分離行列Ｗ(ｆ)に更新される。このとき，ある前記メタフレーム信号Mframe(i)を用いて計算（学習）された分離行列Ｗ(ｆ)を，次の前記メタフレーム信号Mframe(i+1)を用いて分離行列Ｗ(ｆ)を計算（逐次計算）する際の初期値（初期分離行列）として用いれば（初期行列の引き継ぎ），逐次計算（学習）の収束が早まり好適である。
ここで，演算負荷の高い前記学習計算を特に制限無しで実行した場合，前記メタフレーム信号ごとの学習計算の時間ｔｓが，前記メタフレーム信号の時間長（Ｔi+1−Ｔi）より大きくなり，音響環境の変化への速やかな適応が困難になる事態が生じる。
そこで，前記メタフレーム信号ごとの学習計算の時間ｔｓが，前記メタフレーム信号の時間長（Ｔi+1−Ｔi）よりも常に短くなるように，前記学習計算における逐次計算回数を前記上限回数で制限すれば，音響環境の変化への速やかな適応が可能となる。
また，このような逐次計算回数の制限（学習計算の簡素化）より，前記音源分離処理により得られる分離信号に多少のノイズが含まれることとなっても，前記音源分離処理と前記ポスト処理（スペクトル減算処理やスペクトル近似信号抽出処理）との組合せにより，全体として目的音の抽出性能を十分に確保できる。
なお，前記目的音抽出装置Ｘ１〜Ｘ３の処理の開始時（装置の電源ＯＮ時）の最初の前記フィルタ処理においては，例えば，予め用意された初期行列や，前回の処理終了時（装置の電源ＯＦＦ時）にメモリに記憶させておいた分離行列等を前記分離行列として用いることが考えられる。
また，前記上限回数は，本処理を実行するプロセッサ（ＤＳＰやＡＳＩＣ等）の能力に応じて，予め実験や計算により定められる。 FIG. 17 is a time chart of the first embodiment of the learning calculation executed by the sound source separation processing unit 10 that performs sound source separation processing based on the FDICA method.
The example (first embodiment) of the learning calculation (sequential calculation) shown in FIG. 17 is that each metaframe signal {Mframe (1), Mframe (2), Mframe (3),. This is an example in which the separation matrix W (f) used for the subsequent filter processing is obtained using all of them. However, in this case, the number of sequential calculations in the learning calculation is limited to be equal to or less than a predetermined upper limit number (so that the sequential calculation is completed when the upper limit number is reached).
In the learning calculation of the first embodiment shown in FIG. 17, the metaframe signal Mframe (i) corresponding to the input acoustic signal input during the period of time Ti to Ti + 1 (period: Ti + 1−Ti). All are used to calculate (learn) the separation matrix W (f). Then, the separation matrix W (f) used in the subsequent filtering process is updated to a new separation matrix W (f) obtained by the learning calculation. At this time, the separation matrix W (f) calculated (learned) using a certain metaframe signal Mframe (i) is used as the separation matrix W (f) using the next metaframe signal Mframe (i + 1). Is used as the initial value (initial separation matrix) when calculating (sequential calculation) (inheritance of the initial matrix), it is preferable that convergence of the sequential calculation (learning) is accelerated.
Here, when the learning calculation with a high calculation load is executed without particular limitation, the learning calculation time ts for each metaframe signal becomes larger than the time length (Ti + 1−Ti) of the metaframe signal, A situation occurs in which it is difficult to quickly adapt to changes in the acoustic environment.
Therefore, the number of sequential calculations in the learning calculation is limited by the upper limit number so that the learning calculation time ts for each metaframe signal is always shorter than the time length (Ti + 1−Ti) of the metaframe signal. This will enable quick adaptation to changes in the acoustic environment.
In addition, due to the limitation of the number of times of sequential calculation (simplification of learning calculation), even if some noise is included in the separated signal obtained by the sound source separation process, the sound source separation process and the post process ( By combining with the spectral subtraction processing and spectral approximate signal extraction processing), the target sound extraction performance can be sufficiently secured as a whole.
In the first filter processing at the start of the processing of the target sound extraction devices X1 to X3 (when the device is turned on), for example, an initial matrix prepared in advance or at the end of the previous processing (the power of the device) It is conceivable to use a separation matrix or the like stored in the memory at the time of OFF) as the separation matrix.
In addition, the upper limit number of times is determined in advance by experiments or calculations in accordance with the ability of a processor (DSP, ASIC, etc.) that executes this processing.

図１８は，前記ＦＤＩＣＡ方式に基づく音源分離処理を行う前記音源分離処理部１０が実行する前記学習計算の第２実施例のタイムチャートである。
図１８に示す前記学習計算（逐次計算）の例（第２実施例）は，前記メタフレーム信号｛Mframe(１)，Mframe(２)，Mframe(３)，…｝の先頭側の一部の時間帯の信号ごとに，その一部の時間帯の信号を用いて以降の前記フィルタ処理に用いる前記分離行列Ｗ(ｆ)を求める場合の例である。
図１７に示す第２実施例の学習計算では，時刻Ｔi〜Ｔi+1の期間（周期：Ｔi+1−Ｔi）に入力された前記入力音響信号に相当する前記メタフレーム信号Mframe(i)の先頭側の一部を用いて分離行列Ｗ(ｆ)の計算（学習）を行う。そして，以降の前記フィルタ処理により用いられる分離行列Ｗ(ｆ)が，前記学習計算により求められた新たな分離行列Ｗ(ｆ)に更新される。このときも，ある前記メタフレーム信号Mframe(i)の一部を用いて計算（学習）された分離行列Ｗ(ｆ)を，次の前記メタフレーム信号Mframe(i+1)の一部を用いて分離行列Ｗ(ｆ)を計算（逐次計算）する際の初期値（初期分離行列）として用いれば（初期行列の引き継ぎ），逐次計算（学習）の収束が早まり好適である。
この第２実施例では，前記メタフレーム信号ごとの学習計算の時間ｔｓが，前記メタフレーム信号の時間長（Ｔi+1−Ｔi）よりも常に短くなるように，前記メタフレーム信号の一部を間引いて前記学習計算に用いることにより，音響環境の変化への速やかな適応が可能となる。
また，このような学習計算に用いる信号の間引き（学習計算の簡素化）により，前記音源分離処理により得られる分離信号に多少のノイズが含まれることとなっても，前記音源分離処理と前記ポスト処理（スペクトル減算処理やスペクトル近似信号抽出処理）との組合せにより，全体として目的音の抽出性能を十分に確保できる。
なお，前記メタフレーム信号における前記学習計算に用いる部分の時間長（ディジタル信号のサンプル数）は，本処理を実行するプロセッサ（ＤＳＰやＡＳＩＣ等）の能力に応じて，予め実験や計算により定められる。 FIG. 18 is a time chart of the second embodiment of the learning calculation executed by the sound source separation processing unit 10 that performs sound source separation processing based on the FDICA method.
The learning calculation (sequential calculation) example (second embodiment) shown in FIG. 18 is a part of the top side of the metaframe signal {Mframe (1), Mframe (2), Mframe (3),. This is an example of obtaining the separation matrix W (f) used for the subsequent filter processing by using a part of the time zone signals for each time zone signal.
In the learning calculation of the second embodiment shown in FIG. 17, the metaframe signal Mframe (i) corresponding to the input acoustic signal input during the period of time Ti to Ti + 1 (period: Ti + 1−Ti). The separation matrix W (f) is calculated (learned) using a part of the head side. Then, the separation matrix W (f) used in the subsequent filtering process is updated to a new separation matrix W (f) obtained by the learning calculation. Also at this time, a separation matrix W (f) calculated (learned) using a part of a certain metaframe signal Mframe (i) is used as a part of the next metaframe signal Mframe (i + 1). If the separation matrix W (f) is used as an initial value (initial separation matrix) when calculating (sequential calculation) (inheritance of the initial matrix), the convergence of the sequential calculation (learning) is accelerated.
In the second embodiment, a part of the metaframe signal is set so that the learning calculation time ts for each metaframe signal is always shorter than the time length (Ti + 1−Ti) of the metaframe signal. By thinning and using for the learning calculation, it is possible to quickly adapt to changes in the acoustic environment.
In addition, even if some noise is included in the separated signal obtained by the sound source separation processing due to the thinning of the signals used for the learning calculation (simplification of the learning calculation), the sound source separation processing and the post processing are performed. By combining with processing (spectral subtraction processing and spectral approximate signal extraction processing), it is possible to ensure sufficient target sound extraction performance as a whole.
Note that the time length (number of digital signal samples) used for the learning calculation in the metaframe signal is determined in advance by experiment or calculation according to the ability of the processor (DSP, ASIC, etc.) that executes this processing. .

本発明は，目的音成分と雑音成分とを含む音響信号から目的音に相当する音響信号を抽出して出力する目的音抽出装置に利用可能である。 The present invention is applicable to a target sound extraction apparatus that extracts and outputs an acoustic signal corresponding to a target sound from an acoustic signal including the target sound component and a noise component.

第１発明の実施形態に係る目的音抽出装置Ｘ１の概略構成を表すブロック図。The block diagram showing the schematic structure of the target sound extraction device X1 which concerns on embodiment of 1st invention. 目的音抽出装置Ｘ１における目的音抽出処理の過程を表す概念図。The conceptual diagram showing the process of the target sound extraction process in the target sound extraction apparatus X1. 第２発明の実施形態に係る目的音抽出装置Ｘ２の概略構成を表すブロック図。The block diagram showing schematic structure of the target sound extraction device X2 which concerns on embodiment of 2nd invention. 目的音抽出装置Ｘ２における目的音抽出処理の過程を表す概念図。The conceptual diagram showing the process of the target sound extraction process in the target sound extraction apparatus X2. 第３発明の実施形態に係る目的音抽出装置Ｘ３の概略構成を表すブロック図。The block diagram showing schematic structure of the target sound extraction device X3 which concerns on embodiment of 3rd invention. 目的音抽出装置Ｘ３における目的音抽出処理の過程を表す概念図。The conceptual diagram showing the process of the target sound extraction process in the target sound extraction apparatus X3. 目的音抽出装置Ｘ１〜Ｘ３の目的音抽出性能を評価する第１の実験条件を表す図。The figure showing the 1st experimental condition which evaluates the target sound extraction performance of the target sound extraction apparatuses X1-X3. 目的音抽出装置Ｘ１〜Ｘ３の目的音抽出性能を評価する第２の実験条件を表す図。The figure showing the 2nd experimental condition which evaluates the target sound extraction performance of the target sound extraction apparatuses X1-X3. 第１の実験条件の下での目的音抽出装置Ｘ１〜Ｘ３及び従来の目的音抽出処理の目的音抽出性能を表す図。The figure showing the target sound extraction performance of the target sound extraction apparatuses X1-X3 and the conventional target sound extraction process under 1st experimental conditions. 第２の実験条件の下での目的音抽出装置Ｘ１〜Ｘ３及び従来の目的音抽出処理の目的音抽出性能を表す図。The figure showing the target sound extraction performance of the target sound extraction apparatuses X1-X3 and the conventional target sound extraction process under 2nd experimental conditions. 目的音抽出装置Ｘ１の指向性を評価する第３の実験条件を表す図。The figure showing the 3rd experimental condition which evaluates the directivity of target sound extraction device X1. 第３の実験条件の下での目的音抽出装置Ｘ１の指向性を表す図。The figure showing the directivity of the target sound extraction apparatus X1 under 3rd experiment conditions. 目的音抽出装置Ｘ１〜Ｘ３に採用され得る音響入力装置Ｖ２の概略構成を表すブロック図。The block diagram showing schematic structure of the acoustic input device V2 which can be employ | adopted for the target sound extraction apparatus X1-X3. ＦＤＩＣＡ法に基づくＢＳＳ方式の音源分離処理を行う音源分離装置Ｚの概略構成を表すブロック図。The block diagram showing the schematic structure of the sound source separation apparatus Z which performs the sound source separation process of the BSS system based on the FDICA method. 目的音抽出装置Ｘ１〜Ｘ３の音源分離処理における学習計算を除く処理のシーケンスの第１例を表すタイムチャート。The time chart showing the 1st example of the sequence of the process except the learning calculation in the sound source separation process of the target sound extraction apparatuses X1 to X3. 目的音抽出装置Ｘ１〜Ｘ３の音源分離処理における学習計算を除く処理のシーケンスの第２例を表すタイムチャート。The time chart showing the 2nd example of the sequence of the process except the learning calculation in the sound source separation process of the target sound extraction apparatuses X1-X3. 目的音抽出装置Ｘ１〜Ｘ３の音源分離処理における第１実施例に係る学習計算のシーケンスを表すタイムチャート。The time chart showing the sequence of the learning calculation which concerns on 1st Example in the sound source separation process of the target sound extraction apparatuses X1-X3. 目的音抽出装置Ｘ１〜Ｘ３の音源分離処理における第２実施例に係る学習計算のシーケンスを表すタイムチャート。The time chart showing the sequence of the learning calculation which concerns on 2nd Example in the sound source separation process of the target sound extraction apparatuses X1-X3.

Explanation of symbols

Ｘ１：第１発明の実施形態に係る目的音抽出装置
Ｘ２：第２発明の実施形態に係る目的音抽出装置
Ｘ３：第３発明の実施形態に係る目的音抽出装置
Ｖ１，Ｖ２：音響入力装置
１０（１０−１〜１０−３）：音源分離処理部
２０：目的音分離信号合成処理部
３１，３１’：スペクトル減算処理部
３２：スペクトル近似信号抽出処理部
４１：主・副音響信号特定部
４２：信号切替器
１０１：主マイクロホン
１０２：副マイクロホン X1: target sound extraction device X2 according to the embodiment of the first invention X2: target sound extraction device X3 according to the embodiment of the second invention X: target sound extraction devices V1, V2 according to the embodiment of the third invention: sound input device 10 (10-1 to 10-3): sound source separation processing unit 20: target sound separation signal synthesis processing unit 31, 31 ′: spectrum subtraction processing unit 32: spectrum approximate signal extraction processing unit 41: main / sub acoustic signal specifying unit 42 : Signal switch 101: Main microphone 102: Sub microphone

Claims

And one of the main sound signal obtained through one of the primary microphone mainly enter the target sound to be outputted from a predetermined target source, a plurality of sub microphones each having directivity in a plurality of different directions from the front Kinushi microphone A target sound extraction device that extracts an acoustic signal corresponding to the target sound and outputs an extraction signal based on a plurality of sub-acoustic signals obtained through
Provided individually for each combination of the two acoustic signals consisting of each of the main acoustic signal and the plurality of sub-acoustic signal, based on the two acoustic signals, wherein the objective sound separation signals corresponding to the target sound object Sound source separation means for separating and generating a reference sound separation signal corresponding to a reference sound other than sound by sound source separation processing by a blind sound source separation method based on an independent component analysis method ;
Target sound separation signal synthesis means for synthesizing a plurality of target sound separation signals separated and generated by the sound source separation means;
Spectral subtraction processing is performed between the synthesized signal obtained by the target sound separation signal synthesizing means and the plurality of reference sound separation signals separated and generated by the sound source separation means, thereby the target sound separation signal synthesizing means. Spectral subtraction processing means for extracting an acoustic signal corresponding to the target sound from the obtained synthesized signal and outputting the extracted signal;
A target sound extraction apparatus comprising:

Oite to the sound source separation processing for the sound source separation unit executes, with the filtering process sequentially executes to generate a separated signal based on a predetermined separating matrix to audio signals inputted in a time series through a microphone, the time For each section signal segmented at a predetermined period in the acoustic signal input to the series, perform the sequential calculation to obtain the separation matrix used for the subsequent filter processing using all the section signals, and the number of times of the sequential calculation The target sound extraction device according to claim 1, wherein the number is limited to a predetermined number.

Oite to the sound source separation processing for the sound source separation unit executes, with the filtering process sequentially executes to generate a separated signal based on a predetermined separating matrix to audio signals inputted in a time series through a microphone, the time For each signal in a part of the time zone on the head side of the section signal divided by a predetermined period in the acoustic signal input to the series, the separation matrix used for the subsequent filter processing is obtained using the signal. The target sound extraction apparatus according to claim 1, wherein sequential sound calculation is executed.

Three or more, based on the input audio signal their Re respective directivity direction is obtained through different three or more microphones, one of the main acoustic signal and a plurality of the out of the three or more input audio signals Main / sub-acoustic signal specifying means for specifying sub-acoustic signals;
Signal path switching means for switching the transmission path of the acoustic signal from the three or more microphones to the sound source separation means according to the identification result by the main / sub acoustic signal identification means;
Target sound extraction apparatus according to any one of claims 1 to 3 and comprising comprises a.

The main / sub-acoustic signal specifying means is based on a comparison of signal intensities of the three or more input acoustic signals, or a comparison of a proportion of a predetermined frequency component in each of the three or more input acoustic signals. The target sound extraction device according to claim 4 , wherein the main sound signal and the plurality of sub-acoustic signals are specified based on the above.

And one of the main sound signal obtained through one of the primary microphone mainly enter the target sound to be outputted from a predetermined target source, a plurality of sub microphones each having directivity in a plurality of different directions from the front Kinushi microphone A target sound extraction program for causing a computer to execute a process of extracting an acoustic signal corresponding to the target sound and outputting the extracted signal based on a plurality of sub-acoustic signals obtained through
Computer
Individually for each combination of the two acoustic signals consisting of each of the main acoustic signal and the plurality of sub-acoustic signal, based on the two acoustic signals, target sound separation signal corresponding to the target sound and other than the target sound A sound source separation process for separating and generating a reference sound separation signal corresponding to the reference sound of the sound by a blind sound source separation method based on an independent component analysis method ;
A target sound separation signal synthesis process for synthesizing a plurality of target sound separation signals separated and generated by the sound source separation process;
By performing spectral subtraction processing between the synthesized signal obtained by the target sound separation signal synthesis processing and the plurality of reference sound separation signals separated and generated by the sound source separation processing, the target sound separation signal synthesis processing is performed. A process of extracting an acoustic signal corresponding to the target sound from the resultant synthesized signal and outputting an extracted signal;
The target sound extraction program characterized by running.

And one of the main sound signal obtained through one of the primary microphone mainly enter the target sound to be outputted from a predetermined target source, a plurality of sub microphones each having directivity in a plurality of different directions from the front Kinushi microphone A target sound extraction method in which a computer executes a process of extracting an acoustic signal corresponding to the target sound and outputting the extracted signal based on a plurality of sub-acoustic signals obtained through
By computer
Individually for each combination of the two acoustic signals consisting of each of the main acoustic signal and the plurality of sub-acoustic signal, based on the two acoustic signals, target sound separation signal corresponding to the target sound and other than the target sound A sound source separation process for separating and generating a reference sound separation signal corresponding to the reference sound of the sound by a blind sound source separation method based on an independent component analysis method ;
A target sound separation signal synthesis process for synthesizing a plurality of target sound separation signals separated and generated by the sound source separation process;
By performing spectral subtraction processing between the synthesized signal obtained by the target sound separation signal synthesis processing and the plurality of reference sound separation signals separated and generated by the sound source separation processing, the target sound separation signal synthesis processing is performed. A process of extracting an acoustic signal corresponding to the target sound from the resultant synthesized signal and outputting an extracted signal;
The target sound extraction method characterized by performing.