JP2007295085A

JP2007295085A - Sound source separation apparatus, and sound source separation method

Info

Publication number: JP2007295085A
Application number: JP2006117994A
Authority: JP
Inventors: Yohei Ikeda; 陽平池田; Takayuki Hiekata; 孝之稗方; Koji Morita; 孝司森田; Hiroshi Hashimoto; 裕志橋本
Original assignee: Kobe Steel Ltd
Current assignee: Kobe Steel Ltd
Priority date: 2006-04-21
Filing date: 2006-04-21
Publication date: 2007-11-08

Abstract

<P>PROBLEM TO BE SOLVED: To maintain high sound source separation performance even when the number of sound sources existing in an acoustic space is increased or decreased when performing sound source separation processing adopting the BSS system on the basis of the ICA method. <P>SOLUTION: In the acoustic space where 0 to n sets of sound sources can exist, n directivity microphones 111 to 11n are arranged in different directive directions respectively, a power detection/signal selection section 25 detects each power Pi of input sound signals xi, selects one or more acceptance input sound signals xj (channels) corresponding to one or more sound sources existing in the acoustic space where the directivity microphones 111 to 11n are arranged among the input sound signals xi on the basis of the power Pi, and when a plurality of the acceptance input sound signals xj are selected thereby, an ICA section 20 applies sound source separation processing adopting the blind source separation system on the basis of the Independent Component Analysis method to produce separation signals yi whose number is equal to the number of the acceptance input sound signals xj. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、所定の音響空間に複数のマイクロホンが存在する状態で、そのマイクロホン各々を通じて入力される複数の入力音声信号（音源各々からの音源信号が重畳された信号）に対し、独立成分分析法に基づくブラインド音源分離方式の音源分離処理を施すことにより複数の分離信号を生成する機能を備えた音源分離装置及び音源分離方法に関するものである。 The present invention provides an independent component analysis method for a plurality of input audio signals (a signal in which a sound source signal from each sound source is superimposed) input through each microphone in a state where a plurality of microphones exist in a predetermined acoustic space. The present invention relates to a sound source separation device and a sound source separation method having a function of generating a plurality of separated signals by performing a sound source separation process of a blind sound source separation method based on the above.

所定の音響空間に複数の音源と複数のマイクロホンとが存在する場合、その複数のマイクロホンごとに、複数の音源各々からの音声信号（以下、音源信号という）が重畳された音声信号（以下、入力音声信号という）が取得される。このようにして取得（入力）された複数の入力音声信号のみに基づいて、前記音源信号各々を同定（分離）する音源分離処理の方式は、ブラインド音源分離方式（Blind Source Separation方式、以下、ＢＳＳ方式という）と呼ばれる。
さらに、ＢＳＳ方式の音源分離処理の１つに、独立成分分析法（Independent Component Analysis、以下、ＩＣＡ法という）に基づくＢＳＳ方式の音源分離処理がある。このＩＣＡ法に基づくＢＳＳ方式は、複数のマイクを通じて入力される複数の入力音声信号（時系列の音声信号）において、前記音源信号どうしが統計的に独立であることを利用して所定の分離行列（逆混合行列）を最適化し、入力された複数の入力音声信号に対して最適化された分離行列によるフィルタ処理（行列演算）を施すことによって前記音源信号の同定（音源分離）を行う処理方式である。
なお、本明細書において、「演算」、「計算」及び「算出」の用語は同義を表すものとする。 When there are a plurality of sound sources and a plurality of microphones in a predetermined acoustic space, a sound signal (hereinafter referred to as an input) in which sound signals from the plurality of sound sources (hereinafter referred to as sound source signals) are superimposed on each of the plurality of microphones. Audio signal). A sound source separation processing method for identifying (separating) each of the sound source signals based only on a plurality of input sound signals acquired (input) in this way is a blind source separation method (hereinafter referred to as BSS). Called the method).
Furthermore, as one of the BSS sound source separation processes, there is a BSS sound source separation process based on an independent component analysis method (hereinafter referred to as ICA method). The BSS method based on the ICA method uses a predetermined separation matrix by utilizing the fact that the sound source signals are statistically independent among a plurality of input audio signals (time-series audio signals) input through a plurality of microphones. A processing method for identifying the sound source signal (sound source separation) by optimizing (inverse mixing matrix) and applying a filtering process (matrix operation) using an optimized separation matrix to a plurality of input speech signals inputted It is.
In the present specification, the terms “calculation”, “calculation”, and “calculation” are synonymous.

ここで、学習計算開始時には、所定の初期値が設定された分離行列（以下、初期行列という）が与えられ、その初期行列が学習計算により更新されて音源分離（前記分離用フィルタ処理）に用いる分離行列として設定される。通常、最初の学習計算開始時には、予め定められた所定の行列が初期行列として設定され、以後、学習計算が行われるごとに、学習後の分離行列が次の学習計算開始時の初期行列として設定される。このようなＩＣＡ法に基づくＢＳＳ方式の音源分離処理（以下、ＩＣＡ−ＢＳＳ音源分離処理という）は、例えば、非特許文献１や非特許文献２等に詳説されている。
ここで、ＩＣＡ−ＢＳＳ音源分離処理における分離行列の学習計算は、演算負荷が高く、現状の実用的なプロセッサではそれをリアルタイムで行うことはできない。このため、ＩＣＡ−ＢＳＳ音源分離処理をリアルタイムで行う場合、逐次入力される入力音声信号に対し、分離行列を用いた行列演算（前記分離用フィルタ処理）を逐次施してリアルタイムで出力信号としての分離信号を得る一方、これと並行して学習計算を行い、その学習計算によって新たな分離行列が得られる都度、リアルタイムの分離処理に用いる分離行列を新たな分離行列に更新するという処理を行えばよい。
また、特許文献１には、話者が音源である場合に、話者が発話中であるか否かを判別し、その判別結果に応じて分離行列の学習及び分離処理をＯＮ／ＯＦＦ制御する技術が示されている。
特開２００５−２２７５１２号公報猿渡洋、「アレー信号処理を用いたブラインド音源分離の基礎」電子情報通信学会技術報告、vol.EA2001-7、pp.49-56、April 2001. 高谷智哉他、「SIMOモデルに基づくICAを用いた高忠実度なブラインド音源分離」電子情報通信学会技術報告、vol.US2002-87、EA2002-108、January 2003. Here, at the start of learning calculation, a separation matrix (hereinafter referred to as an initial matrix) in which a predetermined initial value is set is given, and the initial matrix is updated by learning calculation and used for sound source separation (the separation filter processing). Set as separation matrix. Normally, at the start of the first learning calculation, a predetermined predetermined matrix is set as the initial matrix, and after that, every time learning calculation is performed, the separated matrix after learning is set as the initial matrix at the start of the next learning calculation. Is done. Such BSS sound source separation processing based on the ICA method (hereinafter referred to as ICA-BSS sound source separation processing) is described in detail in, for example, Non-Patent Document 1, Non-Patent Document 2, and the like.
Here, the learning calculation of the separation matrix in the ICA-BSS sound source separation processing has a high calculation load, and it cannot be performed in real time with a current practical processor. Therefore, when the ICA-BSS sound source separation processing is performed in real time, matrix calculation using the separation matrix (the separation filter processing) is sequentially performed on the input audio signal that is sequentially input to separate the output signal in real time. While obtaining a signal, a learning calculation is performed in parallel, and whenever a new separation matrix is obtained by the learning calculation, a process of updating a separation matrix used for real-time separation processing to a new separation matrix may be performed. .
In Patent Document 1, when a speaker is a sound source, it is determined whether or not the speaker is speaking, and learning / separation processing of a separation matrix is controlled on / off according to the determination result. Technology is shown.
JP 2005-227512 A Hiroshi Saruwatari, “Basics of Blind Sound Source Separation Using Array Signal Processing,” IEICE Technical Report, vol.EA2001-7, pp.49-56, April 2001. Tomoya Takatani et al., "High fidelity blind source separation using ICA based on SIMO model" IEICE Technical Report, vol.US2002-87, EA2002-108, January 2003.

ところで、ＩＣＡ−ＢＳＳ音源分離処理では、処理対象とする入力音声信号の数に応じて分離行列のサイズが定まり、処理対象とする入力音声信号の数と同数の分離信号が生成される。また、従来のＩＣＡ−ＢＳＳ音源分離処理は、処理対象とする入力音声信号の数は、音響空間に配置されるマイクロホンの数に等しい。
しかしながら、ＩＣＡ−ＢＳＳ音源分離処理において、音響空間に存在する音源の数に対し、処理対象となる入力音声信号の数（一般には、マイクロホンの数）に過不足が生じると、音源分離性能が悪化するという問題点があった。
即ち、音源の数よりも処理対象となる入力音声信号の数（マイクロホンの数）が多い場合、１つの音源信号を複数の分離信号に分離する処理がなされてしまうことから、音源分離性能が悪化する。また、音源の数よりも処理対象となる入力音声信号の数が少ない場合、音源の数よりも少ない分離信号しか生成されないことから、音源分離性能が悪化する。
このため、音響空間に存在する音源の数が予め定まっていない場合、従来のＩＣＡ−ＢＳＳ音源分離処理を行う音源分離装置は、音響空間に存在する音源の数に対し、処理対象となる入力音声信号の数（マイクロホンの数）に過不足が生じ、音源分離性能が悪化するという問題点があった。
従って、本発明は上記事情に鑑みてなされたものであり、その目的とするところは、ＩＣＡ法に基づくＢＳＳ方式による音源分離処理を行うに当たり、音響空間に存在する音源の数に増減があった場合でも、高い音源分離性能を維持できる音源分離装置及び音源分離方法を提供することにある。 By the way, in the ICA-BSS sound source separation process, the size of the separation matrix is determined according to the number of input speech signals to be processed, and the same number of separation signals as the number of input speech signals to be processed are generated. In the conventional ICA-BSS sound source separation process, the number of input audio signals to be processed is equal to the number of microphones arranged in the acoustic space.
However, in ICA-BSS sound source separation processing, if the number of input audio signals to be processed (generally, the number of microphones) is excessive or insufficient with respect to the number of sound sources existing in the acoustic space, the sound source separation performance deteriorates. There was a problem of doing.
That is, when the number of input audio signals to be processed (the number of microphones) is larger than the number of sound sources, processing for separating one sound source signal into a plurality of separated signals is performed, so that sound source separation performance is deteriorated. To do. Further, when the number of input audio signals to be processed is smaller than the number of sound sources, only separated signals that are smaller than the number of sound sources are generated, so that sound source separation performance is deteriorated.
For this reason, when the number of sound sources existing in the acoustic space is not determined in advance, the sound source separation device that performs the conventional ICA-BSS sound source separation processing is the input sound to be processed with respect to the number of sound sources present in the acoustic space. There was a problem that the number of signals (the number of microphones) became excessive and insufficient, and the sound source separation performance deteriorated.
Therefore, the present invention has been made in view of the above circumstances, and the purpose thereof is to increase or decrease the number of sound sources existing in the acoustic space when performing sound source separation processing by the BSS method based on the ICA method. Even in such a case, it is an object to provide a sound source separation device and a sound source separation method that can maintain high sound source separation performance.

上記目的を達成するために本発明は、所定の音響空間に複数の指向性マイクロホンが各々異なる指向方向で配置された状況下で、それら指向性マイクロホンを通じて入力される複数の入力音声信号に基づいて音源分離処理を行う音源分離装置又は音源分離方法として構成されるものであり、以下の（１）〜（３）に示す各構成要素（手段又は手順）を有するものである。
（１）前記複数の指向性マイクロホンを通じて入力される複数の入力音声信号各々の信号強度を検出する信号強度検出手段、又はその信号検出手段により信号強度を検出する信号強度検出手順。
（２）前記信号強度検出手段（又は手順）の検出結果に基づいて前記複数の入力音声信号の中から前記音響空間に存在する１又は複数の音源に対応する１又は複数の採用入力音声信号を選択する信号選択手段、又は信号選択手段による選択を実行する信号選択手順。
（３）前記信号選択手段により複数の前記採用入力信号が選択された場合に、該複数の採用入力音声信号に対し独立成分分析法に基づくブラインド音源分離方式の音源分離処理を施すことにより前記採用入力音声信号の数と同数の分離信号を生成するＩＣＡ−ＢＳＳ音源分離手段、又はその分離信号の生成処理を所定のプロセッサにより実行するＩＣＡ−ＢＳＳ音源分離手順。
ここで、前記信号選択手段又は信号選択手順としては、例えば、前記信号強度検出手段（又は同手順）により検出された信号強度が第１の設定強度を超えた前記入力音声信号を前記採用入力音声信号として選択するものが考えられる。 In order to achieve the above object, the present invention is based on a plurality of input audio signals input through a directional microphone in a situation where a plurality of directional microphones are arranged in different directional directions in a predetermined acoustic space. It is configured as a sound source separation device or a sound source separation method for performing sound source separation processing, and has each component (means or procedure) shown in the following (1) to (3).
(1) Signal strength detection means for detecting the signal strength of each of the plurality of input audio signals input through the plurality of directional microphones, or a signal strength detection procedure for detecting the signal strength by the signal detection means.
(2) One or a plurality of adopted input sound signals corresponding to one or a plurality of sound sources existing in the acoustic space from the plurality of input sound signals based on a detection result of the signal intensity detection means (or procedure). A signal selection means for selecting, or a signal selection procedure for executing selection by the signal selection means.
(3) When a plurality of adopted input signals are selected by the signal selection means, the adopted input is performed by subjecting the plurality of adopted input speech signals to a sound source separation process of a blind sound source separation method based on an independent component analysis method. An ICA-BSS sound source separation means for generating the same number of separated signals as the number of input audio signals, or an ICA-BSS sound source separation procedure for executing the separation signal generation processing by a predetermined processor.
Here, as the signal selection means or the signal selection procedure, for example, the input voice signal in which the signal strength detected by the signal strength detection means (or the same procedure) exceeds a first set strength is used as the adopted input voice. What is selected as the signal is conceivable.

以上に示す構成要素を有する音源分離装置又は音源分離方法を採用した場合、以下のような作用及び効果が得られる。
即ち、ある指向性マイクロホンの指向方向（主な集音範囲）に音源が存在すれば、その指向性マイクロホンを通じて得られる入力音声信号の強度（パワー）が特に強くなる。もちろん、他の指向性マイクロホンを通じて得られる入力音声信号の強度にも多少は影響するものの、その影響の度合いは比較的小さい。
このため、前記信号選択手段（又は同手順）により、全ての入力音声信号の中から、信号強度が一定レベル以上であるもののみを音源分離処理の対象（前記採用入力音声信号）として選択すれば、音源の数に対し、過不足のない数の前記採用入力音声信号が選択される。
従って、前記入力音声信号を得るための前記指向性マイクロホンを、変動する音源の数に対して十分な数だけ設けておけば、音響空間に存在する音源の数に増減があった場合でも、高い音源分離性能を維持できる。 When the sound source separation device or the sound source separation method having the components described above is employed, the following operations and effects can be obtained.
That is, if a sound source exists in the direction of a certain directional microphone (main sound collection range), the intensity (power) of the input audio signal obtained through the directional microphone is particularly strong. Of course, although the intensity of the input audio signal obtained through other directional microphones is somewhat affected, the degree of the influence is relatively small.
For this reason, if the signal selection means (or the same procedure) selects only input signals having a signal strength of a certain level or higher from all input audio signals as the target of the sound source separation process (the adopted input audio signal). The employed input audio signals are selected so as not to exceed the number of sound sources.
Accordingly, if a sufficient number of the directional microphones for obtaining the input audio signal is provided for the number of sound sources that fluctuate, even if the number of sound sources existing in the acoustic space increases or decreases, Sound source separation performance can be maintained.

また、前記信号選択手段（又は手順）としては、例えば、前記信号強度検出手段（又は手順）により検出された信号強度が強いものから最大２つまでの前記入力音声信号を前記採用入力音声信号として選択するものが考えられる。
これにより、音源分離処理の演算負荷を低減できる。このような構成を有する音源分離装置や音源分離方法は、例えば、ある特定の指向性マイクロホンの指向方向（主な集音範囲）に存在する音源（目的音源）の音源信号と、その他の音源（ノイズ音源）の音源信号とを分離したい場合（複数のノイズ音源の音源信号を分離する必要がない場合）などに有効である。
また、前記信号選択手段（又は手順）としては、例えば、前記採用入力信号として選択している前記入力音声信号のうち、前記信号強度検出手段（又は手順）により検出された信号強度が第２の設定強度以下である状態が所定時間継続したものを前記採用入力音声信号から除外するものが考えられる。
これにより、ＩＣＡ−ＢＳＳ音源分離手段（又は手順）の入力数（採用入力音声信号の数）の増減が、音源の一時的な音量増減に応じて無用に頻発することを防止できる。 In addition, as the signal selection means (or procedure), for example, a maximum of two input audio signals from those having a strong signal intensity detected by the signal intensity detection means (or procedure) are used as the adopted input audio signals. You can choose what to choose.
Thereby, the calculation load of sound source separation processing can be reduced. The sound source separation device and the sound source separation method having such a configuration include, for example, a sound source signal of a sound source (target sound source) existing in a direction of a specific directional microphone (main sound collection range) and other sound sources ( This is effective when it is desired to separate the sound source signal of the noise source (when it is not necessary to separate the sound source signals of a plurality of noise sources).
In addition, as the signal selection means (or procedure), for example, the signal strength detected by the signal strength detection means (or procedure) out of the input audio signals selected as the adopted input signal is a second value. One that excludes from the adopted input audio signal those in which the state of being below the set intensity has continued for a predetermined time is conceivable.
Thereby, it is possible to prevent the number of inputs (the number of adopted input audio signals) of the ICA-BSS sound source separation means (or procedure) from increasing frequently and unnecessarily in accordance with the temporary volume increase / decrease of the sound source.

ところで、音源が、隣り合う指向性マイクロホンの一方の集音範囲から他方の集音範囲へ移動した場合、指向方向（集音範囲）が隣り合う２つの指向性マイクロホンにおいて、その一方の入力音声信号の強度が強い状態から弱い状態に変化するとともに、他方の入力音声信号の強度が弱い状態から強い状態に変化する。
そこで、前記信号選択手段（又は手順）としては、指向方向が隣り合う２つの前記指向性マイクロホン各々を通じて入力された入力音声信号（これらを、第１の入力音声信号及び第２の入力音声信号と称する）のうち、その第２の入力音声信号が前記採用入力音声信号として選択しているときに、前記第１の入力音声信号の信号強度が前記第１の設定強度を超えた際に前記第２の入力音声信号の信号強度が第２の設定強度以下となった場合に、その第２の入力音声信号を前記採用入力音声信号から除外するものも考えられる。
なお、以上に示した第１の設定強度及び第２の設定強度は、それぞれ同じ強度に設定されることや、第１の設定強度よりも第２の設定強度の方が弱い強度に設定されること等が考えられる。 By the way, when the sound source moves from one sound collection range of the adjacent directional microphones to the other sound collection range, in one of the two directional microphones whose directivity directions (sound collection ranges) are adjacent, one of the input sound signals Changes from a strong state to a weak state, and the other input voice signal changes from a weak state to a strong state.
Therefore, as the signal selection means (or procedure), the input sound signals (the first input sound signal and the second input sound signal, which are input through each of the two directional microphones whose directional directions are adjacent to each other) When the second input audio signal is selected as the adopted input audio signal, the first input audio signal exceeds the first set intensity when the signal intensity of the first input audio signal exceeds the first set intensity. When the signal intensity of the second input audio signal is equal to or lower than the second set intensity, it is possible to exclude the second input audio signal from the adopted input audio signal.
The first set intensity and the second set intensity shown above are set to the same intensity, or the second set intensity is set to be weaker than the first set intensity. It is conceivable.

本発明によれば、前記入力音声信号を得るための前記指向性マイクロホンを、変動する音源の数に対して十分な数だけ設けておけば、音響空間に存在する音源の数に増減があった場合でも、音源の数に対し、過不足のない数の入力音声信号（前記採用入力音声信号）が選択されるので、高い音源分離性能を維持できる。 According to the present invention, if the directional microphones for obtaining the input audio signal are provided in a sufficient number with respect to the number of fluctuating sound sources, the number of sound sources existing in the acoustic space has increased or decreased. Even in this case, since the number of input audio signals (adopted input audio signal) that is not excessive or insufficient with respect to the number of sound sources is selected, high sound source separation performance can be maintained.

以下添付図面を参照しながら、本発明の実施の形態について説明し、本発明の理解に供する。尚、以下の実施の形態は、本発明を具体化した一例であって、本発明の技術的範囲を限定する性格のものではない。
ここに、図１は本発明の実施形態に係る音源分離装置Ｘの概略構成を表すブロック図、図２は音源分離装置Ｘが備える指向性マイクロホンの配置状態の一例を表す平面図、図３は音源分離装置Ｘにおける音源分離処理の手順を表すフローチャート、図４は音源分離装置Ｘの適用対象の一例である携帯電話機Ｖ１の概略斜視図、図５は音源分離装置Ｘの適用対象の一例であるロボットＶ２の概略斜視図、図６はＴＤＩＣＡ法に基づくＢＳＳ方式の音源分離処理を行う音源分離ユニットＺ１の概略構成を表すブロック図、図７はＦＤＩＣＡ法に基づくＢＳＳ方式の音源分離処理を行う音源分離ユニットＺ２の概略構成を表すブロック図である。 Embodiments of the present invention will be described below with reference to the accompanying drawings for understanding of the present invention. In addition, the following embodiment is an example which actualized this invention, Comprising: It is not the thing of the character which limits the technical scope of this invention.
FIG. 1 is a block diagram showing a schematic configuration of the sound source separation device X according to the embodiment of the present invention, FIG. 2 is a plan view showing an example of an arrangement state of directional microphones included in the sound source separation device X, and FIG. FIG. 4 is a schematic perspective view of a cellular phone V1 that is an example of an application target of the sound source separation device X, and FIG. 5 is an example of an application target of the sound source separation device X. 6 is a schematic perspective view of the robot V2, FIG. 6 is a block diagram showing a schematic configuration of a sound source separation unit Z1 that performs BSS sound source separation processing based on the TDICA method, and FIG. 7 is a sound source that performs BSS sound source separation processing based on the FDICA method. It is a block diagram showing schematic structure of the separation unit Z2.

まず、本発明の実施形態について説明する前に、図６及び図７に示すブロック図を用いて、本発明の構成要素として適用可能な各種のＩＣＡ−ＢＳＳ方式の音源分離ユニットの例について説明する。
図６は、ＩＣＡ−ＢＳＳ方式の一種である時間領域独立成分分析方式（time-domain independent component analysis方式、以下、ＴＤＩＣＡ方式という）に基づくＢＳＳ方式の音源分離処理を行う従来の音源分離ユニットＺ１の概略構成を表すブロック図である。なお、本処理の詳細は、非特許文献１や非特許文献２等に示されている。
音源分離ユニットＺ１は、分離フィルタ処理部１１ｔにより、２つの音源１、２からの音源信号Ｓ1(ｔ)、Ｓ2(ｔ)（音源ごとの音声信号）を２つのマイクロホン（音声入力手段）１１１、１１２で入力した２チャンネル（マイクロホンの数）の混合音声信号ｘ１(ｔ)、ｘ２(ｔ)について、分離行列Ｗ(ｚ)によりフィルタ処理を施すことによって音源分離を行う。なお、混合音声信号ｘ１(ｔ)、ｘ２(ｔ)は、所定のサンプリング周期でデジタル化された信号であるが、図６、図７においては、Ａ／Ｄ変換手段の記載を省略している。
図６には、２つの音源１、２からの音源信号Ｓ1(ｔ)、Ｓ2(ｔ)（個別音声信号）を２つのマイクロホン（音声入力手段）１１１、１１２で入力した２チャンネル（マイクロホンの数）の混合音声信号ｘ１(ｔ)、ｘ２(ｔ)に基づいて音源分離を行う例について示しているが、２チャンネル以上であっても同様である。ＩＣＡ−ＢＳＳ方式による音源分離の場合、（入力される混合音声信号のチャンネル数ｎ（即ち、マイクロホンの数））≧（音源の数ｍ）であればよい。但し、前述したように、高い音源分離性能を確保するためには、音源分離処理の対象とするチャンネル数と音源の数とを一致させることが望ましい。
複数のマイクロホン１１１、１１２各々で集音された各混合音声信号ｘ１(ｔ)、ｘ２(ｔ)には、複数音源からの音源信号が重畳されている。以下、各混合音声信号ｘ１(ｔ)、ｘ２(ｔ)を総称してｘ(ｔ)と表す。この混合音声信号ｘ(ｔ)は音源信号Ｓ(ｔ)の時間的空間的な畳み込み信号として表現され、次の（１）式のように表される。

ＴＤＩＣＡによる音源分離の理論は、この音源信号Ｓ(ｔ)のそれぞれの音源同士が統計的に独立であることを利用すると、ｘ(ｔ)がわかればＳ(ｔ)を推測することができ、従って、音源を分離することができるという発想に基づく理論である。
ここで、当該音源分離処理に用いる分離行列をＷ(ｚ)とすれば、分離信号（即ち、同定信号）ｙ(ｔ)は、次の（２）式で表される。

ここで、Ｗ(ｚ)は、出力ｙ(ｔ)から逐次計算（学習計算）により求められる。また、分離信号は、チャンネルの数だけ得られる。
なお、音源合成処理はこのＷ(ｚ)に関する情報により、逆演算処理に相当する配列を形成し、これを用いて逆演算を行えばよい。また、分離行列Ｗ(ｚ)の逐次計算を行う際の分離行列の初期値（初期行列）は、予め定められたものが設定される。
このようなＩＣＡ−ＢＳＳ方式による音源分離を行うことにより、例えば、人の歌声とギター等の楽器の音とが混合した複数チャンネル分の混合音声信号から、歌声の音源信号と楽器の音源信号とが分離（同定）される。
ここで、（２）式は、次の（３）式のように書き換えて表現できる。

そして、（３）式における分離フィルタ（分離行列）Ｗ(ｎ)は、次の（４）式により表される処理（以下、第１の単位処理という）を繰り返し実行する逐次計算により求められる。即ち、前回（ｊ）の出力ｙ(ｔ)を（４）式に適用することよって今回（ｊ＋１）のＷ(ｎ)を求め、今回求めたＷ(ｎ)を用いて所定時間長分の混合音声信号に対してフィルタ処理（行列演算）を施すことによって今回（ｊ＋１）の出力ｙ(ｔ)を求める、という前記第１の単位処理を複数回繰り返す。これにより、分離フィルタ（分離行列）Ｗ(ｎ)が、徐々に上記逐次計算で用いられる混合音声信号に対応した内容となる。

First, before describing embodiments of the present invention, examples of various ICA-BSS sound source separation units applicable as components of the present invention will be described using the block diagrams shown in FIGS. 6 and 7. .
FIG. 6 shows a conventional sound source separation unit Z1 that performs sound source separation processing of a BSS method based on a time-domain independent component analysis method (hereinafter referred to as a TDICA method), which is a kind of ICA-BSS method. It is a block diagram showing a schematic structure. Details of this processing are shown in Non-Patent Document 1, Non-Patent Document 2, and the like.
The sound source separation unit Z1 uses the separation filter processing unit 11t to convert sound source signals S1 (t) and S2 (t) (audio signals for each sound source) from the two

sound sources

1 and 2 into two microphones (audio input means) 111, The mixed sound signals x1 (t) and x2 (t) of the two channels (the number of microphones) input at 112 are subjected to sound source separation by performing filter processing using a separation matrix W (z). Note that the mixed audio signals x1 (t) and x2 (t) are signals digitized at a predetermined sampling period, but the description of the A / D conversion means is omitted in FIGS. .
FIG. 6 shows two channels (the number of microphones) in which sound source signals S1 (t) and S2 (t) (individual audio signals) from two

sound sources

1 and 2 are input by two microphones (audio input means) 111 and 112. ), An example of performing sound source separation based on the mixed audio signals x1 (t) and x2 (t) is shown. In the case of sound source separation by the ICA-BSS system, it is sufficient if (number of channels of input mixed audio signal n (that is, number of microphones)) ≧ (number of sound sources m). However, as described above, in order to ensure high sound source separation performance, it is desirable to match the number of channels targeted for sound source separation processing with the number of sound sources.
Sound source signals from a plurality of sound sources are superimposed on each of the mixed sound signals x1 (t) and x2 (t) collected by each of the plurality of

microphones

111 and 112. Hereinafter, the mixed audio signals x1 (t) and x2 (t) are collectively referred to as x (t). This mixed sound signal x (t) is expressed as a temporal and spatial convolution signal of the sound source signal S (t) and is expressed as the following equation (1).

The theory of sound source separation by TDICA is that if each sound source of the sound source signal S (t) is statistically independent, S (t) can be estimated if x (t) is known, Therefore, the theory is based on the idea that sound sources can be separated.
Here, if the separation matrix used for the sound source separation processing is W (z), the separated signal (that is, the identification signal) y (t) is expressed by the following equation (2).

Here, W (z) is obtained by sequential calculation (learning calculation) from the output y (t). In addition, as many separation signals as the number of channels are obtained.
In the sound source synthesis process, an array corresponding to the inverse calculation process is formed based on the information on W (z), and the inverse calculation may be performed using this. In addition, a predetermined value is set as an initial value (initial matrix) of the separation matrix when the separation matrix W (z) is sequentially calculated.
By performing sound source separation by such an ICA-BSS method, for example, from a mixed sound signal for a plurality of channels in which a human singing voice and a sound of an instrument such as a guitar are mixed, a singing sound source signal and a sound source signal of the instrument Are separated (identified).
Here, the expression (2) can be rewritten and expressed as the following expression (3).

The separation filter (separation matrix) W (n) in equation (3) is obtained by sequential calculation that repeatedly executes the processing represented by the following equation (4) (hereinafter referred to as first unit processing). That is, by applying the output y (t) of the previous time (j) to the equation (4), W (n) of this time (j + 1) is obtained, and mixing for a predetermined time length is performed using W (n) obtained this time. The first unit processing of obtaining the current output (j + 1) y (t) by performing filter processing (matrix operation) on the audio signal is repeated a plurality of times. Thereby, the separation filter (separation matrix) W (n) gradually becomes a content corresponding to the mixed speech signal used in the sequential calculation.

次に、図７に示すブロック図を用いて、ＩＣＡ−ＢＳＳ方式の一種であるＦＤＩＣＡ方式（Frequency-Domain ICA）に基づく音源分離処理を行う従来の音源分離ユニットＺ２について説明する。
ＦＤＩＣＡ方式では、まず、入力された混合音声信号ｘ(ｔ)について、ＳＴ−ＤＦＴ処理部１３によって所定の周期ごとに区分された信号であるフレーム毎に短時間離散フーリエ変換（Short Time Discrete Fourier Transform、以下、ＳＴ−ＤＦＴ処理という）を行うことにより、観測信号の短時間分析（時間領域から周波数領域への変換）を行う。離散フーリエ変換後の信号は、周波数ビンと呼ばれる所定範囲の周波数帯域ごとに区分された信号となる。そして、そのＳＴ−ＤＦＴ処理後の各チャンネルの信号（各周波数成分の信号）について、分離フィルタ処理部１１ｆにより分離行列Ｗ(ｆ)に基づく分離フィルタ処理を施すことによって音源分離（音源信号の同定）を行う。ここでｆを周波数ビン、ｍを分析フレーム番号とすると、分離信号（同定信号）ｙ(ｆ、ｍ)は、次の（５）式のように表すことができる。

そして、（５）式における分離フィルタ（分離行列）Ｗ(ｆ)は、次の（６）式により表される処理（以下、第２の単位処理という）を繰り返し実行する逐次計算により求められる。即ち、前回（ｉ）の出力ｙ(ｆ)を（６）式に適用することよって今回（ｉ＋１）のＷ(ｆ)を求め、今回求めたＷ(ｆ)を用いて所定時間長分の混合音声信号（周波数領域に変換されたもの）に対してフィルタ処理（行列演算）を施すことによって今回（ｉ＋１）の出力ｙ(ｆ)を求める、という前記第２の単位処理を複数回繰り返す。これにより、分離フィルタ（分離行列）Ｗ(ｆ)が、徐々に上記逐次計算で用いられる混合音声信号に対応した内容となる。

このＦＤＩＣＡ方式によれば、音源分離処理が各狭帯域における瞬時混合問題として取り扱われ、比較的簡単かつ安定に分離フィルタ（分離行列）Ｗ(ｆ)を更新することができる。 Next, a conventional sound source separation unit Z2 that performs sound source separation processing based on the FDICA method (Frequency-Domain ICA), which is a type of ICA-BSS method, will be described using the block diagram shown in FIG.
In the FDICA method, first, a short time discrete Fourier transform (Short Time Discrete Fourier Transform) is performed for each frame, which is a signal divided for each predetermined period by the ST-DFT processing unit 13 with respect to the input mixed audio signal x (t). , Hereinafter referred to as ST-DFT processing), the observation signal is analyzed in a short time (conversion from the time domain to the frequency domain). The signal after the discrete Fourier transform becomes a signal divided for each predetermined frequency band called a frequency bin. The signal of each channel (the signal of each frequency component) after the ST-DFT processing is subjected to separation filter processing based on the separation matrix W (f) by the separation filter processing unit 11f, whereby sound source separation (sound source signal identification) is performed. )I do. Here, when f is a frequency bin and m is an analysis frame number, the separation signal (identification signal) y (f, m) can be expressed as the following equation (5).

The separation filter (separation matrix) W (f) in equation (5) is obtained by sequential calculation that repeatedly executes the processing represented by the following equation (6) (hereinafter referred to as second unit processing). That is, by applying the output y (f) of the previous (i) to the equation (6), W (f) of this time (i + 1) is obtained, and mixing for a predetermined time length is performed using W (f) obtained this time. The second unit process of obtaining the current output (i + 1) y (f) by applying a filter process (matrix operation) to the audio signal (converted to the frequency domain) is repeated a plurality of times. As a result, the separation filter (separation matrix) W (f) gradually becomes content corresponding to the mixed speech signal used in the sequential calculation.

According to this FDICA method, sound source separation processing is handled as an instantaneous mixing problem in each narrow band, and the separation filter (separation matrix) W (f) can be updated relatively easily and stably.

以下、図１に示すブロック図を用いて、本発明の実施形態に係る音源分離装置Ｘについて説明する。
音源分離装置Ｘは、１又は複数の音源が存在し得る音響空間に配置される複数の指向性マイクロホン１１１〜１１ｎ（以下、指向性マイクという）を備え、その指向性マイク１１１〜１１ｎ各々を通じて逐次入力される複数の音声信号（以下、入力音声信号ｘiという（但し、i＝１〜ｎ））から、複数の音源信号を分離（同定）した分離信号（即ち、音源信号に対応した同定信号）ｙjを逐次生成し、スピーカ等に対してリアルタイム出力するものである。ここで、音響空間に音源が複数存在する場合、各入力音声信号ｘiは、その複数の音源各々からの音源信号が重畳された混合音声信号である。なお、図１には、音響空間に２つの音源１、２が存在する例を示しているが、指向性マイク１１１〜１１ｎが配置される音響空間には、０個〜ｎ個の音源が存在し得るものとし、いくつの音源が存在するかは予め定まっていないものとする。 Hereinafter, the sound source separation apparatus X according to the embodiment of the present invention will be described with reference to the block diagram shown in FIG.
The sound source separation device X includes a plurality of directional microphones 111 to 11n (hereinafter referred to as directional microphones) arranged in an acoustic space where one or a plurality of sound sources can exist, and sequentially through each of the directional microphones 111 to 11n. A separated signal (ie, an identification signal corresponding to the sound source signal) obtained by separating (identifying) a plurality of sound source signals from a plurality of input sound signals (hereinafter referred to as input sound signals xi (where i = 1 to n)). yj is sequentially generated and output to a speaker or the like in real time. Here, when there are a plurality of sound sources in the acoustic space, each input sound signal xi is a mixed sound signal on which sound source signals from the plurality of sound sources are superimposed. FIG. 1 shows an example in which two sound sources 1 and 2 exist in the acoustic space, but 0 to n sound sources exist in the acoustic space in which the directional microphones 111 to 11n are arranged. It is assumed that how many sound sources exist is not determined in advance.

図１に示すように、音源分離装置Ｘは、ｎ個の指向性マイク１１１〜１１ｎ、Ａ／Ｄコンバータ２１（図中、ＡＤＣと表記）、Ｄ／Ａコンバータ２２（図中、ＤＡＣと表記）、入力バッファ２３、出力バッファ２４、ＩＣＡ部２０、パワー検出・信号選択部２５及び外部入力インターフェース２６等を備えて構成されている。
さらに、ＩＣＡ部２０は、ＳＴ−ＤＦＴ処理部２０ａ、学習演算部２０ｂ、分離フィルタ処理部２０ｃ、分離制御部２０ｅ等を具備して構成されている。
ここで、ＩＣＡ部２０及びパワー検出・信号選択部２５は、それぞれＤＳＰ（Digital Signal Processor）等の演算用のプロセッサ及びそのプロセッサにより実行されるプログラムが記憶されたＲＯＭ等の記憶手段、並びにＲＡＭ等のその他の周辺装置により構成されたものが考えられる。或いは、１つのＣＰＵ及びその周辺装置を有するコンピュータにより、上記各構成要素が行う処理に対応するプログラムモジュールを実行するよう構成されたもの等も考えられる。また、所定のコンピュータ（音源分離装置が備えるプロセッサを含む）に各構成要素の処理を実行させる音源分離装置用のプログラムとして提供することも考えられる。 As shown in FIG. 1, the sound source separation device X includes n directional microphones 111 to 11n, an A / D converter 21 (indicated as ADC in the figure), and a D / A converter 22 (indicated as DAC in the figure). , An input buffer 23, an output buffer 24, an ICA unit 20, a power detection / signal selection unit 25, an external input interface 26, and the like.
Further, the ICA unit 20 includes an ST-DFT processing unit 20a, a learning calculation unit 20b, a separation filter processing unit 20c, a separation control unit 20e, and the like.
Here, each of the ICA unit 20 and the power detection / signal selection unit 25 includes a processor for calculation such as a DSP (Digital Signal Processor), a storage unit such as a ROM storing a program executed by the processor, a RAM, and the like. It may be configured by other peripheral devices. Or what was comprised so that the program module corresponding to the process which each said component performs by the computer which has one CPU and its peripheral device may be considered. It is also conceivable to provide a program for a sound source separation device that causes a predetermined computer (including a processor included in the sound source separation device) to execute processing of each component.

ＡＤＣ２１は、複数のマイクロホン１１１〜１１ｎ各々から入力されるアナログの入力音声信号各々を所定のサンプリング周期でサンプリングすることにより、デジタルの入力音声信号Ｘi(ｔ)に変換（Ａ／Ｄ変換）するすものである。例えば、各音源が人の声である場合、８ｋＨｚ程度のサンプリング周期でデジタル化すればよい。
入力バッファ２３は、ＡＤＣ２１により逐次Ａ／Ｄ変換されて得られるデジタルの入力音声信号ｘi(ｔ)を入力し、常に最新の所定時間長分の入力音声信号ｘiを保持するデータバッファである。
パワー検出・信号選択部２５は、複数の指向性マイク１１１〜１１ｎを通じて入力される複数の入力音声信号ｘi各々のパワーＰi（信号強度）を検出するとともに、そのパワーＰiに基づいて、複数の入力音声信号ｘiの中から、音響空間に存在する１又は複数の音源に対応する１又は複数の入力音声信号（以下、採用入力音声信号ｘjという）を選択する処理を行うものである（信号強度検出手段及び信号選択手段の一例）。その詳細については後述する。
外部入力インターフェース２６は、パワー検出・信号選択部２５が計算機等の外部装置から後述する信号パワー設定値Ｐｓ１、Ｐｓ２を取得するための信号伝送用のインターフェースである。
ＩＣＡ部２０は、パワー検出・信号選択部２５によって複数の前記採用入力信号ｘjが選択された場合に、その複数の採用入力音声信号ｘjに対し、独立成分分析法に基づくブラインド音源分離方式の音源分離処理（前述したＩＣＡ−ＢＳＳ音源分離処理）を施すことにより、採用入力音声信号ｘjの数と同数の分離信号ｙjを生成する処理を実行するものである（ＩＣＡ−ＢＳＳ音源分離手段の一例）。 The ADC 21 converts (A / D conversion) into a digital input audio signal Xi (t) by sampling each analog input audio signal input from each of the plurality of microphones 111 to 11n at a predetermined sampling period. Is. For example, when each sound source is a human voice, it may be digitized with a sampling period of about 8 kHz.
The input buffer 23 is a data buffer that receives a digital input audio signal xi (t) obtained by successive A / D conversion by the ADC 21 and always holds the input audio signal xi for the latest predetermined time length.
The power detection / signal selection unit 25 detects the power Pi (signal intensity) of each of the plurality of input audio signals xi input through the plurality of directional microphones 111 to 11n, and a plurality of inputs based on the power Pi. A process of selecting one or a plurality of input sound signals (hereinafter referred to as adopted input sound signals xj) corresponding to one or a plurality of sound sources existing in the acoustic space from the sound signals xi (signal intensity detection). Examples of means and signal selection means). Details thereof will be described later.
The external input interface 26 is a signal transmission interface for the power detection / signal selection unit 25 to acquire signal power setting values Ps1 and Ps2 described later from an external device such as a computer.
When a plurality of adopted input signals xj are selected by the power detection / signal selection unit 25, the ICA unit 20 applies a blind sound source separation type sound source based on an independent component analysis method to the plurality of adopted input audio signals xj. By performing a separation process (the ICA-BSS sound source separation process described above), a process for generating the same number of separated signals yj as the number of adopted input audio signals xj is executed (an example of an ICA-BSS sound source separation unit). .

具体的には、ＳＴ−ＤＦＴ処理部２０ａにより、入力バッファに蓄積された入力音声信号のうち、パワー検出・信号選択部２５によって選択された所定時間長分（１フレーム分）の採用入力音声信号ｘjに対して短時間離散フーリエ変換処理を行い、所定の時間長分の時間領域の採用入力音声信号ｘj（図６におけるｘi(ｔ)に相当）を、同時間長分の周波数領域の入力音声信号ｘj(ｆ)（周波数ビンと呼ばれる所定範囲の周波数帯域ごとに区分された信号）に変換する。なお、採用入力音声信号ｘjは、所定周期でサンプリングされてデジタル化されているので、採用入力音声信号ｘjの時間長を規定することは、採用入力音声信号ｘjのサンプル数を規定することと同義である。
さらに、分離フィルタ処理部２０ｃにより、ＳＴ−ＤＦＴ処理部２０ａを通じて逐次入力される複数の周波数領域の採用入力音声信号ｘj(ｆ)に対し、分離行列Ｗ(ｆ)を用いた行列演算を施すことにより、複数の音源各々に対応する周波数領域の複数の分離信号ｙj(ｆ)を逐次生成する。なお、周波数ビンをｆ、フレーム番号をｍとすると、分離フィルタ処理部２０ｃの処理によって得られる分離信号ｙ(ｆ、ｍ)（上記ｙj(ｆ)と同義）は、前述した（５）式で表される。
ここで、分離信号ｙj(ｆ)各々は、採用入力音声信号ｘjの数と同じ数だけ出力される。図１に示す例は、２つの入力音声信号ｘ1及びｘ3が、採用入力音声信号ｘjとして選択されている状態を表しているが、採用入力音声信号ｘjの数及び組合せは、パワー検出・信号選択部２５による選択結果に応じて変動し得る。 Specifically, among the input audio signals accumulated in the input buffer by the ST-DFT processing unit 20a, the adopted input audio signal for a predetermined time length (for one frame) selected by the power detection / signal selection unit 25. A short-time discrete Fourier transform process is performed on xj, and a time domain adopted input speech signal xj (corresponding to xi (t) in FIG. 6) for a predetermined time length is converted into a frequency domain input speech for the same time length. The signal is converted into a signal xj (f) (a signal divided for each frequency band in a predetermined range called a frequency bin). Since the adopted input audio signal xj is sampled and digitized at a predetermined period, defining the time length of the adopted input audio signal xj is synonymous with defining the number of samples of the adopted input audio signal xj. It is.
Further, the separation filter processing unit 20c performs a matrix operation using the separation matrix W (f) on the adopted input speech signals xj (f) in a plurality of frequency domains sequentially input through the ST-DFT processing unit 20a. Thus, a plurality of separated signals yj (f) in the frequency domain corresponding to each of the plurality of sound sources are sequentially generated. If the frequency bin is f and the frame number is m, the separated signal y (f, m) (synonymous with yj (f) above) obtained by the processing of the separation filter processing unit 20c is the equation (5) described above. expressed.
Here, as many separated signals yj (f) as the number of adopted input audio signals xj are output. The example shown in FIG. 1 shows a state in which two input audio signals x1 and x3 are selected as adopted input audio signals xj, but the number and combination of adopted input audio signals xj are power detection / signal selection. It may vary depending on the selection result by the unit 25.

さらに、ＩＤＦＴ処理部２０ｄにより、分離フィルタ処理部２０ｃによって生成された周波数領域の分離信号ｙj(ｆ)に対して逆離散フーリエ変換（Inverse Discrete Fourier Transform）処理が施される。これにより、周波数領域の分離信号ｙj(ｆ)が、時間領域の分離信号ｙjに変換され、出力バッファ２４に格納される。
そして、出力バッファ２４に保持された時間領域の分離信号ｙj（デジタル信号）は、Ｄ／Ａコンバータ２２によってアナログの音声信号に変換されて出力される。このアナログの音声信号は、例えば、不図示のスピーカを通じて音声出力される。 Further, the IDFT processing unit 20d performs an inverse discrete Fourier transform process on the frequency domain separation signal yj (f) generated by the separation filter processing unit 20c. As a result, the separated signal yj (f) in the frequency domain is converted into the separated signal yj in the time domain and stored in the output buffer 24.
The time domain separation signal yj (digital signal) held in the output buffer 24 is converted into an analog audio signal by the D / A converter 22 and output. The analog audio signal is output as audio through a speaker (not shown), for example.

一方、学習演算部２０ｂにより、予め定められた時間長分の複数の周波数領域の採用入力音声信号ｘj(ｆ)を用いて、ＦＤＩＣＡ方式のＢＳＳ音源分離処理における分離行列Ｗ(ｆ)の学習演算が行われ、この学習演算により得られる分離行列Ｗ(ｆ)が、分離フィルタ処理部２０ｂで用いられる分離行列Ｗ(ｆ)として設定される。学習演算部２０ｂは、入力バッファ２３に保持されている採用入力音声信号ｘjを用いて学習演算を行う。この学習演算は、分離フィルタ処理部２０ｃによる分離処理が実行される場合、その分離処理と並行して実行される。
ここで、学習演算部２０ｂによる分離行列Ｗ(ｆ)の算出（学習演算）は、図７に示した音源分離ユニットＺ２（ＦＤＩＣＡ方式に基づく分離行列（分離フィルタ）の学習演算）が採用される。即ち、ＳＴ−ＤＦＴ処理部２０ａ及び学習演算部２０ｂが、前述した音源分離ユニットＺ２に相当する。
また、分離制御部２０ｅは、パワー検出・信号選択部２５から、採用入力音声信号がいずれであるかの情報を取得し、その取得情報に基づいて、入力バッファ２３に保持された入力音声信号ｘiの伝送と、当該ＩＣＡ部２０による音源分離処理を実行するか否かとを制御するものである。その詳細については後述する。 On the other hand, the learning calculation unit 20b uses the input speech signals xj (f) in a plurality of frequency regions for a predetermined time length to perform the learning calculation of the separation matrix W (f) in the BSS sound source separation processing of the FDICA method. The separation matrix W (f) obtained by this learning calculation is set as the separation matrix W (f) used in the separation filter processing unit 20b. The learning calculation unit 20 b performs learning calculation using the adopted input audio signal xj held in the input buffer 23. This learning calculation is executed in parallel with the separation processing when the separation processing by the separation filter processing unit 20c is executed.
Here, the calculation (learning calculation) of the separation matrix W (f) by the learning calculation unit 20b employs the sound source separation unit Z2 (learning calculation of the separation matrix (separation filter) based on the FDICA method) shown in FIG. . That is, the ST-DFT processing unit 20a and the learning calculation unit 20b correspond to the sound source separation unit Z2 described above.
Also, the separation control unit 20e acquires information indicating which input audio signal is adopted from the power detection / signal selection unit 25, and the input audio signal xi held in the input buffer 23 based on the acquired information. And whether or not the sound source separation processing by the ICA unit 20 is executed. Details thereof will be described later.

図２は、ｎ個（ｎチャンネル分）の指向性マイク１１１〜１１ｎの配置状態の一例を表す平面図である。
図２に示すように、音源分離装置Ｘが備えるｎ個（図２に示す例では６個）の指向性マイク１１１〜１１ｎは、０個〜ｎ個の音源が存在し得る音響空間において、各々異なる指向方向で配置される。これにより、各指向性マイク１１１〜１１ｎの主な集音範囲（図２において破線で示す範囲）は、ほとんど重ならない状態となっている。
このように、複数の指向性マイク１１１〜１１ｎを図２に示すように配置することにより、ある指向性マイクの指向方向（主な集音範囲）に音源が存在すれば、その指向性マイクを通じて得られる入力音声信号のパワーが特に強くなる。もちろん、他の指向性マイクを通じて得られる入力音声信号のパワーにも多少は影響するものの、その影響の度合いは比較的小さい。 FIG. 2 is a plan view illustrating an example of an arrangement state of n (for n channels) directional microphones 111 to 11n.
As shown in FIG. 2, n (6 in the example shown in FIG. 2) directional microphones 111 to 11n included in the sound source separation device X are respectively in an acoustic space where 0 to n sound sources can exist. Arranged in different directivity directions. Thereby, the main sound collection ranges (ranges indicated by broken lines in FIG. 2) of the directional microphones 111 to 11n are in a state where they do not overlap each other.
In this way, by arranging a plurality of directional microphones 111 to 11n as shown in FIG. 2, if a sound source exists in the directional direction (main sound collection range) of a certain directional microphone, the directional microphones are passed through. The power of the obtained input audio signal is particularly strong. Of course, although the power of the input audio signal obtained through other directional microphones is somewhat affected, the degree of the influence is relatively small.

次に、図３に示すフローチャートを参照しつつ、音源分離装置Ｘにおける音源分離処理の手順について説明する。以下、Ｓ１、Ｓ２、…は、処理手順（ステップ）の識別符号を表す。また、図３に示す処理は、音源分離装置Ｘが備える不図示の電源スイッチがＯＮされた場合に実行が開始される。
［ステップＳ１、Ｓ２］
まず、音源分離装置Ｘが処理を開始すると、各構成要素において、各種の初期処理が実行される（Ｓ１）。
例えば、パワー検出・信号選択部２５は、外部入力インターフェース２６を通じて外部装置から入力される信号パワー設定値Ｐｓ１、Ｐｓ２を取得し、パワー検出・信号選択部２５が備える記憶部に記憶する。
また、パワー検出・信号選択部２５は、採用入力音声信号の選択状態を、１つも選択されていない状態（初期状態）に設定する。
また、学習演算部２０ｂは、学習計算に用いる分離行列Ｗ(ｆ)に所定の初期値を設定する。
さらに、ＡＤＣ２１によるＡ／Ｄ変換処理、即ち、入力音声信号ｘiの入力処理が開始される（Ｓ２）。これにより、以後、所定時間分（例えば、２フレーム分）の最新の入力音声信号ｘi（デジタル音声信号）が入力バッファ２３に順次蓄積される。 Next, the procedure of the sound source separation process in the sound source separation device X will be described with reference to the flowchart shown in FIG. Hereinafter, S1, S2,... Represent identification codes of processing procedures (steps). The processing shown in FIG. 3 is started when a power switch (not shown) provided in the sound source separation device X is turned on.
[Steps S1, S2]
First, when the sound source separation device X starts processing, various initial processes are executed in each component (S1).
For example, the power detection / signal selection unit 25 acquires signal power setting values Ps1 and Ps2 input from an external device through the external input interface 26, and stores them in a storage unit included in the power detection / signal selection unit 25.
In addition, the power detection / signal selection unit 25 sets the selection state of the adopted input audio signal to a state where none is selected (initial state).
In addition, the learning calculation unit 20b sets a predetermined initial value in the separation matrix W (f) used for learning calculation.
Furthermore, A / D conversion processing by the ADC 21, that is, input processing of the input audio signal xi is started (S2). Thereby, the latest input audio signal xi (digital audio signal) for a predetermined time (for example, 2 frames) is sequentially stored in the input buffer 23 thereafter.

［ステップＳ３〜Ｓ５］
次に、パワー検出・信号選択部２５により、入力バッファ２３に蓄積された１フレーム分の各チャンネルの入力音声信号ｘiについて、信号のパワーＰi（信号強度）が検出される（Ｓ３、信号強度検出手順の一例）。なお、このステップＳ３の処理が実行される前に、既に各入力音声信号ｘiについて検出（算出）されたパワーＰiが存在する場合、前回の各入力音声信号ｘiのパワーとしてパワー検出・信号選択部２５の記憶部に記憶される。
例えば、パワー検出・信号選択部２５は、入力バッファ２３に蓄積されたαサンプル分（αは例えば１フレーム分のサンプル数）の入力音声信号ｘiの絶対値の平均値や２乗平均値等を信号のパワーＰiとして算出（検出）する。
さらに、パワー検出・信号選択部２５により、ステップＳ３で検出された信号のパワーＰiに基づいて、全ての（複数の）入力音声信号ｘiの中から、指向性マイク１１１〜１１ｎが配置される音響空間に存在する１又は複数の音源に対応する１又は複数の採用入力音声信号ｘj（チャンネル）を選択する処理（Ｓ４、Ｓ５）が実行される（信号選択手順の一例）。なお、このステップＳ４及びＳ５の処理が実行される前に、その時点で既に選択されている採用入力音声信号のチャンネルが、前回の採用入力信号ｘjのチャンネルとしてパワー検出・信号選択部２５の記憶部に記憶される。
具体的には、パワー検出・信号選択部２５は、ステップＳ３で検出した信号のパワーＰiが、予め外部入力インターフェース２６を取得している信号パワー設定値Ｐｓ１（第１の設定強度の一例）を超えた入力音声信号ｘiを、採用入力音声信号ｘjとして追加的に選択する（Ｓ４）。
また、パワー検出・信号選択部２５は、既に採用入力信号ｘjとして選択している入力音声信号ｘiのうち、ステップＳ３で検出した信号のパワーＰiが、予め外部入力インターフェース２６を取得している信号パワー設定値Ｐｓ２（第２の設定強度の一例）以下である状態が所定の設定時間ｔ０［秒］以上継続したものを、採用入力音声信号ｘjから除外する（Ｓ５）。例えば、ｔ０は、数秒〜１０秒程度に設定することが考えられる。
このように、設定時間ｔ０以上の継続を、採用入力信号ｘjから除外する条件とすることにより、ＩＣＡ部２０への信号入力数（採用入力音声信号ｘjの数）の増減が、音源の一時的な音量増減に応じて無用に頻発することを防止できる。
ここで、信号パワー設定値Ｐｓ１、Ｐｓ２は、Ｐｓ１＝Ｐｓ２とすること、或いはＰｓ１＞Ｐｓ２とすることが考えられる。 [Steps S3 to S5]
Next, the power detection / signal selection unit 25 detects the signal power Pi (signal strength) of the input audio signal xi of each channel for one frame accumulated in the input buffer 23 (S3, signal strength detection). An example of the procedure). If the power Pi already detected (calculated) for each input audio signal xi exists before the processing of step S3 is executed, the power detection / signal selection unit is used as the power of each previous input audio signal xi. 25 storage units.
For example, the power detection / signal selection unit 25 calculates an average value or a square average value of the absolute value of the input audio signal xi for α samples (α is the number of samples for one frame, for example) accumulated in the input buffer 23. It is calculated (detected) as the signal power Pi.
Furthermore, the sound in which the directional microphones 111 to 11n are arranged from all (a plurality of) input audio signals xi based on the power Pi of the signal detected in step S3 by the power detection / signal selection unit 25. Processing (S4, S5) for selecting one or a plurality of adopted input audio signals xj (channels) corresponding to one or a plurality of sound sources existing in the space is executed (an example of a signal selection procedure). Before the processes of steps S4 and S5 are executed, the channel of the adopted input audio signal already selected at that time is stored in the power detection / signal selection unit 25 as the channel of the previous adopted input signal xj. Stored in the department.
Specifically, the power detection / signal selection unit 25 uses the signal power Pi detected in step S3 as the signal power setting value Ps1 (an example of the first setting intensity) acquired in advance from the external input interface 26. The input audio signal xi that has exceeded is additionally selected as the adopted input audio signal xj (S4).
Further, the power detection / signal selection unit 25 is a signal in which the power Pi of the signal detected in step S3 among the input audio signals xi that have already been selected as the adopted input signal xj has acquired the external input interface 26 in advance. A case where the state of being lower than or equal to the power set value Ps2 (an example of the second set intensity) continues for a predetermined set time t0 [seconds] is excluded from the employed input audio signal xj (S5). For example, t0 may be set to about several seconds to 10 seconds.
In this way, by setting the continuation of the set time t0 or longer as a condition to exclude from the adopted input signal xj, the increase / decrease in the number of signal inputs to the ICA unit 20 (the number of adopted input audio signals xj) can be made temporarily. It is possible to prevent unnecessary frequent occurrence in accordance with the volume increase / decrease.
Here, it is conceivable that the signal power setting values Ps1 and Ps2 are Ps1 = Ps2 or Ps1> Ps2.

［ステップＳ６〜Ｓ８］
次に、パワー検出・信号選択部２５により、ステップＳ４及びＳ５の処理によって選択した採用入力音声信号ｘjの数が、１個以上であるか否か（選択されたか否か）の判別（Ｓ６）、及び１個であるか２個以上であるかの判別（Ｓ７）が行われる。
ここで、採用入力音声信号ｘjの数が１個以上ではない（０個である）場合、その旨の情報がパワー検出・信号選択部２５からＩＣＡ部２０の分離制御部２０ｅに伝送される。このように、採用入力音声信号ｘjの数が０個（選択されていない）場合、分離制御部２０ｅは、音源の分離処理（分離フィルタ処理部２０ｃ及び学習演算部２０ｂの処理）を実行させない。その結果、出力バッファ２４への分離信号ｙjの出力及びＤＡＣ２２を通じた分離音声信号の出力も実行されない。
また、採用入力音声信号ｘjの数が１個である場合、その旨の情報がパワー検出・信号選択部２５からＩＣＡ部２０の分離制御部２０ｅに伝送される。これにより、分離制御部２０ｅは、音源の分離処理（分離フィルタ処理部２０ｃ及び学習演算部２０ｂの処理）を停止させるとともに、その１つの採用入力音声信号ｘjをそのまま（分離処理を施さずに）分離信号ｙjとして出力バッファ２４に出力する（Ｓ８）。
そして、採用入力音声信号ｘjの数が０個である場合、又はステップＳ９の処理が実行された場合、パワー検出・信号選択部２５は、処理を前述したステップＳ３に戻す。 [Steps S6 to S8]
Next, it is determined whether or not the number of adopted input audio signals xj selected by the processing of steps S4 and S5 by the power detection / signal selection unit 25 is one or more (whether or not it is selected) (S6). And whether the number is one or more (S7).
Here, when the number of adopted input audio signals xj is not 1 or more (0), information to that effect is transmitted from the power detection / signal selection unit 25 to the separation control unit 20e of the ICA unit 20. As described above, when the number of adopted input audio signals xj is 0 (not selected), the separation control unit 20e does not execute sound source separation processing (processing of the separation filter processing unit 20c and the learning calculation unit 20b). As a result, the output of the separated signal yj to the output buffer 24 and the output of the separated audio signal through the DAC 22 are not executed.
When the number of adopted input audio signals xj is one, information to that effect is transmitted from the power detection / signal selection unit 25 to the separation control unit 20e of the ICA unit 20. As a result, the separation control unit 20e stops the sound source separation processing (the processing of the separation filter processing unit 20c and the learning calculation unit 20b), and the one adopted input audio signal xj is used as it is (without performing the separation processing). The separated signal yj is output to the output buffer 24 (S8).
When the number of adopted input audio signals xj is 0, or when the process of step S9 is executed, the power detection / signal selection unit 25 returns the process to step S3 described above.

［ステップＳ９〜Ｓ１１］
一方、採用入力音声信号ｘjの数が２個以上である場合、パワー検出・信号選択部２５により、今回選択された採用入力音声信号ｘjのチャンネルと、前回の採用入力音声信号ｘjのチャンネルとが同じであるか否かが判別される（Ｓ９）。
ここで、今回と前回の採用入力音声信号ｘjのチャンネルが同じである場合、パワー検出・信号選択部２５は、処理を後述するステップＳ１２へ移行させ、そうでない場合は、処理を次のステップＳ１０へ移行させる。
ステップＳ１０では、パワー検出・信号選択部２５により、音源の移動があったか否かの判別処理が行われる（Ｓ１０）。
具体的には、パワー検出・信号選択部２５は、指向方向（集音範囲）が隣り合う２つの指向性マイク（これらを、第１マイク及び第２マイクと称する）各々を通じて入力された入力音声信号ｘi（これらを、第１入力音声信号及ｘ１iび第２入力音声信号ｘ２iと称する）のうち、その第２入力音声信号ｘ２iが採用入力音声信号ｘjとして選択されており、かつ、今回、第１入力音声信号ｘ１iのパワーが信号パワー設定値Ｐｓ１（第１の設定強度の一例）を超える（前回から変化した）とともに、第２入力音声信号ｘ２iのパワーが信号パワー設定値Ｐｓ２（第２の設定強度の一例）以下となった（前回から変化した）場合に、第１入力音声信号ｘ１iは、第２マイクの指向方向（集音範囲）から第１マイクの指向方向（集音範囲）に移動した音源からの音声信号であると判別する。
そして、パワー検出・信号選択部２５は、このような判別処理によって音源の移動があったと判別した場合、その第２入力音声信号ｘ２iを採用入力音声信号から除外し（Ｓ１１）、処理を次のステップＳ１２へ移行させる。
即ち、指向方向（集音範囲）が隣り合う２つの指向性マイクにおいて、その一方（第２マイク）の入力音声信号のパワーが強い状態から弱い状態に変化するとともに、他方（第１マイク）の入力音声信号のパワーが弱い状態から強い状態に変化した場合、音源が、隣り合う指向性マイクの一方の集音範囲から他方の集音範囲へ移動したと判別する。
一方、パワー検出・信号選択部２５は、音源の移動があったと判別しなかった場合、処理を後述するステップＳ１４へ移行させる。 [Steps S9 to S11]
On the other hand, when the number of adopted input audio signals xj is two or more, the channel of the adopted input audio signal xj selected this time and the channel of the previous adopted input audio signal xj are selected by the power detection / signal selection unit 25. It is determined whether or not they are the same (S9).
Here, when the channel of the adopted input audio signal xj this time and the previous time is the same, the power detection / signal selection unit 25 shifts the process to step S12 described later, and if not, the process proceeds to the next step S10. To move to.
In step S10, the power detection / signal selection unit 25 determines whether or not the sound source has moved (S10).
Specifically, the power detection / signal selection unit 25 receives input sound input through each of two directional microphones (which are referred to as a first microphone and a second microphone) whose directivity directions (sound collection ranges) are adjacent to each other. Of the signals xi (these are referred to as the first input audio signal and the x1i and second input audio signals x2i), the second input audio signal x2i is selected as the adopted input audio signal xj, and this time, The power of the first input audio signal x1i exceeds the signal power setting value Ps1 (an example of the first setting intensity) (changed from the previous time), and the power of the second input audio signal x2i changes to the signal power setting value Ps2 (second An example of the setting intensity) When the following becomes (changed from the previous time), the first input audio signal x1i changes from the directivity direction (sound collection range) of the second microphone to the directivity direction (sound collection range) of the first microphone. Moved sound source Determined to be in al of the speech signal.
When the power detection / signal selection unit 25 determines that the sound source has moved by such determination processing, the power detection / signal selection unit 25 excludes the second input audio signal x2i from the adopted input audio signal (S11), and performs the following processing. The process proceeds to step S12.
That is, in two directional microphones whose directional directions (sound collection ranges) are adjacent to each other, the power of the input audio signal of one (second microphone) changes from a strong state to a weak state, and the other (first microphone). When the power of the input audio signal changes from a weak state to a strong state, it is determined that the sound source has moved from one sound collection range of the adjacent directional microphone to the other sound collection range.
On the other hand, when the power detection / signal selection unit 25 does not determine that the sound source has moved, the power detection / signal selection unit 25 shifts the processing to step S14 to be described later.

［ステップＳ１２、Ｓ１３］
次に、ステップＳ１２において、その時点で選択されている採用入力音声信号ｘjのチャンネル情報がパワー検出・信号選択部２５からＩＣＡ部２０の分離制御部２０ｅに伝送され、分離制御部２０ｅがＩＣＡ部２０を構成する他の構成要素を制御することにより、ＩＣＡ部２０は、採用入力音声信号ｘjを入力信号として、ＩＣＡ−ＢＳＳ音源分離処理を実行する（Ｓ１２）。これにより、採用入力音声信号ｘjの数と同数（複数）の分離信号ｙjが生成され、これが出力バッファ２４に格納される（ＩＣＡ−ＢＳＳ音源分離手順の一例）。
ここで、ステップＳ１２の処理では、ＩＣＡ部２０の学習演算部２０ｂは、それまでに学習済みの分離行列Ｗ(ｆ)を、新たな学習計算に用いる分離行列Ｗ(ｆ)の初期値として引き継ぐ。即ち、分離行列Ｗ(ｆ)の初期化を行わない。このステップＳ１２に至る状況は、音響環境における音源の増減がない状況（新たな音源が増えた、或いはそれまで存在していた音源が無くなった状況）であるからである。これにより、高い音源分離性能が維持される。
さらに、ＤＡＣ２２により、出力バッファ２４に蓄積された分離信号ｙjのＡ／Ｄ変換処理がなされ、分離信号（アナログ信号）が不図示のスピーカを通じて音声出力される（Ｓ１３）。そして、処理が前述したステップＳ３へ戻される。 [Steps S12 and S13]
Next, in step S12, the channel information of the adopted input audio signal xj selected at that time is transmitted from the power detection / signal selection unit 25 to the separation control unit 20e of the ICA unit 20, and the separation control unit 20e is transmitted to the ICA unit. The ICA unit 20 performs ICA-BSS sound source separation processing using the adopted input audio signal xj as an input signal by controlling the other components constituting the component 20 (S12). Thereby, the same number (plural) of separated signals yj as the number of adopted input audio signals xj are generated and stored in the output buffer 24 (an example of an ICA-BSS sound source separation procedure).
Here, in the process of step S12, the learning calculation unit 20b of the ICA unit 20 takes over the separation matrix W (f) that has been learned so far as the initial value of the separation matrix W (f) that is used for a new learning calculation. . That is, the separation matrix W (f) is not initialized. This is because the situation leading to Step S12 is a situation where there is no increase or decrease in the sound source in the acoustic environment (a situation where a new sound source has increased or a sound source that has existed until then has disappeared). Thereby, high sound source separation performance is maintained.
Further, the DAC 22 performs A / D conversion processing on the separated signal yj stored in the output buffer 24, and the separated signal (analog signal) is output as sound through a speaker (not shown) (S13). Then, the process returns to step S3 described above.

［ステップＳ１４〜Ｓ１６］
一方、ステップＳ１４（採用入力音声信号ｘjが複数かつそのチャンネルに変化がある場合）では、その時点で選択されている採用入力音声信号ｘjのチャンネル情報がパワー検出・信号選択部２５からＩＣＡ部２０の分離制御部２０ｅに伝送され、分離制御部２０ｅが学習演算部２０ｂを制御することにより、学習演算部２０ｂが、分離行列Ｗ(ｆ)を初期化する（Ｓ１４）。
さらに、分離制御部２０ｅがＩＣＡ部２０を構成する他の構成要素を制御することにより、ＩＣＡ部２０は、採用入力音声信号ｘjを入力信号として、ＩＣＡ−ＢＳＳ音源分離処理を実行する（Ｓ１５）。これにより、採用入力音声信号ｘjの数と同数の分離信号ｙjが生成され、これが出力バッファ２４に格納される（ＩＣＡ−ＢＳＳ音源分離手順の一例）。
さらに、ＤＡＣ２２により、出力バッファ２４に蓄積された分離信号ｙjのＡ／Ｄ変換処理がなされ、分離信号（アナログ信号）が不図示のスピーカを通じて音声出力される（Ｓ１６）。そして、処理が前述したステップＳ３へ戻される。 [Steps S14 to S16]
On the other hand, in step S14 (when there are a plurality of adopted input audio signals xj and the channel is changed), the channel information of the adopted input audio signal xj selected at that time is converted from the power detection / signal selection unit 25 to the ICA unit 20. The separation control unit 20e controls the learning calculation unit 20b, and the learning calculation unit 20b initializes the separation matrix W (f) (S14).
Further, when the separation control unit 20e controls other components constituting the ICA unit 20, the ICA unit 20 executes the ICA-BSS sound source separation process using the adopted input audio signal xj as an input signal (S15). . As a result, the same number of separated signals yj as the number of adopted input audio signals xj are generated and stored in the output buffer 24 (an example of an ICA-BSS sound source separation procedure).
Further, the DAC 22 performs A / D conversion processing of the separated signal yj stored in the output buffer 24, and the separated signal (analog signal) is output as sound through a speaker (not shown) (S16). Then, the process returns to step S3 described above.

以上に示したように、音源分離装置Ｘでは、ある指向性マイクの指向方向（主な集音範囲）に音源が存在すれば、その指向性マイクを通じて得られる入力音声信号ｘiのパワーＰiが特に強くなる。もちろん、他の指向性マイクを通じて得られる入力音声信号ｘiのパワーＰiにも多少は影響するものの、その影響の度合いは比較的小さい。
そして、パワー検出・信号選択部２５により、全ての入力音声信号ｘiの中から、パワーが一定レベル以上なったもののみが、採用入力音声信号ｘj（音源分離処理の対象とする信号）として選択される（Ｓ４）ので、予め想定できない音源の数に対し、過不足のない数の採用入力音声信号ｘjが選択される。
従って、入力音声信号ｘiを得るための指向性マイク１１１〜１１ｎを、変動する音源の数に対して十分な数だけ設けておけば、音響空間に存在する音源の数に増減があった場合でも、音源の数に対し、過不足のない数の入力音声信号（採用入力音声信号ｘj）が選択されるので、高い音源分離性能を維持できる。 As described above, in the sound source separation device X, if a sound source exists in the directivity direction (main sound collection range) of a certain directional microphone, the power Pi of the input audio signal xi obtained through the directional microphone is particularly high. Become stronger. Of course, although the power Pi of the input audio signal xi obtained through another directional microphone is somewhat affected, the degree of the effect is relatively small.
Then, the power detection / signal selection unit 25 selects only the input power signal xj (the signal to be subjected to sound source separation processing) from among all the input sound signals xi that has a power of a certain level or more. (S4), the number of adopted input audio signals xj is selected so as not to be excessive or insufficient with respect to the number of sound sources that cannot be assumed in advance.
Therefore, if the directional microphones 111 to 11n for obtaining the input audio signal xi are provided in a sufficient number with respect to the number of fluctuating sound sources, even when the number of sound sources existing in the acoustic space is increased or decreased. Since the number of input audio signals (adopted input audio signal xj) that is not excessive or insufficient with respect to the number of sound sources is selected, high sound source separation performance can be maintained.

ところで、以上に示した実施形態では、パワー検出・信号選択部２５は、ステップＳ４及びＳ５の処理において、採用入力音声信号ｘjとして選択する信号数（チャンネル数）を特に制限するものではなかったが、これに制限を加えることも考えられる。
例えば、パワー検出・信号選択部２５が、ステップＳ４及びＳ５の処理において、採用入力音声信号ｘjとして選択した信号数（チャンネル数）が３つ以上となった場合、ステップＳ３の処理によって検出した信号のパワーＰiが強いものから最大２つまでの入力音声信号ｘiを、採用入力音声信号ｘjとして選択することが考えられる。
これにより、ＩＣＡ部２０の演算負荷を低減できる。また、比較的パワーの弱い信号の成分が分離信号ｙiに混在しても、実用上大きな問題は生じない。このような構成を有する音源分離装置は、例えば、ある特定の指向性マイクロホンの指向方向（主な集音範囲）に存在する音源（目的音源）の音源信号と、その他の音源（ノイズ音源）の音源信号とを分離したい場合（複数のノイズ音源の音源信号を分離する必要がない場合）などに有効である。
また、前述した音源分離装置Ｘは、独立成分分析法に基づくブラインド音源分離方式の音源分離処理を実行するＩＣＡ部２０は、演算負荷低減のため、ＦＤＩＣＡ方式に基づく音源分離処理を行う音源分離ユニットＺ２（図７参照）が採用された例を示した。しかしながら、これに限るものではなく、例えば、ＩＣＡ部２０に、ＴＤＩＣＡ方式に基づく音源分離処理を行う音源分離ユニットＺ１（図６参照）を採用することも考えられる。 In the embodiment described above, the power detection / signal selection unit 25 does not particularly limit the number of signals (number of channels) to be selected as the adopted input audio signal xj in the processes of steps S4 and S5. It is possible to limit this.
For example, when the number of signals (number of channels) selected by the power detection / signal selection unit 25 as the adopted input audio signal xj in the processes of steps S4 and S5 is three or more, the signal detected by the process of step S3 It is conceivable to select up to two input audio signals xi having the highest power Pi as adopted input audio signals xj.
Thereby, the calculation load of the ICA part 20 can be reduced. Further, even if a component of a signal having relatively low power is mixed in the separated signal yi, no serious problem is caused in practice. The sound source separation device having such a configuration includes, for example, a sound source signal of a sound source (target sound source) existing in a directivity direction (main sound collection range) of a specific directional microphone and other sound sources (noise sound source). This is effective when it is desired to separate the sound source signal (when it is not necessary to separate the sound source signals of a plurality of noise sound sources).
In addition, the sound source separation apparatus X described above performs the sound source separation process of the blind sound source separation method based on the independent component analysis method, and the ICA unit 20 performs the sound source separation unit based on the FDICA method to reduce the calculation load. An example in which Z2 (see FIG. 7) was employed was shown. However, the present invention is not limited to this. For example, it may be possible to employ a sound source separation unit Z1 (see FIG. 6) that performs sound source separation processing based on the TDICA method in the ICA unit 20.

次に、図４及び図５を参照しつつ、音源分離装置Ｘの適用例について説明する。
図４は、音源分離装置Ｘの適用対象の一例である携帯電話機Ｖ１の概略斜視図である。
図４に示すように、音源分離装置Ｘは、話者の音声とそれ以外のノイズ音声とを分離するために携帯電話機Ｖ１に搭載することが考えられる。
この場合、図４に示すように、携帯電話機Ｖ１に対して、音源分離装置Ｘが備える複数（図４に示す例では６個）の指向性マイク１１１〜１１６が、各々異なる指向方向で配置される。図４に示す例では、当該携帯電話機Ｖ１に対する話者の音源方向である正面方向に向けられた指向性マイク１１１と、その反対方向（背面方向）に向けられた指向性マイク１１２と、正面方向に対して左右及び上下方向の各々に向けられた指向性マイク１１３〜１１６とが、携帯電話機Ｖ１に設けられている。
このような携帯電話機Ｖ１において、音源分離装置Ｘによって生成された指向性マイク１１１に対応する分離信号ｙiを、通話の相手先の携帯電話機に送信する音声信号として出力すれば、ノイズの少ない高音質の通話を実現する携帯電話機を提供できる。 Next, application examples of the sound source separation device X will be described with reference to FIGS. 4 and 5.
FIG. 4 is a schematic perspective view of a mobile phone V1 which is an example of an application target of the sound source separation device X.
As shown in FIG. 4, it is conceivable that the sound source separation device X is mounted on the mobile phone V1 in order to separate the voice of the speaker and other noise voices.
In this case, as shown in FIG. 4, a plurality of (six in the example shown in FIG. 4) directional microphones 111 to 116 included in the sound source separation device X are arranged in different directivity directions with respect to the mobile phone V <b> 1. The In the example shown in FIG. 4, a directional microphone 111 directed in the front direction that is the sound source direction of the speaker with respect to the mobile phone V1, a directional microphone 112 directed in the opposite direction (backward direction), and the front direction On the other hand, directional microphones 113 to 116 directed in the left and right and up and down directions are provided in the mobile phone V1.
In such a cellular phone V1, if the separation signal yi corresponding to the directional microphone 111 generated by the sound source separation device X is output as an audio signal to be transmitted to the other party's cellular phone, high sound quality with less noise is obtained. Can be provided.

また、図５は音源分離装置Ｘの適用対象の一例であるロボットＶ２の概略斜視図である。
図５に示すように、音源分離装置Ｘは、周囲に存在する音源からの音声を音声認識することによって動作制御を行うロボットＶ２に搭載し、周囲に複数の音源が存在する場合に、各音源からの音声信号について個別に音声認識を行うことができるよう、音声認識処理の実行部に各音源に対応する分離信号ｙjを個別に入力させるよう構成されたものが考えられる。
この場合、図５に示すように、ロボットＶ２に対して、音源分離装置Ｘが備える複数（図５に示す例では４個）の指向性マイク１１１〜１１４が、各々異なる指向方向で配置される。図５に示す例では、当該ロボットＶ２の正面方向に向けられた指向性マイク１１１と、その反対方向（背面方向）に向けられた指向性マイク１１２と、正面方向に対して左右方向の各々に向けられた指向性マイク１１３、１１４とが、ロボットＶ２に設けられている。
このようなロボットＶ２において、音源分離装置Ｘによって生成された各指向性マイク１１１〜１１４に対応する分離信号ｙiを、音声認識処理の実行部に個別に入力させれば、ノイズの少ない音声信号について高精度の音声認識処理及びその処理結果に基づく高精度の動作制御を行うことができるロボットを提供できる。 FIG. 5 is a schematic perspective view of a robot V2, which is an example to which the sound source separation device X is applied.
As shown in FIG. 5, the sound source separation device X is mounted on a robot V2 that performs operation control by recognizing sound from sound sources existing in the surroundings, and each sound source is separated when there are a plurality of sound sources in the surroundings. It is conceivable that the separated signal yj corresponding to each sound source is individually input to the voice recognition processing execution unit so that the voice signal can be individually recognized.
In this case, as shown in FIG. 5, a plurality of (four in the example shown in FIG. 5) directional microphones 111 to 114 included in the sound source separation device X are arranged in different directivity directions with respect to the robot V2. . In the example shown in FIG. 5, the directional microphone 111 directed in the front direction of the robot V2, the directional microphone 112 directed in the opposite direction (backward direction), and the left and right directions with respect to the front direction. Directed directional microphones 113 and 114 are provided in the robot V2.
In such a robot V2, if the separation signal yi corresponding to each of the directional microphones 111 to 114 generated by the sound source separation device X is individually input to the execution unit of the speech recognition process, the sound signal with less noise can be obtained. A robot capable of performing highly accurate voice recognition processing and highly accurate motion control based on the processing result can be provided.

本発明は、音源分離装置への利用が可能である。 The present invention can be used for a sound source separation device.

本発明の実施形態に係る音源分離装置Ｘの概略構成を表すブロック図。The block diagram showing the schematic structure of the sound source separation apparatus X which concerns on embodiment of this invention. 音源分離装置Ｘが備える指向性マイクロホンの配置状態の一例を表す平面図。The top view showing an example of the arrangement | positioning state of the directional microphone with which the sound source separation apparatus X is provided. 音源分離装置Ｘにおける音源分離処理の手順を表すフローチャート。6 is a flowchart showing a procedure of sound source separation processing in the sound source separation device X. 音源分離装置Ｘの適用対象の一例である携帯電話機Ｖ１の概略斜視図。The schematic perspective view of the mobile telephone V1 which is an example of the application object of the sound source separation apparatus X. FIG. 音源分離装置Ｘの適用対象の一例であるロボットＶ２の概略斜視図。The schematic perspective view of the robot V2 which is an example of the application object of the sound source separation apparatus X. FIG. ＴＤＩＣＡ法に基づくＢＳＳ方式の音源分離処理を行う音源分離ユニットＺ１の概略構成を表すブロック図。The block diagram showing the schematic structure of the sound source separation unit Z1 which performs the sound source separation process of the BSS system based on the TDICA method. ＦＤＩＣＡ法に基づくＢＳＳ方式の音源分離処理を行う音源分離ユニットＺ２の概略構成を表すブロック図。The block diagram showing the schematic structure of the sound source separation unit Z2 which performs the sound source separation process of the BSS system based on the FDICA method.

Explanation of symbols

Ｘ…本発明の実施形態に係る音源分離装置
Ｖ１…本発明の実施形態に係る音源分離装置を適用した携帯電話機
Ｖ２…本発明の実施形態に係る音源分離装置を適用したロボット
Ｚ１…ＴＤＩＣＡ法に基づくＢＳＳ方式の音源分離処理を行う音源分離ユニット
Ｚ２…ＦＤＩＣＡ法に基づくＢＳＳ方式の音源分離処理を行う音源分離ユニット
１、２…音源
１１ｔ、１１ｆ…分離フィルタ処理部
２０…ＩＣＡ部
２０ａ…ＳＴ−ＤＦＴ処理部
２０ｂ…学習演算部
２０ｃ…分離フィルタ処理部
２０ｄ…ＩＤＦＴ処理部
２０ｅ…分離制御部
２１…Ａ／Ｄコンバータ
２２…Ｄ／Ａコンバータ
２３…入力バッファ
２４…出力バッファ
２５…パワー検出・信号選択部
２６…外部入力インターフェース
１１１〜１１ｎ…指向性マイクロホン
Ｓ１、Ｓ２、〜…処理手順（ステップ） X ... sound source separation device V1 according to the embodiment of the present invention ... mobile phone V2 to which the sound source separation device according to the embodiment of the present invention is applied ... robot Z1 to which the sound source separation device according to the embodiment of the present invention is applied. Sound source separation unit Z2 that performs BSS-based sound source separation processing based on sound source separation unit 1, 2 that performs BSS-based sound source separation processing based on the FDICA method, sound sources 11t, 11f, separation filter processing unit 20, ICA unit 20a, ST- DFT processing unit 20b ... learning calculation unit 20c ... separation filter processing unit 20d ... IDFT processing unit 20e ... separation control unit 21 ... A / D converter 22 ... D / A converter 23 ... input buffer 24 ... output buffer 25 ... power detection / signal Selection unit 26 ... external input interfaces 111 to 11n ... Directional microphones S1, S2, ... Processing procedure Step)

Claims

Signal intensity detecting means for detecting the signal intensity of each of a plurality of input audio signals input through the plurality of directional microphones in a state where the plurality of directional microphones are arranged in different directional directions in a predetermined acoustic space; ,
Signal selecting means for selecting one or a plurality of adopted input sound signals corresponding to one or a plurality of sound sources existing in the acoustic space from the plurality of input sound signals based on a detection result of the signal intensity detecting means;
When a plurality of adopted input signals are selected by the signal selection means, the adopted input speech signal is subjected to a sound source separation process of a blind sound source separation method based on an independent component analysis method for the plurality of adopted input speech signals. ICA-BSS sound source separation means for generating the same number of separated signals as
A sound source separation device comprising:

The sound source separation device according to claim 1, wherein the signal selection unit selects the input audio signal whose signal intensity detected by the signal intensity detection unit exceeds a first set intensity as the adopted input audio signal. .

3. The sound source separation device according to claim 2, wherein the signal selection unit selects, from the strongest signal strength detected by the signal strength detection unit, up to two input speech signals as the adopted input speech signal. .

Of the input audio signals selected by the signal selection means as the adopted input signals, those in which the signal intensity detected by the signal intensity detection means is below a second set intensity for a predetermined time 4. The sound source separation device according to claim 2, wherein the sound source separation device is excluded from the adopted input audio signal.

Of the first input audio signal and the second input audio signal, the signal selection means inputs the adopted input audio signal among the first input audio signal and the second input audio signal input through each of the two directional microphones whose direction of direction is adjacent. When the signal strength of the first input voice signal exceeds the first set strength, the signal strength of the second input voice signal becomes equal to or lower than the second set strength. In this case, the sound source separation device according to any one of claims 2 to 4, wherein the second input audio signal is excluded from the adopted input audio signal.

In a situation where a plurality of directional microphones are arranged in different directional directions in a predetermined acoustic space, the signal intensity of each of the plurality of input audio signals input through the plurality of directional microphones is determined by a predetermined signal intensity detection unit. A signal intensity detection procedure to detect;
Based on the detection result of the signal intensity detection procedure, one or a plurality of adopted input sound signals corresponding to one or a plurality of sound sources existing in the acoustic space are selected from the plurality of input sound signals by a predetermined signal selection unit. Signal selection procedure,
When a plurality of adopted input signals are selected by the signal selection procedure, the adopted input speech signal is subjected to a sound source separation process of a blind sound source separation method based on an independent component analysis method for the plurality of adopted input speech signals. An ICA-BSS sound source separation procedure in which a predetermined processor executes a process of generating the same number of separated signals as
A sound source separation method characterized by comprising: