JP6594222B2

JP6594222B2 - Sound source information estimation apparatus, sound source information estimation method, and program

Info

Publication number: JP6594222B2
Application number: JP2016028682A
Authority: JP
Inventors: 健太丹羽; 和則小林; 悠馬小泉; 智子川瀬
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-12-09
Filing date: 2016-02-18
Publication date: 2019-10-23
Anticipated expiration: 2036-02-18
Also published as: JP2017107141A

Description

この発明は、音響信号処理の技術分野に関し、特に、音響信号から抽出した音源特徴を用いて音源情報を推定する技術に関する。 The present invention relates to the technical field of acoustic signal processing, and more particularly to a technique for estimating sound source information using sound source features extracted from an acoustic signal.

従来から音響信号処理の技術分野において、音源特徴から音源情報を推定することが行われている。そのような従来技術として、例えば、部屋の残響度合いなどの指標となる直間比を推定する直間比推定技術や、様々な雑音が混在する状況下で特定の位置にある音源を強調する音源強調技術などがある。 Conventionally, in the technical field of acoustic signal processing, sound source information has been estimated from sound source characteristics. As such conventional techniques, for example, a direct ratio estimation technique for estimating a direct ratio that is an index such as the degree of reverberation in a room, or a sound source that emphasizes a sound source at a specific position in a situation where various noises are mixed There are emphasis techniques.

従来の直間比推定技術では、複数のビームフォーミング出力から得られる空間的な感度分布を利用して、決定論的に直間比を決定する（例えば、特許文献１参照）。図１に従来の直間比推定装置の機能構成を示す。周波数領域変換部１１−１〜１１−Ｍ（M≧2）は、M本のマイクロホン１０−１〜１０−Ｍで受音した信号を入力とし、周波数領域の観測信号x_ω,τ=[X_1,ω,τ,…,X_M,ω,τ]^Tを出力する。ビームフォーミング部（直接音強調用）１２−１は、観測信号x_ω,τを入力とし、音源方向が強調された出力信号Y_BF,1,ωを出力する。ビームフォーミング部（拡散残響解析用）１２−２は、観測信号x_ω,τを入力とし、音源以外の方向が強調された出力信号Y_BF,2,ωを出力する。局所ＰＳＤ推定部１３は、2個のビームフォーミング出力信号Y_BF,1,ω, Y_BF,2,ωを用いて、周波数ごとに直接音のＰＳＤP_D,ωと残響のＰＳＤ⁻P_R,ωを求める。パワー比計算部１４は、直接音と残響のＰＳＤP_D,ω, ⁻P_R,ωを用いて、直間比Γを推定する。 In the conventional direct ratio estimation technique, the direct ratio is determined deterministically using a spatial sensitivity distribution obtained from a plurality of beamforming outputs (see, for example, Patent Document 1). FIG. 1 shows a functional configuration of a conventional direct ratio estimation apparatus. Frequency domain transforming units 11-1 to 11-M (M ≧ 2) receive signals received by M microphones 10-1 to 10-M as input, and frequency domain observation signals _{xω, τ} = [X _{1, ω, τ 1,} ..., X _{M, ω, τ} ] ^T is output. The beam forming unit (for direct sound enhancement) 12-1 receives the observation signal _{xω, τ} and outputs an output signal _{YBF, 1, ω in} which the sound source direction is enhanced. The beamforming unit (for diffuse reverberation analysis) 12-2 receives the observation signal _{xω, τ} and outputs an output signal _{YBF, 2, ω} in which the direction other than the sound source is emphasized. Local PSD estimator 13, two beamforming output signal _{Y BF, 1, ω, Y} BF, 2, using the _omega, PSDP _D of the direct sound in each _{frequency, omega} and reverberation of PSDP ^R, _omega Ask for. Power ratio calculation unit 14, PSDP _D of the direct sound and the _{_{reverberation,} ω,} ^- P _{_R,} with _omega, estimates the Chokkan ratio gamma.

従来の音源強調技術では、（１）線形性のビームフォーミングを用いる方法や、（２）複数のビームフォーミング出力から得られる空間的な感度分布の差を利用して非線形性のウィーナーフィルタを生成する方法などが用いられる（例えば、特許文献２参照）。図２に従来の音源強調装置の機能構成を示す。受音部１０−１〜１０−Ｍは、汎用マイクロホンアレイや相互情報量増大型の受音系を用いて音を観測する。周波数領域変換部１１−１〜１１−Ｍは、受音したM個の信号を入力とし、周波数領域の観測信号x_ω,τを出力する。ビームフォーミング部（直接音強調用）１２−１は、観測信号x_ω,τを入力とし、音源方向が強調された出力信号Y_ζ(1),ω,τ=Y_i,ω,τを出力する。同様に、ビームフォーミング部（雑音解析用）１２−２〜１２−Ｌ（L≧2）は、観測信号x_ω,τを入力とし、音源以外の方向が強調されたL-1個のビームフォーミング出力信号Y_ζ(2),ω,τ, …, Y_ζ(L),ω,τをそれぞれ出力する。局所ＰＳＤ推定部１３は、L個のビームフォーミング出力信号Y_ζ(1),ω,τ, …, Y_ζ(L),ω,τを用いて、局所ＰＳＤ^Φ_S,ω,τを推定する。事前ＳＮＲ計算部１５は、局所ＰＳＤ^Φ_S,ω,τを入力とし、事前ＳＮＲの推定値^ξ_ω,τ=^φ_S,ω,τ/^φ_N,ω,τを求める。フィルタリング部１６は、推定した事前ＳＮＲ^ξ_ω,τを用いてウィーナーフィルタを計算し、ビームフォーミング出力信号Y_ζ(1),ω,τにウィーナーフィルタを掛け合わせることで、出力信号Z_i,ω,τを出力する。 In the conventional sound source enhancement technique, (1) a method using linear beamforming and (2) a non-linear Wiener filter is generated using a difference in spatial sensitivity distribution obtained from a plurality of beamforming outputs. The method etc. are used (for example, refer patent document 2). FIG. 2 shows a functional configuration of a conventional sound source enhancement apparatus. The sound receiving units 10-1 to 10-M observe sound using a general-purpose microphone array or a mutual information increasing type sound receiving system. The frequency domain transforming units 11-1 to 11-M receive the received M signals and output observation signals _{xω, τ} in the frequency domain. The beam forming unit (for direct sound enhancement) 12-1 receives the observation signal x _{ω, τ} and outputs an output signal Y _{ζ (1), ω, τ} = Y _{i, ω, τ in} which the sound source direction is enhanced. To do. Similarly, beam forming units (for noise analysis) 12-2 to 12-L (L ≧ 2) receive observation signals x _{ω and τ} , and L−1 beam formings in which directions other than the sound source are emphasized. Output signals Y _{ζ (2), ω, τ} ,..., Y _{ζ (L), ω, τ} are output. The local PSD estimation unit 13 estimates local PSD ^ Φ _{S, ω, τ} using _L beamforming output signals Y _{ζ (1), ω, τ} ,..., Y _{ζ (L), ω, τ.} To do. The a priori SNR calculation unit 15 receives the local PSD ^ Φ _{S, ω, τ} , and obtains an estimated value of the prior SNR ^ ξ _{ω, τ} = ^ φ _{S, ω, τ} / ^ φ _{N, ω, τ} . The filtering unit 16 calculates a Wiener filter using the estimated prior SNR ^ ξω _{, τ} , and multiplies the beamforming output signal Y _{ζ (1), ω, τ} by the Wiener filter, thereby obtaining the output signal Z _i, Output _{ω, τ} .

特開２０１１−５３０６２号公報JP 2011-53062 A 国際公開第２０１５／１２９７６０号International Publication No. 2015/129760

従来の直間比推定技術では、（１）直接音と残響が無相関である、（２）室内に音源のみが存在し雑音レベルが十分低い、などの条件が成り立てば、ある程度正確に直間比を推定することができる。しかしながら、直接音と初期反射を含む残響には相関が存在するし、室内には雑音が存在するケースが多い。その場合、直間比の推定精度が低下する。 In the conventional direct ratio estimation technique, if conditions such as (1) direct sound and reverberation are uncorrelated, (2) only a sound source exists in the room and the noise level is sufficiently low, the direct ratio is accurately estimated to some extent. The ratio can be estimated. However, there is a correlation between the direct sound and the reverberation including the initial reflection, and there are many cases where noise exists in the room. In that case, the estimation accuracy of the direct ratio decreases.

従来の音源強調技術では、（１）音源信号群の時間スパース性が低く、音源間の無相関性が仮定できない、（２）背景雑音のレベルが高い、（３）ウィーナーフィルタ設計に用いるパラメータが合っていない、などの場合において、目的音もしくは雑音のＰＳＤの推定精度が低下し、出力信号が強調されないことがある。 In the conventional sound source enhancement technology, (1) the time sparsity of the sound source signal group is low, and no correlation between the sound sources cannot be assumed, (2) the background noise level is high, and (3) the parameters used for the Wiener filter design are In the case of mismatch, the estimation accuracy of the target sound or noise PSD may be lowered, and the output signal may not be emphasized.

すなわち、従来のように決定論的に音源情報を求める手法では、特定の条件を満足しない環境で利用した場合に、精度が低下するという課題があった。 That is, the conventional method for determining sound source information deterministically has a problem that the accuracy is lowered when used in an environment that does not satisfy a specific condition.

この発明の目的は、このような点に鑑みて、従来の決定論的な手法では推定精度が低下する場合であっても、高精度に音源特徴から音源情報を推定することができる音源情報推定技術を提供することである。 In view of these points, an object of the present invention is to provide sound source information estimation capable of estimating sound source information from sound source features with high accuracy even when the estimation accuracy of conventional deterministic techniques is reduced. Is to provide technology.

上記の課題を解決するために、この発明の音源情報推定装置は、複数の異なる方向の角度領域から到来する音を強調して収音した複数の周波数領域観測信号から各周波数領域観測信号の音源特徴を抽出する音源特徴抽出部と、各周波数領域観測信号の音源特徴を統計的マッピングモデルへ入力して音源情報の推定値を求める音源情報推定部と、を含み、統計的マッピングモデルは、複数の異なる方向の角度領域から到来する音を強調して収音した複数の周波数領域音響信号から抽出した音源特徴と各周波数領域音響信号から求めた音源情報の正解値とを用いてパラメータを学習したものである。 In order to solve the above-described problem, the sound source information estimation device according to the present invention provides a sound source for each frequency domain observation signal from a plurality of frequency domain observation signals picked up by emphasizing sounds coming from a plurality of angular regions in different directions. A sound source feature extraction unit that extracts features, and a sound source information estimation unit that inputs a sound source feature of each frequency domain observation signal to a statistical mapping model to obtain an estimation value of the sound source information. Parameters were learned using sound source features extracted from multiple frequency domain acoustic signals picked up by emphasizing sounds coming from angular regions in different directions and correct values of sound source information obtained from each frequency domain acoustic signal Is.

この発明の音源情報推定技術によれば、統計的な手法により音源特徴から音源情報を求めるため、従来の決定論的な手法では推定精度が低下する環境下であっても、高精度に音源特徴から音源情報を推定することができる。 According to the sound source information estimation technique of the present invention, the sound source information is obtained from the sound source features by a statistical method. Therefore, even in an environment where the estimation accuracy is reduced by the conventional deterministic method, the sound source features are accurately obtained. From this, sound source information can be estimated.

図１は、従来の直間比推定装置の機能構成を例示する図である。FIG. 1 is a diagram illustrating a functional configuration of a conventional direct ratio estimation apparatus. 図２は、従来の音源強調装置の機能構成を例示する図である。FIG. 2 is a diagram illustrating a functional configuration of a conventional sound source enhancement device. 図３は、音源とインパルス応答の分解モデルを説明するための図である。FIG. 3 is a diagram for explaining a decomposition model of a sound source and an impulse response. 図４は、残響環境下における直接音と残響の伝搬モデルを説明するための図である。FIG. 4 is a diagram for explaining a propagation model of direct sound and reverberation in a reverberant environment. 図５は、直間比を推定するための局所ＰＳＤ推定を説明するための図である。FIG. 5 is a diagram for explaining local PSD estimation for estimating the direct ratio. 図６は、第一実施形態の直間比推定装置の機能構成を例示する図である。FIG. 6 is a diagram illustrating a functional configuration of the direct ratio estimation apparatus according to the first embodiment. 図７は、直間比推定方法の処理手続きを例示する図である。FIG. 7 is a diagram illustrating a processing procedure of the direct ratio estimation method. 図８は、ＤＮＮマッピング部の機能構成を例示する図である。FIG. 8 is a diagram illustrating a functional configuration of the DNN mapping unit. 図９は、第二実施形態の直間比推定装置の機能構成を例示する図である。FIG. 9 is a diagram illustrating a functional configuration of the direct ratio estimation apparatus according to the second embodiment. 図１０は、第三実施形態の直間比推定装置の機能構成を例示する図である。FIG. 10 is a diagram illustrating a functional configuration of the direct ratio estimation apparatus according to the third embodiment. 図１１は、第四実施形態の直間比推定装置の機能構成を例示する図である。FIG. 11 is a diagram illustrating a functional configuration of the direct ratio estimation apparatus according to the fourth embodiment. 図１２は、相互情報量増大型受音系を説明するための図である。FIG. 12 is a diagram for explaining the mutual information increasing sound receiving system. 図１３は、第五実施形態の音源強調装置の機能構成を例示する図である。FIG. 13 is a diagram illustrating a functional configuration of the sound source emphasizing device according to the fifth embodiment. 図１４は、音源強調方法の処理手続きを例示する図である。FIG. 14 is a diagram illustrating a processing procedure of the sound source enhancement method. 図１５は、第六実施形態の音源強調装置の機能構成を例示する図である。FIG. 15 is a diagram illustrating a functional configuration of the sound source emphasizing device according to the sixth embodiment. 図１６は、第七実施形態の音源強調装置の機能構成を例示する図である。FIG. 16 is a diagram illustrating a functional configuration of the sound source emphasizing device according to the seventh embodiment. 図１７は、第八実施形態の音源強調装置の機能構成を例示する図である。FIG. 17 is a diagram illustrating a functional configuration of the sound source emphasizing device according to the eighth embodiment. 図１８は、対称性をもつアレイ構造を例示する図である。FIG. 18 is a diagram illustrating an array structure having symmetry.

以下、この発明を実施するための形態について詳細に説明する。なお、図面中において同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments for carrying out the present invention will be described in detail. In addition, the same number is attached | subjected to the component which has the same function in drawing, and duplication description is abbreviate | omitted.

第一実施形態から第四実施形態では、この発明の音源情報推定技術を直間比推定技術に適用した実施形態を説明する。第五実施形態から第八実施形態では、この発明の音源情報推定技術を音源強調技術に適用した実施形態を説明する。第九実施形態では、直間比推定技術と音源強調技術との上位概念として抽出される音源情報推定技術を、各実施形態に対応させて説明する。 In the first to fourth embodiments, embodiments in which the sound source information estimation technique of the present invention is applied to the direct ratio estimation technique will be described. In the fifth to eighth embodiments, embodiments in which the sound source information estimation technique of the present invention is applied to a sound source enhancement technique will be described. In the ninth embodiment, a sound source information estimation technique extracted as a superordinate concept of the direct ratio estimation technique and the sound source enhancement technique will be described in association with each embodiment.

なお、文中で使用する記号「^」「⁻」等は、本来直後の文字の真上に記載されるべきものであるが、テキスト記法の制限により、当該文字の直前に記載する。数式中においてはこれらの記号は本来の位置、すなわち文字の真上に記述している。 Note that the symbols “^”, “ ⁻ ”, etc. used in the text should be written immediately above the character that immediately follows, but are written immediately before the character due to restrictions on textual notation. In the mathematical expression, these symbols are described in their original positions, that is, directly above the characters.

＜直間比推定技術＞
最初に、従来の決定論的な手法による直間比推定技術を詳細に説明し、続いて、この発明の音源情報推定技術を適用した直間比推定技術の実施形態を説明する。 <Direct ratio estimation technology>
First, a direct ratio estimation technique using a conventional deterministic technique will be described in detail, and then an embodiment of the direct ratio estimation technique to which the sound source information estimation technique of the present invention is applied will be described.

≪残響環境下での音伝搬モデル≫
残響環境下における音源とマイクロホン間のインパルス応答は、直接音（direct sound）、初期反射音（early reflection）、および後部残響（late reverberation）から構成される。ここでは、簡易なモデル化のために、残響には初期反射音と後部残響とが含まれることとして、図３に示すように、直接音（direct sound）と残響（reverberation）の2要素によりインパルス応答が構成されるものと想定する。インパルス応答の周波数ωにおける特性（以後、伝達特性と呼ぶ）をH_ωとすると、伝達特性は式（１）でモデル化される。 ≪Sound propagation model in reverberant environment≫
The impulse response between a sound source and a microphone in a reverberant environment is composed of a direct sound, an early reflection, and a late reverberation. Here, for the sake of simple modeling, it is assumed that the reverberation includes the initial reflected sound and the rear reverberation. As shown in FIG. 3, the reverberation includes two elements, a direct sound and a reverberation. Assume that a response is constructed. If the characteristic of the impulse response at frequency ω (hereinafter referred to as transfer characteristic) is H _ω , the transfer characteristic is modeled by equation (1).

ここで、H_D,ωは直接音の伝達特性、H_R,ωは残響の伝達特性を表す。 Here, H _{D, ω} represents a direct sound transfer characteristic, and H _{R, ω} represents a reverberation transfer characteristic.

直接音と残響の伝達特性は特性が大きく異なる。一般的に、直接音は空間的に干渉性が高く、残響は拡散性であり干渉性が低い。直接音は、図４（Ａ）に示すように、マイクロホンに対して直接的に伝搬する。反対に、残響は、図４（Ｂ）に示すように、あらゆる方向から等パワーで伝搬するようにモデル化できる。この伝搬モデルの違いに着目することで、直接音と残響のパワーを分離して推定できると考えられる。 The transfer characteristics of direct sound and reverberation are very different. In general, the direct sound is spatially coherent, and the reverberation is diffusive and has low coherence. The direct sound propagates directly to the microphone as shown in FIG. Conversely, reverberation can be modeled to propagate with equal power from all directions, as shown in FIG. By paying attention to the difference in this propagation model, it is considered that the power of direct sound and reverberation can be separated and estimated.

以下の説明の中では、3つの条件を仮定する。（１）音源の到来方向は既知とする。音源の到来方向はマニュアルで与えてもよいし、ビームフォーミング法やMUSIC法等の従来方式で推定してもよい。（２）直接音と残響は無相関であるとする。（３）各方向に対する感度分布を探索するために、複数のマイクロホンで構成されたマイクロホンアレイで観測することとする。 In the following description, three conditions are assumed. (1) The direction of arrival of the sound source is known. The direction of arrival of the sound source may be given manually, or may be estimated by a conventional method such as the beam forming method or the MUSIC method. (2) Assume that the direct sound and reverberation are uncorrelated. (3) In order to search the sensitivity distribution in each direction, observation is made with a microphone array composed of a plurality of microphones.

≪アレイ信号の観測モデル≫
X_m,ω,τをm番目のマイクロホンにおける観測信号とする。マイクロホンは総計でM本あり、ωは周波数ビン番号、τは時間フレーム番号を表す。式（１）を用いると、X_m,ω,τは式（２）でモデル化される。 ≪Array signal observation model≫
Let X _{m, ω, τ be} the observation signal in the m-th microphone. There are a total of M microphones, ω represents a frequency bin number, and τ represents a time frame number. Using equation (1), X _{m, ω, τ} is modeled by equation (2).

ここで、S_ω,τは音源のスペクトラムを表す。 Here, S _{ω, τ} represents the spectrum of the sound source.

式（２）における伝達特性は、さらに2つの要素に分解される。1つ目の要素は、音源から参照位置（例えばマイクロホンアレイの中心）における伝達特性である。2つ目の要素は、参照位置から各マイクロホンまでの伝達特性である。音源信号が平面波として到来することを仮定すると、2つ目の要素（参照位置と各マイクロホンとの間の伝達特性）は、マイクロホン間に生じる遅延差による位相シフトで近似できる。そのため、H_D,m,ωとH_R,m,ωは式（３）（４）で表現される。 The transfer characteristic in equation (2) is further broken down into two elements. The first element is a transfer characteristic from the sound source to the reference position (for example, the center of the microphone array). The second element is a transfer characteristic from the reference position to each microphone. Assuming that the sound source signal arrives as a plane wave, the second element (transfer characteristic between the reference position and each microphone) can be approximated by a phase shift due to a delay difference generated between the microphones. Therefore, H _{D, m, ω} and H _{R, m, ω} are expressed by equations (3) and (4).

ここで、H_Dref,ωは音源と参照位置間における直接音の伝達特性、H_Rref,Ω,ωは残響の伝達特性を表す。τ_Ω,mは三次元的な角度Ω={θ, φ}(θ∈[0, 2π]: 水平角、φ∈[0, π]: 天頂角)から音源が到来するときの参照位置とm番目のマイクロホン間における遅延時間を表す。また、Ω_Dは音源の到来方向を表す。 Here, H _{Dref, ω} represents a direct sound transfer characteristic between the sound source and the reference position, and H _{Rref, Ω, ω} represents a reverberation transfer characteristic. τ _{Ω, m} is the reference position when the sound source arrives from the three-dimensional angle Ω = {θ, φ} (θ∈ [0, 2π]: horizontal angle, φ∈ [0, π]: zenith angle) This represents the delay time between the m-th microphones. Ω _D represents the direction of arrival of the sound source.

マイクロホンアレイの観測信号ベクトルは式（５）でモデル化される。 The observed signal vector of the microphone array is modeled by equation (5).

ここで、a_Ω,ω=[exp(-jωτ_1,Ω), …, exp(-jωτ_M,Ω)]^Tは方向Ωに対するステアリングベクトルを表す。S_D,ω,τ, S_R,Ω,ω,τはそれぞれ参照位置で観測される方向Ωから到来する直接音と残響とを表し、式（６）（７）で定義される。なお、・^T（上付き添え字のT）は転置を表す。 Here, aΩ _{, ω} = [exp (−jωτ1 _{, Ω} ),... Exp (−jωτM _{, Ω} )] ^T represents a steering vector with respect to the direction Ω. S _{D, ω, τ} , S _{R, Ω, ω, τ} represent direct sound and reverberation coming from the direction Ω observed at the reference position, respectively, and are defined by equations (6) and (7).・^T (superscript T) indicates transposition.

≪ビームフォーミング出力≫
異なる方向から到来する波面の強度分布を解析するために、観測信号x_ω,τに対して、2つ以上のビームフォーミングフィルタを掛け合わせることを想定する。l番目のビームフォーミングの出力信号は、式（８）で表される。 ≪Beam forming output≫
In order to analyze the intensity distribution of wavefronts coming from different directions, it is assumed that two or more beamforming filters are multiplied with the observation signal _{xω, τ} . The output signal of the l-th beamforming is expressed by equation (8).

ここで、・^H（上付き添え字のH）は共役転置を表す。w_l,ωはl番目のビームフォーミングのフィルタ係数であり、式（９）で定義される。 Here, · ^H (superscript H) represents conjugate transpose. w _{l, ω} are filter coefficients of the l-th beamforming, and are defined by Expression (9).

ビームフォーミング出力のパワースペクトル密度（Power Spectral Density: PSD）はビームフォーミングの感度特性で重みづけした直接音と残響のＰＳＤの加算で表され、次式で表現される。 The power spectrum density (Power Spectral Density: PSD) of the beamforming output is expressed by adding the direct sound weighted by the beamforming sensitivity characteristic and the PSD of the reverberation, and is expressed by the following equation.

ここで、P_D,ωは参照位置で観測した直接音のＰＳＤ、P_R,Ω,ωは残響のＰＳＤを表す。P_D,ωとP_R,Ω,ωは式（12）（13）で表現される。 Here, P _{D, ω} represents the PSD of the direct sound observed at the reference position, and P _{R, Ω, ω} represents the PSD of reverberation. P _{D, ω} and P _{R, Ω, ω} are expressed by equations (12) and (13).

ここで、E[・]は時間に対する期待値演算を表す。G_l,Ω,ωは方向Ωに対するl番目のビームフォーミングフィルタの感度を表し、式（14）で表現される。 Here, E [•] represents an expected value calculation with respect to time. G _{l, Ω, ω} represents the sensitivity of the l-th beamforming filter with respect to the direction _{Ω, and} is expressed by Expression (14).

式（11）の導出では直接音と残響が無相関であることを仮定した。つまり、E[S^* _D,ω,τS_R,Ω,ω,τ]=0とした。ここで、・^*（上付き添え字の*）は複素共役を表す。 In the derivation of equation (11), it is assumed that the direct sound and reverberation are uncorrelated. That is, E [S ^* _{D, ω, τ} S _{R, Ω, ω, τ} ] = 0. Here, ^* (superscript *) represents a complex conjugate.

残響が空間的に拡散して伝搬する（すなわち、あらゆる方向から到来するパワーが等しい）と仮定しているので、残響のＰＳＤはすべての方向Ωに対して定数であるとモデル化できる。 Since it is assumed that the reverberation propagates spatially (ie, the power coming from all directions is equal), the reverberation PSD can be modeled as constant for all directions Ω.

したがって、式（11）におけるビームフォーミング出力のＰＳＤは式（16）となる。 Therefore, the PSD of the beamforming output in Equation (11) is Equation (16).

≪局所ＰＳＤ推定≫
図５に示すように、2つの指向特性が異なるビームフォーミング（Beamformers）をアレイ観測信号に対して畳み込むと想定する。式（16）によると、2つのビームフォーミング出力のＰＳＤは、式（17）のように行列形式で表される。 ≪Local PSD estimation≫
As shown in FIG. 5, it is assumed that two beamforming (Beamformers) having different directivity characteristics are convoluted with the array observation signal. According to equation (16), the PSDs of the two beamforming outputs are expressed in matrix form as in equation (17).

ここで、P_BF,ωとG_ωに含まれる要素は既知であり事前に計算できるため、直接音と残響のＰＳＤはフレームごとに推定される。 Here, since the elements included in P _{BF, ω} and G _ω are known and can be calculated in advance, the PSD of direct sound and reverberation is estimated for each frame.

ここで、^・は推定された値を表す。なお，^P_cmp,ωを構成するP_D,ωと⁻P_R,ωは正の値であるため、算出された結果が負値である場合には0にフロアリングする。推定したP_D,ωと⁻P_R,ωを用いて、式（19）に従って直間比Γ_convを推定することができる。 Here, ^ · represents the estimated value. Incidentally, ^ P _cmp, P _D constituting the _{_omega, omega} and ^- P _{R, omega} since a positive value, when the result calculated is a negative value is flooring to 0. Estimated P _D, and _omega ^- P _R, with _omega, can be estimated Chokkan ratio gamma _conv according to equation (19).

この発明では、直間比の推定精度を向上させるために、P_D,ω, ⁻P_R,ωという音源特徴を入力とし、直間比Γを出力する統計的マッピングモデルを導入する。近年、統計的マッピングモデルの一つとして、ディープニューラルネットワーク（Deep Neural Network: DNN）が多く用いられているので、ここではディープニューラルネットワークを利用する。ディープニューラルネットワークについての詳細は、下記参考文献１に記載されている。
〔参考文献１〕岡谷貴之著、“深層学習”、第一版、講談社サイエンティフィク、2015年 In the present invention, in order to improve the estimation accuracy of Chokkan _{^{ratio, P D, ω, - P}} R, as input source characterized _omega, introducing statistical mapping model that outputs Chokkan ratio gamma. In recent years, a deep neural network (DNN) is often used as one of the statistical mapping models. Therefore, a deep neural network is used here. Details of the deep neural network are described in Reference Document 1 below.
[Reference 1] Takayuki Okaya, “Deep Learning”, first edition, Kodansha Scientific, 2015

この発明を直間比推定に適用する場合のポイントは、（１）推定された局所ＰＳＤや複数のビームフォーミング出力パワー群をディープニューラルネットワークの入力として直間比を出力する構成と、（２）ディープニューラルネットワークのネットワークパラメータの初期値を従来法のように物理的な特性を加味して設定する点にある。 The points when the present invention is applied to the direct ratio estimation are (1) a configuration in which the estimated local PSD and a plurality of beamforming output power groups are input to the deep neural network, and the direct ratio is output. (2) The initial value of the network parameter of the deep neural network is set in consideration of physical characteristics as in the conventional method.

［第一実施形態］
第一実施形態の直間比推定装置は、図６に示すように、M個のマイクロホン１０−１〜１０−Ｍ、M個の周波数領域変換部１１−１〜１１−Ｍ、2個のビームフォーミング部１２−１〜１２−２、局所ＰＳＤ推定部１３、およびＤＮＮマッピング部２０を備える。この直間比推定装置が後述する各ステップの処理を行うことにより第一実施形態の直間比推定方法が実現される。 [First embodiment]
As shown in FIG. 6, the direct ratio estimation apparatus according to the first embodiment includes M microphones 10-1 to 10 -M, M frequency domain transform units 11-1 to 11 -M, and two beams. A forming unit 12-1 to 12-2, a local PSD estimation unit 13, and a DNN mapping unit 20 are provided. The direct ratio estimation apparatus of the first embodiment is realized by the processing of each step described later by this direct ratio estimation apparatus.

直間比推定装置は、例えば、中央演算処理装置（CPU: Central Processing Unit）、主記憶装置（RAM: Random Access Memory）などを有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。直間比推定装置は、例えば、中央演算処理装置の制御のもとで各処理を実行する。直間比推定装置に入力されたデータや各処理で得られたデータは、例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて読み出されて他の処理に利用される。また、直間比推定装置の各処理部の少なくとも一部が集積回路等のハードウェアによって構成されていてもよい。 The direct ratio estimation device is configured, for example, by loading a special program into a known or dedicated computer having a central processing unit (CPU), a main memory (RAM), and the like. It is a special device. The direct ratio estimation apparatus executes each process under the control of the central processing unit, for example. The data input to the direct ratio estimation device and the data obtained in each process are stored in, for example, the main storage device, and the data stored in the main storage device is read out as necessary for other processing. Used. Moreover, at least a part of each processing unit of the direct ratio estimation apparatus may be configured by hardware such as an integrated circuit.

図７を参照して、第一実施形態の直間比推定方法の処理手続きを説明する。 With reference to FIG. 7, the processing procedure of the direct ratio estimation method of the first embodiment will be described.

ステップＳ１０において、M個のマイクロホン１０−１〜１０−ＭからなるマイクロホンアレイがM個の観測信号x_m(n)（m=1, …, M）を収音する。ここで、nは離散時間信号のサンプル番号を表す。観測信号x_m(n)は周波数領域変換部１１−１〜１１−Ｍにそれぞれ入力される。 In step S10, a microphone array including M microphones 10-1 to 10-M collects M observation signals x _m (n) (m = 1,..., M). Here, n represents the sample number of the discrete time signal. The observation signal x _m (n) is input to each of the frequency domain conversion units 11-1 to 11-M.

ステップＳ１１において、周波数領域変換部１１−ｍ（m=1, …, M）は、各観測信号x_m(n)を短い時間長（例えば、サンプリング周波数16,000Hzの場合には、256サンプル程度）のフレームに分解し、それぞれのフレームにおいて離散フーリエ変換を行い周波数領域の観測信号X_m,ω,τを出力する。ここで、ωは周波数ビン番号、τは時間フレーム番号を表す。各周波数領域変換部１１−１〜１１−Ｍの出力信号群X_1,ω,τ, X_2,ω,τ, …, X_M,ω,τはビームフォーミング部１２−１〜１２−２にそれぞれ入力される。 In step S11, the frequency domain transform unit 11-m (m = 1,..., M) sets each observation signal x _m (n) to a short time length (for example, about 256 samples when the sampling frequency is 16,000 Hz). And the discrete Fourier transform is performed in each frame, and the frequency domain observation signals X _{m, ω, τ} are output. Here, ω represents a frequency bin number, and τ represents a time frame number. The output signal groups X1 _{, ω, τ} , X2 _{, ω, τ} ,..., XM _{, ω, τ} of the frequency domain transform units 11-1 to 11-M are transmitted to the beam forming units 12-1 to 12-2. Each is entered.

ステップＳ１２において、ビームフォーミング部１２−１〜１２−２は、各周波数領域変換部１１−１〜１１−Ｍの出力信号群X_1,ω,τ, X_2,ω,τ, …, X_M,ω,τに対して、それぞれ異なる方向の角度領域から到来する音を強調して収音する処理を行い、結果を出力する。ビームフォーミング部１２−１は、直接音の強調に用いるものであり、あらかじめ定めた音源方向から到来する音を強調して出力信号Y_BF,1,ωを出力する。ビームフォーミング部１２−２は、拡散残響の解析に用いるものであり、音源方向以外の方向から到来する音を強調して出力信号Y_BF,2,ωを出力する。各ビームフォーミング部１２−１〜１２−２の出力信号群Y_BF,1,ω, Y_BF,2,ωは局所ＰＳＤ推定部１３に入力される。 In step S12, the beam forming units 12-1 to 12-2 output signal groups X1 _{, ω, τ} , X2 _{, ω, τ} , ..., XM of the frequency domain transform units 11-1 to 11- _{M. , ω, τ} are processed to collect sounds by emphasizing sounds coming from angular regions in different directions, and the results are output. The beam forming unit 12-1 is used for direct sound enhancement, and outputs an output signal Y _{BF, 1, ω} by enhancing sound arriving from a predetermined sound source direction. The beam forming unit 12-2 is used for analysis of diffuse reverberation, and outputs an output signal _{YBF, 2, ω} by emphasizing sound coming from directions other than the sound source direction. The output signal groups Y _{BF, 1, ω} , Y _{BF, 2, ω} of the beam forming units 12-1 to 12-2 are input to the local PSD estimation unit 13.

ステップＳ１３において、局所ＰＳＤ推定部１３は、各ビームフォーミング部１２−１〜１２−２の出力信号群Y_BF,1,ω, Y_BF,2,ωを入力とし、上記式（18）に従って、直接音のＰＳＤP_D,ωおよび残響のＰＳＤ⁻P_R,ωを推定する。推定した直接音および残響のＰＳＤP_D,ω, ⁻P_R,ωはＤＮＮマッピング部２０に入力される。 In step S13, the local PSD estimation unit 13 receives the output signal groups Y _{BF, 1, ω} , Y _{BF, 2, ω} of each of the beam forming units 12-1 to 12-2 as input, and according to the above equation (18), Estimate PSDP _{D, ω of} direct sound and PSD ^- P _{R, ω} of reverberation. Estimated direct sound and reverberation _{^{_{PSDP D, ω, - P R}}} , ω is input to DNN mapping unit 20.

ステップＳ２０において、ＤＮＮマッピング部２０は、局所ＰＳＤ推定部１３の出力する直接音と残響のＰＳＤP_D,ω, ⁻P_R,ωを入力とし、ネットワークパラメータzを用いて直間比の推定値Γを求め、結果を出力する。 In step S20, DNN mapping unit 20, PSDP _D of the direct sound and the reverberation output of the local PSD estimator _{_13, ω,} ^- P _{_R,} as input _omega, estimates of Chokkan ratio using the network parameter z gamma And output the result.

以下、ＤＮＮマッピング部２０の処理を詳細に説明する。ＤＮＮマッピング部２０は、図８に示すように、N層のディープニューラルネットワークで構成される。ここでは、Nは4〜5程度でよい。まず、ディープニューラルネットワークの入力層に特徴量を設定する。 Hereinafter, the process of the DNN mapping unit 20 will be described in detail. As shown in FIG. 8, the DNN mapping unit 20 is composed of an N-layer deep neural network. Here, N may be about 4-5. First, feature values are set in the input layer of the deep neural network.

ここで、ω_Qは解析FFTビン数を表す。この際、複数の周波数ビンを集約した周波数バンドの状態でも構わない。ネットワークパラメータzにZ⁽²⁾, …, Z^(N), b⁽²⁾, …, b^(N)が含まれるとすると、ディープニューラルネットワークの出力はN-1回の逐次計算により、以下のように計算される。 Here, ω _Q represents the number of analysis FFT bins. At this time, a state of a frequency band in which a plurality of frequency bins are aggregated may be used. If the network parameter z includes Z ⁽²⁾ ,…, Z ^(N) , b ⁽²⁾ ,…, b ^(N) , the output of the deep neural network is calculated as Is calculated as follows.

ここで、n層目のレイヤー数をJ_nと記述するとき、 Here, when the number of layers n-th layer is described as J _n,

である。活性化関数f⁽ⁿ⁾(・)は、式（28）のように、シグモイド関数（sigmoid function）(n=2, …, N-1の場合)と恒等写像関数（n=Nの場合）を併用する。 It is. The activation function f ⁽ⁿ⁾ (•) is expressed by the sigmoid function (when n = 2,…, N-1) and the identity mapping function (when n = N), as shown in Equation (28). ).

N層目のレイヤー数をJ_N=1とし、推定された直間比Γは式（29）となる。 The number of layers in the Nth layer is J _N = 1, and the estimated direct ratio Γ is expressed by Equation (29).

以後、q⁽¹⁾を入力としネットワークパラメータzを用いて推定した直間比をΓ(q⁽¹⁾; z)と表記する。 Hereinafter, the direct ratio estimated using q ⁽¹⁾ as an input and the network parameter z is denoted as Γ (q ⁽¹⁾ ; z).

ディープニューラルネットワークは、Hintonらのディープブリーフネット（deep brief network: DBN）の研究により、多層ニューラルネットの事前学習（pre-training）を適切に行い、ネットワークパラメータの初期値をうまく設定できるようになったことにより、様々な分野で利用されるようになった。ディープブリーフネットでは、制約ボルツマンマシン（restricted Boltzmann machine: RBM）を多層にスタックし（stacked RBMs）、一層ごとにネットワーパラメータの初期値を推定する。なお、各制約ボルツマンマシンのネットワークパラメータの更新量を適切に計算するためには、例えば、コンストラティブダイバージェンス法（contrastive divergence: CD）を用いればよい。 Deep neural network has been able to properly set the initial values of network parameters by properly pre-training multi-layer neural networks based on the study of Hinton et al.'S deep brief network (DBN). As a result, it has been used in various fields. In deep brief nets, restricted Boltzmann machines (RBM) are stacked in multiple layers (stacked RBMs), and the initial values of network parameters are estimated for each layer. In order to appropriately calculate the update amount of the network parameter of each constrained Boltzmann machine, for example, a constructive divergence method (contrastive divergence: CD) may be used.

以下、ディープニューラルネットワークの最適化方法について説明する。第一実施形態では、ネットワークパラメータzの初期値をランダムに設定し、誤差逆伝搬（back propagation）に基づいて、直間比の推定誤差を最小とするようにネットワークパラメータzを最適化する。K個のサンプルデータで構成された学習用の局所ＰＳＤと直間比の正解値とからなる教師情報を、式（30）のように記載する。 Hereinafter, the optimization method of the deep neural network will be described. In the first embodiment, the initial value of the network parameter z is set at random, and the network parameter z is optimized so as to minimize the estimation error of the direct ratio based on back propagation. Teacher information consisting of a local PSD for learning composed of K pieces of sample data and the correct value of the direct ratio is described as in equation (30).

式（21）（22）の各ステップをK個のサンプルデータに対して適用するとき、以下のように行列形式で書くことができる。 When each step of Equations (21) and (22) is applied to K sample data, it can be written in matrix form as follows.

ここで、b⁽ⁿ⁾1_K ^Tはb⁽ⁿ⁾をK個分並べる操作を表し、 Here, b ⁽ⁿ⁾ 1 _K ^T represents an operation of arranging b ⁽ⁿ⁾ by K pieces,

である。 It is.

ディープニューラルネットワークの出力と正解として与えた直間比との誤差を測るための尺度として、式（35）で定義される二乗誤差関数を用いる。 The square error function defined by Equation (35) is used as a measure for measuring the error between the output of the deep neural network and the direct ratio given as the correct answer.

誤差逆伝搬に基づいて、出力層（n=N）から入力層（n=1）に向かって逐次的にネットワークパラメータの勾配を算出する。⁻Γ=[⁻Γ₁, …, ⁻Γ_K]とするとき、n番目の層における各サンプルデータにおけるデルタΔ⁽ⁿ⁾を以下のように求める。 Based on the back propagation error, the gradient of the network parameter is calculated sequentially from the output layer (n = N) to the input layer (n = 1). ^{^{_{- Γ = [- Γ 1,}}} ..., - Γ K] When the obtained delta in each sample data in the n th layer Δ a ⁽ⁿ⁾ as follows.

ここで、f'(・)は関数f(・)の微分であり、 Where f '(•) is the derivative of the function f (•),

は行列の各成分の積を表す。誤差関数の勾配を以下のように計算する。 Represents the product of each component of the matrix. The slope of the error function is calculated as follows:

最後に、求めた勾配を基にパラメータを更新する。 Finally, the parameters are updated based on the obtained gradient.

なお、更新量ΔZ⁽ⁿ⁾, Δb⁽ⁿ⁾は、以下とすればよい。 The update amounts ΔZ ⁽ⁿ⁾ and Δb ⁽ⁿ⁾ may be set as follows.

ここで、ΔZ⁽ⁿ⁾⁺, Δb⁽ⁿ⁾⁺は前回の更新量、εは学習係数、μは汎化性能を向上し学習を速く進めるためのモメンタム（momentum）の係数、λは重み減衰（weight decay）である。εは0.01程度、μは0.9程度、λは0.0002程度に設定すればよい。 Where ΔZ ^{(n) +} , Δb ^{(n) +} is the previous update amount, ε is a learning coefficient, μ is a momentum coefficient to improve generalization performance and speed up learning, and λ is weight attenuation (Weight decay). ε may be set to about 0.01, μ may be set to about 0.9, and λ may be set to about 0.0002.

［第二実施形態］
第一実施形態では、直接音のＰＳＤP_D,ωと残響のＰＳＤ⁻P_R,ωを入力として、直間比の推定値Γを出力するN層のディープニューラルネットワークを用いる構成を説明した。第二実施形態では、次式のように複数のビームフォーミング出力を入力として、直間比の推定値Γを出力するN層のディープニューラルネットワークを用いる構成を説明する。 [Second Embodiment]
In the first embodiment, PSDP _D of the direct _{sound, omega} and reverberation of PSDP ^R, as input _omega, has been described a structure using a deep neural network N layer that outputs the estimated value Γ of Chokkan ratio. In the second embodiment, a configuration using an N-layer deep neural network that outputs a plurality of beamforming outputs as inputs and outputs an estimate value Γ of the direct ratio will be described.

第二実施形態の直間比推定装置は、図９に示すように、M個のマイクロホン１０−１〜１０−Ｍ、M個の周波数領域変換部１１−１〜１１−Ｍ、2個のビームフォーミング部１２−１〜１２−２を第一実施形態と同様に備え、さらにＤＮＮマッピング部２１を含む。第一実施形態の直間比推定装置が備えていた局所ＰＳＤ推定部１３は備えておらず、ビームフォーミング部１２−１〜１２−２の出力がＤＮＮマッピング部２１へ入力されるように構成される。この直間比推定装置が後述する各ステップの処理を行うことにより第二実施形態の直間比推定方法が実現される。 As illustrated in FIG. 9, the direct ratio estimation apparatus according to the second embodiment includes M microphones 10-1 to 10 -M, M frequency domain transform units 11-1 to 11 -M, and two beams. Forming units 12-1 to 12-2 are provided in the same manner as in the first embodiment, and a DNN mapping unit 21 is further included. The local PSD estimation unit 13 provided in the direct ratio estimation apparatus of the first embodiment is not provided, and the outputs of the beamforming units 12-1 to 12-2 are input to the DNN mapping unit 21. The The direct ratio estimation apparatus of the second embodiment is realized by the processing of each step described later by this direct ratio estimation apparatus.

ステップＳ２１において、ＤＮＮマッピング部２１は、各ビームフォーミング部１２−１〜１２−２の出力信号群Y_BF,1,ω, Y_BF,2,ωを入力とし、ネットワークパラメータzを用いて直間比の推定値Γを求め、結果を出力する。ＤＮＮマッピング部２１は、K個のサンプルデータで構成された、学習用のビームフォーミング出力と直間比の正解値とからなる教師情報を用いて、第一実施形態と同様に最適化を行ったものである。 In step S21, the DNN mapping unit 21 receives the output signal groups Y _{BF, 1, ω} , Y _{BF, 2, ω} of each of the beam forming units 12-1 to 12-2 as input, and directly uses the network parameter z. Find the ratio estimate Γ and output the result. The DNN mapping unit 21 performs optimization in the same manner as in the first embodiment, using teacher information that is composed of K pieces of sample data and includes the beamforming output for learning and the correct value of the direct ratio. Is.

［第三実施形態］
第一実施形態および第二実施形態では、ネットワークパラメータzの初期値をランダムに設定した。第三実施形態では、従来法のように物理的な特性を加味してネットワークパラメータzの初期値を設定する方法について説明する。従来法における直間比推定技術は大きく以下の3ステップで構成されている。 [Third embodiment]
In the first embodiment and the second embodiment, the initial value of the network parameter z is set at random. In the third embodiment, a method for setting the initial value of the network parameter z in consideration of physical characteristics as in the conventional method will be described. The direct ratio estimation technique in the conventional method is mainly composed of the following three steps.

（ステップ１：局所ＰＳＤ推定処理）式（18）のように、2つ以上のビームフォーミングの出力パワー群P_BF,ωから局所ＰＳＤの推定値^P_cmp,ωを求める。 As in: (Step 1 local PSD estimation process) formula (18), the output power group P _BF of two or more beamforming _estimate of the local PSD from _omega ^ P _cmp, seeking _omega.

（ステップ２：周波数加算処理）式（19）に含まれるように、局所ＰＳＤの推定値^P_cmp,ωを全周波数帯域にわたって足すことでΣ_ωP_D,ω, Σ_ωP_R,ωを出力する。 (Step 2: Frequency addition processing) As shown in equation (19), by adding the estimated value of local PSD ^ P _{cmp, ω} over the entire frequency band, Σ _ω P _{D, ω} , Σ _ω P _{R, ω} Output.

（ステップ３：対数領域比計算処理）式（19）のように、Σ_ωP_D,ω, Σ_ωP_R,ωから直間比の推定値^Γを以下のように出力する。 As in: (Step 3 log domain ratio calculation process) formula _{_{(19), Σ ω P D}} , ω, Σ ω P R, and outputs the _omega as follows estimates ^ gamma of Chokkan ratio.

以上の3ステップの処理が各層の処理に物理的に対応しているとみなすことができるため、ランダムに設定するよりも良質なネットワークパラメータの初期値を与えることができる。第三実施形態では、第二実施形態の直間比推定装置において、ネットワークパラメータzの初期値を設定するように構成する。なお、最適化処理については第二実施形態と同様である。 Since the above three-step processing can be regarded as physically corresponding to the processing of each layer, it is possible to give an initial value of a network parameter with higher quality than that set at random. In the third embodiment, the direct ratio estimation apparatus of the second embodiment is configured to set an initial value of the network parameter z. The optimization process is the same as in the second embodiment.

第三実施形態の直間比推定装置は、図１０に示すように、M個のマイクロホン１０−１〜１０−Ｍ、M個の周波数領域変換部１１−１〜１１−Ｍ、2個のビームフォーミング部１２−１〜１２−２、およびＤＮＮマッピング部２１を第二実施形態と同様に備え、さらに初期値設定部３１を備える。この直間比推定装置が後述する各ステップの処理を行うことにより第三実施形態の直間比推定方法が実現される。 As shown in FIG. 10, the direct ratio estimation apparatus according to the third embodiment includes M microphones 10-1 to 10-M, M frequency domain transform units 11-1 to 11-M, and two beams. The forming units 12-1 to 12-2 and the DNN mapping unit 21 are provided in the same manner as in the second embodiment, and further an initial value setting unit 31 is provided. The direct ratio estimation apparatus of the third embodiment is realized by the processing of each step described later by this direct ratio estimation apparatus.

初期値設定部３１は、以下のようにして、ＤＮＮマッピング部２１の各層に対応するネットワークパラメータの初期値を設定する。入力層の値は、第二実施形態と同様に、次式のように設定する。 The initial value setting unit 31 sets initial values of network parameters corresponding to each layer of the DNN mapping unit 21 as follows. The value of the input layer is set as in the following equation, as in the second embodiment.

2層目の処理は、（ステップ１：局所ＰＳＤ推定処理）が対応する。2層目のレイヤー数はJ₂≧L×Qとなるようにする。Lはビームフォーミング数であり、Qは周波数ビン数である。以下では、L=2として説明する。以下のようにネットワークパラメータを書き表すことで、局所ＰＳＤ推定処理を表現することができる。 The processing of the second layer corresponds to (Step 1: Local PSD estimation processing). The number of layers in the second layer should be J ₂ ≧ L × Q. L is the number of beamforming, and Q is the number of frequency bins. Below, it demonstrates as L = 2. The local PSD estimation process can be expressed by writing the network parameters as follows.

なお、G₂とB₂は値幅調整係数である。Z⁽²⁾q⁽¹⁾の最大値が1〜5程度になるようにG₂を設定する。また、Z⁽²⁾q⁽¹⁾の出力値が0以下である場合に値を0付近にフロアリングするために、B₂は-5〜0の間に設定する。その後、以下の計算をすることで、2層目の出力q⁽²⁾を得る。 G ₂ and B ₂ are price range adjustment coefficients. Set G ₂ so that the maximum value of Z ⁽²⁾ q ⁽¹⁾ is about 1-5. Also, B ₂ is set between -5 and 0 in order to floor the value near 0 when the output value of Z ⁽²⁾ q ⁽¹⁾ is 0 or less. Thereafter, the output q ⁽²⁾ of the second layer is obtained by performing the following calculation.

3層目の処理は、（ステップ２：周波数加算処理）が対応する。以下のようにネットワークパラメータを書き表すことで、周波数加算処理を表現することができる。なお、3層目のレイヤー数はJ₃≧2となるようにする。 The processing of the third layer corresponds to (Step 2: Frequency addition processing). The frequency addition process can be expressed by writing the network parameters as follows. Note that the number of layers in the third layer is set to satisfy J ₃ ≧ 2.

なお、G₃とB₃は値幅調整係数である。Z⁽³⁾q⁽²⁾の最大値が1〜5程度になるようにG₃を設定する。また、B₃は0程度で問題ない。その後、式（52）（53）の計算をすることで、3層目の出力q⁽³⁾を得る。 G ₃ and B ₃ are price range adjustment coefficients. Set G ₃ so that the maximum value of Z ⁽³⁾ q ⁽²⁾ is about 1-5. In addition, B ₃ is about 0, and there is no problem. Thereafter, by calculating the equations (52) and (53), the output q ⁽³⁾ of the third layer is obtained.

4層目の処理は、（ステップ３：対数領域比計算処理）が対応する。以下のようにネットワークパラメータを書き表すことで、対数領域比計算処理を表現することができる。なお、4層目（出力層）のレイヤー数はJ₄=1である。 The processing of the fourth layer corresponds to (Step 3: Logarithmic area ratio calculation processing). Logarithmic domain ratio calculation processing can be expressed by writing network parameters as follows. Note that the number of layers in the fourth layer (output layer) is J ₄ = 1.

ここで、式（44）に対応させるため、Z_1,1 ⁽⁴⁾は正の値、Z_1,2 ⁽⁴⁾は負の値に制限される。例えば、以下のようにして値を決める。 Here, in order to correspond to the equation (44), Z _1,1 ⁽⁴⁾ is limited to a positive value, and Z _1,2 ⁽⁴⁾ is limited to a negative value. For example, the value is determined as follows.

このとき、参照している10log₁₀(Σ_ωP_D,ω)や10log₁₀(Σ_ωP_R,ω)は、1つのサンプルで計算されたものを利用してもよいし、多数のサンプルで計算された値の平均値を利用してもよい。 At this time, the 10log ₁₀ (Σ _ω P _{D, ω} ) and ₁₀ log ₁₀ (Σ _ω P _{R, ω} ) being referred to may be calculated using one sample, or may be used for many samples. An average value of the calculated values may be used.

最後に、出力値を以下のように算出する。 Finally, the output value is calculated as follows.

上述したネットワークパラメータの初期値設定法では、層の数は信号処理演算の最小単位数+1以上に設定したほうがよいため、N≧4とすることが望ましい。上記では、N=4とみなして説明したが、Nを4よりも多くしたい場合には、冗長な層を挟めばよい。ここで、信号処理演算の最小単位数とは、同等の信号処理演算（ここでは、直間比推定処理）を従来の決定論的な手法で実行するときに必要となる、加算や乗算などの信号処理演算の数を意味している。 In the network parameter initial value setting method described above, the number of layers is preferably set to be equal to or greater than the minimum unit number of signal processing operations + 1. In the above description, it is assumed that N = 4. However, if N is desired to be greater than 4, a redundant layer may be interposed. Here, the minimum number of units of signal processing operations is equivalent to signal processing operations (here, direct ratio estimation processing), such as addition and multiplication, which are required when executing the conventional deterministic method. It means the number of signal processing operations.

［第四実施形態］
第三実施形態では、第二実施形態の直間比推定装置において、ネットワークパラメータzの初期値を設定する構成を説明した。第四実施形態では、第一実施形態の直間比推定装置において、ネットワークパラメータzの初期値を設定するように構成する。なお、最適化処理については第一実施形態と同様である。 [Fourth embodiment]
In the third embodiment, the configuration in which the initial value of the network parameter z is set in the direct ratio estimation apparatus of the second embodiment has been described. In the fourth embodiment, the direct ratio estimation apparatus of the first embodiment is configured to set an initial value of the network parameter z. The optimization process is the same as in the first embodiment.

第四実施形態の直間比推定装置は、図１１に示すように、M個のマイクロホン１０−１〜１０−Ｍ、M個の周波数領域変換部１１−１〜１１−Ｍ、2個のビームフォーミング部１２−１〜１２−２、局所ＰＳＤ推定部１３、およびＤＮＮマッピング部２０を第一実施形態と同様に備え、さらに初期値設定部３０を備える。この直間比推定装置が後述する各ステップの処理を行うことにより第四実施形態の直間比推定方法が実現される。 As shown in FIG. 11, the direct ratio estimation apparatus of the fourth embodiment includes M microphones 10-1 to 10-M, M frequency domain transform units 11-1 to 11-M, and two beams. The forming units 12-1 to 12-2, the local PSD estimation unit 13, and the DNN mapping unit 20 are provided as in the first embodiment, and further an initial value setting unit 30 is provided. The direct ratio estimation apparatus of the fourth embodiment is realized by the processing of each step described later by this direct ratio estimation apparatus.

第三実施形態では、2層目の処理が局所ＰＳＤ推定処理に対応していることを説明した。したがって、第四実施形態では、3層目以降の初期値設定を用いればよいということになる。入力層の値は、第一実施形態と同様に、式（60）のように設定する。 In the third embodiment, it has been described that the process of the second layer corresponds to the local PSD estimation process. Therefore, in the fourth embodiment, the initial value setting for the third and subsequent layers may be used. The value of the input layer is set as shown in Expression (60), as in the first embodiment.

式（50）以降の処理を2層目、3層目の初期値として設定すればよいことになる。なお、第三実施形態ではN≧4に設定したほうがよいと説明したが、層の数を信号処理演算の最小単位数+1以上に設定する考え方は同様であるため、第四実施形態ではN≧3に設定することが望ましい。 The processing after the equation (50) can be set as the initial values for the second and third layers. In the third embodiment, it has been described that it is better to set N ≧ 4. However, since the idea of setting the number of layers to be equal to or greater than the minimum unit number +1 of the signal processing operation is the same, in the fourth embodiment N It is desirable to set ≧ 3.

第三実施形態や第四実施形態のように、ネットワークパラメータの初期値を適切に設定することで、各層の物理的な意味合いを持ちつつ、パラメータ最適化が可能になる。その結果、学習データがある程度少なくても外れ値を出力する可能性が減る効果が期待される。つまり、学習データが少なくても環境に依存しにくいディープニューラルネットワークを設計することができる効果がある。 As in the third embodiment and the fourth embodiment, by appropriately setting the initial values of the network parameters, the parameters can be optimized while maintaining the physical meaning of each layer. As a result, an effect of reducing the possibility of outputting an outlier even if the learning data is small to some extent is expected. That is, there is an effect that it is possible to design a deep neural network that is less dependent on the environment even if there is little learning data.

＜音源強調技術＞
最初に、従来の決定論的手法による音源強調技術を詳細に説明し、続いて、この発明の音源情報推定技術を適用した音源強調技術の実施形態を説明する。 <Sound source enhancement technology>
First, a sound source enhancement technique using a conventional deterministic technique will be described in detail, and then an embodiment of a sound source enhancement technique to which the sound source information estimation technique of the present invention is applied will be described.

≪観測信号のモデリング≫
音場にK個の音源が存在し、M（≧2）個のマイクロホンを用いて観測する。この状況は、多入力多出力系（multiple-inputs and multiple-outputs: MIMO）の一つとして見なすことができる。k番目の音源とm番目のマイクロホン間の伝達特性をA_m,k,ωとすると、M個の観測信号x_ω,τは式（61）のように計算できる。 ≪Modeling of observation signal≫
There are K sound sources in the sound field, and M (≧ 2) microphones are used for observation. This situation can be regarded as one of multiple-inputs and multiple-outputs (MIMO). Assuming that the transfer characteristic between the kth sound source and the mth microphone is A _{m, k, ω} , M observation signals _{xω, τ} can be calculated as shown in Equation (61).

ここで、式（61）は以下の要素で構成される。 Here, the equation (61) includes the following elements.

ここで、k番目の音源をS_k,ω,τ、m番目のマイクロホンにおける非方向性の背景雑音をN_m,ω,τとして記述した。また、音源や背景雑音の平均値やパワーの期待値が以下を満たすことを仮定する。 Here, the kth sound source is described as S _{k, ω, τ} , and the non-directional background noise in the _mth microphone is described as N _{m, ω, τ} . It is also assumed that the average value of the sound source and background noise and the expected value of power satisfy the following.

ここで、<・>は期待値演算子を表す。また、S_k,ω,τやN_m,ω,τが互いに無相関であることを仮定すると、以下のようになる。 Here, <•> represents an expected value operator. Assuming that S _{k, ω, τ} and N _{m, ω, τ} are uncorrelated with each other, the following results.

ここで、・^*（上付き添え字の*）は複素共役を表す。以上の統計的な性質を満たす場合には、音源信号や背景雑音の分散共分散行列は以下のようにモデル化される。 Here, ^* (superscript *) represents a complex conjugate. When the above statistical properties are satisfied, the variance covariance matrix of the sound source signal and background noise is modeled as follows.

ここで、・^H（上付き添え字のH）は共役転置、I_KはK次元の単位行列、I_MはM次元の単位行列である。 Here, · ^H (superscript H) is a conjugate transpose, I _K is a K-dimensional unit matrix, and I _M is an M-dimensional unit matrix.

観測信号x_ω,τの分散共分散行列（以後、空間相関行列と呼ぶ）は，以下でモデル化される。 The variance-covariance matrix (hereinafter referred to as the spatial correlation matrix) of the observation signals x _{ω, τ} is modeled below.

ここで、R_A,ωは各マイクロホンにおける受音パワーσ² _A,ω（事前にチャネルのレベルが正規化されていることを想定）とチャネル間相関Γ_i,j,ωで構成される。 Here, R _{A, ω} is composed of received sound power σ ² _{A, ω} (assuming that the channel level is normalized in advance) and inter-channel correlations Γ _{i, j, ω} .

以後、従来の音源強調技術を構成する、受音系設計技術、ビームフォーミング、およびウィーナーフィルタリングについて順に説明する。受音系設計技術は、目的の音源群を詳細に解析するための受音技術（ハードウェア）である。ビームフォーミングは、受音した観測信号群を処理するための信号処理技術である。ウィーナーフィルタリングは、ビームフォーミング後の信号に対して、更なる雑音抑圧を行うための技術である。これらの技術群を下記のように任意に組み合わせて実装することが従来技術である。
実装形態１：受音設計技術＋ビームフォーミング＋ウィーナーフィルタリング
実装形態２：受音設計技術＋ビームフォーミング
実装形態３：（汎用マイク）＋ビームフォーミング
実装形態４：（汎用マイク）＋ビームフォーミング＋ウィーナーフィルタリング Hereinafter, the sound receiving system design technology, beam forming, and Wiener filtering that constitute the conventional sound source enhancement technology will be described in order. The sound receiving system design technique is a sound receiving technique (hardware) for analyzing a target sound source group in detail. Beam forming is a signal processing technique for processing received observation signal groups. Wiener filtering is a technique for performing further noise suppression on a signal after beamforming. It is a conventional technique to mount these techniques in any combination as described below.
Mounting form 1: Sound reception design technology + beam forming + Wiener filtering Mounting form 2: Sound receiving design technique + beam forming Mounting form 3: (General-purpose microphone) + Beam forming Mounting form 4: (General-purpose microphone) + Beam forming + Wiener filtering

≪相互情報量増大型受音系設計技術≫
参考文献２には、（１）音源信号を分離して収音しやすくなるような受音信号の性質と、（２）一つの実装形態として多凹型反射板を用いた受音系について説明されている。
〔参考文献２〕K. Niwa, T. Kako, and K. Kobayashi, “Microphone array for increasing mutual information between sound sources and observation signals,” ICASSP 2015, pp. 534-538, 2015. ≪ Mutual information increasing type receiving system design technology ≫
Reference 2 describes (1) the nature of a received signal that makes it easy to pick up a sound source signal and (2) a sound receiving system that uses a multi-concave reflector as one implementation. ing.
[Reference 2] K. Niwa, T. Kako, and K. Kobayashi, “Microphone array for increasing mutual information between sound sources and observation signals,” ICASSP 2015, pp. 534-538, 2015.

参考文献２に記載された技術では、詳細に解析したいs_ω,τについてx_ω,τがどのくらい情報を教示してくれるのかを測るために、s_ω,τとx_ω,τの相互情報量I_s;xを定義する。 In the technique described in Reference 2, in order to measure how much information x _{ω, τ} teaches s _{ω, τ} to be analyzed in detail, the mutual information amount of s _{ω, τ} and x _{ω, τ} Define I _{s; x} .

ここで、H_sは伝送情報量のエントロピー、H_s|xは伝送損失を表す。仮に、A_ωが正則な行列でない場合や、背景雑音のレベルが高い場合には、伝送損失H_s|xが増加する。I_s;xが最大化するような空間相関行列を調査するために、チャネル容量C_ωを導入する。 Here, H _s represents the entropy of the transmission information amount, and H _{s | x} represents the transmission loss. If A _ω is not a regular matrix or if the background noise level is high, the transmission loss H _{s | x} increases. In order to investigate a spatial correlation matrix that maximizes I _{s; x} , a channel capacity C _ω is introduced.

R_A,ωを固有値分解することで、C_ωは以下で表現される。 By performing eigenvalue decomposition on R _{A, ω} , C _ω is expressed as follows.

ここで、Λ_m,ωは、R_A,ωのm番目の固有値である。参考文献２によると、C_ωは以下のように固有値分布が平滑化されるように信号を受音することで最大化される。 Here, Λ _{m, ω} is the m-th eigenvalue of R _{A, ω} . According to reference 2, C _omega is maximized by received sound signals so that the eigenvalue distribution as follows are smoothed.

式（82）のように固有値が平滑化されるように音を受音することは、チャネル間相関が0になるように受音することに相当する。 Receiving sound so that the eigenvalue is smoothed as in equation (82) corresponds to receiving sound so that the correlation between channels becomes zero.

もし、I_s;xが増加すれば、音源を分離するための手がかりが観測信号群に含まれるはずである。 If I _{s; x} increases, the observation signal group should include a clue for separating the sound source.

相互情報量I_s;xを増加するための受音系として、（１）拡散受音系（下記参考文献３参照）や、（２）多凹型反射板を用いた受音系（上記参考文献２参照）がある。拡散受音系は，拡散場でマイクロホンを離散して配置することでチャネル間相関が低下する物理現象を利用する、多マイクロホンを囲うように多反射板を設置するアレイである。多凹型反射板を用いた受音系を図１２に示す。パラボラ反射板の焦点付近に準最適にマイクロホンを複数設置されている。焦点付近では，パラボラ反射板により反射された音波が様々な方向、時間差で到来する。焦点位置から少しずれた位置にマイクロホンを設置することで、受音する音の振幅や位相が劇的に変化する。そのため、マイクロホンの位置を最適に設定すれば、相互情報量I_s;xが増加する。図１２の受音系では、相互情報量が増加するように、12枚の各パラボラ反射板の前に8本のマイクロホンを準最適に設置し、計M=96本の無指向性マイクロホンが実装されている。
〔参考文献３〕K. Niwa, Y. Hioka, K. Furuya, and Y. Haneda, “Diffused sensing for sharp directive beamforming,” IEEE Trans. on Audio, Speech and Language Proc., vol. 21, pp. 2346-2355, 2013. As a sound receiving system for increasing the mutual information amount I _{s; x} , (1) a diffuse sound receiving system (see the following reference 3), or (2) a sound receiving system using a multi-concave reflector (the above reference) 2). The diffuse sound receiving system is an array in which multiple reflection plates are installed so as to surround multiple microphones, utilizing a physical phenomenon in which the correlation between channels is lowered by discretely arranging microphones in a diffusion field. A sound receiving system using a multi-concave reflector is shown in FIG. Multiple microphones are installed near the focal point of the parabolic reflector. Near the focal point, the sound waves reflected by the parabolic reflectors arrive in various directions and time differences. By installing the microphone at a position slightly deviated from the focal position, the amplitude and phase of the received sound change dramatically. Therefore, if the position of the microphone is optimally set, the mutual information amount I _{s; x} increases. In the sound receiving system of FIG. 12, eight microphones are sub-optimally installed in front of each of the twelve parabolic reflectors so that mutual information increases, and a total of M = 96 omnidirectional microphones are mounted. Has been.
[Reference 3] K. Niwa, Y. Hioka, K. Furuya, and Y. Haneda, “Diffused sensing for sharp directive beamforming,” IEEE Trans. On Audio, Speech and Language Proc., Vol. 21, pp. 2346 -2355, 2013.

≪音源強調法１：ビームフォーミング≫
ビームフォーミングに基づく音源強調法について説明する。ビームフォーミングは、マイクロホン間に生じる位相／振幅差を操作し、加算することで、特定の方向から到来する音源を強調する方法である。観測信号群x_ω,τに対して、i番目の方向から到来する音源を強調するフィルタw_i,ωを掛け合わせることで、出力信号Y_i,ω,τを得る。 ≪Sound source enhancement method 1: beam forming≫
A sound source enhancement method based on beamforming will be described. Beamforming is a method of enhancing a sound source coming from a specific direction by manipulating and adding the phase / amplitude difference generated between microphones. The output signal Y _{i, ω, τ} is obtained by multiplying the observation signal group _{xω, τ} by a filter w _{i, ω} that enhances the sound source coming from the i-th direction.

ここで、 here,

である。 It is.

フィルタの代表的な設計法には遅延和法と最小分散法があるため、以下で説明する。まず、i番目の方向から到来した音波を受音したときのマイクロホン間の位相／振幅差の関係をモデル化する。以後、それをステアリングベクトルh_i,ωと呼ぶ。 Typical filter design methods include the delay sum method and the minimum variance method, which will be described below. First, the relationship between the phase / amplitude difference between microphones when a sound wave coming from the i-th direction is received is modeled. Hereinafter, it is referred to as a steering vector h _{i, ω} .

汎用性のマイクロホンアレイ（無指向性のマイクを中空に配置）を用い、かつ、音源とマイクロホンの距離が（例えば、1メートル以上）離れている場合、ステアリングベクトルを以下のようにモデル化できる。 When a general-purpose microphone array (a non-directional microphone is disposed in a hollow space) is used and the distance between the sound source and the microphone is separated (for example, 1 meter or more), the steering vector can be modeled as follows.

ここで、cは音速（秒速およそ340メートル）、p_i=[p_X,i, p_Y,i, p_Z,i]^Tはi番目の音源の位置ベクトル、p_m=[p_X,m, p_Y,m, p_Z,m]^Tはm番目のマイクロホンの位置ベクトルを表す。また、相互情報量増大型のマイクロホンアレイを用いる場合には、ステアリングベクトルとして、伝達特性を用いる。 Where c is the speed of sound (approximately 340 meters per second), p _i = [p _{X, i} , p _{Y, i} , p _{Z, i} ] ^T is the position vector of the i th sound source, and p _m = [p _{X, m} , p _{Y, m} , p _{Z, m} ] ^T represents the position vector of the m-th microphone. When a mutual information increasing type microphone array is used, transfer characteristics are used as a steering vector.

ただし、実測したインパルス応答は部屋の残響を含み、長くなる傾向がある。そのため、直接波が到来してから短い区間を切り出したデータを利用してもよいし、シミュレーションで算出したデータを利用してもよい。 However, the measured impulse response includes room reverberation and tends to be long. Therefore, data obtained by cutting out a short section after the direct wave arrives may be used, or data calculated by simulation may be used.

上記のステアリングを利用して、式（89）の計算をすることで、遅延和フィルタが計算される。 The delay sum filter is calculated by calculating the equation (89) using the steering.

最小分散法によりフィルタを設計する場合は、式（90）を計算する。 When designing a filter by the minimum variance method, formula (90) is calculated.

ここで、R_H,ωはステアリングを用いて設計した空間相関行列である。 Here, R _{H, ω} is a spatial correlation matrix designed using steering.

ここで、 here,

である。時間領域のビームフォーミング後の出力信号は、Y_i,ω,τに対して短時間逆フーリエ変換をすることで得られる。 It is. The output signal after beam forming in the time domain is obtained by _performing a short-time inverse Fourier transform on Y _{i, ω, τ} .

≪音源強調法２：局所ＰＳＤ推定に基づくウィーナーフィルタリング≫
更に高い精度で雑音抑圧を実施するために、ビームフォーミングの出力信号Y_i,ω,τに対してウィーナーフィルタを掛け合わせる方法について説明する。i番目の音源を強調するためのウィーナーフィルタをG_i,ω,τとするとき、出力信号Z_i,ω,τは式（93）で得られる。 ≪Sound source enhancement method 2: Wiener filtering based on local PSD estimation≫
In order to perform noise suppression with higher accuracy, a method of multiplying the beamforming output signals Y _{i, ω, τ} by a Wiener filter will be described. When the Wiener filter for enhancing the i-th sound source is Gi _{, ω, τ} , the output signal Z _{i, ω, τ} is obtained by Expression (93).

G_i,ω,τはフレームごとに変化する量であり、式（94）で計算される。 G _{i, ω, τ} is an amount that changes from frame to frame, and is calculated by equation (94).

ここで、^φ_S,ω,τは、ビームフォーミング後の信号に含まれる目的音のＰＳＤの推定値、^φ_N,ω,τは雑音のＰＳＤの推定値を表す。また、^ξ_ω,τ=^φ_S,ω,τ/^φ_N,ω,τは、ビームフォーミング後の信号におけるＳＮ比（signal-noise ratio）（以後、事前ＳＮＲと呼ぶ）の推定値を表す。いずれもウィーナーフィルタを設計するために、観測信号群x_ω,τから求める必要がある。 Here, ^ φ _{S, ω, τ} represents the estimated value of the PSD of the target sound included in the signal after beamforming, and ^ φ _{N, ω, τ} represents the estimated value of the PSD of noise. ^ Ξ _{ω, τ} = ^ φ _{S, ω, τ} / ^ φ _{N, ω, τ} is an estimated value of the signal-noise ratio (hereinafter referred to as prior SNR) in the signal after beam forming. Represents. In either case, in order to design a Wiener filter, it is necessary to obtain the observation signal group x _{ω, τ} .

観測信号群から目的音と雑音のＰＳＤを求めるための従来方式として、局所ＰＳＤ推定法がある（下記参考文献４、５参照）。
〔参考文献４〕Y. Hioka, K. Furuya, K. Kobayashi, K. Niwa, and Y. Haneda, “Underdetermined sound source separation using power spectrum density estimated by combination of directivity gain,” in Proc. IEEE Trans. on Audio, Speech, and Language Proc., vol. 21, pp. 1240-1250, 2013.
〔参考文献５〕K. Niwa, Y. Hioka, and K. Kobayashi, “Post-filter design for speech enhancement in various noisy environments,” in Proc. IWAENC 2014, pp. 36-40, 2014. As a conventional method for obtaining the PSD of the target sound and noise from the observation signal group, there is a local PSD estimation method (see References 4 and 5 below).
[Reference 4] Y. Hioka, K. Furuya, K. Kobayashi, K. Niwa, and Y. Haneda, “Underdetermined sound source separation using power spectrum density estimated by combination of directivity gain,” in Proc. IEEE Trans. On Audio, Speech, and Language Proc., Vol. 21, pp. 1240-1250, 2013.
[Reference 5] K. Niwa, Y. Hioka, and K. Kobayashi, “Post-filter design for speech enhancement in various noisy environments,” in Proc. IWAENC 2014, pp. 36-40, 2014.

上述したように、観測信号x_ω,τにビームフォーミングを適用することで、特定の方向や位置から到来した音源を強調して収音した信号を得られる。目的音だけでなく雑音の情報も解析して目的音と雑音のＰＳＤを推定するために、L（≧2）個のビームフォーミングを用いる。l（=1, …, L）番目のビームフォーミングがζ(l)番目の位置にある音源を強調して収音するとし、l番目のビームフォーミング信号をY_ζ(l),ω,τと表す。複数のビームフォーミング出力信号群をy_ω,τ=[Y_ζ(1),ω,τ, …, Y_ζ(L),ω,τ]^Tと表す。なお、ζ(1)=iとし、1番目のビームフォーミング出力は必ず目的音を強調しているものとする。音源信号が互いに無相関であることを仮定できる場合、l番目のビームフォーミング出力信号のＰＳＤは、式（95）でモデル化される。 As described above, by applying beam forming to the observation signals x _{ω and τ} , it is possible to obtain a signal collected by enhancing a sound source that has arrived from a specific direction or position. In order to analyze not only the target sound but also noise information and estimate the PSD of the target sound and noise, L (≧ 2) beamformings are used. Suppose that the l (= 1,…, L) th beamforming emphasizes the sound source at the ζ (l) th position and picks up the sound, and the lth beamforming signal is expressed as Y _{ζ (l), ω, τ} . To express. A plurality of beamforming output signal groups are represented as y _{ω, τ} = [Y _{ζ (1), ω, τ} ,..., Y _{ζ (L), ω, τ} ] ^T. It is assumed that ζ (1) = i and that the first beamforming output always emphasizes the target sound. When it can be assumed that the sound source signals are uncorrelated with each other, the PSD of the l-th beamforming output signal is modeled by Equation (95).

ここで、φ_Sk,ωは、k番目の音源のＰＳＤを表す。また、D_ζ(l),k,ωは、l番目のビームフォーミングのk番目の音源の位置に対する空間的な感度の平均を表す。L個のφ_Yζ(l),ωとK個のφ_Sk,ωの関係は式（96）でモデル化される。 Here, φ _{Sk, ω} represents the PSD of the kth sound source. D _{ζ (l), k, ω} represents an average of spatial sensitivity to the position of the kth sound source of the lth beamforming. The L _φ Yζ _{(l), ω} and the K phi _Sk, relationship _omega is modeled in equation (96).

なお、音源数Kは事前に正確に推定することが困難な場合も多いため、K≒Lと仮定して、適当に雑音が到来すると想定される場所を強調して収音したビームフォーミング信号群を利用してもよい。 In many cases, it is difficult to accurately estimate the number of sound sources in advance, so it is assumed that K ≒ L, and the beamforming signal group that has been picked up by emphasizing the place where noise is expected to arrive appropriately May be used.

L個の局所ＰＳＤを推定するために、式（96）の逆問題を解く。時間的なスパース性が非常に高く、音源信号が互いに無相関であることが仮定できる場合、式（96）の関係が時間フレームごとに成り立つと仮定できる。式（97）により、フレーム毎に音源信号のＰＳＤを推定することができる。 In order to estimate L local PSDs, the inverse problem of equation (96) is solved. When temporal sparsity is very high and it can be assumed that the sound source signals are uncorrelated with each other, it can be assumed that the relationship of Expression (96) holds for each time frame. By equation (97), the PSD of the sound source signal can be estimated for each frame.

推定した局所ＰＳＤ^Φ_S,ω,τから^φ_S,ω,τと^φ_N,ω,τを計算することで、ウィーナーフィルタを逐次計算できる。 By calculating ^ φ _{S, ω, τ} and ^ φ _{N, ω, τ} from the estimated local PSD ^ Φ _{S, ω, τ} , the Wiener filter can be calculated sequentially.

ここで、α_N,k,ωは調整用の係数であり、出力値から経験的に決められることが多い。 Here, alpha _{N, k, omega} is a coefficient for adjusting, empirically is often determined from the output value.

この発明では、クリアに目的音源を強調した信号を出力するために、ビームフォーミング後の信号群や推定された局所ＰＳＤ^Φ_S,ω,τを入力特徴量とし、事前ＳＮＲ^ξ_ω,τを出力する統計的マッピングモデルを導入する。近年、統計的マッピングモデルの一つとして、ディープニューラルネットワークが多く用いられているので、ここではディープニューラルネットワークを利用する。 In the present invention, in order to output a signal in which the target sound source is clearly emphasized, the signal group after beam forming and the estimated local PSD ^ Φ _{S, ω, τ} are used as input feature amounts, and the prior SNR ^ ξ _{ω, τ} We introduce a statistical mapping model that outputs In recent years, a deep neural network is often used as one of statistical mapping models. Therefore, a deep neural network is used here.

この発明を音源強調に適用する場合のポイントは、（１）推定された局所ＰＳＤや複数のビームフォーミング出力パワー群をディープニューラルネットワークの入力として事前ＳＮＲを出力する構成と、（２）ディープニューラルネットワークのネットワークパラメータの初期値を従来法のように物理的な特性を加味して設定する点にある。 The points when the present invention is applied to sound source enhancement are (1) a configuration in which an estimated local PSD and a plurality of beamforming output power groups are used as inputs to a deep neural network, and a prior SNR is output; and (2) a deep neural network. The initial value of the network parameter is set in consideration of physical characteristics as in the conventional method.

［第五実施形態］
第五実施形態の音源強調装置は、図１３に示すように、M個のマイクロホン１０−１〜１０−Ｍ、M個の周波数領域変換部１１−１〜１１−Ｍ、L個のビームフォーミング部１２−１〜１２−Ｌ、局所ＰＳＤ推定部１３、ＤＮＮマッピング部２２、およびフィルタリング部１６を備える。この音源強調装置が後述する各ステップの処理を行うことにより第五実施形態の音源強調方法が実現される。 [Fifth embodiment]
As illustrated in FIG. 13, the sound source emphasizing device of the fifth embodiment includes M microphones 10-1 to 10 -M, M frequency domain conversion units 11-1 to 11 -M, and L beam forming units. 12-1 to 12-L, a local PSD estimation unit 13, a DNN mapping unit 22, and a filtering unit 16. The sound source emphasizing apparatus according to the fifth embodiment is realized by the sound source emphasizing apparatus performing processes in steps described later.

音源強調装置は、例えば、中央演算処理装置（CPU: Central Processing Unit）、主記憶装置（RAM: Random Access Memory）などを有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。音源強調装置は、例えば、中央演算処理装置の制御のもとで各処理を実行する。音源強調装置に入力されたデータや各処理で得られたデータは、例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて読み出されて他の処理に利用される。また、音源強調装置の各処理部の少なくとも一部が集積回路等のハードウェアによって構成されていてもよい。 The sound source emphasis device is, for example, a special program configured by reading a special program into a known or dedicated computer having a central processing unit (CPU), a main storage device (RAM: Random Access Memory), and the like. Device. The sound source emphasizing apparatus executes each process under the control of the central processing unit, for example. The data input to the sound source emphasizing device and the data obtained in each process are stored in, for example, the main storage device, and the data stored in the main storage device is read as necessary and used for other processing. The In addition, at least a part of each processing unit of the sound source emphasizing device may be configured by hardware such as an integrated circuit.

図１４を参照して、第五実施形態の音源強調方法の処理手続きを説明する。 With reference to FIG. 14, the processing procedure of the sound source enhancement method of the fifth embodiment will be described.

ステップＳ１１において、周波数領域変換部１１−ｍ（m=1, …, M）は、各観測信号x_m(n)を短い時間長（例えば、サンプリング周波数16,000Hzの場合には、256サンプル程度）のフレームに分解し、それぞれのフレームにおいて離散フーリエ変換を行い周波数領域の観測信号X_m,ω,τを出力する。ここで、ωは周波数ビン番号を表し、τはフレーム番号を表す。各周波数領域変換部１１−１〜１１−Ｍの出力信号群X_1,ω,τ, X_2,ω,τ, …, X_M,ω,τはビームフォーミング部１２−１〜１２−Ｌにそれぞれ入力される。 In step S11, the frequency domain transform unit 11-m (m = 1,..., M) sets each observation signal x _m (n) to a short time length (for example, about 256 samples when the sampling frequency is 16,000 Hz). And the discrete Fourier transform is performed in each frame, and the frequency domain observation signals X _{m, ω, τ} are output. Here, ω represents a frequency bin number, and τ represents a frame number. The output signal groups X1 _{, ω, τ} , X2 _{, ω, τ} ,..., XM _{, ω, τ} of the frequency domain transform units 11-1 to 11-M are supplied to the beam forming units 12-1 to 12-L. Each is entered.

ステップＳ１２において、ビームフォーミング部１２−ｌ〜１２−Ｌは、各周波数領域変換部１１−１〜１１−Ｍの出力信号群X_1,ω,τ, X_2,ω,τ, …, X_M,ω,τに対して、それぞれ異なる方向の角度領域から到来する音を強調して収音する処理を行い、結果を出力する。ビームフォーミング部１２−１は、直接音の強調に用いるものであり、あらかじめ定めた音源方向から到来する音を強調して出力信号Y_ζ(1),ω,τを出力する。残りのビームフォーミング部１２−２〜１２−Ｌは、拡散残響の解析に用いるものであり、音源方向以外の方向から到来する音を強調して出力信号群Y_ζ(2),ω,τ, …, Y_ζ(L),ω,τを出力する。各ビームフォーミング部１２−１〜１２−Ｌの出力信号群Y_ζ(1),ω,τ, …, Y_ζ(L),ω,τは局所ＰＳＤ推定部１３に入力される。 In step S12, the beam forming units 12-1 to 12-L output signal groups X1 _{, ω, τ} , X2 _{, ω, τ} , ..., XM of the frequency domain transform units 11-1 to 11- _{M. , ω, τ} are processed to collect sounds by emphasizing sounds coming from angular regions in different directions, and the results are output. The beam forming unit 12-1 is used for direct sound enhancement, and outputs output signals Y _{ζ (1), ω, τ} by enhancing sound coming from a predetermined sound source direction. The remaining beam forming units 12-2 to 12-L are used for analysis of diffuse reverberation, and emphasize the sound coming from directions other than the sound source direction to output signal groups Y _{ζ (2), ω, τ} , …, Y _{ζ (L), ω, τ} are output. The output signal groups Y _{ζ (1), ω, τ} ,..., Y _{ζ (L), ω, τ} of the beam forming units 12-1 to 12-L are input to the local PSD estimation unit 13.

ステップＳ１３において、局所ＰＳＤ推定部１３は、各ビームフォーミング部１２−１〜１２−Ｌの出力信号群Y_ζ(1),ω,τ, …, Y_ζ(L),ω,τを入力とし、上記式（97）に従って、局所ＰＳＤ^Φ_S,ω,τを推定する。推定した局所ＰＳＤ^Φ_S,ω,τはＤＮＮマッピング部２２に入力される。 In step S13, the local PSD estimation unit 13 receives the output signal groups Y _{ζ (1), ω, τ} ,..., Y _{ζ (L), ω, τ} of the beam forming units 12-1 to 12-L as inputs. Then, local PSD ^ Φ _{S, ω, τ} is estimated according to the above equation (97). The estimated local PSD ^ Φ _{S, ω, τ} is input to the DNN mapping unit 22.

ステップＳ２２において、ＤＮＮマッピング部２２は、局所ＰＳＤ推定部１３の出力する局所ＰＳＤ^Φ_S,ω,τを入力とし、ネットワークパラメータz_ωiを用いて事前ＳＮＲの推定値^ξ_ω,τ=^φ_S,ω,τ/^φ_N,ω,τを求め、結果を出力する。 In step S22, the DNN mapping unit 22 receives the local PSD ^ Φ _{S, ω, τ} output from the local PSD estimation unit 13, and uses the network parameter z _ωi to estimate the prior SNR ^ ξ _{ω, τ} = ^ Find φ _{S, ω, τ} / ^ φ _{N, ω, τ} and output the result.

以下、ＤＮＮマッピング部２２の処理を詳細に説明する。ＤＮＮマッピング部２２は、直間比推定の場合と同様に、N層のディープニューラルネットワークで構成される。Nは4〜5程度でよい。まず、ディープニューラルネットワークの入力層に特徴量を設定する。 Hereinafter, the process of the DNN mapping unit 22 will be described in detail. The DNN mapping unit 22 is composed of an N-layer deep neural network, as in the case of the direct ratio estimation. N may be about 4-5. First, feature values are set in the input layer of the deep neural network.

このときのベクトルq_ωi ⁽¹⁾の次元（ノード数）は、J₁=K×Qである。ネットワークパラメータz_ωiにZ_ωi ⁽²⁾, …, Z_ωi ^(N), b_ωi ⁽²⁾, …, b_ωi ^(N)が含まれるとすると、N-1回の逐次計算により、以下のように計算される。 The dimension (number of nodes ⁾ of the vector q _ωi ⁽¹⁾ at this time is J ₁ = K × Q. If the network parameters z _ωi include Z _ωi ⁽²⁾ ,…, Z _ωi ^(N) , b _ωi ⁽²⁾ ,…, b _ωi ^(N) Is calculated.

である。活性化関数f⁽ⁿ⁾(・)は、式（108）のように、シグモイド関数（sigmoid function）(n=2, …, N-1の場合)と恒等写像関数（n=Nの場合）を併用する。 It is. The activation function f ⁽ⁿ⁾ (•) is expressed as sigmoid function (when n = 2,…, N-1) and identity mapping function (when n = N) ).

N層目のレイヤー数をJ_N=1とし、推定された事前ＳＮＲは式（109）となる。 The number of layers in the Nth layer is J _N = 1, and the estimated prior SNR is expressed by Equation (109).

以後、q_ωi ⁽¹⁾を入力としネットワークパラメータz_ωiを用いて推定した事前ＳＮＲをζ(q_ωi ⁽¹⁾; z_ωi)と表記する。ネットワークパラメータは各周波数ビンかつ各帯域で学習し設計することとする。 Hereinafter, the prior SNR estimated using q _ωi ⁽¹⁾ as an input and the network parameter z _ωi is denoted as ζ (q _ωi ⁽¹⁾ ; z _ωi ). The network parameters are learned and designed in each frequency bin and each band.

以下、ディープニューラルネットワークの最適化方法について説明する。第五実施形態では、ネットワークパラメータz_ωiの初期値をランダムに設定し、誤差逆伝搬（back propagation）に基づいて、事前ＳＮＲの推定誤差を最小とするように、ネットワークパラメータz_ωiを最適化する。時間フレーム方向も含め多数の観測信号サンプルデータを用意し、計Θ個のデータで構成された学習用の局所ＰＳＤと事前ＳＮＲの正解値とからなる教師情報を、以下のように記載する。 Hereinafter, the optimization method of the deep neural network will be described. In the fifth embodiment, an initial value of the network parameter z _ωi is set at random, and the network parameter z _ωi is optimized so as to minimize the estimation error of the prior SNR based on back propagation. . A large number of observation signal sample data including the time frame direction is prepared, and teacher information including a local PSD for learning composed of a total of Θ data and a correct value of the prior SNR is described as follows.

式（101）（102）の各ステップをΘ個のサンプルデータに対して適用するとき、以下のように行列形式で書くことができる。 When applying each step of equations (101) and (102) to Θ sample data, it can be written in matrix form as follows:

ここで、b_ωi ⁽ⁿ⁾1_Θ ^Tはb_ωi ⁽ⁿ⁾をΘ個分並べる操作を表し、 Where b _ωi ⁽ⁿ⁾ 1 _Θ ^T represents the operation of arranging b _{ω i} ⁽ⁿ⁾ by Θ,

である。 It is.

ディープニューラルネットワークの出力と正解として与えた事前ＳＮＲとの誤差を測るための尺度として、式（115）で定義される二乗誤差関数を用いる。 As a measure for measuring an error between the output of the deep neural network and the prior SNR given as a correct answer, a square error function defined by Expression (115) is used.

誤差逆伝搬に基づいて、出力層（n=N）から入力層（n=1）に向かって逐次的にネットワークパラメータの勾配を算出する。⁻Ξ_ωi=[⁻ξ_ωi,1, …, ⁻ξ_ωi,Θ]とするとき、n番目の層における各サンプルデータおけるデルタΔ_ωi, ⁽ⁿ⁾を以下のように求める。 Based on the back propagation error, the gradient of the network parameter is calculated sequentially from the output layer (n = N) to the input layer (n = 1). ^{_{^{_{- Ξ ωi = [- ξ ωi}}}} , 1, ..., - ξ ωi, Θ] When the respective sample data definitive delta delta _.omega.i in n-th layer _is obtained as follows ^(n).

なお、更新量ΔZ_ωi ⁽ⁿ⁾, Δb_ωi ⁽ⁿ⁾は、以下とすればよい。 The update amounts ΔZ _ωi ⁽ⁿ⁾ and Δb _ωi ⁽ⁿ⁾ may be set as follows.

ここで、ΔZ_ωi ⁽ⁿ⁾⁺, Δb_ωi ⁽ⁿ⁾⁺は前回の更新量、εは学習係数、μは汎化性能を向上し学習を速く進めるためのモメンタム（momentum）の係数、λは重み減衰（weight decay）である。εは0.01程度、μは0.9程度、λは0.0002程度に設定すればよい。 Where ΔZ _ωi ^{(n) +} , Δb _ωi ^{(n) +} is the previous update amount, ε is a learning coefficient, μ is a momentum coefficient to improve generalization performance and speed up learning, and λ is Weight decay. ε may be set to about 0.01, μ may be set to about 0.9, and λ may be set to about 0.0002.

ステップＳ１６において、フィルタリング部１６は、ＤＮＮマッピング部２２の出力する事前ＳＮＲの推定値^ξ_ω,τを入力とし、上記式（94）に従ってウィーナーフィルタを計算し、上記式（93）に従ってビームフォーミング出力信号群Y_ζ(1),ω,τ, …, Y_ζ(L),ω,τにウィーナーフィルタを掛け合わせることで、出力信号Z_i,ω,τを出力する。 In step S16, the filtering unit 16 receives the pre-SNR estimated value ^ ξ _{ω, τ} output from the DNN mapping unit 22, calculates a Wiener filter according to the above equation (94), and performs beamforming according to the above equation (93). By multiplying the output signal group Y _{ζ (1), ω, τ} ,..., Y _{ζ (L), ω, τ} by the Wiener filter, the output signals Z _{i, ω, τ} are output.

［第六実施形態］
第五実施形態では、局所ＰＳＤ^Φ_S,ω,τを入力として、事前ＳＮＲの推定値^ξ_ω,τを出力するN層のディープニューラルネットワークを各周波数ビンかつ各帯域で設計する構成を説明した。第六実施形態では、次式のように複数のビームフォーミング出力を入力として、事前ＳＮＲの推定値を出力するN層のディープニューラルネットワークを用いる構成を説明する。 [Sixth embodiment]
In the fifth embodiment, an N-layer deep neural network that outputs local PSD ^ Φ _{S, ω, τ} as an input and outputs an estimated value of prior SNR ^ ξ _{ω, τ} is designed in each frequency bin and in each band. explained. In the sixth embodiment, a configuration using an N-layer deep neural network that receives a plurality of beamforming outputs as inputs and outputs an estimated value of a prior SNR will be described.

第六実施形態の音源強調装置は、図１５に示すように、M個のマイクロホン１０−１〜１０−Ｍ、M個の周波数領域変換部１１−１〜１１−Ｍ、L個のビームフォーミング部１２−１〜１２−Ｌ、およびフィルタリング部１６を第五実施形態と同様に備え、さらにＤＮＮマッピング部２３を備える。第五実施形態の音源強調装置が備えていた局所ＰＳＤ推定部１３は備えておらず、ビームフォーミング部１２−１〜１２−Ｌの出力がＤＮＮマッピング部２３へ入力されるように構成される。この音源強調装置が後述する各ステップの処理を行うことにより第六実施形態の音源強調方法が実現される。 As shown in FIG. 15, the sound source emphasizing apparatus according to the sixth embodiment includes M microphones 10-1 to 10-M, M frequency domain transform units 11-1 to 11-M, and L beam forming units. 12-1 to 12-L and the filtering unit 16 are provided in the same manner as in the fifth embodiment, and a DNN mapping unit 23 is further provided. The local PSD estimation unit 13 included in the sound source enhancement device of the fifth embodiment is not provided, and the outputs of the beamforming units 12-1 to 12 -L are input to the DNN mapping unit 23. The sound source emphasizing device according to the sixth embodiment is realized by the sound source emphasizing apparatus performing processing of each step described later.

ステップＳ２３において、ＤＮＮマッピング部２３は、各ビームフォーミング部１２−１〜１２−Ｌの出力信号群Y_ζ(1),ω,τ, …, Y_ζ(L),ω,τを入力とし、ネットワークパラメータz_ωiを用いて事前ＳＮＲの推定値^ξ_ω,τを求め、結果を出力する。ＤＮＮマッピング部２３は、Θ個のサンプルデータで構成された、学習用のビームフォーミング出力と事前ＳＮＲの正解値とからなる教師情報を用いて、第五実施形態と同様に最適化を行ったものである。 In step S23, the DNN mapping unit 23 receives the output signal groups Y _{ζ (1), ω, τ} ,..., Y _{ζ (L), ω, τ} of the beam forming units 12-1 to 12-L as inputs, Using the network parameter z _ωi , an estimated value SNR of prior SNR ^ ξ _{ω, τ} is obtained and the result is output. The DNN mapping unit 23 is optimized in the same manner as in the fifth embodiment, using teacher information composed of Θ pieces of sample data and consisting of a beamforming output for learning and a correct value of a prior SNR. It is.

［第七実施形態］
第五実施形態および第六実施形態では、ネットワークパラメータz_ωiの初期値をランダムに設定した。第七実施形態では、従来法のように物理的な特性を加味してネットワークパラメータz_ωiの初期値を設定する方法について説明する。従来法における事前ＳＮＲ推定技術は大きく以下の3ステップで構成されている。 [Seventh embodiment]
In the fifth embodiment and sixth embodiment, setting the initial value of the network parameter z _.omega.i randomly. In the seventh embodiment, a method for setting the initial value of the network parameter z _ωi in consideration of physical characteristics as in the conventional method will be described. The prior SNR estimation technique in the conventional method is mainly composed of the following three steps.

（ステップ１：局所ＰＳＤ推定処理）式（97）のように、2つ以上のビームフォーミングの出力パワー群φ_{Yζ(i),ωi,τ}から局所ＰＳＤの推定値^Φ_S,ω,τを求める。 (Step 1: Local PSD estimation processing) As shown in equation (97), the estimated value ^ Φ _{S, ω, τ} of the local PSD is _{calculated from} two or more beamforming output power groups φ _{Yζ (i), ωi, τ.} Ask.

（ステップ２：加算処理）式（98）（99）のように、ビームフォーミング出力における目的音と雑音のＰＳＤ^φ_S,ω,τ, ^φ_N,ω,τを出力する。 (Step 2: Addition Processing) As shown in equations (98) and (99), PSDs φφ _{S, ω, τ} , ^ φ _{N, ω, τ} of the target sound and noise in the beamforming output are output.

（ステップ３：対数領域比計算処理）式（124）のように、^φ_S,ω,τ, ^φ_N,ω,τから事前ＳＮＲの推定値^ξ_ω,τを以下のように出力する。 (Step 3: Logarithmic domain ratio calculation process) As shown in equation (124), the estimated prior SNR value ^ ξ _{ω, τ} is output from ^ φ _{S, ω, τ} , ^ φ _N, _{ω, τ} as follows: To do.

以上の3ステップの処理が各層の処理に物理的に対応しているとみなすことができるため、ランダムに設定するよりも良質なネットワークパラメータの初期値を決めることができる。第七実施形態では、第六実施形態の音源強調装置において、ネットワークパラメータz_ωiの初期値を設定するように構成する。なお、最適化処理については第六実施形態と同様である。 Since the above three-step processing can be regarded as physically corresponding to the processing of each layer, it is possible to determine the initial values of the network parameters with higher quality than those set at random. In the seventh embodiment, the sound source emphasizing device of the sixth embodiment is configured to set the initial value of the network parameter z _ωi . The optimization process is the same as in the sixth embodiment.

第七実施形態の音源強調装置は、図１６に示すように、M個のマイクロホン１０−１〜１０−Ｍ、M個の周波数領域変換部１１−１〜１１−Ｍ、L個のビームフォーミング部１２−１〜１２−Ｌ、フィルタリング部１６、およびＤＮＮマッピング部２３を第六実施形態と同様に備え、さらに初期値設定部３３を備える。この音源強調装置が後述する各ステップの処理を行うことにより第七実施形態の音源強調方法が実現される。 As shown in FIG. 16, the sound source emphasizing apparatus according to the seventh embodiment includes M microphones 10-1 to 10-M, M frequency domain transform units 11-1 to 11-M, and L beam forming units. 12-1 to 12-L, the filtering unit 16, and the DNN mapping unit 23 are provided in the same manner as in the sixth embodiment, and further an initial value setting unit 33 is provided. The sound source emphasizing apparatus according to the seventh embodiment is realized by the sound source emphasizing apparatus performing processes in steps described later.

初期値設定部３３は、以下のようにして、ＤＮＮマッピング部２３の各層に対応するネットワークパラメータの初期値を設定する。入力層の値は、第六実施形態と同様に、次式のように設定する。 The initial value setting unit 33 sets initial values of network parameters corresponding to each layer of the DNN mapping unit 23 as follows. As in the sixth embodiment, the value of the input layer is set as follows:

2層目の処理は、（ステップ１：局所ＰＳＤ推定処理）が対応する。2層目のレイヤー数はJ₂≧L×Qとなるようにする。Lはビームフォーミング数であり、Qは周波数ビン数である。式（97）に含まれるD_ω ^-1の要素を以下のように定義する。 The processing of the second layer corresponds to (Step 1: Local PSD estimation processing). The number of layers in the second layer should be J ₂ ≧ L × Q. L is the number of beamforming, and Q is the number of frequency bins. The elements of D _ω ⁻¹ included in Expression (97) are defined as follows.

ここで、・^-1は、L=Kの場合は逆行列を表し、L≠Kの場合は擬似逆行列を表す。 Here, ⁻¹ represents an inverse matrix when L = K, and a pseudo inverse matrix when L ≠ K.

対応する係数をネットワークパラメータに代入することで、2層目の初期値を設定することができる。 By assigning the corresponding coefficient to the network parameter, the initial value of the second layer can be set.

なお、G_ωi,2とB_ωi,2は値幅調整係数である。Z_ωi ⁽²⁾q_ωi ⁽¹⁾の最大値が1〜5程度になるようにG_ωi,2を設定する。また、Z_ωi ⁽²⁾q_ωi ⁽¹⁾の出力値が0以下である場合に値を0付近にフロアリングするために、B_ωi,2は-5〜0の間に設定する。その後、以下の計算をすることで、2層目の出力q_ωi ⁽²⁾を得る。 G _{ωi, 2} and B _{ωi, 2} are value range adjustment coefficients. G _{ωi, 2} is set so that the maximum value of Z _ωi ⁽²⁾ q _ωi ⁽¹⁾ is about 1-5. Further, when the output value of Z _ωi ⁽²⁾ q _ωi ⁽¹⁾ is 0 or less, B _{ωi, 2} is set between -5 and 0 in order to floor the value near 0. Thereafter, the output q _ωi ⁽²⁾ of the second layer is obtained by _performing the following calculation.

3層目の処理は、（ステップ２：加算処理）が対応する。以下のようにネットワークパラメータを書き表すことで、加算処理を表現することができる。なお、3層目のレイヤー数はJ₃≧2となるようにする。 The processing of the third layer corresponds to (Step 2: Addition processing). The addition process can be expressed by writing the network parameters as follows. Note that the number of layers in the third layer is set to satisfy J ₃ ≧ 2.

なお、G_ωi,3とB_ωi,3は値幅調整係数である。Z_ωi ⁽³⁾q_ωi ⁽²⁾の最大値が1〜5程度になるようにG_ωi,3を設定する。また、B_ωi,3は0程度で問題ない。その後、式（133）（134）の計算をすることで、3層目の出力q_ωi ⁽³⁾を得る。 G _{ωi, 3} and B _{ωi, 3} are value range adjustment coefficients. G _{ωi, 3} is set so that the maximum value of Z _ωi ⁽³⁾ q _ωi ⁽²⁾ is about 1-5. In _addition, B ωi, ₃ there is no problem at about 0. Thereafter, by calculating the equations (133) and (134), the output q _ωi ⁽³⁾ of the third layer is obtained.

ここで、式（124）に対応させるため、Z_ωi,1,1 ⁽⁴⁾は正の値、Z_ωi,1,2 ⁽⁴⁾は負の値に制限される。例えば、以下のようにして値を決める。 Here, in order to correspond to the equation (124), Z _{ωi, 1,1} ⁽⁴⁾ is limited to a positive value, and Z _{ωi, 1,2} ⁽⁴⁾ is limited to a negative value. For example, the value is determined as follows.

このとき、参照している^φ_S,ω,τや^φ_N,ω,τは、1つのサンプルで計算されたものを利用してもよいし、多数のサンプルで計算された値の平均値を利用してもよい。また、調整係数g_ωi,4は、次式のように求める。 At this time, the reference ^ φ _{S, ω, τ} and ^ φ _{N, ω, τ} may be calculated from one sample, or the average of the values calculated from many samples A value may be used. Further, the adjustment coefficient g _{ωi, 4} is obtained as follows.

上述したネットワークパラメータの初期値設定法では、層の数は信号処理演算の最小単位数+1以上に設定したほうがよいため、N≧4とすることが望ましい。上記では、N=4とみなして説明したが、仮にNを4よりも多くしたい場合には、冗長な層を挟めばよい。ここで、信号処理演算の最小単位数とは、同等の信号処理演算（ここでは、事前ＳＮＲ推定処理）を従来の決定論的な手法で実行するときに必要となる、加算や乗算などの信号処理演算の数を意味している。 In the network parameter initial value setting method described above, the number of layers is preferably set to be equal to or greater than the minimum unit number of signal processing operations + 1. In the above description, it is assumed that N = 4. However, if N is desired to be larger than 4, a redundant layer may be interposed. Here, the minimum unit number of signal processing operations is a signal such as addition or multiplication that is required when an equivalent signal processing operation (here, prior SNR estimation processing) is executed by a conventional deterministic method. This means the number of processing operations.

［第八実施形態］
第七実施形態では、第六実施形態の音源強調装置において、ネットワークパラメータz_ωiの初期値を設定する構成を説明した。第八実施形態では、第五実施形態の音源強調装置において、ネットワークパラメータz_ωiの初期値を設定するように構成する。なお、最適化処理については第五実施形態と同様である。 [Eighth embodiment]
In the seventh embodiment, the configuration in which the initial value of the network parameter z _ωi is set in the sound source enhancement device of the sixth embodiment has been described. In the eighth embodiment, the sound source emphasizing device of the fifth embodiment is configured to set an initial value of the network parameter z _ωi . The optimization process is the same as in the fifth embodiment.

第八実施形態の音源強調装置は、図１７に示すように、M個のマイクロホン１０−１〜１０−Ｍ、M個の周波数領域変換部１１−１〜１１−Ｍ、L個のビームフォーミング部１２−１〜１２−Ｌ、局所ＰＳＤ推定部１３、ＤＮＮマッピング部２２、およびフィルタリング部１６を第五実施形態と同様に備え、さらに初期値設定部３２を備える。この音源強調装置が後述する各ステップの処理を行うことにより第八実施形態の音源強調方法が実現される。 As illustrated in FIG. 17, the sound source emphasizing device according to the eighth embodiment includes M microphones 10-1 to 10 -M, M frequency domain conversion units 11-1 to 11 -M, and L beam forming units. 12-1 to 12-L, the local PSD estimation unit 13, the DNN mapping unit 22, and the filtering unit 16 are provided in the same manner as in the fifth embodiment, and further an initial value setting unit 32 is provided. The sound source emphasizing apparatus according to the eighth embodiment is realized by the processing of each step described later.

第七実施形態では、2層目の処理が局所ＰＳＤ推定処理に対応していることを説明した。したがって、第八実施形態では、3層目以降の初期値設定を用いればよいということになる。入力層の値は、第五実施形態と同様に、次式のように設定する。 In the seventh embodiment, it has been described that the process of the second layer corresponds to the local PSD estimation process. Therefore, in the eighth embodiment, it is sufficient to use the initial value settings for the third and subsequent layers. As in the fifth embodiment, the value of the input layer is set as follows:

式（131）以降の処理を2層目、3層目の初期値として設定すればよいことになる。なお、第七実施形態ではN≧4に設定したほうがよいと説明したが、層の数を信号処理演算の最小単位数+1以上に設定する考え方は同様であるため、第八実施形態ではN≧3に設定することが望ましい。 The processing after the expression (131) may be set as the initial values for the second and third layers. In the seventh embodiment, it has been described that it is better to set N ≧ 4. However, since the concept of setting the number of layers to be equal to or more than the minimum unit number +1 of the signal processing operation is the same, in the eighth embodiment, N It is desirable to set ≧ 3.

第七実施形態や第八実施形態のように、ネットワークパラメータの初期値を適切に設定することで、各層の物理的な意味合いを持ちつつ、パラメータ最適化が可能になる。その結果、学習データがある程度少なくても外れ値を出力する可能性が減る効果が期待される。つまり、学習データが少なくても環境に依存しにくいディープニューラルネットワークを設計することができる効果がある。 As in the seventh and eighth embodiments, by appropriately setting the initial values of the network parameters, it is possible to optimize the parameters while having the physical meaning of each layer. As a result, an effect of reducing the possibility of outputting an outlier even if the learning data is small to some extent is expected. That is, there is an effect that it is possible to design a deep neural network that is less dependent on the environment even if there is little learning data.

［第九実施形態］
第九実施形態は、第一実施形態から第四実施形態で説明した直間比推定技術と、第五実施形態から第八実施形態で説明した音源強調技術とを包含する上位概念としての音源情報推定技術を説明する。 [Ninth embodiment]
The ninth embodiment is a sound source information as a superordinate concept that includes the direct ratio estimation technology described in the first embodiment to the fourth embodiment and the sound source enhancement technology described in the fifth embodiment to the eighth embodiment. The estimation technique will be described.

第九実施形態の音源情報推定装置は、例えば、音源特徴抽出部および音源情報推定部を備える。この音源情報推定装置が後述の各ステップの処理を行うことにより第九実施形態の音源情報推定方法が実現される。 The sound source information estimation device of the ninth embodiment includes, for example, a sound source feature extraction unit and a sound source information estimation unit. The sound source information estimation apparatus of the ninth embodiment is realized by this sound source information estimation apparatus performing the processing of each step described later.

音源特徴抽出部は、複数の異なる方向の角度領域から到来する音を強調して収音した複数の周波数領域観測信号から各周波数領域観測信号の音源特徴を抽出する。音源特徴抽出部は、第一実施形態および第四実施形態の直間比推定装置では、ビームフォーミング部１２−１〜１２−２および局所ＰＳＤ推定部１３に相当し、第二実施形態および第三実施形態の直間比推定装置では、ビームフォーミング部１２−１〜１２−２に相当する。また、第五実施形態および第八実施形態の音源強調装置では、ビームフォーミング部１２−１〜１２−Ｌおよび局所ＰＳＤ推定部１３に相当し、第六実施形態および第七実施形態の音源強調装置では、ビームフォーミング部１２−１〜１２−Ｌに相当する。 The sound source feature extraction unit extracts sound source features of each frequency domain observation signal from a plurality of frequency domain observation signals picked up by emphasizing sounds coming from a plurality of angular regions in different directions. The sound source feature extraction unit corresponds to the beam forming units 12-1 to 12-2 and the local PSD estimation unit 13 in the direct ratio estimation apparatuses of the first embodiment and the fourth embodiment. The direct ratio estimation apparatus according to the embodiment corresponds to the beam forming units 12-1 to 12-2. In the sound source emphasizing devices of the fifth embodiment and the eighth embodiment, they correspond to the beam forming units 12-1 to 12-L and the local PSD estimating unit 13, and the sound source emphasizing devices of the sixth embodiment and the seventh embodiment. Then, it corresponds to the beam forming units 12-1 to 12-L.

音源情報推定部は、各周波数領域観測信号の音源特徴を統計的マッピングモデルへ入力して音源情報の推定値を求める。このとき、統計的マッピングモデルは、複数の異なる方向の角度領域から到来する音を強調して収音した複数の周波数領域音響信号から抽出した音源特徴と各周波数領域音響信号から求めた音源情報の正解値とを用いてパラメータを学習したものである。音源情報推定部は、第一実施形態および第四実施形態の直間比推定装置では、ＤＮＮマッピング部２０に相当し、第二実施形態および第三実施形態の直間比推定装置では、ＤＮＮマッピング部２１に相当する。また、第五実施形態および第八実施形態の音源強調装置では、ＤＮＮマッピング部２２に相当し、第六実施形態および第七実施形態の音源強調装置では、ＤＮＮマッピング部２３に相当する。 The sound source information estimation unit obtains an estimated value of the sound source information by inputting the sound source characteristics of each frequency domain observation signal to the statistical mapping model. At this time, the statistical mapping model emphasizes sound coming from a plurality of angular regions in different directions and extracts sound source characteristics extracted from a plurality of frequency domain acoustic signals collected and collected from each frequency domain acoustic signal. The parameters are learned using correct values. The sound source information estimation unit corresponds to the DNN mapping unit 20 in the direct ratio estimation apparatus of the first embodiment and the fourth embodiment, and the DNN mapping in the direct ratio estimation apparatus of the second embodiment and the third embodiment. It corresponds to the part 21. Further, the sound source emphasizing devices of the fifth and eighth embodiments correspond to the DNN mapping unit 22, and the sound source emphasizing devices of the sixth embodiment and the seventh embodiment correspond to the DNN mapping unit 23.

上記の実施形態では、統計的マッピングモデルがディープニューラルネットワークで構成される例を説明したが、第一実施形態および第二実施形態の直間比推定技術と、第五実施形態および第六実施形態の音源強調技術と、第九実施形態の音源情報推定技術とにおける統計的マッピングモデルは、ディープニューラルネットワークに限定されず、他の統計的マッピングモデルを用いることが可能である。他の統計的マッピングモデルとしては、例えば、混合正規分布（Gaussian Mixture Model: GMM）などを挙げることができる。なお、第三実施形態および第四実施形態の直間比推定技術と、第七実施形態および第八実施形態の音源強調技術とにおいては、ディープニューラルネットワークのネットワークパラメータの初期値を設定する技術であるため、統計的マッピングモデルはディープニューラルネットワークに限定される。 In the above embodiment, the example in which the statistical mapping model is configured by the deep neural network has been described. However, the direct ratio estimation technique of the first embodiment and the second embodiment, and the fifth embodiment and the sixth embodiment. The statistical mapping model in the sound source enhancement technique and the sound source information estimation technique in the ninth embodiment is not limited to the deep neural network, and other statistical mapping models can be used. Examples of other statistical mapping models include a mixed normal distribution (Gaussian Mixture Model: GMM). In the direct ratio estimation technique of the third embodiment and the fourth embodiment and the sound source enhancement technique of the seventh embodiment and the eighth embodiment, a technique for setting initial values of network parameters of the deep neural network is used. As such, the statistical mapping model is limited to deep neural networks.

［第十実施形態］
上述の実施形態では、特にマイクロホンアレイのハードウェア構造を限定せずに説明してきた。本形態では、マイクロホンアレイのハードウェア構造に対称性を持たせるように限定することで、学習したディープニューラルネットワークのネットワークパラメータの頑健性を高め、音源情報の推定性能を高めることを目的とする。なお、処理手続きに関しては、ハードウェア構成が制限される以外は、各実施形態と同様であるため、以下では対称性を持つマイクロホンアレイの具体的なハードウェア構成例と、なぜこの構成によりディープニューラルネットワークのネットワークパラメータの頑健性が高まるのかについて説明をする。 [Tenth embodiment]
In the above embodiment, the hardware structure of the microphone array has not been particularly limited. An object of the present embodiment is to increase the robustness of the learned deep neural network network parameters and to improve the estimation performance of sound source information by limiting the hardware structure of the microphone array to be symmetric. The processing procedure is the same as that of each embodiment except that the hardware configuration is limited. Therefore, in the following, a specific hardware configuration example of a symmetric microphone array and why this configuration is a deep neural network. Explains whether the network parameters are more robust.

図１８に対称性を持つアレイ構造の例を示す。ここで、対称性とは、２次元または３次元空間における点対称を指す。例えば、直線状にM個のマイクロホンを並べた場合は、１次元であるため対称性を持たせることができない。２次元構造であれば、円周上にマイクロホンを等間隔で並べる場合（すなわち、正多角形の頂点位置）が該当する。また、３次元構造であれば、例えば、正多面体の頂点位置にマイクロホンがある場合が該当する。図１８では、２次元構造の例として正三角形・正方形・正六角形・正八角形の場合を、３次元構造の例として正四面体・正六面体・正八面体・正十二面体・正二十面体の場合を示したが、これらの構造に限定されるものではない。マイクロホンそのものに指向性がある場合には、対称性を保つように素子の向きが制限される。 FIG. 18 shows an example of an array structure having symmetry. Here, the symmetry refers to point symmetry in a two-dimensional or three-dimensional space. For example, when M microphones are arranged in a straight line, they are one-dimensional and cannot have symmetry. In the case of a two-dimensional structure, the case where microphones are arranged at equal intervals on the circumference (that is, the vertex position of a regular polygon) is applicable. Further, in the case of a three-dimensional structure, for example, a case where a microphone is present at the apex position of a regular polyhedron is applicable. In FIG. 18, regular triangles, squares, regular hexagons, and octagons are shown as examples of the two-dimensional structure, and regular tetrahedrons, regular hexahedrons, regular octahedrons, regular dodecahedrons, and regular icosahedrons are exemplified as three-dimensional structures. Although cases are shown, the present invention is not limited to these structures. When the microphone itself has directivity, the orientation of the element is limited so as to maintain symmetry.

マイクロホンアレイの構造に対称性を持たせることの効果について、音声強調の場合を例にして説明する。≪音源強調法１：ビームフォーミング≫で説明したように、目的音を強調するための基本的な方式は、式（84）のようにビームフォーミングをし、その後、式（93）のようなウィーナーフィルタリングをかけることである。その際に、目的音と雑音のＰＳＤ、またはその比である事前ＳＮＲξ_ω,τを必要とするが、これらは式（97）のような演算で得られる。各実施形態ではディープニューラルネットワークを使用してきたが、基本的には、このフローを自動的に推定していることに相当している。マイクロホンアレイの構造に対称性を持たせることにより、感度行列D_ωが目的音の到来方向に依らず同一となる。これにより、目的音を強調するための処理フローが目的音の到来方向とは独立に決まることになる。そのため、ディープニューラルネットワークにより、目的音を強調するフローを学習して推定した際にも、目的音の到来方向とは独立にネットワークパラメータが決まることになり、特定の方向から到来した音のデータを大量に用意しなくてもネットワークパラメータの学習が進む。ただし、ビームフォーミングをする際に目的音の到来方向を既知とすることが前提となる。このようにして、対称性をもつマイクロホンアレイを用いることでディープニューラルネットワークの頑健性を高めることができ、音源情報の推定性能をさらに高めることができる。 The effect of giving symmetry to the structure of the microphone array will be described by taking the case of speech enhancement as an example. As explained in «Sound source enhancement method 1: Beam forming», the basic method for enhancing the target sound is to perform beam forming as shown in equation (84), and then to a Wiener like equation (93). Applying filtering. At that time, the PSD of the target sound and noise, or the prior SNRξω _{, τ,} which is the ratio thereof, is required, and these can be obtained by the calculation as shown in Equation (97). In each embodiment, a deep neural network has been used, which basically corresponds to the automatic estimation of this flow. By having symmetry structure of the microphone array, the sensitivity matrix D _omega the same regardless of the direction of arrival of the target sound. As a result, the processing flow for emphasizing the target sound is determined independently of the arrival direction of the target sound. Therefore, even when learning and estimating a flow that emphasizes the target sound using a deep neural network, the network parameters are determined independently of the direction of arrival of the target sound. Network parameter learning progresses without preparing a large amount. However, it is assumed that the direction of arrival of the target sound is known when performing beamforming. In this way, the robustness of the deep neural network can be enhanced by using the symmetrical microphone array, and the estimation performance of the sound source information can be further enhanced.

この発明は上述の実施形態に限定されるものではなく、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。上記実施形態において説明した各種の処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 The present invention is not limited to the above-described embodiment, and it goes without saying that modifications can be made as appropriate without departing from the spirit of the present invention. The various processes described in the above embodiment may be executed not only in time series according to the order of description, but also in parallel or individually as required by the processing capability of the apparatus that executes the processes or as necessary.

［プログラム、記録媒体］
上記実施形態で説明した各装置における各種の処理機能をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 [Program, recording medium]
When various processing functions in each device described in the above embodiment are realized by a computer, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on a computer, various processing functions in each of the above devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 This program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるASP（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. A configuration in which the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes a processing function only by an execution instruction and result acquisition without transferring a program from the server computer to the computer. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, the present apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

１０−１〜１０−Ｍマイクロホン
１１−１〜１１−Ｍ周波数領域変換部
１２−１〜１２−Ｌビームフォーミング部
１３局所ＰＳＤ推定部
１４パワー比推定部
１５事前ＳＮＲ計算部
１６フィルタリング部
２０〜２３ＤＮＮマッピング部
３０〜３３初期値設定部 10-1 to 10-M Microphones 11-1 to 11-M Frequency domain conversion units 12-1 to 12-L Beam forming unit 13 Local PSD estimation unit 14 Power ratio estimation unit 15 Prior SNR calculation unit 16 Filtering units 20 to 23 DNN mapping unit 30 to 33 Initial value setting unit

Claims

A sound source feature extraction unit that receives a plurality of frequency domain observation signals picked up by emphasizing sounds coming from a plurality of angular regions in different directions, and extracts a power spectrum density of each frequency domain observation signal;
A sound source information estimation unit that inputs the power spectral density of each frequency domain observation signal to a statistical mapping model to obtain an estimate of the direct ratio,
Including
In the statistical mapping model, N is a unit number of signal processing operations constituting a predetermined direct ratio estimation processing operation, and a plurality of frequency regions in which sound is collected by enhancing sounds coming from a plurality of angular regions in different directions It is a deep neural network of N + 1 layers or more that learned parameters using the power spectral density extracted from the acoustic signal and the correct value of the direct ratio obtained from each frequency domain acoustic signal,
The input layer of the above deep neural network is set to the power spectral density of each frequency domain observation signal,
The intermediate layer of the deep neural network is learned to be composed of a first layer corresponding to local PSD estimation processing, a second layer corresponding to frequency addition processing, and a third layer corresponding to logarithmic domain ratio calculation processing. As shown, the initial values of the network parameters for each layer are set taking physical characteristics into account,
Sound source information estimation device.

A sound source feature extraction unit that generates a plurality of frequency domain observation signals by emphasizing sounds coming from a plurality of angular regions in different directions with respect to the input observation signal, and extracts the power of each frequency domain observation signal;
A local PSD estimation unit that estimates the power spectral density of each frequency domain observation signal using the power of each frequency domain observation signal;
A sound source information estimation unit that inputs the power spectral density of each frequency domain observation signal to a statistical mapping model to obtain an estimate of the direct ratio,
Including
In the statistical mapping model, N is a unit number of signal processing operations constituting a predetermined direct ratio estimation processing operation, and a plurality of frequency regions in which sound is collected by enhancing sounds coming from a plurality of angular regions in different directions It is a deep neural network of N + 1 layers or more that learned parameters using the power spectral density extracted from the acoustic signal and the correct value of the direct ratio obtained from each frequency domain acoustic signal,
The input layer of the above deep neural network is set to the power spectral density of each frequency domain observation signal,
The intermediate layer of the deep neural network has an initial value of the network parameter of each layer so that it is learned to be composed of a first layer corresponding to frequency addition processing and a second layer corresponding to logarithmic domain ratio calculation processing. , Set with physical properties in mind,
Sound source information estimation device.

A sound source feature extraction unit that receives a plurality of frequency domain observation signals picked up by emphasizing sounds coming from a plurality of angular regions in different directions, and extracts a power spectrum density of each frequency domain observation signal;
A sound source information estimator that inputs the power spectral density of each frequency domain observation signal to a statistical mapping model and obtains an estimate of the SN ratio;
A filtering unit that calculates a gain coefficient for each frequency band from the estimated value of the SN ratio and multiplies the power spectrum density of each corresponding frequency band of the frequency domain observation signal;
Including
In the statistical mapping model, N is the number of signal processing operations constituting a predetermined signal-to-noise ratio estimation processing operation, and a plurality of frequency domain sounds collected by emphasizing sounds coming from angle regions in different directions It is a deep neural network of N + 1 layers or more that learned parameters using the power spectral density extracted from the signal and the correct value of the SN ratio obtained from each frequency domain acoustic signal,
The input layer of the above deep neural network is set to the power spectral density of each frequency domain observation signal,
The intermediate layer of the deep neural network is learned to include a first layer corresponding to local PSD estimation processing, a second layer corresponding to addition processing, and a third layer corresponding to logarithmic domain ratio calculation processing. As shown, the initial values of the network parameters of each layer are set taking physical characteristics into account,
Sound source information estimation device.

A sound source feature extraction unit that generates a plurality of frequency domain observation signals by emphasizing sounds coming from a plurality of angular regions in different directions with respect to the input observation signal, and extracts the power of each frequency domain observation signal;
A local PSD estimation unit that estimates the power spectral density of each frequency domain observation signal using the power of each frequency domain observation signal;
A sound source information estimator that inputs the power spectral density of each frequency domain observation signal to a statistical mapping model and obtains an estimate of the SN ratio;
A filtering unit that calculates a gain coefficient for each frequency band from the estimated value of the SN ratio and multiplies the power spectrum density of each corresponding frequency band of the frequency domain observation signal;
Including
In the statistical mapping model, N is the number of signal processing operations constituting a predetermined signal-to-noise ratio estimation processing operation, and a plurality of frequency domain sounds collected by emphasizing sounds coming from angle regions in different directions It is a deep neural network of N + 1 layers or more that learned parameters using the power spectral density extracted from the signal and the correct value of the SN ratio obtained from each frequency domain acoustic signal,
The input layer of the above deep neural network is set to the power spectral density of each frequency domain observation signal,
The intermediate layer of the deep neural network is learned to be composed of a first layer corresponding to the addition process and a second layer corresponding to the logarithmic domain ratio calculation process, so that the initial values of the network parameters of each layer are Set with physical characteristics in mind,
Sound source information estimation device.

The sound source information estimation device according to claim 1,
Q is the number of frequency bins, ω is the frequency bin number, P _{BF, l, ω} is the power spectral density of the l-th beamforming output, and G _{l, Ω, ω} is the l-th beamforming filter for the direction Ω. Sensitivity, P _{D, ω} is the power spectral density of the direct sound, P _{R, ω} is the power spectral density of the reverberation, f ⁽ⁿ⁾ is the activation function of the nth layer, and J _n is the layer of the nth layer , G ₂ , g ₃ , B ₂ , B ₃ are predetermined value adjustment factors, ^T represents transposition,
The input layer of the above deep neural network is set to the following formula,

The intermediate layer of the deep neural network is
The initial values of the first layer network parameters are set as

The output of the first layer is calculated by

The initial values of the second layer network parameters are set to

The output of the second layer is calculated by

The initial value of the network parameter of the third layer is set as follows:

Sound source information estimation device.

The sound source information estimation device according to claim 2,
Q is the number of frequency bins, ω is the frequency bin number, P _{D, ω} is the power spectral density of the direct sound, P _{R, ω} is the power spectral density of the reverberation, and f ⁽ⁿ⁾ is the activation of the nth layer J _n is the number of layers in the n-th layer, g ₃ and B ₃ are predetermined width adjustment factors, ^T represents transposition, ^ represents an estimated value,
The input layer of the above deep neural network is set to the following formula,

The intermediate layer of the deep neural network is
The initial values of the first layer network parameters are set as

The output of the first layer is calculated by

The initial values of the network parameters of the second layer are set as follows:

Sound source information estimation device.

The sound source information estimation device according to claim 3,
L is the number of beamforming, K is the number of sound sources, Q is the number of frequency bins, ω is the frequency bin number, τ is the frame number, and φ _{Yζ (l), ω, τ} is the output of the l-th beamforming D _{ζ (l), k, ω} is the average of spatial sensitivity to the position of the kth sound source of the lth beamforming, ψk _{, l, ω} is a value defined by

_Let ^ φ _{S, ω, τ be} an estimate of the power spectrum density of the target sound, ^ φ _{N, ω, τ be} an estimate of the noise power spectral density, α _{N, k, ω} be a predetermined coefficient, f ⁽ⁿ⁾ is the activation function of the n-th layer, the J _n is the number of layers n-th layer, · ^T denotes _{_{transposition, g ωi, 2, g ωi}} , 3, B ωi, 2, B ωi, 3 Is a predetermined price range adjustment factor,
The input layer of the above deep neural network is set to the following formula,

The intermediate layer of the deep neural network is
The initial values of the first layer network parameters are set as

The output of the first layer is calculated by

The initial values of the second layer network parameters are set to

The output of the second layer is calculated by

The initial value of the network parameter of the third layer is set as follows:

Sound source information estimation device.

The sound source information estimation device according to claim 4,
_Let K be the number of sound sources, Q be the number of frequency bins, ω be the frequency bin number, τ be the frame number, ^ φ _{S, ω, τ be} the estimated power spectral density of the target sound, ^ φ _{N, ω , τ} is an estimate of the noise power spectral density, α _{N, k, ω} is a predetermined coefficient, f ⁽ⁿ⁾ is the activation function of the nth layer, J _n is the number of layers of the nth layer, ^T represents transposition, and g _{ωi, 3} , B _{ωi, 3} is a predetermined value adjustment factor,
The input layer of the above deep neural network is set to the following formula,

The intermediate layer of the deep neural network is
The initial values of the first layer network parameters are set as

The output of the first layer is calculated by

The initial values of the network parameters of the second layer are set as follows:

Sound source information estimation device.

The sound source information estimation device according to any one of claims 1, 3, 5, and 7,
In the deep neural network, an input layer value is input to the first layer, an output of the first layer is input to the second layer, and an output of the second layer is input to the third layer. And the output is calculated based on the output of the third layer,
Sound source information estimation device.

The sound source information estimation device according to any one of claims 2, 4, 6, and 8,
In the deep neural network, the value of the input layer is input to the first layer, the output of the first layer is input to the second layer, and the output is calculated based on the output of the second layer. Configured to be
Sound source information estimation device.

The sound source information estimation device according to any one of claims 1 to 10,
The plurality of frequency domain acoustic signals are collected using a microphone array in which each microphone is arranged at the apex position of a regular polygon or regular polyhedron.
Sound source information estimation device.

A sound source feature extraction unit that extracts a power spectrum density of each frequency domain observation signal by inputting a plurality of frequency domain observation signals picked up by emphasizing sounds coming from a plurality of angular regions in different directions. When,
A sound source information estimation unit that inputs the power spectral density of each frequency domain observation signal to a statistical mapping model to obtain an estimate of the direct ratio;
Including
In the statistical mapping model, N is a unit number of signal processing operations constituting a predetermined direct ratio estimation processing operation, and a plurality of frequency regions in which sound is collected by enhancing sounds coming from a plurality of angular regions in different directions It is a deep neural network of N + 1 layers or more that learned parameters using the power spectral density extracted from the acoustic signal and the correct value of the direct ratio obtained from each frequency domain acoustic signal,
The input layer of the above deep neural network is set to the power spectral density of each frequency domain observation signal,
The intermediate layer of the deep neural network is learned to be composed of a first layer corresponding to local PSD estimation processing, a second layer corresponding to frequency addition processing, and a third layer corresponding to logarithmic domain ratio calculation processing. As shown, the initial values of the network parameters for each layer are set taking physical characteristics into account,
Sound source information estimation method.

A sound source whose sound source feature extraction unit emphasizes sounds coming from multiple angular regions in different directions with respect to the input observation signal, generates a plurality of frequency domain observation signals, and extracts the power of each frequency domain observation signal A feature extraction step;
A local PSD estimation step in which a local PSD estimation unit estimates the power spectral density of each frequency domain observation signal using the power of each frequency domain observation signal;
A sound source information estimation unit that inputs the power spectral density of each frequency domain observation signal to a statistical mapping model to obtain an estimate of the direct ratio;
Including
In the statistical mapping model, N is a unit number of signal processing operations constituting a predetermined direct ratio estimation processing operation, and a plurality of frequency regions in which sound is collected by enhancing sounds coming from a plurality of angular regions in different directions It is a deep neural network of N + 1 layers or more that learned parameters using the power spectral density extracted from the acoustic signal and the correct value of the direct ratio obtained from each frequency domain acoustic signal,
The input layer of the above deep neural network is set to the power spectral density of each frequency domain observation signal,
The intermediate layer of the deep neural network has an initial value of the network parameter of each layer so that it is learned to be composed of a first layer corresponding to frequency addition processing and a second layer corresponding to logarithmic domain ratio calculation processing. , Set with physical properties in mind,
Sound source information estimation method.

A sound source feature extraction unit that extracts a power spectrum density of each frequency domain observation signal by inputting a plurality of frequency domain observation signals picked up by emphasizing sounds coming from a plurality of angular regions in different directions. When,
A sound source information estimation unit inputs a power spectral density of each frequency domain observation signal to a statistical mapping model to obtain an SN ratio estimate,
A filtering step in which a filtering unit calculates a gain coefficient for each frequency band from the estimated value of the SN ratio and multiplies the power spectral density of each corresponding frequency band of the frequency domain observation signal;
Including
In the statistical mapping model, N is the number of signal processing operations constituting a predetermined signal-to-noise ratio estimation processing operation, and a plurality of frequency domain sounds collected by emphasizing sounds coming from angle regions in different directions It is a deep neural network of N + 1 layers or more that learned parameters using the power spectral density extracted from the signal and the correct value of the SN ratio obtained from each frequency domain acoustic signal,
The input layer of the above deep neural network is set to the power spectral density of each frequency domain observation signal,
The intermediate layer of the deep neural network is learned to include a first layer corresponding to local PSD estimation processing, a second layer corresponding to addition processing, and a third layer corresponding to logarithmic domain ratio calculation processing. As shown, the initial values of the network parameters for each layer are set taking physical characteristics into account,
Sound source information estimation method.

A sound source whose sound source feature extraction unit emphasizes sounds coming from multiple angular regions in different directions with respect to the input observation signal, generates a plurality of frequency domain observation signals, and extracts the power of each frequency domain observation signal A feature extraction step;
A local PSD estimation step in which a local PSD estimation unit estimates the power spectral density of each frequency domain observation signal using the power of each frequency domain observation signal;
A sound source information estimation unit inputs a power spectral density of each frequency domain observation signal to a statistical mapping model to obtain an SN ratio estimate,
A filtering step in which a filtering unit calculates a gain coefficient for each frequency band from the estimated value of the SN ratio and multiplies the power spectral density of each corresponding frequency band of the frequency domain observation signal;
Including
In the statistical mapping model, N is the number of signal processing operations constituting a predetermined signal-to-noise ratio estimation processing operation, and a plurality of frequency domain sounds collected by emphasizing sounds coming from angle regions in different directions It is a deep neural network of N + 1 layers or more that learned parameters using the power spectral density extracted from the signal and the correct value of the SN ratio obtained from each frequency domain acoustic signal,
The input layer of the above deep neural network is set to the power spectral density of each frequency domain observation signal,
The intermediate layer of the deep neural network is learned to be composed of a first layer corresponding to the addition process and a second layer corresponding to the logarithmic domain ratio calculation process, so that the initial values of the network parameters of each layer are Set with physical characteristics in mind,
Sound source information estimation method.

The program for functioning a computer as a sound source information estimation apparatus in any one of Claim 1 to 11.