WO2021100136A1

WO2021100136A1 - Sound source signal estimation device, sound source signal estimation method, and program

Info

Publication number: WO2021100136A1
Application number: PCT/JP2019/045392
Authority: WO
Inventors: 江村　暁
Original assignee: 日本電信電話株式会社
Priority date: 2019-11-20
Filing date: 2019-11-20
Publication date: 2021-05-27

Abstract

Provided is sound source signal estimation technology that can improve estimation precision by using the sparseness of a sound source signal as an evaluation criterion. A sound source signal estimation device according to the present invention comprises: a whitening unit that generates a whitened sound collection signal vector u(f, ω) (provided that, u(f, ω) satisfies u(f, ω)=T(ω)y(f, ω) and E[u(f, ω)u^H(f, ω)]=I with respect to a prescribed matrix T(ω)) from a sound collection signal vector y(f, ω); a separation filter generation unit that generates a separation filter W(ω) by solving an optimization problem for a cost function F(W(ω)) of the separation filter W(ω) defined by using the whitened sound collection signal vector u(f, ω); and a sound source separation unit that, by using the separation filter W(ω), generates an estimation sound source signal vector ^s(f, ω) from the sound collection signal vector y(f, ω).

Description

Sound source signal estimation device, sound source signal estimation method, program

The present invention relates to a technique for estimating a sound source signal.

In recent years, a technique for separating signals from a plurality of sound sources (hereinafter referred to as sound source signals) included in a multi-channel sound collection signal acquired by installing a plurality of microphones in a sound field into individual sound source signals has been actively researched and developed. ing. As an example of such a method, Blind Source Separation (BSS) based on Independent Component Analysis (ICA) is well known.

An example of BSS will be described below. First, consider the case where M sensors are installed in a sound field with M sound sources. M pieces of each of the m-th sound source of the sound source (m = 1, ..., M ) is called, the signal from the m-th sound source (hereinafter, the first that m sound source signal) (m = 1, ..., M) and s _m ( It is expressed as k) (where k represents the time). In addition, each of the M sensors is called the nth sensor (n = 1,…, M), and the first sound source signal s ₁ (k),…, M sound source signal s _M (k) is generated by the nth sensor. The picked up signal (hereinafter referred to as the nth picked up signal) (n = 1, ..., M) is _expressed as y n (k) (where k represents the time). At this time, consider a model (instantaneous mixed model) in which the nth pick-up signal y _n (k) (n = 1, ..., M) is described by the following equation.

Here, h _{n and m} are mixing coefficients. The mixing coefficients h _{n and m} are scalars.

In BSS based on ICA, the signal from the mth sound source is separated into sound sources by multiplying the nth sound _{source signal y n} _{(k) by the separation coefficients w m and n and taking the sum, as shown in the following equation.} m Get the estimated sound source signal ^ s _m (k) (m = 1,…, M).

At this time, the separation coefficients w _{m and n} are updated so that each sound source signal is statistically more independent. Natural Gradient method and FastICA are known as such update methods.

Next, consider the case where a microphone is installed in the sound field instead of the sensor. In other words, consider the case where M microphones are installed in a sound field with M sound sources. Each of the M microphones is called the nth microphone (n = 1,…, M), and the first sound source signal s ₁ (k),…, Mth sound source signal s _M (k) is picked up by the nth microphone. The signal (hereinafter referred to as the nth pick-up signal) (n = 1, ..., M) is _expressed as y n (k) (where k represents the time). At this time, consider a model (convolution mixed model) in which the nth sound pickup signal y _n (k) (n = 1, ..., M) is described by the following equation using convolution.

Here, h _{n, m} (p) is the impulse response of the acoustic path from the mth sound source to the nth microphone, and P is the length of the impulse response of the acoustic path.

In BSS, the signal from the mth sound source is separated by the following equation using the _{FIR filter w m, n} _{(q), and the mth estimated sound source signal ^ s m} (k) (m = 1,…, M) To get.

Here, Q is the filter length of the FIR filter.

Since the impulse response length P of the acoustic path is _{several thousand taps when sampling at 16 kHz with a normal reverberation time T 60} = 400 ms, the filter length Q of the FIR filter is also several thousand. Therefore, the calculation of BSS in the convolution mixed model is much more difficult than that of BSS in the instantaneous mixed model.

Therefore, the frequency domain processing approach is usually applied to BSS in the convolution mixed model. In this approach, a Short-Time Fourier Transform (STFT) is _{applied to the nth pick-up signal y n} (k) to transform it into the frequency domain. As a result, the convolutional mixed model is converted into a set of instantaneous mixed models for each frequency as shown in the following equation.

Here, f is the frame number when the signal is framed by STFT, ω is the frequency, and S _m (f, ω) is the mth sound source signal obtained by converting _{sm (k) into the frequency domain.} , H _{n, m} (ω) are the transmission characteristics of the acoustic path from the mth sound source to the nth microphone, which are obtained by converting _{h n, m} _{(p) in the frequency domain, and Y n} (f, ω) are , Y _n (k) is the nth pick-up signal obtained by frequency domain conversion. Also, ・^T represents transpose.

At this time, the separation filter W (ω) is given by the following equation.

The separation filter W (ω) can be updated by applying the above-mentioned Natural Gradient method and FastICA as they are at each frequency. Therefore, such an approach is called frequency domain ICA (Frequency-Domain ICA; FDICA).

In this FDICA, each frequency is processed individually, so there are two problems. The first problem is called a scaling problem, in which each sound source signal is estimated with a different gain at each frequency. The second problem is called the permutation problem, in which the sound sources are estimated in a different order at each frequency.

The scaling problem is solved by a method of recovering the sound source signal component at the position of the microphone, focusing on the transmission characteristics between the estimated sound source signal and the sound collection signal by the microphone, and the permutation problem is solved. , It is solved by the method by clustering the activity sequence obtained from the estimated sound source signal (see Non-Patent Document 1).

However, when sound source separation by FDICA is performed, an estimated sound source signal obtained by separating the signal from a certain sound source can be obtained, but the separation performance is often insufficient. This is because the estimated sound source signal is mixed with the crosstalk component of the signal from another sound source, that is, the reverberation of the signal from another sound source or the signal from another sound source, and the reverberation time is not short. The effect is great. In other words, the estimated sound source signal has room for more sparseness as compared to an ideal signal in which the estimated sound source signal contains only the signal from the sound source to be separated.

Therefore, an object of the present invention is to provide a sound source signal estimation technique capable of improving the estimation accuracy by using the sparseness of the sound source signal as an evaluation standard.

One aspect of the present invention, 2 or more integer M, s _m (k) (Here, k represents a time) signal from the m-th sound source (hereinafter, referred to as the m source signal) (m = 1, ... , M), y _n (k) (where k represents the time) is the first sound source signal s ₁ (k), ..., M sound source signal s _M (k) picked up by the nth microphone ( Hereinafter referred to as the nth pick-up signal) (n = 1,…, M), Y _n (f, ω) (n = 1,…, M) (where f represents the frame number and ω represents the frequency). Signal in the frequency region of the nth sound pickup signal y _n (k) (hereinafter referred to as the nth sound collection signal), y (f, ω) = [Y ₁ (f, ω),…, Y _M (f, ω) )] ^{Let T} be the sound collection signal vector, and from the sound collection signal vector y (f, ω), the whitening sound collection signal vector u (f, ω) (where u (f, ω) is a predetermined matrix T (ω). ) To generate u (f, ω) = T (ω) y (f, ω), E [u (f, ω) u ^H (f, ω)] = I) , Whitening The cost function F (W (ω)) of the _{separation filter W (ω) = [w 1} (ω)… w _M (ω)] defined using the sound collection signal vector u (f, ω) Optimization problem

Estimated using the separation filter W (ω) from the sound collection signal vector y (f, ω) and the separation filter generator that generates the separation filter W (ω) by solving (F is an integer of 1 or more). Includes a sound source separator that generates a sound source signal vector ^ s (f, ω).

According to the present invention, it is possible to improve the estimation accuracy by using the sparsity of the sound source signal as an evaluation standard.

It is a block diagram which shows the structure of the sound source signal estimation apparatus 100. It is a flowchart which shows the operation of the sound source signal estimation apparatus 100. It is a block diagram which shows the structure of the separation filter generation part 130. It is a flowchart which shows the operation of the separation filter generation unit 130. It is a figure which shows an example of the functional structure of the computer which realizes each apparatus in embodiment of this invention.

Hereinafter, embodiments of the present invention will be described in detail. The components having the same function are given the same number, and duplicate explanations will be omitted.

Prior to the description of each embodiment, the notation method in this specification will be described.

_ (Underscore) represents the subscript. For example, x ^y_z means that y _z is a superscript for x, and x _{y_z} means that y _z is a subscript for x.

Also, superscripts "^" and "~" such as ^ x and ~ x for a certain character x should be written directly above "x", but they should be written directly above "x". Due to restrictions, it is described as ^ x or ~ x.

<Technical background>
The sparsity of the separated sound source signal can be evaluated by the L1 norm of the vector (see Reference Non-Patent Document 1).
(Reference Non-Patent Document 1: Traian E. Abrudan, Jan Eriksson, and Visa Koivunen, “Steepest Descent Algorithms for Optimization Under Unitary Matrix Constraint,” IEEE Trans. On signal processing, vol.56, Issue 3, pp.1134-1147 , March 2008.)

However, when the value of the L1 norm becomes small, the signal itself may become sparse, or the amplitude of the signal may become small. Therefore, simply reducing the L1 norm of the vector related to the separated sound source signal does not necessarily make the signal sparse.

Therefore, in the embodiment of the present invention, a separation filter that minimizes the L1 norm while keeping the L2 norm representing the signal power of the vector related to the separated sound source signal constant is obtained. At that time, a method of generating a separation filter based on the Lehman optimization method is used. Here, as the separated sound source signal, a sound source signal separated from the whitened sound collection signal is used.

Hereinafter, the procedure for estimating the sound source signal according to the embodiment of the present invention will be described.

<< Sound source signal estimation procedure >>
(Step 1: STFT transform)
The nth sound pickup signal y _n (k) (n = 1, ..., M) is the nth sound pickup signal Y _n (f, ω) (n = 1, ..., M) which is a signal in the frequency domain using STFT. Convert to M).

(Step 2: Whitening of the sound collection signal)
Here, the nth sound pickup signal Y _n (f, ω) (n = 1, ..., M) is whitened. The nth sound collection signal Y _n (f, ω) is the nth element of the sound collection signal vector y (f, ω) = [Y ₁ (f, ω),…, Y _M (f, ω)] ^T Whitened whitening sound collection signal vector u (f, ω) = [U ₁ (f, ω),…, U _M (f, ω)] ^T is a predetermined M × M matrix T (ω). On the other hand, it is a vector u (f, ω) that satisfies u (f, ω) = T (ω) y (f, ω), E [u (f, ω) u ^{H (f, ω)] = I.} is there. Here, I is an M × M identity matrix, and E [・] represents an expected value. Hereinafter, ω is omitted for the sake of simplicity.

The matrix T can be obtained by using the eigenvalue decomposition of the spatial correlation matrix of the sound collection signal vector y (f). Suppose that the eigenvalue decomposition of the spatial correlation matrix E [y (f) y ^H (f)] is given by the following equation.

Here, D is a diagonal matrix, and ^H represents Hermitian transpose.

At this time, the matrix T is calculated by the following equation.

Using this matrix T, the relationship between the sound collection signal vector y (f) and the whitening sound collection signal vector u (f) can be expressed by the following equation.

Or,

It can be expressed as. However,

Is.

(Step 3: Generation of separation filter)
Pay attention to the cost function that measures the sparsity of the separated sound source signal with the L1 norm, and generate a separation filter so that the value is minimized. Here, as the vector relating to the separated sound source signal, the vector v (f) of the following equation generated from the whitening sound collection signal vector u (f) using the separation filter W is used.

The generation of the separation filter W is formulated as a constrained minimization problem (optimization problem) of the following equation.

Here, F is an integer of 1 or more representing the number of frames used for optimization.

Below, for the sake of brevity, F (W) = || [w ₁ ^H [u (1)… u (F)] || ₁ +… + || [w _M ^H [u (1)… u) (F)] || Set to ₁ . Note that F (W) is called a cost function.

In this way, the above optimization problem can be expressed as follows.

Before explaining the algorithm for solving this optimization problem, first, the theoretical background will be explained. Although W ∈ C ^{M × M} , the case of ^{W ∈ C M × P} , which is more general, will be explained here.

(Calculation of Lehman gradient)
The total derivative DF _W (Z) of the cost function F in the Z direction (z ∈ C ^{M × P} ) is given by the following equation.

However,

And G is a matrix consisting of the partial derivatives of the cost function F with respect to W.

The total derivative DF _W (Z) can be expressed as follows by using the Lehman gradient U based on the canonical inner product.

At this time, the Lehman gradient U is in ^{the tangent space of the constraint condition W H} W = I,

(See Reference Non-Patent Document 1).

The Lehman gradient U is in the direction of minimizing the cost function F in the tangent space ^{of the constraint W H W = I.}

(Search on a curve on a manifold based on the Lehman gradient)
Here the curve on the manifold corresponding to the constraint condition W ^{H W = I}

think of. Here, the matrix A is a skew-symmetric matrix, and τ is a parameter representing a curve.

By changing τ, Y (τ) draws a curve on the manifold. This curve is also called a Cayley transformation (see Reference Non-Patent Document 2).
(Reference Non-Patent Document 2: Z. Wen and W. Yin, A feasible method for optimization with orthogonality constraints, Mathematical Programming, 142 (1-2), pp.397-434, 2013.)

This curve Y (τ) has the following properties.
(1) For any τ, the constraint condition W ^H W = I is satisfied.
(2) The tangent vector Y'(0) at τ = 0 is given by Y'(0) =-AW.

From the above properties, it can be seen that the cost function F decreases most efficiently on this curve Y (τ) ^{when A = GW H} -WG ^H. Therefore, the separation filter W can be obtained by finding the parameter τ that surely reduces the cost function F on the curve Y (τ) (see Reference Non-Patent Document 2).

Hereinafter, the algorithm for obtaining the separation filter W will be described.
(1) Let the initial value of the separation filter W be W ^[1] . W ^[1] is called the result of the first update of the separation filter W.
(2) Curves on the manifold using ^{W [k]} , which is the result of the kth update of the separation filter W.

To ask.
(3) By the following search on this curve Y (τ), W ^{[k + 1]} , which is the result of the k + 1th update of the separation filter W, is obtained.
(3-1) Set the parameter τ to a non-zero value.
(3-2) Set _two _{parameters ρ 1} and ρ 2 that specify the aluminum ho condition. However, it is assumed that the parameters ρ ₁ and ρ ₂ satisfy 0 <ρ ₁ <ρ ₂ <1.
(3-3) While the following two inequalities hold, the parameter τ is halved.

In other cases, that is, if either of the above two inequalities does not hold, then W ^{[k + 1]} = Y (τ).

However,

Is.
(4) When tr ((W ^{[k + 1]} -W ^[k] ) ^H (W ^{[k + 1]} -W ^[k] )) becomes smaller than the preset value ε, it is determined that the convergence has occurred. Then, let W ^{[k + 1]} be the solution. That is, let W ^{[k + 1]} be the separation filter W. On the other hand, if tr ((W ^{[k + 1]} -W ^[k] ) ^H (W ^{[k + 1]} -W ^[k] )) is not smaller than the preset value ε, the process returns to (2). Note that tr ((W ^{[k + 1]} -W ^[k] ) ^H (W ^{[k + 1]} -W ^[k] )) <ε is called a convergence condition.

(Step 4: Sound source separation)
Using the separation filter W obtained in step 3, the estimated sound source signal vector ^ s (f) = [^ S ₁ (f),…, ^ S _M (f)] ^{T from the sound collection signal vector y (f).} To generate. Specifically, after obtaining the sound source signal vector s'(f) separated by sound source by s'(f) = Wy (f), the sound source signal vector s'(f) is described in Non-Patent Document 1. An estimated sound source signal vector ^ s (f) is generated as a sound source signal vector that solves the scaling problem and the patent problem using the method. The solution of the scaling problem is to adjust the scaling of each component at each frequency, and the solution of the permutation problem is to determine the order of the sound source components estimated at each frequency.

(Step 5: Inverse STFT transform)
The mth element of the estimated sound source signal vector ^ s (f) is the mth estimated sound source signal ^ S _m (f) (m = 1,…, M), and the mth estimated sound source signal ~ S _m (f) is reversed. It converted to the m estimated source signal ~ s _m is the signal in the time domain (k) (1 ≦ m ≦ M) using the STFT transformation.

When finding the derivative of the cost function F, it is necessary to take the derivative of the L1 norm of the _{vector v = [V 1} … V _L]. This derivative becomes infinite when the length of the vector v is 0. To avoid this, use the fine constant ε to set the L1 norm of the vector v.

(See Non-Patent Document 1). This makes it possible to avoid the occurrence of infinity in numerical calculations and solve optimization problems.

<First Embodiment>
Hereinafter, the sound source signal estimation device 100 will be described with reference to FIGS. 1 to 2. FIG. 1 is a block diagram showing a configuration of a sound source signal estimation device 100. FIG. 2 is a flowchart showing the operation of the sound source signal estimation device 100. As shown in FIG. 1, the sound source signal estimation device 100 includes a frequency domain conversion unit 110, a whitening unit 120, a separation filter generation unit 130, a sound source separation unit 140, a time domain conversion unit 150, and a recording unit 190. Including. The recording unit 190 is a component unit that appropriately records information necessary for processing of the sound source signal estimation device 100.

The sound source signal estimation device 100 receives signals picked up by M microphones installed in a sound field having M sound sources (M is an integer of 2 or more) as an input, and estimates signals from the M sound sources. And output. Hereinafter, s _m (k) (Here, k represents the time) the signal from the m sound source (hereinafter, referred to as the m source signal) (m = 1, ..., M), y n (k) ( where (k represents the time) is the signal _{obtained by collecting the first sound source signal s 1} (k), ..., M sound source signal s _M (k) by the nth microphone (hereinafter referred to as the nth sound source signal) (n = Let 1,…, M).

The operation of the sound source signal estimation device 100 will be described with reference to FIG.

In S110, the frequency domain converter 110, the n collected signal _{y n (k) (n =} 1, ..., M) as input, the n collected signal _{y n (k) (n =} 1, ..., _{From M), the nth sound pickup signal Y n} (f, ω) (n = 1,…, M) (where f is the frame number and ω is the frequency), which is the signal in the frequency domain, by converting the frequency domain. (Represent) is generated and output. For the frequency domain conversion, for example, STFT conversion can be used.

In S120, the whitening unit 120 takes the nth sound pickup signal Y _n (f, ω) (n = 1, ..., M) generated in S110 as an input, and the sound collection signal vector y (f, ω) = [ From Y ₁ (f, ω),…, Y _M (f, ω)] ^T , the whitening sound pickup signal vector u (f, ω) (where u (f, ω) is a predetermined matrix T (ω) For u (f, ω) = T (ω) y (f, ω), E [u (f, ω) u ^H (f, ω)] = I) is generated and output. The method for generating the whitening sound collection signal vector u (f, ω) can be, for example, the method described in <Technical Background>. Specifically, it is as follows. First, the whitening unit 120 decomposes the eigenvalues of the spatial correlation matrix of the sound collection signal vector y (f, ω) E [y (f, ω) y ^H (f, ω)] = C (ω) D (ω). Calculate C ^H (ω) (where D (ω) is a diagonal matrix). Next, the whitening unit 120 calculates the matrix T (ω) by the following equation.

Finally, the whitening unit 120 calculates the whitening sound collection signal vector u (f, ω) by u (f, ω) = T (ω) y (f, ω).

In S130, the separation filter generation unit 130 takes the whitening sound collection signal vector u (f, ω) generated in S120 as an input, and the separation filter is defined by using the whitening sound collection signal vector u (f, ω). Optimization problem of cost function F (W (ω)) of W (ω) = [w ₁ (ω)… w _{M (ω)]}

By solving (F is an integer of 1 or more), the separation filter W (ω) is generated and output.

Hereinafter, the separation filter generation unit 130 will be described with reference to FIGS. 3 to 4. FIG. 3 is a block diagram showing the configuration of the separation filter generation unit 130. FIG. 4 is a flowchart showing the operation of the separation filter generation unit 130. As shown in FIG. 3, the separation filter generation unit 130 includes a separation filter initialization unit 131, a parameter setting unit 132, a separation filter update unit 133, a counter update unit 134, and a convergence condition determination unit 135.

The operation of the separation filter generation unit 130 will be described with reference to FIG.

In S131, the separation filter initialization unit 131 sets the counter k to 1 and sets W ^[1] , which is the result of the first update of the separation filter W (ω).

In S132, the parameter setting unit 132 specifies the parameter τ _k (where τ _k is non-zero) and the parameters ρ _{k, 1} , ρ _{k, 2} (ρ _{k, 1} , ρ _{k, 2} specify the aluminum ho condition). It is a parameter to be set, and 0 <ρ _{k, 1} <ρ _{k, 2} <1 is satisfied).

In S133, the separation filter update unit 133 defines two inequalities (a) and (b) with _{respect to the parameter τ k} ^{using W [k]} which is the kth update result of the separation filter W (ω). ) Is satisfied, the parameter τ _k is halved,

When either of the inequalities (a) and (b) does not hold, W ^{[k + 1]} , which is the result of the k + 1th update of the separation filter W (ω), is changed to W ^{[k + 1]} = Y. Generate as ^[k] (τ _k). To halve the parameter τ _k is specifically to set τ _k ← τ _k / 2. Also,

Is.

In S134, the counter update unit 123 increments the counter k by 1. Specifically, k ← k + 1.

In S135, the convergence condition determination unit 135 is the k-th update result of the ^{separation filter W (ω) W [k]} and the k + 1th update result of the separation filter W (ω) W ^{[k + 1]. ]} And the predetermined convergence condition defined by using] is satisfied, the separation filter W (ω) is ^{generated as W (ω) = W [k + 1]} , and the process is terminated. Here, the predetermined convergence condition is tr ((W ^{[k + 1]} -W ^[k] ) ^H (W ^{[k + 1]} -W ^[k] )) <ε (where ε is preset. It is a value). In other cases, the process returns to S132. That is, the separation filter update unit 130 repeats the calculations of S132 to S134.

In S140, the sound source separation unit 140 receives the sound collection signal vector y (f, ω) generated in S120 and the separation filter W (ω) generated in S130 as inputs, and from the sound collection signal vector y (f, ω), The estimated sound source signal vector ^ s (f, ω) is generated and output using the separation filter W (ω). The method for generating the estimated sound source signal vector ^ s (f, ω) can be, for example, the method described in <Technical Background>. Specifically, it is as follows. First, the sound source separation unit 140 estimates the sound source vector s from the sound collection signal vector y (f, ω) and the separation filter W (ω) by s'(f, ω) = W (ω) y (f, ω). 'Calculate (f, ω). Next, the sound source separation unit 140 solves the scaling problem and the permutation problem of the estimated sound source vector s'(f, ω) by using the method described in Non-Patent Document 1, so that the estimated sound source signal vector ^ Generate s (f, ω).

In S150, the time domain conversion unit 150 takes the estimated sound source signal vector ^ s (f, ω) generated in S140 as an input, and the m-th estimated sound source which is the m-th element of the estimated sound source signal vector ^ s (f, ω). From the signal ^ S _m (f, ω) (m = 1,…, M), the mth estimated sound source signal ^ s _m (k) (m = 1,…, which is the signal in the time domain by the predetermined time domain conversion , M) is generated and output. For the time domain conversion, for example, an inverse STFT conversion can be used.

According to the invention of the present embodiment, it is possible to improve the estimation accuracy by using the sparsity of the sound source signal as an evaluation standard.

<Supplement>
FIG. 5 is a diagram showing an example of a functional configuration of a computer that realizes each of the above-mentioned devices. The processing in each of the above-mentioned devices can be carried out by causing the recording unit 2020 to read a program for causing the computer to function as each of the above-mentioned devices, and operating the control unit 2010, the input unit 2030, the output unit 2040, and the like.

The device of the present invention is, for example, as a single hardware entity, an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) capable of communicating outside the hardware entity. Communication unit, CPU (Central Processing Unit, cache memory, registers, etc.) to which can be connected, RAM and ROM as memory, external storage device as hard hardware, and input, output, and communication units of these , CPU, RAM, ROM, has a connecting bus so that data can be exchanged between external storage devices. Further, if necessary, a device (drive) or the like capable of reading and writing a recording medium such as a CD-ROM may be provided in the hardware entity. A physical entity equipped with such hardware resources includes a general-purpose computer and the like.

The external storage device of the hardware entity stores the program required to realize the above-mentioned functions and the data required for processing this program (not limited to the external storage device, for example, reading a program). It may be stored in a ROM, which is a dedicated storage device). Further, the data obtained by the processing of these programs is appropriately stored in a RAM, an external storage device, or the like.

In the hardware entity, each program stored in the external storage device (or ROM, etc.) and the data necessary for processing each program are read into the memory as needed, and are appropriately interpreted, executed, and processed by the CPU. .. As a result, the CPU realizes a predetermined function (each component represented by the above, ..., ... means, etc.).

The present invention is not limited to the above-described embodiment, and can be appropriately modified without departing from the spirit of the present invention. Further, the processes described in the above-described embodiment are not only executed in chronological order according to the order described, but may also be executed in parallel or individually as required by the processing capacity of the device that executes the processes. ..

As described above, when the processing function in the hardware entity (device of the present invention) described in the above embodiment is realized by a computer, the processing content of the function that the hardware entity should have is described by a program. Then, by executing this program on the computer, the processing function in the above hardware entity is realized on the computer.

The program that describes this processing content can be recorded on a computer-readable recording medium. The computer-readable recording medium may be, for example, a magnetic recording device, an optical disk, a photomagnetic recording medium, a semiconductor memory, or the like. Specifically, for example, a hard disk device, a flexible disk, a magnetic tape, or the like as a magnetic recording device is used as an optical disk, and a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), or a CD-ROM (Compact Disc Read Only) is used as an optical disk. Memory), CD-R (Recordable) / RW (ReWritable), etc., MO (Magneto-Optical disc), etc. as a magneto-optical recording medium, EP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. as a semiconductor memory Can be used.

The distribution of this program is carried out, for example, by selling, transferring, renting, etc., a portable recording medium such as a DVD or CD-ROM on which the program is recorded. Further, the program may be stored in the storage device of the server computer, and the program may be distributed by transferring the program from the server computer to another computer via a network.

A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. Then, when the process is executed, the computer reads the program stored in its own storage device and executes the process according to the read program. Further, as another execution form of this program, a computer may read the program directly from a portable recording medium and execute processing according to the program, and further, the program is transferred from the server computer to this computer. Each time, the processing according to the received program may be executed sequentially. In addition, the above processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition without transferring the program from the server computer to this computer. May be. The program in this embodiment includes information to be used for processing by a computer and equivalent to the program (data that is not a direct command to the computer but has a property of defining the processing of the computer, etc.).

Further, in this form, the hardware entity is configured by executing a predetermined program on the computer, but at least a part of these processing contents may be realized in terms of hardware.

The above description of the embodiment of the present invention is presented for the purpose of illustration and description. There is no intention to be exhaustive and no intention to limit the invention to the exact form disclosed. Deformations and variations are possible from the above teachings. The embodiments are in various embodiments and in various ways to provide the best illustration of the principles of the invention and to be suitable for practical use by those skilled in the art. It is selected and expressed so that it can be used by adding transformations. All such variations and variations are within the scope of the invention as defined by the appended claims, interpreted according to the width given fairly, legally and impartially.

Claims

M an integer of 2 or more, s m (k) (Here, k represents the time) the signal from the m sound source (hereinafter, the referred m source signal) (m = 1, ..., M), y n ( k) (where k represents the time) is the first sound source signal s 1 (k), ..., M sound source signal s M (k) picked up by the nth microphone (hereinafter, the nth sound picking signal). (N = 1,…, M), Y n (f, ω) (n = 1,…, M) (where f is the frame number and ω is the frequency) is the nth pick-up signal y n Signal in the frequency region of (k) (hereinafter referred to as the nth pick-up signal), y (f, ω) = [Y 1 (f, ω),…, Y M (f, ω)] T is the pick-up signal As a vector
From the sound collection signal vector y (f, ω), the whitening sound collection signal vector u (f, ω) (where u (f, ω) is u (f, ω) with respect to the predetermined matrix T (ω). = T (ω) y (f, ω), E [u (f, ω) u H (f, ω)] = I) is generated)
Optimum of the cost function F (W (ω)) of the separation filter W (ω) = [w 1 (ω)… w M (ω)] defined using the whitening sound collection signal vector u (f, ω) Problem

A separation filter generator that generates a separation filter W (ω) by solving (F is an integer of 1 or more),
A sound source separation unit that generates an estimated sound source signal vector ^ s (f, ω) from the sound collection signal vector y (f, ω) using the separation filter W (ω).
Sound source signal estimator including.
The sound source signal estimation device according to claim 1.
The separation filter generator
The separation filter initialization unit that sets the counter k to 1 and sets W [1] , which is the result of the first update of the separation filter W (ω),
The parameter τ k (where τ k is non-zero) and the parameters ρ k, 1 , ρ k, 2 (ρ k, 1 , ρ k, 2 are parameters that specify the aluminum ho condition, 0 <ρ k. , 1 <ρ k, 2 <1) and the parameter setting part,
For the parameter tau k, while the k-th an update result W [k] 2 two inequalities to be defined using the separation filter W (ω) (a), the (b) holds, the parameter tau k Is halved,

When either of the inequalities (a) and (b) does not hold, W [k + 1] , which is the result of the k + 1th update of the separation filter W (ω), is changed to W [k + 1] = Y. Separation filter update part generated as [k] (τ k),
A counter updater that increments the counter k by 1 and
A predetermined value defined using W [k] , which is the k-th update result of the separation filter W (ω), and W [k + 1] , which is the k + 1-th update result of the separation filter W (ω). A sound source signal estimation device including a convergence condition determination unit that generates a separation filter W (ω) as W (ω) = W [k + 1] when the convergence condition is satisfied.
M an integer of 2 or more, s m (k) (Here, k represents the time) the signal from the m sound source (hereinafter, the referred m source signal) (m = 1, ..., M), y n ( k) (where k represents the time) is the first sound source signal s 1 (k), ..., M sound source signal s M (k) picked up by the nth microphone (hereinafter, the nth sound picking signal). (N = 1,…, M), Y n (f, ω) (n = 1,…, M) (where f is the frame number and ω is the frequency) is the nth pick-up signal y n Signal in the frequency region of (k) (hereinafter referred to as the nth pick-up signal), y (f, ω) = [Y 1 (f, ω),…, Y M (f, ω)] T is the pick-up signal As a vector
The sound source signal estimator uses the sound collection signal vector y (f, ω) to whiten the sound collection signal vector u (f, ω) (where u (f, ω) is for a predetermined matrix T (ω). A whitening step that produces u (f, ω) = T (ω) y (f, ω), E [u (f, ω) u H (f, ω)] = I)
The sound source signal estimator uses a whitening sound collection signal vector u (f, ω) to define a separation filter W (ω) = [w 1 (ω)… w M (ω)] cost function F ( W (ω)) optimization problem

A separation filter generation step that generates a separation filter W (ω) by solving (F is an integer of 1 or more),
A sound source signal including a sound source separation step in which the sound source signal estimation device generates an estimated sound source signal vector ^ s (f, ω) from a sound collection signal vector y (f, ω) using a separation filter W (ω). Estimating method.
A program for operating a computer as the sound source signal estimation device according to claim 1 or 2.