WO2024038522A1

WO2024038522A1 - Signal processing device, signal processing method, and program

Info

Publication number: WO2024038522A1
Application number: PCT/JP2022/031099
Authority: WO
Inventors: 林太郎池下; 智広中谷
Original assignee: 日本電信電話株式会社
Priority date: 2022-08-17
Filing date: 2022-08-17
Publication date: 2024-02-22

Abstract

The present invention provides a signal processing device and the like in which entire spatial information relating to a target sound source can be used through introduction of a MaxSNR criterion. The signal processing device comprises: a second spatial covariance matrix estimation unit that uses an estimated value of a spatial/temporal covariance matrix of a non-target sound source to estimate a spatial covariance matrix of the non-target sound source; a reverberation removal filter estimation unit that uses the estimated value of the spatial/temporal covariance matrix of the non-target sound source to estimate a reverberation removal filter; a beam former estimation unit that uses an observed signal or an estimated value of a spatial covariance matrix of a target sound source, the estimated value of the spatial covariance matrix of the non-target sound source, and the estimated reverberation removal filter to estimate a convolutional beam former; and a sound source extraction unit that uses the observed signal and the estimated convolutional beam former to perform beam forming processing, thereby estimating a sound source signal.

Description

Signal processing device, signal processing method, program

The present invention relates to a technique for estimating, with high quality, an audio signal included in a signal recorded using a microphone.

When recording audio signals using a microphone in a noisy and reverberant environment, in addition to the audio components you want to record, unnecessary components such as noise, reverberation, and interfering sounds are mixed into the microphone, so the quality of the audio signals contained in the recorded signals may be affected. is low. Therefore, signal source extraction techniques have been actively researched in order to estimate high quality audio signals included in recorded signals. As a method of extracting a signal source using a plurality of sensors, a method using a convolutional beamformer (CBF, see Non-Patent Document 1) is known. As a standard for optimizing CBF, a standard called Minimum-Variance Distortionless Response (MVDR) has been used so far (see Non-Patent Document 1).

However, when designing a CBF based on the MVDR standard, the CBF is designed by compressing the spatial information (spatial covariance matrix) of the target sound source, which is the extraction target, into a steering vector, so it is not possible to use all the spatial information possessed by the target sound source. The problem is that it can't be done.

An object of the present invention is to provide a signal processing device, a signal processing method, and a program that can use all the spatial information of a target sound source by introducing the MaxSNR standard instead of the MVDR standard.

In order to solve the above problems, according to one aspect of the present invention, a signal processing device estimates a spatial covariance matrix of a non-target sound source using an estimated value of a spatio-temporal covariance matrix of the non-target sound source. a second spatial covariance matrix estimator; a dereverberation filter estimator that estimates a dereverberation filter using the estimated value of the spatiotemporal covariance matrix of the non-target sound source; a beamformer estimator that estimates a convolutional beamformer using the estimated value, the estimated value of the spatial covariance matrix of the non-target sound source, and the estimated dereverberation filter; and a sound source extraction unit that performs beamforming processing and estimates a sound source signal.

According to the present invention, by introducing the MaxSNR criterion, it is possible to use all the spatial information of the target sound source.

FIG. 1 is a functional block diagram of a signal processing device according to a first embodiment. FIG. 3 is a diagram illustrating an example of a processing flow of the signal processing device according to the first embodiment. FIG. 3 is a functional block diagram of a signal processing device according to a second embodiment. FIG. 7 is a diagram illustrating an example of a processing flow of a signal processing device according to a second embodiment. FIG. 3 is a functional block diagram of a signal processing device according to a third embodiment. FIG. 7 is a diagram illustrating an example of a processing flow of a signal processing device according to a third embodiment. The figure which shows the example of a structure of the computer to which this method is applied.

Embodiments of the present invention will be described below. In the drawings used in the following explanation, components having the same functions and steps that perform the same processing are denoted by the same reference numerals, and redundant explanation will be omitted. In the following explanation, the symbols "^", " ^- ", etc. used in the text should originally be written directly above the character that immediately follows them, but due to text notation restrictions, they are written immediately before the character in question. . In the formula, these symbols are written in their original positions. Furthermore, unless otherwise specified, processing performed for each element of a vector or matrix is applied to all elements of that vector or matrix.

<Sound source extraction problem>
The problem targeted in this embodiment is a sound source extraction problem, in which a sound source signal s _f,t or a spatial image s from which reverberations of the sound source signal s _f,t are removed from a signal x _f,t observed with a microphone. The problem is to estimate _f,t ^image =a _f s _f,t . However, a _f represents the acoustic transfer function of the sound source. Note that the sound source signal is a signal based on the sound emitted by the sound source (target sound source) to be recorded by the microphone. In this embodiment, the target sound source is a speaker (hereinafter also referred to as "target speaker"), and the target sound source Let the sound be the voice uttered by the target speaker (hereinafter also referred to as "target voice"), and let the target signal be a signal corresponding to the target voice. However, the target sound source is not limited to these, and the target sound source is not limited to the speaker, but may be any sound source such as a musical instrument or a playback device, and the target sound is not limited to voice, but may be any sound other than voice. Good too. A sound source other than the target sound source is also called a non-target sound source.

<Points of the first embodiment>
The steering vector used by MVDR CBF corresponds to the principal component of the spatial covariance matrix V _S , and the MVDR CBF cannot use all the spatial information that the spatial covariance matrix V _S has. In this embodiment, a MaxSNR criterion is introduced as a new criterion for designing CBF. When designing a CBF using the MaxSNR criterion, there is an advantage that the spatial information of the target sound source (spatial covariance matrix V _S ) can be fully utilized.

First, CBF based on MaxSNR will be explained. Let M be any integer greater than or equal to 2 representing the number of microphones, L+1 be the number of CBF taps, S ₊ be the set of all non-negative definite matrices, and matrix A ^B be a square matrix with B rows and B columns. , the matrix A ^B×C is a matrix with B rows and C columns, ^R _N ∈S ^M+ML ₊ is the spatiotemporal covariance matrix of the non-target sound source, and V _S ∈S ^M ₊ is the spatial covariance of the target sound source. Let O _{A × B} be a zero matrix with A rows and B columns,

Then, MaxSNR CBF ^w is defined as follows.

Note that when L=0, MaxSNR CBF becomes MaxSNR beamformer.

Furthermore, the MaxSNR CBF ^w of this embodiment is characterized in that it can be decomposed into the product of the dereverberation filter ^G and the MaxSNR beamformer w for instantaneous mixing, as shown in the following equation.

However, the subscript opt means the optimal solution, and C is the entire set of complex numbers. In other words, the MaxSNR CBF has the feature that the dereverberation filter ^G and the MaxSNR beamformer w can be optimized in an integrated manner.

In order to explain that equation (2) can be decomposed as equation (3), ^w and ^R _N are written as follows.

However, S ₊₊ is a set consisting of all positive definite matrices.

Here, the optimal solution ^w _opt of MaxSNR CBF ^w is

can be obtained as however,

It is. However, I _M is an identity matrix with M rows and M columns, and A ^H indicates the Hermitian transpose of A.

Note that equation (7) can be solved as the optimal eigenvector of generalized eigenvalue decomposition.

V _S w _opt = λ _max V _N w _opt
However, λ _max is the maximum eigenvalue.

^G in Equation (8) is a multi-channel linear prediction (MCLP)-based dereverberation filter used in dereverberation. Further, V _N in Equation (9) is a Schur complement of ^R _N , and can be regarded as a spatial covariance matrix of a non-target sound source from which reverberation has been removed.

<First embodiment>
FIG. 1 shows a functional block diagram of a signal processing device according to the first embodiment, and FIG. 2 shows its processing flow.

The signal processing device 100 includes a first spatial covariance matrix estimating section 110, a spatiotemporal covariance matrix estimating section 120, a second spatial covariance matrix estimating section 140, a dereverberation filter estimating section 130, and a beamformer estimating section. 150, a sound source extraction section 160, and a spatial image estimation section 170.

The signal processing device 100 inputs an observation signal x _f,t observed with a microphone, and generates a sound source signal s _f,t _{or a spatial image (from which reverberation has been removed) of the sound source signal s f} _{,t s f,t} ^image = Estimate and output a _f s _f,t . Note that the observed signal is, for example, an acoustic signal observed with a microphone array consisting of a plurality of microphones. The output signal of the microphone may be input as it is, an output signal stored in some storage device may be read and input, or a signal obtained by performing some processing on the output signal of the microphone may be input. Note that f(f=1,...,F) indicates the frequency, t(t=1,...,T) indicates the frame number, and the observation signal x _f,t and sound source signal s _f,t are in the frequency domain. It's a signal. However, the observed signal in the time domain is input and converted into the observed signal x _f,t in the frequency domain in a frequency domain transformer (not shown), and the estimated value of the sound source signal s _{f,t is converted into the observed signal x f,t} in the time domain in the time domain transformer (not shown). It may also be converted into a sound source signal and output. Frequency domain transformation and time domain transformation may be performed by any method; for example, Fourier transformation, inverse Fourier transformation, etc. can be used.

The signal processing device 100 is, for example, a special computer configured by loading a special program into a known or dedicated computer having a central processing unit (CPU), a main memory (RAM), etc. It is a great device. The signal processing device 100 executes each process under the control of, for example, a central processing unit. The data input to the signal processing device 100 and the data obtained through each process are stored, for example, in a main memory, and the data stored in the main memory is read out to the central processing unit as necessary. Used for other processing. Each processing unit of the signal processing device 100 may be configured at least in part by hardware such as an integrated circuit. Each storage unit included in the signal processing device 100 can be configured by, for example, a main storage device such as a RAM (Random Access Memory), or middleware such as a relational database or a key-value store. However, each storage unit does not necessarily need to be provided inside the signal processing device 100, and may be configured with an auxiliary storage device configured from a hard disk, an optical disk, or a semiconductor memory element such as a flash memory, and is configured to perform signal processing. The configuration may be provided outside the device 100.

Each part will be explained below.

<First spatial covariance matrix estimation unit 110>
The first spatial covariance matrix estimation unit 110 estimates the spatial covariance matrix of the target sound source (S110), and outputs the estimated value V _S εS ^M ₊ . Various methods can be used to estimate the spatial covariance matrix of the target sound source. For example, the first spatial covariance matrix estimation unit 110 receives the observation signal x _f,t as input, and estimates an interval (hereinafter also referred to as the target signal) that includes the sound emitted by the target sound source from the observation signal x _f,t . , estimate the spatial covariance matrix of the target sound source using the estimated target signal. Furthermore, if the direction of the target sound source is known, the spatial covariance matrix of the target sound source may be approximated in advance through experiments or simulations, and the approximate value may be used as the estimated value V _S ∈S ^M ₊ .
<Spatio-temporal covariance matrix estimation unit 120>
The spatiotemporal covariance matrix estimation unit 120 estimates the spatiotemporal covariance matrix of the non-target sound source (S120), and outputs the estimated value ^R _N εS ^M+ML ₊ . Various methods can be used to estimate the spatiotemporal covariance matrix of the non-target sound source. For example, the space-time covariance matrix estimation unit 120 receives the observed signal x _f,t and estimates an interval that does not include the sound emitted by the target sound source (hereinafter also referred to as non-target signal) from the observed signal x _f,t. Then, the spatiotemporal covariance matrix of the non-target sound source is estimated using the estimated non-target signal.
<Dereverberation filter estimation unit 130>
The dereverberation filter estimation unit 130 inputs the estimated value ^R _N of the spatiotemporal covariance matrix, and estimates the dereverberation filter from the block matrices ^-P _N , ^-R _N included in the estimated value ^R _N (S130). , output the estimated dereverberation filter ^G. For example, the dereverberation filter is estimated by equation (8).

In addition,

It is. In other words, R _N is a block matrix consisting of elements from row 1 to M rows M of the estimated value ^R _N , and ^- P _N is a block matrix consisting of elements from row 1 to M rows 1 of the estimated value ^R _N +ML) is a block matrix consisting of elements in rows and M columns, and ( ^- P _N ) ^H is a block consisting of elements in rows 1 and (M+1) to columns M and (M+ML) of the estimated value ^R _N ^- R _N is a block matrix consisting of elements from (M+1) rows and (M+1) columns to (M+ML) rows and (M+ML) columns of the estimated value ^R _N.

<Second spatial covariance matrix estimation unit 140>
The second spatial covariance matrix estimator 140 receives the estimated value ^R _N of the spatiotemporal covariance matrix as input, and extracts the non-target sound source from the block matrices R _N , ^- P _N , ^- R _N included in the estimated value ^ R _N . Estimate the spatial covariance matrix of (S140) and output the estimated value V _N εS ^M+ML ₊ . For example, the spatial covariance matrix of a non-target sound source is estimated by equation (9).

It is.

Note that the second spatial covariance matrix estimation unit 140 inputs the estimated value ^R _N of the spatiotemporal covariance matrix and the dereverberation filter ^G estimated by the dereverberation filter estimation unit 130, and calculates the estimated value ^R _N and The spatial covariance matrix of the non-target sound source may be estimated from the dereverberation filter ^G using equation (9).

<Beamformer estimation unit 150>
The beamformer estimation unit 150 receives as input the estimated value V _S of the spatial covariance matrix of the target sound source, the estimated value V _N of the spatial covariance matrix of the non-target sound source, and the estimated dereverberation filter ^G. The beamformer estimation unit 150 calculates the MaxSNR beamformer w for instantaneous mixing using Equation (7) from the estimated value V _S of the spatial covariance matrix of the target sound source and the estimated value V _N of the spatial covariance matrix of the non-target sound source. Find _opt .

V _S w _opt = λ _max V _N w _opt
However, λ _max is the maximum eigenvalue.

The beamformer estimation unit 150 estimates a convolutional beamformer from the MaxSNR beamformer w _opt for instantaneous mixing and the estimated dereverberation filter ^G using equation (3) (S150), and calculates the estimated convolutional beamformer ^w. Output.

<Sound source extraction unit 160>
The sound source extraction unit 160 inputs the observed signal x _f,t and the estimated convolutional beamformer ^w, performs beamforming processing using the following equation, estimates the sound source signal (S160), and obtains the estimated value y _f,t Output.

y _f,t =^w _f ^H ^x _f,t ∈ C
^w ∈ C ^M+ML
^w=[^w ₁ | … | ^w _F ]
^x _f,t =[x _f,t ^T |x _f,tD-1 ^T |…|x _f,tDL ^T ] ^T ∈ C ^M+ML
A ^H indicates the Hermitian transposition of A, A ^T indicates the transposition of A, Y=(y _t ) _t=1 ^T is the estimate of the source signal S, and D is the expected delay.

<Spatial image estimation unit 170>
Although the scale of the convolutional beam former ^w _f for each frequency bin f is indefinite, it can be restored by estimating a vector u _f that approximates the spatial image s _f,t ^image using the following equation.

s _f,t ^image =a _f s _f,t ≒u _f y _f,t =(u _f w _f ^H )(^G _f ^H ^x _f,t )∈C ^M
^G=[^G ₁ | … | ^G _F ], where the vector u _f is required to satisfy the following conditions.

(i) w _f ^H u _f =1 (no distortion constraint condition)
(ii) u _f ∝V _N,f w _f (ideally, a _f ∝V _N,f w _f holds)
V _N =[V _N,1 | … | V _N,F ], the vector u _f is uniquely determined by the following equation based on the two constraints.

The spatial image estimation unit 170 inputs the estimated value V _N of the spatial covariance matrix of the non-target sound source, the estimated value y _f,t , and the MaxSNR beamformer w _opt for instantaneous mixing, and calculates the estimated value V _N and instantaneous mixing. ^From _the _MaxSNR _beamformer _w _opt _for Output _f,t .

s _f,t ^image ≒u _f y _f,t
<Effect>
With the above configuration, all the spatial information of the target sound source can be used by introducing the MaxSNR criterion.

<Points of the second embodiment>
MVDR CBF (a method of estimating CBF based on MVDR standards) requires the steering vector of the target sound source to be estimated separately in advance, and there is a problem that the sound source extraction performance of MVDR CBF is strongly dependent on the estimation performance of the steering vector. The problem is that it is difficult to use. This embodiment solves this problem.

To estimate the MaxSNR CBF, as shown in Equations (1) and (2), the estimated value of the spatial covariance matrix of the target sound source V _S and the estimated value of the spatio-temporal covariance matrix of the non-target sound source ^R _N are calculated in advance. It is necessary to ask for

In this embodiment, a Blind MaxSNR CBF that eliminates the need to obtain these two estimated values in advance will be described. Note that "Blind" here means that no prior knowledge is required.

Blind MaxSNR CBF of this embodiment is a method of estimating MaxSNR CBF by repeatedly performing calculations similar to MaxSNR CBF given by equation (2) or equation (7).

The Blind MaxSNR CBF of this embodiment is defined as the following local optimal solution using an arbitrary super Gaussian function φ:R _≧0 → _{R and the Schur complement V X} _of the following matrix ^R (Equations (20a), (20b)).

θ=(^w _f ) _f=1 ^F is a variable, y _f,t =(^w _f ) ^H ^x _f,t is an estimate of the source signal, y _t =[y _1,t | … |y _F,t ] ^T ∈C ^F , ||A|| ₂ =√(A ^H A) is the Euclidean norm for the vector A, and C on the right side of equation (20b) maximizes the function or It is a constant that is determined adaptively and heuristically at each iteration of the minimizing algorithm.

More specifically, the spatiotemporal covariance matrix ^R _{Z,f, which is interpreted as the estimated value ^R N,f} of the spatiotemporal covariance matrix of the non-target sound source _, is calculated based on the following equations (21) and (22). The MaxSNR CBF is optimized without prior knowledge through iterative optimization that alternately repeats the process of obtaining the calculation and the process of estimating the MaxSNR CBF ^w based on the following equations (23) to (26).

y _t ^k =[…|y _f,t ^k |…] ^T , y _f,t ^k =(^w _f ^k ) ^H ^x _f,t (22)

However, k is an index indicating the number of repetitions.

Furthermore, in each iteration of the above-described iterative optimization, the scale of MaxSNR CBF ^w _f is aligned for each frequency f=1,...,F based on the following equation (27).

w _f ←(u _f,m ) ^* w _f =(e _m ^T u _f ) ^* w _f (27)
However, m (1≦m≦M) is the index of the reference microphone, * indicates a complex conjugate, and u _f is expressed by equation (11) (however, V _Z,f is substituted for V _N,f). ), u _f,m =e _m ^T u _f ∈C is the mth element of u _f .

<Second embodiment>
The explanation will focus on parts that are different from the first embodiment.

FIG. 3 shows a functional block diagram of the signal processing device according to the first embodiment, and FIG. 4 shows its processing flow.

The signal processing device 200 includes an initialization section 201 , a first spatial covariance matrix estimation section 210 , a spatiotemporal covariance matrix estimation section 220 , a second spatial covariance matrix estimation section 240 , and a dereverberation filter estimation section 230 , a beam former estimation section 250 , a sound source extraction section 160 , and a determination section 280 .

The signal processing device 200 inputs the observation signal x _f,t observed by the microphone and the index m of the reference microphone, estimates the sound source signal s f,t, and outputs the estimated sound source signal s _f,t . Note that f indicates a frequency, t indicates a frame number, and the observation signal x _f,t and the sound source signal s _f,t are signals in the frequency domain. However, an observed signal in the time domain is input and converted into an observed signal x _f,t in the frequency domain in a frequency domain converter (not shown), and a sound source signal s _f,t is converted into a sound source signal in the time domain in a time domain converter (not shown). It may be converted and output. Frequency domain transformation and time domain transformation may be performed by any method; for example, Fourier transformation, inverse Fourier transformation, etc. can be used.

<Initialization unit 201>
The initialization unit 201 inputs the index m of the reference microphone and sets the initial value ^w ⁰ =[^w ₁ ⁰ ,...,^w _F ⁰ ] of the convolutional beamformer ^w to be estimated using the following formula ( S201), output.

However, _em is a unit vector corresponding to the reference microphone.
<First spatial covariance matrix estimation unit 210>
The first spatial covariance matrix estimation unit 210 receives the observed signal x _f,t as input and estimates the spatial covariance matrix of the observed signal x _f,t using equations (28) to (30) (S210), Output the estimated value V _X.

^x _f,t =[x _f,t ^T |x _f,tD-1 ^T |…|x _f,tDL ^T ] ^T ∈ C ^M+ML

R _X,f is a block matrix consisting of elements from 1 row and 1 to M rows and M columns of the estimated value ^R _X ,f, ^- P _X,f is (M+1) of the estimated value ^R _X,f It is a block matrix consisting of elements from row 1 to (M+ML) rows and M columns, and ( ^- P _X,f ₎ ^H is the estimated value ^R ^- R _X,f _is a block matrix consisting of elements in columns (M+ML), and - R It is a block matrix consisting of elements of columns (ML).

V _X =[V _X,1 ,...,V _X,f ,...,V _X,F ]
<Spatio-temporal covariance matrix estimation unit 220>
The spatiotemporal covariance matrix estimator 220 inputs the convolutional beam former ^w ^k estimated in the previous iteration process or its initial value ^w ⁰ and the observation signal x _f,t , and calculates the spatiotemporal covariance matrix of the non-target sound source. The spatio-temporal covariance matrix ^R _Z =[^R Z _{,1 ,...,^R Z,} _f ,...,^R Z, _{F interpreted as the estimate of the covariance matrix ^R N} _,f ] is calculated based on the following equations (21) and (22) (S220) and output.

y _t ^k =[…|y _f,t ^k |…] ^T , y _f,t ^k =(^w _f ^k ) ^H ^x _f,t (22)
Note that when calculating the space-time covariance matrix ^R _Z,f for the first time, in other words, before estimating the convolutional beamformer ^w in the beamformer estimation unit 250 described later, the output value of the initialization unit 201 is converted into a convolutional beam. Used as the initial value of the former ^w ⁰ =[^w ₁ ⁰ ,…,^w _F ⁰ ].

<Dereverberation filter estimation unit 230>
The dereverberation filter estimation unit 230 receives the space-time covariance matrix ^R _Z,f as input, and estimates the dereverberation filter from ^- P _Z,f , ^- R Z, _f included in the estimated value ^R _Z,f . (S230), the estimated dereverberation filter ^G is output. For example, the dereverberation filter is estimated by equation (25).

In addition,

It is. In other words, R _Z,f is a block matrix consisting of elements from 1 row and 1 to M rows and M columns of the space-time covariance matrix ^R _Z,f , and ^- P _Z,f is the space-time covariance matrix ^R _{Z ,} is a block matrix consisting of elements from (M+1) rows and columns 1 to (M+ML) rows and M columns of f, and ( ^- P _Z,f ) ^H is the space-time covariance matrix ^R 1 of _Z,f It is a block matrix consisting of elements from row (M+1) to column M (M+ML), and ^- R _Z,f _is the (M+1) row (M It is a block matrix consisting of elements from column +1 to (M+ML) rows and (M+ML) columns.

<Second spatial covariance matrix estimation unit 240>
The second spatial covariance matrix estimator 240 receives the space-time covariance matrix ^R _Z,f as input, and receives the R _Z,f , ^- P _Z, _f , - included in the space-time covariance matrix ^R Z,f ^. The spatial covariance matrix of the non-target sound source is estimated from R _Z,f (S240), and the estimated value V _Z,f εS ^M+ML ₊ is output. For example, the spatial covariance matrix of a non-target sound source is estimated by equation (31).

It is.

Note that the second spatial covariance matrix estimation unit 240 inputs the space-time covariance matrix ^R _Z,f and the dereverberation filter ^G estimated by the dereverberation filter estimation unit 230, and calculates the space-time covariance matrix ^R The spatial covariance matrix of the non-target sound source may be estimated from _Z,f and the dereverberation filter ^G using equation (31).

<Beamformer estimation unit 250>
The beamformer estimation unit 250 estimates the spatial covariance matrix of the observed signal x _f,t , V _X =[V _X,1 ,...,V _X,F ], and the spatial covariance matrix of the non-target sound source. V _Z =[V _Z,1 ,...,V _Z,F ] and the estimated dereverberation filter ^G=[^G _,1 ,...,^G _F ] are input. The beamformer estimating unit 250 calculates _{w f} _k _using Equation ( ₂₄ ⁾ from the estimated value V Ask for ⁺¹ .

The beamformer estimation unit 250 estimates a convolutional beamformer from the MaxSNR beamformer w _f ^k+1 for instantaneous mixing and the estimated dereverberation filter ^G using equation (23) (S250).

The beamformer estimation unit 250 calculates a vector u from the estimated value V _Z =[V _Z,1 ,...,V _Z,F ] of the spatial covariance matrix of the non-target sound source and the convolutional beamformer ^w ^k+1 using the following equation. Find _f .

Furthermore, the beamformer estimation unit 250 uses the m-th element u _f,m of the vector u _f to calculate MaxSNR CBF ^w _f for each frequency f=1,...,F based on the following equation (29). Align the scales of ^k+1 and output the convolutional beamformer ^w ^k+1 with aligned scales.

^w ^k+1 ←(u _f,m ) ^* ^w ^k+1 =(e _m ^T u _f ) ^* ^w ^k+1 (29)
<Sound source extraction unit 160>
The sound source extraction unit 160 inputs the observed signal x _f,t and the estimated convolutional beamformer ^w ^k+1 , performs beamforming processing using the following equation, estimates the sound source signal (S160), and calculates the estimated value y Output _f,t .

y _f,t =(^w _f ^k+1 ) ^H ^x _f,t ∈ C
^w ^k+1 ∈C ^M+ML
^w ^k+1 =[^w ₁ ^k+1 | … | ^w _F ^k+1 ]
<Determination unit 280>
The determining unit 280 determines whether or not the convergence condition is satisfied (S280), and if the convergence condition is satisfied (YES in S280), the estimated value y _f,t at that time is used as the output of the signal processing device. Output and end processing. If the convergence condition is not satisfied (NO in S280), the determination unit 280 sends a control signal to each unit to repeat S220 to S160 to control the processing of each unit. Note that the estimated value y _f,t output from the sound source extraction section 160 can be used in the spatiotemporal covariance matrix estimation section 220, and the calculation of equation (22) can be omitted. Note that the convergence conditions include whether learning has been repeated a certain number of times (for example, several times). Conditions such as whether the difference between the convolutional beam former ^w ^k+1 before and after estimation is less than a predetermined threshold can be used.

<Effect>
With such a configuration, effects similar to those of the first embodiment can be obtained. Furthermore, the Blind MaxSNR CBF of this embodiment is an ultra-high-speed method that can estimate MaxSNR CBF with high accuracy through at most several iterations.

In this embodiment, the estimated value y _{f,t of the sound source signal s f} _,t is output, but the spatial image estimation unit 170 is provided and the estimated value y _f,t at the time when the convergence condition is satisfied is used. Then, an approximate value u _f y _f,t of the spatial image s _f,t ^image may be determined and output.

<Points of the third embodiment>
In this embodiment, as a by-product of the Blind MaxSNR CBF of the second embodiment, the spatial covariance matrix V _S of the target sound source is known (= estimated in advance), while the spatial covariance matrix ^R of the unwanted sound We realize "Iteratively Reweighted MaxSNR CBF (IR-MaxSNR CBF)," a method for estimating MaxSNR CBF with high accuracy under the condition that _N is unknown (= not estimated in advance).

When the spatial covariance matrix V _S of the target sound source can be estimated with high accuracy, by using that information, the MaxSNR CBF can be estimated with higher accuracy than the Blind MaxSNR CBF of the second embodiment.

<Third embodiment>
The explanation will focus on parts that are different from the second embodiment.

FIG. 5 shows a functional block diagram of a signal processing device according to the third embodiment, and FIG. 6 shows its processing flow.

The signal processing device 300 includes an initialization section 201 , a first spatial covariance matrix estimation section 110 , a spatiotemporal covariance matrix estimation section 220 , a second spatial covariance matrix estimation section 240 , and a dereverberation filter estimation section 230 , a beam former estimation section 250 , a sound source extraction section 160 , and a determination section 280 .

This embodiment differs from the second embodiment in that it includes a first spatial covariance matrix estimator 110 instead of the first spatial covariance matrix estimator 210. Note that the first spatial covariance matrix estimation unit 110 is as described in the first embodiment. Furthermore, the beamformer estimation unit 250 uses the estimated value V _S of the spatial covariance matrix of the target sound source instead of the estimated value V _X of the spatial covariance matrix of the observed signal x _f,t . different. Other processing is similar to the second embodiment.

<Other variations>
The present invention is not limited to the above-described embodiments and modifications. For example, the various processes described above may not only be executed in chronological order as described, but may also be executed in parallel or individually depending on the processing capacity of the device executing the process or as needed. Other changes may be made as appropriate without departing from the spirit of the present invention.

<Program and recording medium>
The various processes described above are performed by causing the recording unit 2020 of the computer 2000 shown in FIG. This can be done by letting

A program that describes this processing content can be recorded on a computer-readable recording medium. The computer-readable recording medium may be of any type, such as a magnetic recording device, an optical disk, a magneto-optical recording medium, or a semiconductor memory.

Further, distribution of this program is performed, for example, by selling, transferring, lending, etc. portable recording media such as DVDs and CD-ROMs on which the program is recorded. Furthermore, this program may be distributed by storing the program in the storage device of the server computer and transferring the program from the server computer to another computer via a network.

A computer that executes such a program, for example, first stores a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing a process, this computer reads a program stored in its own recording medium and executes a process according to the read program. In addition, as another form of execution of this program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and furthermore, the program may be transferred to this computer from the server computer. The process may be executed in accordance with the received program each time. In addition, the above-mentioned processing is executed by a so-called ASP (Application Service Provider) service, which does not transfer programs from the server computer to this computer, but only realizes processing functions by issuing execution instructions and obtaining results. You can also use it as Note that the program in this embodiment includes information that is used for processing by an electronic computer and that is similar to a program (data that is not a direct command to the computer but has a property that defines the processing of the computer, etc.).

Furthermore, in this embodiment, the present apparatus is configured by executing a predetermined program on a computer, but at least a part of these processing contents may be implemented in hardware.

Claims

a second spatial covariance matrix estimation unit that estimates a spatial covariance matrix of the non-target sound source using an estimated value of the spatio-temporal covariance matrix of the non-target sound source;
a dereverberation filter estimation unit that estimates a dereverberation filter using the estimated value of the spatiotemporal covariance matrix of the non-target sound source;
a beamformer estimation unit that estimates a convolutional beamformer using an estimated value of a spatial covariance matrix of the observed signal or the target sound source, an estimated value of the spatial covariance matrix of the non-target sound source, and the estimated dereverberation filter; and,
a sound source extraction unit that performs beamforming processing using the observed signal and the estimated convolutional beamformer to estimate a sound source signal;
Signal processing device.
The signal processing device according to claim 1,
First spatial covariance matrix estimation for estimating an interval including the sound emitted by the target sound source from the observed signal (hereinafter also referred to as target signal), and estimating a spatial covariance matrix of the target sound source using the estimated target signal. Department and
A space-time method in which a section that does not include the sound emitted by the target sound source (hereinafter also referred to as a non-target signal) is estimated from the observed signal, and a spatio-temporal covariance matrix of the non-target sound source is estimated using the estimated non-target signal. a covariance matrix estimator;
The beamformer estimation unit estimates a convolutional beamformer using the estimated value of the spatial covariance matrix of the target sound source, the estimated value of the spatial covariance matrix of the non-target sound source, and the estimated dereverberation filter. do,
Signal processing device.
The signal processing device according to claim 1,
a first spatial covariance matrix estimator that estimates a spatial covariance matrix of the observed signal using the observed signal;
a spatio-temporal covariance matrix estimation unit that estimates a spatio-temporal covariance matrix of the non-target sound source using the observed signal and the estimated convolutional beamformer,
The beamformer estimation unit estimates a convolutional beamformer using the estimated value of the spatial covariance matrix of the observed signal, the estimated value of the spatial covariance matrix of the non-target sound source, and the estimated dereverberation filter. death,
repeating the processing in the spatiotemporal covariance matrix estimator, the second spatial covariance matrix estimator, the dereverberation filter estimator, the beamformer estimator, and the sound source extractor until a convergence condition is satisfied;
Signal processing device.
The signal processing device according to claim 1,
First spatial covariance matrix estimation for estimating an interval including the sound emitted by the target sound source from the observed signal (hereinafter also referred to as target signal), and estimating a spatial covariance matrix of the target sound source using the estimated target signal. Department and
a spatio-temporal covariance matrix estimation unit that estimates a spatio-temporal covariance matrix of the non-target sound source using the observed signal and the estimated convolutional beamformer,
The beamformer estimation unit estimates a convolutional beamformer using the estimated value of the spatial covariance matrix of the target sound source, the estimated value of the spatial covariance matrix of the non-target sound source, and the estimated dereverberation filter. death,
repeating the processing in the spatiotemporal covariance matrix estimator, the second spatial covariance matrix estimator, the dereverberation filter estimator, the beamformer estimator, and the sound source extractor until a convergence condition is satisfied;
Signal processing device.
a second spatial covariance matrix estimation step of estimating a spatial covariance matrix of the non-target sound source using the estimated value of the spatio-temporal covariance matrix of the non-target sound source;
a dereverberation filter estimation step of estimating a dereverberation filter using the estimated value of the spatiotemporal covariance matrix of the non-target sound source;
a beamformer estimation step of estimating a convolutional beamformer using the estimated value of the spatial covariance matrix of the observed signal or the target sound source, the estimated value of the spatial covariance matrix of the non-target sound source, and the estimated dereverberation filter; and,
a sound source extraction step of performing beamforming processing using the observed signal and the estimated convolutional beamformer to estimate a sound source signal;
Signal processing method.
A program for causing a computer to function as the signal processing device according to any one of claims 1 to 4.