WO2021205494A1

WO2021205494A1 - Signal processing device, signal processing method, and program

Info

Publication number: WO2021205494A1
Application number: PCT/JP2020/015456
Authority: WO
Inventors: 中谷　智広; 慶介木下; 林太郎池下; マークデルクロア
Original assignee: 日本電信電話株式会社
Priority date: 2020-04-06
Filing date: 2020-04-06
Publication date: 2021-10-14
Also published as: JPWO2021205494A1; JP7444243B2

Abstract

The present invention involves: receiving frequency-divided timesequential acoustic signals and auxiliary information representing target sound information; estimating a convolutional beamformer on the basis of the optimization criterion that, when a convolutional beamformer performing dereverberation, diffusive noise suppression, and target sound source separation, is applied to an acoustic signal, the obtained signal is defined according to a probabilistic model; applying the estimated convolutional beamformer to the acoustic signals, to obtain a processed signal; and outputting the processed signal.

Description

Signal processing equipment, signal processing methods, and programs

The present invention relates to a technique for extracting a target sound by suppressing sounds other than the target sound, other noises, and reverberations from an acoustic signal.

Non-Patent Document 1 discloses a method of suppressing noise and reverberation from an acoustic signal in the frequency domain. In this method, the acoustic signal is first received, a reverberation suppression filter that suppresses the reverberation of the target sound is estimated based on the weighted power minimization criterion of the prediction error, and the reverberation suppression filter is applied to the acoustic signal to remove the reverberation. I do. After that, it receives the steering vector (or its estimated value) indicating the direction of the target sound, and minimizes the power of the acoustic signal under the constraint condition that the sound arriving at the microphone from the sound source position of the target sound is not distorted. By estimating the power distortion-free response (MPDR) beamformer and applying it to the acoustic signal after reverberation removal, noise is further suppressed.

Non-Patent Document 2 describes a reverberation suppression block so as to minimize a given objective function under the assumption that the acoustic signal does not contain diffuse noise and contains only a plurality of point sound sources. A method of simultaneously achieving optimum reverberation suppression and sound source separation by alternately updating the coefficients of the sound source separation block and the sound source separation block is disclosed.

However, in the method of Non-Patent Document 1, since the reverberation suppression filter and the minimum power distortion-free response beam former are optimized independently, the optimum processing cannot be performed as a whole. In addition, the method of Non-Patent Document 2 could not suppress diffusive noise from the acoustic signal.

The present invention has been made in view of these points, and an object of the present invention is to perform optimized reverberation suppression, diffusive noise suppression, and target sound source separation as a whole.

A signal obtained by applying a convolutional beamformer that receives frequency-divided time-series acoustic signals and auxiliary information representing target sound information and performs reverberation suppression, diffuse noise suppression, and target sound source separation to the acoustic signal. The convolutional beamformer is estimated based on the optimization criterion that is determined according to the stochastic model, and the estimated convolutional beamformer is applied to the acoustic signal to obtain a processed signal and output it.

In the present invention, the convolution beamformer that suppresses reverberation, diffusive noise, and separates the target sound source can be optimized as a whole by using the auxiliary information that represents the information of the target sound. Therefore, it is possible to perform optimized reverberation suppression, diffusive noise suppression, and target sound source separation as a whole.

FIG. 1 is a block diagram illustrating a functional configuration of the signal processing device of the embodiment. FIG. 2 is a block diagram illustrating the functional configuration of the signal processing device of the second embodiment and its modified example. FIG. 3 is a flow chart for explaining a signal processing method of the second embodiment and a modified example thereof. FIG. 4 is a flow chart for explaining a signal processing method of a modified example of the second embodiment. FIG. 5 is a flow chart for explaining a signal processing method of the second embodiment and a modified example thereof. FIG. 6 is a block diagram for exemplifying the estimation process of RTF (Relative Transfer Function). FIG. 7 is a flow chart for explaining a signal processing method of a modified example of the second embodiment. FIG. 8 is a flow chart for explaining a signal processing method of a modified example of the second embodiment. FIG. 9 is a block diagram illustrating the functional configuration of the signal processing device of the third embodiment and its modified example. FIG. 10 is a flow chart for explaining a signal processing method of the third embodiment and a modified example thereof. FIG. 11 is a flow chart for explaining a signal processing method of the third embodiment and a modified example thereof. FIG. 12 is a flow chart for explaining a signal processing method of a modified example of the third embodiment. FIG. 13 is a flow chart for explaining a signal processing method of a modified example of the third embodiment. FIG. 14 is a block diagram illustrating a hardware configuration of the signal processing device of the embodiment. FIG. 15 is a graph illustrating the word error rate when the processed signals obtained in the fourth embodiment and the modified examples 1 and 2 of the second embodiment are voice-recognized.

Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[First Embodiment]
First, the first embodiment of the present invention will be described. In the first embodiment, the signal processing device receives the frequency-divided time-series acoustic signal and the auxiliary information representing the information of the target sound, and performs reverberation suppression, diffuse noise suppression, and target sound source separation. The convolution beam former is estimated based on the optimization criterion that the signal obtained by applying the former to the acoustic signal is determined according to the stochastic model, and the estimated convolution beam former is applied to the acoustic signal to obtain the processed signal and output it. ..

<Functional configuration>
As illustrated in FIG. 1, the signal processing device 1 of the present embodiment has a convolution beam former estimation unit 11, a convolution beam former application unit 12, and a control unit 13, and each process is performed under the control of the control unit 13. To execute.

<Processing>
It is assumed that the source signals emitted from I sound sources are observed by M microphones in an environment where reverberation and diffusive noise are present. However, I and M are integers of 1 or more, and satisfy the relationship of M ≧ I. A signal obtained by observing a mixed signal of a signal based on the source signal (direct sound, early reflected sound, rear reverberation) emitted from I sound sources and diffusive noise (additive diffusive noise) with a microphone. resulting is frequency divided by well-known methods, such as short-time Fourier transform (short-time Fourier transform), the acoustic signal x _t at each time-frequency _{point, f (acoustic} signal x _t in the time frequency _{domain, f)} is Be done. Such acoustic signals x _{t and f} are modeled as follows.

Here, t ∈ {1, ..., N} is a time index corresponding to a time interval (frame), and f ∈ {1, ..., F} is a frequency index corresponding to a frequency band (frequency bin). N and F are positive integers. For example, N is an integer greater than or equal to 2. Hereinafter, the time interval corresponding to the time index t will be referred to as "time interval t", and the frequency band corresponding to the frequency index f will be referred to as "frequency band f". i is an index corresponding to the sound source of each target sound, and i ∈ {1, ..., I}. The source signal emitted from the sound source i will be referred to as "source signal i". Acoustic signal x _{t, f} = [x _{1, t, f} , ..., X _{M, t, f} ] ^T ∈ C ^{M × 1} is the frequency of all signals observed by M microphones in each time interval t. It is an M-dimensional string vector having _{signals x 1, t, f} , ..., X _{M, t, f} in each frequency band f obtained by division. C represents the set of all complex numbers. (・) ^T represents the non-conjugate transpose of (・). Microphone image signal x _{t, f} ⁽ⁱ⁾ = [x _{1, t, f} ⁽ⁱ⁾ , ..., x _{M, t, f} ⁽ⁱ⁾ ] ^T ∈ C ^{M x 1} is the signal x _{1, t, f} , ..., x _{M, t, f} _{components x 1, t, f} ⁽ⁱ⁾ , ..., X _{M, t, f} ⁽ⁱ⁾ consisting of the direct sound, the early reflection sound, and the rear reverberation corresponding to the source signal i. It is an M-dimensional column vector as an element. x _{1, t, f} ⁽ⁱ⁾ , ..., X _{M, t, f} ⁽ⁱ⁾ do not contain a component corresponding to diffusive noise. Diffusive noise n _{t, f} = [n _{1, t, f} , ..., n _{M, t, f} ] ^T ∈ C ^{M × 1} is the signal x _{1, t, f} , ..., x _{M, t, f} . Of these, it is an M-dimensional column vector whose elements are components corresponding to diffusible noise. The microphone image signals x _{t and f} ⁽ⁱ⁾ of the equation (1) are further divided into two elements as in the equation (2). The target sounds _{dt and f} ⁽ⁱ⁾ represent the components of the microphone image signals x _{t and f} ⁽ⁱ⁾ corresponding to the direct sound and the early reflected sound, and the rear reverberation sounds rt _{and f} ⁽ⁱ⁾ are the microphone image signals x. _{Represents the component of t and f} ⁽ⁱ⁾ corresponding to the rear reverberation. The superscript β of symbols written in the form of “χ _α ^β _{” such as x t, f} ⁽ⁱ⁾ , _{dt, f} ⁽ⁱ⁾ , rt _{, f} ^{(i) is originally true of α.} Although it should be described above (see Equation (2)), it may be described in the upper right of α due to restrictions on the description notation. In the reverberation suppression, the diffusive noise suppression, and the target sound source separation of the present embodiment, the diffusive noise nt _{, f} and the rear reverberation sounds rt _{, f} corresponding to each sound source i from _{the acoustic signals x t, f of the equation (1).} ^(I) is suppressed, and the target sounds _{dt and f} ⁽ⁱ⁾ corresponding to each sound source i are separated and extracted.

The processing of this embodiment will be described with reference to FIG.
Frequency-divisioned time-series acoustic signals x _{t, f} are input to the signal processor 1 for all t ∈ {1, ..., N} and f ∈ {1, ..., F}. As described above, the acoustic signals _{xt and f} exemplified in the present embodiment frequency-divide the signal obtained by observing the mixed signal of the diffuse noise and the microphone image signal based on the source signal emitted from the sound source. It is obtained by Further, auxiliary information s representing information on the target sound is input to the signal processing device 1.
An example of the auxiliary information s is information _{for specifying or estimating RTF (Relative Transfer Function) v to f} ⁽ⁱ⁾ (see, for example, Reference 1 and the like).
Reference 1: I. Cohen, “Relative transfer function identification using speech signals,” IEEE Trans. On Speech, and Audio Processing, vol. 12, no. 5, pp. 451-459, 2004.
RTFv ~ _f ⁽ⁱ⁾ _{is an M-dimensional steering vector v f} ⁽ⁱ⁾ = [v _{1, f} ⁽ⁱ⁾ , ..., V _{M, f} ) corresponding to the space from the sound source i of the target sound to the M microphones. ^(I) ] It is obtained by normalizing each element of ^{T with respect to any element.} Equation (3) shows an example _{of RTFv ~ f} ^(i). V ~ _f ⁽ⁱ⁾ of formula (3) _is obtained by normalizing each element of ^{v f (i)} an element _{v ^1,} ^{f (i)} to the reference. However, this does not limit the present invention.

^{The superscript α of symbols written in the form of “χ α} ” such as v ~ should be written directly above χ (see equation (3)), but due to restrictions on the description notation, χ It may be listed in the upper right corner of. Examples of information for identifying or estimating RTFv to _f ⁽ⁱ⁾ are the reference sound of the target sound, the time frequency mask γ _{t, f} ⁽ⁱ⁾ of the sound source i of the target sound, the steering vector v _f ⁽ⁱ⁾ , and the RTF v. ~ _f ⁽ⁱ⁾ and so on.
Each time frequency mask γ _{t, f} ⁽ⁱ⁾ represents a value corresponding to the existence probability or existence / absence of the source signal i in the time interval t and the frequency band f. For example, the existence probability of the source signal i in the time interval t and the frequency band f or a function value thereof may be used as the time frequency mask γ _{t, f} ⁽ⁱ⁾ , or the source signal i exists in the time interval t and the frequency band f. If this is the case, γ _{t, f} ⁽ⁱ⁾ = 1 may be set, and if it does not exist, γ _{t, f} ⁽ⁱ⁾ = 0 may be set. A method for estimating the time-frequency mask γ _{t, f} ⁽ⁱ⁾ is described in, for example, Reference 2.
Reference 2: F. Bahmaninezhad, J. Wu, R. Gu, S.-X. Zhang, Y. Xu, M. Yu, and D. Yu, “A comprehensive study of speech separation: spectrogram vs waveform separation,” in Interspeech, 2019.
On the other hand, a method of estimating RTF from a time-frequency mask is described in, for example, Non-Patent Document 1.
A method of estimating the time frequency mask from the reference sound of the target sound is described in, for example, Reference 3.
Reference 3: K. Zmolikova, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani, L. Burget, and J. Cernocky, “SpeakerBeam: Speaker aware neural network for target speaker extraction in speechoption,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 4, pp. 800-814, 2019.
The auxiliary information s may further include information that identifies the power of the target sound. A method for estimating the power of the target sound is described in, for example, Reference 3B.
Reference 3B: Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee, “A Regression Approach to Speech Enhancement Based on Deep Neural Networks,” IEEE / ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 7-19, 2015.
The acoustic signals x _{t and f} are input to the convolution beam former estimation unit 11 and the convolution beam former application unit 12, and the auxiliary information s is input to the convolution beam former estimation unit 11.

<< Processing of Convolution Beam Former Estimator 11 (Step S11) >>
The convolution beam former estimation unit 11 receives the acoustic signals x _{t, f} and auxiliary information s, and applies the convolution beam former that suppresses reverberation, diffuse noise, and separates the target sound source to the acoustic signals x _{t, f.} The convolution beamformer is estimated based on the optimization criterion that the obtained signals y _{t and f are determined according to the stochastic model.} However, y _{t, f} = [y _{t, f} ⁽¹⁾ , ..., y _{t, f} ^(I) ] ^T ∈ C ^{I × 1} is an I-dimensional sequence vector, and for i ∈ {1, ..., I} Y _{t, f} ⁽ⁱ⁾ is an estimated signal of the target sound _{dt, f} ^(i). The convolution beam former is expressed as follows, for example.

_{However, W τ} ∈ C ^{M × I} for τ ∈ {0, Δ, Δ + 1, ..., L-1} is an M × I matrix whose elements are beamformer coefficients, and (・) ^H is (・). Represents a conjugate transpose. Δ is a positive integer representing the number of time intervals (number of frames) corresponding to the length of the initial reflected sound. At least Δ ≧ 1, and an example of Δ is a positive integer representing a time interval corresponding to 30 to 50 ms. By Δ, a convolutional beamformer that suppresses the rear reverberation sound rather than the initial reflected sound is realized.

Here, it is assumed that the following is satisfied.

However,

Is used to extract the estimated signals y _{t, f} ⁽ⁱ⁾ of the target sounds _{dt, f} ^(i). Using equations (5) and (6), equation (4) can be transformed as follows.

Also, Q _f ∈ C ^{M × I} , G ⁻ _f ∈ CM ^{(L−Δ) × M} ,

About W _{0, f} = Q _f and

Suppose that When M ≧ I and rank {Q} = I, the existence of ^G− _{f satisfying the equation (8) is guaranteed.} Using these, equation (4) can also be transformed into equations (9) and (10) below.

However, the following is satisfied.

Also, q _f ⁽ⁱ⁾ ∈ CM ^{× 1} ,

_{With respect to w 0} ⁽ⁱ⁾ = q _f ⁽ⁱ⁾ and in i ∈ {1, ..., I}

Suppose that Then, equation (7) can be transformed into the following equations (12) and (13).

Moreover, the equation (9) can be transformed as the following equation (9').

However,

In and, I _^M ∈R _M ^× _M represents a unit matrix of M × M, R denotes the set of whole real numbers.

Represents the Kronecker product. Also, about m ∈ {1, ..., M}

Is the mth column vector of ^G− _f. Moreover, the equation (10) can be transformed as the following equation (10').

The convolution beamformer estimation unit 11 estimates the convolutional beamformer based on the optimization criterion that _{y t and f are determined according to the probabilistic model.} As this probabilistic model, a model satisfying the following (a) and (b) can be exemplified.
(a) Time-varying variance in which _{y t, f} ⁽ⁱ ) of each i ∈ {1, ..., I} has an average of 0.

Follows the complex Gaussian distribution of. E {・} represents the expected value function. Hereinafter, λ _{t, f} ⁽ⁱ⁾ will be referred to as the power of y _{t, f} ^(i).
(b) The convolution beamformer does not distort the sound coming from the sound source i to the microphone for each i ∈ {1, ..., I}. This constraint can be described, for example, by the following equation (15) or (16).

The coefficient of the convolution beamformer shown in Eq. (7) corresponding to the sound source i determined according to this stochastic model.

The optimization criterion of is to minimize _{the cost functions Li, f} (θ _f ⁽ⁱ⁾ ) of the following equation (18) under the constraint condition of equation (15) or (16).

Coefficients of convolution beamformers for all sound sources i = 1, ..., I

The optimization criterion of is to minimize _{the cost function L f} (Θ _f ) of the following equation (20) under the constraint condition that all sound sources i = 1, ..., I satisfy equation (15) or (16). Is to be transformed.

That is, the convolution beamformer estimation unit 11 of the present embodiment estimates _{a convolutional beamformer that minimizes the cost function L f} (Θ _f ) of the equation (20), and estimates the convolutional beamformer (Equation (4), (2). Outputs information that identifies 7), (9) (10), (9') (10'), (12) (13)). As can be seen from equations (14) and (18), this estimation is based on the power λ _{t, f} ⁽ⁱ⁾ = E {| y _{t, f} _{of y t, f} ^{(i) for i ∈ {1, ..., I}.} ^(I) Information that identifies | ² } is required. When the auxiliary information s includes information for specifying the power of the target sound, the power of the target sound specified by the auxiliary information s may be used as _{λ t, f} ^(i). When the auxiliary information s does not include the information for specifying the power of the target sound, y _{t, f} ⁽ⁱ⁾ to λ _{t, f} ⁽ⁱ⁾ = obtained by the convolution beam former application unit 12 as shown below. | Yt _{, f} ⁽ⁱ⁾ | ² is obtained. _{However, since yt, f} ⁽ⁱ⁾ obtained by the convolution beamformer application unit 12 depends on the convolutional beamformer estimated by the convolutional beamformer estimation unit 11, the convolutional beamformer estimation unit 11 and the convolutional beamformer application unit 12 It is necessary to alternately repeat the process of and until a predetermined convergence condition is satisfied.

<< Processing of convolution beam former application unit 12 (step S12) >>
The acoustic signals x _{t, f} and the information for identifying the convolution beam former output from the convolution beam former estimation unit 11 are input to the convolution beam former application unit 12. The convolution beam former application unit 12 uses the convolution beam former (Equation (4), (7), (9) (10), (9') (10'), (12) (13)) specified from the information. Is applied to the acoustic signals x _{t, f} = [x _{1, t, f} , ..., X _{M, t, f} ] ^T and the processing signals y _{t, f} = [y _{t, f} ⁽¹⁾ , ..., y _{t , F} ^(I) ] ^T is obtained and output.

As described above, when the auxiliary information s includes information for specifying the power of the target sound, the signal processing device 1 outputs the processing signals yt _{and f} obtained in step S12. In this case, the iterative processing of steps S11 and S12 is unnecessary. On the other hand, when the auxiliary information s does not include the information for specifying the power of the target sound, the process of step S11 and the process of step S12 are alternately repeated until a predetermined convergence condition is satisfied. Examples of the convergence condition are a condition that the number of repetitions reaches a predetermined number of times, a condition that the amount of change in the coefficient of the convolution beamformer is equal to or less than a predetermined amount before and after the repetition. The signal processing device 1 outputs the processing signals yt _{and f} obtained in step S12 when the convergence condition is satisfied. In either case, the processed signals y _{t, f} are the result of applying reverberation suppression, diffuse noise suppression, and target sound source separation to the acoustic signals x _{t, f.} The output processing signals yt _{and f} may be input to other arithmetic processing, or may be converted into an acoustic signal in the time domain by a well-known method such as an inverse Fourier transform.

<Characteristics of this embodiment>
In the present embodiment, the signal obtained by applying the convolution beam former that suppresses reverberation, diffusive noise, and separation of the target sound source to the acoustic signal by using the auxiliary information s that represents the information of the target sound is determined according to the stochastic model. Estimate the convolution beamformer based on the optimization criteria. As a result, the convolution beam former can be optimized as a whole, and more effective speech enhancement can be realized.

[Second Embodiment]
Next, a second embodiment of the present invention will be described. In the present embodiment, the convolution beamformer is divided into a reverberation suppression filter that suppresses reverberation and a beamformer that suppresses diffusive noise and separates the target sound source. In other words, the convolutional beamformer of the present embodiment includes a reverberation suppression filter that suppresses reverberation and a beamformer that suppresses diffusive noise and separates the target sound source. However, the optimization processing of the reverberation suppression filter and the beam former is not independent of each other, and the convolution beam former as a whole is optimized. Examples of reverberation suppression filters are Eq. (9), Eq. (9') or Eq. (12), and examples of beamformers are Eq. (10), Eq. (10') or Eq. (13). In this embodiment, the equation (9') is used as the reverberation suppression filter and the equation (10') is used as the beam former as an example. In addition, a power-weighted spatiotemporal covariance matrix of each target sound is used to optimize the reverberation suppression filter. Since the power-weighted spatio-temporal covariance matrix of the target sound is small in size, the reverberation suppression filter can be optimized with a small amount of calculation. In the following, the differences from the items described so far will be mainly explained, and the same reference numbers will be cited for the items already explained to simplify the explanation.

<Functional configuration>
As illustrated in FIG. 2, the signal processing device 2 of the present embodiment has a spatiotemporal covariance estimation unit 211, a reverberation suppression filter estimation unit 212, a beamformer estimation unit 213, a reverberation suppression filter application unit 221 and a beamformer application unit. It has 222 and a control unit 13, and executes each process under the control of the control unit 13. Here, the spatiotemporal covariance estimation unit 211, the reverberation suppression filter estimation unit 212, and the beamformer estimation unit 213 constitute a convolutional beamformer estimation unit. The reverberation suppression filter application unit 221 and the beam former application unit 222 form a convolution beam former application unit.

<Processing>
The processing of this embodiment will be described with reference to FIG.
Auxiliary information s is input to the beamformer estimation unit 213, and acoustic signals _{xt and f} are input to the spatiotemporal covariance matrix estimation unit 211 and the reverberation suppression filter application unit 221. The auxiliary information s of the present embodiment is information _{for specifying or estimating RTFv to f} ⁽ⁱ⁾ , and does not include information for specifying the power of the target sound. The spatiotemporal covariance matrix estimation unit 211 is a power-weighted spatiotemporal covariance matrix of the target sound.

And output. The reverberation suppression filter estimation unit 212 receives the power-weighted spatio-temporal covariance matrices R ^- _{x, f} ⁽ⁱ⁾ and P _{x, f} ⁽ⁱ⁾ _{of the target sound, and the information q f} ⁽ⁱ⁾ representing the beamformer. , Estimate the reverberation suppression filter based on the optimization criteria described above. Dereverberation filter applying unit 221, reverberation is estimated by suppression filter estimator 212 reverberation suppression filter an acoustic signal x _t, dereverberation signal applied to _f z _t, to obtain the _f output (equation (9 ') ). The beamformer estimation unit 213 receives the reverberation suppression signals zt _{and f} and the auxiliary information s, and estimates the beamformer based on the optimization criteria described above. The beamformer application unit 222 applies the beamformer estimated by the beamformer estimation unit 213 to the reverberation suppression signals zt _{and f} to obtain and output the processing signals yt _{and f} ^(i). In the present embodiment, the processing of the spatiotemporal covariance estimation unit 211, the reverberation suppression filter estimation unit 212, and the beamformer estimation unit 213 included in the convolution beamformer estimation unit and the application of the convolutional beamformer until a predetermined convergence condition is satisfied. The processing of the reverberation suppression filter application unit 221 and the beam former application unit 222 included in the unit is repeated alternately. _{The signal processing device 2 outputs the processing signals yt and f} obtained by the beamformer application unit 222 when the convergence condition is satisfied.

Hereinafter, the processing of the present embodiment will be described in detail with reference to FIGS. 3 to 6.
Auxiliary information s is input to the beamformer estimation unit 213 (step S213a). In the present embodiment, the time frequency masks γ _{t, f} ⁽ⁱ⁾ are input as auxiliary information s. However, this does not limit the present invention. Further, the acoustic signals x _{t and f} are input to the spatiotemporal covariance estimation unit 211 and the reverberation suppression filter application unit 221 (step S221a).

The spatiotemporal covariance estimation unit 211 initializes _{λ t, f} ⁽ⁱ⁾ for all i ∈ {1, ..., I}, t ∈ {1, ..., N}, f ∈ {1, ..., F}. To become. For example, the spatiotemporal covariance estimation unit 211 initializes _{the powers λ t, f} ^{(i) of the target sound as follows.}

here

Represents α ^{H β α.} Further, α ← β means that β is substituted for α. In other words, α ← β means that α is changed to β (step S211a).

The beamformer estimation unit 213 initializes _{q f} ⁽ⁱ⁾ for all i ∈ {1, ..., I}, f ∈ {1, ..., F}. For example, the beamformer estimating unit 213 the i-th column of _{I M} and _q ^{f (i)} (step S213b).

_{The reverberation suppression filter application unit 221 initializes z t and f} for all t ∈ {1, ..., N} and f ∈ {1, ..., F}. For example, the reverberation suppression filter application unit 221 is set to z _{t, f} ← x _{t, f} (step S221b).

Yet processed signal y _t, as long as _f is not obtained even once, when the spatial covariance estimation unit 211 processes the signal y _{t, f} are not inputted. On the other hand, the processed signal y _t, as long as _f is obtained, further processing the signal y _t is the time-space covariance estimation unit _{211, f} is input. The spatiotemporal covariance estimation unit 211 is a power-weighted spatiotemporal covariance matrix of the target sound for all i ∈ {1, ..., I}, f ∈ {1, ..., F}.

Is calculated and output. If the processing signals y _{t, f} have never been obtained, the λ _{t, f} ⁽ⁱ⁾ obtained in step S211a is used for this calculation. On the other hand, if the processing signals y _{t and f} have already been obtained, the λ _{t and f} ⁽ⁱ⁾ obtained in step S211d are used for this calculation (step S211b). Further, the spatiotemporal covariance estimation unit 211 is a power-weighted spatiotemporal covariance matrix of the target sound for all i ∈ {1, ..., I}, f ∈ {1, ..., F}.

Is calculated and output. Again, if the processing signals y _{t, f} have never been obtained, the λ _{t, f} ⁽ⁱ⁾ obtained in step S211a is used for this calculation. On the other hand, if the processing signals y _{t and f} have already been obtained, the λ _{t and f} ⁽ⁱ⁾ obtained in step S211d are used for this calculation (step S211c).

The acoustic signals x _{t, f} ^{, the power-weighted spatio-temporal covariance matrices R-} _{x, f} ⁽ⁱ⁾ and P _{x, f} ^{(i) of} the target sound obtained by the spatio-temporal covariance estimation unit 211, and the beamformer. _{The information q f} ⁽ⁱ⁾ representing the beamformer obtained by the estimation unit 213 is input to the reverberation suppression filter estimation unit 212. The reverberation suppression filter estimation unit 212 receives these and estimates the reverberation suppression filter (Equation (9')) based on the above-mentioned optimization criteria. First, the reverberation suppression filter estimation unit 212

Is calculated (step S212a). Next, the reverberation suppression filter estimation unit 212

To calculate. However, (・) ^* represents the complex conjugate of (・) (step S212b). Further, the reverberation suppression filter estimation unit 212 provides information for specifying the reverberation suppression filter.

Is calculated and output. However, (・) ⁺ is the Moore-Penrose pseudo-inverse matrix of (・) (step S212c).

^{The g−} _f obtained by the reverberation suppression filter estimation unit 212 is input to the reverberation suppression filter application unit 221. The reverberation suppression filter application unit 221 applies the reverberation suppression filter estimated by the reverberation suppression filter estimation unit 212 to the acoustic signals x _{t and f} as follows to obtain and output the reverberation suppression signals z _{t and f.}

The reverberation suppression signals z _{t and f} are sent to the beam former estimation unit 213 and the beam former application unit 222 (step S221c).

If the processing signals y _{t and f} have never been obtained, the beam former estimation unit 213 obtains the reverberation suppression signals z _{t and f} , auxiliary information s = γ _{t, f} ⁽ⁱ⁾ , and step S211a. The calculated λ _{t, f} ⁽ⁱ⁾ is input. On the other hand, if the processing signals y _{t, f} have already been obtained, the beamformer estimation unit 213 has the reverberation suppression signal z _{t, f} , the auxiliary information s = γ _{t, f} ⁽ⁱ⁾ , and the processing signal y. _{t and f} are input. The beamformer estimation unit 213 receives these and estimates the beamformer based on the above-mentioned optimization criteria.
The beamformer estimation unit 213 obtains RTFv to _f ⁽ⁱ⁾ _{based on z t, f} and γ _{t, f} ⁽ⁱ⁾ . As illustrated in FIG. 6, first, the steering vector estimation unit 2131 of the beamformer estimation unit 213 estimates and outputs the steering vector v _f ⁽ⁱ⁾ _{based on z t, f} and γ _{t, f} ^(i). For example, the steering vector v _f ⁽ⁱ⁾ is estimated as follows.

Further RTF estimator 2132 beamformer estimating unit 213 obtains a v ~ _f ⁽ⁱ⁾ from _v ^{f (i).} For example, the RTF estimation unit 2132 _{obtains v ~ f} ⁽ⁱ⁾ according to the equation (3) (step S213c).
Further, the beamformer estimation unit 213 describes all i ∈ {1, ..., I} and f ∈ {1, ..., F}.

To calculate. If the processing signals y _{t, f} have never been obtained, the λ _{t, f} ⁽ⁱ⁾ obtained in step S211a is used for this calculation. On the other hand, if the processing signals y _{t and f} have already been obtained, the λ _{t and f} ⁽ⁱ⁾ obtained in step S211d are used for this calculation (step S213d).
Further, the beamformer estimation unit 213 provides information for identifying the beamformer for all i ∈ {1, ..., I} and f ∈ {1, ..., F}.

Is calculated and output (step S213e).

The reverberation suppression signals z _{t, f} _{and the information q f} ⁽ⁱ⁾ for identifying the beam former are input to the beam former application unit 222. The beam former application unit 222 applies the beam former to the reverberation suppression signals z _{t, f} as follows to obtain and output the processing signals y _{t, f} ^(i).

This processing is performed for all i ∈ {1, ..., I} and f ∈ {1, ..., F}, and the beamformer application unit 222 has y _{t, f} = [y _{t, f} ⁽¹⁾ , ..., y _{t, f} ^(I) ] ^T is obtained (step S222a).

The control unit 13 determines whether or not the above-mentioned convergence condition is satisfied (step S13). If the convergence condition is not satisfied here, the spatiotemporal covariance estimation unit 211 and the beamformer estimation unit 213 use the input _{yt, f} ^(i).

Is performed (step S211d), and the process returns to step S211b. As a result, the processing of the spatiotemporal covariance matrix estimation unit 211, the processing of the reverberation suppression filter estimation unit 212, the processing of the reverberation suppression filter application unit 221, the processing of the beamformer estimation unit 213, and the processing of the beamformer application unit 222 The process is repeated. Each value is updated by this repetition. On the other hand, when the convergence condition is satisfied, the beam former application unit 222 outputs y _{t, f} = [y _{t, f} ⁽¹⁾ , ..., y _{t, f} ^(I) ] ^T (step S222b). ).

<Characteristics of this embodiment>
In the present embodiment, the signal obtained by applying the convolution beam former that suppresses reverberation, diffusive noise, and separation of the target sound source to the acoustic signal by using the auxiliary information s that represents the information of the target sound is determined according to the stochastic model. Estimate the convolution beamformer based on the optimization criteria. As a result, the convolution beam former can be optimized as a whole, and more effective speech enhancement can be realized. Further, in the present embodiment, the convolution beamformer is divided into a reverberation suppression filter and a beamformer, and the beamformer is estimated using the reverberation suppression signal obtained in the middle of the estimation to realize more effective speech enhancement. can. Further, most of the operations required for estimating the reverberation suppression filter are the operations of the power-weighted spatio-temporal covariance matrix R ^- _{x, f} ⁽ⁱ⁾ and P _{x, f} ⁽ⁱ⁾ of the target sound. The sizes of the power-weighted spatiotemporal covariance matrices R ^- _{x, f} ⁽ⁱ⁾ and P _{x, f} ⁽ⁱ⁾ of the target sound obtained in steps S211b and S211c are the matrices Ψ _f and φ _{f f obtained in steps S212a and S212b.} Smaller than the size of. Therefore, in the present embodiment, the amount of calculation required for estimating the reverberation suppression filter can be significantly reduced, and speech enhancement can be realized at a low calculation cost.

[Modification 1 of the second embodiment]
In the second embodiment, the reverberation suppression filter estimation unit 212 fixes the beam former to estimate the reverberation suppression filter (Equation (9')), and the beam former estimation unit 213 fixes the reverberation suppression filter to the beam former (Equation). The process of estimating (10')) is repeated. In this process, the reverberation suppression filter estimation unit 212 adapts the beamformer to the reverberation suppression signal, and the I-dimensional processing signals y _{t, f} = [y _{t, f} ⁽¹⁾ , ..., y _{t, f} ^(I) ]. ^T is obtained, and the I-dimensional processing signals y _{t and f} are used for estimating the next reverberation suppression filter. However, since I ≦ M, the I-dimensional processing signals y _{t and f} are compressed more than the M-dimensional acoustic signals x _{t and f} , and information is lost. Due to the loss of this information, the reverberation suppression filter or beamformer may become a quasi-optimal solution instead of the optimal solution. In order to solve this problem, in this embodiment, the power-weighted covariance matrix of the target sound corresponding to _{i = 1, ..., I y t, f} ⁽¹⁾ , ..., y _{t, f} ^(I) In addition to R ^- _{x, f} ⁽ⁱ⁾ and P _{x, f} ⁽ⁱ⁾ , the power-weighted spatio-temporal covariance matrix of non-objective sounds corresponding to i = I + 1, ..., M is also calculated to estimate the reverberation suppression filter. Used for.

<Functional configuration>
As illustrated in FIG. 2, the signal processing device 2'of this modified example includes a spatiotemporal covariance estimation unit 211', a reverberation suppression filter estimation unit 212', a beamformer estimation unit 213, a reverberation suppression filter application unit 221 and a beam. It has a former application unit 222 and a control unit 13, and executes each process under the control of the control unit 13. Here, the spatiotemporal covariance estimation unit 211', the reverberation suppression filter estimation unit 212', and the beamformer estimation unit 213 constitute a convolutional beamformer estimation unit. The reverberation suppression filter application unit 221 and the beam former application unit 222 form a convolution beam former application unit.

<Processing>
The processing of this modification will be described with reference to FIG.
In this modification, instead of the spatiotemporal covariance matrix estimation unit 211, the spatiotemporal covariance matrix estimation unit 211'is used for the power-weighted spatiotemporal covariance matrix R ^- _{x, f} ⁽ⁱ⁾ and _{Px, of the target sound. In} ^{addition to f (i)} , a power-weighted spatio-temporal covariance matrix of non-objective sounds is also generated. Further, instead of the reverberation suppression filter estimation unit 212, the reverberation suppression filter estimation unit 212'has a power-weighted spatiotemporal covariance matrix R ^- _{x, f} ⁽ⁱ⁾ and P _{x, f} ⁽ⁱ⁾ of the target sound and the target sound. _{In addition to the information q f} ⁽ⁱ⁾ representing the beamformer corresponding to 1 ≦ i ≦ I for estimating, the power-weighted spatio-temporal covariance matrix of the non-target sound and I for estimating the non-purpose sound. _{The information q f} ⁽ⁱ⁾ representing the beamformer corresponding to <i ≦ M is received, and the reverberation suppression filter is estimated based on the above-mentioned optimization criteria. Instead of the beamformer estimation unit 213, the beamformer estimation unit 213'in _{addition to the information q f} ⁽ⁱ⁾ representing the beamformer corresponding to 1 ≦ i ≦ I for estimating the target sound, further estimates the non-target sound. _{Information q f} ⁽ⁱ⁾ representing a beamformer corresponding to I <i ≦ M is also generated. Instead of the beamformer application unit 222, the beamformer application unit 222'has an estimated value y _{t, f} ⁽¹⁾ , ..., Y _{t, f} ^(I) of the target sound, as well as an estimated value y _{t of the non-target sound. , F} ^⊥ is generated. Others are the same as in the second embodiment.

Hereinafter, the processing of this modification will be described in detail with reference to FIGS. 3 to 5.
First, instead of the signal processing device 2, the signal processing device 2'executes the processes of steps S213a, S221a, S211a, S213b, S221b, S211b, S211c, S212a, and S212b shown in FIG. However, the processing of the spatiotemporal covariance estimation unit 211 is executed by the spatiotemporal covariance estimation unit 211'instead of the spatiotemporal covariance estimation unit 211. In step S211a, the spatiotemporal covariance estimation unit 211'also includes _{the powers λ t, f} ^{⊥ of} the non-target sound in addition to _{the powers λ t, f} ⁽¹⁾ , ..., λ _{t, f} ^{(I) of the target sound.} Initialize in the same way as the power of the target sound. The processing of the beam former estimation unit 213 is executed by the beam former estimation unit 213'instead of the beam former estimation unit 213. _{In step S213b, the beamformer estimation unit 213'initializes q f} ⁽ⁱ⁾ for all i ∈ {1, ..., M}, f ∈ {1, ..., F}. For example, the beamformer estimation unit 213 to i-th column of _{I M} and _q ^{f (i).}

If the spatiotemporal covariance estimation unit 211'has not yet obtained _{y t, f} ⁽ⁱ⁾ _{, the spatiotemporal covariance estimation unit 211' obtained λ t, f} ⁽¹⁾ , ..., λ _{t, f} ^{( I)} and λ _{t, f} ^⊥ are used. On the other hand, if y _{t, f} ⁽ⁱ⁾ _{is obtained, y t, f} ⁽¹⁾ , ..., y _{t, f} ^(I) and y _{t, f} are sent to the spatiotemporal covariance estimation unit 211'. ^{Since ⊥} is input, λ _{t, f} ⁽¹⁾ , ..., λ _{t, f} ^(I) , and λ _{t, f} ^⊥ can be obtained by step S211d.
Next, the spatiotemporal covariance estimation unit 211'uses x _{t, f} and λ _{t, f} ^⊥, and uses the power-weighted spatiotemporal covariance matrix of the non-objective sound.

Is calculated and output (step S211b').
Furthermore, the spatiotemporal covariance estimation unit 211'uses x _{t, f} and λ _{t, f} ^⊥, and uses the power-weighted spatiotemporal covariance matrix of the non-objective sound.

Is calculated and output (step S211c').

The reverberation suppression filter estimation unit ^212'receives ^R- _{x, f} ⊥ and q _f ^(i), and receives them.

Is calculated (step S212a'). Furthermore, the reverberation suppression filter estimation unit ^212'receives _{P ⊥ x, f} and q _f ^(i), and receives P ⊥ x, f and q f (i).

Is calculated (step S212b').

After that, instead of the signal processing device 2, the signal processing device 2'executes the processes of steps S212c, S221c, S213c, S213d, S213e, S222a, S13, S211d, and S222b shown in FIG. However, the process of the reverberation suppression filter estimation unit 212 described in the second embodiment is executed by the reverberation suppression filter estimation unit 212'instead of the reverberation suppression filter estimation unit 212. The processing of the beam former estimation unit 213 is executed by the beam former estimation unit 213'instead of the beam former estimation unit 213. The processing of the beam former application unit 222 is executed by the beam former application unit 222'instead of the beam former application unit 222.
_{In step S213, the beamformer estimation unit 213'estimates q f} ⁽ⁱ⁾ for i ∈ {1, ..., I}, f ∈ {1, ..., F}, and in addition, i ∈ {I + 1, It also generates _{q f} ⁽ⁱ⁾ for ..., M}, f ∈ {1, ..., F}. For example, in each f, i∈ {1, ..., I} to linear _{space q} ^{f (i)} is put against, as a vector spanning the complement, i∈ {I + 1, ... , M} q relating ^{f (i )} Is generated. As the vector that stretches the complementary space, for example, the orthonormal rule of the complementary space may be adopted, or other than that.

<Characteristics of this modified example>
In this modification, not only the power-weighted spatio-temporal covariance matrix of the target sound but also the power-weighted spatio-temporal covariance matrix of the non-target sound is calculated and used for estimating the reverberation suppression filter. Accuracy is improved.

[Modification 2 of the second embodiment]
In the second modification of the second embodiment, _{only the processing signals y t and f} ⁽ⁱ⁾ corresponding to any one of the source signals i are obtained and output. The reverberation suppression filter of this modification is of equation (12), and the beamformer is of equation (10').

<Functional configuration>
As illustrated in FIG. 2, the signal processing device 2 "of this modification has a spatiotemporal covariance estimation unit 211, a reverberation suppression filter estimation unit 212", a beamformer estimation unit 213, a reverberation suppression filter application unit 221 and a beamformer. It has an application unit 222 and a control unit 13, and executes each process under the control of the control unit 13. Here, the spatiotemporal covariance estimation unit 211, the reverberation suppression filter estimation unit 212 ”, and the beamformer estimation unit 213 constitute a convolution beamformer estimation unit. It constitutes a convolution beam former application part.

<Processing>
The processing of this modification will be described in detail with reference to FIGS. 7 and 8.
First, instead of the signal processing device 2, the signal processing device 2 ”executes the processes of steps S213a, S221a, S211a, S211b, and S211c shown in FIG. 7.

Next, the reverberation suppression filter estimation unit 212 " ^{receives R-} _{x, f} ⁽ⁱ⁾ and P _{x, f} ^{(i) in place} of the reverberation suppression filter estimation unit 212, and the information corresponding to the reverberation removal filter is received.

Is calculated and output (step S212c ").

Dereverberation filter applying unit 212 ", the information corresponds to the dereverberation filter G ^- receives _{f ⁽ⁱ⁾} the acoustic signal x _t, and _f, applies a dereverberation filter acoustic signals x _t, the _f as follows The reverberation suppression signals _{zt and f} are obtained and output (step S221c ″).

After that, instead of the signal processing device 2, the signal processing device 2 ”executes the processes of steps S213c, S213d, S213e, S222a, S13, S211d, and S222b shown in FIG.

[Third Embodiment]
Next, a third embodiment of the present invention will be described. In this embodiment, the auxiliary information includes information that identifies the power of the target sound. As a result, iterative processing can be omitted.

<Functional configuration>
As illustrated in FIG. 9, the signal processing device 3 of the present embodiment has a spatiotemporal covariance estimation unit 311, a reverberation suppression filter estimation unit 212, a beamformer estimation unit 313, a reverberation suppression filter application unit 221 and a beamformer application unit. It has 222 and a control unit 13, and executes each process under the control of the control unit 13. Here, the spatiotemporal covariance estimation unit 311, the reverberation suppression filter estimation unit 212, and the beamformer estimation unit 313 constitute a convolutional beamformer estimation unit. The reverberation suppression filter application unit 221 and the beam former application unit 222 form a convolution beam former application unit.

<Processing>
Hereinafter, the processing of this modification will be described in detail with reference to FIGS. 10, 11 and 4.
First, auxiliary information s = {s ₁ , s ₂ } is input to the signal processing device 3. The auxiliary information s of the present embodiment includes the time frequency mask s ₁ = γ _{t, f} ⁽ⁱ⁾ _{and the information s 2} = λ _{t, f} ⁽ⁱ⁾ that specifies the power of the target sound. The time-frequency mask s ₁ = γ _{t, f} ⁽ⁱ⁾ is input to the beamformer estimation unit 313, and the information s ₂ = λ _{t, f} ⁽ⁱ⁾ that identifies the power of the target sound is the spatiotemporal covariance estimation unit 311 and It is input to the beamformer estimation unit 313 (step S313a).

As illustrated in FIGS. 10, 11 and 4, the signal processing device 3 replaces the signal processing device 2 with S221a, S211a, S213b, S221b, S211b, S211c, S212a, S212b, S212c, S221c, S213c, S213d. , S213e, S222a, S222b are executed. However, the processing of the spatiotemporal covariance estimation unit 211 and the processing of the beamformer estimation unit 213 described in the second embodiment are performed in place of the spatiotemporal covariance estimation unit 211 and the beamformer estimation unit 213, respectively. The estimation unit 311 and the beam former estimation unit 313 execute. _{Further, auxiliary information s 2} = λ _{t, f} ⁽ⁱ⁾ is used for the calculation of steps S211b, S211c, and S213d. As a result, _{the processed signals y t, f} ⁽ⁱ⁾ obtained by suppressing the reverberation, suppressing the diffusive noise, and separating the target sound source with respect to _{the acoustic signals x t, f without repeating the processing.}

<Characteristics of this embodiment>
In the present embodiment, by giving the power of the target sound to the signal processing device 3 as auxiliary information, it is possible to perform highly accurate speech enhancement without performing repetitive processing.

[Modification 1 of the third embodiment]
Similar to the first modification of the second embodiment, in the third embodiment, in addition to the power-weighted spatio-temporal covariance matrix R ^- _{x, f} ⁽ⁱ⁾ and P _{x, f} ⁽ⁱ⁾ of the target sound, the non-purpose sound The power-weighted spatio-temporal covariance matrix of may also be calculated and used to estimate the reverberation suppression filter.

<Functional configuration>
As illustrated in FIG. 9, the signal processing device 3'of this modified example includes a spatiotemporal covariance estimation unit 311', a reverberation suppression filter estimation unit 212', a beamformer estimation unit 313, a reverberation suppression filter application unit 221 and a beam. It has a former application unit 222 and a control unit 13, and executes each process under the control of the control unit 13. Here, the spatiotemporal covariance estimation unit 311', the reverberation suppression filter estimation unit 212', and the beamformer estimation unit 313 constitute a convolutional beamformer estimation unit. The reverberation suppression filter application unit 221 and the beam former application unit 222 form a convolution beam former application unit.

<Processing>
Hereinafter, the processing of this modification will be described in detail with reference to FIGS. 10, 11 and 4.
Instead of the signal processing device 3, the signal processing device 3'executes the processes of steps S313a, S221a, S211a, S213b, S221b, S211b, S211c, S212a, and S212b shown in FIG. 10 as described in the third embodiment. ..

Next, instead of the signal processing device 2', the signal processing device 3'executes the processes of steps S211b', S211c', S212a', and S212b'shown in FIG. However, the spatiotemporal covariance estimation unit 311'executes the processing of the spatiotemporal covariance estimation unit 211' described in the first modification of the second embodiment. _{Further, auxiliary information s 2} = λ _{t, f} ⁽ⁱ⁾ is used for the calculation of steps S211b'and S211c'.

Instead of the signal processing device 3, the signal processing device 3'executes the processes of S212c, S221c, S213c, S213d, S213e, S222a, and S222b shown in FIG. 11 as described in the third embodiment.

[Modification 2 of the third embodiment]
Similar to the second embodiment, in the third embodiment, _{only the processing signals yt and f} ⁽ⁱ⁾ corresponding to any one of the source signals i may be obtained.

<Functional configuration>
As illustrated in FIG. 9, the signal processing device 3 "of this modified example includes a spatiotemporal covariance estimation unit 311", a reverberation suppression filter estimation unit 212 ", a beamformer estimation unit 313, a reverberation suppression filter application unit 221", and a beam. It has a former application unit 222 and a control unit 13, and executes each process under the control of the control unit 13. Here, a spatiotemporal covariance estimation unit 311, a reverberation suppression filter estimation unit 212 ”, and a beam former estimation. The unit 313 constitutes a convoluted beamformer estimation unit. The reverberation suppression filter application unit 221 ”and the beam former application unit 222 constitute a convolution beam former application unit.

<Processing>
Hereinafter, the processing of this modification will be described in detail with reference to FIGS. 12 and 13.
Instead of the signal processing device 3, the signal processing device 3 "executes steps S313a, S221a, S211b, S211c shown in FIGS. 12 and 13 as described in the third embodiment. Further, the signal processing device 2 ". The reverberation suppression filter estimation unit 212 "of the signal processing device 3" executes the steps S212c "and S221c" described in the second embodiment of the second embodiment instead of the reverberation suppression filter estimation unit 212 ". Instead of the device 2, the signal processing device 3 ”executes steps S213c, S213d, S213e, S222a, and S222b described in the second embodiment. _{However, auxiliary information s 2} = λ _{t, f} ⁽ⁱ⁾ is used for the calculation of steps S211b, S211c, and S213d.

[Fourth Embodiment]
As described in the first modification of the second embodiment, the sizes of the power-weighted spatio-temporal covariance matrices R ^- _{x, f} ⁽ⁱ⁾ and P _{x, f} ⁽ⁱ⁾ of the target sound obtained in steps S211b and S211c. _{Is smaller than the size of the matrices Ψ f} and φ _f obtained in steps S212a and S212b, so speech enhancement can be realized at a low calculation cost in each of the above embodiments and modifications. Although this effect cannot be obtained, in steps S212a and S212b,

Instead of

May be executed. However, the following is satisfied.

<Comparative experiment>
The comparison results between the fourth embodiment and the modified examples 1 and 2 of the second embodiment will be illustrated below. Experiments were conducted with the following two configurations, Config-1 and 2. A short-time Fourier transform was used for frequency division. A Hann window was used for the window function, and the frame length and shift width were set to 32 ms and 8 ms, respectively. Further, Δ = 4.

FIG. 15 shows the word error rate when the processing signals obtained in the fourth embodiment and the modified examples 1 and 2 of the second embodiment are voice-recognized for the two configurations Config-1 and 2. The horizontal axis of FIG. 15 represents the number of iterations (#iterations), and the vertical axis represents the word error rate (WER (%)). As shown in FIG. 15, it can be seen that in the modified examples 1 and 2 of the second embodiment, the voice recognition performance of the voice signal recorded in the noise reverberation multiple sound source environment can be improved as compared with the method of the fourth embodiment.

The calculation time required to process the two configurations Config-1 and 2 by the methods of the fourth embodiment and the modified examples 1 and 2 of the second embodiment for the mixed signal having a length of 9.44 s is shown below. Illustrate.

It can be seen that the methods of

Modifications

1 and 2 of the second embodiment can perform reverberation suppression, diffusive noise suppression, and target sound source separation with a smaller amount of calculation than the method of the fourth embodiment.

[Hardware configuration]
The

signal processing devices

1, 2, 2', 2 ", 3, 3', 3" in each embodiment are, for example, a processor (hardware processor) such as a CPU (central processing unit) or a RAM (random-access memory). ) -A device configured by a general-purpose or dedicated computer equipped with a memory such as a ROM (read-only memory) executing a predetermined program. This computer may have one processor and memory, or may have a plurality of processors and memory. This program may be installed in a computer or may be recorded in a ROM or the like in advance. Further, a part or all of the processing units may be configured by using an electronic circuit that realizes a processing function independently, instead of an electronic circuit (circuitry) that realizes a function configuration by reading a program like a CPU. .. Further, the electronic circuit constituting one device may include a plurality of CPUs.

FIG. 6 is a block diagram illustrating the hardware configurations of the

signal processing devices

1, 2, 2', 2 ", 3, 3', 3" in each embodiment. As illustrated in FIG. 6, the

signal processing devices

1, 2, 2', 2 ", 3, 3', 3" in this example include a CPU (Central Processing Unit) 10a, an input unit 10b, an output unit 10c, and a RAM. It has (RandomAccessMemory) 10d, ROM (ReadOnlyMemory) 10e, auxiliary storage device 10f, and bus 10g. The CPU 10a of this example has a control unit 10aa, a calculation unit 10ab, and a register 10ac, and executes various arithmetic processes according to various programs read into the register 10ac. Further, the input unit 10b is an input terminal, a keyboard, a mouse, a touch panel, or the like into which data is input. Further, the output unit 10c is an output terminal from which data is output, a display, a LAN card controlled by a CPU 10a that has read a predetermined program, and the like. Further, the RAM 10d is a SRAM (Static Random Access Memory), a DRAM (Dynamic Random Access Memory), or the like, and has a program area 10da in which a predetermined program is stored and a data area 10db in which various data are stored. Further, the auxiliary storage device 10f is, for example, a hard disk, MO (Magneto-Optical disc), a semiconductor memory, or the like, and has a program area 10fa for storing a predetermined program and a data area 10fb for storing various data. There is. Further, the bus 10g connects the CPU 10a, the input unit 10b, the output unit 10c, the RAM 10d, the ROM 10e, and the auxiliary storage device 10f so that information can be exchanged. The CPU 10a writes the program stored in the program area 10fa of the auxiliary storage device 10f to the program area 10da of the RAM 10d according to the read OS (Operating System) program. Similarly, the CPU 10a writes various data stored in the data area 10fb of the auxiliary storage device 10f to the data area 10db of the RAM 10d. Then, the address on the RAM 10d in which this program or data is written is stored in the register 10ac of the CPU 10a. The control unit 10aa of the CPU 10a sequentially reads out these addresses stored in the register 10ac, reads a program or data from the area on the RAM 10d indicated by the read address, and causes the arithmetic unit 10ab to sequentially execute the operations indicated by the program. The calculation result is stored in the register 10ac. With such a configuration, the functional configurations of the

signal processing devices

1, 2, 2', 2 ", 3, 3', 3" are realized.

The above program can be recorded on a computer-readable recording medium. An example of a computer-readable recording medium is a non-transitory recording medium. Examples of such a recording medium are a magnetic recording device, an optical disk, an optical magnetic recording medium, a semiconductor memory, and the like.

The distribution of this program is carried out, for example, by selling, transferring, renting, etc. a portable recording medium such as a DVD or CD-ROM on which the program is recorded. Further, the program may be stored in the storage device of the server computer, and the program may be distributed by transferring the program from the server computer to another computer via the network. As described above, the computer that executes such a program first temporarily stores, for example, the program recorded on the portable recording medium or the program transferred from the server computer in its own storage device. Then, when the process is executed, the computer reads the program stored in its own storage device and executes the process according to the read program. Further, as another execution form of this program, a computer may read the program directly from a portable recording medium and execute processing according to the program, and further, the program is transferred from the server computer to this computer. It is also possible to execute the process according to the received program one by one each time. In addition, the above processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition without transferring the program from the server computer to this computer. May be. The program in this embodiment includes information to be used for processing by a computer and equivalent to the program (data that is not a direct command to the computer but has a property of defining the processing of the computer, etc.).

In each embodiment, the present device is configured by executing a predetermined program on a computer, but at least a part of these processing contents may be realized by hardware.

The present invention is not limited to the above-described embodiment. For example, in each of the above-described embodiments or variations thereof, an acoustic signal obtained by frequency-dividing the signal obtained by observing a mixed signal of diffuse noise and a source signal emitted from a sound source is referred to as No. _{xt. It was set to f} . However, this is not a limitation of the present invention. For example, the acoustic signals obtained by performing some signal processing (filtering processing or the like) on the signal obtained by frequency-dividing the signal obtained by observing the mixed signal may be set as _{xt, f.} Alternatively, the acoustic signals obtained by frequency-dividing the signal obtained by performing some signal processing on the signal obtained by observing the mixed signal may be set as _{xt and f.} Alternatively, the signal obtained by observing the mixed signal is subjected to some kind of signal processing, the signal obtained by frequency-dividing the signal is obtained, and the acoustic signal obtained by further performing some kind of signal processing is xt _{, It} may be f. Further, the various processes described above are not only executed in chronological order according to the description, but may also be executed in parallel or individually as required by the processing capacity of the device that executes the processes.

Further, in the second embodiment, the reverberation suppression signals z _{t, f} and the auxiliary information s = γ _{t, f} ⁽ⁱ⁾ (time frequency mask) are input to the beam former estimation unit 213, and the steering vector of the beam former estimation unit 213 is input. The estimation unit 2131 estimated the steering vector v _f ⁽ⁱ⁾ _{based on z t, f} and γ _{t, f} ⁽ⁱ⁾ . However, the auxiliary information s may include the _{steering vector v f} ^{(i) itself.} In this case, the steering vector estimation unit 2131 can be omitted, and the RTF estimation unit 2132 of the beamformer estimation unit 213 may obtain v to _f ⁽ⁱ⁾ _{from v f} ^{(i) included in the auxiliary information s.} .. Further, even if the auxiliary information s does not include the time frequency mask γ _{t, f} and the steering vector v _f ⁽ⁱ⁾ , if the auxiliary information s includes the reference sound of the target sound, the reverberation suppression signals z _{t, f} and _{The steering vector v f} ⁽ⁱ⁾ can be estimated from the auxiliary information s. That is, as illustrated in FIG. 6, first, the time frequency mask estimation unit 2130 of the beamformer estimation unit 213 receives the reverberation suppression signals z _{t, f} and the auxiliary information s (reference sound of the target sound), and is described in Reference 3. The time-frequency masks γ _{t, f} ⁽ⁱ⁾ may be estimated by the method described above and input to the steering vector estimation unit 2131. Further, RTFv ~ _f ⁽ⁱ⁾ itself may be input to the beamformer estimation unit 213 as auxiliary information s = v ~ _f ^(i). After all, if the information _{for calculating RTFv to f} ⁽ⁱ⁾ is input to the beamformer estimation unit 213 as the auxiliary information s, the beamformer estimation unit 213 can estimate the beamformer.

Further, S212a ', S212b' in a non-target sound power weighted hourly space covariance matrix ^R _{^{^{- x, f ⊥, P ⊥}}} x, spatial covariance matrix at with a power weighting of the target sound instead of _f R ^- _{x , F} ⁽ⁱ⁾ , _{Px, f} ⁽ⁱ⁾ may be used. In this case, steps S211b'and S211c' can be omitted.

Needless to say, other changes can be made as appropriate without departing from the spirit of the present invention.

1,2,2', 2 ", 3,3', 3" Signal processing device 11 Convolution beam former estimation unit 12 Convolution beam

former application unit

211, 211', 311, 311'Spatio-temporal covariance estimation unit 212,212 ', 212' Reverberation suppression filter estimation unit 213,213', 313 Beamformer estimation unit 221,221 "Reverberation suppression filter application unit 222,222' Beamformer application unit

Claims

Obtained by applying a convolutional beamformer that receives frequency-divided time-series acoustic signals and auxiliary information representing target sound information and performs reverberation suppression, diffuse noise suppression, and target sound source separation to the acoustic signal. A convolution beamformer estimation unit that estimates the convolutional beamformer based on an optimization criterion that the signal is determined according to a stochastic model, and a convolutional beamformer estimation unit.
A convolution beam former application unit that applies the convolution beam former estimated by the convolution beam former estimation unit to the acoustic signal to obtain a processing signal and output the processing signal.
A signal processing device having.
The signal processing device according to claim 1.
The convolution beam former includes a reverberation suppression filter that suppresses the reverberation, and a beam former that suppresses the diffusive noise and separates the target sound source.
The convolution beam former estimation unit
The spatiotemporal covariance matrix estimation unit that obtains the power-weighted spatiotemporal covariance matrix of the target sound,
It includes a reverberation suppression filter estimation unit that receives the acoustic signal, the power-weighted spatio-temporal covariance matrix of the target sound, and information representing the beamformer, and estimates the reverberation suppression filter based on the optimization criterion.
The convolution beam former application unit includes a reverberation suppression filter application unit that obtains a reverberation suppression signal by applying the reverberation suppression filter estimated by the reverberation suppression filter estimation unit to the acoustic signal.
The convolution beamformer estimation unit further includes a beamformer estimation unit that receives the reverberation suppression signal and the auxiliary information and estimates the beamformer based on the optimization criterion.
The convolution beamformer application unit is a signal processing device including a beamformer application unit that further applies the beamformer estimated by the beamformer estimation unit to the reverberation suppression signal to obtain and output the processing signal.
The signal processing device according to claim 2.
The spatiotemporal covariance matrix estimation unit further obtains a power-weighted spatiotemporal covariance matrix of non-objective sounds.
The reverberation suppression filter estimation unit receives the acoustic signal, the power-weighted spatio-temporal covariance matrix of the target sound, the power-weighted spatio-temporal covariance matrix of the non-target sound, and information for identifying the beamformer. A signal processing device that estimates the reverberation suppression filter based on the optimization criteria.
The signal processing device according to claim 2 or 3.
The processing of the spatiotemporal covariance matrix estimation unit, the processing of the reverberation suppression filter estimation unit, the processing of the reverberation suppression filter application unit, the processing of the beamformer estimation unit, and the processing of the beamformer application unit are repeated. , Signal processing equipment.
The signal processing device according to any one of claims 1 to 3.
The auxiliary information is a signal processing device including information for specifying the power of the target sound.
Obtained by applying a convolutional beamformer that receives frequency-divided time-series acoustic signals and auxiliary information representing target sound information and performs reverberation suppression, diffuse noise suppression, and target sound source separation to the acoustic signal. The convolution beamformer estimation step for estimating the convolutional beamformer based on the optimization criterion that the signal is determined according to the stochastic model, and the convolutional beamformer estimation step.
A convolution beam former application step in which the convolution beam former estimated in the convolution beam former estimation step is applied to the acoustic signal to obtain a processing signal and output, and a convolution beam former application step.
Signal processing method having.
A program for operating a computer as a signal processing device according to any one of claims 1 to 5.