US11894010B2 - Signal processing apparatus, signal processing method, and program - Google Patents
Signal processing apparatus, signal processing method, and program Download PDFInfo
- Publication number
- US11894010B2 US11894010B2 US17/312,912 US201917312912A US11894010B2 US 11894010 B2 US11894010 B2 US 11894010B2 US 201917312912 A US201917312912 A US 201917312912A US 11894010 B2 US11894010 B2 US 11894010B2
- Authority
- US
- United States
- Prior art keywords
- signals
- beamformer
- frequency
- convolutional
- time interval
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active, expires
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R3/00—Circuits for transducers, loudspeakers or microphones
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L2021/02082—Noise filtering the noise being echo, reverberation of the speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L2021/02161—Number of inputs available containing the signal or the noise to be suppressed
- G10L2021/02166—Microphone arrays; Beamforming
Definitions
- the present invention relates to a signal processing technique for an acoustic signal.
- NPL 1 and NPL 2 disclose a method of suppressing noise and reverberation from an observation signal in the frequency domain.
- reverberation and noise are suppressed by receiving an observation signal in the frequency domain and a steering vector representing the direction of a sound source or an estimated vector thereof, estimating an instantaneous beamformer for minimizing the power of the frequency-domain observation signal under a constraint condition that sound reaching a microphone from the sound source is not distorted, and applying the instantaneous beamformer to the frequency-domain observation signal (conventional method 1).
- PTL 1 and NPL 3 disclose a method of suppressing reverberation from an observation signal in the frequency domain.
- reverberation in an observation signal in the frequency domain is suppressed by receiving an observation signal in the frequency domain and the power of a target sound at each time, or an estimated value thereof, estimating a reverberation suppression filter for suppressing reverberation in the target sound on the basis of a weighted power minimization reference of a prediction error, and applying the reverberation suppression filter to the frequency-domain observation signal (conventional method 2).
- NPL 4 discloses a method of suppressing noise and reverberation by cascade-connecting conventional method 2 and conventional method 1.
- this method at a prior stage, an observation signal in the frequency domain and the power of a target sound at each time are received and reverberation is suppressed using conventional method 2, and then, at a later stage, a steering vector is received and reverberation and noise are further suppressed using conventional method 1 (conventional method 3).
- Conventional method 1 is a method originally developed for the purpose of suppressing noise and may not always be capable of sufficiently suppressing reverberation. With conventional method 2, noise cannot be suppressed.
- Conventional method 3 can suppress more noise and reverberation than when conventional method 1 or conventional method 2 is used alone. With conventional method 3, however, conventional method 2 serving as the prior stage and conventional method 1 serving as the later stage are viewed as independent systems and optimization is performed in the respective systems. Therefore, when conventional method 2 is applied at the prior stage, it may not always be possible to sufficiently suppress reverberation due to the effects of noise. Further, when conventional method 1 is applied at the later stage, it may not always be possible to sufficiently suppress noise and reverberation due to the effects of residual reverberation.
- the present invention has been designed in consideration of these points, and an object thereof is to provide a technique with which noise and reverberation can be sufficiently suppressed.
- a convolutional beamformer for calculating, at each time, a weighted sum of a current signal and a past signal sequence having a predetermined delay and a length of 0 or more such that estimation signals of target signals increase a probability expressing a speech-likeness of the estimation signals based on a predetermined probability model is acquired where the estimation signals are acquired by applying the convolutional beamformer to frequency-divided observation signals corresponding respectively to a plurality of frequency bands of observation signals acquired by picking up acoustic signals emitted from a sound source, whereupon the estimation signals are acquired by applying the acquired convolutional beamformer to the frequency-divided observation signals.
- the convolutional beamformer such that the estimation signals increases the probability expressing the speech-likeness of the estimation signals based on the probability model is acquired, and therefore noise suppression and reverberation suppression can be optimized as a single system, with the result that noise and reverberation can be sufficiently suppressed.
- FIG. 1 A is a block diagram illustrating an example of a functional configuration of a signal processing device according to a first embodiment
- FIG. 1 B is a flowchart illustrating an example of a signal processing method according to the first embodiment.
- FIG. 2 A is a block diagram illustrating an example of a functional configuration of a signal processing device according to a second embodiment
- FIG. 2 B is a flowchart illustrating an example of a signal processing method according to the second embodiment.
- FIG. 3 is a block diagram illustrating an example of a functional configuration of a signal processing device according to a third embodiment.
- FIG. 4 is a block diagram illustrating an example of a functional configuration of a parameter estimation unit illustrated in FIG. 3 .
- FIG. 5 is a flowchart illustrating an example of a parameter estimation method according to the third embodiment.
- FIG. 6 is a block diagram illustrating an example of a functional configuration of a signal processing device according to fourth to seventh embodiments.
- FIG. 7 is a block diagram illustrating an example of a functional configuration of a parameter estimation unit illustrated in FIG. 6 .
- FIG. 8 is a block diagram illustrating an example of a functional configuration of a steering vector estimation unit illustrated in FIG. 7 .
- FIG. 9 is a block diagram illustrating an example of a functional configuration of a signal processing device according to an eighth embodiment.
- FIG. 10 is a block diagram illustrating an example of a functional configuration of a signal processing device according to a ninth embodiment.
- FIGS. 11 A to 11 C are block diagrams illustrating examples of use of the signal processing devices according to the embodiments.
- FIG. 12 is a table illustrating examples of test results of the first embodiment.
- FIG. 13 is a table illustrating examples of test results of the first embodiment.
- FIG. 14 is a table illustrating examples of test results of the fourth embodiment.
- FIGS. 15 A to 15 C are tables illustrating examples of test results of the seventh embodiment.
- a“target signal” denotes a signal corresponding to a direct sound and an initial reflected sound, within a signal (for example, a frequency-divided observation signal) corresponding to a sound emitted from a target sound source and picked up by a microphone.
- the initial reflected sound denotes a reverberation component derived from the sound emitted from the target sound source that reaches the microphone at a delay of no more than several tens of milliseconds following the direct sound.
- the initial reflected sound typically acts to improve the clarity of the sound, and in this embodiment, a signal corresponding to the initial reflected sound is also included in the target signal.
- the signal corresponding to the sound picked up by the microphone also includes, in addition to the target signal described above, late reverberation (a component acquired by excluding the initial reflected sound from the reverberation) derived from the sound emitted from the target sound source, and noise derived from a source other than the target sound source.
- the target signal is estimated by suppressing late reverberation and noise from a frequency-divided observation signal corresponding to a sound recorded by the microphone, for example.
- reverberation is assumed to refer to “late reverberation”.
- Method 1 serving as a prerequisite of the method according to the embodiments will now be described.
- noise and reverberation are suppressed from an M-dimensional observation signal (frequency-divided observation signals) in the frequency domain
- the frequency-divided observation signals x f, t are acquired by transforming M observation signals, which are acquired by picking up acoustic signals emitted from one or a plurality of sound sources in M microphones, to the frequency domain.
- the observation signals are acquired by picking up acoustic signals emitted from the sound sources in an environment where noise and reverberation exist.
- x f, t (m) is acquired by transforming an observation signal that is acquired by being picked up by the microphone having the microphone number m to the frequency domain.
- x f, t (m) corresponds to the frequency band having the frequency band number f and the time frame having the time frame number t.
- the frequency-divided observation signals x f, t are time series signals.
- an instantaneous beamformer w f, 0 for minimizing a cost function C 1 (w f, 0 ) below is determined for each frequency band under the constraint condition in which “the target signals are not distorted as a result of applying an instantaneous beamformer (for example, a minimum power distortionless response beamformer) w f, 0 for calculating the weighted sum of the signals at the current time to the frequency-divided observation signals x f, t at each time”.
- an instantaneous beamformer for example, a minimum power distortionless response beamformer
- the constraint condition is a condition in which, for example, w f, 0 H ⁇ f, 0 is a constant (1, for example).
- v f , 0 [ v f , 0 ( 1 ) , v f , 0 ( 2 ) , ... , v f , 0 ( M ) ] T ( 4 ) is a steering vector having, as an element, a transfer function ⁇ f, 0 (m) relating to the direct sound and the initial reflected sound from the sound source to each microphone (the sound pickup position of the acoustic signal), or an estimated vector (an estimated steering vector) thereof.
- ⁇ f, 0 is expressed by an M-dimensional (the dimension of the number of microphones) vector having, as an element, the transfer function ⁇ f, 0 (m) , which corresponds to the direct sound and initial reflected sound parts of an impulse response from the sound source position to each microphone (i.e. the reverberation that arrives at a delay of no more than several tens of milliseconds (for example, within 30 milliseconds) following the direct sound).
- a normalized vector acquired by normalizing the transfer function of each element so that the gain of a microphone having one of the microphone numbers m 0 ⁇ 1, . . . , M ⁇ becomes a constant g (g ⁇ 0) may be used as ⁇ f, 0 .
- a normalized vector may be used as ⁇ f, 0 .
- Method 2 serving as a prerequisite of the method according to the embodiments will now be described.
- reverberation is suppressed from the frequency-divided observation signal x f, t .
- the reverberation suppression filter F f, ⁇ is an M ⁇ M-dimensional matrix filter for suppressing reverberation from the frequency-divided observation signal x f, t .
- d is a positive integer expressing a prediction delay.
- L is a positive integer expressing the filter length.
- ⁇ f, t 2 is the power of the target signal, which is expressed as follows.
- an estimation signal of a target signal z f, t in which reverberation has been suppressed from the frequency-divided observation signal x f, t is acquired.
- the estimation signal of the target signal z f, t is an M-dimensional column vector, as shown below.
- An estimation signal of a target signal y f, t acquired by suppressing noise and reverberation from the frequency-divided observation signal x f, t by using a method integrating methods 1 and 2 can be modeled as follows.
- w ⁇ f is a convolutional beamformer that calculates a weighted sum of a current signal and a past signal sequence having a predetermined delay at each time. Note that the “ ⁇ ” of “w ⁇ f ” should be written directly above the “w”, as shown below, but due to notation limitations may also be written to the upper right of “w”. w f The convolutional beamformer w ⁇ f calculates the weighted sum of the current signal and a past signal sequence having a predetermined delay at each time point. The convolutional beamformer w ⁇ f is expressed as shown below, for example,
- w _ f [ w _ f ( 1 ) T , w _ f ( 2 ) T , ... , w _ f ( M ) T ] T ( 10 ) where the following is satisfied.
- w _ f ( m ) [ w f , 0 ( m ) , w f , d ( m ) , w f , d + 1 ( m ) , ... , w d + L - 1 ( m ) ] T ( 10 ⁇ A ) Further, x ⁇ f, t is expressed as follows.
- x _ f , t [ x _ f , t ( 1 ) T , x _ f , t ( 2 ) T , ... , x _ f , t ( M ) T ] T ( 11 )
- x _ f , t ( m ) [ x f , t ( m ) , x f , t - d ( m ) , x f , t - d - 1 ( m ) , ... , x f , t - d - L + 1 ( m ) ] T ( 11 ⁇ A )
- the convolutional beamformer w ⁇ f of equation (9A) is a beamformer that calculates, at each time point, the weighted sum of the current signal and a signal sequence having a predetermined delay and a length of 0, and therefore the convolutional beamformer calculates the weighted value of the current signal at each time point.
- the signal processing device of the present invention can acquire the estimation signal of the target signal by determining a convolutional beamformer on the basis of a probability expressing a speech-likeness and applying the convolutional beamformer to the frequency-divided observation signals.
- the convolutional beamformer w ⁇ f which maximizes the probability expressing the speech-likeness of y f, t is determined.
- a complex normal distribution having an average of 0 and a variance matching the power ⁇ f, t 2 of the target signal can be cited as an example of a speech probability density function.
- the “target signal” is a signal corresponding to the direct sound and the initial reflected sound, within a signal corresponding to a sound emitted from a target sound source and picked up by a microphone. Further, the signal processing device determines the convolutional beamformer w ⁇ f under the constraint condition in which “the target signals are not distorted as a result of applying the convolutional beamformer w ⁇ f to the frequency-divided observation signals x f, t ”, for example.
- This constraint condition is a condition in which, for example, w f, 0 H ⁇ f, 0 is a constant (1, for example).
- R is a weighted space-time covariance matrix determined as shown below.
- the signal processing device may determine w ⁇ f which minimizes the cost function C 3 (w ⁇ f ) of equation (13) under the constraint condition described above (in which, for example, w f, 0 H ⁇ f, 0 is a constant), for example.
- ⁇ ⁇ f R f - 1 ⁇ v _ f v _ f H ⁇ R f - 1 ⁇ v _ f ( 15 )
- ⁇ ⁇ f is a vector acquired by disposing the element ⁇ f, 0 (m) of the steering vector ⁇ f, 0 as follows.
- ⁇ ⁇ f (m) is an L+1-dimensional column vector having ⁇ f, 0 (m) , and L zeros as elements.
- the signal processing device acquires the estimation signal of the target signal y f, t by applying the determined convolutional beamformer w ⁇ f to the frequency-divided observation signal x f, t as follows.
- y f,t w f H x f,t (16)
- a signal processing device 1 includes an estimation unit 11 and a suppression unit 12 .
- the frequency-divided observation signal x f, t is input into the estimation unit 11 (equation (1)).
- the estimation unit 11 acquires and outputs the convolutional beamformer w ⁇ f for calculating the weighted sum of the current signal and a past signal sequence having a predetermined delay at each time such that the estimation signals increase the probability expressing the speech-likeness of the estimation signals based on the predetermined probability model where the estimation signals are acquired by applying the convolutional beamformer w ⁇ f to the frequency-divided observation signals x f, t in respective frequency bands.
- the frequency-divided observation signal x f, t and the convolutional beamformer w ⁇ f acquired in step S 11 are input into the suppression unit 12 .
- the suppression unit 12 acquires and outputs the estimation signal of the target signal y f, t by applying the convolutional beamformer w ⁇ f to the frequency-divided observation signal x f, t in each frequency band.
- the suppression unit 12 acquires and outputs the estimation signal of the target signal y f, t by applying w ⁇ f to x ⁇ f, t as shown in equation (16).
- the convolutional beamformer w ⁇ f for calculating the weighted sum of the current signal and a past signal sequence having a predetermined delay at each time such that the estimation signals increases the probability expressing the speech-likeness of the estimation signals based on the predetermined probability model is determined where the estimation signals are acquired by applying the convolutional beamformer w ⁇ f to the frequency-divided observation signals x f, t .
- This corresponds to optimizing noise suppression and reverberation suppression as a single system. In this embodiment, therefore, noise and reverberation can be suppressed more adequately than with the conventional methods.
- a signal processing device 2 includes an estimation unit 21 and the suppression unit 12 .
- the estimation unit 21 includes a matrix estimation unit 211 and a convolutional beamformer estimation unit 212 .
- the estimation unit 21 of this embodiment acquires and outputs the convolutional beamformer w ⁇ f which minimizes a sum of values (the cost function C 3 (w ⁇ f ) of equation (13), for example) acquired by weighting the power of the estimation signals at each time belonging to a predetermined time interval by the reciprocal of the power ⁇ f, t 2 of the target signals or the reciprocal of the estimated power ⁇ f, t 2 of the target signals under the constraint condition in which “the target signals are not distorted as a result of applying the convolutional beamformer w ⁇ f to the frequency-divided observation signals x f, t ”.
- the convolutional beamformer w ⁇ f is equivalent to a beamformer acquired by integrating a reverberation suppression filter F f, t for suppressing reverberation from the frequency-divided observation signal x f, t and the instantaneous beamformer w f, 0 for suppressing noise from a signal acquired by applying the reverberation suppression filter F f, t to the frequency-divided observation signal x f, t .
- the constraint condition is a condition in which, for example, “a value acquired by applying an instantaneous beamformer to a steering vector having, as an element, transfer functions relating to the direct sound and the initial reflected sound from the sound source to the to the pickup position of the acoustic signals, or an estimated steering vector, which is an estimated vector of the steering vector, is a constant (w f, 0 H ⁇ f, 0 is a constant)”.
- a value acquired by applying an instantaneous beamformer to a steering vector having, as an element, transfer functions relating to the direct sound and the initial reflected sound from the sound source to the to the pickup position of the acoustic signals, or an estimated steering vector, which is an estimated vector of the steering vector is a constant (w f, 0 H ⁇ f, 0 is a constant)”.
- the frequency-divided observation signals x f, t and the power or estimated power ⁇ f, t 2 of the target signals are input into the matrix estimation unit 211 .
- the matrix estimation unit 211 acquires and outputs a weighted space-time covariance matrix R f for each frequency band on the basis of the frequency-divided observation signals x f, t and the power or estimated power ⁇ f, t 2 of the target signal.
- the matrix estimation unit 211 acquires and outputs the weighted space-time covariance matrix R f in accordance with equation (14).
- the steering vector or estimated steering vector ⁇ f, 0 (equation (4) or (5)) and the weighted space-time covariance matrix R f acquired in step S 211 are input into the convolutional beamformer estimation unit 212 .
- the convolutional beamformer estimation unit 212 acquires and outputs the convolutional beamformer w ⁇ f on the basis of the weighted space-time covariance matrix R f and the steering vector or estimated steering vector ⁇ f, 0 .
- the convolutional beamformer estimation unit 212 acquires and outputs the convolutional beamformer w ⁇ f in accordance with equation (15).
- This step is identical to the first embodiment, and therefore description thereof has been omitted.
- the weighted space-time covariance matrix R f is acquired, and on the basis of the weighted space-time covariance matrix R f and the steering vector or estimated steering vector ⁇ f, 0 , the convolutional beamformer w ⁇ f is acquired.
- This corresponds to optimizing noise suppression and reverberation suppression as a single system. In this embodiment, therefore, noise and reverberation can be suppressed more adequately than with the conventional methods.
- a signal processing device 3 includes the estimation unit 21 , the suppression unit 12 , and a parameter estimation unit 33 .
- the estimation unit 21 includes the matrix estimation unit 211 and the convolutional beamformer estimation unit 212 .
- the parameter estimation unit 33 includes an initial setting unit 330 , a power estimation unit 331 , a reverberation suppression filter estimation unit 332 , a reverberation suppression filter application unit 333 , a steering vector estimation unit 334 , an instantaneous beamformer estimation unit 335 , an instantaneous beamformer application unit 336 , and a control unit 337 .
- the frequency-divided observation signal x f, t is input into the initial setting unit 330 .
- the initial setting unit 330 uses the frequency-divided observation signal x f, t , the initial setting unit 330 generates and outputs a provisional power ⁇ f, t 2 , which is a provisional value of the estimated power ⁇ f, t 2 of the target signal.
- the initial setting unit 330 generates and outputs the provisional power ⁇ f, t as follows.
- ⁇ f , t 2 x f , t H ⁇ x f , t M ( 17 )
- the frequency-divided observation signals x f, t and the newest provisional powers ⁇ f, t 2 are input into the reverberation suppression filter estimation unit 332 .
- the frequency-divided observation signal x f, t and the newest reverberation suppression filter F f, t acquired in step S 332 are input into the reverberation suppression filter application unit 333 .
- the reverberation suppression filter application unit 333 acquires and outputs an estimation signal y′ f, t by applying the reverberation suppression filter F f, t to the frequency-divided observation signal x f, t in each frequency band.
- the reverberation suppression filter application unit 333 sets z f, t , acquired in accordance with equation (8), as y′ f, t and outputs y′ f, t .
- the newest estimation signal y′ f, t acquired in step S 333 is input into the steering vector estimation unit 334 .
- the steering vector estimation unit 334 acquires and outputs a provisional steering vector ⁇ f, 0 , which is a provisional vector of the estimated steering vector, in each frequency band.
- the steering vector estimation unit 334 acquires and outputs the provisional steering vector ⁇ f, 0 for the estimation signal y′ f, t in accordance with a steering vector estimation method described in NPL 1 and NPL 2.
- the steering vector estimation unit 334 outputs a steering vector estimated using y′ f, t as y f, t according to NPL 2.
- a normalized vector acquired by normalizing the transfer function of each element so that the gain of a microphone having any one of the microphone numbers m 0 ⁇ (1, . . . , M) becomes a constant g may be used as ⁇ f, 0 (equation (5)).
- the newest estimation signal y′ f, t acquired in step S 333 and the newest provisional steering vector ⁇ f, 0 acquired in step S 334 are input into the instantaneous beamformer estimation unit 335 .
- the newest estimation signal y′ f, t acquired in step S 333 and the newest instantaneous beamformer w f, 0 acquired in step S 335 are input into the instantaneous beamformer application unit 336 .
- the instantaneous beamformer application unit 336 acquires and outputs an estimation signal y′′ f, t by applying the instantaneous beamformer w f, 0 to the estimation signal y′ f, t in each frequency band.
- the instantaneous beamformer application unit 336 acquires and outputs the estimation signal y′′ f, t as follows.
- y′′ f,t w f,0 H y′ f,t (19)
- the newest estimation signal y′′ f, t acquired in step S 336 is input into the power estimation unit 331 .
- the power estimation unit 331 outputs the power of the estimation signal y′′ f, t as the provisional power ⁇ f, t 2 in each frequency band.
- the power estimation unit 331 generates and outputs the provisional power ⁇ f, t 2 as follows.
- ⁇ f,t 2
- 2 y′′ f,t H y′′ f,t (20)
- the control unit 337 determines whether or not a termination condition is satisfied.
- the termination condition may be satisfied when the number of repetitions of the processing of steps S 331 to S 336 exceeds a predetermined value, when the variation in ⁇ f, t 2 or ⁇ f, 0 falls to or below a predetermined value after the processing of steps S 331 to S 336 is performed once, and so on.
- the processing returns to step S 332 .
- the processing advances to step S 337 b.
- step S 337 b the power estimation unit 331 outputs ⁇ f, t 2 acquired most recently in step S 331 as the estimated power of the target signal, and the steering vector estimation unit 334 outputs ⁇ f, 0 acquired most recently in step S 334 as the estimated steering vector.
- the estimated power ⁇ f, t 2 is input into the matrix estimation unit 211
- the estimated steering vector ⁇ f, 0 is input into the convolutional beamformer estimation unit 212 .
- the steering vector is estimated on the basis of the frequency-divided observation signal x f, t .
- the estimation precision improves.
- the precision of the estimated steering vector can be improved.
- a signal processing device 4 includes the estimation unit 21 , the suppression unit 12 , and a parameter estimation unit 43 .
- the estimation unit 21 includes the matrix estimation unit 211 and the convolutional beamformer estimation unit 212 .
- the parameter estimation unit 43 includes a reverberation suppression unit 431 and a steering vector estimation unit 432 .
- the fourth embodiment differs from the first to third embodiments in that before generating the estimated steering vector, the reverberation component of the frequency-divided observation signal x f, t is suppressed.
- the reverberation component of the frequency-divided observation signal x f, t is suppressed.
- the frequency-divided observation signal x f, t is input into the reverberation suppression unit 431 of the parameter estimation unit 43 ( FIG. 7 ).
- the reverberation suppression unit 431 acquires and outputs a frequency-divided reverberation-suppressed signal u f, t in which the reverberation component of the frequency-divided observation signal x f, t has been suppressed (preferably, in which the reverberation component of the frequency-divided observation signal x f, t has been removed).
- the reverberation suppression unit 431 acquires and outputs the frequency-divided reverberation-suppressed signal u f, t in which the reverberation component of the frequency-divided observation signal x f, t has been suppressed using a method described in reference document 1.
- the frequency-divided reverberation-suppressed signal u f, t acquired by the reverberation suppression unit 431 is input into the steering vector estimation unit 432 .
- the steering vector estimation unit 432 uses the frequency-divided reverberation-suppressed signal u f, t as input, the steering vector estimation unit 432 generates and outputs an estimated steering vector serving as an estimated vector of the steering vector.
- a steering vector estimation processing method of acquiring an estimated steering vector using a frequency-divided time series signal as input is well-known.
- the steering vector estimation unit 432 acquires and outputs the estimated steering vector ⁇ f, 0 by using the frequency-divided reverberation-suppressed signal u f, t as the input of a desired type of steering vector estimation processing.
- the steering vector estimation processing method There are no limitations on the steering vector estimation processing method, and for example, the method described above in NPL 1 and NPL 2, methods described in reference documents 2 and 3, and so on may be used.
- the estimated steering vector ⁇ f, 0 acquired by the steering vector estimation unit 432 is input into the convolutional beamformer estimation unit 212 .
- the convolutional beamformer estimation unit 212 performs the processing of step S 212 , described in the second embodiment, using the estimated steering vector ⁇ f, 0 and the weighted space-time covariance matrix R f acquired in step S 211 . All other processing is as described in the first and second embodiments.
- the estimated steering vector of each time frame number t can be calculated from frequency-divided observation signals x f, t input successively online, for example.
- a signal processing device 5 includes the estimation unit 21 , the suppression unit 12 , and a parameter estimation unit 53 .
- the estimation unit 21 includes the matrix estimation unit 211 and the convolutional beamformer estimation unit 212 .
- the parameter estimation unit 53 includes a steering vector estimation unit 532 .
- the steering vector estimation unit 532 includes an observation signal covariance matrix updating unit 532 a , a main component vector updating unit 532 b , a steering vector updating unit 532 c (the steering vector estimation unit), an inverse noise covariance matrix updating unit 532 d , and a noise covariance matrix updating unit 532 e .
- the fifth embodiment differs from the first to third embodiments only in that the estimated steering vector is generated by successive processing.
- the estimated steering vector is generated by successive processing.
- the frequency-divided observation signal x f, t which is a frequency-divided time series signal, is input into the steering vector estimation unit 532 ( FIGS. 7 and 8 ).
- Step S 532 a ⁇ Processing of Observation Signal Covariance Matrix Updating Unit 532 a (Step S 532 a )>>
- the observation signal covariance matrix updating unit 532 a ( FIG. 8 ) acquires and outputs a spatial covariance matrix ⁇ x, f, t of the frequency-divided observation signal x f, t (a spatial covariance matrix of a frequency-divided observation signal belonging to a first time interval), which is based on the frequency-divided observation signal x f, t (the frequency-divided observation signal belonging to the first time interval) and a spatial covariance matrix ⁇ x, f, t-1 of a frequency-divided observation signal x f, t-1 (a spatial covariance matrix of a frequency-divided observation signal belonging to a second time interval that is further in the past than the first time interval).
- the observation signal covariance matrix updating unit 532 a acquires and outputs a linear sum of a covariance matrix x f, t x f, t H of the frequency-divided observation signal x f, t (the frequency-divided observation signal belonging to the first time interval) and the spatial covariance matrix ⁇ x, f, t-1 (the spatial covariance matrix of the frequency-divided observation signal belonging to the second time interval that is further in the past than the first time interval) as the spatial covariance matrix ⁇ x, f, t of the frequency-divided observation signal x f, t (the spatial covariance matrix of the frequency-divided observation signal belonging to the first time interval).
- the observation signal covariance matrix updating unit 532 a acquires and outputs the spatial covariance matrix ⁇ x, f, t in accordance with equation (21) shown below, for example.
- ⁇ x,f,t ⁇ x,f,t-1 +x f,t x f,t H (21)
- ⁇ is an oblivion coefficient, and is a real number belonging to a range of 0 ⁇ 1, for example.
- An initial matrix ⁇ x, f, 0 of the spatial covariance matrix ⁇ x, f, t-1 may be set as desired.
- an M ⁇ M-dimensional unit matrix may be set as the initial matrix ⁇ x, f, 0 of the spatial covariance matrix ⁇ x, f, t-1 .
- Step S 532 d ⁇ Processing of Inverse Noise Covariance Matrix Updating Unit 532 d (Step S 532 d )>
- the frequency-divided observation signal x f, t and mask information ⁇ f, t (n) are input into the inverse noise covariance matrix updating unit 532 d .
- the mask information ⁇ f, t (n) is information expressing the ratio of the noise component included in the frequency-divided observation signal x f, t at a time-frequency point corresponding to the time frame number t and the frequency band number f.
- the mask information ⁇ f, t (n) expresses the occupancy probability of the noise component included in the frequency-divided observation signal x f, t at a time-frequency point corresponding to the time frame number t and the frequency band number f.
- Methods of estimating the mask information ⁇ f, t (n) include, for example, an estimation method using a complex Gaussian mixture model (CGMM) (reference document 4, for example), an estimation method using a neural network (reference document 5, for example), an estimation method integrating these methods (reference document 6 and reference document 7, for example), and so on.
- CGMM complex Gaussian mixture model
- the mask information ⁇ f, t (n) may be estimated in advance and stored in a storage device, not illustrated in the figures, or may be estimated successively. Note that the upper right superscript “(n)” of “ ⁇ f, t (n) ” should be written directly above the lower right subscript “f, t”, but due to notation limitations has been written to the upper right of “f, t”.
- the inverse noise covariance matrix updating unit 532 d acquires and outputs an inverse noise covariance matrix ⁇ ⁇ 1 n, f, t (an inverse noise covariance matrix of the frequency-divided observation signal belonging to the first time interval) on the basis of the frequency-divided observation signal x f, t (the frequency-divided observation signal belonging to the first time interval), the mask information ⁇ f, t (n) (mask information belonging to the first time interval), and an inverse noise covariance matrix ⁇ ⁇ 1 n, f, t-1 (an inverse noise covariance matrix of the frequency-divided observation signal belonging to the second time interval that is further in the past than the first time interval).
- the inverse noise covariance matrix updating unit 532 d acquires and outputs the inverse noise covariance matrix ⁇ ⁇ 1 n, f, t in accordance with equation (22), shown below, using the Woodbury formula.
- ⁇ n , f , t - 2 1 0 ⁇ ( ⁇ n , f , t - 1 - 1 - Y f , t ( n ) ⁇ ⁇ n , f , t - 1 - 1 ⁇ x f , t ⁇ x f , t H ⁇ ⁇ n , f , t - 1 - 1 a + Y f , t ( n ) ⁇ x f , t H ⁇ ⁇ n , f , t - 1 - 1 ⁇ x f , t ) ( 22 )
- ⁇ is an oblivion coefficient, and is a real number belonging to a range of 0 ⁇ 1, for example.
- An initial matrix ⁇ ⁇ 1 n, f, 0 of the inverse noise covariance matrix ⁇ ⁇ 1 n, f, t-1 may be set as desired.
- an M ⁇ M-dimensional unit matrix may be set as the initial matrix ⁇ ⁇ 1 n, f, 0 of the inverse noise covariance matrix ⁇ ⁇ 1 n, f, t-1 .
- the upper right superscript “ ⁇ 1” of “ ⁇ ⁇ 1 n, f, t ” should be written directly above the lower right subscript “n, f, t”, but due to notation limitations has been written to the upper left of “n, f, t”.
- the spatial covariance matrix ⁇ x, f, t acquired by the observation signal covariance matrix updating unit 532 a and the inverse noise covariance matrix ⁇ ⁇ 1 n, f, t acquired by the inverse noise covariance matrix updating unit 532 d are input into the main component vector updating unit 532 b .
- the main component vector updating unit 532 b acquires and outputs a main component vector ⁇ ⁇ f, t (a main component vector of the first time interval) relating to ⁇ ⁇ 1 n, f, t ⁇ x, f, t (the product of an inverse matrix of the noise covariance matrix of the frequency-divided observation signal and the spatial covariance matrix of the frequency-divided observation signal belonging to the first time interval) by using a power method on the basis of the inverse noise covariance matrix ⁇ ⁇ 1 n, f, t (the inverse matrix of the noise covariance matrix of the frequency-divided observation signal), the spatial covariance matrix ⁇ x, f, t (the spatial covariance matrix of the frequency-divided observation signal belonging to the first time interval), and a main component vector v ⁇ f, t-1 (a main component vector of the second time interval).
- the main component vector updating unit 532 b acquires and outputs a main component vector v ⁇ f, t based on ⁇ ⁇ 1 n, f, t ⁇ x, f, t v ⁇ f, t-1 .
- the main component vector updating unit 532 b acquires and outputs the main component vector v ⁇ f, t in accordance with equations (23) and (24) shown below, for example. Note that the upper right superscript “ ⁇ ” of “v ⁇ f, t ” should be written directly above the lower right subscript “v”, but due to notation limitations has been written to the upper right of “v”.
- v ⁇ f , t ′ ⁇ n , f , t - 1 ⁇ ⁇ n , f , t ⁇ v ⁇ f , t - 1 ( 23 )
- v ⁇ f , t v ⁇ f , t ′ v ⁇ f , t r ⁇ e ⁇ f ( 24 )
- v ⁇ f, t ref expresses an element corresponding to a predetermined microphone (a reference microphone ref) serving as a reference, among the M elements of a vector v ⁇ f, t acquired from equation (23).
- the upper right superscript “ ⁇ ” of “v ⁇ ′ f, t ” should be written directly above the lower right subscript “v”, but due to notation limitations has been written to the upper right of “v”.
- Step S 532 e ⁇ Noise Covariance Matrix Updating Unit 532 e (Step S 532 e )>
- the noise covariance matrix updating unit 532 e uses the frequency-divided observation signal x f, t (the frequency-divided observation signal belonging to the first time interval) and the mask information ⁇ f, t (n) ; (the mask information of the first time interval) as input, acquires and outputs a noise covariance matrix ⁇ n, f, t of the frequency-divided observation signal x f, t (a noise covariance matrix of the frequency-divided observation signal belonging to the first time interval), which is based on the frequency-divided observation signal x f, t , the mask information ⁇ f, t (n) , and a noise covariance matrix ⁇ n, f, t-1 (a noise covariance matrix of the frequency-divided observation signal belonging to the second time interval that is further in the past than the first time interval).
- the noise covariance matrix updating unit 532 e acquires and outputs the linear sum of a product ⁇ f, t (n) x f, t x f, t H of the covariance matrix x f, t x f, t H of the frequency-divided observation signal x f, t and the mask information ⁇ f, t (n) , and the noise covariance matrix ⁇ n, f, t-1 (the noise covariance matrix of the frequency-divided observation signal belonging to the second time interval that is further in the past than the first time interval) as the noise covariance matrix ⁇ n, f, t of the frequency-divided observation signal x f, t .
- the noise covariance matrix updating unit 532 e acquires and outputs the noise covariance matrix ⁇ n, f, t in accordance with equation (25) shown below.
- ⁇ n,f,t ⁇ n,f,t-1 + ⁇ f,t (n) x f,t x f,t H (25)
- ⁇ is an oblivion coefficient, and is a real number belonging to a range of 0 ⁇ 1, for example.
- Step S 532 c Step S 532 c
- the steering vector updating unit 532 c uses the main component vector v ⁇ f, t (the main component vector of the first time interval) acquired by the main component vector updating unit 532 b and the noise covariance matrix ⁇ n, f, t (the noise covariance matrix of the frequency-divided observation signal) acquired by the noise covariance matrix updating unit 532 e as input, acquires and outputs an estimated steering vector ⁇ f, t (an estimated steering vector of the first time interval) on the basis thereof.
- the steering vector updating unit 532 c acquires and outputs an estimated steering vector ⁇ f, t based on ⁇ n, f, t v ⁇ f, t .
- the steering vector updating unit 532 c acquires and outputs the estimated steering vector ⁇ f, t in accordance with equations (26) and (27) shown below, for example.
- v′ f,t ⁇ n,f,t ⁇ tilde over (v) ⁇ f,t (26)
- v f, t v f , t ′ v f , t r ⁇ e ⁇ f ( 27 )
- v f, t ref expresses an element corresponding to the reference microphone ref, among the M elements of a vector v′ f, t acquired from equation (26).
- the estimated steering vector ⁇ f, t acquired by the steering vector estimation unit 532 is input into the convolutional beamformer estimation unit 212 .
- the convolutional beamformer estimation unit 212 treats the estimated steering vector ⁇ f, t as ⁇ f, 0 , and performs the processing of step S 212 , described in the second embodiment, using the estimated steering vector ⁇ f, t and the weighted space-time covariance matrix R f acquired in step S 211 . All other processing is as described in the first and second embodiments.
- ⁇ f, t 2 input into the matrix estimation unit 211 either the provisional power generated as illustrated in equation (17) or the estimated power ⁇ f, t 2 generated as described in the third embodiment, for example, may be used.
- the inverse noise covariance matrix updating unit 532 d adaptively updates the inverse noise covariance matrix ⁇ ⁇ 1 n, f, t at each time point corresponding to the time frame number t by using the frequency-divided observation signal x f, t and the mask information ⁇ f, t (n) .
- the inverse noise covariance matrix updating unit 532 d may acquire and output the inverse noise covariance matrix ⁇ ⁇ 1 n, f, t by using a frequency-divided observation signal x f, t of a time interval in which the noise component either exists alone or is dominant, without using the mask information ⁇ f, t (n) .
- the inverse noise covariance matrix updating unit 532 d may output, as the inverse noise covariance matrix ⁇ ⁇ 1 n, f, t , an inverse matrix of the temporal average of x f, t x f, t H with respect to a frequency-divided observation signal x f, t of a time interval in which the noise component either exists alone or is dominant.
- the inverse noise covariance matrix ⁇ ⁇ 1 n, f, t acquired in this manner is used continuously in the frames having the respective time frame numbers t.
- the noise covariance matrix updating unit 532 e may acquire and output the noise covariance matrix ⁇ ⁇ 1 n, f, t of the frequency-divided observation signal x f, t using a frequency-divided observation signal x f, t of a time interval in which the noise component either exists alone or is dominant, without using the mask information ⁇ f, t (n) .
- the noise covariance matrix updating unit 532 e may output, as the noise covariance matrix ⁇ n, f, t , the temporal average of x f, t x f, t H with respect to a frequency-divided observation signal x f, t of a time interval in which the noise component either exists alone or is dominant.
- the noise covariance matrix ⁇ n, f, t acquired in this manner is used continuously in the frames having the respective time frame numbers t.
- the first time interval is the frame having the time frame number t and the second time interval is the frame having the time frame number t ⁇ 1 was used as an example, but the present invention is not limited thereto.
- a frame having a time frame number other than the time frame number t may be set as the first time interval, and a time frame that is further in the past than the first time interval and has a time frame number other than the time frame number t ⁇ 1 may be set as the second time interval.
- the steering vector estimation unit 532 acquires and outputs the estimated steering vector ⁇ f, t by successive processing using the frequency-divided observation signal x f, t as input. As noted in the fourth embodiment, however, by estimating the steering vector after suppressing reverberation from the frequency-divided observation signal x f, t , the estimation precision is improved.
- the steering vector estimation unit acquires and outputs the estimated steering vector ⁇ f, t by successive processing, as described in the fifth embodiment, after reverberation has been suppressed from the frequency-divided observation signal x f, t will be described.
- a signal processing device 6 includes the estimation unit 21 , the suppression unit 12 , and a parameter estimation unit 63 .
- the parameter estimation unit 63 includes the reverberation suppression unit 431 and a steering vector estimation unit 632 .
- the sixth embodiment differs from the fifth embodiment in that before generating the estimated steering vector, the reverberation component of the frequency-divided observation signal x f, t is suppressed.
- the reverberation component of the frequency-divided observation signal x f, t is suppressed.
- the reverberation suppression unit 431 acquires and outputs the frequency-divided reverberation-suppressed signal u f, t in which the reverberation component of the frequency-divided observation signal x f, t has been suppressed (preferably, in which the reverberation component of the frequency-divided observation signal x f, t has been removed).
- the frequency-divided reverberation-suppressed signal u f, t is input into the steering vector estimation unit 632 .
- the processing of the steering vector estimation unit 632 is identical to the processing of the steering vector estimation unit 532 of the fifth embodiment except that the frequency-divided reverberation-suppressed signal u f, t , rather than the frequency-divided observation signal x f, t , is input into the steering vector estimation unit 632 , and the steering vector estimation unit 632 uses the frequency-divided reverberation-suppressed signal u f, t instead of the frequency-divided observation signal x f, t .
- the frequency-divided observation signal x f, t used in the processing of the steering vector estimation unit 532 is replaced by the frequency-divided reverberation-suppressed signal u f, t .
- All other processing is identical to the fifth embodiment and the modified example thereof. More specifically, the frequency-divided reverberation-suppressed signal u f, t , which is a frequency-divided time series signal, is input into the steering vector estimation unit 632 .
- the observation signal covariance matrix updating unit 532 a acquires and outputs the spatial covariance matrix ⁇ x, f, t of the frequency-divided reverberation-suppressed signal u f, t belonging to the first time interval, which is based on the frequency-divided reverberation-suppressed signal u f, t belonging to the first time interval and the spatial covariance matrix ⁇ x, f, t-1 of a frequency-divided reverberation-suppressed signal u f, t _i belonging to the second time interval that is further in the past than the first time interval.
- the main component vector updating unit 532 b acquires and outputs the main component vector v ⁇ f, t of the first time interval with respect to the product ⁇ ⁇ 1 n, f, t ⁇ x, f, t of the inverse matrix ⁇ ⁇ 1 n, f, t of the noise covariance matrix of the frequency-divided reverberation-suppressed signal and the spatial covariance matrix ⁇ x, f, t of the frequency-divided reliability-suppressed signal belonging to the first time interval on the basis of the inverse matrix ⁇ ⁇ 1 n, f, t of the noise covariance matrix of the frequency-divided reliability-suppressed signal u f, t , the spatial covariance matrix ⁇ x, f, t of the frequency-divided reliability-suppressed signal belonging to the first time interval, and the main component vector v ⁇ f, t-1 of the second time interval.
- the steering vector updating unit 532 c acquires and outputs the estimated steering vector ⁇ f, t of the first time interval on the basis of the noise covariance matrix of the frequency-divided reverberation-suppressed signal u f, t and the main component vector v ⁇ f, t of the first time interval.
- a method of estimating the convolutional beamformer by successive processing will be described.
- the convolutional beamformer of each time frame number t can be estimated and the estimation signal of the target signal y f, t can be acquired from frequency-divided observation signals x f, t input successively online, for example.
- a signal processing device 7 includes an estimation unit 71 , a suppression unit 72 , and the parameter estimation unit 53 .
- the frequency-divided observation signal x f, t is input into the parameter estimation unit 53 ( FIGS. 6 and 7 ).
- the steering vector estimation unit 532 ( FIG. 8 ) of the parameter estimation unit 53 acquires and outputs the estimated steering vector ⁇ f, t by successive processing using the frequency-divided observation signal x f, t as input (step S 532 ).
- the estimated steering vector ⁇ f, t is represented by the following M-dimensional vector.
- ⁇ f,t [ ⁇ f,t (1) , ⁇ f,t (2) , . . .
- ⁇ f, t (M) T
- ⁇ f, t (m) represents an element corresponding to the microphone having the microphone number m, among the M elements of the estimated steering vector ⁇ f, t .
- the estimated steering vector ⁇ f, t acquired by the steering vector estimation unit 532 is input into the convolutional beamformer estimation unit 712 .
- the frequency-divided observation signal x f, t and the power or estimated power ⁇ f, t 2 of the target signal are input into the matrix estimation unit 711 ( FIG. 6 ).
- ⁇ f, t 2 input into the matrix estimation unit 711 either the provisional power generated as illustrated in equation (17) or the estimated power ⁇ f, t 2 generated as described in the third embodiment, for example, may be used.
- the power or estimated power ⁇ f, t 2 of the target signal (the power or estimated power of the frequency-divided observation signal belonging to the first time interval), and an inverse matrix
- the matrix estimation unit 711 estimates and outputs an inverse matrix
- R ⁇ f , t - 1 of a space-time covariance matrix (an inverse matrix of the space-time covariance matrix of the first time interval).
- An example of the space-time covariance matrix is as follows.
- R ⁇ f , t - 1 1 a ⁇ ( R ⁇ f , t - 1 - 1 - k f , t ⁇ x _ f , t H ⁇ R ⁇ f , t - 1 - 1 ) ( 29 )
- k f, t in equation (28) is an (L+1)M-dimensional vector
- the inverse matrix of equation (29) is an (L+1)M ⁇ (L+1)M matrix.
- ⁇ is an oblivion coefficient, and is a real number belonging to a range of 0 ⁇ 1, for example.
- R ⁇ f , t - 1 - 1 of the space-time covariance matrix may be set as desired, and an example of the initial matrix is an (L+1)M-dimensional unit matrix shown below.
- the convolutional beamformer estimation unit 712 acquires and outputs the convolutional beamformer w ⁇ f, t (the convolutional beamformer of the first time interval) on the basis thereof.
- the convolutional beamformer estimation unit 712 acquires and outputs the convolutional beamformer w ⁇ f, t in accordance with equation (30), shown below.
- v _ f , t [ v _ f , t ( 1 ) , v _ f , t ( 2 ) , ... , v _ f , t ( M ) ] and
- v _ f , t ( m ) [ g f ⁇ v f , t ( m ) , 0 , ... , 0 ] [ g f ⁇ v f , t ( m ) , 0 , ... , 0 ] is an L+1-dimensional vector.
- g f is a scalar constant other than 0.
- the frequency-divided observation signal x f, t and the convolutional beamformer w ⁇ f, t acquired by the beamformer estimation unit 712 are input into the suppression unit 72 .
- the suppression unit 72 acquires and outputs the estimation signal of the target signal y f, t by applying the convolutional beamformer w ⁇ f, t to the frequency-divided observation signal x f, t in each time frame number t and frequency band number f.
- the suppression unit 72 acquires and outputs the estimation signal of the target signal y f, t in accordance with equation (31) shown below.
- y f,t w f,t H x f,t (31)
- the parameter estimation unit 53 of the signal processing device 7 according to the seventh embodiment may be replaced by the parameter estimation unit 63 .
- the parameter estimation unit 63 rather than the parameter estimation unit 53 , may acquire and output the estimated steering vector ⁇ f, t by successive processing, as described in the sixth embodiment, using the frequency-divided observation signal x f, t as input.
- the present invention is not limited thereto.
- a frame having a time frame number other than the time frame number t may be set as the first time interval, and a time frame that is further in the past than the first time interval and has a time frame number other than the time frame number t ⁇ 1 may be set as the second time interval.
- Equation (32) shows an example of the block matrix B f .
- ⁇ ⁇ f, 0 is an M ⁇ 1-dimensional column vector constituted by elements of the steering vector ⁇ f, 0 or the estimated steering vector ⁇ f, 0 that correspond to microphones other than the reference microphone ref
- ⁇ f, 0 ref is the element of ⁇ f, 0 that corresponds to the reference microphone ref
- I M-1 is an (M ⁇ 1) ⁇ (M ⁇ 1)-dimensional unit matrix.
- g f is set as a scalar constant other than 0, a f, 0 is set as an M-dimensional modified instantaneous beamformer, and the instantaneous beamformer w f, 0 is expressed as the sum of a constant multiple g f ⁇ f, 0 of the steering vector ⁇ f, 0 or a constant multiple g f ⁇ f, 0 of the estimated steering vector ⁇ f, 0 and a product B f a f, 0 of the block matrix B f corresponding to the orthogonal complement of the steering vector ⁇ f, 0 or the estimated steering vector ⁇ f, 0 and the modified instantaneous beamformer a f, 0 .
- Equation (33) the constraint condition that “w f, 0 H ⁇ f, 0 is a constant” is satisfied in relation to any modified instantaneous beamformer a f, 0 . It is therefore evident that the instantaneous beamformer w f, 0 may be defined as illustrated in equation (33).
- the convolutional beamformer is estimated using the optimal solution of the convolutional beamformer acquired when the instantaneous beamformer w f, 0 is defined as illustrated in equation (33). This will be described in detail below.
- a signal processing device 8 includes an estimation unit 81 , a suppression unit 82 , and a parameter estimation unit 83 .
- the estimation unit 81 includes a matrix estimation unit 811 , a convolutional beamformer estimation unit 812 , an initial beamformer application unit 813 , and a block unit 814 .
- the parameter estimation unit 83 ( FIG. 9 ), using the frequency-divided observation signal x f, t as input, acquires the estimated steering vector by an identical method to any of the parameter estimation units 33 , 43 , 53 , 63 described above, and outputs the acquired estimated steering vector as ⁇ f, 0 .
- the output estimated steering vector ⁇ f, 0 is transmitted to the initial beamformer application unit 813 and the block unit 814 .
- the estimated steering vector ⁇ f, 0 and the frequency-divided observation signal x f, t are input into the initial beamformer application unit 813 .
- the initial beamformer application unit 813 acquires and outputs an initial beamformer output z f, t (an initial beamformer output of the first time interval) based on the estimated steering vector ⁇ f, 0 and the frequency-divided observation signal x f, t (the frequency-divided observation signal belonging to the first time interval).
- the initial beamformer application unit 813 acquires and outputs an initial beamformer output z f, t based on the constant multiple of the estimated steering vector ⁇ f, 0 and the frequency-divided observation signal r f, t .
- the initial beamformer application unit 813 acquires and outputs the initial beamformer output z f, t in accordance with equation (34) shown below, for example.
- z f,t ( g f ⁇ f,0 ) H x f,t (34)
- the output initial beamformer output z f, t is transmitted to the convolutional beamformer estimation unit 812 and the suppression unit 82 .
- the estimated steering vector ⁇ f, 0 and the frequency-divided observation signal x f, t are input into the block unit 814 .
- Equation (35) the right side of equation (35) becomes a vector in which the number of elements is 0 (an empty vector), whereby equation (36) is as shown below in equation (36A).
- x f,t B f H x f,t (36A)
- the vector x? f, t acquired by the block unit 814 and the power or estimated power ⁇ f, t 2 of the target signal are input into the matrix estimation unit 811 .
- w _ _ f R _ _ f - 1 ⁇ x _ _ f , t ⁇ z f , t H ( 38 )
- w _ _ f [ a f , 0 T , w f ( 1 ) T , w f ( 2 ) T , ... , w f ( M ) T ] T ( 38 ⁇ A )
- w _ _ f ( m ) [ w f , d ( m ) , w f , d + 1 ( m ) , ... , w f , d + L - 1 ( m ) ] T ( 38 ⁇ B )
- equation (38B) becomes a vector in which the number of elements is 0 (an empty vector), whereby equation (38A) is as shown below.
- w f a f,0
- This processing is equivalent to processing for acquiring and outputting the estimation signal of the target signal y f, t by applying the convolutional beamformer w ⁇ f to the frequency-divided observation signal x f, t .
- the suppression unit 82 acquires and outputs the estimation signal of the target signal y f, t in accordance with equation (39) shown below.
- y f,t z f,t + w f H x f,t (39)
- a known steering vector ⁇ f, 0 acquired on the basis of actual measurement or the like may be input into the initial beamformer application unit 813 and the block unit 814 instead of the estimated steering vector ⁇ f, 0 acquired by the parameter estimation unit 83 .
- the initial beamformer application unit 813 and the block unit 814 perform steps S 813 and S 814 , described above, using the steering vector ⁇ f, 0 instead of the estimated steering vector ⁇ f, 0 .
- a method for executing convolutional beamformer estimation based on the eighth embodiment by successive processing will be described.
- a signal processing device 9 includes an estimation unit 91 , a suppression unit 92 , and a parameter estimation unit 93 .
- the estimation unit 91 includes an adaptive gain estimation unit 911 , a convolutional beamformer estimation unit 912 , a matrix estimation unit 915 , the initial beamformer application unit 813 , and the block unit 814 .
- the parameter estimation unit 93 ( FIG. 10 ), using the frequency-divided observation signal x f, t as input, acquires and outputs the estimated steering vector ⁇ f, t by an identical method to either of the parameter estimation units 53 , 63 described above.
- the output estimated steering vector ⁇ f, t is transmitted to the initial beamformer application unit 813 and the block unit 814 .
- the estimated steering vector ⁇ f, t (the estimated steering vector of the first time interval) and the frequency-divided observation signal x f, t (the frequency-divided observation signal belonging to the first time interval) are input into the initial beamformer application unit 813 , and the initial beamformer application unit 813 acquires and outputs the initial beamformer output z f, t (the initial beamformer output of the first time interval) as described in the eighth embodiment using ⁇ f, t instead of ⁇ f, 0 .
- the output initial beamformer output z f, t is transmitted to the suppression unit 92 .
- the suppression unit 92 acquires and outputs the estimation signal of the target signal y f, t in accordance with equation (40) below.
- y f,t z f,t + w f,t-1 H x f,t (40)
- ⁇ f, t 2 input into the matrix estimation unit 711 either the provisional power generated as illustrated in equation (17) or the estimated power ⁇ f, t 2 generated as described in the third embodiment, for example, may be used.
- the “ ⁇ ” of “R ⁇ 1 f, t-1 ” should be written directly above the “R”, but due to notation limitations may also be written to the upper right of “R”.
- the adaptive gain estimation unit 911 acquires and outputs an adaptive gain k f, t (the adaptive gain of the first time interval) that is based on the inverse matrix R ⁇ 1 f, t-1 of the weighted modified space-time covariance matrix (the inverse matrix of the weighted modified space-time covariance matrix of the second time interval), the estimated steering vector ⁇ f, t (the estimated steering vector of the first time interval), the frequency-divided observation signal x f, t , and the power or estimated power ⁇ f, t 2 of the target signal.
- the adaptive gain estimation unit 911 acquires and outputs the adaptive gain k f, t as an (LM+M ⁇ 1)-dimensional vector in accordance with equation (41) shown below.
- an initial matrix of the inverse matrix R ⁇ 1 f, t-1 of the weighted modified space-time covariance matrix may be any (LM+M ⁇ 1) ⁇ (LM+M ⁇ 1)-dimensional matrix.
- An example of the initial matrix of the inverse matrix R ⁇ 1 f, t-1 of the weighted modified space-time covariance matrix is an (LM+M ⁇ 1)-dimensional unit matrix.
- the matrix estimation unit 915 acquires and outputs an inverse matrix R ⁇ 1 f, t of the weighted modified space-time covariance matrix (the inverse matrix of the weighted modified space-time covariance matrix of the first time interval) that is based on the adaptive gain k f, t (the adaptive gain of the first time interval), the estimated steering vector ⁇ f, t (the estimated steering vector of the first time interval), the frequency-divided observation signal x f, t , and the inverse matrix R ⁇ 1 f, t-1 , of the weighted modified space-time covariance matrix (the inverse matrix of the weighted modified space-time covariance matrix of the second time interval).
- the matrix estimation unit 915 acquires and outputs the inverse matrix R ⁇ 1 f,
- R ⁇ f , t - 1 1 a ⁇ ( R ⁇ f , t - 1 - 1 - k f , t ⁇ x _ _ f , t H ⁇ R ⁇ f , t - 1 - 1 ) ( 42 )
- the output inverse matrix R ⁇ 1 f, t of the weighted modified space-time covariance matrix is transmitted to the adaptive gain estimation unit 911 .
- Step S 912 ⁇ Processing of Convolutional Beamformer Estimation Unit 912 (Step S 912 )>
- the estimation signal of the target signal y f, t output from the suppression unit 92 and the adaptive gain k f, t output from the adaptive gain estimation unit 911 are input into the convolutional beamformer estimation unit 912 .
- w f,t w f,t-1 ⁇ k f,t y f,t H (43)
- the first time interval is the frame having the time frame number t and the second time interval is the frame having the time frame number t ⁇ 1 was used as an example, but the present invention is not limited thereto.
- a frame having a time frame number other than the time frame number t may be set as the first time interval, and a time frame that is further in the past than the first time interval and has a time frame number other than the time frame number t ⁇ 1 may be set as the second time interval.
- a known steering vector ⁇ f, t may be input into the initial beamformer application unit 813 and the block unit 814 instead of the estimated steering vector ⁇ f, t acquired by the parameter estimation unit 93 .
- the initial beamformer application unit 813 and the block unit 814 perform steps S 813 and S 814 , described above, using the steering vector ⁇ f, t instead of the estimated steering vector ⁇ f, t .
- the frequency-divided observation signals x f, t input into the signal processing devices 1 to 9 described above may be any signals that correspond respectively to a plurality of frequency bands of an observation signal acquired by picking up an acoustic signal emitted from a sound source.
- a time-domain observation signal x(i) [x(i) (1) , x(i) (2) , . . .
- x(i) (M) ] T (where i is an index expressing a discrete time) acquired by picking up an acoustic signal emitted from a sound source in M microphones may be input into a dividing unit 1051 , and the dividing unit 1051 may transform the observation signal x(i) into frequency-divided observation signals x f, t in the frequency domain and input the frequency-divided observation signals x f, t into the signal processing devices 1 to 9 .
- the transformation method from the time domain to the frequency domain and the discrete Fourier transform or the like, for example, may be used.
- the discrete Fourier transform or the like for example, may be used.
- frequency-divided observation signals x f, t acquired by another processing unit may be input into the signal processing devices 1 to 9 .
- the time-domain observation signal x(i) described above may be transformed into frequency-domain signals in each time frame, the frequency-domain signals may be processed by another processing unit, and the frequency-divided observation signals x f, t acquired as a result may be input into the signal processing devices 1 to 9 .
- the estimation signal of the target signals y f, t output from the signal processing devices 1 to 9 may either be used in other processing (speech recognition processing or the like) without being transformed into time-domain signals y(i) or be transformed into a time-domain signal y(i).
- the estimation signal of the target signals y f, t output from the signal processing devices 1 to 9 may be output as is and used in other processing.
- the estimation signal of the target signals y f, t output from the signal processing devices 1 to 9 may be input into an integration unit 1052 , and the integration unit 1052 may acquire and output a time-domain signal y(i) by integrating the estimation signal of the target signals y f, t .
- the integration unit 1052 may acquire and output a time-domain signal y(i) by integrating the estimation signal of the target signals y f, t .
- the method for acquiring the time-domain signal y(i) from the estimation signal of the target signals y f, t and the inverse Fourier transform or the like, for example, may be used.
- noise/reverberation suppression results acquired by the first embodiment and conventional methods 1 to 3 will be illustrated.
- FIG. 12 shows evaluation results acquired in relation to the speech quality of the observation signal and the signals subjected to noise/reverberation suppression in accordance with the present invention and conventional methods 1 to 3.
- Sim denotes the Sim Data
- Real denotes the Real Data.
- CD denotes cepstrum distortion
- SRMR denotes the signal-to-reverberation modulation ratio
- LLR denotes the log-likelihood ratio
- FWSSNR denotes the frequency-weighted segmental signal-to-noise ratio.
- CD and LLR indicate better speech quality as the values thereof decrease, while SRMR and FWSSNR indicate better speech quality as the values thereof increase.
- the underlined values are optimal values. As illustrated in FIG. 12 , it is evident that according to the present invention, noise and reverberation can be suppressed more adequately than with conventional methods 1 to 3.
- FIG. 13 shows a word error rate in the speech recognition results acquired in relation to the observation signal and the signals subjected to noise/reverberation suppression in accordance with the present invention and conventional methods 1 to 3.
- the word error rate indicates better speech recognition precision as the value thereof decreases.
- the underlined values are optimal values. “R1N” denotes a case in which the speaker is positioned close to the microphones in room 1, while “R1F” denotes a case in which the speaker is positioned far away from the microphones in room 1.
- R2N and R3N respectively denote cases in which the speaker is positioned close to the microphones in rooms 2 and 3
- R2F and R3E respectively denote cases in which the speaker is positioned far away from the microphones in rooms 2 and 3.
- Ave denotes an average value. As illustrated in FIG. 12 , it is evident that according to the present invention, noise and reverberation can be suppressed more adequately than with conventional methods 1 to 3.
- FIG. 14 shows noise/reverberation suppression results acquired in a case where the steering vector was estimated without suppressing the reverberation of the frequency-divided observation signal x f, t (without reverberation suppression) and a case where the steering vector was estimated after suppressing the reverberation of the frequency-divided observation signal x f, t (with reverberation suppression), as described in the fourth embodiment.
- WER expresses the character error rate when speech recognition was performed using the target signal acquired by implementing noise/reverberation suppression. As the value of WER decreases, a better performance is achieved. As illustrated in FIG. 14 , it is evident that the speech quality of the target signal is better with reverberation suppression than without reverberation suppression.
- FIGS. 15 A, 15 B, and 15 C show noise/reverberation suppression results acquired in a case where convolutional beamformer estimation was executed by successive processing, as described in the seventh and ninth embodiments.
- L 64 [msec]
- Adaptive NCM indicates results acquired when the estimated steering vector ⁇ f, t generated by the method of the fifth embodiment was used.
- PreFixed NCM indicates results acquired when the estimated steering vector ⁇ f, t generated by the method of modified example 1 of the fifth embodiment was used.
- observation signal indicates results acquired when no noise/reverberation suppression was implemented. Thus, it is evident that the speech quality of the target signal is improved by the noise/reverberation suppression of the seventh and ninth embodiments.
- d is set at the same value in all of the frequency bands, but d may be set for each frequency band.
- a positive integer d f may be used instead of d.
- L is set at the same value in all of the frequency bands, but L may be set for each frequency band. In other words, a positive integer L f may be used instead of L.
- a time frame corresponding to 1 ⁇ t ⁇ t c may be set as the processing unit, or a time frame corresponding to t c ⁇ t ⁇ t c may be set as the processing unit in relation to a positive integer constant ⁇ .
- processing described above do not have to be executed in time series, as described above, and may be executed in parallel or individually either in accordance with the processing power of the device that executes the processing or in accordance with necessity. Furthermore, the processing may be modified appropriately within a scope that does not depart from the spirit of the present invention.
- the devices described above are configured by, for example, having a general-purpose or dedicated computer including a processor (a hardware processor) such as a CPU (central processing unit) and a memory such as a RAM (random-access memory)/ROM (read-only memory) execute a predetermined program.
- the computer may include one processor and one memory, or pluralities of processors and memories.
- the program may be either installed in the computer or recorded in the ROM or the like in advance.
- electronic circuitry such as a CPU, that realizes a functional configuration by reading a program
- some or all of the processing units may be configured using electronic circuitry that realizes processing functions without the use of a program.
- Electronic circuitry constituting a single device may include a plurality of CPUs.
- the processing content of the functions to be included in the devices is described by the program.
- the computer realizes the processing functions described above by executing the program.
- the program describing the processing content may be recorded in advance on a computer-readable recording medium.
- An example of a computer-readable recording medium is a non-transitory recording medium. Examples of this type of recording medium include a magnetic recording device, an optical disk, a magneto-optical recording medium, a semiconductor memory, and so on.
- the program is distributed by, for example, selling, transferring, renting, etc. a portable recording medium such as a DVD or a CD-ROM on which the program is recorded. Further, the program may be stored in a storage device of a server computer and distributed by being transferred from the server computer to another computer over a network.
- the computer that executes the program first stores the program recorded on the portable recording medium or transferred from the server computer temporarily in a storage device included therein. During execution of the processing, the computer reads the program stored in the storage device included therein and executes processing corresponding to the read program. As a different form of execution of the program, the computer may read the program directly from the portable recording medium and execute processing corresponding to the program. Alternatively, every time the program is transferred to the computer from the server computer, the computer may execute processing corresponding to the received program. Instead of transferring the program from the server computer to the computer, the processing described above may be executed by a so-called ASP (Application Service Provider) type service, in which processing functions are realized only by issuing commands to execute the processing and acquiring results.
- ASP Application Service Provider
- At least some of the processing functions may be realized by hardware.
- the present invention can be used in various applications in which it is necessary to suppress noise and reverberation from an acoustic signal.
- the present invention can be used in speech recognition, call systems, conference call systems, and so on.
Landscapes
- Engineering & Computer Science (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
Description
- [PTL 1] Japanese Patent No. 5227393
- [NPL 1] T Higuchi, N Ito, T Yoshioka, T Nakatani, “Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise”, Proc. ICASSP 2016, 2016.
- [NPL 2] J Heymann, L Drude, R Haeb-Umbach, “Neural network based spectral mask estimation for acoustic beamforming,” Proc. ICASSP 2016, 2016
- [NPL 3] T Nakatani, T Yoshioka, K Kinoshita, M Miyoshi, B H Juang, “Speech dereverberation based on variance-normalized delayed linear prediction,” IEEE Trans. ASLP, 18 (7), 1717-1731, 2010
- [NPL 4] Takuya Yoshioka, Nobutaka Ito, Marc Delcroix, Atsunori Ogawa, Keisuke Kinoshita, Masakiyo Fujimoto, Chengzhu Yu, Wojciech J Fabian, Miquel Espi, Takuya Higuchi, Shoko Araki, Tomohiro Nakatani, “The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices,” Proc. IEEE ASRU 2015, 436-443, 2015.
-
- M: M is a positive integer expressing a number of microphones. For example, M≥2.
- m: m is a positive integer expressing the microphone number, and satisfies 1≤m≤M. The microphone number is represented by upper right superscript in round parentheses. In other words, a value or a vector based on a signal picked up by a microphone having the microphone number m is represented by a symbol having the upper right superscript “(m)” (for example, xf, t (m)).
- N: N is a positive integer expressing the total number of time frames of signals. For example, N≥2.
- t, τ: t and τ are positive integers expressing the time frame number, and t satisfies 1≤t≤N. The time frame number is represented by lower right subscript. In other words, a value or a vector corresponding to a time frame having the time frame number t is represented by a symbol having the lower right subscript “t” (for example, xf, t (m)). Similarly, a value or a vector corresponding to a time frame having the time frame number t is represented by a symbol having the lower right subscript “τ”.
- P: P is a positive integer expressing a total number of frequency bands (discrete frequencies). For example, P≥2.
- f: f is a positive integer expressing the frequency band number, and satisfies 1≤f≤P. The frequency band number is represented by lower right subscript. In other words, a value or a vector corresponding to a frequency band having the frequency band number f is represented by a symbol having the lower right subscript “f” (for example, xf, t (m)).
- T: T expresses a non-conjugated transpose of a matrix or a vector. α0 T represents a matrix or a vector acquired by non-conjugated transposition of α0.
- H: H expresses a conjugated transpose of a matrix or a vector. α0 H represents a matrix or a vector acquired by conjugated transposition of α0.
- |α0|: |α0| expresses the absolute value of α0.
- ∥α0∥: ∥α0∥ expresses the norm of α0.
- |α0|γ: |α0|γ expresses a weighted absolute value γ|α0| of α0.
- ∥α0∥γ: ∥α0∥γ expresses a weighted norm γ∥α0∥ of α0.
The frequency-divided observation signals xf, t are acquired by transforming M observation signals, which are acquired by picking up acoustic signals emitted from one or a plurality of sound sources in M microphones, to the frequency domain. The observation signals are acquired by picking up acoustic signals emitted from the sound sources in an environment where noise and reverberation exist. xf, t (m) is acquired by transforming an observation signal that is acquired by being picked up by the microphone having the microphone number m to the frequency domain. xf, t (m) corresponds to the frequency band having the frequency band number f and the time frame having the time frame number t. In other words, the frequency-divided observation signals xf, t are time series signals.
is a steering vector having, as an element, a transfer function νf, 0 (m) relating to the direct sound and the initial reflected sound from the sound source to each microphone (the sound pickup position of the acoustic signal), or an estimated vector (an estimated steering vector) thereof. In other words, νf, 0 is expressed by an M-dimensional (the dimension of the number of microphones) vector having, as an element, the transfer function νf, 0 (m), which corresponds to the direct sound and initial reflected sound parts of an impulse response from the sound source position to each microphone (i.e. the reverberation that arrives at a delay of no more than several tens of milliseconds (for example, within 30 milliseconds) following the direct sound). When it is difficult to estimate the gain of the steering vector, a normalized vector acquired by normalizing the transfer function of each element so that the gain of a microphone having one of the microphone numbers m0∈{1, . . . , M} becomes a constant g (g≠0) may be used as νf, 0. In other words, as illustrated below, a normalized vector may be used as νf, 0.
y f,t =w f,0 H x f,t (6)
Here, the reverberation suppression filter Ff, τ is an M×M-dimensional matrix filter for suppressing reverberation from the frequency-divided observation signal xf, t. d is a positive integer expressing a prediction delay. L is a positive integer expressing the filter length. σf, t 2 is the power of the target signal, which is expressed as follows.
∥x∥γ relating to the frequency-divided observation signal x is the weighted norm ∥x∥γ=γ(xHx) of the frequency-divided observation signal x.
Here, the estimation signal of the target signal zf, t is an M-dimensional column vector, as shown below.
Here, with respect to τ≠0, wf, τ=−Ff, τwf, 0, and wf, τ corresponds to a filter for performing noise suppression and reverberation suppression simultaneously. w− f is a convolutional beamformer that calculates a weighted sum of a current signal and a past signal sequence having a predetermined delay at each time. Note that the “−” of “w− f” should be written directly above the “w”, as shown below, but due to notation limitations may also be written to the upper right of “w”.
The convolutional beamformer w− f calculates the weighted sum of the current signal and a past signal sequence having a predetermined delay at each time point. The convolutional beamformer w− f is expressed as shown below, for example,
where the following is satisfied.
Further, x− f, t is expressed as follows.
Note that the convolutional beamformer w− f of equation (9A) is a beamformer that calculates, at each time point, the weighted sum of the current signal and a signal sequence having a predetermined delay and a length of 0, and therefore the convolutional beamformer calculates the weighted value of the current signal at each time point. Further, as will be described below, even when L=0, the signal processing device of the present invention can acquire the estimation signal of the target signal by determining a convolutional beamformer on the basis of a probability expressing a speech-likeness and applying the convolutional beamformer to the frequency-divided observation signals.
Here, “const.” expresses a constant.
The signal processing device may determine w− f which minimizes the cost function C3 (w− f) of equation (13) under the constraint condition described above (in which, for example, wf, 0 Hνf, 0 is a constant), for example.
Here, ν− f is a vector acquired by disposing the element νf, 0 (m) of the steering vector νf, 0 as follows.
Here, ν− f (m) is an L+1-dimensional column vector having νf, 0 (m), and L zeros as elements.
y f,t =
y″ f,t =w f,0 H y′ f,t (19)
σf,t 2 =|y″ f,t|2 =y″ f,t H y″ f,t (20)
- Reference document 1: Takuya Yoshioka and Tomohiro Nakatani, “Generalization of Multi-Channel Linear Prediction Methods for Blind MIMO Impulse Response Shortening,” IEEE Transactions on Audio, Speech, and Language Processing (Volume: 20, Issue: 10, December 2012)
- Reference document 2: N. Ito, S. Araki, M. Delcroix, and T. Nakatani, “Probabilistic spatial dictionary based online adaptive beamforming for meeting recognition in noise and reverberant environments,” Proc IEEE ICASSP, pp. 681-685, 2017.
- Reference document 3: S. Markovich-Golan and S. Gannot, “Performance analysis of the covariance subtraction method for relative transfer function estimation and comparison to the covariance whitening method,” Proc IEEE ICASSP, pp. 544-548, 2015.
ψx,f,t=βψx,f,t-1 +x f,t x f,t H (21)
- Reference document 4: T. Higuchi, N. Ito, T. Yoshioka, and T. Nakatani, “Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise,” Proc IEEE ICASSP-2016, pp. 5210-5214, 2016.
- Reference document 5: J. Heymann, L. Drude, and R Haeb-Umbach, “Neural network based spectral mask estimation for acoustic beamforming,” Proc IEEE ICASSP-2016, pp. 196-200, 2016.
- Reference document 6: T. Nakatani, N. Ito, T. Higuchi, S. Araki, and K. Kinoshita, “Integrating DNN-based and spatial clustering-based mask estimation for robust MVDR beamforming,” Proc IEEE ICASSP-2017, pp. 286-290, 2017.
- Reference document 7: Y. Matsui, T. Nakatani, M. Delcroix, K. Kinoshita, S. Araki, and S. Makino, “Online integration of DNN-based and spatial clustering-based mask estimation for robust MVDR beamforming,” Proc. IWA ENC, pp. 71-75, 2018.
Here, α is an oblivion coefficient, and is a real number belonging to a range of 0<α<1, for example. An initial matrix ψ−1 n, f, 0 of the inverse noise covariance matrix ψ−1 n, f, t-1 may be set as desired. For example, an M×M-dimensional unit matrix may be set as the initial matrix ψ−1 n, f, 0 of the inverse noise covariance matrix ψ−1 n, f, t-1. Note that the upper right superscript “−1” of “ψ−1 n, f, t” should be written directly above the lower right subscript “n, f, t”, but due to notation limitations has been written to the upper left of “n, f, t”.
ψn,f,t=αψn,f,t-1+γf,t (n) x f,t x f,t H (25)
Here, α is an oblivion coefficient, and is a real number belonging to a range of 0<α<1, for example.
v′ f,t=ψn,f,t {tilde over (v)} f,t (26)
Here, vf, t ref expresses an element corresponding to the reference microphone ref, among the M elements of a vector v′f, t acquired from equation (26). In other words, in the example of equations (26) and (27), the steering
νf,t=[νf,t (1),νf,t (2), . . . ,νf,t (M)]T
Here, νf, t (m) represents an element corresponding to the microphone having the microphone number m, among the M elements of the estimated steering vector νf, t. The estimated steering vector νf, t acquired by the steering
of a space-time covariance matrix (an inverse matrix of the space-time covariance matrix of the second time interval that is further in the past than the first time interval), the
of a space-time covariance matrix (an inverse matrix of the space-time covariance matrix of the first time interval). An example of the space-time covariance matrix is as follows.
In this case, the
of the space-time covariance matrix in accordance with equations (28) and (29) shown below, for example.
Here, kf, t in equation (28) is an (L+1)M-dimensional vector, and the inverse matrix of equation (29) is an (L+1)M×(L+1)M matrix. α is an oblivion coefficient, and is a real number belonging to a range of 0<α<1, for example. Further, an initial matrix of the inverse matrix
of the space-time covariance matrix may be set as desired, and an example of the initial matrix is an (L+1)M-dimensional unit matrix shown below.
where
and
is an L+1-dimensional vector. gf is a scalar constant other than 0.
y f,t =
Here, ν˜ f, 0 is an M−1-dimensional column vector constituted by elements of the steering vector νf, 0 or the estimated steering vector νf, 0 that correspond to microphones other than the reference microphone ref, νf, 0 ref is the element of νf, 0 that corresponds to the reference microphone ref, and IM-1 is an (M−1)×(M−1)-dimensional unit matrix.
w f,0 =g fνf,0 +B f a f,0 (33)
Accordingly, Bf Hνf, 0=0, and therefore the constraint condition that “wf, 0 Hνf, 0 is a constant” is expressed as follows.
w f,0 Hνf,0=(g fνf,0 +B f a f,0)Hνf,0 =g f H|∥f,0|2=constant
Hence, even under the definition given in equation (33), the constraint condition that “wf, 0 Hνf, 0 is a constant” is satisfied in relation to any modified instantaneous beamformer af, 0. It is therefore evident that the instantaneous beamformer wf, 0 may be defined as illustrated in equation (33). In this embodiment, the convolutional beamformer is estimated using the optimal solution of the convolutional beamformer acquired when the instantaneous beamformer wf, 0 is defined as illustrated in equation (33). This will be described in detail below.
z f,t=(g fνf,0)H x f,t (34)
The output initial beamformer output zf, t is transmitted to the convolutional
Note that the upper right superscript “=” of “x= f, t” should be written directly above the lower right subscript “x”, as shown in equation (36), but due to notation limitations may also be written to the upper right of “x”. The output vector x= f, t is transmitted to the
The output weighted modified space-time covariance matrix R= f is transmitted to the convolutional
The output convolutional beamformer w= f is transmitted to the
y f,t =z f,t +
y f,t =z f,t +
Here, the initial vector w= f, 0 of the convolutional beamformer w= f, t-1 may be any (LM+M−1)-dimensional vector. An example of the initial vector w= f, 0 is an (LM+M−1)-dimensional vector in which all elements are 0.
Here, α is an oblivion coefficient, and is a real number belonging to a range of 0<α<1, for example. Further, an initial matrix of the inverse matrix R˜−1 f, t-1 of the weighted modified space-time covariance matrix may be any (LM+M−1)×(LM+M−1)-dimensional matrix. An example of the initial matrix of the inverse matrix R˜−1 f, t-1 of the weighted modified space-time covariance matrix is an (LM+M−1)-dimensional unit matrix. Here,
Note that R˜ f, t itself is not calculated. The output adaptive gain kf, t is transmitted to the
The output inverse matrix R˜−1 f, t of the weighted modified space-time covariance matrix is transmitted to the adaptive
The output convolutional beamformer w= f, t is transmitted to the
-
- 1-9 Signal processing device
- 11, 21, 71, 81, 91 Estimation unit
- 12, 22 Suppression unit
Claims (17)
Applications Claiming Priority (6)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2018-234075 | 2018-12-14 | ||
| JP2018234075 | 2018-12-14 | ||
| PCT/JP2019/016587 WO2020121545A1 (en) | 2018-12-14 | 2019-04-18 | Signal processing device, signal processing method, and program |
| WOPCT/JP2019/016587 | 2019-04-18 | ||
| JPPCT/JP2019/016587 | 2019-04-18 | ||
| PCT/JP2019/029921 WO2020121590A1 (en) | 2018-12-14 | 2019-07-31 | Signal processing device, signal processing method, and program |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| US20220068288A1 US20220068288A1 (en) | 2022-03-03 |
| US11894010B2 true US11894010B2 (en) | 2024-02-06 |
Family
ID=71076328
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US17/312,912 Active 2040-05-14 US11894010B2 (en) | 2018-12-14 | 2019-07-31 | Signal processing apparatus, signal processing method, and program |
Country Status (3)
| Country | Link |
|---|---|
| US (1) | US11894010B2 (en) |
| JP (1) | JP7115562B2 (en) |
| WO (2) | WO2020121545A1 (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US12165668B2 (en) * | 2022-02-18 | 2024-12-10 | Microsoft Technology Licensing, Llc | Method for neural beamforming, channel shortening and noise reduction |
Families Citing this family (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111933170B (en) * | 2020-07-20 | 2024-03-29 | 歌尔科技有限公司 | Voice signal processing method, device, equipment and storage medium |
| JP7430127B2 (en) * | 2020-09-02 | 2024-02-09 | 三菱重工業株式会社 | Prediction device, prediction method, and program |
| US12348945B2 (en) * | 2020-10-15 | 2025-07-01 | Nippon Telegraph And Telephone Corporation | Acoustic signal enhancement apparatus, method and program |
| JP7639382B2 (en) * | 2021-02-12 | 2025-03-05 | 日本電信電話株式会社 | Audio signal enhancement device, method and program |
| CN112802490B (en) * | 2021-03-11 | 2023-08-18 | 北京声加科技有限公司 | Beam forming method and device based on microphone array |
| US11798533B2 (en) * | 2021-04-02 | 2023-10-24 | Google Llc | Context aware beamforming of audio data |
| WO2023276068A1 (en) * | 2021-06-30 | 2023-01-05 | 日本電信電話株式会社 | Acoustic signal enhancement device, acoustic signal enhancement method, and program |
| CN113707136B (en) * | 2021-10-28 | 2021-12-31 | 南京南大电子智慧型服务机器人研究院有限公司 | Audio and video mixed voice front-end processing method for voice interaction of service robot |
| CN115086836B (en) * | 2022-06-14 | 2023-04-18 | 西北工业大学 | Beam forming method, system and beam former |
| CN117292700A (en) * | 2022-06-20 | 2023-12-26 | 青岛海尔科技有限公司 | Voice enhancement method and device for distributed wakeup and storage medium |
| WO2024038522A1 (en) * | 2022-08-17 | 2024-02-22 | 日本電信電話株式会社 | Signal processing device, signal processing method, and program |
| CN118197341B (en) * | 2024-04-15 | 2024-11-26 | 武汉理工大学 | A beamforming method and device based on room environment adaptive calibration |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20090110207A1 (en) * | 2006-05-01 | 2009-04-30 | Nippon Telegraph And Telephone Company | Method and Apparatus for Speech Dereverberation Based On Probabilistic Models Of Source And Room Acoustics |
| US20110002473A1 (en) | 2008-03-03 | 2011-01-06 | Nippon Telegraph And Telephone Corporation | Dereverberation apparatus, dereverberation method, dereverberation program, and recording medium |
Family Cites Families (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US3685380A (en) * | 1971-02-19 | 1972-08-22 | Amada Ltd Us | Multi-track turret and overload protection |
| JP3484112B2 (en) * | 1999-09-27 | 2004-01-06 | 株式会社東芝 | Noise component suppression processing apparatus and noise component suppression processing method |
| JP2007093630A (en) * | 2005-09-05 | 2007-04-12 | Advanced Telecommunication Research Institute International | Speech enhancement device |
| JP5139111B2 (en) * | 2007-03-02 | 2013-02-06 | 本田技研工業株式会社 | Method and apparatus for extracting sound from moving sound source |
| JP5075042B2 (en) * | 2008-07-23 | 2012-11-14 | 日本電信電話株式会社 | Echo canceling apparatus, echo canceling method, program thereof, and recording medium |
| EP2222091B1 (en) * | 2009-02-23 | 2013-04-24 | Nuance Communications, Inc. | Method for determining a set of filter coefficients for an acoustic echo compensation means |
| US8666090B1 (en) * | 2013-02-26 | 2014-03-04 | Full Code Audio LLC | Microphone modeling system and method |
| US10090000B1 (en) * | 2017-11-01 | 2018-10-02 | GM Global Technology Operations LLC | Efficient echo cancellation using transfer function estimation |
-
2019
- 2019-04-18 WO PCT/JP2019/016587 patent/WO2020121545A1/en not_active Ceased
- 2019-07-31 WO PCT/JP2019/029921 patent/WO2020121590A1/en not_active Ceased
- 2019-07-31 JP JP2020559702A patent/JP7115562B2/en active Active
- 2019-07-31 US US17/312,912 patent/US11894010B2/en active Active
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20090110207A1 (en) * | 2006-05-01 | 2009-04-30 | Nippon Telegraph And Telephone Company | Method and Apparatus for Speech Dereverberation Based On Probabilistic Models Of Source And Room Acoustics |
| US20110002473A1 (en) | 2008-03-03 | 2011-01-06 | Nippon Telegraph And Telephone Corporation | Dereverberation apparatus, dereverberation method, dereverberation program, and recording medium |
| JP5227393B2 (en) | 2008-03-03 | 2013-07-03 | 日本電信電話株式会社 | Reverberation apparatus, dereverberation method, dereverberation program, and recording medium |
Non-Patent Citations (9)
| Title |
|---|
| Farrier et al, "Fast beamforming techniques for circular arrays", J. Acoustic Soc. Am., vol. 58, No. 4, pp. 920-922, October (Year: 1975). * |
| Heymann et al. (2016) "Neural network based spectral mask estimation for acoustic beamforming," Proc. ICASSP 2016. |
| Higuchi et al. (2016) "Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise", Proc. ICASSP 2016. |
| Liu, et al., "Neural Network Based Time-Frequency Masking and Steering Vector Estimation for Two-Channel Mvdr Beamforming", hereinafter Liu, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ), pp. 6717-6721, Apr. 15-20 (Year: 2018). * |
| Nakashika et al."Dysarthric Speech Recognition Using a Convolutive Bottleneck Network", ICSP2014 Proceedings, pp. 505-509 (Year: 2014). * |
| Nakatani et al. (2010) "Speech dereverberation based on variance-normalized delayed linear prediction," IEEE Trans. ASLP, 18 (7), 1717-1731. |
| Nakatani et al. (2018) "A unified convolutional beamformer for simultaneous denoising and dereverberation" published at https://arxiv.org/abs/1812.08400, on Dec. 20, 2018. |
| Yoshioka et al. (2015) "The NTT CHIME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices," Proc. IEEE ASRU 2015, 436-443. |
| Zhang et al "Microphone Subset Selection for MVDR Beamformer Based Noise Reduction", IEEE/ACM Trans. on Acoustics, Speech and Language Processing, vol. ** , No .** , pp. 1-13, May 16 (Year: 2017). * |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US12165668B2 (en) * | 2022-02-18 | 2024-12-10 | Microsoft Technology Licensing, Llc | Method for neural beamforming, channel shortening and noise reduction |
Also Published As
| Publication number | Publication date |
|---|---|
| WO2020121545A1 (en) | 2020-06-18 |
| WO2020121590A1 (en) | 2020-06-18 |
| JP7115562B2 (en) | 2022-08-09 |
| JPWO2020121590A1 (en) | 2021-10-14 |
| US20220068288A1 (en) | 2022-03-03 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11894010B2 (en) | Signal processing apparatus, signal processing method, and program | |
| CN110100457B (en) | On-Line Dereverberation Algorithm Based on Weighted Prediction Errors in Noise Time-varying Environment | |
| US8848933B2 (en) | Signal enhancement device, method thereof, program, and recording medium | |
| US10123113B2 (en) | Selective audio source enhancement | |
| JP6169849B2 (en) | Sound processor | |
| CN112447191A (en) | Signal processing device and signal processing method | |
| US8693287B2 (en) | Sound direction estimation apparatus and sound direction estimation method | |
| JP6169910B2 (en) | Audio processing device | |
| US10818302B2 (en) | Audio source separation | |
| JP6106611B2 (en) | Model estimation device, noise suppression device, speech enhancement device, method and program thereof | |
| CN110998723B (en) | Signal processing device using neural network, signal processing method, and recording medium | |
| CN106031196B (en) | Signal processing device, method and program | |
| KR102410850B1 (en) | Method and apparatus for extracting reverberant environment embedding using dereverberation autoencoder | |
| US9875748B2 (en) | Audio signal noise attenuation | |
| Nesta et al. | A flexible spatial blind source extraction framework for robust speech recognition in noisy environments | |
| US11676619B2 (en) | Noise spatial covariance matrix estimation apparatus, noise spatial covariance matrix estimation method, and program | |
| Liu et al. | A Hybrid Reverberation Model and Its Application to Joint Speech Dereverberation and Separation | |
| Cauchi et al. | Spectrally and spatially informed noise suppression using beamforming and convolutive NMF | |
| CN119694333B (en) | Directional pickup method, system, equipment and storage medium | |
| US20240312446A1 (en) | Acoustic signal enhancement device, acoustic signal enhancement method, and program | |
| Giri et al. | A novel target speaker dependent postfiltering approach for multichannel speech enhancement | |
| Kim et al. | Online speech dereverberation using RLS-WPE based on a full spatial correlation matrix integrated in a speech enhancement system | |
| Pu | Speech Dereverberation Based on Multi-Channel Linear Prediction | |
| Kang et al. | Reverberation and noise robust feature enhancement using multiple inputs | |
| Kouhi-Jelehkaran et al. | Phone-based filter parameter optimization of filter and sum robust speech recognition using likelihood maximization |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AS | Assignment |
Owner name: NIPPON TELEGRAPH AND TELEPHONE CORPORATION, JAPAN Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:NAKATANI, TOMOHIRO;KINOSHITA, KEISUKE;SIGNING DATES FROM 20201214 TO 20201215;REEL/FRAME:056506/0251 |
|
| FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT RECEIVED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
| STCF | Information on status: patent grant |
Free format text: PATENTED CASE |