US11894010B2 - Signal processing apparatus, signal processing method, and program - Google Patents

Signal processing apparatus, signal processing method, and program Download PDF

Info

Publication number
US11894010B2
US11894010B2 US17/312,912 US201917312912A US11894010B2 US 11894010 B2 US11894010 B2 US 11894010B2 US 201917312912 A US201917312912 A US 201917312912A US 11894010 B2 US11894010 B2 US 11894010B2
Authority
US
United States
Prior art keywords
signals
beamformer
frequency
convolutional
time interval
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active, expires
Application number
US17/312,912
Other versions
US20220068288A1 (en
Inventor
Tomohiro Nakatani
Keisuke Kinoshita
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nippon Telegraph and Telephone Corp
Original Assignee
Nippon Telegraph and Telephone Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp filed Critical Nippon Telegraph and Telephone Corp
Assigned to NIPPON TELEGRAPH AND TELEPHONE CORPORATION reassignment NIPPON TELEGRAPH AND TELEPHONE CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: NAKATANI, TOMOHIRO, KINOSHITA, KEISUKE
Publication of US20220068288A1 publication Critical patent/US20220068288A1/en
Application granted granted Critical
Publication of US11894010B2 publication Critical patent/US11894010B2/en
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Definitions

  • the present invention relates to a signal processing technique for an acoustic signal.
  • NPL 1 and NPL 2 disclose a method of suppressing noise and reverberation from an observation signal in the frequency domain.
  • reverberation and noise are suppressed by receiving an observation signal in the frequency domain and a steering vector representing the direction of a sound source or an estimated vector thereof, estimating an instantaneous beamformer for minimizing the power of the frequency-domain observation signal under a constraint condition that sound reaching a microphone from the sound source is not distorted, and applying the instantaneous beamformer to the frequency-domain observation signal (conventional method 1).
  • PTL 1 and NPL 3 disclose a method of suppressing reverberation from an observation signal in the frequency domain.
  • reverberation in an observation signal in the frequency domain is suppressed by receiving an observation signal in the frequency domain and the power of a target sound at each time, or an estimated value thereof, estimating a reverberation suppression filter for suppressing reverberation in the target sound on the basis of a weighted power minimization reference of a prediction error, and applying the reverberation suppression filter to the frequency-domain observation signal (conventional method 2).
  • NPL 4 discloses a method of suppressing noise and reverberation by cascade-connecting conventional method 2 and conventional method 1.
  • this method at a prior stage, an observation signal in the frequency domain and the power of a target sound at each time are received and reverberation is suppressed using conventional method 2, and then, at a later stage, a steering vector is received and reverberation and noise are further suppressed using conventional method 1 (conventional method 3).
  • Conventional method 1 is a method originally developed for the purpose of suppressing noise and may not always be capable of sufficiently suppressing reverberation. With conventional method 2, noise cannot be suppressed.
  • Conventional method 3 can suppress more noise and reverberation than when conventional method 1 or conventional method 2 is used alone. With conventional method 3, however, conventional method 2 serving as the prior stage and conventional method 1 serving as the later stage are viewed as independent systems and optimization is performed in the respective systems. Therefore, when conventional method 2 is applied at the prior stage, it may not always be possible to sufficiently suppress reverberation due to the effects of noise. Further, when conventional method 1 is applied at the later stage, it may not always be possible to sufficiently suppress noise and reverberation due to the effects of residual reverberation.
  • the present invention has been designed in consideration of these points, and an object thereof is to provide a technique with which noise and reverberation can be sufficiently suppressed.
  • a convolutional beamformer for calculating, at each time, a weighted sum of a current signal and a past signal sequence having a predetermined delay and a length of 0 or more such that estimation signals of target signals increase a probability expressing a speech-likeness of the estimation signals based on a predetermined probability model is acquired where the estimation signals are acquired by applying the convolutional beamformer to frequency-divided observation signals corresponding respectively to a plurality of frequency bands of observation signals acquired by picking up acoustic signals emitted from a sound source, whereupon the estimation signals are acquired by applying the acquired convolutional beamformer to the frequency-divided observation signals.
  • the convolutional beamformer such that the estimation signals increases the probability expressing the speech-likeness of the estimation signals based on the probability model is acquired, and therefore noise suppression and reverberation suppression can be optimized as a single system, with the result that noise and reverberation can be sufficiently suppressed.
  • FIG. 1 A is a block diagram illustrating an example of a functional configuration of a signal processing device according to a first embodiment
  • FIG. 1 B is a flowchart illustrating an example of a signal processing method according to the first embodiment.
  • FIG. 2 A is a block diagram illustrating an example of a functional configuration of a signal processing device according to a second embodiment
  • FIG. 2 B is a flowchart illustrating an example of a signal processing method according to the second embodiment.
  • FIG. 3 is a block diagram illustrating an example of a functional configuration of a signal processing device according to a third embodiment.
  • FIG. 4 is a block diagram illustrating an example of a functional configuration of a parameter estimation unit illustrated in FIG. 3 .
  • FIG. 5 is a flowchart illustrating an example of a parameter estimation method according to the third embodiment.
  • FIG. 6 is a block diagram illustrating an example of a functional configuration of a signal processing device according to fourth to seventh embodiments.
  • FIG. 7 is a block diagram illustrating an example of a functional configuration of a parameter estimation unit illustrated in FIG. 6 .
  • FIG. 8 is a block diagram illustrating an example of a functional configuration of a steering vector estimation unit illustrated in FIG. 7 .
  • FIG. 9 is a block diagram illustrating an example of a functional configuration of a signal processing device according to an eighth embodiment.
  • FIG. 10 is a block diagram illustrating an example of a functional configuration of a signal processing device according to a ninth embodiment.
  • FIGS. 11 A to 11 C are block diagrams illustrating examples of use of the signal processing devices according to the embodiments.
  • FIG. 12 is a table illustrating examples of test results of the first embodiment.
  • FIG. 13 is a table illustrating examples of test results of the first embodiment.
  • FIG. 14 is a table illustrating examples of test results of the fourth embodiment.
  • FIGS. 15 A to 15 C are tables illustrating examples of test results of the seventh embodiment.
  • a“target signal” denotes a signal corresponding to a direct sound and an initial reflected sound, within a signal (for example, a frequency-divided observation signal) corresponding to a sound emitted from a target sound source and picked up by a microphone.
  • the initial reflected sound denotes a reverberation component derived from the sound emitted from the target sound source that reaches the microphone at a delay of no more than several tens of milliseconds following the direct sound.
  • the initial reflected sound typically acts to improve the clarity of the sound, and in this embodiment, a signal corresponding to the initial reflected sound is also included in the target signal.
  • the signal corresponding to the sound picked up by the microphone also includes, in addition to the target signal described above, late reverberation (a component acquired by excluding the initial reflected sound from the reverberation) derived from the sound emitted from the target sound source, and noise derived from a source other than the target sound source.
  • the target signal is estimated by suppressing late reverberation and noise from a frequency-divided observation signal corresponding to a sound recorded by the microphone, for example.
  • reverberation is assumed to refer to “late reverberation”.
  • Method 1 serving as a prerequisite of the method according to the embodiments will now be described.
  • noise and reverberation are suppressed from an M-dimensional observation signal (frequency-divided observation signals) in the frequency domain
  • the frequency-divided observation signals x f, t are acquired by transforming M observation signals, which are acquired by picking up acoustic signals emitted from one or a plurality of sound sources in M microphones, to the frequency domain.
  • the observation signals are acquired by picking up acoustic signals emitted from the sound sources in an environment where noise and reverberation exist.
  • x f, t (m) is acquired by transforming an observation signal that is acquired by being picked up by the microphone having the microphone number m to the frequency domain.
  • x f, t (m) corresponds to the frequency band having the frequency band number f and the time frame having the time frame number t.
  • the frequency-divided observation signals x f, t are time series signals.
  • an instantaneous beamformer w f, 0 for minimizing a cost function C 1 (w f, 0 ) below is determined for each frequency band under the constraint condition in which “the target signals are not distorted as a result of applying an instantaneous beamformer (for example, a minimum power distortionless response beamformer) w f, 0 for calculating the weighted sum of the signals at the current time to the frequency-divided observation signals x f, t at each time”.
  • an instantaneous beamformer for example, a minimum power distortionless response beamformer
  • the constraint condition is a condition in which, for example, w f, 0 H ⁇ f, 0 is a constant (1, for example).
  • v f , 0 [ v f , 0 ( 1 ) , v f , 0 ( 2 ) , ... , v f , 0 ( M ) ] T ( 4 ) is a steering vector having, as an element, a transfer function ⁇ f, 0 (m) relating to the direct sound and the initial reflected sound from the sound source to each microphone (the sound pickup position of the acoustic signal), or an estimated vector (an estimated steering vector) thereof.
  • ⁇ f, 0 is expressed by an M-dimensional (the dimension of the number of microphones) vector having, as an element, the transfer function ⁇ f, 0 (m) , which corresponds to the direct sound and initial reflected sound parts of an impulse response from the sound source position to each microphone (i.e. the reverberation that arrives at a delay of no more than several tens of milliseconds (for example, within 30 milliseconds) following the direct sound).
  • a normalized vector acquired by normalizing the transfer function of each element so that the gain of a microphone having one of the microphone numbers m 0 ⁇ 1, . . . , M ⁇ becomes a constant g (g ⁇ 0) may be used as ⁇ f, 0 .
  • a normalized vector may be used as ⁇ f, 0 .
  • Method 2 serving as a prerequisite of the method according to the embodiments will now be described.
  • reverberation is suppressed from the frequency-divided observation signal x f, t .
  • the reverberation suppression filter F f, ⁇ is an M ⁇ M-dimensional matrix filter for suppressing reverberation from the frequency-divided observation signal x f, t .
  • d is a positive integer expressing a prediction delay.
  • L is a positive integer expressing the filter length.
  • ⁇ f, t 2 is the power of the target signal, which is expressed as follows.
  • an estimation signal of a target signal z f, t in which reverberation has been suppressed from the frequency-divided observation signal x f, t is acquired.
  • the estimation signal of the target signal z f, t is an M-dimensional column vector, as shown below.
  • An estimation signal of a target signal y f, t acquired by suppressing noise and reverberation from the frequency-divided observation signal x f, t by using a method integrating methods 1 and 2 can be modeled as follows.
  • w ⁇ f is a convolutional beamformer that calculates a weighted sum of a current signal and a past signal sequence having a predetermined delay at each time. Note that the “ ⁇ ” of “w ⁇ f ” should be written directly above the “w”, as shown below, but due to notation limitations may also be written to the upper right of “w”. w f The convolutional beamformer w ⁇ f calculates the weighted sum of the current signal and a past signal sequence having a predetermined delay at each time point. The convolutional beamformer w ⁇ f is expressed as shown below, for example,
  • w _ f [ w _ f ( 1 ) T , w _ f ( 2 ) T , ... , w _ f ( M ) T ] T ( 10 ) where the following is satisfied.
  • w _ f ( m ) [ w f , 0 ( m ) , w f , d ( m ) , w f , d + 1 ( m ) , ... , w d + L - 1 ( m ) ] T ( 10 ⁇ A ) Further, x ⁇ f, t is expressed as follows.
  • x _ f , t [ x _ f , t ( 1 ) T , x _ f , t ( 2 ) T , ... , x _ f , t ( M ) T ] T ( 11 )
  • x _ f , t ( m ) [ x f , t ( m ) , x f , t - d ( m ) , x f , t - d - 1 ( m ) , ... , x f , t - d - L + 1 ( m ) ] T ( 11 ⁇ A )
  • the convolutional beamformer w ⁇ f of equation (9A) is a beamformer that calculates, at each time point, the weighted sum of the current signal and a signal sequence having a predetermined delay and a length of 0, and therefore the convolutional beamformer calculates the weighted value of the current signal at each time point.
  • the signal processing device of the present invention can acquire the estimation signal of the target signal by determining a convolutional beamformer on the basis of a probability expressing a speech-likeness and applying the convolutional beamformer to the frequency-divided observation signals.
  • the convolutional beamformer w ⁇ f which maximizes the probability expressing the speech-likeness of y f, t is determined.
  • a complex normal distribution having an average of 0 and a variance matching the power ⁇ f, t 2 of the target signal can be cited as an example of a speech probability density function.
  • the “target signal” is a signal corresponding to the direct sound and the initial reflected sound, within a signal corresponding to a sound emitted from a target sound source and picked up by a microphone. Further, the signal processing device determines the convolutional beamformer w ⁇ f under the constraint condition in which “the target signals are not distorted as a result of applying the convolutional beamformer w ⁇ f to the frequency-divided observation signals x f, t ”, for example.
  • This constraint condition is a condition in which, for example, w f, 0 H ⁇ f, 0 is a constant (1, for example).
  • R is a weighted space-time covariance matrix determined as shown below.
  • the signal processing device may determine w ⁇ f which minimizes the cost function C 3 (w ⁇ f ) of equation (13) under the constraint condition described above (in which, for example, w f, 0 H ⁇ f, 0 is a constant), for example.
  • ⁇ ⁇ f R f - 1 ⁇ v _ f v _ f H ⁇ R f - 1 ⁇ v _ f ( 15 )
  • ⁇ ⁇ f is a vector acquired by disposing the element ⁇ f, 0 (m) of the steering vector ⁇ f, 0 as follows.
  • ⁇ ⁇ f (m) is an L+1-dimensional column vector having ⁇ f, 0 (m) , and L zeros as elements.
  • the signal processing device acquires the estimation signal of the target signal y f, t by applying the determined convolutional beamformer w ⁇ f to the frequency-divided observation signal x f, t as follows.
  • y f,t w f H x f,t (16)
  • a signal processing device 1 includes an estimation unit 11 and a suppression unit 12 .
  • the frequency-divided observation signal x f, t is input into the estimation unit 11 (equation (1)).
  • the estimation unit 11 acquires and outputs the convolutional beamformer w ⁇ f for calculating the weighted sum of the current signal and a past signal sequence having a predetermined delay at each time such that the estimation signals increase the probability expressing the speech-likeness of the estimation signals based on the predetermined probability model where the estimation signals are acquired by applying the convolutional beamformer w ⁇ f to the frequency-divided observation signals x f, t in respective frequency bands.
  • the frequency-divided observation signal x f, t and the convolutional beamformer w ⁇ f acquired in step S 11 are input into the suppression unit 12 .
  • the suppression unit 12 acquires and outputs the estimation signal of the target signal y f, t by applying the convolutional beamformer w ⁇ f to the frequency-divided observation signal x f, t in each frequency band.
  • the suppression unit 12 acquires and outputs the estimation signal of the target signal y f, t by applying w ⁇ f to x ⁇ f, t as shown in equation (16).
  • the convolutional beamformer w ⁇ f for calculating the weighted sum of the current signal and a past signal sequence having a predetermined delay at each time such that the estimation signals increases the probability expressing the speech-likeness of the estimation signals based on the predetermined probability model is determined where the estimation signals are acquired by applying the convolutional beamformer w ⁇ f to the frequency-divided observation signals x f, t .
  • This corresponds to optimizing noise suppression and reverberation suppression as a single system. In this embodiment, therefore, noise and reverberation can be suppressed more adequately than with the conventional methods.
  • a signal processing device 2 includes an estimation unit 21 and the suppression unit 12 .
  • the estimation unit 21 includes a matrix estimation unit 211 and a convolutional beamformer estimation unit 212 .
  • the estimation unit 21 of this embodiment acquires and outputs the convolutional beamformer w ⁇ f which minimizes a sum of values (the cost function C 3 (w ⁇ f ) of equation (13), for example) acquired by weighting the power of the estimation signals at each time belonging to a predetermined time interval by the reciprocal of the power ⁇ f, t 2 of the target signals or the reciprocal of the estimated power ⁇ f, t 2 of the target signals under the constraint condition in which “the target signals are not distorted as a result of applying the convolutional beamformer w ⁇ f to the frequency-divided observation signals x f, t ”.
  • the convolutional beamformer w ⁇ f is equivalent to a beamformer acquired by integrating a reverberation suppression filter F f, t for suppressing reverberation from the frequency-divided observation signal x f, t and the instantaneous beamformer w f, 0 for suppressing noise from a signal acquired by applying the reverberation suppression filter F f, t to the frequency-divided observation signal x f, t .
  • the constraint condition is a condition in which, for example, “a value acquired by applying an instantaneous beamformer to a steering vector having, as an element, transfer functions relating to the direct sound and the initial reflected sound from the sound source to the to the pickup position of the acoustic signals, or an estimated steering vector, which is an estimated vector of the steering vector, is a constant (w f, 0 H ⁇ f, 0 is a constant)”.
  • a value acquired by applying an instantaneous beamformer to a steering vector having, as an element, transfer functions relating to the direct sound and the initial reflected sound from the sound source to the to the pickup position of the acoustic signals, or an estimated steering vector, which is an estimated vector of the steering vector is a constant (w f, 0 H ⁇ f, 0 is a constant)”.
  • the frequency-divided observation signals x f, t and the power or estimated power ⁇ f, t 2 of the target signals are input into the matrix estimation unit 211 .
  • the matrix estimation unit 211 acquires and outputs a weighted space-time covariance matrix R f for each frequency band on the basis of the frequency-divided observation signals x f, t and the power or estimated power ⁇ f, t 2 of the target signal.
  • the matrix estimation unit 211 acquires and outputs the weighted space-time covariance matrix R f in accordance with equation (14).
  • the steering vector or estimated steering vector ⁇ f, 0 (equation (4) or (5)) and the weighted space-time covariance matrix R f acquired in step S 211 are input into the convolutional beamformer estimation unit 212 .
  • the convolutional beamformer estimation unit 212 acquires and outputs the convolutional beamformer w ⁇ f on the basis of the weighted space-time covariance matrix R f and the steering vector or estimated steering vector ⁇ f, 0 .
  • the convolutional beamformer estimation unit 212 acquires and outputs the convolutional beamformer w ⁇ f in accordance with equation (15).
  • This step is identical to the first embodiment, and therefore description thereof has been omitted.
  • the weighted space-time covariance matrix R f is acquired, and on the basis of the weighted space-time covariance matrix R f and the steering vector or estimated steering vector ⁇ f, 0 , the convolutional beamformer w ⁇ f is acquired.
  • This corresponds to optimizing noise suppression and reverberation suppression as a single system. In this embodiment, therefore, noise and reverberation can be suppressed more adequately than with the conventional methods.
  • a signal processing device 3 includes the estimation unit 21 , the suppression unit 12 , and a parameter estimation unit 33 .
  • the estimation unit 21 includes the matrix estimation unit 211 and the convolutional beamformer estimation unit 212 .
  • the parameter estimation unit 33 includes an initial setting unit 330 , a power estimation unit 331 , a reverberation suppression filter estimation unit 332 , a reverberation suppression filter application unit 333 , a steering vector estimation unit 334 , an instantaneous beamformer estimation unit 335 , an instantaneous beamformer application unit 336 , and a control unit 337 .
  • the frequency-divided observation signal x f, t is input into the initial setting unit 330 .
  • the initial setting unit 330 uses the frequency-divided observation signal x f, t , the initial setting unit 330 generates and outputs a provisional power ⁇ f, t 2 , which is a provisional value of the estimated power ⁇ f, t 2 of the target signal.
  • the initial setting unit 330 generates and outputs the provisional power ⁇ f, t as follows.
  • ⁇ f , t 2 x f , t H ⁇ x f , t M ( 17 )
  • the frequency-divided observation signals x f, t and the newest provisional powers ⁇ f, t 2 are input into the reverberation suppression filter estimation unit 332 .
  • the frequency-divided observation signal x f, t and the newest reverberation suppression filter F f, t acquired in step S 332 are input into the reverberation suppression filter application unit 333 .
  • the reverberation suppression filter application unit 333 acquires and outputs an estimation signal y′ f, t by applying the reverberation suppression filter F f, t to the frequency-divided observation signal x f, t in each frequency band.
  • the reverberation suppression filter application unit 333 sets z f, t , acquired in accordance with equation (8), as y′ f, t and outputs y′ f, t .
  • the newest estimation signal y′ f, t acquired in step S 333 is input into the steering vector estimation unit 334 .
  • the steering vector estimation unit 334 acquires and outputs a provisional steering vector ⁇ f, 0 , which is a provisional vector of the estimated steering vector, in each frequency band.
  • the steering vector estimation unit 334 acquires and outputs the provisional steering vector ⁇ f, 0 for the estimation signal y′ f, t in accordance with a steering vector estimation method described in NPL 1 and NPL 2.
  • the steering vector estimation unit 334 outputs a steering vector estimated using y′ f, t as y f, t according to NPL 2.
  • a normalized vector acquired by normalizing the transfer function of each element so that the gain of a microphone having any one of the microphone numbers m 0 ⁇ (1, . . . , M) becomes a constant g may be used as ⁇ f, 0 (equation (5)).
  • the newest estimation signal y′ f, t acquired in step S 333 and the newest provisional steering vector ⁇ f, 0 acquired in step S 334 are input into the instantaneous beamformer estimation unit 335 .
  • the newest estimation signal y′ f, t acquired in step S 333 and the newest instantaneous beamformer w f, 0 acquired in step S 335 are input into the instantaneous beamformer application unit 336 .
  • the instantaneous beamformer application unit 336 acquires and outputs an estimation signal y′′ f, t by applying the instantaneous beamformer w f, 0 to the estimation signal y′ f, t in each frequency band.
  • the instantaneous beamformer application unit 336 acquires and outputs the estimation signal y′′ f, t as follows.
  • y′′ f,t w f,0 H y′ f,t (19)
  • the newest estimation signal y′′ f, t acquired in step S 336 is input into the power estimation unit 331 .
  • the power estimation unit 331 outputs the power of the estimation signal y′′ f, t as the provisional power ⁇ f, t 2 in each frequency band.
  • the power estimation unit 331 generates and outputs the provisional power ⁇ f, t 2 as follows.
  • ⁇ f,t 2
  • 2 y′′ f,t H y′′ f,t (20)
  • the control unit 337 determines whether or not a termination condition is satisfied.
  • the termination condition may be satisfied when the number of repetitions of the processing of steps S 331 to S 336 exceeds a predetermined value, when the variation in ⁇ f, t 2 or ⁇ f, 0 falls to or below a predetermined value after the processing of steps S 331 to S 336 is performed once, and so on.
  • the processing returns to step S 332 .
  • the processing advances to step S 337 b.
  • step S 337 b the power estimation unit 331 outputs ⁇ f, t 2 acquired most recently in step S 331 as the estimated power of the target signal, and the steering vector estimation unit 334 outputs ⁇ f, 0 acquired most recently in step S 334 as the estimated steering vector.
  • the estimated power ⁇ f, t 2 is input into the matrix estimation unit 211
  • the estimated steering vector ⁇ f, 0 is input into the convolutional beamformer estimation unit 212 .
  • the steering vector is estimated on the basis of the frequency-divided observation signal x f, t .
  • the estimation precision improves.
  • the precision of the estimated steering vector can be improved.
  • a signal processing device 4 includes the estimation unit 21 , the suppression unit 12 , and a parameter estimation unit 43 .
  • the estimation unit 21 includes the matrix estimation unit 211 and the convolutional beamformer estimation unit 212 .
  • the parameter estimation unit 43 includes a reverberation suppression unit 431 and a steering vector estimation unit 432 .
  • the fourth embodiment differs from the first to third embodiments in that before generating the estimated steering vector, the reverberation component of the frequency-divided observation signal x f, t is suppressed.
  • the reverberation component of the frequency-divided observation signal x f, t is suppressed.
  • the frequency-divided observation signal x f, t is input into the reverberation suppression unit 431 of the parameter estimation unit 43 ( FIG. 7 ).
  • the reverberation suppression unit 431 acquires and outputs a frequency-divided reverberation-suppressed signal u f, t in which the reverberation component of the frequency-divided observation signal x f, t has been suppressed (preferably, in which the reverberation component of the frequency-divided observation signal x f, t has been removed).
  • the reverberation suppression unit 431 acquires and outputs the frequency-divided reverberation-suppressed signal u f, t in which the reverberation component of the frequency-divided observation signal x f, t has been suppressed using a method described in reference document 1.
  • the frequency-divided reverberation-suppressed signal u f, t acquired by the reverberation suppression unit 431 is input into the steering vector estimation unit 432 .
  • the steering vector estimation unit 432 uses the frequency-divided reverberation-suppressed signal u f, t as input, the steering vector estimation unit 432 generates and outputs an estimated steering vector serving as an estimated vector of the steering vector.
  • a steering vector estimation processing method of acquiring an estimated steering vector using a frequency-divided time series signal as input is well-known.
  • the steering vector estimation unit 432 acquires and outputs the estimated steering vector ⁇ f, 0 by using the frequency-divided reverberation-suppressed signal u f, t as the input of a desired type of steering vector estimation processing.
  • the steering vector estimation processing method There are no limitations on the steering vector estimation processing method, and for example, the method described above in NPL 1 and NPL 2, methods described in reference documents 2 and 3, and so on may be used.
  • the estimated steering vector ⁇ f, 0 acquired by the steering vector estimation unit 432 is input into the convolutional beamformer estimation unit 212 .
  • the convolutional beamformer estimation unit 212 performs the processing of step S 212 , described in the second embodiment, using the estimated steering vector ⁇ f, 0 and the weighted space-time covariance matrix R f acquired in step S 211 . All other processing is as described in the first and second embodiments.
  • the estimated steering vector of each time frame number t can be calculated from frequency-divided observation signals x f, t input successively online, for example.
  • a signal processing device 5 includes the estimation unit 21 , the suppression unit 12 , and a parameter estimation unit 53 .
  • the estimation unit 21 includes the matrix estimation unit 211 and the convolutional beamformer estimation unit 212 .
  • the parameter estimation unit 53 includes a steering vector estimation unit 532 .
  • the steering vector estimation unit 532 includes an observation signal covariance matrix updating unit 532 a , a main component vector updating unit 532 b , a steering vector updating unit 532 c (the steering vector estimation unit), an inverse noise covariance matrix updating unit 532 d , and a noise covariance matrix updating unit 532 e .
  • the fifth embodiment differs from the first to third embodiments only in that the estimated steering vector is generated by successive processing.
  • the estimated steering vector is generated by successive processing.
  • the frequency-divided observation signal x f, t which is a frequency-divided time series signal, is input into the steering vector estimation unit 532 ( FIGS. 7 and 8 ).
  • Step S 532 a ⁇ Processing of Observation Signal Covariance Matrix Updating Unit 532 a (Step S 532 a )>>
  • the observation signal covariance matrix updating unit 532 a ( FIG. 8 ) acquires and outputs a spatial covariance matrix ⁇ x, f, t of the frequency-divided observation signal x f, t (a spatial covariance matrix of a frequency-divided observation signal belonging to a first time interval), which is based on the frequency-divided observation signal x f, t (the frequency-divided observation signal belonging to the first time interval) and a spatial covariance matrix ⁇ x, f, t-1 of a frequency-divided observation signal x f, t-1 (a spatial covariance matrix of a frequency-divided observation signal belonging to a second time interval that is further in the past than the first time interval).
  • the observation signal covariance matrix updating unit 532 a acquires and outputs a linear sum of a covariance matrix x f, t x f, t H of the frequency-divided observation signal x f, t (the frequency-divided observation signal belonging to the first time interval) and the spatial covariance matrix ⁇ x, f, t-1 (the spatial covariance matrix of the frequency-divided observation signal belonging to the second time interval that is further in the past than the first time interval) as the spatial covariance matrix ⁇ x, f, t of the frequency-divided observation signal x f, t (the spatial covariance matrix of the frequency-divided observation signal belonging to the first time interval).
  • the observation signal covariance matrix updating unit 532 a acquires and outputs the spatial covariance matrix ⁇ x, f, t in accordance with equation (21) shown below, for example.
  • ⁇ x,f,t ⁇ x,f,t-1 +x f,t x f,t H (21)
  • is an oblivion coefficient, and is a real number belonging to a range of 0 ⁇ 1, for example.
  • An initial matrix ⁇ x, f, 0 of the spatial covariance matrix ⁇ x, f, t-1 may be set as desired.
  • an M ⁇ M-dimensional unit matrix may be set as the initial matrix ⁇ x, f, 0 of the spatial covariance matrix ⁇ x, f, t-1 .
  • Step S 532 d ⁇ Processing of Inverse Noise Covariance Matrix Updating Unit 532 d (Step S 532 d )>
  • the frequency-divided observation signal x f, t and mask information ⁇ f, t (n) are input into the inverse noise covariance matrix updating unit 532 d .
  • the mask information ⁇ f, t (n) is information expressing the ratio of the noise component included in the frequency-divided observation signal x f, t at a time-frequency point corresponding to the time frame number t and the frequency band number f.
  • the mask information ⁇ f, t (n) expresses the occupancy probability of the noise component included in the frequency-divided observation signal x f, t at a time-frequency point corresponding to the time frame number t and the frequency band number f.
  • Methods of estimating the mask information ⁇ f, t (n) include, for example, an estimation method using a complex Gaussian mixture model (CGMM) (reference document 4, for example), an estimation method using a neural network (reference document 5, for example), an estimation method integrating these methods (reference document 6 and reference document 7, for example), and so on.
  • CGMM complex Gaussian mixture model
  • the mask information ⁇ f, t (n) may be estimated in advance and stored in a storage device, not illustrated in the figures, or may be estimated successively. Note that the upper right superscript “(n)” of “ ⁇ f, t (n) ” should be written directly above the lower right subscript “f, t”, but due to notation limitations has been written to the upper right of “f, t”.
  • the inverse noise covariance matrix updating unit 532 d acquires and outputs an inverse noise covariance matrix ⁇ ⁇ 1 n, f, t (an inverse noise covariance matrix of the frequency-divided observation signal belonging to the first time interval) on the basis of the frequency-divided observation signal x f, t (the frequency-divided observation signal belonging to the first time interval), the mask information ⁇ f, t (n) (mask information belonging to the first time interval), and an inverse noise covariance matrix ⁇ ⁇ 1 n, f, t-1 (an inverse noise covariance matrix of the frequency-divided observation signal belonging to the second time interval that is further in the past than the first time interval).
  • the inverse noise covariance matrix updating unit 532 d acquires and outputs the inverse noise covariance matrix ⁇ ⁇ 1 n, f, t in accordance with equation (22), shown below, using the Woodbury formula.
  • ⁇ n , f , t - 2 1 0 ⁇ ( ⁇ n , f , t - 1 - 1 - Y f , t ( n ) ⁇ ⁇ n , f , t - 1 - 1 ⁇ x f , t ⁇ x f , t H ⁇ ⁇ n , f , t - 1 - 1 a + Y f , t ( n ) ⁇ x f , t H ⁇ ⁇ n , f , t - 1 - 1 ⁇ x f , t ) ( 22 )
  • is an oblivion coefficient, and is a real number belonging to a range of 0 ⁇ 1, for example.
  • An initial matrix ⁇ ⁇ 1 n, f, 0 of the inverse noise covariance matrix ⁇ ⁇ 1 n, f, t-1 may be set as desired.
  • an M ⁇ M-dimensional unit matrix may be set as the initial matrix ⁇ ⁇ 1 n, f, 0 of the inverse noise covariance matrix ⁇ ⁇ 1 n, f, t-1 .
  • the upper right superscript “ ⁇ 1” of “ ⁇ ⁇ 1 n, f, t ” should be written directly above the lower right subscript “n, f, t”, but due to notation limitations has been written to the upper left of “n, f, t”.
  • the spatial covariance matrix ⁇ x, f, t acquired by the observation signal covariance matrix updating unit 532 a and the inverse noise covariance matrix ⁇ ⁇ 1 n, f, t acquired by the inverse noise covariance matrix updating unit 532 d are input into the main component vector updating unit 532 b .
  • the main component vector updating unit 532 b acquires and outputs a main component vector ⁇ ⁇ f, t (a main component vector of the first time interval) relating to ⁇ ⁇ 1 n, f, t ⁇ x, f, t (the product of an inverse matrix of the noise covariance matrix of the frequency-divided observation signal and the spatial covariance matrix of the frequency-divided observation signal belonging to the first time interval) by using a power method on the basis of the inverse noise covariance matrix ⁇ ⁇ 1 n, f, t (the inverse matrix of the noise covariance matrix of the frequency-divided observation signal), the spatial covariance matrix ⁇ x, f, t (the spatial covariance matrix of the frequency-divided observation signal belonging to the first time interval), and a main component vector v ⁇ f, t-1 (a main component vector of the second time interval).
  • the main component vector updating unit 532 b acquires and outputs a main component vector v ⁇ f, t based on ⁇ ⁇ 1 n, f, t ⁇ x, f, t v ⁇ f, t-1 .
  • the main component vector updating unit 532 b acquires and outputs the main component vector v ⁇ f, t in accordance with equations (23) and (24) shown below, for example. Note that the upper right superscript “ ⁇ ” of “v ⁇ f, t ” should be written directly above the lower right subscript “v”, but due to notation limitations has been written to the upper right of “v”.
  • v ⁇ f , t ′ ⁇ n , f , t - 1 ⁇ ⁇ n , f , t ⁇ v ⁇ f , t - 1 ( 23 )
  • v ⁇ f , t v ⁇ f , t ′ v ⁇ f , t r ⁇ e ⁇ f ( 24 )
  • v ⁇ f, t ref expresses an element corresponding to a predetermined microphone (a reference microphone ref) serving as a reference, among the M elements of a vector v ⁇ f, t acquired from equation (23).
  • the upper right superscript “ ⁇ ” of “v ⁇ ′ f, t ” should be written directly above the lower right subscript “v”, but due to notation limitations has been written to the upper right of “v”.
  • Step S 532 e ⁇ Noise Covariance Matrix Updating Unit 532 e (Step S 532 e )>
  • the noise covariance matrix updating unit 532 e uses the frequency-divided observation signal x f, t (the frequency-divided observation signal belonging to the first time interval) and the mask information ⁇ f, t (n) ; (the mask information of the first time interval) as input, acquires and outputs a noise covariance matrix ⁇ n, f, t of the frequency-divided observation signal x f, t (a noise covariance matrix of the frequency-divided observation signal belonging to the first time interval), which is based on the frequency-divided observation signal x f, t , the mask information ⁇ f, t (n) , and a noise covariance matrix ⁇ n, f, t-1 (a noise covariance matrix of the frequency-divided observation signal belonging to the second time interval that is further in the past than the first time interval).
  • the noise covariance matrix updating unit 532 e acquires and outputs the linear sum of a product ⁇ f, t (n) x f, t x f, t H of the covariance matrix x f, t x f, t H of the frequency-divided observation signal x f, t and the mask information ⁇ f, t (n) , and the noise covariance matrix ⁇ n, f, t-1 (the noise covariance matrix of the frequency-divided observation signal belonging to the second time interval that is further in the past than the first time interval) as the noise covariance matrix ⁇ n, f, t of the frequency-divided observation signal x f, t .
  • the noise covariance matrix updating unit 532 e acquires and outputs the noise covariance matrix ⁇ n, f, t in accordance with equation (25) shown below.
  • ⁇ n,f,t ⁇ n,f,t-1 + ⁇ f,t (n) x f,t x f,t H (25)
  • is an oblivion coefficient, and is a real number belonging to a range of 0 ⁇ 1, for example.
  • Step S 532 c Step S 532 c
  • the steering vector updating unit 532 c uses the main component vector v ⁇ f, t (the main component vector of the first time interval) acquired by the main component vector updating unit 532 b and the noise covariance matrix ⁇ n, f, t (the noise covariance matrix of the frequency-divided observation signal) acquired by the noise covariance matrix updating unit 532 e as input, acquires and outputs an estimated steering vector ⁇ f, t (an estimated steering vector of the first time interval) on the basis thereof.
  • the steering vector updating unit 532 c acquires and outputs an estimated steering vector ⁇ f, t based on ⁇ n, f, t v ⁇ f, t .
  • the steering vector updating unit 532 c acquires and outputs the estimated steering vector ⁇ f, t in accordance with equations (26) and (27) shown below, for example.
  • v′ f,t ⁇ n,f,t ⁇ tilde over (v) ⁇ f,t (26)
  • v f, t v f , t ′ v f , t r ⁇ e ⁇ f ( 27 )
  • v f, t ref expresses an element corresponding to the reference microphone ref, among the M elements of a vector v′ f, t acquired from equation (26).
  • the estimated steering vector ⁇ f, t acquired by the steering vector estimation unit 532 is input into the convolutional beamformer estimation unit 212 .
  • the convolutional beamformer estimation unit 212 treats the estimated steering vector ⁇ f, t as ⁇ f, 0 , and performs the processing of step S 212 , described in the second embodiment, using the estimated steering vector ⁇ f, t and the weighted space-time covariance matrix R f acquired in step S 211 . All other processing is as described in the first and second embodiments.
  • ⁇ f, t 2 input into the matrix estimation unit 211 either the provisional power generated as illustrated in equation (17) or the estimated power ⁇ f, t 2 generated as described in the third embodiment, for example, may be used.
  • the inverse noise covariance matrix updating unit 532 d adaptively updates the inverse noise covariance matrix ⁇ ⁇ 1 n, f, t at each time point corresponding to the time frame number t by using the frequency-divided observation signal x f, t and the mask information ⁇ f, t (n) .
  • the inverse noise covariance matrix updating unit 532 d may acquire and output the inverse noise covariance matrix ⁇ ⁇ 1 n, f, t by using a frequency-divided observation signal x f, t of a time interval in which the noise component either exists alone or is dominant, without using the mask information ⁇ f, t (n) .
  • the inverse noise covariance matrix updating unit 532 d may output, as the inverse noise covariance matrix ⁇ ⁇ 1 n, f, t , an inverse matrix of the temporal average of x f, t x f, t H with respect to a frequency-divided observation signal x f, t of a time interval in which the noise component either exists alone or is dominant.
  • the inverse noise covariance matrix ⁇ ⁇ 1 n, f, t acquired in this manner is used continuously in the frames having the respective time frame numbers t.
  • the noise covariance matrix updating unit 532 e may acquire and output the noise covariance matrix ⁇ ⁇ 1 n, f, t of the frequency-divided observation signal x f, t using a frequency-divided observation signal x f, t of a time interval in which the noise component either exists alone or is dominant, without using the mask information ⁇ f, t (n) .
  • the noise covariance matrix updating unit 532 e may output, as the noise covariance matrix ⁇ n, f, t , the temporal average of x f, t x f, t H with respect to a frequency-divided observation signal x f, t of a time interval in which the noise component either exists alone or is dominant.
  • the noise covariance matrix ⁇ n, f, t acquired in this manner is used continuously in the frames having the respective time frame numbers t.
  • the first time interval is the frame having the time frame number t and the second time interval is the frame having the time frame number t ⁇ 1 was used as an example, but the present invention is not limited thereto.
  • a frame having a time frame number other than the time frame number t may be set as the first time interval, and a time frame that is further in the past than the first time interval and has a time frame number other than the time frame number t ⁇ 1 may be set as the second time interval.
  • the steering vector estimation unit 532 acquires and outputs the estimated steering vector ⁇ f, t by successive processing using the frequency-divided observation signal x f, t as input. As noted in the fourth embodiment, however, by estimating the steering vector after suppressing reverberation from the frequency-divided observation signal x f, t , the estimation precision is improved.
  • the steering vector estimation unit acquires and outputs the estimated steering vector ⁇ f, t by successive processing, as described in the fifth embodiment, after reverberation has been suppressed from the frequency-divided observation signal x f, t will be described.
  • a signal processing device 6 includes the estimation unit 21 , the suppression unit 12 , and a parameter estimation unit 63 .
  • the parameter estimation unit 63 includes the reverberation suppression unit 431 and a steering vector estimation unit 632 .
  • the sixth embodiment differs from the fifth embodiment in that before generating the estimated steering vector, the reverberation component of the frequency-divided observation signal x f, t is suppressed.
  • the reverberation component of the frequency-divided observation signal x f, t is suppressed.
  • the reverberation suppression unit 431 acquires and outputs the frequency-divided reverberation-suppressed signal u f, t in which the reverberation component of the frequency-divided observation signal x f, t has been suppressed (preferably, in which the reverberation component of the frequency-divided observation signal x f, t has been removed).
  • the frequency-divided reverberation-suppressed signal u f, t is input into the steering vector estimation unit 632 .
  • the processing of the steering vector estimation unit 632 is identical to the processing of the steering vector estimation unit 532 of the fifth embodiment except that the frequency-divided reverberation-suppressed signal u f, t , rather than the frequency-divided observation signal x f, t , is input into the steering vector estimation unit 632 , and the steering vector estimation unit 632 uses the frequency-divided reverberation-suppressed signal u f, t instead of the frequency-divided observation signal x f, t .
  • the frequency-divided observation signal x f, t used in the processing of the steering vector estimation unit 532 is replaced by the frequency-divided reverberation-suppressed signal u f, t .
  • All other processing is identical to the fifth embodiment and the modified example thereof. More specifically, the frequency-divided reverberation-suppressed signal u f, t , which is a frequency-divided time series signal, is input into the steering vector estimation unit 632 .
  • the observation signal covariance matrix updating unit 532 a acquires and outputs the spatial covariance matrix ⁇ x, f, t of the frequency-divided reverberation-suppressed signal u f, t belonging to the first time interval, which is based on the frequency-divided reverberation-suppressed signal u f, t belonging to the first time interval and the spatial covariance matrix ⁇ x, f, t-1 of a frequency-divided reverberation-suppressed signal u f, t _i belonging to the second time interval that is further in the past than the first time interval.
  • the main component vector updating unit 532 b acquires and outputs the main component vector v ⁇ f, t of the first time interval with respect to the product ⁇ ⁇ 1 n, f, t ⁇ x, f, t of the inverse matrix ⁇ ⁇ 1 n, f, t of the noise covariance matrix of the frequency-divided reverberation-suppressed signal and the spatial covariance matrix ⁇ x, f, t of the frequency-divided reliability-suppressed signal belonging to the first time interval on the basis of the inverse matrix ⁇ ⁇ 1 n, f, t of the noise covariance matrix of the frequency-divided reliability-suppressed signal u f, t , the spatial covariance matrix ⁇ x, f, t of the frequency-divided reliability-suppressed signal belonging to the first time interval, and the main component vector v ⁇ f, t-1 of the second time interval.
  • the steering vector updating unit 532 c acquires and outputs the estimated steering vector ⁇ f, t of the first time interval on the basis of the noise covariance matrix of the frequency-divided reverberation-suppressed signal u f, t and the main component vector v ⁇ f, t of the first time interval.
  • a method of estimating the convolutional beamformer by successive processing will be described.
  • the convolutional beamformer of each time frame number t can be estimated and the estimation signal of the target signal y f, t can be acquired from frequency-divided observation signals x f, t input successively online, for example.
  • a signal processing device 7 includes an estimation unit 71 , a suppression unit 72 , and the parameter estimation unit 53 .
  • the frequency-divided observation signal x f, t is input into the parameter estimation unit 53 ( FIGS. 6 and 7 ).
  • the steering vector estimation unit 532 ( FIG. 8 ) of the parameter estimation unit 53 acquires and outputs the estimated steering vector ⁇ f, t by successive processing using the frequency-divided observation signal x f, t as input (step S 532 ).
  • the estimated steering vector ⁇ f, t is represented by the following M-dimensional vector.
  • ⁇ f,t [ ⁇ f,t (1) , ⁇ f,t (2) , . . .
  • ⁇ f, t (M) T
  • ⁇ f, t (m) represents an element corresponding to the microphone having the microphone number m, among the M elements of the estimated steering vector ⁇ f, t .
  • the estimated steering vector ⁇ f, t acquired by the steering vector estimation unit 532 is input into the convolutional beamformer estimation unit 712 .
  • the frequency-divided observation signal x f, t and the power or estimated power ⁇ f, t 2 of the target signal are input into the matrix estimation unit 711 ( FIG. 6 ).
  • ⁇ f, t 2 input into the matrix estimation unit 711 either the provisional power generated as illustrated in equation (17) or the estimated power ⁇ f, t 2 generated as described in the third embodiment, for example, may be used.
  • the power or estimated power ⁇ f, t 2 of the target signal (the power or estimated power of the frequency-divided observation signal belonging to the first time interval), and an inverse matrix
  • the matrix estimation unit 711 estimates and outputs an inverse matrix
  • R ⁇ f , t - 1 of a space-time covariance matrix (an inverse matrix of the space-time covariance matrix of the first time interval).
  • An example of the space-time covariance matrix is as follows.
  • R ⁇ f , t - 1 1 a ⁇ ( R ⁇ f , t - 1 - 1 - k f , t ⁇ x _ f , t H ⁇ R ⁇ f , t - 1 - 1 ) ( 29 )
  • k f, t in equation (28) is an (L+1)M-dimensional vector
  • the inverse matrix of equation (29) is an (L+1)M ⁇ (L+1)M matrix.
  • is an oblivion coefficient, and is a real number belonging to a range of 0 ⁇ 1, for example.
  • R ⁇ f , t - 1 - 1 of the space-time covariance matrix may be set as desired, and an example of the initial matrix is an (L+1)M-dimensional unit matrix shown below.
  • the convolutional beamformer estimation unit 712 acquires and outputs the convolutional beamformer w ⁇ f, t (the convolutional beamformer of the first time interval) on the basis thereof.
  • the convolutional beamformer estimation unit 712 acquires and outputs the convolutional beamformer w ⁇ f, t in accordance with equation (30), shown below.
  • v _ f , t [ v _ f , t ( 1 ) , v _ f , t ( 2 ) , ... , v _ f , t ( M ) ] and
  • v _ f , t ( m ) [ g f ⁇ v f , t ( m ) , 0 , ... , 0 ] [ g f ⁇ v f , t ( m ) , 0 , ... , 0 ] is an L+1-dimensional vector.
  • g f is a scalar constant other than 0.
  • the frequency-divided observation signal x f, t and the convolutional beamformer w ⁇ f, t acquired by the beamformer estimation unit 712 are input into the suppression unit 72 .
  • the suppression unit 72 acquires and outputs the estimation signal of the target signal y f, t by applying the convolutional beamformer w ⁇ f, t to the frequency-divided observation signal x f, t in each time frame number t and frequency band number f.
  • the suppression unit 72 acquires and outputs the estimation signal of the target signal y f, t in accordance with equation (31) shown below.
  • y f,t w f,t H x f,t (31)
  • the parameter estimation unit 53 of the signal processing device 7 according to the seventh embodiment may be replaced by the parameter estimation unit 63 .
  • the parameter estimation unit 63 rather than the parameter estimation unit 53 , may acquire and output the estimated steering vector ⁇ f, t by successive processing, as described in the sixth embodiment, using the frequency-divided observation signal x f, t as input.
  • the present invention is not limited thereto.
  • a frame having a time frame number other than the time frame number t may be set as the first time interval, and a time frame that is further in the past than the first time interval and has a time frame number other than the time frame number t ⁇ 1 may be set as the second time interval.
  • Equation (32) shows an example of the block matrix B f .
  • ⁇ ⁇ f, 0 is an M ⁇ 1-dimensional column vector constituted by elements of the steering vector ⁇ f, 0 or the estimated steering vector ⁇ f, 0 that correspond to microphones other than the reference microphone ref
  • ⁇ f, 0 ref is the element of ⁇ f, 0 that corresponds to the reference microphone ref
  • I M-1 is an (M ⁇ 1) ⁇ (M ⁇ 1)-dimensional unit matrix.
  • g f is set as a scalar constant other than 0, a f, 0 is set as an M-dimensional modified instantaneous beamformer, and the instantaneous beamformer w f, 0 is expressed as the sum of a constant multiple g f ⁇ f, 0 of the steering vector ⁇ f, 0 or a constant multiple g f ⁇ f, 0 of the estimated steering vector ⁇ f, 0 and a product B f a f, 0 of the block matrix B f corresponding to the orthogonal complement of the steering vector ⁇ f, 0 or the estimated steering vector ⁇ f, 0 and the modified instantaneous beamformer a f, 0 .
  • Equation (33) the constraint condition that “w f, 0 H ⁇ f, 0 is a constant” is satisfied in relation to any modified instantaneous beamformer a f, 0 . It is therefore evident that the instantaneous beamformer w f, 0 may be defined as illustrated in equation (33).
  • the convolutional beamformer is estimated using the optimal solution of the convolutional beamformer acquired when the instantaneous beamformer w f, 0 is defined as illustrated in equation (33). This will be described in detail below.
  • a signal processing device 8 includes an estimation unit 81 , a suppression unit 82 , and a parameter estimation unit 83 .
  • the estimation unit 81 includes a matrix estimation unit 811 , a convolutional beamformer estimation unit 812 , an initial beamformer application unit 813 , and a block unit 814 .
  • the parameter estimation unit 83 ( FIG. 9 ), using the frequency-divided observation signal x f, t as input, acquires the estimated steering vector by an identical method to any of the parameter estimation units 33 , 43 , 53 , 63 described above, and outputs the acquired estimated steering vector as ⁇ f, 0 .
  • the output estimated steering vector ⁇ f, 0 is transmitted to the initial beamformer application unit 813 and the block unit 814 .
  • the estimated steering vector ⁇ f, 0 and the frequency-divided observation signal x f, t are input into the initial beamformer application unit 813 .
  • the initial beamformer application unit 813 acquires and outputs an initial beamformer output z f, t (an initial beamformer output of the first time interval) based on the estimated steering vector ⁇ f, 0 and the frequency-divided observation signal x f, t (the frequency-divided observation signal belonging to the first time interval).
  • the initial beamformer application unit 813 acquires and outputs an initial beamformer output z f, t based on the constant multiple of the estimated steering vector ⁇ f, 0 and the frequency-divided observation signal r f, t .
  • the initial beamformer application unit 813 acquires and outputs the initial beamformer output z f, t in accordance with equation (34) shown below, for example.
  • z f,t ( g f ⁇ f,0 ) H x f,t (34)
  • the output initial beamformer output z f, t is transmitted to the convolutional beamformer estimation unit 812 and the suppression unit 82 .
  • the estimated steering vector ⁇ f, 0 and the frequency-divided observation signal x f, t are input into the block unit 814 .
  • Equation (35) the right side of equation (35) becomes a vector in which the number of elements is 0 (an empty vector), whereby equation (36) is as shown below in equation (36A).
  • x f,t B f H x f,t (36A)
  • the vector x? f, t acquired by the block unit 814 and the power or estimated power ⁇ f, t 2 of the target signal are input into the matrix estimation unit 811 .
  • w _ _ f R _ _ f - 1 ⁇ x _ _ f , t ⁇ z f , t H ( 38 )
  • w _ _ f [ a f , 0 T , w f ( 1 ) T , w f ( 2 ) T , ... , w f ( M ) T ] T ( 38 ⁇ A )
  • w _ _ f ( m ) [ w f , d ( m ) , w f , d + 1 ( m ) , ... , w f , d + L - 1 ( m ) ] T ( 38 ⁇ B )
  • equation (38B) becomes a vector in which the number of elements is 0 (an empty vector), whereby equation (38A) is as shown below.
  • w f a f,0
  • This processing is equivalent to processing for acquiring and outputting the estimation signal of the target signal y f, t by applying the convolutional beamformer w ⁇ f to the frequency-divided observation signal x f, t .
  • the suppression unit 82 acquires and outputs the estimation signal of the target signal y f, t in accordance with equation (39) shown below.
  • y f,t z f,t + w f H x f,t (39)
  • a known steering vector ⁇ f, 0 acquired on the basis of actual measurement or the like may be input into the initial beamformer application unit 813 and the block unit 814 instead of the estimated steering vector ⁇ f, 0 acquired by the parameter estimation unit 83 .
  • the initial beamformer application unit 813 and the block unit 814 perform steps S 813 and S 814 , described above, using the steering vector ⁇ f, 0 instead of the estimated steering vector ⁇ f, 0 .
  • a method for executing convolutional beamformer estimation based on the eighth embodiment by successive processing will be described.
  • a signal processing device 9 includes an estimation unit 91 , a suppression unit 92 , and a parameter estimation unit 93 .
  • the estimation unit 91 includes an adaptive gain estimation unit 911 , a convolutional beamformer estimation unit 912 , a matrix estimation unit 915 , the initial beamformer application unit 813 , and the block unit 814 .
  • the parameter estimation unit 93 ( FIG. 10 ), using the frequency-divided observation signal x f, t as input, acquires and outputs the estimated steering vector ⁇ f, t by an identical method to either of the parameter estimation units 53 , 63 described above.
  • the output estimated steering vector ⁇ f, t is transmitted to the initial beamformer application unit 813 and the block unit 814 .
  • the estimated steering vector ⁇ f, t (the estimated steering vector of the first time interval) and the frequency-divided observation signal x f, t (the frequency-divided observation signal belonging to the first time interval) are input into the initial beamformer application unit 813 , and the initial beamformer application unit 813 acquires and outputs the initial beamformer output z f, t (the initial beamformer output of the first time interval) as described in the eighth embodiment using ⁇ f, t instead of ⁇ f, 0 .
  • the output initial beamformer output z f, t is transmitted to the suppression unit 92 .
  • the suppression unit 92 acquires and outputs the estimation signal of the target signal y f, t in accordance with equation (40) below.
  • y f,t z f,t + w f,t-1 H x f,t (40)
  • ⁇ f, t 2 input into the matrix estimation unit 711 either the provisional power generated as illustrated in equation (17) or the estimated power ⁇ f, t 2 generated as described in the third embodiment, for example, may be used.
  • the “ ⁇ ” of “R ⁇ 1 f, t-1 ” should be written directly above the “R”, but due to notation limitations may also be written to the upper right of “R”.
  • the adaptive gain estimation unit 911 acquires and outputs an adaptive gain k f, t (the adaptive gain of the first time interval) that is based on the inverse matrix R ⁇ 1 f, t-1 of the weighted modified space-time covariance matrix (the inverse matrix of the weighted modified space-time covariance matrix of the second time interval), the estimated steering vector ⁇ f, t (the estimated steering vector of the first time interval), the frequency-divided observation signal x f, t , and the power or estimated power ⁇ f, t 2 of the target signal.
  • the adaptive gain estimation unit 911 acquires and outputs the adaptive gain k f, t as an (LM+M ⁇ 1)-dimensional vector in accordance with equation (41) shown below.
  • an initial matrix of the inverse matrix R ⁇ 1 f, t-1 of the weighted modified space-time covariance matrix may be any (LM+M ⁇ 1) ⁇ (LM+M ⁇ 1)-dimensional matrix.
  • An example of the initial matrix of the inverse matrix R ⁇ 1 f, t-1 of the weighted modified space-time covariance matrix is an (LM+M ⁇ 1)-dimensional unit matrix.
  • the matrix estimation unit 915 acquires and outputs an inverse matrix R ⁇ 1 f, t of the weighted modified space-time covariance matrix (the inverse matrix of the weighted modified space-time covariance matrix of the first time interval) that is based on the adaptive gain k f, t (the adaptive gain of the first time interval), the estimated steering vector ⁇ f, t (the estimated steering vector of the first time interval), the frequency-divided observation signal x f, t , and the inverse matrix R ⁇ 1 f, t-1 , of the weighted modified space-time covariance matrix (the inverse matrix of the weighted modified space-time covariance matrix of the second time interval).
  • the matrix estimation unit 915 acquires and outputs the inverse matrix R ⁇ 1 f,
  • R ⁇ f , t - 1 1 a ⁇ ( R ⁇ f , t - 1 - 1 - k f , t ⁇ x _ _ f , t H ⁇ R ⁇ f , t - 1 - 1 ) ( 42 )
  • the output inverse matrix R ⁇ 1 f, t of the weighted modified space-time covariance matrix is transmitted to the adaptive gain estimation unit 911 .
  • Step S 912 ⁇ Processing of Convolutional Beamformer Estimation Unit 912 (Step S 912 )>
  • the estimation signal of the target signal y f, t output from the suppression unit 92 and the adaptive gain k f, t output from the adaptive gain estimation unit 911 are input into the convolutional beamformer estimation unit 912 .
  • w f,t w f,t-1 ⁇ k f,t y f,t H (43)
  • the first time interval is the frame having the time frame number t and the second time interval is the frame having the time frame number t ⁇ 1 was used as an example, but the present invention is not limited thereto.
  • a frame having a time frame number other than the time frame number t may be set as the first time interval, and a time frame that is further in the past than the first time interval and has a time frame number other than the time frame number t ⁇ 1 may be set as the second time interval.
  • a known steering vector ⁇ f, t may be input into the initial beamformer application unit 813 and the block unit 814 instead of the estimated steering vector ⁇ f, t acquired by the parameter estimation unit 93 .
  • the initial beamformer application unit 813 and the block unit 814 perform steps S 813 and S 814 , described above, using the steering vector ⁇ f, t instead of the estimated steering vector ⁇ f, t .
  • the frequency-divided observation signals x f, t input into the signal processing devices 1 to 9 described above may be any signals that correspond respectively to a plurality of frequency bands of an observation signal acquired by picking up an acoustic signal emitted from a sound source.
  • a time-domain observation signal x(i) [x(i) (1) , x(i) (2) , . . .
  • x(i) (M) ] T (where i is an index expressing a discrete time) acquired by picking up an acoustic signal emitted from a sound source in M microphones may be input into a dividing unit 1051 , and the dividing unit 1051 may transform the observation signal x(i) into frequency-divided observation signals x f, t in the frequency domain and input the frequency-divided observation signals x f, t into the signal processing devices 1 to 9 .
  • the transformation method from the time domain to the frequency domain and the discrete Fourier transform or the like, for example, may be used.
  • the discrete Fourier transform or the like for example, may be used.
  • frequency-divided observation signals x f, t acquired by another processing unit may be input into the signal processing devices 1 to 9 .
  • the time-domain observation signal x(i) described above may be transformed into frequency-domain signals in each time frame, the frequency-domain signals may be processed by another processing unit, and the frequency-divided observation signals x f, t acquired as a result may be input into the signal processing devices 1 to 9 .
  • the estimation signal of the target signals y f, t output from the signal processing devices 1 to 9 may either be used in other processing (speech recognition processing or the like) without being transformed into time-domain signals y(i) or be transformed into a time-domain signal y(i).
  • the estimation signal of the target signals y f, t output from the signal processing devices 1 to 9 may be output as is and used in other processing.
  • the estimation signal of the target signals y f, t output from the signal processing devices 1 to 9 may be input into an integration unit 1052 , and the integration unit 1052 may acquire and output a time-domain signal y(i) by integrating the estimation signal of the target signals y f, t .
  • the integration unit 1052 may acquire and output a time-domain signal y(i) by integrating the estimation signal of the target signals y f, t .
  • the method for acquiring the time-domain signal y(i) from the estimation signal of the target signals y f, t and the inverse Fourier transform or the like, for example, may be used.
  • noise/reverberation suppression results acquired by the first embodiment and conventional methods 1 to 3 will be illustrated.
  • FIG. 12 shows evaluation results acquired in relation to the speech quality of the observation signal and the signals subjected to noise/reverberation suppression in accordance with the present invention and conventional methods 1 to 3.
  • Sim denotes the Sim Data
  • Real denotes the Real Data.
  • CD denotes cepstrum distortion
  • SRMR denotes the signal-to-reverberation modulation ratio
  • LLR denotes the log-likelihood ratio
  • FWSSNR denotes the frequency-weighted segmental signal-to-noise ratio.
  • CD and LLR indicate better speech quality as the values thereof decrease, while SRMR and FWSSNR indicate better speech quality as the values thereof increase.
  • the underlined values are optimal values. As illustrated in FIG. 12 , it is evident that according to the present invention, noise and reverberation can be suppressed more adequately than with conventional methods 1 to 3.
  • FIG. 13 shows a word error rate in the speech recognition results acquired in relation to the observation signal and the signals subjected to noise/reverberation suppression in accordance with the present invention and conventional methods 1 to 3.
  • the word error rate indicates better speech recognition precision as the value thereof decreases.
  • the underlined values are optimal values. “R1N” denotes a case in which the speaker is positioned close to the microphones in room 1, while “R1F” denotes a case in which the speaker is positioned far away from the microphones in room 1.
  • R2N and R3N respectively denote cases in which the speaker is positioned close to the microphones in rooms 2 and 3
  • R2F and R3E respectively denote cases in which the speaker is positioned far away from the microphones in rooms 2 and 3.
  • Ave denotes an average value. As illustrated in FIG. 12 , it is evident that according to the present invention, noise and reverberation can be suppressed more adequately than with conventional methods 1 to 3.
  • FIG. 14 shows noise/reverberation suppression results acquired in a case where the steering vector was estimated without suppressing the reverberation of the frequency-divided observation signal x f, t (without reverberation suppression) and a case where the steering vector was estimated after suppressing the reverberation of the frequency-divided observation signal x f, t (with reverberation suppression), as described in the fourth embodiment.
  • WER expresses the character error rate when speech recognition was performed using the target signal acquired by implementing noise/reverberation suppression. As the value of WER decreases, a better performance is achieved. As illustrated in FIG. 14 , it is evident that the speech quality of the target signal is better with reverberation suppression than without reverberation suppression.
  • FIGS. 15 A, 15 B, and 15 C show noise/reverberation suppression results acquired in a case where convolutional beamformer estimation was executed by successive processing, as described in the seventh and ninth embodiments.
  • L 64 [msec]
  • Adaptive NCM indicates results acquired when the estimated steering vector ⁇ f, t generated by the method of the fifth embodiment was used.
  • PreFixed NCM indicates results acquired when the estimated steering vector ⁇ f, t generated by the method of modified example 1 of the fifth embodiment was used.
  • observation signal indicates results acquired when no noise/reverberation suppression was implemented. Thus, it is evident that the speech quality of the target signal is improved by the noise/reverberation suppression of the seventh and ninth embodiments.
  • d is set at the same value in all of the frequency bands, but d may be set for each frequency band.
  • a positive integer d f may be used instead of d.
  • L is set at the same value in all of the frequency bands, but L may be set for each frequency band. In other words, a positive integer L f may be used instead of L.
  • a time frame corresponding to 1 ⁇ t ⁇ t c may be set as the processing unit, or a time frame corresponding to t c ⁇ t ⁇ t c may be set as the processing unit in relation to a positive integer constant ⁇ .
  • processing described above do not have to be executed in time series, as described above, and may be executed in parallel or individually either in accordance with the processing power of the device that executes the processing or in accordance with necessity. Furthermore, the processing may be modified appropriately within a scope that does not depart from the spirit of the present invention.
  • the devices described above are configured by, for example, having a general-purpose or dedicated computer including a processor (a hardware processor) such as a CPU (central processing unit) and a memory such as a RAM (random-access memory)/ROM (read-only memory) execute a predetermined program.
  • the computer may include one processor and one memory, or pluralities of processors and memories.
  • the program may be either installed in the computer or recorded in the ROM or the like in advance.
  • electronic circuitry such as a CPU, that realizes a functional configuration by reading a program
  • some or all of the processing units may be configured using electronic circuitry that realizes processing functions without the use of a program.
  • Electronic circuitry constituting a single device may include a plurality of CPUs.
  • the processing content of the functions to be included in the devices is described by the program.
  • the computer realizes the processing functions described above by executing the program.
  • the program describing the processing content may be recorded in advance on a computer-readable recording medium.
  • An example of a computer-readable recording medium is a non-transitory recording medium. Examples of this type of recording medium include a magnetic recording device, an optical disk, a magneto-optical recording medium, a semiconductor memory, and so on.
  • the program is distributed by, for example, selling, transferring, renting, etc. a portable recording medium such as a DVD or a CD-ROM on which the program is recorded. Further, the program may be stored in a storage device of a server computer and distributed by being transferred from the server computer to another computer over a network.
  • the computer that executes the program first stores the program recorded on the portable recording medium or transferred from the server computer temporarily in a storage device included therein. During execution of the processing, the computer reads the program stored in the storage device included therein and executes processing corresponding to the read program. As a different form of execution of the program, the computer may read the program directly from the portable recording medium and execute processing corresponding to the program. Alternatively, every time the program is transferred to the computer from the server computer, the computer may execute processing corresponding to the received program. Instead of transferring the program from the server computer to the computer, the processing described above may be executed by a so-called ASP (Application Service Provider) type service, in which processing functions are realized only by issuing commands to execute the processing and acquiring results.
  • ASP Application Service Provider
  • At least some of the processing functions may be realized by hardware.
  • the present invention can be used in various applications in which it is necessary to suppress noise and reverberation from an acoustic signal.
  • the present invention can be used in speech recognition, call systems, conference call systems, and so on.

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

To sufficiently suppress noise and reverberation, a convolutional beamformer for calculating, at each time point, a weighted sum of a current signal and a past signal sequence having a predetermined delay and a length of 0 or more such that it increases a probability expressing a speech-likeness of an estimation signals based on a predetermined probability model is acquired where the estimation signals are acquired by applying the convolutional beamformer to frequency-divided observation signals corresponding respectively to a plurality of frequency bands of observation signals acquired by picking up acoustic signals emitted from a sound source, whereupon target signals are acquired by applying the acquired convolutional beamformer to the frequency-divided observation signals.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS
This application is a U.S. 371 Application of International Patent Application No. PCT/JP2019/029921, filed on 31 Jul. 2019, which application claims priority to and the benefit of JP Application No. 2018-234075, filed on 14 Dec. 2018, and International Patent Application No. PCT/JP2019/016587, filed on 18 Apr. 2019, the disclosures of which are hereby incorporated herein by reference in their entireties.
TECHNICAL FIELD
The present invention relates to a signal processing technique for an acoustic signal.
BACKGROUND ART
NPL 1 and NPL 2 disclose a method of suppressing noise and reverberation from an observation signal in the frequency domain. In this method, reverberation and noise are suppressed by receiving an observation signal in the frequency domain and a steering vector representing the direction of a sound source or an estimated vector thereof, estimating an instantaneous beamformer for minimizing the power of the frequency-domain observation signal under a constraint condition that sound reaching a microphone from the sound source is not distorted, and applying the instantaneous beamformer to the frequency-domain observation signal (conventional method 1).
PTL 1 and NPL 3 disclose a method of suppressing reverberation from an observation signal in the frequency domain. In this method, reverberation in an observation signal in the frequency domain is suppressed by receiving an observation signal in the frequency domain and the power of a target sound at each time, or an estimated value thereof, estimating a reverberation suppression filter for suppressing reverberation in the target sound on the basis of a weighted power minimization reference of a prediction error, and applying the reverberation suppression filter to the frequency-domain observation signal (conventional method 2).
NPL 4 discloses a method of suppressing noise and reverberation by cascade-connecting conventional method 2 and conventional method 1. In this method, at a prior stage, an observation signal in the frequency domain and the power of a target sound at each time are received and reverberation is suppressed using conventional method 2, and then, at a later stage, a steering vector is received and reverberation and noise are further suppressed using conventional method 1 (conventional method 3).
CITATION LIST Patent Literature
  • [PTL 1] Japanese Patent No. 5227393
Non Patent Literature
  • [NPL 1] T Higuchi, N Ito, T Yoshioka, T Nakatani, “Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise”, Proc. ICASSP 2016, 2016.
  • [NPL 2] J Heymann, L Drude, R Haeb-Umbach, “Neural network based spectral mask estimation for acoustic beamforming,” Proc. ICASSP 2016, 2016
  • [NPL 3] T Nakatani, T Yoshioka, K Kinoshita, M Miyoshi, B H Juang, “Speech dereverberation based on variance-normalized delayed linear prediction,” IEEE Trans. ASLP, 18 (7), 1717-1731, 2010
  • [NPL 4] Takuya Yoshioka, Nobutaka Ito, Marc Delcroix, Atsunori Ogawa, Keisuke Kinoshita, Masakiyo Fujimoto, Chengzhu Yu, Wojciech J Fabian, Miquel Espi, Takuya Higuchi, Shoko Araki, Tomohiro Nakatani, “The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices,” Proc. IEEE ASRU 2015, 436-443, 2015.
SUMMARY OF THE INVENTION Technical Problem
In the conventional methods, it may be impossible to sufficiently suppress reverberation and noise. Conventional method 1 is a method originally developed for the purpose of suppressing noise and may not always be capable of sufficiently suppressing reverberation. With conventional method 2, noise cannot be suppressed. Conventional method 3 can suppress more noise and reverberation than when conventional method 1 or conventional method 2 is used alone. With conventional method 3, however, conventional method 2 serving as the prior stage and conventional method 1 serving as the later stage are viewed as independent systems and optimization is performed in the respective systems. Therefore, when conventional method 2 is applied at the prior stage, it may not always be possible to sufficiently suppress reverberation due to the effects of noise. Further, when conventional method 1 is applied at the later stage, it may not always be possible to sufficiently suppress noise and reverberation due to the effects of residual reverberation.
The present invention has been designed in consideration of these points, and an object thereof is to provide a technique with which noise and reverberation can be sufficiently suppressed.
Means for Solving the Problem
In the present invention, a convolutional beamformer for calculating, at each time, a weighted sum of a current signal and a past signal sequence having a predetermined delay and a length of 0 or more such that estimation signals of target signals increase a probability expressing a speech-likeness of the estimation signals based on a predetermined probability model is acquired where the estimation signals are acquired by applying the convolutional beamformer to frequency-divided observation signals corresponding respectively to a plurality of frequency bands of observation signals acquired by picking up acoustic signals emitted from a sound source, whereupon the estimation signals are acquired by applying the acquired convolutional beamformer to the frequency-divided observation signals.
Effects of the Invention
In the present invention, the convolutional beamformer such that the estimation signals increases the probability expressing the speech-likeness of the estimation signals based on the probability model is acquired, and therefore noise suppression and reverberation suppression can be optimized as a single system, with the result that noise and reverberation can be sufficiently suppressed.
BRIEF DESCRIPTION OF DRAWINGS
FIG. 1A is a block diagram illustrating an example of a functional configuration of a signal processing device according to a first embodiment, and FIG. 1B is a flowchart illustrating an example of a signal processing method according to the first embodiment.
FIG. 2A is a block diagram illustrating an example of a functional configuration of a signal processing device according to a second embodiment, and FIG. 2B is a flowchart illustrating an example of a signal processing method according to the second embodiment.
FIG. 3 is a block diagram illustrating an example of a functional configuration of a signal processing device according to a third embodiment.
FIG. 4 is a block diagram illustrating an example of a functional configuration of a parameter estimation unit illustrated in FIG. 3 .
FIG. 5 is a flowchart illustrating an example of a parameter estimation method according to the third embodiment.
FIG. 6 is a block diagram illustrating an example of a functional configuration of a signal processing device according to fourth to seventh embodiments.
FIG. 7 is a block diagram illustrating an example of a functional configuration of a parameter estimation unit illustrated in FIG. 6 .
FIG. 8 is a block diagram illustrating an example of a functional configuration of a steering vector estimation unit illustrated in FIG. 7 .
FIG. 9 is a block diagram illustrating an example of a functional configuration of a signal processing device according to an eighth embodiment.
FIG. 10 is a block diagram illustrating an example of a functional configuration of a signal processing device according to a ninth embodiment.
FIGS. 11A to 11C are block diagrams illustrating examples of use of the signal processing devices according to the embodiments.
FIG. 12 is a table illustrating examples of test results of the first embodiment.
FIG. 13 is a table illustrating examples of test results of the first embodiment.
FIG. 14 is a table illustrating examples of test results of the fourth embodiment.
FIGS. 15A to 15C are tables illustrating examples of test results of the seventh embodiment.
DESCRIPTION OF EMBODIMENTS
Embodiments of the present invention will be described below.
[Definitions of Symbols]
First, symbols used in the embodiments will be defined.
    • M: M is a positive integer expressing a number of microphones. For example, M≥2.
    • m: m is a positive integer expressing the microphone number, and satisfies 1≤m≤M. The microphone number is represented by upper right superscript in round parentheses. In other words, a value or a vector based on a signal picked up by a microphone having the microphone number m is represented by a symbol having the upper right superscript “(m)” (for example, xf, t (m)).
    • N: N is a positive integer expressing the total number of time frames of signals. For example, N≥2.
    • t, τ: t and τ are positive integers expressing the time frame number, and t satisfies 1≤t≤N. The time frame number is represented by lower right subscript. In other words, a value or a vector corresponding to a time frame having the time frame number t is represented by a symbol having the lower right subscript “t” (for example, xf, t (m)). Similarly, a value or a vector corresponding to a time frame having the time frame number t is represented by a symbol having the lower right subscript “τ”.
    • P: P is a positive integer expressing a total number of frequency bands (discrete frequencies). For example, P≥2.
    • f: f is a positive integer expressing the frequency band number, and satisfies 1≤f≤P. The frequency band number is represented by lower right subscript. In other words, a value or a vector corresponding to a frequency band having the frequency band number f is represented by a symbol having the lower right subscript “f” (for example, xf, t (m)).
    • T: T expresses a non-conjugated transpose of a matrix or a vector. α0 T represents a matrix or a vector acquired by non-conjugated transposition of α0.
    • H: H expresses a conjugated transpose of a matrix or a vector. α0 H represents a matrix or a vector acquired by conjugated transposition of α0.
    • 0|: |α0| expresses the absolute value of α0.
    • ∥α0∥: ∥α0∥ expresses the norm of α0.
    • 0|γ: |α0|γ expresses a weighted absolute value γ|α0| of α0.
    • ∥α0γ: ∥α0γ expresses a weighted norm γ∥α0∥ of α0.
In this specification, a“target signal” denotes a signal corresponding to a direct sound and an initial reflected sound, within a signal (for example, a frequency-divided observation signal) corresponding to a sound emitted from a target sound source and picked up by a microphone. The initial reflected sound denotes a reverberation component derived from the sound emitted from the target sound source that reaches the microphone at a delay of no more than several tens of milliseconds following the direct sound. The initial reflected sound typically acts to improve the clarity of the sound, and in this embodiment, a signal corresponding to the initial reflected sound is also included in the target signal. Here, the signal corresponding to the sound picked up by the microphone also includes, in addition to the target signal described above, late reverberation (a component acquired by excluding the initial reflected sound from the reverberation) derived from the sound emitted from the target sound source, and noise derived from a source other than the target sound source. Ina signal processing method, the target signal is estimated by suppressing late reverberation and noise from a frequency-divided observation signal corresponding to a sound recorded by the microphone, for example. In this specification, unless specified otherwise, “reverberation” is assumed to refer to “late reverberation”.
[Principles]
Next, principles will be described.
<Prerequisite Method 1>
Method 1 serving as a prerequisite of the method according to the embodiments will now be described. In method 1, noise and reverberation are suppressed from an M-dimensional observation signal (frequency-divided observation signals) in the frequency domain
x f , t = [ x f , t ( 1 ) , x f , t ( 2 ) , , x f , t ( M ) ] T ( 1 )
The frequency-divided observation signals xf, t are acquired by transforming M observation signals, which are acquired by picking up acoustic signals emitted from one or a plurality of sound sources in M microphones, to the frequency domain. The observation signals are acquired by picking up acoustic signals emitted from the sound sources in an environment where noise and reverberation exist. xf, t (m) is acquired by transforming an observation signal that is acquired by being picked up by the microphone having the microphone number m to the frequency domain. xf, t (m) corresponds to the frequency band having the frequency band number f and the time frame having the time frame number t. In other words, the frequency-divided observation signals xf, t are time series signals.
In method 1, an instantaneous beamformer wf, 0 for minimizing a cost function C1 (wf, 0) below is determined for each frequency band under the constraint condition in which “the target signals are not distorted as a result of applying an instantaneous beamformer (for example, a minimum power distortionless response beamformer) wf, 0 for calculating the weighted sum of the signals at the current time to the frequency-divided observation signals xf, t at each time”.
C 1 ( w f , 0 ) = t = 1 N "\[LeftBracketingBar]" w f , 0 H x f , t "\[RightBracketingBar]" 2 ( 2 ) w f , 0 = [ w f , 0 ( 1 ) , w f , 0 ( 2 ) , , w f , 0 ( M ) ] T ( 3 )
Note that the lower right subscript “0” of wf, 0 does not represent the time frame number, wf, 0 being independent of the time frame. The constraint condition is a condition in which, for example, wf, 0 Hνf, 0 is a constant (1, for example). Here,
v f , 0 = [ v f , 0 ( 1 ) , v f , 0 ( 2 ) , , v f , 0 ( M ) ] T ( 4 )
is a steering vector having, as an element, a transfer function νf, 0 (m) relating to the direct sound and the initial reflected sound from the sound source to each microphone (the sound pickup position of the acoustic signal), or an estimated vector (an estimated steering vector) thereof. In other words, νf, 0 is expressed by an M-dimensional (the dimension of the number of microphones) vector having, as an element, the transfer function νf, 0 (m), which corresponds to the direct sound and initial reflected sound parts of an impulse response from the sound source position to each microphone (i.e. the reverberation that arrives at a delay of no more than several tens of milliseconds (for example, within 30 milliseconds) following the direct sound). When it is difficult to estimate the gain of the steering vector, a normalized vector acquired by normalizing the transfer function of each element so that the gain of a microphone having one of the microphone numbers m0∈{1, . . . , M} becomes a constant g (g≠0) may be used as νf, 0. In other words, as illustrated below, a normalized vector may be used as νf, 0.
v f , 0 g v f , 0 v f , 0 ( m 0 ) ( 5 )
By applying the instantaneous beamformer wf, 0 acquired as described above to the frequency-divided observation signal xf, t of each frequency band in the manner illustrated below, an estimation signal of a target signal yf, t in which noise and reverberation have been suppressed from the frequency-divided observation signal xf, t is acquired.
y f,t =w f,0 H x f,t  (6)
<Prerequisite Method 2>
Method 2 serving as a prerequisite of the method according to the embodiments will now be described. In method 2, reverberation is suppressed from the frequency-divided observation signal xf, t. In method 2, a reverberation suppression filter Ff, τ for minimizing a cost function C2 (Ff) below is determined for τ=d, d+1, . . . , d+L−1 in each frequency band.
C 2 ( F f ) = t = 1 N x f , t - τ = d d + L - 1 F f , τ H x f , t - τ σ f , t - 2 ( 7 )
Here, the reverberation suppression filter Ff, τ is an M×M-dimensional matrix filter for suppressing reverberation from the frequency-divided observation signal xf, t. d is a positive integer expressing a prediction delay. L is a positive integer expressing the filter length. σf, t 2 is the power of the target signal, which is expressed as follows.
σ f , t - 2 = 1 σ f , t 2
∥x∥γ relating to the frequency-divided observation signal x is the weighted norm ∥x∥γ=γ(xHx) of the frequency-divided observation signal x.
By applying the reverberation suppression filter Ff, t acquired as described above to the frequency-divided observation signal xf, t of each frequency band in the manner illustrated below, an estimation signal of a target signal zf, t in which reverberation has been suppressed from the frequency-divided observation signal xf, t is acquired.
z f , t = x f , t - d + L - 1 τ = d F f , τ H X f , t - τ ( 8 )
Here, the estimation signal of the target signal zf, t is an M-dimensional column vector, as shown below.
z f , t = [ z f , t ( 1 ) , z f , t ( 2 ) , , z f , t ( M ) ] T
<Method of Embodiments>
The method of the embodiments will now be described. An estimation signal of a target signal yf, t acquired by suppressing noise and reverberation from the frequency-divided observation signal xf, t by using a method integrating methods 1 and 2 can be modeled as follows.
y f , t = w f , 0 H ( x f , t - τ = d d + L - 1 F f , τ H x f , t - τ ) = w f , 0 H x f , t τ = d d + L - 1 w _ f , τ H x _ f , t - τ = w _ f H x _ f , t ( 9 )
Here, with respect to τ≠0, wf, τ=−Ff, τwf, 0, and wf, τ corresponds to a filter for performing noise suppression and reverberation suppression simultaneously. w f is a convolutional beamformer that calculates a weighted sum of a current signal and a past signal sequence having a predetermined delay at each time. Note that the “−” of “w f” should be written directly above the “w”, as shown below, but due to notation limitations may also be written to the upper right of “w”.
w f
The convolutional beamformer w f calculates the weighted sum of the current signal and a past signal sequence having a predetermined delay at each time point. The convolutional beamformer w f is expressed as shown below, for example,
w _ f = [ w _ f ( 1 ) T , w _ f ( 2 ) T , , w _ f ( M ) T ] T ( 10 )
where the following is satisfied.
w _ f ( m ) = [ w f , 0 ( m ) , w f , d ( m ) , w f , d + 1 ( m ) , , w d + L - 1 ( m ) ] T ( 10 A )
Further, x f, t is expressed as follows.
x _ f , t = [ x _ f , t ( 1 ) T , x _ f , t ( 2 ) T , , x _ f , t ( M ) T ] T ( 11 ) x _ f , t ( m ) = [ x f , t ( m ) , x f , t - d ( m ) , x f , t - d - 1 ( m ) , , x f , t - d - L + 1 ( m ) ] T ( 11 A )
Note that throughout this specification, cases in which L=0 in equations (9) to (11A) are also assumed to be included in the convolutional beamformer of the present invention. In other words, even cases in which the length of the past signal sequence used by the convolutional beamformer to calculate the weighted sum is 0 are treated as examples of realization of the convolutional beamformer. At this time, the term Σ in equation (9) becomes 0, and therefore equation (9) becomes equation (9A), shown below. Further, the respective right sides of equations (10A) and (11A) become vectors constituted respectively by only one first element (i.e., scalars), and therefore become equations (10AA) and (11AA), respectively.
y f , t = w f , 0 H x f , t = w _ f H x _ f , t ( 9 A ) w _ f ( m ) = w f , 0 ( m ) ( 10 AA ) x ¯ f , t ( m ) = x f , t ( m ) ( 11 AA )
Note that the convolutional beamformer w f of equation (9A) is a beamformer that calculates, at each time point, the weighted sum of the current signal and a signal sequence having a predetermined delay and a length of 0, and therefore the convolutional beamformer calculates the weighted value of the current signal at each time point. Further, as will be described below, even when L=0, the signal processing device of the present invention can acquire the estimation signal of the target signal by determining a convolutional beamformer on the basis of a probability expressing a speech-likeness and applying the convolutional beamformer to the frequency-divided observation signals.
Here, assuming that yf, t in equation (9) preferably conforms to a speech probability density function p ({yf, t}t=1:N; w f) (a probability model), the signal processing device determines the convolutional beamformer w f such that it increases the probability p ({yf, t}t=1:N; w f) (in other words, a probability expressing the speech-likeness of yf, t) of yf, t based on the speech probability density function. Preferably, the convolutional beamformer w f which maximizes the probability expressing the speech-likeness of yf, t is determined. For example, the signal processing device determines the convolutional beamformer w f such that it increases log p ({yf, t}t=1:N; w f), and preferably determines the convolutional beamformer w f which maximizes log p ({yf, t}t=1:N; w f).
A complex normal distribution having an average of 0 and a variance matching the power σf, t 2 of the target signal can be cited as an example of a speech probability density function. The “target signal” is a signal corresponding to the direct sound and the initial reflected sound, within a signal corresponding to a sound emitted from a target sound source and picked up by a microphone. Further, the signal processing device determines the convolutional beamformer w f under the constraint condition in which “the target signals are not distorted as a result of applying the convolutional beamformer w f to the frequency-divided observation signals xf, t”, for example. This constraint condition is a condition in which, for example, wf, 0 Hνf, 0 is a constant (1, for example). On the basis of this constraint condition, for example, the signal processing device determines w f which maximizes log p ({yf, t}t=1:N; w f), which is determined as shown below, for each frequency band.
log p ( { y f , t ) t = 1 : M ; w _ f ) = - t = 1 N "\[LeftBracketingBar]" w _ f H x _ f , t "\[RightBracketingBar]" σ f , t 2 + const . ( 12 )
Here, “const.” expresses a constant.
The following function, which is acquired by subtracting the constant term (const.) from log p ({yf, t}t=1:N; w f) in equation (12) and reversing the plus/minus sign, is set as a cost function C3 (w f).
C 3 ( w _ f ) = t = 1 N "\[LeftBracketingBar]" w _ f H x _ f , t "\[RightBracketingBar]" σ f , t 2 = w _ f H R f w _ f ( 13 )
Here, R is a weighted space-time covariance matrix determined as shown below.
R f = t = 1 N x _ f , t x _ f , t H σ f , t 2 ( 14 )
The signal processing device may determine w f which minimizes the cost function C3 (w f) of equation (13) under the constraint condition described above (in which, for example, wf, 0 Hνf, 0 is a constant), for example.
The analytical solution of w f for minimizing the cost function C3 (w f) under the constraint condition described above (in which, for example, wf, 0 Hνf, 0=1) is as shown below.
w _ f = R f - 1 v _ f v _ f H R f - 1 v _ f ( 15 )
Here, ν f is a vector acquired by disposing the element νf, 0 (m) of the steering vector νf, 0 as follows.
v _ f = [ v _ f ( 1 ) T , v _ f ( 2 ) T , , v _ f ( M ) T ] T , v _ f ( m ) = [ v f , 0 ( m ) , 0 , , 0 ] T
Here, ν f (m) is an L+1-dimensional column vector having νf, 0 (m), and L zeros as elements.
The signal processing device acquires the estimation signal of the target signal yf, t by applying the determined convolutional beamformer w f to the frequency-divided observation signal xf, t as follows.
y f,t =w f H x f,t  (16)
First Embodiment
Next, a first embodiment will be described.
As illustrated in FIG. 1A, a signal processing device 1 according to this embodiment includes an estimation unit 11 and a suppression unit 12.
<Step S11>
As illustrated in FIG. 1B, the frequency-divided observation signal xf, t is input into the estimation unit 11 (equation (1)). The estimation unit 11 acquires and outputs the convolutional beamformer w f for calculating the weighted sum of the current signal and a past signal sequence having a predetermined delay at each time such that the estimation signals increase the probability expressing the speech-likeness of the estimation signals based on the predetermined probability model where the estimation signals are acquired by applying the convolutional beamformer w f to the frequency-divided observation signals xf, t in respective frequency bands. For example, the estimation unit 11 determines the convolutional beamformer w f such that it increases the probability expressing speech-likeness of yf, t based on the probability density function p ({yf, t}t=1:N; w f) (such that log p ({yf, t}t=1:N; w f) is increased, for example). The estimation unit 11 preferably determines the convolutional beamformer w f which maximizes the probability (maximizes log p ({yf, t}t=1:N; w f), for example).
<Step S12>
The frequency-divided observation signal xf, t and the convolutional beamformer w f acquired in step S11 are input into the suppression unit 12. The suppression unit 12 acquires and outputs the estimation signal of the target signal yf, t by applying the convolutional beamformer w f to the frequency-divided observation signal xf, t in each frequency band. For example, the suppression unit 12 acquires and outputs the estimation signal of the target signal yf, t by applying w f to x f, t as shown in equation (16).
<Features of this Embodiment>
In this embodiment, the convolutional beamformer w f for calculating the weighted sum of the current signal and a past signal sequence having a predetermined delay at each time such that the estimation signals increases the probability expressing the speech-likeness of the estimation signals based on the predetermined probability model is determined where the estimation signals are acquired by applying the convolutional beamformer w f to the frequency-divided observation signals xf, t. This corresponds to optimizing noise suppression and reverberation suppression as a single system. In this embodiment, therefore, noise and reverberation can be suppressed more adequately than with the conventional methods.
Second Embodiment
Next, a second embodiment will be described. Hereafter, processing units and steps described heretofore will be cited using identical reference numerals, and description thereof will be simplified.
As illustrated in FIG. 2A, a signal processing device 2 according to this embodiment includes an estimation unit 21 and the suppression unit 12. The estimation unit 21 includes a matrix estimation unit 211 and a convolutional beamformer estimation unit 212.
The estimation unit 21 of this embodiment acquires and outputs the convolutional beamformer w f which minimizes a sum of values (the cost function C3 (w f) of equation (13), for example) acquired by weighting the power of the estimation signals at each time belonging to a predetermined time interval by the reciprocal of the power σf, t 2 of the target signals or the reciprocal of the estimated power σf, t 2 of the target signals under the constraint condition in which “the target signals are not distorted as a result of applying the convolutional beamformer w f to the frequency-divided observation signals xf, t”. As illustrated in equation (9), the convolutional beamformer w f is equivalent to a beamformer acquired by integrating a reverberation suppression filter Ff, t for suppressing reverberation from the frequency-divided observation signal xf, t and the instantaneous beamformer wf, 0 for suppressing noise from a signal acquired by applying the reverberation suppression filter Ff, t to the frequency-divided observation signal xf, t. Further, the constraint condition is a condition in which, for example, “a value acquired by applying an instantaneous beamformer to a steering vector having, as an element, transfer functions relating to the direct sound and the initial reflected sound from the sound source to the to the pickup position of the acoustic signals, or an estimated steering vector, which is an estimated vector of the steering vector, is a constant (wf, 0 Hνf, 0 is a constant)”. The processing will be described in detail below.
<Step S211>
As illustrated in FIG. 2B, the frequency-divided observation signals xf, t and the power or estimated power σf, t 2 of the target signals are input into the matrix estimation unit 211. The matrix estimation unit 211 acquires and outputs a weighted space-time covariance matrix Rf for each frequency band on the basis of the frequency-divided observation signals xf, t and the power or estimated power σf, t 2 of the target signal. For example, the matrix estimation unit 211 acquires and outputs the weighted space-time covariance matrix Rf in accordance with equation (14).
<Step S212>
The steering vector or estimated steering vector νf, 0 (equation (4) or (5)) and the weighted space-time covariance matrix Rf acquired in step S211 are input into the convolutional beamformer estimation unit 212. The convolutional beamformer estimation unit 212 acquires and outputs the convolutional beamformer w f on the basis of the weighted space-time covariance matrix Rf and the steering vector or estimated steering vector νf, 0. For example, the convolutional beamformer estimation unit 212 acquires and outputs the convolutional beamformer w f in accordance with equation (15).
<Step S12>
This step is identical to the first embodiment, and therefore description thereof has been omitted.
<Features of this Embodiment>
In this embodiment, the weighted space-time covariance matrix Rf is acquired, and on the basis of the weighted space-time covariance matrix Rf and the steering vector or estimated steering vector νf, 0, the convolutional beamformer w f is acquired. This corresponds to optimizing noise suppression and reverberation suppression as a single system. In this embodiment, therefore, noise and reverberation can be suppressed more adequately than with the conventional methods.
Third Embodiment
Next, a third embodiment will be described. In this embodiment, an example of a method of generating σf, t 2 and νf, 0 will be described.
As illustrated in FIG. 3 , a signal processing device 3 according to this embodiment includes the estimation unit 21, the suppression unit 12, and a parameter estimation unit 33. The estimation unit 21 includes the matrix estimation unit 211 and the convolutional beamformer estimation unit 212. Further, as illustrated in FIG. 4 , the parameter estimation unit 33 includes an initial setting unit 330, a power estimation unit 331, a reverberation suppression filter estimation unit 332, a reverberation suppression filter application unit 333, a steering vector estimation unit 334, an instantaneous beamformer estimation unit 335, an instantaneous beamformer application unit 336, and a control unit 337.
Hereafter, only the processing executed by the parameter estimation unit 33, which differs from the second embodiment, will be described. The processing performed by the other processing units is as described in the first and second embodiments.
<Step S330>
The frequency-divided observation signal xf, t is input into the initial setting unit 330. Using the frequency-divided observation signal xf, t, the initial setting unit 330 generates and outputs a provisional power σf, t 2, which is a provisional value of the estimated power σf, t 2 of the target signal. For example, the initial setting unit 330 generates and outputs the provisional power σf, t as follows.
σ f , t 2 = x f , t H x f , t M ( 17 )
<Step S332>
The frequency-divided observation signals xf, t and the newest provisional powers σf, t 2 are input into the reverberation suppression filter estimation unit 332. The reverberation suppression filter estimation unit 332 determines and outputs a reverberation suppression filter Ff, t for minimizing the cost function C2 (Ff) of equation (7) with respect to t=d, d+1, . . . , d+L−1 in each frequency band.
<Step S333>
The frequency-divided observation signal xf, t and the newest reverberation suppression filter Ff, t acquired in step S332 are input into the reverberation suppression filter application unit 333. The reverberation suppression filter application unit 333 acquires and outputs an estimation signal y′f, t by applying the reverberation suppression filter Ff, t to the frequency-divided observation signal xf, t in each frequency band. For example, the reverberation suppression filter application unit 333 sets zf, t, acquired in accordance with equation (8), as y′f, t and outputs y′f, t.
<Step S334>
The newest estimation signal y′f, t acquired in step S333 is input into the steering vector estimation unit 334. Using the estimation signal y′f, t, the steering vector estimation unit 334 acquires and outputs a provisional steering vector νf, 0, which is a provisional vector of the estimated steering vector, in each frequency band. For example, the steering vector estimation unit 334 acquires and outputs the provisional steering vector νf, 0 for the estimation signal y′f, t in accordance with a steering vector estimation method described in NPL 1 and NPL 2. For example, as the provisional steering vector νf, 0, the steering vector estimation unit 334 outputs a steering vector estimated using y′f, t as yf, t according to NPL 2. Further, as noted above, a normalized vector acquired by normalizing the transfer function of each element so that the gain of a microphone having any one of the microphone numbers m0∈(1, . . . , M) becomes a constant g may be used as νf, 0 (equation (5)).
<Step S335>
The newest estimation signal y′f, t acquired in step S333 and the newest provisional steering vector νf, 0 acquired in step S334 are input into the instantaneous beamformer estimation unit 335. The instantaneous beamformer estimation unit 335 acquires and outputs an instantaneous beamformer wf, 0 for minimizing C1 (wf, 0) shown below in equation (18), which is acquired by setting xf, t=y′f, t in equation (2), in each frequency band on the basis of the constraint condition that “wf, 0 Hνf, 0 is a constant”.
C 1 ( w f , 0 ) = t = 1 N "\[LeftBracketingBar]" w f , 0 H y f , t "\[RightBracketingBar]" 2 ( 18 )
<Step S336>
The newest estimation signal y′f, t acquired in step S333 and the newest instantaneous beamformer wf, 0 acquired in step S335 are input into the instantaneous beamformer application unit 336. The instantaneous beamformer application unit 336 acquires and outputs an estimation signal y″f, t by applying the instantaneous beamformer wf, 0 to the estimation signal y′f, t in each frequency band. For example, the instantaneous beamformer application unit 336 acquires and outputs the estimation signal y″f, t as follows.
y″ f,t =w f,0 H y′ f,t  (19)
<Step S331>
The newest estimation signal y″f, t acquired in step S336 is input into the power estimation unit 331. The power estimation unit 331 outputs the power of the estimation signal y″f, t as the provisional power σf, t 2 in each frequency band. For example, the power estimation unit 331 generates and outputs the provisional power σf, t 2 as follows.
σf,t 2 =|y″ f,t|2 =y″ f,t H y″ f,t  (20)
<Step S337 a>
The control unit 337 determines whether or not a termination condition is satisfied. There are no limitations on the termination condition, but for example, the termination condition may be satisfied when the number of repetitions of the processing of steps S331 to S336 exceeds a predetermined value, when the variation in σf, t 2 or νf, 0 falls to or below a predetermined value after the processing of steps S331 to S336 is performed once, and so on. When the termination condition is not satisfied, the processing returns to step S332. When the termination condition is satisfied, on the other hand, the processing advances to step S337 b.
<Step S337 b>
In step S337 b, the power estimation unit 331 outputs σf, t 2 acquired most recently in step S331 as the estimated power of the target signal, and the steering vector estimation unit 334 outputs νf, 0 acquired most recently in step S334 as the estimated steering vector. As illustrated in FIG. 3 , the estimated power σf, t 2 is input into the matrix estimation unit 211, and the estimated steering vector νf, 0 is input into the convolutional beamformer estimation unit 212.
Fourth Embodiment
As described above, the steering vector is estimated on the basis of the frequency-divided observation signal xf, t. Here, when the steering vector is estimated after suppressing (preferably, removing) reverberation from the frequency-divided observation signal xf, t, the estimation precision improves. In other words, by acquiring a frequency-divided reverberation-suppressed signal in which the reverberation component of the frequency-divided observation signal xf, t has been suppressed, and acquiring the estimated steering vector from the frequency-divided reverberation-suppressed signal, the precision of the estimated steering vector can be improved.
As illustrated in FIG. 6 , a signal processing device 4 according to this embodiment includes the estimation unit 21, the suppression unit 12, and a parameter estimation unit 43. The estimation unit 21 includes the matrix estimation unit 211 and the convolutional beamformer estimation unit 212. As illustrated in FIG. 7 , the parameter estimation unit 43 includes a reverberation suppression unit 431 and a steering vector estimation unit 432.
The fourth embodiment differs from the first to third embodiments in that before generating the estimated steering vector, the reverberation component of the frequency-divided observation signal xf, t is suppressed. Hereafter, only a method for generating the estimated steering vector will be described.
<Processing of Reverberation Suppression Unit 431 (Step S431)>
The frequency-divided observation signal xf, t is input into the reverberation suppression unit 431 of the parameter estimation unit 43 (FIG. 7 ). The reverberation suppression unit 431 acquires and outputs a frequency-divided reverberation-suppressed signal uf, t in which the reverberation component of the frequency-divided observation signal xf, t has been suppressed (preferably, in which the reverberation component of the frequency-divided observation signal xf, t has been removed). There are no limitations on the method for suppressing (removing) the reverberation component from the frequency-divided observation signal xf, t, and a well-known reverberation suppression (removal) method may be used. For example, the reverberation suppression unit 431 acquires and outputs the frequency-divided reverberation-suppressed signal uf, t in which the reverberation component of the frequency-divided observation signal xf, t has been suppressed using a method described in reference document 1.
  • Reference document 1: Takuya Yoshioka and Tomohiro Nakatani, “Generalization of Multi-Channel Linear Prediction Methods for Blind MIMO Impulse Response Shortening,” IEEE Transactions on Audio, Speech, and Language Processing (Volume: 20, Issue: 10, December 2012)
<Processing of Steering Vector Estimation Unit 432 (Step S432)>
The frequency-divided reverberation-suppressed signal uf, t acquired by the reverberation suppression unit 431 is input into the steering vector estimation unit 432. Using the frequency-divided reverberation-suppressed signal uf, t as input, the steering vector estimation unit 432 generates and outputs an estimated steering vector serving as an estimated vector of the steering vector. A steering vector estimation processing method of acquiring an estimated steering vector using a frequency-divided time series signal as input is well-known. The steering vector estimation unit 432 acquires and outputs the estimated steering vector νf, 0 by using the frequency-divided reverberation-suppressed signal uf, t as the input of a desired type of steering vector estimation processing. There are no limitations on the steering vector estimation processing method, and for example, the method described above in NPL 1 and NPL 2, methods described in reference documents 2 and 3, and so on may be used.
  • Reference document 2: N. Ito, S. Araki, M. Delcroix, and T. Nakatani, “Probabilistic spatial dictionary based online adaptive beamforming for meeting recognition in noise and reverberant environments,” Proc IEEE ICASSP, pp. 681-685, 2017.
  • Reference document 3: S. Markovich-Golan and S. Gannot, “Performance analysis of the covariance subtraction method for relative transfer function estimation and comparison to the covariance whitening method,” Proc IEEE ICASSP, pp. 544-548, 2015.
The estimated steering vector νf, 0 acquired by the steering vector estimation unit 432 is input into the convolutional beamformer estimation unit 212. The convolutional beamformer estimation unit 212 performs the processing of step S212, described in the second embodiment, using the estimated steering vector νf, 0 and the weighted space-time covariance matrix Rf acquired in step S211. All other processing is as described in the first and second embodiments.
Fifth Embodiment
In a fifth embodiment, a method of executing steering vector estimation by successive processing will be described. In so doing, the estimated steering vector of each time frame number t can be calculated from frequency-divided observation signals xf, t input successively online, for example.
As illustrated in FIG. 6 , a signal processing device 5 according to this embodiment includes the estimation unit 21, the suppression unit 12, and a parameter estimation unit 53. The estimation unit 21 includes the matrix estimation unit 211 and the convolutional beamformer estimation unit 212. As illustrated in FIG. 7 , the parameter estimation unit 53 includes a steering vector estimation unit 532. As illustrated in FIG. 8 , the steering vector estimation unit 532 includes an observation signal covariance matrix updating unit 532 a, a main component vector updating unit 532 b, a steering vector updating unit 532 c (the steering vector estimation unit), an inverse noise covariance matrix updating unit 532 d, and a noise covariance matrix updating unit 532 e. The fifth embodiment differs from the first to third embodiments only in that the estimated steering vector is generated by successive processing. Hereafter, only a method of generating the estimated steering vector will be described. The following processing is executed on each time frame number t in ascending order from t=1.
<Processing of Steering Vector Estimation Unit 532 (Step S532)>
The frequency-divided observation signal xf, t, which is a frequency-divided time series signal, is input into the steering vector estimation unit 532 (FIGS. 7 and 8 ).
<<Processing of Observation Signal Covariance Matrix Updating Unit 532 a (Step S532 a)>>
Using the frequency-divided observation signal xf, t as input, the observation signal covariance matrix updating unit 532 a (FIG. 8 ) acquires and outputs a spatial covariance matrix ψx, f, t of the frequency-divided observation signal xf, t (a spatial covariance matrix of a frequency-divided observation signal belonging to a first time interval), which is based on the frequency-divided observation signal xf, t (the frequency-divided observation signal belonging to the first time interval) and a spatial covariance matrix ψx, f, t-1 of a frequency-divided observation signal xf, t-1 (a spatial covariance matrix of a frequency-divided observation signal belonging to a second time interval that is further in the past than the first time interval). For example, the observation signal covariance matrix updating unit 532 a acquires and outputs a linear sum of a covariance matrix xf, txf, t H of the frequency-divided observation signal xf, t (the frequency-divided observation signal belonging to the first time interval) and the spatial covariance matrix ψx, f, t-1 (the spatial covariance matrix of the frequency-divided observation signal belonging to the second time interval that is further in the past than the first time interval) as the spatial covariance matrix ψx, f, t of the frequency-divided observation signal xf, t (the spatial covariance matrix of the frequency-divided observation signal belonging to the first time interval). The observation signal covariance matrix updating unit 532 a acquires and outputs the spatial covariance matrix ψx, f, t in accordance with equation (21) shown below, for example.
ψx,f,t=βψx,f,t-1 +x f,t x f,t H  (21)
Here, β is an oblivion coefficient, and is a real number belonging to a range of 0<β<1, for example. An initial matrix ψx, f, 0 of the spatial covariance matrix ψx, f, t-1 may be set as desired. For example, an M×M-dimensional unit matrix may be set as the initial matrix ψx, f, 0 of the spatial covariance matrix γx, f, t-1.
<Processing of Inverse Noise Covariance Matrix Updating Unit 532 d (Step S532 d)>
The frequency-divided observation signal xf, t and mask information γf, t (n) are input into the inverse noise covariance matrix updating unit 532 d. The mask information γf, t (n) is information expressing the ratio of the noise component included in the frequency-divided observation signal xf, t at a time-frequency point corresponding to the time frame number t and the frequency band number f. In other words, the mask information γf, t (n) expresses the occupancy probability of the noise component included in the frequency-divided observation signal xf, t at a time-frequency point corresponding to the time frame number t and the frequency band number f. There are no limitations on the method of estimating the mask information γf, t (n). Methods of estimating the mask information γf, t (n) are well-known, and include, for example, an estimation method using a complex Gaussian mixture model (CGMM) (reference document 4, for example), an estimation method using a neural network (reference document 5, for example), an estimation method integrating these methods (reference document 6 and reference document 7, for example), and so on.
  • Reference document 4: T. Higuchi, N. Ito, T. Yoshioka, and T. Nakatani, “Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise,” Proc IEEE ICASSP-2016, pp. 5210-5214, 2016.
  • Reference document 5: J. Heymann, L. Drude, and R Haeb-Umbach, “Neural network based spectral mask estimation for acoustic beamforming,” Proc IEEE ICASSP-2016, pp. 196-200, 2016.
  • Reference document 6: T. Nakatani, N. Ito, T. Higuchi, S. Araki, and K. Kinoshita, “Integrating DNN-based and spatial clustering-based mask estimation for robust MVDR beamforming,” Proc IEEE ICASSP-2017, pp. 286-290, 2017.
  • Reference document 7: Y. Matsui, T. Nakatani, M. Delcroix, K. Kinoshita, S. Araki, and S. Makino, “Online integration of DNN-based and spatial clustering-based mask estimation for robust MVDR beamforming,” Proc. IWA ENC, pp. 71-75, 2018.
The mask information γf, t (n) may be estimated in advance and stored in a storage device, not illustrated in the figures, or may be estimated successively. Note that the upper right superscript “(n)” of “γf, t (n)” should be written directly above the lower right subscript “f, t”, but due to notation limitations has been written to the upper right of “f, t”.
The inverse noise covariance matrix updating unit 532 d acquires and outputs an inverse noise covariance matrix ψ−1 n, f, t (an inverse noise covariance matrix of the frequency-divided observation signal belonging to the first time interval) on the basis of the frequency-divided observation signal xf, t (the frequency-divided observation signal belonging to the first time interval), the mask information γf, t (n) (mask information belonging to the first time interval), and an inverse noise covariance matrix ψ−1 n, f, t-1 (an inverse noise covariance matrix of the frequency-divided observation signal belonging to the second time interval that is further in the past than the first time interval). For example, the inverse noise covariance matrix updating unit 532 d acquires and outputs the inverse noise covariance matrix ψ−1 n, f, t in accordance with equation (22), shown below, using the Woodbury formula.
Ψ n , f , t - 2 = 1 0 ( Ψ n , f , t - 1 - 1 - Y f , t ( n ) Ψ n , f , t - 1 - 1 x f , t x f , t H Ψ n , f , t - 1 - 1 a + Y f , t ( n ) x f , t H Ψ n , f , t - 1 - 1 x f , t ) ( 22 )
Here, α is an oblivion coefficient, and is a real number belonging to a range of 0<α<1, for example. An initial matrix ψ−1 n, f, 0 of the inverse noise covariance matrix ψ−1 n, f, t-1 may be set as desired. For example, an M×M-dimensional unit matrix may be set as the initial matrix ψ−1 n, f, 0 of the inverse noise covariance matrix ψ−1 n, f, t-1. Note that the upper right superscript “−1” of “ψ−1 n, f, t” should be written directly above the lower right subscript “n, f, t”, but due to notation limitations has been written to the upper left of “n, f, t”.
<Processing of Main Component Vector Updating Unit 532 b (Step S532 b)>
The spatial covariance matrix ψx, f, t acquired by the observation signal covariance matrix updating unit 532 a and the inverse noise covariance matrix ψ−1 n, f, t acquired by the inverse noise covariance matrix updating unit 532 d are input into the main component vector updating unit 532 b. The main component vector updating unit 532 b acquires and outputs a main component vector ν˜ f, t (a main component vector of the first time interval) relating to ψ−1 n, f, tψx, f, t (the product of an inverse matrix of the noise covariance matrix of the frequency-divided observation signal and the spatial covariance matrix of the frequency-divided observation signal belonging to the first time interval) by using a power method on the basis of the inverse noise covariance matrix ψ−1 n, f, t (the inverse matrix of the noise covariance matrix of the frequency-divided observation signal), the spatial covariance matrix ψx, f, t (the spatial covariance matrix of the frequency-divided observation signal belonging to the first time interval), and a main component vector v˜ f, t-1 (a main component vector of the second time interval). For example, the main component vector updating unit 532 b acquires and outputs a main component vector v˜ f, t based on ψ−1 n, f, tψx, f, tv˜ f, t-1. The main component vector updating unit 532 b acquires and outputs the main component vector v˜ f, t in accordance with equations (23) and (24) shown below, for example. Note that the upper right superscript “˜” of “v˜ f, t” should be written directly above the lower right subscript “v”, but due to notation limitations has been written to the upper right of “v”.
v ˜ f , t = Ψ n , f , t - 1 Ψ n , f , t v ˜ f , t - 1 ( 23 ) v ˜ f , t = v ˜ f , t v ˜ f , t r e f ( 24 )
Here, v˜ f, t ref expresses an element corresponding to a predetermined microphone (a reference microphone ref) serving as a reference, among the M elements of a vector v˜ f, t acquired from equation (23). In other words, in the example of equations (23) and (24), the main component vector updating unit 532 b sets a vector acquired by normalizing the respective elements of v˜′ f, t−1 n, f, f, tψx, f, tv˜ f, t-1 by v˜ f, t ref as the main component vector v˜ f, t. Note that the upper right superscript “˜” of “v˜′ f, t” should be written directly above the lower right subscript “v”, but due to notation limitations has been written to the upper right of “v”.
<Noise Covariance Matrix Updating Unit 532 e (Step S532 e)>
The noise covariance matrix updating unit 532 e, using the frequency-divided observation signal xf, t (the frequency-divided observation signal belonging to the first time interval) and the mask information γf, t (n); (the mask information of the first time interval) as input, acquires and outputs a noise covariance matrix γn, f, t of the frequency-divided observation signal xf, t (a noise covariance matrix of the frequency-divided observation signal belonging to the first time interval), which is based on the frequency-divided observation signal xf, t, the mask information γf, t (n), and a noise covariance matrix ψn, f, t-1 (a noise covariance matrix of the frequency-divided observation signal belonging to the second time interval that is further in the past than the first time interval). For example, the noise covariance matrix updating unit 532 e acquires and outputs the linear sum of a product γf, t (n)xf, txf, t H of the covariance matrix xf, txf, t H of the frequency-divided observation signal xf, t and the mask information γf, t (n), and the noise covariance matrix ψn, f, t-1 (the noise covariance matrix of the frequency-divided observation signal belonging to the second time interval that is further in the past than the first time interval) as the noise covariance matrix ψn, f, t of the frequency-divided observation signal xf, t. For example, the noise covariance matrix updating unit 532 e acquires and outputs the noise covariance matrix ψn, f, t in accordance with equation (25) shown below.
ψn,f,t=αψn,f,t-1f,t (n) x f,t x f,t H  (25)
Here, α is an oblivion coefficient, and is a real number belonging to a range of 0<α<1, for example.
<Steering Vector Updating Unit 532 c (Step S532 c)>
The steering vector updating unit 532 c, using the main component vector v˜ f, t (the main component vector of the first time interval) acquired by the main component vector updating unit 532 b and the noise covariance matrix ψn, f, t (the noise covariance matrix of the frequency-divided observation signal) acquired by the noise covariance matrix updating unit 532 e as input, acquires and outputs an estimated steering vector νf, t (an estimated steering vector of the first time interval) on the basis thereof. For example, the steering vector updating unit 532 c acquires and outputs an estimated steering vector νf, t based on ψn, f, tv˜ f, t. The steering vector updating unit 532 c acquires and outputs the estimated steering vector νf, t in accordance with equations (26) and (27) shown below, for example.
v′ f,tn,f,t {tilde over (v)} f,t  (26)
v f , t = v f , t v f , t r e f ( 27 )
Here, vf, t ref expresses an element corresponding to the reference microphone ref, among the M elements of a vector v′f, t acquired from equation (26). In other words, in the example of equations (26) and (27), the steering vector updating unit 532 c sets a vector acquired by normalizing the respective elements of v′f, tn, f, tv˜ f, t by vf, t ref as the estimated steering vector νf, t.
The estimated steering vector νf, t acquired by the steering vector estimation unit 532 is input into the convolutional beamformer estimation unit 212. The convolutional beamformer estimation unit 212 treats the estimated steering vector νf, t as νf, 0, and performs the processing of step S212, described in the second embodiment, using the estimated steering vector νf, t and the weighted space-time covariance matrix Rf acquired in step S211. All other processing is as described in the first and second embodiments. Further, as σf, t 2 input into the matrix estimation unit 211, either the provisional power generated as illustrated in equation (17) or the estimated power σf, t 2 generated as described in the third embodiment, for example, may be used.
Modified Example 1 of Fifth Embodiment
In step S532 d of the fifth embodiment, the inverse noise covariance matrix updating unit 532 d adaptively updates the inverse noise covariance matrix ψ−1 n, f, t at each time point corresponding to the time frame number t by using the frequency-divided observation signal xf, t and the mask information γf, t (n). However, the inverse noise covariance matrix updating unit 532 d may acquire and output the inverse noise covariance matrix ψ−1 n, f, t by using a frequency-divided observation signal xf, t of a time interval in which the noise component either exists alone or is dominant, without using the mask information γf, t (n). For example, the inverse noise covariance matrix updating unit 532 d may output, as the inverse noise covariance matrix ψ−1 n, f, t, an inverse matrix of the temporal average of xf, txf, t H with respect to a frequency-divided observation signal xf, t of a time interval in which the noise component either exists alone or is dominant. The inverse noise covariance matrix ψ−1 n, f, t acquired in this manner is used continuously in the frames having the respective time frame numbers t.
In step S532 e of the fifth embodiment, the noise covariance matrix updating unit 532 e may acquire and output the noise covariance matrix ψ−1 n, f, t of the frequency-divided observation signal xf, t using a frequency-divided observation signal xf, t of a time interval in which the noise component either exists alone or is dominant, without using the mask information γf, t (n). For example, the noise covariance matrix updating unit 532 e may output, as the noise covariance matrix ψn, f, t, the temporal average of xf, txf, t H with respect to a frequency-divided observation signal xf, t of a time interval in which the noise component either exists alone or is dominant. The noise covariance matrix ψn, f, t acquired in this manner is used continuously in the frames having the respective time frame numbers t.
Modified Example 2 of Fifth Embodiment
In the fifth embodiment and the modified example thereof, a case in which the first time interval is the frame having the time frame number t and the second time interval is the frame having the time frame number t−1 was used as an example, but the present invention is not limited thereto. A frame having a time frame number other than the time frame number t may be set as the first time interval, and a time frame that is further in the past than the first time interval and has a time frame number other than the time frame number t−1 may be set as the second time interval.
Sixth Embodiment
In the fifth embodiment, the steering vector estimation unit 532 acquires and outputs the estimated steering vector νf, t by successive processing using the frequency-divided observation signal xf, t as input. As noted in the fourth embodiment, however, by estimating the steering vector after suppressing reverberation from the frequency-divided observation signal xf, t, the estimation precision is improved. In the sixth embodiment, an example in which the steering vector estimation unit acquires and outputs the estimated steering vector νf, t by successive processing, as described in the fifth embodiment, after reverberation has been suppressed from the frequency-divided observation signal xf, t will be described.
As illustrated in FIG. 6 , a signal processing device 6 according to this embodiment includes the estimation unit 21, the suppression unit 12, and a parameter estimation unit 63. As illustrated in FIG. 7 , the parameter estimation unit 63 includes the reverberation suppression unit 431 and a steering vector estimation unit 632. The sixth embodiment differs from the fifth embodiment in that before generating the estimated steering vector, the reverberation component of the frequency-divided observation signal xf, t is suppressed. Hereafter, only a method of generating the estimated steering vector will be described.
<Processing of Reverberation Suppression Unit 431 (Step S431)>
As described in the fourth embodiment, the reverberation suppression unit 431 (FIG. 7 ) acquires and outputs the frequency-divided reverberation-suppressed signal uf, t in which the reverberation component of the frequency-divided observation signal xf, t has been suppressed (preferably, in which the reverberation component of the frequency-divided observation signal xf, t has been removed).
<Processing of Steering Vector Estimation Unit 632 (Step S632)>
The frequency-divided reverberation-suppressed signal uf, t is input into the steering vector estimation unit 632. The processing of the steering vector estimation unit 632 is identical to the processing of the steering vector estimation unit 532 of the fifth embodiment except that the frequency-divided reverberation-suppressed signal uf, t, rather than the frequency-divided observation signal xf, t, is input into the steering vector estimation unit 632, and the steering vector estimation unit 632 uses the frequency-divided reverberation-suppressed signal uf, t instead of the frequency-divided observation signal xf, t. In other words, in the processing performed by the steering vector estimation unit 63 the frequency-divided observation signal xf, t used in the processing of the steering vector estimation unit 532 is replaced by the frequency-divided reverberation-suppressed signal uf, t. All other processing is identical to the fifth embodiment and the modified example thereof. More specifically, the frequency-divided reverberation-suppressed signal uf, t, which is a frequency-divided time series signal, is input into the steering vector estimation unit 632. The observation signal covariance matrix updating unit 532 a acquires and outputs the spatial covariance matrix ψx, f, t of the frequency-divided reverberation-suppressed signal uf, t belonging to the first time interval, which is based on the frequency-divided reverberation-suppressed signal uf, t belonging to the first time interval and the spatial covariance matrix ψx, f, t-1 of a frequency-divided reverberation-suppressed signal uf, t_i belonging to the second time interval that is further in the past than the first time interval. The main component vector updating unit 532 b acquires and outputs the main component vector v˜ f, t of the first time interval with respect to the product ψ−1 n, f, tψx, f, t of the inverse matrix ψ−1 n, f, t of the noise covariance matrix of the frequency-divided reverberation-suppressed signal and the spatial covariance matrix ψx, f, t of the frequency-divided reliability-suppressed signal belonging to the first time interval on the basis of the inverse matrix ψ−1 n, f, t of the noise covariance matrix of the frequency-divided reliability-suppressed signal uf, t, the spatial covariance matrix ψx, f, t of the frequency-divided reliability-suppressed signal belonging to the first time interval, and the main component vector v˜ f, t-1 of the second time interval. The steering vector updating unit 532 c acquires and outputs the estimated steering vector νf, t of the first time interval on the basis of the noise covariance matrix of the frequency-divided reverberation-suppressed signal uf, t and the main component vector v˜ f, t of the first time interval.
Seventh Embodiment
In a seventh embodiment, a method of estimating the convolutional beamformer by successive processing will be described. In so doing, the convolutional beamformer of each time frame number t can be estimated and the estimation signal of the target signal yf, t can be acquired from frequency-divided observation signals xf, t input successively online, for example.
As illustrated in FIG. 6 , a signal processing device 7 according to this embodiment includes an estimation unit 71, a suppression unit 72, and the parameter estimation unit 53. The estimation unit 71 includes a matrix estimation unit 711 and a convolutional beamformer estimation unit 712. The following processing is executed on each time frame number t in ascending order from t=1.
<Processing of Parameter Estimation Unit 53 (Step S53)>
The frequency-divided observation signal xf, t is input into the parameter estimation unit 53 (FIGS. 6 and 7 ). As described in the fifth embodiment, the steering vector estimation unit 532 (FIG. 8 ) of the parameter estimation unit 53 acquires and outputs the estimated steering vector νf, t by successive processing using the frequency-divided observation signal xf, t as input (step S532). The estimated steering vector νf, t is represented by the following M-dimensional vector.
νf,t=[νf,t (1)f,t (2), . . . ,νf,t (M)]T
Here, νf, t (m) represents an element corresponding to the microphone having the microphone number m, among the M elements of the estimated steering vector νf, t. The estimated steering vector νf, t acquired by the steering vector estimation unit 532 is input into the convolutional beamformer estimation unit 712.
<Processing of Matrix Estimation Unit 711 (Step S711)>
The frequency-divided observation signal xf, t and the power or estimated power σf, t 2 of the target signal are input into the matrix estimation unit 711 (FIG. 6 ). As σf, t 2 input into the matrix estimation unit 711, either the provisional power generated as illustrated in equation (17) or the estimated power σf, t 2 generated as described in the third embodiment, for example, may be used. On the basis of the frequency-divided observation signal xf, t (the frequency-divided observation signal belonging to the first time interval), the power or estimated power σf, t 2 of the target signal (the power or estimated power of the frequency-divided observation signal belonging to the first time interval), and an inverse matrix
R f , t - 1 - 1
of a space-time covariance matrix (an inverse matrix of the space-time covariance matrix of the second time interval that is further in the past than the first time interval), the matrix estimation unit 711 estimates and outputs an inverse matrix
R f , t - 1
of a space-time covariance matrix (an inverse matrix of the space-time covariance matrix of the first time interval). An example of the space-time covariance matrix is as follows.
R f , t = τ = 0 t a t - τ σ f , τ 2 x _ f , t x _ f , t H
In this case, the matrix estimation unit 711 generates and outputs the inverse matrix
R f , t - 1
of the space-time covariance matrix in accordance with equations (28) and (29) shown below, for example.
k f , t = R f , t - 1 - 1 x _ f , t a σ f , t 2 + x _ f , t H R f , t - 1 - 1 x _ f , t ( 28 )
R f , t - 1 = 1 a ( R f , t - 1 - 1 - k f , t x _ f , t H R f , t - 1 - 1 ) ( 29 )
Here, kf, t in equation (28) is an (L+1)M-dimensional vector, and the inverse matrix of equation (29) is an (L+1)M×(L+1)M matrix. α is an oblivion coefficient, and is a real number belonging to a range of 0<α<1, for example. Further, an initial matrix of the inverse matrix
R f , t - 1 - 1
of the space-time covariance matrix may be set as desired, and an example of the initial matrix is an (L+1)M-dimensional unit matrix shown below.
R f , 0 - 1 = I ( L + 1 ) M
<Processing of beamformer estimation unit 712 (step S712)>
R f , t - 1
(the inverse matrix of the space-time covariance matrix of the first time interval) acquired by the matrix estimation unit 711, and the estimated steering vector νf, t acquired by the parameter estimation unit 53 are input into the beamformer estimation unit 712. The convolutional beamformer estimation unit 712 acquires and outputs the convolutional beamformer w f, t (the convolutional beamformer of the first time interval) on the basis thereof. For example, the convolutional beamformer estimation unit 712 acquires and outputs the convolutional beamformer w f, t in accordance with equation (30), shown below.
w _ f , t = R f , t - 1 v _ f , t v _ f , t H R f , t - 1 v _ f , t ( 30 )
where
v _ f , t = [ v _ f , t ( 1 ) , v _ f , t ( 2 ) , , v _ f , t ( M ) ]
and
v _ f , t ( m ) = [ g f v f , t ( m ) , 0 , , 0 ] [ g f v f , t ( m ) , 0 , , 0 ]
is an L+1-dimensional vector. gf is a scalar constant other than 0.
<Processing of Suppression Unit 72 (Step S72)>
The frequency-divided observation signal xf, t and the convolutional beamformer w f, t acquired by the beamformer estimation unit 712 are input into the suppression unit 72. The suppression unit 72 acquires and outputs the estimation signal of the target signal yf, t by applying the convolutional beamformer w f, t to the frequency-divided observation signal xf, t in each time frame number t and frequency band number f. For example, the suppression unit 72 acquires and outputs the estimation signal of the target signal yf, t in accordance with equation (31) shown below.
y f,t =w f,t H x f,t  (31)
Modified Example 1 of Seventh Embodiment
The parameter estimation unit 53 of the signal processing device 7 according to the seventh embodiment may be replaced by the parameter estimation unit 63. In other words, in the seventh embodiment, the parameter estimation unit 63, rather than the parameter estimation unit 53, may acquire and output the estimated steering vector νf, t by successive processing, as described in the sixth embodiment, using the frequency-divided observation signal xf, t as input.
Modified Example 2 of Seventh Embodiment
In the seventh embodiment and the modified example thereof, a case in which the first time interval is the frame having the time frame number t and the second time interval is the frame having the time frame number t−1 was used as an example, but the present invention is not limited thereto. A frame having a time frame number other than the time frame number t may be set as the first time interval, and a time frame that is further in the past than the first time interval and has a time frame number other than the time frame number t−1 may be set as the second time interval.
Eighth Embodiment
In the second embodiment, an example in which the analytical solution of w f for minimizing the cost function C3 (w f) on the basis of a constraint condition in which wf, 0 Hνf, 0 is a constant is viewed as equation (15) and the convolutional beamformer w f is acquired in accordance with equation (15) was described. In an eighth embodiment, an example in which the convolutional beamformer is acquired using a different optimal solution will be described.
When an (M−1)×M block matrix corresponding to the orthogonal complement of the estimated steering vector νf, 0 is set as Bf, Bf Hνf, 0=0 is satisfied. An infinite number of block matrices Bf of this type exist. Equation (32) below shows an example of the block matrix Bf.
B f H = [ - v f , 0 - v f , 0 ref , I M - 1 ] ( 32 )
Here, ν˜ f, 0 is an M−1-dimensional column vector constituted by elements of the steering vector νf, 0 or the estimated steering vector νf, 0 that correspond to microphones other than the reference microphone ref, νf, 0 ref is the element of νf, 0 that corresponds to the reference microphone ref, and IM-1 is an (M−1)×(M−1)-dimensional unit matrix.
gf is set as a scalar constant other than 0, af, 0 is set as an M-dimensional modified instantaneous beamformer, and the instantaneous beamformer wf, 0 is expressed as the sum of a constant multiple gfνf, 0 of the steering vector νf, 0 or a constant multiple gfνf, 0 of the estimated steering vector νf, 0 and a product Bfaf, 0 of the block matrix Bf corresponding to the orthogonal complement of the steering vector νf, 0 or the estimated steering vector νf, 0 and the modified instantaneous beamformer af, 0. In other words, the instantaneous beamformer wf, 0 is expressed as
w f,0 =g fνf,0 +B f a f,0  (33)
Accordingly, Bf Hνf, 0=0, and therefore the constraint condition that “wf, 0 Hνf, 0 is a constant” is expressed as follows.
w f,0 Hνf,0=(g fνf,0 +B f a f,0)Hνf,0 =g f H|∥f,0|2=constant
Hence, even under the definition given in equation (33), the constraint condition that “wf, 0 Hνf, 0 is a constant” is satisfied in relation to any modified instantaneous beamformer af, 0. It is therefore evident that the instantaneous beamformer wf, 0 may be defined as illustrated in equation (33). In this embodiment, the convolutional beamformer is estimated using the optimal solution of the convolutional beamformer acquired when the instantaneous beamformer wf, 0 is defined as illustrated in equation (33). This will be described in detail below.
As illustrated in FIG. 9 , a signal processing device 8 according to this embodiment includes an estimation unit 81, a suppression unit 82, and a parameter estimation unit 83. The estimation unit 81 includes a matrix estimation unit 811, a convolutional beamformer estimation unit 812, an initial beamformer application unit 813, and a block unit 814.
<Processing of Parameter Estimation Unit 83 (Step S83)>
The parameter estimation unit 83 (FIG. 9 ), using the frequency-divided observation signal xf, t as input, acquires the estimated steering vector by an identical method to any of the parameter estimation units 33, 43, 53, 63 described above, and outputs the acquired estimated steering vector as νf, 0. The output estimated steering vector νf, 0 is transmitted to the initial beamformer application unit 813 and the block unit 814.
<Processing of Initial Beamformer Application Unit 813 (Step S813)>
The estimated steering vector νf, 0 and the frequency-divided observation signal xf, t are input into the initial beamformer application unit 813. The initial beamformer application unit 813 acquires and outputs an initial beamformer output zf, t (an initial beamformer output of the first time interval) based on the estimated steering vector νf, 0 and the frequency-divided observation signal xf, t (the frequency-divided observation signal belonging to the first time interval). For example, the initial beamformer application unit 813 acquires and outputs an initial beamformer output zf, t based on the constant multiple of the estimated steering vector νf, 0 and the frequency-divided observation signal rf, t. The initial beamformer application unit 813 acquires and outputs the initial beamformer output zf, t in accordance with equation (34) shown below, for example.
z f,t=(g fνf,0)H x f,t  (34)
The output initial beamformer output zf, t is transmitted to the convolutional beamformer estimation unit 812 and the suppression unit 82.
<Processing of Block Unit 814 (Step S814)>
The estimated steering vector νf, 0 and the frequency-divided observation signal xf, t are input into the block unit 814. The block unit 814 acquires and outputs a vector x= f, t based on the frequency-divided observation signal xf, t and the block matrix Bf corresponding to the orthogonal complement of the estimated steering vector νf, 0. As noted above, Bf Hνf, 0=0 is satisfied. Equation (32) shows an example of the block matrix Bf, but the present invention is not limited to this example, and any block matrix Bf in which Bf Hνf, 0=0 is satisfied may be used. The block unit 814 acquires and outputs the vector x= f, t in accordance with equations (35) and (36) shown below, for example.
x = f , t - d ( m ) = [ x f , t - d ( m ) , x f , t - d - 1 ( m ) , , x f , t - d - L + 1 ( m ) ] T ( 35 ) x = f , t = [ ( B f H x f , t ) T , x = f , t - d ( 1 ) T , x = f , t - d ( 2 ) T , , x = f , t - d ( M ) T ] T ( 36 )
Note that the upper right superscript “=” of “x= f, t” should be written directly above the lower right subscript “x”, as shown in equation (36), but due to notation limitations may also be written to the upper right of “x”. The output vector x= f, t is transmitted to the matrix estimation unit 811, the convolutional beamformer estimation unit 812, and the suppression unit 82. Further, when L=0, the right side of equation (35) becomes a vector in which the number of elements is 0 (an empty vector), whereby equation (36) is as shown below in equation (36A).
x f,t =B f H x f,t  (36A)
<Processing of Matrix Estimation Unit 811 (Step S811)>
The vector x?f, t acquired by the block unit 814 and the power or estimated power σf, t 2 of the target signal are input into the matrix estimation unit 811. Either the provisional power generated as illustrated in equation (17) or the estimated power σf, t 2 generated as described in the third embodiment, for example, may be used as σf, t 2. Using the vector x= f, t and the power or estimated power σf, t 2 of the target signal, the matrix estimation unit 811 acquires and outputs a weighted modified space-time covariance matrix R= f, which is based on the estimated steering vector νf, 0, the frequency-divided observation signal xf, t, and the power or estimated power σf, t 2 of the target signal and increases the probability expressing the speech-likeness of the estimation signal when the instantaneous beamformer wf, 0 is expressed as illustrated in equation (33). For example, the matrix estimation unit 811 acquires and outputs a weighted modified space-time covariance matrix R= f based on the vector x= f, t, and the power or estimated power σf, t 2 of the target signal. The matrix estimation unit 811 acquires and outputs the weighted modified space-time covariance matrix R= f in accordance with equation (37) below, for example.
R _ _ f = N t = 1 x _ _ f , t x _ _ f , t H σ f , t 2 ( 37 )
The output weighted modified space-time covariance matrix R= f is transmitted to the convolutional beamformer estimation unit 812.
<Processing of Convolutional Beamformer Estimation Unit 812 (Step S812)>
The initial beamformer output zf, t acquired by the initial beamformer application unit 813, the vector x= f, t acquired by the block unit 814, and the weighted modified space-time covariance matrix R= f acquired by the matrix estimation unit 811 are input into the convolutional beamformer estimation unit 812. Using these, the convolutional beamformer estimation unit 812 acquires and outputs a convolutional beamformer w= f that is based on the estimated steering vector νf, t, the weighted modified space-time covariance matrix R= f, and the frequency-divided observation signal xf, t. For example, the convolutional beamformer estimation unit 812 acquires and outputs the convolutional beamformer w= f in accordance with equation (38) shown below.
w _ _ f = R _ _ f - 1 x _ _ f , t z f , t H ( 38 ) w _ _ f = [ a f , 0 T , w f ( 1 ) T , w f ( 2 ) T , , w f ( M ) T ] T ( 38 A ) w _ _ f ( m ) = [ w f , d ( m ) , w f , d + 1 ( m ) , , w f , d + L - 1 ( m ) ] T ( 38 B )
The output convolutional beamformer w= f is transmitted to the suppression unit 82.
Note that when L=0, the right side of equation (38B) becomes a vector in which the number of elements is 0 (an empty vector), whereby equation (38A) is as shown below.
w f =a f,0
<Processing of Suppression Unit 82 (Step S82)>
The vector x= f, t output from the block unit 814, the initial beamformer output zf, t output from the initial beamformer application unit 813, and the convolutional beamformer w= f output from the convolutional beamformer estimation unit 812 are input into the suppression unit 82. The suppression unit 82 acquires and outputs the estimation signal of the target signal yf, t by applying the initial beamformer output zf, t and the convolutional beamformer w= f to the vector x= f, t. This processing is equivalent to processing for acquiring and outputting the estimation signal of the target signal yf, t by applying the convolutional beamformer w f to the frequency-divided observation signal xf, t. For example, the suppression unit 82 acquires and outputs the estimation signal of the target signal yf, t in accordance with equation (39) shown below.
y f,t =z f,t +w f H x f,t  (39)
Modified Example 1 of Eighth Embodiment
A known steering vector νf, 0 acquired on the basis of actual measurement or the like may be input into the initial beamformer application unit 813 and the block unit 814 instead of the estimated steering vector νf, 0 acquired by the parameter estimation unit 83. In this case, the initial beamformer application unit 813 and the block unit 814 perform steps S813 and S814, described above, using the steering vector νf, 0 instead of the estimated steering vector νf, 0.
Ninth Embodiment
In a ninth embodiment, a method for executing convolutional beamformer estimation based on the eighth embodiment by successive processing will be described. The following processing is executed on each time frame number t in ascending order from t=1.
As illustrated in FIG. 10 , a signal processing device 9 according to this embodiment includes an estimation unit 91, a suppression unit 92, and a parameter estimation unit 93. The estimation unit 91 includes an adaptive gain estimation unit 911, a convolutional beamformer estimation unit 912, a matrix estimation unit 915, the initial beamformer application unit 813, and the block unit 814.
<Processing of Parameter Estimation Unit 93 (Step S93)>
The parameter estimation unit 93 (FIG. 10 ), using the frequency-divided observation signal xf, t as input, acquires and outputs the estimated steering vector νf, t by an identical method to either of the parameter estimation units 53, 63 described above. The output estimated steering vector νf, t is transmitted to the initial beamformer application unit 813 and the block unit 814.
<Processing of Initial Beamformer Application Unit 813 (Step S813)>
The estimated steering vector νf, t (the estimated steering vector of the first time interval) and the frequency-divided observation signal xf, t (the frequency-divided observation signal belonging to the first time interval) are input into the initial beamformer application unit 813, and the initial beamformer application unit 813 acquires and outputs the initial beamformer output zf, t (the initial beamformer output of the first time interval) as described in the eighth embodiment using νf, t instead of νf, 0. The output initial beamformer output zf, t is transmitted to the suppression unit 92.
<Processing of Block Unit 814 (Step S814)>
The estimated steering vector νf, t and the frequency-divided observation signal xf, t are input into the block unit 814, and the block unit 814 acquires and outputs the vector x= f, t as described in the eighth embodiment by using νf, t instead of νf, 0. The output vector x= f, t is transmitted to the adaptive gain estimation unit 911, the matrix estimation unit 915, and the suppression unit 92.
<Processing of Suppression Unit 92 (Step S92)>
The initial beamformer output zf, t output from the initial beamformer application unit 813 and the vector x= f, t output from the block unit 814 are input into the suppression unit 92. Using these, the suppression unit 92 acquires and outputs the estimation signal of the target signal yf, t, which is based on the initial beamformer output zf, t (the initial beamformer output of the first time interval), the estimated steering vector vf, t (the estimated steering vector of the first time interval), the frequency-divided observation signal xf, t, and a convolutional beamformer w= f, t-1 (the convolutional beamformer of the second time interval that is further in the past than the first time interval). For example, the suppression unit 92 acquires and outputs the estimation signal of the target signal yf, t in accordance with equation (40) below.
y f,t =z f,t +w f,t-1 H x f,t  (40)
Here, the initial vector w= f, 0 of the convolutional beamformer w= f, t-1 may be any (LM+M−1)-dimensional vector. An example of the initial vector w= f, 0 is an (LM+M−1)-dimensional vector in which all elements are 0.
<Processing of Adaptive Gain Estimation Unit 911 (Step S911)>
The vector x= f, t output from the block unit 814, an inverse matrix R˜−1 f, t-1 of the weighted modified space-time covariance matrix output from the matrix estimation unit 915, and the power or estimated power σf, t 2 of the target signal are input into the adaptive gain estimation unit 911. As σf, t 2 input into the matrix estimation unit 711, either the provisional power generated as illustrated in equation (17) or the estimated power σf, t 2 generated as described in the third embodiment, for example, may be used. Note that the “˜” of “R˜−1 f, t-1” should be written directly above the “R”, but due to notation limitations may also be written to the upper right of “R”. Using these, the adaptive gain estimation unit 911 acquires and outputs an adaptive gain kf, t (the adaptive gain of the first time interval) that is based on the inverse matrix R˜−1 f, t-1 of the weighted modified space-time covariance matrix (the inverse matrix of the weighted modified space-time covariance matrix of the second time interval), the estimated steering vector νf, t (the estimated steering vector of the first time interval), the frequency-divided observation signal xf, t, and the power or estimated power σf, t 2 of the target signal. For example, the adaptive gain estimation unit 911 acquires and outputs the adaptive gain kf, t as an (LM+M−1)-dimensional vector in accordance with equation (41) shown below.
k f , t = R ~ f , t - 1 - 1 x _ _ f , t a σ f , t 2 + x _ _ f , t H R ~ f , t - 1 - 1 x _ _ f , t ( 41 )
Here, α is an oblivion coefficient, and is a real number belonging to a range of 0<α<1, for example. Further, an initial matrix of the inverse matrix R˜−1 f, t-1 of the weighted modified space-time covariance matrix may be any (LM+M−1)×(LM+M−1)-dimensional matrix. An example of the initial matrix of the inverse matrix R˜−1 f, t-1 of the weighted modified space-time covariance matrix is an (LM+M−1)-dimensional unit matrix. Here,
x _ _ f , t = [ ( B f , t H x f , t ) T , x _ _ f , t - d ( 1 ) T , x _ _ f , t - d ( 2 ) T , , x _ _ f , t - d ( M ) ] T x _ _ f , t - d ( m ) = [ x f , t - d ( m ) , x f , t - d - 1 ( m ) , , x f , t - d - l + 1 ( m ) ] T and R ~ f , t = τ = 0 t a t - τ σ f , t 2 x _ _ f , t x _ _ f , t H
Note that R˜ f, t itself is not calculated. The output adaptive gain kf, t is transmitted to the matrix estimation unit 915 and the convolutional beamformer estimation unit 912.
<Processing of matrix estimation unit 915 (step S915)>
The vector x= f, t output from the block unit 814 and the adaptive gain kf, t output from the adaptive gain estimation unit 911 are input into the matrix estimation unit 915. Using these, the matrix estimation unit 915 acquires and outputs an inverse matrix R˜−1 f, t of the weighted modified space-time covariance matrix (the inverse matrix of the weighted modified space-time covariance matrix of the first time interval) that is based on the adaptive gain kf, t (the adaptive gain of the first time interval), the estimated steering vector νf, t (the estimated steering vector of the first time interval), the frequency-divided observation signal xf, t, and the inverse matrix R˜−1 f, t-1, of the weighted modified space-time covariance matrix (the inverse matrix of the weighted modified space-time covariance matrix of the second time interval). For example, the matrix estimation unit 915 acquires and outputs the inverse matrix R˜−1 f, t of the weighted modified space-time covariance matrix in accordance with equation (42) below.
R ~ f , t - 1 = 1 a ( R ~ f , t - 1 - 1 - k f , t x _ _ f , t H R ~ f , t - 1 - 1 ) ( 42 )
The output inverse matrix R˜−1 f, t of the weighted modified space-time covariance matrix is transmitted to the adaptive gain estimation unit 911.
<Processing of Convolutional Beamformer Estimation Unit 912 (Step S912)>
The estimation signal of the target signal yf, t output from the suppression unit 92 and the adaptive gain kf, t output from the adaptive gain estimation unit 911 are input into the convolutional beamformer estimation unit 912. Using these, the convolutional beamformer estimation unit 912 acquires and outputs the convolutional beamformer w= f, t (the convolutional beamformer of the first time interval), which is based on the adaptive gain kf, t (the adaptive gain of the first time interval), the estimation signal of the target signal yf, t (the target signal of the first time interval), and the convolutional beamformer w= f, t-1 (the convolutional beamformer of the second time interval). For example, the convolutional beamformer estimation unit 912 acquires and outputs the convolutional beamformer w= f, t in accordance with equation (43) shown below.
w f,t =w f,t-1 −k f,t y f,t H  (43)
The output convolutional beamformer w= f, t is transmitted to the suppression unit 92.
Modified Example 1 of Ninth Embodiment
In the ninth embodiment and the modified example thereof, a case in which the first time interval is the frame having the time frame number t and the second time interval is the frame having the time frame number t−1 was used as an example, but the present invention is not limited thereto. A frame having a time frame number other than the time frame number t may be set as the first time interval, and a time frame that is further in the past than the first time interval and has a time frame number other than the time frame number t−1 may be set as the second time interval.
Modified Example 2 of Ninth Embodiment
A known steering vector νf, t may be input into the initial beamformer application unit 813 and the block unit 814 instead of the estimated steering vector νf, t acquired by the parameter estimation unit 93. In this case, the initial beamformer application unit 813 and the block unit 814 perform steps S813 and S814, described above, using the steering vector νf, t instead of the estimated steering vector νf, t.
Tenth Embodiment
The frequency-divided observation signals xf, t input into the signal processing devices 1 to 9 described above may be any signals that correspond respectively to a plurality of frequency bands of an observation signal acquired by picking up an acoustic signal emitted from a sound source. For example, as illustrated in FIGS. 11A and 11C, a time-domain observation signal x(i)=[x(i)(1), x(i)(2), . . . , x(i)(M)]T (where i is an index expressing a discrete time) acquired by picking up an acoustic signal emitted from a sound source in M microphones may be input into a dividing unit 1051, and the dividing unit 1051 may transform the observation signal x(i) into frequency-divided observation signals xf, t in the frequency domain and input the frequency-divided observation signals xf, t into the signal processing devices 1 to 9. There are no limitations on the transformation method from the time domain to the frequency domain, and the discrete Fourier transform or the like, for example, may be used. Alternatively, as illustrated in FIG. 11B, frequency-divided observation signals xf, t acquired by another processing unit, not illustrated in the figures, may be input into the signal processing devices 1 to 9. For example, the time-domain observation signal x(i) described above may be transformed into frequency-domain signals in each time frame, the frequency-domain signals may be processed by another processing unit, and the frequency-divided observation signals xf, t acquired as a result may be input into the signal processing devices 1 to 9.
The estimation signal of the target signals yf, t output from the signal processing devices 1 to 9 may either be used in other processing (speech recognition processing or the like) without being transformed into time-domain signals y(i) or be transformed into a time-domain signal y(i). For example, as illustrated in FIG. 11C, the estimation signal of the target signals yf, t output from the signal processing devices 1 to 9 may be output as is and used in other processing. Alternatively, as illustrated in FIGS. 11A and 11B, the estimation signal of the target signals yf, t output from the signal processing devices 1 to 9 may be input into an integration unit 1052, and the integration unit 1052 may acquire and output a time-domain signal y(i) by integrating the estimation signal of the target signals yf, t. There are no limitations on the method for acquiring the time-domain signal y(i) from the estimation signal of the target signals yf, t, and the inverse Fourier transform or the like, for example, may be used.
Test results relating to the methods of the respective embodiments will be illustrated below.
Test Results 1 (First Embodiment)
Next, noise/reverberation suppression results acquired by the first embodiment and conventional methods 1 to 3 will be illustrated.
In this test, a data set of the “REVERB Challenge” was used as the observation signal. Acoustic data (Real Data) acquired by picking up English-language speech read aloud in a room with stationary noise and reverberation using microphones disposed in positions away (0.5 to 2.5 m) from the speaker, and acoustic data (Sim Data) acquired by simulating this environment are recorded in the data set. The number of microphones M=8. The frequency-divided observation signals were determined by the short-time Fourier transform. The frame length was set at 32 milliseconds, the frame shift was set at 4, and the prediction delay was set at d=4. Using these data, the speech quality and speech recognition precision of signals subjected to noise/reverberation suppression in accordance with the present invention and conventional methods 1 to 3 were evaluated.
FIG. 12 shows evaluation results acquired in relation to the speech quality of the observation signal and the signals subjected to noise/reverberation suppression in accordance with the present invention and conventional methods 1 to 3. “Sim” denotes the Sim Data, and “Real” denotes the Real Data. “CD” denotes cepstrum distortion, “SRMR” denotes the signal-to-reverberation modulation ratio, “LLR” denotes the log-likelihood ratio, and “FWSSNR” denotes the frequency-weighted segmental signal-to-noise ratio. CD and LLR indicate better speech quality as the values thereof decrease, while SRMR and FWSSNR indicate better speech quality as the values thereof increase. The underlined values are optimal values. As illustrated in FIG. 12 , it is evident that according to the present invention, noise and reverberation can be suppressed more adequately than with conventional methods 1 to 3.
FIG. 13 shows a word error rate in the speech recognition results acquired in relation to the observation signal and the signals subjected to noise/reverberation suppression in accordance with the present invention and conventional methods 1 to 3. The word error rate indicates better speech recognition precision as the value thereof decreases. The underlined values are optimal values. “R1N” denotes a case in which the speaker is positioned close to the microphones in room 1, while “R1F” denotes a case in which the speaker is positioned far away from the microphones in room 1. Similarly, “R2N” and “R3N” respectively denote cases in which the speaker is positioned close to the microphones in rooms 2 and 3, while “R2F” and “R3E” respectively denote cases in which the speaker is positioned far away from the microphones in rooms 2 and 3. “Ave” denotes an average value. As illustrated in FIG. 12 , it is evident that according to the present invention, noise and reverberation can be suppressed more adequately than with conventional methods 1 to 3.
Test Results 2 (Fourth Embodiment)
FIG. 14 shows noise/reverberation suppression results acquired in a case where the steering vector was estimated without suppressing the reverberation of the frequency-divided observation signal xf, t (without reverberation suppression) and a case where the steering vector was estimated after suppressing the reverberation of the frequency-divided observation signal xf, t (with reverberation suppression), as described in the fourth embodiment. Note that “WER” expresses the character error rate when speech recognition was performed using the target signal acquired by implementing noise/reverberation suppression. As the value of WER decreases, a better performance is achieved. As illustrated in FIG. 14 , it is evident that the speech quality of the target signal is better with reverberation suppression than without reverberation suppression.
Test Results 3 (Seventh and Ninth Embodiments)
FIGS. 15A, 15B, and 15C show noise/reverberation suppression results acquired in a case where convolutional beamformer estimation was executed by successive processing, as described in the seventh and ninth embodiments. In FIGS. 15A, 15B, and 15C, L=64 [msec], α=0.9999, and β=0.66. Further, “Adaptive NCM” indicates results acquired when the estimated steering vector νf, t generated by the method of the fifth embodiment was used. Further, “PreFixed NCM” indicates results acquired when the estimated steering vector νf, t generated by the method of modified example 1 of the fifth embodiment was used. Furthermore, “observation signal” indicates results acquired when no noise/reverberation suppression was implemented. Thus, it is evident that the speech quality of the target signal is improved by the noise/reverberation suppression of the seventh and ninth embodiments.
Other Modified Examples and so on
Note that the present invention is not limited to the embodiments described above. For example, in the above embodiments, d is set at the same value in all of the frequency bands, but d may be set for each frequency band. In other words, a positive integer df may be used instead of d. Similarly, in the above embodiments, L is set at the same value in all of the frequency bands, but L may be set for each frequency band. In other words, a positive integer Lf may be used instead of L.
In the first to third embodiments, examples in which batch processing is performed by determining the cost functions and so on (equations (2), (7), (12), (13), (14), and (18)) by using a time frame corresponding to 1≤t≤N as a processing unit were described, but the present invention is not limited thereto. For example, rather than using a time frame corresponding to 1≤t≤N as a processing unit, the processing may be executed using a partial time frame thereof as a processing unit. Alternatively, the time frame that is used as the processing unit may be updated in real time, and the processing may be executed by determining the cost functions and so on in processing units of each time point. For example, when the number of the current time frame is expressed as tc, a time frame corresponding to 1≤t≤tc, may be set as the processing unit, or a time frame corresponding to tc−η≤t≤tc may be set as the processing unit in relation to a positive integer constant η.
The various types of processing described above do not have to be executed in time series, as described above, and may be executed in parallel or individually either in accordance with the processing power of the device that executes the processing or in accordance with necessity. Furthermore, the processing may be modified appropriately within a scope that does not depart from the spirit of the present invention.
The devices described above are configured by, for example, having a general-purpose or dedicated computer including a processor (a hardware processor) such as a CPU (central processing unit) and a memory such as a RAM (random-access memory)/ROM (read-only memory) execute a predetermined program. The computer may include one processor and one memory, or pluralities of processors and memories. The program may be either installed in the computer or recorded in the ROM or the like in advance. Further, instead of electronic circuitry, such as a CPU, that realizes a functional configuration by reading a program, some or all of the processing units may be configured using electronic circuitry that realizes processing functions without the use of a program. Electronic circuitry constituting a single device may include a plurality of CPUs.
When the configurations described above are realized by a computer, the processing content of the functions to be included in the devices is described by the program. The computer realizes the processing functions described above by executing the program. The program describing the processing content may be recorded in advance on a computer-readable recording medium. An example of a computer-readable recording medium is a non-transitory recording medium. Examples of this type of recording medium include a magnetic recording device, an optical disk, a magneto-optical recording medium, a semiconductor memory, and so on.
The program is distributed by, for example, selling, transferring, renting, etc. a portable recording medium such as a DVD or a CD-ROM on which the program is recorded. Further, the program may be stored in a storage device of a server computer and distributed by being transferred from the server computer to another computer over a network.
For example, the computer that executes the program first stores the program recorded on the portable recording medium or transferred from the server computer temporarily in a storage device included therein. During execution of the processing, the computer reads the program stored in the storage device included therein and executes processing corresponding to the read program. As a different form of execution of the program, the computer may read the program directly from the portable recording medium and execute processing corresponding to the program. Alternatively, every time the program is transferred to the computer from the server computer, the computer may execute processing corresponding to the received program. Instead of transferring the program from the server computer to the computer, the processing described above may be executed by a so-called ASP (Application Service Provider) type service, in which processing functions are realized only by issuing commands to execute the processing and acquiring results.
Instead of realizing the processing functions of the present device by executing a predetermined program on a computer, at least some of the processing functions may be realized by hardware.
INDUSTRIAL APPLICABILITY
The present invention can be used in various applications in which it is necessary to suppress noise and reverberation from an acoustic signal. For example, the present invention can be used in speech recognition, call systems, conference call systems, and so on.
REFERENCE SIGNS LIST
    • 1-9 Signal processing device
    • 11, 21, 71, 81, 91 Estimation unit
    • 12, 22 Suppression unit

Claims (17)

The invention claimed is:
1. A signal processing device comprising processing circuitry, the processing circuitry comprising:
an estimation unit that estimates a convolutional beamformer, wherein the convolutional beamformer is used for calculating, at each time point, a weighted sum of a current signal and a past signal of a sequence of past signals with a predetermined delay and a time duration of the sequence of past signals of zero length or more, and the estimating the convolutional beamformer further comprises:
receiving frequency-divided observation signals obtained from acoustic signals emitted from a target sound source; and
determining, at each time point of the sequence of time points, weights of the weighted sum as the convolutional beamformer, wherein the weighted sum causes estimation signals of target signals to increase a probability of speech-likeliness of the estimation signals based on a predetermined probability model;
and
a suppression unit that suppresses noise and reverberation associated with the frequency-divided observation signals to generate the estimation signals of the target signals by using the convolutional beamformer upon the frequency-divided observation signals, wherein
the probability expressing the speech-likeness is according to a signal distribution of speech in the estimation signals of the target signals, and an average of the estimation signals is 0 and a variance of the estimation signals varies over time.
2. The signal processing device according to claim 1, wherein
the estimation unit acquires the convolutional beamformer which maximizes the probability expressing the speech-likeness of the estimation signals based on the probability model.
3. The signal processing device according to claim 1, wherein
the observation signals are signals acquired by picking up the acoustic signals emitted from the sound source in an environment in which noise and reverberation exist.
4. The signal processing device according to claim 1, wherein
the convolutional beamformer is a beamformer for calculating a weighted value of a current signal at each time point.
5. A non-transitory computer-readable recording medium storing a program for causing a computer to function as the signal processing device according to claim 1.
6. A signal processing device comprising processing circuitry, the processing circuitry comprising:
an estimation unit that estimates a convolutional beamformer, wherein the convolutional beamformer is used for calculating, at each time point, a weighted sum of a current signal and a past signal of a sequence of past signals with a predetermined delay and a time duration of the sequence of past signals of zero length or more, and the estimating the convolutional beamformer further comprises:
receiving frequency-divided observation signals obtained from acoustic signals emitted from a target sound source; and
determining, at each time point of the sequence of time points, weight of the weighted sum as the convolutional beamformer, wherein the weighted sum causes estimation signals of target signals to increase a probability of speech-likeliness of the estimation signals based on a predetermined probability model; and
a suppression unit that suppresses noise and reverberation associated with the frequency-divided observation signals to generate the estimation signals of the target signals by using the convolutional beamformer upon the frequency-divided observation signals, wherein
the estimation unit acquires the convolutional beamformer which minimizes a sum of values acquired by weighting power of the estimation signals at respective time points belonging to a predetermined time interval by reciprocals of the power of the target signals or reciprocals of an estimated power of the target signals, under a constraint condition in which the target signals are not distorted as a result of applying the convolutional beamformer to the frequency-divided observation signals where the target signals are signals that correspond to a direct sound and an initial reflected sound within signals corresponding to a sound emitted from the target sound source and picked up by a microphone.
7. The signal processing device according to claim 6, wherein
the convolutional beamformer is equivalent to a beamformer acquired by integrating a reverberation suppression filter for suppressing reverberation from the frequency-divided observation signals and an instantaneous beamformer for suppressing noise from signals acquired by applying the reverberation suppression filter to the frequency-divided observation signals,
the instantaneous beamformer calculates a weighted sum of signals of a current time point at each time point, and
the constraint condition is a condition in which a value acquired by applying the instantaneous beamformer to a steering vector having, as an element, transfer functions relating to the direct sound and the initial reflected sound from the sound source to a pickup position of the acoustic signals, or to an estimated steering vector that is an estimated vector of the steering vector, is a constant.
8. The signal processing device according to claim 7, wherein
the estimation unit includes:
a matrix estimation unit that acquires a weighted space-time covariance matrix on the basis of the frequency-divided observation signals and the power or estimated power of the target signals; and
a convolutional beamformer estimation unit that acquires the convolutional beamformer on the basis of the weighted space-time covariance matrix and the steering vector or estimated steering vector.
9. The signal processing device according to claim 7, further comprising processing circuitry configured to implement:
a reverberation suppression unit that acquires frequency-divided reverberation-suppressed signals in which a reverberation component has been suppressed from the frequency-divided observation signals; and
a steering vector estimation unit that acquires and outputs the estimated steering vector from the frequency-divided reverberation-suppressed signals.
10. The signal processing device according to claim 9, wherein
the frequency-divided reverberation-suppressed signals are time series signals, the signal processing device further comprises processing circuitry configured to implement:
an observation signal covariance matrix updating unit that acquires a spatial covariance matrix of the frequency-divided reverberation-suppressed signals belonging to a first time interval, the spatial covariance matrix being based on the frequency-divided reverberation-suppressed signals belonging to the first time interval and a spatial covariance matrix of the frequency-divided reverberation-suppressed signals belonging to a second time interval that is further in the past than the first time interval; and
a main component vector updating unit that acquires, on the basis of an inverse matrix of a noise covariance matrix of the frequency-divided reverberation-suppressed signals, a spatial covariance matrix of the frequency-divided reverberation-suppressed signals belonging to the first time interval, and a main component vector of the second time interval, a main component vector of the first time interval relative to a product of the inverse matrix of the noise covariance matrix of the frequency-divided reverberation-suppressed signals and the spatial covariance matrix of the frequency-divided reverberation-suppressed signals belonging to the first time interval, wherein
the steering vector estimation unit acquires and outputs the estimated steering vector of the first time interval on the basis of the noise covariance matrix of the frequency-divided reverberation-suppressed signal and the main component vector of the first time interval.
11. The signal processing device according to claim 7, wherein
the frequency-divided reverberation-suppressed signals are time series signals,
the signal processing device further comprises processing circuitry configured to implement:
an observation signal covariance matrix updating unit that acquires a spatial covariance matrix of the frequency-divided observation signals belonging to a first time interval, the spatial covariance matrix being based on the frequency-divided observation signals belonging to the first time interval and a spatial covariance matrix of the frequency-divided observation signals belonging to a second time interval that is further in the past than the first time interval;
a main component vector updating unit that acquires, on the basis of an inverse matrix of a noise covariance matrix of the frequency-divided observation signals, a spatial covariance matrix of the frequency-divided observation signals belonging to the first time interval, and a main component vector of the second time interval, a main component vector of the first time interval relative to a product of the inverse matrix of the noise covariance matrix of the frequency-divided observation signals and the spatial covariance matrix of the frequency-divided observation signals belonging to the first time interval; and
a steering vector estimation unit that acquires and outputs the estimated steering vector of the first time interval on the basis of the main component vector of the first time interval and the noise covariance matrix of the frequency-divided observation signals.
12. The signal processing device according to claim 10 or 11, wherein
the estimation unit includes:
a matrix estimation unit that estimates an inverse matrix of a space-time covariance matrix of the first time interval on the basis of the frequency-divided observation signals, the power or estimated power of the target signals, and an inverse matrix of a space-time covariance matrix of the second time interval that is further in the past than the first time interval; and
a convolutional beamformer estimation unit that acquires the convolutional beamformer of the first time interval on the basis of the inverse matrix of the space-time covariance matrix of the first time interval and the estimated steering vector.
13. The signal processing device according to claim 10 or 11, wherein
the instantaneous beamformer is equivalent to a sum of a constant multiple of the estimated steering vector and a product of a block matrix corresponding to an orthogonal complement of the estimated steering vector and a modified instantaneous beamformer, and
the estimation unit includes:
an initial beamformer application unit that acquires an initial beamformer output of the first time interval that is based on the estimated steering vector of the first time interval and the frequency-divided observation signals belonging to the first time interval;
the suppression unit that acquires the estimation target signals of the first time interval that is based on the initial beamformer output of the first time interval, the estimated steering vector of the first time interval and the frequency-divided observation signal, and the convolutional beamformer of the second time interval that is further in the past than the first time interval;
an adaptive gain estimation unit that acquires an adaptive gain of the first time interval that is based on an inverse matrix of the weighted modified space-time covariance matrix of the second time interval, and the estimated steering vector of the first time interval, the frequency-divided observation signals and the power or estimated power of the target signals;
a matrix estimation unit that acquires an inverse matrix of the weighted modified space-time covariance matrix of the first time interval that is based on the adaptive gain of the first time interval, the estimated steering vector of the first time interval and the frequency-divided observation signals, and the inverse matrix of the weighted modified space-time covariance matrix of the second time interval; and
the convolutional beamformer estimation unit that acquires the convolutional beamformer of the first time interval that is based on the adaptive gain of the first time interval, the estimation signals of the first time interval, and the convolutional beamformer of the second time interval.
14. The signal processing device according to claim 7, wherein
the estimation unit includes:
a matrix estimation unit that acquires a weighted modified space-time covariance matrix that is based on the steering vector or the estimated steering vector, the frequency-divided observation signals, and the power or estimated power of the target signals, where the weighted modified space-time covariance matrix is characterized in that when the instantaneous beamformer is represented by a sum of a constant multiple of the steering vector or a constant multiple of the estimated steering vector and a product of a block matrix corresponding to an orthogonal complement of the steering vector or the estimated steering vector and a modified instantaneous beamformer, the weighted modified space-time covariance matrix has signals acquired as a result of multiplying the block matrix by the frequency-divided observation signals of the first time interval as elements; and
a convolutional beamformer estimation unit that acquires the convolutional beamformer based on the steering vector or the estimated steering vector, the weighted modified space-time covariance matrix, and the frequency-divided observation signals.
15. A non-transitory computer-readable recording medium storing a program for causing a computer to function as the signal processing device according to claim 6.
16. A signal processing method comprising:
an estimation step of estimating a convolutional beamformer, wherein the convolutional beamformer is used for calculating, at each time point, a weighted sum of a current signal and a past signal of a sequence of past signals with a predetermined delay and a time duration of the sequence of past signals of zero length or more, and the estimating the convolutional beamformer further comprises:
receiving frequency-divided observation signals obtained from acoustic signals emitted from a target sound source;
calculating, at each time point of the sequence of time points, weights of the weighted sum as the convolutional beamformer, wherein the weighted sum causes the estimation signals of the target signals to increase a probability of speech-likeliness of the estimation signals based on a predetermined probability model; and
a suppression step of suppressing noise and reverberation associated with the frequency-divided observation signals to generate the estimation signals of the target signals by using the convolutional beamformer upon the frequency-divided observation signals, wherein
the probability expressing the speech-likeness is according to a signal distribution of speech in the estimation signals of the target signals, and an average of the estimation signals is 0 and a variance of the estimation signals varies over time.
17. A signal processing method comprising:
an estimation step of estimating a convolutional beamformer, wherein the convolutional beamformer is used for calculating, at each time point, a weighted sum of a current signal and a past signal of a sequence of past signals with a predetermined delay and a time duration of the sequence of past signals of zero length or more, and the estimating the convolutional beamformer further comprises:
receiving frequency-divided observation signals obtained from acoustic signals emitted from a target sound source; and
determining, at each time point of the sequence of time points, weights of the weighted sum as the convolutional beamformer, wherein the weighted sum causes the estimation signals of the target signals to increase a probability of speech-likeliness of the estimation signals based on a predetermined probability model; and
a suppression step of suppressing noise and reverberation associated with the frequency-divided observation signals to generate the estimation signals of the target signals by using the convolutional beamformer upon the frequency-divided observation signals, wherein
the estimation step acquires the convolutional beamformer which minimizes a sum of values acquired by weighting power of the estimation signals at respective time points belonging to a predetermined time interval by reciprocals of the power of the target signals or reciprocals of an estimated power of the target signals, under a constraint condition in which the target signals are not distorted as a result of applying the convolutional beamformer to the frequency-divided observation signals where the target signals are signals that correspond to a direct sound and an initial reflected sound within signals corresponding to a sound emitted from the target sound source and picked up by a microphone.
US17/312,912 2018-12-14 2019-07-31 Signal processing apparatus, signal processing method, and program Active 2040-05-14 US11894010B2 (en)

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
JP2018-234075 2018-12-14
JP2018234075 2018-12-14
PCT/JP2019/016587 WO2020121545A1 (en) 2018-12-14 2019-04-18 Signal processing device, signal processing method, and program
WOPCT/JP2019/016587 2019-04-18
JPPCT/JP2019/016587 2019-04-18
PCT/JP2019/029921 WO2020121590A1 (en) 2018-12-14 2019-07-31 Signal processing device, signal processing method, and program

Publications (2)

Publication Number Publication Date
US20220068288A1 US20220068288A1 (en) 2022-03-03
US11894010B2 true US11894010B2 (en) 2024-02-06

Family

ID=71076328

Family Applications (1)

Application Number Title Priority Date Filing Date
US17/312,912 Active 2040-05-14 US11894010B2 (en) 2018-12-14 2019-07-31 Signal processing apparatus, signal processing method, and program

Country Status (3)

Country Link
US (1) US11894010B2 (en)
JP (1) JP7115562B2 (en)
WO (2) WO2020121545A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12165668B2 (en) * 2022-02-18 2024-12-10 Microsoft Technology Licensing, Llc Method for neural beamforming, channel shortening and noise reduction

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111933170B (en) * 2020-07-20 2024-03-29 歌尔科技有限公司 Voice signal processing method, device, equipment and storage medium
JP7430127B2 (en) * 2020-09-02 2024-02-09 三菱重工業株式会社 Prediction device, prediction method, and program
US12348945B2 (en) * 2020-10-15 2025-07-01 Nippon Telegraph And Telephone Corporation Acoustic signal enhancement apparatus, method and program
JP7639382B2 (en) * 2021-02-12 2025-03-05 日本電信電話株式会社 Audio signal enhancement device, method and program
CN112802490B (en) * 2021-03-11 2023-08-18 北京声加科技有限公司 Beam forming method and device based on microphone array
US11798533B2 (en) * 2021-04-02 2023-10-24 Google Llc Context aware beamforming of audio data
WO2023276068A1 (en) * 2021-06-30 2023-01-05 日本電信電話株式会社 Acoustic signal enhancement device, acoustic signal enhancement method, and program
CN113707136B (en) * 2021-10-28 2021-12-31 南京南大电子智慧型服务机器人研究院有限公司 Audio and video mixed voice front-end processing method for voice interaction of service robot
CN115086836B (en) * 2022-06-14 2023-04-18 西北工业大学 Beam forming method, system and beam former
CN117292700A (en) * 2022-06-20 2023-12-26 青岛海尔科技有限公司 Voice enhancement method and device for distributed wakeup and storage medium
WO2024038522A1 (en) * 2022-08-17 2024-02-22 日本電信電話株式会社 Signal processing device, signal processing method, and program
CN118197341B (en) * 2024-04-15 2024-11-26 武汉理工大学 A beamforming method and device based on room environment adaptive calibration

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090110207A1 (en) * 2006-05-01 2009-04-30 Nippon Telegraph And Telephone Company Method and Apparatus for Speech Dereverberation Based On Probabilistic Models Of Source And Room Acoustics
US20110002473A1 (en) 2008-03-03 2011-01-06 Nippon Telegraph And Telephone Corporation Dereverberation apparatus, dereverberation method, dereverberation program, and recording medium

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3685380A (en) * 1971-02-19 1972-08-22 Amada Ltd Us Multi-track turret and overload protection
JP3484112B2 (en) * 1999-09-27 2004-01-06 株式会社東芝 Noise component suppression processing apparatus and noise component suppression processing method
JP2007093630A (en) * 2005-09-05 2007-04-12 Advanced Telecommunication Research Institute International Speech enhancement device
JP5139111B2 (en) * 2007-03-02 2013-02-06 本田技研工業株式会社 Method and apparatus for extracting sound from moving sound source
JP5075042B2 (en) * 2008-07-23 2012-11-14 日本電信電話株式会社 Echo canceling apparatus, echo canceling method, program thereof, and recording medium
EP2222091B1 (en) * 2009-02-23 2013-04-24 Nuance Communications, Inc. Method for determining a set of filter coefficients for an acoustic echo compensation means
US8666090B1 (en) * 2013-02-26 2014-03-04 Full Code Audio LLC Microphone modeling system and method
US10090000B1 (en) * 2017-11-01 2018-10-02 GM Global Technology Operations LLC Efficient echo cancellation using transfer function estimation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090110207A1 (en) * 2006-05-01 2009-04-30 Nippon Telegraph And Telephone Company Method and Apparatus for Speech Dereverberation Based On Probabilistic Models Of Source And Room Acoustics
US20110002473A1 (en) 2008-03-03 2011-01-06 Nippon Telegraph And Telephone Corporation Dereverberation apparatus, dereverberation method, dereverberation program, and recording medium
JP5227393B2 (en) 2008-03-03 2013-07-03 日本電信電話株式会社 Reverberation apparatus, dereverberation method, dereverberation program, and recording medium

Non-Patent Citations (9)

* Cited by examiner, † Cited by third party
Title
Farrier et al, "Fast beamforming techniques for circular arrays", J. Acoustic Soc. Am., vol. 58, No. 4, pp. 920-922, October (Year: 1975). *
Heymann et al. (2016) "Neural network based spectral mask estimation for acoustic beamforming," Proc. ICASSP 2016.
Higuchi et al. (2016) "Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise", Proc. ICASSP 2016.
Liu, et al., "Neural Network Based Time-Frequency Masking and Steering Vector Estimation for Two-Channel Mvdr Beamforming", hereinafter Liu, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ), pp. 6717-6721, Apr. 15-20 (Year: 2018). *
Nakashika et al."Dysarthric Speech Recognition Using a Convolutive Bottleneck Network", ICSP2014 Proceedings, pp. 505-509 (Year: 2014). *
Nakatani et al. (2010) "Speech dereverberation based on variance-normalized delayed linear prediction," IEEE Trans. ASLP, 18 (7), 1717-1731.
Nakatani et al. (2018) "A unified convolutional beamformer for simultaneous denoising and dereverberation" published at https://arxiv.org/abs/1812.08400, on Dec. 20, 2018.
Yoshioka et al. (2015) "The NTT CHIME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices," Proc. IEEE ASRU 2015, 436-443.
Zhang et al "Microphone Subset Selection for MVDR Beamformer Based Noise Reduction", IEEE/ACM Trans. on Acoustics, Speech and Language Processing, vol. ** , No .** , pp. 1-13, May 16 (Year: 2017). *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12165668B2 (en) * 2022-02-18 2024-12-10 Microsoft Technology Licensing, Llc Method for neural beamforming, channel shortening and noise reduction

Also Published As

Publication number Publication date
WO2020121545A1 (en) 2020-06-18
WO2020121590A1 (en) 2020-06-18
JP7115562B2 (en) 2022-08-09
JPWO2020121590A1 (en) 2021-10-14
US20220068288A1 (en) 2022-03-03

Similar Documents

Publication Publication Date Title
US11894010B2 (en) Signal processing apparatus, signal processing method, and program
CN110100457B (en) On-Line Dereverberation Algorithm Based on Weighted Prediction Errors in Noise Time-varying Environment
US8848933B2 (en) Signal enhancement device, method thereof, program, and recording medium
US10123113B2 (en) Selective audio source enhancement
JP6169849B2 (en) Sound processor
CN112447191A (en) Signal processing device and signal processing method
US8693287B2 (en) Sound direction estimation apparatus and sound direction estimation method
JP6169910B2 (en) Audio processing device
US10818302B2 (en) Audio source separation
JP6106611B2 (en) Model estimation device, noise suppression device, speech enhancement device, method and program thereof
CN110998723B (en) Signal processing device using neural network, signal processing method, and recording medium
CN106031196B (en) Signal processing device, method and program
KR102410850B1 (en) Method and apparatus for extracting reverberant environment embedding using dereverberation autoencoder
US9875748B2 (en) Audio signal noise attenuation
Nesta et al. A flexible spatial blind source extraction framework for robust speech recognition in noisy environments
US11676619B2 (en) Noise spatial covariance matrix estimation apparatus, noise spatial covariance matrix estimation method, and program
Liu et al. A Hybrid Reverberation Model and Its Application to Joint Speech Dereverberation and Separation
Cauchi et al. Spectrally and spatially informed noise suppression using beamforming and convolutive NMF
CN119694333B (en) Directional pickup method, system, equipment and storage medium
US20240312446A1 (en) Acoustic signal enhancement device, acoustic signal enhancement method, and program
Giri et al. A novel target speaker dependent postfiltering approach for multichannel speech enhancement
Kim et al. Online speech dereverberation using RLS-WPE based on a full spatial correlation matrix integrated in a speech enhancement system
Pu Speech Dereverberation Based on Multi-Channel Linear Prediction
Kang et al. Reverberation and noise robust feature enhancement using multiple inputs
Kouhi-Jelehkaran et al. Phone-based filter parameter optimization of filter and sum robust speech recognition using likelihood maximization

Legal Events

Date Code Title Description
AS Assignment

Owner name: NIPPON TELEGRAPH AND TELEPHONE CORPORATION, JAPAN

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:NAKATANI, TOMOHIRO;KINOSHITA, KEISUKE;SIGNING DATES FROM 20201214 TO 20201215;REEL/FRAME:056506/0251

FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS

STPP Information on status: patent application and granting procedure in general

Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT RECEIVED

STPP Information on status: patent application and granting procedure in general

Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED

STCF Information on status: patent grant

Free format text: PATENTED CASE