WO2021205494A1 - Signal processing device, signal processing method, and program - Google Patents

Signal processing device, signal processing method, and program Download PDF

Info

Publication number
WO2021205494A1
WO2021205494A1 PCT/JP2020/015456 JP2020015456W WO2021205494A1 WO 2021205494 A1 WO2021205494 A1 WO 2021205494A1 JP 2020015456 W JP2020015456 W JP 2020015456W WO 2021205494 A1 WO2021205494 A1 WO 2021205494A1
Authority
WO
WIPO (PCT)
Prior art keywords
beamformer
estimation unit
signal
convolution
reverberation suppression
Prior art date
Application number
PCT/JP2020/015456
Other languages
French (fr)
Japanese (ja)
Inventor
中谷 智広
慶介 木下
林太郎 池下
マーク デルクロア
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to JP2022513704A priority Critical patent/JP7444243B2/en
Priority to PCT/JP2020/015456 priority patent/WO2021205494A1/en
Publication of WO2021205494A1 publication Critical patent/WO2021205494A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating

Definitions

  • the present invention relates to a technique for extracting a target sound by suppressing sounds other than the target sound, other noises, and reverberations from an acoustic signal.
  • Non-Patent Document 1 discloses a method of suppressing noise and reverberation from an acoustic signal in the frequency domain.
  • the acoustic signal is first received, a reverberation suppression filter that suppresses the reverberation of the target sound is estimated based on the weighted power minimization criterion of the prediction error, and the reverberation suppression filter is applied to the acoustic signal to remove the reverberation.
  • the steering vector or its estimated value
  • the power of the acoustic signal under the constraint condition that the sound arriving at the microphone from the sound source position of the target sound is not distorted.
  • MPDR power distortion-free response
  • Non-Patent Document 2 describes a reverberation suppression block so as to minimize a given objective function under the assumption that the acoustic signal does not contain diffuse noise and contains only a plurality of point sound sources.
  • a method of simultaneously achieving optimum reverberation suppression and sound source separation by alternately updating the coefficients of the sound source separation block and the sound source separation block is disclosed.
  • Non-Patent Document 1 since the reverberation suppression filter and the minimum power distortion-free response beam former are optimized independently, the optimum processing cannot be performed as a whole. In addition, the method of Non-Patent Document 2 could not suppress diffusive noise from the acoustic signal.
  • the present invention has been made in view of these points, and an object of the present invention is to perform optimized reverberation suppression, diffusive noise suppression, and target sound source separation as a whole.
  • the convolutional beamformer is estimated based on the optimization criterion that is determined according to the stochastic model, and the estimated convolutional beamformer is applied to the acoustic signal to obtain a processed signal and output it.
  • the convolution beamformer that suppresses reverberation, diffusive noise, and separates the target sound source can be optimized as a whole by using the auxiliary information that represents the information of the target sound. Therefore, it is possible to perform optimized reverberation suppression, diffusive noise suppression, and target sound source separation as a whole.
  • FIG. 1 is a block diagram illustrating a functional configuration of the signal processing device of the embodiment.
  • FIG. 2 is a block diagram illustrating the functional configuration of the signal processing device of the second embodiment and its modified example.
  • FIG. 3 is a flow chart for explaining a signal processing method of the second embodiment and a modified example thereof.
  • FIG. 4 is a flow chart for explaining a signal processing method of a modified example of the second embodiment.
  • FIG. 5 is a flow chart for explaining a signal processing method of the second embodiment and a modified example thereof.
  • FIG. 6 is a block diagram for exemplifying the estimation process of RTF (Relative Transfer Function).
  • FIG. 7 is a flow chart for explaining a signal processing method of a modified example of the second embodiment.
  • FIG. 8 is a flow chart for explaining a signal processing method of a modified example of the second embodiment.
  • FIG. 9 is a block diagram illustrating the functional configuration of the signal processing device of the third embodiment and its modified example.
  • FIG. 10 is a flow chart for explaining a signal processing method of the third embodiment and a modified example thereof.
  • FIG. 11 is a flow chart for explaining a signal processing method of the third embodiment and a modified example thereof.
  • FIG. 12 is a flow chart for explaining a signal processing method of a modified example of the third embodiment.
  • FIG. 13 is a flow chart for explaining a signal processing method of a modified example of the third embodiment.
  • FIG. 14 is a block diagram illustrating a hardware configuration of the signal processing device of the embodiment.
  • FIG. 15 is a graph illustrating the word error rate when the processed signals obtained in the fourth embodiment and the modified examples 1 and 2 of the second embodiment are voice-recognized.
  • the signal processing device receives the frequency-divided time-series acoustic signal and the auxiliary information representing the information of the target sound, and performs reverberation suppression, diffuse noise suppression, and target sound source separation.
  • the convolution beam former is estimated based on the optimization criterion that the signal obtained by applying the former to the acoustic signal is determined according to the stochastic model, and the estimated convolution beam former is applied to the acoustic signal to obtain the processed signal and output it. ..
  • the signal processing device 1 of the present embodiment has a convolution beam former estimation unit 11, a convolution beam former application unit 12, and a control unit 13, and each process is performed under the control of the control unit 13. To execute.
  • I and M are integers of 1 or more, and satisfy the relationship of M ⁇ I.
  • acoustic signal x t at each time-frequency point, f (acoustic signal x t in the time frequency domain, f) is Be done.
  • Such acoustic signals x t and f are modeled as follows.
  • t ⁇ ⁇ 1, ..., N ⁇ is a time index corresponding to a time interval (frame)
  • f ⁇ ⁇ 1, ..., F ⁇ is a frequency index corresponding to a frequency band (frequency bin).
  • N and F are positive integers.
  • N is an integer greater than or equal to 2.
  • time interval t the time interval corresponding to the time index t
  • frequency band f the frequency band corresponding to the frequency index f
  • i is an index corresponding to the sound source of each target sound, and i ⁇ ⁇ 1, ..., I ⁇ .
  • source signal i The source signal emitted from the sound source i
  • Acoustic signal x t, f [x 1, t, f , ..., X M, t, f ] T ⁇ C M ⁇ 1 is the frequency of all signals observed by M microphones in each time interval t.
  • Diffusive noise n t, f [n 1, t, f , ..., n M, t, f ] T ⁇ C M ⁇ 1 is the signal x 1, t, f , ..., x M, t, f . Of these, it is an M-dimensional column vector whose elements are components corresponding to diffusible noise.
  • the microphone image signals x t and f (i) of the equation (1) are further divided into two elements as in the equation (2).
  • the target sounds dt and f (i) represent the components of the microphone image signals x t and f (i) corresponding to the direct sound and the early reflected sound, and the rear reverberation sounds rt and f (i) are the microphone image signals x. Represents the component of t and f (i) corresponding to the rear reverberation.
  • the superscript ⁇ of symbols written in the form of “ ⁇ ⁇ ⁇ ” such as x t, f (i) , dt, f (i) , rt , f (i) is originally true of ⁇ . Although it should be described above (see Equation (2)), it may be described in the upper right of ⁇ due to restrictions on the description notation.
  • the diffusive noise nt , f and the rear reverberation sounds rt , f corresponding to each sound source i are suppressed, and the target sounds dt and f (i) corresponding to each sound source i are separated and extracted.
  • Frequency-divisioned time-series acoustic signals x t, f are input to the signal processor 1 for all t ⁇ ⁇ 1, ..., N ⁇ and f ⁇ ⁇ 1, ..., F ⁇ .
  • the acoustic signals xt and f exemplified in the present embodiment frequency-divide the signal obtained by observing the mixed signal of the diffuse noise and the microphone image signal based on the source signal emitted from the sound source. It is obtained by Further, auxiliary information s representing information on the target sound is input to the signal processing device 1.
  • auxiliary information s is information for specifying or estimating RTF (Relative Transfer Function) v to f (i) (see, for example, Reference 1 and the like).
  • Reference 1 I. Cohen, “Relative transfer function identification using speech signals,” IEEE Trans. On Speech, and Audio Processing, vol. 12, no. 5, pp. 451-459, 2004.
  • Equation (3) shows an example of RTFv ⁇ f (i).
  • V ⁇ f (i) of formula (3) is obtained by normalizing each element of v f (i) an element v 1, f (i) to the reference.
  • ⁇ ⁇ The superscript ⁇ of symbols written in the form of “ ⁇ ⁇ ” such as v ⁇ should be written directly above ⁇ (see equation (3)), but due to restrictions on the description notation, ⁇ It may be listed in the upper right corner of.
  • Examples of information for identifying or estimating RTFv to f (i) are the reference sound of the target sound, the time frequency mask ⁇ t, f (i) of the sound source i of the target sound, the steering vector v f (i) , and the RTF v. ⁇ f (i) and so on.
  • Each time frequency mask ⁇ t, f (i) represents a value corresponding to the existence probability or existence / absence of the source signal i in the time interval t and the frequency band f.
  • a method for estimating the time-frequency mask ⁇ t, f (i) is described in, for example, Reference 2.
  • Reference 2 F. Bahmaninezhad, J. Wu, R. Gu, S.-X. Zhang, Y. Xu, M. Yu, and D.
  • Non-Patent Document 1 A method of estimating the time frequency mask from the reference sound of the target sound is described in, for example, Reference 3.
  • Reference 3 K. Zmolikova, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani, L. Burget, and J. Cernocky, “SpeakerBeam: Speaker aware neural network for target speaker extraction in speechoption,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 4, pp. 800-814, 2019.
  • the auxiliary information s may further include information that identifies the power of the target sound.
  • a method for estimating the power of the target sound is described in, for example, Reference 3B.
  • Reference 3B Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee, “A Regression Approach to Speech Enhancement Based on Deep Neural Networks,” IEEE / ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 7-19, 2015.
  • the acoustic signals x t and f are input to the convolution beam former estimation unit 11 and the convolution beam former application unit 12, and the auxiliary information s is input to the convolution beam former estimation unit 11.
  • the convolution beam former estimation unit 11 receives the acoustic signals x t, f and auxiliary information s, and applies the convolution beam former that suppresses reverberation, diffuse noise, and separates the target sound source to the acoustic signals x t, f.
  • the convolution beamformer is estimated based on the optimization criterion that the obtained signals y t and f are determined according to the stochastic model.
  • the convolution beam former is expressed as follows, for example. However, W ⁇ ⁇ C M ⁇ I for ⁇ ⁇ ⁇ 0, ⁇ , ⁇ + 1, ..., L-1 ⁇ is an M ⁇ I matrix whose elements are beamformer coefficients, and ( ⁇ ) H is ( ⁇ ). Represents a conjugate transpose.
  • is a positive integer representing the number of time intervals (number of frames) corresponding to the length of the initial reflected sound. At least ⁇ ⁇ 1, and an example of ⁇ is a positive integer representing a time interval corresponding to 30 to 50 ms.
  • equation (4) can be transformed as follows.
  • equation (7) can be transformed into the following equations (12) and (13).
  • Equation (9) can be transformed as the following equation (9').
  • I M ⁇ R M ⁇ M represents a unit matrix of M ⁇ M
  • R denotes the set of whole real numbers. Represents the Kronecker product.
  • equation (10) can be transformed as the following equation (10').
  • the convolution beamformer estimation unit 11 estimates the convolutional beamformer based on the optimization criterion that y t and f are determined according to the probabilistic model.
  • a model satisfying the following (a) and (b) can be exemplified.
  • E ⁇ represents the expected value function.
  • ⁇ t, f (i) will be referred to as the power of y t, f (i).
  • the convolution beamformer does not distort the sound coming from the sound source i to the microphone for each i ⁇ ⁇ 1, ..., I ⁇ . This constraint can be described, for example, by the following equation (15) or (16).
  • the power of the target sound specified by the auxiliary information s may be used as ⁇ t, f (i).
  • y t, f (i) to ⁇ t, f (i) obtained by the convolution beam former application unit 12 as shown below.
  • 2 is obtained.
  • yt, f (i) obtained by the convolution beamformer application unit 12 depends on the convolutional beamformer estimated by the convolutional beamformer estimation unit 11, the convolutional beamformer estimation unit 11 and the convolutional beamformer application unit 12 It is necessary to alternately repeat the process of and until a predetermined convergence condition is satisfied.
  • processing of convolution beam former application unit 12 (step S12) >> The acoustic signals x t, f and the information for identifying the convolution beam former output from the convolution beam former estimation unit 11 are input to the convolution beam former application unit 12.
  • the convolution beam former application unit 12 uses the convolution beam former (Equation (4), (7), (9) (10), (9') (10'), (12) (13)) specified from the information.
  • the signal processing device 1 when the auxiliary information s includes information for specifying the power of the target sound, the signal processing device 1 outputs the processing signals yt and f obtained in step S12. In this case, the iterative processing of steps S11 and S12 is unnecessary.
  • the auxiliary information s does not include the information for specifying the power of the target sound, the process of step S11 and the process of step S12 are alternately repeated until a predetermined convergence condition is satisfied. Examples of the convergence condition are a condition that the number of repetitions reaches a predetermined number of times, a condition that the amount of change in the coefficient of the convolution beamformer is equal to or less than a predetermined amount before and after the repetition.
  • the signal processing device 1 outputs the processing signals yt and f obtained in step S12 when the convergence condition is satisfied.
  • the processed signals y t, f are the result of applying reverberation suppression, diffuse noise suppression, and target sound source separation to the acoustic signals x t, f.
  • the output processing signals yt and f may be input to other arithmetic processing, or may be converted into an acoustic signal in the time domain by a well-known method such as an inverse Fourier transform.
  • the signal obtained by applying the convolution beam former that suppresses reverberation, diffusive noise, and separation of the target sound source to the acoustic signal by using the auxiliary information s that represents the information of the target sound is determined according to the stochastic model. Estimate the convolution beamformer based on the optimization criteria. As a result, the convolution beam former can be optimized as a whole, and more effective speech enhancement can be realized.
  • the convolution beamformer is divided into a reverberation suppression filter that suppresses reverberation and a beamformer that suppresses diffusive noise and separates the target sound source.
  • the convolutional beamformer of the present embodiment includes a reverberation suppression filter that suppresses reverberation and a beamformer that suppresses diffusive noise and separates the target sound source.
  • the optimization processing of the reverberation suppression filter and the beam former is not independent of each other, and the convolution beam former as a whole is optimized. Examples of reverberation suppression filters are Eq. (9), Eq.
  • the equation (9') is used as the reverberation suppression filter and the equation (10') is used as the beam former as an example.
  • a power-weighted spatiotemporal covariance matrix of each target sound is used to optimize the reverberation suppression filter. Since the power-weighted spatio-temporal covariance matrix of the target sound is small in size, the reverberation suppression filter can be optimized with a small amount of calculation.
  • the signal processing device 2 of the present embodiment has a spatiotemporal covariance estimation unit 211, a reverberation suppression filter estimation unit 212, a beamformer estimation unit 213, a reverberation suppression filter application unit 221 and a beamformer application unit. It has 222 and a control unit 13, and executes each process under the control of the control unit 13.
  • the spatiotemporal covariance estimation unit 211, the reverberation suppression filter estimation unit 212, and the beamformer estimation unit 213 constitute a convolutional beamformer estimation unit.
  • the reverberation suppression filter application unit 221 and the beam former application unit 222 form a convolution beam former application unit.
  • auxiliary information s is input to the beamformer estimation unit 213, and acoustic signals xt and f are input to the spatiotemporal covariance matrix estimation unit 211 and the reverberation suppression filter application unit 221.
  • the auxiliary information s of the present embodiment is information for specifying or estimating RTFv to f (i) , and does not include information for specifying the power of the target sound.
  • the spatiotemporal covariance matrix estimation unit 211 is a power-weighted spatiotemporal covariance matrix of the target sound. And output.
  • the reverberation suppression filter estimation unit 212 receives the power-weighted spatio-temporal covariance matrices R - x, f (i) and P x, f (i) of the target sound, and the information q f (i) representing the beamformer. , Estimate the reverberation suppression filter based on the optimization criteria described above. Dereverberation filter applying unit 221, reverberation is estimated by suppression filter estimator 212 reverberation suppression filter an acoustic signal x t, dereverberation signal applied to f z t, to obtain the f output (equation (9 ') ).
  • the beamformer estimation unit 213 receives the reverberation suppression signals zt and f and the auxiliary information s, and estimates the beamformer based on the optimization criteria described above.
  • the beamformer application unit 222 applies the beamformer estimated by the beamformer estimation unit 213 to the reverberation suppression signals zt and f to obtain and output the processing signals yt and f (i).
  • the processing of the reverberation suppression filter application unit 221 and the beam former application unit 222 included in the unit is repeated alternately.
  • the signal processing device 2 outputs the processing signals yt and f obtained by the beamformer application unit 222 when the convergence condition is satisfied.
  • Auxiliary information s is input to the beamformer estimation unit 213 (step S213a).
  • the time frequency masks ⁇ t, f (i) are input as auxiliary information s.
  • this does not limit the present invention.
  • the acoustic signals x t and f are input to the spatiotemporal covariance estimation unit 211 and the reverberation suppression filter application unit 221 (step S221a).
  • the spatiotemporal covariance estimation unit 211 initializes ⁇ t, f (i) for all i ⁇ ⁇ 1, ..., I ⁇ , t ⁇ ⁇ 1, ..., N ⁇ , f ⁇ ⁇ 1, ..., F ⁇ . To become.
  • the spatiotemporal covariance estimation unit 211 initializes the powers ⁇ t, f (i) of the target sound as follows. here Represents ⁇ H ⁇ ⁇ . Further, ⁇ ⁇ ⁇ means that ⁇ is substituted for ⁇ . In other words, ⁇ ⁇ ⁇ means that ⁇ is changed to ⁇ (step S211a).
  • the beamformer estimation unit 213 initializes q f (i) for all i ⁇ ⁇ 1, ..., I ⁇ , f ⁇ ⁇ 1, ..., F ⁇ . For example, the beamformer estimating unit 213 the i-th column of I M and q f (i) (step S213b).
  • the reverberation suppression filter application unit 221 initializes z t and f for all t ⁇ ⁇ 1, ..., N ⁇ and f ⁇ ⁇ 1, ..., F ⁇ . For example, the reverberation suppression filter application unit 221 is set to z t, f ⁇ x t, f (step S221b).
  • the spatiotemporal covariance estimation unit 211 is a power-weighted spatiotemporal covariance matrix of the target sound for all i ⁇ ⁇ 1, ..., I ⁇ , f ⁇ ⁇ 1, ..., F ⁇ . Is calculated and output.
  • the spatiotemporal covariance estimation unit 211 is a power-weighted spatiotemporal covariance matrix of the target sound for all i ⁇ ⁇ 1, ..., I ⁇ , f ⁇ ⁇ 1, ..., F ⁇ . Is calculated and output.
  • step S211a the ⁇ t, f (i) obtained in step S211a is used for this calculation.
  • the processing signals y t and f have already been obtained, the ⁇ t and f (i) obtained in step S211d are used for this calculation (step S211c).
  • the information q f (i) representing the beamformer obtained by the estimation unit 213 is input to the reverberation suppression filter estimation unit 212.
  • the reverberation suppression filter estimation unit 212 receives these and estimates the reverberation suppression filter (Equation (9')) based on the above-mentioned optimization criteria.
  • the reverberation suppression filter estimation unit 212 Is calculated (step S212a).
  • ( ⁇ ) * represents the complex conjugate of ( ⁇ ) (step S212b). Further, the reverberation suppression filter estimation unit 212 provides information for specifying the reverberation suppression filter. Is calculated and output. However, ( ⁇ ) + is the Moore-Penrose pseudo-inverse matrix of ( ⁇ ) (step S212c).
  • the g ⁇ f obtained by the reverberation suppression filter estimation unit 212 is input to the reverberation suppression filter application unit 221.
  • the reverberation suppression filter application unit 221 applies the reverberation suppression filter estimated by the reverberation suppression filter estimation unit 212 to the acoustic signals x t and f as follows to obtain and output the reverberation suppression signals z t and f.
  • the reverberation suppression signals z t and f are sent to the beam former estimation unit 213 and the beam former application unit 222 (step S221c).
  • the calculated ⁇ t, f (i) is input.
  • the beamformer estimation unit 213 receives these and estimates the beamformer based on the above-mentioned optimization criteria.
  • the beamformer estimation unit 213 obtains RTFv to f (i) based on z t, f and ⁇ t, f (i) .
  • the steering vector estimation unit 2131 of the beamformer estimation unit 213 estimates and outputs the steering vector v f (i) based on z t, f and ⁇ t, f (i).
  • the steering vector v f (i) is estimated as follows.
  • Further RTF estimator 2132 beamformer estimating unit 213 obtains a v ⁇ f (i) from v f (i).
  • the RTF estimation unit 2132 obtains v ⁇ f (i) according to the equation (3) (step S213c). Further, the beamformer estimation unit 213 describes all i ⁇ ⁇ 1, ..., I ⁇ and f ⁇ ⁇ 1, ..., F ⁇ . To calculate. If the processing signals y t, f have never been obtained, the ⁇ t, f (i) obtained in step S211a is used for this calculation. On the other hand, if the processing signals y t and f have already been obtained, the ⁇ t and f (i) obtained in step S211d are used for this calculation (step S213d). Further, the beamformer estimation unit 213 provides information for identifying the beamformer for all i ⁇ ⁇ 1, ..., I ⁇ and f ⁇ ⁇ 1, ..., F ⁇ . Is calculated and output (step S213e).
  • the reverberation suppression signals z t, f and the information q f (i) for identifying the beam former are input to the beam former application unit 222.
  • the signal obtained by applying the convolution beam former that suppresses reverberation, diffusive noise, and separation of the target sound source to the acoustic signal by using the auxiliary information s that represents the information of the target sound is determined according to the stochastic model. Estimate the convolution beamformer based on the optimization criteria. As a result, the convolution beam former can be optimized as a whole, and more effective speech enhancement can be realized. Further, in the present embodiment, the convolution beamformer is divided into a reverberation suppression filter and a beamformer, and the beamformer is estimated using the reverberation suppression signal obtained in the middle of the estimation to realize more effective speech enhancement. can.
  • most of the operations required for estimating the reverberation suppression filter are the operations of the power-weighted spatio-temporal covariance matrix R - x, f (i) and P x, f (i) of the target sound.
  • the sizes of the power-weighted spatiotemporal covariance matrices R - x, f (i) and P x, f (i) of the target sound obtained in steps S211b and S211c are the matrices ⁇ f and ⁇ f f obtained in steps S212a and S212b. Smaller than the size of. Therefore, in the present embodiment, the amount of calculation required for estimating the reverberation suppression filter can be significantly reduced, and speech enhancement can be realized at a low calculation cost.
  • the reverberation suppression filter estimation unit 212 fixes the beam former to estimate the reverberation suppression filter (Equation (9')), and the beam former estimation unit 213 fixes the reverberation suppression filter to the beam former (Equation).
  • the process of estimating (10')) is repeated.
  • the I-dimensional processing signals y t and f are compressed more than the M-dimensional acoustic signals x t and f , and information is lost. Due to the loss of this information, the reverberation suppression filter or beamformer may become a quasi-optimal solution instead of the optimal solution.
  • the power-weighted covariance matrix of the target sound corresponding to i 1, ..., I y t, f (1) , ..., y t, f (I)
  • the signal processing device 2'of this modified example includes a spatiotemporal covariance estimation unit 211', a reverberation suppression filter estimation unit 212', a beamformer estimation unit 213, a reverberation suppression filter application unit 221 and a beam. It has a former application unit 222 and a control unit 13, and executes each process under the control of the control unit 13.
  • the spatiotemporal covariance estimation unit 211', the reverberation suppression filter estimation unit 212', and the beamformer estimation unit 213 constitute a convolutional beamformer estimation unit.
  • the reverberation suppression filter application unit 221 and the beam former application unit 222 form a convolution beam former application unit.
  • the spatiotemporal covariance matrix estimation unit 211 instead of the spatiotemporal covariance matrix estimation unit 211, the spatiotemporal covariance matrix estimation unit 211'is used for the power-weighted spatiotemporal covariance matrix R - x, f (i) and Px, of the target sound. In addition to f (i) , a power-weighted spatio-temporal covariance matrix of non-objective sounds is also generated.
  • the reverberation suppression filter estimation unit 212 instead of the reverberation suppression filter estimation unit 212, the reverberation suppression filter estimation unit 212'has a power-weighted spatiotemporal covariance matrix R - x, f (i) and P x, f (i) of the target sound and the target sound.
  • the power-weighted spatio-temporal covariance matrix of the non-target sound and I for estimating the non-purpose sound is received, and the reverberation suppression filter is estimated based on the above-mentioned optimization criteria.
  • the beamformer estimation unit 213 instead of the beamformer estimation unit 213, the beamformer estimation unit 213'in addition to the information q f (i) representing the beamformer corresponding to 1 ⁇ i ⁇ I for estimating the target sound, further estimates the non-target sound. Information q f (i) representing a beamformer corresponding to I ⁇ i ⁇ M is also generated. Instead of the beamformer application unit 222, the beamformer application unit 222'has an estimated value y t, f (1) , ..., Y t, f (I) of the target sound, as well as an estimated value y t of the non-target sound. , F ⁇ is generated. Others are the same as in the second embodiment.
  • the signal processing device 2 instead of the signal processing device 2, the signal processing device 2'executes the processes of steps S213a, S221a, S211a, S213b, S221b, S211b, S211c, S212a, and S212b shown in FIG.
  • the processing of the spatiotemporal covariance estimation unit 211 is executed by the spatiotemporal covariance estimation unit 211'instead of the spatiotemporal covariance estimation unit 211.
  • the spatiotemporal covariance estimation unit 211' also includes the powers ⁇ t, f ⁇ of the non-target sound in addition to the powers ⁇ t, f (1) , ..., ⁇ t, f (I) of the target sound. Initialize in the same way as the power of the target sound.
  • the processing of the beam former estimation unit 213 is executed by the beam former estimation unit 213'instead of the beam former estimation unit 213.
  • the spatiotemporal covariance estimation unit 211' If the spatiotemporal covariance estimation unit 211'has not yet obtained y t, f (i) , the spatiotemporal covariance estimation unit 211' obtained ⁇ t, f (1) , ..., ⁇ t, f ( I) and ⁇ t, f ⁇ are used. On the other hand, if y t, f (i) is obtained, y t, f (1) , ..., y t, f (I) and y t, f are sent to the spatiotemporal covariance estimation unit 211'.
  • step S211d Since ⁇ is input, ⁇ t, f (1) , ..., ⁇ t, f (I) , and ⁇ t, f ⁇ can be obtained by step S211d.
  • the spatiotemporal covariance estimation unit 211' uses x t, f and ⁇ t, f ⁇ , and uses the power-weighted spatiotemporal covariance matrix of the non-objective sound. Is calculated and output (step S211b'). Furthermore, the spatiotemporal covariance estimation unit 211'uses x t, f and ⁇ t, f ⁇ , and uses the power-weighted spatiotemporal covariance matrix of the non-objective sound. Is calculated and output (step S211c').
  • the reverberation suppression filter estimation unit 212' receives R- x, f ⁇ and q f (i), and receives them. Is calculated (step S212a'). Furthermore, the reverberation suppression filter estimation unit 212'receives P ⁇ x, f and q f (i), and receives P ⁇ x, f and q f (i). Is calculated (step S212b').
  • the signal processing device 2 executes the processes of steps S212c, S221c, S213c, S213d, S213e, S222a, S13, S211d, and S222b shown in FIG.
  • the process of the reverberation suppression filter estimation unit 212 described in the second embodiment is executed by the reverberation suppression filter estimation unit 212'instead of the reverberation suppression filter estimation unit 212.
  • the processing of the beam former estimation unit 213 is executed by the beam former estimation unit 213'instead of the beam former estimation unit 213.
  • the processing of the beam former application unit 222 is executed by the beam former application unit 222'instead of the beam former application unit 222.
  • the beamformer estimation unit 213 estimates q f (i) for i ⁇ ⁇ 1, ..., I ⁇ , f ⁇ ⁇ 1, ..., F ⁇ , and in addition, i ⁇ ⁇ I + 1, It also generates q f (i) for ..., M ⁇ , f ⁇ ⁇ 1, ..., F ⁇ .
  • q f (i) is put against, as a vector spanning the complement, i ⁇ ⁇ I + 1, ... , M ⁇ q relating f (i ) Is generated.
  • the orthonormal rule of the complementary space may be adopted, or other than that.
  • the signal processing device 2 "of this modification has a spatiotemporal covariance estimation unit 211, a reverberation suppression filter estimation unit 212", a beamformer estimation unit 213, a reverberation suppression filter application unit 221 and a beamformer. It has an application unit 222 and a control unit 13, and executes each process under the control of the control unit 13.
  • the spatiotemporal covariance estimation unit 211, the reverberation suppression filter estimation unit 212 ”, and the beamformer estimation unit 213 constitute a convolution beamformer estimation unit. It constitutes a convolution beam former application part.
  • the reverberation suppression filter estimation unit 212 " receives R- x, f (i) and P x, f (i) in place of the reverberation suppression filter estimation unit 212, and the information corresponding to the reverberation removal filter is received. Is calculated and output (step S212c ").
  • Dereverberation filter applying unit 212 " the information corresponds to the dereverberation filter G - receives f (i) the acoustic signal x t, and f, applies a dereverberation filter acoustic signals x t, the f as follows
  • the reverberation suppression signals zt and f are obtained and output (step S221c ′′).
  • the signal processing device 2 executes the processes of steps S213c, S213d, S213e, S222a, S13, S211d, and S222b shown in FIG.
  • the auxiliary information includes information that identifies the power of the target sound. As a result, iterative processing can be omitted.
  • the signal processing device 3 of the present embodiment has a spatiotemporal covariance estimation unit 311, a reverberation suppression filter estimation unit 212, a beamformer estimation unit 313, a reverberation suppression filter application unit 221 and a beamformer application unit. It has 222 and a control unit 13, and executes each process under the control of the control unit 13.
  • the spatiotemporal covariance estimation unit 311, the reverberation suppression filter estimation unit 212, and the beamformer estimation unit 313 constitute a convolutional beamformer estimation unit.
  • the reverberation suppression filter application unit 221 and the beam former application unit 222 form a convolution beam former application unit.
  • auxiliary information s ⁇ s 1 , s 2 ⁇ is input to the signal processing device 3.
  • the signal processing device 3 replaces the signal processing device 2 with S221a, S211a, S213b, S221b, S211b, S211c, S212a, S212b, S212c, S221c, S213c, S213d. , S213e, S222a, S222b are executed.
  • the processing of the spatiotemporal covariance estimation unit 211 and the processing of the beamformer estimation unit 213 described in the second embodiment are performed in place of the spatiotemporal covariance estimation unit 211 and the beamformer estimation unit 213, respectively.
  • the estimation unit 311 and the beam former estimation unit 313 execute.
  • the processed signals y t, f (i) obtained by suppressing the reverberation, suppressing the diffusive noise, and separating the target sound source with respect to the acoustic signals x t, f without repeating the processing.
  • Modification 1 of the third embodiment Similar to the first modification of the second embodiment, in the third embodiment, in addition to the power-weighted spatio-temporal covariance matrix R - x, f (i) and P x, f (i) of the target sound, the non-purpose sound
  • the power-weighted spatio-temporal covariance matrix of may also be calculated and used to estimate the reverberation suppression filter.
  • the signal processing device 3'of this modified example includes a spatiotemporal covariance estimation unit 311', a reverberation suppression filter estimation unit 212', a beamformer estimation unit 313, a reverberation suppression filter application unit 221 and a beam. It has a former application unit 222 and a control unit 13, and executes each process under the control of the control unit 13.
  • the spatiotemporal covariance estimation unit 311', the reverberation suppression filter estimation unit 212', and the beamformer estimation unit 313 constitute a convolutional beamformer estimation unit.
  • the reverberation suppression filter application unit 221 and the beam former application unit 222 form a convolution beam former application unit.
  • the signal processing device 3 executes the processes of steps S313a, S221a, S211a, S213b, S221b, S211b, S211c, S212a, and S212b shown in FIG. 10 as described in the third embodiment. ..
  • the signal processing device 3 instead of the signal processing device 2', the signal processing device 3'executes the processes of steps S211b', S211c', S212a', and S212b'shown in FIG.
  • the spatiotemporal covariance estimation unit 311' executes the processing of the spatiotemporal covariance estimation unit 211' described in the first modification of the second embodiment.
  • auxiliary information s 2 ⁇ t, f (i) is used for the calculation of steps S211b'and S211c'.
  • the signal processing device 3 instead of the signal processing device 3, the signal processing device 3'executes the processes of S212c, S221c, S213c, S213d, S213e, S222a, and S222b shown in FIG. 11 as described in the third embodiment.
  • the signal processing device 3 "of this modified example includes a spatiotemporal covariance estimation unit 311", a reverberation suppression filter estimation unit 212 ", a beamformer estimation unit 313, a reverberation suppression filter application unit 221", and a beam. It has a former application unit 222 and a control unit 13, and executes each process under the control of the control unit 13.
  • the unit 313 constitutes a convoluted beamformer estimation unit.
  • the reverberation suppression filter application unit 221 ”and the beam former application unit 222 constitute a convolution beam former application unit.
  • the signal processing device 3 executes steps S313a, S221a, S211b, S211c shown in FIGS. 12 and 13 as described in the third embodiment. Further, the signal processing device 2 ".
  • the reverberation suppression filter estimation unit 212 "of the signal processing device 3" executes the steps S212c “and S221c” described in the second embodiment of the second embodiment instead of the reverberation suppression filter estimation unit 212 ".
  • the signal processing device 3 executes steps S213c, S213d, S213e, S222a, and S222b described in the second embodiment.
  • auxiliary information s 2 ⁇ t, f (i) is used for the calculation of steps S211b, S211c, and S213d.
  • the sizes of the power-weighted spatio-temporal covariance matrices R - x, f (i) and P x, f (i) of the target sound obtained in steps S211b and S211c. Is smaller than the size of the matrices ⁇ f and ⁇ f obtained in steps S212a and S212b, so speech enhancement can be realized at a low calculation cost in each of the above embodiments and modifications. Although this effect cannot be obtained, in steps S212a and S212b, Instead of May be executed. However, the following is satisfied.
  • FIG. 15 shows the word error rate when the processing signals obtained in the fourth embodiment and the modified examples 1 and 2 of the second embodiment are voice-recognized for the two configurations Config-1 and 2.
  • the horizontal axis of FIG. 15 represents the number of iterations (#iterations), and the vertical axis represents the word error rate (WER (%)).
  • WER word error rate
  • the signal processing devices 1, 2, 2', 2 ", 3, 3', 3" in each embodiment are, for example, a processor (hardware processor) such as a CPU (central processing unit) or a RAM (random-access memory). ) -A device configured by a general-purpose or dedicated computer equipped with a memory such as a ROM (read-only memory) executing a predetermined program.
  • This computer may have one processor and memory, or may have a plurality of processors and memory.
  • This program may be installed in a computer or may be recorded in a ROM or the like in advance.
  • a part or all of the processing units may be configured by using an electronic circuit that realizes a processing function independently, instead of an electronic circuit (circuitry) that realizes a function configuration by reading a program like a CPU. ..
  • the electronic circuit constituting one device may include a plurality of CPUs.
  • FIG. 6 is a block diagram illustrating the hardware configurations of the signal processing devices 1, 2, 2', 2 ", 3, 3', 3" in each embodiment.
  • the signal processing devices 1, 2, 2', 2 ", 3, 3', 3" in this example include a CPU (Central Processing Unit) 10a, an input unit 10b, an output unit 10c, and a RAM. It has (RandomAccessMemory) 10d, ROM (ReadOnlyMemory) 10e, auxiliary storage device 10f, and bus 10g.
  • the CPU 10a of this example has a control unit 10aa, a calculation unit 10ab, and a register 10ac, and executes various arithmetic processes according to various programs read into the register 10ac.
  • the input unit 10b is an input terminal, a keyboard, a mouse, a touch panel, or the like into which data is input.
  • the output unit 10c is an output terminal from which data is output, a display, a LAN card controlled by a CPU 10a that has read a predetermined program, and the like.
  • the RAM 10d is a SRAM (Static Random Access Memory), a DRAM (Dynamic Random Access Memory), or the like, and has a program area 10da in which a predetermined program is stored and a data area 10db in which various data are stored.
  • the auxiliary storage device 10f is, for example, a hard disk, MO (Magneto-Optical disc), a semiconductor memory, or the like, and has a program area 10fa for storing a predetermined program and a data area 10fb for storing various data.
  • the bus 10g connects the CPU 10a, the input unit 10b, the output unit 10c, the RAM 10d, the ROM 10e, and the auxiliary storage device 10f so that information can be exchanged.
  • the CPU 10a writes the program stored in the program area 10fa of the auxiliary storage device 10f to the program area 10da of the RAM 10d according to the read OS (Operating System) program.
  • OS Operating System
  • the CPU 10a writes various data stored in the data area 10fb of the auxiliary storage device 10f to the data area 10db of the RAM 10d. Then, the address on the RAM 10d in which this program or data is written is stored in the register 10ac of the CPU 10a.
  • the control unit 10aa of the CPU 10a sequentially reads out these addresses stored in the register 10ac, reads a program or data from the area on the RAM 10d indicated by the read address, and causes the arithmetic unit 10ab to sequentially execute the operations indicated by the program.
  • the calculation result is stored in the register 10ac.
  • the above program can be recorded on a computer-readable recording medium.
  • a computer-readable recording medium is a non-transitory recording medium. Examples of such a recording medium are a magnetic recording device, an optical disk, an optical magnetic recording medium, a semiconductor memory, and the like.
  • the distribution of this program is carried out, for example, by selling, transferring, renting, etc. a portable recording medium such as a DVD or CD-ROM on which the program is recorded.
  • the program may be stored in the storage device of the server computer, and the program may be distributed by transferring the program from the server computer to another computer via the network.
  • the computer that executes such a program first temporarily stores, for example, the program recorded on the portable recording medium or the program transferred from the server computer in its own storage device. Then, when the process is executed, the computer reads the program stored in its own storage device and executes the process according to the read program.
  • a computer may read the program directly from a portable recording medium and execute processing according to the program, and further, the program is transferred from the server computer to this computer. It is also possible to execute the process according to the received program one by one each time.
  • the above processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition without transferring the program from the server computer to this computer. May be.
  • the program in this embodiment includes information to be used for processing by a computer and equivalent to the program (data that is not a direct command to the computer but has a property of defining the processing of the computer, etc.).
  • the present device is configured by executing a predetermined program on a computer, but at least a part of these processing contents may be realized by hardware.
  • an acoustic signal obtained by frequency-dividing the signal obtained by observing a mixed signal of diffuse noise and a source signal emitted from a sound source is referred to as No. xt. It was set to f .
  • the acoustic signals obtained by performing some signal processing (filtering processing or the like) on the signal obtained by frequency-dividing the signal obtained by observing the mixed signal may be set as xt, f.
  • the acoustic signals obtained by frequency-dividing the signal obtained by performing some signal processing on the signal obtained by observing the mixed signal may be set as xt and f.
  • the signal obtained by observing the mixed signal is subjected to some kind of signal processing, the signal obtained by frequency-dividing the signal is obtained, and the acoustic signal obtained by further performing some kind of signal processing is xt , It may be f.
  • the various processes described above are not only executed in chronological order according to the description, but may also be executed in parallel or individually as required by the processing capacity of the device that executes the processes.
  • the estimation unit 2131 estimated the steering vector v f (i) based on z t, f and ⁇ t, f (i) .
  • the auxiliary information s may include the steering vector v f (i) itself. In this case, the steering vector estimation unit 2131 can be omitted, and the RTF estimation unit 2132 of the beamformer estimation unit 213 may obtain v to f (i) from v f (i) included in the auxiliary information s.
  • the auxiliary information s does not include the time frequency mask ⁇ t, f and the steering vector v f (i)
  • the auxiliary information s includes the reference sound of the target sound
  • the reverberation suppression signals z t, f and The steering vector v f (i) can be estimated from the auxiliary information s. That is, as illustrated in FIG. 6, first, the time frequency mask estimation unit 2130 of the beamformer estimation unit 213 receives the reverberation suppression signals z t, f and the auxiliary information s (reference sound of the target sound), and is described in Reference 3.
  • S212a ', S212b' in a non-target sound power weighted hourly space covariance matrix R - x, f ⁇ , P ⁇ x, spatial covariance matrix at with a power weighting of the target sound instead of f R - x , F (i) , Px, f (i) may be used.
  • steps S211b'and S211c' can be omitted.
  • Convolution beam former estimation unit 12 Convolution beam former application unit 211, 211', 311, 311'Spatio-temporal covariance estimation unit 212,212 ', 212' Reverberation suppression filter estimation unit 213,213', 313 Beamformer estimation unit 221,221 "Reverberation suppression filter application unit 222,222' Beamformer application unit

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The present invention involves: receiving frequency-divided timesequential acoustic signals and auxiliary information representing target sound information; estimating a convolutional beamformer on the basis of the optimization criterion that, when a convolutional beamformer performing dereverberation, diffusive noise suppression, and target sound source separation, is applied to an acoustic signal, the obtained signal is defined according to a probabilistic model; applying the estimated convolutional beamformer to the acoustic signals, to obtain a processed signal; and outputting the processed signal.

Description

信号処理装置、信号処理方法、およびプログラムSignal processing equipment, signal processing methods, and programs
 本発明は、音響信号から目的音以外の音やその他の雑音、残響を抑圧し、目的音を抽出する技術に関する。 The present invention relates to a technique for extracting a target sound by suppressing sounds other than the target sound, other noises, and reverberations from an acoustic signal.
 非特許文献1には、周波数領域の音響信号から雑音や残響を抑圧する方法が開示されている。この方法では、まず音響信号を受け取るとともに、予測誤差の重み付きパワー最小化基準に基づき、目的音の残響を抑圧する残響抑圧フィルタを推定し、その残響抑圧フィルタを音響信号に適用して残響除去を行う。その後、目的音の方向を表すステアリングベクトル(またはその推定値)を受け取り、目的音の音源位置からマイクロホンに到来する音を歪ませないという拘束条件のもと、音響信号のパワーを最小化する最小パワー無歪応答(Minimum-Power Distortionless Response, MPDR)ビームフォーマを推定し、それを残響除去後の音響信号に適用することで、さらに雑音を抑圧する。 Non-Patent Document 1 discloses a method of suppressing noise and reverberation from an acoustic signal in the frequency domain. In this method, the acoustic signal is first received, a reverberation suppression filter that suppresses the reverberation of the target sound is estimated based on the weighted power minimization criterion of the prediction error, and the reverberation suppression filter is applied to the acoustic signal to remove the reverberation. I do. After that, it receives the steering vector (or its estimated value) indicating the direction of the target sound, and minimizes the power of the acoustic signal under the constraint condition that the sound arriving at the microphone from the sound source position of the target sound is not distorted. By estimating the power distortion-free response (MPDR) beamformer and applying it to the acoustic signal after reverberation removal, noise is further suppressed.
 非特許文献2には、音響信号には拡散性雑音は含まれておらず、複数の点音源のみが含まれるとの仮定の下、与えられた目的関数を最小化するように、残響抑圧ブロックと音源分離ブロックの係数を交互に更新することで、最適な残響抑圧と音源分離を同時に実現する方法が開示されている。 Non-Patent Document 2 describes a reverberation suppression block so as to minimize a given objective function under the assumption that the acoustic signal does not contain diffuse noise and contains only a plurality of point sound sources. A method of simultaneously achieving optimum reverberation suppression and sound source separation by alternately updating the coefficients of the sound source separation block and the sound source separation block is disclosed.
 しかし、非特許文献1の方法では、残響抑圧フィルタと最小パワー無歪応答ビームフォーマとを独立に最適化するため、全体として最適な処理を行うことができなかった。また、非特許文献2の方法では、音響信号から拡散性雑音を抑圧することができなかった。 However, in the method of Non-Patent Document 1, since the reverberation suppression filter and the minimum power distortion-free response beam former are optimized independently, the optimum processing cannot be performed as a whole. In addition, the method of Non-Patent Document 2 could not suppress diffusive noise from the acoustic signal.
 本発明はこのような点に鑑みてなされたものであり、全体として最適化された残響抑圧と拡散性雑音抑圧と目的音源分離とを行うことを目的とする。 The present invention has been made in view of these points, and an object of the present invention is to perform optimized reverberation suppression, diffusive noise suppression, and target sound source separation as a whole.
 周波数分割された時系列の音響信号と、目的音の情報を表す補助情報とを受け取り、残響抑圧と拡散性雑音抑圧と目的音源分離とを行う畳み込みビームフォーマを音響信号に適用して得られる信号が確率モデルに従って定まるという最適化基準に基づき、畳み込みビームフォーマを推定し、推定された畳み込みビームフォーマを音響信号に適用して処理信号を得て出力する。 A signal obtained by applying a convolutional beamformer that receives frequency-divided time-series acoustic signals and auxiliary information representing target sound information and performs reverberation suppression, diffuse noise suppression, and target sound source separation to the acoustic signal. The convolutional beamformer is estimated based on the optimization criterion that is determined according to the stochastic model, and the estimated convolutional beamformer is applied to the acoustic signal to obtain a processed signal and output it.
 本発明では、目的音の情報を表す補助情報を用いることで、残響抑圧と拡散性雑音抑圧と目的音源分離とを行う畳み込みビームフォーマを全体として最適化できる。そのため、全体として最適化された残響抑圧と拡散性雑音抑圧と目的音源分離とを行うことができる。 In the present invention, the convolution beamformer that suppresses reverberation, diffusive noise, and separates the target sound source can be optimized as a whole by using the auxiliary information that represents the information of the target sound. Therefore, it is possible to perform optimized reverberation suppression, diffusive noise suppression, and target sound source separation as a whole.
図1は実施形態の信号処理装置の機能構成を例示したブロック図である。FIG. 1 is a block diagram illustrating a functional configuration of the signal processing device of the embodiment. 図2は第2実施形態およびその変形例の信号処理装置の機能構成を例示したブロック図である。FIG. 2 is a block diagram illustrating the functional configuration of the signal processing device of the second embodiment and its modified example. 図3は第2実施形態およびその変形例の信号処理方法を説明するためのフロー図である。FIG. 3 is a flow chart for explaining a signal processing method of the second embodiment and a modified example thereof. 図4は第2実施形態の変形例の信号処理方法を説明するためのフロー図である。FIG. 4 is a flow chart for explaining a signal processing method of a modified example of the second embodiment. 図5は第2実施形態およびその変形例の信号処理方法を説明するためのフロー図である。FIG. 5 is a flow chart for explaining a signal processing method of the second embodiment and a modified example thereof. 図6はRTF(Relative Transfer Function)の推定処理を例示するためのブロック図である。FIG. 6 is a block diagram for exemplifying the estimation process of RTF (Relative Transfer Function). 図7は第2実施形態の変形例の信号処理方法を説明するためのフロー図である。FIG. 7 is a flow chart for explaining a signal processing method of a modified example of the second embodiment. 図8は第2実施形態の変形例の信号処理方法を説明するためのフロー図である。FIG. 8 is a flow chart for explaining a signal processing method of a modified example of the second embodiment. 図9は第3実施形態およびその変形例の信号処理装置の機能構成を例示したブロック図である。FIG. 9 is a block diagram illustrating the functional configuration of the signal processing device of the third embodiment and its modified example. 図10は第3実施形態およびその変形例の信号処理方法を説明するためのフロー図である。FIG. 10 is a flow chart for explaining a signal processing method of the third embodiment and a modified example thereof. 図11は第3実施形態およびその変形例の信号処理方法を説明するためのフロー図である。FIG. 11 is a flow chart for explaining a signal processing method of the third embodiment and a modified example thereof. 図12は第3実施形態の変形例の信号処理方法を説明するためのフロー図である。FIG. 12 is a flow chart for explaining a signal processing method of a modified example of the third embodiment. 図13は第3実施形態の変形例の信号処理方法を説明するためのフロー図である。FIG. 13 is a flow chart for explaining a signal processing method of a modified example of the third embodiment. 図14は実施形態の信号処理装置のハードウェア構成を例示したブロック図である。FIG. 14 is a block diagram illustrating a hardware configuration of the signal processing device of the embodiment. 図15は第4実施形態と第2実施形態の変形例1,2で得られた処理信号を音声認識した際の単語誤り率を例示したグラフである。FIG. 15 is a graph illustrating the word error rate when the processed signals obtained in the fourth embodiment and the modified examples 1 and 2 of the second embodiment are voice-recognized.
 以下、図面を参照して本発明の実施形態を説明する。
 [第1実施形態]
 まず、本発明の第1実施形態を説明する。第1実施形態では、信号処理装置が、周波数分割された時系列の音響信号と、目的音の情報を表す補助情報とを受け取り、残響抑圧と拡散性雑音抑圧と目的音源分離とを行う畳み込みビームフォーマを音響信号に適用して得られる信号が確率モデルに従って定まるという最適化基準に基づき、畳み込みビームフォーマを推定し、推定された畳み込みビームフォーマを音響信号に適用して処理信号を得て出力する。
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[First Embodiment]
First, the first embodiment of the present invention will be described. In the first embodiment, the signal processing device receives the frequency-divided time-series acoustic signal and the auxiliary information representing the information of the target sound, and performs reverberation suppression, diffuse noise suppression, and target sound source separation. The convolution beam former is estimated based on the optimization criterion that the signal obtained by applying the former to the acoustic signal is determined according to the stochastic model, and the estimated convolution beam former is applied to the acoustic signal to obtain the processed signal and output it. ..
 <機能構成>
 図1に例示するように、本実施形態の信号処理装置1は、畳み込みビームフォーマ推定部11、畳み込みビームフォーマ適用部12、および制御部13を有し、制御部13の制御の下で各処理を実行する。
<Functional configuration>
As illustrated in FIG. 1, the signal processing device 1 of the present embodiment has a convolution beam former estimation unit 11, a convolution beam former application unit 12, and a control unit 13, and each process is performed under the control of the control unit 13. To execute.
 <処理>
 I個の音源から発せられた源信号が残響および拡散性雑音が存在する環境下でM個のマイクロホンで観測される状況を想定する。ただし、IおよびMは1以上の整数であり、M≧Iの関係を満たす。I個の音源から発せられた源信号に基づく信号(直接音、初期反射音、後部残響)と、拡散性雑音(加法的拡散性雑音)との混合信号をマイクロホンで観測して得られた信号は、短時間フーリエ変換(short-time Fourier transform)などの周知の方法によって周波数分割され、各時間周波数点での音響信号xt,f(時間周波数領域での音響信号xt,f)が得られる。このような音響信号xt,fは以下のようにモデル化される。
Figure JPOXMLDOC01-appb-M000001

Figure JPOXMLDOC01-appb-M000002

ここで、t∈{1,…,N}は時間区間(フレーム)に対応する時間インデックスであり、f∈{1,…,F}は周波数帯域(周波数ビン)に対応する周波数インデックスである。NおよびFは正整数である。例えば、Nは2以上の整数である。以降、時間インデックスtに対応する時間区間を「時間区間t」と呼び、周波数インデックスfに対応する周波数帯域を「周波数帯域f」と呼ぶことにする。iは各目的音の音源に対応するインデックスであり、i∈{1,…,I}である。音源iから発せられた源信号を「源信号i」と呼ぶことにする。音響信号xt,f=[x1,t,f,…,xM,t,f∈CM×1は、M個のマイクロホンで観測されたすべての信号を各時間区間tで周波数分割して得られた各周波数帯域fでの信号x1,t,f,…,xM,t,fを要素とするM次元列ベクトルである。Cは複素数全体の集合を表す。(・)は(・)の非共役転置を表す。マイクイメージ信号xt,f (i)=[x1,t,f (i),…,xM,t,f (i)∈CM×1は、信号x1,t,f,…,xM,t,fのうち源信号iに対応する直接音と初期反射音と後部残響からなる成分x1,t,f (i),…,xM,t,f (i)を要素とするM次元列ベクトルである。x1,t,f (i),…,xM,t,f (i)は拡散性雑音に対応する成分を含まない。拡散性雑音nt,f=[n1,t,f,…,nM,t,f∈CM×1は、信号x1,t,f,…,xM,t,fのうち拡散性雑音に対応する成分を要素とするM次元列ベクトルである。式(1)のマイクイメージ信号xt,f (i)はさらに式(2)のように2つの要素に分割される。目的音dt,f (i)はマイクイメージ信号xt,f (i)のうち直接音および初期反射音に対応する成分を表し、後部残響音rt,f (i)はマイクイメージ信号xt,f (i)のうち後部残響に対応する成分を表す。なお、xt,f (i),dt,f (i),rt,f (i)などの「χα β」の形式で表記される記号の上付き添え字βは本来αの真上に記載すべきであるが(式(2)参照)、記載表記の制約上、αの右上に記載する場合がある。本実施形態の残響抑圧と拡散性雑音抑圧と目的音源分離では、式(1)の音響信号xt,fから拡散性雑音nt,fと各音源iに対応する後部残響音rt,f (i)とが抑圧され、各音源iに対応する目的音dt,f (i)が分離抽出される。
<Processing>
It is assumed that the source signals emitted from I sound sources are observed by M microphones in an environment where reverberation and diffusive noise are present. However, I and M are integers of 1 or more, and satisfy the relationship of M ≧ I. A signal obtained by observing a mixed signal of a signal based on the source signal (direct sound, early reflected sound, rear reverberation) emitted from I sound sources and diffusive noise (additive diffusive noise) with a microphone. resulting is frequency divided by well-known methods, such as short-time Fourier transform (short-time Fourier transform), the acoustic signal x t at each time-frequency point, f (acoustic signal x t in the time frequency domain, f) is Be done. Such acoustic signals x t and f are modeled as follows.
Figure JPOXMLDOC01-appb-M000001

Figure JPOXMLDOC01-appb-M000002

Here, t ∈ {1, ..., N} is a time index corresponding to a time interval (frame), and f ∈ {1, ..., F} is a frequency index corresponding to a frequency band (frequency bin). N and F are positive integers. For example, N is an integer greater than or equal to 2. Hereinafter, the time interval corresponding to the time index t will be referred to as "time interval t", and the frequency band corresponding to the frequency index f will be referred to as "frequency band f". i is an index corresponding to the sound source of each target sound, and i ∈ {1, ..., I}. The source signal emitted from the sound source i will be referred to as "source signal i". Acoustic signal x t, f = [x 1, t, f , ..., X M, t, f ] T ∈ C M × 1 is the frequency of all signals observed by M microphones in each time interval t. It is an M-dimensional string vector having signals x 1, t, f , ..., X M, t, f in each frequency band f obtained by division. C represents the set of all complex numbers. (・) T represents the non-conjugate transpose of (・). Microphone image signal x t, f (i) = [x 1, t, f (i) , ..., x M, t, f (i) ] T ∈ C M x 1 is the signal x 1, t, f , ..., x M, t, f components x 1, t, f (i) , ..., X M, t, f (i) consisting of the direct sound, the early reflection sound, and the rear reverberation corresponding to the source signal i. It is an M-dimensional column vector as an element. x 1, t, f (i) , ..., X M, t, f (i) do not contain a component corresponding to diffusive noise. Diffusive noise n t, f = [n 1, t, f , ..., n M, t, f ] T ∈ C M × 1 is the signal x 1, t, f , ..., x M, t, f . Of these, it is an M-dimensional column vector whose elements are components corresponding to diffusible noise. The microphone image signals x t and f (i) of the equation (1) are further divided into two elements as in the equation (2). The target sounds dt and f (i) represent the components of the microphone image signals x t and f (i) corresponding to the direct sound and the early reflected sound, and the rear reverberation sounds rt and f (i) are the microphone image signals x. Represents the component of t and f (i) corresponding to the rear reverberation. The superscript β of symbols written in the form of “χ α β ” such as x t, f (i) , dt, f (i) , rt , f (i) is originally true of α. Although it should be described above (see Equation (2)), it may be described in the upper right of α due to restrictions on the description notation. In the reverberation suppression, the diffusive noise suppression, and the target sound source separation of the present embodiment, the diffusive noise nt , f and the rear reverberation sounds rt , f corresponding to each sound source i from the acoustic signals x t, f of the equation (1). (I) is suppressed, and the target sounds dt and f (i) corresponding to each sound source i are separated and extracted.
 図1を用いて本実施形態の処理を説明する。
 信号処理装置1には、周波数分割された時系列の音響信号xt,fがすべてのt∈{1,…,N}およびf∈{1,…,F}について入力される。前述のように、本実施形態で例示する音響信号xt,fは、拡散性雑音と音源から発せられた源信号に基づくマイクイメージ信号との混合信号を観測して得られた信号を周波数分割して得られるものである。さらに信号処理装置1には、目的音の情報を表す補助情報sが入力される。
 補助情報sの例は、RTF(Relative Transfer Function)v~ (i)(例えば、参考文献1等参照)を特定または推定するための情報である。
 参考文献1:I. Cohen, “Relative transfer function identification using speech signals,” IEEE Trans. on Speech, and Audio Processing, vol. 12, no. 5, pp. 451-459, 2004.
 RTFv~ (i)は、目的音の音源iからM個のマイクロホンまでの空間に対応するM次元のステアリングベクトルv (i)=[v1,f (i),…,vM,f (i)の各要素を何れかの要素を基準に正規化して得られるものである。式(3)にRTFv~ (i)の一例を示す。式(3)のv~ (i)は、v (i)の各要素を要素v1,f (i)を基準に正規化して得られるものである。ただし、これは本発明を限定するものではない。
Figure JPOXMLDOC01-appb-M000003

 なお、v~など「χα」の形式で表記される記号の上付き添え字αは本来χの真上に記載すべきであるが(式(3)参照)、記載表記の制約上、χの右上に記載する場合がある。RTFv~ (i)を特定または推定するための情報の例は、目的音の参照音、目的音の音源iの時間周波数マスクγt,f (i)、ステアリングベクトルv (i)、RTFv~ (i)などである。
 各時間周波数マスクγt,f (i)は、時間区間tおよび周波数帯域fでの源信号iの存在確率または存在の有無に対応する値を表す。例えば、時間区間tおよび周波数帯域fでの源信号iの存在確率またはその関数値を時間周波数マスクγt,f (i)としてもよいし、時間区間tおよび周波数帯域fで源信号iが存在する場合にγt,f (i)=1とし、存在しない場合にγt,f (i)=0としてもよい。時間周波数マスクγt,f (i)の推定方法は、例えば、参考文献2に記載されている。
 参考文献2:F. Bahmaninezhad, J. Wu, R. Gu, S.-X. Zhang, Y. Xu, M. Yu, and D. Yu, “A comprehensive study of speech separation: spectrogram vs waveform separation,” in Interspeech, 2019.
一方、時間周波数マスクからRTFを推定する方法は、例えば、非特許文献1に記載されている。
 目的音の参照音から時間周波数マスクを推定する方法は、例えば、参考文献3に記載されている。
 参考文献3:K. Zmolikova, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani, L. Burget, and J. Cernocky, “SpeakerBeam: Speaker aware neural network for target speaker extraction in speech mixtures,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 4, pp. 800-814, 2019.
 補助情報sがさらに目的音のパワーを特定する情報を含んでいてもよい。目的音のパワーを推定する方法としては、例えば、参考文献3Bに記載されている。
参考文献3B:Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee, “A Regression Approach to Speech Enhancement Based on Deep Neural Networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 7-19, 2015.
 音響信号xt,fは畳み込みビームフォーマ推定部11および畳み込みビームフォーマ適用部12に入力され、補助情報sは畳み込みビームフォーマ推定部11に入力される。
The processing of this embodiment will be described with reference to FIG.
Frequency-divisioned time-series acoustic signals x t, f are input to the signal processor 1 for all t ∈ {1, ..., N} and f ∈ {1, ..., F}. As described above, the acoustic signals xt and f exemplified in the present embodiment frequency-divide the signal obtained by observing the mixed signal of the diffuse noise and the microphone image signal based on the source signal emitted from the sound source. It is obtained by Further, auxiliary information s representing information on the target sound is input to the signal processing device 1.
An example of the auxiliary information s is information for specifying or estimating RTF (Relative Transfer Function) v to f (i) (see, for example, Reference 1 and the like).
Reference 1: I. Cohen, “Relative transfer function identification using speech signals,” IEEE Trans. On Speech, and Audio Processing, vol. 12, no. 5, pp. 451-459, 2004.
RTFv ~ f (i) is an M-dimensional steering vector v f (i) = [v 1, f (i) , ..., V M, f ) corresponding to the space from the sound source i of the target sound to the M microphones. (I) ] It is obtained by normalizing each element of T with respect to any element. Equation (3) shows an example of RTFv ~ f (i). V ~ f (i) of formula (3) is obtained by normalizing each element of v f (i) an element v 1, f (i) to the reference. However, this does not limit the present invention.
Figure JPOXMLDOC01-appb-M000003

The superscript α of symbols written in the form of “χ α ” such as v ~ should be written directly above χ (see equation (3)), but due to restrictions on the description notation, χ It may be listed in the upper right corner of. Examples of information for identifying or estimating RTFv to f (i) are the reference sound of the target sound, the time frequency mask γ t, f (i) of the sound source i of the target sound, the steering vector v f (i) , and the RTF v. ~ f (i) and so on.
Each time frequency mask γ t, f (i) represents a value corresponding to the existence probability or existence / absence of the source signal i in the time interval t and the frequency band f. For example, the existence probability of the source signal i in the time interval t and the frequency band f or a function value thereof may be used as the time frequency mask γ t, f (i) , or the source signal i exists in the time interval t and the frequency band f. If this is the case, γ t, f (i) = 1 may be set, and if it does not exist, γ t, f (i) = 0 may be set. A method for estimating the time-frequency mask γ t, f (i) is described in, for example, Reference 2.
Reference 2: F. Bahmaninezhad, J. Wu, R. Gu, S.-X. Zhang, Y. Xu, M. Yu, and D. Yu, “A comprehensive study of speech separation: spectrogram vs waveform separation,” in Interspeech, 2019.
On the other hand, a method of estimating RTF from a time-frequency mask is described in, for example, Non-Patent Document 1.
A method of estimating the time frequency mask from the reference sound of the target sound is described in, for example, Reference 3.
Reference 3: K. Zmolikova, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani, L. Burget, and J. Cernocky, “SpeakerBeam: Speaker aware neural network for target speaker extraction in speechoption,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 4, pp. 800-814, 2019.
The auxiliary information s may further include information that identifies the power of the target sound. A method for estimating the power of the target sound is described in, for example, Reference 3B.
Reference 3B: Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee, “A Regression Approach to Speech Enhancement Based on Deep Neural Networks,” IEEE / ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 7-19, 2015.
The acoustic signals x t and f are input to the convolution beam former estimation unit 11 and the convolution beam former application unit 12, and the auxiliary information s is input to the convolution beam former estimation unit 11.
 ≪畳み込みビームフォーマ推定部11の処理(ステップS11)≫
 畳み込みビームフォーマ推定部11は、音響信号xt,fと補助情報sとを受け取り、残響抑圧と拡散性雑音抑圧と目的音源分離とを行う畳み込みビームフォーマを音響信号xt,fに適用して得られる信号yt,fが確率モデルに従って定まるという最適化基準に基づき、当該畳み込みビームフォーマを推定する。ただし、yt,f=[yt,f (1),…,yt,f (I)∈CI×1はI次元列ベクトルであり、i∈{1,…,I}についてのyt,f (i)は目的音dt,f (i)の推定信号である。畳み込みビームフォーマは、例えば、以下のように表現される。
Figure JPOXMLDOC01-appb-M000004

ただし、τ∈{0,Δ,Δ+1,…,L-1}についてのWτ∈CM×Iはビームフォーマ係数を要素とするM×I行列であり、(・)は(・)の共役転置を表す。Δは初期反射音の長さに対応する時間区間数(フレーム数)を表す正整数である。少なくともΔ≧1であり、Δの一例は30~50msに対応する時間区間を表す正整数である。Δによって初期反射音よりも後部残響音を抑圧する畳み込みビームフォーマが実現される。
<< Processing of Convolution Beam Former Estimator 11 (Step S11) >>
The convolution beam former estimation unit 11 receives the acoustic signals x t, f and auxiliary information s, and applies the convolution beam former that suppresses reverberation, diffuse noise, and separates the target sound source to the acoustic signals x t, f. The convolution beamformer is estimated based on the optimization criterion that the obtained signals y t and f are determined according to the stochastic model. However, y t, f = [y t, f (1) , ..., y t, f (I) ] T ∈ C I × 1 is an I-dimensional sequence vector, and for i ∈ {1, ..., I} Y t, f (i) is an estimated signal of the target sound dt, f (i). The convolution beam former is expressed as follows, for example.
Figure JPOXMLDOC01-appb-M000004

However, W τ ∈ C M × I for τ ∈ {0, Δ, Δ + 1, ..., L-1} is an M × I matrix whose elements are beamformer coefficients, and (・) H is (・). Represents a conjugate transpose. Δ is a positive integer representing the number of time intervals (number of frames) corresponding to the length of the initial reflected sound. At least Δ ≧ 1, and an example of Δ is a positive integer representing a time interval corresponding to 30 to 50 ms. By Δ, a convolutional beamformer that suppresses the rear reverberation sound rather than the initial reflected sound is realized.
 ここで以下を満たすとする。
Figure JPOXMLDOC01-appb-M000005

ただし、
Figure JPOXMLDOC01-appb-M000006

は目的音dt,f (i)の推定信号yt,f (i)を抽出するために使用される。式(5)(6)を用いると式(4)は以下のようにも変形できる。
Figure JPOXMLDOC01-appb-M000007
Here, it is assumed that the following is satisfied.
Figure JPOXMLDOC01-appb-M000005

However,
Figure JPOXMLDOC01-appb-M000006

Is used to extract the estimated signals y t, f (i) of the target sounds dt, f (i). Using equations (5) and (6), equation (4) can be transformed as follows.
Figure JPOXMLDOC01-appb-M000007
 また、Qf∈CM×I,G ∈CM(L-Δ)×M
Figure JPOXMLDOC01-appb-M000008

について、W0,f=Qおよび
Figure JPOXMLDOC01-appb-M000009

を満たすとする。なお、M≧Iおよびrank{Q}=Iであるとき、式(8)を満たすG の存在は保証される。これらを用いると、式(4)は以下の式(9)(10)のようにも変形できる。
Figure JPOXMLDOC01-appb-M000010

Figure JPOXMLDOC01-appb-M000011

ただし、以下を満たす。
Figure JPOXMLDOC01-appb-M000012
Also, Q f ∈ C M × I , G f ∈ CM (L−Δ) × M ,
Figure JPOXMLDOC01-appb-M000008

About W 0, f = Q f and
Figure JPOXMLDOC01-appb-M000009

Suppose that When M ≧ I and rank {Q} = I, the existence of G− f satisfying the equation (8) is guaranteed. Using these, equation (4) can also be transformed into equations (9) and (10) below.
Figure JPOXMLDOC01-appb-M000010

Figure JPOXMLDOC01-appb-M000011

However, the following is satisfied.
Figure JPOXMLDOC01-appb-M000012
 また、q (i)∈CM×1
Figure JPOXMLDOC01-appb-M000013

Figure JPOXMLDOC01-appb-M000014

について、i∈{1,…,I}においてw (i)=q (i)および
Figure JPOXMLDOC01-appb-M000015

を満たすとする。すると、式(7)は以下の式(12)(13)のようにも変形できる。
Figure JPOXMLDOC01-appb-M000016

Figure JPOXMLDOC01-appb-M000017
Also, q f (i) ∈ CM × 1 ,
Figure JPOXMLDOC01-appb-M000013

Figure JPOXMLDOC01-appb-M000014

With respect to w 0 (i) = q f (i) and in i ∈ {1, ..., I}
Figure JPOXMLDOC01-appb-M000015

Suppose that Then, equation (7) can be transformed into the following equations (12) and (13).
Figure JPOXMLDOC01-appb-M000016

Figure JPOXMLDOC01-appb-M000017
 また、式(9)は以下の式(9')のように変形できる。
Figure JPOXMLDOC01-appb-M000018

ただし、
Figure JPOXMLDOC01-appb-M000019

Figure JPOXMLDOC01-appb-M000020

であり、I∈RM×MはM×Mの単位行列を表し、Rは実数全体の集合を表す。
Figure JPOXMLDOC01-appb-M000021

はクロネッカー積を表す。またm∈{1,…,M}について
Figure JPOXMLDOC01-appb-M000022

はG のm番目の列ベクトルである。また式(10)は以下の式(10')のように変形できる。
Figure JPOXMLDOC01-appb-M000023
Moreover, the equation (9) can be transformed as the following equation (9').
Figure JPOXMLDOC01-appb-M000018

However,
Figure JPOXMLDOC01-appb-M000019

Figure JPOXMLDOC01-appb-M000020

In and, I M ∈R M × M represents a unit matrix of M × M, R denotes the set of whole real numbers.
Figure JPOXMLDOC01-appb-M000021

Represents the Kronecker product. Also, about m ∈ {1, ..., M}
Figure JPOXMLDOC01-appb-M000022

Is the mth column vector of G− f. Moreover, the equation (10) can be transformed as the following equation (10').
Figure JPOXMLDOC01-appb-M000023
 畳み込みビームフォーマ推定部11は、yt,fが確率モデルに従って定まるという最適化基準に基づき、畳み込みビームフォーマを推定する。この確率モデルとしては以下の(a)および(b)を満たすモデルを例示できる。
 (a)各i∈{1,…,I}のyt,f (i)が平均0の時変分散(time-varying variance)
Figure JPOXMLDOC01-appb-M000024

の複素ガウシアン分布に従う。E{・}は、期待値関数を表す。以下、λt,f (i)をyt,f (i)のパワーと呼ぶ。
 (b)畳み込みビームフォーマは、各i∈{1,…,I}について音源iからマイクロホンに到来する音を歪ませない。この拘束条件は例えば以下の式(15)または(16)のように記載できる。
Figure JPOXMLDOC01-appb-M000025

Figure JPOXMLDOC01-appb-M000026
The convolution beamformer estimation unit 11 estimates the convolutional beamformer based on the optimization criterion that y t and f are determined according to the probabilistic model. As this probabilistic model, a model satisfying the following (a) and (b) can be exemplified.
(a) Time-varying variance in which y t, f (i ) of each i ∈ {1, ..., I} has an average of 0.
Figure JPOXMLDOC01-appb-M000024

Follows the complex Gaussian distribution of. E {・} represents the expected value function. Hereinafter, λ t, f (i) will be referred to as the power of y t, f (i).
(b) The convolution beamformer does not distort the sound coming from the sound source i to the microphone for each i ∈ {1, ..., I}. This constraint can be described, for example, by the following equation (15) or (16).
Figure JPOXMLDOC01-appb-M000025

Figure JPOXMLDOC01-appb-M000026
 この確率モデルに従って定まる音源iに対応する式(7)に示す畳み込みビームフォーマの係数
Figure JPOXMLDOC01-appb-M000027

の最適化基準は、式(15)または(16)の拘束条件下で、以下の式(18)のコスト関数Li,f(θ (i))を最小化することである。
Figure JPOXMLDOC01-appb-M000028

 すべての音源i=1,…,Iについての畳み込みビームフォーマの係数
Figure JPOXMLDOC01-appb-M000029

の最適化基準は、すべての音源i=1,…,Iについて式(15)または(16)を満たすという拘束条件下で、以下の式(20)のコスト関数L(Θ)を最小化することである。
Figure JPOXMLDOC01-appb-M000030
The coefficient of the convolution beamformer shown in Eq. (7) corresponding to the sound source i determined according to this stochastic model.
Figure JPOXMLDOC01-appb-M000027

The optimization criterion of is to minimize the cost functions Li, ff (i) ) of the following equation (18) under the constraint condition of equation (15) or (16).
Figure JPOXMLDOC01-appb-M000028

Coefficients of convolution beamformers for all sound sources i = 1, ..., I
Figure JPOXMLDOC01-appb-M000029

The optimization criterion of is to minimize the cost function L ff ) of the following equation (20) under the constraint condition that all sound sources i = 1, ..., I satisfy equation (15) or (16). Is to be transformed.
Figure JPOXMLDOC01-appb-M000030
 すなわち、本実施形態の畳み込みビームフォーマ推定部11は、式(20)のコスト関数L(Θ)を最小化する畳み込みビームフォーマを推定し、推定した畳み込みビームフォーマ(式(4),(7),(9)(10),(9')(10'),(12)(13))を特定する情報を出力する。式(14)(18)から分かるように、この推定にはi∈{1,…,I}についてyt,f (i)のパワーλt,f (i)=E{|yt,f (i)}を特定する情報が必要である。補助情報sが目的音のパワーを特定する情報を含んでいる場合には、補助情報sによって特定される目的音のパワーをλt,f (i)として用いればよい。補助情報sが目的音のパワーを特定する情報を含んでいない場合には、以下に示すように畳み込みビームフォーマ適用部12で得られるyt,f (i)からλt,f (i)=|yt,f (i)を得る。ただし、畳み込みビームフォーマ適用部12で得られるyt,f (i)は畳み込みビームフォーマ推定部11で推定される畳み込みビームフォーマに依存するため、畳み込みビームフォーマ推定部11と畳み込みビームフォーマ適用部12との処理を所定の収束条件を満たすまで交互に繰り返す必要がある。 That is, the convolution beamformer estimation unit 11 of the present embodiment estimates a convolutional beamformer that minimizes the cost function L ff ) of the equation (20), and estimates the convolutional beamformer (Equation (4), (2). Outputs information that identifies 7), (9) (10), (9') (10'), (12) (13)). As can be seen from equations (14) and (18), this estimation is based on the power λ t, f (i) = E {| y t, f of y t, f (i) for i ∈ {1, ..., I}. (I) Information that identifies | 2 } is required. When the auxiliary information s includes information for specifying the power of the target sound, the power of the target sound specified by the auxiliary information s may be used as λ t, f (i). When the auxiliary information s does not include the information for specifying the power of the target sound, y t, f (i) to λ t, f (i) = obtained by the convolution beam former application unit 12 as shown below. | Yt , f (i) | 2 is obtained. However, since yt, f (i) obtained by the convolution beamformer application unit 12 depends on the convolutional beamformer estimated by the convolutional beamformer estimation unit 11, the convolutional beamformer estimation unit 11 and the convolutional beamformer application unit 12 It is necessary to alternately repeat the process of and until a predetermined convergence condition is satisfied.
 ≪畳み込みビームフォーマ適用部12の処理(ステップS12)≫
 音響信号xt,fおよび畳み込みビームフォーマ推定部11から出力された畳み込みビームフォーマを特定する情報は、畳み込みビームフォーマ適用部12に入力される。畳み込みビームフォーマ適用部12は、当該情報から特定される畳み込みビームフォーマ(式(4),(7),(9)(10),(9')(10'),(12)(13))を音響信号xt,f=[x1,t,f,…,xM,t,fに適用して処理信号yt,f=[yt,f (1),…,yt,f (I)を得て出力する。
<< Processing of convolution beam former application unit 12 (step S12) >>
The acoustic signals x t, f and the information for identifying the convolution beam former output from the convolution beam former estimation unit 11 are input to the convolution beam former application unit 12. The convolution beam former application unit 12 uses the convolution beam former (Equation (4), (7), (9) (10), (9') (10'), (12) (13)) specified from the information. Is applied to the acoustic signals x t, f = [x 1, t, f , ..., X M, t, f ] T and the processing signals y t, f = [y t, f (1) , ..., y t , F (I) ] T is obtained and output.
 前述のように、補助情報sが目的音のパワーを特定する情報を含んでいる場合、信号処理装置1はステップS12で得られた処理信号yt,fを出力する。この場合にはステップS11,S12の繰り返し処理は不要である。一方、補助情報sが目的音のパワーを特定する情報を含んでいない場合、所定の収束条件を満たすまでステップS11の処理とステップS12の処理とが交互に繰り返される。収束条件の例は、繰り返し回数が所定回数に達したという条件、繰り返しの前後で畳み込みビームフォーマの係数の変化量の所定量以下であるという条件などである。信号処理装置1は当該収束条件を満たしたときにステップS12で得られた処理信号yt,fを出力する。何れの場合も、処理信号yt,fは、音響信号xt,fに残響抑圧と拡散性雑音抑圧と目的音源分離とを施した結果となる。出力された処理信号yt,fは他の演算処理の入力とされてもよいし、逆フーリエ変換(Inverse Fourier transform)などの周知の方法によって時間領域の音響信号に変換されてもよい。 As described above, when the auxiliary information s includes information for specifying the power of the target sound, the signal processing device 1 outputs the processing signals yt and f obtained in step S12. In this case, the iterative processing of steps S11 and S12 is unnecessary. On the other hand, when the auxiliary information s does not include the information for specifying the power of the target sound, the process of step S11 and the process of step S12 are alternately repeated until a predetermined convergence condition is satisfied. Examples of the convergence condition are a condition that the number of repetitions reaches a predetermined number of times, a condition that the amount of change in the coefficient of the convolution beamformer is equal to or less than a predetermined amount before and after the repetition. The signal processing device 1 outputs the processing signals yt and f obtained in step S12 when the convergence condition is satisfied. In either case, the processed signals y t, f are the result of applying reverberation suppression, diffuse noise suppression, and target sound source separation to the acoustic signals x t, f. The output processing signals yt and f may be input to other arithmetic processing, or may be converted into an acoustic signal in the time domain by a well-known method such as an inverse Fourier transform.
 <本実施形態の特徴>
 本実施形態では、目的音の情報を表す補助情報sを用い、残響抑圧と拡散性雑音抑圧と目的音源分離とを行う畳み込みビームフォーマを音響信号に適用して得られる信号が確率モデルに従って定まるという最適化基準に基づき、畳み込みビームフォーマを推定する。これにより、畳み込みビームフォーマを全体として最適化でき、より効果的な音声強調を実現できる。
<Characteristics of this embodiment>
In the present embodiment, the signal obtained by applying the convolution beam former that suppresses reverberation, diffusive noise, and separation of the target sound source to the acoustic signal by using the auxiliary information s that represents the information of the target sound is determined according to the stochastic model. Estimate the convolution beamformer based on the optimization criteria. As a result, the convolution beam former can be optimized as a whole, and more effective speech enhancement can be realized.
 [第2実施形態]
 次に本発明の第2実施形態を説明する。本実施形態では、畳み込みビームフォーマを、残響抑圧を行う残響抑圧フィルタと、拡散性雑音抑圧と目的音源分離とを行うビームフォーマとに分割して取り扱う。言い換えると、本実施形態の畳み込みビームフォーマは、残響抑圧を行う残響抑圧フィルタと、拡散性雑音抑圧と目的音源分離とを行うビームフォーマとを含む。ただし、残響抑圧フィルタとビームフォーマとの最適化処理は互いに独立しておらず、畳み込みビームフォーマ全体として最適化される。残響抑圧フィルタの例は式(9),式(9')または式(12)などであり、ビームフォーマの例は式(10),式(10')または式(13)などである。本実施形態では一例として残響抑圧フィルタとして式(9')を用い、ビームフォーマとして式(10')を用いる。また、残響抑圧フィルタの最適化には各目的音のパワー重み付き時空間共分散行列が用いられる。目的音のパワー重み付き時空間共分散行列はサイズが小さいため、小さい演算量で残響抑圧フィルタの最適化を行うことができる。以下では、これまで説明した事項との相違点を中心に説明し、既に説明した事項については同じ参照番号を引用して説明を簡略化する。
[Second Embodiment]
Next, a second embodiment of the present invention will be described. In the present embodiment, the convolution beamformer is divided into a reverberation suppression filter that suppresses reverberation and a beamformer that suppresses diffusive noise and separates the target sound source. In other words, the convolutional beamformer of the present embodiment includes a reverberation suppression filter that suppresses reverberation and a beamformer that suppresses diffusive noise and separates the target sound source. However, the optimization processing of the reverberation suppression filter and the beam former is not independent of each other, and the convolution beam former as a whole is optimized. Examples of reverberation suppression filters are Eq. (9), Eq. (9') or Eq. (12), and examples of beamformers are Eq. (10), Eq. (10') or Eq. (13). In this embodiment, the equation (9') is used as the reverberation suppression filter and the equation (10') is used as the beam former as an example. In addition, a power-weighted spatiotemporal covariance matrix of each target sound is used to optimize the reverberation suppression filter. Since the power-weighted spatio-temporal covariance matrix of the target sound is small in size, the reverberation suppression filter can be optimized with a small amount of calculation. In the following, the differences from the items described so far will be mainly explained, and the same reference numbers will be cited for the items already explained to simplify the explanation.
 <機能構成>
 図2に例示するように、本実施形態の信号処理装置2は、時空間共分散推定部211、残響抑圧フィルタ推定部212、ビームフォーマ推定部213、残響抑圧フィルタ適用部221、ビームフォーマ適用部222、および制御部13を有し、制御部13の制御の下で各処理を実行する。ここで、時空間共分散推定部211、残響抑圧フィルタ推定部212、およびビームフォーマ推定部213は、畳み込みビームフォーマ推定部を構成する。残響抑圧フィルタ適用部221およびビームフォーマ適用部222は、畳み込みビームフォーマ適用部を構成する。
<Functional configuration>
As illustrated in FIG. 2, the signal processing device 2 of the present embodiment has a spatiotemporal covariance estimation unit 211, a reverberation suppression filter estimation unit 212, a beamformer estimation unit 213, a reverberation suppression filter application unit 221 and a beamformer application unit. It has 222 and a control unit 13, and executes each process under the control of the control unit 13. Here, the spatiotemporal covariance estimation unit 211, the reverberation suppression filter estimation unit 212, and the beamformer estimation unit 213 constitute a convolutional beamformer estimation unit. The reverberation suppression filter application unit 221 and the beam former application unit 222 form a convolution beam former application unit.
 <処理>
 図2を用いて本実施形態の処理を説明する。
 補助情報sがビームフォーマ推定部213に入力され、音響信号xt,fが時空間共分散行列推定部211、および残響抑圧フィルタ適用部221に入力される。本実施形態の補助情報sは、RTFv~ (i)を特定または推定するための情報であり、目的音のパワーを特定する情報を含まない。時空間共分散行列推定部211は、目的音のパワー重み付き時空間共分散行列
Figure JPOXMLDOC01-appb-M000031

Figure JPOXMLDOC01-appb-M000032

を得て出力する。残響抑圧フィルタ推定部212は、目的音のパワー重み付き時空間共分散行列R x,f (i)およびPx,f (i)と、ビームフォーマを表す情報q (i)とを受け取り、前述の最適化基準に基づいて残響抑圧フィルタを推定する。残響抑圧フィルタ適用部221は、残響抑圧フィルタ推定部212で推定された残響抑圧フィルタを音響信号xt,fに適用して残響抑圧信号zt,fを得て出力する(式(9'))。ビームフォーマ推定部213は、残響抑圧信号zt,fと補助情報sとを受け取り、前述した最適化基準に基づいてビームフォーマを推定する。ビームフォーマ適用部222は、ビームフォーマ推定部213で推定されたビームフォーマを残響抑圧信号zt,fに適用して処理信号yt,f (i)を得て出力する。本実施形態では、所定の収束条件を満たすまで、畳み込みビームフォーマ推定部に含まれる時空間共分散推定部211、残響抑圧フィルタ推定部212、およびビームフォーマ推定部213の処理と、畳み込みビームフォーマ適用部に含まれる残響抑圧フィルタ適用部221およびビームフォーマ適用部222の処理と、を交互に繰り返す。信号処理装置2は当該収束条件を満たしたときにビームフォーマ適用部222で得られた処理信号yt,fを出力する。
<Processing>
The processing of this embodiment will be described with reference to FIG.
Auxiliary information s is input to the beamformer estimation unit 213, and acoustic signals xt and f are input to the spatiotemporal covariance matrix estimation unit 211 and the reverberation suppression filter application unit 221. The auxiliary information s of the present embodiment is information for specifying or estimating RTFv to f (i) , and does not include information for specifying the power of the target sound. The spatiotemporal covariance matrix estimation unit 211 is a power-weighted spatiotemporal covariance matrix of the target sound.
Figure JPOXMLDOC01-appb-M000031

Figure JPOXMLDOC01-appb-M000032

And output. The reverberation suppression filter estimation unit 212 receives the power-weighted spatio-temporal covariance matrices R - x, f (i) and P x, f (i) of the target sound, and the information q f (i) representing the beamformer. , Estimate the reverberation suppression filter based on the optimization criteria described above. Dereverberation filter applying unit 221, reverberation is estimated by suppression filter estimator 212 reverberation suppression filter an acoustic signal x t, dereverberation signal applied to f z t, to obtain the f output (equation (9 ') ). The beamformer estimation unit 213 receives the reverberation suppression signals zt and f and the auxiliary information s, and estimates the beamformer based on the optimization criteria described above. The beamformer application unit 222 applies the beamformer estimated by the beamformer estimation unit 213 to the reverberation suppression signals zt and f to obtain and output the processing signals yt and f (i). In the present embodiment, the processing of the spatiotemporal covariance estimation unit 211, the reverberation suppression filter estimation unit 212, and the beamformer estimation unit 213 included in the convolution beamformer estimation unit and the application of the convolutional beamformer until a predetermined convergence condition is satisfied. The processing of the reverberation suppression filter application unit 221 and the beam former application unit 222 included in the unit is repeated alternately. The signal processing device 2 outputs the processing signals yt and f obtained by the beamformer application unit 222 when the convergence condition is satisfied.
 以下、図3~図6を用いて本実施形態の処理を詳細に説明する。
 補助情報sがビームフォーマ推定部213に入力される(ステップS213a)。本実施形態では、補助情報sとして時間周波数マスクγt,f (i)が入力される。しかし、これは本発明を限定するものではない。また、音響信号xt,fが、時空間共分散推定部211、および残響抑圧フィルタ適用部221に入力される(ステップS221a)。
Hereinafter, the processing of the present embodiment will be described in detail with reference to FIGS. 3 to 6.
Auxiliary information s is input to the beamformer estimation unit 213 (step S213a). In the present embodiment, the time frequency masks γ t, f (i) are input as auxiliary information s. However, this does not limit the present invention. Further, the acoustic signals x t and f are input to the spatiotemporal covariance estimation unit 211 and the reverberation suppression filter application unit 221 (step S221a).
 時空間共分散推定部211は、すべてのi∈{1,…,I},t∈{1,…,N},f∈{1,…,F}についてλt,f (i)を初期化する。例えば、時空間共分散推定部211は以下のように目的音のパワーλt,f (i)を初期化する。
Figure JPOXMLDOC01-appb-M000033

ここで
Figure JPOXMLDOC01-appb-M000034

はαβαを表す。また、α←βはβをαに代入することを表す。言い換えると、α←βはαをβにすることを表す(ステップS211a)。
The spatiotemporal covariance estimation unit 211 initializes λ t, f (i) for all i ∈ {1, ..., I}, t ∈ {1, ..., N}, f ∈ {1, ..., F}. To become. For example, the spatiotemporal covariance estimation unit 211 initializes the powers λ t, f (i) of the target sound as follows.
Figure JPOXMLDOC01-appb-M000033

here
Figure JPOXMLDOC01-appb-M000034

Represents α H β α. Further, α ← β means that β is substituted for α. In other words, α ← β means that α is changed to β (step S211a).
 ビームフォーマ推定部213はすべてのi∈{1,…,I},f∈{1,…,F}についてq (i)を初期化する。例えば、ビームフォーマ推定部213はIのi番目の列をq (i)とする(ステップS213b)。 The beamformer estimation unit 213 initializes q f (i) for all i ∈ {1, ..., I}, f ∈ {1, ..., F}. For example, the beamformer estimating unit 213 the i-th column of I M and q f (i) (step S213b).
 残響抑圧フィルタ適用部221はすべてのt∈{1,…,N},f∈{1,…,F}についてzt,fを初期化する。例えば、残響抑圧フィルタ適用部221はzt,f←xt,fとする(ステップS221b)。 The reverberation suppression filter application unit 221 initializes z t and f for all t ∈ {1, ..., N} and f ∈ {1, ..., F}. For example, the reverberation suppression filter application unit 221 is set to z t, f ← x t, f (step S221b).
 まだ処理信号yt,fが一度も得られていないのであれば、時空間共分散推定部211には処理信号yt,fは入力されない。一方、処理信号yt,fが得られているのであれば、時空間共分散推定部211には更に処理信号yt,fが入力される。時空間共分散推定部211は、すべてのi∈{1,…,I},f∈{1,…,F}について目的音のパワー重み付き時空間共分散行列
Figure JPOXMLDOC01-appb-M000035

を計算して出力する。処理信号yt,fが一度も得られていないのであれば、この計算にはステップS211aで得られたλt,f (i)が用いられる。一方、既に処理信号yt,fが得られているのであれば、この計算にはステップS211dで得られたλt,f (i)が用いられる(ステップS211b)。さらに時空間共分散推定部211は、すべてのi∈{1,…,I},f∈{1,…,F}について目的音のパワー重み付き時空間共分散行列
Figure JPOXMLDOC01-appb-M000036

を計算して出力する。ここでも処理信号yt,fが一度も得られていないのであれば、この計算にはステップS211aで得られたλt,f (i)が用いられる。一方、既に処理信号yt,fが得られているのであれば、この計算にはステップS211dで得られたλt,f (i)が用いられる(ステップS211c)。
Yet processed signal y t, as long as f is not obtained even once, when the spatial covariance estimation unit 211 processes the signal y t, f are not inputted. On the other hand, the processed signal y t, as long as f is obtained, further processing the signal y t is the time-space covariance estimation unit 211, f is input. The spatiotemporal covariance estimation unit 211 is a power-weighted spatiotemporal covariance matrix of the target sound for all i ∈ {1, ..., I}, f ∈ {1, ..., F}.
Figure JPOXMLDOC01-appb-M000035

Is calculated and output. If the processing signals y t, f have never been obtained, the λ t, f (i) obtained in step S211a is used for this calculation. On the other hand, if the processing signals y t and f have already been obtained, the λ t and f (i) obtained in step S211d are used for this calculation (step S211b). Further, the spatiotemporal covariance estimation unit 211 is a power-weighted spatiotemporal covariance matrix of the target sound for all i ∈ {1, ..., I}, f ∈ {1, ..., F}.
Figure JPOXMLDOC01-appb-M000036

Is calculated and output. Again, if the processing signals y t, f have never been obtained, the λ t, f (i) obtained in step S211a is used for this calculation. On the other hand, if the processing signals y t and f have already been obtained, the λ t and f (i) obtained in step S211d are used for this calculation (step S211c).
 音響信号xt,fと、時空間共分散推定部211で得られた目的音のパワー重み付き時空間共分散行列R x,f (i)およびPx,f (i)と、ビームフォーマ推定部213で得られたビームフォーマを表す情報q (i)は残響抑圧フィルタ推定部212に入力される。残響抑圧フィルタ推定部212は、これらを受け取り、前述の最適化基準に基づいて残響抑圧フィルタ(式(9'))を推定する。まず残響抑圧フィルタ推定部212は、
Figure JPOXMLDOC01-appb-M000037

を計算する(ステップS212a)。次に残響抑圧フィルタ推定部212は、
Figure JPOXMLDOC01-appb-M000038

を計算する。ただし(・)*は(・)の複素共役を表す(ステップS212b)。さらに残響抑圧フィルタ推定部212は、残響抑圧フィルタを特定する情報
Figure JPOXMLDOC01-appb-M000039

を計算して出力する。ただし、(・)は(・)のムーア-ペンローズ擬似逆行列(Moore-Penrose pseudo-inverse matrix)である(ステップS212c)。
The acoustic signals x t, f , the power-weighted spatio-temporal covariance matrices R- x, f (i) and P x, f (i) of the target sound obtained by the spatio-temporal covariance estimation unit 211, and the beamformer. The information q f (i) representing the beamformer obtained by the estimation unit 213 is input to the reverberation suppression filter estimation unit 212. The reverberation suppression filter estimation unit 212 receives these and estimates the reverberation suppression filter (Equation (9')) based on the above-mentioned optimization criteria. First, the reverberation suppression filter estimation unit 212
Figure JPOXMLDOC01-appb-M000037

Is calculated (step S212a). Next, the reverberation suppression filter estimation unit 212
Figure JPOXMLDOC01-appb-M000038

To calculate. However, (・) * represents the complex conjugate of (・) (step S212b). Further, the reverberation suppression filter estimation unit 212 provides information for specifying the reverberation suppression filter.
Figure JPOXMLDOC01-appb-M000039

Is calculated and output. However, (・) + is the Moore-Penrose pseudo-inverse matrix of (・) (step S212c).
 残響抑圧フィルタ推定部212で得られたg は残響抑圧フィルタ適用部221に入力される。残響抑圧フィルタ適用部221は、残響抑圧フィルタ推定部212で推定された残響抑圧フィルタを以下のように音響信号xt,fに適用して残響抑圧信号zt,fを得て出力する。
Figure JPOXMLDOC01-appb-M000040

残響抑圧信号zt,fは、ビームフォーマ推定部213およびビームフォーマ適用部222に送られる(ステップS221c)。
The g− f obtained by the reverberation suppression filter estimation unit 212 is input to the reverberation suppression filter application unit 221. The reverberation suppression filter application unit 221 applies the reverberation suppression filter estimated by the reverberation suppression filter estimation unit 212 to the acoustic signals x t and f as follows to obtain and output the reverberation suppression signals z t and f.
Figure JPOXMLDOC01-appb-M000040

The reverberation suppression signals z t and f are sent to the beam former estimation unit 213 and the beam former application unit 222 (step S221c).
 処理信号yt,fが一度も得られていないのであれば、ビームフォーマ推定部213には、残響抑圧信号zt,f、補助情報s=γt,f (i)、およびステップS211aで得られたλt,f (i)が入力される。一方、既に処理信号yt,fが得られているのであれば、ビームフォーマ推定部213には、残響抑圧信号zt,f、補助情報s=γt,f (i)、および処理信号yt,fが入力される。ビームフォーマ推定部213は、これらを受け取り、前述の最適化基準に基づいてビームフォーマを推定する。
 ビームフォーマ推定部213は、zt,fおよびγt,f (i)に基づいてRTFv~ (i)を得る。図6に例示するように、まずビームフォーマ推定部213のステアリングベクトル推定部2131がzt,fおよびγt,f (i)に基づいてステアリングベクトルv (i)を推定して出力する。例えば、ステアリングベクトルv (i)は、以下のように推定される。
Figure JPOXMLDOC01-appb-M000041

Figure JPOXMLDOC01-appb-M000042

Figure JPOXMLDOC01-appb-M000043

さらにビームフォーマ推定部213のRTF推定部2132がv (i)からv~ (i)を得る。例えば、RTF推定部2132は式(3)に従ってv~ (i)を得る(ステップS213c)。
 またビームフォーマ推定部213は、すべてのi∈{1,…,I},f∈{1,…,F}について
Figure JPOXMLDOC01-appb-M000044

を計算する。なお、処理信号yt,fが一度も得られていないのであれば、この計算にはステップS211aで得られたλt,f (i)が用いられる。一方、既に処理信号yt,fが得られているのであれば、この計算にはステップS211dで得られたλt,f (i)が用いられる(ステップS213d)。
 さらにビームフォーマ推定部213は、すべてのi∈{1,…,I},f∈{1,…,F}についてビームフォーマを特定する情報
Figure JPOXMLDOC01-appb-M000045

を計算して出力する(ステップS213e)。
If the processing signals y t and f have never been obtained, the beam former estimation unit 213 obtains the reverberation suppression signals z t and f , auxiliary information s = γ t, f (i) , and step S211a. The calculated λ t, f (i) is input. On the other hand, if the processing signals y t, f have already been obtained, the beamformer estimation unit 213 has the reverberation suppression signal z t, f , the auxiliary information s = γ t, f (i) , and the processing signal y. t and f are input. The beamformer estimation unit 213 receives these and estimates the beamformer based on the above-mentioned optimization criteria.
The beamformer estimation unit 213 obtains RTFv to f (i) based on z t, f and γ t, f (i) . As illustrated in FIG. 6, first, the steering vector estimation unit 2131 of the beamformer estimation unit 213 estimates and outputs the steering vector v f (i) based on z t, f and γ t, f (i). For example, the steering vector v f (i) is estimated as follows.
Figure JPOXMLDOC01-appb-M000041

Figure JPOXMLDOC01-appb-M000042

Figure JPOXMLDOC01-appb-M000043

Further RTF estimator 2132 beamformer estimating unit 213 obtains a v ~ f (i) from v f (i). For example, the RTF estimation unit 2132 obtains v ~ f (i) according to the equation (3) (step S213c).
Further, the beamformer estimation unit 213 describes all i ∈ {1, ..., I} and f ∈ {1, ..., F}.
Figure JPOXMLDOC01-appb-M000044

To calculate. If the processing signals y t, f have never been obtained, the λ t, f (i) obtained in step S211a is used for this calculation. On the other hand, if the processing signals y t and f have already been obtained, the λ t and f (i) obtained in step S211d are used for this calculation (step S213d).
Further, the beamformer estimation unit 213 provides information for identifying the beamformer for all i ∈ {1, ..., I} and f ∈ {1, ..., F}.
Figure JPOXMLDOC01-appb-M000045

Is calculated and output (step S213e).
 ビームフォーマ適用部222には、残響抑圧信号zt,f、およびビームフォーマを特定する情報q (i)が入力される。ビームフォーマ適用部222は、以下のようにビームフォーマを残響抑圧信号zt,fに適用して処理信号yt,f (i)を得て出力する。
Figure JPOXMLDOC01-appb-M000046

この処理はすべてのi∈{1,…,I}およびf∈{1,…,F}について行われ、ビームフォーマ適用部222はyt,f=[yt,f (1),…,yt,f (I)を得る(ステップS222a)。
The reverberation suppression signals z t, f and the information q f (i) for identifying the beam former are input to the beam former application unit 222. The beam former application unit 222 applies the beam former to the reverberation suppression signals z t, f as follows to obtain and output the processing signals y t, f (i).
Figure JPOXMLDOC01-appb-M000046

This processing is performed for all i ∈ {1, ..., I} and f ∈ {1, ..., F}, and the beamformer application unit 222 has y t, f = [y t, f (1) , ..., y t, f (I) ] T is obtained (step S222a).
 制御部13は前述の収束条件を充足したか否かを判定する(ステップS13)。ここで収束条件を充足していない場合、時空間共分散推定部211およびビームフォーマ推定部213は、入力されたyt,f (i)を用いて
Figure JPOXMLDOC01-appb-M000047

の計算を行って(ステップS211d)、処理がステップS211bに戻る。これにより、時空間共分散行列推定部211の処理と、残響抑圧フィルタ推定部212の処理と、残響抑圧フィルタ適用部221の処理と、ビームフォーマ推定部213の処理と、ビームフォーマ適用部222の処理とが繰り返される。この繰り返しによって各値が更新される。一方、収束条件を充足している場合には、ビームフォーマ適用部222はyt,f=[yt,f (1),…,yt,f (I)を出力する(ステップS222b)。
The control unit 13 determines whether or not the above-mentioned convergence condition is satisfied (step S13). If the convergence condition is not satisfied here, the spatiotemporal covariance estimation unit 211 and the beamformer estimation unit 213 use the input yt, f (i).
Figure JPOXMLDOC01-appb-M000047

Is performed (step S211d), and the process returns to step S211b. As a result, the processing of the spatiotemporal covariance matrix estimation unit 211, the processing of the reverberation suppression filter estimation unit 212, the processing of the reverberation suppression filter application unit 221, the processing of the beamformer estimation unit 213, and the processing of the beamformer application unit 222 The process is repeated. Each value is updated by this repetition. On the other hand, when the convergence condition is satisfied, the beam former application unit 222 outputs y t, f = [y t, f (1) , ..., y t, f (I) ] T (step S222b). ).
 <本実施形態の特徴>
 本実施形態では、目的音の情報を表す補助情報sを用い、残響抑圧と拡散性雑音抑圧と目的音源分離とを行う畳み込みビームフォーマを音響信号に適用して得られる信号が確率モデルに従って定まるという最適化基準に基づき、畳み込みビームフォーマを推定する。これにより、畳み込みビームフォーマを全体として最適化でき、より効果的な音声強調を実現できる。また、本実施形態では畳み込みビームフォーマを残響抑圧フィルタとビームフォーマとに分割し、推定の途中段階で得られる残響抑圧信号を用いてビームフォーマを推定することで、より効果的な音声強調を実現できる。さらに、残響抑圧フィルタの推定に必要な演算の大部分が目的音のパワー重み付き時空間共分散行列R x,f (i)およびPx,f (i)の演算である。ステップS211b,S211cで得られる目的音のパワー重み付き時空間共分散行列R x,f (i)およびPx,f (i)のサイズはステップS212a,S212bで得られる行列Ψ,φのサイズよりも小さい。そのため、本実施形態では、残響抑圧フィルタの推定に必要な演算量を大幅に削減でき、少ない計算コストで音声強調を実現できる。
<Characteristics of this embodiment>
In the present embodiment, the signal obtained by applying the convolution beam former that suppresses reverberation, diffusive noise, and separation of the target sound source to the acoustic signal by using the auxiliary information s that represents the information of the target sound is determined according to the stochastic model. Estimate the convolution beamformer based on the optimization criteria. As a result, the convolution beam former can be optimized as a whole, and more effective speech enhancement can be realized. Further, in the present embodiment, the convolution beamformer is divided into a reverberation suppression filter and a beamformer, and the beamformer is estimated using the reverberation suppression signal obtained in the middle of the estimation to realize more effective speech enhancement. can. Further, most of the operations required for estimating the reverberation suppression filter are the operations of the power-weighted spatio-temporal covariance matrix R - x, f (i) and P x, f (i) of the target sound. The sizes of the power-weighted spatiotemporal covariance matrices R - x, f (i) and P x, f (i) of the target sound obtained in steps S211b and S211c are the matrices Ψ f and φ f f obtained in steps S212a and S212b. Smaller than the size of. Therefore, in the present embodiment, the amount of calculation required for estimating the reverberation suppression filter can be significantly reduced, and speech enhancement can be realized at a low calculation cost.
 [第2実施形態の変形例1]
 第2実施形態では、残響抑圧フィルタ推定部212がビームフォーマを固定して残響抑圧フィルタ(式(9'))を推定し、ビームフォーマ推定部213が残響抑圧フィルタを固定してビームフォーマ(式(10'))を推定する処理を繰り返す。この処理では、残響抑圧フィルタ推定部212がビームフォーマを残響抑圧信号に適応してI次元の処理信号yt,f=[yt,f (1),…,yt,f (I)を得、I次元の処理信号yt,fが次の残響抑圧フィルタの推定に用いられる。しかし、I≦Mであるため、I次元の処理信号yt,fはM次元の音響信号xt,fよりも圧縮され、情報が失われている。この情報の損失に起因して、残響抑圧フィルタやビームフォーマが最適解ではなく、準最適解となってしまう場合がある。この問題を解決するため、本実施形態ではi=1,…,Iのyt,f (1),…,yt,f (I)に対応する目的音のパワー重み付き時空間共分散行列R x,f (i)およびPx,f (i)に加え、i=I+1,…,Mに対応する非目的音のパワー重み付き時空間共分散行列も計算して残響抑圧フィルタの推定に用いる。
[Modification 1 of the second embodiment]
In the second embodiment, the reverberation suppression filter estimation unit 212 fixes the beam former to estimate the reverberation suppression filter (Equation (9')), and the beam former estimation unit 213 fixes the reverberation suppression filter to the beam former (Equation). The process of estimating (10')) is repeated. In this process, the reverberation suppression filter estimation unit 212 adapts the beamformer to the reverberation suppression signal, and the I-dimensional processing signals y t, f = [y t, f (1) , ..., y t, f (I) ]. T is obtained, and the I-dimensional processing signals y t and f are used for estimating the next reverberation suppression filter. However, since I ≦ M, the I-dimensional processing signals y t and f are compressed more than the M-dimensional acoustic signals x t and f , and information is lost. Due to the loss of this information, the reverberation suppression filter or beamformer may become a quasi-optimal solution instead of the optimal solution. In order to solve this problem, in this embodiment, the power-weighted covariance matrix of the target sound corresponding to i = 1, ..., I y t, f (1) , ..., y t, f (I) In addition to R - x, f (i) and P x, f (i) , the power-weighted spatio-temporal covariance matrix of non-objective sounds corresponding to i = I + 1, ..., M is also calculated to estimate the reverberation suppression filter. Used for.
 <機能構成>
 図2に例示するように、本変形例の信号処理装置2’は、時空間共分散推定部211’、残響抑圧フィルタ推定部212’、ビームフォーマ推定部213、残響抑圧フィルタ適用部221、ビームフォーマ適用部222、および制御部13を有し、制御部13の制御の下で各処理を実行する。ここで、時空間共分散推定部211’、残響抑圧フィルタ推定部212’、およびビームフォーマ推定部213は、畳み込みビームフォーマ推定部を構成する。残響抑圧フィルタ適用部221およびビームフォーマ適用部222は、畳み込みビームフォーマ適用部を構成する。
<Functional configuration>
As illustrated in FIG. 2, the signal processing device 2'of this modified example includes a spatiotemporal covariance estimation unit 211', a reverberation suppression filter estimation unit 212', a beamformer estimation unit 213, a reverberation suppression filter application unit 221 and a beam. It has a former application unit 222 and a control unit 13, and executes each process under the control of the control unit 13. Here, the spatiotemporal covariance estimation unit 211', the reverberation suppression filter estimation unit 212', and the beamformer estimation unit 213 constitute a convolutional beamformer estimation unit. The reverberation suppression filter application unit 221 and the beam former application unit 222 form a convolution beam former application unit.
 <処理>
 図2を用いて本変形例の処理を説明する。
 本変形例では、時空間共分散行列推定部211に代えて時空間共分散行列推定部211’が、目的音のパワー重み付き時空間共分散行列R x,f (i)およびPx,f (i)に加え、さらに非目的音のパワー重み付き時空間共分散行列も生成する。さらに、残響抑圧フィルタ推定部212に代えて残響抑圧フィルタ推定部212’が、目的音のパワー重み付き時空間共分散行列R x,f (i)およびPx,f (i)と目的音を推定するための1≦i≦Iに対応するビームフォーマを表す情報q (i)とに加え、さらに非目的音のパワー重み付き時空間共分散行列と非目的音を推定するためのI<i≦Mに対応するビームフォーマを表す情報q (i)を受け取り、前述の最適化基準に基づいて残響抑圧フィルタを推定する。ビームフォーマ推定部213に変えてビームフォーマ推定部213’が、目的音を推定するための1≦i≦Iに対応するビームフォーマを表す情報q (i)に加え、さらに非目的音を推定するためのI<i≦Mに対応するビームフォーマを表す情報q (i)をも生成する。ビームフォーマ適用部222に代えてビームフォーマ適用部222’が、目的音の推定値yt,f (1),…,yt,f (I)に加えて、非目的音の推定値yt,f を生成する。その他は第2実施形態と同じである。
<Processing>
The processing of this modification will be described with reference to FIG.
In this modification, instead of the spatiotemporal covariance matrix estimation unit 211, the spatiotemporal covariance matrix estimation unit 211'is used for the power-weighted spatiotemporal covariance matrix R - x, f (i) and Px, of the target sound. In addition to f (i) , a power-weighted spatio-temporal covariance matrix of non-objective sounds is also generated. Further, instead of the reverberation suppression filter estimation unit 212, the reverberation suppression filter estimation unit 212'has a power-weighted spatiotemporal covariance matrix R - x, f (i) and P x, f (i) of the target sound and the target sound. In addition to the information q f (i) representing the beamformer corresponding to 1 ≦ i ≦ I for estimating, the power-weighted spatio-temporal covariance matrix of the non-target sound and I for estimating the non-purpose sound. The information q f (i) representing the beamformer corresponding to <i ≦ M is received, and the reverberation suppression filter is estimated based on the above-mentioned optimization criteria. Instead of the beamformer estimation unit 213, the beamformer estimation unit 213'in addition to the information q f (i) representing the beamformer corresponding to 1 ≦ i ≦ I for estimating the target sound, further estimates the non-target sound. Information q f (i) representing a beamformer corresponding to I <i ≦ M is also generated. Instead of the beamformer application unit 222, the beamformer application unit 222'has an estimated value y t, f (1) , ..., Y t, f (I) of the target sound, as well as an estimated value y t of the non-target sound. , F is generated. Others are the same as in the second embodiment.
 以下、図3~図5を用いて本変形例の処理を詳細に説明する。
 まず、信号処理装置2に代えて信号処理装置2’が、図3に示すステップS213a,S221a,S211a,S213b,S221b,S211b,S211c,S212a,S212bの処理を実行する。ただし、時空間共分散推定部211の処理は、時空間共分散推定部211に代えて時空間共分散推定部211’が実行する。ステップS211aにおいて、時空間共分散推定部211’は、目的音のパワーλt,f (1),…,λt,f (I)に加えて、非目的音のパワーλt,f も目的音のパワーと同様の方法で初期化する。ビームフォーマ推定部213の処理は、ビームフォーマ推定部213に代えてビームフォーマ推定部213’が実行する。ステップS213bにおいて、ビームフォーマ推定部213’は、すべてのi∈{1,…,M},f∈{1,…,F}についてq (i)を初期化する。例えば、ビームフォーマ推定部213はIのi番目の列をq (i)とする。
Hereinafter, the processing of this modification will be described in detail with reference to FIGS. 3 to 5.
First, instead of the signal processing device 2, the signal processing device 2'executes the processes of steps S213a, S221a, S211a, S213b, S221b, S211b, S211c, S212a, and S212b shown in FIG. However, the processing of the spatiotemporal covariance estimation unit 211 is executed by the spatiotemporal covariance estimation unit 211'instead of the spatiotemporal covariance estimation unit 211. In step S211a, the spatiotemporal covariance estimation unit 211'also includes the powers λ t, f ⊥ of the non-target sound in addition to the powers λ t, f (1) , ..., λ t, f (I) of the target sound. Initialize in the same way as the power of the target sound. The processing of the beam former estimation unit 213 is executed by the beam former estimation unit 213'instead of the beam former estimation unit 213. In step S213b, the beamformer estimation unit 213'initializes q f (i) for all i ∈ {1, ..., M}, f ∈ {1, ..., F}. For example, the beamformer estimation unit 213 to i-th column of I M and q f (i).
 時空間共分散推定部211’は、まだ、yt,f (i)が一度も得られていないのであれば、S211aで得られたλt,f (1),…,λt,f (I)とλt,f を用いる。一方、yt,f (i)が得られているのであれば、時空間共分散推定部211’に、yt,f (1),…,yt,f (I)とyt,f が入力されるので、λt,f (1),…,λt,f (I)、および、λt,f をステップS211dにより得ることができる。
 次に時空間共分散推定部211’は、xt,fおよびλt,f を用い、非目的音のパワー重み付き時空間共分散行列
Figure JPOXMLDOC01-appb-M000048

を計算して出力する(ステップS211b’)。
 さらに時空間共分散推定部211’は、xt,fおよびλt,f を用い、非目的音のパワー重み付き時空間共分散行列
Figure JPOXMLDOC01-appb-M000049

を計算して出力する(ステップS211c’)。
If the spatiotemporal covariance estimation unit 211'has not yet obtained y t, f (i) , the spatiotemporal covariance estimation unit 211' obtained λ t, f (1) , ..., λ t, f ( I) and λ t, f are used. On the other hand, if y t, f (i) is obtained, y t, f (1) , ..., y t, f (I) and y t, f are sent to the spatiotemporal covariance estimation unit 211'. Since ⊥ is input, λ t, f (1) , ..., λ t, f (I) , and λ t, f can be obtained by step S211d.
Next, the spatiotemporal covariance estimation unit 211'uses x t, f and λ t, f ⊥, and uses the power-weighted spatiotemporal covariance matrix of the non-objective sound.
Figure JPOXMLDOC01-appb-M000048

Is calculated and output (step S211b').
Furthermore, the spatiotemporal covariance estimation unit 211'uses x t, f and λ t, f ⊥, and uses the power-weighted spatiotemporal covariance matrix of the non-objective sound.
Figure JPOXMLDOC01-appb-M000049

Is calculated and output (step S211c').
 残響抑圧フィルタ推定部212’はR x,f およびq (i)を受け取り、
Figure JPOXMLDOC01-appb-M000050

を計算する(ステップS212a’)。さらに残響抑圧フィルタ推定部212’はP x,fおよびq (i)を受け取り、
Figure JPOXMLDOC01-appb-M000051

を計算する(ステップS212b’)。
The reverberation suppression filter estimation unit 212'receives R- x, f ⊥ and q f (i), and receives them.
Figure JPOXMLDOC01-appb-M000050

Is calculated (step S212a'). Furthermore, the reverberation suppression filter estimation unit 212'receives P ⊥ x, f and q f (i), and receives P ⊥ x, f and q f (i).
Figure JPOXMLDOC01-appb-M000051

Is calculated (step S212b').
 その後、信号処理装置2に代えて信号処理装置2’が、図5に示すステップS212c,S221c,S213c,S213d,S213e,S222a,S13,S211d,S222bの処理を実行する。ただし、第2実施形態で説明した残響抑圧フィルタ推定部212の処理は、残響抑圧フィルタ推定部212に代えて残響抑圧フィルタ推定部212’が実行する。ビームフォーマ推定部213の処理は、ビームフォーマ推定部213に代えてビームフォーマ推定部213’が実行する。ビームフォーマ適用部222の処理は、ビームフォーマ適用部222に代えてビームフォーマ適用部222’が実行する。
 ビームフォーマ推定部213’は、ステップS213において、i∈{1,…,I},f∈{1,…,F}についてq (i)を推定するのに加えて、i∈{I+1,…,M},f∈{1,…,F}に関するq (i)をも生成する。例えば、各fにおいて、i∈{1,…,I}に対するq (i)が張る線形空間に対し、その補空間を張るベクトルとして、i∈{I+1,…,M}に関するq (i)を生成する。補空間を張るベクトルとしては、例えば、その補空間の正規直交規定を採用すればよいし、それ以外でもよい。
After that, instead of the signal processing device 2, the signal processing device 2'executes the processes of steps S212c, S221c, S213c, S213d, S213e, S222a, S13, S211d, and S222b shown in FIG. However, the process of the reverberation suppression filter estimation unit 212 described in the second embodiment is executed by the reverberation suppression filter estimation unit 212'instead of the reverberation suppression filter estimation unit 212. The processing of the beam former estimation unit 213 is executed by the beam former estimation unit 213'instead of the beam former estimation unit 213. The processing of the beam former application unit 222 is executed by the beam former application unit 222'instead of the beam former application unit 222.
In step S213, the beamformer estimation unit 213'estimates q f (i) for i ∈ {1, ..., I}, f ∈ {1, ..., F}, and in addition, i ∈ {I + 1, It also generates q f (i) for ..., M}, f ∈ {1, ..., F}. For example, in each f, i∈ {1, ..., I} to linear space q f (i) is put against, as a vector spanning the complement, i∈ {I + 1, ... , M} q relating f (i ) Is generated. As the vector that stretches the complementary space, for example, the orthonormal rule of the complementary space may be adopted, or other than that.
 <本変形例の特徴>
 本変形例では、目的音のパワー重み付き時空間共分散行列だけではなく、非目的音のパワー重み付き時空間共分散行列も計算して残響抑圧フィルタの推定に用いるため、残響抑圧フィルタの推定精度が向上する。
<Characteristics of this modified example>
In this modification, not only the power-weighted spatio-temporal covariance matrix of the target sound but also the power-weighted spatio-temporal covariance matrix of the non-target sound is calculated and used for estimating the reverberation suppression filter. Accuracy is improved.
 [第2実施形態の変形例2]
 第2実施形態の変形例2では、いずれか一つの源信号iに対応する処理信号yt,f (i)のみを得て出力する。本変形例の残響抑圧フィルタは式(12)のものであり、ビームフォーマは式(10')のものである。
[Modification 2 of the second embodiment]
In the second modification of the second embodiment, only the processing signals y t and f (i) corresponding to any one of the source signals i are obtained and output. The reverberation suppression filter of this modification is of equation (12), and the beamformer is of equation (10').
 <機能構成>
 図2に例示するように、本変形例の信号処理装置2”は、時空間共分散推定部211、残響抑圧フィルタ推定部212”、ビームフォーマ推定部213、残響抑圧フィルタ適用部221、ビームフォーマ適用部222、および制御部13を有し、制御部13の制御の下で各処理を実行する。ここで、時空間共分散推定部211、残響抑圧フィルタ推定部212”、およびビームフォーマ推定部213は、畳み込みビームフォーマ推定部を構成する。残響抑圧フィルタ適用部221およびビームフォーマ適用部222は、畳み込みビームフォーマ適用部を構成する。
<Functional configuration>
As illustrated in FIG. 2, the signal processing device 2 "of this modification has a spatiotemporal covariance estimation unit 211, a reverberation suppression filter estimation unit 212", a beamformer estimation unit 213, a reverberation suppression filter application unit 221 and a beamformer. It has an application unit 222 and a control unit 13, and executes each process under the control of the control unit 13. Here, the spatiotemporal covariance estimation unit 211, the reverberation suppression filter estimation unit 212 ”, and the beamformer estimation unit 213 constitute a convolution beamformer estimation unit. It constitutes a convolution beam former application part.
 <処理>
 図7および図8を用いて本変形例の処理を詳細に説明する。
 まず、信号処理装置2に代えて信号処理装置2”が、図7に示すステップS213a,S221a,S211a,S211b,S211cの処理を実行する。
<Processing>
The processing of this modification will be described in detail with reference to FIGS. 7 and 8.
First, instead of the signal processing device 2, the signal processing device 2 ”executes the processes of steps S213a, S221a, S211a, S211b, and S211c shown in FIG. 7.
 次に、残響抑圧フィルタ推定部212に代えて残響抑圧フィルタ推定部212”がR x,f (i)およびPx,f (i)を受け取り、残響除去フィルタに対応する情報
Figure JPOXMLDOC01-appb-M000052

を計算して出力する(ステップS212c”)。
Next, the reverberation suppression filter estimation unit 212 " receives R- x, f (i) and P x, f (i) in place of the reverberation suppression filter estimation unit 212, and the information corresponding to the reverberation removal filter is received.
Figure JPOXMLDOC01-appb-M000052

Is calculated and output (step S212c ").
 残響抑圧フィルタ適用部212”は、残響除去フィルタに対応する情報G (i)と音響信号xt,fとを受け取り、以下のように残響抑圧フィルタを音響信号xt,fに適用して残響抑圧信号zt,fを得て出力する(ステップS221c”)。
Figure JPOXMLDOC01-appb-M000053
Dereverberation filter applying unit 212 ", the information corresponds to the dereverberation filter G - receives f (i) the acoustic signal x t, and f, applies a dereverberation filter acoustic signals x t, the f as follows The reverberation suppression signals zt and f are obtained and output (step S221c ″).
Figure JPOXMLDOC01-appb-M000053
 その後、信号処理装置2に代えて信号処理装置2”が、図8に示すステップS213c,S213d,S213e,S222a,S13,S211d,S222bの処理を実行する。 After that, instead of the signal processing device 2, the signal processing device 2 ”executes the processes of steps S213c, S213d, S213e, S222a, S13, S211d, and S222b shown in FIG.
 [第3実施形態]
 次に本発明の第3実施形態を説明する。本実施形態では、補助情報が目的音のパワーを特定する情報を含む。これにより、繰り返し処理を省略できる。
[Third Embodiment]
Next, a third embodiment of the present invention will be described. In this embodiment, the auxiliary information includes information that identifies the power of the target sound. As a result, iterative processing can be omitted.
 <機能構成>
 図9に例示するように、本実施形態の信号処理装置3は、時空間共分散推定部311、残響抑圧フィルタ推定部212、ビームフォーマ推定部313、残響抑圧フィルタ適用部221、ビームフォーマ適用部222、および制御部13を有し、制御部13の制御の下で各処理を実行する。ここで、時空間共分散推定部311、残響抑圧フィルタ推定部212、およびビームフォーマ推定部313は、畳み込みビームフォーマ推定部を構成する。残響抑圧フィルタ適用部221およびビームフォーマ適用部222は、畳み込みビームフォーマ適用部を構成する。
<Functional configuration>
As illustrated in FIG. 9, the signal processing device 3 of the present embodiment has a spatiotemporal covariance estimation unit 311, a reverberation suppression filter estimation unit 212, a beamformer estimation unit 313, a reverberation suppression filter application unit 221 and a beamformer application unit. It has 222 and a control unit 13, and executes each process under the control of the control unit 13. Here, the spatiotemporal covariance estimation unit 311, the reverberation suppression filter estimation unit 212, and the beamformer estimation unit 313 constitute a convolutional beamformer estimation unit. The reverberation suppression filter application unit 221 and the beam former application unit 222 form a convolution beam former application unit.
 <処理>
 以下、図10,図11および図4を用いて本変形例の処理を詳細に説明する。
 まず、補助情報s={s,s}が信号処理装置3に入力される。本実施形態の補助情報sは、時間周波数マスクs=γt,f (i)と目的音のパワーを特定する情報s=λt,f (i)とを含む。時間周波数マスクs=γt,f (i)はビームフォーマ推定部313に入力され、目的音のパワーを特定する情報s=λt,f (i)は時空間共分散推定部311およびビームフォーマ推定部313に入力される(ステップS313a)。
<Processing>
Hereinafter, the processing of this modification will be described in detail with reference to FIGS. 10, 11 and 4.
First, auxiliary information s = {s 1 , s 2 } is input to the signal processing device 3. The auxiliary information s of the present embodiment includes the time frequency mask s 1 = γ t, f (i) and the information s 2 = λ t, f (i) that specifies the power of the target sound. The time-frequency mask s 1 = γ t, f (i) is input to the beamformer estimation unit 313, and the information s 2 = λ t, f (i) that identifies the power of the target sound is the spatiotemporal covariance estimation unit 311 and It is input to the beamformer estimation unit 313 (step S313a).
 図10,図11および図4に例示するように、信号処理装置2に代えて信号処理装置3が、S221a,S211a,S213b,S221b,S211b,S211c,S212a,S212b,S212c,S221c,S213c,S213d,S213e,S222a,S222bの処理を実行する。ただし、第2実施形態で説明した時空間共分散推定部211の処理およびビームフォーマ推定部213の処理は、それぞれ、時空間共分散推定部211およびビームフォーマ推定部213に代えて時空間共分散推定部311およびビームフォーマ推定部313が実行する。また、ステップS211b,S211c,S213dの計算には補助情報s=λt,f (i)が用いられる。これにより、繰り返し処理を行うことなく、音響信号xt,fに対して残響抑圧と拡散性雑音抑圧と目的音源分離とを行った処理信号yt,f (i)が得られる。 As illustrated in FIGS. 10, 11 and 4, the signal processing device 3 replaces the signal processing device 2 with S221a, S211a, S213b, S221b, S211b, S211c, S212a, S212b, S212c, S221c, S213c, S213d. , S213e, S222a, S222b are executed. However, the processing of the spatiotemporal covariance estimation unit 211 and the processing of the beamformer estimation unit 213 described in the second embodiment are performed in place of the spatiotemporal covariance estimation unit 211 and the beamformer estimation unit 213, respectively. The estimation unit 311 and the beam former estimation unit 313 execute. Further, auxiliary information s 2 = λ t, f (i) is used for the calculation of steps S211b, S211c, and S213d. As a result, the processed signals y t, f (i) obtained by suppressing the reverberation, suppressing the diffusive noise, and separating the target sound source with respect to the acoustic signals x t, f without repeating the processing.
 <本実施形態の特徴>
 本実施形態では、目的音のパワーを補助情報として信号処理装置3に与えることで、繰り返し処理を行うことなく、高精度な音声強調を行うことができる。
<Characteristics of this embodiment>
In the present embodiment, by giving the power of the target sound to the signal processing device 3 as auxiliary information, it is possible to perform highly accurate speech enhancement without performing repetitive processing.
 [第3実施形態の変形例1]
 第2実施形態の変形例1と同様、第3実施形態において、目的音のパワー重み付き時空間共分散行列R x,f (i)およびPx,f (i)に加え、非目的音のパワー重み付き時空間共分散行列も計算して残響抑圧フィルタの推定に用いてもよい。
[Modification 1 of the third embodiment]
Similar to the first modification of the second embodiment, in the third embodiment, in addition to the power-weighted spatio-temporal covariance matrix R - x, f (i) and P x, f (i) of the target sound, the non-purpose sound The power-weighted spatio-temporal covariance matrix of may also be calculated and used to estimate the reverberation suppression filter.
 <機能構成>
 図9に例示するように、本変形例の信号処理装置3’は、時空間共分散推定部311’、残響抑圧フィルタ推定部212’、ビームフォーマ推定部313、残響抑圧フィルタ適用部221、ビームフォーマ適用部222、および制御部13を有し、制御部13の制御の下で各処理を実行する。ここで、時空間共分散推定部311’、残響抑圧フィルタ推定部212’、およびビームフォーマ推定部313は、畳み込みビームフォーマ推定部を構成する。残響抑圧フィルタ適用部221およびビームフォーマ適用部222は、畳み込みビームフォーマ適用部を構成する。
<Functional configuration>
As illustrated in FIG. 9, the signal processing device 3'of this modified example includes a spatiotemporal covariance estimation unit 311', a reverberation suppression filter estimation unit 212', a beamformer estimation unit 313, a reverberation suppression filter application unit 221 and a beam. It has a former application unit 222 and a control unit 13, and executes each process under the control of the control unit 13. Here, the spatiotemporal covariance estimation unit 311', the reverberation suppression filter estimation unit 212', and the beamformer estimation unit 313 constitute a convolutional beamformer estimation unit. The reverberation suppression filter application unit 221 and the beam former application unit 222 form a convolution beam former application unit.
 <処理>
 以下、図10,図11および図4を用いて本変形例の処理を詳細に説明する。
 信号処理装置3に代えて信号処理装置3’が、図10に示すステップS313a,S221a,S211a,S213b,S221b,S211b,S211c,S212a,S212bの処理を第3実施形態で説明したように実行する。
<Processing>
Hereinafter, the processing of this modification will be described in detail with reference to FIGS. 10, 11 and 4.
Instead of the signal processing device 3, the signal processing device 3'executes the processes of steps S313a, S221a, S211a, S213b, S221b, S211b, S211c, S212a, and S212b shown in FIG. 10 as described in the third embodiment. ..
 次に信号処理装置2’に代えて信号処理装置3’が、図4に示すステップS211b’,S211c’,S212a’,S212b’の処理を実行する。ただし、第2実施形態の変形例1で説明した時空間共分散推定部211’の処理は時空間共分散推定部311’が実行する。また、ステップS211b’,S211c’の計算には補助情報s=λt,f (i)が用いられる。 Next, instead of the signal processing device 2', the signal processing device 3'executes the processes of steps S211b', S211c', S212a', and S212b'shown in FIG. However, the spatiotemporal covariance estimation unit 311'executes the processing of the spatiotemporal covariance estimation unit 211' described in the first modification of the second embodiment. Further, auxiliary information s 2 = λ t, f (i) is used for the calculation of steps S211b'and S211c'.
 信号処理装置3に代えて信号処理装置3’が、図11に示すS212c,S221c,S213c,S213d,S213e,S222a,S222bの処理を第3実施形態で説明したように実行する。 Instead of the signal processing device 3, the signal processing device 3'executes the processes of S212c, S221c, S213c, S213d, S213e, S222a, and S222b shown in FIG. 11 as described in the third embodiment.
 [第3実施形態の変形例2]
 第2実施形態の変形例2と同様、第3実施形態においても、いずれか一つの源信号iに対応する処理信号yt,f (i)のみが得られてもよい。
[Modification 2 of the third embodiment]
Similar to the second embodiment, in the third embodiment, only the processing signals yt and f (i) corresponding to any one of the source signals i may be obtained.
 <機能構成>
 図9に例示するように、本変形例の信号処理装置3”は、時空間共分散推定部311、残響抑圧フィルタ推定部212”、ビームフォーマ推定部313、残響抑圧フィルタ適用部221”、ビームフォーマ適用部222、および制御部13を有し、制御部13の制御の下で各処理を実行する。ここで、時空間共分散推定部311、残響抑圧フィルタ推定部212”、およびビームフォーマ推定部313は、畳み込みビームフォーマ推定部を構成する。残響抑圧フィルタ適用部221”およびビームフォーマ適用部222は、畳み込みビームフォーマ適用部を構成する。
<Functional configuration>
As illustrated in FIG. 9, the signal processing device 3 "of this modified example includes a spatiotemporal covariance estimation unit 311", a reverberation suppression filter estimation unit 212 ", a beamformer estimation unit 313, a reverberation suppression filter application unit 221", and a beam. It has a former application unit 222 and a control unit 13, and executes each process under the control of the control unit 13. Here, a spatiotemporal covariance estimation unit 311, a reverberation suppression filter estimation unit 212 ”, and a beam former estimation. The unit 313 constitutes a convoluted beamformer estimation unit. The reverberation suppression filter application unit 221 ”and the beam former application unit 222 constitute a convolution beam former application unit.
 <処理>
 以下、図12および図13を用いて本変形例の処理を詳細に説明する。
 信号処理装置3に代えて信号処理装置3”が、第3実施形態で説明したように、図12および図13に示すステップS313a,S221a,S211b,S211cを実行する。さらに、信号処理装置2”の残響抑圧フィルタ推定部212”に代えて信号処理装置3”の残響抑圧フィルタ推定部212”が第2実施形態の変形例2で説明したステップS212c”およびS221c”を実行する。その後、信号処理装置2に代えて信号処理装置3”が、第2実施形態で説明したステップS213c,S213d,S213e,S222a,S222bを実行する。ただし、ステップS211b,S211c,S213dの計算には補助情報s=λt,f (i)が用いられる。
<Processing>
Hereinafter, the processing of this modification will be described in detail with reference to FIGS. 12 and 13.
Instead of the signal processing device 3, the signal processing device 3 "executes steps S313a, S221a, S211b, S211c shown in FIGS. 12 and 13 as described in the third embodiment. Further, the signal processing device 2 ". The reverberation suppression filter estimation unit 212 "of the signal processing device 3" executes the steps S212c "and S221c" described in the second embodiment of the second embodiment instead of the reverberation suppression filter estimation unit 212 ". Instead of the device 2, the signal processing device 3 ”executes steps S213c, S213d, S213e, S222a, and S222b described in the second embodiment. However, auxiliary information s 2 = λ t, f (i) is used for the calculation of steps S211b, S211c, and S213d.
 [第4実施形態]
 第2実施形態の変形例1で説明したように、ステップS211b,S211cで得られる目的音のパワー重み付き時空間共分散行列R x,f (i)およびPx,f (i)のサイズはステップS212a,S212bで得られる行列Ψ,φのサイズよりも小さいため、上記の各実施形態および変形例では少ない計算コストで音声強調を実現できる。この効果は得られないがステップS212a,S212bにおいて、
Figure JPOXMLDOC01-appb-M000054

Figure JPOXMLDOC01-appb-M000055

に代えて、
Figure JPOXMLDOC01-appb-M000056

Figure JPOXMLDOC01-appb-M000057

が実行されてもよい。ただし、以下を満たす。
Figure JPOXMLDOC01-appb-M000058
[Fourth Embodiment]
As described in the first modification of the second embodiment, the sizes of the power-weighted spatio-temporal covariance matrices R - x, f (i) and P x, f (i) of the target sound obtained in steps S211b and S211c. Is smaller than the size of the matrices Ψ f and φ f obtained in steps S212a and S212b, so speech enhancement can be realized at a low calculation cost in each of the above embodiments and modifications. Although this effect cannot be obtained, in steps S212a and S212b,
Figure JPOXMLDOC01-appb-M000054

Figure JPOXMLDOC01-appb-M000055

Instead of
Figure JPOXMLDOC01-appb-M000056

Figure JPOXMLDOC01-appb-M000057

May be executed. However, the following is satisfied.
Figure JPOXMLDOC01-appb-M000058
 <比較実験>
 以下に第4実施形態と第2実施形態の変形例1,2との比較結果を例示する。以下の2つの構成Config-1,2で実験を行った。周波数分割には短時間フーリエ変換を用いた。窓関数にはフォンハン窓(Hann window)を用い、フレーム長およびシフト幅をそれぞれ32msおよび8msに設定した。またΔ=4とした。
Figure JPOXMLDOC01-appb-T000059
  
<Comparative experiment>
The comparison results between the fourth embodiment and the modified examples 1 and 2 of the second embodiment will be illustrated below. Experiments were conducted with the following two configurations, Config-1 and 2. A short-time Fourier transform was used for frequency division. A Hann window was used for the window function, and the frame length and shift width were set to 32 ms and 8 ms, respectively. Further, Δ = 4.
Figure JPOXMLDOC01-appb-T000059
 図15に、2つの構成Config-1,2について第4実施形態と第2実施形態の変形例1,2で得られた処理信号を音声認識した際の単語誤り率を示す。図15の横軸は繰り返し回数(#iterations)を表し、縦軸は単語誤り率(WER(%))を表す。図15に示すように、第2実施形態の変形例1,2方法では、雑音残響複数音源環境下で収録した音声信号の音声認識性能を第4実施形態の方法よりも改善できることが分かる。 FIG. 15 shows the word error rate when the processing signals obtained in the fourth embodiment and the modified examples 1 and 2 of the second embodiment are voice-recognized for the two configurations Config-1 and 2. The horizontal axis of FIG. 15 represents the number of iterations (#iterations), and the vertical axis represents the word error rate (WER (%)). As shown in FIG. 15, it can be seen that in the modified examples 1 and 2 of the second embodiment, the voice recognition performance of the voice signal recorded in the noise reverberation multiple sound source environment can be improved as compared with the method of the fourth embodiment.
 以下に9.44sの長さの混合信号を対象とし、2つの構成Config-1,2について第4実施形態と第2実施形態の変形例1,2の方法で処理するために必要な計算時間を例示する。
Figure JPOXMLDOC01-appb-T000060
 
 第2実施形態の変形例1,2の方法は、第4実施形態の方法よりも少ない計算量で、残響抑圧と拡散性雑音抑圧と目的音源分離とを行うことができることが分かる。
The calculation time required to process the two configurations Config-1 and 2 by the methods of the fourth embodiment and the modified examples 1 and 2 of the second embodiment for the mixed signal having a length of 9.44 s is shown below. Illustrate.
Figure JPOXMLDOC01-appb-T000060

It can be seen that the methods of Modifications 1 and 2 of the second embodiment can perform reverberation suppression, diffusive noise suppression, and target sound source separation with a smaller amount of calculation than the method of the fourth embodiment.
 [ハードウェア構成]
 各実施形態における信号処理装置1,2,2’,2”,3,3’,3”は、例えば、CPU(central processing unit)等のプロセッサ(ハードウェア・プロセッサ)やRAM(random-access memory)・ROM(read-only memory)等のメモリ等を備える汎用または専用のコンピュータが所定のプログラムを実行することで構成される装置である。このコンピュータは1個のプロセッサやメモリを備えていてもよいし、複数個のプロセッサやメモリを備えていてもよい。このプログラムはコンピュータにインストールされてもよいし、予めROM等に記録されていてもよい。また、CPUのようにプログラムが読み込まれることで機能構成を実現する電子回路(circuitry)ではなく、単独で処理機能を実現する電子回路を用いて一部またはすべての処理部が構成されてもよい。また、1個の装置を構成する電子回路が複数のCPUを含んでいてもよい。
[Hardware configuration]
The signal processing devices 1, 2, 2', 2 ", 3, 3', 3" in each embodiment are, for example, a processor (hardware processor) such as a CPU (central processing unit) or a RAM (random-access memory). ) -A device configured by a general-purpose or dedicated computer equipped with a memory such as a ROM (read-only memory) executing a predetermined program. This computer may have one processor and memory, or may have a plurality of processors and memory. This program may be installed in a computer or may be recorded in a ROM or the like in advance. Further, a part or all of the processing units may be configured by using an electronic circuit that realizes a processing function independently, instead of an electronic circuit (circuitry) that realizes a function configuration by reading a program like a CPU. .. Further, the electronic circuit constituting one device may include a plurality of CPUs.
 図6は、各実施形態における信号処理装置1,2,2’,2”,3,3’,3”のハードウェア構成を例示したブロック図である。図6に例示するように、この例の信号処理装置1,2,2’,2”,3,3’,3”は、CPU(Central Processing Unit)10a、入力部10b、出力部10c、RAM(Random Access Memory)10d、ROM(Read Only Memory)10e、補助記憶装置10f及びバス10gを有している。この例のCPU10aは、制御部10aa、演算部10ab及びレジスタ10acを有し、レジスタ10acに読み込まれた各種プログラムに従って様々な演算処理を実行する。また、入力部10bは、データが入力される入力端子、キーボード、マウス、タッチパネル等である。また、出力部10cは、データが出力される出力端子、ディスプレイ、所定のプログラムを読み込んだCPU10aによって制御されるLANカード等である。また、RAM10dは、SRAM (Static Random Access Memory)、DRAM (Dynamic Random Access Memory)等であり、所定のプログラムが格納されるプログラム領域10da及び各種データが格納されるデータ領域10dbを有している。また、補助記憶装置10fは、例えば、ハードディスク、MO(Magneto-Optical disc)、半導体メモリ等であり、所定のプログラムが格納されるプログラム領域10fa及び各種データが格納されるデータ領域10fbを有している。また、バス10gは、CPU10a、入力部10b、出力部10c、RAM10d、ROM10e及び補助記憶装置10fを、情報のやり取りが可能なように接続する。CPU10aは、読み込まれたOS(Operating System)プログラムに従い、補助記憶装置10fのプログラム領域10faに格納されているプログラムをRAM10dのプログラム領域10daに書き込む。同様にCPU10aは、補助記憶装置10fのデータ領域10fbに格納されている各種データを、RAM10dのデータ領域10dbに書き込む。そして、このプログラムやデータが書き込まれたRAM10d上のアドレスがCPU10aのレジスタ10acに格納される。CPU10aの制御部10aaは、レジスタ10acに格納されたこれらのアドレスを順次読み出し、読み出したアドレスが示すRAM10d上の領域からプログラムやデータを読み出し、そのプログラムが示す演算を演算部10abに順次実行させ、その演算結果をレジスタ10acに格納していく。このような構成により、信号処理装置1,2,2’,2”,3,3’,3”の機能構成が実現される。 FIG. 6 is a block diagram illustrating the hardware configurations of the signal processing devices 1, 2, 2', 2 ", 3, 3', 3" in each embodiment. As illustrated in FIG. 6, the signal processing devices 1, 2, 2', 2 ", 3, 3', 3" in this example include a CPU (Central Processing Unit) 10a, an input unit 10b, an output unit 10c, and a RAM. It has (RandomAccessMemory) 10d, ROM (ReadOnlyMemory) 10e, auxiliary storage device 10f, and bus 10g. The CPU 10a of this example has a control unit 10aa, a calculation unit 10ab, and a register 10ac, and executes various arithmetic processes according to various programs read into the register 10ac. Further, the input unit 10b is an input terminal, a keyboard, a mouse, a touch panel, or the like into which data is input. Further, the output unit 10c is an output terminal from which data is output, a display, a LAN card controlled by a CPU 10a that has read a predetermined program, and the like. Further, the RAM 10d is a SRAM (Static Random Access Memory), a DRAM (Dynamic Random Access Memory), or the like, and has a program area 10da in which a predetermined program is stored and a data area 10db in which various data are stored. Further, the auxiliary storage device 10f is, for example, a hard disk, MO (Magneto-Optical disc), a semiconductor memory, or the like, and has a program area 10fa for storing a predetermined program and a data area 10fb for storing various data. There is. Further, the bus 10g connects the CPU 10a, the input unit 10b, the output unit 10c, the RAM 10d, the ROM 10e, and the auxiliary storage device 10f so that information can be exchanged. The CPU 10a writes the program stored in the program area 10fa of the auxiliary storage device 10f to the program area 10da of the RAM 10d according to the read OS (Operating System) program. Similarly, the CPU 10a writes various data stored in the data area 10fb of the auxiliary storage device 10f to the data area 10db of the RAM 10d. Then, the address on the RAM 10d in which this program or data is written is stored in the register 10ac of the CPU 10a. The control unit 10aa of the CPU 10a sequentially reads out these addresses stored in the register 10ac, reads a program or data from the area on the RAM 10d indicated by the read address, and causes the arithmetic unit 10ab to sequentially execute the operations indicated by the program. The calculation result is stored in the register 10ac. With such a configuration, the functional configurations of the signal processing devices 1, 2, 2', 2 ", 3, 3', 3" are realized.
 上述のプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体の例は非一時的な(non-transitory)記録媒体である。このような記録媒体の例は、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等である。 The above program can be recorded on a computer-readable recording medium. An example of a computer-readable recording medium is a non-transitory recording medium. Examples of such a recording medium are a magnetic recording device, an optical disk, an optical magnetic recording medium, a semiconductor memory, and the like.
 このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。上述のように、このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記憶装置に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるASP(Application Service Provider)型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの(コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等)を含むものとする。 The distribution of this program is carried out, for example, by selling, transferring, renting, etc. a portable recording medium such as a DVD or CD-ROM on which the program is recorded. Further, the program may be stored in the storage device of the server computer, and the program may be distributed by transferring the program from the server computer to another computer via the network. As described above, the computer that executes such a program first temporarily stores, for example, the program recorded on the portable recording medium or the program transferred from the server computer in its own storage device. Then, when the process is executed, the computer reads the program stored in its own storage device and executes the process according to the read program. Further, as another execution form of this program, a computer may read the program directly from a portable recording medium and execute processing according to the program, and further, the program is transferred from the server computer to this computer. It is also possible to execute the process according to the received program one by one each time. In addition, the above processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition without transferring the program from the server computer to this computer. May be. The program in this embodiment includes information to be used for processing by a computer and equivalent to the program (data that is not a direct command to the computer but has a property of defining the processing of the computer, etc.).
 各実施形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In each embodiment, the present device is configured by executing a predetermined program on a computer, but at least a part of these processing contents may be realized by hardware.
 なお、本発明は上述の実施形態に限定されるものではない。例えば、上述の各実施形態またはその変形例では、拡散性雑音と音源から発せられた源信号との混合信号を観測して得られた信号を周波数分割して得られる音響信号を号xt,fとした。しかしながら、これは本発明を限定するものではない。例えば、混合信号を観測して得られた信号を周波数分割して得られた信号に何らかの信号処理(フィルタリング処理など)を施して得られた音響信号をxt,fとしてもよい。あるいは、混合信号を観測して得られた信号に何らかの信号処理を施して得られた信号を周波数分割して得られた音響信号をxt,fとしてもよい。あるいは、混合信号を観測して得られた信号に何らかの信号処理を施して得られた信号を周波数分割して得られた信号に、さらに何らかの信号処理を施して得られた音響信号をxt,fとしてもよい。また、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 The present invention is not limited to the above-described embodiment. For example, in each of the above-described embodiments or variations thereof, an acoustic signal obtained by frequency-dividing the signal obtained by observing a mixed signal of diffuse noise and a source signal emitted from a sound source is referred to as No. xt. It was set to f . However, this is not a limitation of the present invention. For example, the acoustic signals obtained by performing some signal processing (filtering processing or the like) on the signal obtained by frequency-dividing the signal obtained by observing the mixed signal may be set as xt, f. Alternatively, the acoustic signals obtained by frequency-dividing the signal obtained by performing some signal processing on the signal obtained by observing the mixed signal may be set as xt and f. Alternatively, the signal obtained by observing the mixed signal is subjected to some kind of signal processing, the signal obtained by frequency-dividing the signal is obtained, and the acoustic signal obtained by further performing some kind of signal processing is xt , It may be f. Further, the various processes described above are not only executed in chronological order according to the description, but may also be executed in parallel or individually as required by the processing capacity of the device that executes the processes.
 また、第2実施形態では、残響抑圧信号zt,fおよび補助情報s=γt,f (i)(時間周波数マスク)がビームフォーマ推定部213に入力され、ビームフォーマ推定部213のステアリングベクトル推定部2131がzt,fおよびγt,f (i)に基づいてステアリングベクトルv (i)を推定した。しかし、補助情報sがステアリングベクトルv (i)そのものを含んでもよい。この場合にはステアリングベクトル推定部2131は省略可能であり、ビームフォーマ推定部213のRTF推定部2132が補助情報sに含まれたv (i)からv~ (i)を得てもよい。また、補助情報sが時間周波数マスクγt,fやステアリングベクトルv (i)を含まなくても、補助情報sが目的音の参照音を含んでいれば、残響抑圧信号zt,fおよび補助情報sからステアリングベクトルv (i)を推定可能である。すなわち、図6に例示するように、まずビームフォーマ推定部213の時間周波数マスク推定部2130が残響抑圧信号zt,fおよび補助情報s(目的音の参照音)を受け取り、参考文献3に記載された方法によって時間周波数マスクγt,f (i)を推定し、ステアリングベクトル推定部2131に入力してもよい。また、RTFv~ (i)そのものが補助情報s=v~ (i)としてビームフォーマ推定部213に入力されてもよい。結局、補助情報sとしてRTFv~ (i)を算出するための情報がビームフォーマ推定部213に入力されれば、ビームフォーマ推定部213はビームフォーマを推定できる。 Further, in the second embodiment, the reverberation suppression signals z t, f and the auxiliary information s = γ t, f (i) (time frequency mask) are input to the beam former estimation unit 213, and the steering vector of the beam former estimation unit 213 is input. The estimation unit 2131 estimated the steering vector v f (i) based on z t, f and γ t, f (i) . However, the auxiliary information s may include the steering vector v f (i) itself. In this case, the steering vector estimation unit 2131 can be omitted, and the RTF estimation unit 2132 of the beamformer estimation unit 213 may obtain v to f (i) from v f (i) included in the auxiliary information s. .. Further, even if the auxiliary information s does not include the time frequency mask γ t, f and the steering vector v f (i) , if the auxiliary information s includes the reference sound of the target sound, the reverberation suppression signals z t, f and The steering vector v f (i) can be estimated from the auxiliary information s. That is, as illustrated in FIG. 6, first, the time frequency mask estimation unit 2130 of the beamformer estimation unit 213 receives the reverberation suppression signals z t, f and the auxiliary information s (reference sound of the target sound), and is described in Reference 3. The time-frequency masks γ t, f (i) may be estimated by the method described above and input to the steering vector estimation unit 2131. Further, RTFv ~ f (i) itself may be input to the beamformer estimation unit 213 as auxiliary information s = v ~ f (i). After all, if the information for calculating RTFv to f (i) is input to the beamformer estimation unit 213 as the auxiliary information s, the beamformer estimation unit 213 can estimate the beamformer.
 また、S212a’,S212b’において、非目的音のパワー重み付き時空間共分散行列R x,f ,P x,fに代えて目的音のパワー重み付き時空間共分散行列R x,f (i),Px,f (i)が用いられてもよい。この場合には、ステップS211b’,S211c’は省略可能である。 Further, S212a ', S212b' in a non-target sound power weighted hourly space covariance matrix R - x, f ⊥, P ⊥ x, spatial covariance matrix at with a power weighting of the target sound instead of f R - x , F (i) , Px, f (i) may be used. In this case, steps S211b'and S211c' can be omitted.
 その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 Needless to say, other changes can be made as appropriate without departing from the spirit of the present invention.
1,2,2’,2”,3,3’,3” 信号処理装置
11 畳み込みビームフォーマ推定部
12 畳み込みビームフォーマ適用部
211,211’,311,311’ 時空間共分散推定部
212,212’,212” 残響抑圧フィルタ推定部
213,213’,313 ビームフォーマ推定部
221,221” 残響抑圧フィルタ適用部
222,222’ ビームフォーマ適用部
1,2,2', 2 ", 3,3', 3" Signal processing device 11 Convolution beam former estimation unit 12 Convolution beam former application unit 211, 211', 311, 311'Spatio-temporal covariance estimation unit 212,212 ', 212' Reverberation suppression filter estimation unit 213,213', 313 Beamformer estimation unit 221,221 "Reverberation suppression filter application unit 222,222' Beamformer application unit

Claims (7)

  1.  周波数分割された時系列の音響信号と、目的音の情報を表す補助情報とを受け取り、残響抑圧と拡散性雑音抑圧と目的音源分離とを行う畳み込みビームフォーマを前記音響信号に適用して得られる信号が確率モデルに従って定まるという最適化基準に基づき、前記畳み込みビームフォーマを推定する畳み込みビームフォーマ推定部と、
     前記畳み込みビームフォーマ推定部で推定された前記畳み込みビームフォーマを前記音響信号に適用して処理信号を得て出力する畳み込みビームフォーマ適用部と、
    を有する信号処理装置。
    Obtained by applying a convolutional beamformer that receives frequency-divided time-series acoustic signals and auxiliary information representing target sound information and performs reverberation suppression, diffuse noise suppression, and target sound source separation to the acoustic signal. A convolution beamformer estimation unit that estimates the convolutional beamformer based on an optimization criterion that the signal is determined according to a stochastic model, and a convolutional beamformer estimation unit.
    A convolution beam former application unit that applies the convolution beam former estimated by the convolution beam former estimation unit to the acoustic signal to obtain a processing signal and output the processing signal.
    A signal processing device having.
  2.  請求項1の信号処理装置であって、
     前記畳み込みビームフォーマは、前記残響抑圧を行う残響抑圧フィルタと、前記拡散性雑音抑圧と前記目的音源分離とを行うビームフォーマとを含み、
     前記畳み込みビームフォーマ推定部は、
     前記目的音のパワー重み付き時空間共分散行列を得る時空間共分散行列推定部と、
     前記音響信号と前記目的音のパワー重み付き時空間共分散行列と前記ビームフォーマを表す情報とを受け取り、前記最適化基準に基づいて前記残響抑圧フィルタを推定する残響抑圧フィルタ推定部とを含み、
     前記畳み込みビームフォーマ適用部は、前記残響抑圧フィルタ推定部で推定された前記残響抑圧フィルタを前記音響信号に適用して残響抑圧信号を得る残響抑圧フィルタ適用部を含み、
     前記畳み込みビームフォーマ推定部は、さらに前記残響抑圧信号と前記補助情報とを受け取り、前記最適化基準に基づいて前記ビームフォーマを推定するビームフォーマ推定部を含み、
     前記畳み込みビームフォーマ適用部は、さらに前記ビームフォーマ推定部で推定された前記ビームフォーマを前記残響抑圧信号に適用して前記処理信号を得て出力するビームフォーマ適用部を含む、信号処理装置。
    The signal processing device according to claim 1.
    The convolution beam former includes a reverberation suppression filter that suppresses the reverberation, and a beam former that suppresses the diffusive noise and separates the target sound source.
    The convolution beam former estimation unit
    The spatiotemporal covariance matrix estimation unit that obtains the power-weighted spatiotemporal covariance matrix of the target sound,
    It includes a reverberation suppression filter estimation unit that receives the acoustic signal, the power-weighted spatio-temporal covariance matrix of the target sound, and information representing the beamformer, and estimates the reverberation suppression filter based on the optimization criterion.
    The convolution beam former application unit includes a reverberation suppression filter application unit that obtains a reverberation suppression signal by applying the reverberation suppression filter estimated by the reverberation suppression filter estimation unit to the acoustic signal.
    The convolution beamformer estimation unit further includes a beamformer estimation unit that receives the reverberation suppression signal and the auxiliary information and estimates the beamformer based on the optimization criterion.
    The convolution beamformer application unit is a signal processing device including a beamformer application unit that further applies the beamformer estimated by the beamformer estimation unit to the reverberation suppression signal to obtain and output the processing signal.
  3.  請求項2の信号処理装置であって、
     前記時空間共分散行列推定部は、さらに非目的音のパワー重み付き時空間共分散行列を得、
     前記残響抑圧フィルタ推定部は、前記音響信号と前記目的音のパワー重み付き時空間共分散行列と前記非目的音のパワー重み付き時空間共分散行列と前記ビームフォーマを特定する情報とを受け取り、前記最適化基準に基づいて前記残響抑圧フィルタを推定する、信号処理装置。
    The signal processing device according to claim 2.
    The spatiotemporal covariance matrix estimation unit further obtains a power-weighted spatiotemporal covariance matrix of non-objective sounds.
    The reverberation suppression filter estimation unit receives the acoustic signal, the power-weighted spatio-temporal covariance matrix of the target sound, the power-weighted spatio-temporal covariance matrix of the non-target sound, and information for identifying the beamformer. A signal processing device that estimates the reverberation suppression filter based on the optimization criteria.
  4.  請求項2または3の信号処理装置であって、
     時空間共分散行列推定部の処理と、前記残響抑圧フィルタ推定部の処理と、前記残響抑圧フィルタ適用部の処理と、前記ビームフォーマ推定部の処理と、前記ビームフォーマ適用部の処理とを繰り返す、信号処理装置。
    The signal processing device according to claim 2 or 3.
    The processing of the spatiotemporal covariance matrix estimation unit, the processing of the reverberation suppression filter estimation unit, the processing of the reverberation suppression filter application unit, the processing of the beamformer estimation unit, and the processing of the beamformer application unit are repeated. , Signal processing equipment.
  5.  請求項1から3の何れかの信号処理装置であって、
     前記補助情報は前記目的音のパワーを特定する情報を含む、信号処理装置。
    The signal processing device according to any one of claims 1 to 3.
    The auxiliary information is a signal processing device including information for specifying the power of the target sound.
  6.  周波数分割された時系列の音響信号と、目的音の情報を表す補助情報とを受け取り、残響抑圧と拡散性雑音抑圧と目的音源分離とを行う畳み込みビームフォーマを前記音響信号に適用して得られる信号が確率モデルに従って定まるという最適化基準に基づき、前記畳み込みビームフォーマを推定する畳み込みビームフォーマ推定ステップと、
     前記畳み込みビームフォーマ推定ステップで推定された前記畳み込みビームフォーマを前記音響信号に適用して処理信号を得て出力する畳み込みビームフォーマ適用ステップと、
    を有する信号処理方法。
    Obtained by applying a convolutional beamformer that receives frequency-divided time-series acoustic signals and auxiliary information representing target sound information and performs reverberation suppression, diffuse noise suppression, and target sound source separation to the acoustic signal. The convolution beamformer estimation step for estimating the convolutional beamformer based on the optimization criterion that the signal is determined according to the stochastic model, and the convolutional beamformer estimation step.
    A convolution beam former application step in which the convolution beam former estimated in the convolution beam former estimation step is applied to the acoustic signal to obtain a processing signal and output, and a convolution beam former application step.
    Signal processing method having.
  7.  請求項1から5の何れかの信号処理装置としてコンピュータを機能させるためのプログラム。 A program for operating a computer as a signal processing device according to any one of claims 1 to 5.
PCT/JP2020/015456 2020-04-06 2020-04-06 Signal processing device, signal processing method, and program WO2021205494A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2022513704A JP7444243B2 (en) 2020-04-06 2020-04-06 Signal processing device, signal processing method, and program
PCT/JP2020/015456 WO2021205494A1 (en) 2020-04-06 2020-04-06 Signal processing device, signal processing method, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/015456 WO2021205494A1 (en) 2020-04-06 2020-04-06 Signal processing device, signal processing method, and program

Publications (1)

Publication Number Publication Date
WO2021205494A1 true WO2021205494A1 (en) 2021-10-14

Family

ID=78023169

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/015456 WO2021205494A1 (en) 2020-04-06 2020-04-06 Signal processing device, signal processing method, and program

Country Status (2)

Country Link
JP (1) JP7444243B2 (en)
WO (1) WO2021205494A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114999508A (en) * 2022-07-29 2022-09-02 之江实验室 Universal speech enhancement method and device by using multi-source auxiliary information

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009005261A (en) * 2007-06-25 2009-01-08 Nippon Telegr & Teleph Corp <Ntt> Sound pickup apparatus, sound pickup method, sound pickup program using its method, and storage medium
JP2016136229A (en) * 2015-01-14 2016-07-28 本田技研工業株式会社 Voice processing device, voice processing method, and voice processing system
JP2017505461A (en) * 2014-04-30 2017-02-16 ホアウェイ・テクノロジーズ・カンパニー・リミテッド Apparatus, method, and computer program for signal processing for removing reverberation of some input audio signals
JP2017107141A (en) * 2015-12-09 2017-06-15 日本電信電話株式会社 Sound source information estimation device, sound source information estimation method and program
WO2018110008A1 (en) * 2016-12-16 2018-06-21 日本電信電話株式会社 Target sound emphasis device, noise estimation parameter learning device, method for emphasizing target sound, method for learning noise estimation parameter, and program
JP2019508730A (en) * 2016-03-23 2019-03-28 グーグル エルエルシー Adaptive audio enhancement for multi-channel speech recognition

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009005261A (en) * 2007-06-25 2009-01-08 Nippon Telegr & Teleph Corp <Ntt> Sound pickup apparatus, sound pickup method, sound pickup program using its method, and storage medium
JP2017505461A (en) * 2014-04-30 2017-02-16 ホアウェイ・テクノロジーズ・カンパニー・リミテッド Apparatus, method, and computer program for signal processing for removing reverberation of some input audio signals
JP2016136229A (en) * 2015-01-14 2016-07-28 本田技研工業株式会社 Voice processing device, voice processing method, and voice processing system
JP2017107141A (en) * 2015-12-09 2017-06-15 日本電信電話株式会社 Sound source information estimation device, sound source information estimation method and program
JP2019508730A (en) * 2016-03-23 2019-03-28 グーグル エルエルシー Adaptive audio enhancement for multi-channel speech recognition
WO2018110008A1 (en) * 2016-12-16 2018-06-21 日本電信電話株式会社 Target sound emphasis device, noise estimation parameter learning device, method for emphasizing target sound, method for learning noise estimation parameter, and program

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114999508A (en) * 2022-07-29 2022-09-02 之江实验室 Universal speech enhancement method and device by using multi-source auxiliary information
CN114999508B (en) * 2022-07-29 2022-11-08 之江实验室 Universal voice enhancement method and device by utilizing multi-source auxiliary information
US12094484B2 (en) 2022-07-29 2024-09-17 Zhejiang Lab General speech enhancement method and apparatus using multi-source auxiliary information

Also Published As

Publication number Publication date
JPWO2021205494A1 (en) 2021-10-14
JP7444243B2 (en) 2024-03-06

Similar Documents

Publication Publication Date Title
Kitamura et al. Determined blind source separation with independent low-rank matrix analysis
US8848933B2 (en) Signal enhancement device, method thereof, program, and recording medium
JP7115562B2 (en) SIGNAL PROCESSING DEVICE, SIGNAL PROCESSING METHOD, AND PROGRAM
KR100878992B1 (en) Geometric source separation signal processing technique
WO2007100137A1 (en) Reverberation removal device, reverberation removal method, reverberation removal program, and recording medium
JP2007526511A (en) Method and apparatus for blind separation of multipath multichannel mixed signals in the frequency domain
JP6815956B2 (en) Filter coefficient calculator, its method, and program
WO2021205494A1 (en) Signal processing device, signal processing method, and program
Song et al. An integrated multi-channel approach for joint noise reduction and dereverberation
Barner et al. Polynomial weighted median filtering
WO2021171406A1 (en) Signal processing device, signal processing method, and program
JP7156064B2 (en) Latent variable optimization device, filter coefficient optimization device, latent variable optimization method, filter coefficient optimization method, program
WO2022162878A1 (en) Signal processing device, signal processing method, signal processing program, learning device, learning method, and learning program
JP7552742B2 (en) SOUND SOURCE SEPARATION DEVICE, SOUND SOURCE SEPARATION METHOD, AND PROGRAM
WO2020184210A1 (en) Noise-spatial-covariance-matrix estimation device, noise-spatial-covariance-matrix estimation method, and program
CN110677782B (en) Signal adaptive noise filter
CN109074811B (en) Audio source separation
CN108353241A (en) Rendering system
WO2024038522A1 (en) Signal processing device, signal processing method, and program
JP2020141160A (en) Sound information processing device and programs
WO2022180741A1 (en) Acoustic signal enhancement device, method, and program
JP7173355B2 (en) PSD optimization device, PSD optimization method, program
JP2024152109A (en) Signal enhancement device, method and program
Moir et al. Decorrelation of multiple non‐stationary sources using a multivariable crosstalk‐resistant adaptive noise canceller
JP7173356B2 (en) PSD optimization device, PSD optimization method, program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20930384

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022513704

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20930384

Country of ref document: EP

Kind code of ref document: A1