WO2024038522A1 - Signal processing device, signal processing method, and program - Google Patents
Signal processing device, signal processing method, and program Download PDFInfo
- Publication number
- WO2024038522A1 WO2024038522A1 PCT/JP2022/031099 JP2022031099W WO2024038522A1 WO 2024038522 A1 WO2024038522 A1 WO 2024038522A1 JP 2022031099 W JP2022031099 W JP 2022031099W WO 2024038522 A1 WO2024038522 A1 WO 2024038522A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- covariance matrix
- sound source
- signal
- target sound
- estimated
- Prior art date
Links
- 238000012545 processing Methods 0.000 title claims abstract description 69
- 238000003672 processing method Methods 0.000 title claims description 3
- 239000011159 matrix material Substances 0.000 claims abstract description 147
- 238000000605 extraction Methods 0.000 claims abstract description 17
- 238000000034 method Methods 0.000 claims description 25
- 230000002123 temporal effect Effects 0.000 abstract 2
- 230000008569 process Effects 0.000 description 11
- 238000010586 diagram Methods 0.000 description 9
- 230000009466 transformation Effects 0.000 description 8
- 238000002156 mixing Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 5
- 230000005236 sound signal Effects 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000000295 complement effect Effects 0.000 description 2
- 238000000354 decomposition reaction Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 230000017105 transposition Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 1
- 239000006227 byproduct Substances 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 239000000047 product Substances 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
- G10L21/028—Voice signal separating using properties of sound source
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R3/00—Circuits for transducers, loudspeakers or microphones
Definitions
- the present invention relates to a technique for estimating, with high quality, an audio signal included in a signal recorded using a microphone.
- Non-Patent Document 1 a method using a convolutional beamformer (CBF, see Non-Patent Document 1) is known.
- CBF convolutional beamformer
- MVDR Minimum-Variance Distortionless Response
- the CBF is designed by compressing the spatial information (spatial covariance matrix) of the target sound source, which is the extraction target, into a steering vector, so it is not possible to use all the spatial information possessed by the target sound source.
- the problem is that it can't be done.
- An object of the present invention is to provide a signal processing device, a signal processing method, and a program that can use all the spatial information of a target sound source by introducing the MaxSNR standard instead of the MVDR standard.
- a signal processing device estimates a spatial covariance matrix of a non-target sound source using an estimated value of a spatio-temporal covariance matrix of the non-target sound source.
- a second spatial covariance matrix estimator a dereverberation filter estimator that estimates a dereverberation filter using the estimated value of the spatiotemporal covariance matrix of the non-target sound source;
- a beamformer estimator that estimates a convolutional beamformer using the estimated value, the estimated value of the spatial covariance matrix of the non-target sound source, and the estimated dereverberation filter; and a sound source extraction unit that performs beamforming processing and estimates a sound source signal.
- the present invention by introducing the MaxSNR criterion, it is possible to use all the spatial information of the target sound source.
- FIG. 1 is a functional block diagram of a signal processing device according to a first embodiment.
- FIG. 3 is a diagram illustrating an example of a processing flow of the signal processing device according to the first embodiment.
- FIG. 3 is a functional block diagram of a signal processing device according to a second embodiment.
- FIG. 7 is a diagram illustrating an example of a processing flow of a signal processing device according to a second embodiment.
- FIG. 3 is a functional block diagram of a signal processing device according to a third embodiment.
- FIG. 7 is a diagram illustrating an example of a processing flow of a signal processing device according to a third embodiment. The figure which shows the example of a structure of the computer to which this method is applied.
- the problem targeted in this embodiment is a sound source extraction problem, in which a sound source signal s f,t or a spatial image s from which reverberations of the sound source signal s f,t are removed from a signal x f,t observed with a microphone.
- a f represents the acoustic transfer function of the sound source.
- the sound source signal is a signal based on the sound emitted by the sound source (target sound source) to be recorded by the microphone.
- the target sound source is a speaker (hereinafter also referred to as "target speaker”), and the target sound source Let the sound be the voice uttered by the target speaker (hereinafter also referred to as "target voice"), and let the target signal be a signal corresponding to the target voice.
- target sound source is not limited to these, and the target sound source is not limited to the speaker, but may be any sound source such as a musical instrument or a playback device, and the target sound is not limited to voice, but may be any sound other than voice. Good too.
- a sound source other than the target sound source is also called a non-target sound source.
- the steering vector used by MVDR CBF corresponds to the principal component of the spatial covariance matrix V S , and the MVDR CBF cannot use all the spatial information that the spatial covariance matrix V S has.
- a MaxSNR criterion is introduced as a new criterion for designing CBF.
- the MaxSNR CBF ⁇ w of this embodiment is characterized in that it can be decomposed into the product of the dereverberation filter ⁇ G and the MaxSNR beamformer w for instantaneous mixing, as shown in the following equation.
- the subscript opt means the optimal solution
- C is the entire set of complex numbers.
- the MaxSNR CBF has the feature that the dereverberation filter ⁇ G and the MaxSNR beamformer w can be optimized in an integrated manner.
- equation (2) can be decomposed as equation (3), ⁇ w and ⁇ R N are written as follows.
- S ++ is a set consisting of all positive definite matrices.
- equation (7) can be solved as the optimal eigenvector of generalized eigenvalue decomposition.
- V S w opt ⁇ max V N w opt
- ⁇ max is the maximum eigenvalue
- ⁇ G in Equation (8) is a multi-channel linear prediction (MCLP)-based dereverberation filter used in dereverberation.
- V N in Equation (9) is a Schur complement of ⁇ R N , and can be regarded as a spatial covariance matrix of a non-target sound source from which reverberation has been removed.
- FIG. 1 shows a functional block diagram of a signal processing device according to the first embodiment, and FIG. 2 shows its processing flow.
- the signal processing device 100 includes a first spatial covariance matrix estimating section 110, a spatiotemporal covariance matrix estimating section 120, a second spatial covariance matrix estimating section 140, a dereverberation filter estimating section 130, and a beamformer estimating section. 150, a sound source extraction section 160, and a spatial image estimation section 170.
- the observed signal is, for example, an acoustic signal observed with a microphone array consisting of a plurality of microphones.
- the output signal of the microphone may be input as it is, an output signal stored in some storage device may be read and input, or a signal obtained by performing some processing on the output signal of the microphone may be input.
- the observation signal x f,t and sound source signal s f,t are in the frequency domain. It's a signal.
- the observed signal in the time domain is input and converted into the observed signal x f,t in the frequency domain in a frequency domain transformer (not shown), and the estimated value of the sound source signal s f,t is converted into the observed signal x f,t in the time domain in the time domain transformer (not shown). It may also be converted into a sound source signal and output.
- Frequency domain transformation and time domain transformation may be performed by any method; for example, Fourier transformation, inverse Fourier transformation, etc. can be used.
- the signal processing device 100 is, for example, a special computer configured by loading a special program into a known or dedicated computer having a central processing unit (CPU), a main memory (RAM), etc. It is a great device.
- the signal processing device 100 executes each process under the control of, for example, a central processing unit.
- the data input to the signal processing device 100 and the data obtained through each process are stored, for example, in a main memory, and the data stored in the main memory is read out to the central processing unit as necessary. Used for other processing.
- Each processing unit of the signal processing device 100 may be configured at least in part by hardware such as an integrated circuit.
- Each storage unit included in the signal processing device 100 can be configured by, for example, a main storage device such as a RAM (Random Access Memory), or middleware such as a relational database or a key-value store.
- a main storage device such as a RAM (Random Access Memory), or middleware such as a relational database or a key-value store.
- middleware such as a relational database or a key-value store.
- each storage unit does not necessarily need to be provided inside the signal processing device 100, and may be configured with an auxiliary storage device configured from a hard disk, an optical disk, or a semiconductor memory element such as a flash memory, and is configured to perform signal processing.
- the configuration may be provided outside the device 100.
- the first spatial covariance matrix estimation unit 110 estimates the spatial covariance matrix of the target sound source (S110), and outputs the estimated value V S ⁇ S M + .
- Various methods can be used to estimate the spatial covariance matrix of the target sound source.
- the first spatial covariance matrix estimation unit 110 receives the observation signal x f,t as input, and estimates an interval (hereinafter also referred to as the target signal) that includes the sound emitted by the target sound source from the observation signal x f,t . , estimate the spatial covariance matrix of the target sound source using the estimated target signal.
- the spatial covariance matrix of the target sound source may be approximated in advance through experiments or simulations, and the approximate value may be used as the estimated value V S ⁇ S M + .
- the spatiotemporal covariance matrix estimation unit 120 estimates the spatiotemporal covariance matrix of the non-target sound source (S120), and outputs the estimated value ⁇ R N ⁇ S M+ML + .
- Various methods can be used to estimate the spatiotemporal covariance matrix of the non-target sound source.
- the space-time covariance matrix estimation unit 120 receives the observed signal x f,t and estimates an interval that does not include the sound emitted by the target sound source (hereinafter also referred to as non-target signal) from the observed signal x f,t. Then, the spatiotemporal covariance matrix of the non-target sound source is estimated using the estimated non-target signal.
- the dereverberation filter estimation unit 130 inputs the estimated value ⁇ R N of the spatiotemporal covariance matrix, and estimates the dereverberation filter from the block matrices -P N , -R N included in the estimated value ⁇ R N (S130). , output the estimated dereverberation filter ⁇ G.
- the dereverberation filter is estimated by equation (8).
- R N is a block matrix consisting of elements from row 1 to M rows M of the estimated value ⁇ R N
- - P N is a block matrix consisting of elements from row 1 to M rows 1 of the estimated value ⁇ R N +ML
- ( - P N ) H is a block consisting of elements in rows 1 and (M+1) to columns M and (M+ML) of the estimated value ⁇ R N
- R N is a block matrix consisting of elements from (M+1) rows and (M+1) columns to (M+ML) rows and (M+ML) columns of the estimated value ⁇ R N.
- the second spatial covariance matrix estimator 140 receives the estimated value ⁇ R N of the spatiotemporal covariance matrix as input, and extracts the non-target sound source from the block matrices R N , - P N , - R N included in the estimated value ⁇ R N .
- the spatial covariance matrix of a non-target sound source is estimated by equation (9). It is.
- the second spatial covariance matrix estimation unit 140 inputs the estimated value ⁇ R N of the spatiotemporal covariance matrix and the dereverberation filter ⁇ G estimated by the dereverberation filter estimation unit 130, and calculates the estimated value ⁇ R N and
- the spatial covariance matrix of the non-target sound source may be estimated from the dereverberation filter ⁇ G using equation (9).
- the beamformer estimation unit 150 receives as input the estimated value V S of the spatial covariance matrix of the target sound source, the estimated value V N of the spatial covariance matrix of the non-target sound source, and the estimated dereverberation filter ⁇ G.
- the beamformer estimation unit 150 calculates the MaxSNR beamformer w for instantaneous mixing using Equation (7) from the estimated value V S of the spatial covariance matrix of the target sound source and the estimated value V N of the spatial covariance matrix of the non-target sound source. Find opt . Note that equation (7) can be solved as the optimal eigenvector of generalized eigenvalue decomposition.
- V S w opt ⁇ max V N w opt
- ⁇ max is the maximum eigenvalue
- the beamformer estimation unit 150 estimates a convolutional beamformer from the MaxSNR beamformer w opt for instantaneous mixing and the estimated dereverberation filter ⁇ G using equation (3) (S150), and calculates the estimated convolutional beamformer ⁇ w. Output.
- ⁇ Sound source extraction unit 160> The sound source extraction unit 160 inputs the observed signal x f,t and the estimated convolutional beamformer ⁇ w, performs beamforming processing using the following equation, estimates the sound source signal (S160), and obtains the estimated value y f,t Output.
- ⁇ w F ] ⁇ x f,t [x f,t T
- a H indicates the Hermitian transposition of A
- a T indicates the transposition of A
- ⁇ Spatial image estimation unit 170 Although the scale of the convolutional beam former ⁇ w f for each frequency bin f is indefinite, it can be restored by estimating a vector u f that approximates the spatial image s f,t image using the following equation.
- the spatial image estimation unit 170 inputs the estimated value V N of the spatial covariance matrix of the non-target sound source, the estimated value y f,t , and the MaxSNR beamformer w opt for instantaneous mixing, and calculates the estimated value V N and instantaneous mixing. From the MaxSNR beamformer w opt for Output f,t .
- MVDR CBF (a method of estimating CBF based on MVDR standards) requires the steering vector of the target sound source to be estimated separately in advance, and there is a problem that the sound source extraction performance of MVDR CBF is strongly dependent on the estimation performance of the steering vector. The problem is that it is difficult to use. This embodiment solves this problem.
- Equations (1) and (2) the estimated value of the spatial covariance matrix of the target sound source V S and the estimated value of the spatio-temporal covariance matrix of the non-target sound source ⁇ R N are calculated in advance. It is necessary to ask for In this embodiment, a Blind MaxSNR CBF that eliminates the need to obtain these two estimated values in advance will be described. Note that “Blind” here means that no prior knowledge is required.
- Blind MaxSNR CBF of this embodiment is a method of estimating MaxSNR CBF by repeatedly performing calculations similar to MaxSNR CBF given by equation (2) or equation (7).
- the Blind MaxSNR CBF of this embodiment is defined as the following local optimal solution using an arbitrary super Gaussian function ⁇ :R ⁇ 0 ⁇ R and the Schur complement V X of the following matrix ⁇ R (Equations (20a), (20b)).
- y f,t ( ⁇ w f ) H ⁇ x f,t is an estimate of the source signal
- y t [y 1,t
- 2 ⁇ (A H A) is the Euclidean norm for the vector A, and C on the right side of equation (20b) maximizes the function or It is a constant that is determined adaptively and heuristically at each iteration of the minimizing algorithm.
- the spatiotemporal covariance matrix ⁇ R Z,f which is interpreted as the estimated value ⁇ R N,f of the spatiotemporal covariance matrix of the non-target sound source , is calculated based on the following equations (21) and (22).
- the MaxSNR CBF is optimized without prior knowledge through iterative optimization that alternately repeats the process of obtaining the calculation and the process of estimating the MaxSNR CBF ⁇ w based on the following equations (23) to (26).
- y t k [...
- y f,t k ( ⁇ w f k ) H ⁇ x f,t (22)
- k is an index indicating the number of repetitions.
- FIG. 3 shows a functional block diagram of the signal processing device according to the first embodiment
- FIG. 4 shows its processing flow.
- the signal processing device 200 includes an initialization section 201 , a first spatial covariance matrix estimation section 210 , a spatiotemporal covariance matrix estimation section 220 , a second spatial covariance matrix estimation section 240 , and a dereverberation filter estimation section 230 , a beam former estimation section 250 , a sound source extraction section 160 , and a determination section 280 .
- the signal processing device 200 inputs the observation signal x f,t observed by the microphone and the index m of the reference microphone, estimates the sound source signal s f,t, and outputs the estimated sound source signal s f,t .
- f indicates a frequency
- t indicates a frame number
- the observation signal x f,t and the sound source signal s f,t are signals in the frequency domain.
- an observed signal in the time domain is input and converted into an observed signal x f,t in the frequency domain in a frequency domain converter (not shown), and a sound source signal s f,t is converted into a sound source signal in the time domain in a time domain converter (not shown). It may be converted and output.
- Frequency domain transformation and time domain transformation may be performed by any method; for example, Fourier transformation, inverse Fourier transformation, etc. can be used.
- em is a unit vector corresponding to the reference microphone.
- First spatial covariance matrix estimation unit 210> The first spatial covariance matrix estimation unit 210 receives the observed signal x f,t as input and estimates the spatial covariance matrix of the observed signal x f,t using equations (28) to (30) (S210), Output the estimated value V X.
- R X,f is a block matrix consisting of elements from 1 row and 1 to M rows and M columns of the estimated value ⁇ R X ,f, - P X,f is (M+1) of the estimated value ⁇ R X,f It is a block matrix consisting of elements from row 1 to (M+ML) rows and M columns, and ( - P X,f ) H is the estimated value ⁇ R - R X,f is a block matrix consisting of elements in columns (M+ML), and - R It is a block matrix consisting of elements of columns (ML).
- V X [V X,1 ,...,V X,f ,...,V X,F ] ⁇ Spatio-temporal covariance matrix estimation unit 220>
- the spatiotemporal covariance matrix estimator 220 inputs the convolutional beam former ⁇ w k estimated in the previous iteration process or its initial value ⁇ w 0 and the observation signal x f,t , and calculates the spatiotemporal covariance matrix of the non-target sound source.
- the spatio-temporal covariance matrix ⁇ R Z [ ⁇ R Z ,1 ,..., ⁇ R Z, f ,..., ⁇ R Z, F interpreted as the estimate of the covariance matrix ⁇ R N ,f ] is calculated based on the following equations (21) and (22) (S220) and output.
- the dereverberation filter estimation unit 230 receives the space-time covariance matrix ⁇ R Z,f as input, and estimates the dereverberation filter from - P Z,f , - R Z, f included in the estimated value ⁇ R Z,f . (S230), the estimated dereverberation filter ⁇ G is output.
- the dereverberation filter is estimated by equation (25). In addition, It is.
- R Z,f is a block matrix consisting of elements from 1 row and 1 to M rows and M columns of the space-time covariance matrix ⁇ R Z,f
- - P Z,f is the space-time covariance matrix ⁇ R Z
- ( - P Z,f ) H is the space-time covariance matrix ⁇ R 1 of Z,f It is a block matrix consisting of elements from row (M+1) to column M (M+ML)
- - R Z,f is the (M+1) row (M It is a block matrix consisting of elements from column +1 to (M+ML) rows and (M+ML) columns.
- the second spatial covariance matrix estimator 240 receives the space-time covariance matrix ⁇ R Z,f as input, and receives the R Z,f , - P Z, f , - included in the space-time covariance matrix ⁇ R Z,f .
- the spatial covariance matrix of the non-target sound source is estimated from R Z,f (S240), and the estimated value V Z,f ⁇ S M+ML + is output.
- the spatial covariance matrix of a non-target sound source is estimated by equation (31). It is.
- the second spatial covariance matrix estimation unit 240 inputs the space-time covariance matrix ⁇ R Z,f and the dereverberation filter ⁇ G estimated by the dereverberation filter estimation unit 230, and calculates the space-time covariance matrix ⁇ R
- the spatial covariance matrix of the non-target sound source may be estimated from Z,f and the dereverberation filter ⁇ G using equation (31).
- the beamformer estimating unit 250 calculates w f k using Equation ( 24 ) from the estimated value V Ask for +1 .
- the beamformer estimation unit 250 estimates a convolutional beamformer from the MaxSNR beamformer w f k+1 for instantaneous mixing and the estimated dereverberation filter ⁇ G using equation (23) (S250).
- the sound source extraction unit 160 inputs the observed signal x f,t and the estimated convolutional beamformer ⁇ w k+1 , performs beamforming processing using the following equation, estimates the sound source signal (S160), and calculates the estimated value y Output f,t .
- the determining unit 280 determines whether or not the convergence condition is satisfied (S280), and if the convergence condition is satisfied (YES in S280), the estimated value y f,t at that time is used as the output of the signal processing device. Output and end processing. If the convergence condition is not satisfied (NO in S280), the determination unit 280 sends a control signal to each unit to repeat S220 to S160 to control the processing of each unit.
- the estimated value y f,t output from the sound source extraction section 160 can be used in the spatiotemporal covariance matrix estimation section 220, and the calculation of equation (22) can be omitted.
- the convergence conditions include whether learning has been repeated a certain number of times (for example, several times). Conditions such as whether the difference between the convolutional beam former ⁇ w k+1 before and after estimation is less than a predetermined threshold can be used.
- the Blind MaxSNR CBF of this embodiment is an ultra-high-speed method that can estimate MaxSNR CBF with high accuracy through at most several iterations.
- the estimated value y f,t of the sound source signal s f ,t is output, but the spatial image estimation unit 170 is provided and the estimated value y f,t at the time when the convergence condition is satisfied is used. Then, an approximate value u f y f,t of the spatial image s f,t image may be determined and output.
- the MaxSNR CBF can be estimated with higher accuracy than the Blind MaxSNR CBF of the second embodiment.
- FIG. 5 shows a functional block diagram of a signal processing device according to the third embodiment, and FIG. 6 shows its processing flow.
- the signal processing device 300 includes an initialization section 201 , a first spatial covariance matrix estimation section 110 , a spatiotemporal covariance matrix estimation section 220 , a second spatial covariance matrix estimation section 240 , and a dereverberation filter estimation section 230 , a beam former estimation section 250 , a sound source extraction section 160 , and a determination section 280 .
- This embodiment differs from the second embodiment in that it includes a first spatial covariance matrix estimator 110 instead of the first spatial covariance matrix estimator 210.
- the first spatial covariance matrix estimation unit 110 is as described in the first embodiment.
- the beamformer estimation unit 250 uses the estimated value V S of the spatial covariance matrix of the target sound source instead of the estimated value V X of the spatial covariance matrix of the observed signal x f,t . different.
- Other processing is similar to the second embodiment.
- a program that describes this processing content can be recorded on a computer-readable recording medium.
- the computer-readable recording medium may be of any type, such as a magnetic recording device, an optical disk, a magneto-optical recording medium, or a semiconductor memory.
- this program is performed, for example, by selling, transferring, lending, etc. portable recording media such as DVDs and CD-ROMs on which the program is recorded. Furthermore, this program may be distributed by storing the program in the storage device of the server computer and transferring the program from the server computer to another computer via a network.
- a computer that executes such a program for example, first stores a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing a process, this computer reads a program stored in its own recording medium and executes a process according to the read program. In addition, as another form of execution of this program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and furthermore, the program may be transferred to this computer from the server computer. The process may be executed in accordance with the received program each time.
- ASP Application Service Provider
- the above-mentioned processing is executed by a so-called ASP (Application Service Provider) service, which does not transfer programs from the server computer to this computer, but only realizes processing functions by issuing execution instructions and obtaining results.
- ASP Application Service Provider
- the present apparatus is configured by executing a predetermined program on a computer, but at least a part of these processing contents may be implemented in hardware.
Landscapes
- Engineering & Computer Science (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Measurement Of Velocity Or Position Using Acoustic Or Ultrasonic Waves (AREA)
Abstract
The present invention provides a signal processing device and the like in which entire spatial information relating to a target sound source can be used through introduction of a MaxSNR criterion. The signal processing device comprises: a second spatial covariance matrix estimation unit that uses an estimated value of a spatial/temporal covariance matrix of a non-target sound source to estimate a spatial covariance matrix of the non-target sound source; a reverberation removal filter estimation unit that uses the estimated value of the spatial/temporal covariance matrix of the non-target sound source to estimate a reverberation removal filter; a beam former estimation unit that uses an observed signal or an estimated value of a spatial covariance matrix of a target sound source, the estimated value of the spatial covariance matrix of the non-target sound source, and the estimated reverberation removal filter to estimate a convolutional beam former; and a sound source extraction unit that uses the observed signal and the estimated convolutional beam former to perform beam forming processing, thereby estimating a sound source signal.
Description
本発明は、マイクロホンを用いて収録した信号に含まれる音声信号を高品質に推定する技術に関する。
The present invention relates to a technique for estimating, with high quality, an audio signal included in a signal recorded using a microphone.
雑音残響環境においてマイクロホンを用いて音声信号を収録する際、マイクロホンには収録したい音声成分に加えて、雑音、残響、妨害音といった不要な成分が混入するため、収録信号に含まれる音声信号の品質は低い。そこで、収録信号に含まれる音声信号を高品質に推定するために、信号源抽出技術が盛んに研究されてきた。複数のセンサを用いて信号源抽出を行う手法として、畳み込みビームフォーマ(Convolutional Beamformer: CBF, 非特許文献1参照)を用いた手法が知られている。CBFを最適化する基準としては、これまで無歪最小分散(Minimum-Variance Distortionless Response: MVDR)という基準が用いられてきた(非特許文献1参照)。
When recording audio signals using a microphone in a noisy and reverberant environment, in addition to the audio components you want to record, unnecessary components such as noise, reverberation, and interfering sounds are mixed into the microphone, so the quality of the audio signals contained in the recorded signals may be affected. is low. Therefore, signal source extraction techniques have been actively researched in order to estimate high quality audio signals included in recorded signals. As a method of extracting a signal source using a plurality of sensors, a method using a convolutional beamformer (CBF, see Non-Patent Document 1) is known. As a standard for optimizing CBF, a standard called Minimum-Variance Distortionless Response (MVDR) has been used so far (see Non-Patent Document 1).
しかしながら、CBFをMVDR基準で設計する場合、抽出対象である目的音源の空間情報(空間共分散行列)をステアリングベクトルに圧縮してCBFを設計するため、目的音源が有する空間情報をすべて用いることができないという問題がある。
However, when designing a CBF based on the MVDR standard, the CBF is designed by compressing the spatial information (spatial covariance matrix) of the target sound source, which is the extraction target, into a steering vector, so it is not possible to use all the spatial information possessed by the target sound source. The problem is that it can't be done.
本発明は、MVDR基準に代えて、MaxSNR基準を導入することで、目的音源の空間情報をすべて用いることができる信号処理装置、信号処理方法、プログラムを提供することを目的とする。
An object of the present invention is to provide a signal processing device, a signal processing method, and a program that can use all the spatial information of a target sound source by introducing the MaxSNR standard instead of the MVDR standard.
上記の課題を解決するために、本発明の一態様によれば、信号処理装置は、非目的音源の空間時間共分散行列の推定値を用いて、非目的音源の空間共分散行列を推定する第二空間共分散行列推定部と、非目的音源の空間時間共分散行列の推定値を用いて、残響除去フィルタを推定する残響除去フィルタ推定部と、観測信号または目的音源の空間共分散行列の推定値と、非目的音源の空間共分散行列の推定値と、推定した残響除去フィルタとを用いて、畳み込みビームフォーマを推定するビームフォーマ推定部と、観測信号と推定した畳み込みビームフォーマとを用いて、ビームフォーミング処理を行い、音源信号を推定する音源抽出部とを含む。
In order to solve the above problems, according to one aspect of the present invention, a signal processing device estimates a spatial covariance matrix of a non-target sound source using an estimated value of a spatio-temporal covariance matrix of the non-target sound source. a second spatial covariance matrix estimator; a dereverberation filter estimator that estimates a dereverberation filter using the estimated value of the spatiotemporal covariance matrix of the non-target sound source; a beamformer estimator that estimates a convolutional beamformer using the estimated value, the estimated value of the spatial covariance matrix of the non-target sound source, and the estimated dereverberation filter; and a sound source extraction unit that performs beamforming processing and estimates a sound source signal.
本発明によれば、MaxSNR基準を導入することで、目的音源の空間情報をすべて用いることができるという効果を奏する。
According to the present invention, by introducing the MaxSNR criterion, it is possible to use all the spatial information of the target sound source.
以下、本発明の実施形態について、説明する。なお、以下の説明に用いる図面では、同じ機能を持つ構成部や同じ処理を行うステップには同一の符号を記し、重複説明を省略する。以下の説明において、テキスト中で使用する記号「^」「-」等は、本来直後の文字の真上に記載されるべきものであるが、テキスト記法の制限により、当該文字の直前に記載する。式中においてはこれらの記号は本来の位置に記述している。また、ベクトルや行列の各要素単位で行われる処理は、特に断りが無い限り、そのベクトルやその行列の全ての要素に対して適用されるものとする。
Embodiments of the present invention will be described below. In the drawings used in the following explanation, components having the same functions and steps that perform the same processing are denoted by the same reference numerals, and redundant explanation will be omitted. In the following explanation, the symbols "^", " - ", etc. used in the text should originally be written directly above the character that immediately follows them, but due to text notation restrictions, they are written immediately before the character in question. . In the formula, these symbols are written in their original positions. Furthermore, unless otherwise specified, processing performed for each element of a vector or matrix is applied to all elements of that vector or matrix.
<音源抽出問題>
本実施形態で対象とする問題は、音源抽出問題であり、マイクロホンで観測した信号xf,tから、音源信号sf,tあるいは、音源信号sf,tの残響が取り除かれた空間イメージsf,t image=afsf,tを推定する問題である。ただし、afは音源の音響伝達関数を表す。なお、音源信号とはマイクロホンの収録対象である音源(目的音源)が発した音に基づく信号であり、本実施形態では、目的音源を話者(以下「目的話者」ともいう)とし、目的音を目的話者が発話した音声(以下「目的音声」ともいう)とし、目的信号を目的音声に対応する信号とする。ただし、これらに限定されるものではなく、目的音源は話者に限らず楽器などの音源や再生装置等の何らかの音源であってもよく、目的音は音声に限らず音声以外の音であってもよい。目的音源以外の音源を非目的音源ともいう。 <Sound source extraction problem>
The problem targeted in this embodiment is a sound source extraction problem, in which a sound source signal s f,t or a spatial image s from which reverberations of the sound source signal s f,t are removed from a signal x f,t observed with a microphone. The problem is to estimate f,t image =a f s f,t . However, a f represents the acoustic transfer function of the sound source. Note that the sound source signal is a signal based on the sound emitted by the sound source (target sound source) to be recorded by the microphone. In this embodiment, the target sound source is a speaker (hereinafter also referred to as "target speaker"), and the target sound source Let the sound be the voice uttered by the target speaker (hereinafter also referred to as "target voice"), and let the target signal be a signal corresponding to the target voice. However, the target sound source is not limited to these, and the target sound source is not limited to the speaker, but may be any sound source such as a musical instrument or a playback device, and the target sound is not limited to voice, but may be any sound other than voice. Good too. A sound source other than the target sound source is also called a non-target sound source.
本実施形態で対象とする問題は、音源抽出問題であり、マイクロホンで観測した信号xf,tから、音源信号sf,tあるいは、音源信号sf,tの残響が取り除かれた空間イメージsf,t image=afsf,tを推定する問題である。ただし、afは音源の音響伝達関数を表す。なお、音源信号とはマイクロホンの収録対象である音源(目的音源)が発した音に基づく信号であり、本実施形態では、目的音源を話者(以下「目的話者」ともいう)とし、目的音を目的話者が発話した音声(以下「目的音声」ともいう)とし、目的信号を目的音声に対応する信号とする。ただし、これらに限定されるものではなく、目的音源は話者に限らず楽器などの音源や再生装置等の何らかの音源であってもよく、目的音は音声に限らず音声以外の音であってもよい。目的音源以外の音源を非目的音源ともいう。 <Sound source extraction problem>
The problem targeted in this embodiment is a sound source extraction problem, in which a sound source signal s f,t or a spatial image s from which reverberations of the sound source signal s f,t are removed from a signal x f,t observed with a microphone. The problem is to estimate f,t image =a f s f,t . However, a f represents the acoustic transfer function of the sound source. Note that the sound source signal is a signal based on the sound emitted by the sound source (target sound source) to be recorded by the microphone. In this embodiment, the target sound source is a speaker (hereinafter also referred to as "target speaker"), and the target sound source Let the sound be the voice uttered by the target speaker (hereinafter also referred to as "target voice"), and let the target signal be a signal corresponding to the target voice. However, the target sound source is not limited to these, and the target sound source is not limited to the speaker, but may be any sound source such as a musical instrument or a playback device, and the target sound is not limited to voice, but may be any sound other than voice. Good too. A sound source other than the target sound source is also called a non-target sound source.
<第一実施形態のポイント>
MVDR CBFが用いるステアリングベクトルは空間共分散行列VSの主成分に対応し、MVDR CBFは空間共分散行列VSがもつ空間情報をすべて用いることはできない。本実施形態では、CBFを設計する新たな基準として、MaxSNR基準を導入する。MaxSNR基準を用いてCBFを設計する際、目的音源の空間情報(空間共分散行列VS)をフルに活用できるという利点がある。 <Points of the first embodiment>
The steering vector used by MVDR CBF corresponds to the principal component of the spatial covariance matrix V S , and the MVDR CBF cannot use all the spatial information that the spatial covariance matrix V S has. In this embodiment, a MaxSNR criterion is introduced as a new criterion for designing CBF. When designing a CBF using the MaxSNR criterion, there is an advantage that the spatial information of the target sound source (spatial covariance matrix V S ) can be fully utilized.
MVDR CBFが用いるステアリングベクトルは空間共分散行列VSの主成分に対応し、MVDR CBFは空間共分散行列VSがもつ空間情報をすべて用いることはできない。本実施形態では、CBFを設計する新たな基準として、MaxSNR基準を導入する。MaxSNR基準を用いてCBFを設計する際、目的音源の空間情報(空間共分散行列VS)をフルに活用できるという利点がある。 <Points of the first embodiment>
The steering vector used by MVDR CBF corresponds to the principal component of the spatial covariance matrix V S , and the MVDR CBF cannot use all the spatial information that the spatial covariance matrix V S has. In this embodiment, a MaxSNR criterion is introduced as a new criterion for designing CBF. When designing a CBF using the MaxSNR criterion, there is an advantage that the spatial information of the target sound source (spatial covariance matrix V S ) can be fully utilized.
まず、MaxSNR基準のCBFについて説明する。Mをマイクロホンの数を表す2以上の整数の何れかとし、L+1をCBFのタップ数とし、S+を非負定値行列の全体からなる集合とし、行列ABをB行B列の正方行列とし、行列AB×CをB行C列の行列とし、^RN∈SM+ML
+を非目的音源の空間時間共分散行列とし、VS∈SM
+を目的音源の空間共分散行列とし、OA×BをA行B列の零行列とし、
とすると、MaxSNR CBF ^wは、以下のように定義される。
なお、L=0のときMaxSNR CBFはMaxSNR beamformerになる。 First, CBF based on MaxSNR will be explained. Let M be any integer greater than or equal to 2 representing the number of microphones, L+1 be the number of CBF taps, S + be the set of all non-negative definite matrices, and matrix A B be a square matrix with B rows and B columns. , the matrix A B×C is a matrix with B rows and C columns, ^R N ∈S M+ML + is the spatiotemporal covariance matrix of the non-target sound source, and V S ∈S M + is the spatial covariance of the target sound source. Let O A × B be a zero matrix with A rows and B columns,
Then, MaxSNR CBF ^w is defined as follows.
Note that when L=0, MaxSNR CBF becomes MaxSNR beamformer.
とすると、MaxSNR CBF ^wは、以下のように定義される。
なお、L=0のときMaxSNR CBFはMaxSNR beamformerになる。 First, CBF based on MaxSNR will be explained. Let M be any integer greater than or equal to 2 representing the number of microphones, L+1 be the number of CBF taps, S + be the set of all non-negative definite matrices, and matrix A B be a square matrix with B rows and B columns. , the matrix A B×C is a matrix with B rows and C columns, ^R N ∈S M+ML + is the spatiotemporal covariance matrix of the non-target sound source, and V S ∈S M + is the spatial covariance of the target sound source. Let O A × B be a zero matrix with A rows and B columns,
Then, MaxSNR CBF ^w is defined as follows.
Note that when L=0, MaxSNR CBF becomes MaxSNR beamformer.
また、本実施形態のMaxSNR CBF ^wは、次式のように、残響除去フィルタ^Gと、瞬時混合に対するMaxSNRビームフォーマwの積に分解できるという特徴がある。
ただし、下付き添え字optは最適解を意味し、Cは複素数(Complex numbers)の全体の集合である。言い換えると、MaxSNR CBFは、残響除去フィルタ^GとMaxSNRビームフォーマwを統合的に最適化できるという特徴がある。 Furthermore, the MaxSNR CBF ^w of this embodiment is characterized in that it can be decomposed into the product of the dereverberation filter ^G and the MaxSNR beamformer w for instantaneous mixing, as shown in the following equation.
However, the subscript opt means the optimal solution, and C is the entire set of complex numbers. In other words, the MaxSNR CBF has the feature that the dereverberation filter ^G and the MaxSNR beamformer w can be optimized in an integrated manner.
ただし、下付き添え字optは最適解を意味し、Cは複素数(Complex numbers)の全体の集合である。言い換えると、MaxSNR CBFは、残響除去フィルタ^GとMaxSNRビームフォーマwを統合的に最適化できるという特徴がある。 Furthermore, the MaxSNR CBF ^w of this embodiment is characterized in that it can be decomposed into the product of the dereverberation filter ^G and the MaxSNR beamformer w for instantaneous mixing, as shown in the following equation.
However, the subscript opt means the optimal solution, and C is the entire set of complex numbers. In other words, the MaxSNR CBF has the feature that the dereverberation filter ^G and the MaxSNR beamformer w can be optimized in an integrated manner.
式(2)が式(3)のように分解できることについて説明するために、^w、^RNを以下のように記載する。
ただし、S++は正定値行列の全体からなる集合である。 In order to explain that equation (2) can be decomposed as equation (3), ^w and ^R N are written as follows.
However, S ++ is a set consisting of all positive definite matrices.
ただし、S++は正定値行列の全体からなる集合である。 In order to explain that equation (2) can be decomposed as equation (3), ^w and ^R N are written as follows.
However, S ++ is a set consisting of all positive definite matrices.
ここで、MaxSNR CBF ^wの最適解^woptを
として得ることができる。ただし、
である。ただし、IMはM行M列の単位行列であり、AHはAのエルミート転置を示す。 Here, the optimal solution ^w opt of MaxSNR CBF ^w is
can be obtained as however,
It is. However, I M is an identity matrix with M rows and M columns, and A H indicates the Hermitian transpose of A.
として得ることができる。ただし、
である。ただし、IMはM行M列の単位行列であり、AHはAのエルミート転置を示す。 Here, the optimal solution ^w opt of MaxSNR CBF ^w is
can be obtained as however,
It is. However, I M is an identity matrix with M rows and M columns, and A H indicates the Hermitian transpose of A.
なお、式(7)は、一般化された固有値分解の最適の固有ベクトルとして解くことができる。
Note that equation (7) can be solved as the optimal eigenvector of generalized eigenvalue decomposition.
VSwopt = λmaxVNwopt
ただし、λmaxは最大固有値である。 V S w opt = λ max V N w opt
However, λ max is the maximum eigenvalue.
ただし、λmaxは最大固有値である。 V S w opt = λ max V N w opt
However, λ max is the maximum eigenvalue.
式(8)の^Gは、残響除去で用いられる多チャネル線形予測(multi-channel linear prediction: MCLP)ベースの残響除去フィルタである。また、式(9)のVNは^RNのシューア補行列であり、残響が取り除かれた、非目的音源の空間共分散行列とみなすことができる。
^G in Equation (8) is a multi-channel linear prediction (MCLP)-based dereverberation filter used in dereverberation. Further, V N in Equation (9) is a Schur complement of ^R N , and can be regarded as a spatial covariance matrix of a non-target sound source from which reverberation has been removed.
<第一実施形態>
図1は第一実施形態に係る信号処理装置の機能ブロック図を、図2はその処理フローを示す。 <First embodiment>
FIG. 1 shows a functional block diagram of a signal processing device according to the first embodiment, and FIG. 2 shows its processing flow.
図1は第一実施形態に係る信号処理装置の機能ブロック図を、図2はその処理フローを示す。 <First embodiment>
FIG. 1 shows a functional block diagram of a signal processing device according to the first embodiment, and FIG. 2 shows its processing flow.
信号処理装置100は、第一空間共分散行列推定部110と、空間時間共分散行列推定部120と、第二空間共分散行列推定部140と、残響除去フィルタ推定部130と、ビームフォーマ推定部150と、音源抽出部160と、空間イメージ推定部170とを含む。
The signal processing device 100 includes a first spatial covariance matrix estimating section 110, a spatiotemporal covariance matrix estimating section 120, a second spatial covariance matrix estimating section 140, a dereverberation filter estimating section 130, and a beamformer estimating section. 150, a sound source extraction section 160, and a spatial image estimation section 170.
信号処理装置100は、マイクロホンで観測した観測信号xf,tを入力とし、音源信号sf,tあるいは、音源信号sf,tの(残響が取り除かれた)空間イメージsf,t
image=afsf,tを推定して、出力する。なお、観測信号は、例えば複数のマイクロホンからなるマイクロホンアレーで観測した音響信号である。マイクロホンの出力信号をそのまま入力としてもよいし、何らかの記憶装置に記憶された出力信号を読み出して入力としてもよいし、マイクロホンの出力信号に対して何らかの処理を行ったものを入力としてもよい。なお、f(f=1,…,F)は周波数を示し、t(t=1,…,T)はフレーム番号を示し、観測信号xf,t、音源信号sf,tは周波数領域の信号である。ただし、時間領域の観測信号を入力とし図示しない周波数領域変換部において周波数領域の観測信号xf,tに変換し、音源信号sf,tの推定値を図示しない時間領域変換部において時間領域の音源信号に変換し出力してもよい。周波数領域変換、時間領域変換はどのような方法によって行ってもよく、例えば、フーリエ変換、逆フーリエ変換等を用いることができる。
The signal processing device 100 inputs an observation signal x f,t observed with a microphone, and generates a sound source signal s f,t or a spatial image (from which reverberation has been removed) of the sound source signal s f ,t s f,t image = Estimate and output a f s f,t . Note that the observed signal is, for example, an acoustic signal observed with a microphone array consisting of a plurality of microphones. The output signal of the microphone may be input as it is, an output signal stored in some storage device may be read and input, or a signal obtained by performing some processing on the output signal of the microphone may be input. Note that f(f=1,...,F) indicates the frequency, t(t=1,...,T) indicates the frame number, and the observation signal x f,t and sound source signal s f,t are in the frequency domain. It's a signal. However, the observed signal in the time domain is input and converted into the observed signal x f,t in the frequency domain in a frequency domain transformer (not shown), and the estimated value of the sound source signal s f,t is converted into the observed signal x f,t in the time domain in the time domain transformer (not shown). It may also be converted into a sound source signal and output. Frequency domain transformation and time domain transformation may be performed by any method; for example, Fourier transformation, inverse Fourier transformation, etc. can be used.
信号処理装置100は、例えば、中央演算処理装置(CPU: Central Processing Unit)、主記憶装置(RAM: Random Access Memory)などを有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。信号処理装置100は、例えば、中央演算処理装置の制御のもとで各処理を実行する。信号処理装置100に入力されたデータや各処理で得られたデータは、例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて中央演算処理装置へ読み出されて他の処理に利用される。信号処理装置100の各処理部は、少なくとも一部が集積回路等のハードウェアによって構成されていてもよい。信号処理装置100が備える各記憶部は、例えば、RAM(Random Access Memory)などの主記憶装置、またはリレーショナルデータベースやキーバリューストアなどのミドルウェアにより構成することができる。ただし、各記憶部は、必ずしも信号処理装置100がその内部に備える必要はなく、ハードディスクや光ディスクもしくはフラッシュメモリ(Flash Memory)のような半導体メモリ素子により構成される補助記憶装置により構成し、信号処理装置100の外部に備える構成としてもよい。
The signal processing device 100 is, for example, a special computer configured by loading a special program into a known or dedicated computer having a central processing unit (CPU), a main memory (RAM), etc. It is a great device. The signal processing device 100 executes each process under the control of, for example, a central processing unit. The data input to the signal processing device 100 and the data obtained through each process are stored, for example, in a main memory, and the data stored in the main memory is read out to the central processing unit as necessary. Used for other processing. Each processing unit of the signal processing device 100 may be configured at least in part by hardware such as an integrated circuit. Each storage unit included in the signal processing device 100 can be configured by, for example, a main storage device such as a RAM (Random Access Memory), or middleware such as a relational database or a key-value store. However, each storage unit does not necessarily need to be provided inside the signal processing device 100, and may be configured with an auxiliary storage device configured from a hard disk, an optical disk, or a semiconductor memory element such as a flash memory, and is configured to perform signal processing. The configuration may be provided outside the device 100.
以下、各部について説明する。
Each part will be explained below.
<第一空間共分散行列推定部110>
第一空間共分散行列推定部110は、目的音源の空間共分散行列を推定し(S110)、推定値VS∈SM +を出力する。目的音源の空間共分散行列の推定方法として、様々な方法を用いることができる。例えば、第一空間共分散行列推定部110は、観測信号xf,tを入力とし、観測信号xf,tから目的音源が発した音を含む区間(以下、目的信号ともいう)を推定し、推定した目的信号を用いて、目的音源の空間共分散行列を推定する。また、目的音源の方向が既知の場合には、予め実験やシミュレーションで目的音源の空間共分散行列を近似して近似値を推定値VS∈SM +として用いてもよい。
<空間時間共分散行列推定部120>
空間時間共分散行列推定部120は、非目的音源の空間時間共分散行列を推定し(S120)、推定値^RN∈SM+ML +を出力する。非目的音源の空間時間共分散行列の推定方法として、様々な方法を用いることができる。例えば、空間時間共分散行列推定部120は、観測信号xf,tを入力とし、観測信号xf,tから目的音源が発した音を含まない区間(以下、非目的信号ともいう)を推定し、推定した非目的信号を用いて、非目的音源の空間時間共分散行列を推定する。
<残響除去フィルタ推定部130>
残響除去フィルタ推定部130は、空間時間共分散行列の推定値^RNを入力とし、推定値^RNに含まれるブロック行列-PN,-RNから残響除去フィルタを推定し(S130)、推定した残響除去フィルタ^Gを出力する。例えば、残響除去フィルタは式(8)により推定される。
なお、
である。つまり、RNは推定値^RNの1行1列~M行M列の要素からなるブロック行列であり、-PNは推定値^RNの(M+1)行1列~(M+ML)行M列の要素からなるブロック行列であり、(-PN)Hは推定値^RNの1行(M+1)列~M行(M+ML)列の要素からなるブロック行列であり、-RNは推定値^RNの(M+1)行(M+1)列~(M+ML)行(M+ML)列の要素からなるブロック行列である。 <First spatial covariancematrix estimation unit 110>
The first spatial covariancematrix estimation unit 110 estimates the spatial covariance matrix of the target sound source (S110), and outputs the estimated value V S εS M + . Various methods can be used to estimate the spatial covariance matrix of the target sound source. For example, the first spatial covariance matrix estimation unit 110 receives the observation signal x f,t as input, and estimates an interval (hereinafter also referred to as the target signal) that includes the sound emitted by the target sound source from the observation signal x f,t . , estimate the spatial covariance matrix of the target sound source using the estimated target signal. Furthermore, if the direction of the target sound source is known, the spatial covariance matrix of the target sound source may be approximated in advance through experiments or simulations, and the approximate value may be used as the estimated value V S ∈S M + .
<Spatio-temporal covariancematrix estimation unit 120>
The spatiotemporal covariancematrix estimation unit 120 estimates the spatiotemporal covariance matrix of the non-target sound source (S120), and outputs the estimated value ^R N εS M+ML + . Various methods can be used to estimate the spatiotemporal covariance matrix of the non-target sound source. For example, the space-time covariance matrix estimation unit 120 receives the observed signal x f,t and estimates an interval that does not include the sound emitted by the target sound source (hereinafter also referred to as non-target signal) from the observed signal x f,t. Then, the spatiotemporal covariance matrix of the non-target sound source is estimated using the estimated non-target signal.
<Dereverberationfilter estimation unit 130>
The dereverberationfilter estimation unit 130 inputs the estimated value ^R N of the spatiotemporal covariance matrix, and estimates the dereverberation filter from the block matrices -P N , -R N included in the estimated value ^R N (S130). , output the estimated dereverberation filter ^G. For example, the dereverberation filter is estimated by equation (8).
In addition,
It is. In other words, R N is a block matrix consisting of elements fromrow 1 to M rows M of the estimated value ^R N , and - P N is a block matrix consisting of elements from row 1 to M rows 1 of the estimated value ^R N +ML) is a block matrix consisting of elements in rows and M columns, and ( - P N ) H is a block consisting of elements in rows 1 and (M+1) to columns M and (M+ML) of the estimated value ^R N - R N is a block matrix consisting of elements from (M+1) rows and (M+1) columns to (M+ML) rows and (M+ML) columns of the estimated value ^R N.
第一空間共分散行列推定部110は、目的音源の空間共分散行列を推定し(S110)、推定値VS∈SM +を出力する。目的音源の空間共分散行列の推定方法として、様々な方法を用いることができる。例えば、第一空間共分散行列推定部110は、観測信号xf,tを入力とし、観測信号xf,tから目的音源が発した音を含む区間(以下、目的信号ともいう)を推定し、推定した目的信号を用いて、目的音源の空間共分散行列を推定する。また、目的音源の方向が既知の場合には、予め実験やシミュレーションで目的音源の空間共分散行列を近似して近似値を推定値VS∈SM +として用いてもよい。
<空間時間共分散行列推定部120>
空間時間共分散行列推定部120は、非目的音源の空間時間共分散行列を推定し(S120)、推定値^RN∈SM+ML +を出力する。非目的音源の空間時間共分散行列の推定方法として、様々な方法を用いることができる。例えば、空間時間共分散行列推定部120は、観測信号xf,tを入力とし、観測信号xf,tから目的音源が発した音を含まない区間(以下、非目的信号ともいう)を推定し、推定した非目的信号を用いて、非目的音源の空間時間共分散行列を推定する。
<残響除去フィルタ推定部130>
残響除去フィルタ推定部130は、空間時間共分散行列の推定値^RNを入力とし、推定値^RNに含まれるブロック行列-PN,-RNから残響除去フィルタを推定し(S130)、推定した残響除去フィルタ^Gを出力する。例えば、残響除去フィルタは式(8)により推定される。
なお、
である。つまり、RNは推定値^RNの1行1列~M行M列の要素からなるブロック行列であり、-PNは推定値^RNの(M+1)行1列~(M+ML)行M列の要素からなるブロック行列であり、(-PN)Hは推定値^RNの1行(M+1)列~M行(M+ML)列の要素からなるブロック行列であり、-RNは推定値^RNの(M+1)行(M+1)列~(M+ML)行(M+ML)列の要素からなるブロック行列である。 <First spatial covariance
The first spatial covariance
<Spatio-temporal covariance
The spatiotemporal covariance
<Dereverberation
The dereverberation
In addition,
It is. In other words, R N is a block matrix consisting of elements from
<第二空間共分散行列推定部140>
第二空間共分散行列推定部140は、空間時間共分散行列の推定値^RNを入力とし、推定値^RNに含まれるブロック行列RN,-PN,-RNから非目的音源の空間共分散行列を推定し(S140)、推定値VN∈SM+ML +を出力する。例えば、非目的音源の空間共分散行列は式(9)により推定される。
である。 <Second spatial covariancematrix estimation unit 140>
The second spatialcovariance matrix estimator 140 receives the estimated value ^R N of the spatiotemporal covariance matrix as input, and extracts the non-target sound source from the block matrices R N , - P N , - R N included in the estimated value ^ R N . Estimate the spatial covariance matrix of (S140) and output the estimated value V N εS M+ML + . For example, the spatial covariance matrix of a non-target sound source is estimated by equation (9).
It is.
第二空間共分散行列推定部140は、空間時間共分散行列の推定値^RNを入力とし、推定値^RNに含まれるブロック行列RN,-PN,-RNから非目的音源の空間共分散行列を推定し(S140)、推定値VN∈SM+ML +を出力する。例えば、非目的音源の空間共分散行列は式(9)により推定される。
である。 <Second spatial covariance
The second spatial
It is.
なお、第二空間共分散行列推定部140は、空間時間共分散行列の推定値^RNと残響除去フィルタ推定部130で推定した残響除去フィルタ^Gとを入力とし、推定値^RNと残響除去フィルタ^Gとから式(9)により非目的音源の空間共分散行列を推定してもよい。
Note that the second spatial covariance matrix estimation unit 140 inputs the estimated value ^R N of the spatiotemporal covariance matrix and the dereverberation filter ^G estimated by the dereverberation filter estimation unit 130, and calculates the estimated value ^R N and The spatial covariance matrix of the non-target sound source may be estimated from the dereverberation filter ^G using equation (9).
<ビームフォーマ推定部150>
ビームフォーマ推定部150は、目的音源の空間共分散行列の推定値VSと、非目的音源の空間共分散行列の推定値VNと、推定した残響除去フィルタ^Gとを入力とする。ビームフォーマ推定部150は、目的音源の空間共分散行列の推定値VSと、非目的音源の空間共分散行列の推定値VNとから、式(7)により、瞬時混合に対するMaxSNRビームフォーマwoptを求める。
なお、式(7)は、一般化された固有値分解の最適の固有ベクトルとして解くことができる。 <Beamformer estimation unit 150>
Thebeamformer estimation unit 150 receives as input the estimated value V S of the spatial covariance matrix of the target sound source, the estimated value V N of the spatial covariance matrix of the non-target sound source, and the estimated dereverberation filter ^G. The beamformer estimation unit 150 calculates the MaxSNR beamformer w for instantaneous mixing using Equation (7) from the estimated value V S of the spatial covariance matrix of the target sound source and the estimated value V N of the spatial covariance matrix of the non-target sound source. Find opt .
Note that equation (7) can be solved as the optimal eigenvector of generalized eigenvalue decomposition.
ビームフォーマ推定部150は、目的音源の空間共分散行列の推定値VSと、非目的音源の空間共分散行列の推定値VNと、推定した残響除去フィルタ^Gとを入力とする。ビームフォーマ推定部150は、目的音源の空間共分散行列の推定値VSと、非目的音源の空間共分散行列の推定値VNとから、式(7)により、瞬時混合に対するMaxSNRビームフォーマwoptを求める。
なお、式(7)は、一般化された固有値分解の最適の固有ベクトルとして解くことができる。 <
The
Note that equation (7) can be solved as the optimal eigenvector of generalized eigenvalue decomposition.
VSwopt = λmaxVNwopt
ただし、λmaxは最大固有値である。 V S w opt = λ max V N w opt
However, λ max is the maximum eigenvalue.
ただし、λmaxは最大固有値である。 V S w opt = λ max V N w opt
However, λ max is the maximum eigenvalue.
ビームフォーマ推定部150は、瞬時混合に対するMaxSNRビームフォーマwoptと推定した残響除去フィルタ^Gとから、式(3)により、畳み込みビームフォーマを推定し(S150)、推定した畳み込みビームフォーマ^wを出力する。
<音源抽出部160>
音源抽出部160は、観測信号xf,tと推定した畳み込みビームフォーマ^wとを入力とし、次式により、ビームフォーミング処理を行い、音源信号を推定し(S160)、推定値yf,tを出力する。 Thebeamformer estimation unit 150 estimates a convolutional beamformer from the MaxSNR beamformer w opt for instantaneous mixing and the estimated dereverberation filter ^G using equation (3) (S150), and calculates the estimated convolutional beamformer ^w. Output.
<Soundsource extraction unit 160>
The soundsource extraction unit 160 inputs the observed signal x f,t and the estimated convolutional beamformer ^w, performs beamforming processing using the following equation, estimates the sound source signal (S160), and obtains the estimated value y f,t Output.
<音源抽出部160>
音源抽出部160は、観測信号xf,tと推定した畳み込みビームフォーマ^wとを入力とし、次式により、ビームフォーミング処理を行い、音源信号を推定し(S160)、推定値yf,tを出力する。 The
<Sound
The sound
yf,t=^wf
H^xf,t∈ C
^w ∈ CM+ML
^w=[^w1 | … | ^wF]
^xf,t=[xf,t T|xf,t-D-1 T|…|xf,t-D-L T]T ∈ CM+ML
AHはAのエルミート転置を示し、ATはAの転置を示し、Y=(yt)t=1 Tは音源信号Sの推定値であり、Dは予測遅延である。 y f,t =^w f H ^x f,t ∈ C
^w ∈ C M+ML
^w=[^w 1 | … | ^w F ]
^x f,t =[x f,t T |x f,tD-1 T |…|x f,tDL T ] T ∈ C M+ML
A H indicates the Hermitian transposition of A, A T indicates the transposition of A, Y=(y t ) t=1 T is the estimate of the source signal S, and D is the expected delay.
^w ∈ CM+ML
^w=[^w1 | … | ^wF]
^xf,t=[xf,t T|xf,t-D-1 T|…|xf,t-D-L T]T ∈ CM+ML
AHはAのエルミート転置を示し、ATはAの転置を示し、Y=(yt)t=1 Tは音源信号Sの推定値であり、Dは予測遅延である。 y f,t =^w f H ^x f,t ∈ C
^w ∈ C M+ML
^w=[^w 1 | … | ^w F ]
^x f,t =[x f,t T |x f,tD-1 T |…|x f,tDL T ] T ∈ C M+ML
A H indicates the Hermitian transposition of A, A T indicates the transposition of A, Y=(y t ) t=1 T is the estimate of the source signal S, and D is the expected delay.
<空間イメージ推定部170>
各周波数ビンfの畳み込みビームフォーマ^wfのスケールは不定であるが、次式により、空間イメージsf,t imageを近似するベクトルufを推定することで、復元することができる。 <Spatialimage estimation unit 170>
Although the scale of the convolutional beam former ^w f for each frequency bin f is indefinite, it can be restored by estimating a vector u f that approximates the spatial image s f,t image using the following equation.
各周波数ビンfの畳み込みビームフォーマ^wfのスケールは不定であるが、次式により、空間イメージsf,t imageを近似するベクトルufを推定することで、復元することができる。 <Spatial
Although the scale of the convolutional beam former ^w f for each frequency bin f is indefinite, it can be restored by estimating a vector u f that approximates the spatial image s f,t image using the following equation.
sf,t
image=afsf,t≒ufyf,t=(ufwf
H)(^Gf
H^xf,t)∈CM
^G=[^G1 | … | ^GF]、ただし、ベクトルufは以下の条件を満たすことを要求される。 s f,t image =a f s f,t ≒u f y f,t =(u f w f H )(^G f H ^x f,t )∈C M
^G=[^G 1 | … | ^G F ], where the vector u f is required to satisfy the following conditions.
^G=[^G1 | … | ^GF]、ただし、ベクトルufは以下の条件を満たすことを要求される。 s f,t image =a f s f,t ≒u f y f,t =(u f w f H )(^G f H ^x f,t )∈C M
^G=[^G 1 | … | ^G F ], where the vector u f is required to satisfy the following conditions.
(i) wf
Huf=1(歪無し制約条件)
(ii) uf∝VN,fwf(理想的にはaf∝VN,fwfが成立するため)
VN=[VN,1 | … | VN,F]、二つの制約により、ベクトルufは次式の通り一意に決定される。
空間イメージ推定部170は、非目的音源の空間共分散行列の推定値VNと、推定値yf,tと、瞬時混合に対するMaxSNRビームフォーマwoptとを入力とし、推定値VNと瞬時混合に対するMaxSNRビームフォーマwoptから式(11)により、ベクトルufを求め、推定値yf,tとベクトルufから次式により、空間イメージsf,t imageを近似し、近似値ufyf,tを出力する。 (i) w f H u f =1 (no distortion constraint condition)
(ii) u f ∝V N,f w f (ideally, a f ∝V N,f w f holds)
V N =[V N,1 | … | V N,F ], the vector u f is uniquely determined by the following equation based on the two constraints.
The spatialimage estimation unit 170 inputs the estimated value V N of the spatial covariance matrix of the non-target sound source, the estimated value y f,t , and the MaxSNR beamformer w opt for instantaneous mixing, and calculates the estimated value V N and instantaneous mixing. From the MaxSNR beamformer w opt for Output f,t .
(ii) uf∝VN,fwf(理想的にはaf∝VN,fwfが成立するため)
VN=[VN,1 | … | VN,F]、二つの制約により、ベクトルufは次式の通り一意に決定される。
空間イメージ推定部170は、非目的音源の空間共分散行列の推定値VNと、推定値yf,tと、瞬時混合に対するMaxSNRビームフォーマwoptとを入力とし、推定値VNと瞬時混合に対するMaxSNRビームフォーマwoptから式(11)により、ベクトルufを求め、推定値yf,tとベクトルufから次式により、空間イメージsf,t imageを近似し、近似値ufyf,tを出力する。 (i) w f H u f =1 (no distortion constraint condition)
(ii) u f ∝V N,f w f (ideally, a f ∝V N,f w f holds)
V N =[V N,1 | … | V N,F ], the vector u f is uniquely determined by the following equation based on the two constraints.
The spatial
sf,t
image≒ufyf,t
<効果>
以上の構成により、MaxSNR基準を導入することで、目的音源の空間情報をすべて用いることができる。 s f,t image ≒u f y f,t
<Effect>
With the above configuration, all the spatial information of the target sound source can be used by introducing the MaxSNR criterion.
<効果>
以上の構成により、MaxSNR基準を導入することで、目的音源の空間情報をすべて用いることができる。 s f,t image ≒u f y f,t
<Effect>
With the above configuration, all the spatial information of the target sound source can be used by introducing the MaxSNR criterion.
<第二実施形態のポイント>
MVDR CBF(MVDR基準でCBFを推定する手法)は、目的音源のステアリングベクトルを別途事前に推定する必要があり、ステアリングベクトルの推定性能に、MVDR CBFの音源抽出性能が強く依存するという問題や、使い勝手が悪いという問題がある。本実施形態では、この問題を解消する。 <Points of the second embodiment>
MVDR CBF (a method of estimating CBF based on MVDR standards) requires the steering vector of the target sound source to be estimated separately in advance, and there is a problem that the sound source extraction performance of MVDR CBF is strongly dependent on the estimation performance of the steering vector. The problem is that it is difficult to use. This embodiment solves this problem.
MVDR CBF(MVDR基準でCBFを推定する手法)は、目的音源のステアリングベクトルを別途事前に推定する必要があり、ステアリングベクトルの推定性能に、MVDR CBFの音源抽出性能が強く依存するという問題や、使い勝手が悪いという問題がある。本実施形態では、この問題を解消する。 <Points of the second embodiment>
MVDR CBF (a method of estimating CBF based on MVDR standards) requires the steering vector of the target sound source to be estimated separately in advance, and there is a problem that the sound source extraction performance of MVDR CBF is strongly dependent on the estimation performance of the steering vector. The problem is that it is difficult to use. This embodiment solves this problem.
MaxSNR CBFを推定するには、式(1),(2)の通り、目的音源の空間共分散行列の推定値VSと、非目的音源の空間時間共分散行列の推定値^RNを事前に求めておく必要がある。
本実施形態では、これら2つの推定値を事前に求めておくことを不要にした、Blind MaxSNR CBFについて説明する。なお、ここで、「Blind」は、事前知識が不要という意味であることを意味する。 To estimate the MaxSNR CBF, as shown in Equations (1) and (2), the estimated value of the spatial covariance matrix of the target sound source V S and the estimated value of the spatio-temporal covariance matrix of the non-target sound source ^R N are calculated in advance. It is necessary to ask for
In this embodiment, a Blind MaxSNR CBF that eliminates the need to obtain these two estimated values in advance will be described. Note that "Blind" here means that no prior knowledge is required.
本実施形態では、これら2つの推定値を事前に求めておくことを不要にした、Blind MaxSNR CBFについて説明する。なお、ここで、「Blind」は、事前知識が不要という意味であることを意味する。 To estimate the MaxSNR CBF, as shown in Equations (1) and (2), the estimated value of the spatial covariance matrix of the target sound source V S and the estimated value of the spatio-temporal covariance matrix of the non-target sound source ^R N are calculated in advance. It is necessary to ask for
In this embodiment, a Blind MaxSNR CBF that eliminates the need to obtain these two estimated values in advance will be described. Note that "Blind" here means that no prior knowledge is required.
本実施形態のBlind MaxSNR CBFは、式(2)あるいは式(7)で与えられるMaxSNR CBFと類似の計算を繰り返し行うことで、MaxSNR CBFを推定する方法である。
Blind MaxSNR CBF of this embodiment is a method of estimating MaxSNR CBF by repeatedly performing calculations similar to MaxSNR CBF given by equation (2) or equation (7).
本実施形態のBlind MaxSNR CBFは、任意のスーパーガウス関数φ:R≧0→Rと以下の行列^RXのシューア補行列VXを用いて、ブラインドMaxSNR CBFを以下の局所最適解として定義する(式(20a),(20b))。
θ=(^wf)f=1 Fは変数であり、yf,t=(^wf)H^xf,tは音源信号の推定値であり、yt=[y1,t| … |yF,t]T∈CF、ベクトルAに対して||A||2=√(AHA)はユークリッドノルムであり、式(20b)の右辺のCは関数を最大化または最小化するアルゴリズムの反復毎に適応的かつ発見的に決定する定数である。 The Blind MaxSNR CBF of this embodiment is defined as the following local optimal solution using an arbitrary super Gaussian function φ:R ≧0 → R and the Schur complement V X of the following matrix ^R (Equations (20a), (20b)).
θ=(^w f ) f=1 F is a variable, y f,t =(^w f ) H ^x f,t is an estimate of the source signal, y t =[y 1,t | … |y F,t ] T ∈C F , ||A|| 2 =√(A H A) is the Euclidean norm for the vector A, and C on the right side of equation (20b) maximizes the function or It is a constant that is determined adaptively and heuristically at each iteration of the minimizing algorithm.
θ=(^wf)f=1 Fは変数であり、yf,t=(^wf)H^xf,tは音源信号の推定値であり、yt=[y1,t| … |yF,t]T∈CF、ベクトルAに対して||A||2=√(AHA)はユークリッドノルムであり、式(20b)の右辺のCは関数を最大化または最小化するアルゴリズムの反復毎に適応的かつ発見的に決定する定数である。 The Blind MaxSNR CBF of this embodiment is defined as the following local optimal solution using an arbitrary super Gaussian function φ:R ≧0 → R and the Schur complement V X of the following matrix ^R (Equations (20a), (20b)).
θ=(^w f ) f=1 F is a variable, y f,t =(^w f ) H ^x f,t is an estimate of the source signal, y t =[y 1,t | … |y F,t ] T ∈C F , ||A|| 2 =√(A H A) is the Euclidean norm for the vector A, and C on the right side of equation (20b) maximizes the function or It is a constant that is determined adaptively and heuristically at each iteration of the minimizing algorithm.
より具体的には、非目的音源の空間時間共分散行列の推定値^RN,fとして解釈される空間時間共分散行列^RZ,fを以下の式(21)、(22)に基づき求める処理と、以下の式(23)-(26)に基づくMaxSNR CBF ^wの推定処理とを交互に繰り返す反復最適化によって、事前知識なしでMaxSNR CBFを最適化していく。
yt k=[…|yf,t k|…]T, yf,t k=(^wf k)H^xf,t (22)
ただし、kは繰り返し回数を示すインデックスである。 More specifically, the spatiotemporal covariance matrix ^R Z,f, which is interpreted as the estimated value ^R N,f of the spatiotemporal covariance matrix of the non-target sound source , is calculated based on the following equations (21) and (22). The MaxSNR CBF is optimized without prior knowledge through iterative optimization that alternately repeats the process of obtaining the calculation and the process of estimating the MaxSNR CBF ^w based on the following equations (23) to (26).
y t k =[…|y f,t k |…] T , y f,t k =(^w f k ) H ^x f,t (22)
However, k is an index indicating the number of repetitions.
yt k=[…|yf,t k|…]T, yf,t k=(^wf k)H^xf,t (22)
ただし、kは繰り返し回数を示すインデックスである。 More specifically, the spatiotemporal covariance matrix ^R Z,f, which is interpreted as the estimated value ^R N,f of the spatiotemporal covariance matrix of the non-target sound source , is calculated based on the following equations (21) and (22). The MaxSNR CBF is optimized without prior knowledge through iterative optimization that alternately repeats the process of obtaining the calculation and the process of estimating the MaxSNR CBF ^w based on the following equations (23) to (26).
y t k =[…|y f,t k |…] T , y f,t k =(^w f k ) H ^x f,t (22)
However, k is an index indicating the number of repetitions.
また、上記の反復最適化の各反復において、以下の式(27)に基づいて、周波数f=1,…,FごとにMaxSNR CBF ^wfのスケールを揃えることを特徴とする。
Furthermore, in each iteration of the above-described iterative optimization, the scale of MaxSNR CBF ^w f is aligned for each frequency f=1,...,F based on the following equation (27).
wf←(uf,m)*wf=(em
Tuf)*wf (27)
ただし、m(1≦m≦M)は参照マイクロホンのインデックスであり、*は複素共役を示し、ufは式(11)で表され(ただし、VN,fに代えてVZ,fを用いる)、uf,m=em Tuf∈Cはufのm番目の要素である。 w f ←(u f,m ) * w f =(e m T u f ) * w f (27)
However, m (1≦m≦M) is the index of the reference microphone, * indicates a complex conjugate, and u f is expressed by equation (11) (however, V Z,f is substituted for V N,f). ), u f,m =e m T u f ∈C is the mth element of u f .
ただし、m(1≦m≦M)は参照マイクロホンのインデックスであり、*は複素共役を示し、ufは式(11)で表され(ただし、VN,fに代えてVZ,fを用いる)、uf,m=em Tuf∈Cはufのm番目の要素である。 w f ←(u f,m ) * w f =(e m T u f ) * w f (27)
However, m (1≦m≦M) is the index of the reference microphone, * indicates a complex conjugate, and u f is expressed by equation (11) (however, V Z,f is substituted for V N,f). ), u f,m =e m T u f ∈C is the mth element of u f .
<第二実施形態>
第一実施形態と異なる部分を中心に説明する。 <Second embodiment>
The explanation will focus on parts that are different from the first embodiment.
第一実施形態と異なる部分を中心に説明する。 <Second embodiment>
The explanation will focus on parts that are different from the first embodiment.
図3は第一実施形態に係る信号処理装置の機能ブロック図を、図4はその処理フローを示す。
FIG. 3 shows a functional block diagram of the signal processing device according to the first embodiment, and FIG. 4 shows its processing flow.
信号処理装置200は、初期化部201と、第一空間共分散行列推定部210と、空間時間共分散行列推定部220と、第二空間共分散行列推定部240と、残響除去フィルタ推定部230と、ビームフォーマ推定部250と、音源抽出部160と、判定部280とを含む。
The signal processing device 200 includes an initialization section 201 , a first spatial covariance matrix estimation section 210 , a spatiotemporal covariance matrix estimation section 220 , a second spatial covariance matrix estimation section 240 , and a dereverberation filter estimation section 230 , a beam former estimation section 250 , a sound source extraction section 160 , and a determination section 280 .
信号処理装置200は、マイクロホンで観測した観測信号xf,tと参照マイクロホンのイデックスmを入力とし、音源信号sf,tを推定して、出力する。なお、fは周波数を示し、tはフレーム番号を示し、観測信号xf,t、音源信号sf,tは周波数領域の信号である。ただし、時間領域の観測信号を入力とし図示しない周波数領域変換部において周波数領域の観測信号xf,tに変換し、音源信号sf,tを図示しない時間領域変換部において時間領域の音源信号に変換し出力してもよい。周波数領域変換、時間領域変換はどのような方法によって行ってもよく、例えば、フーリエ変換、逆フーリエ変換等を用いることができる。
The signal processing device 200 inputs the observation signal x f,t observed by the microphone and the index m of the reference microphone, estimates the sound source signal s f,t, and outputs the estimated sound source signal s f,t . Note that f indicates a frequency, t indicates a frame number, and the observation signal x f,t and the sound source signal s f,t are signals in the frequency domain. However, an observed signal in the time domain is input and converted into an observed signal x f,t in the frequency domain in a frequency domain converter (not shown), and a sound source signal s f,t is converted into a sound source signal in the time domain in a time domain converter (not shown). It may be converted and output. Frequency domain transformation and time domain transformation may be performed by any method; for example, Fourier transformation, inverse Fourier transformation, etc. can be used.
<初期化部201>
初期化部201は、参照マイクロホンのイデックスmを入力とし、推定対象の畳み込みビームフォーマ^wの初期値^w0=[^w1 0,…,^wF 0]を次式により設定し(S201)、出力する。
ただし、emは、参照マイクロホンに対応する単位ベクトルである。
<第一空間共分散行列推定部210>
第一空間共分散行列推定部210は、観測信号xf,tを入力とし、式(28)~(30)を用いて観測信号xf,tの空間共分散行列を推定し(S210)、推定値VXを出力する。
^xf,t=[xf,t T|xf,t-D-1 T|…|xf,t-D-L T]T ∈ CM+ML
RX,fは推定値^RX,fの1行1列~M行M列の要素からなるブロック行列であり、-PX,fは推定値^RX,fの(M+1)行1列~(M+ML)行M列の要素からなるブロック行列であり、(-PX,f)Hは推定値^RX,fの1行(M+1)列~M行(M+ML)列の要素からなるブロック行列であり、-RX,fは推定値^RX,fの(M+1)行(M+1)列~(M+ML)行(M+ML)列の要素からなるブロック行列である。
VX=[VX,1,...,VX,f,...,VX,F]
<空間時間共分散行列推定部220>
空間時間共分散行列推定部220は、1回前の繰り返し処理で推定した畳み込みビームフォーマ^wkまたはその初期値^w0と観測信号xf,tとを入力とし、非目的音源の空間時間共分散行列の推定値^RN,fとして解釈される空間時間共分散行列^RZ=[^RZ,1,...,^RZ,f,...,^RZ,F]を以下の式(21)、(22)に基づき求め(S220)、出力する。
yt k=[…|yf,t k|…]T, yf,t k=(^wf k)H^xf,t (22)
なお、初めて空間時間共分散行列^RZ,fを求める際、言い換えると、後述するビームフォーマ推定部250で畳み込みビームフォーマ^wを推定する前には、初期化部201の出力値を畳み込みビームフォーマの初期値^w0=[^w1 0,…,^wF 0]として用いる。 <Initialization unit 201>
Theinitialization unit 201 inputs the index m of the reference microphone and sets the initial value ^w 0 =[^w 1 0 ,...,^w F 0 ] of the convolutional beamformer ^w to be estimated using the following formula ( S201), output.
However, em is a unit vector corresponding to the reference microphone.
<First spatial covariancematrix estimation unit 210>
The first spatial covariancematrix estimation unit 210 receives the observed signal x f,t as input and estimates the spatial covariance matrix of the observed signal x f,t using equations (28) to (30) (S210), Output the estimated value V X.
^x f,t =[x f,t T |x f,tD-1 T |…|x f,tDL T ] T ∈ C M+ML
R X,f is a block matrix consisting of elements from 1 row and 1 to M rows and M columns of the estimated value ^R X ,f, - P X,f is (M+1) of the estimated value ^R X,f It is a block matrix consisting of elements fromrow 1 to (M+ML) rows and M columns, and ( - P X,f ) H is the estimated value ^R - R X,f is a block matrix consisting of elements in columns (M+ML), and - R It is a block matrix consisting of elements of columns (ML).
V X =[V X,1 ,...,V X,f ,...,V X,F ]
<Spatio-temporal covariancematrix estimation unit 220>
The spatiotemporalcovariance matrix estimator 220 inputs the convolutional beam former ^w k estimated in the previous iteration process or its initial value ^w 0 and the observation signal x f,t , and calculates the spatiotemporal covariance matrix of the non-target sound source. The spatio-temporal covariance matrix ^R Z =[^R Z ,1 ,...,^R Z, f ,...,^R Z, F interpreted as the estimate of the covariance matrix ^R N ,f ] is calculated based on the following equations (21) and (22) (S220) and output.
y t k =[…|y f,t k |…] T , y f,t k =(^w f k ) H ^x f,t (22)
Note that when calculating the space-time covariance matrix ^R Z,f for the first time, in other words, before estimating the convolutional beamformer ^w in thebeamformer estimation unit 250 described later, the output value of the initialization unit 201 is converted into a convolutional beam. Used as the initial value of the former ^w 0 =[^w 1 0 ,…,^w F 0 ].
初期化部201は、参照マイクロホンのイデックスmを入力とし、推定対象の畳み込みビームフォーマ^wの初期値^w0=[^w1 0,…,^wF 0]を次式により設定し(S201)、出力する。
ただし、emは、参照マイクロホンに対応する単位ベクトルである。
<第一空間共分散行列推定部210>
第一空間共分散行列推定部210は、観測信号xf,tを入力とし、式(28)~(30)を用いて観測信号xf,tの空間共分散行列を推定し(S210)、推定値VXを出力する。
^xf,t=[xf,t T|xf,t-D-1 T|…|xf,t-D-L T]T ∈ CM+ML
RX,fは推定値^RX,fの1行1列~M行M列の要素からなるブロック行列であり、-PX,fは推定値^RX,fの(M+1)行1列~(M+ML)行M列の要素からなるブロック行列であり、(-PX,f)Hは推定値^RX,fの1行(M+1)列~M行(M+ML)列の要素からなるブロック行列であり、-RX,fは推定値^RX,fの(M+1)行(M+1)列~(M+ML)行(M+ML)列の要素からなるブロック行列である。
VX=[VX,1,...,VX,f,...,VX,F]
<空間時間共分散行列推定部220>
空間時間共分散行列推定部220は、1回前の繰り返し処理で推定した畳み込みビームフォーマ^wkまたはその初期値^w0と観測信号xf,tとを入力とし、非目的音源の空間時間共分散行列の推定値^RN,fとして解釈される空間時間共分散行列^RZ=[^RZ,1,...,^RZ,f,...,^RZ,F]を以下の式(21)、(22)に基づき求め(S220)、出力する。
yt k=[…|yf,t k|…]T, yf,t k=(^wf k)H^xf,t (22)
なお、初めて空間時間共分散行列^RZ,fを求める際、言い換えると、後述するビームフォーマ推定部250で畳み込みビームフォーマ^wを推定する前には、初期化部201の出力値を畳み込みビームフォーマの初期値^w0=[^w1 0,…,^wF 0]として用いる。 <
The
However, em is a unit vector corresponding to the reference microphone.
<First spatial covariance
The first spatial covariance
^x f,t =[x f,t T |x f,tD-1 T |…|x f,tDL T ] T ∈ C M+ML
R X,f is a block matrix consisting of elements from 1 row and 1 to M rows and M columns of the estimated value ^R X ,f, - P X,f is (M+1) of the estimated value ^R X,f It is a block matrix consisting of elements from
V X =[V X,1 ,...,V X,f ,...,V X,F ]
<Spatio-temporal covariance
The spatiotemporal
y t k =[…|y f,t k |…] T , y f,t k =(^w f k ) H ^x f,t (22)
Note that when calculating the space-time covariance matrix ^R Z,f for the first time, in other words, before estimating the convolutional beamformer ^w in the
<残響除去フィルタ推定部230>
残響除去フィルタ推定部230は、空間時間共分散行列^RZ,fを入力とし、推定値^RZ,fに含まれる-PZ,f,-RZ,fから残響除去フィルタを推定し(S230)、推定した残響除去フィルタ^Gを出力する。例えば、残響除去フィルタは式(25)により推定される。
なお、
である。つまり、RZ,fは空間時間共分散行列^RZ,fの1行1列~M行M列の要素からなるブロック行列であり、-PZ,fは空間時間共分散行列^RZ,fの(M+1)行1列~(M+ML)行M列の要素からなるブロック行列であり、(-PZ,f)Hは空間時間共分散行列^RZ,fの1行(M+1)列~M行(M+ML)列の要素からなるブロック行列であり、-RZ,fは空間時間共分散行列^RZ,fの(M+1)行(M+1)列~(M+ML)行(M+ML)列の要素からなるブロック行列である。 <Dereverberationfilter estimation unit 230>
The dereverberationfilter estimation unit 230 receives the space-time covariance matrix ^R Z,f as input, and estimates the dereverberation filter from - P Z,f , - R Z, f included in the estimated value ^R Z,f . (S230), the estimated dereverberation filter ^G is output. For example, the dereverberation filter is estimated by equation (25).
In addition,
It is. In other words, R Z,f is a block matrix consisting of elements from 1 row and 1 to M rows and M columns of the space-time covariance matrix ^R Z,f , and - P Z,f is the space-time covariance matrix ^R Z , is a block matrix consisting of elements from (M+1) rows andcolumns 1 to (M+ML) rows and M columns of f, and ( - P Z,f ) H is the space-time covariance matrix ^R 1 of Z,f It is a block matrix consisting of elements from row (M+1) to column M (M+ML), and - R Z,f is the (M+1) row (M It is a block matrix consisting of elements from column +1 to (M+ML) rows and (M+ML) columns.
残響除去フィルタ推定部230は、空間時間共分散行列^RZ,fを入力とし、推定値^RZ,fに含まれる-PZ,f,-RZ,fから残響除去フィルタを推定し(S230)、推定した残響除去フィルタ^Gを出力する。例えば、残響除去フィルタは式(25)により推定される。
なお、
である。つまり、RZ,fは空間時間共分散行列^RZ,fの1行1列~M行M列の要素からなるブロック行列であり、-PZ,fは空間時間共分散行列^RZ,fの(M+1)行1列~(M+ML)行M列の要素からなるブロック行列であり、(-PZ,f)Hは空間時間共分散行列^RZ,fの1行(M+1)列~M行(M+ML)列の要素からなるブロック行列であり、-RZ,fは空間時間共分散行列^RZ,fの(M+1)行(M+1)列~(M+ML)行(M+ML)列の要素からなるブロック行列である。 <Dereverberation
The dereverberation
In addition,
It is. In other words, R Z,f is a block matrix consisting of elements from 1 row and 1 to M rows and M columns of the space-time covariance matrix ^R Z,f , and - P Z,f is the space-time covariance matrix ^R Z , is a block matrix consisting of elements from (M+1) rows and
<第二空間共分散行列推定部240>
第二空間共分散行列推定部240は、空間時間共分散行列^RZ,fを入力とし、空間時間共分散行列^RZ,fに含まれるRZ,f,-PZ,f,-RZ,fから非目的音源の空間共分散行列を推定し(S240)、推定値VZ,f∈SM+ML +を出力する。例えば、非目的音源の空間共分散行列は式(31)により推定される。
である。 <Second spatial covariancematrix estimation unit 240>
The second spatialcovariance matrix estimator 240 receives the space-time covariance matrix ^R Z,f as input, and receives the R Z,f , - P Z, f , - included in the space-time covariance matrix ^R Z,f . The spatial covariance matrix of the non-target sound source is estimated from R Z,f (S240), and the estimated value V Z,f εS M+ML + is output. For example, the spatial covariance matrix of a non-target sound source is estimated by equation (31).
It is.
第二空間共分散行列推定部240は、空間時間共分散行列^RZ,fを入力とし、空間時間共分散行列^RZ,fに含まれるRZ,f,-PZ,f,-RZ,fから非目的音源の空間共分散行列を推定し(S240)、推定値VZ,f∈SM+ML +を出力する。例えば、非目的音源の空間共分散行列は式(31)により推定される。
である。 <Second spatial covariance
The second spatial
It is.
なお、第二空間共分散行列推定部240は、空間時間共分散行列^RZ,fと残響除去フィルタ推定部230で推定した残響除去フィルタ^Gとを入力とし、空間時間共分散行列^RZ,fと残響除去フィルタ^Gとから式(31)により非目的音源の空間共分散行列を推定してもよい。
Note that the second spatial covariance matrix estimation unit 240 inputs the space-time covariance matrix ^R Z,f and the dereverberation filter ^G estimated by the dereverberation filter estimation unit 230, and calculates the space-time covariance matrix ^R The spatial covariance matrix of the non-target sound source may be estimated from Z,f and the dereverberation filter ^G using equation (31).
<ビームフォーマ推定部250>
ビームフォーマ推定部250は、観測信号xf,tの空間共分散行列の推定値VX=[VX,1,…,VX,F]と、非目的音源の空間共分散行列の推定値VZ=[VZ,1,…,VZ,F]と、推定した残響除去フィルタ^G=[^G,1,…,^GF]とを入力とする。ビームフォーマ推定部250は、観測信号xf,tの空間共分散行列の推定値VXと、非目的音源の空間共分散行列の推定値VZとから、式(24)により、wf k+1を求める。
ビームフォーマ推定部250は、瞬時混合に対するMaxSNRビームフォーマwf k+1と推定した残響除去フィルタ^Gとから、式(23)により、畳み込みビームフォーマを推定する(S250)。
ビームフォーマ推定部250は、非目的音源の空間共分散行列の推定値VZ=[VZ,1,…,VZ,F]と、畳み込みビームフォーマ^wk+1から次式によりベクトルufを求める。
さらに、ビームフォーマ推定部250は、以下の式(29)に基づいて、ベクトルufのm番目の要素uf,mを用いて、周波数f=1,…,FごとにMaxSNR CBF ^wf k+1のスケールを揃え、スケールを揃えた畳み込みビームフォーマ^wk+1を出力する。 <Beamformer estimation unit 250>
Thebeamformer estimation unit 250 estimates the spatial covariance matrix of the observed signal x f,t , V X =[V X,1 ,...,V X,F ], and the spatial covariance matrix of the non-target sound source. V Z =[V Z,1 ,...,V Z,F ] and the estimated dereverberation filter ^G=[^G ,1 ,...,^G F ] are input. The beamformer estimating unit 250 calculates w f k using Equation ( 24 ) from the estimated value V Ask for +1 .
Thebeamformer estimation unit 250 estimates a convolutional beamformer from the MaxSNR beamformer w f k+1 for instantaneous mixing and the estimated dereverberation filter ^G using equation (23) (S250).
Thebeamformer estimation unit 250 calculates a vector u from the estimated value V Z =[V Z,1 ,...,V Z,F ] of the spatial covariance matrix of the non-target sound source and the convolutional beamformer ^w k+1 using the following equation. Find f .
Furthermore, thebeamformer estimation unit 250 uses the m-th element u f,m of the vector u f to calculate MaxSNR CBF ^w f for each frequency f=1,...,F based on the following equation (29). Align the scales of k+1 and output the convolutional beamformer ^w k+1 with aligned scales.
ビームフォーマ推定部250は、観測信号xf,tの空間共分散行列の推定値VX=[VX,1,…,VX,F]と、非目的音源の空間共分散行列の推定値VZ=[VZ,1,…,VZ,F]と、推定した残響除去フィルタ^G=[^G,1,…,^GF]とを入力とする。ビームフォーマ推定部250は、観測信号xf,tの空間共分散行列の推定値VXと、非目的音源の空間共分散行列の推定値VZとから、式(24)により、wf k+1を求める。
ビームフォーマ推定部250は、瞬時混合に対するMaxSNRビームフォーマwf k+1と推定した残響除去フィルタ^Gとから、式(23)により、畳み込みビームフォーマを推定する(S250)。
ビームフォーマ推定部250は、非目的音源の空間共分散行列の推定値VZ=[VZ,1,…,VZ,F]と、畳み込みビームフォーマ^wk+1から次式によりベクトルufを求める。
さらに、ビームフォーマ推定部250は、以下の式(29)に基づいて、ベクトルufのm番目の要素uf,mを用いて、周波数f=1,…,FごとにMaxSNR CBF ^wf k+1のスケールを揃え、スケールを揃えた畳み込みビームフォーマ^wk+1を出力する。 <
The
The
The
Furthermore, the
^wk+1←(uf,m)*^wk+1=(em
Tuf)*^wk+1 (29)
<音源抽出部160>
音源抽出部160は、観測信号xf,tと推定した畳み込みビームフォーマ^wk+1とを入力とし、次式により、ビームフォーミング処理を行い、音源信号を推定し(S160)、推定値yf,tを出力する。 ^w k+1 ←(u f,m ) * ^w k+1 =(e m T u f ) * ^w k+1 (29)
<Soundsource extraction unit 160>
The soundsource extraction unit 160 inputs the observed signal x f,t and the estimated convolutional beamformer ^w k+1 , performs beamforming processing using the following equation, estimates the sound source signal (S160), and calculates the estimated value y Output f,t .
<音源抽出部160>
音源抽出部160は、観測信号xf,tと推定した畳み込みビームフォーマ^wk+1とを入力とし、次式により、ビームフォーミング処理を行い、音源信号を推定し(S160)、推定値yf,tを出力する。 ^w k+1 ←(u f,m ) * ^w k+1 =(e m T u f ) * ^w k+1 (29)
<Sound
The sound
yf,t=(^wf
k+1)H^xf,t∈ C
^wk+1 ∈CM+ML
^wk+1=[^w1 k+1| … | ^wF k+1]
<判定部280>
判定部280は、収束条件を満たすか否かを判定し(S280)、収束条件を満たす場合(S280のYESの場合)には、その時点の推定値yf,tを信号処理装置の出力として出力し処理を終了する。収束条件を満たさない場合(S280のNOの場合)には、判定部280は、S220~S160を繰り返すように各部に制御信号を送って、各部の処理を制御する。なお、音源抽出部160の出力する推定値yf,tを空間時間共分散行列推定部220で用い、式(22)の計算を省略することができる。なお、収束条件には、学習を一定回数(例えば数回)繰り返したか?推定前後の畳み込みビームフォーマ^wk+1の差分が所定の閾値以下か?などの条件を利用できる。 y f,t =(^w f k+1 ) H ^x f,t ∈ C
^w k+1 ∈C M+ML
^w k+1 =[^w 1 k+1 | … | ^w F k+1 ]
<Determination unit 280>
The determiningunit 280 determines whether or not the convergence condition is satisfied (S280), and if the convergence condition is satisfied (YES in S280), the estimated value y f,t at that time is used as the output of the signal processing device. Output and end processing. If the convergence condition is not satisfied (NO in S280), the determination unit 280 sends a control signal to each unit to repeat S220 to S160 to control the processing of each unit. Note that the estimated value y f,t output from the sound source extraction section 160 can be used in the spatiotemporal covariance matrix estimation section 220, and the calculation of equation (22) can be omitted. Note that the convergence conditions include whether learning has been repeated a certain number of times (for example, several times). Conditions such as whether the difference between the convolutional beam former ^w k+1 before and after estimation is less than a predetermined threshold can be used.
^wk+1 ∈CM+ML
^wk+1=[^w1 k+1| … | ^wF k+1]
<判定部280>
判定部280は、収束条件を満たすか否かを判定し(S280)、収束条件を満たす場合(S280のYESの場合)には、その時点の推定値yf,tを信号処理装置の出力として出力し処理を終了する。収束条件を満たさない場合(S280のNOの場合)には、判定部280は、S220~S160を繰り返すように各部に制御信号を送って、各部の処理を制御する。なお、音源抽出部160の出力する推定値yf,tを空間時間共分散行列推定部220で用い、式(22)の計算を省略することができる。なお、収束条件には、学習を一定回数(例えば数回)繰り返したか?推定前後の畳み込みビームフォーマ^wk+1の差分が所定の閾値以下か?などの条件を利用できる。 y f,t =(^w f k+1 ) H ^x f,t ∈ C
^w k+1 ∈C M+ML
^w k+1 =[^w 1 k+1 | … | ^w F k+1 ]
<
The determining
<効果>
このような構成とすることで、第一実施形態と同様の効果を得ることができる。さらに、本実施形態のBlind MaxSNR CBFは、高々数回の反復でMaxSNR CBFを高精度に推定できる超高速な手法である。 <Effect>
With such a configuration, effects similar to those of the first embodiment can be obtained. Furthermore, the Blind MaxSNR CBF of this embodiment is an ultra-high-speed method that can estimate MaxSNR CBF with high accuracy through at most several iterations.
このような構成とすることで、第一実施形態と同様の効果を得ることができる。さらに、本実施形態のBlind MaxSNR CBFは、高々数回の反復でMaxSNR CBFを高精度に推定できる超高速な手法である。 <Effect>
With such a configuration, effects similar to those of the first embodiment can be obtained. Furthermore, the Blind MaxSNR CBF of this embodiment is an ultra-high-speed method that can estimate MaxSNR CBF with high accuracy through at most several iterations.
なお、本実施形態では、音源信号sf,tの推定値yf,tを出力しているが、空間イメージ推定部170を設け、収束条件を満たした時点の推定値yf,tを用いて、空間イメージsf,t
imageの近似値ufyf,tを求め、出力する構成としてもよい。
In this embodiment, the estimated value y f,t of the sound source signal s f ,t is output, but the spatial image estimation unit 170 is provided and the estimated value y f,t at the time when the convergence condition is satisfied is used. Then, an approximate value u f y f,t of the spatial image s f,t image may be determined and output.
<第三実施形態のポイント>
本実施形態では、第二実施形態のBlind MaxSNR CBFの副産物として、目的音源の空間共分散行列VSは既知(= 事前に推定する)で、一方で、不要音の空間時間共分散行列^RNは未知(=事前に推定しない)、という状況下で MaxSNR CBFを高精度に推定する手法である「Iteratively Reweighted MaxSNR CBF (IR-MaxSNR CBF)」を実現する。 <Points of the third embodiment>
In this embodiment, as a by-product of the Blind MaxSNR CBF of the second embodiment, the spatial covariance matrix V S of the target sound source is known (= estimated in advance), while the spatial covariance matrix ^R of the unwanted sound We realize "Iteratively Reweighted MaxSNR CBF (IR-MaxSNR CBF)," a method for estimating MaxSNR CBF with high accuracy under the condition that N is unknown (= not estimated in advance).
本実施形態では、第二実施形態のBlind MaxSNR CBFの副産物として、目的音源の空間共分散行列VSは既知(= 事前に推定する)で、一方で、不要音の空間時間共分散行列^RNは未知(=事前に推定しない)、という状況下で MaxSNR CBFを高精度に推定する手法である「Iteratively Reweighted MaxSNR CBF (IR-MaxSNR CBF)」を実現する。 <Points of the third embodiment>
In this embodiment, as a by-product of the Blind MaxSNR CBF of the second embodiment, the spatial covariance matrix V S of the target sound source is known (= estimated in advance), while the spatial covariance matrix ^R of the unwanted sound We realize "Iteratively Reweighted MaxSNR CBF (IR-MaxSNR CBF)," a method for estimating MaxSNR CBF with high accuracy under the condition that N is unknown (= not estimated in advance).
目的音源の空間共分散行列VSが高精度に推定できる場合に、その情報を用いることで、第二実施形態のBlind MaxSNR CBFと比べて精度良くMaxSNR CBFを推定できる。
When the spatial covariance matrix V S of the target sound source can be estimated with high accuracy, by using that information, the MaxSNR CBF can be estimated with higher accuracy than the Blind MaxSNR CBF of the second embodiment.
<第三実施形態>
第二実施形態と異なる部分を中心に説明する。 <Third embodiment>
The explanation will focus on parts that are different from the second embodiment.
第二実施形態と異なる部分を中心に説明する。 <Third embodiment>
The explanation will focus on parts that are different from the second embodiment.
図5は第三実施形態に係る信号処理装置の機能ブロック図を、図6はその処理フローを示す。
FIG. 5 shows a functional block diagram of a signal processing device according to the third embodiment, and FIG. 6 shows its processing flow.
信号処理装置300は、初期化部201と、第一空間共分散行列推定部110と、空間時間共分散行列推定部220と、第二空間共分散行列推定部240と、残響除去フィルタ推定部230と、ビームフォーマ推定部250と、音源抽出部160と、判定部280とを含む。
The signal processing device 300 includes an initialization section 201 , a first spatial covariance matrix estimation section 110 , a spatiotemporal covariance matrix estimation section 220 , a second spatial covariance matrix estimation section 240 , and a dereverberation filter estimation section 230 , a beam former estimation section 250 , a sound source extraction section 160 , and a determination section 280 .
本実施形態では、第一空間共分散行列推定部210に代えて第一空間共分散行列推定部110を含む点が第二実施形態とは異なる。なお、第一空間共分散行列推定部110は第一実施形態で説明した通りである。また、ビームフォーマ推定部250は、観測信号xf,tの空間共分散行列の推定値VXに代えて、目的音源の空間共分散行列の推定値VSを用いる点が第二実施形態と異なる。他の処理は第二実施形態と同様である。
This embodiment differs from the second embodiment in that it includes a first spatial covariance matrix estimator 110 instead of the first spatial covariance matrix estimator 210. Note that the first spatial covariance matrix estimation unit 110 is as described in the first embodiment. Furthermore, the beamformer estimation unit 250 uses the estimated value V S of the spatial covariance matrix of the target sound source instead of the estimated value V X of the spatial covariance matrix of the observed signal x f,t . different. Other processing is similar to the second embodiment.
<その他の変形例>
本発明は上記の実施形態及び変形例に限定されるものではない。例えば、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 <Other variations>
The present invention is not limited to the above-described embodiments and modifications. For example, the various processes described above may not only be executed in chronological order as described, but may also be executed in parallel or individually depending on the processing capacity of the device executing the process or as needed. Other changes may be made as appropriate without departing from the spirit of the present invention.
本発明は上記の実施形態及び変形例に限定されるものではない。例えば、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 <Other variations>
The present invention is not limited to the above-described embodiments and modifications. For example, the various processes described above may not only be executed in chronological order as described, but may also be executed in parallel or individually depending on the processing capacity of the device executing the process or as needed. Other changes may be made as appropriate without departing from the spirit of the present invention.
<プログラム及び記録媒体>
上述の各種の処理は、図7に示すコンピュータ2000の記録部2020に、上記方法の各ステップを実行させるプログラムを読み込ませ、制御部2010、入力部2030、出力部2040、表示部2050などに動作させることで実施できる。 <Program and recording medium>
The various processes described above are performed by causing therecording unit 2020 of the computer 2000 shown in FIG. This can be done by letting
上述の各種の処理は、図7に示すコンピュータ2000の記録部2020に、上記方法の各ステップを実行させるプログラムを読み込ませ、制御部2010、入力部2030、出力部2040、表示部2050などに動作させることで実施できる。 <Program and recording medium>
The various processes described above are performed by causing the
この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。
A program that describes this processing content can be recorded on a computer-readable recording medium. The computer-readable recording medium may be of any type, such as a magnetic recording device, an optical disk, a magneto-optical recording medium, or a semiconductor memory.
また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。
Further, distribution of this program is performed, for example, by selling, transferring, lending, etc. portable recording media such as DVDs and CD-ROMs on which the program is recorded. Furthermore, this program may be distributed by storing the program in the storage device of the server computer and transferring the program from the server computer to another computer via a network.
このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるASP(Application Service Provider)型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの(コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等)を含むものとする。
A computer that executes such a program, for example, first stores a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing a process, this computer reads a program stored in its own recording medium and executes a process according to the read program. In addition, as another form of execution of this program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and furthermore, the program may be transferred to this computer from the server computer. The process may be executed in accordance with the received program each time. In addition, the above-mentioned processing is executed by a so-called ASP (Application Service Provider) service, which does not transfer programs from the server computer to this computer, but only realizes processing functions by issuing execution instructions and obtaining results. You can also use it as Note that the program in this embodiment includes information that is used for processing by an electronic computer and that is similar to a program (data that is not a direct command to the computer but has a property that defines the processing of the computer, etc.).
また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。
Furthermore, in this embodiment, the present apparatus is configured by executing a predetermined program on a computer, but at least a part of these processing contents may be implemented in hardware.
Claims (6)
- 非目的音源の空間時間共分散行列の推定値を用いて、前記非目的音源の空間共分散行列を推定する第二空間共分散行列推定部と、
前記非目的音源の空間時間共分散行列の推定値を用いて、残響除去フィルタを推定する残響除去フィルタ推定部と、
観測信号または目的音源の空間共分散行列の推定値と、前記非目的音源の空間共分散行列の推定値と、推定した前記残響除去フィルタとを用いて、畳み込みビームフォーマを推定するビームフォーマ推定部と、
前記観測信号と推定した前記畳み込みビームフォーマとを用いて、ビームフォーミング処理を行い、音源信号を推定する音源抽出部とを含む、
信号処理装置。 a second spatial covariance matrix estimation unit that estimates a spatial covariance matrix of the non-target sound source using an estimated value of the spatio-temporal covariance matrix of the non-target sound source;
a dereverberation filter estimation unit that estimates a dereverberation filter using the estimated value of the spatiotemporal covariance matrix of the non-target sound source;
a beamformer estimation unit that estimates a convolutional beamformer using an estimated value of a spatial covariance matrix of the observed signal or the target sound source, an estimated value of the spatial covariance matrix of the non-target sound source, and the estimated dereverberation filter; and,
a sound source extraction unit that performs beamforming processing using the observed signal and the estimated convolutional beamformer to estimate a sound source signal;
Signal processing device. - 請求項1の信号処理装置であって、
前記観測信号から目的音源が発した音を含む区間(以下、目的信号ともいう)を推定し、推定した目的信号を用いて、目的音源の空間共分散行列を推定する第一空間共分散行列推定部と、
前記観測信号から目的音源が発した音を含まない区間(以下、非目的信号ともいう)を推定し、推定した非目的信号を用いて、非目的音源の空間時間共分散行列を推定する空間時間共分散行列推定部とを含み、
前記ビームフォーマ推定部は、前記目的音源の空間共分散行列の推定値と、前記非目的音源の空間共分散行列の推定値と、推定した前記残響除去フィルタとを用いて、畳み込みビームフォーマを推定する、
信号処理装置。 The signal processing device according to claim 1,
First spatial covariance matrix estimation for estimating an interval including the sound emitted by the target sound source from the observed signal (hereinafter also referred to as target signal), and estimating a spatial covariance matrix of the target sound source using the estimated target signal. Department and
A space-time method in which a section that does not include the sound emitted by the target sound source (hereinafter also referred to as a non-target signal) is estimated from the observed signal, and a spatio-temporal covariance matrix of the non-target sound source is estimated using the estimated non-target signal. a covariance matrix estimator;
The beamformer estimation unit estimates a convolutional beamformer using the estimated value of the spatial covariance matrix of the target sound source, the estimated value of the spatial covariance matrix of the non-target sound source, and the estimated dereverberation filter. do,
Signal processing device. - 請求項1の信号処理装置であって、
前記観測信号を用いて、前記観測信号の空間共分散行列を推定する第一空間共分散行列推定部と、
前記観測信号と推定した前記畳み込みビームフォーマとを用いて、前記非目的音源の空間時間共分散行列を推定する空間時間共分散行列推定部とを含み、
前記ビームフォーマ推定部は、前記観測信号の空間共分散行列の推定値と、前記非目的音源の空間共分散行列の推定値と、推定した前記残響除去フィルタとを用いて、畳み込みビームフォーマを推定し、
収束条件を満たすまで、前記空間時間共分散行列推定部、前記第二空間共分散行列推定部、前記残響除去フィルタ推定部、前記ビームフォーマ推定部および前記音源抽出部における処理を繰り返す、
信号処理装置。 The signal processing device according to claim 1,
a first spatial covariance matrix estimator that estimates a spatial covariance matrix of the observed signal using the observed signal;
a spatio-temporal covariance matrix estimation unit that estimates a spatio-temporal covariance matrix of the non-target sound source using the observed signal and the estimated convolutional beamformer,
The beamformer estimation unit estimates a convolutional beamformer using the estimated value of the spatial covariance matrix of the observed signal, the estimated value of the spatial covariance matrix of the non-target sound source, and the estimated dereverberation filter. death,
repeating the processing in the spatiotemporal covariance matrix estimator, the second spatial covariance matrix estimator, the dereverberation filter estimator, the beamformer estimator, and the sound source extractor until a convergence condition is satisfied;
Signal processing device. - 請求項1の信号処理装置であって、
前記観測信号から目的音源が発した音を含む区間(以下、目的信号ともいう)を推定し、推定した目的信号を用いて、目的音源の空間共分散行列を推定する第一空間共分散行列推定部と、
前記観測信号と推定した前記畳み込みビームフォーマとを用いて、前記非目的音源の空間時間共分散行列を推定する空間時間共分散行列推定部とを含み、
前記ビームフォーマ推定部は、前記目的音源の空間共分散行列の推定値と、前記非目的音源の空間共分散行列の推定値と、推定した前記残響除去フィルタとを用いて、畳み込みビームフォーマを推定し、
収束条件を満たすまで、前記空間時間共分散行列推定部、前記第二空間共分散行列推定部、前記残響除去フィルタ推定部、前記ビームフォーマ推定部および前記音源抽出部における処理を繰り返す、
信号処理装置。 The signal processing device according to claim 1,
First spatial covariance matrix estimation for estimating an interval including the sound emitted by the target sound source from the observed signal (hereinafter also referred to as target signal), and estimating a spatial covariance matrix of the target sound source using the estimated target signal. Department and
a spatio-temporal covariance matrix estimation unit that estimates a spatio-temporal covariance matrix of the non-target sound source using the observed signal and the estimated convolutional beamformer,
The beamformer estimation unit estimates a convolutional beamformer using the estimated value of the spatial covariance matrix of the target sound source, the estimated value of the spatial covariance matrix of the non-target sound source, and the estimated dereverberation filter. death,
repeating the processing in the spatiotemporal covariance matrix estimator, the second spatial covariance matrix estimator, the dereverberation filter estimator, the beamformer estimator, and the sound source extractor until a convergence condition is satisfied;
Signal processing device. - 非目的音源の空間時間共分散行列の推定値を用いて、前記非目的音源の空間共分散行列を推定する第二空間共分散行列推定ステップと、
前記非目的音源の空間時間共分散行列の推定値を用いて、残響除去フィルタを推定する残響除去フィルタ推定ステップと、
観測信号または目的音源の空間共分散行列の推定値と、前記非目的音源の空間共分散行列の推定値と、推定した前記残響除去フィルタとを用いて、畳み込みビームフォーマを推定するビームフォーマ推定ステップと、
前記観測信号と推定した前記畳み込みビームフォーマとを用いて、ビームフォーミング処理を行い、音源信号を推定する音源抽出ステップとを含む、
信号処理方法。 a second spatial covariance matrix estimation step of estimating a spatial covariance matrix of the non-target sound source using the estimated value of the spatio-temporal covariance matrix of the non-target sound source;
a dereverberation filter estimation step of estimating a dereverberation filter using the estimated value of the spatiotemporal covariance matrix of the non-target sound source;
a beamformer estimation step of estimating a convolutional beamformer using the estimated value of the spatial covariance matrix of the observed signal or the target sound source, the estimated value of the spatial covariance matrix of the non-target sound source, and the estimated dereverberation filter; and,
a sound source extraction step of performing beamforming processing using the observed signal and the estimated convolutional beamformer to estimate a sound source signal;
Signal processing method. - 請求項1から請求項4の何れかに記載の信号処理装置としてコンピュータを機能させるためのプログラム。 A program for causing a computer to function as the signal processing device according to any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2022/031099 WO2024038522A1 (en) | 2022-08-17 | 2022-08-17 | Signal processing device, signal processing method, and program |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2022/031099 WO2024038522A1 (en) | 2022-08-17 | 2022-08-17 | Signal processing device, signal processing method, and program |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024038522A1 true WO2024038522A1 (en) | 2024-02-22 |
Family
ID=89941461
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2022/031099 WO2024038522A1 (en) | 2022-08-17 | 2022-08-17 | Signal processing device, signal processing method, and program |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2024038522A1 (en) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020121545A1 (en) * | 2018-12-14 | 2020-06-18 | 日本電信電話株式会社 | Signal processing device, signal processing method, and program |
-
2022
- 2022-08-17 WO PCT/JP2022/031099 patent/WO2024038522A1/en unknown
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020121545A1 (en) * | 2018-12-14 | 2020-06-18 | 日本電信電話株式会社 | Signal processing device, signal processing method, and program |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP4422692B2 (en) | Transmission path estimation method, dereverberation method, sound source separation method, apparatus, program, and recording medium | |
Heymann et al. | A generic neural acoustic beamforming architecture for robust multi-channel speech processing | |
US11894010B2 (en) | Signal processing apparatus, signal processing method, and program | |
JPWO2007100137A1 (en) | Reverberation removal apparatus, dereverberation removal method, dereverberation removal program, and recording medium | |
KR102410850B1 (en) | Method and apparatus for extracting reverberant environment embedding using dereverberation autoencoder | |
JP6815956B2 (en) | Filter coefficient calculator, its method, and program | |
WO2024038522A1 (en) | Signal processing device, signal processing method, and program | |
JP7428251B2 (en) | Target sound signal generation device, target sound signal generation method, program | |
US11322169B2 (en) | Target sound enhancement device, noise estimation parameter learning device, target sound enhancement method, noise estimation parameter learning method, and program | |
JP2018031910A (en) | Sound source emphasis learning device, sound source emphasis device, sound source emphasis learning method, program, and signal processing learning device | |
JP7444243B2 (en) | Signal processing device, signal processing method, and program | |
US11676619B2 (en) | Noise spatial covariance matrix estimation apparatus, noise spatial covariance matrix estimation method, and program | |
JP7156064B2 (en) | Latent variable optimization device, filter coefficient optimization device, latent variable optimization method, filter coefficient optimization method, program | |
US11297418B2 (en) | Acoustic signal separation apparatus, learning apparatus, method, and program thereof | |
WO2021171406A1 (en) | Signal processing device, signal processing method, and program | |
WO2022180741A1 (en) | Acoustic signal enhancement device, method, and program | |
WO2023276170A1 (en) | Acoustic signal enhancement device, acoustic signal enhancement method, and program | |
JP2020030373A (en) | Sound source enhancement device, sound source enhancement learning device, sound source enhancement method, program | |
WO2021100215A1 (en) | Sound source signal estimation device, sound source signal estimation method, and program | |
WO2021144934A1 (en) | Voice enhancement device, learning device, methods therefor, and program | |
JP7375904B2 (en) | Filter coefficient optimization device, latent variable optimization device, filter coefficient optimization method, latent variable optimization method, program | |
JP7375905B2 (en) | Filter coefficient optimization device, filter coefficient optimization method, program | |
Moir et al. | Decorrelation of multiple non‐stationary sources using a multivariable crosstalk‐resistant adaptive noise canceller | |
JP2018191255A (en) | Sound collecting device, method thereof, and program | |
JP6989031B2 (en) | Transfer function estimator, method and program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22955696 Country of ref document: EP Kind code of ref document: A1 |