WO2022162878A1 - 信号処理装置、信号処理方法、信号処理プログラム、学習装置、学習方法及び学習プログラム - Google Patents
信号処理装置、信号処理方法、信号処理プログラム、学習装置、学習方法及び学習プログラム Download PDFInfo
- Publication number
- WO2022162878A1 WO2022162878A1 PCT/JP2021/003278 JP2021003278W WO2022162878A1 WO 2022162878 A1 WO2022162878 A1 WO 2022162878A1 JP 2021003278 W JP2021003278 W JP 2021003278W WO 2022162878 A1 WO2022162878 A1 WO 2022162878A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- signal
- microphone
- observed
- signal processing
- learning
- Prior art date
Links
- 238000012545 processing Methods 0.000 title claims abstract description 74
- 238000000034 method Methods 0.000 title claims description 38
- 238000003672 processing method Methods 0.000 title claims description 5
- 238000013528 artificial neural network Methods 0.000 claims abstract description 48
- 238000013136 deep learning model Methods 0.000 claims abstract description 11
- 230000008569 process Effects 0.000 claims description 16
- 239000011159 matrix material Substances 0.000 claims description 6
- 238000013459 approach Methods 0.000 claims description 5
- 230000006870 function Effects 0.000 claims description 4
- 238000011156 evaluation Methods 0.000 description 11
- 230000000875 corresponding effect Effects 0.000 description 10
- 238000010586 diagram Methods 0.000 description 10
- 238000002474 experimental method Methods 0.000 description 7
- 238000000926 separation method Methods 0.000 description 7
- 239000013598 vector Substances 0.000 description 7
- 230000000694 effects Effects 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 2
- 238000003491 array Methods 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 230000004807 localization Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000003860 storage Methods 0.000 description 2
- 238000007630 basic procedure Methods 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000001276 controlling effect Effects 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 238000013213 extrapolation Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R3/00—Circuits for transducers, loudspeakers or microphones
- H04R3/005—Circuits for transducers, loudspeakers or microphones for combining the signals of two or more microphones
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R1/00—Details of transducers, loudspeakers or microphones
- H04R1/20—Arrangements for obtaining desired frequency or directional characteristics
- H04R1/32—Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only
- H04R1/326—Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only for microphones
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R1/00—Details of transducers, loudspeakers or microphones
- H04R1/20—Arrangements for obtaining desired frequency or directional characteristics
- H04R1/32—Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only
- H04R1/40—Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R3/00—Circuits for transducers, loudspeakers or microphones
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R1/00—Details of transducers, loudspeakers or microphones
- H04R1/20—Arrangements for obtaining desired frequency or directional characteristics
- H04R1/32—Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only
- H04R1/40—Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers
- H04R1/406—Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers microphones
Definitions
- the present invention relates to a signal processing device, a signal processing method, a signal processing program, a learning device, a learning method and a learning program.
- Array signal processing technology using microphone arrays is widely used in various applications such as speech enhancement, sound source separation, and sound source direction estimation.
- the performance of array signal processing basically depends on the number of microphones, but in actual operation, many devices have limitations, and it is often difficult to increase the number of microphones. Therefore, it is desirable to improve the performance of microphone array technology when the number of microphones is small.
- Hiroki Katahira "Nonlinear speech enhancement by virtual increase of channels and maximum SNR beamformer", [online], [searched on January 25, 2021], Internet ⁇ URL: https://asp-eurasipjournals.springeropen.com/ track/pdf/10.1186/s13634-015-0301-3.pdf>
- the virtual microphone signal was estimated based on a physical model, but this physical model does not always hold true, and the problem is that it is difficult to estimate the virtual microphone signal (especially the phase). there were.
- the present invention has been made in view of the above, and provides a signal processing apparatus, signal processing method, signal An object of the present invention is to provide a processing program, a learning device, a learning method, and a learning program.
- a signal processing device for processing an acoustic signal, which uses a deep learning model having a neural network to process input real data.
- the present invention is characterized by including an estimating unit that estimates an observed signal of a virtually arranged virtual microphone from an observed signal of the microphone.
- the learning device has an input for receiving, as learning data, an observed signal of a real microphone and an observed signal actually observed at the position of a virtually arranged virtual microphone, which is an object of estimation.
- an estimating unit that estimates an observed signal of a virtual microphone from an input observed signal of a real microphone using a deep learning model having a neural network; and an updating unit that updates the parameters of the neural network so as to approximate the observation signal actually observed at the position of the microphone.
- FIG. 1 is a diagram schematically showing an example of an estimation device according to Embodiment 1.
- FIG. 2 is a flowchart illustrating a processing procedure of estimation processing according to the first embodiment.
- FIG. 3 is a diagram schematically showing an example of a learning device according to Embodiment 2.
- FIG. 4 is a flowchart of a learning process procedure according to the second embodiment.
- FIG. 5 is a diagram schematically showing an example of a signal processing device according to Embodiment 3.
- FIG. FIG. 6 is a diagram showing the microphone array arrangement of the CHiME-4 corpus.
- FIG. 7 is a diagram illustrating an example of a computer that realizes an estimation device, a learning device, and a signal processing device by executing programs.
- FIG. 1 is a diagram schematically showing an example of an estimation device according to Embodiment 1.
- FIG. 1 is a diagram schematically showing an example of an estimation device according to Embodiment 1.
- the estimating device 10 (estimating unit) is, for example, a computer including ROM (Read Only Memory), RAM (Random Access Memory), CPU (Central Processing Unit), etc.
- a predetermined program is read, and the CPU executes the predetermined program. This is achieved by executing
- the estimating device 10 also has a communication interface for transmitting and receiving various information to and from another device connected via a wired connection or a network.
- the estimation device 10 As shown in FIG. 1, the estimation device 10 according to Embodiment 1 has NN11.
- FIG. 1 shows an example in which two channels corresponding to actually observed real microphones are received and one channel corresponding to a virtual microphone is generated.
- the NN 11 estimates the observed signal (amplitude and phase components) of the virtually arranged virtual microphone from the input observed signal observed by the real microphone.
- Real microphones are actually installed microphones (microphones 1 and 3 in FIG. 1).
- the observed signal r of the real microphone is the mixed acoustic signal (in FIG. 1, 1 and 3 circled by a solid line) observed by the real microphone.
- the virtual microphone is a microphone (microphone 2 in FIG. 1) that is virtually placed at a position different from the position of the real microphone.
- the NN 11 estimates and outputs the observed signal ⁇ v of the virtual microphone (in FIG. 1, 2 indicated by a dashed circle).
- NN11 is, for example, a time-domain deep learning model with high phase estimation performance.
- NN11 is a NN that operates directly in the time domain without being based on physical assumptions and can accurately estimate time domain signals.
- the estimating apparatus 10 uses the NN 11 to estimate a time-domain signal, which is the observed signal of the virtual microphone, from the input time-domain signal, which is the observed signal of the real microphone.
- NN-VME Neural Network-based Virtual Microphone Estimator
- the NN 11 does not necessarily have to be a time domain model, and may be realized by a frequency domain model.
- NN 11 has encoder 111 , convolution block 112 and decoder 113 .
- the encoder 111 is a neural network that maps an acoustic signal to a predetermined feature space, that is, converts the acoustic signal into a feature amount vector.
- the convolution block 112 is a set of layers for performing one-dimensional convolution and the like.
- the decoder 113 is a neural network that maps feature quantities in a predetermined feature space to the space of acoustic signals, that is, converts feature quantity vectors into acoustic signals.
- the NN 11 outputs the observed signal converted by the decoder 113 as an estimated signal ⁇ v of the virtual microphone.
- the configuration of the convolution block, encoder and decoder is described in reference 1 (Y. Luo and N. Mesgarani, "Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation", IEEE/ACM Trans. ASLP, vol. 27, No. 8, pp. 1256-1266, 2019.).
- the acoustic signal in the time domain may be obtained by the method described in Reference 1.
- each feature amount in the following description shall be represented by a vector.
- r c is the T long-time domain waveform of the c-th real microphone
- ⁇ v c ' is the estimated signal of the c'-th virtual microphone.
- C r indicates the number of observed channels (ie, real microphones)
- C v indicates the number of hypothetical estimated channels (ie, virtual microphones)
- NN-VME(•) is the neural network.
- FIG. 2 is a flowchart illustrating a processing procedure of estimation processing according to the first embodiment.
- the estimating apparatus 10 converts the input observation signal r of the real microphone in the time domain into a feature amount (step S1).
- the convolution block 112 performs one-dimensional convolution (step S2).
- the decoder 113 converts the feature quantity into an observed signal at the position of the virtual microphone (step S3).
- the NN 11 outputs the observed signal converted by the decoder 113 as an estimated signal ⁇ v of the virtual microphone (step S4).
- the estimation apparatus 10 uses a time-domain deep learning model with high phase estimation performance to directly estimate the observed signal of the virtual microphone from the input observed signal observed by the real microphone.
- a data doblin framework allows direct estimation of the virtual microphone signal (amplitude and phase components) without making explicit assumptions about the signal (e.g., a physical model). can.
- the estimating apparatus 10 uses a time-domain deep learning model with high phase estimation performance to estimate both the amplitude and the phase of the virtual microphone signal.
- the first embodiment it is possible to virtually increase the number of observation microphones, and even if the number of microphones is small, it is possible to improve the performance of the microphone array technology.
- Embodiment 2 Next, Embodiment 2 will be described.
- a learning device for learning the NN 11 in the estimation device 10 will be described.
- the learning device 20 adopts supervised learning, and as learning data, in addition to the observed signals of the actual microphones that are actually placed during operation, the virtual microphones We use the observed signal of a real microphone at the position of .
- FIG. 3 is a diagram schematically showing an example of a learning device according to Embodiment 2.
- the same reference numerals are assigned to the same configurations as in the first embodiment, and the description thereof is omitted.
- the learning device 20 receives two channels corresponding to real microphones and performs learning for the NN 11 that generates one channel corresponding to a virtual microphone. to explain.
- the learning device 20 shown in FIG. 3 is realized, for example, by reading a predetermined program into a computer or the like including ROM, RAM, CPU, etc., and executing the predetermined program by the CPU.
- the learning device 20 also has a communication interface for transmitting and receiving various information to and from another device connected via a wired connection or a network.
- the learning device 20 has an NN 11 , an input section 21 and a parameter updating section 22 .
- the input unit 21 receives, as learning data, observed signals (1 and 3 circled by a solid line in FIG. 3) of real microphones (microphones 1 and 3) installed during operation, and virtual arrangement of signals to be estimated. An input with an observed signal actually observed at the position of the virtual microphone (microphone 2) (in FIG. 3, 2 indicated by a solid line circle) is accepted. The input unit 21 inputs the observation signal r in the time domain of the actual microphone installed during operation (1 and 3 circled in solid line in FIG. 3) to the NN. The input unit 21 inputs the observed signal t (in FIG. 1, 2 circled with a solid line) actually observed at the position of the virtual microphone to the parameter updating unit 22 .
- the NN 11 (estimating unit) generates the observed signal ⁇ v (in FIG. 3, the dashed line 2) of the circle is estimated.
- the parameter updating unit 22 updates the parameters of the NN 11 so that the observed signal ⁇ v of the virtual microphone estimated by the NN 11 approaches the observed signal t actually observed at the position of the virtual microphone.
- the learning device 20 employs supervised learning to allow the NN11, which is the NN-VME module, to estimate the virtual microphone signal. For this reason, during learning, the observed signal of the real microphone at the position of the virtual microphone is used as the object of learning, along with the observed signal of the real microphone.
- FIG. 3 shows the case where a subset of microphones (eg channels 1 and 3) are assigned as network input values r and another subset (eg channel 2) is used as network target value t.
- a subset of microphones eg channels 1 and 3
- another subset eg channel 2
- the NN 11 is trained based on the time domain loss between the estimated signal and the real signal at the position of the virtual microphone.
- the parameter updating unit 22 employs a scale-dependent signal-to-noise ratio (SNR) as the loss, for example, as in Equation (2).
- SNR scale-dependent signal-to-noise ratio
- FIG. 4 is a flowchart of a learning process procedure according to the second embodiment.
- Step S11 the input of the observation signal of the actual microphone installed during operation and the observation signal actually observed at the position of the virtually arranged virtual microphone, which is the target of estimation, is accepted.
- the input unit 21 inputs the observation signal r in the time domain of the actual microphone installed during operation to the NN 11 (step S12).
- the NN 11 performs the same processing as steps S1 to S4 shown in FIG. 2 to estimate the observed signal ⁇ v of the virtually arranged virtual microphone from the input observed signal r observed by the real microphone. (Steps S13 to S16).
- the parameter updating unit 22 updates the parameters of the NN 11 so that the observed signal ⁇ v of the virtual microphone estimated by the NN 11 approaches the observed signal t actually observed at the position of the virtual microphone (step S17).
- a parameter updating unit 22 updates the parameters of the NN 11 so that the loss calculated by Equation (2) is optimized.
- the parameter updating unit 22 determines whether or not the end condition is reached (step S18). If the end condition is met (step S18: Yes), the learning device 20 ends the process, and if the end condition is not met (step S18: No), the process returns to step S12.
- the termination conditions are, for example, that the parameters for the NN 11 have been updated a predetermined number of times, that the loss value used for updating the parameters has become equal to or less than a predetermined threshold, or that the parameter update amount (differential value of the loss function value, etc. ) is equal to or less than a predetermined threshold.
- the learning device 20 does not require a pair of a noisy signal and a clean signal, unlike the learning of the speech enhancement method. are required as training data.
- the learning device 20 needs only observation signals (mixed acoustic signals) containing multi-channel noise as learning data. can be used as In other words, the learning device 20 can use actual recordings made with a large number of microphones as learning data, instead of simulation recordings.
- the learning device 20 preparation of learning data is easy and cost is low.
- the learning device 20 can build a powerful NN 11, which enables detailed modeling of real recordings.
- the estimator 10 allows the generation of virtual microphone signals and thus can be used for various array processing. Therefore, in the third embodiment, a configuration in which the estimation device 10 is combined with a frequency domain beamformer will be described as an example.
- FIG. 5 is a diagram schematically showing an example of a signal processing device according to Embodiment 3.
- the signal processing apparatus 100 shown in FIG. 5 is realized by, for example, reading a predetermined program into a computer or the like including ROM, RAM, CPU, etc., and executing the predetermined program by the CPU.
- the signal processing device 100 also has a communication interface for transmitting and receiving various information to and from another device connected via a wired connection or a network or the like.
- the signal processing device 100 has an estimation device 10, a microphone signal processing section 30, and an application section 40 (signal processing section).
- the microphone signal processing unit 30 generates a voice-enhanced signal from which noise components have been removed, based on the observed signal of the real microphone and the observed signal of the virtual microphone estimated by the estimation device 10 .
- the microphone signal processing unit 30 may include sound source separation processing, sound source localization processing, and the like.
- the application unit 40 performs another task-dependent process using the speech enhancement signal.
- the application unit 40 performs, for example, speech recognition processing.
- the processing order of the signal processing apparatus 100 is an example, and there are cases where speech recognition processing is performed after sound source separation processing, and speech enhancement processing and sound source separation processing are performed after sound source localization processing.
- the microphone signal processor 30 then obtains the enhanced speech signal using a frequency domain beamformer in addition to the extended microphone signal in a frequency domain representation (i.e., the Short-Time Fourier Transform (STFT)). Finally, the inverse STFT is used to recover the enhanced time-domain waveform.
- STFT Short-Time Fourier Transform
- Y t,f ⁇ C C is the vector containing the C-channel STFT coefficients of the enhanced microphone signal at time-frequency bin (t,f) and w f ⁇ C C is the vector containing the beamforming filter coefficients.
- H represents the conjugate transpose.
- the microphone signal processing unit 30 is, for example, Minimum Variance Distortionless Response (MVDR) beamforming (Reference 2: Mehrez Souden, Jacob Benesty, and Sofiene Affes, “On optimal frequency-domain multichannel linear filtering for noise reduction”, IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 2, pp. 260-276, 2009.), and the time-invariant filter coefficient w f calculate.
- MVDR Minimum Variance Distortionless Response
- ⁇ S f ⁇ C C ⁇ C and ⁇ N f ⁇ C C ⁇ C are the spatial covariance (SC) matrices of the speech and noise signals, respectively.
- u ⁇ R C is a one-hot vector representing the reference microphone.
- the SC matrix is estimated as shown in Equation (4) (Reference 3: Jahn Heymann, Lukas Drude, and Reinhold Haeb-Umbach, "Neural network based spectral mask estimation for acoustic beamforming", in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 196-200.).
- m S t,f ⁇ [0,1] and m N t,f ⁇ [0,1] are the time-frequency masks for speech and noise, respectively.
- the microphone signal processing unit 30 adds a loading term that reduces the weight of the virtual microphone channel to the spatial covariance matrix of the audio signal and the noise signal.
- the virtual microphone signal estimated by the estimator 10 with the NN-VME module also allows for improved speech enhancement and signal processing performance enhanced by the NN-VME.
- Evaluation experiment 1 for virtual microphone estimation performance by NN-VME and evaluation experiment 2 for enhancement performance by a beamformer using the estimated virtual microphone.
- evaluation experiment 2 for enhancement performance by a beamformer using the estimated virtual microphone In the experiment, the result of estimating one virtual microphone was reported, but it is of course possible to extend the method to estimate a plurality of virtual microphones.
- Fig. 6 is a diagram showing the microphone array arrangement of the CHiME-4 corpus. All microphones except microphone 2 in FIG. 6 face the front.
- NN-VME was extracted from the CHiME-4 corpus (Reference 4: Jon Barker, Ricard Marxer, Emmanuel Vincent, and Shinji Watanabe, “The third CHiMEspeech separation and recognition challenge: Dataset, task and baselines”, in IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2015, pp. 504-511.).
- the CHiME-4 corpus contains sounds recorded using a tablet device equipped with a 6-channel rectangular microphone array, as shown in FIG. This corpus contains not only simulated data, but also real-world recordings in noisy public environments.
- the training set consists of 3 hours of real speech data from 4 speakers and 15 hours of simulated speech data from 83 speakers.
- the evaluation set includes 1320 utterances of real speech data and simulated speech data including noise, respectively, uttered by four speakers. Of these utterances, we used an evaluation set consisting of 1149 utterances, excluding utterances associated with microphone failure.
- BSSEval Reference 5: Emmanuel Vincent, Remi Gribonval, and Cedric Fevotte, “Performance measurement in blind audio source separation”, IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 4, pp 1462-1469, 2006.
- SDR and Word Error Rate were used.
- WER Word Error Rate
- the clean reverberation signal in the fourth channel was used as a reference signal. This evaluation is performed only on simulated data, as access to the clean signal is required.
- Kaldi's CHiME-4 recipe (Reference 6: Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky , Georg Stemmer, and Karel Vesely, "The Kaldi speech recognition toolkit", in IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2011., Reference 7: [online], [searched January 25, 2021], Internet ⁇ https://github.com/kaldi-asr/kaldi/tree/master/egs/chime4/s5_6ch>) was used.
- NN-VME by adopting Adam's algorithm with gradient clipping (Reference 11: Diederik P Kingma and Jimmy Ba, "Adam: A method for stochastic optimization", in International Conference on Learning Representations (ICLR), 2015.) trained. At this time, the initial learning rate was set to 0.0001. After 200 epochs, the training was terminated.
- the GitHub repository used in Kaldi's CHiME-4 recipe (Reference 12: [online], [searched January 25, 2021], Internet ⁇ URL: https://github.com /fgnt/nn-gev, >) were used (see reference 3).
- the loading hyperparameter ⁇ in Eq. (5) was set to 0.05.
- Table 1 shows the SDR [dB] of the virtual microphone estimation using the noisy observed signal as the reference signal.
- RM represents a real microphone and VM represents a virtual microphone estimated by NN-VME (NN11).
- the reference signal for calculating the SDR is not a clean signal but an observed signal containing noise in the channel corresponding to the virtual microphone. Therefore, virtual microphone estimation performance can also be evaluated for real recordings.
- eval ch in the first column indicates the channel index of the virtual microphone signal or real microphone signal used as the estimated signal in calculating the SDR.
- “ref ch” in the second column indicates the channel index of the real microphone signal used as the reference signal.
- the notation “5(4,6)” indicates that the virtual microphone signal in channel 5 was estimated using the real microphone signals in channels 4 and 6.
- the score is compared to the SDR obtained with the closest (ie, highest SDR) real microphone.
- Table 1 shows that the signal estimated by the NN-VME module (e.g. "5(4,6)”) has a higher SDR score than the observed signal recorded by a nearby microphone (e.g. "4"). is shown.
- Table 1 shows the results of interpolation (i.e. virtual microphones positioned between real microphones) (e.g. "5(4,6)”) and extrapolation in the lateral direction (e.g. "6(4,5)”). is shown.
- the NN-VME (NN11) can predict a virtual microphone signal with an SDR of approximately 12 dB or more and little distortion in the time waveform.
- Table 2 shows the SDR [dB] of the beamformer using the clean signal as the reference signal. Note that the higher the SDR, the better, and the lower the WER [%], the better the performance.
- VM BF in Table 2 indicates a beamformer with an estimated virtual microphone (NN11 output), and RM BF indicates a beamformer with only real microphones.
- the columns 'real' and 'virtual' in the 'used channel' column are the channel indices corresponding to the real and virtual microphones used to form the beamformer, respectively. indicates For example, "VM BF" in line (4) is formed using two real microphone signals (ie channels 4 and 6) and one virtual microphone signal (ie channel 5).
- Table 2 shows that the VM BF proposed in Embodiment 1 (eg, row (4)) has a higher SDR score than the RM BF formed by the same real microphone signal (eg, row (2)). It is shown that.
- another RM BF eg, line (3)
- Table 2 shows the results of VM BF using virtual microphone loading.
- the WER score of VM BF without loading is 15.1% under the same conditions as row (4) and 13.4% under the same conditions as row (7). This indicates that virtual microphone loading is effective in improving the ASR performance of VM BF.
- the virtual microphone signal estimated by NN-VME (NN11) showed that there is an improvement in speech enhancement and signal processing performance enhanced by NN-VME.
- Each component of the estimating device 10, the learning device 20, and the signal processing device 100 is functionally conceptual and does not necessarily need to be physically configured as illustrated. That is, the specific form of distribution and integration of the functions of the estimating device 10, the learning device 20, and the signal processing device 100 is not limited to the illustrated one, and all or part of them can be It can be configured by distributing or integrating functionally or physically in arbitrary units.
- each process performed in the estimation device 10, the learning device 20, and the signal processing device 100 is a CPU, a GPU (Graphics Processing Unit), and a program that is analyzed and executed by the CPU and GPU. may be implemented. Further, each process performed in the estimation device 10, the learning device 20, and the signal processing device 100 may be implemented as hardware based on wired logic.
- FIG. 7 is a diagram showing an example of a computer that implements the estimation device 10, the learning device 20, and the signal processing device 100 by executing programs.
- the computer 1000 has a memory 1010 and a CPU 1020, for example.
- Computer 1000 also has hard disk drive interface 1030 , disk drive interface 1040 , serial port interface 1050 , video adapter 1060 and network interface 1070 . These units are connected by a bus 1080 .
- the memory 1010 includes a ROM 1011 and a RAM 1012.
- the ROM 1011 stores a boot program such as BIOS (Basic Input Output System).
- BIOS Basic Input Output System
- Hard disk drive interface 1030 is connected to hard disk drive 1090 .
- a disk drive interface 1040 is connected to the disk drive 1100 .
- a removable storage medium such as a magnetic disk or optical disk is inserted into the disk drive 1100 .
- Serial port interface 1050 is connected to mouse 1110 and keyboard 1120, for example.
- Video adapter 1060 is connected to display 1130, for example.
- the hard disk drive 1090 stores an OS (Operating System) 1091, application programs 1092, program modules 1093, and program data 1094, for example. That is, a program that defines each process of the estimation device 10, the learning device 20, and the signal processing device 100 is implemented as a program module 1093 in which code executable by the computer 1000 is described. Program modules 1093 are stored, for example, on hard disk drive 1090 .
- the hard disk drive 1090 stores a program module 1093 for executing processing similar to the functional configurations of the estimating device 10 , the learning device 20 and the signal processing device 100 .
- the hard disk drive 1090 may be replaced by an SSD (Solid State Drive).
- the setting data used in the processing of the above-described embodiment is stored as program data 1094 in the memory 1010 or the hard disk drive 1090, for example. Then, the CPU 1020 reads out the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 to the RAM 1012 as necessary and executes them.
- the program modules 1093 and program data 1094 are not limited to being stored in the hard disk drive 1090, but may be stored in a removable storage medium, for example, and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program modules 1093 and program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). Program modules 1093 and program data 1094 may then be read by CPU 1020 through network interface 1070 from other computers.
- LAN Local Area Network
- WAN Wide Area Network
- estimation device 11 neural network (NN) 111 encoder 112 convolution block 113 decoder 20 learning device 21 input unit 22 parameter updating unit 30 microphone signal processing unit 40 application unit 100 signal processing unit
Abstract
Description
実施の形態1では、マイクロホンアレイを用いるアレイ信号処理のために、仮想的に配置した仮想マイクの信号を推定する推定装置について説明する。
続いて、NN11が、1つ以上の仮想マイクを同時に推定する場合について説明する。まず、rcは、c番目の実マイクのT長時間領域波形であり、^vc´は、c´番目の仮想マイクの推定信号を示す。実マイク信号r={rc=1,…,rc=Cr}を入力とすると、NN-VMEモジュールであるNN11は、仮想マイク信号^v={^vc´=1,…,^vc´=Cv}を式(1)のように推定する。
図2は、実施の形態1に係る推定処理の処理手順を示すフローチャートである。推定装置10では、実マイクの観測信号rが入力されると、入力された実マイクの時間領域の観測信号rを特徴量に変換する(ステップS1)。畳み込みブロック112は、1次元の畳み込みを行う(ステップS2)。
このように、推定装置10は、高い位相推定性能を有する時間領域・深層学習モデルを用いて、入力された実マイクで観測された観測信号から、直接仮想マイクの観測信号を推定する。実施の形態10では、このようなデータドブリンの枠組みにより、信号に対する明示的な仮定(例えば、物理的モデル)を置くことなく、仮想的マイクの信号(振幅及び位相成分)を直接推定することができる。そして、推定装置10では、高い位相推定性能を有する時間領域・深層学習モデルを用いることで、仮想マイクの信号として、振幅と位相との双方の推定を実現した。
次に、実施の形態2について説明する。実施の形態2では、推定装置10におけるNN11の学習を行う学習装置について説明する。NN-VNEモジュールであるNN11に仮想マイクの信号を推定させるため、学習装置20では、教師有り学習を採用し、学習データとして、運用時に実際に配置される実マイクの観測信号に加え、仮想マイクの位置における実マイクの観測信号を使用する。
続いて、学習処理について説明する。学習装置20は、NN-VMEモジュールであるNN11に仮想マイク信号を推定させるため、教師あり学習を採用する。このため、学習時には、学習対象として、実マイクの観測信号とともに、仮想マイクの位置における実マイクの観測信号を使用する。
次に、実施の形態2に係る学習処理について説明する。図4は、実施の形態2に係る学習処理の処理手順を示すフローチャートである。
このように、実施の形態2に係る学習装置20では、音声強調法の学習とは異なり、ペアとなったノイズの多い信号とクリーン信号とを必要とすることなく、複数の実マイクの観測信号のみを学習データとして必要とする。言い換えると、学習装置20では、学習データとして、マルチチャネルのノイズを含む観測信号(混合音響信号)のみがあればよいため、デバイスの形に制限がなく、多数のチャネルの混合音響信号を学習データとして使用することができる。すなわち、学習装置20は、シミュレーション録音ではなく、多数のマイクで録音した現実の録音を、そのまま、学習データとして使用することができる。
推定装置10によって、仮想マイク信号の生成が可能となるため、各種アレイ処理に使用することができる。そこで、本実施の形態3では、推定装置10を周波数領域ビームフォーマと組み合わせた構成を例として説明する。
図5は、実施の形態3に係る信号処理装置の一例を模式的に示す図である。図5に示す信号処理装置100は、例えば、ROM、RAM、CPU等を含むコンピュータ等に所定のプログラムが読み込まれて、CPUが所定のプログラムを実行することで実現される。また、信号処理装置100は、有線接続、或いは、ネットワーク等を介して接続された他の装置との間で、各種情報を送受信する通信インタフェースを有する。信号処理装置100は、推定装置10、マイク信号処理部30及びアプリケーション部40(信号処理部)を有する。
[基本手順]
まず、推定装置10を用いて、式(1)で説明したように実マイク信号r∈RT×Crとして、仮想マイク信号^v∈RT×Cvを推定し、拡張マイク信号y=[r,^v]∈RT×C(C=Cr+Cv)を求める。次に、マイク信号処理部30は、周波数領域表現(すなわち、短時間フーリエ変換(STFT:Short-Time Fourier Transform)における拡張マイク信号に加えて周波数領域ビームフォーマを使用して強調音声信号を取得する。最後に、逆STFTを用いて強調時間領域波形を復元する。
マイク信号処理部30は、例えば、最小分散無歪応答法(MVDR:Minimum Variance Distortionless Response)ビームフォーミング(参考文献2:Mehrez Souden, Jacob Benesty, and Sofiene Affes, “On optimal frequency-domain multichannel linear filtering for noise reduction”,IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 2, pp. 260-276, 2009.)を使用し、時不変フィルタ係数wfを式(3)のように算出する。
後述する実験では、ビームフォーミングにおける仮想マイクの使用は、信号対歪み比(SDR:Signal-to-Distortion Ratios)を高めるには効果的であるものの、必ずしも自動音声認識(ASR:Automatic Speech Recognition)性能を上げることはないということが明らかになった。これは、仮想マイク推定によって処理アーティファクトが混入するためである。
NN-VMEモジュールを有する推定装置10によって推定された仮想マイクの信号によって、NN-VMEによって拡張された音声強調及び信号処理の性能の向上も見込むことができる。
NN-VMEを評価するため、以下の2つの評価を行った。NN-VMEによる仮想マイク推定性能に対する評価実験1、及び、推定仮想マイクを用いたビームフォーマによる強調性能に対する評価実験2である。なお、実験では、1つの仮想マイクを推定する結果を報告したが、当然のこととして複数の仮想マイクを推定するよう拡張することもできる。
NN-VMEをCHiME-4コーパス(参考文献4:Jon Barker, Ricard Marxer, Emmanuel Vincent, and Shinji Watanabe, “The third CHiMEspeech separation and recognition challenge: Dataset, task and baselines”,in IEEE Automatic Speech Recognition and Understanding Workshop(ASRU), 2015, pp. 504-511.)上で評価した。CHiME-4コーパスは、図6に示すように、6チャネル長方形マイクロホンアレイを備えたタブレットデバイスを用いて録音された音声を含む。このコーパスは、模擬データだけでなく騒がしい公共環境での現実の録音も含む。
NN-VMEのネットワーク構成には、Conv-TasNetベースのネットワークアーキテクチャを採用した。参考文献1の記載に従い、ハイパーパラメータを、N=256,L=20,B=256,H=512,P=3,X=8及びR=4と設定した。
[仮想マイク推定性能の評価]
表1は、ノイズを含む観測信号を参照信号として使用した、仮想マイク推定のSDR[dB]である。
表2は、クリーン信号を参照信号として使用するビームフォーマのSDR[dB]を示す。なお、SDRは、値が高いほどよく、WER[%]は、値が低い方がよい性能であることを示す。
推定装置10、学習装置20及び信号処理装置100の各構成要素は機能概念的なものであり、必ずしも物理的に図示のように構成されていることを要しない。すなわち、推定装置10、学習装置20及び信号処理装置100の機能の分散及び統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散または統合して構成することができる。
図7は、プログラムが実行されることにより、推定装置10、学習装置20及び信号処理装置100が実現されるコンピュータの一例を示す図である。コンピュータ1000は、例えば、メモリ1010、CPU1020を有する。また、コンピュータ1000は、ハードディスクドライブインタフェース1030、ディスクドライブインタフェース1040、シリアルポートインタフェース1050、ビデオアダプタ1060、ネットワークインタフェース1070を有する。これらの各部は、バス1080によって接続される。
11 ニューラルネットワーク(NN)
111 エンコーダ
112 畳み込みブロック
113 デコーダ
20 学習装置
21 入力部
22 パラメータ更新部
30 マイク信号処理部
40 アプリケーション部
100 信号処理部
Claims (8)
- 音響信号を処理する信号処理装置であって、
ニューラルネットワークを有する深層学習モデルを用いて、入力された実マイクの観測信号から、仮想的に配置された仮想マイクの観測信号を推定する推定部
を有することを特徴とする信号処理装置。 - 前記推定部は、前記深層学習モデルを用いて、入力された前記実マイクの観測信号である時間領域信号から、前記仮想マイクの観測信号である時間領域信号を推定することを特徴とする請求項1に記載の信号処理装置。
- 前記実マイクの観測信号と、前記推定部によって推定された前記仮想マイクの観測信号とを基に、雑音信号が取り除かれた音声強調信号を生成するマイク信号処理部と、
前記音声強調信号を用いた信号処理を行うアプリケーション部と、
を有し、
前記マイク信号処理部は、音声信号及びノイズ信号の空間共分散行列に、前記仮想マイクのチャネルの重みを低減するローディング項を加えることを特徴とする請求項1または2に記載の信号処理装置。 - 信号処理装置が実行する信号処理方法であって、
ニューラルネットワークを有する深層学習モデルを用いて、入力された実マイクの観測信号から、仮想的に配置された仮想マイクの観測信号を推定する工程
を含んだことを特徴とする信号処理方法。 - コンピュータを、請求項1~3のいずれか一つに記載の信号処理装置として機能させるための信号処理プログラム。
- 学習データとして、実マイクの観測信号と、推定対象である、仮想的に配置された仮想マイクの位置において実際に観測された観測信号との入力を受け付ける入力部と、
ニューラルネットワークを有する深層学習モデルを用いて、入力された実マイクの観測信号から、前記仮想マイクの観測信号を推定する推定部と、
前記推定部によって推定された前記仮想マイクの観測信号が、前記仮想マイクの位置において実際に観測された観測信号に近づくよう、前記ニューラルネットワークのパラメータを更新する更新部と、
を有することを特徴とする学習装置。 - 学習装置が実行する学習方法であって、
学習データとして、実マイクの観測信号と、推定対象である、仮想的に配置された仮想マイクの位置において実際に観測された観測信号との入力を受け付ける入力工程と、
ニューラルネットワークを有する深層学習モデルを用いて、入力された実マイクの観測信号から、前記仮想マイクの観測信号を推定する推定工程と、
前記推定工程において推定された前記仮想マイクの観測信号が、前記仮想マイクの位置において実際に観測された観測信号に近づくよう、前記ニューラルネットワークのパラメータを更新する更新工程と、
を含んだことを特徴とする学習方法。 - コンピュータを、請求項6に記載の学習装置として機能させるための学習プログラム。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US18/273,272 US20240129666A1 (en) | 2021-01-29 | 2021-01-29 | Signal processing device, signal processing method, signal processing program, training device, training method, and training program |
JP2022577952A JPWO2022162878A1 (ja) | 2021-01-29 | 2021-01-29 | |
PCT/JP2021/003278 WO2022162878A1 (ja) | 2021-01-29 | 2021-01-29 | 信号処理装置、信号処理方法、信号処理プログラム、学習装置、学習方法及び学習プログラム |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2021/003278 WO2022162878A1 (ja) | 2021-01-29 | 2021-01-29 | 信号処理装置、信号処理方法、信号処理プログラム、学習装置、学習方法及び学習プログラム |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022162878A1 true WO2022162878A1 (ja) | 2022-08-04 |
Family
ID=82652806
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2021/003278 WO2022162878A1 (ja) | 2021-01-29 | 2021-01-29 | 信号処理装置、信号処理方法、信号処理プログラム、学習装置、学習方法及び学習プログラム |
Country Status (3)
Country | Link |
---|---|
US (1) | US20240129666A1 (ja) |
JP (1) | JPWO2022162878A1 (ja) |
WO (1) | WO2022162878A1 (ja) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2014502109A (ja) * | 2010-12-03 | 2014-01-23 | フラウンホッファー−ゲゼルシャフト ツァ フェルダールング デァ アンゲヴァンテン フォアシュンク エー.ファオ | 到来方向推定から幾何学的な情報の抽出による音取得 |
JP2015502716A (ja) * | 2011-12-02 | 2015-01-22 | フラウンホーファー−ゲゼルシャフト・ツール・フェルデルング・デル・アンゲヴァンテン・フォルシュング・アインゲトラーゲネル・フェライン | 空間パワー密度に基づくマイクロフォン位置決め装置および方法 |
-
2021
- 2021-01-29 JP JP2022577952A patent/JPWO2022162878A1/ja active Pending
- 2021-01-29 US US18/273,272 patent/US20240129666A1/en active Pending
- 2021-01-29 WO PCT/JP2021/003278 patent/WO2022162878A1/ja active Application Filing
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2014502109A (ja) * | 2010-12-03 | 2014-01-23 | フラウンホッファー−ゲゼルシャフト ツァ フェルダールング デァ アンゲヴァンテン フォアシュンク エー.ファオ | 到来方向推定から幾何学的な情報の抽出による音取得 |
JP2015502716A (ja) * | 2011-12-02 | 2015-01-22 | フラウンホーファー−ゲゼルシャフト・ツール・フェルデルング・デル・アンゲヴァンテン・フォルシュング・アインゲトラーゲネル・フェライン | 空間パワー密度に基づくマイクロフォン位置決め装置および方法 |
Non-Patent Citations (2)
Title |
---|
TSUBASA OCHIAI; MARC DELCROIX; TOMOHIRO NAKATANI; RINTARO IKESHITA; KEISUKE KINOSHITA; SHOKO ARAKI: "Neural Network-based Virtual Microphone Estimator", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 12 January 2021 (2021-01-12), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081857370 * |
YAMAOKA KOUEI; LI LI; ONO NOBUTAKA; MAKINO SHOJI; YAMADA TAKESHI: "CNN-based virtual microphone signal estimation for MPDR beamforming in underdetermined situations", 2019 27TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), EURASIP, 2 September 2019 (2019-09-02), pages 1 - 5, XP033660465, DOI: 10.23919/EUSIPCO.2019.8903040 * |
Also Published As
Publication number | Publication date |
---|---|
US20240129666A1 (en) | 2024-04-18 |
JPWO2022162878A1 (ja) | 2022-08-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Giri et al. | Attention wave-u-net for speech enhancement | |
CN110914899B (zh) | 掩模计算装置、簇权重学习装置、掩模计算神经网络学习装置、掩模计算方法、簇权重学习方法和掩模计算神经网络学习方法 | |
Delcroix et al. | Strategies for distant speech recognitionin reverberant environments | |
JP5124014B2 (ja) | 信号強調装置、その方法、プログラム及び記録媒体 | |
Zhang et al. | Deep neural network-based bottleneck feature and denoising autoencoder-based dereverberation for distant-talking speaker identification | |
Kinoshita et al. | Suppression of late reverberation effect on speech signal using long-term multiple-step linear prediction | |
Yoshioka et al. | Integrated speech enhancement method using noise suppression and dereverberation | |
Drude et al. | Integrating Neural Network Based Beamforming and Weighted Prediction Error Dereverberation. | |
Xiao et al. | The NTU-ADSC systems for reverberation challenge 2014 | |
Heymann et al. | Frame-online DNN-WPE dereverberation | |
CN110998723B (zh) | 使用神经网络的信号处理装置及信号处理方法、记录介质 | |
Ravanelli et al. | Contaminated speech training methods for robust DNN-HMM distant speech recognition | |
Ozerov et al. | Uncertainty-based learning of acoustic models from noisy data | |
Nakatani et al. | Maximum likelihood convolutional beamformer for simultaneous denoising and dereverberation | |
JP6348427B2 (ja) | 雑音除去装置及び雑音除去プログラム | |
JP6106611B2 (ja) | モデル推定装置、雑音抑圧装置、音声強調装置、これらの方法及びプログラム | |
Abdulbaqi et al. | Residual recurrent neural network for speech enhancement | |
Sainath et al. | Raw multichannel processing using deep neural networks | |
Song et al. | An integrated multi-channel approach for joint noise reduction and dereverberation | |
JP4348393B2 (ja) | 信号歪み除去装置、方法、プログラム及びそのプログラムを記録した記録媒体 | |
Ochiai et al. | Neural network-based virtual microphone estimator | |
WO2022162878A1 (ja) | 信号処理装置、信号処理方法、信号処理プログラム、学習装置、学習方法及び学習プログラム | |
CN110739004B (zh) | 一种用于wasn的分布式语音噪声消除系统 | |
Abdulbaqi et al. | RHR-Net: A residual hourglass recurrent neural network for speech enhancement | |
WO2021205494A1 (ja) | 信号処理装置、信号処理方法、およびプログラム |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21922894 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2022577952 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 18273272 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21922894 Country of ref document: EP Kind code of ref document: A1 |