CN113257270A - Multi-channel voice enhancement method based on reference microphone optimization - Google Patents

Multi-channel voice enhancement method based on reference microphone optimization Download PDF

Info

Publication number
CN113257270A
CN113257270A CN202110505085.9A CN202110505085A CN113257270A CN 113257270 A CN113257270 A CN 113257270A CN 202110505085 A CN202110505085 A CN 202110505085A CN 113257270 A CN113257270 A CN 113257270A
Authority
CN
China
Prior art keywords
microphone
rank
reference microphone
voice
signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110505085.9A
Other languages
Chinese (zh)
Other versions
CN113257270B (en
Inventor
张结
陈星宇
戴礼荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202110505085.9A priority Critical patent/CN113257270B/en
Publication of CN113257270A publication Critical patent/CN113257270A/en
Application granted granted Critical
Publication of CN113257270B publication Critical patent/CN113257270B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Abstract

The invention discloses a multi-channel voice enhancement method based on reference microphone optimization, which comprises the following steps: step 1, establishing a low-rank approximate multi-channel wiener filter; step 2, establishing an output signal-to-noise ratio mathematical model; step 3, selecting a reference microphone: selecting two microphones based on the output signal-to-noise ratio mathematical model established in the step (2), respectively calculating the output signal-to-noise ratio difference of the two microphones, and selecting the microphone with the largest input signal-to-noise ratio as a reference microphone; and 4, forming a beam to obtain an enhanced voice signal: and (3) substituting the rank selected in the step (2) and the reference microphone selected in the step (3) into the low-rank approximate multi-channel wiener filter established in the step (1), and performing inner product operation formed by weighted summation wave beams on the multi-microphone voice signal to be enhanced and the low-rank approximate multi-channel wiener filter in a short-time frequency domain to obtain a result, namely the voice signal after single-channel enhancement. The method effectively reduces the time complexity of reference microphone selection and improves the multi-microphone speech enhancement and speech recognition performance.

Description

Multi-channel voice enhancement method based on reference microphone optimization
Technical Field
The invention relates to the field of voice signal processing, in particular to a multi-channel voice enhancement and voice recognition method based on a microphone array.
Background
Speech enhancement (speech enhancement) aims at extracting a clean sound source signal from a sound signal with noise and reverberation collected by an acoustic sensor, and performance measures of the speech enhancement (speech enhancement) mainly include an output signal-to-noise ratio (SNR), hearing perception intelligibility (speech), and the like. Microphone array based multi-channel speech enhancement techniques have important applications in many practical systems, for example: high-quality voice communication in a cocktail party scene, man-machine voice interaction in an intelligent household scene, auditory perception and interaction facing an intelligent robot, auxiliary hearing equipment facing a hearing impaired person and the like. Currently, beam forming (beam forming) based on Multichannel Wiener Filtering (MWF) is one of the mainstream methods. The multi-channel wiener filtering method designs a beam former by minimizing mean-square error (MSE) between an output signal and a reference target sound source signal in a short-time frequency domain, wherein the output signal is a weighted superposition form of an input signal. Mathematically, the multi-channel wiener filter relies on a signal covariance matrix and a reference microphone vector. Although the mean square error between the output signal obtained by the method and the reference sound source signal is minimum, the distortion of the sound source signal is not controlled, and the obtained enhanced signal has distortion on a frequency spectrum, so that the speech intelligibility and the listening comfort are influenced. Doclo et al propose based on pronunciation distortion weighted multichannel wiener filtering (SDW-MWF, speech distortion weighted MWF) pronunciation enhancement algorithm, this method regards mean square error and weighted output noise variance that the traditional multichannel wiener filter designed as the whole as the optimization target, therefore the beam former obtained not only relies on signal covariance matrix and reference microphone, also rely on the weight of noise variance, therefore choose the weight coefficient of pronunciation distortion to design SDW-MWF especially important, should adjust according to the system performance requirement. When the weight is 0, the filter is degraded into a traditional multi-channel wiener filter; when the weight is increased, the spectral distortion of the output speech signal is smaller, but the quality of the output signal is deteriorated.
However, no matter what multi-channel wiener filtering method is adopted, the reference microphone is a key parameter for controlling the speech enhancement output based on the beam forming, but in the traditional multi-microphone beam forming, the reference microphone is usually arbitrarily designated or simply selected as the microphone closest to the sound source (in reality, the distance from the sound source to the microphone is unknown). Because the output signal of the beam forming is the pure sound source signal component on the reference microphone, in reality, the distances from the microphone to the sound source and the interference source are different, and the self thermal noise powers of the microphones are also different, which are related to the signal-to-noise ratio distribution of the multi-microphone signals. Therefore, the selection of the reference microphone may affect the quality of the enhanced signal. In order to overcome the effect of the reference microphone on the quality of the enhanced signal, it is necessary to quantitatively evaluate the quality of the enhanced signal (e.g., signal-to-noise ratio) of the binaural beamformer, and further optimize the selection of the reference microphone to improve the speech enhancement performance.
In the existing documents t.c. lan-Ore and s.doclo, Reference microphone selection for MWF-based noise reduction using distributed microphone arrays, in 10th ITG Symposium Proceedings of Speech Communication,2012, a Reference microphone selection method based on maximizing output signal-to-noise ratio is proposed for multi-channel wiener filtering Speech enhancement, which designs a wiener filter based on the case that the enumeration method successively tries each microphone as a Reference microphone, and selects the case of obtaining the maximum signal-to-noise ratio as a Reference microphone by comparing the output signal-to-noise ratios in different cases, which obviously needs to traverse various cases and is very time-consuming. The existing documents J.Zhang, H.Chen, and R.C.Hendriks, A study on reference microphone selection for multi-microphone Speech enhancement, IEEE/ACM Trans. Audio, Speech, and Language Process,29: 671-.
Disclosure of Invention
Aiming at the problems in the prior art, the invention aims to provide a multi-channel speech enhancement method based on reference microphone optimization, which can solve the problems of time consumption, high time complexity, restriction on the real-time performance of a speech information processing system and the like in the conventional method for selecting a reference microphone for speech enhancement.
The purpose of the invention is realized by the following technical scheme:
the embodiment of the invention provides a multi-channel voice enhancement method based on reference microphone optimization, which comprises the following steps:
step 1, establishing a multi-channel wiener filter: randomly selecting one microphone from a microphone array consisting of M microphones as a reference microphone, taking the mean square error between a filter output signal and a reference pure voice signal of the selected reference microphone and the weighted output noise power as an objective function, and minimizing the objective function to obtain an original wiener filter; generalized eigenvalue decomposition is carried out on a voice covariance matrix and a noise covariance matrix contained in a multi-microphone noisy voice signal covariance matrix of a microphone array to obtain M eigenvalues, low-rank approximation is carried out on the voice covariance matrix by selecting the first k eigenvalues and corresponding eigenvectors in the M eigenvalues, k is more than 0 and less than or equal to M to obtain a low-rank approximated voice covariance matrix with k as a rank, and the voice covariance matrix based on the low-rank approximation is substituted into the original wiener filter to obtain a low-rank approximated multi-channel wiener filter based on the voice covariance matrix, the generalized eigenvalues, the eigenvectors and a reference microphone;
step 2, establishing an output signal-to-noise ratio mathematical model: taking the ratio of the output signal power and the output noise power of the low-rank approximate multi-channel wiener filter in the step 1 as an output signal-to-noise ratio mathematical model, wherein the output signal-to-noise ratio mathematical model is a function of a reference microphone and a rank;
step 3, selecting a reference microphone: selecting two microphones based on the output signal-to-noise ratio mathematical model established in the step (2), calculating the output signal-to-noise ratio difference of the two microphones, and selecting the microphone with the largest input signal-to-noise ratio as a reference microphone according to the output signal-to-noise ratio difference;
and 4, forming a beam to obtain an enhanced voice signal: and (3) substituting the rank determined in the step (1) and the reference microphone selected in the step (3) into the low-rank approximate multi-channel wiener filter established in the step (1), performing weighted summation inner product operation on the multi-microphone voice signal to be enhanced and the low-rank approximate multi-channel wiener filter in a short-time frequency domain, and performing beam forming to obtain a single-channel enhanced voice signal.
According to the technical scheme provided by the invention, the multichannel voice enhancement method based on the optimization of the reference microphone has the beneficial effects that:
by analyzing and establishing a mathematical model of the reference microphone for the output signal quality of the multichannel wiener filter, according to the relationship between the output signal-to-noise ratio determined by analysis and the input signal-to-noise ratio, and based on the suboptimal selection reference microphone of the criterion of maximizing the input signal-to-noise ratio, the method reduces the time complexity of selecting the reference microphone, can effectively improve the output signal quality and speech intelligibility of the multichannel speech enhancement method, and improves the performance of the multi-microphone noise speech recognition system.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
FIG. 1 is a flowchart of a multi-channel speech enhancement method based on reference microphone optimization according to an embodiment of the present invention;
FIG. 2 is a block diagram of a multi-channel speech enhancement system based on reference microphone optimization according to an embodiment of the present invention;
fig. 3 shows the multi-channel wiener filtering speech enhancement performance based on reference microphone selection and low rank approximation (maxiSNR represents the method proposed by the present invention): FIG. 3(a) signal-to-noise ratio gain, FIG. 3(b) speech intelligibility gain;
fig. 4 is a noise speech recognition system based on a linear uniform microphone array according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the specific contents of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention. Details which are not described in detail in the embodiments of the invention belong to the prior art which is known to the person skilled in the art.
As shown in fig. 1, an embodiment of the present invention provides a multi-channel speech enhancement method based on reference microphone optimization, which is based on a speech recognition system in which a plurality of microphones form an array, and firstly based on a mathematical modeling of speech enhancement performance of multi-channel wiener filtering, and then based on a maximized input signal-to-noise ratio, selects a reference microphone, and the method includes: establishing a wiener filter, establishing an output signal-to-noise ratio mathematical model, analyzing the performance of the wiener filter and selecting a reference microphone, wherein the step of establishing the wiener filter follows the design criterion of variable spread filters (variable span filters) or speech distortion weighted wiener filter (SDW-MWF), and the specific scheme is as follows:
step 1, establishing a wiener filter:
first, an original wiener filter (a general form of multi-channel wiener filter) is built: selecting a reference microphone from a microphone array consisting of M microphones (namely, an array consisting of a plurality of microphones of the voice recognition system), taking the mean square error between a filter output signal and a reference pure voice signal of the reference microphone and the weighted output noise power as an objective function, and minimizing the objective function to obtain an original wiener filter;
secondly, generalized eigenvalue decomposition is carried out on a voice covariance matrix (namely, a covariance matrix of voice components in the microphone signals) and a noise covariance matrix (namely, a covariance matrix of noise components in the microphone signals) of the multi-microphone noisy voice signal covariance matrix to obtain M eigenvalues, low-rank approximation (low-rank approximation) is carried out on the voice covariance matrix by utilizing the generalized eigenvalues and corresponding eigenvectors, namely, the first k eigenvalues and the corresponding eigenvectors are selected to carry out low-rank approximation on the voice covariance matrix to obtain a low-rank approximate voice covariance matrix with the rank k, wherein the rank k is more than 0 and less than or equal to M, the number of microphones is more than or equal to M, each microphone signal of the microphone array can obtain M linear irrelevant eigenvalues, and the low-rank approximate voice covariance matrix is substituted into an original wiener filter to obtain a rank based on the voice covariance matrix, A low rank approximate multi-channel wiener filter of the generalized eigenvalue, eigenvector and reference microphone;
through the processing, the influence of the rank of the reference microphone and the voice covariance matrix on the quality of the output signal of the original wiener filter can be conveniently analyzed, and research shows that different ranks are selected, and the original wiener filter can be converted into different beam formers, such as the common MWF, the maximum signal-to-noise ratio (maxSNR), the Minimum Variance Distortionless Response (MVDR) and other beam formers;
step 2, establishing an output signal-to-noise ratio mathematical model: taking the output signal-to-noise ratio of the output noise power on the output signal power ratio of the low-rank approximate multi-channel wiener filter obtained in the step (1) as an output signal-to-noise ratio mathematical model, wherein the output signal-to-noise ratio mathematical model is a function of a reference microphone and a rank, and the dependence relationship between the quality of an enhanced signal and the rank of the reference microphone can be obtained from the output signal-to-noise ratio mathematical model; according to theory, it can be proved that: the larger the rank is, the smaller the output signal-to-noise ratio is, that is, the rank-1 beamformer can maximize the output signal-to-noise ratio, however, the case that the rank of the speech covariance matrix estimated based on the limited observation data is not 1 is more common in reality; therefore, the rank k of the selected voice covariance matrix can be any integer between 1 and M (M is the total number of microphones), and can be selected arbitrarily;
in the step 2, the output signal power of the low-rank approximate multi-channel wiener filter is obtained by multiplying the conjugate transpose of the low-rank approximate multi-channel wiener filter in the frequency domain by the input voice covariance matrix and then by the filter vector; #
The output noise power of the low-rank approximate multi-channel wiener filter is the product of a filter vector transpose, an input noise covariance matrix and a filter vector in a frequency domain.
Step 3, selecting a reference microphone: selecting two microphones based on the output signal-to-noise ratio mathematical model in the step 2, calculating the output signal-to-noise ratio difference of the two microphones, and selecting the microphone with the largest input signal-to-noise ratio as a reference microphone according to the output signal-to-noise ratio difference; because the gain is confirmed to be positively correlated with the difference between the input signal-to-noise ratios of the two reference microphones, the output signal-to-noise ratio can be optimized by selecting the microphone with the largest input signal-to-noise ratio as the reference microphone, thereby avoiding complex optimization operation;
and 4, forming a beam to obtain an enhanced voice signal: and (3) substituting the rank determined in the step (1) and the reference microphone obtained in the step (3) into the low-rank approximate multi-channel wiener filter in the step (1), performing weighted summation inner product operation on a multi-microphone voice signal to be enhanced (namely, a multi-microphone noisy voice signal) and the low-rank approximate multi-channel wiener filter in a short-time frequency domain, and performing beam forming to obtain a single-channel enhanced voice signal.
Further, the method further comprises: and 5, voice recognition: and (4) inputting the single-channel enhanced voice signal obtained in the step (4) into a voice recognizer based on a subspace Gaussian mixture-deep neural network (SGMM-DNN) for voice recognition, and translating and analyzing to obtain the content of the target sound source.
The system structure for implementing the above method is shown in fig. 4, the input of VAD (voice activity detector) is a multi-microphone noisy speech signal, the output is a noise frame and a noise-plus-speech frame, based on which, a noise covariance matrix and a mixed covariance matrix are estimated by respectively using the noise frame and the noise-plus-speech frame by using a moving average technique; the generalized eigenvalue decomposition module inputs a noise covariance matrix and a mixed covariance matrix and outputs a generalized eigenvalue and a corresponding eigenvector; the low-rank approximation module approximates the voice covariance matrix under the condition of considering different ranks by using the generalized eigenvalue and the eigenvector and outputs the voice covariance matrix; the input signal-to-noise ratio estimation module estimates the distribution of the input signal-to-noise ratios of the multiple microphones by utilizing the original signals of the multiple microphones and the end point detection result; the input of the reference microphone selection module is input signal-to-noise ratio distribution, and the output is reference microphone index; the input of the SDW-MWF wiener filter design module is a voice covariance matrix based on low-rank approximation and a reference microphone, and the output is a frequency domain filter vector; the beam forming module performs beam forming operation on the original multi-microphone signal and the wiener filter, outputs an enhanced single-channel voice signal, and can also observe and output a signal-to-noise ratio and voice intelligibility; and finally, inputting the enhanced single-channel voice signal into an SGMM-DNN voice recognition model to obtain the voice recognition content, the word error rate and the like of the target speaker.
The invention provides a reference microphone selection method based on the maximized input signal-to-noise ratio from the design theory angle of the multi-channel wiener filter by analyzing the mathematical relationship between the output signal-to-noise ratio and the rank of the reference microphone and the voice covariance matrix, so that the relation between the signal quality before and after voice enhancement in the multi-microphone voice information processing system can be more clearly understood, and the robustness of voice enhancement and noise voice recognition is improved. The problems that the signal quality of a better reference microphone cannot be guaranteed, the output signal-to-noise ratio is influenced, and time-consuming and poor real-time performance are caused by adopting an enumeration method and a semi-fixed programming optimization method with high time complexity in the conventional method for randomly specifying the reference microphone can be solved.
The effectiveness of the method of the invention was verified by the following experiments, including:
(one) experimental setup:
the method is verified by adopting voice enhancement and noise voice recognition based on a linear microphone array in a conference room scene, the experimental configuration is shown in figure 2, the room size is 4m multiplied by 3m, and the reverberation time is 200 milliseconds; the uniform linear array (uniform linear array) comprises 8 omnidirectional microphones, namely the number M of the omnidirectional microphones is 8, the coordinate of the center of a microphone array is (2,0.5) M, the microphones are numbered from left to right sequentially from M to 8, and the distance between adjacent microphones is 2 cm; the target speaker is positioned in the direction of theta which is 45 degrees, the directions of the two interference sources are respectively 0 degree and 180 degrees, and all sound sources are positioned on a semicircle which is 1 meter away from the radius of the microphone array; the speaker-targeted speech is from the TIMIT English database test set (i.e., the TIMIT English database test set disclosed in documents J.Garofolo, L.Lamel, W.Fisher, J.Fiscus, and D.Pallett, DARPA TIMIT audio-visual connected speech sound CD-ROM. NIST speech disc 1-1.1, NASA STI/Recon Technical Report N,93:27403,1993) containing 192 sentences of 24 different speakers in total; the module for speech recognition uses the SGMM-DNN model, and uses training set of TIMIT for model training, which contains 3696 words in total, and the interfering sound source signal comes from the NoiseX-92 database. The sampling frequency of all source signals is fixed at 16kHz, and short-time Fourier transform (STFT) uses a square root hanning window of 32 ms and frame shift of 16 ms.
Given The room configuration, a Room Impulse Response (RIR) of The sound source to The microphone array is generated using a mirror method (i.e., The mirror method disclosed in J.B.Allen, D.A.Berkley, Image method for influencing small-room optics, The Journal of The acoustic facility of America,65(4):943-950, 1979); the mixed signal collected by the microphone is the superposition of a sound source component, an interference source component and a non-relevant noise component; the method comprises the following steps that a sound source component is a room impulse response of a sound source signal convolution sound source, two interference source components are respectively room impulse responses corresponding to interference source convolution, and a non-correlated noise component is simulated as microphone self-noise and is a Gaussian white signal; the power of the interference source is controlled by signal-to-interference ratio (SIR), the power of the incoherent noise is controlled by signal-to-noise ratio (SNR), and when SIR is 0dB and SNR is 40dB, the overall input signal-to-noise ratio is slightly less than 0 dB.
(II) experimental results:
first, the result of Speech enhancement is verified, and the audio quality and intelligibility of the output Speech signal are measured using the signal-to-noise ratio gain and short-time objective intelligibility (STOI, short-time objective intelligibility) (i.e. documents c.h.taal, r.c.standards, r.heusden, and j.jensen, An algorithm for interactive intelligibility prediction of time weighted noise pace, IEEE trans.audio Speech and Speech Process, 19(7):2125 2136, 2011) respectively; the STOI parameter measures mutual information between the enhancement signal and the pure reference signal, the value range is from 0 to 1, the enhancement signal is closer to the pure sound source signal when the value is higher, and the STOI parameter is widely applied to evaluating speech intelligibility; it can be seen that the output signal-to-noise ratio and speech intelligibility decrease as the rank becomes higher, and the reference microphone has a significant impact on the performance of the microphone array-based speech enhancement algorithm. The reference microphone selection method based on the maximum input signal-to-noise ratio, namely maxiSNR, selects the first microphone as the reference microphone, because the first microphone is closest to the target speaker, although the optimal output signal-to-noise ratio cannot be obtained, the speech intelligibility can be optimal. It is noted that for rank 1 beamforming methods (such as MVDR), its output signal-to-noise ratio is not dependent on the reference microphone, but speech intelligibility is always related to the reference microphone.
Table 1 below gives the low rank approximation using different speech covariance matrices and the effect of the reference microphone on the speech recognition Word Error Rate (WER) when the signal-to-interference ratio SIR of the two interferers is 0 dB. It should be noted that for the clean sound source signal, the word error rate of speech recognition based on the SGMM-DNN model is 18.0%, the word error rate of the mixed signal on the first microphone is 78.0%, and all statistics are the average of the word error rates of all test sentences. It can be seen that the word error rate is related to rank and the reference microphone at the same time, when the rank is increased, the speech recognition accuracy is reduced, which is consistent with the variation trend of the output signal-to-noise ratio and speech intelligibility in speech enhancement, when the rank is 1, the reference microphone has less influence on the word error rate of speech recognition, but when the rank is not 1, it is important to select a proper reference microphone.
Table 1 shows the word error rate statistics for speech recognition in different rank and reference microphone cases (SIR 0dB)
Figure BDA0003058080780000071
Figure BDA0003058080780000081
Table 2 below gives the low rank approximation using different speech covariance matrices and the effect of the reference microphone on the speech recognition word error rate when the signal-to-interference ratio SIR of the two interferers is 20dB, when the word error rate of the mixed signal on the first microphone is 38.7%. The conclusion similar to table 1 can be obtained through analysis, and by comparing table 1 with table 2, it can be obviously found that noise can seriously reduce the speech recognition precision, and the speech recognition performance can be improved by using the speech enhancement front-end module based on multi-channel wiener filtering. Since the word error rate and speech intelligibility depend on both the reference microphone and the rank, while the output signal-to-noise ratio is related to the reference microphone only when the rank is not 1, the speech recognition accuracy is more dependent on the intelligibility of the input speech than the signal-to-noise ratio, see fig. 3(a) and 3(b) in fig. 3. The lowest word error rate can be obtained by adopting the maximum input signal-to-noise ratio reference microphone selection method (namely k is 1).
Table 2 shows the word error rate statistics for speech recognition in different rank and reference microphone cases (SIR 20dB)
Figure BDA0003058080780000082
In summary, the multi-channel speech enhancement method based on reference microphone optimization of the present invention has at least the following advantages: firstly, in the design of a multi-channel wiener filter, analyzing a dependence model between the output signal quality and the speech covariance matrix rank and a reference microphone, and using the dependence model for selecting an actual beam former; second, optimizing the reference microphone based on the maximize input signal-to-noise criterion at the reference microphone selection has lower time complexity relative to the reference microphone selection method based on the maximize output signal-to-noise criterion.
The experimental result shows that the method belongs to suboptimal solution in the aspect of output signal to noise ratio, but can maximize the intelligibility of output voice, and the method for enhancing the voice based on the multichannel wiener filtering is applied to a noise voice recognition scene, so that the reference microphone selection method can obviously reduce the word error rate and improve the robustness of a multi-microphone noise voice recognition system.
The embodiments of the present invention are described in further detail below.
The embodiment provides a multi-channel speech enhancement method based on reference microphone optimization, which comprises the following steps:
the signal model is as follows:
in this embodiment, taking a noise speech recognition system including M microphones as an example, a frame and a frequency index are respectively represented by t and f in a short-time frequency domain, and a speech signal with noise collected by an M-th microphone is represented as:
Ym(t,f)=hm(f)Xk(t,f)+Nm(t,f),m=1,...,M, (1);
in the above formula (1), Xk(t,f)、Nm(t, f) and hm(f) Respectively representing the clean sound source component at the reference microphone k, the noise component at the microphone m (including the interfering sound source, background noise, reverberation, and self-noise of the microphone, etc.), and the relative acoustic transfer function (RTF) of the target sound source to the microphone m; in the above signal model, the microphone k is selected as the reference microphone, which is further optimized in the following reference microphone selection step; for each time frequency point, the STFT of the M microphones is sparsely stored as a column vector Y, namely Y ═ Y1(t,f),Y2(t,f),...,YM(t,f)]TSimilarly, defining the relative acoustic transfer function and noise contribution as vectors h and n, the signal model can be written in the form of the following vectors:
y=hXk+n (2);
in the above formula (2), the time-frequency index (t, f) is saved for convenience of expressionA little bit; assuming that the target sound source and the noise component are uncorrelated, the covariance matrix of the noisy speech signal of the multi-microphone can be written as a summation of the speech covariance matrix and the noise covariance matrix, i.e.: phiyy=ε[yyH]=ε[xxH]+ε[nnH]=Φxxnn (3);
In the above-mentioned formula (3),
Figure BDA0003058080780000091
representing a speech covariance matrix;
Figure BDA0003058080780000092
represents the power spectral density of the target sound source component at the reference microphone k; phinnRepresenting a noise covariance matrix; epsilon represents the averaging operation; theoretically, when only a single target sound source is present, ΦxxIs 1; in practice, however, since the estimation of the speech covariance matrix depends on a finite length of observed data, phi resultsxxHas an error in the estimation of (1), and the rank is not 1; the method can divide the multi-microphone noisy speech signal into a noise frame and a speech plus noise frame by using a sound event detection method, and estimates phi in the two intervals by using a moving average technologynnAnd phiyySuch as:
Figure BDA0003058080780000093
(II) establishing a low-rank approximation-based multi-channel wiener filter:
in this embodiment, taking the general speech distortion weighted wiener filter SDW-MWF as an example, the process of building a low-rank approximate multi-channel wiener filter is described, taking the minimum mean square error of a target sound source plus the weighted residual noise power as a design criterion, that is: minw ε[|wHx-Xk|2]+με[|wHn|2] (4);
In the above formula (4), w ═ w1,w2,...,wM]TRepresents a filter vector; mu is more than or equal to 0 and is a balance factor of the voice enhancement performance and the voice distortion degree; throughThe expression of the multi-channel wiener filter is derived as follows:
w=(Φxx+μΦnn)-1Φxxek (5);
in the above formula (5), ekThe kth element is 1 and the other elements are 0 for the column vector dependent on the reference microphone. Obviously, when μ ═ 0, the multichannel wiener filter is equivalent to a classical wiener filter.
Consideration pair (phi)xx,Φnn) Carrying out generalized eigenvalue decomposition, and arranging the obtained eigenvalues from large to small as lambda1≥λ2≥...≥λMThe corresponding eigenvector is stored in the matrix U ═ U1,u2,...,uM]Defining a diagonal matrix Lambda, wherein diagonal elements of the diagonal matrix are generalized eigenvalues; based on analysis of generalized eigenvalues,. phixxAnd phinnThe joint diagonalization can be:
UHΦxxU=Λ,UHΦnnU=I;
wherein I is an identity matrix; due to phiyy=Φxxnn,ΦyyDiagonalization can be: u shapeHΦyyU=Λ+I。
Therefore, phi can be utilized in practiceyyAnd phinnThe generalized eigenvalue decomposition realizes the joint diagonalization operation based on phixxThe diagonalization operation of (a) can yield:
Figure BDA0003058080780000101
in the above formula (6), Q ═ Q1,q2,...,qM]=U-H(ii) a It can be seen that phinn=QQHAnd is and
Figure BDA0003058080780000102
thus, Q ═ Q1,q2,...,qM]Comprises a matrix
Figure BDA0003058080780000103
The left feature vector of (2); for a single source scene, studies have shown that the normalized dominant eigenvector corresponds to the Relative acoustic transfer function of the source (i.e., the Relative acoustic transfer functions disclosed in documents j.zhang, r.heusders, and r.c.hendriks, Relative acoustic transfer function estimation in wireless acoustic sensor networks, IEEE/ACM trans.audio, Speech, Language processes, 27 (10): 1507-;
based on the generalized eigenvalue decomposition, the first r eigenvalues and corresponding eigenvector pairs Φ can be utilizedxxAnd (3) carrying out approximation:
Figure BDA0003058080780000104
phi of rank rxxSubstituting the matrix into an original SDW-MWF filter to obtain a wiener filter based on low-rank approximation, namely:
Figure BDA0003058080780000111
selecting different ranks, the original SDW-MWF filter can be converted into a beam former such as MVDR, maxSNR and the like (namely, the beam former disclosed in the documents J.R.Jensen, J.Benesty, and M.G.Christensen, Noise reduction with optimal variable beam filters, IEEE/ACM Trans.Audio Speech and Language processes, 24(4): 631. 644, 2016);
(III) evaluating the performance of the multi-channel wiener filter based on low rank approximation:
after the multi-microphone noisy speech signal passes through the obtained low-rank approximate multi-channel wiener filter, the signal-to-noise ratio of the output speech of the reference microphone k is calculated by the following formula:
Figure BDA0003058080780000112
in the above equation (9), the matrices a, B are calculated as follows:
Figure BDA0003058080780000113
Figure BDA0003058080780000114
thus, the output signal-to-noise ratio is:
Figure BDA0003058080780000115
therefore, it can be determined that the output signal-to-noise ratio depends on the reference microphone k, and the quality of the output signal can be improved by optimizing the reference microphone, which proves that the output signal-to-noise ratio can be reduced as the rank r is increased, which indicates that the output performance of the multi-channel wiener filter is influenced when the estimation of the voice covariance matrix is more inaccurate, and the scheme in the prior art can bring too high time complexity and cause the problem of poor time consumption and instantaneity because the reference microphone is selected in a manner of directly maximizing the output signal-to-noise ratio.
(iv) selecting a reference microphone based on maximizing the input signal-to-noise ratio:
in order to more clearly understand the relationship between the output signal-to-noise ratio and the input signal-to-noise ratio, in this step, the rank r is fixed, if r is 2 without loss of generality, the influence of selecting different reference microphones on the output signal-to-noise ratio is analyzed, and a dual sound source scene is considered, namely, the dual sound source scene comprises a target speaker and an interference sound source, wherein the relative acoustic transfer function of the target sound source is a normalized principal feature vector q1The M-2 noise subspaces can be eliminated by using low-rank approximate operation of a voice covariance matrix with the rank of 2, the approximate voice covariance matrix comprises a target sound source and a single interference sound source, and the estimated disturbance component of the approximate voice covariance matrix is composed of a sub-eigenvector q2The expansion is obtained, considering the case of using microphones 1 and 2 as reference microphones, respectively, the input signal-to-noise ratios of the two microphones are:
Figure BDA0003058080780000121
the above formula(10) In (1),
Figure BDA0003058080780000122
and
Figure BDA0003058080780000123
representing the power spectral density of the interference source at the position of the microphone k e {1, 2} and the noise respectively; in addition, define:
Figure BDA0003058080780000124
the output signal-to-noise ratio of the two microphones can be simplified as follows:
Figure BDA0003058080780000125
thus, the output signal-to-noise ratios of the two microphones are compared as follows:
Figure BDA0003058080780000131
wherein the content of the first and second substances,
Figure BDA0003058080780000132
is a positive number; thus, the signal-to-noise gain of the two microphones is proportional to the input signal-to-noise difference; this shows that a microphone with a larger input signal-to-noise ratio is selected as the reference microphone to obtain a larger output signal-to-noise ratio. In summary, the reference microphone extraction result based on maximizing the input signal-to-noise ratio can be obtained by finding the largest element, i.e. the method for finding the largest element
Figure BDA0003058080780000133
The time complexity of the method is logarithmic;
(V) speech enhancement signal based on wiener filtering beamforming:
after the reference microphone is optimally selected according to the fourth step, the multichannel wiener filter based on the low rank approximation can be ensuredThe mathematical expression of the filter is determined as follows:
Figure BDA0003058080780000134
specific filter coefficient vectors can be obtained from the data, and the low-rank approximate multi-channel wiener filter vector and a noisy speech signal vector y are subjected to inner product operation, namely, wave beam formation of frequency points one by one is carried out to obtain an estimation result of a target sound source:
Figure BDA0003058080780000135
and performing inverse short-time Fourier transform on the estimation result to recover to obtain a time domain target speaker voice signal, namely a voice enhancement signal, and inputting the time domain target speaker voice signal into the SGMM-DNN model for voice recognition to obtain a voice recognition result.
The multichannel wiener filtering method based on input signal-to-noise ratio reference microphone optimization establishes a strict mathematical model of an enhanced signal-to-noise ratio and a reference microphone, selects the reference microphone based on a maximized input signal-to-noise ratio by analyzing the relation between the output signal-to-noise ratio and the input signal-to-noise ratio, effectively reduces the time complexity of reference microphone selection, and improves the speech enhancement and speech recognition performances of a multi-microphone.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (3)

1. A multi-channel speech enhancement method based on reference microphone optimization, comprising:
step 1, establishing a multi-channel wiener filter: randomly selecting one microphone from a microphone array consisting of M microphones as a reference microphone, taking the mean square error between a filter output signal and a reference pure voice signal of the selected reference microphone and the weighted output noise power as an objective function, and minimizing the objective function to obtain an original wiener filter; generalized eigenvalue decomposition is carried out on a voice covariance matrix and a noise covariance matrix contained in a multi-microphone noisy voice signal covariance matrix of a microphone array to obtain M eigenvalues, low-rank approximation is carried out on the voice covariance matrix by selecting the first k eigenvalues and corresponding eigenvectors in the M eigenvalues, k is more than 0 and less than or equal to M to obtain a low-rank approximated voice covariance matrix with k as a rank, and the voice covariance matrix based on the low-rank approximation is substituted into the original wiener filter to obtain a low-rank approximated multi-channel wiener filter based on the voice covariance matrix, the generalized eigenvalues, the eigenvectors and a reference microphone;
step 2, establishing an output signal-to-noise ratio mathematical model: taking the ratio of the output signal power and the output noise power of the low-rank approximate multi-channel wiener filter in the step 1 as an output signal-to-noise ratio mathematical model, wherein the output signal-to-noise ratio mathematical model is a function of a reference microphone and a rank;
step 3, selecting a reference microphone: selecting two microphones based on the output signal-to-noise ratio mathematical model established in the step (2), calculating the output signal-to-noise ratio difference of the two microphones, and selecting the microphone with the largest input signal-to-noise ratio as a reference microphone according to the output signal-to-noise ratio difference;
and 4, forming a beam to obtain an enhanced voice signal: and (3) substituting the rank determined in the step (1) and the reference microphone selected in the step (3) into the low-rank approximate multi-channel wiener filter established in the step (1), performing weighted summation inner product operation on the multi-microphone voice signal to be enhanced and the low-rank approximate multi-channel wiener filter in a short-time frequency domain, and performing beam forming to obtain a single-channel enhanced voice signal.
2. The multi-channel speech enhancement method based on reference microphone optimization of claim 1, wherein in step 2, the output signal power of the low-rank approximate multi-channel wiener filter is obtained by multiplying the conjugate transpose of the low-rank approximate multi-channel wiener filter in the frequency domain by the input speech covariance matrix and then by the filter vector;
the output noise power of the low-rank approximate multi-channel wiener filter is the product of a filter vector transpose, an input noise covariance matrix and a filter vector in a frequency domain.
3. The reference microphone optimization based multi-channel speech enhancement method according to claim 1 or 2, characterized in that the method further comprises: and 5: and (3) voice recognition: and (4) inputting the enhanced voice signal obtained in the step (4) into a voice recognizer based on a subspace Gaussian mixture-deep neural network for voice recognition, and translating and analyzing the content of a target sound source.
CN202110505085.9A 2021-05-10 2021-05-10 Multi-channel voice enhancement method based on reference microphone optimization Active CN113257270B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110505085.9A CN113257270B (en) 2021-05-10 2021-05-10 Multi-channel voice enhancement method based on reference microphone optimization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110505085.9A CN113257270B (en) 2021-05-10 2021-05-10 Multi-channel voice enhancement method based on reference microphone optimization

Publications (2)

Publication Number Publication Date
CN113257270A true CN113257270A (en) 2021-08-13
CN113257270B CN113257270B (en) 2022-07-15

Family

ID=77222524

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110505085.9A Active CN113257270B (en) 2021-05-10 2021-05-10 Multi-channel voice enhancement method based on reference microphone optimization

Country Status (1)

Country Link
CN (1) CN113257270B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100002886A1 (en) * 2006-05-10 2010-01-07 Phonak Ag Hearing system and method implementing binaural noise reduction preserving interaural transfer functions
CN102938254A (en) * 2012-10-24 2013-02-20 中国科学技术大学 Voice signal enhancement system and method
CN102969000A (en) * 2012-12-04 2013-03-13 中国科学院自动化研究所 Multi-channel speech enhancement method
US20140056435A1 (en) * 2012-08-24 2014-02-27 Retune DSP ApS Noise estimation for use with noise reduction and echo cancellation in personal communication
WO2015086377A1 (en) * 2013-12-11 2015-06-18 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Extraction of reverberant sound using microphone arrays
CN105206281A (en) * 2015-09-14 2015-12-30 胡旻波 Voice enhancement device based on distributed microphone array network
US20210076124A1 (en) * 2019-09-11 2021-03-11 Oticon A/S Hearing device comprising a noise reduction system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100002886A1 (en) * 2006-05-10 2010-01-07 Phonak Ag Hearing system and method implementing binaural noise reduction preserving interaural transfer functions
US20140056435A1 (en) * 2012-08-24 2014-02-27 Retune DSP ApS Noise estimation for use with noise reduction and echo cancellation in personal communication
CN102938254A (en) * 2012-10-24 2013-02-20 中国科学技术大学 Voice signal enhancement system and method
CN102969000A (en) * 2012-12-04 2013-03-13 中国科学院自动化研究所 Multi-channel speech enhancement method
WO2015086377A1 (en) * 2013-12-11 2015-06-18 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Extraction of reverberant sound using microphone arrays
CN105206281A (en) * 2015-09-14 2015-12-30 胡旻波 Voice enhancement device based on distributed microphone array network
US20210076124A1 (en) * 2019-09-11 2021-03-11 Oticon A/S Hearing device comprising a noise reduction system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JIE ZHANG ET AL: "A Study on Reference Microphone Selection for Multi-Microphone Speech Enhancement", 《IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING ( VOLUME: 29)》 *
SEBASTIAN STENZEL ET AL: "A multichannel Wiener filter with partial equalization for distributed microphones", 《2013 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS》 *
肖强等: "基于MGSC与改进维纳滤波的麦克风阵列语音增强", 《声学技术》 *

Also Published As

Publication number Publication date
CN113257270B (en) 2022-07-15

Similar Documents

Publication Publication Date Title
Zhang et al. Deep learning based binaural speech separation in reverberant environments
Hadad et al. The binaural LCMV beamformer and its performance analysis
Taherian et al. Robust speaker recognition based on single-channel and multi-channel speech enhancement
US8139787B2 (en) Method and device for binaural signal enhancement
Tan et al. Neural spectrospatial filtering
CN110473564B (en) Multi-channel voice enhancement method based on deep beam forming
Kuklasinski et al. Multichannel Wiener filters in binaural and bilateral hearing aids—speech intelligibility improvement and robustness to DOA errors
US11818557B2 (en) Acoustic processing device including spatial normalization, mask function estimation, and mask processing, and associated acoustic processing method and storage medium
Ali et al. Integration of a priori and estimated constraints into an MVDR beamformer for speech enhancement
Aroudi et al. Cognitive-driven convolutional beamforming using EEG-based auditory attention decoding
Hassani et al. LCMV beamforming with subspace projection for multi-speaker speech enhancement
Çolak et al. A novel voice activity detection for multi-channel noise reduction
Priyanka et al. Generalized sidelobe canceller beamforming with combined postfilter and sparse NMF for speech enhancement
CN113257270B (en) Multi-channel voice enhancement method based on reference microphone optimization
Takatani et al. High-fidelity blind separation of acoustic signals using SIMO-model-based independent component analysis
Priyanka et al. Adaptive Beamforming Using Zelinski-TSNR Multichannel Postfilter for Speech Enhancement
CN108735228B (en) Voice beam forming method and system
Aroudi et al. TRUNet: Transformer-recurrent-U network for multi-channel reverberant sound source separation
D'Olne et al. Model-based beamforming for wearable microphone arrays
Salvati et al. Joint identification and localization of a speaker in adverse conditions using a microphone array
Hadad et al. A class of Pareto optimal binaural beamformers
Chen et al. Reference microphone selection and low-rank approximation based multichannel wiener filter with application to speech recognition
Li et al. Speech enhancement based on binaural sound source localization and cosh measure wiener filtering
CN113241090A (en) Multi-channel blind sound source separation method based on minimum volume constraint
Šarić et al. Supervised speech separation combined with adaptive beamforming

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant