CN113257270A

CN113257270A - Multi-channel voice enhancement method based on reference microphone optimization

Info

Publication number: CN113257270A
Application number: CN202110505085.9A
Authority: CN
Inventors: 张结; 陈星宇; 戴礼荣
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2021-05-10
Filing date: 2021-05-10
Publication date: 2021-08-13
Anticipated expiration: 2041-05-10
Also published as: CN113257270B

Abstract

The invention discloses a multi-channel voice enhancement method based on reference microphone optimization, which comprises the following steps: step 1, establishing a low-rank approximate multi-channel wiener filter; step 2, establishing an output signal-to-noise ratio mathematical model; step 3, selecting a reference microphone: selecting two microphones based on the output signal-to-noise ratio mathematical model established in the step (2), respectively calculating the output signal-to-noise ratio difference of the two microphones, and selecting the microphone with the largest input signal-to-noise ratio as a reference microphone; and 4, forming a beam to obtain an enhanced voice signal: and (3) substituting the rank selected in the step (2) and the reference microphone selected in the step (3) into the low-rank approximate multi-channel wiener filter established in the step (1), and performing inner product operation formed by weighted summation wave beams on the multi-microphone voice signal to be enhanced and the low-rank approximate multi-channel wiener filter in a short-time frequency domain to obtain a result, namely the voice signal after single-channel enhancement. The method effectively reduces the time complexity of reference microphone selection and improves the multi-microphone speech enhancement and speech recognition performance.

Description

Multi-channel voice enhancement method based on reference microphone optimization

Technical Field

The invention relates to the field of voice signal processing, in particular to a multi-channel voice enhancement and voice recognition method based on a microphone array.

Background

Speech enhancement (speech enhancement) aims at extracting a clean sound source signal from a sound signal with noise and reverberation collected by an acoustic sensor, and performance measures of the speech enhancement (speech enhancement) mainly include an output signal-to-noise ratio (SNR), hearing perception intelligibility (speech), and the like. Microphone array based multi-channel speech enhancement techniques have important applications in many practical systems, for example: high-quality voice communication in a cocktail party scene, man-machine voice interaction in an intelligent household scene, auditory perception and interaction facing an intelligent robot, auxiliary hearing equipment facing a hearing impaired person and the like. Currently, beam forming (beam forming) based on Multichannel Wiener Filtering (MWF) is one of the mainstream methods. The multi-channel wiener filtering method designs a beam former by minimizing mean-square error (MSE) between an output signal and a reference target sound source signal in a short-time frequency domain, wherein the output signal is a weighted superposition form of an input signal. Mathematically, the multi-channel wiener filter relies on a signal covariance matrix and a reference microphone vector. Although the mean square error between the output signal obtained by the method and the reference sound source signal is minimum, the distortion of the sound source signal is not controlled, and the obtained enhanced signal has distortion on a frequency spectrum, so that the speech intelligibility and the listening comfort are influenced. Doclo et al propose based on pronunciation distortion weighted multichannel wiener filtering (SDW-MWF, speech distortion weighted MWF) pronunciation enhancement algorithm, this method regards mean square error and weighted output noise variance that the traditional multichannel wiener filter designed as the whole as the optimization target, therefore the beam former obtained not only relies on signal covariance matrix and reference microphone, also rely on the weight of noise variance, therefore choose the weight coefficient of pronunciation distortion to design SDW-MWF especially important, should adjust according to the system performance requirement. When the weight is 0, the filter is degraded into a traditional multi-channel wiener filter; when the weight is increased, the spectral distortion of the output speech signal is smaller, but the quality of the output signal is deteriorated.

However, no matter what multi-channel wiener filtering method is adopted, the reference microphone is a key parameter for controlling the speech enhancement output based on the beam forming, but in the traditional multi-microphone beam forming, the reference microphone is usually arbitrarily designated or simply selected as the microphone closest to the sound source (in reality, the distance from the sound source to the microphone is unknown). Because the output signal of the beam forming is the pure sound source signal component on the reference microphone, in reality, the distances from the microphone to the sound source and the interference source are different, and the self thermal noise powers of the microphones are also different, which are related to the signal-to-noise ratio distribution of the multi-microphone signals. Therefore, the selection of the reference microphone may affect the quality of the enhanced signal. In order to overcome the effect of the reference microphone on the quality of the enhanced signal, it is necessary to quantitatively evaluate the quality of the enhanced signal (e.g., signal-to-noise ratio) of the binaural beamformer, and further optimize the selection of the reference microphone to improve the speech enhancement performance.

In the existing documents t.c. lan-Ore and s.doclo, Reference microphone selection for MWF-based noise reduction using distributed microphone arrays, in 10th ITG Symposium Proceedings of Speech Communication,2012, a Reference microphone selection method based on maximizing output signal-to-noise ratio is proposed for multi-channel wiener filtering Speech enhancement, which designs a wiener filter based on the case that the enumeration method successively tries each microphone as a Reference microphone, and selects the case of obtaining the maximum signal-to-noise ratio as a Reference microphone by comparing the output signal-to-noise ratios in different cases, which obviously needs to traverse various cases and is very time-consuming. The existing documents J.Zhang, H.Chen, and R.C.Hendriks, A study on reference microphone selection for multi-microphone Speech enhancement, IEEE/ACM Trans. Audio, Speech, and Language Process,29: 671-.

Disclosure of Invention

Aiming at the problems in the prior art, the invention aims to provide a multi-channel speech enhancement method based on reference microphone optimization, which can solve the problems of time consumption, high time complexity, restriction on the real-time performance of a speech information processing system and the like in the conventional method for selecting a reference microphone for speech enhancement.

The purpose of the invention is realized by the following technical scheme:

the embodiment of the invention provides a multi-channel voice enhancement method based on reference microphone optimization, which comprises the following steps:

step 1, establishing a multi-channel wiener filter: randomly selecting one microphone from a microphone array consisting of M microphones as a reference microphone, taking the mean square error between a filter output signal and a reference pure voice signal of the selected reference microphone and the weighted output noise power as an objective function, and minimizing the objective function to obtain an original wiener filter; generalized eigenvalue decomposition is carried out on a voice covariance matrix and a noise covariance matrix contained in a multi-microphone noisy voice signal covariance matrix of a microphone array to obtain M eigenvalues, low-rank approximation is carried out on the voice covariance matrix by selecting the first k eigenvalues and corresponding eigenvectors in the M eigenvalues, k is more than 0 and less than or equal to M to obtain a low-rank approximated voice covariance matrix with k as a rank, and the voice covariance matrix based on the low-rank approximation is substituted into the original wiener filter to obtain a low-rank approximated multi-channel wiener filter based on the voice covariance matrix, the generalized eigenvalues, the eigenvectors and a reference microphone;

step 2, establishing an output signal-to-noise ratio mathematical model: taking the ratio of the output signal power and the output noise power of the low-rank approximate multi-channel wiener filter in the step 1 as an output signal-to-noise ratio mathematical model, wherein the output signal-to-noise ratio mathematical model is a function of a reference microphone and a rank;

step 3, selecting a reference microphone: selecting two microphones based on the output signal-to-noise ratio mathematical model established in the step (2), calculating the output signal-to-noise ratio difference of the two microphones, and selecting the microphone with the largest input signal-to-noise ratio as a reference microphone according to the output signal-to-noise ratio difference;

and 4, forming a beam to obtain an enhanced voice signal: and (3) substituting the rank determined in the step (1) and the reference microphone selected in the step (3) into the low-rank approximate multi-channel wiener filter established in the step (1), performing weighted summation inner product operation on the multi-microphone voice signal to be enhanced and the low-rank approximate multi-channel wiener filter in a short-time frequency domain, and performing beam forming to obtain a single-channel enhanced voice signal.

According to the technical scheme provided by the invention, the multichannel voice enhancement method based on the optimization of the reference microphone has the beneficial effects that:

by analyzing and establishing a mathematical model of the reference microphone for the output signal quality of the multichannel wiener filter, according to the relationship between the output signal-to-noise ratio determined by analysis and the input signal-to-noise ratio, and based on the suboptimal selection reference microphone of the criterion of maximizing the input signal-to-noise ratio, the method reduces the time complexity of selecting the reference microphone, can effectively improve the output signal quality and speech intelligibility of the multichannel speech enhancement method, and improves the performance of the multi-microphone noise speech recognition system.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a flowchart of a multi-channel speech enhancement method based on reference microphone optimization according to an embodiment of the present invention;

FIG. 2 is a block diagram of a multi-channel speech enhancement system based on reference microphone optimization according to an embodiment of the present invention;

fig. 3 shows the multi-channel wiener filtering speech enhancement performance based on reference microphone selection and low rank approximation (maxiSNR represents the method proposed by the present invention): FIG. 3(a) signal-to-noise ratio gain, FIG. 3(b) speech intelligibility gain;

fig. 4 is a noise speech recognition system based on a linear uniform microphone array according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the specific contents of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention. Details which are not described in detail in the embodiments of the invention belong to the prior art which is known to the person skilled in the art.

As shown in fig. 1, an embodiment of the present invention provides a multi-channel speech enhancement method based on reference microphone optimization, which is based on a speech recognition system in which a plurality of microphones form an array, and firstly based on a mathematical modeling of speech enhancement performance of multi-channel wiener filtering, and then based on a maximized input signal-to-noise ratio, selects a reference microphone, and the method includes: establishing a wiener filter, establishing an output signal-to-noise ratio mathematical model, analyzing the performance of the wiener filter and selecting a reference microphone, wherein the step of establishing the wiener filter follows the design criterion of variable spread filters (variable span filters) or speech distortion weighted wiener filter (SDW-MWF), and the specific scheme is as follows:

step 1, establishing a wiener filter:

first, an original wiener filter (a general form of multi-channel wiener filter) is built: selecting a reference microphone from a microphone array consisting of M microphones (namely, an array consisting of a plurality of microphones of the voice recognition system), taking the mean square error between a filter output signal and a reference pure voice signal of the reference microphone and the weighted output noise power as an objective function, and minimizing the objective function to obtain an original wiener filter;

secondly, generalized eigenvalue decomposition is carried out on a voice covariance matrix (namely, a covariance matrix of voice components in the microphone signals) and a noise covariance matrix (namely, a covariance matrix of noise components in the microphone signals) of the multi-microphone noisy voice signal covariance matrix to obtain M eigenvalues, low-rank approximation (low-rank approximation) is carried out on the voice covariance matrix by utilizing the generalized eigenvalues and corresponding eigenvectors, namely, the first k eigenvalues and the corresponding eigenvectors are selected to carry out low-rank approximation on the voice covariance matrix to obtain a low-rank approximate voice covariance matrix with the rank k, wherein the rank k is more than 0 and less than or equal to M, the number of microphones is more than or equal to M, each microphone signal of the microphone array can obtain M linear irrelevant eigenvalues, and the low-rank approximate voice covariance matrix is substituted into an original wiener filter to obtain a rank based on the voice covariance matrix, A low rank approximate multi-channel wiener filter of the generalized eigenvalue, eigenvector and reference microphone;

through the processing, the influence of the rank of the reference microphone and the voice covariance matrix on the quality of the output signal of the original wiener filter can be conveniently analyzed, and research shows that different ranks are selected, and the original wiener filter can be converted into different beam formers, such as the common MWF, the maximum signal-to-noise ratio (maxSNR), the Minimum Variance Distortionless Response (MVDR) and other beam formers;

step 2, establishing an output signal-to-noise ratio mathematical model: taking the output signal-to-noise ratio of the output noise power on the output signal power ratio of the low-rank approximate multi-channel wiener filter obtained in the step (1) as an output signal-to-noise ratio mathematical model, wherein the output signal-to-noise ratio mathematical model is a function of a reference microphone and a rank, and the dependence relationship between the quality of an enhanced signal and the rank of the reference microphone can be obtained from the output signal-to-noise ratio mathematical model; according to theory, it can be proved that: the larger the rank is, the smaller the output signal-to-noise ratio is, that is, the rank-1 beamformer can maximize the output signal-to-noise ratio, however, the case that the rank of the speech covariance matrix estimated based on the limited observation data is not 1 is more common in reality; therefore, the rank k of the selected voice covariance matrix can be any integer between 1 and M (M is the total number of microphones), and can be selected arbitrarily;

in the step 2, the output signal power of the low-rank approximate multi-channel wiener filter is obtained by multiplying the conjugate transpose of the low-rank approximate multi-channel wiener filter in the frequency domain by the input voice covariance matrix and then by the filter vector; #

The output noise power of the low-rank approximate multi-channel wiener filter is the product of a filter vector transpose, an input noise covariance matrix and a filter vector in a frequency domain.

Step 3, selecting a reference microphone: selecting two microphones based on the output signal-to-noise ratio mathematical model in the step 2, calculating the output signal-to-noise ratio difference of the two microphones, and selecting the microphone with the largest input signal-to-noise ratio as a reference microphone according to the output signal-to-noise ratio difference; because the gain is confirmed to be positively correlated with the difference between the input signal-to-noise ratios of the two reference microphones, the output signal-to-noise ratio can be optimized by selecting the microphone with the largest input signal-to-noise ratio as the reference microphone, thereby avoiding complex optimization operation;

and 4, forming a beam to obtain an enhanced voice signal: and (3) substituting the rank determined in the step (1) and the reference microphone obtained in the step (3) into the low-rank approximate multi-channel wiener filter in the step (1), performing weighted summation inner product operation on a multi-microphone voice signal to be enhanced (namely, a multi-microphone noisy voice signal) and the low-rank approximate multi-channel wiener filter in a short-time frequency domain, and performing beam forming to obtain a single-channel enhanced voice signal.

Further, the method further comprises: and 5, voice recognition: and (4) inputting the single-channel enhanced voice signal obtained in the step (4) into a voice recognizer based on a subspace Gaussian mixture-deep neural network (SGMM-DNN) for voice recognition, and translating and analyzing to obtain the content of the target sound source.

The system structure for implementing the above method is shown in fig. 4, the input of VAD (voice activity detector) is a multi-microphone noisy speech signal, the output is a noise frame and a noise-plus-speech frame, based on which, a noise covariance matrix and a mixed covariance matrix are estimated by respectively using the noise frame and the noise-plus-speech frame by using a moving average technique; the generalized eigenvalue decomposition module inputs a noise covariance matrix and a mixed covariance matrix and outputs a generalized eigenvalue and a corresponding eigenvector; the low-rank approximation module approximates the voice covariance matrix under the condition of considering different ranks by using the generalized eigenvalue and the eigenvector and outputs the voice covariance matrix; the input signal-to-noise ratio estimation module estimates the distribution of the input signal-to-noise ratios of the multiple microphones by utilizing the original signals of the multiple microphones and the end point detection result; the input of the reference microphone selection module is input signal-to-noise ratio distribution, and the output is reference microphone index; the input of the SDW-MWF wiener filter design module is a voice covariance matrix based on low-rank approximation and a reference microphone, and the output is a frequency domain filter vector; the beam forming module performs beam forming operation on the original multi-microphone signal and the wiener filter, outputs an enhanced single-channel voice signal, and can also observe and output a signal-to-noise ratio and voice intelligibility; and finally, inputting the enhanced single-channel voice signal into an SGMM-DNN voice recognition model to obtain the voice recognition content, the word error rate and the like of the target speaker.

The invention provides a reference microphone selection method based on the maximized input signal-to-noise ratio from the design theory angle of the multi-channel wiener filter by analyzing the mathematical relationship between the output signal-to-noise ratio and the rank of the reference microphone and the voice covariance matrix, so that the relation between the signal quality before and after voice enhancement in the multi-microphone voice information processing system can be more clearly understood, and the robustness of voice enhancement and noise voice recognition is improved. The problems that the signal quality of a better reference microphone cannot be guaranteed, the output signal-to-noise ratio is influenced, and time-consuming and poor real-time performance are caused by adopting an enumeration method and a semi-fixed programming optimization method with high time complexity in the conventional method for randomly specifying the reference microphone can be solved.

The effectiveness of the method of the invention was verified by the following experiments, including:

(one) experimental setup:

the method is verified by adopting voice enhancement and noise voice recognition based on a linear microphone array in a conference room scene, the experimental configuration is shown in figure 2, the room size is 4m multiplied by 3m, and the reverberation time is 200 milliseconds; the uniform linear array (uniform linear array) comprises 8 omnidirectional microphones, namely the number M of the omnidirectional microphones is 8, the coordinate of the center of a microphone array is (2,0.5) M, the microphones are numbered from left to right sequentially from M to 8, and the distance between adjacent microphones is 2 cm; the target speaker is positioned in the direction of theta which is 45 degrees, the directions of the two interference sources are respectively 0 degree and 180 degrees, and all sound sources are positioned on a semicircle which is 1 meter away from the radius of the microphone array; the speaker-targeted speech is from the TIMIT English database test set (i.e., the TIMIT English database test set disclosed in documents J.Garofolo, L.Lamel, W.Fisher, J.Fiscus, and D.Pallett, DARPA TIMIT audio-visual connected speech sound CD-ROM. NIST speech disc 1-1.1, NASA STI/Recon Technical Report N,93:27403,1993) containing 192 sentences of 24 different speakers in total; the module for speech recognition uses the SGMM-DNN model, and uses training set of TIMIT for model training, which contains 3696 words in total, and the interfering sound source signal comes from the NoiseX-92 database. The sampling frequency of all source signals is fixed at 16kHz, and short-time Fourier transform (STFT) uses a square root hanning window of 32 ms and frame shift of 16 ms.

Given The room configuration, a Room Impulse Response (RIR) of The sound source to The microphone array is generated using a mirror method (i.e., The mirror method disclosed in J.B.Allen, D.A.Berkley, Image method for influencing small-room optics, The Journal of The acoustic facility of America,65(4):943-950, 1979); the mixed signal collected by the microphone is the superposition of a sound source component, an interference source component and a non-relevant noise component; the method comprises the following steps that a sound source component is a room impulse response of a sound source signal convolution sound source, two interference source components are respectively room impulse responses corresponding to interference source convolution, and a non-correlated noise component is simulated as microphone self-noise and is a Gaussian white signal; the power of the interference source is controlled by signal-to-interference ratio (SIR), the power of the incoherent noise is controlled by signal-to-noise ratio (SNR), and when SIR is 0dB and SNR is 40dB, the overall input signal-to-noise ratio is slightly less than 0 dB.

(II) experimental results:

first, the result of Speech enhancement is verified, and the audio quality and intelligibility of the output Speech signal are measured using the signal-to-noise ratio gain and short-time objective intelligibility (STOI, short-time objective intelligibility) (i.e. documents c.h.taal, r.c.standards, r.heusden, and j.jensen, An algorithm for interactive intelligibility prediction of time weighted noise pace, IEEE trans.audio Speech and Speech Process, 19(7):2125 2136, 2011) respectively; the STOI parameter measures mutual information between the enhancement signal and the pure reference signal, the value range is from 0 to 1, the enhancement signal is closer to the pure sound source signal when the value is higher, and the STOI parameter is widely applied to evaluating speech intelligibility; it can be seen that the output signal-to-noise ratio and speech intelligibility decrease as the rank becomes higher, and the reference microphone has a significant impact on the performance of the microphone array-based speech enhancement algorithm. The reference microphone selection method based on the maximum input signal-to-noise ratio, namely maxiSNR, selects the first microphone as the reference microphone, because the first microphone is closest to the target speaker, although the optimal output signal-to-noise ratio cannot be obtained, the speech intelligibility can be optimal. It is noted that for rank 1 beamforming methods (such as MVDR), its output signal-to-noise ratio is not dependent on the reference microphone, but speech intelligibility is always related to the reference microphone.

Table 1 below gives the low rank approximation using different speech covariance matrices and the effect of the reference microphone on the speech recognition Word Error Rate (WER) when the signal-to-interference ratio SIR of the two interferers is 0 dB. It should be noted that for the clean sound source signal, the word error rate of speech recognition based on the SGMM-DNN model is 18.0%, the word error rate of the mixed signal on the first microphone is 78.0%, and all statistics are the average of the word error rates of all test sentences. It can be seen that the word error rate is related to rank and the reference microphone at the same time, when the rank is increased, the speech recognition accuracy is reduced, which is consistent with the variation trend of the output signal-to-noise ratio and speech intelligibility in speech enhancement, when the rank is 1, the reference microphone has less influence on the word error rate of speech recognition, but when the rank is not 1, it is important to select a proper reference microphone.

Table 1 shows the word error rate statistics for speech recognition in different rank and reference microphone cases (SIR 0dB)

Table 2 below gives the low rank approximation using different speech covariance matrices and the effect of the reference microphone on the speech recognition word error rate when the signal-to-interference ratio SIR of the two interferers is 20dB, when the word error rate of the mixed signal on the first microphone is 38.7%. The conclusion similar to table 1 can be obtained through analysis, and by comparing table 1 with table 2, it can be obviously found that noise can seriously reduce the speech recognition precision, and the speech recognition performance can be improved by using the speech enhancement front-end module based on multi-channel wiener filtering. Since the word error rate and speech intelligibility depend on both the reference microphone and the rank, while the output signal-to-noise ratio is related to the reference microphone only when the rank is not 1, the speech recognition accuracy is more dependent on the intelligibility of the input speech than the signal-to-noise ratio, see fig. 3(a) and 3(b) in fig. 3. The lowest word error rate can be obtained by adopting the maximum input signal-to-noise ratio reference microphone selection method (namely k is 1).

Table 2 shows the word error rate statistics for speech recognition in different rank and reference microphone cases (SIR 20dB)

In summary, the multi-channel speech enhancement method based on reference microphone optimization of the present invention has at least the following advantages: firstly, in the design of a multi-channel wiener filter, analyzing a dependence model between the output signal quality and the speech covariance matrix rank and a reference microphone, and using the dependence model for selecting an actual beam former; second, optimizing the reference microphone based on the maximize input signal-to-noise criterion at the reference microphone selection has lower time complexity relative to the reference microphone selection method based on the maximize output signal-to-noise criterion.

The experimental result shows that the method belongs to suboptimal solution in the aspect of output signal to noise ratio, but can maximize the intelligibility of output voice, and the method for enhancing the voice based on the multichannel wiener filtering is applied to a noise voice recognition scene, so that the reference microphone selection method can obviously reduce the word error rate and improve the robustness of a multi-microphone noise voice recognition system.

The embodiments of the present invention are described in further detail below.

The embodiment provides a multi-channel speech enhancement method based on reference microphone optimization, which comprises the following steps:

the signal model is as follows:

in this embodiment, taking a noise speech recognition system including M microphones as an example, a frame and a frequency index are respectively represented by t and f in a short-time frequency domain, and a speech signal with noise collected by an M-th microphone is represented as:

Y_m(t，f)＝h_m(f)X_k(t，f)+N_m(t，f)，m＝1，...，M， (1)；

in the above formula (1), X_k(t，f)、N_m(t, f) and h_m(f) Respectively representing the clean sound source component at the reference microphone k, the noise component at the microphone m (including the interfering sound source, background noise, reverberation, and self-noise of the microphone, etc.), and the relative acoustic transfer function (RTF) of the target sound source to the microphone m; in the above signal model, the microphone k is selected as the reference microphone, which is further optimized in the following reference microphone selection step; for each time frequency point, the STFT of the M microphones is sparsely stored as a column vector Y, namely Y ═ Y₁(t，f)，Y₂(t，f)，...，Y_M(t，f)]^TSimilarly, defining the relative acoustic transfer function and noise contribution as vectors h and n, the signal model can be written in the form of the following vectors:

y＝hX_k+n (2)；

in the above formula (2), the time-frequency index (t, f) is saved for convenience of expressionA little bit; assuming that the target sound source and the noise component are uncorrelated, the covariance matrix of the noisy speech signal of the multi-microphone can be written as a summation of the speech covariance matrix and the noise covariance matrix, i.e.: phi_yy＝ε[yy^H]＝ε[xx^H]+ε[nn^H]＝Φ_xx+Φ_nn (3)；

In the above-mentioned formula (3),

representing a speech covariance matrix;

represents the power spectral density of the target sound source component at the reference microphone k; phi_nnRepresenting a noise covariance matrix; epsilon represents the averaging operation; theoretically, when only a single target sound source is present, Φ_xxIs 1; in practice, however, since the estimation of the speech covariance matrix depends on a finite length of observed data, phi results_xxHas an error in the estimation of (1), and the rank is not 1; the method can divide the multi-microphone noisy speech signal into a noise frame and a speech plus noise frame by using a sound event detection method, and estimates phi in the two intervals by using a moving average technology_nnAnd phi_yySuch as:

(II) establishing a low-rank approximation-based multi-channel wiener filter:

in this embodiment, taking the general speech distortion weighted wiener filter SDW-MWF as an example, the process of building a low-rank approximate multi-channel wiener filter is described, taking the minimum mean square error of a target sound source plus the weighted residual noise power as a design criterion, that is: min_w ε[|w^Hx-X_k|²]+με[|w^Hn|²] (4)；

In the above formula (4), w ═ w₁，w₂，...，w_M]^TRepresents a filter vector; mu is more than or equal to 0 and is a balance factor of the voice enhancement performance and the voice distortion degree; throughThe expression of the multi-channel wiener filter is derived as follows:

w＝(Φ_xx+μΦ_nn)^-1Φ_xxe_k (5)；

in the above formula (5), e_kThe kth element is 1 and the other elements are 0 for the column vector dependent on the reference microphone. Obviously, when μ ═ 0, the multichannel wiener filter is equivalent to a classical wiener filter.

Consideration pair (phi)_xx，Φ_nn) Carrying out generalized eigenvalue decomposition, and arranging the obtained eigenvalues from large to small as lambda₁≥λ₂≥...≥λ_MThe corresponding eigenvector is stored in the matrix U ═ U₁，u₂，...，u_M]Defining a diagonal matrix Lambda, wherein diagonal elements of the diagonal matrix are generalized eigenvalues; based on analysis of generalized eigenvalues,. phi_xxAnd phi_nnThe joint diagonalization can be:

U^HΦ_xxU＝Λ，U^HΦ_nnU＝I；

wherein I is an identity matrix; due to phi_yy＝Φ_xx+Φ_nn，Φ_yyDiagonalization can be: u shape^HΦ_yyU＝Λ+I。

Therefore, phi can be utilized in practice_yyAnd phi_nnThe generalized eigenvalue decomposition realizes the joint diagonalization operation based on phi_xxThe diagonalization operation of (a) can yield:

in the above formula (6), Q ═ Q₁，q₂，...，q_M]＝U^-H(ii) a It can be seen that phi_nn＝QQ^HAnd is and

thus, Q ═ Q₁，q₂，...，q_M]Comprises a matrix

The left feature vector of (2); for a single source scene, studies have shown that the normalized dominant eigenvector corresponds to the Relative acoustic transfer function of the source (i.e., the Relative acoustic transfer functions disclosed in documents j.zhang, r.heusders, and r.c.hendriks, Relative acoustic transfer function estimation in wireless acoustic sensor networks, IEEE/ACM trans.audio, Speech, Language processes, 27 (10): 1507-;

based on the generalized eigenvalue decomposition, the first r eigenvalues and corresponding eigenvector pairs Φ can be utilized_xxAnd (3) carrying out approximation:

phi of rank r_xxSubstituting the matrix into an original SDW-MWF filter to obtain a wiener filter based on low-rank approximation, namely:

selecting different ranks, the original SDW-MWF filter can be converted into a beam former such as MVDR, maxSNR and the like (namely, the beam former disclosed in the documents J.R.Jensen, J.Benesty, and M.G.Christensen, Noise reduction with optimal variable beam filters, IEEE/ACM Trans.Audio Speech and Language processes, 24(4): 631. 644, 2016);

(III) evaluating the performance of the multi-channel wiener filter based on low rank approximation:

after the multi-microphone noisy speech signal passes through the obtained low-rank approximate multi-channel wiener filter, the signal-to-noise ratio of the output speech of the reference microphone k is calculated by the following formula:

in the above equation (9), the matrices a, B are calculated as follows:

thus, the output signal-to-noise ratio is:

therefore, it can be determined that the output signal-to-noise ratio depends on the reference microphone k, and the quality of the output signal can be improved by optimizing the reference microphone, which proves that the output signal-to-noise ratio can be reduced as the rank r is increased, which indicates that the output performance of the multi-channel wiener filter is influenced when the estimation of the voice covariance matrix is more inaccurate, and the scheme in the prior art can bring too high time complexity and cause the problem of poor time consumption and instantaneity because the reference microphone is selected in a manner of directly maximizing the output signal-to-noise ratio.

(iv) selecting a reference microphone based on maximizing the input signal-to-noise ratio:

in order to more clearly understand the relationship between the output signal-to-noise ratio and the input signal-to-noise ratio, in this step, the rank r is fixed, if r is 2 without loss of generality, the influence of selecting different reference microphones on the output signal-to-noise ratio is analyzed, and a dual sound source scene is considered, namely, the dual sound source scene comprises a target speaker and an interference sound source, wherein the relative acoustic transfer function of the target sound source is a normalized principal feature vector q₁The M-2 noise subspaces can be eliminated by using low-rank approximate operation of a voice covariance matrix with the rank of 2, the approximate voice covariance matrix comprises a target sound source and a single interference sound source, and the estimated disturbance component of the approximate voice covariance matrix is composed of a sub-eigenvector q₂The expansion is obtained, considering the case of using

microphones

1 and 2 as reference microphones, respectively, the input signal-to-noise ratios of the two microphones are:

the above formula(10) In (1),

and

representing the power spectral density of the interference source at the position of the microphone k e {1, 2} and the noise respectively; in addition, define:

the output signal-to-noise ratio of the two microphones can be simplified as follows:

thus, the output signal-to-noise ratios of the two microphones are compared as follows:

wherein the content of the first and second substances,

is a positive number; thus, the signal-to-noise gain of the two microphones is proportional to the input signal-to-noise difference; this shows that a microphone with a larger input signal-to-noise ratio is selected as the reference microphone to obtain a larger output signal-to-noise ratio. In summary, the reference microphone extraction result based on maximizing the input signal-to-noise ratio can be obtained by finding the largest element, i.e. the method for finding the largest element

The time complexity of the method is logarithmic;

(V) speech enhancement signal based on wiener filtering beamforming:

after the reference microphone is optimally selected according to the fourth step, the multichannel wiener filter based on the low rank approximation can be ensuredThe mathematical expression of the filter is determined as follows:

specific filter coefficient vectors can be obtained from the data, and the low-rank approximate multi-channel wiener filter vector and a noisy speech signal vector y are subjected to inner product operation, namely, wave beam formation of frequency points one by one is carried out to obtain an estimation result of a target sound source:

and performing inverse short-time Fourier transform on the estimation result to recover to obtain a time domain target speaker voice signal, namely a voice enhancement signal, and inputting the time domain target speaker voice signal into the SGMM-DNN model for voice recognition to obtain a voice recognition result.

The multichannel wiener filtering method based on input signal-to-noise ratio reference microphone optimization establishes a strict mathematical model of an enhanced signal-to-noise ratio and a reference microphone, selects the reference microphone based on a maximized input signal-to-noise ratio by analyzing the relation between the output signal-to-noise ratio and the input signal-to-noise ratio, effectively reduces the time complexity of reference microphone selection, and improves the speech enhancement and speech recognition performances of a multi-microphone.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A multi-channel speech enhancement method based on reference microphone optimization, comprising:

2. The multi-channel speech enhancement method based on reference microphone optimization of claim 1, wherein in step 2, the output signal power of the low-rank approximate multi-channel wiener filter is obtained by multiplying the conjugate transpose of the low-rank approximate multi-channel wiener filter in the frequency domain by the input speech covariance matrix and then by the filter vector;

3. The reference microphone optimization based multi-channel speech enhancement method according to claim 1 or 2, characterized in that the method further comprises: and 5: and (3) voice recognition: and (4) inputting the enhanced voice signal obtained in the step (4) into a voice recognizer based on a subspace Gaussian mixture-deep neural network for voice recognition, and translating and analyzing the content of a target sound source.