CN110390945B

CN110390945B - Dual-sensor voice enhancement method and implementation device

Info

Publication number: CN110390945B
Application number: CN201910678398.7A
Authority: CN
Inventors: 张军; 李�学; 宁更新; 冯义志; 余华; 季飞
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2019-07-25
Filing date: 2019-07-25
Publication date: 2021-09-21
Anticipated expiration: 2039-07-25
Also published as: WO2021012403A1; CN110390945A

Abstract

The invention discloses a dual-sensor speech enhancement method based on dual-channel wiener filtering and an implementation device. Compared with the prior art, the method and the device have the advantages that the information contained in the air conduction voice and the non-air conduction voice is more fully fused, the priori knowledge of the voice signals is introduced through the statistical model, and the enhancement effect of the voice enhancement system in a noise environment can be effectively improved. The invention can be widely applied to various occasions such as video call, vehicle-mounted telephone, multimedia classroom, military communication and the like.

Description

Dual-sensor voice enhancement method and implementation device

Technical Field

The invention relates to the technical field of voice signal processing, in particular to a dual-sensor voice enhancement method based on dual-channel wiener filtering and an implementation device.

Background

In actual voice communication, a voice signal is often interfered by external environmental noise, and the quality of received voice is affected. The speech enhancement technology is an important branch of speech signal processing, aims to extract pure original speech from noisy speech as far as possible, and is widely applied to the fields of speech communication, speech compression coding, speech recognition and the like in a noisy environment.

Since human ears sense sound through air vibration, most of the existing speech enhancement algorithms are directed at air conduction (air conduction for short), that is, speech is collected by an air conduction sensor (such as a microphone), the enhancement effect is greatly influenced by various acoustic noises in the environment, and the performance is usually not good in a noisy environment. To reduce the impact of ambient noise on speech quality, non-air-conduction (referred to as non-air-conduction) sensors such as laryngeal microphones, bone conduction microphones, etc. are often used for speech acquisition in noisy environments. Different from the air conduction sensor, the non-air conduction voice sensor utilizes the vibration of the vocal cords, the jaw bones and other parts of a speaker to drive the reed or the carbon film in the sensor to change, change the resistance value of the reed or the carbon film, change the voltage at two ends of the reed or the carbon film, and convert a vibration signal into an electric signal, namely a voice signal. The reed or the carbon film of the non-air-conduction sensor cannot be deformed by the sound waves conducted in the air, so that the non-air-conduction sensor is not influenced by the air-conduction sound and has strong acoustic noise resistance. However, the non-air conduction sensor collects the voice transmitted through the vibration of the jaw bone, muscle, skin and other parts, and the high frequency part of the voice is seriously lost, which is manifested as stuffiness and vague voice and poorer speech intelligibility.

In view of the shortcomings of both air conduction and non-air conduction sensors when used alone, some speech enhancement methods have been developed in recent years that combine the advantages of both. These methods utilize the complementarity of air-borne speech and non-air-borne speech, and employ multi-sensor fusion techniques to achieve speech enhancement, often achieving better results than single-sensor speech enhancement systems. The existing dual-sensor voice enhancement mainly comprises two modes, namely, firstly recovering air conduction voice from non-air conduction voice, and then fusing the air conduction voice with noise; and the other method is to recover the air conduction voice from the non-air conduction voice, enhance the air conduction voice with noise by using signals of the air conduction sensor and the non-air conduction sensor, and then fuse the air conduction voice and the non-air conduction sensor. These techniques suffer from the following disadvantages: (1) when restoring air conduction speech using non-air conduction speech, additional noise may be introduced in the high frequency or silence, affecting the enhancement effect. (2) When recovering air conduction speech using non-air conduction speech, information of current air conduction speech is not utilized. (3) When the air conduction speech restored by using the non-air conduction speech is fused with the air conduction speech, the correlation and the prior knowledge of the air conduction speech and the air conduction speech cannot be fully utilized. (4) The non-air-guided speech and the air-guided speech are generally assumed to be independent of each other in the fusion, but this assumption does not hold in practice.

Chinese patent 201610025390.7 discloses a method and apparatus for dual-sensor speech enhancement based on statistical models, the invention firstly combines non-air conduction voice and air conduction voice to construct a combined statistical model for classification and carry out endpoint test, calculates the current optimal air conduction voice filter through the combined statistical model, the air conduction voice is subjected to filtering enhancement, then the non-air conduction voice is converted into the air conduction voice by utilizing a mapping model from the non-air conduction voice to the air conduction voice, and the weighted fusion is carried out on the air conduction voice after the filtering enhancement, the defects that the correlation and the prior knowledge of the air conduction voice and the air conduction voice recovered by a non-air conduction sensor cannot be fully utilized when the air conduction voice and the air conduction voice are fused are partially solved, however, the second step of fusion still uses the air conduction voice recovered from the non-air conduction voice, so that the method also has the defects of high-frequency and mute noise, information of the air conduction voice which cannot be utilized when the non-air conduction voice is used for recovering the air conduction voice, and the like.

Disclosure of Invention

The invention aims to solve the defects in the prior art and provides a dual-sensor speech enhancement method based on dual-channel wiener filtering and a realization device. Compared with the prior art, the method and the device have the advantages that the information contained in the air conduction voice and the non-air conduction voice is more fully fused, the priori knowledge of the voice signals is introduced through the statistical model, and the enhancement effect of the voice enhancement system in a noise environment can be effectively improved. The invention can be widely applied to various occasions such as video call, vehicle-mounted telephone, multimedia classroom, military communication and the like.

The first purpose of the invention can be achieved by adopting the following technical scheme:

a dual-sensor speech enhancement method based on dual-channel wiener filtering comprises the following steps:

s1, synchronously collecting clean air conduction training voice and non-air conduction training voice, establishing a dual-channel voice combined classification model of air conduction voice frames and non-air conduction voice frames, and calculating an air conduction voice power spectrum average value phi corresponding to each classification in the dual-channel voice combined classification model_ss(omega, l) non-air conduction speech power spectrum mean value phi_bbCross-spectral mean phi between (omega, l), air-conducting speech and non-air-conducting speech_bs(ω, l), where ω is frequency and l is the number of the class;

s2, synchronously collecting air conduction test voice and non-air conduction test voice, establishing a statistical model of air conduction noise by using pure noise section of the air conduction test voice, and calculating the power spectrum mean value phi of the air conduction noise_vv(ω)；

S3, classifying the synchronously input air conduction test voice frame and the non-air conduction test voice frame by using the statistical model of the air conduction noise and the dual-channel voice combined classification model in the step S1;

s4, classifying result and power spectrum mean value phi according to the step S3_vv(omega) constructing a dual-channel wiener filter, and filtering the air conduction test voice frame and the non-air conduction test voice frame to obtain the enhanced air conduction voice.

Further, the step S1 is as follows:

s1.1, framing and preprocessing clean air conduction training voice and non-air conduction training voice which are synchronously collected, and extracting a characteristic parameter, namely a reverse Mel spectral coefficient, of each frame of voice;

s1.2, training a dual-channel speech joint classification model by using the clean air conduction speech and non-air conduction speech characteristics obtained in the step S1.1;

s1.3, use of trained pairsClassifying all air conduction training speech frames and non-air conduction speech frames by a channel speech combined classification model, and then calculating the air conduction speech power spectrum mean value phi of the air conduction training speech frames and the non-air conduction speech frames contained in each classification_ss(omega, l) non-air conduction speech power spectrum mean value phi_bbCross-spectral mean phi between (omega, l), air-conducting speech and non-air-conducting speech_bs(ω,l)。

Further, in step S1.2, the dual-channel speech joint classification Model adopts a multiple data stream Gaussian Mixture Model (GMM), that is, a Gaussian Mixture Model (GMM)

Where N (o, μ, σ) is a Gaussian function, o^x(k) And o^b(k) For the feature vectors extracted from the k-th frame of air conduction test speech and non-air conduction test speech,

and

is the mean of the first gaussian components of the air-guide speech data stream and the non-air-guide speech data stream in the multi-data stream GMM,

and

variance of the first Gaussian component of the flow of air-guiding and non-air-guiding speech data in a multi-data-flow GMM, c_lIs the weight of the first Gaussian component in multiple data streams GMM, w_xAnd w_bThe weights of the air-guide voice data stream and the non-air-guide voice data stream in the multi-data stream GMM are respectively, and L is the number of Gaussian components.

Further, in step S1.3, each gaussian component in the dual-channel speech joint classification model represents a classification, and for each pair of synchronous air conduction training speech frame and non-air conduction speech frame, the score of each classification is calculated by using the following formula

The current air conduction training speech frame and the non-air conduction speech frame belong to the classification with the highest score; calculating the classification of all air conduction training speech frames and non-air conduction speech frames, and then calculating the air conduction speech power spectrum mean value phi of the air conduction training speech frames and the non-air conduction speech frames contained in the same classification_ss(omega, l) non-air conduction speech power spectrum mean value phi_bbCross-spectral mean phi between (omega, l), air-conducting speech and non-air-conducting speech_bs(ω,l)。

Further, the statistical model of the air conduction noise is the power spectrum mean value phi of the air conduction noise_vv(ω), calculated using the following method:

s2.1, synchronously acquiring air conduction test voice and non-air conduction test voice and framing;

s2.2, testing the short-time autocorrelation function R of the non-air conduction testing voice frame according to the test_b(m) and short-term energy E_bCalculating the short-time average threshold crossing rate C of each frame of test non-air conduction test voice frame_b：

Wherein sgn [. C]In order to take the sign of the operation,

is an adjustment factor, T is the initial threshold value, M is the frame length, when C_bWhen the value is larger than the preset threshold value, judging the frame as a voice signal, otherwise, judging the frame as noise, and obtaining the end point position of the non-air conduction test voice signal according to the judgment result of each frame;

s2.3, taking the time corresponding to the non-air conduction test voice signal end point tested in the step S2.2 as an end point of the air conduction test voice, and extracting a pure noise section in the air conduction test voice;

s2.4, calculating the power spectrum mean value phi of the pure noise section signal in the air conduction test voice_vv(ω)。

Further, in step S3, a Vector Taylor series model (VTS) compensation technique is first adopted, a statistical model of air conduction noise is used to correct parameters of an air conduction speech data stream in the dual-channel speech combined classification model, and then the input air conduction test speech frame and the input non-air conduction test speech frame are classified, wherein the following formula is adopted to correct the mean value of each gaussian component of the air conduction speech data stream in the dual-channel speech combined classification model:

wherein

And

and respectively enabling power spectrums of clean air conduction training voice and noise belonging to the l-th class to respectively pass through a 24-dimensionalmel filter bank and take the mean values after logarithm, C is a DCT (discrete cosine transformation) matrix, other parameters in the dual-channel voice combined classification model are kept unchanged, and classifying the synchronously input air conduction test voice frame and the non-air conduction test voice frame by adopting the modified dual-channel voice combined classification model to obtain the classification scores q (k, l) of each classification corresponding to the current air conduction test voice frame and the non-air conduction test voice frame.

Further, in step S4, for the air conduction test speech and the non-air conduction test speech acquired synchronously at the kth frame, the enhanced air conduction speech spectrum is calculated by using the following formula:

wherein Y (omega, k), X (omega, k) and B (omega, k) are respectively the enhanced air conduction voice of the kth frameThe frequency spectra of the air conduction test speech and the non-air conduction test speech,

for the frequency responses of the wiener filters corresponding to the k-th frame of air conduction test speech and the non-air conduction test speech, the following equations are respectively used to calculate

Where q (k, l) is the classification score for the kth frame of air conduction test speech and the non-air conduction test speech corresponding to class I of the two-channel speech joint classification model, H_a(omega, k, l) is the frequency response of the wiener filter of the kth frame air conduction test voice corresponding to the l class of the dual-channel voice joint classification model, and the calculation method comprises the following steps:

H_na(omega, k, l) is the frequency response of the wiener filter of the kth frame of non-air conduction test voice corresponding to the l class of the dual-channel voice joint classification model, and the calculation method comprises the following steps:

further, the

And

calculated using the formula:

the other purpose of the invention is realized by the following technical scheme:

an implementation device of a dual-sensor speech enhancement method based on dual-channel wiener filtering comprises an air conduction speech sensor, a non-air conduction speech sensor, a noise model estimation module, a dual-channel speech joint classification model, a model compensation module, a frame classification module, a filter coefficient generation module and a dual-channel filter, wherein,

the air conduction voice sensor and the non-air conduction voice sensor are respectively connected with the noise model estimation module, the frame classification module and the dual-channel filter; the dual-channel speech joint classification model, the model compensation module, the frame classification module, the filter coefficient generation module and the dual-channel filter are sequentially connected, the noise model estimation module is connected with the model compensation module and the filter coefficient generation module, and the dual-channel speech joint classification model is connected with the filter coefficient generation module;

the air conduction voice sensor and the non-air conduction voice sensor are respectively used for collecting air conduction voice signals and non-air conduction voice signals, the noise model estimation module is used for estimating a model and a power spectrum of current air conduction noise, the dual-channel voice combined classification model adopts clean air conduction training voice and non-air conduction training voice which are synchronously collected to establish an air conduction voice frame and a non-air conduction voice frame, and the mean value of the power spectrum of each classified air conduction voice in the dual-channel voice combined classification model is phi_ss(omega, l) and the mean value of the power spectrum of the non-air-conduction speech is phi_bbThe cross-spectral mean between (ω, l), air-guided speech and non-air-guided speech is Φ_bs(omega, l), the model compensation module utilizes the statistical model of air conduction noise to revise the parameter of the dual-channel speech joint classification model, the frame classification module classify the current synchronous input air conduction test speech frame and the non-air conduction test speech frame, the filter coefficient generation module construct the dual-channel wiener filter according to the classification result and the power spectrum of the air conduction noise, the dual-channel filter measure the air conductionAnd filtering the test voice frame and the non-air conduction test voice frame to obtain the enhanced air conduction voice.

Further, the air conduction voice sensor is a microphone, and the non-air conduction voice sensor is a throat microphone.

Compared with the prior art, the invention has the following advantages and effects:

(1) compared with the voice enhancement technology only based on the air conduction test voice or the non-air conduction test voice, the method and the device have the advantages that the information of the air conduction test voice and the non-air conduction test voice is simultaneously utilized during enhancement, and a better enhancement effect can be achieved.

(2) The invention adopts the dual-channel speech joint classification model to fuse the information of the air conduction test speech and the non-air conduction test speech, can make the frame classification more accurate, and fully utilizes the correlation and the prior knowledge of the two.

(3) Compared with the Chinese patent 201610025390.7, the method for restoring the air conduction voice by the two-channel wiener filter is simpler in calculation, can avoid the defects of high-frequency or mute noise and failure in utilizing air conduction voice information when restoring the air conduction voice from non-air conduction voice, and has better performance.

(4) The invention adopts the two-channel wiener filter to recover the air conduction voice, and avoids the assumption that the non-air conduction voice and the air conduction voice are mutually independent.

Drawings

FIG. 1 is a block diagram of an apparatus for implementing a dual-channel wiener filtering-based dual-sensor speech enhancement method disclosed in the embodiments of the present invention;

FIG. 2 is a flowchart of a dual-channel wiener filtering-based dual-sensor speech enhancement method disclosed in the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

The embodiment discloses a structural block diagram of an implementation device of a dual-sensor voice enhancement method based on dual-channel wiener filtering, as shown in fig. 1, the device comprises an air conduction voice sensor, a non-air conduction voice sensor, a noise model estimation module, a dual-channel voice joint classification model, a model compensation module, a frame classification module, a filter coefficient generation module and a dual-channel filter, the air conduction voice sensor and the non-air conduction voice sensor are respectively connected with the noise model estimation module, the frame classification module and the dual-channel filter, the dual-channel voice combined classification model, the model compensation module, the frame classification module, the filter coefficient generation module and the dual-channel filter are sequentially connected, the noise model estimation module is connected with the model compensation module and the filter coefficient generation module, and the dual-channel voice combined classification model is connected with the filter coefficient generation module.

In this embodiment, the air conduction voice sensor is a microphone, and the non-air conduction voice sensor is a throat microphone, and the air conduction voice sensor and the throat microphone are used for acquiring air conduction voice signals and non-air conduction voice signals; the noise model estimation module is used for estimating a model and a power spectrum of the current air conduction noise. The dual-channel speech combined classification model adopts synchronously acquired clean air conduction training speech and non-air conduction training speech to establish air conduction speech frames and non-air conduction speech frames, and the average value phi of the air conduction speech power spectrum of each classification in the dual-channel speech combined classification model_ss(omega, l) non-air conduction speech power spectrum mean value phi_bbCross-spectral mean phi between (omega, l), air-conducting speech and non-air-conducting speech_bs(ω, l). And the model compensation module corrects the parameters of the dual-channel speech combined classification model by using the statistical model of the air conduction noise. And the frame classification module classifies the currently synchronously input air conduction test voice and non-air conduction test voice frames. And the filter coefficient generating module constructs a dual-channel wiener filter according to the classification result and the power spectrum of the air conduction noise. Dual-channel filter pair air conduction test voice frame and non-air conduction testAnd filtering the voice test frame to obtain the enhanced air conduction voice.

Example two

The embodiment discloses a dual-sensor speech enhancement method based on dual-channel wiener filtering, according to the implementation device disclosed in the embodiment, the following steps are adopted to calculate enhanced air conduction speech by using input air conduction test speech and non-air conduction test speech, and the flow is shown in fig. 2:

step S1, collecting clean air conduction training voice and non-air conduction training voice synchronously, establishing a dual-channel voice combined classification model of air conduction voice frames and non-air conduction voice frames, and calculating an air conduction voice power spectrum average value phi corresponding to each classification in the dual-channel voice combined classification model_ss(omega, l) non-air conduction speech power spectrum mean value phi_bbCross-spectral mean phi between (omega, l), air-conducting speech and non-air-conducting speech_bs(ω, l), where ω is frequency and l is the number of the class.

The following steps are adopted in the embodiment to complete the process:

s1.1, framing and preprocessing clean air conduction training voice and non-air conduction training voice which are synchronously collected, and extracting characteristic parameters of each frame of voice.

In this embodiment, the clean air conduction training voice and the non-air conduction training voice which are synchronously acquired are framed according to the frame length of 30ms and the frame shift of 10ms, and each frame of the clean air conduction training voice and the non-air conduction training voice is windowed by using a hamming window respectively and is subjected to pre-emphasis, and then the power spectrums of the clean air conduction training voice and the non-air conduction training voice are obtained. And respectively enabling the power spectrums of the air conduction training voice and the non-air conduction training voice to pass through a 24-dimensional Mel filter bank, logarithm is taken from the output of the filter bank, and then DCT transformation is carried out to obtain two groups of 12-dimensional Mel frequency cepstrum coefficients which are used as training characteristics of a two-channel voice combined classification model.

S1.2, training a dual-channel speech joint classification model by using the clean air conduction speech and non-air conduction speech characteristics obtained in the step S1.1. In this embodiment, the two-channel speech joint classification model uses multiple data streams GMM, i.e.

and

and

Parameter c in dual-channel speech joint classification model_l、w_x、w_b、

And

the maximum Expectation (Expectation Maximization) algorithm is used for estimation.

S1.3, classifying all air conduction training speech frames and non-air conduction speech frames by using the trained dual-channel speech combined classification model, and then calculating the air conduction speech power spectrum mean value phi of the air conduction training speech frames and the non-air conduction speech frames contained in each classification_ss(omega, l) non-air conduction speech power spectrum mean value phi_bb(omega, l), air-guided speech and non-air-guided speechCross spectral mean phi between_bs(ω,l)。

In this embodiment, each gaussian component in the dual-channel speech joint classification model represents a classification, and for each pair of synchronous air conduction training speech frame and non-air conduction speech frame, the score of each classification is calculated by using the following formula

The current air conduction training speech frame and the non-air conduction speech frame belong to the class with the highest score. Calculating the classification of all air conduction training speech frames and non-air conduction speech frames, and then calculating the air conduction speech power spectrum mean value phi of the air conduction training speech frames and the non-air conduction speech frames contained in the same classification_ss(omega, l) non-air conduction speech power spectrum mean value phi_bbCross-spectral mean phi between (omega, l), air-conducting speech and non-air-conducting speech_bs(ω,l)。

Step S2, synchronously collecting air conduction test voice and non-air conduction test voice, establishing a statistical model of air conduction noise by using pure noise section of the air conduction test voice, and calculating the power spectrum mean value phi of the air conduction noise_vv(ω)。

In this embodiment, the statistical model of the air conduction noise is the power spectrum mean value Φ of the air conduction noise_vv(ω), calculated using the following method:

Wherein sgn [. C]In order to take the sign of the operation,

is the adjustment factor, T is the threshold initial value, and M is the frame length. When C is present_bWhen the value is larger than the preset threshold value, judging the frame as a voice signal, otherwise, judging the frame as noise, and obtaining the end point position of the non-air conduction test voice signal according to the judgment result of each frame;

The statistical model of the air conduction noise is a Gaussian function, a GMM model or an HMM model.

And S3, classifying the synchronously input air conduction test speech frames and non-air conduction test speech frames by utilizing the statistical model of the air conduction noise and the dual-channel speech joint classification model in the step S1.

In this embodiment, a VTS model compensation technique is first adopted, and a statistical model of air conduction noise is used to correct parameters of an air conduction speech data stream in a dual-channel speech combined classification model, and then an input air conduction test speech frame and a non-air conduction test speech frame are classified. The specific method is to adopt the following formula to correct the mean value of each Gaussian component of the air guide voice data flow in the dual-channel voice combined classification model:

wherein

And

the power spectra of clean air conduction training speech and noise belonging to the first class are passed through a 24-dimensionalmel filter bank and the mean values after logarithmic calculation are taken, and C is Discrete Cosine Transform (DCT). Dual channel speech joint classificationOther parameters in the model remain unchanged. And classifying the synchronously input air conduction test voice frame and the non-air conduction test voice frame by adopting the corrected two-channel voice combined classification model to obtain the classification score q (k, l) of each classification corresponding to the current air conduction test voice frame and the non-air conduction test voice frame.

Step S4, sorting result according to step S3 and phi_vv(omega) constructing a dual-channel wiener filter, and filtering the air conduction test voice frame and the non-air conduction test voice frame to obtain the enhanced air conduction voice.

In this embodiment, for the air conduction test voice and the non-air conduction test voice acquired synchronously at the kth frame, the enhanced air conduction voice spectrum is calculated by using the following formula:

wherein Y (omega, k), X (omega, k) and B (omega, k) are respectively the frequency spectrums of the enhanced air conduction voice, the air conduction test voice and the non-air conduction test voice of the kth frame,

Q (k, l) in the formula is the classification score of the kth frame of air conduction test voice and the non-air conduction test voice corresponding to the l class of the dual-channel voice joint classification model. H_a(omega, k, l) is the frequency response of the wiener filter of the kth frame air conduction test voice corresponding to the l class of the dual-channel voice joint classification model, and the calculation method comprises the following steps:

in another embodiment, the above

And

calculated using the formula:

the above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A dual-sensor speech enhancement method based on dual-channel wiener filtering is characterized by comprising the following steps:

s1, synchronously collecting clean air conduction training voice and non-air conduction training voice, establishing a dual-channel voice combined classification model of air conduction voice frames and non-air conduction voice frames, and calculating an air conduction voice power spectrum average value phi corresponding to each classification in the dual-channel voice combined classification model_ss(omega, l) non-air conduction speech power spectrum mean value phi_bbBetween (omega, l), air-conducting speech and non-air-conducting speechCross spectral mean phi_bs(ω, l), where ω is frequency and l is the number of the class;

2. The dual-sensor speech enhancement method of claim 1, wherein the step S1 is performed as follows:

s1.1, framing and preprocessing clean air conduction training voice and non-air conduction training voice which are synchronously collected, and extracting characteristic parameters of each frame of voice, wherein the characteristic parameters are reverse Mel spectral coefficients;

s1.3, classifying all air conduction training speech frames and non-air conduction speech frames by using the trained dual-channel speech combined classification model, and then calculating the air conduction speech power spectrum mean value phi of the air conduction training speech frames and the non-air conduction speech frames contained in each classification_ss(omega, l) non-air conduction speech power spectrum mean value phi_bbCross-spectral mean phi between (omega, l), air-conducting speech and non-air-conducting speech_bs(ω,l)。

3. The dual-sensor speech enhancement method of claim 2, wherein in step S1.2, the dual-channel speech joint classification model uses multiple data streams GMM, where GMM is a Gaussian Mixture Model (GMM)

and

and

4. The dual-sensor speech enhancement method of claim 3 wherein in step S1.3, each Gaussian component in the dual-channel speech joint classification model represents a class, and for each pair of synchronous air conduction training speech frames and non-air conduction speech frames, the score for each class is calculated using the following equation

Wherein the current air conduction training speech frame andthe non-air conduction speech frame belongs to the class with the highest score; calculating the classification of all air conduction training speech frames and non-air conduction speech frames, and then calculating the air conduction speech power spectrum mean value phi of the air conduction training speech frames and the non-air conduction speech frames contained in the same classification_ss(omega, l) non-air conduction speech power spectrum mean value phi_bbCross-spectral mean phi between (omega, l), air-conducting speech and non-air-conducting speech_bs(ω,l)。

5. The dual-sensor speech enhancement method of claim 1, wherein the statistical model of the air conduction noise is the power spectrum mean Φ of the air conduction noise_vv(ω), calculated using the following method:

Wherein sgn [. C]In order to take the sign of the operation,

6. The dual-sensor speech enhancement method of claim 1, wherein in step S3, a vector taylor series model compensation technique is first used, a statistical model of the air conduction noise is used to correct parameters of the air conduction speech data stream in the dual-channel speech combined classification model, and then the input air conduction test speech frame and the input non-air conduction test speech frame are classified, wherein the mean value of each gaussian component of the air conduction speech data stream in the dual-channel speech combined classification model is corrected by the following formula:

wherein

And

and respectively enabling power spectrums of clean air conduction training voice and noise belonging to the l-th class to respectively pass through a 24-dimensional Mel filter bank and take the mean values after logarithm, C is a discrete cosine transform matrix, other parameters in the dual-channel voice combined classification model are kept unchanged, and classifying synchronously input air conduction test voice frames and non-air conduction test voice frames by adopting the modified dual-channel voice combined classification model to obtain classification scores q (k, l) of the current air conduction test voice frames and the non-air conduction test voice frames corresponding to each classification.

7. The dual-sensor speech enhancement method of claim 2, wherein in step S4, for the k-th frame of synchronously acquired air conduction test speech and non-air conduction test speech, the spectrum of the enhanced air conduction speech is calculated by using the following formula:

8. the dual-sensor speech enhancement method of claim 7, wherein the speech enhancement is performed by a speech enhancement processor

And

calculated using the formula:

9. an implementation device of a dual-sensor speech enhancement method based on dual-channel wiener filtering is characterized by comprising an air conduction speech sensor, a non-air conduction speech sensor, a noise model estimation module, a dual-channel speech joint classification model, a model compensation module, a frame classification module, a filter coefficient generation module and a dual-channel filter, wherein,

the air conduction voice sensor and the non-air conduction voice sensor are respectively used for collecting air conduction voice signals and non-air conduction voice signals, the noise model estimation module is used for estimating a model and a power spectrum of current air conduction noise, the dual-channel voice combined classification model adopts clean air conduction training voice and non-air conduction training voice which are synchronously collected to establish an air conduction voice frame and a non-air conduction voice frame, and the mean value of the power spectrum of each classified air conduction voice in the dual-channel voice combined classification model is phi_ss(omega, l) and the mean value of the power spectrum of the non-air-conduction speech is phi_bbThe cross-spectral mean between (ω, l), air-guided speech and non-air-guided speech is Φ_bs(ω, l), said model compensation module jointly classifying the two-channel speech using a statistical model of air conduction noiseThe parameters of the model are corrected, the frame classification module classifies the current synchronously input air conduction test voice frame and the non-air conduction test voice frame, the filter coefficient generation module constructs a dual-channel wiener filter according to the classification result and the power spectrum of air conduction noise, and the dual-channel filter filters the air conduction test voice frame and the non-air conduction test voice frame to obtain enhanced air conduction voice.

10. The apparatus for implementing a dual-sensor speech enhancement method according to claim 9, wherein said air conduction speech sensor is a microphone and said non-air conduction speech sensor is a throat microphone.