WO2021012403A1

WO2021012403A1 - Dual sensor speech enhancement method and implementation device

Info

Publication number: WO2021012403A1
Application number: PCT/CN2019/110290
Authority: WO
Inventors: 张军; 李�学; 宁更新; 冯义志; 余华; 季飞
Original assignee: 华南理工大学
Priority date: 2019-07-25
Filing date: 2019-10-10
Publication date: 2021-01-28
Also published as: CN110390945A; CN110390945B

Abstract

A dual sensor speech enhancement method and an implementation device based on two-channel Wiener filtering. The method comprises: firstly, building a dual channel speech association classification model for performing frame classification on dual channel input signals of an air conduction sensor and a non-air conduction sensor by using the complementarity between air conduction speech and non-air conduction speech; secondly, classifying speech frames collected by dual channels by using the model; and finally, constructing a dual channel Wiener filter according to the classification result, and performing filtering enhancement on the speech signals collected by the dual channels. Information comprised in the air conduction speech and the non-air conduction speech is more fully integrated, and the prior knowledge of the speech signals is introduced by means of a statistic model, so that the enhancement effect of a speech enhancement system in a noise environment can be effectively improved. The dual sensor speech enhancement method can be widely applied to a plurality of occasions such as video call, vehicle call, multimedia classrooms and military communication.

Description

Double-sensor voice enhancement method and implementation device

Technical field

The invention relates to the technical field of speech signal processing, in particular to a dual-sensor speech enhancement method and implementation device based on dual-channel Wiener filtering.

Background technique

In actual voice communication, the voice signal is often interfered by external environmental noise, which affects the quality of the received voice. Speech enhancement technology is an important branch of speech signal processing. The purpose is to extract as much pure original speech as possible from noisy speech, and it is widely used in speech communication, speech compression coding and speech recognition in noisy environments.

Since human ears perceive sound through the vibration of the air, most of the current speech enhancement algorithms are aimed at air conduction (air conduction for short) speech, that is, the use of air conduction sensors (such as microphones) to collect speech, and the enhancement effect is affected. The influence of various acoustic noises in the environment is great, and the performance is usually poor in a noisy environment. In order to reduce the impact of environmental noise on voice quality, non-air conduction (abbreviated as non-air conduction) sensors such as throat microphones and bone conduction microphones are often used for voice collection in noisy environments. Different from air conduction sensors, non-air conduction voice sensors use the vibration of the speaker’s vocal cords, jawbone and other parts to drive the reed or carbon film in the sensor to change, change its resistance value, and change the voltage at both ends. In this way, the vibration signal is converted into an electrical signal, that is, a voice signal. Since the sound waves conducted in the air cannot deform the reed or carbon film of the non-air-conducting sensor, the non-air-conducting sensor is not affected by the air-conducting sound and has a strong ability to resist acoustic noise. However, because the non-air conduction sensor collects the voice transmitted through the vibration of the jawbone, muscle, skin and other parts, the high frequency part is seriously lost, which is manifested as the sound is muffled and ambiguous, and the voice intelligibility is poor.

In view of the certain shortcomings of air conduction and non-air conduction sensors when applied separately, some speech enhancement methods combining the advantages of both have appeared in recent years. These methods take advantage of the complementarity of air-guided speech and non-air-guided speech, and use multi-sensor fusion technology to achieve speech enhancement, which usually achieves better results than single-sensor speech enhancement systems. There are mainly two ways to enhance the existing dual-sensor speech. One is to restore air-conducted speech from non-air-conducted speech, and then merge with noisy air-conducted speech; the other is to restore air conduction from non-air-conducted speech Voice, and use the air conduction sensor and non-air conduction sensor signals to enhance the noisy air conduction speech, and then the two are fused. These technologies have the following shortcomings: (1) When using non-air-conducted speech to restore air-conducted speech, additional noise will be introduced in the high frequency or silence, which will affect the enhancement effect. (2) When using non-air conduction speech to restore air conduction speech, the information of current air conduction speech cannot be used. (3) When the air-guided speech and air-guided speech recovered by non-air-guided speech are fused, the correlation and prior knowledge of the two cannot be fully utilized. (4) It is usually assumed that non-air-guided speech and air-guided speech are independent of each other during fusion, but this assumption does not hold true in practice.

Chinese invention patent 201610025390.7 discloses a dual-sensor speech enhancement method and device based on a statistical model. The invention first combines non-air-conducted speech and air-conducted speech to construct a joint statistical model for classification and endpoint detection. Calculate the current best air conduction speech filter, filter and enhance the air conduction speech, and then use the non-air conduction speech to air conduction speech mapping model to convert the non-air conduction speech into the air conduction speech, and combine it with the filtered and enhanced speech The weighted fusion of air conduction speech partially solves the problem of the lack of full use of the correlation and prior knowledge of the air conduction speech recovered by the non-air conduction sensor and the air conduction speech fusion, but it is still used in the second step of fusion The air-conducted speech recovered to the non-air-conducted speech, so there are also deficiencies such as high-frequency and silent noise, and the use of non-air-conducted speech to recover the air-conducted speech.

Summary of the invention

The purpose of the present invention is to solve the above-mentioned defects in the prior art and provide a dual-sensor speech enhancement method and implementation device based on dual-channel Wiener filtering. The method first utilizes the complementarity between air-conducted speech and non-air-conducted speech It establishes a dual-channel voice joint classification model for frame classification of dual-channel input signals of air conduction sensors and non-air conduction sensors, and uses this model to classify the voice frames collected by dual channels, and finally constructs a dual-channel dimension based on the classification results Nano filter to filter and enhance the voice signal collected by dual channels. Compared with the prior art, the present invention more fully integrates the information contained in air-guided speech and non-air-guided speech, and introduces prior knowledge of the speech signal through a statistical model, which can effectively improve the performance of the speech enhancement system in a noisy environment. Enhancement. The invention can be widely used in various occasions such as video calls, car phones, multimedia classrooms, and military communications.

The first objective of the present invention can be achieved by adopting the following technical solutions:

A dual-sensor voice enhancement method based on dual-channel Wiener filtering. The dual-sensor voice enhancement method includes the following steps:

S1. Collect clean air conduction training speech and non-air conduction training speech simultaneously, establish a two-channel speech joint classification model of air conduction speech frame and non-air conduction speech frame, and calculate corresponding to each of the above two-channel speech joint classification models The average power spectrum of classified air-guided speech Φ _ss (ω,l), the average power spectrum of non-air-guided speech Φ _bb (ω,l), the average cross-spectrum between air-guided speech and non-air-guided speech Φ _bs (ω, l), where ω is the frequency and l is the serial number of the classification;

S2. Collect air conduction test speech and non-air conduction test speech simultaneously, use the pure noise section of air conduction test speech to establish a statistical model of air conduction noise, and calculate the mean value of the power spectrum of air conduction noise Φ _vv (ω);

S3. Use the statistical model of air conduction noise and the dual-channel speech joint classification model in step S1 to classify the synchronized input air conduction test speech frames and non-air conduction test speech frames;

S4. Construct a two-channel Wiener filter according to the classification result of step S3 and the power spectrum mean value Φ _vv (ω), and filter the air conduction test speech frame and the non-air conduction test speech frame to obtain an enhanced air conduction speech.

Further, the process of step S1 is as follows:

S1.1. Perform framing and preprocessing of the synchronously collected clean air conduction training speech and non-air conduction training speech, and extract the characteristic parameter of each frame of speech—inverted mel spectrum coefficient;

S1.2. Use the air-guided speech and non-air-guided speech features obtained in step S1.1 to train a dual-channel speech joint classification model;

S1.3. Use the trained dual-channel speech joint classification model to classify all air conduction training speech frames and non-air conduction speech frames, and then calculate the air conduction training speech frames and non-air conduction speech frames contained in each classification. Air-guided speech power spectrum mean Φ _ss (ω,l), non-air-guided speech power spectrum mean Φ _bb (ω,l), cross-spectrum mean between air-guided speech and non-air-guided speech Φ _bs (ω,l) .

Further, in the step S1.2, the dual-channel voice joint classification model adopts a multi-data stream Gaussian Mixture Model (GMM), namely

Where N (o, μ, σ) is a Gaussian function, o ^x (k) and o ^b (k) for the k-th frame and the non-voice test air conduction air conduction test speech feature vectors extracted,

with

Is the mean value of the l Gaussian component of the air-conducted voice data stream and the non-air-conducted voice data stream in the multi-data stream GMM,

with

Is the variance of the l-th Gaussian component of the air-conducted voice data stream and the non-air-conducted voice data stream in the multi-stream GMM, c _l is the weight of the l-th Gaussian component in the multi-stream GMM, w _x and w _b are respectively The weight of the air-conducted voice data stream and the non-air-conducted voice data stream in the data stream GMM, L is the number of Gaussian components.

Further, in the step S1.3, each Gaussian component in the dual-channel speech joint classification model represents a category, and for each pair of synchronized air conduction training speech frames and non-air conduction speech frames, the following formula is used to calculate Its score for each category

The current air conduction training speech frames and non-air conduction speech frames belong to the category with the highest score; calculate the classification to which all air conduction training speech frames and non-air conduction speech frames belong, and then calculate the air conduction training speech frames and sums contained in the same category The mean value of the air-guided speech power spectrum of the non-air-guided speech frame Φ _ss (ω,l), the mean value of the power spectrum of the non-air-guided speech Φ _bb (ω,l), the cross-spectrum mean value between the air-guided speech and the non-air-guided speech Φ _bs (ω,l).

Further, the statistical model of air conduction noise is the mean value of the power spectrum of air conduction noise Φ _vv (ω), which is calculated by the following method:

S2.1. Collect air conduction test speech and non-air conduction test speech simultaneously and divide them into frames;

S2.2. According to the short-term autocorrelation function R _b (m) and short-term energy E _b of the non-air conduction detection speech frame, calculate the short-term average threshold crossing rate C _{b of} each non-air conduction detection speech frame:

Where sgn[·] is a symbolic operation,

Is the adjustment factor, T is the threshold initial value, and M is the frame length. When C _{b is} greater than the preset threshold, the frame is judged to be a speech signal, otherwise it is noise. According to the judgment result of each frame, the non-air conduction detection speech is obtained The end position of the signal;

S2.3. Use the time corresponding to the endpoint of the non-air conduction test voice signal detected in step S2.2 as the endpoint of the air conduction test voice, and extract the pure noise segment in the air conduction test voice;

S2.4. Calculate the mean value Φ _vv (ω) of the power spectrum of the pure noise signal in the air conduction test speech.

Further, in the step S3, the vector Taylor series model (Vector Taylor Seties, VTS) compensation technology is first used, and the air conduction noise statistical model is used to correct the parameters of the air conduction speech data stream in the dual-channel speech joint classification model. , And then classify the input air conduction test speech frames and non-air conduction test speech frames. The following formula is used to modify the average value of each Gaussian component of the air conduction speech data stream in the dual-channel speech joint classification model:

among them

with

The power spectra of clean air conduction training speech and noise belonging to the l-th class respectively pass through the 24-dimensional mel filter bank and take the logarithm of the mean value. C is the DCT transformation matrix. Others in the dual-channel speech joint classification model The parameters remain unchanged, and the revised dual-channel speech joint classification model is used to classify the synchronized input air conduction test speech frames and non-air conduction test speech frames, and the current air conduction test speech frames and non-air conduction test speech frames correspond to The classification score of each category is q(k,l).

Further, in the step S4, for the air conduction test speech and non-air conduction test speech synchronously collected in the kth frame, the following formula is used to calculate the enhanced air conduction speech spectrum:

Among them, Y(ω,k), X(ω,k), B(ω,k) are the frequency spectrums of the enhanced air conduction speech, air conduction test speech and non-air conduction test speech at the k-th frame, respectively.

In order to correspond to the frequency response of the Wiener filter filter of the k-th air conduction test speech and non-air conduction test speech, the following formulas are used to calculate

Where q(k,l) is the classification score of the k-th frame air conduction test speech and non-air conduction test speech corresponding to the first category of the dual-channel speech joint classification model, and H _a (ω,k,l) is the k-th frame The air conduction test speech corresponds to the Wiener filter frequency response of the first category of the dual-channel speech joint classification model. The calculation method is:

H _na (ω,k,l) is the Wiener filter frequency response of the k-th frame of non-air conduction test speech corresponding to the first category of the dual-channel speech joint classification model. The calculation method is:

Further, said

with

Use the following formula:

l=arg max q(k,l).

Another objective of the present invention is achieved through the following technical solutions:

A device for implementing a dual-sensor speech enhancement method based on dual-channel Wiener filtering. The device includes an air-conducted speech sensor, a non-air-conducted speech sensor, a noise model estimation module, a dual-channel speech joint classification model, and a model compensation module , Frame classification module, filter coefficient generation module and dual-channel filter, among which,

The air-conducted speech sensor and the non-air-conducted speech sensor are respectively connected to the noise model estimation module, the frame classification module, and the dual-channel filter; the dual-channel speech joint classification model, the model compensation module, and the frame classification module , The filter coefficient generation module and the dual-channel filter are connected in sequence, the noise model estimation module is connected with the model compensation module and the filter coefficient generation module, and the dual-channel speech joint classification model is connected with the filter coefficient generation module ；

The air conduction speech sensor and the non-air conduction speech sensor are used to collect air conduction and non-air conduction speech signals, respectively, and the noise model estimation module is used to estimate the current air conduction noise model and power spectrum. The channel voice joint classification model uses synchronously collected clean air conduction training speech and non-air conduction training speech to establish air conduction speech frames and non-air conduction speech frames, and the air conduction speech power of each category in the dual-channel speech joint classification model The mean value of the spectrum is Φ _ss (ω,l), the mean value of the power spectrum of non-air-guided speech is Φ _bb (ω,l), the mean value of the cross-spectrum between air-guided speech and non-air-guided speech is Φ _bs (ω,l), The model compensation module uses the statistical model of air conduction noise to correct the parameters of the dual-channel speech joint classification model, and the frame classification module classifies the currently synchronized input air conduction test speech and non-air conduction test speech frames, The filter coefficient generation module constructs a dual-channel Wiener filter based on the classification result and the power spectrum of air conduction noise. The dual-channel filter filters air conduction test speech frames and non-air conduction test speech frames to be enhanced After the air conduction voice.

Further, the air-conducted speech sensor is a microphone, and the non-air-conducted speech sensor is a throat microphone.

Compared with the prior art, the present invention has the following advantages and effects:

(1) Compared with the speech enhancement technology based only on air conduction test speech or non-air conduction test speech, the present invention uses both air conduction test speech and non-air conduction test speech information when enhancing, and can achieve better enhancement effect.

(2) The present invention adopts a dual-channel speech joint classification model to fuse information of air conduction test speech and non-air conduction test speech, which can make frame classification more accurate and make full use of the correlation and prior knowledge of the two.

(3) The present invention uses a dual-channel Wiener filter to recover air-conducted speech. Compared with the Chinese invention patent 201610025390.7, the calculation is simpler, and it can avoid high-frequency or silent noise when air-conducted speech is recovered from non-air-conducted speech. , Failed to take advantage of the insufficiency of air conduction voice information, with better performance.

(4) The present invention uses a dual-channel Wiener filter to recover air-conducted speech, avoiding the assumption that non-air-conducted speech and air-conducted speech are mutually independent.

Description of the drawings

Figure 1 is a structural block diagram of a device for implementing a dual-sensor voice enhancement method based on dual-channel Wiener filtering disclosed in an embodiment of the present invention;

Fig. 2 is a flowchart of a dual-sensor speech enhancement method based on dual-channel Wiener filtering disclosed in an embodiment of the present invention.

Detailed ways

In order to make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of the embodiments of the present invention, not all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.

Example one

This embodiment discloses a structural block diagram of a device for implementing a dual-sensor speech enhancement method based on dual-channel Wiener filtering. As shown in FIG. 1, the air-conducted speech sensor, non-air-conducted speech sensor, noise model estimation module, and dual Channel speech joint classification model, model compensation module, frame classification module, filter coefficient generation module, and dual-channel filter are jointly constituted. Among them, air-conducted speech sensor and non-air-conducted speech sensor are respectively combined with noise model estimation module, frame classification module, Dual-channel filter connection, dual-channel speech joint classification model, model compensation module, frame classification module, filter coefficient generation module, dual-channel filter are connected in sequence, noise model estimation module is connected with model compensation module, filter coefficient generation module , The dual-channel speech joint classification model is connected to the filter coefficient generation module.

In this embodiment, the air-conducted speech sensor is a microphone, and the non-air-conducted speech sensor is a larynx microphone, both of which are used to collect air-conducted and non-air-conducted speech signals; the noise model estimation module is used to estimate the current air conduction noise Model and power spectrum. The dual-channel voice joint classification model uses the synchronously collected clean air conduction training speech and non-air conduction training speech to establish air conduction speech frames and non-air conduction speech frames. The air conduction speech power spectrum of each category in the above two-channel speech joint classification model The mean value Φ _ss (ω,l), the mean value of the power spectrum of non-air-guided speech Φ _bb (ω,l), the mean value of cross-spectrum between air-guided speech and non-air-guided speech Φ _bs (ω,l). The model compensation module uses the statistical model of air conduction noise to correct the parameters of the dual-channel speech joint classification model. The frame classification module classifies the air conduction test speech and non-air conduction test speech frames input simultaneously. The filter coefficient generation module constructs a dual-channel Wiener filter based on the classification result and the power spectrum of air conduction noise. The dual-channel filter filters air conduction test speech frames and non-air conduction test speech frames to obtain enhanced air conduction speech.

Example two

This embodiment discloses a dual-sensor speech enhancement method based on dual-channel Wiener filtering. According to the implementation device disclosed in the above embodiment, the following steps are used to calculate the enhanced air conduction test speech and non-air conduction test speech input. Guide voice, its process is shown in Figure 2:

Step S1. Synchronously collect clean air conduction training speech and non-air conduction training speech, establish a two-channel speech joint classification model of air conduction speech frame and non-air conduction speech frame, and calculate each corresponding to each of the above-mentioned two-channel speech joint classification model The average power spectrum of air-guided speech Φ _ss (ω,l), the average power spectrum of non-air-guided speech Φ _bb (ω,l), the average cross-spectrum between air-guided speech and non-air-guided speech Φ _bs (ω ,l), where ω is the frequency and l is the serial number of the classification.

In this embodiment, the following steps are used to complete:

S1.1. Perform framing and preprocessing on the synchronously collected clean air conduction training speech and non-air conduction training speech, and extract the characteristic parameters of each frame of speech.

In this embodiment, the synchronously collected clean air conduction training speech and non-air conduction training speech are divided into frames with a frame length of 30 ms and a frame shift of 10 ms. Each frame of clean air conduction training speech and non-air conduction training speech uses Hamming. After adding windows and pre-emphasis, find the power spectrum. The power spectra of the above-mentioned air conduction training speech and non-air conduction training speech are respectively passed through a 24-dimensional mel filter bank, and the output of the filter bank is taken logarithmically and then subjected to DCT transformation to obtain two sets of 12-dimensional mel frequency inverted The spectral coefficient is used as the training feature of the dual-channel speech joint classification model.

S1.2. Use the air-guided speech and non-air-guided speech features obtained in step S1.1 to train a dual-channel speech joint classification model. In this embodiment, the dual-channel voice joint classification model adopts multi-data stream GMM, namely

with

The parameters c _l , w _x , w _b , in the dual-channel speech joint classification model

with

Estimated using Expectation Maximization algorithm.

In this embodiment, each Gaussian component in the dual-channel speech joint classification model represents a category. For each pair of synchronized air conduction training speech frames and non-air conduction speech frames, the following formula is used to calculate the score for each category

The current air conduction training speech frame and non-air conduction speech frame belong to the category with the highest score. Calculate the category to which all air conduction training speech frames and non-air conduction speech frames belong, and then calculate the average air conduction speech power spectrum of the air conduction training speech frames and non-air conduction speech frames contained in the same category Φ _ss (ω,l) , The average power spectrum of non-air-guided speech Φ _bb (ω,l), the average cross-spectrum between air-guided speech and non-air-guided speech Φ _bs (ω,l).

Step S2: Collect air conduction test speech and non-air conduction test speech simultaneously, use the pure noise section of the air conduction test speech to establish a statistical model of air conduction noise, and calculate the power spectrum mean value Φ _vv (ω) of air conduction noise.

In this embodiment, the statistical model of air conduction noise is the mean value of the power spectrum of air conduction noise Φ _vv (ω), which is calculated by the following method:

Where sgn[·] is a symbolic operation,

Is the adjustment factor, T is the initial threshold value, and M is the frame length. When C _{b is} greater than the preset threshold, determine that the frame is a voice signal, otherwise it is noise, and obtain the endpoint position of the non-air conduction detection voice signal according to the decision result of each frame;

Among them, the statistical model of air conduction noise is Gaussian function, GMM model or HMM model.

Step S3: Use the statistical model of air conduction noise and the two-channel speech joint classification model in step S1 to classify the air conduction test speech frames and non-air conduction test speech frames input simultaneously.

In this embodiment, the VTS model compensation technology is first adopted, and the air conduction noise statistical model is used to correct the parameters of the air conduction speech data stream in the dual-channel speech joint classification model, and then the input air conduction test speech frames and non-air conduction test speech frames are corrected. Guide test voice frames for classification. The specific method is to use the following formula to modify the mean value of each Gaussian component of the air conduction speech data stream in the dual-channel speech joint classification model:

among them

with

The power spectra of the clean air conduction training speech and noise belonging to the l-th class respectively pass through the 24-dimensional mel filter bank and take the logarithm of the mean, and C is the discrete cosine transform matrix (Discrete Cosine Transform, DCT). The other parameters in the dual-channel speech joint classification model remain unchanged. The revised dual-channel speech joint classification model is used to classify the synchronized input air conduction test speech frames and non-air conduction test speech frames, and the current air conduction test speech frames and non-air conduction test speech frames corresponding to each classification are obtained Score q(k,l).

Step S4: Construct a two-channel Wiener filter according to the classification result of step S3 and Φ _vv (ω), and filter the air conduction test speech frame and the non-air conduction test speech frame to obtain an enhanced air conduction speech.

In this embodiment, for the air conduction test speech and non-air conduction test speech synchronously collected at the kth frame, the following formula is used to calculate the enhanced air conduction speech spectrum:

In the formula, q(k,l) is the classification score of the k-th frame air conduction test speech and non-air conduction test speech corresponding to the first category of the dual-channel speech joint classification model. H _a (ω,k,l) is the Wiener filter frequency response of the k-th frame air conduction test speech corresponding to the first category of the dual-channel speech joint classification model. The calculation method is:

In another embodiment, the above

with

Use the following formula:

l=argmaxq(k,l).

The above-mentioned embodiments are preferred embodiments of the present invention, but the embodiments of the present invention are not limited by the above-mentioned embodiments, and any other changes, modifications, substitutions, combinations, etc. made without departing from the spirit and principle of the present invention Simplified, all should be equivalent replacement methods, and they are all included in the protection scope of the present invention.

Claims

A dual-sensor voice enhancement method based on dual-channel Wiener filtering is characterized in that the dual-sensor voice enhancement method includes the following steps:

S1. Collect clean air conduction training speech and non-air conduction training speech simultaneously, establish a two-channel speech joint classification model of air conduction speech frame and non-air conduction speech frame, and calculate corresponding to each of the above two-channel speech joint classification models The average power spectrum of classified air-guided speech Φ ss (ω,l), the average power spectrum of non-air-guided speech Φ bb (ω,l), the average cross-spectrum between air-guided speech and non-air-guided speech Φ bs (ω, l), where ω is frequency and l is the serial number of the classification;

S2. Collect air conduction test speech and non-air conduction test speech simultaneously, use the pure noise section of air conduction test speech to establish a statistical model of air conduction noise, and calculate the mean value of the power spectrum of air conduction noise Φ vv (ω);

S3. Use the statistical model of air conduction noise and the dual-channel speech joint classification model in step S1 to classify the synchronized input air conduction test speech frames and non-air conduction test speech frames;

S4. Construct a two-channel Wiener filter according to the classification result of step S3 and the power spectrum mean value Φ vv (ω), and filter the air conduction test speech frame and the non-air conduction test speech frame to obtain an enhanced air conduction speech.
The dual-sensor speech enhancement method according to claim 1, wherein the process of step S1 is as follows:

S1.1. Framing and preprocessing the synchronously collected clean air conduction training speech and non-air conduction training speech, and extracting characteristic parameters of each frame of speech, where the characteristic parameters are inverted mel spectrum coefficients;

S1.2. Use the air-guided speech and non-air-guided speech features obtained in step S1.1 to train a dual-channel speech joint classification model;

S1.3. Use the trained dual-channel speech joint classification model to classify all air conduction training speech frames and non-air conduction speech frames, and then calculate the air conduction training speech frames and non-air conduction speech frames contained in each classification. Air-guided speech power spectrum mean Φ ss (ω,l), non-air-guided speech power spectrum mean Φ bb (ω,l), cross-spectrum mean between air-guided speech and non-air-guided speech Φ bs (ω,l) .
The dual-sensor speech enhancement method according to claim 2, characterized in that, in the step S1.2, the dual-channel speech joint classification model adopts a multi-data stream GMM, where GMM is a Gaussian mixture model, namely

Where N (o, μ, σ) is a Gaussian function, o x (k) and o b (k) for the k-th frame and the non-voice test air conduction air conduction test speech feature vectors extracted,
with
Is the mean value of the l Gaussian component of the air-conducted voice data stream and the non-air-conducted voice data stream in the multi-data stream GMM,
with
Is the variance of the l-th Gaussian component of the air-conducted voice data stream and the non-air-conducted voice data stream in the multi-stream GMM, c l is the weight of the l-th Gaussian component in the multi-stream GMM, w x and w b are respectively The weight of the air-conducted voice data stream and the non-air-conducted voice data stream in the data stream GMM, L is the number of Gaussian components.
The dual-sensor speech enhancement method according to claim 3, wherein in the step S1.3, each Gaussian component in the dual-channel speech joint classification model represents a classification, and for each pair of synchronized air conduction Training speech frames and non-air-conducted speech frames, use the following formula to calculate their scores for each category

The current air conduction training speech frames and non-air conduction speech frames belong to the category with the highest score; calculate the classification to which all air conduction training speech frames and non-air conduction speech frames belong, and then calculate the air conduction training speech frames and sums contained in the same category The mean value of the air-guided speech power spectrum of the non-air-guided speech frame Φ ss (ω,l), the mean value of the power spectrum of the non-air-guided speech Φ bb (ω,l), the cross-spectrum mean value between the air-guided speech and the non-air-guided speech Φ bs (ω,l).
The dual-sensor speech enhancement method according to claim 1, wherein the statistical model of air conduction noise is the mean value of the power spectrum of air conduction noise Φ vv (ω), which is calculated by the following method:

S2.1. Collect air conduction test speech and non-air conduction test speech simultaneously and divide them into frames;

S2.2. According to the short-term autocorrelation function R b (m) and short-term energy E b of the non-air conduction detection speech frame, calculate the short-term average threshold crossing rate C b of each non-air conduction detection speech frame:

Where sgn[·] is a symbolic operation,
Is the adjustment factor, T is the threshold initial value, and M is the frame length. When C b is greater than the preset threshold, the frame is judged to be a speech signal, otherwise it is noise. According to the judgment result of each frame, the non-air conduction detection speech is obtained The end position of the signal;

S2.3. Use the time corresponding to the endpoint of the non-air conduction test voice signal detected in step S2.2 as the endpoint of the air conduction test voice, and extract the pure noise segment in the air conduction test voice;

S2.4. Calculate the mean value Φ vv (ω) of the power spectrum of the pure noise signal in the air conduction test speech.
The dual-sensor speech enhancement method according to claim 1, characterized in that, in said step S3, a vector Taylor series model compensation technique is first adopted, and a statistical model of air conduction noise is used to analyze the air conduction noise in the dual-channel speech joint classification model. The parameters of the speech data stream are corrected, and then the input air conduction test speech frames and non-air conduction test speech frames are classified. The following formula is used to modify each Gaussian component of the air conduction speech data stream in the dual-channel speech joint classification model Mean of:

among them
with
The power spectra of the clean air conduction training speech and noise belonging to the l-th class respectively pass through the 24-dimensional mel filter bank and take the logarithm of the mean value. C is the discrete cosine transform matrix, the two-channel speech joint classification model Other parameters remain unchanged, and the revised dual-channel speech joint classification model is used to classify the synchronized input air conduction test speech frame and non-air conduction test speech frame to obtain the current air conduction test speech frame and the corresponding non-air conduction test speech frame The category score q(k,l) for each category.
The dual-sensor speech enhancement method according to claim 2, characterized in that, in the step S4, for the air conduction test speech and non-air conduction test speech synchronously collected at the kth frame, the following formula is used to calculate the enhanced air conduction test speech Guide voice spectrum:

Among them, Y(ω,k), X(ω,k), B(ω,k) are the frequency spectrums of the enhanced air conduction speech, air conduction test speech and non-air conduction test speech at the k-th frame, respectively.
In order to correspond to the frequency response of the Wiener filter filter of the k-th air conduction test speech and non-air conduction test speech, the following formulas are used to calculate

Where q(k,l) is the classification score of the k-th frame air conduction test speech and non-air conduction test speech corresponding to the first category of the dual-channel speech joint classification model, and H a (ω,k,l) is the k-th frame The air conduction test speech corresponds to the Wiener filter frequency response of the first category of the dual-channel speech joint classification model. The calculation method is:

H na (ω,k,l) is the Wiener filter frequency response of the k-th frame of non-air conduction test speech corresponding to the first category of the dual-channel speech joint classification model. The calculation method is:
The dual-sensor speech enhancement method according to claim 7, wherein the
with
Use the following formula:
A device for implementing a dual-sensor speech enhancement method based on dual-channel Wiener filtering, characterized in that the device includes an air-conducted speech sensor, a non-air-conducted speech sensor, a noise model estimation module, and a dual-channel speech joint classification model , Model compensation module, frame classification module, filter coefficient generation module and dual-channel filter, among which,

The air-conducted speech sensor and the non-air-conducted speech sensor are respectively connected to the noise model estimation module, the frame classification module, and the dual-channel filter; the dual-channel speech joint classification model, the model compensation module, and the frame classification module , The filter coefficient generation module and the dual-channel filter are connected in sequence, the noise model estimation module is connected with the model compensation module and the filter coefficient generation module, and the dual-channel speech joint classification model is connected with the filter coefficient generation module ；

The air conduction speech sensor and the non-air conduction speech sensor are used to collect air conduction and non-air conduction speech signals, respectively, and the noise model estimation module is used to estimate the current air conduction noise model and power spectrum. The channel voice joint classification model uses synchronously collected clean air conduction training speech and non-air conduction training speech to establish air conduction speech frames and non-air conduction speech frames, and the air conduction speech power of each category in the dual-channel speech joint classification model The mean value of the spectrum is Φ ss (ω,l), the mean value of the power spectrum of non-air-guided speech is Φ bb (ω,l), the mean value of the cross-spectrum between air-guided speech and non-air-guided speech is Φ bs (ω,l), The model compensation module uses the statistical model of air conduction noise to correct the parameters of the dual-channel speech joint classification model, and the frame classification module classifies the currently synchronized input air conduction test speech and non-air conduction test speech frames, The filter coefficient generation module constructs a dual-channel Wiener filter based on the classification result and the power spectrum of air conduction noise. The dual-channel filter filters air conduction test speech frames and non-air conduction test speech frames to be enhanced After the air conduction voice.
The device for implementing the dual-sensor speech enhancement method according to claim 9, wherein the air-conducted speech sensor is a microphone, and the non-air-conducted speech sensor is a throat microphone.