CN114220450A

CN114220450A - Method for restraining strong noise of space-based finger-controlled environment

Info

Publication number: CN114220450A
Application number: CN202111370832.9A
Authority: CN
Inventors: 刘泽石; 李思凝; 张世辉; 张佳鹏; 姜博文
Original assignee: Shenyang Aircraft Design and Research Institute Aviation Industry of China AVIC
Current assignee: Shenyang Aircraft Design and Research Institute Aviation Industry of China AVIC
Priority date: 2021-11-18
Filing date: 2021-11-18
Publication date: 2022-03-22

Abstract

The application belongs to the technical field of aviation display control, and particularly relates to a method for restraining strong noise of a space-based finger control environment. The method comprises the following steps: step one, optimizing a microphone array for space-based control, acquiring a first voice signal and engine mechanical noise through the microphone array, and processing the first voice signal through a super-directional beam former to obtain a noise-reduced second voice signal pointing to the mouth of a pilot; removing mechanical engine noise in the second voice signal through a self-adaptive filter to obtain a third voice signal; and step three, carrying out deep learning noise reduction model training, carrying out cabin noise suppression processing on the third voice signal through the deep learning noise reduction model to obtain a fourth voice signal, and sending the fourth voice signal to a recognition engine. The method for restraining the strong noise of the space-based finger-controlled environment can improve the quality of the collected voice data of the human voice, further improve the voice recognition accuracy of an operator and promote the engineering application of voice interaction related technology.

Description

Method for restraining strong noise of space-based finger-controlled environment

Technical Field

The application belongs to the technical field of aviation display control, and particularly relates to a method for restraining strong noise of a space-based finger control environment.

Background

In the deployment of the air-based command control system and the existing man-machine cabin, an operator completes the command control function of the unmanned aerial vehicle through voice recognition under the high dynamic environment, the strong noise environment of the existing cabin leads to the fact that the audio data acquired by the pickup equipment of the traditional helmet cannot meet the requirement of voice recognition, and the typical aerodynamic noise in the cabin cannot be eliminated through simple physical noise reduction or the traditional active noise reduction and other methods at the rear end.

Accordingly, a technical solution is desired to overcome or at least alleviate at least one of the above-mentioned drawbacks of the prior art.

Disclosure of Invention

The application aims to provide a method for suppressing strong noise in a space-based finger-controlled environment, so as to solve at least one problem in the prior art.

The technical scheme of the application is as follows:

a method for suppressing strong noise in a space-based finger-controlled environment comprises the following steps:

step one, optimizing a microphone array for space-based control, acquiring a first voice signal and mechanical noise of an engine through the microphone array, and processing the first voice signal through a super-directional beam former to obtain a noise-reduced second voice signal which points to the mouth of a pilot;

removing the mechanical noise of the engine in the second voice signal through a self-adaptive filter to obtain a third voice signal;

and step three, carrying out deep learning noise reduction model training, carrying out cabin noise suppression processing on the third voice signal through the deep learning noise reduction model to obtain a fourth voice signal, and sending the fourth voice signal to a recognition engine.

In at least one embodiment of the present application, in step one, the microphone array for optimizing space-based steering includes: four microphones are uniformly distributed in a linear mode at the headset pickup position of the driver helmet, and one microphone is installed on the side face of the driver helmet.

In at least one embodiment of the present application, in step one, the super-directional beamformer is:

supposing a uniform linear array of M omnidirectional microphones with an array element spacing of delta, in the presence of isotropic noise, a super-directional beam former is designed, the array gain of which in the end-fire direction reaches M²The array gain is defined as follows:

δ_ij＝(i-j)δ

wherein d is_L(ω, θ) is the array steering vector, θ is the desired direction;

make the array gain ζ_L,dn[h(ω)]The maximum filter is found by:

wherein the content of the first and second substances,

for any complex number, the maximum signal-to-noise ratio filter is obtained through constraint of a distortion-free criterion, namely, the super-directional beam former is as follows:

in at least one embodiment of the present application, the method further comprises processing the super-directional beamformer to obtain a robust super-directional beamformer:

acquiring a white noise gain:

maximizing the directivity factor under the constraint of white noise gain is equivalent to minimizing the following equation:

wherein epsilon is a Lagrange multiplier, and a robust super-directional beam former is obtained through constraint of a distortion-free criterion:

in at least one embodiment of the present application, in the second step, the removing the engine mechanical noise in the second speech signal by using an adaptive filter to obtain a third speech signal includes:

obtaining an adaptive filter;

removing the engine mechanical noise in the second voice signal through an adaptive filter to obtain a third voice signal:

e(n)＝d(n)-y(n)

wherein, x (N) is engine mechanical noise, d (N) is a second voice signal, h (N) is an adaptive filter, e (N) is a third voice signal, and N is a filter length.

In at least one embodiment of the present application, in step three, the performing deep learning noise reduction model training includes:

training the deep learning noise reduction model by using the mixed voice y output by the signal processing module and the corresponding pure voice s:

extracting an acoustic feature F of the mixed voice y:

F＝log(mel(STFT(y)))

wherein STFT is short-time Fourier transform, and mel is Mel spectral feature;

the acoustic feature F is normalized:

F＝(F-mean)/var

wherein mean is the average value of the training set characteristics, and var is the standard deviation of the training set characteristics;

sending the normalized acoustic features F into a deep learning noise reduction model as input, and guiding the update of model parameters by using the output s _ of the deep learning noise reduction model and the MSE error e between actual pure speeches s;

e＝||s_-ISTFT(STFT(y)*model(F))||²

the ISTFT is inverse short-time Fourier transform, the model is a model, and the multiplication is dot;

and (4) until the value of the MSE error e tends to be stable, converging the deep learning noise reduction model, and storing the deep learning noise reduction model.

In at least one embodiment of the present application, the method further includes performing model compression on the deep learning noise reduction model, specifically:

a 16bit fixed-point method is used for all parameters of the deep learning noise reduction model to accelerate the calculation speed;

SVD decomposition is applied to the matrix to reduce the amount of computation.

The invention has at least the following beneficial technical effects:

the method for restraining the strong noise of the space-based finger-controlled environment can improve the quality of the collected voice data of the human voice, further improve the voice recognition accuracy of an operator and promote the engineering application of voice interaction related technology.

Drawings

FIG. 1 is a flow chart of a method for suppressing strong noise in a space-based finger-controlled environment according to an embodiment of the present application;

FIG. 2 is a super-directive beam pattern according to an embodiment of the present application;

FIG. 3 is a flow diagram of an adaptive filter process according to an embodiment of the present application;

FIG. 4 is an adaptive filter according to an embodiment of the present application;

FIG. 5 is a frequency domain block adaptive filter according to an embodiment of the present application;

FIG. 6 is a flow chart of deep learning noise reduction model training according to an embodiment of the present application.

Detailed Description

In order to make the implementation objects, technical solutions and advantages of the present application clearer, the technical solutions in the embodiments of the present application will be described in more detail below with reference to the drawings in the embodiments of the present application. In the drawings, the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The described embodiments are a subset of the embodiments in the present application and not all embodiments in the present application. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. Embodiments of the present application will be described in detail below with reference to the accompanying drawings.

In the description of the present application, it is to be understood that the terms "center", "longitudinal", "lateral", "front", "back", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like indicate orientations or positional relationships based on those shown in the drawings, and are used merely for convenience in describing the present application and for simplifying the description, and do not indicate or imply that the referenced device or element must have a particular orientation, be constructed in a particular orientation, and be operated, and therefore should not be construed as limiting the scope of the present application.

The present application is described in further detail below with reference to fig. 1 to 6.

The application provides a method for suppressing strong noise in a space-based finger-controlled environment, as shown in fig. 1, comprising the following steps:

step one, optimizing a microphone array for space-based control, acquiring a first voice signal and engine mechanical noise through the microphone array, and processing the first voice signal through a super-directional beam former to obtain a noise-reduced second voice signal pointing to the mouth of a pilot;

removing mechanical engine noise in the second voice signal through a self-adaptive filter to obtain a third voice signal;

In the method for suppressing strong noise in the space-based finger control environment, in the first step, optimizing a microphone array for space-based finger control includes: four microphones are uniformly distributed in a linear mode at the headset pickup position of the helmet of the driver and used for collecting voice control instructions of the pilot, and meanwhile, one microphone is arranged on the side face of the helmet of the driver and used for collecting mechanical noise of an aircraft engine. The voice signals collected by the microphone array are processed by the super-directional beam former, wind noise generated after the front end of the airplane is rubbed with air is filtered, and noise-reduced signals pointing to the mouth of a pilot can be obtained.

Super-directional beamforming is one type of fixed beamforming that has a narrower main lobe and is more noise-suppressing than conventional beamforming. In one embodiment of the present application, in step one, the super-directional beamformer is:

δ_ij＝(i-j)δ

make the array gain ζ_L,dn[h(ω)]The maximum filter is found by:

wherein the content of the first and second substances,

the maximum snr filter is the super-directional beamformer (main lobe direction θ). But the super-directional beamformer has a white noise gain at low frequencies much less than 1 and therefore in practical applications will cause an amplification of the white noise, especially at low frequencies. Therefore, in this embodiment, it is preferable that the processing of the super-directional beamformer results in a robust super-directional beamformer:

acquiring a white noise gain:

on the basis of the robust super-directional beam former, low-frequency sidelobe constraint is continuously added, so that the beam direction is narrower at low frequency, and the suppression on noise is enhanced. Through the beam forming processing of the stage, the integral signal-to-noise ratio is improved by about 10-15 dB.

Fig. 2 shows the super-directional beam pattern designed in this embodiment, and it can be seen that the super-directional beam former has limited noise suppression capability at low frequencies. In this regard, noise will be further suppressed by incorporating adaptive noise cancellation methods.

According to the space-based finger-control environment strong noise suppression method, under the environment of an airplane cabin, the frequency spectrum occupied by wind noise is relatively wide and does not have obvious frequency spectrum characteristics, but engine mechanical noise usually has obvious frequency spectrum characteristics, so that adaptive noise cancellation processing aiming at the engine mechanical noise under the environment of the cabin is possible. According to the method, the mic is arranged on the side face of the helmet of the pilot to collect the reference noise, wherein the collected reference noise comprises wind noise and engine mechanical noise, but the frequency spectrum structure of the engine mechanical noise is more critical to the learning and tracking of the adaptive filter processing. As shown in fig. 3, the noise-containing signal obtained by mixing the mechanical engine noise collected by the speech command and the reference microphone is collected by the microphone array disposed in the headset of the helmet, so that the mechanical engine noise collected by the microphone can be well eliminated as long as the change of H can be accurately calculated and tracked.

In this embodiment, in the second step, removing the mechanical engine noise in the second speech signal by using the adaptive filter to obtain the third speech signal includes:

obtaining an adaptive filter;

e(n)＝d(n)-y(n)

wherein, x (N) is engine mechanical noise, d (N) is a second voice signal, h (N) is an adaptive filter, e (N) is a third voice signal, and N is a filter length. A block diagram of the adaptive cancellation algorithm process can be represented by fig. 4, with s (n) representing the near-end speaker's speech.

In the preferred embodiment of the present application, in a practical complex usage environment of the cabin, fast tracking and convergence of the filter need to be considered, the time domain processing usually generates unacceptable delay, and the frequency domain processing has the advantages of low computational complexity and high convergence rate compared to the time domain processing, and each frequency band in the convergence process can be precisely controlled to achieve the overall optimal filtering, so in this embodiment, the structure of the frequency domain block adaptive filter (PBFDAF) is adopted for adaptive noise cancellation, and the structure is shown in fig. 5.

The coefficient vector of the time-domain adaptive filter can be represented as:

w(n)＝[w₀(n)…w_M-1(n)]^T

the error vector in the time domain is:

e(n)＝[e(n)…e(n+M-1)]^T

the input signal matrix in the frequency domain is:

X(k)＝diag{X₀(k)…X_2M-1(k)}

＝diag{F[x(kM-M)…x(kM+M-1)]^T}

the filter coefficients and error signal vector in the frequency domain are:

W(k)＝[w₀(k)…w_2M-1(k)]^T＝F[w^T(kM)0…0]

E(k)＝[E₀(k)…E_2M-1(k)]^T＝F[0…0e(kM)]^T

wherein, F and F^-1Are 2M x 2M, respectivelyDFT and IDFT matrices. The frequency domain adaptive iterative expression of FDAF can thus be written as:

wherein μ (k) ═ diag { μ₀(k)…μ_2M-1(k) Is the normalized step matrix, Λ (k) ═ diag { P }₀(k)…P_2M-1(k) Is the input signal power matrix and,

and

respectively as follows:

after the self-adaptive filter processing, the overall signal-to-noise ratio can be improved by about 15-20 dB.

According to the space-based finger control environment strong noise suppression method, cabin noise suppression processing is further performed on the third voice signal through the deep learning noise reduction model. Single-channel noise reduction algorithms can be divided into two broad categories: conventional methods and Deep Learning (Deep Learning) based methods. The conventional methods can be classified into parametric methods, nonparametric methods, and statistical model-based methods, and a typical representative of the Deep learning method is a Deep Neural Network (DNN) -based method. The traditional method has the advantages of simple method, easy realization, small calculated amount and good processing effect on stationary noise in an environment with high signal-to-noise ratio, but when the signal-to-noise ratio is low, the performance of the algorithm is sharply reduced, and the processing effect on non-stationary noise is poor. The DNN-based noise reduction algorithm just makes up the defects of the traditional noise reduction signal algorithm, but the method is a data-driven supervised learning method, a large amount of data are needed to train a model to achieve the expected effect, more time is needed for data preparation and model training, and compared with the traditional method, the method is large in calculated amount, and the calculated amount is reduced while the performance is ensured by using a model compression method.

In a preferred embodiment of the present application, in step three, performing deep learning noise reduction model training includes:

the input voice of the module is the voice output by the signal processing module, and the output voice is sent to the next module for subsequent operation.

Extracting an acoustic feature F of the mixed voice y:

F＝log(mel(STFT(y)))

wherein STFT is short-time Fourier transform, and mel is Mel spectral feature;

the acoustic feature F is normalized:

F＝(F-mean)/var

e＝||s_-ISTFT(STFT(y)*model(F))||²

In this embodiment, further, a model test is included. And (3) processing the mixed voice by a signal processing module to obtain y, extracting the characteristics of the y, inputting the characteristics into a model to obtain a corresponding output mask, performing point multiplication on the mask and the y, reconstructing the voice to obtain a noise-reduced signal, and outputting the noise-reduced signal to a next module, wherein the model framework is shown in fig. 6.

According to the method for restraining the strong noise of the space-based finger control environment, due to the fact that the parameter quantity of an original model is large, the computing resources of mobile equipment are limited, the model needs to be compressed, and the computing quantity is reduced. In this embodiment, the model compression is performed on the deep learning noise reduction model, specifically: a 16bit fixed-point method is used for all parameters of the deep learning noise reduction model to accelerate the calculation speed; SVD decomposition is applied to the matrix to reduce the amount of computation. The computation amount is greatly reduced while the loss of the model performance is very small. The signal after the preceding stage signal processing is processed by a model, and the expected signal-to-noise ratio can be additionally improved by about 10 dB.

The method for restraining the strong noise of the space-based finger-controlled environment can improve the noise restraint by about 30-45db, and the optimization rate is as high as 100%.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for suppressing strong noise in a space-based finger-controlled environment is characterized by comprising the following steps:

2. The method for suppressing strong noise in a space-based pointing environment according to claim 1, wherein in step one, the microphone array for optimizing space-based pointing comprises: four microphones are uniformly distributed in a linear mode at the headset pickup position of the driver helmet, and one microphone is installed on the side face of the driver helmet.

3. The method according to claim 1, wherein in the first step, the super-directional beamformer comprises:

δ_ij＝(i-j)δ

make the array gain ζ_L,dn[h(ω)]The maximum filter is found by:

wherein the content of the first and second substances,

for any complex number, constrained by a distortion-free criterion, to obtain a maximum signal-to-noise ratio filter, i.e. a super-directional beamformerComprises the following steps:

4. the method according to claim 3, further comprising processing the super-directional beamformer to obtain a robust super-directional beamformer:

acquiring a white noise gain:

5. the method according to claim 4, wherein in the second step, the removing the mechanical engine noise in the second speech signal by an adaptive filter to obtain a third speech signal includes:

obtaining an adaptive filter;

e(n)＝d(n)-y(n)

6. The method for suppressing strong noise in a space-based finger-controlled environment according to claim 1, wherein in step three, the performing deep learning noise reduction model training includes:

extracting an acoustic feature F of the mixed voice y:

F＝log(mel(STFT(y)))

wherein STFT is short-time Fourier transform, and mel is Mel spectral feature;

the acoustic feature F is normalized:

F＝(F-mean)/var

e＝||s_-ISTFT(STFT(y)*model(F))||²

7. The method for suppressing strong noise in a space-based finger-controlled environment according to claim 6, further comprising performing model compression on the deep learning noise reduction model, specifically:

SVD decomposition is applied to the matrix to reduce the amount of computation.