CN113658605A

CN113658605A - Speech enhancement method based on deep learning assisted RLS filtering processing

Info

Publication number: CN113658605A
Application number: CN202111207569.1A
Authority: CN
Inventors: 万东琴; 胡岸; 刘文通; 曾帆
Original assignee: Chipintelli Technology Co Ltd
Current assignee: Chipintelli Technology Co Ltd
Priority date: 2021-10-18
Filing date: 2021-10-18
Publication date: 2021-11-16
Anticipated expiration: 2041-10-18
Also published as: CN113658605B

Abstract

A speech enhancement method based on deep learning assisted RLS filtering processing comprises the following steps: s1, processing a voice signal by adopting a beam forming method of generalized sidelobe cancellation to obtain a fixed beam forming output signal and a noise reference signal; s2, randomly extracting the characteristic signals of any path of microphone signals in the microphone array, sending the characteristic signals into a GRU-Mask network to calculate a masking value S3 of the original microphone signals, comparing the masking value output by the network with a noise threshold value, calculating a noise eliminator, and eliminating the noise by using the noise eliminator. According to the invention, only the signal with the dominant noise component is filtered by adopting the RLS algorithm, so that the computational power of microphone beam signal processing is effectively reduced, the noise residue in the output signal can be reduced under the condition of not increasing distortion, and the purposes of enhancing the voice signal, and improving the voice recognition rate and the human-computer interaction experience are achieved.

Description

Speech enhancement method based on deep learning assisted RLS filtering processing

Technical Field

The invention belongs to the technical field of artificial intelligence, relates to voice recognition, and particularly relates to a voice enhancement method based on deep learning assisted RLS filtering processing.

Background

With the wide application of the voice interaction technology, the traditional single-microphone voice enhancement method cannot meet the voice quality requirement in the interaction technology. For example, in a far-field environment or a noisy environment, the single-microphone method has limited information capture and limited noise reduction performance. At the moment, the microphone array signals can effectively utilize the direction information of the voice signals to capture the voice signals in the wave beams, inhibit the signals of the wave beams in other directions and obtain better noise reduction effect.

The General Sidelobe Cancellation (GSC) method is widely used as one of the classical beamforming algorithms. However, GSC generates speech leakage when blocking speech signals, calculating noise references due to multipath propagation in microphone array speech signals, and speech signal directivity errors cause an increase in the noise residual component in the captured speech beam signal. Ignoring the presence of speech leakage results in more severe speech distortion if the residual noise in the fixed beamformed signal is eliminated, while ignoring the residual of the large noise in the fixed beamformed signal results in a severe reduction in signal quality.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention discloses a speech enhancement method based on deep learning assisted RLS filtering processing, which effectively reduces the computational power of microphone beam signal processing, can reduce the noise residue in an output signal under the condition of not increasing distortion, and enhances a speech signal, thereby improving the speech recognition rate.

The invention discloses a speech enhancement method based on deep learning assisted RLS filtering processing, which comprises the following steps:

s1, processing a microphone array voice signal y (l, k) by adopting a generalized sidelobe cancellation beam forming method to obtain a fixed beam forming output signal y_s(l, k) and a noise reference signal u (l, k); l, k represent time and frequency indices, respectively;

s2, randomly extracting a characteristic signal of any path of microphone signal in the microphone array, sending the characteristic signal into a GRU-Mask network, and calculating a Mask value Mask (l, k) of an original microphone signal;

s3, comparing the Mask value Mask (l, k) output by the GRU-Mask network with a noise threshold thred:

when mask (l, k)<When thred, noise canceller w is generated₀(l, k) for a fixed beam forming output signal y_s(l, k) filtering, using the processed signal as the final output signal, otherwise using the fixed beam to form the output signal y_s(l, k) as the final output signal.

Preferably, the noise canceller w₀(l, k) forming the output signal y from the fixed beam_s(l, k) and the noise estimation signal u (l, k) are calculated by using an RLS algorithm, and the calculation is as follows:

wherein, the superscripts-1, H and phi of u or phi respectively represent inverse, transposition and conjugate transposition operations; initialization operation phi^-1(0,k)=I,w₀(0,k)= K(0,k)=O^(M-1)*1Wherein O represents a zero matrix and M is the number of microphones; phi (^-1(l, K), wherein K (l, K) is an intermediate variable, and lambda is a forgetting factor;

the final output signal s (l, k) after filtering is:

s(l,k)= y_s(l,k)- w₀(l,k) * u(l,k) 。

preferably, the GRU-Mask network is composed of a preprocessing layer, a voice spectrum estimator, a noise spectrum estimator and a gain simulator;

wherein the pretreatment layer consists of 1 full-connection layer; the voice spectrum estimator consists of 2 GRU layers; the noise spectrum estimator is composed of 1 GRU layer; the gain simulator is composed of 1 GRU layer and 1 full connection layer

The invention calculates the information leading component in the signal by using a deep learning method, and only carries out filtering processing on the signal with leading noise component by adopting an RLS algorithm, thereby effectively reducing the computational power of processing the microphone beam signal, realizing the reduction of noise residue in the output signal under the condition of not increasing distortion, and achieving the purposes of enhancing the voice signal and improving the voice recognition rate and the human-computer interaction experience.

Drawings

FIG. 1 is a flow chart illustrating a speech enhancement method according to an embodiment of the present invention;

FIG. 2 is a graph of a comparison of frequency spectra processed using a conventional GSC method and the present invention, in accordance with one embodiment; in fig. 2, the abscissa is time and the ordinate is frequency; part (a1) of fig. 2 is a spectrogram processed by the conventional GSC method, and part (a2) is a spectrogram processed by the method;

FIG. 3 is a graph comparing waveforms processed using a conventional GSC method and the present invention, in one embodiment; in fig. 3, the abscissa is time and the ordinate is voltage amplitude, and in fig. 3, the part (A3) is a waveform diagram processed by the conventional GSC method, and the part (a4) is a waveform diagram processed by the method.

Detailed Description

The following provides a more detailed description of the present invention.

As shown in fig. 1, the specific flow in the embodiment is as follows:

s1 processing the voice signal y (l, k) by using the beam forming method of Generalized Sidelobe Cancellation (GSC) to obtain the fixed beam forming output signal y_s(l, k) and a noise reference signal u (l, k).

l, k represent time and frequency indices, respectively;

the beamforming method of Generalized Sidelobe Cancellation (GSC) generally mainly comprises a fixed beamformer w_bf(k) Blocking matrix B (k) and noise canceller w₀(l, k) three parts. The microphone speech signal y (l, k) is passed through a fixed beamformer w_bf(k) Module processing to obtain fixed beam forming output signal y_s(l, k), as follows:

y_s(l,k)= w_bf(k)* y(l,k),

the microphone voice signal y (l, k) is passed through the blocking matrix b (k) to remove the useful voice signal in the microphone voice signal y (l, k) to obtain the noise reference signal u (l, k), as follows:

u(l,k)= B(k) *y(l,k)。

s2 randomly extracts the characteristic signal of any microphone signal in the microphone array and sends the characteristic signal into the GRU-Mask network to calculate the Mask (l, k) of the original microphone signal. The input of the GRU-Mask network is a characteristic signal of a microphone signal, the output signal is a Mask (l, k), and the Mask (l, k) represents that a leading component in the microphone signal is a voice signal or a noise signal, so that a criterion is provided for whether to perform further filtering processing subsequently.

The GRU-Mask network is composed of a preprocessing layer, a voice spectrum estimator, a noise spectrum estimator and a gain simulator. Wherein the pretreatment layer consists of 1 full-connection layer; the voice spectrum estimator consists of 2 GRU layers; the noise spectrum estimator is composed of 1 GRU layer; the gain simulator is composed of 1 GRU (gated-round unit) layer and 1 fully-connected layer.

The input signal dimension is M, namely the pretreatment layer dimension, the 2 GRU layer dimensions of the voice spectrum estimator are G1 and G2 respectively, the GRU layer dimension of the noise spectrum estimator is G3, and the GRU layer and the full connection layer dimension of the gain simulator are G4 and N respectively.

S3 compares the Mask value Mask (l, k) output by the GRU-Mask network with the noise threshold thred:

when mask (l, k)<When thred, noise canceller w is generated₀(l, k) for a fixed beam forming output signal y_s(l, k) filtering, using the processed signal as final output signal, otherwise using fixed beam to form outputOutput signal y_s(l, k) as the final output signal.

The conventional GSC method is to directly perform noise elimination after beam processing, and according to the GSC beam forming method, a fixed beam is used for forming an output signal y_s(l, k) and noise reference signal u (l, k) an ideal noise canceller w is designed₀(l, k) for estimating the fixed beamformed output signal y_sResidual noise in (l, k) and cancellation of the fixed beam forming output signal y_sThis part of (l, k) remains noise.

Noise canceller w due to voice leakage due to multipath effects₀(l, k) forming the output signal y at the cancellation of the fixed beam_sThe residual noise in (l, k) may remove part of the speech, thereby causing distortion of the signal. The invention adopts a GRU-Mask network to calculate a Mask value Mask (l, k), compares the Mask value Mask (l, k) with a preset noise threshold value thred to determine whether to adopt a noise eliminator w₀(l, k) basis for eliminating residual noise.

When mask (l, k) ≧ thred, it indicates that the signal is dominated by the speech component, and the fixed beamforming output signal y can be ignored_sResidual noise in (l, k), without having to calculate the noise canceller w₀(l, k) to perform the cancellation. When mask (l, k)<thred, which indicates that the signal is dominated by noise components, increases the directivity error in fixed beam forming, resulting in a fixed beam forming output signal y_sThe residual noise is more in (l, k), so the noise canceller w needs to be calculated₀(l, k) to cancel the fixed beamforming output signal y_sResidual noise in (l, k).

Noise canceller w₀(l, k) forming the output signal y from the fixed beam_s(l, k) and the noise estimation signal u (l, k) are calculated by using an RLS (recursive least squares) algorithm, which is as follows:

wherein, the superscripts-1, H and phi of u or phi respectively represent inverse, transposition and conjugate transposition operations; initialization operation phi^-1(0,k)=I,w₀(0,k)= K(0,k)=O^(M-1)*1Wherein O represents a zero matrix, M is the number of microphones, and lambda is a forgetting factor; the value range of lambda is a positive number less than or equal to 1.

The method only forms the output signal y for the part of the fixed beam with dominant noise component in the original signal_s(l, k) computational noise canceller w₀(l, k) performing filtering processing, wherein the final output signal s (l, k) after filtering is as follows:

s(l,k)= y_s(l,k)- w₀(l,k) * u(l,k)

partial fixed beam forming output signal y dominated by speech component in original signal_s(l, k) without filtering, the final output signal s (l, k) is:

s(l,k)= y_s(l,k)

and consequently also does not have to calculate the noise canceller w₀(l, k), can reduce the computational power, is more favorable to the real-time processing of the signal. In this way, the part of the fixed beam forming output signal y that is dominated by the noise component in the original signal is formed_s(l, k) is filtered to reduce noise and improve signal-to-noise ratio, and the output signal y is formed by partial fixed beam with dominant voice component in the original signal_sAnd (l, k) filtering processing is not carried out, so that the problem of distortion of the final output voice signal s (l, k) caused by voice leakage caused by multipath effect can be effectively reduced. The method thus allows for reduced computational effort and reduced noise residual in the output signal without increasing distortion.

Fig. 1 shows a specific flow diagram of the method of the present invention, fig. 2 shows a comparison graph of the frequency spectrums processed by the conventional GSC method and the method of the present invention, the upper part (a1) of fig. 2 is the frequency spectrum processed by the conventional GSC method, and the lower part (a2) is the frequency spectrum processed by the method, and it can be seen from fig. 2 that the lower voice signal has more white reserved part, indicating less distortion; fig. 3 shows a comparison graph of waveforms processed by the conventional GSC method and the present method, where the upper part (A3) of fig. 3 is a waveform processed by the conventional GSC method, and the lower part (a4) is a waveform processed by the present method, and it can be seen from fig. 3 that the amplitude of the lower speech is larger, indicating that the processing effect is better, and the subsequent speech recognition is more facilitated.

The foregoing is directed to preferred embodiments of the present invention, wherein the preferred embodiments are not obviously contradictory or subject to any particular embodiment, and any combination of the preferred embodiments may be combined in any overlapping manner, and the specific parameters in the embodiments and examples are only for the purpose of clearly illustrating the inventor's invention verification process and are not intended to limit the scope of the invention, which is defined by the claims and the equivalent structural changes made by the description and drawings of the present invention are also intended to be included in the scope of the present invention.

Claims

1. A speech enhancement method based on deep learning assisted RLS filtering processing is characterized by comprising the following steps:

2. The method of speech enhancement based on deep learning assisted RLS filtering process of claim 1 wherein the noise canceller w₀(l, k) forming the output signal y from the fixed beam_s(l, k) and the noise estimation signal u (l, k) are calculated by using an RLS algorithm, and the calculation is as follows:

the final output signal s (l, k) after filtering is:

s(l,k)= y_s(l,k)- w₀(l,k) * u(l,k)。

3. the speech enhancement method based on deep learning assisted RLS filtering process of claim 1, wherein the GRU-Mask network is composed of a preprocessing layer, a speech spectrum estimator, a noise spectrum estimator, a gain simulator;

wherein the pretreatment layer consists of 1 full-connection layer; the voice spectrum estimator consists of 2 GRU layers; the noise spectrum estimator is composed of 1 GRU layer; the gain simulator is composed of 1 GRU layer and 1 fully-connected layer.