CN113658605A - Speech enhancement method based on deep learning assisted RLS filtering processing - Google Patents
Speech enhancement method based on deep learning assisted RLS filtering processing Download PDFInfo
- Publication number
- CN113658605A CN113658605A CN202111207569.1A CN202111207569A CN113658605A CN 113658605 A CN113658605 A CN 113658605A CN 202111207569 A CN202111207569 A CN 202111207569A CN 113658605 A CN113658605 A CN 113658605A
- Authority
- CN
- China
- Prior art keywords
- signal
- noise
- output signal
- mask
- gru
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L2021/02161—Number of inputs available containing the signal or the noise to be suppressed
- G10L2021/02166—Microphone arrays; Beamforming
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
A speech enhancement method based on deep learning assisted RLS filtering processing comprises the following steps: s1, processing a voice signal by adopting a beam forming method of generalized sidelobe cancellation to obtain a fixed beam forming output signal and a noise reference signal; s2, randomly extracting the characteristic signals of any path of microphone signals in the microphone array, sending the characteristic signals into a GRU-Mask network to calculate a masking value S3 of the original microphone signals, comparing the masking value output by the network with a noise threshold value, calculating a noise eliminator, and eliminating the noise by using the noise eliminator. According to the invention, only the signal with the dominant noise component is filtered by adopting the RLS algorithm, so that the computational power of microphone beam signal processing is effectively reduced, the noise residue in the output signal can be reduced under the condition of not increasing distortion, and the purposes of enhancing the voice signal, and improving the voice recognition rate and the human-computer interaction experience are achieved.
Description
Technical Field
The invention belongs to the technical field of artificial intelligence, relates to voice recognition, and particularly relates to a voice enhancement method based on deep learning assisted RLS filtering processing.
Background
With the wide application of the voice interaction technology, the traditional single-microphone voice enhancement method cannot meet the voice quality requirement in the interaction technology. For example, in a far-field environment or a noisy environment, the single-microphone method has limited information capture and limited noise reduction performance. At the moment, the microphone array signals can effectively utilize the direction information of the voice signals to capture the voice signals in the wave beams, inhibit the signals of the wave beams in other directions and obtain better noise reduction effect.
The General Sidelobe Cancellation (GSC) method is widely used as one of the classical beamforming algorithms. However, GSC generates speech leakage when blocking speech signals, calculating noise references due to multipath propagation in microphone array speech signals, and speech signal directivity errors cause an increase in the noise residual component in the captured speech beam signal. Ignoring the presence of speech leakage results in more severe speech distortion if the residual noise in the fixed beamformed signal is eliminated, while ignoring the residual of the large noise in the fixed beamformed signal results in a severe reduction in signal quality.
Disclosure of Invention
In order to overcome the defects in the prior art, the invention discloses a speech enhancement method based on deep learning assisted RLS filtering processing, which effectively reduces the computational power of microphone beam signal processing, can reduce the noise residue in an output signal under the condition of not increasing distortion, and enhances a speech signal, thereby improving the speech recognition rate.
The invention discloses a speech enhancement method based on deep learning assisted RLS filtering processing, which comprises the following steps:
s1, processing a microphone array voice signal y (l, k) by adopting a generalized sidelobe cancellation beam forming method to obtain a fixed beam forming output signal ys(l, k) and a noise reference signal u (l, k); l, k represent time and frequency indices, respectively;
s2, randomly extracting a characteristic signal of any path of microphone signal in the microphone array, sending the characteristic signal into a GRU-Mask network, and calculating a Mask value Mask (l, k) of an original microphone signal;
s3, comparing the Mask value Mask (l, k) output by the GRU-Mask network with a noise threshold thred:
when mask (l, k)<When thred, noise canceller w is generated0(l, k) for a fixed beam forming output signal ys(l, k) filtering, using the processed signal as the final output signal, otherwise using the fixed beam to form the output signal ys(l, k) as the final output signal.
Preferably, the noise canceller w0(l, k) forming the output signal y from the fixed beams(l, k) and the noise estimation signal u (l, k) are calculated by using an RLS algorithm, and the calculation is as follows:
wherein, the superscripts-1, H and phi of u or phi respectively represent inverse, transposition and conjugate transposition operations; initialization operation phi-1(0,k)=I,w0(0,k)= K(0,k)=O(M-1)*1Wherein O represents a zero matrix and M is the number of microphones; phi (-1(l, K), wherein K (l, K) is an intermediate variable, and lambda is a forgetting factor;
the final output signal s (l, k) after filtering is:
s(l,k)= ys(l,k)- w0(l,k) * u(l,k) 。
preferably, the GRU-Mask network is composed of a preprocessing layer, a voice spectrum estimator, a noise spectrum estimator and a gain simulator;
wherein the pretreatment layer consists of 1 full-connection layer; the voice spectrum estimator consists of 2 GRU layers; the noise spectrum estimator is composed of 1 GRU layer; the gain simulator is composed of 1 GRU layer and 1 full connection layer
The invention calculates the information leading component in the signal by using a deep learning method, and only carries out filtering processing on the signal with leading noise component by adopting an RLS algorithm, thereby effectively reducing the computational power of processing the microphone beam signal, realizing the reduction of noise residue in the output signal under the condition of not increasing distortion, and achieving the purposes of enhancing the voice signal and improving the voice recognition rate and the human-computer interaction experience.
Drawings
FIG. 1 is a flow chart illustrating a speech enhancement method according to an embodiment of the present invention;
FIG. 2 is a graph of a comparison of frequency spectra processed using a conventional GSC method and the present invention, in accordance with one embodiment; in fig. 2, the abscissa is time and the ordinate is frequency; part (a1) of fig. 2 is a spectrogram processed by the conventional GSC method, and part (a2) is a spectrogram processed by the method;
FIG. 3 is a graph comparing waveforms processed using a conventional GSC method and the present invention, in one embodiment; in fig. 3, the abscissa is time and the ordinate is voltage amplitude, and in fig. 3, the part (A3) is a waveform diagram processed by the conventional GSC method, and the part (a4) is a waveform diagram processed by the method.
Detailed Description
The following provides a more detailed description of the present invention.
The invention discloses a speech enhancement method based on deep learning assisted RLS filtering processing, which comprises the following steps:
s1, processing a microphone array voice signal y (l, k) by adopting a generalized sidelobe cancellation beam forming method to obtain a fixed beam forming output signal ys(l, k) and a noise reference signal u (l, k); l, k represent time and frequency indices, respectively;
s2, randomly extracting a characteristic signal of any path of microphone signal in the microphone array, sending the characteristic signal into a GRU-Mask network, and calculating a Mask value Mask (l, k) of an original microphone signal;
s3, comparing the Mask value Mask (l, k) output by the GRU-Mask network with a noise threshold thred:
when mask (l, k)<When thred, noise canceller w is generated0(l, k) for a fixed beam forming output signal ys(l, k) filtering, using the processed signal as the final output signal, otherwise using the fixed beam to form the output signal ys(l, k) as the final output signal.
As shown in fig. 1, the specific flow in the embodiment is as follows:
s1 processing the voice signal y (l, k) by using the beam forming method of Generalized Sidelobe Cancellation (GSC) to obtain the fixed beam forming output signal ys(l, k) and a noise reference signal u (l, k).
l, k represent time and frequency indices, respectively;
the beamforming method of Generalized Sidelobe Cancellation (GSC) generally mainly comprises a fixed beamformer wbf(k) Blocking matrix B (k) and noise canceller w0(l, k) three parts. The microphone speech signal y (l, k) is passed through a fixed beamformer wbf(k) Module processing to obtain fixed beam forming output signal ys(l, k), as follows:
ys(l,k)= wbf(k)* y(l,k),
the microphone voice signal y (l, k) is passed through the blocking matrix b (k) to remove the useful voice signal in the microphone voice signal y (l, k) to obtain the noise reference signal u (l, k), as follows:
u(l,k)= B(k) *y(l,k)。
s2 randomly extracts the characteristic signal of any microphone signal in the microphone array and sends the characteristic signal into the GRU-Mask network to calculate the Mask (l, k) of the original microphone signal. The input of the GRU-Mask network is a characteristic signal of a microphone signal, the output signal is a Mask (l, k), and the Mask (l, k) represents that a leading component in the microphone signal is a voice signal or a noise signal, so that a criterion is provided for whether to perform further filtering processing subsequently.
The GRU-Mask network is composed of a preprocessing layer, a voice spectrum estimator, a noise spectrum estimator and a gain simulator. Wherein the pretreatment layer consists of 1 full-connection layer; the voice spectrum estimator consists of 2 GRU layers; the noise spectrum estimator is composed of 1 GRU layer; the gain simulator is composed of 1 GRU (gated-round unit) layer and 1 fully-connected layer.
The input signal dimension is M, namely the pretreatment layer dimension, the 2 GRU layer dimensions of the voice spectrum estimator are G1 and G2 respectively, the GRU layer dimension of the noise spectrum estimator is G3, and the GRU layer and the full connection layer dimension of the gain simulator are G4 and N respectively.
S3 compares the Mask value Mask (l, k) output by the GRU-Mask network with the noise threshold thred:
when mask (l, k)<When thred, noise canceller w is generated0(l, k) for a fixed beam forming output signal ys(l, k) filtering, using the processed signal as final output signal, otherwise using fixed beam to form outputOutput signal ys(l, k) as the final output signal.
The conventional GSC method is to directly perform noise elimination after beam processing, and according to the GSC beam forming method, a fixed beam is used for forming an output signal ys(l, k) and noise reference signal u (l, k) an ideal noise canceller w is designed0(l, k) for estimating the fixed beamformed output signal ysResidual noise in (l, k) and cancellation of the fixed beam forming output signal ysThis part of (l, k) remains noise.
Noise canceller w due to voice leakage due to multipath effects0(l, k) forming the output signal y at the cancellation of the fixed beamsThe residual noise in (l, k) may remove part of the speech, thereby causing distortion of the signal. The invention adopts a GRU-Mask network to calculate a Mask value Mask (l, k), compares the Mask value Mask (l, k) with a preset noise threshold value thred to determine whether to adopt a noise eliminator w0(l, k) basis for eliminating residual noise.
When mask (l, k) ≧ thred, it indicates that the signal is dominated by the speech component, and the fixed beamforming output signal y can be ignoredsResidual noise in (l, k), without having to calculate the noise canceller w0(l, k) to perform the cancellation. When mask (l, k)<thred, which indicates that the signal is dominated by noise components, increases the directivity error in fixed beam forming, resulting in a fixed beam forming output signal ysThe residual noise is more in (l, k), so the noise canceller w needs to be calculated0(l, k) to cancel the fixed beamforming output signal ysResidual noise in (l, k).
Noise canceller w0(l, k) forming the output signal y from the fixed beams(l, k) and the noise estimation signal u (l, k) are calculated by using an RLS (recursive least squares) algorithm, which is as follows:
wherein, the superscripts-1, H and phi of u or phi respectively represent inverse, transposition and conjugate transposition operations; initialization operation phi-1(0,k)=I,w0(0,k)= K(0,k)=O(M-1)*1Wherein O represents a zero matrix, M is the number of microphones, and lambda is a forgetting factor; the value range of lambda is a positive number less than or equal to 1.
The method only forms the output signal y for the part of the fixed beam with dominant noise component in the original signals(l, k) computational noise canceller w0(l, k) performing filtering processing, wherein the final output signal s (l, k) after filtering is as follows:
s(l,k)= ys(l,k)- w0(l,k) * u(l,k)
partial fixed beam forming output signal y dominated by speech component in original signals(l, k) without filtering, the final output signal s (l, k) is:
s(l,k)= ys(l,k)
and consequently also does not have to calculate the noise canceller w0(l, k), can reduce the computational power, is more favorable to the real-time processing of the signal. In this way, the part of the fixed beam forming output signal y that is dominated by the noise component in the original signal is formeds(l, k) is filtered to reduce noise and improve signal-to-noise ratio, and the output signal y is formed by partial fixed beam with dominant voice component in the original signalsAnd (l, k) filtering processing is not carried out, so that the problem of distortion of the final output voice signal s (l, k) caused by voice leakage caused by multipath effect can be effectively reduced. The method thus allows for reduced computational effort and reduced noise residual in the output signal without increasing distortion.
Fig. 1 shows a specific flow diagram of the method of the present invention, fig. 2 shows a comparison graph of the frequency spectrums processed by the conventional GSC method and the method of the present invention, the upper part (a1) of fig. 2 is the frequency spectrum processed by the conventional GSC method, and the lower part (a2) is the frequency spectrum processed by the method, and it can be seen from fig. 2 that the lower voice signal has more white reserved part, indicating less distortion; fig. 3 shows a comparison graph of waveforms processed by the conventional GSC method and the present method, where the upper part (A3) of fig. 3 is a waveform processed by the conventional GSC method, and the lower part (a4) is a waveform processed by the present method, and it can be seen from fig. 3 that the amplitude of the lower speech is larger, indicating that the processing effect is better, and the subsequent speech recognition is more facilitated.
The foregoing is directed to preferred embodiments of the present invention, wherein the preferred embodiments are not obviously contradictory or subject to any particular embodiment, and any combination of the preferred embodiments may be combined in any overlapping manner, and the specific parameters in the embodiments and examples are only for the purpose of clearly illustrating the inventor's invention verification process and are not intended to limit the scope of the invention, which is defined by the claims and the equivalent structural changes made by the description and drawings of the present invention are also intended to be included in the scope of the present invention.
Claims (3)
1. A speech enhancement method based on deep learning assisted RLS filtering processing is characterized by comprising the following steps:
s1, processing a microphone array voice signal y (l, k) by adopting a generalized sidelobe cancellation beam forming method to obtain a fixed beam forming output signal ys(l, k) and a noise reference signal u (l, k); l, k represent time and frequency indices, respectively;
s2, randomly extracting a characteristic signal of any path of microphone signal in the microphone array, sending the characteristic signal into a GRU-Mask network, and calculating a Mask value Mask (l, k) of an original microphone signal;
s3, comparing the Mask value Mask (l, k) output by the GRU-Mask network with a noise threshold thred:
when mask (l, k)<When thred, noise canceller w is generated0(l, k) for a fixed beam forming output signal ys(l, k) filtering, using the processed signal as the final output signal, otherwise using the fixed beam to form the output signal ys(l, k) as the final output signal.
2. The method of speech enhancement based on deep learning assisted RLS filtering process of claim 1 wherein the noise canceller w0(l, k) forming the output signal y from the fixed beams(l, k) and the noise estimation signal u (l, k) are calculated by using an RLS algorithm, and the calculation is as follows:
wherein, the superscripts-1, H and phi of u or phi respectively represent inverse, transposition and conjugate transposition operations; initialization operation phi-1(0,k)=I,w0(0,k)= K(0,k)=O(M-1)*1Wherein O represents a zero matrix and M is the number of microphones; phi (-1(l, K), wherein K (l, K) is an intermediate variable, and lambda is a forgetting factor;
the final output signal s (l, k) after filtering is:
s(l,k)= ys(l,k)- w0(l,k) * u(l,k)。
3. the speech enhancement method based on deep learning assisted RLS filtering process of claim 1, wherein the GRU-Mask network is composed of a preprocessing layer, a speech spectrum estimator, a noise spectrum estimator, a gain simulator;
wherein the pretreatment layer consists of 1 full-connection layer; the voice spectrum estimator consists of 2 GRU layers; the noise spectrum estimator is composed of 1 GRU layer; the gain simulator is composed of 1 GRU layer and 1 fully-connected layer.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111207569.1A CN113658605B (en) | 2021-10-18 | 2021-10-18 | Speech enhancement method based on deep learning assisted RLS filtering processing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111207569.1A CN113658605B (en) | 2021-10-18 | 2021-10-18 | Speech enhancement method based on deep learning assisted RLS filtering processing |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113658605A true CN113658605A (en) | 2021-11-16 |
CN113658605B CN113658605B (en) | 2021-12-17 |
Family
ID=78494575
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111207569.1A Active CN113658605B (en) | 2021-10-18 | 2021-10-18 | Speech enhancement method based on deep learning assisted RLS filtering processing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113658605B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101778322A (en) * | 2009-12-07 | 2010-07-14 | 中国科学院自动化研究所 | Microphone array postfiltering sound enhancement method based on multi-models and hearing characteristic |
US20180122403A1 (en) * | 2016-02-16 | 2018-05-03 | Red Pill VR, Inc. | Real-time audio source separation using deep neural networks |
US20190043491A1 (en) * | 2018-05-18 | 2019-02-07 | Intel Corporation | Neural network based time-frequency mask estimation and beamforming for speech pre-processing |
CN109326297A (en) * | 2017-07-31 | 2019-02-12 | 哈曼贝克自动系统股份有限公司 | Self-adaptive post-filtering |
CN111816200A (en) * | 2020-07-01 | 2020-10-23 | 电子科技大学 | Multi-channel speech enhancement method based on time-frequency domain binary mask |
CN113096682A (en) * | 2021-03-20 | 2021-07-09 | 杭州知存智能科技有限公司 | Real-time voice noise reduction method and device based on mask time domain decoder |
-
2021
- 2021-10-18 CN CN202111207569.1A patent/CN113658605B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101778322A (en) * | 2009-12-07 | 2010-07-14 | 中国科学院自动化研究所 | Microphone array postfiltering sound enhancement method based on multi-models and hearing characteristic |
US20180122403A1 (en) * | 2016-02-16 | 2018-05-03 | Red Pill VR, Inc. | Real-time audio source separation using deep neural networks |
CN109326297A (en) * | 2017-07-31 | 2019-02-12 | 哈曼贝克自动系统股份有限公司 | Self-adaptive post-filtering |
US20190043491A1 (en) * | 2018-05-18 | 2019-02-07 | Intel Corporation | Neural network based time-frequency mask estimation and beamforming for speech pre-processing |
CN111816200A (en) * | 2020-07-01 | 2020-10-23 | 电子科技大学 | Multi-channel speech enhancement method based on time-frequency domain binary mask |
CN113096682A (en) * | 2021-03-20 | 2021-07-09 | 杭州知存智能科技有限公司 | Real-time voice noise reduction method and device based on mask time domain decoder |
Non-Patent Citations (3)
Title |
---|
MOJTABA HASANNEZHAD等: ""An Integrated CNN-GRU Framework for Complex"", 《PROCEEDINGS,APSIPA ANNUAL SUMMIT AND CONFERENCE 2020》 * |
曹丽静: "语音增强技术研究综述", 《河北省科学院学报》 * |
郭晓波等: ""基于神经网络和空域聚类的时频掩蔽值估计波束形成"", 《信息工程大学学报》 * |
Also Published As
Publication number | Publication date |
---|---|
CN113658605B (en) | 2021-12-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108831495B (en) | Speech enhancement method applied to speech recognition in noise environment | |
KR102469516B1 (en) | Method and apparatus for obtaining target voice based on microphone array | |
Meyer et al. | Multi-channel speech enhancement in a car environment using Wiener filtering and spectral subtraction | |
KR101339592B1 (en) | Sound source separator device, sound source separator method, and computer readable recording medium having recorded program | |
CN108922554B (en) | LCMV frequency invariant beam forming speech enhancement algorithm based on logarithmic spectrum estimation | |
Ortega-García et al. | Overview of speech enhancement techniques for automatic speaker recognition | |
US8000482B2 (en) | Microphone array processing system for noisy multipath environments | |
EP3866165B1 (en) | Method for enhancing telephone speech signals based on convolutional neural networks | |
US11373667B2 (en) | Real-time single-channel speech enhancement in noisy and time-varying environments | |
CN108447496B (en) | Speech enhancement method and device based on microphone array | |
CN112530451A (en) | Speech enhancement method based on denoising autoencoder | |
CN110827847A (en) | Microphone array voice denoising and enhancing method with low signal-to-noise ratio and remarkable growth | |
CN112331226A (en) | Voice enhancement system and method for active noise reduction system | |
JP2011203414A (en) | Noise and reverberation suppressing device and method therefor | |
Hashemgeloogerdi et al. | Joint beamforming and reverberation cancellation using a constrained Kalman filter with multichannel linear prediction | |
CN113658605B (en) | Speech enhancement method based on deep learning assisted RLS filtering processing | |
Nagata et al. | Speech enhancement based on auto gain control | |
CN113362846B (en) | Voice enhancement method based on generalized sidelobe cancellation structure | |
Aichner et al. | Post-processing for convolutive blind source separation | |
CN111933169B (en) | Voice noise reduction method for secondarily utilizing voice existence probability | |
CN114242104A (en) | Method, device and equipment for voice noise reduction and storage medium | |
CN114724574A (en) | Double-microphone noise reduction method with adjustable expected sound source direction | |
CN111210836A (en) | Dynamic adjustment method for microphone array beam forming | |
Nakatani et al. | Reduction of Highly Nonstationary Ambient Noise by Integrating Spectral and Locational Characteristics of Speech and Noise for Robust ASR. | |
Kothapally et al. | Monaural Speech Dereverberation using Deformable Convolutional Networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |