CN114999519A

CN114999519A - Voice real-time noise reduction method and system based on double transformation

Info

Publication number: CN114999519A
Application number: CN202210838874.9A
Authority: CN
Inventors: 唐镇坤; 潘伟; 吴庆耀; 钟佳; 王琅
Original assignee: China Post Consumer Finance Co ltd
Current assignee: China Post Consumer Finance Co ltd
Priority date: 2022-07-18
Filing date: 2022-07-18
Publication date: 2022-09-02

Abstract

The invention relates to a voice real-time noise reduction method and a system based on double transformation, wherein the method comprises the following steps: performing framing processing on the voice signal, and performing short-time Fourier transform to obtain a time-frequency signal; masking the time frequency signal to enhance the time frequency signal; then, carrying out inverse Fourier transform to obtain a time domain signal; masking the time domain signal to enhance the time domain signal, and then performing one-dimensional convolution operation; reconstructing a waveform signal through overlap-add; through two cascaded transformations, firstly, a short-time Fourier transformation is carried out on a voice signal to obtain a time-frequency domain signal, a masking treatment is carried out to obtain a clean amplitude spectrum signal, the signal is transformed to a time domain signal for the second time, and then a masking treatment is carried out to obtain a final clean voice signal.

Description

Voice real-time noise reduction method and system based on double transformation

Technical Field

The invention relates to the technical field of software development, in particular to a voice real-time noise reduction method and system based on double transformation.

Background

With the continuous development of internet technology, people can live broadcast, conference or conversation at any time and any place through a mobile phone, and voice signals are often interfered by noise of the surrounding environment in the process, so that the quality of the voice signals is reduced, the intelligibility of audio is poor, and the daily communication of people is influenced. In order to improve the quality of a voice signal, a single-channel voice enhancement technology is generally used for noise reduction of the voice, and the existing noise reduction technology cannot process a non-stationary noise signal; the single-channel noise reduction method usually only processes the amplitude spectrum in the signal, and retains the original noisy phase, and the quality of the generated noise reduction signal is poor.

Disclosure of Invention

Therefore, it is necessary to provide a method and a system for real-time speech noise reduction based on dual transformation, which have better noise reduction effect.

The embodiment of the invention provides a speech real-time noise reduction method based on double transformation, which is characterized by comprising the following steps:

s1, framing the voice signal, and performing short-time Fourier transform to obtain a time-frequency signal;

s2: masking the time-frequency signal to enhance and purify the time-frequency signal;

s3: carrying out inverse Fourier transform on the enhanced time-frequency signal to obtain a time-domain signal;

s4: masking the time domain signal to enhance and purify the time domain signal;

s5: performing one-dimensional convolution operation on the enhanced time domain signal;

s6: the waveform signal is reconstructed by overlapping phases.

Preferably, in step S1, when the speech signal is subjected to framing processing, a frame of 25-35ms length and a frame of 5-10ms are moved into the framing processing.

Preferably, in step S1, when the speech signal is subjected to framing processing, 32 ms-length one frame and 8ms frames are moved into framing processing.

Preferably, the short-time fourier transform employs the following equation:

where Y represents the amplitude component of the mixed speech signal Y after short-time Fourier transform, M is a mask applied to Y, and has a value of 0-1,

representing the phase portion after the short-time fourier transform, the clean audio is predicted by preserving the phase of the mixed speech.

Preferably, in step S2, the masking process performed on the time-frequency signal by the first partial encoder includes the following steps: the mixed amplitude spectrum Y is subjected to GRU network of a full connection layer and a Sigmoid layer to obtain a mask M, and the mask M is multiplied by the Y to obtain an estimated amplitude spectrum

(ii) a The expression is as follows:

。

preferably, in step S3, the estimated magnitude spectrum is analyzed

And original phase

And performing inverse Fourier transform to obtain a time-domain signal, and not synthesizing into a waveform signal.

Preferably, after step S3, before step S4, the following steps are also required: after channel normalization, the time domain signal passes through a GRU network with a full connection layer and a Sigmoid layer to obtain a mask M on a time domain, and the mask M is multiplied by a framed time domain signal to obtain a pre-estimated time domain signal; the expression is as follows:

。

preferably, in step S5, the number of channels is converted into the length of one frame by using one-dimensional convolution, and then the waveform is reconstructed by using an overlap-add technique; the expression is as follows:

。

the invention also provides a voice real-time noise reduction system, which comprises:

the framing module is used for framing the voice signals;

the short-time Fourier module is used for obtaining a time-frequency signal;

the first part encoder is used for masking the time-frequency signal to enhance and purify the time-frequency signal;

the inverse Fourier transform module is used for performing inverse Fourier transform on the enhanced time-frequency signal to obtain a time-domain signal;

the time domain signal is masked by the second part encoder, so that the time domain signal is enhanced and purified;

the one-dimensional convolution module is used for performing one-dimensional convolution operation on the enhanced time domain signal;

and the overlap-add module is used for performing overlap-add on the signals to reconstruct the waveform signals.

Preferably, the first partial encoder at least comprises two gating cycle units and two layers of GRU networks, wherein the two layers of GRU networks are two layers of GRU networks of a full connection layer and a Sigmoid layer respectively;

the second part of the encoder also at least comprises two gating cycle units and two layers of GRU networks, wherein the two layers of GRU networks are a full connection layer GRU network and a Sigmoid layer GRU network respectively.

According to the method, through two cascaded transformations, firstly, short-time Fourier transformation is carried out on a voice signal to obtain a time-frequency domain signal, masking processing is carried out to obtain a clean amplitude spectrum signal, the signal is transformed to a time domain signal for the second time, and then masking processing is carried out to obtain a final clean voice signal.

Drawings

The foregoing and other objects, features and advantages of the invention will be apparent from the following more particular description of preferred embodiments of the invention, as illustrated in the accompanying drawings. Like reference numerals refer to like parts throughout the drawings, and the drawings are not intended to be drawn to scale in actual dimensions, emphasis instead being placed upon illustrating the principles of the invention.

FIG. 1 is a flow chart of a method for real-time noise reduction of speech based on dual transformation according to the present invention.

Detailed Description

The technical solutions of the present invention are further described in detail with reference to the drawings and specific embodiments so that those skilled in the art can better understand the present invention and can implement the present invention, but the embodiments are not limited to the present invention.

As shown in fig. 1, an embodiment of the present invention provides a method for reducing noise in real time for speech based on dual transformation, which is characterized by comprising the following steps:

s1, framing the voice signal, and obtaining a time-frequency signal through short-time Fourier transform;

s3: performing inverse Fourier transform on the enhanced time-frequency signal to obtain a time-domain signal;

s6: the waveform signal is reconstructed by overlapping phases.

Through two times of transformation, the time-frequency domain signal is processed firstly, then the time domain signal is processed, and the noise-carrying first-quality signal is processed in a progressive mode. In the quadratic transform, the model processes the signal on a frame-by-frame basis, and the audio signal is streamed in real time without losing the performance of the model. The amplitude spectrum is processed firstly, and then the amplitude spectrum is processed in a time domain, so that the effect of processing the phase at the same time is achieved, and the processed voice signal is better and clear.

In the preferred embodiment, in step S1, when the speech signal is subjected to framing processing, one frame of 25-35ms length and 5-10ms frames are subjected to framing processing.

In the preferred embodiment, in step S1, when the speech signal is subjected to framing processing, 32 ms-length one frame and 8ms frames are shifted into framing processing.

In a preferred embodiment, the short-time fourier transform employs the following equation:

where Y represents the amplitude component of the mixed speech signal Y after short-time Fourier transform, M is a mask applied to Y and has a value of 0-1,

In a preferred embodiment, in step S2, the masking processing is performed on the time-frequency signal by the first partial encoder, which includes the following steps: the mixed amplitude spectrum Y is subjected to GRU network of a full connection layer and a Sigmoid layer to obtain a mask M, and the mask M is multiplied by the Y to obtainPre-estimated amplitude spectrum

(ii) a The expression is as follows:

。

in a preferred embodiment, in step S3, the estimated magnitude spectrum is compared

And original phase

And performing inverse Fourier transform to obtain a time domain signal, and not combining the time domain signal into a waveform signal.

In a preferred embodiment, after step S3, before step S4, the following steps are also required: after channel normalization, the time domain signal passes through a GRU network with a full connection layer and a Sigmoid layer to obtain a mask M on a time domain, and the mask M is multiplied by a framed time domain signal to obtain a pre-estimated time domain signal; the expression is as follows:

。

in a preferred embodiment, in step S5, the number of channels is converted into the length of one frame using one-dimensional convolution, and then the waveform is reconstructed using overlap-add technique; the expression is as follows:

。

the framing module is used for framing the voice signals;

the short-time Fourier module is used for obtaining a time-frequency signal;

The first part of the encoder at least comprises two gating circulating units and two layers of GRU networks, wherein the two layers of GRU networks are full connection layers and Sigmoid layers respectively;

Example 1:

as shown in fig. 1, in order to further improve the noise-reduced speech quality while maintaining a low computational complexity, the present invention provides a dual transform noise reduction technique, which can obtain a clean amplitude spectrum in the time-frequency domain in real time, and also obtain a clean time-domain signal after performing a secondary transform and noise reduction, and this method further models the phase signal to obtain a higher-quality speech signal.

A real-time noise reduction method based on double transformation comprises the following steps:

performing framing processing on a voice signal by moving a frame with the length of 32ms and a frame with the length of 8ms, performing short-time Fourier transform to obtain a time-frequency signal, performing masking processing on the time-frequency signal by using a first part of encoders, and performing inverse Fourier transform to obtain a time-domain signal;

a mask is obtained by passing the time domain signal through a second part encoder, and the time domain signal is masked;

performing one-dimensional convolution operation on the enhanced time domain signal, and then performing overlapping phase to enhance a waveform signal;

as a specific real-time scheme, the masking process performed by the first partial encoder includes the following steps:

s11: taking 32ms as a frame length of an audio signal, performing framing by frame shift of 8ms, and performing short-time Fourier transform:

representing the phase part after short-time Fourier transform, and predicting clean audio by reserving the phase of mixed voice;

s12, the mixed amplitude spectrum Y is processed by two layers of GRU network, full connection layer and Sigmoid layer to obtain a mask M, and the mask M is multiplied by Y to obtain the estimated amplitude spectrum

；

；

S13: pre-estimated amplitude spectrum

And sourceWith a phase

The inverse fourier transform is performed to obtain a time domain signal, but the time domain signal is not combined into a waveform signal.

S21: the second stage of conversion processing is processing of time domain signals, firstly, the framing time domain signals output in S1 are converted into signals with 256 channels through one-dimensional convolution;

s22: in order to facilitate real-time processing and convergence of deep learning training, channel normalization is firstly carried out, then a time domain signal passes through a two-layer GRU network with the same structure as that in S1, a full connection layer and a Sigmoid layer to obtain a mask M on a time domain, and the mask M is multiplied by a framed time domain signal to obtain an estimated time domain signal;

s31: s3 first converts the channel number to the length of one frame using one-dimensional convolution, and then reconstructs the waveform using overlap-add techniques.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A speech real-time noise reduction method based on double transformation is characterized by comprising the following steps:

s6: the waveform signal is reconstructed by overlapping phases.

2. The method of real-time noise reduction for speech based on double-transform as claimed in claim 1, wherein in step S1, when the speech signal is framed, a frame of 25-35ms length and a frame of 5-10ms are moved into the framing process.

3. The method for reducing noise in real time for speech based on double-transform as claimed in claim 2, wherein in step S1, when framing the speech signal, 32ms long one frame and 8ms frame are moved into the framing.

4. The dual transform-based voice real-time noise reduction method of claim 1, wherein the short-time fourier transform employs the following formula:

where Y represents the amplitude component of the mixed speech signal Y after short-time Fourier transform, and M is applied toA shade is obtained on Y, the value of which is 0-1,

5. The dual transform-based speech real-time noise reduction method of claim 1,

in step S2, the masking process is performed on the time-frequency signal by the first partial encoder, which includes the following steps: the mixed amplitude spectrum Y is subjected to GRU network of a full connection layer and a Sigmoid layer to obtain a mask M, and the mask M is multiplied by the Y to obtain an estimated amplitude spectrum

(ii) a The expression is as follows:

。

6. the method of claim 5, wherein in step S3, the estimated magnitude spectrum is processed

And original phase

7. The method for real-time noise reduction of speech based on double transformation according to claim 6, wherein after step S3, before step S4, the following steps are further performed: after channel normalization, the time domain signal passes through GRU networks of a full connection layer and a Sigmoid layer to obtain a mask M on a time domain, and the mask M is multiplied by a framed time domain signal to obtain a pre-estimated time domain signal; the expression is as follows:

。

8. the method for reducing noise in real time based on dual-transform speech of claim 1, wherein in step S5, a one-dimensional convolution is used to convert the number of channels into a length of one frame, and then an overlap-add technique is used to reconstruct the waveform; the expression is as follows:

。

9. a real-time voice noise reduction system is characterized by comprising

The framing module is used for framing the voice signals;

the short-time Fourier module is used for obtaining a time-frequency signal;

10. The speech real-time noise reduction system of claim 9, wherein the first partial encoder comprises at least two gated cyclic units and two layers of GRU networks, the two layers of GRU networks being a fully connected layer and a Sigmoid layer two layers of GRU networks, respectively;