CN113096682B

CN113096682B - Real-time voice noise reduction method and device based on mask time domain decoder

Info

Publication number: CN113096682B
Application number: CN202110299114.0A
Authority: CN
Inventors: 李平平
Original assignee: Hangzhou Zhicun Intelligent Technology Co ltd
Current assignee: Hangzhou Zhicun Intelligent Technology Co ltd
Priority date: 2021-03-20
Filing date: 2021-03-20
Publication date: 2023-08-29
Anticipated expiration: 2041-03-20
Also published as: CN113096682A

Abstract

The application provides a real-time voice noise reduction method and device based on a mask time domain decoder, wherein the method comprises the following steps: extracting features of the voice with noise through Stft; inputting the extracted features into a pre-trained neural network to obtain a mask; the masking and the noisy speech are input to a time domain decoder for decoding to obtain enhanced speech, wherein the noisy speech is processed by applying a set of weighting functions (masking) to the time domain decoder to implement the time domain post processing based real-time neural network noise reduction with significantly smaller model size and shorter minimum latency, making it a suitable solution for edge device real-time noise reduction.

Description

Real-time voice noise reduction method and device based on mask time domain decoder

Technical Field

The application relates to the technical field of voice processing, in particular to a real-time voice noise reduction method and device based on a mask time domain decoder.

Background

Speech enhancement refers to a technique of extracting useful speech signals from noise background, suppressing and reducing noise interference after speech signals are disturbed or even submerged by various noise, and simply speaking, extracting original speech as clean as possible from noise-containing speech. Speech enhancement has a wide range of applications, and some enhancement measures are generally adopted to different extents for speech systems in special environments. For example, communication voice processing in a helicopter cabin, a call system in a ship cabin, etc., all require voice enhancement techniques. Classical speech enhancement methods are spectral subtraction, wiener filtering, statistical model-based methods, MCRA minimum recursive average methods, histogram methods, etc.

Conventional classical speech enhancement methods often have some prior assumptions, such as spectral subtraction, where noise is additive, but often it is difficult to meet these assumptions in real situations, resulting in a less than expected practical effect. Moreover, classical speech enhancement methods can achieve a certain effect on stationary noise, but are not satisfactory in complex scenarios with non-stationary noise and low signal-to-noise ratio.

In recent years, deep learning has greatly improved the performance of time-frequency masking methods by improving the accuracy of mask estimation, the waveform of each sound source being calculated using the Inverse Short Time Fourier Transform (iSTFT) of the estimated spectrogram of each sound source and the original phase or modified phase of the mixed sound. First, the accurate reconstruction of the phase of a clean source by STFT/ISTFT is a not insignificant problem, and false estimation of the phase can introduce an upper limit on the accuracy of reconstructing the audio. Even if an ideal clean magnitude spectrum is applied to the mixture, this problem cannot be seen from the source reconstruction accuracy. Although a phase reconstruction method may be applied to alleviate this problem, the performance of this method is still poor. Second, clean signals are decomposed from the mixed signal, which requires a longer time window to calculate ISTFT, increasing the minimum delay of the system, limiting its versatility in real-time, low-delay applications, such as in telecommunications and audible devices.

Disclosure of Invention

In view of the problems in the prior art, the present application provides a method and apparatus for real-time speech noise reduction based on a mask time domain decoder, an electronic device, and a computer readable storage medium, which can at least partially solve the problems in the prior art.

In order to achieve the above purpose, the present application adopts the following technical scheme:

in a first aspect, a method for real-time speech noise reduction based on a mask time domain decoder is provided, including:

extracting features of the voice with noise through Stft;

inputting the extracted features into a pre-trained neural network to obtain a mask;

and inputting the mask and the noisy speech into a time domain decoder for decoding to obtain the enhanced speech.

Further, the decoding the mask and the noisy speech input to a time domain decoder to obtain the enhanced speech includes:

inputting the mask and the noisy speech into a time-domain decoder;

and filtering the noisy speech on different subbands with the mask by using the time-domain decoder to obtain enhanced speech.

Further, the mask is a multi-dimensional mask representing each subband gain.

Further, the extracting features of the noisy speech through Stft includes:

and pre-emphasis, framing, windowing and Fourier transformation are carried out on the voice with noise to obtain the characteristics of the voice with noise.

Further, the extracting features of the noisy speech through Stft further comprises:

the frequency domain of the noisy speech is divided into a plurality of subbands.

Further, the neural network has a structure of [ GRU (48), GRU (96), GRU (128), FC (512), FC (40) ].

Further, the time domain decoder is an IIR band-pass filter or an FIR filter.

In a second aspect, there is provided a real-time speech noise reduction apparatus based on a mask time domain decoder, comprising:

the feature extraction module is used for extracting features of the noisy speech through Stft;

the reasoning module inputs the extracted features into a pre-trained neural network to obtain a mask;

and the time domain decoding module inputs the mask and the noisy speech into a time domain decoder for decoding to obtain the enhanced speech.

In a third aspect, an electronic device is provided, including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the above-described real-time speech noise reduction method based on a mask time domain decoder when the program is executed.

In a fourth aspect, a computer readable storage medium is provided, on which a computer program is stored which, when executed by a processor, implements the steps of the above-described real-time speech noise reduction method based on a mask time domain decoder.

The application provides a real-time voice noise reduction method and a device based on a mask time domain decoder, wherein the method comprises the following steps: extracting features of the voice with noise through Stft; inputting the extracted features into a pre-trained neural network to obtain a mask; the masking and the noisy speech are input to a time domain decoder for decoding to obtain enhanced speech, wherein the noisy speech is processed by applying a set of weighting functions (masking) to the time domain decoder to implement the time domain post processing based real-time neural network noise reduction with significantly smaller model size and shorter minimum latency, making it a suitable solution for edge device real-time noise reduction.

The foregoing and other objects, features and advantages of the application will be apparent from the following more particular description of preferred embodiments, as illustrated in the accompanying drawings.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. In the drawings:

fig. 1 is a schematic diagram of an architecture between a server S1 and a client device B1 according to an embodiment of the present application;

fig. 2 is a schematic diagram of an architecture among a server S1, a client device B1 and a database server S2 according to an embodiment of the present application;

FIG. 3 illustrates a flow of a real-time speech noise reduction technique based on a masked time domain decoder in an embodiment of the present application;

FIG. 4 is a flowchart of a method for real-time speech noise reduction based on a mask time domain decoder according to an embodiment of the present application;

FIG. 5 shows specific steps of step S300 in an embodiment of the present application;

FIG. 6 shows specific steps of step S100 in an embodiment of the application;

FIG. 7 is a block diagram of a real-time speech noise reduction device based on a mask time domain decoder in an embodiment of the present application;

fig. 8 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

It is noted that the terms "comprises" and "comprising," and any variations thereof, in the description and claims of the present application and in the foregoing figures, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.

The real-time voice noise reduction technology based on the mask time domain decoder provided by the embodiment of the application can be implemented on electronic equipment, including but not limited to smart phones, tablet electronic equipment, network set top boxes, portable computers, desktop computers, personal Digital Assistants (PDAs), vehicle-mounted equipment, intelligent wearable equipment, electric toys, intelligent household equipment and the like, and also can be implemented on servers. Wherein, intelligent wearing equipment can include intelligent glasses, intelligent wrist-watch, intelligent bracelet etc..

When the real-time speech noise reduction technique based on the mask time domain decoder provided by the embodiment of the present application is implemented on a server, referring to fig. 1, the server S1 may be communicatively connected to at least one client device B1, the client device B1 may send noisy speech to the server S1, and the server S1 may receive the noisy speech online. The server S1 can preprocess the acquired voice with noise on line or off line, and extract the characteristics of the voice with noise through Stft; inputting the extracted features into a pre-trained neural network to obtain a mask; and inputting the mask and the noisy speech into a time domain decoder for decoding to obtain the enhanced speech. The server S1 may then send the enhanced voice online to the client device B1, or perform subsequent processing such as voice recognition and semantic recognition by using the enhanced voice.

The client device B1 includes, but is not limited to, a smart phone, a tablet electronic device, a network set top box, a portable computer, a desktop computer, a Personal Digital Assistant (PDA), a vehicle-mounted device, a smart wearable device, an electric toy, a smart home device, and the like. Wherein, intelligent wearing equipment can include intelligent glasses, intelligent wrist-watch, intelligent bracelet etc

In addition, referring to fig. 2, the server S1 may be further communicatively connected to at least one database server S2, where the database server S2 is configured to store pre-trained neural networks for the server S1 to call, or store historical voice data. The database server S2 sends the historical voice data to the server S1 on line, the server S1 can receive the historical voice data on line, then a training sample set of the neural network is obtained according to a plurality of historical voice data, and the training sample set is applied to train the neural network.

Based on the above, the database server S2 may also be used to store historical voice data for testing. The database server S2 sends the historical voice data for test to the server S1 on line, the server S1 can receive the historical voice data for test on line, then a test sample is obtained according to at least one historical voice data for test, the model is tested by applying the test sample, the output of the model is used as a test result, whether the current model meets the preset requirement is judged based on the test result and the known evaluation result of at least one historical XX data for test, if yes, the current model is used as a target model for mask extraction; if the current model does not meet the preset requirement, optimizing the current model and/or re-training the model by applying the updated training sample set.

Any suitable network protocol may be used for communication between the server and the client device, including those not yet developed on the filing date of the present application. The network protocols may include, for example, TCP/IP protocol, UDP/IP protocol, HTTP protocol, HTTPS protocol, etc. Of course, the network protocol may also include, for example, RPC protocol (Remote Procedure Call Protocol ), REST protocol (Representational State Transfer, representational state transfer protocol), etc. used above the above-described protocol.

Referring to fig. 3 and 4, a method for real-time speech noise reduction based on a mask time domain decoder according to the present application may include:

step S100: extracting features of the voice with noise through Stft;

specifically, the method comprises the steps of carrying out certain pretreatment on the voice with noise, and then carrying out Stft feature extraction, wherein the pretreatment comprises the following steps: analysis, verification and the like are carried out,

step S200: inputting the extracted features into a pre-trained neural network to obtain a mask;

specifically, the network structure is [ GRU (48), GRU (96), GRU (128), FC (512), FC (40) ], wherein GRU is a gated cyclic neural network, FC is a fully connected layer, backward propagation updates learning parameters, the activation function of the last layer of fully connected layer is sigmoid, and hidden layer space information is mapped to real space.

Step S300: and inputting the mask and the noisy speech into a time domain decoder for decoding to obtain the enhanced speech.

By adopting the technical scheme, in the neural network mask mode noise reduction, ISTFT is replaced by a time domain decoder, so that system delay is reduced, and the signal decoupling upper limit caused by ISTFT is broken.

In an alternative embodiment, referring to fig. 5, the step S300 may include the following:

step S310: inputting the mask and the noisy speech into a time-domain decoder;

step S320: and filtering the noisy speech on different subbands with the mask by using the time-domain decoder to obtain enhanced speech.

Wherein the mask is a multi-dimensional mask representing the gain of each subband.

In an alternative embodiment, the time domain decoder is an IIR band-pass filter or FIR filter or the like.

In an alternative embodiment, referring to fig. 6, the step S100 may include the following:

step S110: the frequency domain of the noisy speech is divided into a plurality of subbands.

Step S120: and pre-emphasis, framing, windowing and Fourier transformation are carried out on the voice with noise to obtain the characteristics of the voice with noise.

Specifically, pre-emphasis is performed on the noisy speech to enhance high frequency information, then framing and windowing processing is performed, and then fourier transformation is performed to obtain an amplitude spectrum and a phase spectrum of the noisy speech.

It is worth noting that features are extracted for each of a plurality of subbands of noisy speech.

For a better understanding of the present application, those skilled in the art will now be presented by way of example to illustrate the implementation of the present application:

firstly, dividing a noisy signal frequency domain into 40 sub-bands, extracting 40-dimensional sub-band characteristics of noisy voice through Stft, sending the characteristics into a neural network, wherein the network structure is [ GRU (48), GRU (96), GRU (128), FC (512) and FC (40) ], updating learning parameters by back propagation, and mapping hidden layer space information into a real space by using a last layer of fully connected activation function as sigmoid to obtain a 40-dimensional mask representing the gain of each sub-band; then sending the voice to a first-order band-pass IIR time domain decoder for decoding, and filtering the voice with noise on different sub-bands by using a mask to finally obtain the enhanced voice.

By adopting the technical scheme, the calculation amount can be reduced, the system delay is reduced, the signal decoupling upper limit caused by ISTFT is broken, the power consumption of edge computing equipment is reduced, and the high calculation amount and reconstruction bottleneck caused by the isftf module are avoided.

Based on the same inventive concept, the embodiment of the present application also provides a real-time voice noise reduction device based on a mask time domain decoder, which can be used to implement the method described in the above embodiment, as described in the following embodiment. Because the principle of solving the problem of the real-time speech noise reduction device based on the mask time domain decoder is similar to that of the method, the implementation of the real-time speech noise reduction device based on the mask time domain decoder can be referred to the implementation of the method, and the repetition is omitted. As used below, the term "unit" or "module" may be a combination of software and/or hardware that implements the intended function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

Fig. 7 is a block diagram of a real-time speech noise reduction apparatus based on a mask time domain decoder according to an embodiment of the present application. As shown in fig. 7, the real-time voice noise reduction device based on the mask time domain decoder specifically includes: a feature extraction module 10, an inference module 20 and a time decoding module 30.

The feature extraction module 10 extracts features of the noisy speech through Stft;

the inference module 20 inputs the extracted features into a pre-trained neural network to obtain a mask;

the time domain decoding module 30 inputs the mask and the noisy speech to a time domain decoder for decoding to obtain the enhanced speech.

The apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. A typical implementation device is an electronic device, which may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

In a typical example the electronic device comprises in particular a memory, a processor and a computer program stored on the memory and executable on the processor, said processor implementing the steps of the above described real-time speech noise reduction method based on a mask time domain decoder when said program is executed.

Referring now to fig. 8, a schematic diagram of an electronic device 600 suitable for use in implementing embodiments of the present application is shown.

As shown in fig. 8, the electronic apparatus 600 includes a Central Processing Unit (CPU) 601, which can perform various appropriate works and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage section 608 into a Random Access Memory (RAM)) 603. In the RAM603, various programs and data required for the operation of the system 600 are also stored. The CPU601, ROM602, and RAM603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, mouse, etc.; an output portion 607 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The drive 610 is also connected to the I/O interface 605 as needed. Removable media 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on drive 610 as needed, so that a computer program read therefrom is mounted as needed as storage section 608.

In particular, according to embodiments of the present application, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above-described real-time speech noise reduction method based on a masked time domain decoder.

In such an embodiment, the computer program may be downloaded and installed from a network through the communication portion 609, and/or installed from the removable medium 611.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in the same piece or pieces of software and/or hardware when implementing the present application.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are to be included in the scope of the claims of the present application.

Claims

1. A method for real-time speech noise reduction based on a masked time domain decoder, comprising:

extracting features of the voice with noise through Stft;

inputting the mask and the noisy speech into a time domain decoder for decoding to obtain enhanced speech;

wherein the decoding the mask and the noisy speech input to a time domain decoder to obtain enhanced speech includes:

inputting the mask and the noisy speech into a time-domain decoder;

filtering the noisy speech with the mask on different subbands using the time-domain decoder to obtain enhanced speech;

the time domain decoder is an IIR band-pass filter or an FIR filter, the neural network structure is [ GRU (48), GRU (96), GRU (128), FC (512) and FC (40) ], the GRU is a gating cyclic neural network, the FC is a full-connection layer, the back propagation updates learning parameters, the activation function of the last full-connection layer is sigmoid, and hidden layer space information is mapped to a real space;

the frequency domain of the noisy speech is divided into a plurality of sub-bands, features are extracted from the plurality of sub-bands of the noisy speech, and the mask is a multi-dimensional mask representing the gain of each sub-band.

2. The method of real-time speech noise reduction based on a masked time domain decoder according to claim 1, wherein said extracting features of the noisy speech by Stft comprises:

3. A real-time speech noise reduction apparatus based on a masked time domain decoder, comprising:

the time domain decoding module inputs the mask and the voice with noise into a time domain decoder for decoding to obtain enhanced voice;

wherein the time domain decoding module comprises:

inputting the mask and the noisy speech into a time-domain decoder;

the time domain decoder is an IIR band-pass filter or an FIR filter, the neural network structure is [ GRU (48), GRU (96), GRU (128), FC (512) and FC (40) ], the GRU is a gating cyclic neural network, the FC is a full-connection layer, the back propagation updates learning parameters, the activation function of the last full-connection layer is sigmoid, and hidden layer space information is mapped to a real space; the frequency domain of the noisy speech is divided into a plurality of sub-bands, features are extracted from the plurality of sub-bands of the noisy speech, and the mask is a multi-dimensional mask representing the gain of each sub-band.

4. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the mask time domain decoder based real-time speech noise reduction method according to any of claims 1 to 2 when the program is executed by the processor.

5. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the steps of the mask time domain decoder based real time speech noise reduction method according to any of claims 1 to 2.