CN113096682B - Real-time voice noise reduction method and device based on mask time domain decoder - Google Patents
Real-time voice noise reduction method and device based on mask time domain decoder Download PDFInfo
- Publication number
- CN113096682B CN113096682B CN202110299114.0A CN202110299114A CN113096682B CN 113096682 B CN113096682 B CN 113096682B CN 202110299114 A CN202110299114 A CN 202110299114A CN 113096682 B CN113096682 B CN 113096682B
- Authority
- CN
- China
- Prior art keywords
- mask
- speech
- time domain
- time
- domain decoder
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 56
- 230000009467 reduction Effects 0.000 title claims abstract description 34
- 238000013528 artificial neural network Methods 0.000 claims abstract description 23
- 230000006870 function Effects 0.000 claims abstract description 13
- 238000004590 computer program Methods 0.000 claims description 17
- 238000000605 extraction Methods 0.000 claims description 6
- 238000001914 filtration Methods 0.000 claims description 6
- 230000004913 activation Effects 0.000 claims description 4
- 238000009432 framing Methods 0.000 claims description 4
- 230000009466 transformation Effects 0.000 claims description 4
- 125000004122 cyclic group Chemical group 0.000 claims description 3
- 230000000873 masking effect Effects 0.000 abstract description 5
- 238000012805 post-processing Methods 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 12
- 238000012545 processing Methods 0.000 description 11
- 238000012360 testing method Methods 0.000 description 9
- 230000008569 process Effects 0.000 description 8
- 238000004891 communication Methods 0.000 description 7
- 230000003287 optical effect Effects 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 238000001228 spectrum Methods 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 239000011521 glass Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 101001120757 Streptococcus pyogenes serotype M49 (strain NZ131) Oleate hydratase Proteins 0.000 description 1
- 239000000654 additive Substances 0.000 description 1
- 230000000996 additive effect Effects 0.000 description 1
- 229940083712 aldosterone antagonist Drugs 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0224—Processing in the time domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Human Computer Interaction (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Soundproofing, Sound Blocking, And Sound Damping (AREA)
Abstract
The application provides a real-time voice noise reduction method and device based on a mask time domain decoder, wherein the method comprises the following steps: extracting features of the voice with noise through Stft; inputting the extracted features into a pre-trained neural network to obtain a mask; the masking and the noisy speech are input to a time domain decoder for decoding to obtain enhanced speech, wherein the noisy speech is processed by applying a set of weighting functions (masking) to the time domain decoder to implement the time domain post processing based real-time neural network noise reduction with significantly smaller model size and shorter minimum latency, making it a suitable solution for edge device real-time noise reduction.
Description
Technical Field
The application relates to the technical field of voice processing, in particular to a real-time voice noise reduction method and device based on a mask time domain decoder.
Background
Speech enhancement refers to a technique of extracting useful speech signals from noise background, suppressing and reducing noise interference after speech signals are disturbed or even submerged by various noise, and simply speaking, extracting original speech as clean as possible from noise-containing speech. Speech enhancement has a wide range of applications, and some enhancement measures are generally adopted to different extents for speech systems in special environments. For example, communication voice processing in a helicopter cabin, a call system in a ship cabin, etc., all require voice enhancement techniques. Classical speech enhancement methods are spectral subtraction, wiener filtering, statistical model-based methods, MCRA minimum recursive average methods, histogram methods, etc.
Conventional classical speech enhancement methods often have some prior assumptions, such as spectral subtraction, where noise is additive, but often it is difficult to meet these assumptions in real situations, resulting in a less than expected practical effect. Moreover, classical speech enhancement methods can achieve a certain effect on stationary noise, but are not satisfactory in complex scenarios with non-stationary noise and low signal-to-noise ratio.
In recent years, deep learning has greatly improved the performance of time-frequency masking methods by improving the accuracy of mask estimation, the waveform of each sound source being calculated using the Inverse Short Time Fourier Transform (iSTFT) of the estimated spectrogram of each sound source and the original phase or modified phase of the mixed sound. First, the accurate reconstruction of the phase of a clean source by STFT/ISTFT is a not insignificant problem, and false estimation of the phase can introduce an upper limit on the accuracy of reconstructing the audio. Even if an ideal clean magnitude spectrum is applied to the mixture, this problem cannot be seen from the source reconstruction accuracy. Although a phase reconstruction method may be applied to alleviate this problem, the performance of this method is still poor. Second, clean signals are decomposed from the mixed signal, which requires a longer time window to calculate ISTFT, increasing the minimum delay of the system, limiting its versatility in real-time, low-delay applications, such as in telecommunications and audible devices.
Disclosure of Invention
In view of the problems in the prior art, the present application provides a method and apparatus for real-time speech noise reduction based on a mask time domain decoder, an electronic device, and a computer readable storage medium, which can at least partially solve the problems in the prior art.
In order to achieve the above purpose, the present application adopts the following technical scheme:
in a first aspect, a method for real-time speech noise reduction based on a mask time domain decoder is provided, including:
extracting features of the voice with noise through Stft;
inputting the extracted features into a pre-trained neural network to obtain a mask;
and inputting the mask and the noisy speech into a time domain decoder for decoding to obtain the enhanced speech.
Further, the decoding the mask and the noisy speech input to a time domain decoder to obtain the enhanced speech includes:
inputting the mask and the noisy speech into a time-domain decoder;
and filtering the noisy speech on different subbands with the mask by using the time-domain decoder to obtain enhanced speech.
Further, the mask is a multi-dimensional mask representing each subband gain.
Further, the extracting features of the noisy speech through Stft includes:
and pre-emphasis, framing, windowing and Fourier transformation are carried out on the voice with noise to obtain the characteristics of the voice with noise.
Further, the extracting features of the noisy speech through Stft further comprises:
the frequency domain of the noisy speech is divided into a plurality of subbands.
Further, the neural network has a structure of [ GRU (48), GRU (96), GRU (128), FC (512), FC (40) ].
Further, the time domain decoder is an IIR band-pass filter or an FIR filter.
In a second aspect, there is provided a real-time speech noise reduction apparatus based on a mask time domain decoder, comprising:
the feature extraction module is used for extracting features of the noisy speech through Stft;
the reasoning module inputs the extracted features into a pre-trained neural network to obtain a mask;
and the time domain decoding module inputs the mask and the noisy speech into a time domain decoder for decoding to obtain the enhanced speech.
In a third aspect, an electronic device is provided, including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the above-described real-time speech noise reduction method based on a mask time domain decoder when the program is executed.
In a fourth aspect, a computer readable storage medium is provided, on which a computer program is stored which, when executed by a processor, implements the steps of the above-described real-time speech noise reduction method based on a mask time domain decoder.
The application provides a real-time voice noise reduction method and a device based on a mask time domain decoder, wherein the method comprises the following steps: extracting features of the voice with noise through Stft; inputting the extracted features into a pre-trained neural network to obtain a mask; the masking and the noisy speech are input to a time domain decoder for decoding to obtain enhanced speech, wherein the noisy speech is processed by applying a set of weighting functions (masking) to the time domain decoder to implement the time domain post processing based real-time neural network noise reduction with significantly smaller model size and shorter minimum latency, making it a suitable solution for edge device real-time noise reduction.
The foregoing and other objects, features and advantages of the application will be apparent from the following more particular description of preferred embodiments, as illustrated in the accompanying drawings.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. In the drawings:
fig. 1 is a schematic diagram of an architecture between a server S1 and a client device B1 according to an embodiment of the present application;
fig. 2 is a schematic diagram of an architecture among a server S1, a client device B1 and a database server S2 according to an embodiment of the present application;
FIG. 3 illustrates a flow of a real-time speech noise reduction technique based on a masked time domain decoder in an embodiment of the present application;
FIG. 4 is a flowchart of a method for real-time speech noise reduction based on a mask time domain decoder according to an embodiment of the present application;
FIG. 5 shows specific steps of step S300 in an embodiment of the present application;
FIG. 6 shows specific steps of step S100 in an embodiment of the application;
FIG. 7 is a block diagram of a real-time speech noise reduction device based on a mask time domain decoder in an embodiment of the present application;
fig. 8 is a block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
It is noted that the terms "comprises" and "comprising," and any variations thereof, in the description and claims of the present application and in the foregoing figures, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.
It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.
The real-time voice noise reduction technology based on the mask time domain decoder provided by the embodiment of the application can be implemented on electronic equipment, including but not limited to smart phones, tablet electronic equipment, network set top boxes, portable computers, desktop computers, personal Digital Assistants (PDAs), vehicle-mounted equipment, intelligent wearable equipment, electric toys, intelligent household equipment and the like, and also can be implemented on servers. Wherein, intelligent wearing equipment can include intelligent glasses, intelligent wrist-watch, intelligent bracelet etc..
When the real-time speech noise reduction technique based on the mask time domain decoder provided by the embodiment of the present application is implemented on a server, referring to fig. 1, the server S1 may be communicatively connected to at least one client device B1, the client device B1 may send noisy speech to the server S1, and the server S1 may receive the noisy speech online. The server S1 can preprocess the acquired voice with noise on line or off line, and extract the characteristics of the voice with noise through Stft; inputting the extracted features into a pre-trained neural network to obtain a mask; and inputting the mask and the noisy speech into a time domain decoder for decoding to obtain the enhanced speech. The server S1 may then send the enhanced voice online to the client device B1, or perform subsequent processing such as voice recognition and semantic recognition by using the enhanced voice.
The client device B1 includes, but is not limited to, a smart phone, a tablet electronic device, a network set top box, a portable computer, a desktop computer, a Personal Digital Assistant (PDA), a vehicle-mounted device, a smart wearable device, an electric toy, a smart home device, and the like. Wherein, intelligent wearing equipment can include intelligent glasses, intelligent wrist-watch, intelligent bracelet etc
In addition, referring to fig. 2, the server S1 may be further communicatively connected to at least one database server S2, where the database server S2 is configured to store pre-trained neural networks for the server S1 to call, or store historical voice data. The database server S2 sends the historical voice data to the server S1 on line, the server S1 can receive the historical voice data on line, then a training sample set of the neural network is obtained according to a plurality of historical voice data, and the training sample set is applied to train the neural network.
Based on the above, the database server S2 may also be used to store historical voice data for testing. The database server S2 sends the historical voice data for test to the server S1 on line, the server S1 can receive the historical voice data for test on line, then a test sample is obtained according to at least one historical voice data for test, the model is tested by applying the test sample, the output of the model is used as a test result, whether the current model meets the preset requirement is judged based on the test result and the known evaluation result of at least one historical XX data for test, if yes, the current model is used as a target model for mask extraction; if the current model does not meet the preset requirement, optimizing the current model and/or re-training the model by applying the updated training sample set.
Any suitable network protocol may be used for communication between the server and the client device, including those not yet developed on the filing date of the present application. The network protocols may include, for example, TCP/IP protocol, UDP/IP protocol, HTTP protocol, HTTPS protocol, etc. Of course, the network protocol may also include, for example, RPC protocol (Remote Procedure Call Protocol ), REST protocol (Representational State Transfer, representational state transfer protocol), etc. used above the above-described protocol.
Referring to fig. 3 and 4, a method for real-time speech noise reduction based on a mask time domain decoder according to the present application may include:
step S100: extracting features of the voice with noise through Stft;
specifically, the method comprises the steps of carrying out certain pretreatment on the voice with noise, and then carrying out Stft feature extraction, wherein the pretreatment comprises the following steps: analysis, verification and the like are carried out,
step S200: inputting the extracted features into a pre-trained neural network to obtain a mask;
specifically, the network structure is [ GRU (48), GRU (96), GRU (128), FC (512), FC (40) ], wherein GRU is a gated cyclic neural network, FC is a fully connected layer, backward propagation updates learning parameters, the activation function of the last layer of fully connected layer is sigmoid, and hidden layer space information is mapped to real space.
Step S300: and inputting the mask and the noisy speech into a time domain decoder for decoding to obtain the enhanced speech.
By adopting the technical scheme, in the neural network mask mode noise reduction, ISTFT is replaced by a time domain decoder, so that system delay is reduced, and the signal decoupling upper limit caused by ISTFT is broken.
In an alternative embodiment, referring to fig. 5, the step S300 may include the following:
step S310: inputting the mask and the noisy speech into a time-domain decoder;
step S320: and filtering the noisy speech on different subbands with the mask by using the time-domain decoder to obtain enhanced speech.
Wherein the mask is a multi-dimensional mask representing the gain of each subband.
In an alternative embodiment, the time domain decoder is an IIR band-pass filter or FIR filter or the like.
In an alternative embodiment, referring to fig. 6, the step S100 may include the following:
step S110: the frequency domain of the noisy speech is divided into a plurality of subbands.
Step S120: and pre-emphasis, framing, windowing and Fourier transformation are carried out on the voice with noise to obtain the characteristics of the voice with noise.
Specifically, pre-emphasis is performed on the noisy speech to enhance high frequency information, then framing and windowing processing is performed, and then fourier transformation is performed to obtain an amplitude spectrum and a phase spectrum of the noisy speech.
It is worth noting that features are extracted for each of a plurality of subbands of noisy speech.
For a better understanding of the present application, those skilled in the art will now be presented by way of example to illustrate the implementation of the present application:
firstly, dividing a noisy signal frequency domain into 40 sub-bands, extracting 40-dimensional sub-band characteristics of noisy voice through Stft, sending the characteristics into a neural network, wherein the network structure is [ GRU (48), GRU (96), GRU (128), FC (512) and FC (40) ], updating learning parameters by back propagation, and mapping hidden layer space information into a real space by using a last layer of fully connected activation function as sigmoid to obtain a 40-dimensional mask representing the gain of each sub-band; then sending the voice to a first-order band-pass IIR time domain decoder for decoding, and filtering the voice with noise on different sub-bands by using a mask to finally obtain the enhanced voice.
By adopting the technical scheme, the calculation amount can be reduced, the system delay is reduced, the signal decoupling upper limit caused by ISTFT is broken, the power consumption of edge computing equipment is reduced, and the high calculation amount and reconstruction bottleneck caused by the isftf module are avoided.
Based on the same inventive concept, the embodiment of the present application also provides a real-time voice noise reduction device based on a mask time domain decoder, which can be used to implement the method described in the above embodiment, as described in the following embodiment. Because the principle of solving the problem of the real-time speech noise reduction device based on the mask time domain decoder is similar to that of the method, the implementation of the real-time speech noise reduction device based on the mask time domain decoder can be referred to the implementation of the method, and the repetition is omitted. As used below, the term "unit" or "module" may be a combination of software and/or hardware that implements the intended function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
Fig. 7 is a block diagram of a real-time speech noise reduction apparatus based on a mask time domain decoder according to an embodiment of the present application. As shown in fig. 7, the real-time voice noise reduction device based on the mask time domain decoder specifically includes: a feature extraction module 10, an inference module 20 and a time decoding module 30.
The feature extraction module 10 extracts features of the noisy speech through Stft;
the inference module 20 inputs the extracted features into a pre-trained neural network to obtain a mask;
the time domain decoding module 30 inputs the mask and the noisy speech to a time domain decoder for decoding to obtain the enhanced speech.
By adopting the technical scheme, in the neural network mask mode noise reduction, ISTFT is replaced by a time domain decoder, so that system delay is reduced, and the signal decoupling upper limit caused by ISTFT is broken.
The apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. A typical implementation device is an electronic device, which may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
In a typical example the electronic device comprises in particular a memory, a processor and a computer program stored on the memory and executable on the processor, said processor implementing the steps of the above described real-time speech noise reduction method based on a mask time domain decoder when said program is executed.
Referring now to fig. 8, a schematic diagram of an electronic device 600 suitable for use in implementing embodiments of the present application is shown.
As shown in fig. 8, the electronic apparatus 600 includes a Central Processing Unit (CPU) 601, which can perform various appropriate works and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage section 608 into a Random Access Memory (RAM)) 603. In the RAM603, various programs and data required for the operation of the system 600 are also stored. The CPU601, ROM602, and RAM603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, mouse, etc.; an output portion 607 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The drive 610 is also connected to the I/O interface 605 as needed. Removable media 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on drive 610 as needed, so that a computer program read therefrom is mounted as needed as storage section 608.
In particular, according to embodiments of the present application, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above-described real-time speech noise reduction method based on a masked time domain decoder.
In such an embodiment, the computer program may be downloaded and installed from a network through the communication portion 609, and/or installed from the removable medium 611.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in the same piece or pieces of software and/or hardware when implementing the present application.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.
The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are to be included in the scope of the claims of the present application.
Claims (5)
1. A method for real-time speech noise reduction based on a masked time domain decoder, comprising:
extracting features of the voice with noise through Stft;
inputting the extracted features into a pre-trained neural network to obtain a mask;
inputting the mask and the noisy speech into a time domain decoder for decoding to obtain enhanced speech;
wherein the decoding the mask and the noisy speech input to a time domain decoder to obtain enhanced speech includes:
inputting the mask and the noisy speech into a time-domain decoder;
filtering the noisy speech with the mask on different subbands using the time-domain decoder to obtain enhanced speech;
the time domain decoder is an IIR band-pass filter or an FIR filter, the neural network structure is [ GRU (48), GRU (96), GRU (128), FC (512) and FC (40) ], the GRU is a gating cyclic neural network, the FC is a full-connection layer, the back propagation updates learning parameters, the activation function of the last full-connection layer is sigmoid, and hidden layer space information is mapped to a real space;
the frequency domain of the noisy speech is divided into a plurality of sub-bands, features are extracted from the plurality of sub-bands of the noisy speech, and the mask is a multi-dimensional mask representing the gain of each sub-band.
2. The method of real-time speech noise reduction based on a masked time domain decoder according to claim 1, wherein said extracting features of the noisy speech by Stft comprises:
and pre-emphasis, framing, windowing and Fourier transformation are carried out on the voice with noise to obtain the characteristics of the voice with noise.
3. A real-time speech noise reduction apparatus based on a masked time domain decoder, comprising:
the feature extraction module is used for extracting features of the noisy speech through Stft;
the reasoning module inputs the extracted features into a pre-trained neural network to obtain a mask;
the time domain decoding module inputs the mask and the voice with noise into a time domain decoder for decoding to obtain enhanced voice;
wherein the time domain decoding module comprises:
inputting the mask and the noisy speech into a time-domain decoder;
filtering the noisy speech with the mask on different subbands using the time-domain decoder to obtain enhanced speech;
the time domain decoder is an IIR band-pass filter or an FIR filter, the neural network structure is [ GRU (48), GRU (96), GRU (128), FC (512) and FC (40) ], the GRU is a gating cyclic neural network, the FC is a full-connection layer, the back propagation updates learning parameters, the activation function of the last full-connection layer is sigmoid, and hidden layer space information is mapped to a real space; the frequency domain of the noisy speech is divided into a plurality of sub-bands, features are extracted from the plurality of sub-bands of the noisy speech, and the mask is a multi-dimensional mask representing the gain of each sub-band.
4. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the mask time domain decoder based real-time speech noise reduction method according to any of claims 1 to 2 when the program is executed by the processor.
5. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the steps of the mask time domain decoder based real time speech noise reduction method according to any of claims 1 to 2.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110299114.0A CN113096682B (en) | 2021-03-20 | 2021-03-20 | Real-time voice noise reduction method and device based on mask time domain decoder |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110299114.0A CN113096682B (en) | 2021-03-20 | 2021-03-20 | Real-time voice noise reduction method and device based on mask time domain decoder |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113096682A CN113096682A (en) | 2021-07-09 |
CN113096682B true CN113096682B (en) | 2023-08-29 |
Family
ID=76668750
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110299114.0A Active CN113096682B (en) | 2021-03-20 | 2021-03-20 | Real-time voice noise reduction method and device based on mask time domain decoder |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113096682B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113096648A (en) * | 2021-03-20 | 2021-07-09 | 杭州知存智能科技有限公司 | Real-time decoding method and device for speech recognition |
CN113705411A (en) * | 2021-08-20 | 2021-11-26 | 珠海格力电器股份有限公司 | Method and device for reducing noise of waveform signal, electronic equipment and storage medium |
CN113658605B (en) * | 2021-10-18 | 2021-12-17 | 成都启英泰伦科技有限公司 | Speech enhancement method based on deep learning assisted RLS filtering processing |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105792074A (en) * | 2016-02-26 | 2016-07-20 | 西北工业大学 | Voice signal processing method and device |
CN106340292A (en) * | 2016-09-08 | 2017-01-18 | 河海大学 | Voice enhancement method based on continuous noise estimation |
US9640194B1 (en) * | 2012-10-04 | 2017-05-02 | Knowles Electronics, Llc | Noise suppression for speech processing based on machine-learning mask estimation |
CN110136737A (en) * | 2019-06-18 | 2019-08-16 | 北京拙河科技有限公司 | A kind of voice de-noising method and device |
CN110415687A (en) * | 2019-05-21 | 2019-11-05 | 腾讯科技(深圳)有限公司 | Method of speech processing, device, medium, electronic equipment |
WO2020125376A1 (en) * | 2018-12-18 | 2020-06-25 | 腾讯科技(深圳)有限公司 | Voice denoising method and apparatus, computing device and computer readable storage medium |
CN111508518A (en) * | 2020-05-18 | 2020-08-07 | 中国科学技术大学 | Single-channel speech enhancement method based on joint dictionary learning and sparse representation |
CN111785288A (en) * | 2020-06-30 | 2020-10-16 | 北京嘀嘀无限科技发展有限公司 | Voice enhancement method, device, equipment and storage medium |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020041497A1 (en) * | 2018-08-21 | 2020-02-27 | 2Hz, Inc. | Speech enhancement and noise suppression systems and methods |
-
2021
- 2021-03-20 CN CN202110299114.0A patent/CN113096682B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9640194B1 (en) * | 2012-10-04 | 2017-05-02 | Knowles Electronics, Llc | Noise suppression for speech processing based on machine-learning mask estimation |
CN105792074A (en) * | 2016-02-26 | 2016-07-20 | 西北工业大学 | Voice signal processing method and device |
CN106340292A (en) * | 2016-09-08 | 2017-01-18 | 河海大学 | Voice enhancement method based on continuous noise estimation |
WO2020125376A1 (en) * | 2018-12-18 | 2020-06-25 | 腾讯科技(深圳)有限公司 | Voice denoising method and apparatus, computing device and computer readable storage medium |
CN110415687A (en) * | 2019-05-21 | 2019-11-05 | 腾讯科技(深圳)有限公司 | Method of speech processing, device, medium, electronic equipment |
CN110136737A (en) * | 2019-06-18 | 2019-08-16 | 北京拙河科技有限公司 | A kind of voice de-noising method and device |
CN111508518A (en) * | 2020-05-18 | 2020-08-07 | 中国科学技术大学 | Single-channel speech enhancement method based on joint dictionary learning and sparse representation |
CN111785288A (en) * | 2020-06-30 | 2020-10-16 | 北京嘀嘀无限科技发展有限公司 | Voice enhancement method, device, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN113096682A (en) | 2021-07-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113096682B (en) | Real-time voice noise reduction method and device based on mask time domain decoder | |
Qian et al. | Speech Enhancement Using Bayesian Wavenet. | |
US7313518B2 (en) | Noise reduction method and device using two pass filtering | |
Abd El-Fattah et al. | Speech enhancement with an adaptive Wiener filter | |
CN113611323B (en) | Voice enhancement method and system based on double-channel convolution attention network | |
Lei et al. | Speech enhancement for in‐vehicle voice control systems using wavelet analysis and blind source separation | |
Yu et al. | A deep neural network based Kalman filter for time domain speech enhancement | |
CN112735456A (en) | Speech enhancement method based on DNN-CLSTM network | |
CN113345460B (en) | Audio signal processing method, device, equipment and storage medium | |
CN108922514B (en) | Robust feature extraction method based on low-frequency log spectrum | |
Venturini et al. | On speech features fusion, α-integration Gaussian modeling and multi-style training for noise robust speaker classification | |
Kantamaneni et al. | Speech enhancement with noise estimation and filtration using deep learning models | |
CN116612778B (en) | Echo and noise suppression method, related device and medium | |
Rao et al. | Speech enhancement using sub-band cross-correlation compensated Wiener filter combined with harmonic regeneration | |
CN115662461A (en) | Noise reduction model training method, device and equipment | |
CN115497492A (en) | Real-time voice enhancement method based on full convolution neural network | |
CN116129927A (en) | Voice processing method and device and computer readable storage medium | |
CN111613211B (en) | Method and device for processing specific word voice | |
Alameri et al. | Convolutional Deep Neural Network and Full Connectivity for Speech Enhancement. | |
Venkateswarlu et al. | Speech Enhancement in terms of Objective Quality Measures Based on Wavelet Hybrid Thresholding the Multitaper Spectrum | |
Nower et al. | Restoration of instantaneous amplitude and phase using Kalman filter for speech enhancement | |
CN113763976A (en) | Method and device for reducing noise of audio signal, readable medium and electronic equipment | |
Liang et al. | The analysis of the simplification from the ideal ratio to binary mask in signal-to-noise ratio sense | |
Buragohain et al. | Single Channel Speech Enhancement System using Convolutional Neural Network based Autoencoder for Noisy Environments | |
CN117153179A (en) | Audio processing model training method and audio processing method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |