WO2022134351A1

WO2022134351A1 - Noise reduction method and system for monophonic speech, and device and readable storage medium

Info

Publication number: WO2022134351A1
Application number: PCT/CN2021/083652
Authority: WO
Inventors: 王健宗; 程宁; 张之勇
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-12-22
Filing date: 2021-03-29
Publication date: 2022-06-30
Also published as: CN112614504A

Abstract

A noise reduction method and system for monophonic speech, and a device and a readable storage medium. The method comprises: acquiring monophonic speech to be subjected to noise reduction (S1); building an initial noise reduction model based on an LSTM neural network (S2); acquiring a preset number of enhanced training samples, and using the preset number of enhanced training samples to train the initial noise reduction model, so as to obtain a noise reduction model (S3); and reducing noise of said monophonic speech by means of the noise reduction model to obtain human voice audio (S4). By means of training a model on the basis of an LSTM neural network, learning a noise rule of a whole time sequence is facilitated, thus achieving a better noise reduction effect. Moreover, on the basis of the complexity of a noise influence factor, an initial noise reduction model is trained again by means of enhanced training samples, such that the noise reduction effect of the noise reduction model is further improved.

Description

Monophonic speech noise reduction method, system, device and readable storage medium

This application claims the priority of the Chinese patent application with the application number 202011534575.3 and the invention title "Monophonic speech noise reduction method, system, device and readable storage medium", filed in the China Patent Office on December 22, 2020 , the entire contents of which are incorporated herein by reference.

technical field

The present application belongs to the technical field of speech noise reduction, and in particular relates to a method, system, device and readable storage medium for monophonic speech noise reduction.

Background technique

A few years ago, we often heard various noises during calls, which greatly affected the quality of calls, but now, with the popularity of smartphones, we can clearly feel the reduction of noise during calls. This is because most of the current smartphones use Qualcomm chips, and most of these chips are equipped with Qualcomm's proprietary CVC technology. CVC technology is a call noise reduction technology. The built-in dual microphones in the mobile phone acquire the sound. The main microphone is near the speaker's mouth and can receive a larger speaker's voice; the secondary microphone is farther away from the speaker's mouth, and the received speaker's voice is smaller, while the two The microphone can receive almost the same amount of ambient noise. By combining the sound signals collected by the main and auxiliary microphones, a certain algorithm can be used to distinguish which sounds are the human voices we want, so as to realize noise reduction calls.

However, the inventors realized that this technology still has the following drawbacks. First of all, this technology cannot handle monophonic audio, and must require the mobile phone to have dual microphones, which has no effect on single-microphone mobile phones; moreover, there are certain requirements for the speaker's posture when talking, requiring the speaker's voice source to be very close to the main microphone. The person is far from the microphone or the headset with a single microphone cannot be used.

technical problem

One of the purposes of the embodiments of the present application is to provide a method, system, device, and readable storage medium for monophonic speech noise reduction, aiming to solve the technical problem that the existing noise reduction technology for calls cannot handle monophonic audio.

technical solutions

In order to solve the above-mentioned technical problems, the technical solutions adopted in the embodiments of the present application are:

A first aspect of the embodiments of the present application provides a method for noise reduction of monophonic speech, including:

Obtain the monophonic speech to be denoised; construct an initial denoising model based on the LSTM neural network; obtain a preset number of enhanced training samples, use the preset number of enhanced training samples to train the initial noise reduction model, and obtain a noise reduction model; The noise reduction model denoises the monophonic speech to be denoised to obtain human voice audio.

A second aspect of the embodiments of the present application provides a monophonic speech noise reduction system, including:

The acquisition module is used to acquire the monophonic voice to be denoised;

The model building module is used to build the initial noise reduction model based on the LSTM neural network;

an augmented training module for obtaining a preset number of augmented training samples, and using the preset number of augmented training samples to train an initial noise reduction model to obtain a noise reduction model; and

The noise reduction module is used to denoise the monophonic speech to be denoised through the denoising model to obtain human voice audio.

A third aspect of the embodiments of the present application provides a terminal device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program When realized:

A fourth aspect of the embodiments of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement:

A fifth aspect of the embodiments of the present application further provides a computer program product, which, when the computer program product runs on a terminal device, enables the terminal device to implement:

beneficial effect

Compared with the prior art, the embodiments of the present application include the following advantages:

In this embodiment of the present application, after obtaining the monophonic speech to be denoised, an initial noise reduction model is constructed, a preset number of enhanced training samples are obtained, and the preset number of enhanced training samples is used to train the initial noise reduction model to obtain a noise reduction model. model, and then directly use the noise reduction model to reduce the noise of the monophonic speech to be denoised to obtain vocal audio. The noise reduction process is not limited by the restriction of two channels, and can realize the noise reduction of any monophonic speech. Among them, the noise reduction model is obtained by using the enhanced training samples to train the initial noise reduction model, and the initial noise reduction model is constructed by training the LSTM neural network with mixed audio. The training of the model based on the neural network is easy to learn the noise law of the entire time series, and then achieve a better noise reduction effect. At the same time, based on the complexity of noise influencing factors, it is difficult for the training data to cover all possible noise environments, resulting in the inability of the initial noise reduction model to be robust to some noises. The learned noise types are re-learned to further improve the noise reduction effect of the noise reduction model.

Description of drawings

In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings that are used in the description of the embodiments or exemplary technologies. Obviously, the drawings in the following description are only for the present application. In some embodiments, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without any creative effort.

FIG. 1 is a flowchart of a method for noise reduction of monophonic speech according to an embodiment of the present application;

2 is a block diagram of an initial noise reduction model training process according to an embodiment of the application;

3 is a block diagram of a noise reduction model training process according to an embodiment of the present application;

4 is a structural block diagram of a monophonic speech noise reduction system according to an embodiment of the application;

FIG. 5 is a structural block diagram of a terminal device according to an embodiment of the present application.

Embodiments of the present invention

In order to make those skilled in the art better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only The embodiments are part of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the scope of protection of the present application.

It should be noted that the terms "first", "second", etc. in the description and claims of the present application and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It is to be understood that data so used may be interchanged under appropriate circumstances so that the embodiments of the application described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" and "having" and any variations thereof, are intended to cover non-exclusive inclusion, for example, a process, method, system, product or device comprising a series of steps or units is not necessarily limited to those expressly listed Rather, those steps or units may include other steps or units not expressly listed or inherent to these processes, methods, products or devices.

The application will be described in further detail below in conjunction with the accompanying drawings:

Referring to FIG. 1, in an embodiment of the present application, a method for noise reduction of monophonic speech is provided, which realizes effective noise reduction of monophonic speech, and can be better used for single-microphone mobile phone calls and single-microphone headset calls. , the monophonic speech noise reduction method includes the following steps:

S1: Obtain the monophonic voice to be denoised.

Specifically, when a single-microphone mobile phone and a single-microphone headset are used for a call, the monaural voice is obtained in real time through the microphone, and the monaural voice here may also be a monaural voice downloaded from the Internet. When performing noise reduction processing on the call voice, noise reduction may be performed before the monophonic voice is sent, or when the monophonic voice is sent to the corresponding call device, or at the same time.

S2: Build an initial denoising model based on LSTM neural network.

Specifically, in order to obtain a noise reduction model for monophonic speech noise reduction, first, an initial noise reduction model needs to be obtained, and since the call process occurs in a time series, and in many cases, in the same speech The noise category of the entire time series is roughly the same (such as the wind sound throughout), so in this embodiment, the LSTM long short-term memory network is used to build the initial model, which is convenient to learn the noise law of the entire time series, and then train the LSTM neural network by mixing audio. A network model is used to obtain an initial noise reduction model. Specifically, see Figure 2, which includes the following steps:

S201: By collecting a large number of pure human voice audio and pure noise audio, and then randomly combining the human voice audio and noise audio, several mixed audios are obtained. In order to achieve noise reduction requirements, each mixed audio includes human voice audio and at least one noise. Audio, which can also include multiple vocal audio.

S202: Perform frame-by-frame windowing processing and Fourier transform on the mixed audio to obtain spectrums of several mixed audio frames. Specifically, the mixed audio is divided into frames and windowed. In this embodiment, the mixed audio is divided into frames according to the frame division requirements of a frame length of 25ms and a frame shift of 10ms. After dividing the mixed audio into frames, each frame needs to be divided into The mixed audio is windowed. The window function of the windowing process generally has low-pass characteristics. The purpose of the windowing function is to reduce the leakage in the frequency domain. The commonly used window functions in speech signal analysis are rectangular window, Hamming window and Hanming window. Ning window, you can choose different window functions according to different situations. Then, through Fourier transform, the mixed audio after framed and windowed is converted into time domain and frequency domain, and the time domain features are mapped into spectral features of a certain dimension, such as several basic sine waves, and then several mixed audios are obtained. Audio frame spectrum.

S203: Divide the spectrum of several mixed audio frames into a training set and a test set, establish an LSTM neural network model for binary classification, train the LSTM neural network model through the training set, and test the trained LSTM neural network model through the test set, and the test is qualified Then the initial noise reduction model is obtained.

Specifically, the spectrum of several mixed audio frames is randomly divided into a training set and a test set. The ratio of the spectrum of mixed audio frames in the training set and the test set can be customized. During iterative training, the mixed audio in the training set and the test set can also be exchanged. frame spectrum.

Then an LSTM neural network model for binary classification is established. The existing LSTM neural network model generally includes an input layer, a number of hidden layers and an output layer. The mixed audio frame spectrum is used as the input layer, and a number of hidden layers are set to separate the human voice. The spectrum and noise spectrum are used as the binary classification result of the output layer. Then, the constructed LSTM neural network model is trained through the training set, and the trained LSTM neural network model is tested through the test set. When the pass rate of the test result meets the preset pass rate threshold, the test is passed, and the initial noise reduction model is obtained. Among them, the pass rate of the test result refers to the ratio of the mixed audio frame spectrum of each mixed audio frame in the test set to the spectrum of all mixed audio frames after passing the trained LSTM neural network model to obtain clear vocal audio.

Among them, the specific method of training the LSTM neural network model through the training set is:

Input the mixed audio frame spectrum in the training set into the constructed LSTM neural network model, output the predicted vocal spectrum and noise spectrum, and then obtain the vocal audio and noise audio through inverse Fourier transform. Then according to the error between the predicted vocal audio and the actual vocal audio, the error is back-propagated and the parameters in the LSTM neural network model are iteratively updated through the stochastic gradient descent algorithm, including the forget gate, the input gate and the output gate weight, until the number of training times reaches a preset value or the error between the predicted vocal audio and the actual vocal audio no longer decreases.

Preferably, when training the LSTM neural network model through the training set, an annealing algorithm is used to update the learning rate of the LSTM neural network model. Among them, the learning rate is a hyperparameter that guides how to use the gradient of the loss function to adjust the weight of the model in the gradient descent method. If the learning rate is too large, the loss function may directly exceed the global optimal point. If the learning rate is too small, the loss function The speed of change is very slow, which will greatly increase the convergence complexity of the network, and it is easy to be trapped in a local minimum or saddle point. By using the annealing algorithm to let the learning rate change with time, this problem can be well solved.

Then, test the trained LSTM neural network model through the test set, and input the mixed audio frame spectrum in the test set into the trained LSTM neural network model. When the noise reduction pass rate of the mixed audio frame spectrum in the test set is within the preset range , it can be considered that the trained LSTM neural network model is qualified for the test, and the initial noise reduction model is obtained.

S3: Obtain a preset number of enhanced training samples, train an initial noise reduction model by using the preset number of enhanced training samples, and obtain a noise reduction model.

Specifically, based on the initial noise reduction model, the noise reduction requirements for monophonic speech in some scenarios can be achieved, but the inventor found in actual work that due to the complexity of various factors such as noise types, hardware, and microphone recording distance , it is difficult for the training data to cover all possible noise environments, which makes the initial noise reduction model not robust to some noises. When the initial noise reduction model performs noise reduction, the intelligibility and clarity of the denoised vocal audio At this time, the noise data pattern learning of the initial noise reduction model needs to be carried out according to the specific noise data—that is, the enhanced training of the initial noise reduction model, the purpose is to re-learn the unlearned noise types, so as to achieve a better understanding of the noise. The pattern recognition of this type of noise achieves the purpose of noise reduction. Based on this, in this embodiment, the monophonic speech unqualified for noise reduction by using the initial noise reduction model is used as the enhanced training sample, and the final noise reduction model is obtained by training the initial noise reduction model by using the enhanced training sample.

Specifically, in this embodiment, a method for acquiring enhanced training samples is provided. Based on the premise of the continuity of noise, generally when the monophonic voice is denoised by the initial noise reduction model, the number of occurrences in the obtained vocal audio is greater than When the similar waveforms of a preset number of times and the energy of the similar waveforms are greater than the preset energy threshold, it is considered that the current monophonic voice noise reduction is unqualified, and these unqualified monophonic voices are used as enhanced training samples. The user generates different augmented training samples.

Specifically, referring to Figure 3, the specific method for training the initial noise reduction model by using the enhanced training samples is as follows:

Obtain a number of monophonic speech unqualified for noise reduction by the initial noise reduction model as a preset number of enhanced training samples; obtain a number of test samples formed by combining vocal audio and noise audio; Set the number of enhanced training samples to train the initial noise reduction model, until the noise reduction effect of the initial noise reduction model after training on the test sample and the noise reduction effect of the initial noise reduction model on the test sample are within the preset error, and the initial noise reduction effect after training is within the preset error. The noise reduction effect of the noise reduction model on the enhanced training samples is greater than the preset threshold of the noise reduction effect of the initial noise reduction model on the enhanced training samples, and a noise reduction model is obtained.

If the initial noise reduction model after enhanced training has caused a great loss to the noise reduction effect of the test sample, or there is no obvious effect on the noise reduction of the enhanced training sample, the initial noise reduction model will not be updated, and the enhanced training will be considered to have failed. Re-obtain boosted training samples for training.

Among them, when training the initial noise reduction model with a preset number of enhanced training samples, the parameters of several hidden layers in the initial noise reduction model are fixed. Since the training is an unsupervised learning process, it is necessary to fix the parameters in the initial noise reduction model. For some hidden layers, parameter adjustment is not performed during the enhanced training process to prevent model training from overfitting. Among them, regarding the selection of fixed parameter hidden layers, all hidden layers except the classification layer are generally selected.

Preferably, when the number of training samples for visual enhancement exceeds a certain number, several random nonlinear layers are added between the hidden layer and the classification layer of the initial noise reduction model. The purpose of adding additional random nonlinear layers is to ensure that the model can Higher modeling of such noise patterns for better training results.

S4: The monophonic speech to be denoised is denoised by the denoising model to obtain human voice audio.

Specifically, in this embodiment, by performing Fourier transform on the monophonic speech to be denoised, the spectrum of the voice of the call is obtained, which is characterized by the extracted spectrum; the spectrum of the voice of the call is input into a preset noise reduction model to obtain the human voice Spectrum and noise spectrum; perform an inverse Fourier transform on the vocal spectrum to get vocal audio.

To sum up, in the monophonic speech noise reduction method of the present application, after obtaining the monophonic speech to be denoised, an initial noise reduction model is constructed, a preset number of enhanced training samples are obtained, and a preset number of enhanced training samples are used. The sample trains the initial noise reduction model to obtain the noise reduction model, and directly uses the noise reduction model to reduce the noise of the monophonic speech to be denoised to obtain human voice audio. The noise reduction model is obtained by training the initial noise reduction model with the enhanced training samples. , and the initial noise reduction model is constructed by mixing the audio training LSTM neural network model. Based on the roughly the same characteristics of the noise category of the entire time series in the same speech, the model is trained based on the LSTM neural network model, which is easy to learn the entire time. The noise law of the sequence, so as to achieve a better noise reduction effect. At the same time, based on the complexity of noise influencing factors, it is difficult for the training data to cover all possible noise environments, resulting in the inability of the initial noise reduction model to be robust to some noises. The learned noise types are re-learned to further improve the noise reduction effect of the noise reduction model.

The biggest advantage of the monophonic speech noise reduction method is "monophonic", which has the advantages of wide application range and good compatibility. Applied to mobile phone calls, this method does not require the mobile phone to have dual microphones, and is suitable for all mobile phones with microphones. Therefore, after this method is applied, mobile phone manufacturers do not need to equip mobile phones with dual microphones, which can reduce the cost of mobile phones and reduce the cost of mobile phones. The weight can also save the space of a microphone to make the phone thinner and lighter. Secondly, this method does not have any requirements on the distance between the speaker and the microphone, and can be applied not only to the scene where the speaker is far away from the microphone, but also to the scene where the speaker uses a single-microphone headset to talk. Finally, the method can also perform noise reduction processing on monophonic speech obtained in any way, such as speech downloaded from the Internet.

The following device embodiments of the present application may be used to execute the method embodiments of the present application. For details that are not omitted in the device embodiments, please refer to the method embodiments of the present application.

Referring to FIG. 4 , in another embodiment of the present application, a monophonic voice noise reduction system is provided, and the monophonic voice noise reduction system can be used to implement the above-mentioned monophonic voice noise reduction method. The speech noise reduction system includes an acquisition module, a model building module, an enhanced training module and a noise reduction module.

Among them, the acquisition module is used to acquire the monophonic speech to be denoised; the model building module is used to construct an initial noise reduction model based on the LSTM neural network; the augmentation training module is used to acquire a preset number of augmented training samples, using a preset number of The enhanced training samples of , train the initial noise reduction model to obtain the noise reduction model; the noise reduction module is used to reduce the noise of the monophonic speech to be noise reduction through the noise reduction model to obtain the human voice audio.

Preferably, the model building module of the monophonic speech noise reduction system includes a mixed audio acquisition module, a mixed audio processing module and a training module.

Wherein, the mixed audio acquisition module is used for acquiring several human voice audios and several noise audio frequencies and combining them randomly to obtain several mixed audio frequencies, and each mixed audio audio includes human voice audio and at least one noise audio frequency; the mixed audio processing module is used for mixing the mixed audio audio. Perform frame-by-frame window processing and Fourier transform to obtain several mixed audio frame spectra; the training module is used to divide several mixed audio frame spectra into training sets and test sets, and establish an LSTM neural network model for binary classification. The LSTM neural network model is trained on the set, and the trained LSTM neural network model is tested through the test set, and the initial noise reduction model is obtained after passing the test.

Preferably, the training module of the monophonic speech noise reduction system includes a prediction module and a parameter update module.

Among them, the prediction module is used to input the mixed audio frame spectrum in the training set into the LSTM neural network model to obtain the human voice spectrum and noise spectrum and perform inverse Fourier transform to obtain the predicted human voice audio and noise audio; the parameter update module uses According to the error between the predicted vocal audio and the actual vocal audio, each parameter in the LSTM neural network model is iteratively updated until the number of training times reaches the preset value or the predicted vocal audio and the actual vocal audio are between The error no longer decreases.

Preferably, the enhanced training module of the monophonic speech noise reduction system includes an enhanced training sample acquisition module, a test sample acquisition module and an enhanced training management module.

Among them, the enhanced training sample acquisition module is used to acquire a number of monophonic voices that are unqualified for noise reduction by the initial noise reduction model, as a preset number of enhanced training samples; the test sample acquisition module is used to acquire a number of passing vocal audio and a number of noise audio The test samples formed by the combination; the enhanced training management module is used to train the initial noise reduction model through a preset number of enhanced training samples in an unsupervised learning method, until the noise reduction effect of the trained initial noise reduction model on the test samples is the same as the initial noise reduction model. The noise reduction effect of the noise reduction model on the test sample is within the preset error, and the noise reduction effect of the initial noise reduction model after training on the enhanced training sample is greater than the preset threshold for the noise reduction effect of the initial noise reduction model on the enhanced training sample.

Preferably, the enhanced training module of the monophonic speech noise reduction system further includes a parameter fixing module, which is used to adjust the parameters of several hidden layers in the initial noise reduction model when training the initial noise reduction model with a preset number of enhanced training samples. fixed.

Preferably, the enhanced training module of the monophonic speech noise reduction system further includes a random nonlinear layer adding module, which is used for, before training the initial noise reduction model through a preset number of enhanced training samples, the hidden layer of the initial noise reduction model is Several random nonlinear layers are added between the classification layers.

Preferably, the noise reduction module of the monophonic voice noise reduction system is used to perform Fourier transform on the monophonic voice to be denoised to obtain the voice spectrum of the call; input the spectrum of the voice of the call into the noise reduction model to obtain the human voice spectrum and noise spectrum; perform the inverse Fourier transform of the human voice spectrum to obtain the human voice audio.

Referring to FIG. 5 , in another embodiment of the present application, a terminal device is provided, including: at the hardware level, the terminal device includes: a processor and a memory, and optionally an internal bus and a network interface. The memory may include internal memory, such as high-speed random access memory, or may also include non-volatile memory, such as at least one disk storage. Of course, the terminal equipment may also include hardware required for other services. The processor, network interface, and memory are connected to each other through an internal bus, which may be an industry standard architecture bus, a peripheral component interconnection standard bus, an extended industry standard structure bus, and the like. The bus can be divided into address bus, data bus, control bus and so on. Memory is used to store programs. Specifically, the program may include program code, and the program code includes computer operation instructions. The memory may include memory and non-volatile memory and provide instructions and data to the processor.

The processor reads the corresponding computer program from the non-volatile memory into the memory and then runs it, forming the above-mentioned terminal device on the logical level. The processor may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application-specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., which are the computing core and control core of the terminal. It is suitable for implementing one or more instructions, and is specifically suitable for loading and executing one or more instructions so as to realize the corresponding method process or corresponding function. specific:

A terminal device, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, the processor implements when the processor executes the computer program:

In one embodiment, when the processor executes the computer program, it further implements:

Obtain a number of vocal audios and a number of noise audios and combine them randomly to obtain a number of mixed audios, each of which includes a vocal audio and at least one noise audio; the mixed audio is subjected to frame-by-frame windowing and Fourier transform to obtain several Mixed audio frame spectrum; divide several mixed audio frame spectrum into training set and test set, establish LSTM neural network model for binary classification, train LSTM neural network model through training set, and test the trained LSTM neural network model through test set , when the pass rate of the test result meets the preset pass rate threshold, the test is passed, and the initial noise reduction model is obtained.

Input the mixed audio frame spectrum in the training set into the LSTM neural network model to obtain the human voice spectrum and noise spectrum and perform inverse Fourier transform to obtain the predicted human voice audio and noise audio; For the error between the voice and audio, each parameter in the LSTM neural network model is iteratively updated, until the number of training times reaches the preset value or the error between the predicted voice audio and the actual voice audio no longer decreases.

Obtain a number of monophonic speech unqualified for noise reduction by the initial noise reduction model as a preset number of enhanced training samples; obtain a number of test samples formed by combining vocal audio and noise audio; Set the number of enhanced training samples to train the initial noise reduction model, until the noise reduction effect of the initial noise reduction model after training on the test sample and the noise reduction effect of the initial noise reduction model on the test sample are within the preset error, and the initial noise reduction effect after training is within the preset error. The noise reduction effect of the noise reduction model on the enhanced training samples is greater than the preset threshold of the noise reduction effect of the initial noise reduction model on the enhanced training samples.

When training the initial noise reduction model with a preset number of enhanced training samples, the parameters of several hidden layers in the initial noise reduction model are fixed.

Before training the initial noise reduction model with a preset number of enhanced training samples, several random nonlinear layers are added between the hidden layer and the classification layer of the initial noise reduction model.

Perform Fourier transform on the monophonic speech to be denoised to obtain the voice spectrum of the call; input the spectrum of the voice of the call into the noise reduction model to obtain the human voice spectrum and noise spectrum; perform inverse Fourier transform on the human voice spectrum to obtain the human voice spectrum. sound audio.

In yet another embodiment, the present application further provides a computer-readable storage medium (Memory), where the computer-readable storage medium is a memory device in a terminal device for storing programs and data. It can be understood that, the computer-readable storage medium here may include both a built-in storage medium in the terminal device, and certainly also an extended storage medium supported by the terminal device. The computer-readable storage medium may be non-volatile or volatile. The computer-readable storage medium provides storage space in which the operating system of the terminal is stored. In addition, one or more instructions suitable for being loaded and executed by the processor are also stored in the storage space, and these instructions may be one or more computer programs (including program codes). It should be noted that the computer-readable storage medium here can be a high-speed RAM memory, or a non-volatile memory (non-volatile memory). memory), such as at least one disk storage. All or part of the process in the method of the above-mentioned embodiments can be implemented by instructing the relevant hardware through computer-readable instructions, and the computer-readable instructions can be stored in a non-volatile computer-readable storage medium, When executed, the computer-readable instructions may include the processes of the above-described method embodiments.

One or more instructions stored in the computer-readable storage medium can be loaded and executed by the processor, so as to implement the corresponding steps of the method for noise reduction of monophonic speech in the above-mentioned embodiments. specific:

In one embodiment, the computer program, when executed by the processor, further implements:

As will be appreciated by those skilled in the art, the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block in the flowcharts and/or block diagrams, and combinations of flows and/or blocks in the flowcharts and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in one or more of the flowcharts and/or one or more blocks of the block diagrams.

These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions An apparatus implements the functions specified in one or more of the flowcharts and/or one or more blocks of the block diagrams.

These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in one or more of the flowcharts and/or one or more blocks of the block diagrams.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application rather than to limit them. Although the present application has been described in detail with reference to the above embodiments, those of ordinary skill in the art should understand: Any modification or equivalent replacement without departing from the spirit and scope of the present application shall be included within the protection scope of the claims of the present application.

Claims

A monophonic speech noise reduction method, comprising the following steps:

Obtain the monophonic voice to be denoised;

Build an initial noise reduction model based on LSTM neural network;

Obtain a preset number of enhanced training samples, use the preset number of enhanced training samples to train an initial noise reduction model, and obtain a noise reduction model;

The monophonic speech to be denoised is denoised by the denoising model to obtain human voice audio.
The method for denoising monophonic speech according to claim 1, wherein said constructing an initial denoising model based on an LSTM neural network comprises:

Obtaining a plurality of vocal audios and a number of noise audios and combining them randomly to obtain a number of mixed audios, each of which includes a vocal audio and at least one noise audio;

Perform frame-by-frame windowing and Fourier transform on the mixed audio to obtain a number of mixed audio frame spectra;

Divide the spectrum of several mixed audio frames into a training set and a test set, establish an LSTM neural network model for binary classification, train the LSTM neural network model through the training set, and test the trained LSTM neural network model through the test set. When the pass rate meets the preset pass rate threshold, the test is passed, and the initial noise reduction model is obtained.
The method for denoising monophonic speech according to claim 2, wherein the training of the LSTM neural network model through the training set comprises:

Input the mixed audio frame spectrum in the training set into the LSTM neural network model to obtain the human voice spectrum and noise spectrum and perform inverse Fourier transform to obtain the predicted voice audio and noise audio;

According to the error between the predicted vocal audio and the actual vocal audio, each parameter in the LSTM neural network model is iteratively updated, until the number of training times reaches the preset value or the predicted vocal audio and the actual vocal audio are between The error is no longer decreasing.
The monophonic speech noise reduction method according to claim 1, wherein the obtaining a preset number of enhanced training samples, and using the preset number of enhanced training samples to train the initial noise reduction model comprises:

Obtain a number of monophonic voices that are unqualified for noise reduction by the initial noise reduction model as a preset number of enhanced training samples;

Obtain several test samples formed by combining vocal audio and several noise audio;

The unsupervised learning method is used to train the initial noise reduction model through a preset number of enhanced training samples, and the noise reduction effect of the initial noise reduction model after training on the test sample and the noise reduction effect of the initial noise reduction model on the test sample are in the prediction. Within the set error, and the noise reduction effect of the initial noise reduction model after training on the enhanced training samples is greater than the preset threshold of the noise reduction effect of the initial noise reduction model on the enhanced training samples.
The method for noise reduction of monophonic speech according to claim 4, wherein when the initial noise reduction model is trained by using a preset number of enhanced training samples, the parameters of several hidden layers in the initial noise reduction model are fixed.
The monophonic speech noise reduction method according to claim 4, wherein before the initial noise reduction model is trained by using a preset number of enhanced training samples, a number of random numbers are added between the hidden layer and the classification layer of the initial noise reduction model. non-linear layer.
The monophonic speech noise reduction method according to claim 1, wherein the noise reduction of the monophonic speech to be denoised by the noise reduction model to obtain the human voice audio comprises:

Fourier transform is performed on the monophonic voice to be denoised to obtain the voice spectrum of the call;

Input the voice spectrum of the call into the noise reduction model to obtain the human voice spectrum and the noise spectrum;

Perform an inverse Fourier transform on the vocal spectrum to obtain vocal audio.
A monophonic speech noise reduction system, comprising:

The acquisition module is used to acquire the monophonic voice to be denoised;

The model building module is used to build the initial noise reduction model based on the LSTM neural network;

an augmented training module for obtaining a preset number of augmented training samples, and using the preset number of augmented training samples to train an initial noise reduction model to obtain a noise reduction model; and

The noise reduction module is used to denoise the monophonic speech to be denoised through the denoising model to obtain human voice audio.
A terminal device, wherein the terminal device includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor implements when executing the computer program:

Obtain the monophonic voice to be denoised;

Build an initial noise reduction model based on LSTM neural network;

Obtain a preset number of enhanced training samples, use the preset number of enhanced training samples to train an initial noise reduction model, and obtain a noise reduction model;

The monophonic speech to be denoised is denoised by the denoising model to obtain human voice audio.
The terminal device according to claim 9, wherein, when the processor executes the computer program, it further implements:

Obtaining a plurality of vocal audios and a number of noise audios and combining them randomly to obtain a number of mixed audios, each of which includes a vocal audio and at least one noise audio;

Perform frame-by-frame windowing and Fourier transform on the mixed audio to obtain a number of mixed audio frame spectra;

Divide the spectrum of several mixed audio frames into training set and test set, establish an LSTM neural network model for binary classification, train the LSTM neural network model through the training set, and test the trained LSTM neural network model through the test set. When the pass rate meets the preset pass rate threshold, the test is passed, and the initial noise reduction model is obtained.
The terminal device according to claim 10, wherein, when the processor executes the computer program, it further implements:

Input the mixed audio frame spectrum in the training set into the LSTM neural network model to obtain the human voice spectrum and noise spectrum and perform inverse Fourier transform to obtain the predicted voice audio and noise audio;

According to the error between the predicted vocal audio and the actual vocal audio, each parameter in the LSTM neural network model is iteratively updated, until the number of training times reaches the preset value or the predicted vocal audio and the actual vocal audio are between The error is no longer decreasing.
The terminal device according to claim 9, wherein, when the processor executes the computer program, it further implements:

Obtain a number of monophonic voices that are unqualified for noise reduction by the initial noise reduction model as a preset number of enhanced training samples;

Obtain several test samples formed by combining vocal audio and several noise audio;

The unsupervised learning method is used to train the initial noise reduction model through a preset number of enhanced training samples, and the noise reduction effect of the initial noise reduction model after training on the test sample and the noise reduction effect of the initial noise reduction model on the test sample are in the prediction. Within the set error, and the noise reduction effect of the initial noise reduction model after training on the enhanced training samples is greater than the preset threshold of the noise reduction effect of the initial noise reduction model on the enhanced training samples.
The terminal device according to claim 12, wherein when the processor executes the computer program, it further implements:

When the initial noise reduction model is trained by using a preset number of enhanced training samples, the parameters of several hidden layers in the initial noise reduction model are fixed.
The terminal device according to claim 12, wherein when the processor executes the computer program, it further implements:

Before the initial noise reduction model is trained by the preset number of enhanced training samples, several random nonlinear layers are added between the hidden layer and the classification layer of the initial noise reduction model.
A computer-readable storage medium storing a computer program, wherein the computer program is executed by a processor to realize:

Obtain the monophonic voice to be denoised;

Build an initial noise reduction model based on LSTM neural network;

Obtain a preset number of enhanced training samples, use the preset number of enhanced training samples to train an initial noise reduction model, and obtain a noise reduction model;

The monophonic speech to be denoised is denoised by the denoising model to obtain human voice audio.
The computer-readable storage medium of claim 15, wherein the computer program, when executed by the processor, further implements:

Obtaining a plurality of vocal audios and a number of noise audios and combining them randomly to obtain a number of mixed audios, each of which includes a vocal audio and at least one noise audio;

Perform frame-by-frame windowing and Fourier transform on the mixed audio to obtain a number of mixed audio frame spectra;

Divide the spectrum of several mixed audio frames into training set and test set, establish an LSTM neural network model for binary classification, train the LSTM neural network model through the training set, and test the trained LSTM neural network model through the test set. When the pass rate meets the preset pass rate threshold, the test is passed, and the initial noise reduction model is obtained.
The computer-readable storage medium of claim 16, wherein the computer program, when executed by the processor, further implements:

Input the mixed audio frame spectrum in the training set into the LSTM neural network model to obtain the human voice spectrum and noise spectrum and perform inverse Fourier transform to obtain the predicted voice audio and noise audio;

According to the error between the predicted vocal audio and the actual vocal audio, each parameter in the LSTM neural network model is iteratively updated, until the number of training times reaches the preset value or the predicted vocal audio and the actual vocal audio are between The error is no longer decreasing.
The computer-readable storage medium of claim 15, wherein the computer program, when executed by the processor, further implements:

Obtain a number of monophonic voices that are unqualified for noise reduction by the initial noise reduction model as a preset number of enhanced training samples;

Obtain several test samples formed by combining vocal audio and several noise audio;

The unsupervised learning method is used to train the initial noise reduction model through a preset number of enhanced training samples, and the noise reduction effect of the initial noise reduction model after training on the test sample and the noise reduction effect of the initial noise reduction model on the test sample are in the prediction. Within the set error, and the noise reduction effect of the initial noise reduction model after training on the enhanced training samples is greater than the preset threshold of the noise reduction effect of the initial noise reduction model on the enhanced training samples.
The computer-readable storage medium of claim 18, wherein the computer program, when executed by the processor, further implements:

When the initial noise reduction model is trained by using a preset number of enhanced training samples, the parameters of several hidden layers in the initial noise reduction model are fixed.
The computer-readable storage medium of claim 18, wherein the computer program, when executed by the processor, further implements:

Before the initial noise reduction model is trained by the preset number of enhanced training samples, several random nonlinear layers are added between the hidden layer and the classification layer of the initial noise reduction model.