WO2022134351A1 - Noise reduction method and system for monophonic speech, and device and readable storage medium - Google Patents

Noise reduction method and system for monophonic speech, and device and readable storage medium Download PDF

Info

Publication number
WO2022134351A1
WO2022134351A1 PCT/CN2021/083652 CN2021083652W WO2022134351A1 WO 2022134351 A1 WO2022134351 A1 WO 2022134351A1 CN 2021083652 W CN2021083652 W CN 2021083652W WO 2022134351 A1 WO2022134351 A1 WO 2022134351A1
Authority
WO
WIPO (PCT)
Prior art keywords
noise reduction
model
audio
initial
reduction model
Prior art date
Application number
PCT/CN2021/083652
Other languages
French (fr)
Chinese (zh)
Inventor
王健宗
程宁
张之勇
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022134351A1 publication Critical patent/WO2022134351A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02163Only one microphone

Definitions

  • the present application belongs to the technical field of speech noise reduction, and in particular relates to a method, system, device and readable storage medium for monophonic speech noise reduction.
  • CVC technology is a call noise reduction technology.
  • the built-in dual microphones in the mobile phone acquire the sound.
  • the main microphone is near the speaker's mouth and can receive a larger speaker's voice; the secondary microphone is farther away from the speaker's mouth, and the received speaker's voice is smaller, while the two
  • the microphone can receive almost the same amount of ambient noise.
  • this technology cannot handle monophonic audio, and must require the mobile phone to have dual microphones, which has no effect on single-microphone mobile phones; moreover, there are certain requirements for the speaker's posture when talking, requiring the speaker's voice source to be very close to the main microphone. The person is far from the microphone or the headset with a single microphone cannot be used.
  • One of the purposes of the embodiments of the present application is to provide a method, system, device, and readable storage medium for monophonic speech noise reduction, aiming to solve the technical problem that the existing noise reduction technology for calls cannot handle monophonic audio.
  • a first aspect of the embodiments of the present application provides a method for noise reduction of monophonic speech, including:
  • Obtain the monophonic speech to be denoised construct an initial denoising model based on the LSTM neural network; obtain a preset number of enhanced training samples, use the preset number of enhanced training samples to train the initial noise reduction model, and obtain a noise reduction model;
  • the noise reduction model denoises the monophonic speech to be denoised to obtain human voice audio.
  • a second aspect of the embodiments of the present application provides a monophonic speech noise reduction system, including:
  • the acquisition module is used to acquire the monophonic voice to be denoised
  • the model building module is used to build the initial noise reduction model based on the LSTM neural network
  • an augmented training module for obtaining a preset number of augmented training samples, and using the preset number of augmented training samples to train an initial noise reduction model to obtain a noise reduction model
  • the noise reduction module is used to denoise the monophonic speech to be denoised through the denoising model to obtain human voice audio.
  • a third aspect of the embodiments of the present application provides a terminal device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program When realized:
  • Obtain the monophonic speech to be denoised construct an initial denoising model based on the LSTM neural network; obtain a preset number of enhanced training samples, use the preset number of enhanced training samples to train the initial noise reduction model, and obtain a noise reduction model;
  • the noise reduction model denoises the monophonic speech to be denoised to obtain human voice audio.
  • a fourth aspect of the embodiments of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement:
  • Obtain the monophonic speech to be denoised construct an initial denoising model based on the LSTM neural network; obtain a preset number of enhanced training samples, use the preset number of enhanced training samples to train the initial noise reduction model, and obtain a noise reduction model;
  • the noise reduction model denoises the monophonic speech to be denoised to obtain human voice audio.
  • a fifth aspect of the embodiments of the present application further provides a computer program product, which, when the computer program product runs on a terminal device, enables the terminal device to implement:
  • Obtain the monophonic speech to be denoised construct an initial denoising model based on the LSTM neural network; obtain a preset number of enhanced training samples, use the preset number of enhanced training samples to train the initial noise reduction model, and obtain a noise reduction model;
  • the noise reduction model denoises the monophonic speech to be denoised to obtain human voice audio.
  • the embodiments of the present application include the following advantages:
  • an initial noise reduction model is constructed, a preset number of enhanced training samples are obtained, and the preset number of enhanced training samples is used to train the initial noise reduction model to obtain a noise reduction model. model, and then directly use the noise reduction model to reduce the noise of the monophonic speech to be denoised to obtain vocal audio.
  • the noise reduction process is not limited by the restriction of two channels, and can realize the noise reduction of any monophonic speech.
  • the noise reduction model is obtained by using the enhanced training samples to train the initial noise reduction model, and the initial noise reduction model is constructed by training the LSTM neural network with mixed audio.
  • the training of the model based on the neural network is easy to learn the noise law of the entire time series, and then achieve a better noise reduction effect.
  • the learned noise types are re-learned to further improve the noise reduction effect of the noise reduction model.
  • FIG. 1 is a flowchart of a method for noise reduction of monophonic speech according to an embodiment of the present application
  • FIG. 2 is a block diagram of an initial noise reduction model training process according to an embodiment of the application.
  • FIG. 3 is a block diagram of a noise reduction model training process according to an embodiment of the present application.
  • FIG. 4 is a structural block diagram of a monophonic speech noise reduction system according to an embodiment of the application.
  • FIG. 5 is a structural block diagram of a terminal device according to an embodiment of the present application.
  • the monophonic speech noise reduction method includes the following steps:
  • the monaural voice is obtained in real time through the microphone, and the monaural voice here may also be a monaural voice downloaded from the Internet.
  • noise reduction may be performed before the monophonic voice is sent, or when the monophonic voice is sent to the corresponding call device, or at the same time.
  • an initial noise reduction model needs to be obtained, and since the call process occurs in a time series, and in many cases, in the same speech
  • the noise category of the entire time series is roughly the same (such as the wind sound throughout), so in this embodiment, the LSTM long short-term memory network is used to build the initial model, which is convenient to learn the noise law of the entire time series, and then train the LSTM neural network by mixing audio.
  • a network model is used to obtain an initial noise reduction model. Specifically, see Figure 2, which includes the following steps:
  • each mixed audio includes human voice audio and at least one noise. Audio, which can also include multiple vocal audio.
  • S202 Perform frame-by-frame windowing processing and Fourier transform on the mixed audio to obtain spectrums of several mixed audio frames.
  • the mixed audio is divided into frames and windowed.
  • the mixed audio is divided into frames according to the frame division requirements of a frame length of 25ms and a frame shift of 10ms.
  • each frame needs to be divided into The mixed audio is windowed.
  • the window function of the windowing process generally has low-pass characteristics.
  • the purpose of the windowing function is to reduce the leakage in the frequency domain.
  • the commonly used window functions in speech signal analysis are rectangular window, Hamming window and Hanming window. Ning window, you can choose different window functions according to different situations.
  • the mixed audio after framed and windowed is converted into time domain and frequency domain, and the time domain features are mapped into spectral features of a certain dimension, such as several basic sine waves, and then several mixed audios are obtained. Audio frame spectrum.
  • S203 Divide the spectrum of several mixed audio frames into a training set and a test set, establish an LSTM neural network model for binary classification, train the LSTM neural network model through the training set, and test the trained LSTM neural network model through the test set, and the test is qualified Then the initial noise reduction model is obtained.
  • the spectrum of several mixed audio frames is randomly divided into a training set and a test set.
  • the ratio of the spectrum of mixed audio frames in the training set and the test set can be customized.
  • the mixed audio in the training set and the test set can also be exchanged. frame spectrum.
  • the existing LSTM neural network model generally includes an input layer, a number of hidden layers and an output layer.
  • the mixed audio frame spectrum is used as the input layer, and a number of hidden layers are set to separate the human voice.
  • the spectrum and noise spectrum are used as the binary classification result of the output layer.
  • the constructed LSTM neural network model is trained through the training set, and the trained LSTM neural network model is tested through the test set.
  • the pass rate of the test result meets the preset pass rate threshold, the test is passed, and the initial noise reduction model is obtained.
  • the pass rate of the test result refers to the ratio of the mixed audio frame spectrum of each mixed audio frame in the test set to the spectrum of all mixed audio frames after passing the trained LSTM neural network model to obtain clear vocal audio.
  • the specific method of training the LSTM neural network model through the training set is:
  • an annealing algorithm is used to update the learning rate of the LSTM neural network model.
  • the learning rate is a hyperparameter that guides how to use the gradient of the loss function to adjust the weight of the model in the gradient descent method. If the learning rate is too large, the loss function may directly exceed the global optimal point. If the learning rate is too small, the loss function The speed of change is very slow, which will greatly increase the convergence complexity of the network, and it is easy to be trapped in a local minimum or saddle point. By using the annealing algorithm to let the learning rate change with time, this problem can be well solved.
  • test the trained LSTM neural network model through the test set and input the mixed audio frame spectrum in the test set into the trained LSTM neural network model.
  • the noise reduction pass rate of the mixed audio frame spectrum in the test set is within the preset range , it can be considered that the trained LSTM neural network model is qualified for the test, and the initial noise reduction model is obtained.
  • S3 Obtain a preset number of enhanced training samples, train an initial noise reduction model by using the preset number of enhanced training samples, and obtain a noise reduction model.
  • the noise reduction requirements for monophonic speech in some scenarios can be achieved, but the inventor found in actual work that due to the complexity of various factors such as noise types, hardware, and microphone recording distance , it is difficult for the training data to cover all possible noise environments, which makes the initial noise reduction model not robust to some noises.
  • the initial noise reduction model performs noise reduction, the intelligibility and clarity of the denoised vocal audio
  • the noise data pattern learning of the initial noise reduction model needs to be carried out according to the specific noise data—that is, the enhanced training of the initial noise reduction model, the purpose is to re-learn the unlearned noise types, so as to achieve a better understanding of the noise.
  • the pattern recognition of this type of noise achieves the purpose of noise reduction.
  • the monophonic speech unqualified for noise reduction by using the initial noise reduction model is used as the enhanced training sample, and the final noise reduction model is obtained by training the initial noise reduction model by using the enhanced training sample.
  • a method for acquiring enhanced training samples is provided. Based on the premise of the continuity of noise, generally when the monophonic voice is denoised by the initial noise reduction model, the number of occurrences in the obtained vocal audio is greater than When the similar waveforms of a preset number of times and the energy of the similar waveforms are greater than the preset energy threshold, it is considered that the current monophonic voice noise reduction is unqualified, and these unqualified monophonic voices are used as enhanced training samples. The user generates different augmented training samples.
  • the specific method for training the initial noise reduction model by using the enhanced training samples is as follows:
  • the noise reduction effect of the noise reduction model on the enhanced training samples is greater than the preset threshold of the noise reduction effect of the initial noise reduction model on the enhanced training samples, and a noise reduction model is obtained.
  • the initial noise reduction model after enhanced training has caused a great loss to the noise reduction effect of the test sample, or there is no obvious effect on the noise reduction of the enhanced training sample, the initial noise reduction model will not be updated, and the enhanced training will be considered to have failed. Re-obtain boosted training samples for training.
  • the parameters of several hidden layers in the initial noise reduction model are fixed. Since the training is an unsupervised learning process, it is necessary to fix the parameters in the initial noise reduction model. For some hidden layers, parameter adjustment is not performed during the enhanced training process to prevent model training from overfitting. Among them, regarding the selection of fixed parameter hidden layers, all hidden layers except the classification layer are generally selected.
  • the number of training samples for visual enhancement exceeds a certain number
  • several random nonlinear layers are added between the hidden layer and the classification layer of the initial noise reduction model.
  • the purpose of adding additional random nonlinear layers is to ensure that the model can Higher modeling of such noise patterns for better training results.
  • S4 The monophonic speech to be denoised is denoised by the denoising model to obtain human voice audio.
  • the spectrum of the voice of the call is obtained, which is characterized by the extracted spectrum; the spectrum of the voice of the call is input into a preset noise reduction model to obtain the human voice Spectrum and noise spectrum; perform an inverse Fourier transform on the vocal spectrum to get vocal audio.
  • an initial noise reduction model is constructed, a preset number of enhanced training samples are obtained, and a preset number of enhanced training samples are used.
  • the sample trains the initial noise reduction model to obtain the noise reduction model, and directly uses the noise reduction model to reduce the noise of the monophonic speech to be denoised to obtain human voice audio.
  • the noise reduction model is obtained by training the initial noise reduction model with the enhanced training samples.
  • the initial noise reduction model is constructed by mixing the audio training LSTM neural network model.
  • the model is trained based on the LSTM neural network model, which is easy to learn the entire time.
  • the noise law of the sequence so as to achieve a better noise reduction effect.
  • the learned noise types are re-learned to further improve the noise reduction effect of the noise reduction model.
  • the biggest advantage of the monophonic speech noise reduction method is "monophonic", which has the advantages of wide application range and good compatibility. Applied to mobile phone calls, this method does not require the mobile phone to have dual microphones, and is suitable for all mobile phones with microphones. Therefore, after this method is applied, mobile phone manufacturers do not need to equip mobile phones with dual microphones, which can reduce the cost of mobile phones and reduce the cost of mobile phones. The weight can also save the space of a microphone to make the phone thinner and lighter. Secondly, this method does not have any requirements on the distance between the speaker and the microphone, and can be applied not only to the scene where the speaker is far away from the microphone, but also to the scene where the speaker uses a single-microphone headset to talk. Finally, the method can also perform noise reduction processing on monophonic speech obtained in any way, such as speech downloaded from the Internet.
  • a monophonic voice noise reduction system is provided, and the monophonic voice noise reduction system can be used to implement the above-mentioned monophonic voice noise reduction method.
  • the speech noise reduction system includes an acquisition module, a model building module, an enhanced training module and a noise reduction module.
  • the acquisition module is used to acquire the monophonic speech to be denoised;
  • the model building module is used to construct an initial noise reduction model based on the LSTM neural network;
  • the augmentation training module is used to acquire a preset number of augmented training samples, using a preset number of The enhanced training samples of , train the initial noise reduction model to obtain the noise reduction model;
  • the noise reduction module is used to reduce the noise of the monophonic speech to be noise reduction through the noise reduction model to obtain the human voice audio.
  • the model building module of the monophonic speech noise reduction system includes a mixed audio acquisition module, a mixed audio processing module and a training module.
  • the mixed audio acquisition module is used for acquiring several human voice audios and several noise audio frequencies and combining them randomly to obtain several mixed audio frequencies, and each mixed audio audio includes human voice audio and at least one noise audio frequency;
  • the mixed audio processing module is used for mixing the mixed audio audio. Perform frame-by-frame window processing and Fourier transform to obtain several mixed audio frame spectra;
  • the training module is used to divide several mixed audio frame spectra into training sets and test sets, and establish an LSTM neural network model for binary classification.
  • the LSTM neural network model is trained on the set, and the trained LSTM neural network model is tested through the test set, and the initial noise reduction model is obtained after passing the test.
  • the training module of the monophonic speech noise reduction system includes a prediction module and a parameter update module.
  • the prediction module is used to input the mixed audio frame spectrum in the training set into the LSTM neural network model to obtain the human voice spectrum and noise spectrum and perform inverse Fourier transform to obtain the predicted human voice audio and noise audio;
  • the parameter update module uses According to the error between the predicted vocal audio and the actual vocal audio, each parameter in the LSTM neural network model is iteratively updated until the number of training times reaches the preset value or the predicted vocal audio and the actual vocal audio are between The error no longer decreases.
  • the enhanced training module of the monophonic speech noise reduction system includes an enhanced training sample acquisition module, a test sample acquisition module and an enhanced training management module.
  • the enhanced training sample acquisition module is used to acquire a number of monophonic voices that are unqualified for noise reduction by the initial noise reduction model, as a preset number of enhanced training samples;
  • the test sample acquisition module is used to acquire a number of passing vocal audio and a number of noise audio The test samples formed by the combination;
  • the enhanced training management module is used to train the initial noise reduction model through a preset number of enhanced training samples in an unsupervised learning method, until the noise reduction effect of the trained initial noise reduction model on the test samples is the same as the initial noise reduction model.
  • the noise reduction effect of the noise reduction model on the test sample is within the preset error, and the noise reduction effect of the initial noise reduction model after training on the enhanced training sample is greater than the preset threshold for the noise reduction effect of the initial noise reduction model on the enhanced training sample.
  • the enhanced training module of the monophonic speech noise reduction system further includes a parameter fixing module, which is used to adjust the parameters of several hidden layers in the initial noise reduction model when training the initial noise reduction model with a preset number of enhanced training samples. fixed.
  • a parameter fixing module which is used to adjust the parameters of several hidden layers in the initial noise reduction model when training the initial noise reduction model with a preset number of enhanced training samples. fixed.
  • the enhanced training module of the monophonic speech noise reduction system further includes a random nonlinear layer adding module, which is used for, before training the initial noise reduction model through a preset number of enhanced training samples, the hidden layer of the initial noise reduction model is Several random nonlinear layers are added between the classification layers.
  • the noise reduction module of the monophonic voice noise reduction system is used to perform Fourier transform on the monophonic voice to be denoised to obtain the voice spectrum of the call; input the spectrum of the voice of the call into the noise reduction model to obtain the human voice spectrum and noise spectrum; perform the inverse Fourier transform of the human voice spectrum to obtain the human voice audio.
  • a terminal device including: at the hardware level, the terminal device includes: a processor and a memory, and optionally an internal bus and a network interface.
  • the memory may include internal memory, such as high-speed random access memory, or may also include non-volatile memory, such as at least one disk storage.
  • the terminal equipment may also include hardware required for other services.
  • the processor, network interface, and memory are connected to each other through an internal bus, which may be an industry standard architecture bus, a peripheral component interconnection standard bus, an extended industry standard structure bus, and the like.
  • the bus can be divided into address bus, data bus, control bus and so on.
  • Memory is used to store programs. Specifically, the program may include program code, and the program code includes computer operation instructions.
  • the memory may include memory and non-volatile memory and provide instructions and data to the processor.
  • the processor reads the corresponding computer program from the non-volatile memory into the memory and then runs it, forming the above-mentioned terminal device on the logical level.
  • the processor may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application-specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., which are the computing core and control core of the terminal. It is suitable for implementing one or more instructions, and is specifically suitable for loading and executing one or more instructions so as to realize the corresponding method process or corresponding function. specific:
  • a terminal device comprising a memory, a processor, and a computer program stored in the memory and running on the processor, the processor implements when the processor executes the computer program:
  • Obtain the monophonic speech to be denoised construct an initial denoising model based on the LSTM neural network; obtain a preset number of enhanced training samples, use the preset number of enhanced training samples to train the initial noise reduction model, and obtain a noise reduction model;
  • the noise reduction model denoises the monophonic speech to be denoised to obtain human voice audio.
  • the processor when the processor executes the computer program, it further implements:
  • the processor when the processor executes the computer program, it further implements:
  • the processor when the processor executes the computer program, it further implements:
  • the noise reduction effect of the noise reduction model on the enhanced training samples is greater than the preset threshold of the noise reduction effect of the initial noise reduction model on the enhanced training samples.
  • the processor when the processor executes the computer program, it further implements:
  • the parameters of several hidden layers in the initial noise reduction model are fixed.
  • the processor when the processor executes the computer program, it further implements:
  • the processor when the processor executes the computer program, it further implements:
  • the processor when the processor executes the computer program, it further implements:
  • the present application further provides a computer-readable storage medium (Memory), where the computer-readable storage medium is a memory device in a terminal device for storing programs and data.
  • the computer-readable storage medium here may include both a built-in storage medium in the terminal device, and certainly also an extended storage medium supported by the terminal device.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the computer-readable storage medium provides storage space in which the operating system of the terminal is stored.
  • one or more instructions suitable for being loaded and executed by the processor are also stored in the storage space, and these instructions may be one or more computer programs (including program codes).
  • the computer-readable storage medium here can be a high-speed RAM memory, or a non-volatile memory (non-volatile memory). memory), such as at least one disk storage. All or part of the process in the method of the above-mentioned embodiments can be implemented by instructing the relevant hardware through computer-readable instructions, and the computer-readable instructions can be stored in a non-volatile computer-readable storage medium, When executed, the computer-readable instructions may include the processes of the above-described method embodiments.
  • One or more instructions stored in the computer-readable storage medium can be loaded and executed by the processor, so as to implement the corresponding steps of the method for noise reduction of monophonic speech in the above-mentioned embodiments. specific:
  • Obtain the monophonic speech to be denoised construct an initial denoising model based on the LSTM neural network; obtain a preset number of enhanced training samples, use the preset number of enhanced training samples to train the initial noise reduction model, and obtain a noise reduction model;
  • the noise reduction model denoises the monophonic speech to be denoised to obtain human voice audio.
  • the computer program when executed by the processor, further implements:
  • the computer program when executed by the processor, further implements:
  • the computer program when executed by the processor, further implements:
  • the noise reduction effect of the noise reduction model on the enhanced training samples is greater than the preset threshold of the noise reduction effect of the initial noise reduction model on the enhanced training samples.
  • the computer program when executed by the processor, further implements:
  • the parameters of several hidden layers in the initial noise reduction model are fixed.
  • the computer program when executed by the processor, further implements:
  • the computer program when executed by the processor, further implements:
  • the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including, but not limited to, disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions
  • An apparatus implements the functions specified in one or more of the flowcharts and/or one or more blocks of the block diagrams.
  • These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in one or more of the flowcharts and/or one or more blocks of the block diagrams.

Landscapes

  • Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Soundproofing, Sound Blocking, And Sound Damping (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

A noise reduction method and system for monophonic speech, and a device and a readable storage medium. The method comprises: acquiring monophonic speech to be subjected to noise reduction (S1); building an initial noise reduction model based on an LSTM neural network (S2); acquiring a preset number of enhanced training samples, and using the preset number of enhanced training samples to train the initial noise reduction model, so as to obtain a noise reduction model (S3); and reducing noise of said monophonic speech by means of the noise reduction model to obtain human voice audio (S4). By means of training a model on the basis of an LSTM neural network, learning a noise rule of a whole time sequence is facilitated, thus achieving a better noise reduction effect. Moreover, on the basis of the complexity of a noise influence factor, an initial noise reduction model is trained again by means of enhanced training samples, such that the noise reduction effect of the noise reduction model is further improved.

Description

单声道语音降噪方法、系统、设备及可读存储介质Monophonic speech noise reduction method, system, device and readable storage medium
本申请要求于2020年12月22日在中国专利局提交的、申请号为202011534575.3、发明名称为“单声道语音降噪方法、系统、设备及可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202011534575.3 and the invention title "Monophonic speech noise reduction method, system, device and readable storage medium", filed in the China Patent Office on December 22, 2020 , the entire contents of which are incorporated herein by reference.
技术领域technical field
本申请属于语音降噪技术领域,尤其涉及一种单声道语音降噪方法、系统、设备及可读存储介质。The present application belongs to the technical field of speech noise reduction, and in particular relates to a method, system, device and readable storage medium for monophonic speech noise reduction.
背景技术Background technique
几年之前,我们在通话的时候还经常会听到各种噪音,非常影响通话质量,但现如今,随着智能手机的普及,我们已经可以明显感觉到通话时杂音的减少。这是由于现在的大部分智能手机都使用了高通的芯片,而这些芯片大多都搭载了高通专有的CVC技术,CVC技术是一种通话降噪技术,其工作原理是在通话的时候,通过手机内置的双麦克风获取声音,其中,主麦克风在说话人嘴边,可以接收到较大的说话人声;副麦克风离说话人的嘴较远,接收到的说话人声较小,而两个麦克风却可以接收到几乎相同大小的环境噪声,通过结合主副麦克风收集到的声音信号,就可以通过一定算法,分辨出哪些声音是我们想要的说话人声,从而实现降噪通话。A few years ago, we often heard various noises during calls, which greatly affected the quality of calls, but now, with the popularity of smartphones, we can clearly feel the reduction of noise during calls. This is because most of the current smartphones use Qualcomm chips, and most of these chips are equipped with Qualcomm's proprietary CVC technology. CVC technology is a call noise reduction technology. The built-in dual microphones in the mobile phone acquire the sound. The main microphone is near the speaker's mouth and can receive a larger speaker's voice; the secondary microphone is farther away from the speaker's mouth, and the received speaker's voice is smaller, while the two The microphone can receive almost the same amount of ambient noise. By combining the sound signals collected by the main and auxiliary microphones, a certain algorithm can be used to distinguish which sounds are the human voices we want, so as to realize noise reduction calls.
但是,发明人意识到,该技术却仍然有着如下缺陷。首先,该技术无法处理单声道音频,必须要求手机拥有双麦克风,对于单麦克风手机没有作用;而且,对说话人通话的姿势有一定要求,要求说话人声源离主麦克风很近,若说话人离麦克风远或者带了单麦克风的耳机也无法应用。However, the inventors realized that this technology still has the following drawbacks. First of all, this technology cannot handle monophonic audio, and must require the mobile phone to have dual microphones, which has no effect on single-microphone mobile phones; moreover, there are certain requirements for the speaker's posture when talking, requiring the speaker's voice source to be very close to the main microphone. The person is far from the microphone or the headset with a single microphone cannot be used.
技术问题technical problem
本申请实施例的目的之一在于:提供一种单声道语音降噪方法、系统、设备及可读存储介质,旨在解决现有的通话降噪技术无法处理单声道音频的技术问题。One of the purposes of the embodiments of the present application is to provide a method, system, device, and readable storage medium for monophonic speech noise reduction, aiming to solve the technical problem that the existing noise reduction technology for calls cannot handle monophonic audio.
技术解决方案technical solutions
为解决上述技术问题,本申请实施例采用的技术方案是:In order to solve the above-mentioned technical problems, the technical solutions adopted in the embodiments of the present application are:
本申请实施例的第一方面提供了一种单声道语音降噪方法,包括:A first aspect of the embodiments of the present application provides a method for noise reduction of monophonic speech, including:
获取待降噪的单声道语音;构建基于LSTM神经网络的初始降噪模型;获取预设数量的增强训练样本,采用预设数量的增强训练样本训练初始降噪模型,得到降噪模型;通过降噪模型将待降噪的单声道语音降噪,得到人声音频。Obtain the monophonic speech to be denoised; construct an initial denoising model based on the LSTM neural network; obtain a preset number of enhanced training samples, use the preset number of enhanced training samples to train the initial noise reduction model, and obtain a noise reduction model; The noise reduction model denoises the monophonic speech to be denoised to obtain human voice audio.
本申请实施例的第二方面提供了一种单声道语音降噪系统,包括:A second aspect of the embodiments of the present application provides a monophonic speech noise reduction system, including:
获取模块,用于获取待降噪的单声道语音;The acquisition module is used to acquire the monophonic voice to be denoised;
模型构建模块,用于构建基于LSTM神经网络的初始降噪模型;The model building module is used to build the initial noise reduction model based on the LSTM neural network;
增强训练模块,用于获取预设数量的增强训练样本,采用预设数量的增强训练样本训练初始降噪模型,得到降噪模型;以及an augmented training module for obtaining a preset number of augmented training samples, and using the preset number of augmented training samples to train an initial noise reduction model to obtain a noise reduction model; and
降噪模块,用于通过降噪模型将待降噪的单声道语音降噪,得到人声音频。The noise reduction module is used to denoise the monophonic speech to be denoised through the denoising model to obtain human voice audio.
本申请实施例的第三方面提供了一种终端设备,包括:存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现:A third aspect of the embodiments of the present application provides a terminal device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program When realized:
获取待降噪的单声道语音;构建基于LSTM神经网络的初始降噪模型;获取预设数量的增强训练样本,采用预设数量的增强训练样本训练初始降噪模型,得到降噪模型;通过降噪模型将待降噪的单声道语音降噪,得到人声音频。Obtain the monophonic speech to be denoised; construct an initial denoising model based on the LSTM neural network; obtain a preset number of enhanced training samples, use the preset number of enhanced training samples to train the initial noise reduction model, and obtain a noise reduction model; The noise reduction model denoises the monophonic speech to be denoised to obtain human voice audio.
本申请实施例的第四方面提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现:A fourth aspect of the embodiments of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement:
获取待降噪的单声道语音;构建基于LSTM神经网络的初始降噪模型;获取预设数量的增强训练样本,采用预设数量的增强训练样本训练初始降噪模型,得到降噪模型;通过降噪模型将待降噪的单声道语音降噪,得到人声音频。Obtain the monophonic speech to be denoised; construct an initial denoising model based on the LSTM neural network; obtain a preset number of enhanced training samples, use the preset number of enhanced training samples to train the initial noise reduction model, and obtain a noise reduction model; The noise reduction model denoises the monophonic speech to be denoised to obtain human voice audio.
本申请实施例的第五方面还提供了一种计算机程序产品,当所述计算机程序产品在终端设备上运行时,使得所述终端设备执行时实现:A fifth aspect of the embodiments of the present application further provides a computer program product, which, when the computer program product runs on a terminal device, enables the terminal device to implement:
获取待降噪的单声道语音;构建基于LSTM神经网络的初始降噪模型;获取预设数量的增强训练样本,采用预设数量的增强训练样本训练初始降噪模型,得到降噪模型;通过降噪模型将待降噪的单声道语音降噪,得到人声音频。Obtain the monophonic speech to be denoised; construct an initial denoising model based on the LSTM neural network; obtain a preset number of enhanced training samples, use the preset number of enhanced training samples to train the initial noise reduction model, and obtain a noise reduction model; The noise reduction model denoises the monophonic speech to be denoised to obtain human voice audio.
有益效果beneficial effect
与现有技术相比,本申请实施例包括以下优点:Compared with the prior art, the embodiments of the present application include the following advantages:
本申请实施例,在获取待降噪的单声道语音后,构建初始降噪模型,并获取预设数量的增强训练样本,采用预设数量的增强训练样本训练初始降噪模型,得到降噪模型,然后直接通过降噪模型将待降噪的单声道语音降噪,获得人声音频,降噪过程不受限于双声道的限制,能够实现任何单声道语音的降噪处理,其中,降噪模型通过采用增强训练样本训练初始降噪模型得到,而初始降噪模型通过混合音频训练LSTM神经网络构建,基于同一通语音中整个时间序列的噪音类别大致相同的特性,通过采用LSTM神经网络为基础进行模型的训练,便于学习到整个时间序列的噪音规律,进而达到较好的降噪效果。同时,基于噪声影响因素的复杂性,训练数据很难覆盖所有可能的噪声环境,进而导致初始降噪模型对一些噪声不具有鲁棒性,通过增强训练样本对初始降噪模型再次训练,对未学习到的噪声种类进行再学习,进一步提升降噪模型的降噪效果。In this embodiment of the present application, after obtaining the monophonic speech to be denoised, an initial noise reduction model is constructed, a preset number of enhanced training samples are obtained, and the preset number of enhanced training samples is used to train the initial noise reduction model to obtain a noise reduction model. model, and then directly use the noise reduction model to reduce the noise of the monophonic speech to be denoised to obtain vocal audio. The noise reduction process is not limited by the restriction of two channels, and can realize the noise reduction of any monophonic speech. Among them, the noise reduction model is obtained by using the enhanced training samples to train the initial noise reduction model, and the initial noise reduction model is constructed by training the LSTM neural network with mixed audio. The training of the model based on the neural network is easy to learn the noise law of the entire time series, and then achieve a better noise reduction effect. At the same time, based on the complexity of noise influencing factors, it is difficult for the training data to cover all possible noise environments, resulting in the inability of the initial noise reduction model to be robust to some noises. The learned noise types are re-learned to further improve the noise reduction effect of the noise reduction model.
附图说明Description of drawings
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例或示范性技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings that are used in the description of the embodiments or exemplary technologies. Obviously, the drawings in the following description are only for the present application. In some embodiments, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without any creative effort.
图1为本申请实施例的单声道语音降噪方法流程框图;FIG. 1 is a flowchart of a method for noise reduction of monophonic speech according to an embodiment of the present application;
图2为本申请实施例的初始降噪模型训练流程框图;2 is a block diagram of an initial noise reduction model training process according to an embodiment of the application;
图3为本申请实施例的降噪模型训练流程框图;3 is a block diagram of a noise reduction model training process according to an embodiment of the present application;
图4为本申请实施例的单声道语音降噪系统结构框图;4 is a structural block diagram of a monophonic speech noise reduction system according to an embodiment of the application;
图5为本申请实施例的终端设备结构框图。FIG. 5 is a structural block diagram of a terminal device according to an embodiment of the present application.
本发明的实施方式Embodiments of the present invention
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。In order to make those skilled in the art better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only The embodiments are part of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the scope of protection of the present application.
需要说明的是,本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "first", "second", etc. in the description and claims of the present application and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It is to be understood that data so used may be interchanged under appropriate circumstances so that the embodiments of the application described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" and "having" and any variations thereof, are intended to cover non-exclusive inclusion, for example, a process, method, system, product or device comprising a series of steps or units is not necessarily limited to those expressly listed Rather, those steps or units may include other steps or units not expressly listed or inherent to these processes, methods, products or devices.
下面结合附图对本申请做进一步详细描述:The application will be described in further detail below in conjunction with the accompanying drawings:
参见图1,本申请一实施例中,提供一种单声道语音降噪方法,实现单声道语音的有效降噪,能够较好的用于单麦克风手机及单麦克风的耳机通话,具体的,该单声道语音降噪方法包括以下步骤:Referring to FIG. 1, in an embodiment of the present application, a method for noise reduction of monophonic speech is provided, which realizes effective noise reduction of monophonic speech, and can be better used for single-microphone mobile phone calls and single-microphone headset calls. , the monophonic speech noise reduction method includes the following steps:
S1:获取待降噪的单声道语音。S1: Obtain the monophonic voice to be denoised.
具体的,在使用单麦克风手机及单麦克风的耳机通话时,通过麦克风实时获取单声道语音,这里的单声道语音也可以是网上下载的单声道语音。其中,对通话语音进行降噪处理时,可以是在单声道语音在发送前就进行降噪,或者在单声道语音发送到对应的通话设备时进行降噪,或者同时进行降噪。Specifically, when a single-microphone mobile phone and a single-microphone headset are used for a call, the monaural voice is obtained in real time through the microphone, and the monaural voice here may also be a monaural voice downloaded from the Internet. When performing noise reduction processing on the call voice, noise reduction may be performed before the monophonic voice is sent, or when the monophonic voice is sent to the corresponding call device, or at the same time.
S2:构建基于LSTM神经网络的初始降噪模型。S2: Build an initial denoising model based on LSTM neural network.
具体的,为了获得用于单声道语音降噪的降噪模型,首先,需要获得一个初始降噪模型,并且,鉴于通话过程发生在一个时间序列中,且在很多情况下,同一通语音中整个时间序列的噪音类别大致相同(如贯穿始终的风声),因此本实施例中,采用了LSTM长短期记忆网络构建初始模型,便于学习到整个时间序列的噪音规律,进而通过混合音频训练LSTM神经网络模型的方式,来获得一个初始降噪模型,具体的,参见图2,包括以下步骤:Specifically, in order to obtain a noise reduction model for monophonic speech noise reduction, first, an initial noise reduction model needs to be obtained, and since the call process occurs in a time series, and in many cases, in the same speech The noise category of the entire time series is roughly the same (such as the wind sound throughout), so in this embodiment, the LSTM long short-term memory network is used to build the initial model, which is convenient to learn the noise law of the entire time series, and then train the LSTM neural network by mixing audio. A network model is used to obtain an initial noise reduction model. Specifically, see Figure 2, which includes the following steps:
S201:通过收集大量纯人声音频和纯噪音音频,然后将人声音频和噪音音频进行随机组合,得到若干混合音频,为了实现降噪要求,各混合音频中均包括一人声音频及至少一个噪音音频,也可以包括多个人声音频。S201: By collecting a large number of pure human voice audio and pure noise audio, and then randomly combining the human voice audio and noise audio, several mixed audios are obtained. In order to achieve noise reduction requirements, each mixed audio includes human voice audio and at least one noise. Audio, which can also include multiple vocal audio.
S202:将混合音频进行分帧加窗处理及傅里叶变换,得到若干混合音频帧频谱。具体的,将混合音频进行分帧加窗处理,本实施例中,按照帧长25ms,帧移10ms的分帧要求,将混合音频进行分帧,将混合音频分帧后,需要对每一帧混合音频进行加窗处理,加窗处理的窗函数一般具有低通特性,加窗函数的目的是减少频域中的泄漏,在语音信号分析中常用的窗函数有矩形窗、汉明窗和汉宁窗,可根据不同的情况选择不同的窗函数。然后通过傅里叶变换,将分帧加窗处理后的混合音频进行时域和频域的转换,将时域特征映射成一定维度的频谱特征,比如映射为若干基础正弦波,进而得到若干混合音频帧频谱。S202: Perform frame-by-frame windowing processing and Fourier transform on the mixed audio to obtain spectrums of several mixed audio frames. Specifically, the mixed audio is divided into frames and windowed. In this embodiment, the mixed audio is divided into frames according to the frame division requirements of a frame length of 25ms and a frame shift of 10ms. After dividing the mixed audio into frames, each frame needs to be divided into The mixed audio is windowed. The window function of the windowing process generally has low-pass characteristics. The purpose of the windowing function is to reduce the leakage in the frequency domain. The commonly used window functions in speech signal analysis are rectangular window, Hamming window and Hanming window. Ning window, you can choose different window functions according to different situations. Then, through Fourier transform, the mixed audio after framed and windowed is converted into time domain and frequency domain, and the time domain features are mapped into spectral features of a certain dimension, such as several basic sine waves, and then several mixed audios are obtained. Audio frame spectrum.
S203:将若干混合音频帧频谱分为训练集和测试集,建立用于二分类的LSTM神经网络模型,通过训练集训练LSTM神经网络模型,通过测试集测试训练后的LSTM神经网络模型,测试合格后得到初始降噪模型。S203: Divide the spectrum of several mixed audio frames into a training set and a test set, establish an LSTM neural network model for binary classification, train the LSTM neural network model through the training set, and test the trained LSTM neural network model through the test set, and the test is qualified Then the initial noise reduction model is obtained.
具体的,将若干混合音频帧频谱随机分为训练集和测试集,训练集和测试集中混合音频帧频谱的比例可以自定义设置,在迭代训练时,也可以交换训练集和测试集中的混合音频帧频谱。Specifically, the spectrum of several mixed audio frames is randomly divided into a training set and a test set. The ratio of the spectrum of mixed audio frames in the training set and the test set can be customized. During iterative training, the mixed audio in the training set and the test set can also be exchanged. frame spectrum.
然后建立用于二分类的LSTM神经网络模型,现有的LSTM神经网络模型一般包括输入层、若干隐藏层及输出层,以混合音频帧频谱作为输入层,设置若干隐藏层,以分离的人声频谱和噪音频谱作为输出层的二分类结果。然后通过训练集训练构建好的LSTM神经网络模型,并通过测试集测试训练后的LSTM神经网络模型,当测试结果的合格率符合预设的合格率阈值时测试合格,得到初始降噪模型。其中,测试结果的合格率是指测试集中各混合音频帧频谱通过训练后的LSTM神经网络模型后得到清晰的人声音频的混合音频帧频谱与所有混合音频帧频谱的比例。Then an LSTM neural network model for binary classification is established. The existing LSTM neural network model generally includes an input layer, a number of hidden layers and an output layer. The mixed audio frame spectrum is used as the input layer, and a number of hidden layers are set to separate the human voice. The spectrum and noise spectrum are used as the binary classification result of the output layer. Then, the constructed LSTM neural network model is trained through the training set, and the trained LSTM neural network model is tested through the test set. When the pass rate of the test result meets the preset pass rate threshold, the test is passed, and the initial noise reduction model is obtained. Among them, the pass rate of the test result refers to the ratio of the mixed audio frame spectrum of each mixed audio frame in the test set to the spectrum of all mixed audio frames after passing the trained LSTM neural network model to obtain clear vocal audio.
其中,通过训练集训练LSTM神经网络模型的具体方法为:Among them, the specific method of training the LSTM neural network model through the training set is:
将训练集内的混合音频帧频谱输入到构建好的LSTM神经网络模型当中,输出预测出的人声频谱和噪音频谱,进而通过逆傅里叶变换得到人声音频和噪音音频。然后根据预测的人声音频与实际的人声音频之间的误差,将此误差反向传播并通过随机梯度下降算法迭代更新LSTM神经网络模型中的各参数,包括遗忘门、输入门和输出门权重,至训练次数达到预设值或预测的人声音频与实际的人声音频之间的误差不再下降。Input the mixed audio frame spectrum in the training set into the constructed LSTM neural network model, output the predicted vocal spectrum and noise spectrum, and then obtain the vocal audio and noise audio through inverse Fourier transform. Then according to the error between the predicted vocal audio and the actual vocal audio, the error is back-propagated and the parameters in the LSTM neural network model are iteratively updated through the stochastic gradient descent algorithm, including the forget gate, the input gate and the output gate weight, until the number of training times reaches a preset value or the error between the predicted vocal audio and the actual vocal audio no longer decreases.
优选的,在通过训练集训练LSTM神经网络模型时,通过退火算法更新LSTM神经网络模型的学习率。其中,学习率是指导在梯度下降法中,如何使用损失函数的梯度调整模型权重的超参数,学习率如果过大,可能会使损失函数直接越过全局最优点,学习率如果过小,损失函数的变化速度很慢,会大大增加网络的收敛复杂度,并且很容易被困在局部最小值或者鞍点,通过使用退火算法,让学习率随时间而变化,可以很好的解决这一问题。Preferably, when training the LSTM neural network model through the training set, an annealing algorithm is used to update the learning rate of the LSTM neural network model. Among them, the learning rate is a hyperparameter that guides how to use the gradient of the loss function to adjust the weight of the model in the gradient descent method. If the learning rate is too large, the loss function may directly exceed the global optimal point. If the learning rate is too small, the loss function The speed of change is very slow, which will greatly increase the convergence complexity of the network, and it is easy to be trapped in a local minimum or saddle point. By using the annealing algorithm to let the learning rate change with time, this problem can be well solved.
然后,通过测试集测试训练后的LSTM神经网络模型,将测试集中的混合音频帧频谱输入训练后的LSTM神经网络模型,当测试集中的混合音频帧频谱的降噪合格率在预设范围内时,可以认为训练后的LSTM神经网络模型测试合格,得到初始降噪模型。Then, test the trained LSTM neural network model through the test set, and input the mixed audio frame spectrum in the test set into the trained LSTM neural network model. When the noise reduction pass rate of the mixed audio frame spectrum in the test set is within the preset range , it can be considered that the trained LSTM neural network model is qualified for the test, and the initial noise reduction model is obtained.
S3:获取预设数量的增强训练样本,采用预设数量的增强训练样本训练初始降噪模型,得到降噪模型。S3: Obtain a preset number of enhanced training samples, train an initial noise reduction model by using the preset number of enhanced training samples, and obtain a noise reduction model.
具体的,基于初始降噪模型,可以实现部分场景下的单声道语音的降噪要求,但是发明人在实际的工作中发现,由于噪声种类、硬件、麦克风录音距离等多种因素的复杂性,训练数据很难覆盖所有可能的噪声环境,进而导致初始降噪模型对一些噪声不具有鲁棒性,初始降噪模型在进行降噪时,降噪后的人声音频的可懂度和清晰度有时仍不能满足要求,此时需要根据特定噪声数据对初始降噪模型进行噪声数据模式学习—即初始降噪模型的增强训练,目的是对未学习到的噪声种类进行再学习,从而达到对这种噪声类型的模式识别,实现降噪的目的。基于此,本实施例中,将采用初始降噪模型降噪不合格的单声道语音作为增强训练样本,通过采用增强训练样本训练初始降噪模型得到最终的降噪模型。Specifically, based on the initial noise reduction model, the noise reduction requirements for monophonic speech in some scenarios can be achieved, but the inventor found in actual work that due to the complexity of various factors such as noise types, hardware, and microphone recording distance , it is difficult for the training data to cover all possible noise environments, which makes the initial noise reduction model not robust to some noises. When the initial noise reduction model performs noise reduction, the intelligibility and clarity of the denoised vocal audio At this time, the noise data pattern learning of the initial noise reduction model needs to be carried out according to the specific noise data—that is, the enhanced training of the initial noise reduction model, the purpose is to re-learn the unlearned noise types, so as to achieve a better understanding of the noise. The pattern recognition of this type of noise achieves the purpose of noise reduction. Based on this, in this embodiment, the monophonic speech unqualified for noise reduction by using the initial noise reduction model is used as the enhanced training sample, and the final noise reduction model is obtained by training the initial noise reduction model by using the enhanced training sample.
具体的,本实施例中,提供了增强训练样本的获取方法,基于噪声的连续性的前提,一般当单声道语音经过初始降噪模型降噪后,得到的人声音频中存在出现次数大于预设次数的相似波形,且该相似波形的能量大于预设的能量阈值时,认为当前单声道语音降噪不合格,将这些降噪不合格的单声道语音作为增强训练样本,不同的使用者产生不同的增强训练样本。Specifically, in this embodiment, a method for acquiring enhanced training samples is provided. Based on the premise of the continuity of noise, generally when the monophonic voice is denoised by the initial noise reduction model, the number of occurrences in the obtained vocal audio is greater than When the similar waveforms of a preset number of times and the energy of the similar waveforms are greater than the preset energy threshold, it is considered that the current monophonic voice noise reduction is unqualified, and these unqualified monophonic voices are used as enhanced training samples. The user generates different augmented training samples.
具体的,参见图3,采用增强训练样本训练初始降噪模型的具体方法为:Specifically, referring to Figure 3, the specific method for training the initial noise reduction model by using the enhanced training samples is as follows:
获取若干初始降噪模型降噪不合格的单声道语音,作为预设数量的增强训练样本;获取若干通过人声音频和若干噪音音频组合形成的测试样本;采用无监督学习的方式,通过预设数量的增强训练样本训练初始降噪模型,至训练后的初始降噪模型对测试样本的降噪效果与初始降噪模型对测试样本的降噪效果在预设误差内,且训练后的初始降噪模型对增强训练样本的降噪效果大于初始降噪模型对增强训练样本的降噪效果预设阈值,得到降噪模型。Obtain a number of monophonic speech unqualified for noise reduction by the initial noise reduction model as a preset number of enhanced training samples; obtain a number of test samples formed by combining vocal audio and noise audio; Set the number of enhanced training samples to train the initial noise reduction model, until the noise reduction effect of the initial noise reduction model after training on the test sample and the noise reduction effect of the initial noise reduction model on the test sample are within the preset error, and the initial noise reduction effect after training is within the preset error. The noise reduction effect of the noise reduction model on the enhanced training samples is greater than the preset threshold of the noise reduction effect of the initial noise reduction model on the enhanced training samples, and a noise reduction model is obtained.
若是增强训练后的初始降噪模型对测试样本的降噪效果造成了较大损失,或者在增强训练样本的降噪上无明显效果,则不更新初始降噪模型,认为此次增强训练失败,重新获得增强训练样本进行训练。If the initial noise reduction model after enhanced training has caused a great loss to the noise reduction effect of the test sample, or there is no obvious effect on the noise reduction of the enhanced training sample, the initial noise reduction model will not be updated, and the enhanced training will be considered to have failed. Re-obtain boosted training samples for training.
其中,在通过预设数量的增强训练样本训练初始降噪模型时,将初始降噪模型中若干隐藏层的参数固定,由于该训练是无监督学习的过程,因此需要固定初始降噪模型中的部分隐藏层,在增强训练过程中不做参数调节,防止模型训练过拟合。其中,关于固定参数隐藏层的选取,一般会选取除分类层外的所有隐藏层。Among them, when training the initial noise reduction model with a preset number of enhanced training samples, the parameters of several hidden layers in the initial noise reduction model are fixed. Since the training is an unsupervised learning process, it is necessary to fix the parameters in the initial noise reduction model. For some hidden layers, parameter adjustment is not performed during the enhanced training process to prevent model training from overfitting. Among them, regarding the selection of fixed parameter hidden layers, all hidden layers except the classification layer are generally selected.
优选的,可视增强训练样本的数量,当超过一定数量时,在初始降噪模型的隐藏层与分类层之间,添加若干随机非线性层,额外增加随机非线性层的目的是保证模型能对此种噪声模式进行更高的建模,以便获得更好的训练效果。Preferably, when the number of training samples for visual enhancement exceeds a certain number, several random nonlinear layers are added between the hidden layer and the classification layer of the initial noise reduction model. The purpose of adding additional random nonlinear layers is to ensure that the model can Higher modeling of such noise patterns for better training results.
S4:通过降噪模型将待降噪的单声道语音降噪,得到人声音频。S4: The monophonic speech to be denoised is denoised by the denoising model to obtain human voice audio.
具体的,本实施例中,通过将待降噪的单声道语音进行傅里叶变换,得到通话语音频谱,以提取频谱为特征;将通话语音频谱输入预设的降噪模型,得到人声频谱和噪音频谱;将人声频谱进行逆傅里叶变换,得到人声音频。Specifically, in this embodiment, by performing Fourier transform on the monophonic speech to be denoised, the spectrum of the voice of the call is obtained, which is characterized by the extracted spectrum; the spectrum of the voice of the call is input into a preset noise reduction model to obtain the human voice Spectrum and noise spectrum; perform an inverse Fourier transform on the vocal spectrum to get vocal audio.
综上所述,本申请单声道语音降噪方法,在获取待降噪的单声道语音后,构建初始降噪模型,并获取预设数量的增强训练样本,采用预设数量的增强训练样本训练初始降噪模型,得到降噪模型,直接通过降噪模型将待降噪的单声道语音降噪,获得人声音频,其中,降噪模型通过采用增强训练样本训练初始降噪模型得到,而初始降噪模型通过混合音频训练LSTM神经网络模型构建,基于同一通语音中整个时间序列的噪音类别大致相同的特性,通过采用LSTM神经网络模型为基础进行模型的训练,便于学习到整个时间序列的噪音规律,进而达到较好的降噪效果。同时,基于噪声影响因素的复杂性,训练数据很难覆盖所有可能的噪声环境,进而导致初始降噪模型对一些噪声不具有鲁棒性,通过增强训练样本对初始降噪模型再次训练,对未学习到的噪声种类进行再学习,进一步提升降噪模型的降噪效果。To sum up, in the monophonic speech noise reduction method of the present application, after obtaining the monophonic speech to be denoised, an initial noise reduction model is constructed, a preset number of enhanced training samples are obtained, and a preset number of enhanced training samples are used. The sample trains the initial noise reduction model to obtain the noise reduction model, and directly uses the noise reduction model to reduce the noise of the monophonic speech to be denoised to obtain human voice audio. The noise reduction model is obtained by training the initial noise reduction model with the enhanced training samples. , and the initial noise reduction model is constructed by mixing the audio training LSTM neural network model. Based on the roughly the same characteristics of the noise category of the entire time series in the same speech, the model is trained based on the LSTM neural network model, which is easy to learn the entire time. The noise law of the sequence, so as to achieve a better noise reduction effect. At the same time, based on the complexity of noise influencing factors, it is difficult for the training data to cover all possible noise environments, resulting in the inability of the initial noise reduction model to be robust to some noises. The learned noise types are re-learned to further improve the noise reduction effect of the noise reduction model.
该单声道语音降噪方法最大的优势就在于“单声道”,具有适用范围广,兼容性好的优点。应用于手机通话,该方法不要求手机拥有双麦克风,适用于一切拥有麦克风的手机,因此在该方法应用后,手机厂商便不需要再为手机搭载双麦克风,可以降低手机的成本、减轻手机的重量,还可以节省出一个麦克风的空间让手机变得更加轻薄。其次,该方法对说话人与麦克风的距离没有任何要求,不仅可以应用于说话人离麦克风较远的场景,还可以用于说话人使用单麦克风耳机通话的场景。最后,该方法还可以对任何途径得到的单声道语音,比如网上下载的语音进行降噪处理。The biggest advantage of the monophonic speech noise reduction method is "monophonic", which has the advantages of wide application range and good compatibility. Applied to mobile phone calls, this method does not require the mobile phone to have dual microphones, and is suitable for all mobile phones with microphones. Therefore, after this method is applied, mobile phone manufacturers do not need to equip mobile phones with dual microphones, which can reduce the cost of mobile phones and reduce the cost of mobile phones. The weight can also save the space of a microphone to make the phone thinner and lighter. Secondly, this method does not have any requirements on the distance between the speaker and the microphone, and can be applied not only to the scene where the speaker is far away from the microphone, but also to the scene where the speaker uses a single-microphone headset to talk. Finally, the method can also perform noise reduction processing on monophonic speech obtained in any way, such as speech downloaded from the Internet.
下述为本申请的装置实施例,可以用于执行本申请方法实施例。对于装置实施例中未纰漏的细节,请参照本申请方法实施例。The following device embodiments of the present application may be used to execute the method embodiments of the present application. For details that are not omitted in the device embodiments, please refer to the method embodiments of the present application.
参见图4,本申请再一个实施例中,提供一种单声道语音降噪系统,该单声道语音降噪系统能够用于实现上述单声道语音降噪方法,具体的,该单声道语音降噪系统包括获取模块、模型构建模块、增强训练模块以及降噪模块。Referring to FIG. 4 , in another embodiment of the present application, a monophonic voice noise reduction system is provided, and the monophonic voice noise reduction system can be used to implement the above-mentioned monophonic voice noise reduction method. The speech noise reduction system includes an acquisition module, a model building module, an enhanced training module and a noise reduction module.
其中,获取模块用于获取待降噪的单声道语音;模型构建模块用于构建基于LSTM神经网络的初始降噪模型;增强训练模块用于获取预设数量的增强训练样本,采用预设数量的增强训练样本训练初始降噪模型,得到降噪模型;降噪模块用于通过降噪模型将待降噪的单声道语音降噪,得到人声音频。Among them, the acquisition module is used to acquire the monophonic speech to be denoised; the model building module is used to construct an initial noise reduction model based on the LSTM neural network; the augmentation training module is used to acquire a preset number of augmented training samples, using a preset number of The enhanced training samples of , train the initial noise reduction model to obtain the noise reduction model; the noise reduction module is used to reduce the noise of the monophonic speech to be noise reduction through the noise reduction model to obtain the human voice audio.
优选的,该单声道语音降噪系统的模型构建模块包括混合音频获取模块、混合音频处理模块以及训练模块。Preferably, the model building module of the monophonic speech noise reduction system includes a mixed audio acquisition module, a mixed audio processing module and a training module.
其中,混合音频获取模块用于获取若干人声音频和若干噪音音频并随机组合,得到若干混合音频,各混合音频中均包括一人声音频及至少一个噪音音频;混合音频处理模块用于将混合音频进行分帧加窗处理及傅里叶变换,得到若干混合音频帧频谱;训练模块用于将若干混合音频帧频谱分为训练集和测试集,建立用于二分类的LSTM神经网络模型,通过训练集训练LSTM神经网络模型,通过测试集测试训练后的LSTM神经网络模型,测试合格后得到初始降噪模型。Wherein, the mixed audio acquisition module is used for acquiring several human voice audios and several noise audio frequencies and combining them randomly to obtain several mixed audio frequencies, and each mixed audio audio includes human voice audio and at least one noise audio frequency; the mixed audio processing module is used for mixing the mixed audio audio. Perform frame-by-frame window processing and Fourier transform to obtain several mixed audio frame spectra; the training module is used to divide several mixed audio frame spectra into training sets and test sets, and establish an LSTM neural network model for binary classification. The LSTM neural network model is trained on the set, and the trained LSTM neural network model is tested through the test set, and the initial noise reduction model is obtained after passing the test.
优选的,该单声道语音降噪系统的训练模块包括预测模块以及参数更新模块。Preferably, the training module of the monophonic speech noise reduction system includes a prediction module and a parameter update module.
其中,预测模块用于将训练集内的混合音频帧频谱输入LSTM神经网络模型,得到人声频谱和噪音频谱并进行逆傅里叶变换,得到预测的人声音频及噪音音频;参数更新模块用于根据预测的人声音频与实际的人声音频之间的误差,迭代更新LSTM神经网络模型中的各参数,至训练次数达到预设值或预测的人声音频与实际的人声音频之间的误差不再下降。Among them, the prediction module is used to input the mixed audio frame spectrum in the training set into the LSTM neural network model to obtain the human voice spectrum and noise spectrum and perform inverse Fourier transform to obtain the predicted human voice audio and noise audio; the parameter update module uses According to the error between the predicted vocal audio and the actual vocal audio, each parameter in the LSTM neural network model is iteratively updated until the number of training times reaches the preset value or the predicted vocal audio and the actual vocal audio are between The error no longer decreases.
优选的,该单声道语音降噪系统的增强训练模块包括增强训练样本获取模块、测试样本获取模块以及增强训练管理模块。Preferably, the enhanced training module of the monophonic speech noise reduction system includes an enhanced training sample acquisition module, a test sample acquisition module and an enhanced training management module.
其中,增强训练样本获取模块用于获取若干初始降噪模型降噪不合格的单声道语音,作为预设数量的增强训练样本;测试样本获取模块用于获取若干通过人声音频和若干噪音音频组合形成的测试样本;增强训练管理模块用于采用无监督学习的方式,通过预设数量的增强训练样本训练初始降噪模型,至训练后的初始降噪模型对测试样本的降噪效果与初始降噪模型对测试样本的降噪效果在预设误差内,且训练后的初始降噪模型对增强训练样本的降噪效果大于初始降噪模型对增强训练样本的降噪效果预设阈值。Among them, the enhanced training sample acquisition module is used to acquire a number of monophonic voices that are unqualified for noise reduction by the initial noise reduction model, as a preset number of enhanced training samples; the test sample acquisition module is used to acquire a number of passing vocal audio and a number of noise audio The test samples formed by the combination; the enhanced training management module is used to train the initial noise reduction model through a preset number of enhanced training samples in an unsupervised learning method, until the noise reduction effect of the trained initial noise reduction model on the test samples is the same as the initial noise reduction model. The noise reduction effect of the noise reduction model on the test sample is within the preset error, and the noise reduction effect of the initial noise reduction model after training on the enhanced training sample is greater than the preset threshold for the noise reduction effect of the initial noise reduction model on the enhanced training sample.
优选的,该单声道语音降噪系统的增强训练模块还包括参数固定模块,用于在通过预设数量的增强训练样本训练初始降噪模型时,将初始降噪模型中若干隐藏层的参数固定。Preferably, the enhanced training module of the monophonic speech noise reduction system further includes a parameter fixing module, which is used to adjust the parameters of several hidden layers in the initial noise reduction model when training the initial noise reduction model with a preset number of enhanced training samples. fixed.
优选的,该单声道语音降噪系统的增强训练模块还包括随机非线性层添加模块,用于在通过预设数量的增强训练样本训练初始降噪模型前,初始降噪模型的隐藏层与分类层之间添加若干随机非线性层。Preferably, the enhanced training module of the monophonic speech noise reduction system further includes a random nonlinear layer adding module, which is used for, before training the initial noise reduction model through a preset number of enhanced training samples, the hidden layer of the initial noise reduction model is Several random nonlinear layers are added between the classification layers.
优选的,该单声道语音降噪系统的降噪模块用于将待降噪的单声道语音进行傅里叶变换,得到通话语音频谱;将通话语音频谱输入降噪模型,得到人声频谱和噪音频谱;将人声频谱进行逆傅里叶变换,得到人声音频。Preferably, the noise reduction module of the monophonic voice noise reduction system is used to perform Fourier transform on the monophonic voice to be denoised to obtain the voice spectrum of the call; input the spectrum of the voice of the call into the noise reduction model to obtain the human voice spectrum and noise spectrum; perform the inverse Fourier transform of the human voice spectrum to obtain the human voice audio.
下述为本申请的装置实施例,可以用于执行本申请方法实施例。对于装置实施例中未纰漏的细节,请参照本申请方法实施例。The following device embodiments of the present application may be used to execute the method embodiments of the present application. For details that are not omitted in the device embodiments, please refer to the method embodiments of the present application.
参见图5,本申请再一个实施例中,提供一种终端设备,包括:在硬件层面,该终端设备包括:处理器和存储器,可选的还包括内部总线、网络接口。其中,存储器可能包含内存,例如高速随机存储器,也可能还包括非易失性存储器,例如,至少一个磁盘存储器等。当然,该终端设备可能还包括其他业务所需的硬件。处理器、网络接口、存储器通过内部总线互相连接,该内部总线可以是工业标准体系结构总线、外设部件互连标准总线、扩展工业标准结构总线等。总线可以分为地址总线、数据总线、控制总线等。存储器用于存放程序。具体地,程序可以包括程序代码、所述程序代码包括计算机操作指令。存储器可以包括内存和非易失性存储器,并向处理器提供指令和数据。Referring to FIG. 5 , in another embodiment of the present application, a terminal device is provided, including: at the hardware level, the terminal device includes: a processor and a memory, and optionally an internal bus and a network interface. The memory may include internal memory, such as high-speed random access memory, or may also include non-volatile memory, such as at least one disk storage. Of course, the terminal equipment may also include hardware required for other services. The processor, network interface, and memory are connected to each other through an internal bus, which may be an industry standard architecture bus, a peripheral component interconnection standard bus, an extended industry standard structure bus, and the like. The bus can be divided into address bus, data bus, control bus and so on. Memory is used to store programs. Specifically, the program may include program code, and the program code includes computer operation instructions. The memory may include memory and non-volatile memory and provide instructions and data to the processor.
处理器从非易失性存储器中读取对应的计算机程序到内存中然后运行,在逻辑层面上形成上述终端设备。处理器可能是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor、DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable GateArray,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,其是终端的计算核心以及控制核心,其适于实现一条或一条以上指令,具体适于加载并执行一条或一条以上指令从而实现相应方法流程或相应功能。具体的:The processor reads the corresponding computer program from the non-volatile memory into the memory and then runs it, forming the above-mentioned terminal device on the logical level. The processor may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application-specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., which are the computing core and control core of the terminal. It is suitable for implementing one or more instructions, and is specifically suitable for loading and executing one or more instructions so as to realize the corresponding method process or corresponding function. specific:
一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现:A terminal device, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, the processor implements when the processor executes the computer program:
获取待降噪的单声道语音;构建基于LSTM神经网络的初始降噪模型;获取预设数量的增强训练样本,采用预设数量的增强训练样本训练初始降噪模型,得到降噪模型;通过降噪模型将待降噪的单声道语音降噪,得到人声音频。Obtain the monophonic speech to be denoised; construct an initial denoising model based on the LSTM neural network; obtain a preset number of enhanced training samples, use the preset number of enhanced training samples to train the initial noise reduction model, and obtain a noise reduction model; The noise reduction model denoises the monophonic speech to be denoised to obtain human voice audio.
在一实施例中,所述处理器执行所述计算机程序时还实现:In one embodiment, when the processor executes the computer program, it further implements:
获取若干人声音频和若干噪音音频并随机组合,得到若干混合音频,各混合音频中均包括一人声音频及至少一个噪音音频;将混合音频进行分帧加窗处理及傅里叶变换,得到若干混合音频帧频谱;将若干混合音频帧频谱分为训练集和测试集,建立用于二分类的LSTM神经网络模型,通过训练集训练LSTM神经网络模型,通过测试集测试训练后的LSTM神经网络模型,当测试结果的合格率符合预设的合格率阈值时测试合格,得到初始降噪模型。Obtain a number of vocal audios and a number of noise audios and combine them randomly to obtain a number of mixed audios, each of which includes a vocal audio and at least one noise audio; the mixed audio is subjected to frame-by-frame windowing and Fourier transform to obtain several Mixed audio frame spectrum; divide several mixed audio frame spectrum into training set and test set, establish LSTM neural network model for binary classification, train LSTM neural network model through training set, and test the trained LSTM neural network model through test set , when the pass rate of the test result meets the preset pass rate threshold, the test is passed, and the initial noise reduction model is obtained.
在一实施例中,所述处理器执行所述计算机程序时还实现:In one embodiment, when the processor executes the computer program, it further implements:
将训练集内的混合音频帧频谱输入LSTM神经网络模型,得到人声频谱和噪音频谱并进行逆傅里叶变换,得到预测的人声音频及噪音音频;根据预测的人声音频与实际的人声音频之间的误差,迭代更新LSTM神经网络模型中的各参数,至训练次数达到预设值或预测的人声音频与实际的人声音频之间的误差不再下降。Input the mixed audio frame spectrum in the training set into the LSTM neural network model to obtain the human voice spectrum and noise spectrum and perform inverse Fourier transform to obtain the predicted human voice audio and noise audio; For the error between the voice and audio, each parameter in the LSTM neural network model is iteratively updated, until the number of training times reaches the preset value or the error between the predicted voice audio and the actual voice audio no longer decreases.
在一实施例中,所述处理器执行所述计算机程序时还实现:In one embodiment, when the processor executes the computer program, it further implements:
获取若干初始降噪模型降噪不合格的单声道语音,作为预设数量的增强训练样本;获取若干通过人声音频和若干噪音音频组合形成的测试样本;采用无监督学习的方式,通过预设数量的增强训练样本训练初始降噪模型,至训练后的初始降噪模型对测试样本的降噪效果与初始降噪模型对测试样本的降噪效果在预设误差内,且训练后的初始降噪模型对增强训练样本的降噪效果大于初始降噪模型对增强训练样本的降噪效果预设阈值。Obtain a number of monophonic speech unqualified for noise reduction by the initial noise reduction model as a preset number of enhanced training samples; obtain a number of test samples formed by combining vocal audio and noise audio; Set the number of enhanced training samples to train the initial noise reduction model, until the noise reduction effect of the initial noise reduction model after training on the test sample and the noise reduction effect of the initial noise reduction model on the test sample are within the preset error, and the initial noise reduction effect after training is within the preset error. The noise reduction effect of the noise reduction model on the enhanced training samples is greater than the preset threshold of the noise reduction effect of the initial noise reduction model on the enhanced training samples.
在一实施例中,所述处理器执行所述计算机程序时还实现:In one embodiment, when the processor executes the computer program, it further implements:
在通过预设数量的增强训练样本训练初始降噪模型时,将初始降噪模型中若干隐藏层的参数固定。When training the initial noise reduction model with a preset number of enhanced training samples, the parameters of several hidden layers in the initial noise reduction model are fixed.
在一实施例中,所述处理器执行所述计算机程序时还实现:In one embodiment, when the processor executes the computer program, it further implements:
在通过预设数量的增强训练样本训练初始降噪模型前,初始降噪模型的隐藏层与分类层之间添加若干随机非线性层。Before training the initial noise reduction model with a preset number of enhanced training samples, several random nonlinear layers are added between the hidden layer and the classification layer of the initial noise reduction model.
在一实施例中,所述处理器执行所述计算机程序时还实现:In one embodiment, when the processor executes the computer program, it further implements:
将待降噪的单声道语音进行傅里叶变换,得到通话语音频谱;将通话语音频谱输入降噪模型,得到人声频谱和噪音频谱;将人声频谱进行逆傅里叶变换,得到人声音频。Perform Fourier transform on the monophonic speech to be denoised to obtain the voice spectrum of the call; input the spectrum of the voice of the call into the noise reduction model to obtain the human voice spectrum and noise spectrum; perform inverse Fourier transform on the human voice spectrum to obtain the human voice spectrum. sound audio.
在一实施例中,所述处理器执行所述计算机程序时还实现:In one embodiment, when the processor executes the computer program, it further implements:
将待降噪的单声道语音进行傅里叶变换,得到通话语音频谱;将通话语音频谱输入降噪模型,得到人声频谱和噪音频谱;将人声频谱进行逆傅里叶变换,得到人声音频。Perform Fourier transform on the monophonic speech to be denoised to obtain the voice spectrum of the call; input the spectrum of the voice of the call into the noise reduction model to obtain the human voice spectrum and noise spectrum; perform inverse Fourier transform on the human voice spectrum to obtain the human voice spectrum. sound audio.
再一个实施例中,本申请还提供了一种计算机可读存储介质(Memory),所述计算机可读存储介质是终端设备中的记忆设备,用于存放程序和数据。可以理解的是,此处的计算机可读存储介质既可以包括终端设备中的内置存储介质,当然也可以包括终端设备所支持的扩展存储介质。所述计算机可读存储介质可以是非易失性,也可以是易失性。计算机可读存储介质提供存储空间,该存储空间存储了终端的操作系统。并且,在该存储空间中还存放了适于被处理器加载并执行的一条或一条以上的指令,这些指令可以是一个或一个以上的计算机程序(包括程序代码)。需要说明的是,此处的计算机可读存储介质可以是高速RAM存储器,也可以是非不稳定的存储器(non-volatile memory),例如至少一个磁盘存储器。实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。In yet another embodiment, the present application further provides a computer-readable storage medium (Memory), where the computer-readable storage medium is a memory device in a terminal device for storing programs and data. It can be understood that, the computer-readable storage medium here may include both a built-in storage medium in the terminal device, and certainly also an extended storage medium supported by the terminal device. The computer-readable storage medium may be non-volatile or volatile. The computer-readable storage medium provides storage space in which the operating system of the terminal is stored. In addition, one or more instructions suitable for being loaded and executed by the processor are also stored in the storage space, and these instructions may be one or more computer programs (including program codes). It should be noted that the computer-readable storage medium here can be a high-speed RAM memory, or a non-volatile memory (non-volatile memory). memory), such as at least one disk storage. All or part of the process in the method of the above-mentioned embodiments can be implemented by instructing the relevant hardware through computer-readable instructions, and the computer-readable instructions can be stored in a non-volatile computer-readable storage medium, When executed, the computer-readable instructions may include the processes of the above-described method embodiments.
可由处理器加载并执行计算机可读存储介质中存放的一条或一条以上指令,以实现上述实施例中有关单声道语音降噪方法的相应步骤。具体的:One or more instructions stored in the computer-readable storage medium can be loaded and executed by the processor, so as to implement the corresponding steps of the method for noise reduction of monophonic speech in the above-mentioned embodiments. specific:
获取待降噪的单声道语音;构建基于LSTM神经网络的初始降噪模型;获取预设数量的增强训练样本,采用预设数量的增强训练样本训练初始降噪模型,得到降噪模型;通过降噪模型将待降噪的单声道语音降噪,得到人声音频。Obtain the monophonic speech to be denoised; construct an initial denoising model based on the LSTM neural network; obtain a preset number of enhanced training samples, use the preset number of enhanced training samples to train the initial noise reduction model, and obtain a noise reduction model; The noise reduction model denoises the monophonic speech to be denoised to obtain human voice audio.
在一实施例中,所述计算机程序被处理器执行时还实现:In one embodiment, the computer program, when executed by the processor, further implements:
获取若干人声音频和若干噪音音频并随机组合,得到若干混合音频,各混合音频中均包括一人声音频及至少一个噪音音频;将混合音频进行分帧加窗处理及傅里叶变换,得到若干混合音频帧频谱;将若干混合音频帧频谱分为训练集和测试集,建立用于二分类的LSTM神经网络模型,通过训练集训练LSTM神经网络模型,通过测试集测试训练后的LSTM神经网络模型,当测试结果的合格率符合预设的合格率阈值时测试合格,得到初始降噪模型。Obtain a number of vocal audios and a number of noise audios and combine them randomly to obtain a number of mixed audios, each of which includes a vocal audio and at least one noise audio; the mixed audio is subjected to frame-by-frame windowing and Fourier transform to obtain several Mixed audio frame spectrum; divide several mixed audio frame spectrum into training set and test set, establish LSTM neural network model for binary classification, train LSTM neural network model through training set, and test the trained LSTM neural network model through test set , when the pass rate of the test result meets the preset pass rate threshold, the test is passed, and the initial noise reduction model is obtained.
在一实施例中,所述计算机程序被处理器执行时还实现:In one embodiment, the computer program, when executed by the processor, further implements:
将训练集内的混合音频帧频谱输入LSTM神经网络模型,得到人声频谱和噪音频谱并进行逆傅里叶变换,得到预测的人声音频及噪音音频;根据预测的人声音频与实际的人声音频之间的误差,迭代更新LSTM神经网络模型中的各参数,至训练次数达到预设值或预测的人声音频与实际的人声音频之间的误差不再下降。Input the mixed audio frame spectrum in the training set into the LSTM neural network model to obtain the human voice spectrum and noise spectrum and perform inverse Fourier transform to obtain the predicted human voice audio and noise audio; For the error between the voice and audio, each parameter in the LSTM neural network model is iteratively updated, until the number of training times reaches the preset value or the error between the predicted voice audio and the actual voice audio no longer decreases.
在一实施例中,所述计算机程序被处理器执行时还实现:In one embodiment, the computer program, when executed by the processor, further implements:
获取若干初始降噪模型降噪不合格的单声道语音,作为预设数量的增强训练样本;获取若干通过人声音频和若干噪音音频组合形成的测试样本;采用无监督学习的方式,通过预设数量的增强训练样本训练初始降噪模型,至训练后的初始降噪模型对测试样本的降噪效果与初始降噪模型对测试样本的降噪效果在预设误差内,且训练后的初始降噪模型对增强训练样本的降噪效果大于初始降噪模型对增强训练样本的降噪效果预设阈值。Obtain a number of monophonic speech unqualified for noise reduction by the initial noise reduction model as a preset number of enhanced training samples; obtain a number of test samples formed by combining vocal audio and noise audio; Set the number of enhanced training samples to train the initial noise reduction model, until the noise reduction effect of the initial noise reduction model after training on the test sample and the noise reduction effect of the initial noise reduction model on the test sample are within the preset error, and the initial noise reduction effect after training is within the preset error. The noise reduction effect of the noise reduction model on the enhanced training samples is greater than the preset threshold of the noise reduction effect of the initial noise reduction model on the enhanced training samples.
在一实施例中,所述计算机程序被处理器执行时还实现:In one embodiment, the computer program, when executed by the processor, further implements:
在通过预设数量的增强训练样本训练初始降噪模型时,将初始降噪模型中若干隐藏层的参数固定。When training the initial noise reduction model with a preset number of enhanced training samples, the parameters of several hidden layers in the initial noise reduction model are fixed.
在一实施例中,所述计算机程序被处理器执行时还实现:In one embodiment, the computer program, when executed by the processor, further implements:
在通过预设数量的增强训练样本训练初始降噪模型前,初始降噪模型的隐藏层与分类层之间添加若干随机非线性层。Before training the initial noise reduction model with a preset number of enhanced training samples, several random nonlinear layers are added between the hidden layer and the classification layer of the initial noise reduction model.
在一实施例中,所述计算机程序被处理器执行时还实现:In one embodiment, the computer program, when executed by the processor, further implements:
将待降噪的单声道语音进行傅里叶变换,得到通话语音频谱;将通话语音频谱输入降噪模型,得到人声频谱和噪音频谱;将人声频谱进行逆傅里叶变换,得到人声音频。Perform Fourier transform on the monophonic speech to be denoised to obtain the voice spectrum of the call; input the spectrum of the voice of the call into the noise reduction model to obtain the human voice spectrum and noise spectrum; perform inverse Fourier transform on the human voice spectrum to obtain the human voice spectrum. sound audio.
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。As will be appreciated by those skilled in the art, the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block in the flowcharts and/or block diagrams, and combinations of flows and/or blocks in the flowcharts and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in one or more of the flowcharts and/or one or more blocks of the block diagrams.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions An apparatus implements the functions specified in one or more of the flowcharts and/or one or more blocks of the block diagrams.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in one or more of the flowcharts and/or one or more blocks of the block diagrams.
最后应当说明的是:以上实施例仅用以说明本申请的技术方案而非对其限制,尽管参照上述实施例对本申请进行了详细的说明,所属领域的普通技术人员应当理解:依然可以对本申请的具体实施方式进行修改或者等同替换,而未脱离本申请精神和范围的任何修改或者等同替换,其均应涵盖在本申请的权利要求保护范围之内。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application rather than to limit them. Although the present application has been described in detail with reference to the above embodiments, those of ordinary skill in the art should understand: Any modification or equivalent replacement without departing from the spirit and scope of the present application shall be included within the protection scope of the claims of the present application.

Claims (20)

  1. 一种单声道语音降噪方法,其中,包括以下步骤: A monophonic speech noise reduction method, comprising the following steps:
    获取待降噪的单声道语音;Obtain the monophonic voice to be denoised;
    构建基于LSTM神经网络的初始降噪模型;Build an initial noise reduction model based on LSTM neural network;
    获取预设数量的增强训练样本,采用预设数量的增强训练样本训练初始降噪模型,得到降噪模型;Obtain a preset number of enhanced training samples, use the preset number of enhanced training samples to train an initial noise reduction model, and obtain a noise reduction model;
    通过降噪模型将待降噪的单声道语音降噪,得到人声音频。The monophonic speech to be denoised is denoised by the denoising model to obtain human voice audio.
  2. 如权利要求1所述的单声道语音降噪方法,其中,所述构建基于LSTM神经网络的初始降噪模型包括: The method for denoising monophonic speech according to claim 1, wherein said constructing an initial denoising model based on an LSTM neural network comprises:
    获取若干人声音频和若干噪音音频并随机组合,得到若干混合音频,各混合音频中均包括一人声音频及至少一个噪音音频;Obtaining a plurality of vocal audios and a number of noise audios and combining them randomly to obtain a number of mixed audios, each of which includes a vocal audio and at least one noise audio;
    将混合音频进行分帧加窗处理及傅里叶变换,得到若干混合音频帧频谱;Perform frame-by-frame windowing and Fourier transform on the mixed audio to obtain a number of mixed audio frame spectra;
    将若干混合音频帧频谱分为训练集和测试集,建立用于二分类的LSTM神经网络模型,通过训练集训练LSTM神经网络模型,通过测试集测试训练后的LSTM神经网络模型,当测试结果的合格率符合预设的合格率阈值时测试合格,得到初始降噪模型。Divide the spectrum of several mixed audio frames into a training set and a test set, establish an LSTM neural network model for binary classification, train the LSTM neural network model through the training set, and test the trained LSTM neural network model through the test set. When the pass rate meets the preset pass rate threshold, the test is passed, and the initial noise reduction model is obtained.
  3. 如权利要求2所述的单声道语音降噪方法,其中,所述通过训练集训练LSTM神经网络模型包括: The method for denoising monophonic speech according to claim 2, wherein the training of the LSTM neural network model through the training set comprises:
    将训练集内的混合音频帧频谱输入LSTM神经网络模型,得到人声频谱和噪音频谱并进行逆傅里叶变换,得到预测的人声音频及噪音音频;Input the mixed audio frame spectrum in the training set into the LSTM neural network model to obtain the human voice spectrum and noise spectrum and perform inverse Fourier transform to obtain the predicted voice audio and noise audio;
    根据预测的人声音频与实际的人声音频之间的误差,迭代更新LSTM神经网络模型中的各参数,至训练次数达到预设值或预测的人声音频与实际的人声音频之间的误差不再下降。According to the error between the predicted vocal audio and the actual vocal audio, each parameter in the LSTM neural network model is iteratively updated, until the number of training times reaches the preset value or the predicted vocal audio and the actual vocal audio are between The error is no longer decreasing.
  4. 如权利要求1所述的单声道语音降噪方法,其中,所述获取预设数量的增强训练样本,采用预设数量的增强训练样本训练初始降噪模型包括: The monophonic speech noise reduction method according to claim 1, wherein the obtaining a preset number of enhanced training samples, and using the preset number of enhanced training samples to train the initial noise reduction model comprises:
    获取若干初始降噪模型降噪不合格的单声道语音,作为预设数量的增强训练样本;Obtain a number of monophonic voices that are unqualified for noise reduction by the initial noise reduction model as a preset number of enhanced training samples;
    获取若干通过人声音频和若干噪音音频组合形成的测试样本;Obtain several test samples formed by combining vocal audio and several noise audio;
    采用无监督学习的方式,通过预设数量的增强训练样本训练初始降噪模型,至训练后的初始降噪模型对测试样本的降噪效果与初始降噪模型对测试样本的降噪效果在预设误差内,且训练后的初始降噪模型对增强训练样本的降噪效果大于初始降噪模型对增强训练样本的降噪效果预设阈值。The unsupervised learning method is used to train the initial noise reduction model through a preset number of enhanced training samples, and the noise reduction effect of the initial noise reduction model after training on the test sample and the noise reduction effect of the initial noise reduction model on the test sample are in the prediction. Within the set error, and the noise reduction effect of the initial noise reduction model after training on the enhanced training samples is greater than the preset threshold of the noise reduction effect of the initial noise reduction model on the enhanced training samples.
  5. 如权利要求4所述的单声道语音降噪方法,其中,所述通过预设数量的增强训练样本训练初始降噪模型时,将初始降噪模型中若干隐藏层的参数固定。 The method for noise reduction of monophonic speech according to claim 4, wherein when the initial noise reduction model is trained by using a preset number of enhanced training samples, the parameters of several hidden layers in the initial noise reduction model are fixed.
  6. 如权利要求4所述的单声道语音降噪方法,其中,所述通过预设数量的增强训练样本训练初始降噪模型前,在初始降噪模型的隐藏层与分类层之间添加若干随机非线性层。 The monophonic speech noise reduction method according to claim 4, wherein before the initial noise reduction model is trained by using a preset number of enhanced training samples, a number of random numbers are added between the hidden layer and the classification layer of the initial noise reduction model. non-linear layer.
  7. 如权利要求1所述的单声道语音降噪方法,其中,所述通过降噪模型将待降噪的单声道语音降噪,得到人声音频包括: The monophonic speech noise reduction method according to claim 1, wherein the noise reduction of the monophonic speech to be denoised by the noise reduction model to obtain the human voice audio comprises:
    将待降噪的单声道语音进行傅里叶变换,得到通话语音频谱;Fourier transform is performed on the monophonic voice to be denoised to obtain the voice spectrum of the call;
    将通话语音频谱输入降噪模型,得到人声频谱和噪音频谱;Input the voice spectrum of the call into the noise reduction model to obtain the human voice spectrum and the noise spectrum;
    将人声频谱进行逆傅里叶变换,得到人声音频。Perform an inverse Fourier transform on the vocal spectrum to obtain vocal audio.
  8. 一种单声道语音降噪系统,其中,包括: A monophonic speech noise reduction system, comprising:
    获取模块,用于获取待降噪的单声道语音;The acquisition module is used to acquire the monophonic voice to be denoised;
    模型构建模块,用于构建基于LSTM神经网络的初始降噪模型;The model building module is used to build the initial noise reduction model based on the LSTM neural network;
    增强训练模块,用于获取预设数量的增强训练样本,采用预设数量的增强训练样本训练初始降噪模型,得到降噪模型;以及an augmented training module for obtaining a preset number of augmented training samples, and using the preset number of augmented training samples to train an initial noise reduction model to obtain a noise reduction model; and
    降噪模块,用于通过降噪模型将待降噪的单声道语音降噪,得到人声音频。The noise reduction module is used to denoise the monophonic speech to be denoised through the denoising model to obtain human voice audio.
  9. 一种终端设备,其中,所述终端设备包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现:A terminal device, wherein the terminal device includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor implements when executing the computer program:
    获取待降噪的单声道语音;Obtain the monophonic voice to be denoised;
    构建基于LSTM神经网络的初始降噪模型;Build an initial noise reduction model based on LSTM neural network;
    获取预设数量的增强训练样本,采用预设数量的增强训练样本训练初始降噪模型,得到降噪模型;Obtain a preset number of enhanced training samples, use the preset number of enhanced training samples to train an initial noise reduction model, and obtain a noise reduction model;
    通过降噪模型将待降噪的单声道语音降噪,得到人声音频。The monophonic speech to be denoised is denoised by the denoising model to obtain human voice audio.
  10. 根据权利要求9所述的终端设备,其中,所述处理器执行所述计算机程序时还实现:The terminal device according to claim 9, wherein, when the processor executes the computer program, it further implements:
    获取若干人声音频和若干噪音音频并随机组合,得到若干混合音频,各混合音频中均包括一人声音频及至少一个噪音音频;Obtaining a plurality of vocal audios and a number of noise audios and combining them randomly to obtain a number of mixed audios, each of which includes a vocal audio and at least one noise audio;
    将混合音频进行分帧加窗处理及傅里叶变换,得到若干混合音频帧频谱;Perform frame-by-frame windowing and Fourier transform on the mixed audio to obtain a number of mixed audio frame spectra;
    将若干混合音频帧频谱分为训练集和测试集,建立用于二分类的LSTM神经网络模型,通过训练集训练LSTM神经网络模型,通过测试集测试训练后的LSTM神经网络模型,当测试结果的合格率符合预设的合格率阈值时测试合格,得到初始降噪模型。Divide the spectrum of several mixed audio frames into training set and test set, establish an LSTM neural network model for binary classification, train the LSTM neural network model through the training set, and test the trained LSTM neural network model through the test set. When the pass rate meets the preset pass rate threshold, the test is passed, and the initial noise reduction model is obtained.
  11. 根据权利要求10所述的终端设备,其中,所述处理器执行所述计算机程序时还实现:The terminal device according to claim 10, wherein, when the processor executes the computer program, it further implements:
    将训练集内的混合音频帧频谱输入LSTM神经网络模型,得到人声频谱和噪音频谱并进行逆傅里叶变换,得到预测的人声音频及噪音音频;Input the mixed audio frame spectrum in the training set into the LSTM neural network model to obtain the human voice spectrum and noise spectrum and perform inverse Fourier transform to obtain the predicted voice audio and noise audio;
    根据预测的人声音频与实际的人声音频之间的误差,迭代更新LSTM神经网络模型中的各参数,至训练次数达到预设值或预测的人声音频与实际的人声音频之间的误差不再下降。According to the error between the predicted vocal audio and the actual vocal audio, each parameter in the LSTM neural network model is iteratively updated, until the number of training times reaches the preset value or the predicted vocal audio and the actual vocal audio are between The error is no longer decreasing.
  12. 根据权利要求9所述的终端设备,其中,所述处理器执行所述计算机程序时还实现:The terminal device according to claim 9, wherein, when the processor executes the computer program, it further implements:
    获取若干初始降噪模型降噪不合格的单声道语音,作为预设数量的增强训练样本;Obtain a number of monophonic voices that are unqualified for noise reduction by the initial noise reduction model as a preset number of enhanced training samples;
    获取若干通过人声音频和若干噪音音频组合形成的测试样本;Obtain several test samples formed by combining vocal audio and several noise audio;
    采用无监督学习的方式,通过预设数量的增强训练样本训练初始降噪模型,至训练后的初始降噪模型对测试样本的降噪效果与初始降噪模型对测试样本的降噪效果在预设误差内,且训练后的初始降噪模型对增强训练样本的降噪效果大于初始降噪模型对增强训练样本的降噪效果预设阈值。The unsupervised learning method is used to train the initial noise reduction model through a preset number of enhanced training samples, and the noise reduction effect of the initial noise reduction model after training on the test sample and the noise reduction effect of the initial noise reduction model on the test sample are in the prediction. Within the set error, and the noise reduction effect of the initial noise reduction model after training on the enhanced training samples is greater than the preset threshold of the noise reduction effect of the initial noise reduction model on the enhanced training samples.
  13. 根据权利要求12所述的终端设备,其中,所述处理器执行所述计算机程序时还实现:The terminal device according to claim 12, wherein when the processor executes the computer program, it further implements:
    在所述通过预设数量的增强训练样本训练初始降噪模型时,将初始降噪模型中若干隐藏层的参数固定。When the initial noise reduction model is trained by using a preset number of enhanced training samples, the parameters of several hidden layers in the initial noise reduction model are fixed.
  14. 根据权利要求12所述的终端设备,其中,所述处理器执行所述计算机程序时还实现:The terminal device according to claim 12, wherein when the processor executes the computer program, it further implements:
    在所述通过预设数量的增强训练样本训练初始降噪模型前,在初始降噪模型的隐藏层与分类层之间添加若干随机非线性层。Before the initial noise reduction model is trained by the preset number of enhanced training samples, several random nonlinear layers are added between the hidden layer and the classification layer of the initial noise reduction model.
  15. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其中,所述计算机程序被处理器执行时实现: A computer-readable storage medium storing a computer program, wherein the computer program is executed by a processor to realize:
    获取待降噪的单声道语音;Obtain the monophonic voice to be denoised;
    构建基于LSTM神经网络的初始降噪模型;Build an initial noise reduction model based on LSTM neural network;
    获取预设数量的增强训练样本,采用预设数量的增强训练样本训练初始降噪模型,得到降噪模型;Obtain a preset number of enhanced training samples, use the preset number of enhanced training samples to train an initial noise reduction model, and obtain a noise reduction model;
    通过降噪模型将待降噪的单声道语音降噪,得到人声音频。The monophonic speech to be denoised is denoised by the denoising model to obtain human voice audio.
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述计算机程序被处理器执行时还实现: The computer-readable storage medium of claim 15, wherein the computer program, when executed by the processor, further implements:
    获取若干人声音频和若干噪音音频并随机组合,得到若干混合音频,各混合音频中均包括一人声音频及至少一个噪音音频;Obtaining a plurality of vocal audios and a number of noise audios and combining them randomly to obtain a number of mixed audios, each of which includes a vocal audio and at least one noise audio;
    将混合音频进行分帧加窗处理及傅里叶变换,得到若干混合音频帧频谱;Perform frame-by-frame windowing and Fourier transform on the mixed audio to obtain a number of mixed audio frame spectra;
    将若干混合音频帧频谱分为训练集和测试集,建立用于二分类的LSTM神经网络模型,通过训练集训练LSTM神经网络模型,通过测试集测试训练后的LSTM神经网络模型,当测试结果的合格率符合预设的合格率阈值时测试合格,得到初始降噪模型。Divide the spectrum of several mixed audio frames into training set and test set, establish an LSTM neural network model for binary classification, train the LSTM neural network model through the training set, and test the trained LSTM neural network model through the test set. When the pass rate meets the preset pass rate threshold, the test is passed, and the initial noise reduction model is obtained.
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述计算机程序被处理器执行时还实现: The computer-readable storage medium of claim 16, wherein the computer program, when executed by the processor, further implements:
    将训练集内的混合音频帧频谱输入LSTM神经网络模型,得到人声频谱和噪音频谱并进行逆傅里叶变换,得到预测的人声音频及噪音音频;Input the mixed audio frame spectrum in the training set into the LSTM neural network model to obtain the human voice spectrum and noise spectrum and perform inverse Fourier transform to obtain the predicted voice audio and noise audio;
    根据预测的人声音频与实际的人声音频之间的误差,迭代更新LSTM神经网络模型中的各参数,至训练次数达到预设值或预测的人声音频与实际的人声音频之间的误差不再下降。According to the error between the predicted vocal audio and the actual vocal audio, each parameter in the LSTM neural network model is iteratively updated, until the number of training times reaches the preset value or the predicted vocal audio and the actual vocal audio are between The error is no longer decreasing.
  18. 根据权利要求15所述的计算机可读存储介质,其中,所述计算机程序被处理器执行时还实现: The computer-readable storage medium of claim 15, wherein the computer program, when executed by the processor, further implements:
    获取若干初始降噪模型降噪不合格的单声道语音,作为预设数量的增强训练样本;Obtain a number of monophonic voices that are unqualified for noise reduction by the initial noise reduction model as a preset number of enhanced training samples;
    获取若干通过人声音频和若干噪音音频组合形成的测试样本;Obtain several test samples formed by combining vocal audio and several noise audio;
    采用无监督学习的方式,通过预设数量的增强训练样本训练初始降噪模型,至训练后的初始降噪模型对测试样本的降噪效果与初始降噪模型对测试样本的降噪效果在预设误差内,且训练后的初始降噪模型对增强训练样本的降噪效果大于初始降噪模型对增强训练样本的降噪效果预设阈值。The unsupervised learning method is used to train the initial noise reduction model through a preset number of enhanced training samples, and the noise reduction effect of the initial noise reduction model after training on the test sample and the noise reduction effect of the initial noise reduction model on the test sample are in the prediction. Within the set error, and the noise reduction effect of the initial noise reduction model after training on the enhanced training samples is greater than the preset threshold of the noise reduction effect of the initial noise reduction model on the enhanced training samples.
  19. 根据权利要求18所述的计算机可读存储介质,其中,所述计算机程序被处理器执行时还实现: The computer-readable storage medium of claim 18, wherein the computer program, when executed by the processor, further implements:
    在所述通过预设数量的增强训练样本训练初始降噪模型时,将初始降噪模型中若干隐藏层的参数固定。When the initial noise reduction model is trained by using a preset number of enhanced training samples, the parameters of several hidden layers in the initial noise reduction model are fixed.
  20. 根据权利要求18所述的计算机可读存储介质,其中,所述计算机程序被处理器执行时还实现:The computer-readable storage medium of claim 18, wherein the computer program, when executed by the processor, further implements:
    在所述通过预设数量的增强训练样本训练初始降噪模型前,在初始降噪模型的隐藏层与分类层之间添加若干随机非线性层。Before the initial noise reduction model is trained by the preset number of enhanced training samples, several random nonlinear layers are added between the hidden layer and the classification layer of the initial noise reduction model.
PCT/CN2021/083652 2020-12-22 2021-03-29 Noise reduction method and system for monophonic speech, and device and readable storage medium WO2022134351A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011534575.3A CN112614504A (en) 2020-12-22 2020-12-22 Single sound channel voice noise reduction method, system, equipment and readable storage medium
CN202011534575.3 2020-12-22

Publications (1)

Publication Number Publication Date
WO2022134351A1 true WO2022134351A1 (en) 2022-06-30

Family

ID=75244251

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/083652 WO2022134351A1 (en) 2020-12-22 2021-03-29 Noise reduction method and system for monophonic speech, and device and readable storage medium

Country Status (2)

Country Link
CN (1) CN112614504A (en)
WO (1) WO2022134351A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111568384A (en) * 2020-05-29 2020-08-25 上海联影医疗科技有限公司 Voice noise reduction method and device in medical scanning and computer equipment
CN113327626B (en) * 2021-06-23 2023-09-08 深圳市北科瑞声科技股份有限公司 Voice noise reduction method, device, equipment and storage medium
CN113539262B (en) * 2021-07-09 2023-08-22 广东金鸿星智能科技有限公司 Sound enhancement and recording method and system for voice control of electric door

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170061978A1 (en) * 2014-11-07 2017-03-02 Shannon Campbell Real-time method for implementing deep neural network based speech separation
CN107452389A (en) * 2017-07-20 2017-12-08 大象声科(深圳)科技有限公司 A kind of general monophonic real-time noise-reducing method
CN109841226A (en) * 2018-08-31 2019-06-04 大象声科(深圳)科技有限公司 A kind of single channel real-time noise-reducing method based on convolution recurrent neural network
US10614827B1 (en) * 2017-02-21 2020-04-07 Oben, Inc. System and method for speech enhancement using dynamic noise profile estimation
US20200349965A1 (en) * 2019-05-01 2020-11-05 Francesco Nesta Audio enhancement through supervised latent variable representation of target speech and noise

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170061978A1 (en) * 2014-11-07 2017-03-02 Shannon Campbell Real-time method for implementing deep neural network based speech separation
US10614827B1 (en) * 2017-02-21 2020-04-07 Oben, Inc. System and method for speech enhancement using dynamic noise profile estimation
CN107452389A (en) * 2017-07-20 2017-12-08 大象声科(深圳)科技有限公司 A kind of general monophonic real-time noise-reducing method
CN109841226A (en) * 2018-08-31 2019-06-04 大象声科(深圳)科技有限公司 A kind of single channel real-time noise-reducing method based on convolution recurrent neural network
US20200349965A1 (en) * 2019-05-01 2020-11-05 Francesco Nesta Audio enhancement through supervised latent variable representation of target speech and noise

Also Published As

Publication number Publication date
CN112614504A (en) 2021-04-06

Similar Documents

Publication Publication Date Title
Zhao et al. Monaural speech dereverberation using temporal convolutional networks with self attention
CN109065067B (en) Conference terminal voice noise reduction method based on neural network model
Fu et al. End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks
WO2022134351A1 (en) Noise reduction method and system for monophonic speech, and device and readable storage medium
Li et al. ICASSP 2021 deep noise suppression challenge: Decoupling magnitude and phase optimization with a two-stage deep network
WO2021196905A1 (en) Voice signal dereverberation processing method and apparatus, computer device and storage medium
KR100643310B1 (en) Method and apparatus for disturbing voice data using disturbing signal which has similar formant with the voice signal
Lin et al. Speech enhancement using multi-stage self-attentive temporal convolutional networks
US10553236B1 (en) Multichannel noise cancellation using frequency domain spectrum masking
US9286883B1 (en) Acoustic echo cancellation and automatic speech recognition with random noise
US10115411B1 (en) Methods for suppressing residual echo
US10755728B1 (en) Multichannel noise cancellation using frequency domain spectrum masking
CN114203163A (en) Audio signal processing method and device
CN108806707B (en) Voice processing method, device, equipment and storage medium
US20220199102A1 (en) Speaker-specific voice amplification
CN113744749B (en) Speech enhancement method and system based on psychoacoustic domain weighting loss function
CN113763977A (en) Method, apparatus, computing device and storage medium for eliminating echo signal
Seidel et al. Y $^ 2$-Net FCRN for Acoustic Echo and Noise Suppression
CN111508519A (en) Method and device for enhancing voice of audio signal
CN110491406A (en) A kind of multimode inhibits double noise speech Enhancement Methods of variety classes noise
WO2024027295A1 (en) Speech enhancement model training method and apparatus, enhancement method, electronic device, storage medium, and program product
Rao et al. Interspeech 2021 conferencingspeech challenge: Towards far-field multi-channel speech enhancement for video conferencing
CN113571047A (en) Audio data processing method, device and equipment
US11380312B1 (en) Residual echo suppression for keyword detection
Schröter et al. Low latency speech enhancement for hearing aids using deep filtering

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21908363

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21908363

Country of ref document: EP

Kind code of ref document: A1