CN114155872A

CN114155872A - Single-channel voice noise reduction method and device, electronic equipment and storage medium

Info

Publication number: CN114155872A
Application number: CN202111545638.XA
Authority: CN
Inventors: 关海欣; 梁家恩
Original assignee: Unisound Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd
Priority date: 2021-12-16
Filing date: 2021-12-16
Publication date: 2022-03-08

Abstract

The application relates to a method and a device for single-channel voice noise reduction, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a voice to be denoised; extracting the characteristics of the voice to be denoised, inputting the characteristics into a pre-trained model to obtain a mask value of each frame of voice; multiplying the mask value of each frame of voice by the modified discrete cosine transform corresponding to the mask value of each frame of voice, and then performing inverse modified discrete cosine transform to obtain each frame of voice after noise reduction; and overlapping and adding the noise-reduced voice of each frame to obtain the noise-reduced voice. The method has the advantages that the FFT is complex, a neural network training tool does not support the complex number at present, a network needs to be designed manually, and the complexity of a model is high; secondly, the FFT transform parameters are more, taking 512-point FFT as an example, the real part and the imaginary part after the transform are 512 parameters in total, and the input parameter and the output parameter are halved after the transform is performed on the 512-point MDCT, so that the noise reduction is simpler.

Description

Single-channel voice noise reduction method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of speech noise reduction technologies, and in particular, to a method and an apparatus for single-channel speech noise reduction, an electronic device, and a storage medium.

Background

The method for single-channel speech noise reduction based on deep learning takes a complex number field (FFT) as an input to obtain a good effect at present, but the calculated amount is large, and because the complex number is not well supported by a catalogue, the time part and the imaginary part are taken as two paths of data streams at present, so that the model design is difficult, and the model compression is difficult. We have proposed using DCT instead of FFT, because there is only real part, so the model does not need special design, and the model compression is much easier. However, information redundancy exists no matter the FFT and the DCT are added through superposition, and 512 points of data are generated every 256 points by taking 512-point frame length and 256-point frame shift as an example, so that the input is large, the output is large, and the parameters and the calculation amount of the intermediate layer are also large.

Disclosure of Invention

Based on the above problems, the present application provides a single-channel speech noise reduction method, an electronic device, and a storage medium.

In a first aspect, an embodiment of the present application provides a method for single-channel speech noise reduction, including:

acquiring a voice to be denoised;

extracting the characteristics of the voice to be denoised, inputting the characteristics into a pre-trained model to obtain a mask value of each frame of voice;

multiplying the mask value of each frame of voice by the modified discrete cosine transform corresponding to the mask value of each frame of voice, and then performing inverse modified discrete cosine transform to obtain each frame of voice after noise reduction;

and overlapping and adding the noise-reduced voice of each frame to obtain the noise-reduced voice.

Further, in the above single-channel speech noise reduction method, extracting features of a speech to be noise reduced, and inputting the extracted features into a pre-trained model to obtain a mask value of each frame of speech, the method includes:

framing, windowing and modifying discrete cosine transform of the voice to be denoised to obtain modified discrete cosine transform corresponding to each frame of voice;

and inputting the modified discrete cosine transform corresponding to each frame of voice into a pre-trained model to obtain a mask value of each frame of voice.

Further, in the above single-channel speech noise reduction method, the training step of the pre-trained model is as follows:

acquiring a training set; the training set comprises multi-sentence clean voice data and multi-sentence noise data of various types; mixing the clean voice data with noise data of different types with different signal-to-noise ratios;

extracting the characteristics of the training set;

inputting the characteristics of the training set into a network model for training, and estimating by using a signal approximation method to obtain an implicit mask matrix;

multiplying the implicit mask matrix by the modified discrete cosine transform of the noise data in the characteristics of the training set, and then performing inverse modified discrete cosine transform to obtain enhanced voice on a time domain;

and (4) returning the error of the enhanced voice and the target voice in the time domain by using a loss function, and obtaining a pre-trained model when the loss is continuously reduced until convergence.

Further, in the above method for single-channel speech noise reduction, extracting features of training data includes:

and framing and windowing each sentence of voice of the training data, and obtaining modified discrete cosine transform of each sentence of noise data and modified discrete cosine transform of each sentence of clean voice data by using the modified discrete cosine transform.

Further, the above single-channel speech noise reduction method further includes:

acquiring verification data;

using the verification data to supervise the model in the process of training the pre-trained model without participating in error return;

the verification data comprises multi-sentence clean voice data and multi-sentence noise data of various types, and the multi-sentence clean voice data and the multi-sentence noise data of various types are mixed at different signal-to-noise ratios; the validation data is different from the training data.

Further, in the single-channel speech noise reduction method, the network model is a convolutional neural network, a long-term memory network and a full-connection network combined model.

Further, in the above-mentioned single-channel speech noise reduction method, the loss function is SI-SNR, SNR or MSE.

In a second aspect, an embodiment of the present application further provides a single-channel speech noise reduction apparatus, including:

an acquisition module: the method comprises the steps of obtaining a voice to be denoised;

an extraction module: the method comprises the steps of extracting the characteristics of the voice to be denoised, inputting the characteristics into a pre-trained model to obtain the mask value of each frame of voice;

a modified discrete transform module: the system is used for multiplying the mask value of each frame of voice by the modified discrete cosine transform corresponding to the mask value of each frame of voice, and then performing inverse modified discrete cosine transform to obtain each frame of voice after noise reduction;

an overlap-add module: and overlapping and adding each frame of voice subjected to noise reduction to obtain noise-reduced voice.

In a third aspect, an embodiment of the present application further provides an electronic device, including: a processor and a memory;

the processor is used for executing the single-channel voice noise reduction method by calling the program or the instruction stored in the memory.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores a program or instructions, and the program or instructions cause a computer to perform the above-mentioned one single-channel speech noise reduction method.

The embodiment of the application has the advantages that: the application relates to a method and a device for single-channel voice noise reduction, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a voice to be denoised; extracting the characteristics of the voice to be denoised, inputting the characteristics into a pre-trained model to obtain a mask value of each frame of voice; multiplying the mask value of each frame of voice by the modified discrete cosine transform corresponding to the mask value of each frame of voice, and then performing inverse modified discrete cosine transform to obtain each frame of voice after noise reduction; and overlapping and adding the noise-reduced voice of each frame to obtain the noise-reduced voice. The method replaces the current mainstream technology FFT complex spectrum by the modified discrete cosine transform. Compared with FFT, the FFT has the advantages that firstly, the FFT is complex, a neural network training tool does not support complex numbers at present, a network needs to be designed manually, the model is high in complexity, secondly, the FFT has more parameters, for example, 512-point FFT is used, the real part and the imaginary part of the transformed imaginary part are 512 parameters, 512-point MDCT is only 256 points after transformation, and input and output parameters are halved, so that noise reduction is simpler.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments or the conventional technologies of the present application, the drawings used in the descriptions of the embodiments or the conventional technologies will be briefly introduced below, it is obvious that the drawings in the following descriptions are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a first schematic diagram illustrating a single-channel speech noise reduction method according to an embodiment of the present application;

fig. 2 is a schematic diagram of a single-channel speech noise reduction method according to an embodiment of the present application;

fig. 3 is a schematic diagram of a single-channel speech noise reduction method according to an embodiment of the present application;

fig. 4 is a schematic diagram of a single-channel speech noise reduction apparatus according to an embodiment of the present application;

fig. 5 is a schematic block diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, embodiments accompanying the present application are described in detail below with reference to the accompanying drawings. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of embodiment in many different forms than that described herein and those skilled in the art will be able to make similar modifications without departing from the spirit of the application and therefore should not be limited to the specific embodiments disclosed below.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

Fig. 1 is a first schematic diagram of a single-channel speech noise reduction method according to an embodiment of the present application.

In a first aspect, an embodiment of the present application provides a method for single-channel speech noise reduction, which, with reference to fig. 1, includes four steps S101 to S104:

s101: and acquiring the voice to be denoised.

Specifically, the speech to be denoised obtained in the embodiment of the present application is multi-frame speech.

S102: after extracting the characteristics of the voice to be denoised, inputting the characteristics into a pre-trained model to obtain the mask value of each frame of voice.

Specifically, in this embodiment of the present application, taking a frame of 512-point data of a voice to be denoised as an example, framing, windowing, performing modified discrete cosine transform extraction on the voice to be denoised to obtain 256 points after obtaining a modified discrete cosine transform corresponding to each frame of voice, inputting the modified discrete cosine transform corresponding to each frame of voice into a pre-trained model to obtain a mask value of each frame of voice, and outputting the mask value to be 256 points.

S103: and multiplying the mask value of each frame of voice by the modified discrete cosine transform corresponding to the mask value of each frame of voice, and then performing inverse modified discrete cosine transform to obtain each frame of voice after noise reduction.

Specifically, in the embodiment of the present application, the mask value of each frame of speech is multiplied by the modified discrete cosine transform MDCT correspondingly, so as to obtain the modified discrete cosine transform MDCT after noise reduction, and then the modified discrete cosine transform IMDCT is performed to return to the time domain, so as to obtain each frame of speech after noise reduction.

S104: and overlapping and adding the noise-reduced voice of each frame to obtain the noise-reduced voice.

Specifically, in the embodiment of the present application, the noise-reduced voice of each frame is overlapped and added to obtain the multi-frame noise-reduced voice of the voice to be noise-reduced.

Fig. 2 is a schematic diagram of a single-channel speech noise reduction method according to an embodiment of the present application.

Further, in the above single-channel speech noise reduction method, extracting the features of the speech to be noise reduced, and inputting the extracted features into a pre-trained model to obtain the mask value of each frame of speech, with reference to fig. 2, the method includes two steps S201 to S202:

s201: and framing the voice to be denoised, windowing and modifying the discrete cosine transform to obtain the modified discrete cosine transform corresponding to each frame of voice.

S202: and inputting the modified discrete cosine transform corresponding to each frame of voice into a pre-trained model to obtain a mask value of each frame of voice.

Specifically, in the embodiment of the present application, taking a frame of 512-point data of a voice to be denoised as an example, framing, windowing, performing modified discrete cosine transform to obtain a modified discrete cosine transform corresponding to each frame of voice to obtain 256 points, inputting the modified discrete cosine transform corresponding to each frame of voice into a pre-trained model to obtain a mask value of each frame of voice, outputting the mask value of each frame of voice also being 256 points, setting a last layer of Dense layer of the pre-trained model as a full link network, and then connecting a Sigmoid activation function, where Sigmoid is an activation function of a training tool, and an output value of the activation function is between 0 and 1, which is equivalent to a mask, and thus obtaining the mask value of each frame of voice.

Fig. 3 is a schematic diagram of a single-channel speech noise reduction method according to an embodiment of the present application.

Further, in the above method for single-channel speech noise reduction, with reference to fig. 3, the training step of the pre-trained model includes five steps S301 to S305:

s301: acquiring a training set; the training set comprises multi-sentence clean voice data and multi-sentence noise data of various types; the clean speech data and the noise data of multiple types are mixed at different signal-to-noise ratios.

S302: and extracting the characteristics of the training set.

Extracting features of training data, including: and framing and windowing each sentence of voice of the training data, and obtaining modified discrete cosine transform of each sentence of noise data and modified discrete cosine transform of each sentence of clean voice data by using the modified discrete cosine transform.

S303: inputting the characteristics of the training set into a network model for training, and estimating by using a signal approximation method to obtain an implicit mask matrix.

S304: and multiplying the implicit mask matrix by the modified discrete cosine transform of the noise data in the characteristics of the training set, and then performing the inverse modified discrete cosine transform to obtain the enhanced voice on the time domain.

S305: and (4) returning the error of the enhanced voice and the target voice in the time domain by using a loss function, and obtaining a pre-trained model when the loss is continuously reduced until convergence.

Specifically, in this embodiment of the application, a pre-trained model is obtained by training a training set through the above five steps S301 to S305, taking a frame of 512-point data as an example, 256 points are obtained after modified discrete cosine transform MDCT, a last layer of Dense layer of the network is a full link network, the output is 256 points, and a Sigmoid activation function is connected after the 256 points, where Sigmoid is an activation function owned by all training tools, an output value of the activation function is between 0 and 1, and is equivalent to a mask, and is multiplied by the modified discrete cosine transform MDCT correspondingly, so that a modified discrete cosine transform MDCT after noise reduction is obtained, and then the modified discrete cosine transform is returned to a time domain after modified inverse discrete cosine transform, so as to obtain an enhanced speech in a time domain, and the enhanced speech in the time domain and a target speech are subjected to error feedback by using a loss function, and when the loss is continuously reduced until the model is converged, the pre-trained model is obtained.

Further, in the above-mentioned single-channel speech noise reduction method,

acquiring verification data;

Specifically, in the embodiment of the present application, clean speech data and noise are mixed with different signal-to-noise ratios, speech data in which clean speech data and noise are mixed with different signal-to-noise ratios is used as training data, a verification set is generated in the same manner, the training set and the verification set are different in noise type, signal-to-noise ratio, and speaker, a training model of the training set is used, and the verification set is used to supervise the model but not participate in error feedback.

Specifically, in the embodiment of the application, the network model uses a classical convolutional neural network, a long-time memory network and a full-connection network joint model, so that the model is simpler, the noise reduction performance of the network model is good overall, and the size and the calculated amount of the network model are greatly compressed.

Specifically, in the embodiment of the present application, an exemplary loss function is SI-SNR, which is defined by the following formula:

wherein s and

representing clean speech and estimated speech separately,<,>a dot-product representing a vector is calculated,

and (4) keeping the model to obtain a pre-trained model when the Euclidean norm is continuously reduced until the Euclidean norm is converged.

Fig. 4 is a schematic diagram of a single-channel speech noise reduction apparatus according to an embodiment of the present application.

the acquisition module 401: for obtaining the speech to be denoised.

Specifically, in this embodiment of the present application, the to-be-denoised voice acquired by the acquiring module 401 is a multi-frame voice.

The extraction module 402: the method is used for extracting the characteristics of the voice to be denoised and inputting the characteristics into a pre-trained model to obtain the mask value of each frame of voice.

Specifically, in this embodiment of the application, the extracting module 402 takes 512-point data of a frame of speech to be denoised as an example, frames the speech to be denoised, adds a window, and performs modified discrete cosine transform extraction to obtain 256 points after obtaining the modified discrete cosine transform corresponding to each frame of speech, and inputs the modified discrete cosine transform corresponding to each frame of speech into a pre-trained model to obtain a mask value of each frame of speech, and outputs the mask value of each frame of speech as 256 points.

Modified discrete transform module 403: and the method is used for multiplying the mask value of each frame of voice by the modified discrete cosine transform corresponding to the mask value of each frame of voice, and then performing inverse modified discrete cosine transform to obtain each frame of voice after noise reduction.

Specifically, in this embodiment of the application, the modified discrete transform module 403 multiplies the mask value of each frame of speech by the modified discrete cosine transform MDCT correspondingly to obtain the modified discrete cosine transform MDCT after noise reduction, and returns to the time domain through the modified inverse discrete cosine transform IMDCT to obtain each frame of speech after noise reduction.

Overlap-and-add module 404: and overlapping and adding each frame of voice subjected to noise reduction to obtain noise-reduced voice.

Specifically, in this embodiment of the application, the overlap-add module 404 overlaps and adds each frame of voice after noise reduction to obtain a multi-frame noise-reduced voice of the voice to be noise-reduced.

Fig. 5 is a schematic block diagram of an electronic device provided by an embodiment of the present disclosure.

As shown in fig. 5, the electronic apparatus includes: at least one processor 501, at least one memory 502, and at least one communication interface 503. The various components in the electronic device are coupled together by a bus system 504. A communication interface 503 for information transmission with an external device. It is understood that the bus system 504 is used to enable communications among the components. The bus system 504 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, the various buses are labeled as bus system 504 in fig. 5.

It will be appreciated that the memory 502 in this embodiment can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory.

In some embodiments, memory 502 stores elements, executable units or data structures, or a subset thereof, or an expanded set thereof as follows: an operating system and an application program.

The operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application programs, including various application programs such as a Media Player (Media Player), a Browser (Browser), etc., are used to implement various application services. The program for implementing any one of the single-channel speech noise reduction methods provided by the embodiments of the present application may be included in an application program.

In this embodiment of the application, the processor 501 is configured to execute the steps of the embodiments of the single-channel speech noise reduction method provided by this embodiment of the application by calling a program or an instruction stored in the memory 502, specifically, a program or an instruction stored in an application program.

Acquiring a voice to be denoised;

Any one of the single-channel speech noise reduction methods provided by the embodiments of the present application may be applied to the processor 501, or implemented by the processor 501. The processor 501 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 501. The Processor 501 may be a general-purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The steps of any one of the methods in the single-channel speech noise reduction method provided by the embodiment of the present application may be directly implemented as the execution of a hardware decoding processor, or implemented by the combination of hardware and software units in the decoding processor. The software elements may be located in ram, flash, rom, prom, or eprom, registers, among other storage media that are well known in the art. The storage medium is located in the memory 502, and the processor 501 reads the information in the memory 502 and completes the steps of a single-channel speech noise reduction method by combining the hardware.

Those skilled in the art will appreciate that although some embodiments described herein include some features included in other embodiments instead of others, combinations of features of different embodiments are meant to be within the scope of the application and form different embodiments.

Those skilled in the art will appreciate that the description of each embodiment has a respective emphasis, and reference may be made to the related description of other embodiments for those parts of an embodiment that are not described in detail.

While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for single-channel speech noise reduction, comprising:

acquiring a voice to be denoised;

extracting the characteristics of the voice to be denoised, inputting the voice to be denoised into a pre-trained model to obtain a mask value of each frame of voice;

and overlapping and adding the noise-reduced voice of each frame to obtain noise-reduced voice.

2. The method of claim 1, wherein the extracting the features of the speech to be noise-reduced and inputting the extracted features into a pre-trained model to obtain a mask value of each frame of speech comprises:

framing, windowing and modifying the discrete cosine transform of the voice to be denoised to obtain modified discrete cosine transform corresponding to each frame of voice;

3. The method of claim 1, wherein the training of the pre-trained model comprises the following steps:

acquiring a training set; the training set comprises multi-sentence clean voice data and multi-sentence noise data of various types; mixing the multi-sentence clean voice data and the multi-sentence noise data with different signal-to-noise ratios;

extracting features of the training set;

and returning the error of the enhanced voice and the target voice in the time domain by using a loss function, and obtaining the pre-trained model when the loss is continuously reduced until convergence.

4. The method of claim 3, wherein the extracting the feature of the training data comprises:

5. The method of claim 3, wherein the method further comprises:

acquiring verification data;

using the validation data to supervise the model during training of the pre-trained model without participating in error backtracking;

6. The method of claim 3, wherein the network model is a convolutional neural network, a long-term memory network, or a full-connection network joint model.

7. A method for single channel speech noise reduction according to claim 3, wherein the loss function is SI-SNR, SNR or MSE.

8. A single-channel speech noise reduction apparatus, comprising:

an extraction module: the model is used for extracting the characteristics of the voice to be denoised and inputting the characteristics into a pre-trained model to obtain a mask value of each frame of voice;

a modified discrete transform module: the modified discrete cosine transform module is used for multiplying the mask value of each frame of voice with the corresponding modified discrete cosine transform of the mask value of each frame of voice, and then performing inverse modified discrete cosine transform to obtain each frame of voice after noise reduction;

9. An electronic device, comprising: a processor and a memory;

the processor is used for executing a single-channel speech noise reduction method according to any one of claims 1 to 7 by calling the program or the instructions stored in the memory.

10. A computer-readable storage medium storing a program or instructions for causing a computer to perform the method of single-channel speech noise reduction according to any one of claims 1 to 7.