CN114783455A - Method, apparatus, electronic device and computer readable medium for voice noise reduction - Google Patents

Method, apparatus, electronic device and computer readable medium for voice noise reduction Download PDF

Info

Publication number
CN114783455A
CN114783455A CN202210490037.1A CN202210490037A CN114783455A CN 114783455 A CN114783455 A CN 114783455A CN 202210490037 A CN202210490037 A CN 202210490037A CN 114783455 A CN114783455 A CN 114783455A
Authority
CN
China
Prior art keywords
mel
spectrum
voice
noise reduction
trained
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210490037.1A
Other languages
Chinese (zh)
Inventor
张超
魏庆凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kuaiyu Electronics Co ltd
Original Assignee
Beijing Kuaiyu Electronics Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kuaiyu Electronics Co ltd filed Critical Beijing Kuaiyu Electronics Co ltd
Priority to CN202210490037.1A priority Critical patent/CN114783455A/en
Publication of CN114783455A publication Critical patent/CN114783455A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use

Abstract

Embodiments of the present disclosure disclose methods, apparatuses, electronic devices, and computer-readable media for speech noise reduction. One embodiment of the method comprises: acquiring a target voice; preprocessing the target voice to obtain a Mel spectrum of the target voice; inputting the Mel spectrum into a pre-trained characteristic neural network to obtain an amplitude mask gain coefficient of the Mel spectrum; carrying out amplitude masking on the Mel spectrum according to the gain coefficient of the amplitude mask to obtain a noise-reduction Mel spectrum; and inputting the noise reduction Mel spectrum into a pre-trained neural network vocoder to obtain noise reduction voice. The embodiment realizes the lighter noise reduction processing of the target voice and obtains better denoising effect.

Description

Method, apparatus, electronic device and computer readable medium for voice noise reduction
Technical Field
The embodiment of the disclosure relates to the technical field of computer voice signal processing, in particular to a method and a device for voice noise reduction, electronic equipment and a computer readable medium.
Background
With the increasing maturity of communication technology, the traffic volume of voice/video calls is increasing, and terminals used by users are diversified, such as personal computers, mobile phones, and edge devices such as microphones. However, since the environment of the user is various, the occurrence of various background noises has a great influence on the quality of the voice call, such as the sound of an indoor air conditioner fan, the sound of hitting a keyboard, the sound of an outdoor car, a bird song, and the like.
In addition, when there are other people in the communication environment of the user, the speaking voice of the other people also interferes with the voice call. In order to make the heard voice more clear, the collected voice containing noise needs to be subjected to noise reduction processing to filter background noise and other speaking voice.
Disclosure of Invention
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
Some embodiments of the present disclosure propose methods, apparatuses, electronic devices and computer readable media for speech noise reduction to solve the technical problems mentioned in the background section above.
In a first aspect, some embodiments of the present disclosure provide a method for speech noise reduction, the method comprising: acquiring a target voice; preprocessing the target voice to obtain a Mel spectrum of the target voice; inputting the Mel spectrum into a pre-trained characteristic neural network to obtain an amplitude mask gain coefficient of the Mel spectrum; performing amplitude masking on the Mel spectrum according to the gain coefficient of the amplitude mask to obtain a noise-reduced Mel spectrum; and inputting the noise reduction Mel spectrum into a pre-trained neural network vocoder to obtain noise reduction voice.
In a second aspect, some embodiments of the present disclosure provide a speech noise reduction apparatus, the apparatus comprising: an acquisition unit configured to acquire a target voice; a preprocessing unit configured to preprocess the target voice to obtain a mel spectrum of the target voice; a feature unit configured to input the mel spectrum to a pre-trained feature neural network to obtain an amplitude mask gain coefficient of the mel spectrum; an amplitude masking unit configured to perform amplitude masking on the mel spectrum according to the amplitude masking gain coefficient to obtain a noise reduction mel spectrum; and the generating unit is configured to input the noise reduction Mel spectrum to a pre-trained neural network vocoder to obtain noise reduction voice.
In a third aspect, an embodiment of the present application provides an electronic device, where the network device includes: one or more processors; storage means for storing one or more programs; when executed by one or more processors, cause the one or more processors to implement a method as described in any implementation of the first aspect.
In a fourth aspect, the present application provides a computer-readable medium, on which a computer program is stored, which, when executed by a processor, implements the method as described in any implementation manner of the first aspect.
One of the above various embodiments of the present disclosure has the following beneficial effects: the method comprises the steps of obtaining target voice, preprocessing the target voice to obtain a Mel spectrum of the target voice, inputting the Mel spectrum to a pre-trained characteristic neural network to obtain a magnitude mask gain coefficient of the Mel spectrum, performing magnitude masking on the Mel spectrum according to the magnitude mask gain coefficient to obtain a noise reduction Mel spectrum, inputting the noise reduction Mel spectrum to a pre-trained neural network vocoder to obtain noise reduction voice, eliminating interference of background noise and other human voice in the obtained noise reduction voice through processing of the target voice, achieving lighter noise reduction processing of the target voice and obtaining a better noise reduction effect.
Drawings
The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and elements are not necessarily drawn to scale.
FIG. 1 is a schematic diagram of an application scenario of a speech noise reduction method according to some embodiments of the present disclosure;
FIG. 2 is a flow diagram of some embodiments of a speech noise reduction method according to the present disclosure;
FIG. 3 is a schematic diagram of a characteristic neural network structure, according to some embodiments of the present disclosure;
FIG. 4 is a schematic diagram of a self-attention module structure according to some embodiments of the present disclosure;
FIG. 5 is a schematic diagram of a convolution module structure according to some embodiments of the present disclosure;
FIG. 6 is a schematic block diagram of some embodiments of a speech noise reduction apparatus according to the present disclosure;
FIG. 7 is a schematic block diagram of an electronic device suitable for use in implementing some embodiments of the present disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the disclosure are shown in the drawings, it is to be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.
It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings. The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.
It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.
It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.
The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.
The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 is a schematic diagram of an application scenario of a speech noise reduction method according to some embodiments of the present disclosure.
As shown in fig. 1, first, the server 101 may obtain a target voice 102, then may preprocess the target voice 102 to obtain a mel spectrum 103 of the target voice, input the mel spectrum 103 to a pre-trained characteristic neural network to obtain an amplitude mask gain coefficient 104, then perform amplitude masking on the mel spectrum 103 according to the amplitude mask gain coefficient 104 to obtain a noise reduction mel spectrum 105, and finally input the noise reduction mel spectrum 105 to a pre-trained neural network vocoder to obtain a noise reduction voice 106.
It is understood that the voice noise reduction method may be executed by a terminal device, or may also be executed by the server 101, and an execution main body of the method may also include a device formed by integrating the terminal device and the server 101 through a network, or may also be executed by various software programs. The terminal device may be various electronic devices with information processing capability, including but not limited to a smart phone, a tablet computer, an e-book reader, a laptop portable computer, a desktop computer, and the like. The execution subject may also be embodied as the server 101, software, etc. When the execution subject is software, the software can be installed in the electronic device listed above. It may be implemented, for example, as multiple software or software modules for providing distributed services, or as a single software or software module. And is not particularly limited herein.
It should be understood that the number of servers in fig. 1 is merely illustrative. There may be any number of servers, as desired for implementation.
With continued reference to fig. 2, a flow 200 of some embodiments of a speech noise reduction method according to the present disclosure is shown. The voice noise reduction method comprises the following steps:
step 201, obtaining target voice.
In some embodiments, the subject (e.g., the server shown in fig. 1) performing the voice noise reduction method may obtain the target voice through a wired connection or a wireless connection. Here, the target speech is generally speech that requires noise reduction.
It should be noted that the above-mentioned wireless connection means may include, but is not limited to, 3G/4G connection, WiFi connection, bluetooth connection, WiMAX connection, Zigbee connection, uwb (ultra wideband) connection, and other now known or later developed wireless connection means.
Step 202, preprocessing the target voice to obtain a mel spectrum of the target voice.
In some embodiments, based on the target speech obtained in step 201, the execution subject (e.g., the server shown in fig. 1) may pre-process the target speech to obtain a mel spectrum of the target speech. Here, the preprocessing generally refers to a process of converting a target voice into a mel spectrum. There are various ways to convert the target speech into mel spectrum, which are not described herein again.
In some optional implementation manners of some embodiments, the execution main body may perform framing operation on the target speech to obtain frame-level speech of the target speech. And then windowing and short-time Fourier transform are carried out on the frame-level voice to obtain the amplitude spectrum of the frame-level voice frequency spectrum. Here, the windowing process generally uses a Hanning (Hanning) window function.
Then, the magnitude spectrum is processed by a mel filter to obtain a mel energy spectrum. The mel-Energy spectrum is subjected to Energy Normalization (PCEN) to obtain an Energy normalized mel-Energy spectrum. Here, the purpose of the energy normalization processing is to obtain the cumulative average of each dimension of the features, and then to divide the average by the features of the current frame, thereby implementing the normalization operation.
And then splicing the Mel energy spectrum and the normalized Mel energy spectrum to obtain a Mel spectrum of the target voice.
In some optional implementations of some embodiments, the performing body may perform energy normalization on the mel-energy spectrum according to the following formula: PCEN (t, f) ═ E (t, f)/(M (t, f) + epsilon)α+δ)rrWherein PCEN (t, f) represents energy normalization for each band channel; t represents a time domain sequence; f represents a frequency domain sequence; ε represents a constant (ε may be 10 as an example)-8) (ii) a Alpha represents an energy normalization coefficient; δ and r represent dynamic range coefficients; e (t, f) represents a Mel energy spectrum; m (t, f) represents a smoothed energy spectrum, determined according to the following formula: m (t, f) ═ 1-s · M (t-1, f) + s · E (t, f), where s denotes a smoothing coefficient.
The preprocessing part not only uses the Mel spectrum of the voice, but also adds trainable energy normalization of each channel of the Mel spectrum, performs automatic gain control and dynamic range control on the Mel spectrum, supplements more past time axis information, and simultaneously, has the characteristic of trainable parameters, better adapts to a noise reduction neural network, and is beneficial to improving the whole voice noise reduction effect.
Step 203, inputting the Mel spectrum into a pre-trained characteristic neural network to obtain an amplitude mask gain coefficient of the Mel spectrum.
In some embodiments, the execution subject may input the mel-frequency spectrum to a pre-trained eigenneural network, resulting in an amplitude mask gain coefficient of the mel-frequency spectrum. Here, the above-mentioned characteristic neural network is generally used to characterize the correspondence between the mel-frequency spectrum and the gain coefficient of the amplitude mask.
As an example, the characteristic neural network may be a correspondence table of a sample mel-frequency spectrum and a sample amplitude mask gain coefficient integrated by a researcher based on a large amount of data, and the execution body may determine a sample amplitude mask gain coefficient identical or similar to the mel-frequency spectrum in the correspondence table and determine a sample amplitude mask gain coefficient corresponding to the sample mel-frequency spectrum as the amplitude mask gain coefficient.
In some optional implementations of some embodiments, the above-described signature neural network includes a convolution module, a self-attention module, a gated-cycle unit module, and/or a deconvolution module connected in sequence.
As an example, the above-described characteristic neural network may be a neural network composed of a convolution module 301, a self-attention module 302, a gated loop unit module 303, and a deconvolution module 304 as shown in fig. 3.
Specifically, as shown in fig. 4, the convolution module 301 is composed of 5 convolution units with the same structure and different parameters. Each convolution unit is composed of a dot convolution layer 401(PConv), a batch normalization layer 402(BatchNorm), a linear rectification function layer 403(ReLU), a depth convolution layer 404(DSConv), a batch normalization layer 405(BatchNorm), and a linear rectification function layer 406(ReLU) in this order. The output of the last ReLU of each convolution unit is sent to both the next layer of the network and to the corresponding deconvolution unit of deconvolution module 407 as part of the input. The convolution kernels of the point convolution layers of each convolution unit are all 1, and the step length is all 1. The convolution kernels of the depth convolution layers of the convolution units are respectively 5, 3, 5, 3 and 5, and the step sizes are respectively 2, 1, 2, 1 and 2.
And a convolution module in the characteristic neural network is used for extracting the compressed Mel spectrum characteristics and reducing the dimensionality of the characteristics. 5-order convolution units with the same structure and different parameters reduce dimensionality and loss of characteristic information. The convolution unit of the invention uses point convolution and depth convolution, compared with one-step standard convolution, the parameter quantity and the operation quantity of the convolution are greatly reduced, and a similar effect is achieved.
As shown in fig. 5, the self-attention module 500 is composed of a multi-headed self-attention layer 501(multihead attention), a layer normalization layer 502(LayerNorm), a dot convolution layer 503(PConv), a batch normalization layer (BatchNorm)504, and a linear rectification function layer 505 (ReLU). The output of the convolution module is firstly sent to a multi-head self-attention layer, and then added with the output of the convolution module after passing through a layer normalization layer, and then enters into a point convolution layer, and then is sent to the next module after being processed by a batch normalization layer and a linear rectification function layer.
The self-attention module aims to synthesize the Mel spectrum features extracted by the convolution module. The self-attention network is used instead of the recurrent neural network because the self-attention network is more capable of capturing global information, and the recurrent neural network is more focused on local features and time domain relation. The 8-head self-attention module used by the invention obtains better noise reduction effect by using the parameter quantity smaller than the circulating neural network.
The gated loop unit module is composed of a gated loop unit layer (GRU), a dot convolution layer (PConv), a batch normalization layer (BatchNorm), and a linear rectification function layer (ReLU). The output from the attention module is firstly sent into a gate control circulation unit layer, and then sequentially passes through a point convolution layer, a batch standardization layer and a linear rectification function layer.
The gating circulation unit is used for processing the relation between each front and back voice frame and establishing the information flow of the time axis. A recurrent neural network is used here instead of a self-attention network because the self-attention network cannot handle the context of the time axis. For the selection of the recurrent neural network, the use of gated recurrent units rather than long-short term memory networks can further reduce the computational load of the network. Due to the requirement of real-time performance, a unidirectional gating circulation network is required to be used to form a causal system.
The deconvolution module 304 is composed of 5 deconvolution units with the same structure and different parameters. Each deconvolution unit is composed of a point convolution layer (PConv), a batch normalization layer (BatchNorm), a linear rectification function layer (ReLU), a deconvolution layer (ConvT), a batch normalization layer (BatchNorm), and a linear rectification function layer (ReLU) in this order. The input of the first convolution layer of each deconvolution unit is received from the upper layer of the network, and also received from the last linear rectification function layer of the corresponding convolution unit of the convolution module, and the two parts are spliced to be used as the integral input. The convolution kernel of the point convolution layer of each deconvolution unit is 1, and the step length is 1. The convolution kernels of the deconvolution layers of the deconvolution units are respectively 5, 3, 5, 3 and 5, and the step lengths are respectively divided into 2, 1, 2, 1 and 2.
The deconvolution module restores the compressed reduced-dimensionality Mel spectrum to the size of the original Mel spectrum to generate a Mel spectrum mask. The Mel spectrum mask is used for multiplying the Mel spectrum of the original voice with noise to obtain the Mel spectrum after noise reduction. The deconvolution module adopts a structure symmetrical to the convolution module, is provided with 5 deconvolution units and can be regarded as the inverse operation of the convolution module. And the input of each deconvolution unit not only receives the output of the upper layer of the network, but also directly receives the output of the corresponding unit of the convolution module, so that the gradient change is more stable during network training, and a network model with good effect is easier to train.
In some optional implementations of some embodiments, the executing entity may obtain a training sample set, where the training sample set includes a sample mel spectrum and a sample amplitude mask gain coefficient corresponding to the sample speech; inputting the sample Mel spectrum into a model to be trained to obtain an amplitude mask gain coefficient; comparing the amplitude mask gain coefficient with the sample amplitude mask gain coefficient according to a preset loss function to determine a loss value; and determining that the model to be trained is trained in response to the loss value meeting a preset condition, and determining the model to be trained as a characteristic neural network.
The computation of the loss function of the characteristic neural network accounts for the loss between the noise reduction Mel spectrum and the clean speech Mel spectrum, and uses multi-scale loss summation with weighting, covering more slice range and giving more weight to the long slice. Due to the characteristic that the short-time stability of the voice and the time length of each pronunciation are not fixed, compared with a single loss function, the multi-scale loss weighted summation improves the noise reduction effect after network training and reduces the distortion of the voice.
In some optional implementations of some embodiments, the executing body may further adjust a relevant parameter in the model to be trained in response to the loss value not satisfying a preset condition.
And 204, performing amplitude masking on the Mel spectrum according to the amplitude masking gain coefficient to obtain a noise-reduction Mel spectrum.
In some embodiments, the execution entity may perform an amplitude mask on the mel-frequency spectrum according to the gain coefficient to obtain a noise-reduced mel-frequency spectrum. Specifically, the process of amplitude masking generally refers to performing a multiplication of the magnitude mask gain coefficient and the mel-frequency spectrum by the subject to obtain a noise-reduced mel-frequency spectrum.
Step 205, inputting the noise reduction Mel spectrum into a pre-trained neural network vocoder to obtain noise reduction voice.
In some embodiments, the executive body may input the noise-reduced mel spectrum to a pre-trained neural network vocoder to obtain noise-reduced speech. Here, the neural network vocoder is generally used to characterize the correspondence between the noise-reduced mel spectrum and the noise-reduced voice.
As an example, the neural network vocoder may be a correspondence table of a sample de-noised mel spectrum and sample de-noised voice integrated by a researcher based on a large amount of data, and the execution body may determine a sample de-noised mel spectrum identical or similar to the de-noised mel spectrum in the correspondence table and determine the sample de-noised voice corresponding to the sample de-noised mel spectrum as the de-noised voice.
As yet another example, the above-described neural network vocoder may be based on the Multi-band MelGAN technology, composed of a generator and a discriminator. The generator is used for producing voice and mainly comprises a convolution module, an up-sampling module, a residual block module and an activation function module, and the discriminator is used for discriminating the authenticity of the voice and comprises a plurality of discriminator modules which are full-band and self-contained. During training, the generator and the discriminator alternately circulate to carry out confrontation training.
Compared with the traditional vocoder, the neural network vocoder generates voice by a Mel spectrum, and the phase information is completed by network automatic learning, so that the quality of the generated voice is greatly improved. Because the noise reduction method of the invention needs to run in real time, although the effect of some large-scale neural network vocoders is probably better, the generation speed is extremely slow, and the real-time requirement is difficult to meet. The neural network vocoder based on the Multi-band MelGAN technology ensures the quality of generated voice and greatly improves the generation speed, so that the overall speed is superior to the end-to-end speed.
The invention provides a noise reduction method based on deep learning, which does not use an end-to-end noise reduction method which is applied in a large quantity, but provides a two-stage noise reduction method. Compared with the end-to-end network which directly processes the frequency domain spectrum to generate the frequency spectrum mask, the noise reduction neural network in the first stage inputs the Mel spectrum and finally generates the Mel spectrum mask, the operation amount is greatly reduced by 4 to 8 times, the Mel spectrum is more in line with the human auditory sense, and the auditory sense effect after noise reduction is improved. In order to restore the high quality of the Mel spectrum to time domain voice, the invention uses a neural network vocoder in the second stage, the neural network vocoder based on the Multi-band MelGAN technology can not only generate voice with high quality, but also greatly improve the generating speed compared with a large-scale neural network vocoder, only 3% voice duration is needed in the operation of a CPU, and the real-time requirement of voice communication noise reduction can be fully met.
The whole neural network has the characteristic of light weight, only uses few network parameters and extremely small calculated amount, is convenient to use at a PC (personal computer) end or transplanted to terminals such as an embedded type terminal, has the noise reduction effect which can be achieved by a large-scale network, and meets the real-time advantage which cannot be achieved by the large-scale network. The hardware resource occupation is low, and the noise reduction effect is good.
One of the above various embodiments of the present disclosure has the following beneficial effects: the method comprises the steps of obtaining target voice, preprocessing the target voice to obtain a Mel spectrum of the target voice, inputting the Mel spectrum to a pre-trained characteristic neural network to obtain a magnitude mask gain coefficient of the Mel spectrum, performing magnitude masking on the Mel spectrum according to the magnitude mask gain coefficient to obtain a noise reduction Mel spectrum, inputting the noise reduction Mel spectrum to a pre-trained neural network vocoder to obtain noise reduction voice, eliminating interference of background noise and other human voice in the obtained noise reduction voice through processing of the target voice, achieving lighter noise reduction processing of the target voice and obtaining a better noise reduction effect.
With further reference to fig. 6, as an implementation of the methods shown in the above figures, the present disclosure provides some embodiments of a speech noise reduction apparatus, which correspond to those shown in fig. 2, and which may be applied in various electronic devices in particular.
As shown in fig. 6, the speech noise reduction apparatus 600 of some embodiments includes: an acquisition unit 601, a preprocessing unit 602, a feature unit 603, an amplitude masking unit 604, and a generation unit 605. Wherein, the obtaining unit 601 is configured to obtain a target voice; a preprocessing unit 602 configured to preprocess the target speech to obtain a mel spectrum of the target speech; a feature unit 603 configured to input the mel spectrum to a pre-trained feature neural network to obtain an amplitude mask gain coefficient of the mel spectrum; an amplitude masking unit 604 configured to perform amplitude masking on the mel spectrum according to the amplitude masking gain coefficient to obtain a noise reduced mel spectrum; and the generating unit 605 is configured to input the noise-reduced mel spectrum to a pre-trained neural network vocoder to obtain noise-reduced voice.
In an alternative implementation of some embodiments, the pre-processing unit is further configured to: performing framing operation on the target voice to obtain frame-level voice of the target voice; windowing and short-time Fourier transform are carried out on the frame-level voice to obtain a magnitude spectrum of a frame-level voice frequency spectrum; processing the amplitude spectrum by using a Mel filter to obtain a Mel energy spectrum; carrying out energy normalization processing on the Mel energy spectrum to obtain an energy normalized Mel energy spectrum; and splicing the Mel energy spectrum and the normalized Mel energy spectrum to obtain a Mel spectrum of the target voice.
In an alternative implementation of some embodiments, the pre-processing unit is further configured to: performing energy normalization processing on the Mel energy spectrum according to the following formula: PCEN (t, f) ═ E (t, f)/(M (t, f) + epsilon) α + δ)rrWherein PCEN (t, f) represents energy normalization for each band channel; t represents a time domain sequence; f represents a frequency domain sequence; ε represents a constant; alpha represents an energy normalization coefficient; δ and r represent dynamic range coefficients; e (t, f) represents a Mel energy spectrum; m (t, f) represents smoothing energySpectra, determined according to the following formula: m (t, f) ═ 1-s · M (t-1, f) + s · E (t, f), where s denotes a smoothing coefficient.
In an alternative implementation of some embodiments, the above-described signature neural network includes a convolution module, a self-attention module, a gated round-robin unit module, and/or a deconvolution module connected in sequence.
In an alternative implementation of some embodiments, the above-mentioned characteristic neural network is trained according to the following steps: acquiring a training sample set, wherein the training sample set comprises a sample Mel spectrum and a sample amplitude mask gain coefficient corresponding to the sample voice; inputting the sample Mel spectrum into a model to be trained to obtain an amplitude mask gain coefficient; comparing the amplitude mask gain coefficient with the sample amplitude mask gain coefficient according to a preset loss function to determine a loss value; and determining that the model to be trained is trained in response to the loss value meeting a preset condition, and determining the model to be trained as a characteristic neural network.
In an optional implementation manner of some embodiments, the above speech noise reduction apparatus further includes an adjusting unit configured to: and responding to the loss value not meeting the preset condition, and adjusting the relevant parameters in the model to be trained.
It will be understood that the elements described in the apparatus 600 correspond to various steps in the method described with reference to fig. 2. Thus, the operations, features and advantages described above for the method are also applicable to the apparatus 600 and the units included therein, and are not described herein again.
One of the above-described various embodiments of the present disclosure has the following advantageous effects: the method comprises the steps of obtaining target voice, preprocessing the target voice to obtain a Mel spectrum of the target voice, inputting the Mel spectrum to a pre-trained characteristic neural network to obtain an amplitude mask gain coefficient of the Mel spectrum, conducting amplitude masking on the Mel spectrum according to the amplitude mask gain coefficient to obtain a noise reduction Mel spectrum, inputting the noise reduction Mel spectrum to a pre-trained neural network vocoder to obtain noise reduction voice, eliminating interference of background noise and other human voice in the obtained noise reduction voice through processing of the target voice, achieving lighter noise reduction processing of the target voice and obtaining a better noise reduction effect.
Referring now to fig. 7, a schematic diagram of an electronic device (e.g., the server of fig. 1) 700 suitable for use in implementing some embodiments of the present disclosure is shown. The electronic device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 7, electronic device 700 may include a processing means (e.g., central processing unit, graphics processor, etc.) 701 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage means 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data necessary for the operation of the electronic apparatus 700 are also stored. The processing device 701, the ROM 702, and the RAM703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
Generally, the following devices may be connected to the I/O interface 705: input devices 706 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, or the like; an output device 707 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; and a communication device 709. The communication means 709 may allow the electronic device 700 to communicate with other devices, wireless or wired, to exchange data. While fig. 7 illustrates an electronic device 700 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may be alternatively implemented or provided. Each block shown in fig. 7 may represent one device or may represent multiple devices as desired.
In particular, according to some embodiments of the present disclosure, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, some embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer-readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In some such embodiments, the computer program may be downloaded and installed from a network via communications means 709, or may be installed from storage 708, or may be installed from ROM 702. The computer program, when executed by the processing device 701, performs the above-described functions defined in the methods of some embodiments of the present disclosure.
It should be noted that the computer readable medium described above in some embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In some embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In some embodiments of the present disclosure, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring target voice; preprocessing the target voice to obtain a Mel spectrum of the target voice; inputting the Mel spectrum into a pre-trained characteristic neural network to obtain an amplitude mask gain coefficient of the Mel spectrum; carrying out amplitude masking on the Mel spectrum according to the gain coefficient of the amplitude masking to obtain a noise-reduced Mel spectrum; and inputting the noise reduction Mel spectrum into a pre-trained neural network vocoder to obtain noise reduction voice.
Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in some embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, a preprocessing unit, a feature unit, an amplitude mask unit, and a generation unit. The names of these units do not in some cases constitute a limitation on the unit itself, and for example, a receiving unit may also be described as a "unit that acquires target speech".
The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combinations of the above-mentioned features, and other embodiments in which the above-mentioned features or their equivalents are combined arbitrarily without departing from the spirit of the invention are also encompassed. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims (10)

1. A method for speech noise reduction, comprising:
acquiring a target voice;
preprocessing the target voice to obtain a Mel spectrum of the target voice;
inputting the Mel spectrum to a pre-trained characteristic neural network to obtain an amplitude mask gain coefficient of the Mel spectrum;
carrying out amplitude masking on the Mel spectrum according to the amplitude masking gain coefficient to obtain a noise-reduction Mel spectrum;
and inputting the noise reduction Mel spectrum into a pre-trained neural network vocoder to obtain noise reduction voice.
2. The method of claim 1, wherein the pre-processing the target speech to obtain a mel spectrum of the target speech comprises:
performing framing operation on the target voice to obtain frame-level voice of the target voice;
windowing and short-time Fourier transform are carried out on the frame-level voice to obtain a magnitude spectrum of a frame-level voice frequency spectrum;
processing the amplitude spectrum by using a Mel filter to obtain a Mel energy spectrum;
carrying out energy normalization processing on the Mel energy spectrum to obtain an energy normalized Mel energy spectrum;
and splicing the Mel energy spectrum and the normalized Mel energy spectrum to obtain a Mel spectrum of the target voice.
3. The method of claim 2, wherein said energy normalizing said mel-energy spectrum to obtain an energy normalized mel-energy spectrum comprises:
performing energy normalization processing on the Mel energy spectrum according to the following formula:
PCEN(t,f)=(E(t,f)/(M(t,f)+ε)α+δ)rr
wherein PCEN (t, f) represents energy normalization for each band channel;
t represents a time domain sequence;
f represents a frequency domain sequence;
ε represents a constant;
alpha represents an energy normalization coefficient;
δ and r represent dynamic range coefficients;
e (t, f) represents a Mel energy spectrum;
m (t, f) represents a smoothed energy spectrum, determined according to the following formula:
M(t,f)=(1-s)·M(t-1,f)+s·E(t,f),
where s represents a smoothing coefficient.
4. The method of claim 1, wherein the signature neural network comprises sequentially connected convolution modules, self-attention modules, gated cycle cell modules, and/or deconvolution modules.
5. The method according to one of claims 1 or 4, wherein the eigenneural network is trained according to the following steps:
acquiring a training sample set, wherein the training sample set comprises a sample Mel spectrum and a sample amplitude mask gain coefficient corresponding to the sample voice;
inputting the sample Mel spectrum to a model to be trained to obtain an amplitude mask gain coefficient;
comparing the amplitude mask gain coefficient with the sample amplitude mask gain coefficient according to a preset loss function to determine a loss value;
and determining that the model to be trained is trained in response to the loss value meeting a preset condition, and determining the model to be trained as a characteristic neural network.
6. The method of claim 5, wherein the method further comprises:
and adjusting relevant parameters in the model to be trained in response to the loss value not meeting a preset condition.
7. An apparatus for speech noise reduction, comprising:
an acquisition unit configured to acquire a target voice;
the preprocessing unit is configured to preprocess the target voice to obtain a Mel spectrum of the target voice;
a feature unit configured to input the Mel spectrum to a pre-trained feature neural network, resulting in an amplitude mask gain coefficient of the Mel spectrum;
the amplitude mask unit is configured to carry out amplitude mask on the Mel spectrum according to the amplitude mask gain coefficient to obtain a noise reduction Mel spectrum;
a generating unit configured to input the noise-reduced Mel spectrum to a pre-trained neural network vocoder to obtain noise-reduced speech.
8. The apparatus of claim 7, wherein the signature neural network comprises a sequentially connected convolution module, self-attention module, gated cyclic unit module, and/or deconvolution module.
9. An electronic device, comprising:
one or more processors;
a storage device having one or more programs stored thereon,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-6.
10. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-6.
CN202210490037.1A 2022-05-07 2022-05-07 Method, apparatus, electronic device and computer readable medium for voice noise reduction Pending CN114783455A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210490037.1A CN114783455A (en) 2022-05-07 2022-05-07 Method, apparatus, electronic device and computer readable medium for voice noise reduction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210490037.1A CN114783455A (en) 2022-05-07 2022-05-07 Method, apparatus, electronic device and computer readable medium for voice noise reduction

Publications (1)

Publication Number Publication Date
CN114783455A true CN114783455A (en) 2022-07-22

Family

ID=82435854

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210490037.1A Pending CN114783455A (en) 2022-05-07 2022-05-07 Method, apparatus, electronic device and computer readable medium for voice noise reduction

Country Status (1)

Country Link
CN (1) CN114783455A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114898766A (en) * 2022-07-12 2022-08-12 四川高速公路建设开发集团有限公司 Distributed optical fiber voice enhancement method based on GAN network and tunnel rescue system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114898766A (en) * 2022-07-12 2022-08-12 四川高速公路建设开发集团有限公司 Distributed optical fiber voice enhancement method based on GAN network and tunnel rescue system

Similar Documents

Publication Publication Date Title
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
CN108564963B (en) Method and apparatus for enhancing voice
WO2021196905A1 (en) Voice signal dereverberation processing method and apparatus, computer device and storage medium
WO2018223727A1 (en) Voiceprint recognition method, apparatus and device, and medium
US20240038252A1 (en) Sound signal processing method and apparatus, and electronic device
CN111462728A (en) Method, apparatus, electronic device and computer readable medium for generating speech
CN114898762A (en) Real-time voice noise reduction method and device based on target person and electronic equipment
US11776520B2 (en) Hybrid noise suppression for communication systems
CN112259116B (en) Noise reduction method and device for audio data, electronic equipment and storage medium
CN111462727A (en) Method, apparatus, electronic device and computer readable medium for generating speech
CN115602165A (en) Digital staff intelligent system based on financial system
WO2022166710A1 (en) Speech enhancement method and apparatus, device, and storage medium
CN114783455A (en) Method, apparatus, electronic device and computer readable medium for voice noise reduction
CN116913258B (en) Speech signal recognition method, device, electronic equipment and computer readable medium
CN114333893A (en) Voice processing method and device, electronic equipment and readable medium
CN116403594B (en) Speech enhancement method and device based on noise update factor
WO2024027295A1 (en) Speech enhancement model training method and apparatus, enhancement method, electronic device, storage medium, and program product
WO2022213825A1 (en) Neural network-based end-to-end speech enhancement method and apparatus
CN113823312B (en) Speech enhancement model generation method and device, and speech enhancement method and device
JP2024502287A (en) Speech enhancement method, speech enhancement device, electronic device, and computer program
CN114333891A (en) Voice processing method and device, electronic equipment and readable medium
CN111784567A (en) Method, apparatus, electronic device, and computer-readable medium for converting an image
Su et al. Learning an adversarial network for speech enhancement under extremely low signal-to-noise ratio condition
Srinivasarao An efficient recurrent Rats function network (Rrfn) based speech enhancement through noise reduction
WO2024055751A1 (en) Audio data processing method and apparatus, device, storage medium, and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination