CN115662461A

CN115662461A - Noise reduction model training method, device and equipment

Info

Publication number: CN115662461A
Application number: CN202211301496.7A
Authority: CN
Inventors: 梁龙腾; 李伟南; 黄传辉
Original assignee: Shanghai Xiaodu Technology Co Ltd
Current assignee: Shanghai Xiaodu Technology Co Ltd
Priority date: 2022-10-24
Filing date: 2022-10-24
Publication date: 2023-01-31

Abstract

The disclosure provides a noise reduction model training method, device and equipment, relates to the technical field of artificial intelligence, and particularly relates to the technical fields of natural language processing, deep learning and the like. One embodiment of the method comprises: acquiring a sample voice signal with noise; extracting sample frequency domain characteristics of a sample voice signal with noise; calculating a sample target frequency band gain and a sample target time domain energy ratio of the sample noisy speech signal; and taking the frequency domain characteristics of the sample as input, taking the ratio of the frequency band gain of the target frequency of the sample and the time domain energy of the target time domain of the sample as supervision, and training the network to obtain a noise reduction model. The noise reduction model trained by the embodiment has better noise reduction performance.

Description

Noise reduction model training method, device and equipment

Technical Field

The present disclosure relates to the field of artificial intelligence technology, and more particularly to the field of natural language processing and deep learning.

Background

The real-time voice communication can bring great convenience to the life and work of people. However, as science and technology and life are continuously developed, the quality requirement of voice call is higher and higher. In the face of various noisy product application environments, the voice noise reduction technology shows no substitutable status, can inhibit background noise in voice signals, improves the definition and intelligibility of voice, and further improves the voice interaction quality and efficiency. In real-time conversation, the noise suppression capability and the real-time operation capability play a more important role.

At present, speech noise reduction is carried out by using classical noise reduction algorithms such as spectral subtraction, adaptive filter noise reduction and the like.

Disclosure of Invention

The embodiment of the disclosure provides a noise reduction model training method, a noise reduction model training device, noise reduction model training equipment, a noise reduction model training storage medium and a noise reduction model training program product.

In a first aspect, an embodiment of the present disclosure provides a noise reduction model training method, including: acquiring a sample voice signal with noise; extracting sample frequency domain characteristics of a sample voice signal with noise; calculating a sample target frequency band gain and a sample target time domain energy ratio of the sample voice signal with noise; and taking the frequency domain characteristics of the sample as input, taking the ratio of the frequency band gain of the target frequency of the sample and the time domain energy of the target time domain of the sample as supervision, and training the network to obtain a noise reduction model.

In a second aspect, an embodiment of the present disclosure provides a speech noise reduction method, including: acquiring a voice signal with noise; extracting frequency domain characteristics of a voice signal with noise; inputting the frequency domain characteristics into a noise reduction model to obtain M frequency band gains, wherein M is a positive integer, and the noise reduction model is obtained by training by adopting the method of the first aspect; and carrying out noise reduction based on the M frequency band gains to obtain a clean voice signal.

In a third aspect, an embodiment of the present disclosure provides a noise reduction model training apparatus, including: an acquisition module configured to acquire a sample noisy speech signal; an extraction module configured to extract sample frequency domain features of the sample noisy speech signal; a calculation module configured to calculate a sample target frequency band gain and a sample target time domain energy ratio of the sample noisy speech signal; and the training module is configured to take the frequency domain characteristics of the samples as input, take the frequency band gain of the sample target and the time domain energy ratio of the sample target as supervision, train the network and obtain the noise reduction model.

In a fourth aspect, an embodiment of the present disclosure provides a speech noise reduction apparatus, including: an acquisition module configured to acquire a noisy speech signal; an extraction module configured to extract frequency domain features of a noisy speech signal; an input module configured to input the frequency domain features into a noise reduction model, resulting in M band gains, where M is a positive integer, and the noise reduction model is obtained by training using the apparatus of the third aspect; and the noise reduction module is configured to reduce noise based on the M frequency band gains to obtain a clean voice signal.

In a fifth aspect, an embodiment of the present disclosure provides an electronic device, including: at least two processors; and a memory communicatively coupled to the at least two processors; wherein the memory stores instructions executable by the at least two processors to enable the at least two processors to perform the method as described in any one of the implementations of the first aspect or the method as described in any one of the implementations of the second aspect.

In a sixth aspect, the disclosed embodiments propose a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method as described in any one of the implementations of the first aspect or the method as described in any one of the implementations of the second aspect.

In a seventh aspect, the present disclosure provides a computer program product, which includes a computer program, and when executed by a processor, the computer program implements the method described in any of the implementation manners of the first aspect or the method described in any of the implementation manners of the second aspect.

According to the noise reduction model training method provided by the embodiment of the disclosure, the trained noise reduction model has good noise reduction performance and real-time performance, and has good inhibition capability on both stationary noise and non-stationary noise.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

Other features, objects, and advantages of the disclosure will become apparent from a reading of the following detailed description of non-limiting embodiments which proceeds with reference to the accompanying drawings. The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow diagram of one embodiment of a noise reduction model training method according to the present disclosure;

FIG. 2 is a flow diagram of yet another embodiment of a noise reduction model training method according to the present disclosure;

FIG. 3 is a schematic diagram of training features in the noise reduction model training method of FIG. 2;

FIG. 4 is a network structure diagram of a noise reduction model in the noise reduction model training method of FIG. 2;

FIG. 5 is a flow diagram of one embodiment of a method of speech noise reduction according to the present disclosure;

FIG. 6 is a flow diagram of yet another embodiment of a speech noise reduction method according to the present disclosure;

FIG. 7 is a block flow diagram of the speech noise reduction method of FIG. 6;

FIG. 8 is a schematic illustration of a noise reduction effect;

FIG. 9 is a schematic illustration of yet another noise reduction effect;

FIG. 10 is a scene diagram of a noise reduction model training method and a speech noise reduction method that can implement embodiments of the present disclosure;

FIG. 11 is a schematic diagram of an embodiment of a noise reduction model training apparatus according to the present disclosure;

FIG. 12 is a schematic block diagram of one embodiment of a speech noise reduction apparatus according to the present disclosure;

FIG. 13 is a block diagram of an electronic device for implementing a noise reduction model training method and a speech noise reduction method according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

FIG. 1 illustrates a flow 100 of one embodiment of a noise reduction model training method according to the present disclosure. The noise reduction model training method comprises the following steps:

step 101, obtaining a sample noisy speech signal.

In this embodiment, the executing subject of the noise reduction model training method may obtain a sample noisy speech signal. The sample noisy speech signal is a noisy speech signal, may be a speech signal collected in a noisy environment, or may be obtained by fusing a clean speech signal with a noise signal.

And 102, extracting the sample frequency domain characteristics of the sample noisy speech signal.

In this embodiment, the execution subject may extract a sample frequency domain feature of the sample noisy speech signal. The sample frequency domain features may include, but are not limited to: ERB (Equivalent Rectangular Bandwidth) cepstral coefficients, first and second order differences of ERB cepstral coefficients, and DCT (Discrete Cosine Transform) coefficients and pitch periods of speech pitch correlation, and the like.

Here, the sample noisy speech signal is usually a time domain signal, which needs to be converted to the frequency domain in order to extract the frequency domain features. Specifically, converting a sample voice signal with noise into a frequency domain to obtain a sample voice signal with noise in the frequency domain; and extracting sample frequency domain characteristics from the sample frequency domain noisy speech signal.

Step 103, calculating a sample target frequency band gain and a sample target time domain energy ratio of the sample noisy speech signal.

In this embodiment, the execution body may calculate a sample target frequency band gain and a sample target time-domain energy ratio of the sample noisy speech signal. Wherein the sample target frequency band gain and the sample target time domain energy ratio may be calculated based on the sample noisy speech signal and its corresponding sample clean speech signal.

And step 104, taking the frequency domain characteristics of the sample as input, taking the ratio of the target frequency band gain of the sample and the target time domain energy of the sample as supervision, and training the network to obtain a noise reduction model.

In this embodiment, the executing entity may use the sample frequency domain characteristics as input, use the sample target frequency band gain and the sample target time domain energy ratio as supervision, and train the network to obtain the noise reduction model.

In general, the sample frequency domain features are input into the network, and the sample prediction band gain and the sample prediction time-domain energy ratio can be learned. And adjusting parameters of the network based on the difference between the sample predicted frequency band gain and the sample target frequency band gain and the difference between the sample predicted time domain energy ratio and the sample target time domain energy ratio, so that the two differences are small enough to obtain the noise reduction model.

According to the noise reduction model training method provided by the embodiment of the disclosure, the trained noise reduction model has better noise reduction performance and real-time performance. The method has better inhibition capability on both stationary noise and non-stationary noise.

With continued reference to FIG. 2, a flow 200 of yet another embodiment of a noise reduction model training method according to the present disclosure is illustrated. The noise reduction model training method comprises the following steps:

step 201, a sample clean speech signal and a sample noise signal are obtained.

In this embodiment, the executing subject of the noise reduction model training method may obtain a sample clean speech signal and a sample noise signal. Wherein the sample clean speech signal may be a speech signal collected in a quiet environment. The sample noise signal may be a noise signal acquired in a noisy environment.

Step 202, the sample clean speech signal is fused with the sample noise signal to obtain a sample noisy speech signal.

In this embodiment, the execution subject may fuse the sample clean speech signal and the sample noise signal to obtain a sample noisy speech signal.

Here, the samples are generated by fusion of the clean speech signal and the noise signal, thereby making the acquisition of the samples easier.

Step 203, converting the sample voice signal with noise into a frequency domain to obtain a sample voice signal with noise in the frequency domain.

In this embodiment, the execution subject may convert the sample noise-containing speech signal into a frequency domain, so as to obtain a sample frequency domain noise-containing speech signal.

Typically, the sample noisy speech signal may be a time domain signal. Firstly, a Hamming window is added to the sample voice signal with noise, and then Fourier transformation is carried out on the sample voice signal with noise, so that the sample voice signal with noise in the frequency domain can be obtained.

And 204, converting the sample frequency domain noisy speech signal into an equivalent rectangular bandwidth ERB domain to obtain a sample ERB domain frequency point.

In this embodiment, the execution subject may convert the sample frequency domain noisy speech signal into an ERB domain, so as to obtain a sample ERB domain frequency point.

In general, a sample frequency domain noisy speech signal can be converted into the ERB domain using equation (1):

X _erb (w)＝9.265*loge ^{(1+X(w)/(24.7*9.265))} (1)

wherein, X _erb And (w) is a sample ERB domain frequency point, and X (w) is a sample frequency domain noisy speech signal.

Step 205, dividing the sample ERB domain frequency points into M sample ERB domain sub-bands.

In this embodiment, the execution subject may divide the sample ERB domain frequency points into M sample ERB domain subbands. Wherein M is a positive integer. For example, the sample ERB domain frequency point is equally divided into 30 sample ERB domain sub-bands.

Step 206, converting the ERB domain sub-bands of the M samples to the frequency domain to obtain frequency domain sub-bands of the M samples.

In this embodiment, the execution subject may convert the M sample ERB domain subbands into the frequency domain, and obtain M sample frequency domain subbands. Wherein, one sample ERB domain sub-band is converted to obtain one sample frequency domain sub-band.

In general, the sample ERB domain may be sub-band converted to the frequency domain using equation (2):

wherein, X _bin (w) is the sample ERB domain subband, 1 ≦ bin ≦ M, X _erb (w) are the sample ERB domain subbands.

And step 207, performing Discrete Cosine Transform (DCT) on the logarithmic energy sums of the M sample frequency domain sub-bands to obtain M sample DCT coefficients.

In this embodiment, the execution entity may perform DCT on the logarithmic energy sums of the M sample frequency domain subbands to obtain M sample DCT coefficients.

In general, the log energy sums of the sample frequency domain subbands can be DCT' ed using equation (3):

wherein, y _k Is a sample DCT coefficient, C _k Is coefficient, k is more than or equal to 0 and less than or equal to M-1,E _bin Is the sum of the logarithmic energies of the sample frequency domain subbands.

Wherein, C _k Can be defined by equation (4):

step 208, respectively calculating the first order difference and the second order difference of the first N sample DCT coefficients in the M sample DCT coefficients.

In this embodiment, the execution subject may calculate the first-order difference and the second-order difference of the first N sample DCT coefficients among the M sample DCT coefficients, respectively. Wherein N is a positive integer and is less than or equal to M. For example, the first and second order differences of the first 11 sample DCT coefficients among the 30 sample DCT coefficients are calculated, respectively.

In general, the first order difference of the sample DCT coefficients can be calculated by equation (5), and the second order difference of the sample DCT coefficients can be calculated by equation (6):

y1 _k ＝y _l,k -y _l-2,k (5)

y2 _k ＝y _l,k -2*y _l1,k +y _l-2,k (6)

wherein l is the number of frames, y1 _k Is the first order difference of the sample DCT coefficients, equal to the k sample DCT coefficient of the current frame minus the k sample DCT coefficients of the previous two frames, y2 _k The second order difference of the sample DCT coefficients is equal to the k sample DCT coefficient of the current frame minus 2 times of the k sample DCT coefficient of the previous frame, plus the k sample DCT coefficients of the previous two frames.

Step 209, calculate the P-dimensional DCT coefficient and pitch period of the speech pitch correlation of the sample frequency domain noisy speech signal.

In this embodiment, the execution subject may calculate a pitch period and a P-dimensional DCT coefficient of the speech pitch correlation of the sample frequency domain noisy speech signal. Wherein P is a positive integer. For example, 11-dimensional DCT coefficients and pitch periods of the speech pitch correlation of the sample frequency domain noisy speech signal are calculated. It should be noted that, the method for calculating the P-dimensional DCT coefficient and pitch period of the speech pitch correlation belongs to the prior art, and is not described herein again.

Here, the DCT coefficient, the first and second order differences of the DCT coefficient, the pitch correlation degree, and the pitch period are extracted, so that the feature content input into the network is richer.

Step 210, calculating a sample target frequency band gain based on the M sample frequency domain subbands of the sample clean speech signal and the M sample frequency domain subbands of the sample noisy speech signal.

In this embodiment, the execution body may calculate the sample target frequency band gain based on M sample frequency domain subbands of the sample clean speech signal and M sample frequency domain subbands of the sample noisy speech signal. For example, based on 30 sample frequency-domain subbands of the sample clean speech signal and 30 sample frequency-domain subbands of the sample noisy speech signal, 30 sample target band gains are calculated.

In general, the sample target band gain can be calculated by equation (7):

G _bin ＝(∑ _w X _{clean_bin} (w))/(∑ _w X _{noisy_bin} (w)) (7)

wherein G is _bin For the sample target band gain, which is the ratio, X, of the sample frequency-domain subband of the sample clean speech signal to the sample frequency-domain subband of the sample noisy speech signal _{clean_bin} (w) is the sample frequency domain subband of the sample clean speech signal, X _{noisy_bin} (w) are the sample frequency-domain subbands of the sample noisy speech signal.

Step 211, calculating a sample target time domain energy ratio based on the sample clean speech signal and the sample noisy speech signal.

In this embodiment, the execution subject may calculate the sample target time-domain energy ratio based on the sample clean speech signal and the sample noisy speech signal. For example, based on the sample clean speech signal and the sample noisy speech signal, a 1 sample target time-domain energy ratio is calculated.

In general, the sample target time-domain energy ratio can be calculated by equation (8):

G＝(∑ _t x _clean (t))/(∑ _t x _noisy (t)) (8)

wherein G is the sample target time domain energy ratio, which is the energy ratio of the sample clean speech signal to the sample noisy speech signal, x _clean (t) is the sample clean speech signal, x _noisy (t) is the sample noisy speech signal.

Step 212, inputting the frequency domain characteristics of the samples into the network, and learning to obtain the sample prediction frequency band gain and the sample prediction time domain energy ratio.

In this embodiment, the execution subject may input the sample frequency domain feature to the network, and the network learns two objects to obtain the sample prediction frequency band gain and the sample prediction time-domain energy ratio. The number of the sample prediction frequency band gains is M, the value interval is between 0 and 1, and the sample prediction frequency band gains are used for carrying out weighting gain on the frequency spectrum of the sample noisy speech signal. The number of the sample prediction time domain energy ratios is 1, the sample prediction time domain energy ratios are used for assisting network training, reducing voice loss caused by noise reduction, reflecting the proportion or strength of components of a sample clean voice signal in a sample noisy voice signal, and being used for rough voice activity detection estimation.

Step 213, calculate a first loss function based on the sample target band gain and the sample predicted band gain.

In this embodiment, the execution body may calculate the first loss function based on the sample target band gain and the sample prediction band gain.

In general, the first loss function can be calculated by equation (9):

wherein, the first and the second end of the pipe are connected with each other,

is a first loss function, C ₀ And C ₁ For adjusting the degree of loss, in practice, C ₀ ＝10，C ₁ =10, can reach better training effect, G _bin In order to sample the target band gain of the sample,

the band gains are predicted for the samples.

Step 214, calculating a second loss function based on the sample target time-domain energy ratio and the sample prediction time-domain energy ratio.

In this embodiment, the execution subject may calculate the second loss function based on the sample target temporal energy ratio and the sample prediction temporal energy ratio.

In general, the second loss function can be calculated by equation (10):

wherein the content of the first and second substances,

is a second loss function, C ₂ For adjusting the degree of loss, in practice, C ₂ =5, a better training effect can be achieved, G is the sample target time-domain energy ratio,

the time-domain energy ratio is predicted for the samples.

Step 215, adjusting parameters of the network based on the first loss function and the second loss function to obtain a noise reduction model.

In this embodiment, the executing entity may adjust parameters of the network based on the first loss function and the second loss function to obtain the noise reduction model.

Generally, a total loss function can be obtained by performing weighted summation on the first loss function and the second loss function, and a value of the total loss function is reduced by adjusting parameters of the network based on the total loss function until the network converges, so that the noise reduction model can be obtained.

As can be seen from fig. 2, compared with the embodiment corresponding to fig. 1, the flow 200 of the noise reduction model training method in this embodiment highlights a sample obtaining step, a frequency domain feature extracting step, and a loss calculating step. Therefore, the scheme described in the embodiment generates the sample by fusing the clean voice signal and the noise signal, so that the sample is more convenient to acquire. The DCT coefficient, the first order difference and the second order difference of the DCT coefficient, the pitch correlation degree and the pitch period are extracted, so that the feature content input into the network is richer. And calculating a loss function for band gain training and a loss function for time domain energy ratio gain training, and taking the loss functions as training targets of the network, so as to train a noise reduction model with better effect.

For ease of understanding, FIG. 3 shows a schematic diagram of training features in the noise reduction model training method of FIG. 2. As shown in fig. 3, the sample clean speech signal is fused with the sample noise signal to generate a sample noisy speech signal. ERB frequency band division and DCT are carried out on the basis of the sample noisy speech signal, and a 30-dimensional DCT coefficient, an 11-dimensional first-order difference and an 11-dimensional second-order difference are obtained. And extracting fundamental tone based on the sample noisy speech signal to obtain 11-dimensional fundamental tone correlation degree and a fundamental tone period. And calculating a frequency domain sub-band energy ratio and a time domain energy ratio based on the sample clean speech signal and the sample noisy speech signal to obtain 1 sample time domain energy ratio and 30 sample frequency band gains to be used as training targets.

Further, fig. 4 shows a network structure diagram of the noise reduction model in the noise reduction model training method in fig. 2. As shown in fig. 4, 30-dimensional DCT coefficients, 11-dimensional first and second order differences, 11-dimensional pitch correlation, and pitch period, for a total of 64 features, are input to the network. The network includes 1 fully connected layer density (96), 3 gated cyclic units GRU (96), 1 fully connected layer density (30) and 1 fully connected layer density (1). The full connection layer Dense (30) outputs 30 band gains G _bin And 1 time-domain energy ratio G.

FIG. 5 illustrates a flow 500 of one embodiment of a speech noise reduction method according to the present disclosure. The voice noise reduction method comprises the following steps:

step 501, obtaining a voice signal with noise.

In this embodiment, the main body of execution of the voice noise reduction method may acquire a noisy voice signal. The noisy speech signal is a noisy speech signal, and may be a speech signal collected in a noisy environment.

Step 502, extracting frequency domain characteristics of the voice signal with noise.

In this embodiment, the execution subject may extract a frequency domain feature of the noisy speech signal. The frequency domain features may include, but are not limited to: ERB cepstral coefficients, first and second order differences of ERB cepstral coefficients, DCT coefficients and pitch periods of speech pitch correlations, and the like.

Here, the noisy speech signal is usually a time domain signal, which needs to be converted into the frequency domain in order to extract the frequency domain features. Specifically, converting the voice signal with noise into a frequency domain to obtain a voice signal with noise in the frequency domain; and extracting frequency domain characteristics from the frequency domain voice signal with noise.

It should be noted that the frequency domain feature extraction method may refer to the sample frequency domain feature extraction method in fig. 2, and is not described herein again.

Step 503, inputting the frequency domain characteristics into the noise reduction model to obtain M band gains.

In this embodiment, the execution subject may input the frequency domain features to the noise reduction model, and obtain M band gains. Where M is a positive integer, the noise reduction model may be obtained by training using the noise reduction model training method shown in fig. 1 or fig. 2, which is not described herein again.

In general, inputting the frequency domain features to the noise reduction model, M band gains and 1 time domain energy ratio can be obtained, and the noise reduction only uses the M band gains.

It should be noted that the noise reduction model obtained by training may be converted into a tensoflowlite model first, and the noise reduction process is called by the tenslflowlite.

And step 504, denoising based on the M frequency band gains to obtain a clean voice signal.

In this embodiment, the execution main body may perform noise reduction based on M band gains, so as to obtain a clean speech signal.

Generally, based on M band gains, a frequency domain clean speech signal can be obtained, and a clean speech signal can be obtained by converting the frequency domain clean speech signal into a time domain.

The voice noise reduction method provided by the embodiment of the disclosure has better inhibition capability on stationary noise and non-stationary noise. The method can be applied to equipment with audio and video functions, such as equipment with a screen, a sound box, a mobile phone, a computer and the like, and can inhibit noise, improve tone quality and improve conversation experience of a user.

With continued reference to fig. 6, a flow 600 of yet another embodiment of a speech noise reduction method according to the present disclosure is shown. The voice noise reduction method comprises the following steps:

step 601, acquiring a voice signal with noise.

In this embodiment, the specific operation of step 601 has been described in detail in step 501 in the embodiment shown in fig. 5, and is not described herein again.

Step 602, converting the voice signal with noise to a frequency domain to obtain a voice signal with noise in the frequency domain.

In this embodiment, the main executing body of the speech noise reduction method may convert the noise-containing speech signal into a frequency domain, so as to obtain a frequency domain noise-containing speech signal.

In general, a noisy speech signal may be a time-domain signal. Firstly, hamming window processing is carried out on the voice signal with noise, and then Fourier transformation is carried out on the voice signal with noise, so that the voice signal with noise in the frequency domain can be obtained.

Step 603, extracting frequency domain characteristics from the frequency domain noisy speech signal.

In this embodiment, the execution subject may extract frequency-domain features from the frequency-domain noisy speech signal. The frequency domain features may include, but are not limited to: ERB cepstral coefficients, first and second order differences of ERB cepstral coefficients, DCT coefficients and pitch periods of speech pitch correlations, and the like.

Step 604, inputting the frequency domain characteristics to a noise reduction model to obtain M band gains.

In this embodiment, the specific operation of step 604 is described in detail in step 503 in the embodiment shown in fig. 5, and is not described herein again.

And 605, interpolating the M band gains, and weighting the frequency points of the interpolated band gains to obtain a frequency domain clean voice signal.

In this embodiment, the execution main body may interpolate M band gains and weight frequency points of the interpolated band gains, so as to obtain a frequency-domain clean speech signal.

Step 606, converting the frequency domain clean speech signal to a time domain to obtain a clean speech signal.

In this embodiment, the execution subject may convert the frequency domain clean speech signal into the time domain to obtain the clean speech signal.

Generally, the clean speech signal is obtained by performing inverse fourier transform on the clean speech signal in the frequency domain.

For ease of understanding, fig. 7 shows a flow diagram of the speech noise reduction method of fig. 6. As shown in fig. 7, the noisy speech signal is fourier transformed to obtain a frequency-domain noisy speech signal. And extracting the characteristics of the frequency domain voice signal with the noise to obtain frequency domain characteristics. And inputting the frequency domain characteristics to the Tflite model converted from the noise reduction model to obtain the frequency band gain. And carrying out gain interpolation on the frequency band gain to obtain a clean voice signal of a frequency domain. And carrying out inverse Fourier transform on the frequency domain clean voice signal to obtain the clean voice signal. Fig. 8 and 9 respectively show noise reduction effect diagrams of the voice noise reduction method of the present embodiment. Where the upper is noisy audio and the lower is noise-reduced audio. Therefore, the voice noise reduction method of the embodiment can well inhibit noise and improve the tone quality.

For ease of understanding, fig. 10 shows a scene diagram of a noise reduction model training method and a speech noise reduction method that can implement embodiments of the present disclosure. As shown in fig. 10, feature extraction is performed on training data, and model training is performed on the obtained features to obtain a noise reduction model. And extracting the characteristics of the voice with noise, inputting the obtained characteristics into a noise reduction model for noise reduction, and obtaining clean voice.

With further reference to fig. 11, as an implementation of the methods shown in the above-mentioned figures, the present disclosure provides an embodiment of a noise reduction model training apparatus, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 1, and the apparatus may be specifically applied to various electronic devices.

As shown in fig. 11, the noise reduction model training apparatus 1100 of the present embodiment may include: an acquisition module 1101, an extraction module 1102, a calculation module 1103, and a training module 1104. Wherein, the obtaining module 1101 is configured to obtain a sample noisy speech signal; an extracting module 1102 configured to extract sample frequency domain features of the sample noisy speech signal; a calculating module 1103 configured to calculate a sample target frequency band gain and a sample target time-domain energy ratio of the sample noisy speech signal; and the training module 1104 is configured to train the network by taking the frequency domain characteristics of the sample as input, taking the frequency band gain of the sample target and the time domain energy ratio of the sample target as supervision, and obtaining a noise reduction model.

In this embodiment, in the noise reduction model training apparatus 1100: the specific processing of the obtaining module 1101, the extracting module 1102, the calculating module 1103 and the training module 1104 and the technical effects thereof can refer to the related descriptions of steps 101 to 104 in the corresponding embodiment of fig. 1, which are not repeated herein.

In some optional implementations of this embodiment, the extracting module 1102 includes: a conversion sub-module configured to convert the sample noisy speech signal to the frequency domain, resulting in a sample frequency domain noisy speech signal; an extraction sub-module configured to extract sample frequency domain features from the sample frequency domain noisy speech signal.

In some optional implementations of this embodiment, the extracting sub-module includes: the first conversion unit is configured to convert the sample frequency domain noisy speech signal into an equivalent rectangular bandwidth ERB domain to obtain a sample ERB domain frequency point; the dividing unit is configured to divide the sample ERB domain frequency points into M sample ERB domain sub-bands, wherein M is a positive integer; a second conversion unit configured to convert the M sample ERB domain sub-bands to the frequency domain, resulting in M sample frequency domain sub-bands; and the transformation unit is configured to perform Discrete Cosine Transform (DCT) on the logarithmic energy sums of the M sample frequency domain sub-bands to obtain M sample DCT coefficients.

In some optional implementations of this embodiment, the extracting sub-module further includes: the first calculating unit is configured to calculate a first order difference and a second order difference of the first N sample DCT coefficients in the M sample DCT coefficients respectively, wherein N is a positive integer and is less than or equal to M.

In some optional implementations of this embodiment, the extracting sub-module further includes: and a second calculating unit configured to calculate a pitch period and a P-dimensional DCT coefficient of the speech pitch correlation of the sample frequency domain noisy speech signal, where P is a positive integer.

In some optional implementations of this embodiment, the obtaining module 1101 is further configured to: obtaining a sample clean voice signal and a sample noise signal; fusing the sample clean voice signal and the sample noise signal to obtain a sample voice signal with noise; and the calculation module 1103 is further configured to: calculating a sample target frequency band gain based on M sample frequency domain sub-bands of the sample clean voice signal and M sample frequency domain sub-bands of the sample noisy voice signal; based on the sample clean speech signal and the sample noisy speech signal, a sample target time-domain energy ratio is calculated.

In some optional implementations of this embodiment, the training module 1104 is further configured to: inputting the frequency domain characteristics of the samples into a network, and learning to obtain the predicted frequency band gain of the samples and the predicted time domain energy ratio of the samples; calculating a first loss function based on the sample target band gain and the sample predicted band gain; calculating a second loss function based on the sample target time-domain energy ratio and the sample prediction time-domain energy ratio; and adjusting parameters of the network based on the first loss function and the second loss function to obtain a noise reduction model.

With further reference to fig. 12, as an implementation of the methods shown in the above-mentioned figures, the present disclosure provides an embodiment of a speech noise reduction apparatus, which corresponds to the method embodiment shown in fig. 5, and which can be applied in various electronic devices.

As shown in fig. 12, the speech noise reduction apparatus 1200 of the present embodiment may include: an acquisition module 1201, an extraction module 1202, an input module 1203, and a noise reduction module 1204. The obtaining module 1201 is configured to obtain a noisy speech signal; an extraction module 1202 configured to extract frequency domain features of a noisy speech signal; an input module 1203, configured to input the frequency domain features into a noise reduction model, so as to obtain M band gains, where M is a positive integer, and the noise reduction model is obtained by training using the apparatus shown in fig. 5 or fig. 6; a noise reduction module 1204 configured to perform noise reduction based on the M band gains to obtain a clean speech signal.

In this embodiment, in the speech noise reduction apparatus 1200: the specific processing of the obtaining module 1201, the extracting module 1202, the inputting module 1203 and the denoising module 1204 and the technical effects thereof may refer to the related descriptions of steps 501 to 504 in the corresponding embodiment of fig. 5, which are not repeated herein.

In some optional implementations of this embodiment, the extraction module 1202 is further configured to: converting the voice signal with noise into a frequency domain to obtain a voice signal with noise in the frequency domain; extracting frequency domain characteristics from the frequency domain voice signal with noise; and the noise reduction module 1204 is further configured to: interpolating M frequency band gains, and weighting the frequency points of the interpolated frequency band gains to obtain a frequency domain clean voice signal; and converting the frequency domain clean voice signal into a time domain to obtain a clean voice signal.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

Fig. 13 illustrates a schematic block diagram of an example electronic device 1300 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 13, the apparatus 1300 includes a computing unit 1301 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1302 or a computer program loaded from a storage unit 1308 into a Random Access Memory (RAM) 1303. In the RAM 1303, various programs and data necessary for the operation of the device 1300 can also be stored. The calculation unit 1301, the ROM1302, and the RAM 1303 are connected to each other via a bus 1304. An input/output (I/O) interface 1305 is also connected to bus 1304.

A number of components in the device 1300 connect to the I/O interface 1305, including: an input unit 1306 such as a keyboard, a mouse, and the like; an output unit 1307 such as various types of displays, speakers, and the like; storage unit 1308, such as a magnetic disk, optical disk, or the like; and a communication unit 1309 such as a network card, modem, wireless communication transceiver, etc. The communication unit 1309 allows the device 1300 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

Computing unit 1301 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of computing unit 1301 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 1301 performs the various methods and processes described above, such as a noise reduction model training method or a speech noise reduction method. For example, in some embodiments, the noise reduction model training method or the speech noise reduction method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 1308. In some embodiments, some or all of the computer program may be loaded onto and/or installed onto device 1300 via ROM1302 and/or communications unit 1309. When the computer program is loaded into the RAM 1303 and executed by the computing unit 1301, one or more steps of the noise reduction model training method or the speech noise reduction method described above may be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured to perform the noise reduction model training method or the speech noise reduction method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least two programmable processors, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least two input devices, and at least two output devices.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in this disclosure may be performed in parallel or sequentially or in a different order, as long as the desired results of the technical solutions provided by this disclosure can be achieved, and are not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A noise reduction model training method comprises the following steps:

acquiring a sample voice signal with noise;

extracting sample frequency domain characteristics of the sample noisy speech signal;

calculating a sample target frequency band gain and a sample target time domain energy ratio of the sample noisy speech signal;

and taking the sample frequency domain characteristics as input, taking the sample target frequency band gain and the sample target time domain energy ratio as supervision, and training a network to obtain a noise reduction model.

2. The method according to claim 1, wherein said extracting sample frequency domain features of the sample noisy speech signal comprises:

converting the sample voice signal with noise into a frequency domain to obtain a sample voice signal with noise in the frequency domain;

and extracting the sample frequency domain characteristics from the sample frequency domain noisy speech signal.

3. The method according to claim 2, wherein said extracting the sample frequency-domain features from the sample frequency-domain noisy speech signal comprises:

converting the sample frequency domain voice signal with the noise into an equivalent rectangular bandwidth ERB domain to obtain a sample ERB domain frequency point;

dividing the sample ERB domain frequency points into M sample ERB domain sub-bands, wherein M is a positive integer;

converting the sub-bands of the ERB domain of the M samples into a frequency domain to obtain sub-bands of the frequency domain of the M samples;

and respectively carrying out Discrete Cosine Transform (DCT) on the logarithmic energy sum of the M sample frequency domain sub-bands to obtain M sample DCT coefficients.

4. The method according to claim 3, wherein said extracting the sample frequency-domain features from the sample frequency-domain noisy speech signal, further comprises:

and respectively calculating first order difference and second order difference of the first N sample DCT coefficients in the M sample DCT coefficients, wherein N is a positive integer and is less than or equal to M.

5. The method according to any one of claims 2-4, wherein the extracting the sample frequency-domain features from the sample frequency-domain noisy speech signal, further comprises:

and calculating a P-dimensional DCT coefficient and a pitch period of the voice pitch correlation degree of the sample frequency domain voice signal with the noise, wherein P is a positive integer.

6. The method of claim 2, wherein said obtaining a sample noisy speech signal comprises:

obtaining a sample clean voice signal and a sample noise signal;

fusing the sample clean voice signal and the sample noise signal to obtain the sample noisy voice signal; and

the calculating of the sample target frequency band gain and the sample target time domain energy ratio of the sample noisy speech signal includes:

calculating the sample target frequency band gain based on the M sample frequency domain sub-bands of the sample clean speech signal and the M sample frequency domain sub-bands of the sample noisy speech signal;

and calculating the sample target time domain energy ratio based on the sample clean speech signal and the sample noisy speech signal.

7. The method according to any one of claims 1-6, wherein the training a network with the sample frequency domain features as input and the sample target frequency band gain and the sample target time domain energy ratio as supervision to obtain a noise reduction model comprises:

inputting the sample frequency domain characteristics to the network, and learning to obtain sample prediction frequency band gain and a sample prediction time domain energy ratio;

calculating a first loss function based on the sample target band gain and the sample predicted band gain;

calculating a second loss function based on the sample target temporal energy ratio and the sample predicted temporal energy ratio;

and adjusting parameters of the network based on the first loss function and the second loss function to obtain the noise reduction model.

8. A method of speech noise reduction, comprising:

acquiring a voice signal with noise;

extracting the frequency domain characteristics of the voice signal with noise;

inputting the frequency domain characteristics into a noise reduction model to obtain M frequency band gains, wherein M is a positive integer, and the noise reduction model is obtained by training according to the method of any one of claims 1-7;

and denoising based on the M frequency band gains to obtain a clean voice signal.

9. The method according to claim 8, wherein said extracting frequency domain features of the noisy speech signal comprises:

converting the voice signal with noise into a frequency domain to obtain a frequency domain voice signal with noise;

extracting the frequency domain characteristics from the frequency domain noisy speech signal; and

the denoising based on the M band gains to obtain a clean speech signal includes:

interpolating the M band gains, and weighting the frequency points of the interpolated band gains to obtain a frequency domain clean voice signal;

and converting the frequency domain clean voice signal into a time domain to obtain the clean voice signal.

10. A noise reduction model training apparatus comprising:

an acquisition module configured to acquire a sample noisy speech signal;

an extraction module configured to extract sample frequency domain features of the sample noisy speech signal;

a calculation module configured to calculate a sample target frequency band gain and a sample target time domain energy ratio of the sample noisy speech signal;

and the training module is configured to train the network by taking the sample frequency domain characteristics as input and the sample target frequency band gain and the sample target time domain energy ratio as supervision to obtain a noise reduction model.

11. The apparatus of claim 10, wherein the extraction module comprises:

a conversion sub-module configured to convert the sample noisy speech signal to the frequency domain, resulting in a sample frequency domain noisy speech signal;

an extraction sub-module configured to extract the sample frequency-domain features from the sample frequency-domain noisy speech signal.

12. The apparatus of claim 11, wherein the extraction submodule comprises:

the first conversion unit is configured to convert the sample frequency domain noisy speech signal into an equivalent rectangular bandwidth ERB domain to obtain a sample ERB domain frequency point;

a dividing unit configured to divide the sample ERB domain frequency points into M sample ERB domain sub-bands, wherein M is a positive integer;

a second conversion unit configured to convert the M sample ERB domain sub-bands to the frequency domain, resulting in M sample frequency domain sub-bands;

and the transformation unit is configured to perform Discrete Cosine Transform (DCT) on the logarithmic energy sums of the M sample frequency domain sub-bands respectively to obtain M sample DCT coefficients.

13. The apparatus of claim 12, wherein the extraction sub-module further comprises:

a first calculating unit configured to calculate first and second order differences of first N sample DCT coefficients among the M sample DCT coefficients, respectively, wherein N is a positive integer and N is less than or equal to M.

14. The apparatus of any one of claims 11-13, wherein the extraction sub-module further comprises:

a second calculating unit configured to calculate a pitch period and a P-dimensional DCT coefficient of the speech pitch correlation of the sample frequency domain noisy speech signal, where P is a positive integer.

15. The apparatus of claim 11, wherein the acquisition module is further configured to:

obtaining a sample clean voice signal and a sample noise signal;

the computing module is further configured to:

16. The apparatus of any of claims 10-15, wherein the training module is further configured to:

17. A speech noise reduction apparatus comprising:

an acquisition module configured to acquire a noisy speech signal;

an extraction module configured to extract frequency domain features of the noisy speech signal;

an input module configured to input the frequency domain features into a noise reduction model, resulting in M band gains, where M is a positive integer, the noise reduction model being trained using the apparatus of any one of claims 10-16;

and the noise reduction module is configured to perform noise reduction based on the M frequency band gains to obtain a clean voice signal.

18. The apparatus of claim 17, wherein the extraction module is further configured to:

the noise reduction module is further configured to:

interpolating the M band gains, and weighting the frequency points of the band gains after interpolation to obtain a clean voice signal of a frequency domain;

19. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7 or the method of any one of claims 8-9.

20. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any of claims 1-7 or the method of any of claims 8-9.

21. A computer program product comprising a computer program which, when executed by a processor, implements the method of any of claims 1-7 or the method of any of claims 8-9.