CN113808602A

CN113808602A - Speech enhancement method, model training method and related equipment

Info

Publication number: CN113808602A
Application number: CN202110129897.8A
Authority: CN
Inventors: 雪巍; 蔡玉玉; 吴俊仪; 全刚; 张超; 杨帆; 丁国宏; 何晓冬
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2021-01-29
Filing date: 2021-01-29
Publication date: 2021-12-17
Also published as: WO2022161277A1

Abstract

The present disclosure provides a speech enhancement method, a model training method and related devices. The speech enhancement model comprises a speech prediction neural network module, a noise estimation neural network module and a linear filtering module, and the model training method comprises the following steps: acquiring a noisy speech amplitude spectrum and a pure speech amplitude spectrum of each speech pair in a training set; obtaining a first characteristic set and a second characteristic set according to the noisy speech magnitude spectrum; inputting the first characteristic set into a voice prediction neural network module to output a first quasi-estimation pure voice amplitude spectrum and a prediction error; inputting the second feature set into a noise estimation neural network module to output estimated noise energy; inputting the first quasi-estimated pure speech amplitude spectrum, the prediction error and the estimated noise energy into a linear filtering module, wherein the linear filtering module is used for outputting the estimated pure speech amplitude spectrum; model losses are calculated from the clean speech magnitude spectrum and the estimated clean speech magnitude spectrum to train a speech enhancement model. The present disclosure enables optimization of speech enhancement.

Description

Speech enhancement method, model training method and related equipment

Technical Field

The present disclosure relates to the field of internet technologies, and in particular, to a speech enhancement method, a model training method, and a related device.

Background

With the rapid development of speech recognition technology, speech recognition technology has been applied to various scenes such as intelligent hardware, intelligent telephone customer service and the like, and because the accuracy of recognition results is closely related to the working efficiency and the user interaction experience, people have higher and higher requirements on the speech recognition effect. At present, because the application scene of voice recognition is basically related to the daily life requirement and the work requirement of a user, the input voice signal cannot be ensured to be pure and noiseless voice, so that when the voice with noise in some background environments is recognized, the noise interferes with the quality of the voice signal, the recognition result is inaccurate, and the efficiency of the user in the processes of man-machine interaction and audio character transcription is influenced. Therefore, speech enhancement techniques directed to addressing audio noise interference in complex noise environments are a key component in speech recognition techniques.

The speech enhancement technique aims at processing speech containing noise and outputting processed clean speech audio. The main means can be divided into two categories: linear filtering methods based on signal processing, such as wiener filtering, kalman filtering, filters based on minimum mean square error, and the like; and machine learning based methods such as recurrent neural network based, convolutional-recurrent neural network based, UNET network based methods, and the like.

The linear filtering method based on signal processing firstly presets statistical models of voice and noise, solves an optimal filter under a certain optimization criterion, and acts on noisy audio to achieve the aim of enhancing the voice. The method based on machine learning adopts a large amount of training data, adopts a certain network structure, and trains a nonlinear function from noisy speech to pure speech under the framework of supervised learning, thereby achieving the purpose of speech enhancement.

Although the linear filter-based method does not require large-scale data training, since it usually designs an optimization function based on expert knowledge, under some conditions, the model assumption of speech or noise is too ideal, such as assuming that the noise is subject to stationarity, and the like, which results in significant performance degradation in practical scenarios, especially under non-stationary noise conditions. The speech enhancement method based on machine learning obtains the mapping from the characteristics of the speech with noise to the pure speech by adopting a large amount of linguistic data to train the neural network, and can obviously improve the performance under complex non-stationary noise. However, the performance is obviously limited by the variability of noise in the corpus, and when the corpus is limited, an overfitting problem is often generated, resulting in poor generalization performance for the out-of-set noise. The main reason for this problem is that the method based on machine learning relies too much on the existing neural network model structure and does not introduce the traditional expert knowledge based on signal processing, so that it is difficult to improve the performance of the network by designing a regularization method that conforms to the optimal speech signal processing.

Therefore, a technical problem to be solved by those skilled in the art is how to optimize a speech enhancement method to maintain good enhancement performance under stationary noise and complex non-stationary noise, and to improve the generalization performance of speech enhancement.

It is noted that the information disclosed in the background section above is only for enhancement of understanding of the background of the present disclosure, and therefore, may include information that does not constitute prior art that is known to those of ordinary skill in the art.

Disclosure of Invention

In view of the above, the present disclosure provides a speech enhancement method, a model training method and related devices, which maintain good enhancement performance under stationary noise and complex non-stationary noise by optimizing the speech enhancement method, and at the same time, improve generalization performance of speech enhancement.

One aspect of the present disclosure provides a speech enhancement model training method, the speech enhancement model including a speech prediction neural network module, a noise estimation neural network module, and a linear filtering module,

the training method of the speech enhancement model comprises the following steps:

acquiring a noisy speech amplitude spectrum and a clean speech amplitude spectrum of each speech pair in a speech training set, wherein the speech pair comprises a related clean speech signal and a noisy speech signal;

obtaining a first characteristic set and a second characteristic set according to the noisy speech amplitude spectrum;

inputting the first feature set into the speech prediction neural network module, wherein the speech prediction neural network module is used for outputting a first quasi-estimated pure speech magnitude spectrum and a prediction error;

inputting the second set of features into the noise estimation neural network module, the noise estimation neural network module for outputting an estimated noise energy;

inputting a first quasi-estimated pure speech amplitude spectrum and a prediction error output by the speech prediction neural network module and estimated noise energy output by the noise estimation neural network module into the linear filtering module, wherein the linear filtering module is used for outputting an estimated pure speech amplitude spectrum;

and calculating model loss according to the pure speech amplitude spectrum and the estimated pure speech amplitude spectrum, and training the speech enhancement model according to the model loss.

In some embodiments of the present disclosure, the obtaining the noisy speech magnitude spectrum and the clean speech magnitude spectrum of each speech pair in the speech training set includes:

performing a time domain to frequency domain transformation step on the clean speech signals of the speech pair;

a time-domain to frequency-domain transformation step is performed on the noisy speech signal of the speech pair,

the time domain to frequency domain transforming step comprises:

framing a voice signal to be processed;

carrying out Fourier transform on each frame of the voice signal to be processed to obtain a frame Fourier spectrum of each frame;

splicing the frame Fourier spectrums of each frame of the voice signal to be processed according to a time axis to obtain the Fourier spectrums of the voice signal to be processed;

and generating the amplitude spectrum of the voice signal to be processed based on the amplitude of each frequency point of the Fourier spectrum of the voice signal to be processed.

In some embodiments of the present disclosure, the speech prediction neural network module is a time-series neural network model, the first feature set is a noise magnitude spectrum sequence of a plurality of consecutive frames, the first quasi-estimated pure speech magnitude spectrum output by the speech prediction neural network module is a first quasi-estimated pure speech magnitude spectrum sequence having the same dimension as the noise magnitude spectrum sequence, and the prediction error output by the speech prediction neural network module is a prediction error sequence having the same dimension as the noise magnitude spectrum sequence.

In some embodiments of the present disclosure, the noise estimation neural network module is a multi-layer fully-connected network, and the second feature set includes noise-containing speech magnitude spectra of the current frame and a domain window of the current frame.

In some embodiments of the present disclosure, the linear filtering module includes a wiener filtering module, a Kalman gain calculation module, and a linear combination module,

the wiener filtering module is used for outputting a wiener filtering solution of the pure speech amplitude spectrum as a second quasi-estimation pure speech amplitude spectrum according to the estimated noise energy output by the noise estimation neural network module and the second characteristic set;

the Kalman gain calculation module is used for outputting an optimal Kalman gain G according to the prediction error output by the voice prediction neural network module and the estimated noise energy output by the noise estimation neural network module;

and the linear combination module is used for calculating a linear combination result of the first quasi-estimated pure speech amplitude spectrum and the second quasi-estimated pure speech amplitude spectrum output by the speech prediction neural network module according to the optimal Kalman gain G and taking the linear combination result as the estimated pure speech amplitude spectrum.

In some embodiments of the present disclosure, calculating, according to the optimal kalman gain G, a linear combination result of the first quasi-estimated pure speech magnitude spectrum and the second quasi-estimated pure speech magnitude spectrum output by the speech prediction neural network module, as the estimated pure speech magnitude spectrum, includes:

(1-G) as a first weight of the first quasi-estimated clean speech magnitude spectrum;

taking the optimal Kalman gain G as a second weight of the second quasi-estimation pure speech magnitude spectrum;

and calculating a weighted sum of the first quasi-estimated pure speech amplitude spectrum and the second quasi-estimated pure speech amplitude spectrum according to the first weight and the second weight, and taking the weighted sum as the estimated pure speech amplitude spectrum.

In some embodiments of the present disclosure, said calculating a model loss from said clean speech magnitude spectrum and said estimated clean speech magnitude spectrum, and training said speech enhancement model from a model loss comprises:

and optimizing parameters of the speech prediction neural network module and the noise estimation neural network module by adopting a back propagation algorithm.

According to another aspect of the present disclosure, there is also provided a speech enhancement method, including:

acquiring a voice amplitude spectrum to be enhanced and a voice phase spectrum to be enhanced of a voice signal to be enhanced;

obtaining a first feature set and a second feature set according to the voice magnitude spectrum to be enhanced;

inputting the first feature set and the second feature set into a trained speech enhancement model, the speech enhancement model comprising a speech prediction neural network module, a noise estimation neural network module, and a linear filtering module, wherein the first feature set is used as an input of the speech prediction neural network module, the speech prediction neural network module is used for outputting a first quasi-estimated pure speech magnitude spectrum and a prediction error, the second set of features is provided as an input to the noise estimation neural network module, which is configured to output an estimated noise energy, the first quasi-estimated pure speech amplitude spectrum and the prediction error output by the speech prediction neural network module, the estimated noise energy output by the noise estimation neural network module and input into the linear filtering module, the linear filtering module is used for outputting an estimated pure voice magnitude spectrum of the voice signal to be enhanced;

and restoring according to the estimated pure voice magnitude spectrum and the voice phase spectrum to be enhanced to obtain an enhanced voice signal of the voice signal to be enhanced.

According to another aspect of the present disclosure, there is also provided a speech enhancement model training apparatus, the speech enhancement model including a speech prediction neural network module, a noise estimation neural network module, and a linear filtering module,

the speech enhancement model training device comprises:

a first obtaining module configured to obtain a noisy speech amplitude spectrum and a clean speech amplitude spectrum of each speech pair in a speech training set, the speech pair including an associated clean speech signal and a noisy speech signal;

the second acquisition module is configured to obtain a first feature set and a second feature set according to the noisy speech magnitude spectrum;

a first input module configured to input the first set of features into the speech prediction neural network module, the speech prediction neural network module to output a first quasi-estimated pure speech magnitude spectrum and a prediction error;

a second input module configured to input the second set of features into the noise estimation neural network module, the noise estimation neural network module to output an estimated noise energy;

an output module configured to input a first quasi-estimated clean speech magnitude spectrum and a prediction error output by the speech prediction neural network module, and an estimated noise energy output by the noise estimation neural network module into the linear filtering module, wherein the linear filtering module is used for outputting an estimated clean speech magnitude spectrum;

a training module configured to calculate a model loss from the clean speech magnitude spectrum and the estimated clean speech magnitude spectrum, and train the speech enhancement model from the model loss.

Yet another aspect of the present disclosure provides an electronic device, comprising: a processor; a memory having executable instructions stored therein; wherein the executable instructions, when executed by the processor, implement the speech enhancement model training method and/or the speech enhancement method of any of the above embodiments.

Yet another aspect of the present disclosure provides a computer-readable storage medium storing a program, wherein the program is configured to implement the speech enhancement model training method and/or the speech enhancement method according to any of the above embodiments when executed.

Compared with the prior art, the beneficial effects of this disclosure include at least:

the speech enhancement model comprises a speech prediction neural network module, a noise estimation neural network module and a linear filtering module, so that speech signals are enhanced through the speech enhancement model, a linear filtering method based on signal processing and a speech enhancement method based on machine learning are combined, the speech enhancement performance of the linear filtering method based on signal processing under complex non-stationary noise is improved through the speech enhancement method based on machine learning, the generalization performance of the speech enhancement method based on machine learning is improved through the linear filtering method based on signal processing, and optimization of speech enhancement is achieved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is apparent that the drawings described below are only some embodiments of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without inventive effort.

FIG. 1 shows a flow diagram of a method of speech enhancement model training in an embodiment of the present disclosure;

FIG. 2 is a schematic diagram illustrating the structure of a speech enhancement model in an embodiment of the present disclosure;

FIG. 3 illustrates a synchronization flow diagram of a method of speech enhancement in an embodiment of the present disclosure;

FIG. 4 is a block diagram of a speech enhancement model training apparatus according to an embodiment of the present disclosure;

FIG. 5 shows a block diagram of a speech enhancement apparatus in an embodiment of the present disclosure;

FIG. 6 shows a schematic structural diagram of an electronic device in an embodiment of the disclosure; and

fig. 7 shows a schematic structural diagram of a computer-readable storage medium in an embodiment of the disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The step numbers in the following embodiments are merely used to indicate different execution contents, and the execution order between the steps is not strictly limited. The use of "first," "second," and similar terms in the detailed description is not intended to imply any order, quantity, or importance, but rather is used to distinguish one element from another. It should be noted that features of the embodiments of the disclosure and of the different embodiments may be combined with each other without conflict.

Fig. 1 shows the main steps of a speech enhancement training method in an embodiment, and referring to fig. 1, the speech enhancement model provided by the present invention includes a speech prediction neural network module, a noise estimation neural network module, and a linear filtering module. The method for training the speech enhancement model comprises the following steps: step S110: acquiring a noisy speech amplitude spectrum and a clean speech amplitude spectrum of each speech pair in a speech training set, wherein the speech pair comprises a related clean speech signal and a noisy speech signal; step S120: obtaining a first characteristic set and a second characteristic set according to the noisy speech amplitude spectrum; step S130: inputting the first feature set into the speech prediction neural network module, wherein the speech prediction neural network module is used for outputting a first quasi-estimated pure speech magnitude spectrum and a prediction error; step S140: inputting the second set of features into the noise estimation neural network module, the noise estimation neural network module for outputting an estimated noise energy; step S150; inputting a first quasi-estimated pure speech amplitude spectrum and a prediction error output by the speech prediction neural network module and estimated noise energy output by the noise estimation neural network module into the linear filtering module, wherein the linear filtering module is used for outputting an estimated pure speech amplitude spectrum; and step S160: and calculating model loss according to the pure speech amplitude spectrum and the estimated pure speech amplitude spectrum, and training the speech enhancement model according to the model loss.

In the speech enhancement method of the embodiment, the speech enhancement model includes the speech prediction neural network module, the noise estimation neural network module and the linear filtering module, so that the speech enhancement is performed through the speech enhancement model, and the speech enhancement method based on signal processing and the speech enhancement method based on machine learning are combined, so that the speech enhancement performance of the linear filtering method based on signal processing under complex non-stationary noise is improved by using the speech enhancement method based on machine learning, the generalization performance of the speech enhancement method based on machine learning is improved by using the linear filtering method based on signal processing, and the optimization of speech enhancement is realized.

The following describes the speech enhancement model training method in detail with reference to fig. 2 and a specific example.

Specifically, the speech training set in step S110 may include a plurality of pairs of speech pairs, each pair of speech pairs including an associated clean speech signal and a noisy speech signal, where the noisy speech signal is obtained by adding noise with a certain signal-to-noise ratio to the clean speech signal. The added snr can be set as desired, for example, in some embodiments, the added snr can be set to range from-10 to 30dB, and the invention is not so limited.

Specifically, the step S110 of obtaining the noisy speech amplitude spectrum and the clean speech amplitude spectrum of each speech pair in the speech training set may be implemented by the following steps: performing a time domain to frequency domain transformation step on the clean speech signals of the speech pair to obtain a clean speech magnitude spectrum; and performing a time-domain to frequency-domain transformation step on the noisy speech signal of the speech pair to obtain a noisy speech magnitude spectrum.

Specifically, when a clean speech signal and a noisy speech signal are used as the speech signal to be processed, the time-domain to frequency-domain transforming step is implemented as follows:

first, the speech signal x (t) to be processed is framed. Wherein t is the sampling point sequence number of the voice signal to be processed. In some implementations, each frame may be made 8 milliseconds to 32 milliseconds in length, with 50% -75% overlap between frames. The length and degree of coincidence of each frame may be set as desired, and the disclosure is not so limited. Further, keeping the frames coincident when framing is facilitated by using temporal correlation to window the fourier transform of subsequent steps. Secondly, Fourier transform is carried out on each frame of the voice signal to be processed, and a frame Fourier spectrum of each frame is obtained. In some implementations, a short-time Fourier transform of 64-512 bins may be performed on each frame. The number of frequency points may be set as desired, and the invention is not limited thereto. And thirdly, splicing the frame Fourier spectrums of the frames of the voice signal to be processed according to the time axis to obtain a Fourier spectrum X (t, f) of the voice signal to be processed, wherein the Fourier spectrum X (t, f) is a two-dimensional short-time Fourier spectrum of a complex field, t is a frame number, and f is a frequency number. And finally, generating an amplitude spectrum | X (t, f) | of the voice signal to be processed based on the amplitude of each frequency point of the Fourier spectrum X (t, f) of the voice signal to be processed.

Therefore, based on the above steps, the time domain to frequency domain conversion can be performed on the pure speech signal and the noisy speech signal, so as to obtain a pure speech amplitude spectrum | S (t, f) | and a noisy speech amplitude spectrum | Y (t, f) |, respectively.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a speech enhancement model in an embodiment of the present disclosure. The speech enhancement model 200 includes a speech prediction neural network module 210, a noise estimation neural network module 220, and a linear filtering module 230.

The speech prediction neural network module 210 may be a time series neural network model. The time series neural network model may be, for example, a multi-layer long-and-short memory recurrent neural network. In this embodiment, in the multi-layer long-and-short-term memory recurrent neural network, the number of nodes in each layer can be 256-1024 nodes, and the number of nodes in each layer is consistent. The time series neural network model provided by the invention is not limited by the above.

Since the speech prediction neural network module 210 is a time series neural network model, the first feature set can be a noise magnitude spectrum sequence of a plurality of consecutive frames. Specifically, the tth frame of the noise amplitude spectrum sequence may be defined as Y (t) [ | Y (t, 1) |, | Y (t, 2) |, …, | Y (t, f) |]^TWherein F is the total number of frequency bands. Thus, the noise magnitude spectrum sequence is Y_A[k]＝[y(L×k)，y(L×k+1)，…，y(L×k+L-1)]Wherein k is the sequence number and L is the sequence length.

The speech prediction neural network module (time series neural network model) 210 has two outputs: the first quasi-estimate of the clean speech magnitude spectrum and the prediction error. The time series neural network model predicts a first quasi-estimation pure speech magnitude spectrum sequence and a corresponding prediction error sequence in a sequence-to-sequence mode according to an input characteristic sequence. Further, the first quasi-estimated pure speech amplitude spectrum output by the speech prediction neural network module is the sequence Y of the noise amplitude spectrum_A[k]A first quasi-estimated pure speech magnitude spectrum sequence with the same dimensionality, wherein the prediction error output by the speech prediction neural network module is equal to the noise magnitude spectrum sequence Y_A[k]Prediction error sequences having the same dimensions. In particular, each first quasi-estimated clean speech magnitude spectrum of the first quasi-estimated clean speech magnitude spectrum sequence may be denoted as | S_NN(t, f) |; the value of the prediction error sequence being the variance of the prediction error

The noise estimation neural network module 220 may be a multi-layer fully connected network. In this embodiment, the number of nodes in each layer of the multi-layer fully-connected network may be 256-1024 nodes, and the number of nodes in each layer is the same, which is not limited in the present invention. In this embodiment, the second feature set includes a current frame (t frames) and a domain window [ t-N, t-N +1,…，t，…，t+N-1，t+N]^Tthe noisy speech magnitude spectrum of (1). I.e. Y_B(t)＝[y(t-N)，y(t-N+1)，…，y(t+N-1)，y(t+N)]Where N is the width of the neighborhood window. The noise estimation neural network module 220 outputs a noise energy vector of the current frame with dimension F × 1, wherein the F-th element of the noise energy vector represents the estimated noise energy of the F-th frequency band, and is recorded as

The linear filtering module 230 includes a wiener filtering module 231, a kalman gain calculation module 232, and a linear combination module 233.

The wiener filtering module 231 is configured to output a wiener filtering solution of the pure speech amplitude spectrum as a second quasi-estimated pure speech amplitude spectrum according to the estimated noise energy output by the noise estimation neural network module 220 and the second feature set.

Specifically, the total voice energy of the time frequency point can be calculated according to the noisy voice amplitude spectrum of the second feature set

The wiener filtering module 231 may obtain a wiener filtering solution of the pure speech magnitude spectrum based on the minimum mean square error criterion according to the following formula:

wherein, | S_Wiener(t, f) | is the wiener filter solution of the pure speech magnitude spectrum,

is the total speech energy of the time-frequency point of the f frequency band of the t frame,

is the noise energy of the time frequency point of the f frequency band of the t frame, | Y (t, f) | is the voice amplitude spectrum with noise.

The kalman gain calculating module 232 is configured to output an optimal kalman gain according to the prediction error output by the speech prediction neural network module 210 and the estimated noise energy output by the noise estimation neural network module 220. Specifically, the kalman gain calculation module 232 may determine the optimal kalman gain based on the conventional kalman filtering theory according to the following equation:

wherein G is the optimal Kalman gain,

the variance of the prediction error of the time bin of the f-th frequency band of the t-th frame,

the noise energy of the time frequency point of the f frequency band of the t frame.

The linear combination module 233 is configured to calculate, according to the optimal kalman gain, a linear combination result of the first quasi-estimated pure speech magnitude spectrum and the second quasi-estimated pure speech magnitude spectrum output by the speech prediction neural network module, as the estimated pure speech magnitude spectrum. In particular, the linear combination module 233 may calculate the estimated clean speech magnitude spectrum according to the following formula:

wherein, | S_o(t, f) | is the estimated clean speech magnitude spectrum, G is the optimal Kalman gain, and G is the second quasi-estimated clean speech magnitude spectrum | S_WienerA second weight of (t, f) |; (1-G) estimating a clean speech magnitude spectrum | S for the first criterion_NNA first weight of (t, f) |.

In some embodiments of the present disclosure, step S160 in fig. 1 may calculate a model loss according to the clean speech magnitude spectrum and the estimated clean speech magnitude spectrum (for example, the calculation of the model loss may be performed by a loss function), and training the speech enhancement model according to the model loss may be implemented by: and optimizing parameters of the speech prediction neural network module and the noise estimation neural network module by adopting a back propagation algorithm. Specifically, the speech enhancement model can adaptively learn parameters in the speech prediction neural network module and the noise estimation neural network module based on a back propagation algorithm of the neural network under the criterion of minimum mean square error.

The above are merely illustrative of various implementations of the present invention, which is not limited thereto, and the implementations may be implemented alone or in combination.

The embodiment of the present disclosure further provides a speech enhancement method, which is used for enhancing a speech signal based on a trained speech enhancement model. Fig. 3 shows a flowchart of a speech enhancement method in an embodiment of the present disclosure, as shown in fig. 3, including:

step S310: and acquiring a voice amplitude spectrum to be enhanced and a voice phase spectrum to be enhanced of the voice signal to be enhanced.

Specifically, step S310 may perform short-time fourier transform on the speech signal to be enhanced to obtain a speech amplitude spectrum to be enhanced and a speech phase spectrum to be enhanced.

Step S320: and obtaining a first feature set and a second feature set according to the voice magnitude spectrum to be enhanced.

Specifically, the step of obtaining the first feature set and the second feature set based on the speech magnitude spectrum to be enhanced may be the same as the step of obtaining the first feature set and the second feature set based on the noisy speech magnitude spectrum, and is not repeated herein.

Step S330: inputting the first feature set and the second feature set into a trained speech enhancement model, the speech enhancement model comprising a speech prediction neural network module, a noise estimation neural network module, and a linear filtering module, wherein the first feature set is used as an input of the speech prediction neural network module, the speech prediction neural network module is used for outputting a first quasi-estimated pure speech magnitude spectrum and a prediction error, the second set of features is provided as an input to the noise estimation neural network module, which is configured to output an estimated noise energy, the first quasi-estimated pure speech amplitude spectrum and the prediction error output by the speech prediction neural network module, the estimated noise energy output by the noise estimation neural network module and input into the linear filtering module, the linear filtering module is used for outputting an estimated pure voice magnitude spectrum of the voice signal to be enhanced.

Specifically, the structure of the speech enhancement model, the speech prediction neural network module, the noise estimation neural network module, and the linear filtering module may be implemented with reference to fig. 2 and the related description of fig. 2. The speech enhancement model may be obtained via training in a training method as shown in fig. 1.

Step S340: and restoring according to the estimated pure voice magnitude spectrum and the voice phase spectrum to be enhanced to obtain an enhanced voice signal of the voice signal to be enhanced.

Therefore, in the speech enhancement method of this embodiment, the speech enhancement model includes the speech prediction neural network module, the noise estimation neural network module, and the linear filtering module, so as to enhance the speech signal through the speech enhancement model, and thus, in combination with the linear filtering method based on signal processing and the speech enhancement method based on machine learning, the speech enhancement method based on machine learning is used to improve the speech enhancement performance of the linear filtering method based on signal processing under complex non-stationary noise, and the linear filtering method based on signal processing is used to improve the generalization performance of the speech enhancement method based on machine learning, thereby achieving optimization of speech enhancement.

The embodiment of the present disclosure further provides a speech enhancement model training device, which can be used to implement the speech enhancement model training method described in any of the above embodiments. The voice enhancement model comprises a voice prediction neural network module, a noise estimation neural network module and a linear filtering module. Fig. 4 shows a block diagram of a speech enhancement model training apparatus in an embodiment of the present disclosure, and as shown in fig. 4, a speech enhancement model training apparatus 410 in this embodiment includes a first obtaining module 411, a second obtaining module 412, a first input module 413, a second input module 414, an output module 415, and a training module 416. The first obtaining module 411 is configured to obtain a noisy speech amplitude spectrum and a clean speech amplitude spectrum of each speech pair in the speech training set, where the speech pair includes an associated clean speech signal and a noisy speech signal. The second obtaining module 412 is configured to obtain a first feature set and a second feature set according to the noisy speech magnitude spectrum. The first input module 413 is configured to input the first set of features into the speech prediction neural network module for outputting a first quasi-estimated clean speech magnitude spectrum and a prediction error. The second input module 414 is configured to input the second set of features into the noise estimation neural network module, which is configured to output an estimated noise energy. The output module 415 is configured to input the first quasi-estimated clean speech magnitude spectrum output by the speech prediction neural network module and the prediction error, the noise estimation neural network module outputs the estimated noise energy to the linear filtering module, and the linear filtering module is configured to output the estimated clean speech magnitude spectrum. The training module 416 is configured to calculate a model loss based on the clean speech magnitude spectrum and the estimated clean speech magnitude spectrum, and train the speech enhancement model based on the model loss. The specific principle of each module can be referred to any of the above embodiments of the speech enhancement model training method, and the description is not repeated here.

The embodiment of the present disclosure further provides a speech enhancement apparatus, which can be used to implement the speech enhancement method described in any of the above embodiments.

Fig. 5 shows the main modules of the speech enhancement apparatus in the embodiment, and referring to fig. 5, the speech enhancement apparatus 420 in the embodiment includes a third obtaining module 421, a fourth obtaining module 422, an enhancing module 423, and a restoring module 424. The third obtaining module 421 is configured to obtain a speech amplitude spectrum to be enhanced and a speech phase spectrum to be enhanced of the speech signal to be enhanced. The fourth obtaining module 422 is configured to obtain the first feature set and the second feature set according to the voice magnitude spectrum to be enhanced. The enhancement module 423 is configured to input the first feature set and the second feature set into a trained speech enhancement model, the speech enhancement model comprises a speech prediction neural network module, a noise estimation neural network module and a linear filtering module, wherein the first feature set is used as an input of the speech prediction neural network module, the speech prediction neural network module is used for outputting a first quasi-estimated pure speech magnitude spectrum and a prediction error, the second set of features is provided as an input to the noise estimation neural network module, which is configured to output an estimated noise energy, the first quasi-estimated pure speech amplitude spectrum and the prediction error output by the speech prediction neural network module, the estimated noise energy output by the noise estimation neural network module and input into the linear filtering module, the linear filtering module is used for outputting an estimated pure voice magnitude spectrum of the voice signal to be enhanced. The restoring module 424 is configured to restore the estimated pure speech magnitude spectrum and the speech phase spectrum to be enhanced to obtain an enhanced speech signal of the speech signal to be enhanced. The specific principle of each module can be referred to any of the above embodiments of the speech enhancement method, and the description is not repeated here.

In the speech enhancement model training device and the speech enhancement device of the embodiment, the speech enhancement model includes the speech prediction neural network module, the noise estimation neural network module and the linear filtering module, so that the speech signal enhancement is performed through the speech enhancement model, and the speech enhancement method based on the signal processing and the speech enhancement method based on the machine learning are combined, so that the speech enhancement performance of the linear filtering method based on the signal processing under the complex non-stationary noise is improved by using the speech enhancement method based on the machine learning, the generalization performance of the speech enhancement method based on the machine learning is improved by using the linear filtering method based on the signal processing, and the optimization of the speech enhancement is realized.

Fig. 4 and fig. 5 are only schematic diagrams illustrating the speech enhancement model training device and the speech enhancement device provided by the present invention, and the splitting, combining and adding of modules are within the protection scope of the present invention without departing from the concept of the present invention. The speech enhancement model training device and the speech enhancement device provided by the invention can be realized by software, hardware, firmware, plug-in and any combination thereof, and the invention is not limited by the invention.

The embodiment of the present disclosure further provides an electronic device, which includes a processor and a memory, where the memory stores executable instructions, and when the executable instructions are executed by the processor, the method for training a speech enhancement model and/or a method for speech enhancement described in any of the above embodiments are implemented.

The electronic equipment disclosed by the invention enhances the voice signal through the voice enhancement model by enabling the voice enhancement model to comprise the voice prediction neural network module, the noise estimation neural network module and the linear filtering module, so that the voice signal enhancement is carried out through the voice enhancement model, and the voice enhancement method based on signal processing and the voice enhancement method based on machine learning are combined, the voice enhancement performance of the linear filtering method based on signal processing under the complex non-stationary noise is improved by utilizing the voice enhancement method based on machine learning, the generalization performance of the voice enhancement method based on machine learning is improved by utilizing the linear filtering method based on signal processing, and the optimization of voice enhancement is realized.

Fig. 6 is a schematic structural diagram of an electronic device in an embodiment of the present disclosure, and it should be understood that fig. 6 only schematically illustrates various modules, which may be virtual software modules or actual hardware modules, and the combination, the splitting, and the addition of the remaining modules of these modules are within the scope of the present disclosure.

As shown in fig. 6, the electronic device 500 is embodied in the form of a general purpose computing device. The components of the electronic device 500 include, but are not limited to: at least one processing unit 510, at least one memory unit 520, a bus 530 connecting different platform components (including memory unit 520 and processing unit 510), a display unit 540, etc.

Wherein the storage unit stores a program code, which can be executed by the processing unit 510, to cause the processing unit 510 to perform the steps of the speech enhancement model training method and/or the speech enhancement method described in any of the embodiments above. For example, processing unit 510 may perform the steps shown in fig. 1 and 3.

The memory unit 520 may include a readable medium in the form of a volatile memory unit, such as a random access memory unit (RAM)5201 and/or a cache memory unit 5202, and may further include a read only memory unit (ROM) 5203.

Storage unit 520 may also include a program/utility 5204 having one or more program modules 5205, such program modules 5205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 530 may be one or more of any of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 500 may also communicate with one or more external devices 600, and the external devices 600 may be one or more of a keyboard, a pointing device, a bluetooth device, etc. These external devices 600 enable a user to interactively communicate with the electronic device 500. The electronic device 500 can also communicate with one or more other computing devices, including routers, modems. Such communication may occur via input/output (I/O) interfaces 550. Also, the electronic device 500 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 560. The network adapter 560 may communicate with other modules of the electronic device 500 via the bus 530. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 500, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage platforms, to name a few.

The embodiments of the present disclosure also provide a computer-readable storage medium for storing a program, and the program is executed to implement the speech enhancement model training method and/or the speech enhancement method described in any of the above embodiments. In some possible embodiments, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the speech enhancement model training method and/or the speech enhancement method described in any of the above embodiments, when the program product is run on the terminal device.

The computer-readable storage medium of the present disclosure enables a speech enhancement model to include a speech prediction neural network module, a noise estimation neural network module, and a linear filtering module, so as to enhance a speech signal through the speech enhancement model, thereby combining a linear filtering method based on signal processing and a speech enhancement method based on machine learning, improving speech enhancement performance of the linear filtering method based on signal processing under complex non-stationary noise by using the speech enhancement method based on machine learning, improving generalization performance of the speech enhancement method based on machine learning by using the linear filtering method based on signal processing, and implementing optimization of speech enhancement.

Fig. 7 is a schematic structural diagram of a computer-readable storage medium of the present disclosure. Referring to fig. 6, a program product 700 for implementing the above method according to an embodiment of the present disclosure is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of readable storage media include, but are not limited to: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device, such as through the internet using an internet service provider.

The foregoing is a more detailed description of the present disclosure in connection with specific preferred embodiments, and it is not intended that the specific embodiments of the present disclosure be limited to these descriptions. For those skilled in the art to which the disclosure pertains, several simple deductions or substitutions may be made without departing from the concept of the disclosure, which should be considered as falling within the protection scope of the disclosure.

Claims

1. A method for training a speech enhancement model, wherein the speech enhancement model comprises a speech prediction neural network module, a noise estimation neural network module and a linear filtering module,

2. The method of speech enhancement model training according to claim 1, wherein said obtaining noisy speech magnitude spectrum and clean speech magnitude spectrum for each speech pair in the speech training set comprises:

the time domain to frequency domain transforming step comprises:

framing a voice signal to be processed;

3. The method for training speech enhancement models according to claim 1, wherein the speech prediction neural network module is a time-series neural network model, the first feature set is a noise magnitude spectrum sequence of a plurality of consecutive frames, the first quasi-estimated clean speech magnitude spectrum output by the speech prediction neural network module is a first quasi-estimated clean speech magnitude spectrum sequence having the same dimension as the noise magnitude spectrum sequence, and the prediction error output by the speech prediction neural network module is a prediction error sequence having the same dimension as the noise magnitude spectrum sequence.

4. The method of training a speech enhancement model according to claim 1, wherein the noise estimation neural network module is a multi-layer fully-connected network model, and the second feature set comprises noise-containing speech magnitude spectra of a current frame and a domain window of the current frame.

5. The speech enhancement model training method of claim 1, wherein the linear filtering module comprises a wiener filtering module, a Kalman gain calculation module, and a linear combination module,

6. The method of training speech enhancement models according to claim 5, wherein calculating a linear combination result of the first quasi-estimated clean speech magnitude spectrum and the second quasi-estimated clean speech magnitude spectrum output by the speech prediction neural network module according to the optimal Kalman gain G, as the estimated clean speech magnitude spectrum, comprises:

7. The method of training a speech enhancement model according to any one of claims 1 to 6, wherein said calculating a model loss from the clean speech magnitude spectrum and the estimated clean speech magnitude spectrum, and training the speech enhancement model according to the model loss comprises:

8. A method of speech enhancement, comprising:

9. A speech enhancement model training device is characterized in that the speech enhancement model comprises a speech prediction neural network module, a noise estimation neural network module and a linear filtering module,

the speech enhancement model training device comprises:

10. An electronic device, comprising:

a processor;

a memory having executable instructions stored therein;

wherein the executable instructions, when executed by the processor, implement:

the speech enhancement model training method according to any one of claims 1 to 7; and/or

The speech enhancement method of claim 8.

11. A computer-readable storage medium storing a program that when executed implements:

The speech enhancement method of claim 8.