CN114121030A

CN114121030A - Method and apparatus for generating speech enhancement model and speech enhancement method and apparatus

Info

Publication number: CN114121030A
Application number: CN202111623706.XA
Authority: CN
Inventors: 陆丛希; 李林锴; 周昊帅; 袁宇帆; 孙鸿程
Original assignee: Shanghai Youwei Intelligent Technology Co ltd
Current assignee: Shanghai Youwei Intelligent Technology Co ltd
Priority date: 2021-12-28
Filing date: 2021-12-28
Publication date: 2022-03-01
Also published as: WO2023124984A1

Abstract

The application discloses a method for generating a speech enhancement model, comprising: acquiring audio training data; acquiring a first model and a second model, wherein the first model is a deep neural network model, and the resource occupation of the second model during operation is less than that of the first model; training a first model and the second model based on audio training data, comprising: acquiring a first audio with a frame length of M from an input audio; inputting a first audio frequency into a first model for processing to obtain a first output result; obtaining a second audio with a frame length of N from the input audio, the second audio following the first audio, and N < M; inputting the second audio and the first output result into a second model for processing to obtain a second output result; updating parameters of the first model and the second model based on the second output result and the output audio to obtain a trained first model and a trained second model; and generating a speech enhancement model based on the trained first model and the trained second model.

Description

Method and apparatus for generating speech enhancement model and speech enhancement method and apparatus

Technical Field

The present application relates to audio processing technology, and more particularly, to a method and apparatus for generating a speech enhancement model, and a speech enhancement method and apparatus.

Background

Speech enhancement refers to a technique of suppressing or reducing noise interference and extracting a useful speech signal from an audio signal when the speech signal is interfered with, or even submerged in, various kinds of noise. Speech enhancement is widely used in fields such as mobile phones, video or teleconferencing systems, speech recognition and hearing aids. In recent years, with the wide use of neural network technology, the application of deep neural network technology to speech enhancement has brought significant improvements to speech enhancement technology. However, conventional deep neural network models typically require audio as input that is longer than 8 milliseconds (e.g., 10 milliseconds or 16 milliseconds), plus the delay of the algorithm itself, such that the total delay can exceed 20 milliseconds. Too long a delay makes the deep neural network model inapplicable to devices (e.g., hearing aids) that require high real-time performance. In addition, the high computational load of the deep neural network model also limits its application in low power devices.

Therefore, there is a need to provide a speech enhancement model to solve the above problems in the prior art.

Disclosure of Invention

An object of the present application is to provide a method and an apparatus for generating a speech enhancement model, and a method and an apparatus for speech enhancement using the speech enhancement model, which have the characteristics of low power consumption and low latency.

In one aspect of the present application, a method for generating a speech enhancement model is provided. The method for generating a speech enhancement model comprises: obtaining audio training data comprising noisy input audio and noiseless output audio corresponding to the noisy input audio; obtaining a first model and a second model, wherein the first model is a deep neural network model, and the resource occupation of the second model in operation is less than that of the first model; training the first model and the second model based on the audio training data, comprising: acquiring a first audio with a frame length of M from the input audio; inputting the first audio frequency into the first model for processing to obtain a first output result; obtaining a second audio having a frame length of N from the input audio, the second audio following the first audio and N < M; inputting the second audio and the first output result into the second model for processing to obtain a second output result; and updating parameters of the first model and the second model based on the second output result and the output audio to obtain a trained first model and a trained second model; and generating a speech enhancement model based on the trained first model and the trained second model.

In another aspect of the present application, there is also provided an apparatus for generating a speech enhancement model. The apparatus for generating a speech enhancement model comprises: a processor; and storage means for storing a computer program operable on the processor; wherein the computer program, when executed by the processor, causes the processor to perform the above-described method for generating a speech enhancement model.

In another aspect of the present application, a non-volatile computer-readable storage medium is also provided. The non-transitory computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements the method for generating a speech enhancement model described above.

In another aspect of the present application, a method of speech enhancement is also provided. The voice enhancement method comprises the following steps: acquiring audio data; obtaining a speech enhancement model, wherein the speech enhancement model comprises a first model and a second model, the first model is a deep neural network model, and the resource occupation of the second model is less than that of the first model when the second model runs; processing the audio data using the speech enhancement model to attenuate or remove noise signals in the audio data, comprising: acquiring a first audio with a frame length of M from the audio data; inputting the first audio frequency into the first model for processing to obtain a first output result; obtaining a second audio having a frame length of N from the input audio, the second audio following the first audio and N < M; inputting the second audio and the first output result into the second model for processing to obtain a second output result; and outputting the second output result as enhanced audio data.

In another aspect of the present application, a speech enhancement apparatus is also provided. The speech enhancement device comprises: a processor; and storage means for storing a computer program operable on the processor; wherein the computer program, when executed by the processor, causes the processor to perform the speech enhancement method described above.

In another aspect of the present application, a non-volatile computer-readable storage medium is also provided. The non-transitory computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements the speech enhancement method described above.

The foregoing is a summary of the application that may be simplified, generalized, and details omitted, and thus it should be understood by those skilled in the art that this section is illustrative only and is not intended to limit the scope of the application in any way. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.

Drawings

The above-described and other features of the present disclosure will become more fully apparent from the following description and appended claims, taken in conjunction with the accompanying drawings. It is appreciated that these drawings depict only several embodiments of the disclosure and are therefore not to be considered limiting of its scope. The present disclosure will be described more clearly and in detail by using the accompanying drawings.

FIG. 1 shows a block diagram of a convolutional recurrent neural network;

FIG. 2 shows a block diagram of an RNNoise model;

FIG. 3 shows a flow diagram of a method for generating a speech enhancement model according to an embodiment of the present application;

FIG. 4 shows a flow diagram of a method of training a first model and a second model based on audio training data according to an embodiment of the present application;

FIG. 5 is a schematic diagram illustrating a first model and a second model processing respective frames of audio data when performing the method of FIG. 4 according to an embodiment of the present application;

FIG. 6 shows a flow diagram of a method of training a first model and a second model based on audio training data according to another embodiment of the present application;

FIG. 7 is a schematic diagram illustrating a first model and a second model processing respective frames of audio data when performing the method of FIG. 6 according to an embodiment of the present application;

FIG. 8 shows a flow diagram of a method of speech enhancement according to an embodiment of the present application;

FIG. 9 is a schematic diagram illustrating a first model and a second model processing respective frames of audio data when performing the method of FIG. 8 according to an embodiment of the present application;

FIG. 10 shows a schematic diagram of the first model and the second model processing each audio data frame when executing the method of FIG. 8 according to another embodiment of the present application.

Detailed Description

In the following detailed description, reference is made to the accompanying drawings, which form a part hereof. In the drawings, like reference numerals generally refer to like parts throughout the various views unless the context dictates otherwise. The illustrative embodiments described in the detailed description, drawings, and claims are not intended to be limiting. Other embodiments may be utilized, and other changes may be made, without departing from the spirit or scope of the subject matter of the present application. It will be understood that aspects of the present disclosure, as generally described in the present disclosure and illustrated in the figures herein, may be arranged, substituted, combined, and designed in a wide variety of different configurations, all of which form part of the present disclosure.

Referring to FIG. 1, a block diagram of a Convolutional Recurrent Neural Network (CRNN) 10 is shown.

The convolution cyclic Neural Network is a Neural Network obtained by combining a Convolution Neural Network (CNN) and a Recurrent Neural Network (RNN). As shown in FIG. 1, the convolutional recurrent neural network 10 generally includes a CNN module 110, an RNN module 120, and a Connection Timing Classification (CTC) module 130. Specifically, the CNN module 110 is configured to perform feature extraction on the audio data, the RNN module 120 (for example, using a Long Short Time Memory (LSTM) neural unit) is configured to predict a sequence of audio data features and output a predicted tag distribution, and the CTC module 130 is configured to convert the tag distribution obtained by the RNN module 120 into a final tag sequence. The CTC algorithm is a loss computation method. In practical applications, each layer or module of the convolutional recurrent neural network has various parameters for processing data, such as weighting coefficients and/or bias coefficients. After training with specific training data, parameters in the convolutional recurrent neural network can be determined, enabling speech enhancement for similar speech data/signals.

For a speech signal, information of both Time domain and Frequency domain can be presented simultaneously through a Time-Frequency Spectrum (Time-Frequency Spectrum). In order to remove or reduce the noise signal in the speech signal, the Time-Frequency spectrum of the speech signal may be processed through Time-Frequency masking (Time-Frequency Mask), so as to obtain an enhanced Time-Frequency spectrum. For example, each element in the time-frequency mask may be considered as a ratio of a clean speech signal to a noise signal in different time-frequency units, so that the time-frequency spectrum of the enhanced speech signal may be obtained by multiplying the time-frequency mask and the time-frequency spectrum of the speech signal. The output of the convolutional recurrent neural network 10 in fig. 1 is typically time-frequency masking. In some examples, the output of the convolutional recurrent neural network 10 is an Ideal Binary Mask (IBM: Ideal Binary Mask), which may be set to 0 or 1, depending on the signal-to-noise ratio at each time-frequency unit in the speech signal: under the condition that noise is dominant, the value of the corresponding element of the time frequency unit is set to be 0; in the case of speech dominance, the value of the time-frequency unit corresponding element is set to 1. In other examples, the output of the convolutional recurrent neural network 10 is Ideal Ratio masking (IRM: Ideal Ratio Mask). Different from the 'non-0 or 1' strategy of ideal binary masking, the ideal ratio masking calculates the energy ratio between voice and noise to obtain a number between 0 and 1, and the number is used as the value of the corresponding element of the time-frequency unit and output. The ideal ratio masking is the evolution of ideal binary masking, and can more accurately reflect the noise suppression degree of each time-frequency unit, thereby further improving the quality and intelligibility of the separated voice. In other embodiments, the output of the convolutional recurrent neural network 10 may also be directly the enhanced speech signal after processing by ideal binary masking or ideal ratio masking.

Referring to FIG. 2, a block diagram of an RNNoise model 20 is shown.

RNNoise model 20 is another neural network model that can process data with relatively few processing parameters. Specifically, RNNoise model 20 includes a Voice Activity Detection (Voice Activity Detection) module 210, a Noise spectrum Estimation (Noise Spectral Estimation) module 220, and a Spectral Subtraction (Spectral Subtraction) module 230. The voice activity detection module 210 includes two fully-connected layers and a Gated Recycling Unit (GRU) located between the two fully-connected layers, which receives input features (e.g., spectral features) of the audio signal and can detect when the input audio signal contains voice and when it contains only noise, and outputs a voice activity detection result. The noise spectrum estimation module 220 includes a gated loop unit that can receive input features and combine information obtained from the voice activity detection module 210 to estimate features of the noise spectrum. The spectral cancellation module 230 includes a gated loop unit and a fully connected layer, which may subtract a noise portion from the input audio signal based on information obtained from the voice activity detection module 210 and the noise spectrum estimation module 220 to obtain an enhanced speech signal. In the embodiment shown in fig. 2, the spectral cancellation module 230 outputs the enhanced speech signal by applying different gains to different frequency bands of the input audio signal. The gated loop cells and/or the fully connected layer of the RNNoise model may also have a variety of parameters for processing data. Similarly, these parameters in the RNNoise model may be determined by training with specific training data to meet the requirements of speech enhancement processing. For further description of the RNNoise model, reference may be made to the article "A Hybrid DSP/Deep Learning Approach to Real-Time Full-Band Speech Enhancement" by Jean-Marc Valin "

(arXiv:1709.08243[ cs.sd ],2017), the entire content of which is incorporated by reference into the present application.

It is noted that RNNoise models have fewer parameters (e.g., less than 10 ten thousand) but are less effective in speech enhancement than convolutional recurrent neural networks. In contrast, a convolutional recurrent neural network can obtain a better speech enhancement effect, but the convolutional recurrent neural network usually has more than 100 ten thousand parameters, which results in more resources (e.g., computing resources and/or storage resources) being occupied during runtime, and further results in too high power consumption of the device. In addition, the convolutional recurrent neural network usually requires an input frame length of 10-16ms, and adds the delay generated by the operation itself, so that the total end-to-end delay is usually about 20 ms. However, for devices with high real-time requirements such as hearing aids, once the end-to-end delay exceeds 10ms, the processed sound will be aliased with the original sound, which can produce a severe "echo feel" and lose speech recognition. In order to reduce the time delay, the input frame length of the convolutional circular neural network can be shortened, but the shortening of the frame length means that the operation frequency of the convolutional circular neural network becomes higher, so that the power consumption of the device is further increased, and the deployment in an embedded system is not facilitated.

In view of the problems of the prior art, the present application provides a method for generating a speech enhancement model. According to the method, after audio training data are obtained, a first model with more resource occupation and a second model with less resource occupation are trained on the training data. In the training process, a first audio with a long frame length is obtained from an input audio, and the first audio is input into a first model to be processed so as to obtain a first output result; then, a second audio with a shorter frame length after the first audio is obtained from the input audio, and the second audio and the first output result are input into a second model to be processed to obtain a second output result; and updating parameters of the first model and the second model based on the second output result and the target output audio in the training data to obtain the trained first model and the trained second model. After training is finished, a speech enhancement model is generated based on the trained first model and the trained second model. The voice enhancement model generated by the method can be fused with the advantages of the two models, so that a better voice enhancement effect can be obtained on the premise of low power consumption and low time delay.

The method for generating a speech enhancement model according to the present application is described in detail below with reference to the accompanying drawings. FIG. 3 illustrates a flow diagram of a method 30 for generating a speech enhancement model according to some embodiments of the present application, including in particular steps 310-340 as follows.

At step 310, audio training data is obtained that includes noisy input audio and noiseless output audio corresponding to the noisy input audio.

In some embodiments, the input audio in the training data may be Noisy human Speech (noise Speech) and the output audio may be noiseless human Speech, alternatively referred to as Clean Speech. For example, when training data is prepared, a plurality of original human voices are collected as output audio; and then adding different noises into the original human voice to obtain a voice with noises corresponding to the original human voice as an input audio. For another example, in preparing training data, a near-field microphone near a person may be used to obtain clean person speech as output audio, while a far-field microphone further from the person may be used to obtain noisy person speech as input audio. Also for example, noisy input audio and noise-free output audio are included in the training data, where the output audio is audio that is generated after processing the input audio using various noise suppression or reduction techniques.

In some embodiments, the audio training data is from an open source speech Dataset, such as a NOIZEUS Dataset, a VOiCES Dataset, a CHIME Dataset, and the like.

It is to be understood that the speech to be the object in the training data may be other types of sounds such as animal sounds, musical instrument sounds, machine operation sounds, and the like, in addition to the human speech. It should be noted that "noiseless" in this application does not require that the audio data does not contain sounds other than speech, but refers to audio data in which the clarity and intelligibility of speech are improved relative to "noisy" or in which noise is significantly attenuated, or audio data in which the clarity and intelligibility of speech meet certain criteria.

And 320, acquiring a first model and a second model, wherein the first model is a deep neural network model, and the resource occupation of the second model in operation is less than that of the first model.

In some embodiments, the second model may occupy less resources at runtime than the first model, which may mean that the second model occupies less storage resources (e.g., memory, cache, or other storage space occupied for storing parameters of the model or intermediate data) at runtime than the first model, and/or the second model occupies less computing resources (e.g., the amount of operations of a processor, graphics card, or other computing unit occupied for speech enhancement of audio data per unit time) than the first model. It is to be appreciated that in other embodiments, the resource occupancy may also not be limited to the storage resources or the computing resources described above, and may also include occupancy of other hardware resources (e.g., bus resources, power resources, etc.) on the hardware platform to be deployed.

In some embodiments, the first model and the second model are both deep neural network models, but the parameters of the second model are less than the parameters of the first model, such that the second model occupies less resources at runtime than the first model. For example, the number ratio of the parameters of the second model to the parameters of the first model is less than 1/2, or preferably less than 1/5, e.g., less than 1/800, less than 1/500, less than 1/200, less than 1/100, or less than 1/10, etc. The specific ratio of the number of parameters is not limited in this application.

For example, the first model may be the convolutional recurrent neural network model shown in fig. 1, which typically has more than 100 ten thousand parameters; the second model may be the RNNoise model shown in fig. 2, which typically has less than 100 ten thousand parameters, for example around 10 thousand parameters. For another example, the first model may be the convolutional recurrent neural network model shown in fig. 1, and the second model may be the convolutional recurrent neural network model generated after the first model is compressed (e.g., by 2 to 100 times). The compression described above involves pruning and/or quantizing the parameters of the convolutional recurrent neural network model.

It is understood that the first model and the second model can also be other deep neural network models suitable for speech enhancement, such as Convolutional Neural Network (CNN), cyclic neural network (RNN), Deep Belief Network (DBN), Restricted Boltzmann Machine (RBM), fully-connected network (FCN), Deep Convolutional Network (DCN), long-short term memory (LSTM) network, or gated round-robin unit (GRU), etc., which are not listed here.

In other embodiments, the first model is a deep neural network model and the second model is a hybrid model that includes a digital signal processing model and a simplified neural network model. The digital signal processing model can filter noise signals through the traditional digital signal processing technology, so that voice enhancement is realized. For example, the digital signal processing model may implement speech enhancement based on spectral subtraction, wavelet transform, or Wiener (Wiener) filters. The input frame length of the digital signal processing model is usually shorter, and the occupied resources are less, so that the finally obtained voice enhancement model is favorable for obtaining the effects of low delay and low power consumption.

In some embodiments, the simplified neural network model in the second model may not have the function of speech enhancement, but the output result of the first model can be fused with the output result of the digital signal processing model, so that the second model still can achieve the effect of speech enhancement in the whole view. For example, the second model may include an energy Spectrum Estimation (Power Spectrum Estimation) model and one or more fully connected layers. The energy spectrum estimation model is one of digital signal processing models that can estimate the energy spectrum of noise for use in suppressing or canceling the noise signal in the audio data. The full link layer, although not having a speech enhancement function, can fuse the output result of the energy spectrum estimation model with the output result of the first model.

It will be appreciated that the parameters in the first and second models (e.g., parameters of the deep neural network model, parameters of the digital signal processing model, or parameters of the fully-connected layer) may have initial values before untraining. After training with the aforementioned training data, these parameters in the model may be updated.

Step 330, training the first model and the second model based on the audio training data.

In some embodiments, a first audio having a frame length of M may be first obtained from an input audio; and inputting the first audio frequency into the first model for processing so as to obtain a first output result. Next, a second audio having a frame length of N may be obtained from the input audio, where the second audio follows the first audio and N < M. The second audio and the first output result may then be input to a second model for processing to obtain a second output result. In this way, based on the second output result and the output audio, the parameters of the first model and the second model may be updated, thereby obtaining a trained first model and a trained second model.

The purpose of training the first and second models is to update the values of the parameters in the models to better achieve the goal of speech enhancement. In the above embodiments, the step of inputting the first audio into the first model for processing to obtain the first output result and the step of inputting the second audio and the first output result into the second model for processing to obtain the second output result may be referred to as Forward Propagation (Back Propagation), and the step of updating the parameters of the first model and the second model based on the second output result and the output audio to obtain the trained first model and the trained second model may be referred to as Back Propagation (Back Propagation). In some embodiments, the first and second models may be trained by controlling the forward and backward propagation processes based on a loss function. The loss function is optimized, for example by a gradient descent algorithm, to be as small as possible. In some embodiments, the forward propagation and the backward propagation may be performed in multiple iterations, and during the iterative execution, the updated values of the parameters of the trained first model and the trained second model generated in the previous training are respectively assigned to the pre-trained first model and the pre-trained second model in the subsequent training for use. After multiple iterations are performed, the error between the second output result and the output audio frequency in the training data can be made smaller and smaller, and the effect of optimizing the speech enhancement model is achieved.

Additional details of training the first and second models based on audio training data may also be found in the specific embodiments described below in connection with fig. 4-7.

Step 340, generating a speech enhancement model based on the trained first model and the trained second model.

In some embodiments, in the above-mentioned training of the first model and the second model based on the audio training data, if an error between the second output result and the output audio in the training data is smaller than a preset error value or converges to a certain value, the training process may be ended, and the speech enhancement model is generated based on the trained first model and the trained second model.

It should be noted that speech is a complex time-varying signal, but it usually has strong correlation between time frames. The correlation may be reflected in the co-articulation phenomenon during speaking, and often several words before and after the speech have an influence on the word being spoken, that is, there is a long-term correlation between frames of speech. In the embodiment of the application, the first audio processed by the first model precedes the second audio processed by the second model, so that when the first output result generated by the first model and the second audio are input into the second model together for processing, the correlation between the second audio and the first audio is considered, and the second output result generated by the second model is favorable for realizing a better voice enhancement effect on the second audio. In addition, in the embodiment of the application, the frame lengths of the first audio and the second audio are M and N, respectively, and N < M; when the second output result of the second model is subsequently used as the output result of the entire speech enhancement model, the delay time can be significantly reduced compared to using only the first output result.

Referring to fig. 4, a flowchart of a method for training a first model and a second model based on audio training data according to an embodiment of the present application is shown, which specifically includes the following

steps

402 and 416. Fig. 5 shows a schematic diagram of the first model and the second model processing respective frames of audio data when performing the method of fig. 4. In the examples of fig. 4 and 5, the first model and the second model are both deep neural network models, and the second model occupies less resources at runtime than the first model. For example, the first model may be the convolutional recurrent neural network shown in fig. 1, and the second model may be the RNNoise model shown in fig. 2.

The method begins at step 402 by obtaining a first audio having a frame length M from an input audio.

As mentioned above, the first model has better speech enhancement effect, but it needs audio signal with longer frame length as input, and its resources are more occupied than the second model in running. Therefore, in the embodiment of the present application, the first model is mainly used to obtain the information of the coarse granularity in the audio input with a large frame length, but the information of the coarse granularity is not directly used as the final input of the whole speech enhancement model, but is input to the second model for further processing. Considering that strong correlation exists among the time frames of the input audio, the second model can further perform fine processing on the subsequent audio frames based on the coarse granularity information, so that the voice enhancement effect is favorably improved.

In some embodiments, the frame length M of the first audio may be between 8ms and 32ms, e.g., 10ms, 16ms, 24ms, etc. Since different types of deep neural network models have different requirements on the frame length of the input audio, in a specific application, a specific value of the first audio frame length M may be determined based on the application requirements and the characteristics of the first model.

For example, in the example of fig. 5, the frame length M of the first audio obtained from the input audio is 16 ms. Assuming that the current time is t, the second model processes the input audio after time t, and the first audio obtained from the input audio may be an audio frame between time t-16ms and t. It is understood that when training is just started, t is 0, and there is no audio data between t-16ms and t, so blank audio (e.g., all feature values of audio are 0) may be used as the first audio at this time. In other examples, randomly generated audio or preset audio may also be used as the first audio.

Step 404, inputting the first audio into the first model for processing to obtain a first output result.

In the example of fig. 5, the first model is a convolutional recurrent neural network, and after the first audio frequency between t-16ms and t is processed by the first model, the ideal binary masking, the ideal ratio masking, or the enhanced speech signal processed by the ideal binary masking or the ideal ratio masking of the first audio frequency can be obtained.

A second audio having a frame length N is obtained from the input audio, the second audio following the first audio, and N < M, step 406.

In some embodiments, the frame length N of the second audio may be between 1ms and 8ms, e.g., 2ms, 4ms, 6ms, etc. In a particular application, a particular value of the second audio frame length N may be determined based on application requirements and characteristics of the second model.

It should be noted that, in the embodiment of the present application, the second model needs to process the second audio after the first audio using the first result output by the first model, but the correlation between the time frames of the speech has a local property, and the delay of the second audio after the first audio is not too long. Preferably, in some examples, if the first model processes a first audio between t-M to t, the second audio is audio data between t to t + M.

In the example of fig. 5, the first audio is an audio frame between t-16ms and t, and the second audio is an audio frame between time t and t +16 ms. In the case where the frame length N of the second audio is 2ms, the second audio obtained from the input audio may be an audio frame between time t to t +2 ms.

The frame length of the second audio is related to the output frame length of the subsequently generated speech enhancement model. For example, the frame length N of the second audio is equal to the output frame length of the subsequently generated speech enhancement model, such that the frame length of the second audio generally determines the power consumption and delay of the subsequently generated speech enhancement model. The shorter the frame length of the second audio, the shorter the end-to-end delay of the speech enhancement model. However, a shorter frame length means that the speech enhancement model operates more frequently, and power consumption is increased accordingly. For example, if the frame length of the second audio is 2ms, the end-to-end delay of the speech enhancement model is 2ms plus the delay generated by hardware operation; in addition, the speech enhancement model needs to run every 2ms to continue speech enhancement of the input audio. Therefore, in a specific application, the frame length of the second audio can be set according to the requirements of the application scene on time delay and power consumption. For example, for an application scenario requiring a lower delay, the frame length of the second audio is set shorter; for application scenarios requiring lower power consumption, the frame length of the second audio is set longer.

In step 408, a third audio with frame length P before the second audio is obtained from the input audio.

In the example of fig. 5, the first model and the second model are both deep neural network models. Due to the strong correlation between the time frames of the input audio, the deep neural network model usually achieves a better speech enhancement effect only when the frame length of the input audio is larger than a certain value. Although the second model is used to process the second audio with the frame length N to obtain the output result with the frame length N, if a third audio with the frame length P before the second audio is added to the input of the second model, it is beneficial to improve the speech enhancement effect. In general, the frame length P of the third audio may be determined based on the requirements of the second model for the input audio frame length. For example, assuming that the second model can achieve a better speech enhancement effect only if the frame length of the input audio is greater than X, the frame length P of the third audio should preferably be equal to or greater than (X-N).

Specifically, in the example of fig. 5, the second audio is an audio frame between time t and t +2ms, and the frame length of the third audio before the acquired second audio is 8ms, that is, the third audio is an audio frame between t-8ms and t. It will be appreciated that when training is just started, i.e. when t is 0, there is no audio data between t-8ms and t, and blank audio may be used as the third audio. In other examples, randomly generated audio or preset audio may also be used as the third audio.

And step 410, combining the third audio and the second audio to form a fourth audio with the frame length of P + N.

In the example of fig. 5, a third audio between t-8ms and t and a second audio between t and t +2ms are combined to form a fourth audio having a frame length of 10 ms.

In step 412, the fourth audio and the first output result are input to the second model for processing to obtain an intermediate output result with a frame length of P + N.

In the example of fig. 5, a fourth audio input of 10ms between t-8ms and t +2ms and a first audio obtained by processing the first audio between t-16ms and t by the first model are input into the second model for processing to obtain an intermediate output result with a frame length of 10 ms. The intermediate output result may be an ideal binary mask, an ideal ratio mask, or an enhanced speech signal processed by the ideal binary mask or the ideal ratio mask.

Taking the second model as the RNNoise model shown in fig. 2 as an example, and the received input features are the spectral features of the audio signal, the fourth audio and the first output result can be collectively expressed as the spectral features meeting the RNNoise model input requirements and then input to the RNNoise model for processing. The input of the second model comprises the coarse granularity information obtained by processing the first audio frequency between t-16ms and t by the first model and the audio frequency information between t-8ms and t +2ms, so that the processing is favorable for obtaining better voice enhancement effect.

And step 414, obtaining a result with the length of N corresponding to the second audio from the intermediate output result as a second output result.

In some embodiments, although the second model processes the fourth audio having a frame length of P + N and obtains an intermediate output result having a frame length of P + N, only a result having a length of N corresponding to the second audio among the intermediate output results is truncated as the second output result.

Specifically, in the example of fig. 5, only the output results between t and t +2ms of the intermediate output results are truncated as the second output results. That is, each time the second model is run, it can be considered that only the second audio between t and t +2ms of the new input is processed, and the third audio between t-8ms and t before is only used as auxiliary data to improve the effect of speech enhancement.

And step 416, updating parameters of the first model and the second model based on the second output result and the output audio to obtain a trained first model and a trained second model.

In some embodiments, the frame length M of the first audio is K times the frame length N of the second audio (i.e., M ═ K × N), where K is an integer greater than 1. In this way, in the process of training the first model and the second model based on the audio training data, K consecutive second audios with frame length N after the first audio are obtained from the input audio, and the K second audios may be sequentially input to the second model together with the first output result for processing to obtain K second output results. That is, the second audios with the frame length N of K consecutive frames after the first audio are obtained, and step 406 and 414 in fig. 4 are performed for each second audio to obtain the corresponding second output result. However, based on the K second output results and the output audio in the training data, the parameters of the first model and the second model may be updated, thereby obtaining a trained first model and a trained second model.

Continuing with the example of fig. 5, the frame length 16ms of the first audio is 8 times the frame length 2ms of the second audio, the second model is executed 8 times to process 8 second audios between t and t +2ms, between t +2ms and t +4ms, … …, and between t +14ms and t +16ms, respectively, and obtain results between t and t +2ms, between t +2ms and t +4ms, between … …, and between t +14ms and t +16ms in the output as 8 second output results. The parameters of the first model and the second model are then updated based on the 8 second output results and the corresponding output audio in the training data. In the subsequent training process, 16ms of audio between t +16ms and t +16ms of the input audio can be continuously acquired as the first audio, 82 ms of audio between t +16ms and t +32ms of the input audio can be acquired as the second audio, and the first model and the second model are continuously trained. And so on until the input audio in the audio training data ends.

It should be noted that, although the embodiment of the present application has been described in the above embodiment by taking the frame length M of the first audio as an integral multiple of the frame length N of the second audio, the present application is not limited thereto. In other embodiments, the frame length M of the first audio is greater than the frame length N of the second audio, but the frame length M of the first audio is not an integer multiple of the frame length N of the second audio. For example, the frame length M of the first audio is 16ms, while the frame length N of the second audio is 5 ms. Still assume that the first audio is an audio frame between t-16ms and t, while the second audio may be an audio frame between time t and t +5 ms; the second model is executed 3 times to process 3 frames of second audio frequencies from t to t +5ms, from t +5ms to t +10ms and from t +10ms to t +15ms in the input audio frequency respectively, and obtain results from t to t +5ms, from t +5ms to t +10ms and from t +10ms to t +15ms in the corresponding output respectively as 3 second output results; then, the parameters of the first model and the second model are updated based on the 3 second output results and the corresponding output audio in the training data. In the subsequent training process, 16ms of audio between t-1ms and t +15ms in the input audio can be acquired as first audio, 3 5ms of audio between t +15ms and t +30ms in the input audio can be acquired as second audio, and the first model and the second model are continuously trained. And so on until the input audio in the audio training data ends.

In some embodiments, the speech extraction model may be trained using, for example, an inverse Error Propagation (Error Back Propagation) algorithm or other training algorithms for existing neural networks. In the training process of the inverse error propagation algorithm, an error between K second output results generated in the training process and a corresponding noise-free target output audio frequency in the training data can be calculated, and then the error is propagated in the inverse direction to adjust or update parameters of the first model and the second model. The inverse error propagation algorithm may perform the above training steps in multiple iterative loops until a training stop condition is reached.

Referring to fig. 6, a flowchart of a method for training a first model and a second model based on audio training data according to another embodiment of the present application is shown, which specifically includes the following

steps

602 and 612. Fig. 7 shows a schematic diagram of the first model and the second model processing respective frames of audio data when performing the method of fig. 6. In the examples of fig. 6 and 7, the first model is a deep neural network model and the second model is a hybrid model including a digital signal processing model and a simplified neural network model. For example, the first model may be a convolutional recurrent neural network as shown in fig. 1, and the second model may include an energy spectrum estimation model and one or more fully-connected layers.

Methods for training the first and second models in some embodiments are described below in conjunction with fig. 6 and 7. It should be noted that the method for training the first model and the second model shown in fig. 6 and 7 has the same or similar points as the method for training the first model and the second model shown in fig. 4 and 5, and the detailed description of the same or similar points is omitted, and the difference between the two points is emphasized.

In step 602, a first audio with a frame length of M is obtained from an input audio.

In the example of fig. 7, audio frames of length 16ms between t-16ms and t are taken from the input audio as the first audio.

Step 604, input the first audio to the first model for processing to obtain a first output result.

In the example of fig. 7, a first audio input between t-16ms and t is processed to a first model to obtain a first output result.

And step 606, acquiring a second audio with the frame length of N from the input audio, wherein the second audio is subsequent to the first audio, and N < M.

In the example of fig. 7, an audio frame having a length of 2ms between t and t +2ms is acquired as the second audio.

Step 608, the second audio is input to the digital signal processing model for processing to obtain an intermediate output result.

Unlike the second model of fig. 4 and 5, in the example of fig. 6 and 7, the second model is a hybrid model including a digital signal processing model and a simplified neural network model. The digital signal processing model may filter or suppress noise signals in the input audio by conventional digital signal processing techniques. The digital signal processing model has fewer parameters, and compared with the deep neural network model, the required frame length of the input audio is shorter, which is beneficial to reducing the delay of the finally generated speech enhancement model. The simplified neural network model may be a neural network with a plurality of hidden layers such as a gated cyclic unit, a long-and-short memory unit, and/or a convolution unit removed, which only retains units (e.g., one or more convolution layers) capable of fusing the output result of the first model with the output result of the digital signal processing model. Generally, the simplified neural network model has no speech enhancement function, and is only used for fusing the output result of the first model and the output result of the digital signal processing model. The present application is not so limited and in some embodiments, the simplified neural network model may also have speech enhancement functionality, e.g., may have relatively few processing parameters.

In the example of fig. 7, after the second audio frequency between t and t +2ms is input to the digital signal processing model for processing, the noise signal in the second audio frequency can be filtered or suppressed, and an intermediate output result with a length of 2ms is obtained.

And step 610, inputting the intermediate output result and the first output result into the simplified neural network model for processing to obtain a second output result.

In the example of fig. 7, intermediate output results corresponding to a second audio frequency between t and t +2ms are input to the simplified neural network model together with first output results produced by the first model for processing to obtain a second output result having a frame length of 2 ms. Because the input of the simplified neural network model comprises the coarse granularity information obtained by processing the first audio frequency between t-16ms and t by the first model and the fine granularity audio frequency between t and t +2ms, the processing is favorable for obtaining better voice enhancement effect.

And step 612, updating parameters of the first model and the second model based on the second output result and the output audio to obtain the trained first model and the trained second model.

In the example of fig. 7, the frame length 16ms of the first audio is 8 times the frame length 2ms of the second audio, and the second model is executed 8 times to process 8 second audios between t and t +2ms, between t +2ms and t +4ms, and … …, between t +14ms and t +16ms, respectively, so as to obtain 8 second output results. And updating parameters of the first model, the digital signal processing model and the simplified neural network model based on the 8 second output results and corresponding output audio in the training data.

It should be noted that the frame lengths of the respective audios are only used as examples in the above example (for example, the frame length of the first audio is 16ms, the frame length of the second audio is 2ms, and the like), but the present application is not limited thereto, and in other examples, the first model and the second model may be trained with other frame lengths based on a specific application scenario and characteristics of the first model and the second model.

In addition, it should be noted that, the method for training the first model and the second model of the present application by using the forward propagation and the backward propagation methods is described in detail above with reference to fig. 4 to 7, but the present application is not limited thereto, and in other examples, other training algorithms capable of updating the parameters of the first model and the second model to obtain a better speech enhancement effect may also be used.

In another aspect of the present application, a method of speech enhancement is also provided. The method uses the speech enhancement model generated in the previous embodiment to enhance the acquired audio data. In particular, the speech enhancement models generated in the foregoing embodiments may be used in various audio devices, for example, speech enhancement models may be used in speech recognition systems, as well as speech enhancement in hearing aids. The effect of different audio devices on speech enhancement may be different, but both speech recognition devices and hearing aid devices need to remove speech noise while maintaining the integrity and intelligibility of the speech as much as possible. In an embodiment of the present invention, the speech enhancement model may be applied to, for example, a hearing aid, an earphone, a mobile communication terminal, a smart home product, and other electronic devices having an audio acquisition and/or audio output function, and the like.

The speech enhancement method of the present application will be described with reference to the drawings. FIG. 8 shows a flow diagram of a speech enhancement method 80 according to some embodiments of the present application, including in

particular steps

810 and 840 as follows.

Step 810, audio data is obtained.

In some embodiments, the audio data may be acquired by an audio capture unit of the audio device. For example, in the case where a voice enhancement request of the user is received, a microphone of the audio device is turned on to collect audio data.

Since various noises exist in the surrounding environment of the audio equipment, such as white noise generated by nature, burst noise generated by human beings, reverberation generated by room reflection, and the like, noise is mixed in the voice signal received by the audio acquisition unit. These noise mixing in may affect the quality and intelligibility of the received speech and thus affect the processing of subsequent speech units. The main purpose of speech enhancement is to perform noise reduction processing on noisy audio data. Various interference signals can be effectively inhibited through voice enhancement, and a target voice signal is enhanced. The target voice signal may be a character voice, an animal voice, a musical instrument voice, a machine operation voice, or the like.

And 820, acquiring a voice enhancement model, wherein the voice enhancement model comprises a first model and a second model, the first model is a deep neural network model, and the resource occupation of the second model in operation is less than that of the first model.

In some embodiments, the first model and the second model are both deep neural network models, and the parameters of the second model are less than the parameters of the first model. For example, the ratio of the number of parameters of the second model to the number of parameters of the first model is less than 1/2, preferably less than 1/5. In some embodiments, the first model is a convolutional recurrent neural network model and the second model is an RNNoise model.

In some embodiments, the first model is a deep neural network model and the second model is a hybrid model that includes a digital signal processing model and a simplified neural network model. For example, the second model includes an energy spectrum estimation model and one or more fully connected layers.

The speech enhancement model may be a speech enhancement model produced using the method for generating a speech enhancement model in the above embodiment. Therefore, for specific details of the speech enhancement model, reference may also be made to the description above in conjunction with fig. 3-7, which are not described here again.

The audio data is processed using the speech enhancement model to attenuate or remove noise signals in the audio data, step 830.

In the example of fig. 8, step 830 further includes steps 8302 through 8308 as follows: step 8302, obtaining a first audio with the frame length of M from the audio data; step 8304, inputting the first audio frequency into the first model for processing to obtain a first output result; step 8306, obtaining a second audio with a frame length of N from the input audio, wherein the second audio is subsequent to the first audio and N < M; step 8308, the second audio and the first output result are input to the simplified neural network model for processing to obtain a second output result.

And step 840, outputting the second output result as the enhanced audio data.

The

steps

830 and 840 of processing the audio data using the speech enhancement model and obtaining enhanced audio data are further described with reference to the specific examples of fig. 9 and 10.

In the example of fig. 9, the first model and the second model are both deep neural network models, and the second model occupies less resources at runtime than the first model. For example, the first model may be the convolutional recurrent neural network shown in fig. 1, and the second model may be the RNNoise model shown in fig. 2.

Specifically, first, a first audio having a frame length of 16ms is acquired from audio data. Assuming that the current time is t, the second model processes the audio data after time t, and the first audio obtained from the audio data may be an audio frame between time t-16ms and t. Then, the first audio frequency of 16ms between time t-16ms and t is input into the first model for processing, so as to obtain a first output result. For the second model, an audio frame of 2ms between the time t and t +2ms is acquired from the audio data as the second audio, and an audio frame of 8ms between the time t-8ms and t is acquired from the audio data as the third audio, and the third audio and the second audio are combined to form a fourth audio with a frame length of 10 ms. Then, the fourth audio and the first output result of the first model are simultaneously input to the second model for processing to obtain an intermediate output result with a frame length of 10 ms. And finally, obtaining a result with the length of 2ms corresponding to the second audio from the intermediate output result as a second output result, wherein the second output result is the enhanced audio data generated after the noise signal in the audio data is weakened or removed. Thereafter, the 7 second audios between t +2ms and t +4ms, … …, and t +14ms and t +16ms may be continuously obtained, and are sequentially input to the second model together with the first output result to perform similar processing, so as to respectively output corresponding enhanced audio data. After the second model completes processing of 8 consecutive second audios between t and t +16ms, the first model may be started again to continue processing the audios between t and t +16ms in the audio data to generate a new first output result, and the new first output result is transferred to the second model. The second model processes 8 second audios between t +16ms and t +32ms by using the new first output result to obtain new enhanced audio data. And by parity of reasoning, the voice enhancement processing is continuously carried out on the audio data acquired by the audio acquisition unit.

In the example of fig. 10, the first model is a deep neural network model, and the second model is a hybrid model including a digital signal processing model and a simplified neural network model. For example, the first model may be a convolutional recurrent neural network as shown in fig. 1, and the second model may include an energy spectrum estimation model and one or more fully-connected layers.

Specifically, first, a first audio having a frame length of 16ms is acquired from audio data. Assuming that the current time is t, the second model processes the audio data after time t, and the first audio obtained from the audio data may be an audio frame between time t-16ms and t. Then, the first audio frequency of 16ms between time t-16ms and t is input into the first model for processing, so as to obtain a first output result. For the digital signal processing model in the second model, an audio frame of 2ms between time t to t +2ms is acquired from the audio data as the second audio. Then, after the second audio frequency between t and t +2ms is input into the digital signal processing model for processing, the noise signal in the second audio frequency can be filtered or suppressed, and an intermediate output result of 2ms is obtained. Thereafter, the intermediate output result corresponding to the second audio frequency between t and t +2ms is input to the simplified neural network model together with the first output result generated by the first model for processing to obtain a second output result with the frame length of 2 ms. The second output result is enhanced audio data generated after the noise signal in the audio data is weakened or removed. Thereafter, the 7 second audios between t +2ms and t +4ms, … …, and t +14ms and t +16ms may be continuously obtained and input to the digital signal processing model to obtain 7 intermediate output results. And inputting the 7 intermediate output results and the first output result into a simplified neural network model for processing to obtain 7 second output results. After the second model completes processing of 8 consecutive second audios between t and t +16ms, the first model may be started again to continue processing the audios between t and t +16ms in the audio data to generate a new first output result, and the new first output result is transferred to the second model. The second model processes 8 second audios between t +16ms and t +32ms by using the new first output result to obtain new enhanced audio data. And by parity of reasoning, the voice enhancement processing is continuously carried out on the audio data acquired by the audio acquisition unit.

It should be noted that, in the embodiment described above in conjunction with fig. 9 and fig. 10, the frame lengths of the audios are only used as examples (for example, the frame length of the first audio is 16ms, the frame length of the second audio is 2ms, and the like), but the present application is not limited thereto, and in other examples, the first model and the second model may be trained with other frame lengths based on a specific application scenario and characteristics of the first model and the second model. Furthermore, the speech enhancement method described above with reference to fig. 8 to 10 has the same or similar points as the method for generating a speech enhancement model described in the previous embodiment, and therefore, reference may also be made to the method for generating a speech enhancement model described above with reference to fig. 3 to 7 where the description of the speech enhancement method is not exhaustive, and the description thereof is omitted here.

It can be seen from the above embodiments that the speech enhancement model of the present application fuses two models occupied by different resources, and is deployed on corresponding speech enhancement devices after retraining: the large model with more resource occupation can obtain the information of the coarse granularity on the audio data with the large frame length, and the small model with less resource occupation can be used for carrying out fine processing on the audio data with the small frame length by utilizing the information of the coarse granularity generated by the large model, so that the voice enhancement model can reduce delay and power consumption while giving consideration to the voice enhancement effect. Taking the speech enhancement method shown in fig. 9 as an example, which outputs enhanced audio data at a frequency of once every 2ms, for audio data with a frame length of 16ms, a large model (i.e., a first model) with more resource occupation needs to be run once, and a small model (i.e., a second model) with less resource occupation needs to be run eight times. In contrast, assuming that a large model with a large resource occupation is used alone to output enhanced audio data at a frequency of once every 2ms, the large model needs to be run eight times for audio data with a length of 16ms, and since the large model has many parameters (for example, more than 100 ten thousands), the resource occupation is large during running, which significantly increases the power consumption of the speech device. In contrast, if the small model with less resource occupation is used alone to output the enhanced audio data at the frequency of once every 2ms, the small model needs to be run eight times for the audio data with the frame length of 16ms, however, because the small model has fewer parameters, the speech enhancement effect will be significantly lower than that of the speech enhancement model of the present application.

In some embodiments, the present application further provides a non-transitory computer-readable storage medium having stored thereon computer-executable code, which when executed by a processor, is configured to implement the steps of the method illustrated in fig. 3, 4 and/or 6, thereby implementing the method for generating a speech enhancement model in the above embodiments of the present application. In some embodiments, the present application further provides an apparatus for generating a speech enhancement model, which includes a processor and a memory, the storage device being configured to store a computer program capable of running on the processor, and when the computer program is executed by the processor, the method for generating a speech enhancement model in the above embodiments of the present application can be implemented.

In other embodiments, the present application further provides a non-transitory computer-readable storage medium having stored thereon computer-executable code, which when executed by a processor, is configured to implement the steps of the method shown in fig. 8, thereby implementing the speech enhancement method in the above embodiments of the present application. In some embodiments, the present application further provides a speech enhancement device comprising a processor and a memory, the storage means being configured to store a computer program capable of running on the processor, and when the computer program is executed by the processor, the speech enhancement method in the above embodiments of the present application can be implemented.

In addition, the embodiments of the present invention may be realized by hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or specially designed hardware. Those skilled in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such code being provided on a carrier medium such as a disk, CD-or DVD-ROM, programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier, for example. The apparatus and its modules of the present invention may be implemented by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., or by software executed by various types of processors, or by a combination of hardware circuits and software, e.g., firmware.

It should be noted that although several steps or modules of the method, apparatus and storage medium for generating a speech enhancement model and several steps or modules of the method, apparatus and storage medium for speech enhancement are mentioned in the above detailed description, such partitioning is merely exemplary and not mandatory. Indeed, according to embodiments of the application, the features and functions of two or more modules described above may be embodied in one module. Conversely, the features and functions of one module described above may be further divided into embodiments by a plurality of modules.

Other variations to the disclosed embodiments can be understood and effected by those skilled in the art from a study of the specification, the disclosure, the drawings, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the words "a" or "an" do not exclude a plurality. In the practical application of the present application, one element may perform the functions of several technical features recited in the claims. Any reference signs in the claims shall not be construed as limiting the scope.

Claims

1. A method for generating a speech enhancement model, the method for generating a speech enhancement model comprising:

obtaining audio training data comprising noisy input audio and noiseless output audio corresponding to the noisy input audio;

obtaining a first model and a second model, wherein the first model is a deep neural network model, and the resource occupation of the second model in operation is less than that of the first model;

training the first model and the second model based on the audio training data, comprising:

acquiring a first audio with a frame length of M from the input audio;

inputting the first audio frequency into the first model for processing to obtain a first output result;

obtaining a second audio having a frame length of N from the input audio, the second audio following the first audio and N < M;

inputting the second audio and the first output result into the second model for processing to obtain a second output result; and

updating parameters of the first model and the second model based on the second output result and the output audio to obtain a trained first model and a trained second model; and

generating a speech enhancement model based on the trained first model and the trained second model.

2. The method of claim 1, wherein the second model is a deep neural network model and the parameters of the second model are less than the parameters of the first model.

3. The method of claim 2, wherein the ratio of the number of parameters of the second model to the number of parameters of the first model is less than 1/2.

4. The method of claim 2, wherein the first model is a convolutional recurrent neural network model and the second model is an RNNoise model.

5. The method of claim 2,

training the first model and the second model based on the audio training data further comprises:

acquiring a third audio with a frame length P before the second audio from the input audio; and

combining the third audio with the second audio to form a fourth audio with a frame length of P + N;

wherein inputting the second audio and the first output result into the second model for processing to obtain a second output result comprises:

inputting the fourth audio and the first output result into the second model for processing to obtain an intermediate output result with a frame length of P + N;

and acquiring a result with the length of N corresponding to the second audio from the intermediate output result as the second output result.

6. The method of claim 1, wherein the second model comprises a digital signal processing model and a simplified neural network model.

7. The method of claim 6, wherein the second model comprises an energy spectrum estimation model and a fully connected layer.

8. The method of claim 6, wherein inputting the second audio and the first output result to the second model for processing to obtain a second output result comprises:

inputting the second audio frequency into the digital signal processing model for processing so as to obtain an intermediate output result; and

inputting the intermediate output result and the first output result into the simplified neural network model for processing to obtain the second output result.

9. The method of claim 1, wherein M is K × N, and K is an integer greater than 1.

10. The method of claim 9, wherein in training the first model and the second model based on the audio training data, K consecutive second audios having a frame length of N after the first audio are obtained from the input audio, the K second audios are sequentially input to the second model together with the first output result to be processed to obtain K second output results, and parameters of the first model and the second model are updated based on the K second output results and the output audio to obtain the trained first model and the trained second model.

11. The method of claim 1, wherein the step of training the first model and the second model based on the audio training data is performed in a plurality of iterations, wherein during the iterative execution, updated values of parameters of the trained first model and the trained second model generated in a previous training are assigned to be used by the pre-trained first model and the pre-trained second model in a subsequent training, respectively.

12. An apparatus for generating a speech enhancement model, comprising:

a processor; and

a storage device for storing a computer program operable on the processor;

wherein the computer program, when executed by the processor, causes the processor to carry out the method for generating a speech enhancement model according to any of claims 1-11.

13. A non-transitory computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, implements a method for generating a speech enhancement model according to any one of claims 1-11.

14. A method of speech enhancement, the method comprising:

acquiring audio data;

obtaining a speech enhancement model, wherein the speech enhancement model comprises a first model and a second model, the first model is a deep neural network model, and the resource occupation of the second model is less than that of the first model when the second model runs;

processing the audio data using the speech enhancement model to attenuate or remove noise signals in the audio data, comprising:

acquiring a first audio with a frame length of M from the audio data;

obtaining a second audio having a frame length of N from the input audio, the second audio following the first audio and N < M; and

outputting the second output result as enhanced audio data.

15. The speech enhancement method of claim 14 wherein the second model is a deep neural network model and the parameters of the second model are less than the parameters of the first model.

16. The speech enhancement method of claim 15 wherein the ratio of the number of parameters of the second model to the number of parameters of the first model is less than 1/2.

17. The speech enhancement method of claim 15 wherein the first model is a convolutional recurrent neural network model and the second model is an RNNoise model.

18. The speech enhancement method of claim 15,

processing the audio data using the speech enhancement model to attenuate or remove noise signals in the audio data further comprises:

inputting the fourth audio and the first output result into the second model for processing to obtain an intermediate output result with a frame length of P + N; and

19. The speech enhancement method of claim 14 wherein the second model comprises a digital signal processing model and a simplified neural network model.

20. The speech enhancement method of claim 19 wherein the second model comprises an energy spectrum estimation model and a fully connected layer.

21. The speech enhancement method of claim 19 wherein inputting the second audio and the first output result to the second model for processing to obtain a second output result comprises:

22. The speech enhancement method of claim 14 wherein M is K x N, K being an integer greater than 1.

23. The speech enhancement method according to claim 22, wherein, in the process of processing the audio data using the speech enhancement model, K consecutive second audios of frame length N following the first audio are obtained from the audio data, and the K second audios are sequentially input to the second model together with the first output result to be processed to obtain K second output results.

24. A speech enhancement device, comprising:

a processor; and

a storage device for storing a computer program operable on the processor;

wherein the computer program, when executed by the processor, causes the processor to perform the speech enhancement method of any of claims 14-23.

25. A non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the speech enhancement method of any one of claims 14-23.