CN114333895A

CN114333895A - Speech enhancement model, electronic device, storage medium, and related methods

Info

Publication number: CN114333895A
Application number: CN202210022926.5A
Authority: CN
Inventors: 赵胜奎
Original assignee: Alibaba Damo Institute Hangzhou Technology Co Ltd
Current assignee: Alibaba Damo Institute Hangzhou Technology Co Ltd
Priority date: 2022-01-10
Filing date: 2022-01-10
Publication date: 2022-04-12

Abstract

The embodiment of the application provides a voice enhancement model, electronic equipment, a storage medium and a related method, wherein the voice enhancement method comprises the following steps: converting the voice data with noise into time-frequency domain characteristic data; generating a masking value of the voice data with noise according to the long-range correlation of the time-frequency domain characteristic data in the frequency direction; and generating the enhanced voice data of the voice data with noise according to the masking value and the time-frequency domain characteristic data. The scheme can improve the voice enhancement effect on the voice signals.

Description

Speech enhancement model, electronic device, storage medium, and related methods

Technical Field

Embodiments of the present application relate to the field of speech processing technologies, and in particular, to a speech enhancement model, an electronic device, a storage medium, and a related method.

Background

Speech enhancement is a speech processing technique for extracting a useful speech signal from a noise background after the speech signal is disturbed or annihilated by various noises, so as to suppress and reduce the noise disturbance. Speech enhancement is widely applied in scenes such as audio-video conferences and courseware of Real-time communication (RTC).

At present, when speech enhancement is performed on speech data, a time sequence feature of the speech data along a time direction and a short-term frequency feature along a frequency direction are extracted, and a chicken performs speech enhancement on the speech data according to the time sequence feature and the short-term frequency feature to suppress noise in the speech data.

However, noise and useful speech signals with large frequency differences in speech data cannot be effectively recognized based on the timing characteristics and the short-term frequency characteristics, and then the noise with large frequency differences cannot be filtered from the useful speech signals, so that the speech enhancement effect on speech signals is poor.

Disclosure of Invention

In view of the above, embodiments of the present application provide a speech enhancement model, an electronic device, a storage medium and a related method, so as to at least solve or alleviate the above-mentioned problems.

According to a first aspect of embodiments of the present application, there is provided a speech enhancement method, including: converting the voice data with noise into time-frequency domain characteristic data; generating a masking value of the voice data with noise according to the long-range correlation of the time-frequency domain characteristic data in the frequency direction; and generating the enhanced voice data of the voice data with noise according to the masking value and the time-frequency domain characteristic data.

According to a second aspect of embodiments of the present application, there is provided a speech enhancement model, comprising: an encoder, an attention module, a loop module, and a decoder; the encoder is used for processing the time-frequency domain characteristic data corresponding to the noisy speech data; the attention module is used for processing the output data of the encoder; the circulation module is used for processing the output data of the encoder; and the decoder is used for processing the output data of the attention module and the circulation module to obtain correlation characteristic data for representing the long-range correlation of the time-frequency domain characteristic data in the frequency direction.

According to a third aspect of embodiments of the present application, there is provided a speech recognition method, including: acquiring voice data to be processed, wherein the voice data to be processed comprises noise, and the voice data to be processed comprises one of the following: audio and video conference voice data, online education voice data and network live broadcast voice data; converting the voice data to be processed into time-frequency domain characteristic data; generating a masking value of the voice data with noise according to the long-range correlation of the time-frequency domain characteristic data in the frequency direction; generating enhanced voice data of the voice data to be processed according to the masking value and the time-frequency domain characteristic data; and carrying out voice recognition on the enhanced voice data to obtain a recognition result.

According to a fourth aspect of embodiments of the present application, there is provided an electronic apparatus, including: the processor, the memory and the communication interface complete mutual communication through the communication bus; the memory is configured to store at least one executable instruction, where the executable instruction causes the processor to perform an operation corresponding to the speech enhancement method according to the first aspect or an operation corresponding to the speech recognition method according to the third aspect.

According to a fifth aspect of embodiments of the present application, there is provided a computer storage medium having stored thereon a computer program which, when executed by a processor, implements a speech enhancement method as described in the above first aspect or a speech recognition method as described in the above third aspect.

According to a sixth aspect of embodiments herein, there is provided a computer program product comprising computer instructions for instructing a computing device to perform operations corresponding to the speech enhancement method according to the first aspect or operations corresponding to the speech recognition method according to the third aspect.

According to the technical scheme, after the noisy speech data are converted into the time-frequency domain characteristic data, the correlation between the noise signal with large frequency difference and the useful speech signal in the noisy speech data can be represented due to the long-range correlation of the time-frequency domain characteristic data in the frequency direction, so that the noise signal with large frequency difference and the useful speech signal in the time-frequency domain characteristic data can be determined based on the long-range correlation of the time-frequency domain characteristic data in the frequency direction, a masking value of the noisy speech data is generated, the enhanced speech data of the noisy speech data are generated according to the masking value and the time-frequency domain characteristic data, the noise signal with large frequency difference in the generated enhanced speech data is filtered, and the speech enhancement effect can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the embodiments of the present application, and other drawings can be obtained by those skilled in the art according to the drawings.

FIG. 1 is a schematic diagram of an exemplary system in which one embodiment of the present application may be implemented;

FIG. 2 is a flow diagram of a speech enhancement method of one embodiment of the present application;

FIG. 3 is a schematic illustration of a speech enhancement model of an embodiment of the present application;

FIG. 4 is a schematic illustration of a speech enhancement model of another embodiment of the present application;

FIG. 5 is a schematic diagram of a first volume loop block of one embodiment of the present application;

FIG. 6 is a flow chart of a model training method of one embodiment of the present application;

FIG. 7 is a schematic diagram of an auxiliary network of one embodiment of the present application;

FIG. 8 is a flow diagram of a speech recognition method of one embodiment of the present application;

FIG. 9 is a schematic diagram of a speech enhancement apparatus according to an embodiment of the present application;

FIG. 10 is a schematic view of an electronic device of an embodiment of the application.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the embodiments of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, but not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application shall fall within the scope of the protection of the embodiments in the present application.

In the embodiment of the present application, in order to improve the effect of speech enhancement, the noisy speech data is converted into the time-frequency domain feature data, a masking value of the noisy speech data is generated according to the long-range correlation of the time-frequency domain feature data in the frequency direction, and then the enhanced speech data of the noisy speech data is generated according to the masking value and the time-frequency domain feature data. The long-range correlation of the time-frequency domain characteristic data in the frequency direction can represent the correlation between a noise signal with large frequency difference and a useful voice signal in the voice data, and the noise with large frequency difference can be effectively filtered from the useful voice signal according to the long-range correlation in the frequency direction, so that a masking value is generated based on the long-range correlation of the time-frequency domain characteristic data in the frequency direction, and then the enhanced voice data is generated based on the masking value, so that the noise in the voice data with noise can be more effectively filtered, and the voice enhancement effect can be improved.

In specific implementation, the speech enhancement method provided by the embodiment of the application can be used in various application scenarios. For example, a certain cloud service system may provide a voice enhancement service, which may be implemented by the scheme provided in the embodiment of the present application. Specifically, the cloud service system provides a voice enhancement model and provides a cloud voice enhancement interface for a user, a plurality of users can call the interface in respective application systems, the cloud service system runs a related processing program after receiving the call, the voice enhancement is realized through the voice enhancement model, and enhanced voice data is returned. In addition, the voice enhancement method provided by the embodiment of the present application can also be used in a localized device, for example, the voice enhancement method provided by the embodiment of the present application can be implemented in a local device such as an audio video conference terminal, an online education terminal, and the like.

FIG. 1 illustrates an exemplary system suitable for use with the speech enhancement method of embodiments of the present application. As shown in fig. 1, the system 100 may include a server 102, a communication network 104, and/or one or more user devices 106, illustrated in fig. 1 as a plurality of user devices.

Server 102 may be any suitable server for storing information, data, programs, and/or any other suitable type of content. In some embodiments, server 102 may perform any suitable functions. For example, in some embodiments, the server 102 may be used for speech enhancement. As an alternative example, in some embodiments, server 102 may be used for speech enhancement by a speech enhancement model. As another example, in some embodiments, the server 102 may be used to send speech enhancement results to the user device.

In some embodiments, the communication network 104 may be any suitable combination of one or more wired and/or wireless networks. For example, the communication network 104 can include any one or more of the following: the network may include, but is not limited to, the internet, an intranet, a Wide Area Network (WAN), a Local Area Network (LAN), a wireless network, a Digital Subscriber Line (DSL) network, a frame relay network, an Asynchronous Transfer Mode (ATM) network, a Virtual Private Network (VPN), and/or any other suitable communication network. The user device 106 can be connected to the communication network 104 by one or more communication links (e.g., communication link 112), and the communication network 104 can be linked to the server 102 via one or more communication links (e.g., communication link 114). The communication link may be any communication link suitable for communicating data between the user device 106 and the server 102, such as a network link, a dial-up link, a wireless link, a hardwired link, any other suitable communication link, or any suitable combination of such links.

User devices 106 may include any one or more user devices adapted to receive voice data, collect voice data. In some embodiments, user devices 106 may comprise any suitable type of device. For example, in some embodiments, the user device 106 may include a mobile device, a tablet computer, a laptop computer, a desktop computer, a wearable computer, a game console, a media player, a vehicle entertainment system, and/or any other suitable type of user device.

Although server 102 is illustrated as one device, in some embodiments, any suitable number of devices may be used to perform the functions performed by server 102. For example, in some embodiments, multiple devices may be used to implement the functions performed by the server 102. Alternatively, the functionality of the server 102 may be implemented using a cloud service.

Speech enhancement method

Based on the above system, the embodiment of the present application provides a speech enhancement method, which is described below by using a plurality of embodiments.

Fig. 2 is a schematic flowchart of a speech enhancement method according to an embodiment of the present application. As shown in fig. 2, the method comprises the steps of:

step 201, converting the voice data with noise into time-frequency domain characteristic data.

Noisy speech data comprises data that may be represented as x-y + z, where x is used to represent noisy speech data, y is used to represent a clean speech signal, and z is used to represent a noise signal. The noisy speech data describes the time-domain characteristics of the speech signal, i.e., the amplitude of the speech signal is characterized over time. In order to perform voice enhancement on noisy voice data, time-frequency characteristics of a noisy voice signal are required, and the time-frequency characteristics can represent time-domain characteristics and frequency-domain characteristics of the noisy voice signal.

Alternatively, the noisy speech data is converted into Time-frequency domain feature data by Short-Time Fourier Transform (STFT).

Step 202, generating a masking value of the voice data with noise according to the long-range correlation of the time-frequency domain characteristic data in the frequency direction.

The long-range correlation of the noisy speech data in the frequency axis direction may be captured from the time-frequency domain feature data, and a masking value of the noisy speech data is generated based on the captured long-range correlation, the masking value indicating a useful speech signal and a noisy speech signal in the time-frequency domain feature data.

The time-frequency domain feature data includes long-range correlation of the noisy speech data along the frequency axis direction, the time-frequency domain feature data may be input to a speech enhancement model, the long-range correlation of the noisy speech data along the frequency axis direction is captured by the speech enhancement model, and a useful speech signal and a noise speech signal in the time-frequency domain feature data are determined based on the captured long-range correlation, thereby generating a masking value indicating the useful speech signal and the noise speech signal in the time-frequency domain feature data.

And 203, generating enhanced voice data of the voice data with noise according to the masking value and the time-frequency domain characteristic data.

Since the masking value of the time-frequency domain feature data can indicate the useful speech signal and the noise speech signal in the time-frequency domain feature data, the noise speech signal in the time-frequency domain feature data can be filtered according to the masking value, and the enhanced speech data mainly comprising the useful speech signal in the time-frequency domain feature data is obtained.

In order to facilitate subsequent processing such as speech recognition and speech playing, the enhanced speech data in the time-frequency domain needs to be converted into the time domain. In an alternative implementation, the enhanced speech data may be converted to the Time domain by Inverse Short-Time Fourier Transform (ISTFT).

In the embodiment of the present application, after the noisy speech data is converted into the time-frequency domain feature data, because the long-range correlation of the time-frequency domain feature data in the frequency direction can characterize the correlation between the noise signal with a large frequency difference in the noisy speech data and the useful speech signal, the noise signal with a large frequency difference in the time-frequency domain feature data and the useful speech signal can be determined based on the long-range correlation of the time-frequency domain feature data in the frequency direction, so as to generate the masking value of the noisy speech data, and generate the enhanced speech data of the noisy speech data according to the masking value and the time-frequency domain feature data, and the noise signal with a large frequency difference in the generated enhanced speech data is filtered, so that the speech enhancement effect can be improved.

In one possible implementation manner, the time-frequency domain feature data may be input into a speech enhancement model, a noise signal and a useful speech signal in the time-frequency domain feature data are identified based on a long-range correlation of the time-frequency domain feature data in a frequency direction by using the speech enhancement model, correlation feature data capable of indicating the noise signal and the useful speech signal in the time-frequency domain feature data is obtained, and then the correlation feature data is processed through an activation function to obtain a masking value.

When the voice data with noise is converted into the time-frequency domain characteristic data, the voice data with noise can be converted into the time-frequency domain characteristic data in a complex form, and the corresponding voice enhancement model is also a neural network model based on complex values. For example, the complex time-frequency domain feature data can be expressed as

Wherein V is used for characterizing time-frequency domain characteristic data, V_rFor characterizing the real part, V, of time-frequency domain characteristic data_iThe imaginary part used for characterizing the time-frequency domain characteristic data, j is an imaginary unit, j²＝-1，

The method is used for representing a complex number domain, C is used for representing a channel dimension of the time-frequency domain characteristic data, T is used for representing a frame dimension of the time-frequency domain characteristic data, and F is used for representing a frequency dimension of the time-frequency domain characteristic data.

Because the time-frequency domain feature data is in a complex form, and the speech enhancement model for processing the time-frequency domain feature data is also a neural network model based on complex values, the correlation feature data output by the speech enhancement model is also in a complex form, when the correlation feature data is processed through an activation function, the real part of the correlation feature data can be processed through a real part activation function, the imaginary part of the correlation feature data is processed through an imaginary part activation function, and then the output of the real part activation function and the imaginary part activation function forms a masking value.

Optionally, both the real part activation function and the imaginary part activation function are hyperbolic tangent functions (tanh), and of course, the real part activation function and the imaginary part activation function may also be other types of functions, and the types of the real part activation function and the imaginary part activation function are not limited in the embodiments of the present application.

In the embodiment of the application, noisy speech data are converted into complex time-frequency domain characteristic data, the complex time-frequency domain characteristic data are processed through a speech enhancement model based on a complex value neural network to obtain complex correlation characteristic data, and then a real part and an imaginary part of the correlation characteristic data are respectively processed through an activation function to obtain a masking value. The method comprises the steps of carrying out voice enhancement on noisy voice data through a voice enhancement model of a complex value neural network, processing time-frequency domain feature data from multiple dimensions of channels, frames, frequencies and the like of the time-frequency domain feature data based on a complex special operation mode to extract high-level feature representation of the time-frequency domain feature data, carrying out voice enhancement on the noisy voice data based on the high-level feature representation, and ensuring the voice enhancement effect.

In a possible implementation manner, the speech enhancement model for speech enhancement includes an encoder, an attention module, a circulation module and an encoder, after the time-frequency domain feature data is input into the speech enhancement model, the time-frequency domain feature data is processed by using the encoder, the output data of the encoder is processed by using the attention module and the circulation module respectively, and the output data of the output data circulation module of the attention module is processed by using the encoder, so that correlation feature data indicating a noise signal and a useful speech signal in the time-frequency domain feature data is obtained.

In the embodiment of the application, the encoder-decoder is implemented by a Convolutional Neural Network (CNN), the circulation module is implemented by a Recurrent Neural Network (RNN), high-level features of noisy speech data can be extracted by the encoder-decoder, and long-term time dependence of the noisy speech data can be modeled by the circulation module, so that noise signals in time-frequency domain feature data can be effectively filtered by a speech enhancement model with a convolutional circulation structure, thereby ensuring the effect of speech enhancement on the noisy speech data.

After the encoder processes the time-frequency domain characteristic data, the attention module processes output data of the encoder, filters interference information in the output data of the encoder, and then inputs the filtered data into a decoder for processing. The attention module processes the output data of the encoder based on an attention mechanism, can filter interference information in the output data of the encoder, and then the decoder processes the filtered data, so that noise signals and useful voice signals in the time-frequency domain characteristic data can be more accurately identified, and the effect of enhancing the voice is further improved.

In one possible implementation, the encoder includes M first convolution cyclic blocks, M being a natural number greater than or equal to 2, the first convolution cyclic blocks including a first convolution unit for a first convolution process and a first rotation unit for a first rotation process. For the ith first convolution cyclic block in the encoder, a first convolution unit in the first convolution cyclic block is used for performing first convolution processing on input data, and a first circulation unit in the first convolution cyclic block is used for performing first circulation processing on a result of the first convolution processing to obtain output data of the first convolution cyclic block, wherein i is more than or equal to 1 and less than or equal to M. The output data of each first convolution loop block is input to the attention module and the output data of the mth first convolution loop block is input to the loop module.

In an embodiment of the present application, the encoder includes at least two first convolution blocks, each of the first convolution blocks includes a first convolution unit and a first circulation unit, the first convolution unit is configured to perform a first convolution process on data input to the first convolution block, and the first circulation unit is configured to perform a first circulation process on a result of the first convolution process. After the convolution circulation processing is carried out on the input data by each first convolution circulation block, the output data of each first convolution circulation block is input into the attention module, and the output data of the Mth first convolution circulation block is input into the circulation module. The time-frequency domain characteristic data are subjected to convolution cyclic processing through the first convolution cyclic blocks, so that the high-grade characteristics of the time-frequency domain characteristic data can be effectively and comprehensively extracted, and then voice enhancement can be performed on noisy voice data based on the extracted high-grade characteristics, and the voice enhancement effect is ensured. The output data of each first convolution circulation block is input into an attention module, interference information is filtered through the attention module, the input of an input decoder is guaranteed to include less interference information, the decoder is guaranteed to be capable of accurately recognizing noise signals and useful voice information, and then correlation characteristic data capable of accurately representing the noise signals and the useful voice signals are generated.

For each first convolution circulating block, after convolution processing is carried out on input data by the first convolution unit, circulation processing is carried out on output data of the first convolution unit by the first circulation unit, circulation in the frequency direction can be achieved by the first circulation unit, frequency recursion is achieved, long-range correlation of time-frequency domain characteristic data along the frequency direction is extracted, when a subsequent decoder processes output data of the encoder, noise signals and useful voice signals can be identified based on the long-range correlation of the time-frequency domain characteristic data along the frequency direction, correlation characteristic data capable of accurately indicating the noise signals and the useful voice signals are generated, and therefore the effect of voice enhancement on the noisy voice data is guaranteed.

In a possible implementation manner, when the value of i is 1, the input data of the ith first convolution cyclic block is time-frequency domain characteristic data, and when the position of i is 2 to M, the input data of the ith first convolution cyclic block is the output data of the (i-1) th first convolution cyclic block.

In the embodiment of the present application, the first convolution block uses the time-frequency domain characteristic data as input data, and except for the first convolution block, each of the other first convolution blocks uses output data of the previous first convolution block as input data. Starting from the second first convolution cyclic block, each first convolution cyclic block performs convolution cyclic processing on the output data of the previous first convolution cyclic block, and inputs the convolution cyclic processing result into the next first convolution cyclic block, and the last (mth) first convolution cyclic block inputs the convolution cyclic processing result into the cyclic module. The encoder comprises a plurality of first convolution circulating blocks, the output data of the previous first convolution circulating block is used as input data by the next first convolution circulating block, and high-level features in the time-frequency domain feature data can be accurately and comprehensively extracted through convolution circulating processing of each first convolution circulating block, so that voice enhancement can be more accurately carried out on noisy voice data.

In one possible implementation, when the attention module is used to process the output data of the encoder, the attention module may be used to process the output data of each first convolution cyclic block separately, and input the output data of the attention module to the decoder.

The encoder comprises a plurality of first convolution circulating blocks, each first convolution circulating block carries out convolution circulating processing on input data and then inputs the output data into the attention module, the attention module respectively carries out processing on the output data of each first convolution circulating block so as to respectively filter interference information in the output data of each first convolution circulating block, the attention module processes the output data of each first convolution circulating block and then sends the processing result to the decoder, the decoder generates correlation characteristic data according to each output result of the attention module, and as the data received by the decoder comprises less interference information, noise signals and useful voice signals in the time-frequency domain characteristic data can be more accurately identified according to the received data so as to generate correlation characteristic data capable of accurately indicating the noise signals and the useful voice signals in the time-frequency domain characteristic data, and then the effect of speech enhancement on noisy speech data can be improved.

In one possible implementation, the cyclic module includes one or more cyclic units, and the cyclic units are implemented by a cyclic neural network. When the loop module includes a plurality of loop units, the first loop unit takes the output data of the mth first convolution loop block as input data, and the other loop units except the first loop unit all take the output data of the previous loop unit as input data. The output data of the Mth first convolution circulation block is subjected to circulation processing through the circulation units, the long-term time dependency of the time-frequency domain feature data can be effectively extracted, and then the decoder can generate the correlation feature data according to the long-term time dependency of the time-frequency domain feature data, so that the generated correlation feature data can accurately indicate noise signals and useful voice signals in the time-frequency domain feature data, and the effect of carrying out voice enhancement on noisy voice data is improved.

Alternatively, the loop module may include a loop unit implemented by a feed forward Sequential Memory neural network (FSMN). When the time-frequency domain characteristic data is in a complex form and the speech enhancement model is a speech enhancement model based on a complex-valued neural network, the circulation unit is an FSMN capable of processing complex values.

In one possible implementation, the decoder comprises M second convolution cyclic blocks, each comprising a second convolution unit for the second convolution process and a second rotation unit for the second rotation process, corresponding to the encoder comprising M first convolution cyclic blocks. When the decoder is used for processing the output data of the attention module and the output data of the circulation module to obtain the correlation characteristic data, for the ith second convolution circulation block in the M second convolution circulation blocks, the second convolution unit in the second convolution circulation block is used for carrying out second convolution processing on the second input data and the second input data, and the second circulation unit in the second convolution circulation block is used for carrying out second circulation processing on the result of the second convolution processing to obtain the output data of the second convolution circulation block. Wherein, the output data of the Mth second convolution circulation block is correlation characteristic data. If the value of i is 1, the first input data of the second convolution unit in the ith second convolution cyclic block is the output data of the attention module after the attention module processes the output data of the mth first convolution cyclic block, and the second input data of the second convolution unit in the ith second convolution cyclic block is the output data of the cyclic module. And if the value of i is 2-M, the first input data of the second convolution unit in the ith second convolution circulating block is the output data of the attention module after processing the output data of the M +1-i first convolution circulating block, and the second input data of the second convolution unit in the ith second convolution circulating block is the output data of the ith-1 second convolution circulating block.

In this embodiment, the encoder and the decoder are of a symmetric structure, the encoder includes M first convolution cyclic blocks, the decoder includes M second convolution cyclic blocks, the output data of the ith first convolution cyclic block is processed by the attention module and then input to the M +1-i second convolution cyclic blocks as the first input data, the output data of the cyclic module is input to the 1 st second convolution cyclic block as the second input data of the 1 st second convolution cyclic block, and starting from the second convolution cyclic block, each second convolution cyclic block takes the output data of the previous second convolution cyclic block as the second input data. The encoder and the decoder adopt a symmetrical structure, so that the time-frequency domain characteristic data of the input voice enhancement model and the correlation characteristic data output by the voice enhancement model have the same data structure, the enhancement voice data of the voice data with noise can be obtained according to the correlation characteristic data output by the decoder, and the voice noise reduction effect is ensured.

In one possible implementation, the first cyclic unit in the first convolution cyclic block and the second cyclic unit in the second convolution cyclic block may be implemented by a feedforward sequence memory neural network. Because the feedforward sequence memory neural network can effectively carry out frequency recursion so as to capture the long-range correlation of the time-frequency domain characteristic data in the frequency direction, the encoder and the decoder which are composed of the convolution unit and the feedforward sequence memory neural network can capture the local time-frequency structure of the voice data with noise and capture the long-range correlation of the frequency spectrum, thereby better identifying the noise signal and the useful voice signal and improving the voice enhancement effect.

Speech enhancement model

The following describes a speech enhancement model capable of implementing the speech enhancement method in the embodiment of the present application in detail. Fig. 3 is a schematic diagram of a speech enhancement model provided in an embodiment of the present application. As shown in FIG. 3, the speech enhancement model 31 includes an encoder 311, a decoder 312, an attention module 313, and a loop module 314. The encoder 311 is configured to process time-frequency domain feature data corresponding to noisy speech data, the attention module 313 is configured to process output data of the encoder, the loop module 314 is configured to process output data of the encoder, and the decoder 312 is configured to process output data of the attention module 313 and the loop module 314 to obtain correlation feature data used for characterizing long-range correlation of the time-frequency domain feature data in the frequency direction.

It should be noted that the speech enhancement model 31 may be used in the speech enhancement method in the foregoing embodiment, and is configured to generate the correlation characteristic data according to the time-frequency domain characteristic data, and for the processing process of each module in the speech enhancement model 31 on the data, reference may be made to the description in the foregoing method embodiment, which is not described herein again.

FIG. 4 is a schematic diagram of another speech enhancement model provided in an embodiment of the present application. As shown in fig. 4, the encoder 311 includes M first convolution blocks, which are a first convolution block 3111 through a first convolution block 311M in sequence, and the ith first convolution block is a first convolution block 311 i. The decoder 312 includes M second convolution cyclic blocks, which are, in order, the second convolution cyclic block 3121 to the second convolution cyclic block 312M, and the ith second convolution cyclic block is the second convolution cyclic block 312i, where M is a natural number greater than or equal to 2. The loop module 314 includes N complex feedforward sequence memory neural networks (CFSMNs) connected in series, where N is a natural number, and when N is a natural number greater than or equal to 2, N CFSMNs are CFSMN3141 to CFSMN314N in sequence, the ith CFSMN is CFSMN314i, the 1 st CFSMN3141 is connected to the mth first convolution loop block, and the nth CFSMN314N is connected to the 1 st second convolution loop block 3121.

The noisy speech data X are converted by a Short Time Fourier Transform (STFT)32 into time-frequency domain feature data X in complex value form, the time-frequency domain feature data X including a real part X_rAnd imaginary part X_i. After the time-frequency domain feature data X is input into the speech enhancement model 31, the speech enhancement model 31 processes the time-frequency domain feature data X and outputs correlation feature data, and the real part of the correlation feature data is input into a real part activation function (tanh)331 to obtain a masking value

The imaginary part of the correlation characteristic data, the imaginary input imaginary activation function (tanh)332 to obtain the masking value

By the imaginary part of the masking value

The real part and the imaginary part of (A) constitute a masking value in complex form

Time-setting frequency-domain characteristic data X and masking value

Performing element-by-element complex multiplication to obtain enhanced voice data of time-frequency domain

The enhanced speech data in the time-frequency domain is transformed by an inverse short-time Fourier transform (ISTFT)34

Enhanced speech data converted to time domain

Fig. 5 is a schematic diagram of a first volume loop block according to an embodiment of the present disclosure. As shown in fig. 5, the first convolution loop block 311i includes a convolution unit 501, a normalization unit 502, an activation function 503, and a loop unit 504. Convolution unit 501 is used to process a complex two-dimensional convolutional neural network. The activation function 503 is the LeakyReLU function. The loop unit 504 is implemented by CFSMN (complex feedforward sequence memory neural network).

The input data of the convolution unit 501 is a three-dimensional feature matrix

Wherein, V_rFor characterizing the real part, V_iFor characterizing imaginary parts, j being an imaginary unit, j²＝-1，

For characterizing complex fields, C forThe channel dimension is characterized, T is used to characterize the frame dimension, and F is used to characterize the frequency dimension. The convolution kernel of convolution unit 501 is

Wherein, W_rFor characterizing the real part, W_iFor characterizing the imaginary part, C ' for characterizing the number of convolution kernels, and T ' xf ' for characterizing the size of the convolution kernels. The output data of convolution unit 501 may be represented as

Real part U of the output of convolution unit 501_rAnd imaginary part U_iCan be expressed as the following formula (1):

U_r＝V_r*W_r-V_i*W_i

U_i＝V_r*W_i+V_i*W_r (1)

wherein, the convolution filter is used for representing real numbers. The convolution unit 501 performs causal convolution with T 'equal to 2, a step equal to 1, and zero padding used in the time direction, F' equal to 5, a step equal to 2, and zero padding not needed in the frequency direction, which halves the feature map corresponding to the time-frequency domain feature data in the encoder 311 block by block.

The above process is applied to the convolution units 501 in each first convolution block, and the same C 'is used in the convolution units 501 included in each first convolution block, so as to ensure that each first convolution block has the same feature map of data, such as C' 128.

The output data of the convolution unit 501 is processed by the normalization unit 502 and the activation function 503, and then input to the loop unit 504. In the rotation unit 504, data is processed

Performs the same processing on the real part and the imaginary part of (1), and only processes data below

Description of the real partThe process of processing the imaginary part refers to the process of processing the real part.

Data to be recorded

Real part U of_rIs converted from C × T × F 'to T × F' × C for data U_r∈R^T×F″×CCan form a frequency sequence

The frequency sequence is input to the real part circulation unit in the circulation unit 504 for processing, and for the ith component in the frequency sequence, the real part circulation unit performs processing through the following equations (2) to (4):

wherein f is_i＝f₁，f₂…, F' and omitting t, delta characterization of the activation function ReLU, N in the above equations (2) to (4)_LAnd N_RRespectively representing the backtrack and look-ahead sequences of the ith component, e.g., N_L20 and N_LThe other parameters are model parameters for the loop unit 504, 0. Since the loop unit 504 in the embodiment of the present application includes only one CFSMN, the value of l is 1,

is equivalent to

Is the output of the current real cyclic unit and is the input of the next real cyclic unit. For the first real cyclic unit, S_r(t) is the output of the activation function 503. In other embodiments, l may take other positive integers, such as 2, 3, 4, etc.

Output of the circulation unit 504

Can be expressed as the following equation (5):

S_out＝FSMN_r(S_r)-FSMN_i(S_i)+j(FSMN_r(S_i)+FSMN_i(S_r)) (5)

wherein, FSMN_rAnd FSMN_iReal and imaginary cyclic units, S, respectively, for characterizing the cyclic unit 504_rAnd S_iRespectively, for characterizing the real and imaginary parts of the frequency sequence.

The output of the rotation unit 504 in the first convolution rotation block 311M

After entering the loop module 314, the data is processed

Is shaped as real part of

Where H ═ F ″ × C', in two-dimensional pairs by CFSMN

And (6) processing. Based on

Forming a time series

Each CFSMN in the loop-back module 314 then applies the processing of equations (2) through (4) above to Q_rTo fit the temporal dynamics. The output of the rotation module 314 may be referred to in equation (5) above.

It should be noted that, the processing processes of the real part and the imaginary part in the input data by the first convolution loop block are similar, and the processing process of the input data by the loop module 314 is similar to the processing process of the input data by the loop unit in the first convolution loop block, so that only the processing process of the real part in the input data by the first convolution loop block, the processing process of the imaginary part in the input data by the first convolution loop block, and the processing process of the input data by the loop module 314 are described in detail in this embodiment of the application, which is not described herein again, and specific processes may refer to the description of the processing of the real part in the input data by the first convolution loop block.

It should be further noted that the structure of the decoder 312 is symmetrical to the structure of the encoder 311, so the process of processing the input data by the second convolution loop block 312i in the decoder 312 is similar to the process of processing the input data by the first convolution loop block 311i, and therefore the data processing process of the second convolution loop block 312i is not repeated, and the detailed process may refer to the above description of the first convolution loop block 311 i.

Speech enhancement model training method

For the speech enhancement model shown in fig. 3, an embodiment of the present application further provides a method for training the speech enhancement model. Fig. 6 is a schematic flowchart of a model training method provided in an embodiment of the present application, for training a speech enhancement model in the foregoing embodiment. As shown in fig. 6, the model training method includes the following steps:

step 601, inputting the time-frequency domain characteristic data of the training sample into the model to be trained.

The training samples may be formed from clean speech data and noisy speech data. Converting the training sample of the time domain to the time-frequency domain to obtain the time-frequency domain feature data of the training sample, for example, converting the training sample to the time-frequency domain by short-time fourier transform (SFTF) to obtain the time-frequency domain feature data of the training sample.

Step 602, determining a first loss value of the model to be trained.

And after the time-frequency domain characteristic data of the training sample is input into the model to be trained, obtaining an output result of the model to be trained, and obtaining a first loss value according to the output result of the model to be trained. The first loss value of the model to be trained is determined based on at least one of a signal-to-noise ratio loss (SI-SNR loss), a masking loss, and a real-imaginary spectral loss.

In one possible implementation, the first loss value of the model to be trained can be expressed as formula (6):

wherein the content of the first and second substances,

for characterizing the first loss value and for characterizing the second loss value,

for characterizing the loss of signal-to-noise ratio,

for the purpose of characterizing the masking loss,

for characterizing the real imaginary spectral loss. y is used to characterize the temporal feature data of the clean speech in the training sample,

for characterizing enhanced speech data in the time domain determined from the output of the speech enhancement model, M for characterizing true masking values determined from clean speech data in training samples,

method and apparatus for characterizing output from a speech enhancement modelA fixed masking value, Y is used to characterize the time-frequency domain feature data of the clean speech in the training samples,

for characterizing the enhanced speech data in the time-frequency domain determined from the correlation feature data output by the speech enhancement model. Lambda [ alpha ]₁And λ₂For characterizing the preset coefficients.

Loss of signal-to-noise ratio

Can be represented by formula

Performing a calculation, wherein | | · | calculation₂The norm of L2 is shown,

<.，.>the dot product is represented. Masking loss

Can be represented by formula

And (6) performing calculation. Real and imaginary spectral loss

Can be represented by formula

And (6) performing calculation.

Step 603, adjusting the model parameters of the model to be trained according to the first loss value until the first loss value meets a preset first condition.

After the time-frequency domain characteristic data of the training sample is input to the model to be trained each time, a first loss value of the model to be trained is obtained, and whether the first loss value meets a preset first condition or not is judged. If the first loss value meets the preset first condition, executing step 604, if the first loss value does not meet the preset first condition, adjusting the model parameters of the model to be trained according to the first loss value, and then continuing to execute step 601 until the first loss value meets the preset first condition.

The first condition may be that the first loss value is smaller than a preset first threshold value, that the first loss value is not decreased any more, or the like.

And step 604, determining a second loss value of the model to be trained.

And after the first loss value meets a preset first condition, determining a second loss value of the model to be trained according to an output result of the model to be trained. A second loss value of the model to be trained is determined based on at least one of the first loss value, the antagonistic loss, and the depth feature loss.

In one possible implementation, the second loss value of the model to be trained can be expressed as formula (7):

wherein the content of the first and second substances,

for characterizing the second loss value and for characterizing the second loss value,

for the characterization of the loss of resistance,

for characterizing the loss of depth features, λ₃And λ₄For characterizing the preset coefficients.

To determine the second loss value, an auxiliary network having the same model structure as the model to be trained is constructed, the structure of the auxiliary network being shown in fig. 7, the auxiliary network comprising an encoder 701, a loop module 702, a decoder 703 and a linear projection module 704. Resistance loss and depth feature loss can be determined.

The arbiter D is formed by the encoder 701 and the linear projection module 704 comprised by the auxiliary network,enhanced speech data in the time-frequency domain to be used for characterizing a correlation feature data determined from an output of a speech enhancement model

The input discriminator D can determine the antagonism loss according to the output result of the linear projection module 704

Can be embodied by formulas

Calculating antagonism loss

Where E () is used to characterize the expected value operation,

for characterizing the output of the linear projection module 704. Time-frequency domain feature data Y to be used for characterizing clean speech in training samples and

an input discriminator D for determining the depth feature loss based on the output of the encoder 701

Can be embodied by formulas

Computing depth feature loss

Wherein L is_DFor characterizing the number of first convolution loop modules in the encoder 701,

for characterizing the 1 st first convolution cycle model in the encoder 701

Output of D_l(Y) is used to characterize the output of the 1 st first convolution cycle model for Y in encoder 701.

In order to ensure that the discriminator D has sufficient spectral information, the reconstruction network C is formed by the encoder 701, the rotation module 702, and the decoder 703 included in the auxiliary network, and the loss value of the reconstruction network C is defined as

And the loss value of the discriminator D is defined as the following equation (8):

where E () is used to characterize the expected value operation, λ₅For characterizing the preset coefficients.

And in the process of training the model to be trained, alternately training the model parameters of the reconstructed network C, the discriminator D and the model to be trained.

And 605, adjusting the model parameters of the model to be trained according to the second loss value until the second loss value meets a preset second condition, so as to obtain the voice enhancement model.

And after a second loss value of the model to be trained is obtained each time, judging whether the second loss value meets a preset second condition. And if the second loss value meets a preset second condition, finishing training to obtain the voice enhancement model. If the second loss value does not satisfy the preset second condition, adjusting the model parameters of the model to be trained according to the second loss value, and then continuing to execute step 601 until the second loss value satisfies the preset second condition, completing model training, and obtaining the speech enhancement model.

In the embodiment of the application, the training process of the speech enhancement model is divided into two stages, the first stage trains the speech enhancement model with the first loss value as a reference, and the second stage trains the speech enhancement model with the second loss value as a reference. The first loss value is determined according to at least one of signal-to-noise ratio loss, masking loss and real-imaginary part spectrum loss, the first loss value indicates signal level loss of the voice signal, and the voice enhancement model is trained based on the first loss value, so that the voice enhancement model can be rapidly converged, and the model training effect is improved. The second loss value is determined according to at least one of the first loss value, the antagonism loss and the depth characteristic loss, and the voice enhancement model is trained according to the second loss value, so that the perceptibility of the voice enhancement model can be improved, and the voice enhancement effect of the voice enhancement model is ensured.

Speech recognition method

For an application scenario of the speech enhancement scheme provided in the embodiment of the present application in speech recognition, the embodiment of the present application provides a speech recognition method, as shown in fig. 8, the speech recognition method includes the following steps:

step 801, acquiring voice data to be processed, wherein the voice data to be processed includes noise, and the voice data to be processed includes one of the following: audio and video conference voice data, online education voice data and network live broadcast voice data;

step 802, converting the voice data to be processed into time-frequency domain characteristic data;

step 803, generating a masking value of the voice data with noise according to the long-range correlation of the time-frequency domain characteristic data in the frequency direction;

step 804, generating enhanced voice data of the voice data to be processed according to the masking value and the time-frequency domain characteristic data;

step 805, performing voice recognition on the enhanced voice data to obtain a recognition result.

It should be noted that, for a specific application of the speech enhancement scheme in the embodiment of the present application, the specific speech enhancement process in the embodiment shown in fig. 8 may refer to the description in the foregoing embodiment, and is not described herein again.

Speech enhancement device

Corresponding to the above method embodiment, fig. 9 shows a schematic diagram of a speech enhancement apparatus, and fig. 9 shows the speech enhancement apparatus comprising:

a converting unit 901, configured to convert the voice data with noise into time-frequency domain characteristic data;

a processing unit 902, configured to generate a masking value of the voice data with noise according to the long-range correlation of the time-frequency domain feature data in the frequency direction;

a generating unit 903, configured to generate enhanced speech data of the speech data with noise according to the masking value and the time-frequency domain feature data.

It should be noted that the speech enhancement apparatus of the present embodiment is used to implement the corresponding speech enhancement method in the foregoing method embodiments, and has the beneficial effects of the corresponding method embodiments, which are not described herein again.

Electronic device

Fig. 10 is a schematic block diagram of an electronic device according to an embodiment of the present application, and a specific embodiment of the present application does not limit a specific implementation of the electronic device. As shown in fig. 10, the electronic device may include: a processor (processor)1002, a Communications Interface 1004, a memory 1006, and a Communications bus 1008. Wherein:

the processor 1002, communication interface 1004, and memory 1006 communicate with each other via a communication bus 1008.

A communication interface 1004 for communicating with other electronic devices or servers.

The processor 1002 is configured to execute the program 1010, and may specifically execute the relevant steps in any of the foregoing speech enhancement method embodiments.

In particular, the program 1010 may include program code that includes computer operating instructions.

The processor 1002 may be a CPU, or an application Specific Integrated circuit (asic), or one or more Integrated circuits configured to implement embodiments of the present application. The intelligent device comprises one or more processors which can be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.

The memory 1006 is used for storing the program 1010. The memory 1006 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The program 1010 may be specifically adapted to cause the processor 1002 to execute a speech enhancement method in any of the embodiments described above.

For specific implementation of each step in the program 1010, reference may be made to corresponding steps and corresponding descriptions in units in any of the foregoing speech enhancement method embodiments, which are not described herein again. It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described devices and modules may refer to the corresponding process descriptions in the foregoing method embodiments, and are not described herein again.

Through the electronic equipment provided by the embodiment of the application, after the voice data with noise is converted into the time-frequency domain characteristic data, the long-range correlation of the time-frequency domain characteristic data in the frequency direction can represent the correlation of the noise signal with larger frequency difference and the useful voice signal in the voice data with noise, so that the noise signal with larger frequency difference and the useful voice signal in the voice data with noise can be determined based on the long-range correlation of the time-frequency domain characteristic data in the frequency direction, the masking value of the voice data with noise is generated, the enhanced voice data of the voice data with noise is generated according to the masking value and the time-frequency domain characteristic data, the noise signal with larger frequency difference in the generated enhanced voice data is filtered, and the voice enhancement effect can be improved.

Computer storage medium

The present application also provides a computer-readable storage medium storing instructions for causing a machine to perform a speech enhancement method as described herein. Specifically, a system or an apparatus equipped with a storage medium on which software program codes that realize the functions of any of the above-described embodiments are stored may be provided, and a computer (or a CPU or MPU) of the system or the apparatus is caused to read out and execute the program codes stored in the storage medium.

In this case, the program code itself read from the storage medium can realize the functions of any of the above-described embodiments, and thus the program code and the storage medium storing the program code constitute a part of the present application.

Examples of the storage medium for supplying the program code include a floppy disk, a hard disk, a magneto-optical disk, an optical disk (e.g., CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD + RW), a magnetic tape, a nonvolatile memory card, and a ROM. Alternatively, the program code may be downloaded from a server computer via a communications network.

Computer program product

Embodiments of the present application further provide a computer program product, which includes computer instructions for instructing a computing device to perform operations corresponding to any of the above method embodiments.

It should be noted that, according to the implementation requirement, each component/step described in the embodiment of the present application may be divided into more components/steps, and two or more components/steps or partial operations of the components/steps may also be combined into a new component/step to achieve the purpose of the embodiment of the present application.

The above-described methods according to embodiments of the present application may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium downloaded through a network and to be stored in a local recording medium, so that the methods described herein may be stored in such software processes on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware such as an ASIC or FPGA. It will be appreciated that a computer, processor, microprocessor controller, or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by a computer, processor, or hardware, implements the methods described herein. Further, when a general-purpose computer accesses code for implementing the methods illustrated herein, execution of the code transforms the general-purpose computer into a special-purpose computer for performing the methods illustrated herein.

Those of ordinary skill in the art will appreciate that the various illustrative elements and method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.

The above embodiments are only used for illustrating the embodiments of the present application, and not for limiting the embodiments of the present application, and those skilled in the relevant art can make various changes and modifications without departing from the spirit and scope of the embodiments of the present application, so that all equivalent technical solutions also belong to the scope of the embodiments of the present application, and the scope of patent protection of the embodiments of the present application should be defined by the claims.

Claims

1. A method of speech enhancement comprising:

converting the voice data with noise into time-frequency domain characteristic data;

generating a masking value of the voice data with noise according to the long-range correlation of the time-frequency domain characteristic data in the frequency direction;

and generating the enhanced voice data of the voice data with noise according to the masking value and the time-frequency domain characteristic data.

2. The speech enhancement method according to claim 1, wherein the generating of the masking value of the noisy speech data from the long-range correlation of the time-frequency domain feature data in the frequency direction comprises:

inputting the time-frequency domain feature data into a voice enhancement model to obtain correlation feature data output by the voice enhancement model;

and respectively processing the real part and the imaginary part of the correlation characteristic data through an activation function to obtain the masking value.

3. The speech enhancement method of claim 2 wherein the speech enhancement model comprises an encoder, an attention module, a loop module, and a decoder,

the inputting the time-frequency domain feature data into a speech enhancement model to obtain the correlation feature data output by the speech enhancement model includes:

processing the time-frequency domain feature data by using an encoder;

processing the output data of the encoder by using the attention module and the circulation module respectively, wherein the attention module is used for filtering interference information in the output data of the encoder;

and processing the output data of the attention module and the output data of the circulation module by using the decoder to obtain the correlation characteristic data.

4. The speech enhancement method according to claim 3, wherein the encoder includes M first convolution blocks, M being a natural number greater than or equal to 2, the first convolution blocks including a first convolution element for a first convolution process and a first circulation element for a first circulation process,

the processing the time-frequency domain feature data using an encoder includes:

aiming at the ith first convolution circulating block, performing first convolution processing on input data by using a first convolution unit in the first convolution circulating block, and performing first circulation processing on a result of the first convolution processing by using the first circulation unit in the first convolution circulating block to obtain output data of the first convolution circulating block, wherein i is more than or equal to 1 and less than or equal to M;

the output data of each first convolution loop block is input to the attention module and the output data of the mth first convolution loop block is input to the loop module.

5. The speech enhancement method according to claim 4,

if the value of i is 1, the input data of the ith first convolution cyclic block is the time-frequency domain characteristic data;

if the value of i is 2-M, the input data of the ith first convolution cyclic block is the output data of the (i-1) th first convolution cyclic block.

6. The speech enhancement method of claim 4, wherein processing the output data of the encoder using the attention module comprises:

and processing the output data of each first convolution cyclic block by using the attention module, and inputting the output data of the attention module to the decoder.

7. The speech enhancement method of claim 4, wherein processing the output data of the encoder using the rotation module comprises:

and processing the output data of the Mth first convolution loop block by using the loop module, and inputting the output data of the loop module to the decoder.

8. The speech enhancement method of claim 4 wherein the decoder comprises M second convolution cyclic blocks, the second convolution cyclic blocks comprising a second convolution unit for a second convolution process and a second rotation unit for a second rotation process,

the processing, by the decoder, the output data of the attention module and the output data of the circulation module to obtain the correlation characteristic data includes:

for the ith second convolution cyclic block, performing second convolution processing on the first input data and the second input data by using a second convolution unit in the second convolution cyclic block, and performing second cyclic processing on the result of the second convolution processing by using a second cyclic unit in the second convolution cyclic block to obtain output data of the second convolution cyclic block; wherein the output data of the Mth second convolution cyclic block is the correlation characteristic data,

if the value of i is 1, the first input data of the second convolution unit in the ith second convolution cyclic block is the output data of the attention module after the output data of the mth first convolution cyclic block is processed, and the second input data of the second convolution unit in the ith second convolution cyclic block is the output data of the cyclic module;

and if the value of i is 2-M, the first input data of the second convolution unit in the ith second convolution cyclic block is the output data of the attention module after processing the output data of the M +1-i first convolution cyclic block, and the second input data of the second convolution unit in the ith second convolution cyclic block is the output data of the ith-1 second convolution cyclic block.

9. A speech enhancement model comprising: an encoder, an attention module, a loop module, and a decoder;

the encoder is used for processing the time-frequency domain characteristic data corresponding to the noisy speech data;

the attention module is used for processing the output data of the encoder;

the circulation module is used for processing the output data of the encoder;

and the decoder is used for processing the output data of the attention module and the circulation module to obtain correlation characteristic data for representing the long-range correlation of the time-frequency domain characteristic data in the frequency direction.

10. The speech enhancement model of claim 9, wherein the speech enhancement model is obtained by training as follows:

inputting the time-frequency domain characteristic data of the training sample into a model to be trained;

determining a first loss value of the model to be trained, wherein the first loss value is determined according to at least one of signal-to-noise ratio loss, masking loss and real-imaginary spectral loss;

adjusting the model parameters of the model to be trained according to the first loss value until the first loss value meets a preset first condition;

determining a second loss value of the model to be trained, wherein the second loss value is determined according to at least one of the first loss value, the antagonism loss and the depth feature loss;

and adjusting the model parameters of the model to be trained according to the second loss value until the second loss value meets a preset second condition, so as to obtain the voice enhancement model.

11. A speech recognition method comprising:

acquiring voice data to be processed, wherein the voice data to be processed comprises noise, and the voice data to be processed comprises one of the following: audio and video conference voice data, online education voice data and network live broadcast voice data;

converting the voice data to be processed into time-frequency domain characteristic data;

generating enhanced voice data of the voice data to be processed according to the masking value and the time-frequency domain characteristic data;

and carrying out voice recognition on the enhanced voice data to obtain a recognition result.

12. An electronic device, comprising: the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is configured to store at least one executable instruction that causes the processor to perform operations corresponding to the speech enhancement method of any one of claims 1-8 or the speech recognition method of claim 11.

13. A computer storage medium having stored thereon a computer program which, when executed by a processor, implements a speech enhancement method according to any one of claims 1 to 8 or a speech recognition method according to claim 11.

14. A computer program product comprising computer instructions to instruct a computing device to perform operations corresponding to the speech enhancement method of any of claims 1-8 or the speech recognition method of claim 11.