CN113284507A

CN113284507A - Training method and device of voice enhancement model and voice enhancement method and device

Info

Publication number: CN113284507A
Application number: CN202110529546.6A
Authority: CN
Inventors: 张新; 郑羲光; 张晨
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-05-14
Filing date: 2021-05-14
Publication date: 2021-08-20
Anticipated expiration: 2041-05-14
Also published as: CN113284507B

Abstract

The present disclosure relates to a training method and apparatus for a speech enhancement model, and a speech enhancement method and apparatus, wherein the speech enhancement model includes a noise mask ratio prediction network and a noise type discrimination network, and the training method includes: acquiring a noisy speech sample, wherein the noisy speech sample is formed by mixing a speaker speech sample and at least one scene noise data; inputting reference scene noise data in at least one scene noise data into a noise type discrimination network to obtain noise type characteristics of the reference scene noise data, wherein the reference scene noise data is scene noise data expected to be removed; inputting the amplitude spectrum and the noise type characteristics of the noise-containing voice sample into a noise mask ratio prediction network to obtain an estimated noise mask ratio of reference scene noise data; calculating a loss function based on the estimated noise mask ratio and the noise type characteristics; and adjusting parameters of a noise mask ratio prediction network and a noise type discrimination network through the calculated loss function, and training the voice enhancement model.

Description

Training method and device of voice enhancement model and voice enhancement method and device

Technical Field

The present disclosure relates to the field of audio technologies, and in particular, to a method and an apparatus for training a speech enhancement model and a method and an apparatus for speech enhancement.

Background

The noisy environment can affect the effect of people in voice communication, in current mainstream communication software, different voice enhancement algorithms are usually adopted to process noisy audio in the conversation process, and the traditional method can process steady-state noise. However, a common speech enhancement algorithm removes all noise in a scene and only retains human voice, but the types of noise that people need to remove are different in different scenes, and therefore, the common speech enhancement algorithm cannot achieve speech enhancement for a specific scene.

Disclosure of Invention

The present disclosure provides a method and an apparatus for training a speech enhancement model, and a method and an apparatus for speech enhancement, so as to solve at least the problems in the related art described above, and may not solve any of the problems described above.

According to a first aspect of the embodiments of the present disclosure, there is provided a training method for a speech enhancement model, where the speech enhancement model includes a noise mask ratio prediction network and a noise type discrimination network, the training method including: acquiring a noisy speech sample, wherein the noisy speech sample is formed by mixing a speaker speech sample and at least one scene noise data; inputting reference scene noise data in the at least one scene noise data into the noise type discrimination network to obtain noise type characteristics of the reference scene noise data, wherein the reference scene noise data is scene noise data which is expected to be removed in the at least one scene noise data, and the speech enhancement model is used for obtaining an estimated speech enhancement signal obtained after removing the reference scene noise data from the noisy speech sample; inputting the amplitude spectrum of the noisy speech sample and the noise type characteristic into the noise mask ratio prediction network to obtain an estimated noise mask ratio of the reference scene noise data, wherein the noise mask ratio represents a ratio of the amplitude spectrum of the reference scene noise data to the amplitude spectrum of the noisy speech sample; calculating a loss function based on the estimated noise mask ratio of the reference scene noise data and the noise type characteristics of the reference scene noise data; and adjusting parameters of the noise mask ratio prediction network and the noise type discrimination network through the calculated loss function, and training the voice enhancement model.

Optionally, the inputting the amplitude spectrum of the noisy speech sample and the noise type characteristic into the noise mask ratio prediction network to obtain an estimated noise mask ratio of the reference scene noise data may include: connecting the amplitude spectrum of the noise-containing voice sample with the noise type characteristic in series; and inputting the features after series connection into the noise mask ratio prediction network to obtain the estimated noise mask ratio of the reference scene noise data.

Optionally, the inputting the amplitude spectrum of the noisy speech sample and the noise type characteristic into the noise mask ratio prediction network to obtain an estimated noise mask ratio of the reference scene noise data may include: inputting the amplitude spectrum of the noise-containing voice sample into a part of network in the noise mask ratio prediction network to obtain the local characteristics of the amplitude spectrum of the noise-containing voice sample; concatenating the local features with the noise type features; inputting the features after the series connection into the other part of the noise mask ratio prediction network to obtain the estimated noise mask ratio of the reference scene noise data.

Alternatively, the noise mask ratio prediction network may be a convolutional recurrent neural network including a convolutional neural network and a recurrent neural network.

Alternatively, a portion of the noise mask ratio prediction network may be a convolutional neural network of the convolutional recurrent neural network, and another portion of the noise mask ratio prediction network may be a recurrent neural network of the convolutional recurrent neural network.

Optionally, each scene noise data of the at least one scene noise data may have a true noise type tag vector; wherein calculating a loss function based on the estimated noise mask ratio of the reference scene noise data and the noise type characteristics of the reference scene noise data may include: obtaining an estimated amplitude spectrum of the voice enhancement signal by multiplying the compensated estimated noise mask ratio of the reference scene noise data with the amplitude spectrum of the noise-containing voice sample; obtaining an estimated noise type label vector by passing the noise type characteristics through a full connection layer; calculating a loss function based on the estimated magnitude spectrum of the speech enhancement signal, the magnitude spectrum of a target speech enhancement signal, the estimated noise type tag vector, and the true noise type tag vector of the reference scene noise data, wherein the target speech enhancement signal is the signal of the noisy speech sample after the reference scene noise data is removed.

Optionally, the calculating a loss function based on the estimated magnitude spectrum of the speech enhancement signal, the magnitude spectrum of the target speech enhancement signal, the estimated noise type tag vector, and the true noise type tag vector of the reference scene noise data may include: calculating a mean square error loss function based on the estimated magnitude spectrum of the speech enhancement signal and the magnitude spectrum of the target speech enhancement signal; calculating a cross entropy loss function based on the estimated noise type label vector and a true noise type label vector of the reference scene noise data; and summing the mean square error loss function and the cross entropy loss function to obtain the loss function.

Alternatively, the loss function may be expressed as:

wherein, Mag_estRepresenting the magnitude spectrum, Mag, of the estimated speech enhancement signal_tarA magnitude spectrum representing the target speech enhancement signal,

presentation instrumentThe mean square error loss function is described, M represents the total number of types of scene noise, i represents the traversal marker,

a noise type tag vector representing the estimate, z represents a true noise type tag vector of the reference scene noise data,

representing the cross entropy loss function.

According to a second aspect of the embodiments of the present disclosure, there is provided a speech enhancement method performed based on a speech enhancement model including a noise mask ratio prediction network and a noise type discrimination network, the speech enhancement method including: acquiring a noisy speech signal to be enhanced and reference scene noise data, wherein the noisy speech signal to be enhanced comprises a speaker speech signal and at least one scene noise data, and the reference scene noise data is scene noise data which is expected to be removed from the at least one scene noise data; inputting the reference scene noise data into the noise type discrimination network to obtain the noise type characteristics of the reference scene noise data; inputting the amplitude spectrum of the noisy speech signal to be enhanced and the noise type characteristic into the noise mask ratio prediction network to obtain an estimated noise mask ratio of the reference scene noise data, wherein the noise mask ratio represents a ratio of the amplitude spectrum of the reference scene noise data to the amplitude spectrum of the noisy speech signal; obtaining an estimated speech enhancement signal based on the estimated noise mask ratio of the reference scene noise data and the noisy speech signal to be enhanced, wherein the estimated speech enhancement signal is the estimated speech enhancement signal obtained after removing the reference scene noise data from the noisy speech signal to be enhanced.

Optionally, the inputting the amplitude spectrum of the noisy speech signal to be enhanced and the noise type feature into the noise mask ratio prediction network to obtain an estimated noise mask ratio of the reference scene noise data may include: connecting the amplitude spectrum of the noisy speech signal to be enhanced with the noise type characteristic in series; and inputting the features after series connection into the noise mask ratio prediction network to obtain the estimated noise mask ratio of the reference scene noise data.

Optionally, the inputting the amplitude spectrum of the noisy speech signal to be enhanced and the noise type feature into the noise mask ratio prediction network to obtain an estimated noise mask ratio of the reference scene noise data may include: inputting the amplitude spectrum of the noisy speech signal to be enhanced into a part of network in the noise mask ratio prediction network to obtain the local characteristics of the amplitude spectrum of the noisy speech sample; concatenating the local features with the noise type features; inputting the features after the series connection into the other part of the noise mask ratio prediction network to obtain the estimated noise mask ratio of the reference scene noise data.

Optionally, the obtaining an estimated speech enhancement signal based on the estimated noise mask ratio of the reference scene noise data and the noisy speech signal to be enhanced may include: obtaining the amplitude spectrum of the estimated voice enhancement signal by multiplying the compensated estimated noise mask ratio of the reference scene noise data with the amplitude spectrum of the noise-containing voice signal to be enhanced; and combining the amplitude spectrum of the estimated voice enhancement signal with the phase spectrum of the noisy voice signal to be enhanced and executing time-frequency inverse transformation to obtain the estimated voice enhancement signal.

Alternatively, the reference scene noise data may be a scene noise segment that is pre-recorded in the environment where the speaker is located and is desired to be removed.

Optionally, the speech enhancement model is trained by using a training method of the speech enhancement model according to the present disclosure.

According to a third aspect of the embodiments of the present disclosure, there is provided a training apparatus for a speech enhancement model, the speech enhancement model including a noise mask ratio prediction network and a noise type discrimination network, the training apparatus including: a sample acquisition unit configured to: acquiring a noisy speech sample, wherein the noisy speech sample is formed by mixing a speaker speech sample and at least one scene noise data; a noise type estimation unit configured to: inputting reference scene noise data in the at least one scene noise data into the noise type discrimination network to obtain noise type characteristics of the reference scene noise data, wherein the reference scene noise data is scene noise data which is expected to be removed in the at least one scene noise data, and the speech enhancement model is used for obtaining an estimated speech enhancement signal obtained after removing the reference scene noise data from the noisy speech sample; a noise mask ratio estimation unit configured to: inputting the amplitude spectrum of the noisy speech sample and the noise type characteristic into the noise mask ratio prediction network to obtain an estimated noise mask ratio of the reference scene noise data, wherein the noise mask ratio represents a ratio of the amplitude spectrum of the reference scene noise data to the amplitude spectrum of the noisy speech sample; a loss function calculation unit configured to: calculating a loss function based on the estimated noise mask ratio of the reference scene noise data and the noise type characteristics of the reference scene noise data; a model training unit configured to: and adjusting parameters of the noise mask ratio prediction network and the noise type discrimination network through the calculated loss function, and training the voice enhancement model.

Alternatively, the noise mask ratio estimation unit may be configured to: connecting the amplitude spectrum of the noise-containing voice sample with the noise type characteristic in series; and inputting the features after series connection into the noise mask ratio prediction network to obtain the estimated noise mask ratio of the reference scene noise data.

Alternatively, the noise mask ratio estimation unit may be configured to: inputting the amplitude spectrum of the noise-containing voice sample into a part of network in the noise mask ratio prediction network to obtain the local characteristics of the amplitude spectrum of the noise-containing voice sample; concatenating the local features with the noise type features; inputting the features after the series connection into the other part of the noise mask ratio prediction network to obtain the estimated noise mask ratio of the reference scene noise data.

Optionally, each scene noise data of the at least one scene noise data may have a true noise type tag vector; wherein the loss function calculation unit may be configured to: obtaining an estimated amplitude spectrum of the voice enhancement signal by multiplying the compensated estimated noise mask ratio of the reference scene noise data with the amplitude spectrum of the noise-containing voice sample; obtaining an estimated noise type label vector by passing the noise type characteristics through a full connection layer; calculating a loss function based on the estimated magnitude spectrum of the speech enhancement signal, the magnitude spectrum of a target speech enhancement signal, the estimated noise type tag vector, and the true noise type tag vector of the reference scene noise data, wherein the target speech enhancement signal is the signal of the noisy speech sample after the reference scene noise data is removed.

Optionally, the loss function calculation unit may be configured to: calculating a mean square error loss function based on the estimated magnitude spectrum of the speech enhancement signal and the magnitude spectrum of the target speech enhancement signal; calculating a cross entropy loss function based on the estimated noise type label vector and a true noise type label vector of the reference scene noise data; and summing the mean square error loss function and the cross entropy loss function to obtain the loss function.

Alternatively, the loss function may be expressed as:

representing the mean square error loss function, M representing the total number of types of scene noise, i representing a traversal marker,

representing the cross entropy loss function.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a speech enhancement apparatus that performs an operation based on a speech enhancement model including a noise mask ratio prediction network and a noise type discrimination network, the speech enhancement apparatus including: a data acquisition unit configured to: acquiring a noisy speech signal to be enhanced and reference scene noise data, wherein the noisy speech signal to be enhanced comprises a speaker speech signal and at least one scene noise data, and the reference scene noise data is scene noise data which is expected to be removed from the at least one scene noise data; a noise type estimation unit configured to: inputting the reference scene noise data into the noise type discrimination network to obtain the noise type characteristics of the reference scene noise data; a noise mask ratio estimation unit configured to: inputting the amplitude spectrum of the noisy speech signal to be enhanced and the noise type characteristic into the noise mask ratio prediction network to obtain an estimated noise mask ratio of the reference scene noise data, wherein the noise mask ratio represents a ratio of the amplitude spectrum of the reference scene noise data to the amplitude spectrum of the noisy speech signal; a speech enhancement unit configured to: obtaining an estimated speech enhancement signal based on the estimated noise mask ratio of the reference scene noise data and the noisy speech signal to be enhanced, wherein the estimated speech enhancement signal is the estimated speech enhancement signal obtained after removing the reference scene noise data from the noisy speech signal to be enhanced.

Alternatively, the noise mask ratio estimation unit may be configured to: connecting the amplitude spectrum of the noisy speech signal to be enhanced with the noise type characteristic in series; and inputting the features after series connection into the noise mask ratio prediction network to obtain the estimated noise mask ratio of the reference scene noise data.

Alternatively, the noise mask ratio estimation unit may be configured to: inputting the amplitude spectrum of the noisy speech signal to be enhanced into a part of network in the noise mask ratio prediction network to obtain the local characteristics of the amplitude spectrum of the noisy speech sample; concatenating the local features with the noise type features; inputting the features after the series connection into the other part of the noise mask ratio prediction network to obtain the estimated noise mask ratio of the reference scene noise data.

Optionally, the speech enhancement unit may be configured to: obtaining the amplitude spectrum of the estimated voice enhancement signal by multiplying the compensated estimated noise mask ratio of the reference scene noise data with the amplitude spectrum of the noise-containing voice signal to be enhanced; and combining the amplitude spectrum of the estimated voice enhancement signal with the phase spectrum of the noisy voice signal to be enhanced and executing time-frequency inverse transformation to obtain the estimated voice enhancement signal.

Alternatively, the speech enhancement model may be trained using a training method of the speech enhancement model according to the present disclosure.

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic apparatus including: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform a training method or a speech enhancement method of a speech enhancement model according to the present disclosure.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions, when executed by at least one processor, cause the at least one processor to perform a method of training a speech enhancement model or a method of speech enhancement according to the present disclosure.

According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by at least one processor, implement a training method or a speech enhancement method of a speech enhancement model according to the present disclosure.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

according to the training method and the training device of the speech enhancement model, the speech enhancement method and the speech enhancement device, the specific scene noise can be used as an auxiliary vector to be input into the speech enhancement model, and the specific scene noise can be removed under the specific scene, so that the specific scene noise can be removed according to the requirements of a user under the condition of comprising various scene noises, and the speech enhancement effect expected by the user can be achieved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is an overall system diagram illustrating a speech enhancement model according to an exemplary embodiment of the present disclosure.

FIG. 2 is a flowchart illustrating a method of training a speech enhancement model according to an exemplary embodiment of the present disclosure.

Fig. 3 is a schematic diagram illustrating a structure of a noise type discrimination network according to an exemplary embodiment of the present disclosure.

Fig. 4 is a diagram illustrating a noise mask ratio obtained using a noise mask ratio prediction network according to an exemplary embodiment of the present disclosure.

Fig. 5 is a schematic diagram illustrating a noise mask ratio obtained using a noise mask ratio prediction network according to another exemplary embodiment of the present disclosure.

Fig. 6 is a flowchart illustrating a voice enhancement method according to an exemplary embodiment of the present disclosure.

FIG. 7 is a block diagram illustrating a training apparatus of a speech enhancement model according to an exemplary embodiment of the present disclosure.

Fig. 8 is a block diagram illustrating a speech enhancement apparatus according to an exemplary embodiment of the present disclosure.

Fig. 9 is a block diagram of an electronic device 900 according to an example embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The embodiments described in the following examples do not represent all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In this case, the expression "at least one of the items" in the present disclosure means a case where three types of parallel expressions "any one of the items", "a combination of any plural ones of the items", and "the entirety of the items" are included. For example, "include at least one of a and B" includes the following three cases in parallel: (1) comprises A; (2) comprises B; (3) including a and B. For another example, "at least one of the first step and the second step is performed", which means that the following three cases are juxtaposed: (1) executing the step one; (2) executing the step two; (3) and executing the step one and the step two.

Conventional speech enhancement algorithms remove all noise in a scene for whatever scene they are, leaving only human speech. For example, in a scene where people take short videos, there are voice, singing voice and noise, and people need to remove the noise and keep the voice and singing voice, but the traditional speech enhancement algorithm removes the noise and singing voice and only keeps the voice, so that the expected speech enhancement effect is not achieved.

In order to solve the technical problem, the present disclosure provides a scene-based speech enhancement algorithm, and in particular, the present disclosure provides a speech enhancement model, which uses specific scene noise as an auxiliary vector to be input into the speech enhancement model, so as to remove the specific scene noise in a specific scene, thereby removing the specific scene noise according to user requirements under the condition of including multiple scene noises, and achieving a speech enhancement effect desired by a user. Hereinafter, a training method and apparatus of a speech enhancement model and a speech enhancement method and apparatus according to an exemplary embodiment of the present disclosure will be described in detail with reference to fig. 1 to 9.

Referring to fig. 1, a speech enhancement model according to the present disclosure may include a noise mask ratio prediction network for predicting a noise mask ratio of reference scene noise desired to be removed in noisy speech, and a noise type discrimination network for predicting a type of reference scene noise desired to be removed in noisy speech.

Specifically, during the training phase, noisy speech samples may be obtained, which may be formed by mixing a speaker speech sample with at least one scene noise data. The reference scene noise data desired to be removed in the at least one scene noise data may be input to a noise type discrimination network, resulting in noise type characteristics (i.e., embedding vectors) of the reference scene noise data. A Time-frequency Transform (e.g., Short-Time Fourier Transform, STFT) may be performed on the noisy speech sample to obtain a magnitude spectrum and a phase spectrum. The noise type characteristics of the reference scene noise data can be used as auxiliary vectors and input into a noise mask ratio prediction network together with the amplitude spectrum of the noisy speech sample, so as to obtain a predicted noise mask ratio (mask). The predicted noise mask ratio can be point multiplied with the magnitude spectrum of the noisy speech after being complemented to obtain the estimated magnitude spectrum of the speech enhancement signal. A predicted noise type label may be derived based on noise type characteristics of the reference scene noise data. A mean square error loss function (MSE loss) may be calculated based on the estimated magnitude spectrum of the speech enhancement signal and the magnitude spectrum of the target speech enhancement signal (i.e., the speech signal after removing the reference scene noise data from the noisy speech samples), and a cross entropy loss function (cross entropy loss) may be calculated based on the predicted noise type label and the true noise type label of the reference scene noise data. And summing the mean square error loss function and the cross entropy loss function to obtain a final loss function (loss) so as to train the noise mask ratio prediction network and the noise type discrimination network in the voice enhancement model together.

In the inference stage, the reference scene noise desired to be removed may be input to a noise type discrimination network, and a noise type feature (i.e., embedding vector) of the reference scene noise data is obtained. A Time-frequency Transform (e.g., Short-Time Fourier Transform, STFT) may be performed on noisy speech to obtain a magnitude spectrum and a phase spectrum. The noise type characteristics of the reference scene noise data can be used as auxiliary vectors and input into a noise mask ratio prediction network together with the magnitude spectrum of the noisy speech to obtain a predicted noise mask ratio (mask). The estimated speech from which the reference scene noise data is removed from the noisy speech may be obtained as enhanced speech by multiplying the compensated predicted noise mask ratio by the magnitude spectrum point of the noisy speech, combining the phase spectrum of the noisy speech, and performing an Inverse Time-frequency Transform (e.g., Inverse Short-Time Fourier Transform (ISTFT)).

By utilizing the voice enhancement model disclosed by the invention, the specific scene noise in the noisy voice can be removed by means of the fragment of the specific scene noise, so that different voice enhancement effects based on different scenes are realized, and the user experience is improved.

FIG. 2 is a flowchart illustrating a method of training a speech enhancement model according to an exemplary embodiment of the present disclosure. Here, as described above, the speech enhancement model may include a noise mask ratio prediction network for predicting a noise mask ratio of reference scene noise desired to be removed in the noisy speech, and a noise type discrimination network for predicting a type of reference scene noise desired to be removed in the noisy speech. That is, the speech enhancement model may be used to obtain an estimated speech enhancement signal derived after removing the reference scene noise data from the noisy speech.

Referring to fig. 2, at step 201, a noisy speech sample may be obtained, wherein the noisy speech sample is formed by mixing a speaker speech sample with at least one scene noise data. Here, the types of scenes may include subway stations, cafes, buses, streets, and the like. Scene noise data may be obtained by recording in different scenes (e.g., subway stations, coffee shops, buses, streets, etc.) using a recording device. The speaker voice samples may be obtained from a voice data set or by recording voice segments of different speakers.

In step 202, reference scene noise data in the at least one scene noise data may be input into the noise type discrimination network, so as to obtain a noise type characteristic of the reference scene noise data. Here, the reference scene noise data may be one of the at least one scene noise data, and is the scene noise data desired to be removed. In addition, the noise type feature is a fixed-dimension vector, which may also be referred to as an auxiliary vector, for describing scene noise type information. For example, the reference scene noise data may be represented as s and the noise type discrimination network may be represented as M_svThe noise type feature may be expressed as embedding, and thus, the process of step 202 may be expressed as the following equation (1).

embedding＝M_sv(s) (1)

According to an exemplary embodiment of the present disclosure, the input of the noise type discrimination network may be Mel-Frequency Cepstral Coefficients (MFCCs) of the reference scene noise data. For example, one implementation of a noise type discrimination network may be a three-layer long short-term memory (LSTM) network. Fig. 3 is a schematic diagram illustrating a structure of a noise type discrimination network according to an exemplary embodiment of the present disclosure. The mfcc of the reference scene noise data can be input into three layers of long-short term memory network layers (LSTM1, LSTM2 and LSTM3), the Hidden State (Hidden State) output by the last layer of LSTM3 is taken, and the auxiliary vector embedding can be obtained through a layer of fully connected layer (dense). Of course, the noise type determination network is not limited to the above network or model, and may be any other network that may implement noise type determination, which is not limited by the present disclosure.

Referring back to fig. 2, in step 203, the magnitude spectrum of the noisy speech sample and the noise type feature may be input into the noise mask ratio prediction network to obtain an estimated noise mask ratio of the reference scene noise data.

According to an exemplary embodiment of the present disclosure, the magnitude spectrum of a noisy speech sample may be obtained by time-frequency transformation. For example, a noisy speech sample may be transformed from a Time domain to a frequency domain by Short-Time Fourier Transform (STFT), and amplitude information and phase information of each frame of audio signal are obtained, so as to obtain an amplitude spectrum and a phase spectrum of the noisy speech sample. For example, if a mixed signal x with a length T and containing several types of noise is x (T) in the time domain, the reference scene noise signal is w (T), and the speech after removing the reference scene noise is y (T), where T represents time, and 0 < T ≦ T, the noisy speech signal x (T) can be expressed as the following formula (2) in the time domain.

x(t)＝w(t)+y(n)(2)

The noisy speech signal x (t) after being subjected to short-time fourier transform can be expressed as the following formula (3) in the time-frequency domain.

X(n，k)＝W(n，k)+Y(n，k) (3)

Wherein N is a frame sequence, N is more than 0 and less than or equal to N, N is a total frame number, and K is a central frequency sequence, K is more than 0 and less than or equal to K; k is the total frequency point number.

After the noisy signal X (n, k) of the frequency domain is obtained, the amplitude spectrum Mag of the noisy signal can be obtained_oriAnd phase spectrum Pha_oriAnd can be expressed as the following formula (4).

Mag(n，k)＝abs(X(n，k))，Pha(n，k)＝angle(X(n，k)) (4)

After obtaining the amplitude spectrum of the noisy signal, the amplitude spectrum Mag (n, k) and the noise type characteristic embedding of the noisy speech may be input into the noise mask ratio prediction network M_seObtaining an estimated noise mask ratio mask for the reference scene noise data_wThat is, the above process can be expressed as the following formula (5).

mask_w＝M_se(Mag，embedding) (5)

According to an exemplary embodiment of the present disclosure, the noise mask ratio of the reference scene noise data may be a ratio of a magnitude spectrum of the reference scene noise data to a magnitude spectrum of the noisy speech signal. Specifically, the noise mask ratio of the reference scene noise data may be a gain matrix that is the same as the magnitude spectrum of the reference scene noise data, where the value of each element is between [ 0, 1 ]. And multiplying the noise mask ratio of the reference scene noise data by the amplitude spectrum point of the voice signal containing the noise, and performing time-frequency inverse transformation to obtain the reference scene noise data.

According to an exemplary embodiment of the present disclosure, the magnitude spectrum and the noise type characteristics of a noisy speech sample may be concatenated; and inputting the features after series connection into a noise mask ratio prediction network to obtain an estimated noise mask ratio of the reference scene noise data. According to another exemplary embodiment of the present disclosure, the magnitude spectrum of a noisy speech sample is input into a part of a network in a noise mask ratio prediction network to obtain a local feature of the magnitude spectrum of the noisy speech sample; connecting the local features and the noise type features in series; and inputting the features after the series connection into the other part of the noise mask ratio prediction network to obtain the estimated noise mask ratio of the reference scene noise data.

According to an exemplary embodiment of the present disclosure, the noise mask ratio prediction network may be a Convolutional Recurrent Neural Network (CRNN) including a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN). Fig. 4 is a diagram illustrating a noise mask ratio obtained using a noise mask ratio prediction network according to an exemplary embodiment of the present disclosure. Referring to fig. 4, the magnitude spectrum (Mag) and the noise type feature (embedding) of the noisy speech sample may be concatenated (Concat), the concatenated feature may be input to a noise mask ratio prediction network, the CNN may be used to extract the feature, and the feature may be sent to the RNN to establish a context relationship, and the estimated noise mask ratio mask may be output_w. Fig. 5 is a schematic diagram illustrating a noise mask ratio obtained using a noise mask ratio prediction network according to another exemplary embodiment of the present disclosure. Referring to fig. 5, a magnitude spectrum (Mag) of a noisy speech sample may be input into a part of a network (e.g., CNN) in a noise mask ratio prediction network to obtain a local feature of the magnitude spectrum of the noisy speech sample, the local feature and a noise type feature (embedding) are concatenated (Concat), the concatenated feature is input into another part of the noise mask ratio prediction network (e.g., RNN), and an estimated noise mask ratio mask is obtained_w. Of course, the noise mask ratio prediction network is not limited to the above network or model, but may be any other network that may implement noise mask ratio prediction, and the disclosure is not limited thereto.

Referring back to fig. 2, at step 204, a loss function may be calculated based on the estimated noise mask ratio of the reference scene noise data and the noise type characteristics of the reference scene noise data.

According to an exemplary embodiment of the present disclosure, the estimated magnitude spectrum of the speech enhancement signal is obtained by multiplying the compensated estimated noise mask ratio of the reference scene noise data with the magnitude spectrum of the noisy speech sample. For example, this process can be expressed as the following equation (6).

Mag_est＝Mag_ori⊙(1-mask_w) (6)

Furthermore, an estimated noise type label vector may be obtained by passing the noise type features through a full-concatenation layer. That is, noise type features may be projected through a fully connected layer to a fixed-dimension vector

Thus, the magnitude spectrum Mag of the speech enhancement signal may be based on the estimation_estAmplitude spectrum Mag of target voice enhanced signal_tarThe estimated noise type tag vector

And calculating a loss function according to a real noise type label vector z of the reference scene noise data. Here, the target speech enhancement signal is the signal after the noisy speech sample has removed the reference scene noise data. Further, each of the at least one scene noise data has a true noise type tag vector, e.g., a one hot vector.

For example, a mean square error loss function may be calculated based on the estimated magnitude spectrum of the speech enhancement signal and the magnitude spectrum of the target speech enhancement signal. For example, the mean square error loss function can be expressed as the following equation (7).

A cross entropy loss function may be calculated based on the estimated noise type label vector and a true noise type label vector of the reference scene noise data. For example, the cross entropy loss function can be expressed as the following equation (8).

The mean square error loss function and the cross entropy loss function may be summed to obtain a final loss function. For example, the final loss function can be expressed as the following equation (9).

representing the cross entropy loss function.

In step 205, parameters of the noise mask ratio prediction network and the noise type decision network may be adjusted through the calculated loss function, and the speech enhancement model is trained. That is, the values of the above equation (9) may be used to back-propagate to cooperatively adjust the parameters of the noise mask ratio prediction network and the noise type discrimination network.

Fig. 6 is a flowchart illustrating a voice enhancement method according to an exemplary embodiment of the present disclosure. Here, the speech enhancement method according to the exemplary embodiment of the present disclosure may be implemented based on the speech enhancement model according to the present disclosure. A speech enhancement model according to the present disclosure may include a noise mask ratio prediction network and a noise type discrimination network. For example, a speech enhancement model according to the present disclosure may be trained by a training method using a speech enhancement model according to the present disclosure.

Referring to fig. 6, in step 601, a noisy speech signal to be enhanced including a speaker speech signal and at least one scene noise data, and reference scene noise data, which is scene noise data desired to be removed, among the at least one scene noise data, may be acquired. For example, the types of scenes may include subway stations, cafes, buses, streets, and so forth. For example, the reference scene noise data may be a scene noise segment that the user desires to remove, which is pre-recorded under the environment where the speaker is located.

In step 602, the reference scene noise data may be input into the noise type discrimination network, so as to obtain a noise type characteristic of the reference scene noise data. For example, the noise type discrimination network may be, but is not limited to, a structure as shown in fig. 3.

Here, the present disclosure does not limit the execution order of

steps

601 and 602. For example, in step 601, reference scene noise data may be acquired before acquiring the noisy speech signal to be enhanced, or the noisy speech signal to be enhanced may be acquired before acquiring the reference scene noise data, or the noisy speech signal to be enhanced and the reference scene noise data may be acquired simultaneously. For another example, the reference scene noise data may be obtained in step 601, the noise type feature may be obtained in step 602, and the noisy speech signal to be enhanced may be obtained in step 601.

In step 603, the magnitude spectrum of the noisy speech signal to be enhanced and the noise type feature may be input into the noise mask ratio prediction network to obtain an estimated noise mask ratio of the reference scene noise data, where the noise mask ratio may represent a ratio of the magnitude spectrum of the reference scene noise data to the magnitude spectrum of the noisy speech signal.

According to an exemplary embodiment of the present disclosure, the amplitude spectrum of the noisy speech signal to be enhanced and the noise type feature may be concatenated; and inputting the features after series connection into the noise mask ratio prediction network to obtain the estimated noise mask ratio of the reference scene noise data. According to another exemplary embodiment of the present disclosure, the magnitude spectrum of the noisy speech signal to be enhanced may be input to a part of the network in the noise mask ratio prediction network, to obtain a local feature of the magnitude spectrum of the noisy speech sample; concatenating the local features with the noise type features; inputting the features after the series connection into the other part of the noise mask ratio prediction network to obtain the estimated noise mask ratio of the reference scene noise data.

According to an exemplary embodiment of the present disclosure, the noise mask ratio prediction network may be a Convolutional Recurrent Neural Network (CRNN) including a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN). For example, as shown in fig. 4, the amplitude spectrum of the noisy speech signal to be enhanced and the noise type feature may be directly concatenated, and the concatenated feature may be input to the noise mask ratio prediction network to obtain the estimated noise mask ratio. For another example, as shown in fig. 5, the amplitude spectrum of the noisy speech signal to be enhanced may be input into a part of the noise mask ratio prediction network (e.g., CNN), so as to obtain a local feature of the amplitude spectrum of the noisy speech sample, and then the local feature and the noise type feature are concatenated, and then the concatenated feature is input into another part of the noise mask ratio prediction network (e.g., RNN), so as to obtain an estimated noise mask ratio.

At step 604, an estimated speech enhancement signal may be derived based on the estimated noise mask ratio of the reference scene noise data and the noisy speech signal to be enhanced, wherein the estimated speech enhancement signal is the estimated speech enhancement signal derived after removing the reference scene noise data from the noisy speech signal to be enhanced.

According to an exemplary embodiment of the present disclosure, the estimated magnitude spectrum of the speech enhancement signal may be obtained by multiplying the estimated noise mask ratio of the reference scene noise data after being complemented by the magnitude spectrum of the noisy speech signal to be enhanced, for example, as shown in equation (6). And then combining the amplitude spectrum of the estimated voice enhancement signal with the phase spectrum of the noisy voice signal to be enhanced and executing time-frequency inverse transformation to obtain the estimated voice enhancement signal. For example, the estimated magnitude spectrum Mag of the speech enhancement signal may be used_estWith the phase spectrum Pha of the noisy speech signal to be enhanced_oriAnd combining the signals, and obtaining an estimated voice enhancement signal y after removing the reference scene noise through Inverse Short-Time Fourier Transform (ISTFT). For example, the estimated speech enhancement signal can be expressed as the following equation (10).

y＝ISTFT(Mag_est，Pha_ori) (10)

Referring to fig. 7, a training apparatus 700 of a speech enhancement model according to an exemplary embodiment of the present disclosure may include a sample acquisition unit 701, a noise type estimation unit 702, a noise mask ratio estimation unit 703, a loss function calculation unit 704, and a model training unit 705.

The sample acquiring unit 701 may acquire a noisy speech sample, wherein the noisy speech sample is formed by mixing a speaker speech sample and at least one scene noise data. Here, the types of scenes may include subway stations, cafes, buses, streets, and the like. Scene noise data may be obtained by recording in different scenes (e.g., subway stations, coffee shops, buses, streets, etc.) using a recording device. The speaker voice samples may be obtained from a voice data set or by recording voice segments of different speakers.

The noise type estimation unit 702 may input reference scene noise data in the at least one scene noise data into the noise type discrimination network, so as to obtain a noise type characteristic of the reference scene noise data. Here, the reference scene noise data may be one of the at least one scene noise data, and is the scene noise data desired to be removed. In addition, the noise type feature is a fixed-dimension vector, which may also be referred to as an auxiliary vector, for describing scene noise type information.

According to an exemplary embodiment of the present disclosure, the input of the noise type discrimination network may be Mel-Frequency Cepstral Coefficients (MFCCs) of the reference scene noise data. For example, one implementation of a noise type discrimination network may be a three-layer long short-term memory (LSTM) network. As shown in fig. 3, the noise type estimation unit 702 may input the mfcc of the reference scene noise data into three layers of long-term and short-term memory network layers (LSTM1, LSTM2, LSTM3), take the Hidden State (Hidden State) output by the last layer of LSTM3, and obtain the auxiliary vector embedding through one layer of fully connected layer (dense). Of course, the noise type determination network is not limited to the above network or model, and may be any other network that may implement noise type determination, which is not limited by the present disclosure.

The noise mask ratio estimation unit 703 may input the amplitude spectrum of the noise-containing speech sample and the noise type characteristic into the noise mask ratio prediction network to obtain an estimated noise mask ratio of the reference scene noise data.

According to an exemplary embodiment of the present disclosure, the magnitude spectrum of a noisy speech sample may be obtained by time-frequency transformation. For example, the noise mask ratio estimation unit 703 may Transform the noisy speech sample from the Time domain to the frequency domain through Short-Time Fourier Transform (STFT), and obtain the amplitude information and the phase information of each frame of audio signal, thereby obtaining the amplitude spectrum and the phase spectrum of the noisy speech sample.

According to an exemplary embodiment of the present disclosure, the noise mask ratio estimation unit 703 may concatenate the amplitude spectrum of the noisy speech sample and the noise type feature; and inputting the features after series connection into a noise mask ratio prediction network to obtain an estimated noise mask ratio of the reference scene noise data. According to another exemplary embodiment of the present disclosure, the noise mask ratio estimation unit 703 may input the magnitude spectrum of the noisy speech sample into a part of a network in the noise mask ratio prediction network, to obtain a local feature of the magnitude spectrum of the noisy speech sample; connecting the local features and the noise type features in series; and inputting the features after the series connection into the other part of the noise mask ratio prediction network to obtain the estimated noise mask ratio of the reference scene noise data.

According to an exemplary embodiment of the present disclosure, the noise mask ratio prediction network may be a Convolutional Recurrent Neural Network (CRNN) including a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN). For example, as shown in fig. 4, the noise mask ratio estimation unit 703 may concatenate (Concat) the amplitude spectrum (Mag) of the noisy speech sample and the noise type feature (embedding), input the concatenated features into the noise mask ratio prediction network, extract features using the CNN, send the features into the RNN to establish a context relationship, and output the estimated noise mask ratio mask_w. For another example, as shown in fig. 5, the noise mask ratio estimation unit 703 may input the amplitude spectrum (Mag) of the noisy speech sample into a part of the noise mask ratio prediction network (e.g., CNN), obtain a local feature of the amplitude spectrum of the noisy speech sample, concatenate (Concat) the local feature with the noise type feature (embedding), and then concatenate the concatenated featuresCharacterize another part (e.g., RNN) in the input noise mask ratio prediction network to obtain an estimated noise mask ratio mask_w. Of course, the noise mask ratio prediction network is not limited to the above network or model, but may be any other network that may implement noise mask ratio prediction, and the disclosure is not limited thereto.

The loss function calculation unit 704 may calculate a loss function based on the estimated noise mask ratio of the reference scene noise data and the noise type characteristics of the reference scene noise data.

According to an exemplary embodiment of the present disclosure, the loss function calculation unit 704 may obtain the estimated magnitude spectrum of the speech enhancement signal by multiplying the compensated estimated noise mask ratio of the reference scene noise data with the magnitude spectrum of the noisy speech sample. The loss function calculation unit 704 may obtain an estimated noise type tag vector by passing the noise type feature through a full connection layer. That is, the noise type features may be projected through the fully-connected layer to a vector of one fixed dimension. Thus, the loss function calculation unit 704 may calculate a loss function based on the estimated magnitude spectrum of the speech enhancement signal, the magnitude spectrum of the target speech enhancement signal, the estimated noise type tag vector, the true noise type tag vector of the reference scene noise data. Here, the target speech enhancement signal is the signal after the noisy speech sample has removed the reference scene noise data. Further, each of the at least one scene noise data has a true noise type tag vector, e.g., a one hot vector.

For example, the loss function calculation unit 704 may calculate a mean square error loss function based on the estimated magnitude spectrum of the speech enhancement signal and the magnitude spectrum of the target speech enhancement signal, as shown in equation (7). The loss function calculation unit 704 may calculate a cross entropy loss function based on the estimated noise type tag vector and the true noise type tag vector of the reference scene noise data, as shown in equation (8). The loss function calculation unit 704 may sum the mean square error loss function and the cross entropy loss function to obtain a final loss function, as shown in equation (9).

The model training unit 705 may adjust parameters of the noise mask ratio prediction network and the noise type discrimination network according to the calculated loss function, and train the speech enhancement model. That is, the model training unit 705 may utilize the value of equation (9) above to back-propagate to cooperatively adjust the parameters of the noise mask ratio prediction network and the noise type discrimination network.

Fig. 8 is a block diagram illustrating a speech enhancement apparatus according to an exemplary embodiment of the present disclosure. Here, the speech enhancement apparatus according to the exemplary embodiment of the present disclosure may be implemented based on a speech enhancement model according to the present disclosure. A speech enhancement model according to the present disclosure may include a noise mask ratio prediction network and a noise type discrimination network. For example, a speech enhancement model according to the present disclosure may be trained by a training method using a speech enhancement model according to the present disclosure.

Referring to fig. 8, a speech enhancement apparatus 800 according to an exemplary embodiment of the present disclosure may include a data acquisition unit 801, a noise type estimation unit 802, a noise mask ratio estimation unit 803, and a speech enhancement unit 804.

The data acquisition unit 801 may acquire a noisy speech signal to be enhanced including a speaker speech signal and at least one scene noise data, and reference scene noise data that is scene noise data desired to be removed among the at least one scene noise data. For example, the types of scenes may include subway stations, cafes, buses, streets, and so forth. For example, the reference scene noise data may be a scene noise segment that the user desires to remove, which is pre-recorded under the environment where the speaker is located.

The noise type estimation unit 802 may input the reference scene noise data into the noise type discrimination network to obtain the noise type characteristic of the reference scene noise data.

Here, the present disclosure does not limit the execution order of the data acquisition unit 801 and the noise type estimation unit 802. For example, the data acquisition unit 801 may acquire the reference scene noise data first and then acquire the noisy speech signal to be enhanced, or may acquire the noisy speech signal to be enhanced first and then acquire the reference scene noise data, or may acquire both the noisy speech signal to be enhanced and the reference scene noise data. For another example, the data obtaining unit 801 may obtain the reference scene noise data, the noise type estimating unit 802 obtains the noise type feature, and then the data obtaining unit 801 obtains the noisy speech signal to be enhanced.

The noise mask ratio estimation unit 803 may input the magnitude spectrum of the noisy speech signal to be enhanced and the noise type feature into the noise mask ratio prediction network to obtain an estimated noise mask ratio of the reference scene noise data, where the noise mask ratio may represent a ratio of the magnitude spectrum of the reference scene noise data to the magnitude spectrum of the noisy speech signal.

According to an exemplary embodiment of the present disclosure, the noise mask ratio estimation unit 803 may concatenate the magnitude spectrum of the noisy speech signal to be enhanced and the noise type characteristic; and inputting the features after series connection into the noise mask ratio prediction network to obtain the estimated noise mask ratio of the reference scene noise data. According to another exemplary embodiment of the present disclosure, the noise mask ratio estimation unit 803 may input the magnitude spectrum of the noisy speech signal to be enhanced into a part of the noise mask ratio prediction network, to obtain a local feature of the magnitude spectrum of the noisy speech sample; concatenating the local features with the noise type features; inputting the features after the series connection into the other part of the noise mask ratio prediction network to obtain the estimated noise mask ratio of the reference scene noise data.

According to an exemplary embodiment of the present disclosure, the noise mask ratio prediction network may be a Convolutional Recurrent Neural Network (CRNN) including a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN). For example, as shown in fig. 4, the noise mask ratio estimation unit 803 may directly concatenate the amplitude spectrum of the noisy speech signal to be enhanced and the noise type feature, and input the concatenated feature into the noise mask ratio prediction network to obtain the estimated noise mask ratio. For another example, as shown in fig. 5, the noise mask ratio estimation unit 803 may input the amplitude spectrum of the noisy speech signal to be enhanced into a part of the noise mask ratio prediction network (e.g., CNN), to obtain a local feature of the amplitude spectrum of the noisy speech sample, concatenate the local feature with the noise type feature, and input the concatenated feature into another part of the noise mask ratio prediction network (e.g., RNN), to obtain an estimated noise mask ratio.

The speech enhancement unit 804 may derive an estimated speech enhancement signal based on the estimated noise mask ratio of the reference scene noise data and the noisy speech signal to be enhanced, wherein the estimated speech enhancement signal is the estimated speech enhancement signal derived after removing the reference scene noise data from the noisy speech signal to be enhanced.

According to an exemplary embodiment of the present disclosure, the speech enhancement unit 804 may obtain the estimated speech enhancement signal by multiplying the compensated estimated noise mask ratio of the reference scene noise data by the amplitude spectrum of the noisy speech signal to be enhanced to obtain an estimated amplitude spectrum of the speech enhancement signal, combining the estimated amplitude spectrum of the speech enhancement signal with the phase spectrum of the noisy speech signal to be enhanced, and performing an inverse time-frequency transform.

Referring to fig. 9, an electronic device 900 includes at least one memory 901 and at least one processor 902, the at least one memory 901 having stored therein a set of computer-executable instructions that, when executed by the at least one processor 902, perform a method of training a speech enhancement model or a method of speech enhancement according to an exemplary embodiment of the present disclosure.

By way of example, the electronic device 900 may be a PC computer, tablet device, personal digital assistant, smartphone, or other device capable of executing the set of instructions described above. Here, the electronic device 900 need not be a single electronic device, but can be any arrangement or collection of circuits capable of executing the above-described instructions (or sets of instructions), either individually or in combination. The electronic device 900 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).

In the electronic device 900, the processor 902 may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

The processor 902 may execute instructions or code stored in the memory 901, wherein the memory 1201 may also store data. The instructions and data may also be transmitted or received over a network via a network interface device, which may employ any known transmission protocol.

The memory 901 may be integrated with the processor 902, for example, by having RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, memory 901 may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The memory 901 and the processor 902 may be operatively coupled or may communicate with each other, e.g., through I/O ports, network connections, etc., such that the processor 902 is able to read files stored in the memory.

In addition, the electronic device 900 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of electronic device 900 may be connected to each other via a bus and/or a network.

According to an exemplary embodiment of the present disclosure, a computer-readable storage medium may also be provided, wherein instructions in the computer-readable storage medium, when executed by at least one processor, cause the at least one processor to perform a training method or a speech enhancement method of a speech enhancement model according to the present disclosure. Examples of the computer-readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD + RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD + RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or compact disc memory, Hard Disk Drive (HDD), solid-state drive (SSD), card-type memory (such as a multimedia card, a Secure Digital (SD) card or a extreme digital (XD) card), magnetic tape, a floppy disk, a magneto-optical data storage device, an optical data storage device, a hard disk, a magnetic tape, a magneto-optical data storage device, a, A solid state disk, and any other device configured to store and provide a computer program and any associated data, data files, and data structures to a processor or computer in a non-transitory manner such that the processor or computer can execute the computer program. The computer program in the computer-readable storage medium described above can be run in an environment deployed in a computer apparatus, such as a client, a host, a proxy device, a server, and the like, and further, in one example, the computer program and any associated data, data files, and data structures are distributed across a networked computer system such that the computer program and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.

According to an exemplary embodiment of the present disclosure, a computer program product may also be provided, comprising computer instructions executable by at least one processor to perform a method of training a speech enhancement model or a method of speech enhancement according to an exemplary embodiment of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A training method of a speech enhancement model, wherein the speech enhancement model comprises a noise mask ratio prediction network and a noise type discrimination network, the training method comprising:

acquiring a noisy speech sample, wherein the noisy speech sample is formed by mixing a speaker speech sample and at least one scene noise data;

inputting reference scene noise data in the at least one scene noise data into the noise type discrimination network to obtain noise type characteristics of the reference scene noise data, wherein the reference scene noise data is scene noise data which is expected to be removed in the at least one scene noise data, and the speech enhancement model is used for obtaining an estimated speech enhancement signal obtained after removing the reference scene noise data from the noisy speech sample;

inputting the amplitude spectrum of the noisy speech sample and the noise type characteristic into the noise mask ratio prediction network to obtain an estimated noise mask ratio of the reference scene noise data, wherein the noise mask ratio represents a ratio of the amplitude spectrum of the reference scene noise data to the amplitude spectrum of the noisy speech sample;

calculating a loss function based on the estimated noise mask ratio of the reference scene noise data and the noise type characteristics of the reference scene noise data;

and adjusting parameters of the noise mask ratio prediction network and the noise type discrimination network through the calculated loss function, and training the voice enhancement model.

2. The training method of claim 1, wherein said inputting the magnitude spectrum of the noisy speech sample and the noise type feature into the noise mask ratio prediction network to obtain the estimated noise mask ratio of the reference scene noise data comprises:

connecting the amplitude spectrum of the noise-containing voice sample with the noise type characteristic in series;

and inputting the features after series connection into the noise mask ratio prediction network to obtain the estimated noise mask ratio of the reference scene noise data.

3. The training method of claim 1, wherein said inputting the magnitude spectrum of the noisy speech sample and the noise type feature into the noise mask ratio prediction network to obtain the estimated noise mask ratio of the reference scene noise data comprises:

inputting the amplitude spectrum of the noise-containing voice sample into a part of network in the noise mask ratio prediction network to obtain the local characteristics of the amplitude spectrum of the noise-containing voice sample;

concatenating the local features with the noise type features;

inputting the features after the series connection into the other part of the noise mask ratio prediction network to obtain the estimated noise mask ratio of the reference scene noise data.

4. A training method as claimed in claim 2 or 3 wherein the noise mask ratio prediction network is a convolutional recurrent neural network comprising a convolutional neural network and a recurrent neural network.

5. A speech enhancement method performed based on a speech enhancement model including a noise mask ratio prediction network and a noise type discrimination network, the speech enhancement method comprising:

acquiring a noisy speech signal to be enhanced and reference scene noise data, wherein the noisy speech signal to be enhanced comprises a speaker speech signal and at least one scene noise data, and the reference scene noise data is scene noise data which is expected to be removed from the at least one scene noise data;

inputting the reference scene noise data into the noise type discrimination network to obtain the noise type characteristics of the reference scene noise data;

inputting the amplitude spectrum of the noisy speech signal to be enhanced and the noise type characteristic into the noise mask ratio prediction network to obtain an estimated noise mask ratio of the reference scene noise data, wherein the noise mask ratio represents a ratio of the amplitude spectrum of the reference scene noise data to the amplitude spectrum of the noisy speech signal;

obtaining an estimated speech enhancement signal based on the estimated noise mask ratio of the reference scene noise data and the noisy speech signal to be enhanced, wherein the estimated speech enhancement signal is the estimated speech enhancement signal obtained after removing the reference scene noise data from the noisy speech signal to be enhanced.

6. An apparatus for training a speech enhancement model, wherein the speech enhancement model includes a noise mask ratio prediction network and a noise type discrimination network, the apparatus comprising:

a sample acquisition unit configured to: acquiring a noisy speech sample, wherein the noisy speech sample is formed by mixing a speaker speech sample and at least one scene noise data;

a noise type estimation unit configured to: inputting reference scene noise data in the at least one scene noise data into the noise type discrimination network to obtain noise type characteristics of the reference scene noise data, wherein the reference scene noise data is scene noise data which is expected to be removed in the at least one scene noise data, and the speech enhancement model is used for obtaining an estimated speech enhancement signal obtained after removing the reference scene noise data from the noisy speech sample;

a noise mask ratio estimation unit configured to: inputting the amplitude spectrum of the noisy speech sample and the noise type characteristic into the noise mask ratio prediction network to obtain an estimated noise mask ratio of the reference scene noise data, wherein the noise mask ratio represents a ratio of the amplitude spectrum of the reference scene noise data to the amplitude spectrum of the noisy speech sample;

a loss function calculation unit configured to: calculating a loss function based on the estimated noise mask ratio of the reference scene noise data and the noise type characteristics of the reference scene noise data;

a model training unit configured to: and adjusting parameters of the noise mask ratio prediction network and the noise type discrimination network through the calculated loss function, and training the voice enhancement model.

7. A speech enhancement apparatus that performs an operation based on a speech enhancement model including a noise mask ratio prediction network and a noise type discrimination network, the speech enhancement apparatus comprising:

a data acquisition unit configured to: acquiring a noisy speech signal to be enhanced and reference scene noise data, wherein the noisy speech signal to be enhanced comprises a speaker speech signal and at least one scene noise data, and the reference scene noise data is scene noise data which is expected to be removed from the at least one scene noise data;

a noise type estimation unit configured to: inputting the reference scene noise data into the noise type discrimination network to obtain the noise type characteristics of the reference scene noise data;

a noise mask ratio estimation unit configured to: inputting the amplitude spectrum of the noisy speech signal to be enhanced and the noise type characteristic into the noise mask ratio prediction network to obtain an estimated noise mask ratio of the reference scene noise data, wherein the noise mask ratio represents a ratio of the amplitude spectrum of the reference scene noise data to the amplitude spectrum of the noisy speech signal;

a speech enhancement unit configured to: obtaining an estimated speech enhancement signal based on the estimated noise mask ratio of the reference scene noise data and the noisy speech signal to be enhanced, wherein the estimated speech enhancement signal is the estimated speech enhancement signal obtained after removing the reference scene noise data from the noisy speech signal to be enhanced.

8. An electronic device, comprising:

at least one processor;

at least one memory storing computer-executable instructions,

wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the method of training a speech enhancement model according to any one of claims 1 to 4 or the method of speech enhancement according to claim 5.

9. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by at least one processor, cause the at least one processor to perform the method of training a speech enhancement model according to any one of claims 1 to 4 or the method of speech enhancement according to claim 5.

10. A computer program product comprising computer instructions, characterized in that the computer instructions, when executed by at least one processor, implement the method of training a speech enhancement model according to any of claims 1 to 4 or the method of speech enhancement according to claim 5.