CN113284507B

CN113284507B - Training method and device for voice enhancement model and voice enhancement method and device

Info

Publication number: CN113284507B
Application number: CN202110529546.6A
Authority: CN
Inventors: 张新; 郑羲光; 张晨
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-05-14
Filing date: 2021-05-14
Publication date: 2024-02-13
Anticipated expiration: 2041-05-14
Also published as: CN113284507A

Abstract

The present disclosure relates to a training method and apparatus for a speech enhancement model, and a speech enhancement method and apparatus, the speech enhancement model including a noise mask ratio prediction network and a noise type discrimination network, the training method including: obtaining a noise-containing voice sample, wherein the noise-containing voice sample is formed by mixing a speaker voice sample and at least one scene noise data; inputting reference scene noise data in at least one scene noise data into a noise type discrimination network to obtain noise type characteristics of the reference scene noise data, wherein the reference scene noise data is the scene noise data which is expected to be removed; inputting the amplitude spectrum and the noise type characteristics of the noise-containing voice sample into a noise mask ratio prediction network to obtain an estimated noise mask ratio of noise data of a reference scene; calculating a loss function based on the estimated noise mask ratio and noise type characteristics; and adjusting parameters of the noise mask ratio prediction network and the noise type discrimination network through the calculated loss function, and training the voice enhancement model.

Description

Training method and device for voice enhancement model and voice enhancement method and device

Technical Field

The present disclosure relates to the field of audio technologies, and in particular, to a method and apparatus for training a speech enhancement model, and a method and apparatus for speech enhancement.

Background

The noisy environment can influence the effect of people in voice communication, in the current mainstream communication software, different voice enhancement algorithms are generally adopted to process noise-containing frequencies in the communication process, and the traditional method can process steady-state noise. However, the common speech enhancement algorithm can remove all noise in the scene and only keep human voice, but the types of noise which people need to remove are different in different scenes, so that the common speech enhancement algorithm cannot realize speech enhancement for specific scenes.

Disclosure of Invention

The present disclosure provides a method and apparatus for training a speech enhancement model, and a method and apparatus for speech enhancement, so as to solve at least the above-mentioned problems in the related art, or not solve any of the above-mentioned problems.

According to a first aspect of embodiments of the present disclosure, there is provided a training method of a speech enhancement model, the speech enhancement model including a noise mask ratio prediction network and a noise type discrimination network, the training method including: obtaining a noise-containing voice sample, wherein the noise-containing voice sample is formed by mixing a speaker voice sample and at least one scene noise data; inputting reference scene noise data in the at least one scene noise data into the noise type discrimination network to obtain noise type characteristics of the reference scene noise data, wherein the reference scene noise data is scene noise data which is expected to be removed from the at least one scene noise data, and the voice enhancement model is used for obtaining an estimated voice enhancement signal obtained after the reference scene noise data is removed from the noisy voice sample; inputting the magnitude spectrum of the noise-containing voice sample and the noise type characteristic into the noise mask ratio prediction network to obtain an estimated noise mask ratio of the noise data of the reference scene, wherein the noise mask ratio represents the ratio of the magnitude spectrum of the noise data of the reference scene to the magnitude spectrum of the noise-containing voice sample; calculating a loss function based on the estimated noise mask ratio of the reference scene noise data and the noise type characteristics of the reference scene noise data; and adjusting parameters of the noise mask ratio prediction network and the noise type discrimination network through the calculated loss function, and training the voice enhancement model.

Optionally, the inputting the magnitude spectrum of the noise-containing speech sample and the noise type feature into the noise mask ratio prediction network to obtain the estimated noise mask ratio of the reference scene noise data may include: the amplitude spectrum of the noise-containing voice sample and the noise type feature are connected in series; and inputting the characteristics after the series connection into the noise mask ratio prediction network to obtain the estimated noise mask ratio of the noise data of the reference scene.

Optionally, the inputting the magnitude spectrum of the noise-containing speech sample and the noise type feature into the noise mask ratio prediction network to obtain the estimated noise mask ratio of the reference scene noise data may include: inputting the amplitude spectrum of the noise-containing voice sample into a part of the noise mask ratio prediction network to obtain local characteristics of the amplitude spectrum of the noise-containing voice sample; concatenating the local feature with the noise-type feature; and inputting the characteristics after the series connection into another part of the noise mask ratio prediction network to obtain the estimated noise mask ratio of the noise data of the reference scene.

Alternatively, the noise mask ratio prediction network may be a convolutional recurrent neural network including a convolutional neural network and a recurrent neural network.

Alternatively, a portion of the noise mask ratio prediction network may be a convolutional neural network of the convolutional recurrent neural networks, and another portion of the noise mask ratio prediction network may be a recurrent neural network of the convolutional recurrent neural networks.

Optionally, each of the at least one scene noise data may have a true noise type tag vector; wherein calculating a loss function based on the estimated noise mask ratio of the reference scene noise data and the noise type characteristic of the reference scene noise data may include: multiplying the estimated noise mask ratio of the reference scene noise data with the amplitude spectrum of the noise-containing voice sample to obtain an estimated amplitude spectrum of the voice enhancement signal; obtaining an estimated noise type label vector by passing the noise type feature through a full connection layer; and calculating a loss function based on the estimated amplitude spectrum of the voice enhancement signal, the amplitude spectrum of the target voice enhancement signal, the estimated noise type tag vector and the real noise type tag vector of the reference scene noise data, wherein the target voice enhancement signal is a signal after the noise-containing voice sample removes the reference scene noise data.

Optionally, the calculating a loss function based on the estimated magnitude spectrum of the speech enhancement signal, the magnitude spectrum of the target speech enhancement signal, the estimated noise-type tag vector, the real noise-type tag vector of the reference scene noise data may include: calculating a mean square error loss function based on the estimated magnitude spectrum of the speech enhancement signal and the magnitude spectrum of the target speech enhancement signal; calculating a cross entropy loss function based on the estimated noise-type tag vector and a true noise-type tag vector of the reference scene noise data; and summing the mean square error loss function and the cross entropy loss function to obtain the loss function.

Alternatively, the loss function may be expressed as:

wherein Mag _est Mag representing an amplitude spectrum of the estimated speech enhancement signal _tar Representing the magnitude spectrum of the target speech enhancement signal,representing the mean square error loss function, M representing the total number of types of scene noise, i representing the traversal flag,/>Representing the estimated noise-type tag vector, z representing the true noise-type tag vector of the reference scene noise data, +.>Representing the cross entropy loss function.

According to a second aspect of embodiments of the present disclosure, there is provided a speech enhancement method performed based on a speech enhancement model including a noise mask ratio prediction network and a noise type discrimination network, the speech enhancement method including: acquiring a noise-containing voice signal to be enhanced and reference scene noise data, wherein the noise-containing voice signal to be enhanced comprises a speaker voice signal and at least one scene noise data, and the reference scene noise data is scene noise data which is expected to be removed from the at least one scene noise data; inputting the reference scene noise data into the noise type discrimination network to obtain the noise type characteristics of the reference scene noise data; inputting the amplitude spectrum of the noise-containing voice signal to be enhanced and the noise type characteristic into the noise mask ratio prediction network to obtain an estimated noise mask ratio of the reference scene noise data, wherein the noise mask ratio represents the ratio of the amplitude spectrum of the reference scene noise data to the amplitude spectrum of the noise-containing voice signal; and obtaining an estimated voice enhancement signal based on the estimated noise mask ratio of the reference scene noise data and the noise-containing voice signal to be enhanced, wherein the estimated voice enhancement signal is an estimated voice enhancement signal obtained after the reference scene noise data is removed from the noise-containing voice signal to be enhanced.

Optionally, inputting the magnitude spectrum of the noise-containing speech signal to be enhanced and the noise type feature into the noise mask ratio prediction network to obtain an estimated noise mask ratio of the reference scene noise data may include: the amplitude spectrum of the noise-containing voice signal to be enhanced and the noise type feature are connected in series; and inputting the characteristics after the series connection into the noise mask ratio prediction network to obtain the estimated noise mask ratio of the noise data of the reference scene.

Optionally, inputting the magnitude spectrum of the noise-containing speech signal to be enhanced and the noise type feature into the noise mask ratio prediction network to obtain an estimated noise mask ratio of the reference scene noise data may include: inputting the amplitude spectrum of the noise-containing voice signal to be enhanced into a part of the noise mask ratio prediction network to obtain local characteristics of the amplitude spectrum of the noise-containing voice sample; concatenating the local feature with the noise-type feature; and inputting the characteristics after the series connection into another part of the noise mask ratio prediction network to obtain the estimated noise mask ratio of the noise data of the reference scene.

Optionally, the obtaining an estimated speech enhancement signal based on the estimated noise mask ratio of the reference scene noise data and the noise-containing speech signal to be enhanced may include: multiplying the estimated noise mask ratio of the reference scene noise data with the amplitude spectrum of the noise-containing voice signal to be enhanced after being complemented to obtain the estimated amplitude spectrum of the voice enhancement signal; combining the amplitude spectrum of the estimated voice enhancement signal with the phase spectrum of the noise-containing voice signal to be enhanced and performing time-frequency inverse transformation to obtain the estimated voice enhancement signal.

Alternatively, the reference scene noise data may be a scene noise segment that is prerecorded in the environment in which the speaker is located and desired to be removed.

Optionally, the speech enhancement model is trained using a training method of the speech enhancement model according to the present disclosure.

According to a third aspect of embodiments of the present disclosure, there is provided a training apparatus of a speech enhancement model including a noise mask ratio prediction network and a noise type discrimination network, the training apparatus comprising: a sample acquisition unit configured to: obtaining a noise-containing voice sample, wherein the noise-containing voice sample is formed by mixing a speaker voice sample and at least one scene noise data; a noise type estimation unit configured to: inputting reference scene noise data in the at least one scene noise data into the noise type discrimination network to obtain noise type characteristics of the reference scene noise data, wherein the reference scene noise data is scene noise data which is expected to be removed from the at least one scene noise data, and the voice enhancement model is used for obtaining an estimated voice enhancement signal obtained after the reference scene noise data is removed from the noisy voice sample; a noise mask ratio estimation unit configured to: inputting the magnitude spectrum of the noise-containing voice sample and the noise type characteristic into the noise mask ratio prediction network to obtain an estimated noise mask ratio of the noise data of the reference scene, wherein the noise mask ratio represents the ratio of the magnitude spectrum of the noise data of the reference scene to the magnitude spectrum of the noise-containing voice sample; a loss function calculation unit configured to: calculating a loss function based on the estimated noise mask ratio of the reference scene noise data and the noise type characteristics of the reference scene noise data; a model training unit configured to: and adjusting parameters of the noise mask ratio prediction network and the noise type discrimination network through the calculated loss function, and training the voice enhancement model.

Alternatively, the noise mask ratio estimation unit may be configured to: the amplitude spectrum of the noise-containing voice sample and the noise type feature are connected in series; and inputting the characteristics after the series connection into the noise mask ratio prediction network to obtain the estimated noise mask ratio of the noise data of the reference scene.

Alternatively, the noise mask ratio estimation unit may be configured to: inputting the amplitude spectrum of the noise-containing voice sample into a part of the noise mask ratio prediction network to obtain local characteristics of the amplitude spectrum of the noise-containing voice sample; concatenating the local feature with the noise-type feature; and inputting the characteristics after the series connection into another part of the noise mask ratio prediction network to obtain the estimated noise mask ratio of the noise data of the reference scene.

Optionally, each of the at least one scene noise data may have a true noise type tag vector; wherein the loss function calculation unit may be configured to: multiplying the estimated noise mask ratio of the reference scene noise data with the amplitude spectrum of the noise-containing voice sample to obtain an estimated amplitude spectrum of the voice enhancement signal; obtaining an estimated noise type label vector by passing the noise type feature through a full connection layer; and calculating a loss function based on the estimated amplitude spectrum of the voice enhancement signal, the amplitude spectrum of the target voice enhancement signal, the estimated noise type tag vector and the real noise type tag vector of the reference scene noise data, wherein the target voice enhancement signal is a signal after the noise-containing voice sample removes the reference scene noise data.

Alternatively, the loss function calculation unit may be configured to: calculating a mean square error loss function based on the estimated magnitude spectrum of the speech enhancement signal and the magnitude spectrum of the target speech enhancement signal; calculating a cross entropy loss function based on the estimated noise-type tag vector and a true noise-type tag vector of the reference scene noise data; and summing the mean square error loss function and the cross entropy loss function to obtain the loss function.

Alternatively, the loss function may be expressed as:

According to a fourth aspect of embodiments of the present disclosure, there is provided a voice enhancement apparatus that performs an operation based on a voice enhancement model including a noise mask ratio prediction network and a noise type discrimination network, the voice enhancement apparatus comprising: a data acquisition unit configured to: acquiring a noise-containing voice signal to be enhanced and reference scene noise data, wherein the noise-containing voice signal to be enhanced comprises a speaker voice signal and at least one scene noise data, and the reference scene noise data is scene noise data which is expected to be removed from the at least one scene noise data; a noise type estimation unit configured to: inputting the reference scene noise data into the noise type discrimination network to obtain the noise type characteristics of the reference scene noise data; a noise mask ratio estimation unit configured to: inputting the amplitude spectrum of the noise-containing voice signal to be enhanced and the noise type characteristic into the noise mask ratio prediction network to obtain an estimated noise mask ratio of the reference scene noise data, wherein the noise mask ratio represents the ratio of the amplitude spectrum of the reference scene noise data to the amplitude spectrum of the noise-containing voice signal; a speech enhancement unit configured to: and obtaining an estimated voice enhancement signal based on the estimated noise mask ratio of the reference scene noise data and the noise-containing voice signal to be enhanced, wherein the estimated voice enhancement signal is an estimated voice enhancement signal obtained after the reference scene noise data is removed from the noise-containing voice signal to be enhanced.

Alternatively, the noise mask ratio estimation unit may be configured to: the amplitude spectrum of the noise-containing voice signal to be enhanced and the noise type feature are connected in series; and inputting the characteristics after the series connection into the noise mask ratio prediction network to obtain the estimated noise mask ratio of the noise data of the reference scene.

Alternatively, the noise mask ratio estimation unit may be configured to: inputting the amplitude spectrum of the noise-containing voice signal to be enhanced into a part of the noise mask ratio prediction network to obtain local characteristics of the amplitude spectrum of the noise-containing voice sample; concatenating the local feature with the noise-type feature; and inputting the characteristics after the series connection into another part of the noise mask ratio prediction network to obtain the estimated noise mask ratio of the noise data of the reference scene.

Alternatively, the speech enhancement unit may be configured to: multiplying the estimated noise mask ratio of the reference scene noise data with the amplitude spectrum of the noise-containing voice signal to be enhanced after being complemented to obtain the estimated amplitude spectrum of the voice enhancement signal; combining the amplitude spectrum of the estimated voice enhancement signal with the phase spectrum of the noise-containing voice signal to be enhanced and performing time-frequency inverse transformation to obtain the estimated voice enhancement signal.

Alternatively, the speech enhancement model may be trained using a training method of the speech enhancement model according to the present disclosure.

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic device, comprising: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform a training method or a speech enhancement method according to a speech enhancement model of the present disclosure.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium, which when executed by at least one processor, causes the at least one processor to perform a training method or a speech enhancement method according to a speech enhancement model of the present disclosure.

According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by at least one processor, implement a training method or a speech enhancement method according to the speech enhancement model of the present disclosure.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

according to the training method and the training device of the voice enhancement model, the voice enhancement method and the voice enhancement device, specific scene noise can be input into the voice enhancement model by means of the specific scene noise as an auxiliary vector, and the specific scene noise can be removed under the specific scene, so that the specific scene noise can be removed according to the user requirements under the condition that the specific scene noise is included, and the voice enhancement effect expected by the user is achieved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.

Fig. 1 is an overall system diagram illustrating a speech enhancement model according to an exemplary embodiment of the present disclosure.

Fig. 2 is a flowchart illustrating a training method of a speech enhancement model according to an exemplary embodiment of the present disclosure.

Fig. 3 is a schematic diagram illustrating a structure of a noise type discriminating network according to an exemplary embodiment of the present disclosure.

Fig. 4 is a schematic diagram illustrating deriving a noise mask ratio using a noise mask ratio prediction network according to an exemplary embodiment of the present disclosure.

Fig. 5 is a schematic diagram illustrating deriving a noise mask ratio using a noise mask ratio prediction network according to another exemplary embodiment of the present disclosure.

Fig. 6 is a flowchart illustrating a speech enhancement method according to an exemplary embodiment of the present disclosure.

Fig. 7 is a block diagram illustrating a training apparatus of a speech enhancement model according to an exemplary embodiment of the present disclosure.

Fig. 8 is a block diagram illustrating a voice enhancement device according to an exemplary embodiment of the present disclosure.

Fig. 9 is a block diagram of an electronic device 900 according to an exemplary embodiment of the present disclosure.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The embodiments described in the examples below are not representative of all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

It should be noted that, in this disclosure, "at least one of the items" refers to a case where three types of juxtaposition including "any one of the items", "a combination of any of the items", "an entirety of the items" are included. For example, "including at least one of a and B" includes three cases side by side as follows: (1) comprises A; (2) comprising B; (3) includes A and B. For example, "at least one of the first and second steps is executed", that is, three cases are juxtaposed as follows: (1) performing step one; (2) executing the second step; (3) executing the first step and the second step.

The traditional voice enhancement algorithm only keeps the voice no matter what scene is, all noise in the scene is removed. For example, in a scene where people take short videos, there are voice, singing and noise, and people need to remove noise while retaining voice and singing, but conventional voice enhancement algorithms remove noise and singing while retaining voice only, so that desired voice enhancement effects are not achieved.

In order to solve the above technical problems, the present disclosure provides a scene-based speech enhancement algorithm, and in particular, the present disclosure provides a speech enhancement model, which is input into the speech enhancement model by using specific scene noise as an auxiliary vector, so as to implement removal of the specific scene noise in a specific scene, thereby removing the specific scene noise according to user requirements under the condition of including multiple scene noises, and achieving a speech enhancement effect expected by a user. Hereinafter, a training method and apparatus of a speech enhancement model and a speech enhancement method and apparatus according to exemplary embodiments of the present disclosure will be described in detail with reference to fig. 1 to 9.

Referring to fig. 1, a speech enhancement model according to the present disclosure may include a noise mask ratio prediction network for predicting a noise mask ratio of reference scene noise desired to be removed in noisy speech, and a noise type discrimination network for predicting a type of reference scene noise desired to be removed in noisy speech.

Specifically, during a training phase, noisy speech samples may be obtained, which may be mixed from speaker speech samples and at least one scene noise data. The reference scene noise data desired to be removed from the at least one scene noise data may be input into a noise type discrimination network to derive noise type features (i.e., an enabling vector) of the reference scene noise data. A time-frequency transform (e.g., short-time fourier transform (STFT)) may be performed on the noisy speech samples to obtain a magnitude spectrum and a phase spectrum. The noise type characteristics of the reference scene noise data may be input as auxiliary vectors to a noise mask ratio prediction network along with the magnitude spectrum of the noisy speech samples to yield a predicted noise mask ratio (mask). The predicted noise mask ratio may be complemented and multiplied by the amplitude spectrum point of the noisy speech to obtain an estimated amplitude spectrum of the speech enhancement signal. The predicted noise type signature may be derived based on noise type characteristics of the reference scene noise data. A mean square error loss function (MSE loss) may be calculated based on the estimated magnitude spectrum of the speech enhancement signal and the magnitude spectrum of the target speech enhancement signal (i.e., the speech signal from which the reference scene noise data was removed from the noisy speech samples), and a cross entropy loss function (CrossEntropy loss) may be calculated based on the predicted noise type signature and the actual noise type signature of the reference scene noise data. The mean square error loss function and the cross entropy loss function are summed to obtain a final loss function (loss) to train the noise mask ratio prediction network and the noise type discrimination network together in the speech enhancement model.

In the inference stage, the reference scene noise that is desired to be removed may be input into a noise type discrimination network, resulting in a noise type feature (i.e., an embedding vector) of the reference scene noise data. A time-frequency transform (e.g., short-time fourier transform (STFT)) may be performed on the noisy speech to obtain a magnitude spectrum and a phase spectrum. The noise type characteristics of the reference scene noise data may be input as auxiliary vectors to a noise mask ratio prediction network along with the magnitude spectrum of the noisy speech to obtain a predicted noise mask ratio (mask). The predicted noise mask ratio may be complemented by the amplitude spectrum point of the noisy speech, combined with the phase spectrum of the noisy speech, and subjected to an Inverse frequency transform (e.g., an Inverse Short time fourier transform (ISTFT-Time Fourier Transform)), resulting in an estimated speech from the noisy speech from which the reference scene noise data was removed as enhanced speech.

By utilizing the voice enhancement model disclosed by the invention, the specific scene noise in the noise-containing voice can be removed by means of the segment of the specific scene noise, so that different voice enhancement effects based on different scenes are realized, and the user experience is improved.

Fig. 2 is a flowchart illustrating a training method of a speech enhancement model according to an exemplary embodiment of the present disclosure. Here, as described above, the speech enhancement model may include a noise mask ratio prediction network for predicting a noise mask ratio of reference scene noise desired to be removed in the noisy speech, and a noise type discrimination network for predicting a type of reference scene noise desired to be removed in the noisy speech. That is, the speech enhancement model may be used to obtain an estimated speech enhancement signal that results after removal of reference scene noise data from noisy speech.

Referring to fig. 2, in step 201, noisy speech samples may be obtained, wherein the noisy speech samples are mixed with at least one scene noise data from speaker speech samples. Here, the types of scenes may include subway stations, cafes, buses, streets, and the like. Scene noise data may be obtained by recording in different scenes (e.g., subway stations, cafes, buses, streets, etc.) using a recording device. The speaker speech samples may be obtained from a speech data set or by recording speech segments of different speakers.

In step 202, reference scene noise data in the at least one scene noise data may be input into the noise type discrimination network, so as to obtain a noise type characteristic of the reference scene noise data. Here, the reference scene noise data may be one of the at least one scene noise data, and is scene noise data desired to be removed. In addition, the noise type feature is a vector of fixed dimension for describing scene noise type information, also referred to as auxiliary directionAmount of the components. For example, the reference scene noise data may be represented as s and the noise type discrimination network may be represented as M _sv The noise type feature may be expressed as ebedding, and thus, the process of step 202 may be expressed as the following equation (1).

embedding＝M _sv (s) (1)

According to an exemplary embodiment of the present disclosure, the input to the noise type discrimination network may be Mel-cepstrum coefficient (Mel-scale Frequency Cepstral Coefficients, MFCC) of the reference scene noise data. For example, one implementation of a noise type discrimination network may be a three-layer Long Short Term Memory (LSTM) network. Fig. 3 is a schematic diagram illustrating a structure of a noise type discriminating network according to an exemplary embodiment of the present disclosure. The mfcc of the reference scene noise data can be input into three long-short-period memory network layers (LSTM 1, LSTM2 and LSTM 3), the Hidden State (Hidden State) output by the last LSTM3 layer is taken, and the auxiliary vector enabling is obtained through a full-connection layer (dense). Of course, the noise type discrimination network is not limited to the above network or model, but may be any other network that can realize noise type discrimination, and the present disclosure is not limited thereto.

Referring back to fig. 2, at step 203, the magnitude spectrum of the noisy speech samples and the noise-type features may be input into the noise mask ratio prediction network, resulting in an estimated noise mask ratio of the reference scene noise data.

According to an exemplary embodiment of the present disclosure, the magnitude spectrum of the noisy speech sample may be obtained by a time-frequency transform. For example, the noisy speech samples may be transformed from the time domain to the frequency domain by Short-time fourier transform (STFT) to obtain amplitude information and phase information of each frame of the audio signal, thereby obtaining amplitude spectrum and phase spectrum of the noisy speech samples. For example, if a mixed signal x containing several types of noise with a length of T is x (T) in the time domain, the reference scene noise signal is w (T), and the speech after removing the reference scene noise is y (T), where T represents time, 0 < t.ltoreq.t, the noisy speech signal x (T) can be expressed as the following equation (2) in the time domain.

x(t)＝w(t)+y(n)(2)

The noisy speech signal x (t) after being subjected to short-time fourier transform can be expressed as the following equation (3) in the time-frequency domain.

X(n，k)＝W(n，k)+Y(n，k) (3)

Wherein N is a frame sequence, N is more than 0 and less than or equal to N, N is the total frame number, and K is the center frequency sequence, K is more than 0 and less than or equal to K; k is the total frequency point number.

After obtaining the noise-containing signal X (n, k) in the frequency domain, the amplitude spectrum Mag thereof can be obtained _ori And phase spectrum Pha _ori Can be expressed as the following formula (4).

Mag(n，k)＝abs(X(n，k))，Pha(n，k)＝angle(X(n，k)) (4)

After obtaining the amplitude spectrum of the noisy signal, the amplitude spectrum Mag (n, k) of the noisy speech and the noise type characteristic emmbedding can be input into the noise mask ratio prediction network M _se Noise mask ratio mask for obtaining an estimate of reference scene noise data _w That is, the above-described process can be expressed as the following formula (5).

mask _w ＝M _se (Mag，embedding) (5)

According to an exemplary embodiment of the present disclosure, the noise mask ratio of the reference scene noise data may be a ratio of a magnitude spectrum of the reference scene noise data to a magnitude spectrum of the noisy speech signal. In particular, the noise mask ratio of the reference scene noise data may be one gain matrix that is the same as the amplitude spectrum of the reference scene noise data, wherein the value of each element is between [ 0,1 ]. The noise mask ratio of the reference scene noise data is multiplied by the amplitude spectrum point of the noise-containing voice signal, and then the time-frequency inverse transformation is carried out, so that the reference scene noise data can be obtained.

According to exemplary embodiments of the present disclosure, the amplitude spectrum and noise type characteristics of the noisy speech samples may be concatenated; and inputting the characteristics after the series connection into a noise mask ratio prediction network to obtain the estimated noise mask ratio of the noise data of the reference scene. According to another exemplary embodiment of the present disclosure, the amplitude spectrum of the noisy speech sample is input into a part of the network of the noise mask ratio prediction network to obtain local features of the amplitude spectrum of the noisy speech sample; the local features are connected in series with the noise type features; and inputting the characteristics after the series connection into another part of the noise mask ratio prediction network to obtain the estimated noise mask ratio of the reference scene noise data.

According to an exemplary embodiment of the present disclosure, the noise mask ratio prediction network may be a Convolutional Recurrent Neural Network (CRNN) including a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN). Fig. 4 is a schematic diagram illustrating deriving a noise mask ratio using a noise mask ratio prediction network according to an exemplary embodiment of the present disclosure. Referring to FIG. 4, the amplitude spectrum (Mag) and the noise type feature (empedding) of the noisy speech sample may be concatenated (Concat), and the concatenated features may be input into a noise mask ratio prediction network, extracted using CNN and fed into RNN for context establishment, outputting an estimated noise mask ratio mask _w . Fig. 5 is a schematic diagram illustrating deriving a noise mask ratio using a noise mask ratio prediction network according to another exemplary embodiment of the present disclosure. Referring to FIG. 5, a local feature of the amplitude spectrum of a noisy speech sample (Mag) may be obtained by inputting the amplitude spectrum of the noisy speech sample (Mag) into a part of a network (e.g., CNN) in a noise mask ratio prediction network, then the local feature is concatenated with a noise type feature (embedding) (Concat), and then the concatenated feature is input into another part of the noise mask ratio prediction network (e.g., RNN) to obtain an estimated noise mask ratio mask _w . Of course, the noise mask ratio prediction network is not limited to the above network or model, but may be any other network that may implement noise mask ratio prediction, which is not limited by the present disclosure.

Referring back to fig. 2, at step 204, a loss function may be calculated based on the estimated noise mask ratio of the reference scene noise data and the noise type characteristics of the reference scene noise data.

According to an exemplary embodiment of the present disclosure, the estimated noise mask ratio of the reference scene noise data is complemented and multiplied by the amplitude spectrum of the noisy speech sample to obtain an estimated amplitude spectrum of the speech enhancement signal. This process may be expressed as the following equation (6), for example.

Mag _est ＝Mag _ori ⊙(1-mask _w ) (6)

Further, an estimated noise-type tag vector may be obtained by passing the noise-type feature through a full connection layer. That is, the noise-type features can be projected through the full-join layer to a vector of fixed dimensions

Thus, the amplitude spectrum Mag of the speech enhancement signal may be based on the estimate _est Amplitude spectrum Mag of target voice enhancement signal _tar The estimated noise-type tag vectorAnd calculating a loss function according to the real noise type label vector z of the reference scene noise data. Here, the target speech enhancement signal is a signal after the noisy speech samples have removed the reference scene noise data. Further, each of the at least one scene noise data has a true noise type tag vector, e.g., a one hot vector.

For example, a mean square error loss function may be calculated based on the estimated magnitude spectrum of the speech enhancement signal and the magnitude spectrum of the target speech enhancement signal. For example, the mean square error loss function can be expressed as the following formula (7).

A cross entropy loss function may be calculated based on the estimated noise-type tag vector and a true noise-type tag vector of the reference scene noise data. For example, the cross entropy loss function may be expressed as the following equation (8).

The mean square error loss function and the cross entropy loss function may be summed to obtain a final loss function. For example, the final loss function may be expressed as the following equation (9).

In step 205, the speech enhancement model may be trained by adjusting parameters of the noise mask ratio prediction network and the noise type discrimination network via the calculated loss function. That is, the values of equation (9) above may be utilized to counter-propagate to coordinate the parameters of the noise mask ratio prediction network and the noise type discrimination network.

Fig. 6 is a flowchart illustrating a speech enhancement method according to an exemplary embodiment of the present disclosure. Here, the voice enhancement method according to the exemplary embodiment of the present disclosure may be implemented based on the voice enhancement model according to the present disclosure. A speech enhancement model according to the present disclosure may include a noise mask ratio prediction network and a noise type discrimination network. For example, a speech enhancement model according to the present disclosure may be trained by using a training method of the speech enhancement model according to the present disclosure.

Referring to fig. 6, in step 601, a noisy speech signal to be enhanced may be obtained, wherein the noisy speech signal to be enhanced includes a speaker speech signal and at least one scene noise data, and reference scene noise data, which is scene noise data desired to be removed from among the at least one scene noise data. For example, the types of scenes may include subway stations, cafes, buses, streets, and the like. For example, the reference scene noise data may be a pre-recorded scene noise clip that the user desires to remove in the environment in which the speaker is located.

In step 602, the reference scene noise data may be input into the noise type discrimination network to obtain a noise type characteristic of the reference scene noise data. For example, the noise type discrimination network may be, but is not limited to, the structure shown in fig. 3.

Here, the present disclosure does not limit the execution order of steps 601 and 602. For example, in step 601, the reference scene noise data may be acquired first and then the noise-containing speech signal to be enhanced may be acquired first, or the noise-containing speech signal to be enhanced may be acquired first and then the reference scene noise data may be acquired, or the noise-containing speech signal to be enhanced and the reference scene noise data may be acquired simultaneously. For another example, the reference scene noise data may be acquired in step 601, the noise type feature may be acquired in step 602, and the noisy speech signal to be enhanced may be acquired in step 601.

In step 603, the magnitude spectrum of the noise-containing speech signal to be enhanced and the noise type feature may be input into the noise mask ratio prediction network to obtain an estimated noise mask ratio of the reference scene noise data, where the noise mask ratio may represent a ratio of the magnitude spectrum of the reference scene noise data to the magnitude spectrum of the noise-containing speech signal.

According to an exemplary embodiment of the present disclosure, the amplitude spectrum of the noisy speech signal to be enhanced and the noise-type feature may be concatenated; and inputting the characteristics after the series connection into the noise mask ratio prediction network to obtain the estimated noise mask ratio of the noise data of the reference scene. According to another exemplary embodiment of the present disclosure, the amplitude spectrum of the noise-containing speech signal to be enhanced may be input into a part of the noise mask ratio prediction network to obtain local features of the amplitude spectrum of the noise-containing speech sample; concatenating the local feature with the noise-type feature; and inputting the characteristics after the series connection into another part of the noise mask ratio prediction network to obtain the estimated noise mask ratio of the noise data of the reference scene.

According to an exemplary embodiment of the present disclosure, the noise mask ratio prediction network may be a Convolutional Recurrent Neural Network (CRNN) including a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN). For example, as shown in fig. 4, the magnitude spectrum of the noise-containing speech signal to be enhanced and the noise type feature may be directly connected in series, and the characteristics after being connected in series are input into the noise mask ratio prediction network, so as to obtain an estimated noise mask ratio. For another example, as shown in fig. 5, the amplitude spectrum of the noise-containing speech signal to be enhanced may be input into a part of the network (e.g., CNN) in the noise mask ratio prediction network, to obtain local features of the amplitude spectrum of the noise-containing speech sample, then the local features are connected in series with the noise type features, and then the connected features are input into another part of the noise mask ratio prediction network (e.g., RNN) to obtain the estimated noise mask ratio.

In step 604, an estimated speech enhancement signal may be obtained based on the estimated noise mask ratio of the reference scene noise data and the noisy speech signal to be enhanced, wherein the estimated speech enhancement signal is the estimated speech enhancement signal obtained after removing the reference scene noise data from the noisy speech signal to be enhanced.

According to an exemplary embodiment of the present disclosure, the estimated noise mask ratio of the reference scene noise data may be complemented and multiplied by the amplitude spectrum of the noise-containing speech signal to be enhanced to obtain an amplitude spectrum of the estimated speech enhancement signal, for example, as shown in equation (6). Combining the estimated amplitude spectrum of the voice enhancement signal with the phase spectrum of the noise-containing voice signal to be enhanced and performing time-frequency inverse transformation to obtainThe estimated speech enhancement signal. For example, the estimated amplitude spectrum Mag of the speech enhancement signal may be _est Phase spectrum Pha of noisy speech signal to be enhanced _ori In combination, the estimated speech enhancement signal y after removal of the reference scene noise is obtained by an Inverse Short-time fourier transform (Inverse Short-Time Fourier Transform, ISTFT). For example, the estimated speech enhancement signal may be expressed as the following equation (10).

y＝ISTFT(Mag _est ，Pha _ori ) (10)

Referring to fig. 7, a training apparatus 700 of a speech enhancement model according to an exemplary embodiment of the present disclosure may include a sample acquisition unit 701, a noise type estimation unit 702, a noise mask ratio estimation unit 703, a loss function calculation unit 704, and a model training unit 705.

The sample acquisition unit 701 may acquire a noisy speech sample, wherein the noisy speech sample is mixed by a speaker speech sample and at least one scene noise data. Here, the types of scenes may include subway stations, cafes, buses, streets, and the like. Scene noise data may be obtained by recording in different scenes (e.g., subway stations, cafes, buses, streets, etc.) using a recording device. The speaker speech samples may be obtained from a speech data set or by recording speech segments of different speakers.

The noise type estimation unit 702 may input reference scene noise data in the at least one scene noise data into the noise type discrimination network, to obtain a noise type characteristic of the reference scene noise data. Here, the reference scene noise data may be one of the at least one scene noise data, and is scene noise data desired to be removed. Furthermore, the noise type feature is a vector of fixed dimensions, also referred to as an auxiliary vector, that describes the scene noise type information.

According to an exemplary embodiment of the present disclosure, the input to the noise type discrimination network may be Mel-cepstrum coefficient (Mel-scale Frequency Cepstral Coefficients, MFCC) of the reference scene noise data. For example, one implementation of a noise type discrimination network may be a three-layer Long Short Term Memory (LSTM) network. As shown in fig. 3, the noise type estimation unit 702 may input mfcc of the reference scene noise data into three long-short-term memory network layers (LSTM 1, LSTM2, LSTM 3), take the Hidden State (Hidden State) output by the last LSTM3, and obtain the auxiliary vector ebedding through a full-connection layer (dense). Of course, the noise type discrimination network is not limited to the above network or model, but may be any other network that can realize noise type discrimination, and the present disclosure is not limited thereto.

The noise mask ratio estimation unit 703 may input the magnitude spectrum of the noise-containing speech sample and the noise type feature into the noise mask ratio prediction network to obtain an estimated noise mask ratio of the reference scene noise data.

According to an exemplary embodiment of the present disclosure, the magnitude spectrum of the noisy speech sample may be obtained by a time-frequency transform. For example, the noise mask ratio estimation unit 703 may transform the noisy speech sample from the time domain to the frequency domain by Short-time fourier transform (STFT) to obtain amplitude information and phase information of each frame of the audio signal, thereby obtaining the amplitude spectrum and phase spectrum of the noisy speech sample.

According to an exemplary embodiment of the present disclosure, the noise mask ratio estimation unit 703 may concatenate the magnitude spectrum of the noisy speech sample with the noise type feature; and inputting the characteristics after the series connection into a noise mask ratio prediction network to obtain the estimated noise mask ratio of the noise data of the reference scene. According to another exemplary embodiment of the present disclosure, the noise mask ratio estimation unit 703 may input the amplitude spectrum of the noise-containing voice sample to a part of the network of the noise mask ratio prediction network, to obtain a local feature of the amplitude spectrum of the noise-containing voice sample; the local features are connected in series with the noise type features; and inputting the characteristics after the series connection into another part of the noise mask ratio prediction network to obtain the estimated noise mask ratio of the reference scene noise data.

According to an exemplary embodiment of the present disclosure, the noise mask ratio prediction network may be a Convolutional Recurrent Neural Network (CRNN) including a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN). For example, as shown in fig. 4, the noise mask ratio estimation unit 703 may concatenate (Concat) the amplitude spectrum (Mag) of the noisy speech sample and the noise type feature (embedding), input the concatenated features into the noise mask ratio prediction network, extract the features using CNN, send the extracted features into RNN for context establishment, and output the estimated noise mask ratio mask _w . For another example, as shown in fig. 5, the noise mask ratio estimation unit 703 may input the amplitude spectrum (Mag) of the noise-containing voice sample into a part of the network (e.g., CNN) in the noise mask ratio prediction network to obtain local features of the amplitude spectrum of the noise-containing voice sample, serially connect (Concat) the local features with the noise type features (empedding), and input the serially connected features into another part of the noise mask ratio prediction network (e.g., RNN) to obtain the estimated noise mask ratio mask _w . Of course, the noise mask ratio prediction network is not limited to the above network or model, but may be any other network that may implement noise mask ratio prediction, which is not limited by the present disclosure.

The loss function calculation unit 704 may calculate a loss function based on the estimated noise mask ratio of the reference scene noise data and the noise type characteristic of the reference scene noise data.

According to an exemplary embodiment of the present disclosure, the loss function calculation unit 704 may obtain an estimated amplitude spectrum of the speech enhancement signal by multiplying the estimated noise mask ratio of the reference scene noise data and the amplitude spectrum of the noisy speech sample. The loss function calculation unit 704 may obtain an estimated noise type label vector by passing the noise type feature through a full connection layer. That is, noise-type features may be projected through the full connection layer to a vector of fixed dimensions. Accordingly, the loss function calculation unit 704 may calculate a loss function based on the estimated amplitude spectrum of the speech enhancement signal, the amplitude spectrum of the target speech enhancement signal, the estimated noise type tag vector, the real noise type tag vector of the reference scene noise data. Here, the target speech enhancement signal is a signal after the noisy speech samples have removed the reference scene noise data. Further, each of the at least one scene noise data has a true noise type tag vector, e.g., a one hot vector.

For example, the loss function calculation unit 704 may calculate a mean square error loss function based on the estimated amplitude spectrum of the speech enhancement signal and the amplitude spectrum of the target speech enhancement signal, as shown in equation (7). The loss function calculation unit 704 may calculate a cross entropy loss function based on the estimated noise type tag vector and the real noise type tag vector of the reference scene noise data, as shown in equation (8). The loss function calculation unit 704 may sum the mean square error loss function and the cross entropy loss function to obtain a final loss function, as shown in equation (9).

The model training unit 705 may train the speech enhancement model by adjusting parameters of the noise mask ratio prediction network and the noise type discrimination network through the calculated loss function. That is, the model training unit 705 may use the back propagation of the values of the above equation (9) to cooperatively adjust the parameters of the noise mask ratio prediction network and the noise type discrimination network.

Fig. 8 is a block diagram illustrating a voice enhancement device according to an exemplary embodiment of the present disclosure. Here, the voice enhancement apparatus according to the exemplary embodiment of the present disclosure may be implemented based on the voice enhancement model according to the present disclosure. A speech enhancement model according to the present disclosure may include a noise mask ratio prediction network and a noise type discrimination network. For example, a speech enhancement model according to the present disclosure may be trained by using a training method of the speech enhancement model according to the present disclosure.

Referring to fig. 8, a voice enhancement apparatus 800 according to an exemplary embodiment of the present disclosure may include a data acquisition unit 801, a noise type estimation unit 802, a noise mask ratio estimation unit 803, and a voice enhancement unit 804.

The data acquisition unit 801 may acquire a noise-containing speech signal to be enhanced, including a speaker speech signal and at least one kind of scene noise data, and reference scene noise data, which is scene noise data desired to be removed from among the at least one kind of scene noise data. For example, the types of scenes may include subway stations, cafes, buses, streets, and the like. For example, the reference scene noise data may be a pre-recorded scene noise clip that the user desires to remove in the environment in which the speaker is located.

The noise type estimation unit 802 may input the reference scene noise data into the noise type discrimination network, to obtain a noise type characteristic of the reference scene noise data.

Here, the present disclosure does not limit the execution order of the data acquisition unit 801 and the noise type estimation unit 802. For example, the data acquisition unit 801 may acquire the reference scene noise data first and then acquire the noise-containing speech signal to be enhanced, or may acquire the noise-containing speech signal to be enhanced first and then acquire the reference scene noise data, or may acquire the noise-containing speech signal to be enhanced and the reference scene noise data at the same time. For another example, the data acquisition unit 801 may acquire the reference scene noise data first, the noise type estimation unit 802 acquires the noise type feature again, and then the data acquisition unit 801 acquires the noise-containing speech signal to be enhanced again.

The noise mask ratio estimation unit 803 may input the magnitude spectrum of the noise-containing speech signal to be enhanced and the noise type feature into the noise mask ratio prediction network to obtain an estimated noise mask ratio of the reference scene noise data, where the noise mask ratio may represent a ratio of the magnitude spectrum of the reference scene noise data to the magnitude spectrum of the noise-containing speech signal.

According to an exemplary embodiment of the present disclosure, the noise mask ratio estimation unit 803 may concatenate the amplitude spectrum of the noise-containing speech signal to be enhanced and the noise type feature; and inputting the characteristics after the series connection into the noise mask ratio prediction network to obtain the estimated noise mask ratio of the noise data of the reference scene. According to another exemplary embodiment of the present disclosure, the noise mask ratio estimation unit 803 may input the amplitude spectrum of the noise-containing speech signal to be enhanced into a part of the noise mask ratio prediction network to obtain local features of the amplitude spectrum of the noise-containing speech sample; concatenating the local feature with the noise-type feature; and inputting the characteristics after the series connection into another part of the noise mask ratio prediction network to obtain the estimated noise mask ratio of the noise data of the reference scene.

According to an exemplary embodiment of the present disclosure, the noise mask ratio prediction network may be a Convolutional Recurrent Neural Network (CRNN) including a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN). For example, as shown in fig. 4, the noise mask ratio estimation unit 803 may directly concatenate the magnitude spectrum of the noise-containing speech signal to be enhanced and the noise type feature, and input the concatenated feature into the noise mask ratio prediction network, to obtain an estimated noise mask ratio. For another example, as shown in fig. 5, the noise mask ratio estimation unit 803 may input the amplitude spectrum of the noise-containing speech signal to be enhanced into a part of the networks (for example, CNN) in the noise mask ratio prediction network, obtain local features of the amplitude spectrum of the noise-containing speech sample, then concatenate the local features with the noise type features, and input the concatenated features into another part of the noise mask ratio prediction network (for example, RNN) to obtain the estimated noise mask ratio.

The speech enhancement unit 804 may obtain an estimated speech enhancement signal based on the estimated noise mask ratio of the reference scene noise data and the noisy speech signal to be enhanced, wherein the estimated speech enhancement signal is an estimated speech enhancement signal obtained after removing the reference scene noise data from the noisy speech signal to be enhanced.

According to an exemplary embodiment of the present disclosure, the speech enhancement unit 804 may obtain an estimated amplitude spectrum of the speech enhancement signal by multiplying the estimated noise mask ratio of the reference scene noise data with the amplitude spectrum of the noise-containing speech signal to be enhanced, and then combine the amplitude spectrum of the estimated speech enhancement signal with the phase spectrum of the noise-containing speech signal to be enhanced and perform time-frequency inverse transformation to obtain the estimated speech enhancement signal.

Referring to fig. 9, an electronic device 900 includes at least one memory 901 and at least one processor 902, the at least one memory 901 having stored therein a set of computer-executable instructions that, when executed by the at least one processor 902, perform a training method or a speech enhancement method of a speech enhancement model according to an exemplary embodiment of the present disclosure.

By way of example, electronic device 900 may be a PC computer, tablet device, personal digital assistant, smart phone, or other device capable of executing the above-described set of instructions. Here, the electronic device 900 is not necessarily a single electronic device, but may be any apparatus or a collection of circuits capable of executing the above-described instructions (or instruction set) individually or in combination. The electronic device 900 may also be part of an integrated control system or system manager, or may be a portable electronic device configured to interface with locally or remotely (e.g., via wireless transmission).

In electronic device 900, processor 902 may include a Central Processing Unit (CPU), a Graphics Processor (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

The processor 902 may execute instructions or code stored in the memory 901, wherein the memory 1201 may also store data. The instructions and data may also be transmitted and received over a network via a network interface device, which may employ any known transmission protocol.

The memory 901 may be integrated with the processor 902, for example, RAM or flash memory disposed within an integrated circuit microprocessor or the like. In addition, memory 901 may include a stand-alone device, such as an external disk drive, storage array, or other storage device usable by any database system. The memory 901 and the processor 902 may be operatively coupled or may communicate with each other, for example, through an I/O port, network connection, etc., such that the processor 902 is able to read files stored in the memory.

In addition, the electronic device 900 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device 900 may be connected to each other via buses and/or networks.

According to an exemplary embodiment of the present disclosure, a computer-readable storage medium may also be provided, wherein instructions in the computer-readable storage medium, when executed by at least one processor, cause the at least one processor to perform a training method or a speech enhancement method according to a speech enhancement model of the present disclosure. Examples of the computer readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, nonvolatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD+RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, blu-ray or optical disk storage, hard Disk Drives (HDD), solid State Disks (SSD), card memory (such as multimedia cards, secure Digital (SD) cards or ultra-fast digital (XD) cards), magnetic tape, floppy disks, magneto-optical data storage, hard disks, solid state disks, and any other means configured to store computer programs and any associated data, data files and data structures in a non-transitory manner and to provide the computer programs and any associated data, data files and data structures to a processor or computer to enable the processor or computer to execute the programs. The computer programs in the computer readable storage media described above can be run in an environment deployed in a computer device, such as a client, host, proxy device, server, etc., and further, in one example, the computer programs and any associated data, data files, and data structures are distributed across networked computer systems such that the computer programs and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.

According to an exemplary embodiment of the present disclosure, a computer program product may also be provided, comprising computer instructions executable by at least one processor to perform a training method or a speech enhancement method of a speech enhancement model according to an exemplary embodiment of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of training a speech enhancement model, the speech enhancement model comprising a noise mask ratio prediction network and a noise type discrimination network, the method comprising:

obtaining a noise-containing voice sample, wherein the noise-containing voice sample is formed by mixing a speaker voice sample and at least one scene noise data;

inputting reference scene noise data in the at least one scene noise data into the noise type discrimination network to obtain noise type characteristics of the reference scene noise data, wherein the reference scene noise data is scene noise data which is expected to be removed from the at least one scene noise data, and the voice enhancement model is used for obtaining an estimated voice enhancement signal obtained after the reference scene noise data is removed from the noisy voice sample;

inputting the magnitude spectrum of the noise-containing voice sample and the noise type characteristic into the noise mask ratio prediction network to obtain an estimated noise mask ratio of the noise data of the reference scene, wherein the noise mask ratio represents the ratio of the magnitude spectrum of the noise data of the reference scene to the magnitude spectrum of the noise-containing voice sample;

Calculating a loss function based on the estimated noise mask ratio of the reference scene noise data and the noise type characteristics of the reference scene noise data;

and adjusting parameters of the noise mask ratio prediction network and the noise type discrimination network through the calculated loss function, and training the voice enhancement model.

2. The training method of claim 1 wherein said inputting the magnitude spectrum of the noisy speech samples and the noise-type features into the noise mask ratio prediction network yields an estimated noise mask ratio of the reference scene noise data comprises:

the amplitude spectrum of the noise-containing voice sample and the noise type feature are connected in series;

and inputting the characteristics after the series connection into the noise mask ratio prediction network to obtain the estimated noise mask ratio of the noise data of the reference scene.

3. The training method of claim 1 wherein said inputting the magnitude spectrum of the noisy speech samples and the noise-type features into the noise mask ratio prediction network yields an estimated noise mask ratio of the reference scene noise data comprises:

inputting the amplitude spectrum of the noise-containing voice sample into a part of the noise mask ratio prediction network to obtain local characteristics of the amplitude spectrum of the noise-containing voice sample;

Concatenating the local feature with the noise-type feature;

and inputting the characteristics after the series connection into another part of the noise mask ratio prediction network to obtain the estimated noise mask ratio of the noise data of the reference scene.

4. A training method as claimed in claim 2 or 3 wherein the noise mask ratio prediction network is a convolutional recurrent neural network comprising a convolutional neural network and a recurrent neural network.

5. The training method of claim 4, wherein a portion of the noise mask ratio prediction network is a convolutional neural network of the convolutional recurrent neural networks and another portion of the noise mask ratio prediction network is a recurrent neural network of the convolutional recurrent neural networks.

6. The training method of claim 1, wherein each of the at least one scene noise data has a true noise type tag vector;

wherein calculating a loss function based on the estimated noise mask ratio of the reference scene noise data and the noise type characteristic of the reference scene noise data comprises:

multiplying the estimated noise mask ratio of the reference scene noise data with the amplitude spectrum of the noise-containing voice sample to obtain an estimated amplitude spectrum of the voice enhancement signal;

Obtaining an estimated noise type label vector by passing the noise type feature through a full connection layer;

and calculating a loss function based on the estimated amplitude spectrum of the voice enhancement signal, the amplitude spectrum of the target voice enhancement signal, the estimated noise type tag vector and the real noise type tag vector of the reference scene noise data, wherein the target voice enhancement signal is a signal after the noise-containing voice sample removes the reference scene noise data.

7. The training method of claim 6 wherein said calculating a loss function based on said estimated magnitude spectrum of the speech enhancement signal, said magnitude spectrum of the target speech enhancement signal, said estimated noise-type tag vector, said true noise-type tag vector of the reference scene noise data comprises:

calculating a mean square error loss function based on the estimated magnitude spectrum of the speech enhancement signal and the magnitude spectrum of the target speech enhancement signal;

calculating a cross entropy loss function based on the estimated noise-type tag vector and a true noise-type tag vector of the reference scene noise data;

and summing the mean square error loss function and the cross entropy loss function to obtain the loss function.

8. The training method of claim 7, wherein the loss function is represented as:

9. A speech enhancement method, wherein the speech enhancement method is performed based on a speech enhancement model comprising a noise mask ratio prediction network and a noise type discrimination network, the speech enhancement method comprising:

acquiring a noise-containing voice signal to be enhanced and reference scene noise data, wherein the noise-containing voice signal to be enhanced comprises a speaker voice signal and at least one scene noise data, and the reference scene noise data is scene noise data which is expected to be removed from the at least one scene noise data;

inputting the reference scene noise data into the noise type discrimination network to obtain the noise type characteristics of the reference scene noise data;

Inputting the amplitude spectrum of the noise-containing voice signal to be enhanced and the noise type characteristic into the noise mask ratio prediction network to obtain an estimated noise mask ratio of the reference scene noise data, wherein the noise mask ratio represents the ratio of the amplitude spectrum of the reference scene noise data to the amplitude spectrum of the noise-containing voice signal;

and obtaining an estimated voice enhancement signal based on the estimated noise mask ratio of the reference scene noise data and the noise-containing voice signal to be enhanced, wherein the estimated voice enhancement signal is an estimated voice enhancement signal obtained after the reference scene noise data is removed from the noise-containing voice signal to be enhanced.

10. The method of speech enhancement according to claim 9, wherein said inputting the magnitude spectrum of the noisy speech signal to be enhanced and the noise-type features into the noise mask ratio prediction network yields an estimated noise mask ratio of the reference scene noise data, comprising:

the amplitude spectrum of the noise-containing voice signal to be enhanced and the noise type feature are connected in series;

11. The method of speech enhancement according to claim 9, wherein said inputting the magnitude spectrum of the noisy speech signal to be enhanced and the noise-type features into the noise mask ratio prediction network yields an estimated noise mask ratio of the reference scene noise data, comprising:

inputting the amplitude spectrum of the noise-containing voice signal to be enhanced into a part of the noise mask ratio prediction network to obtain local characteristics of the amplitude spectrum of the noise-containing voice sample;

concatenating the local feature with the noise-type feature;

12. The speech enhancement method according to claim 10 or 11, wherein said noise mask ratio prediction network is a convolutional recurrent neural network comprising a convolutional neural network and a recurrent neural network.

13. The speech enhancement method of claim 12, wherein a portion of the noise mask ratio prediction network is a convolutional neural network of the convolutional recurrent neural networks and another portion of the noise mask ratio prediction network is a recurrent neural network of the convolutional recurrent neural networks.

14. The speech enhancement method of claim 9 wherein said deriving an estimated speech enhancement signal based on said estimated noise mask ratio of said reference scene noise data and said noisy speech signal to be enhanced comprises:

multiplying the estimated noise mask ratio of the reference scene noise data with the amplitude spectrum of the noise-containing voice signal to be enhanced after being complemented to obtain the estimated amplitude spectrum of the voice enhancement signal;

combining the amplitude spectrum of the estimated voice enhancement signal with the phase spectrum of the noise-containing voice signal to be enhanced and performing time-frequency inverse transformation to obtain the estimated voice enhancement signal.

15. The method of claim 9, wherein the reference scene noise data is a scene noise segment that is expected to be removed pre-recorded in an environment in which a speaker is located.

16. A speech enhancement method according to claim 9, wherein said speech enhancement model is trained using a training method for a speech enhancement model according to any of claims 1 to 8.

17. A training device for a speech enhancement model, wherein the speech enhancement model comprises a noise mask ratio prediction network and a noise type discrimination network, the training device comprising:

A sample acquisition unit configured to: obtaining a noise-containing voice sample, wherein the noise-containing voice sample is formed by mixing a speaker voice sample and at least one scene noise data;

a noise type estimation unit configured to: inputting reference scene noise data in the at least one scene noise data into the noise type discrimination network to obtain noise type characteristics of the reference scene noise data, wherein the reference scene noise data is scene noise data which is expected to be removed from the at least one scene noise data, and the voice enhancement model is used for obtaining an estimated voice enhancement signal obtained after the reference scene noise data is removed from the noisy voice sample;

a noise mask ratio estimation unit configured to: inputting the magnitude spectrum of the noise-containing voice sample and the noise type characteristic into the noise mask ratio prediction network to obtain an estimated noise mask ratio of the noise data of the reference scene, wherein the noise mask ratio represents the ratio of the magnitude spectrum of the noise data of the reference scene to the magnitude spectrum of the noise-containing voice sample;

a loss function calculation unit configured to: calculating a loss function based on the estimated noise mask ratio of the reference scene noise data and the noise type characteristics of the reference scene noise data;

A model training unit configured to: and adjusting parameters of the noise mask ratio prediction network and the noise type discrimination network through the calculated loss function, and training the voice enhancement model.

18. The training apparatus of claim 17, wherein the noise mask ratio estimation unit is configured to:

19. The training apparatus of claim 17, wherein the noise mask ratio estimation unit is configured to:

concatenating the local feature with the noise-type feature;

20. Training apparatus according to claim 18 or 19, wherein the noise mask ratio prediction network is a convolutional recurrent neural network comprising a convolutional neural network and a recurrent neural network.

21. The training apparatus of claim 20 wherein a portion of the noise mask ratio prediction network is a convolutional neural network of the convolutional recurrent neural networks and another portion of the noise mask ratio prediction network is a recurrent neural network of the convolutional recurrent neural networks.

22. The training apparatus of claim 17 wherein each of the at least one scene noise data has a true noise type tag vector;

wherein the loss function calculation unit is configured to:

23. The training apparatus of claim 22 wherein the loss function calculation unit is configured to:

24. The training apparatus of claim 23 wherein the loss function is represented as:

25. A speech enhancement apparatus that performs operations based on a speech enhancement model that includes a noise mask ratio prediction network and a noise type discrimination network, the speech enhancement apparatus comprising:

A data acquisition unit configured to: acquiring a noise-containing voice signal to be enhanced and reference scene noise data, wherein the noise-containing voice signal to be enhanced comprises a speaker voice signal and at least one scene noise data, and the reference scene noise data is scene noise data which is expected to be removed from the at least one scene noise data;

a noise type estimation unit configured to: inputting the reference scene noise data into the noise type discrimination network to obtain the noise type characteristics of the reference scene noise data;

a noise mask ratio estimation unit configured to: inputting the amplitude spectrum of the noise-containing voice signal to be enhanced and the noise type characteristic into the noise mask ratio prediction network to obtain an estimated noise mask ratio of the reference scene noise data, wherein the noise mask ratio represents the ratio of the amplitude spectrum of the reference scene noise data to the amplitude spectrum of the noise-containing voice signal;

a speech enhancement unit configured to: and obtaining an estimated voice enhancement signal based on the estimated noise mask ratio of the reference scene noise data and the noise-containing voice signal to be enhanced, wherein the estimated voice enhancement signal is an estimated voice enhancement signal obtained after the reference scene noise data is removed from the noise-containing voice signal to be enhanced.

26. The speech enhancement apparatus of claim 25, wherein the noise mask ratio estimation unit is configured to:

27. The speech enhancement apparatus of claim 25, wherein the noise mask ratio estimation unit is configured to:

concatenating the local feature with the noise-type feature;

28. The speech enhancement apparatus according to claim 26 or 27, wherein said noise mask ratio prediction network is a convolutional recurrent neural network comprising a convolutional neural network and a recurrent neural network.

29. The speech enhancement apparatus of claim 28, wherein a portion of the noise mask ratio prediction network is a convolutional neural network of the convolutional recurrent neural networks and another portion of the noise mask ratio prediction network is a recurrent neural network of the convolutional recurrent neural networks.

30. The speech enhancement apparatus of claim 25, wherein the speech enhancement unit is configured to:

31. The speech enhancement apparatus of claim 25 wherein the reference scene noise data is a prerecorded scene noise segment that is desired to be removed in the environment of the speaker.

32. The speech enhancement apparatus of claim 25, wherein the speech enhancement model is trained using a training method of a speech enhancement model according to any of claims 1 to 8.

33. An electronic device, comprising:

at least one processor;

at least one memory storing computer-executable instructions,

wherein the computer executable instructions, when executed by the at least one processor, cause the at least one processor to perform the training method of the speech enhancement model of any one of claims 1 to 8 or the speech enhancement method of any one of claims 9 to 16.

34. A computer readable storage medium, characterized in that instructions in the computer readable storage medium, when executed by at least one processor, cause the at least one processor to perform the training method of the speech enhancement model of any of claims 1 to 8 or the speech enhancement method of any of claims 9 to 16.