CN113555031A

CN113555031A - Training method and device of voice enhancement model and voice enhancement method and device

Info

Publication number: CN113555031A
Application number: CN202110869479.2A
Authority: CN
Inventors: 陈联武; 张晨; 张旭; 郑羲光; 任新蕾
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-07-30
Filing date: 2021-07-30
Publication date: 2021-10-26
Anticipated expiration: 2041-07-30
Also published as: CN113555031B

Abstract

The present disclosure relates to a training method and apparatus for a speech enhancement model, and a speech enhancement method and apparatus, wherein the training method includes: acquiring a training sample set; respectively inputting at least two frequency spectrums of the noise-containing voice signal into corresponding feature extraction networks in the at least two feature extraction networks to obtain at least two features of the noise-containing voice signal, wherein the at least two frequency spectrums are obtained based on at least two groups of preset different time-frequency conversion parameters; fusing at least two characteristics to obtain fused characteristics; inputting the fused features into a voice enhancement network to obtain a pre-estimated enhancement frequency spectrum of the noisy voice signal; determining a target loss function of the speech enhancement model based on a pre-estimated time domain signal corresponding to the pre-estimated enhancement spectrum and a corresponding clean speech signal; and adjusting parameters of at least two feature extraction networks and the voice enhancement network according to the target loss function, and training the voice enhancement model.

Description

Training method and device of voice enhancement model and voice enhancement method and device

Technical Field

The present disclosure relates to the field of audio and video, and in particular, to a method and an apparatus for training a speech enhancement model, and a method and an apparatus for speech enhancement.

Background

At present, a speech enhancement technology based on a neural network is mainly divided into two schemes, namely a frequency domain scheme and a Time domain scheme, for the frequency domain scheme, Short-Time Fourier Transform (STFT) is mainly adopted to convert a noisy speech signal into a frequency domain, so as to extract spectral features, and then the neural network estimates a clean speech signal according to the extracted spectral features. However, the scheme adopts a fixed time-frequency conversion parameter to perform STFT on the noisy speech signal to extract the spectral feature of the noisy speech signal. Because of the large variation of speech signals and noise signals in an actual scene, ideally, different time-frequency conversion parameters should be used for different signal characteristics (e.g., short window analysis is used for short tapping noise, and long window analysis is used for signals with lower fundamental frequency and stability). Therefore, in the actual frequency domain scheme, a fixed time-frequency conversion parameter is selected according to the overall effect to extract the spectrum characteristics, and the characteristics of all signals in the noisy speech signal cannot be better covered.

Disclosure of Invention

The present disclosure provides a training method and apparatus for a speech enhancement model, and a speech enhancement method and apparatus, so as to at least solve the problem in the related art that the characteristics of all signals in a noisy speech signal cannot be better covered by selecting a fixed time-frequency conversion parameter according to the overall effect to extract the spectral features.

According to a first aspect of the embodiments of the present disclosure, there is provided a training method for a speech enhancement model, where the speech enhancement model includes at least two feature extraction networks and a speech enhancement network, the training method includes: acquiring a training sample set, wherein each training sample in the training sample set comprises a noise-containing voice signal and a corresponding clean voice signal, and the noise-containing voice signal is the voice signal of the corresponding clean voice signal after noise and reverberation are added; respectively inputting at least two frequency spectrums of the noise-containing voice signal into corresponding feature extraction networks in the at least two feature extraction networks to obtain at least two features of the noise-containing voice signal, wherein the at least two frequency spectrums are obtained based on at least two groups of preset different time-frequency conversion parameters; fusing at least two characteristics to obtain fused characteristics; inputting the fused features into a voice enhancement network to obtain a pre-estimated enhancement frequency spectrum of the noisy voice signal; determining a target loss function of the speech enhancement model based on a pre-estimated time domain signal corresponding to the pre-estimated enhancement spectrum and a corresponding clean speech signal; and adjusting parameters of at least two feature extraction networks and the voice enhancement network according to the target loss function, and training the voice enhancement model.

Optionally, the outputting of the speech enhancement network is a mask of the noisy speech signal, where the mask represents a spectrum ratio of an clean speech signal in the noisy speech signal, and the fused features are input to the speech enhancement network to obtain an estimated enhancement spectrum of the noisy speech signal, including: and multiplying the frequency spectrum of the voice signal containing the noise by the mask of the voice signal containing the noise to obtain the estimated enhanced frequency spectrum of the voice signal containing the noise, wherein the frequency spectrum of the voice signal containing the noise is obtained based on a preset group of time-frequency conversion parameters.

Optionally, the output of the speech enhancement network is an estimated enhanced spectrum of the noisy speech signal.

Optionally, before the at least two frequency spectrums of the noisy speech signal are respectively input to corresponding feature extraction networks in the at least two feature extraction networks to obtain the at least two features of the noisy speech signal, the method further includes: acquiring at least two groups of preset different time-frequency conversion parameters; and respectively carrying out short-time Fourier transform on the voice signal containing the noise based on at least two groups of different time-frequency conversion parameters to obtain at least two frequency spectrums of the voice signal containing the noise.

Optionally, the fusing at least two features to obtain fused features, including: and performing weighted splicing or weighted addition on the at least two features, wherein the weight corresponding to each of the at least two features is preset.

Optionally, the set of time-frequency conversion parameters includes: at least one of window length, window shift, window function, and fast fourier transform length.

According to a second aspect of the embodiments of the present disclosure, there is provided a speech enhancement method, including: acquiring a to-be-processed noisy voice signal; respectively inputting at least two frequency spectrums of the noise-containing voice signal to be processed into corresponding feature extraction networks in a voice enhancement model to obtain at least two features of the noise-containing voice signal to be processed, wherein the at least two frequency spectrums are obtained based on at least two groups of preset different time-frequency conversion parameters; fusing at least two characteristics to obtain fused characteristics; inputting the fused features into a voice enhancement network in a voice enhancement model to obtain an enhanced frequency spectrum of the noisy voice signal to be processed; and acquiring a time domain signal corresponding to the enhanced frequency spectrum, and taking the time domain signal as an enhanced voice signal of the voice signal containing noise to be processed.

Optionally, the outputting of the speech enhancement network is a mask of the noisy speech signal to be processed, where the mask represents a spectrum ratio of an clean speech signal in the noisy speech signal to be processed, and the fused features are input to the speech enhancement network in the speech enhancement model to obtain an enhanced spectrum of the noisy speech signal to be processed, including: and multiplying the frequency spectrum of the voice signal containing noise to be processed by the corresponding mask to obtain an enhanced frequency spectrum of the voice signal containing noise to be processed, wherein the frequency spectrum of the voice signal containing noise to be processed is obtained based on a preset group of time-frequency conversion parameters.

Optionally, the output of the speech enhancement network is an enhanced spectrum of the noisy speech signal to be processed.

Optionally, before the at least two frequency spectrums of the noisy speech signal to be processed are respectively input to the corresponding feature extraction networks in the speech enhancement model to obtain the at least two features of the noisy speech signal to be processed, the method further includes: acquiring at least two groups of preset different time-frequency conversion parameters; and based on at least two groups of different time-frequency conversion parameters, carrying out short-time Fourier transform on the voice signal containing noise to be processed to obtain at least two frequency spectrums of the voice signal containing noise to be processed.

Optionally, the set of time-frequency transformation parameters comprises at least one of a window length, a window shift, a window function and a fast fourier transform length.

Optionally, the speech enhancement model is trained based on the above-mentioned training method of the speech enhancement model.

According to a third aspect of the embodiments of the present disclosure, there is provided a training apparatus for a speech enhancement model, the speech enhancement model including at least two feature extraction networks and a speech enhancement network, the training apparatus including:

the training sample set acquisition unit is configured to acquire a training sample set, wherein each training sample in the training sample set comprises a noise-containing voice signal and a corresponding clean voice signal, and the noise-containing voice signal is a voice signal obtained by adding noise and reverberation to the corresponding clean voice signal; the characteristic extraction unit is configured to input at least two frequency spectrums of the noisy speech signal into corresponding characteristic extraction networks of the at least two characteristic extraction networks respectively to obtain at least two characteristics of the noisy speech signal, wherein the at least two frequency spectrums are obtained based on at least two groups of preset different time-frequency conversion parameters; the fusion unit is configured to perform fusion processing on at least two features to obtain fused features; the estimated enhanced spectrum acquisition unit is configured to input the fused features into a voice enhanced network to obtain an estimated enhanced spectrum of the noisy voice signal; a target loss function determination unit configured to determine a target loss function of the speech enhancement model based on the pre-estimated time domain signal corresponding to the pre-estimated enhancement spectrum and the corresponding clean speech signal; and the training unit is configured to adjust parameters of the at least two feature extraction networks and the voice enhancement network according to the target loss function and train the voice enhancement model.

Optionally, the output of the speech enhancement network is a mask of the noisy speech signal, where the mask represents a spectrum ratio of an clean speech signal in the noisy speech signal, and the estimated enhanced spectrum obtaining unit is further configured to multiply the spectrum of the noisy speech signal with the mask of the noisy speech signal to obtain an estimated enhanced spectrum of the noisy speech signal, where the spectrum of the noisy speech signal is obtained based on a preset set of time-frequency conversion parameters.

Optionally, the feature extraction unit is further configured to obtain at least two preset groups of different time-frequency conversion parameters before the at least two frequency spectrums of the noisy speech signal are respectively input to corresponding feature extraction networks of the at least two feature extraction networks to obtain the at least two features of the noisy speech signal; and respectively carrying out short-time Fourier transform on the voice signal containing the noise based on at least two groups of different time-frequency conversion parameters to obtain at least two frequency spectrums of the voice signal containing the noise.

Optionally, the fusion unit is further configured to perform weighted concatenation or weighted addition on the at least two features, wherein a weight corresponding to each of the at least two features is preset.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a speech enhancement apparatus, including: a signal acquisition unit configured to acquire a noisy speech signal to be processed; the characteristic extraction unit is configured to input at least two frequency spectrums of the noise-containing voice signal to be processed into corresponding characteristic extraction networks in the voice enhancement model respectively to obtain at least two characteristics of the noise-containing voice signal to be processed, wherein the at least two frequency spectrums are obtained based on at least two groups of preset different time-frequency conversion parameters; the fusion unit is configured to perform fusion processing on at least two features to obtain fused features; the enhanced spectrum acquisition unit is configured to input the fused features into a voice enhanced network in a voice enhanced model to obtain an enhanced spectrum of the noise-containing voice signal to be processed; and the enhanced voice signal acquisition unit is configured to acquire a time domain signal corresponding to the enhanced spectrum and take the time domain signal as an enhanced voice signal of the noisy voice signal to be processed.

Optionally, the output of the speech enhancement network is a mask of the noisy speech signal to be processed, where the mask represents a spectrum ratio of a clean speech signal in the noisy speech signal to be processed, and the enhanced spectrum obtaining unit is further configured to multiply the spectrum of the noisy speech signal to be processed by the corresponding mask to obtain an enhanced spectrum of the noisy speech signal to be processed, where the spectrum of the noisy speech signal to be processed is obtained based on a preset set of time-frequency conversion parameters.

Optionally, the feature extraction unit is further configured to obtain at least two preset groups of different time-frequency conversion parameters before inputting at least two frequency spectrums of the noisy speech signal to be processed into the corresponding feature extraction networks in the speech enhancement model respectively to obtain at least two features of the noisy speech signal to be processed; and based on at least two groups of different time-frequency conversion parameters, carrying out short-time Fourier transform on the voice signal containing noise to be processed to obtain at least two frequency spectrums of the voice signal containing noise to be processed.

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to execute the instructions to implement the training method of the speech enhancement model and the speech enhancement method according to the present disclosure.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions, when executed by at least one processor, cause the at least one processor to perform the method for training a speech enhancement model and the method for speech enhancement as described above according to the present disclosure.

According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by a processor, implement a training method and a speech enhancement method of a speech enhancement model according to the present disclosure.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

according to the training method and device of the voice enhancement model and the voice enhancement method and device, a plurality of time-frequency conversion parameters are preset in the training process, feature extraction is carried out based on frequency spectrums obtained by the time-frequency conversion parameters, multi-scale features of a noise-containing voice signal can be extracted, then the multi-scale features are fused, and the fused features are used for training the voice enhancement model. Because the extracted multi-scale features contain different types of information of the noisy speech signals, namely the characteristics of all signals in the noisy speech signals, a better training effect can be obtained through the extracted multi-scale features, and the overall effect of the speech enhancement model is improved. Therefore, the method and the device solve the problem that the characteristics of all signals in a noise-containing voice signal cannot be better covered by selecting a fixed time-frequency conversion parameter to extract the spectrum characteristics according to the overall effect in the related art.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a schematic diagram illustrating an implementation scenario of a training method of a speech enhancement model according to an exemplary embodiment of the present disclosure;

FIG. 2 is a flow diagram illustrating a method of training a speech enhancement model in accordance with an exemplary embodiment;

FIG. 3 is a flow diagram illustrating a method of speech enhancement according to an exemplary embodiment;

FIG. 4 is a schematic diagram illustrating a speech enhancement system according to an exemplary embodiment;

FIG. 5 is a block diagram illustrating a speech enhancement model training apparatus in accordance with an exemplary embodiment;

FIG. 6 is a block diagram illustrating a speech enhancement apparatus according to an exemplary embodiment;

fig. 7 is a block diagram of an electronic device 700 according to an embodiment of the disclosure.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The embodiments described in the following examples do not represent all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In this case, the expression "at least one of the items" in the present disclosure means a case where three types of parallel expressions "any one of the items", "a combination of any plural ones of the items", and "the entirety of the items" are included. For example, "include at least one of a and B" includes the following three cases in parallel: (1) comprises A; (2) comprises B; (3) including a and B. For another example, "at least one of the first step and the second step is performed", which means that the following three cases are juxtaposed: (1) executing the step one; (2) executing the step two; (3) and executing the step one and the step two.

The present disclosure provides a training method of a speech enhancement model and a speech enhancement method, which can achieve a better training effect and improve the overall effect of the speech enhancement model. Fig. 1 is a schematic diagram illustrating an implementation scenario of a training method for a speech enhancement model according to an exemplary embodiment of the present disclosure, as shown in fig. 1, the implementation scenario includes a server 100, a user terminal 110, and a user terminal 120, where the number of the user terminals is not limited to 2, and includes not limited to a mobile phone, a personal computer, and the like, the user terminal may be equipped with a microphone for acquiring sound, and the server may be one server, or a server cluster formed by several servers, or a cloud computing platform or a virtualization center.

After the server 100 receives a request for training a speech enhancement model (including at least two feature extraction networks and a speech enhancement network) sent by the

user terminal

110, 120, it may count a clean speech signal and a noise signal received historically, then mix the clean speech signal and the noise signal according to a preset manner and add reverberation to obtain a noisy speech signal, take the noisy speech signal and the corresponding clean speech signal as a training sample for training the speech enhancement model, obtain a plurality of training samples according to the above manner, combine the plurality of training samples to obtain a training sample set, after obtaining the training sample set, the server 100 inputs at least two frequency spectrums of the noisy speech signal to corresponding feature extraction networks of the at least two feature extraction networks respectively to obtain at least two features of the noisy speech signal, the method comprises the steps of obtaining at least two frequency spectrums based on at least two groups of preset different time-frequency conversion parameters, fusing the obtained at least two characteristics, inputting the fused characteristics into a voice enhancement network to obtain estimated enhancement frequency spectrums of noisy voice signals, determining a target loss function of a voice enhancement model based on estimated time domain signals corresponding to the estimated enhancement frequency spectrums and corresponding clean voice signals, adjusting parameters of at least two characteristic extraction networks and the voice enhancement network according to the target loss function, and training the voice enhancement model.

After the speech enhancement model is trained, the

user terminals

110 and 120 receive a noise-containing speech signal (such as the voice of a speaker in a conference) through the microphone and send the noise-containing speech signal to the server 100, after the server 100 receives the noise-containing speech signal, at least two frequency spectrums of the noise-containing speech signal are respectively input into the corresponding feature extraction networks in the speech enhancement model to obtain at least two features of the noise-containing speech signal to be processed, wherein the at least two frequency spectrums are obtained based on at least two groups of preset different time-frequency conversion parameters, then at least two features are subjected to fusion processing, the fused features are input into the speech enhancement network in the speech enhancement model to obtain an enhanced frequency spectrum of the noise-containing speech signal to be processed, and then time domain signals corresponding to the enhanced frequency spectrums, namely the enhanced speech signals of the noise-containing speech signal received by the

user terminals

110 and 120, are obtained, i.e. the sound after the noise and reverberation are removed by the speaker in the conference.

Hereinafter, a training method and apparatus of a speech enhancement model, a speech enhancement method and apparatus according to an exemplary embodiment of the present disclosure will be described in detail with reference to fig. 2 to 6.

FIG. 2 is a flowchart illustrating a method for training a speech enhancement model according to an exemplary embodiment, where the speech enhancement model includes at least two feature extraction networks and a speech enhancement network, and the method for training the speech enhancement model includes the following steps:

in step S201, a training sample set is obtained, where each training sample in the training sample set includes a noisy speech signal and a corresponding clean speech signal, and the noisy speech signal is a speech signal after noise and reverberation are added to the corresponding clean speech signal. The noisy speech signal and the corresponding clean speech signal in the training sample set may be a single-channel noisy speech signal and a corresponding clean speech signal, may also be a multi-channel noisy speech signal and a corresponding clean speech signal, and may also be a noisy speech signal and a corresponding clean speech signal required by a time domain noise reduction system performing framing operation, which is not limited by the present disclosure.

Specifically, in the case where the noisy speech signal and the corresponding clean speech signal in the training sample set are single-channel noisy speech signals and corresponding clean speech signals, the noisy speech signal may be generated in a data enhancement manner. That is, for clean speech signals and noise signals, the frequency response of hardware equipment is simulated through various EQ filters, then various room impulse responses are used for simulating the environmental reverberation, and finally the simulated speech signals and noise signals are mixed according to different signal-to-noise ratios to generate noise-containing speech signals. I.e. the noisy speech signal used in the training process and the corresponding clean speech signal.

Returning to fig. 2, in step S202, at least two frequency spectrums of the noisy speech signal are respectively input to corresponding feature extraction networks of the at least two feature extraction networks, so as to obtain at least two features of the noisy speech signal, where the at least two frequency spectrums are obtained based on at least two preset groups of different time-frequency conversion parameters. Specifically, in this step, time-frequency domain conversion may be performed through multi-scale time-frequency analysis (STFT _1, STFT _2, …, STFT _ M), and then the spectrum after time-frequency domain conversion is input to the corresponding feature extraction network (FeaNet _1, FeaNet _2, …, FeaNet _ M), where M represents the number of multi-scale features (i.e., time-frequency conversion parameters). When M is 1, a speech enhancement scheme with a fixed time-frequency resolution is adopted, so in the present disclosure, the value of M needs to satisfy that M is greater than or equal to 2. Based on the output of each STFT analysis, a corresponding feature extraction network (FeaNet _1, FeaNet _2, …, FeaNet _ M) is provided. Since different STFT _ m correspond to different FFT lengths, the input dimension of FeaNet _ m is also different. The FeaNet _ m can select different network structures, and a typical FeaNet _ m can comprise two layers of Conv2d convolutional networks to extract the structure information of the frequency spectrum, and then a fully-connected network is connected to map the features to the required dimension.

According to an exemplary embodiment of the present disclosure, before inputting at least two frequency spectrums of a noisy speech signal into corresponding feature extraction networks of at least two feature extraction networks, respectively, to obtain at least two features of the noisy speech signal, the method further includes: acquiring at least two groups of preset different time-frequency conversion parameters; and respectively carrying out short-time Fourier transform on the voice signal containing the noise based on at least two groups of different time-frequency conversion parameters to obtain at least two frequency spectrums of the voice signal containing the noise. By the embodiment, short-time Fourier transform can be performed based on different time-frequency conversion parameters, so that a plurality of frequency spectrums of the noise-containing voice signal can be obtained to cover all signals in the noise-containing voice signal.

According to an exemplary embodiment of the disclosure, a set of time-frequency transformation parameters includes: at least one of window length, window shift, window function, and fast fourier transform length.

Specifically, for STFT _ M (M ═ 1,2, …, M), the corresponding time-frequency transform parameters may be selected according to the actual scene. For example, for a 16KHz input speech enhancement system, a typical configuration is shown in Table 1 below. The configuration corresponds to the time-frequency conversion parameter setting of the multi-scale analysis when M is 3, and the window shift is uniformly set to be 160 sampling points for facilitating the alignment of the multi-scale features.

Time-frequency conversion parameter setting corresponding to table 116 KHz voice enhancement system

	Window length	Window shift	Window function	FFT Length
					STFT_1	320	160	Hamming	320
STFT_2	512	160	Hamming	512
					STFT_3	768	160	Kaiser	1024
STFT_0	512	160	Hamming	512

Returning to fig. 2, in step S203, at least two features are subjected to fusion processing to obtain a fused feature. For example, in the fusion process, different fusion weights { α ] can be set for different scale features (i.e., the at least two features) according to the requirements of the actual scene₁,α₂,…,α_MThe fusion method may adopt a splicing fusion method or an additive fusion method, but is not limited thereto, that is, the fusion method may adopt any fusion method applicable to the present disclosure. It should be noted that this step may be implemented by setting a feature fusion module, where the input of the feature fusion module is the multi-scale feature { F }₁,F₂,…,F_MI.e. the above-mentioned at least two features), the output is the fused feature F_all。

According to an exemplary embodiment of the present disclosure, fusing at least two features to obtain a fused feature, includes: and performing weighted splicing or weighted addition on the at least two features, wherein the weight corresponding to each of the at least two features is preset. Through this embodiment, can be convenient, quick fuse.

For example, when the dimensions of the multi-scale features (i.e., the at least two features) are not consistent, a stitching fusion method may be employed:

F_all＝concat(α₁F₁,α₂F₂,...，α_MF_M)

α_m(M-1, 2, …, M) is 0,1]The values in the interval correspond to the at least two weights, M is a positive integer, and a typical value is

For another example, when the dimensions of the multi-scale features (i.e., the at least two features) are consistent, an additive fusion method may be employed:

F_all＝α₁F₁+α₂F₂+…+α_MF_M

In step S204, the fused features are input toAnd the voice enhancement network obtains the estimated enhancement frequency spectrum of the noisy voice signal. For example, the speech-enhanced network input is a multi-scale fusion feature F_allThe output is the estimated enhanced spectrum of the clean voice signal, and can also be the mask of the clean voice signal. When the mask of the clean voice signal is output, a multiplication module needs to be additionally arranged, and the mask is multiplied by the frequency spectrum of the noisy voice signal to obtain the estimated enhanced frequency spectrum of the clean voice signal. The present disclosure does not limit the structure of the speech enhancement network, and in order to ensure low complexity of the system in a practical scenario, a typical speech enhancement network structure may be a two-layer RNN network plus one layer of fully connected network.

According to an exemplary embodiment of the present disclosure, the outputting of the voice enhancement network is a mask of a noisy voice signal, where the mask represents a spectrum proportion of a clean voice signal in the noisy voice signal, and the merged features are input to the voice enhancement network to obtain an estimated enhanced spectrum of the noisy voice signal, including: and multiplying the frequency spectrum of the voice signal containing the noise by the mask of the voice signal containing the noise to obtain the estimated enhanced frequency spectrum of the voice signal containing the noise, wherein the frequency spectrum of the voice signal containing the noise is obtained based on a preset group of time-frequency conversion parameters. It should be noted that the preset group of time-frequency conversion parameters may be one of the preset at least two different groups of time-frequency conversion parameters, or may be a group of time-frequency conversion parameters preset separately. By the embodiment, the part of the spectrum multiplied by the mask is separated from the voice enhancement network, and the complexity of the voice enhancement network can be reduced.

According to an exemplary embodiment of the present disclosure, the output of the speech enhancement network is a pre-estimated enhanced spectrum of the noisy speech signal. Through the embodiment, the pre-estimated enhanced spectrum can be conveniently and quickly acquired.

In step S205, a target loss function of the speech enhancement model is determined based on the estimated time domain signal corresponding to the estimated enhancement spectrum and the corresponding clean speech signal. The target loss function is not limited in the present disclosure, and a commonly used time domain or frequency domain loss function may be adopted, for example, the spectrum mean square error MSE, the logarithmic energy spectrum mean absolute error MAE, the time domain MSE, and the like.

In step S206, parameters of at least two feature extraction networks and the speech enhancement network are adjusted according to the objective loss function, and the speech enhancement model is trained. In the training process, a noisy speech signal is input, and an enhanced speech signal (namely, a pre-estimated time domain signal) can be obtained finally through the multi-scale feature extraction, the feature fusion and the speech enhancement network. And calculating the value of the target loss function according to the enhanced voice signal and the corresponding clean voice signal, and updating the parameters of the two feature extraction networks and the voice enhancement network until the voice enhancement model converges by taking the minimized target loss function as a target.

Fig. 3 is a flowchart illustrating a speech enhancement method according to an exemplary embodiment, where the speech enhancement model shown in fig. 3 is trained based on any one of the above-mentioned training methods of the speech enhancement model, and the speech enhancement method includes the following steps:

in step S301, a noisy speech signal to be processed is acquired. The to-be-processed noisy speech signal may be a speech signal received by a microphone in the terminal, or may be any other speech signal that needs to be processed.

In step S302, at least two frequency spectrums of the noisy speech signal to be processed are respectively input to corresponding feature extraction networks in the speech enhancement model, so as to obtain at least two features of the noisy speech signal to be processed, where the at least two frequency spectrums are obtained based on at least two preset groups of different time-frequency conversion parameters. Specifically, in this step, time-frequency domain conversion may be performed through multi-scale time-frequency analysis (STFT _1, STFT _2, …, STFT _ M), and then the spectrum after time-frequency domain conversion is input to the corresponding feature extraction network (FeaNet _1, FeaNet _2, …, FeaNet _ M), where M represents the number of multi-scale features (i.e., time-frequency conversion parameters). When M is 1, a speech enhancement scheme with a fixed time-frequency resolution is adopted, so in the present disclosure, the value of M needs to satisfy that M is greater than or equal to 2. Based on the output of each STFT analysis, a corresponding feature extraction network (FeaNet _1, FeaNet _2, …, FeaNet _ M) is provided. Since different STFT _ m correspond to different FFT lengths, the input dimension of FeaNet _ m is also different. The FeaNet _ m can select different network structures, and a typical FeaNet _ m can comprise two layers of Conv2d convolutional networks to extract the structure information of the frequency spectrum, and then a fully-connected network is connected to map the features to the required dimension.

According to an exemplary embodiment of the present disclosure, before inputting at least two frequency spectrums of a noisy speech signal to be processed into a corresponding feature extraction network in a speech enhancement model respectively to obtain at least two features of the noisy speech signal to be processed, the method further includes: acquiring at least two groups of preset different time-frequency conversion parameters; and based on at least two groups of different time-frequency conversion parameters, carrying out short-time Fourier transform on the voice signal containing noise to be processed to obtain at least two frequency spectrums of the voice signal containing noise to be processed. By the embodiment, short-time Fourier transform can be performed based on different time-frequency conversion parameters, so that a plurality of frequency spectrums of the noise-containing voice signal can be obtained to cover all signals in the noise-containing voice signal.

According to an exemplary embodiment of the present disclosure, the set of time-frequency conversion parameters includes at least one of a window length, a window shift, a window function, and a fast fourier transform length.

In step S303, at least two features are fused to obtain a fused feature. For example, in the fusion process, different fusion weights { α ] can be set for different scale features (i.e., the at least two features) according to the requirements of the actual scene₁,α₂,…,α_MThe fusion method can adopt a splicing fusion method or an additive fusion method, but is not limited toHere, the above-described fusion method may employ any fusion method applicable to the present disclosure. It should be noted that this step may be implemented by setting a feature fusion module, where the input of the feature fusion module is the multi-scale feature { F }₁,F₂,…,F_MI.e. the above-mentioned at least two features), the output is the fused feature F_all。

F_all＝concat(α_iF₁,α₂F₂,...，α_MF_M)

F_all＝α₁F₁+α₂F₂+…+α_MF_M

In step S304, the to-be-processed noisy speech signal is obtained based on the fused features and the speech enhancement network in the speech enhancement modelThe enhanced spectrum of (1). For example, the speech-enhanced network input is a multi-scale fusion feature F_allThe output is the estimated enhanced spectrum of the clean voice signal, and can also be the mask of the clean voice signal. When the mask of the clean voice signal is output, a multiplication module needs to be additionally arranged, and the mask is multiplied by the frequency spectrum of the noisy voice signal to obtain the estimated enhanced frequency spectrum of the clean voice signal. The present disclosure does not limit the structure of the speech enhancement network, and in order to ensure low complexity of the system in a practical scenario, a typical speech enhancement network structure may be a two-layer RNN network plus one layer of fully connected network.

According to an exemplary embodiment of the present disclosure, the outputting of the speech enhancement network is a mask of the noisy speech signal to be processed, where the mask represents a spectrum proportion of an clean speech signal in the noisy speech signal to be processed, and the inputting of the fused features into the speech enhancement network in the speech enhancement model to obtain an enhanced spectrum of the noisy speech signal to be processed includes: and multiplying the frequency spectrum of the voice signal containing noise to be processed by the corresponding mask to obtain an enhanced frequency spectrum of the voice signal containing noise to be processed, wherein the frequency spectrum of the voice signal containing noise to be processed is obtained based on a preset group of time-frequency conversion parameters. It should be noted that the preset group of time-frequency conversion parameters may be different from the preset at least two different groups of time-frequency conversion parameters, or may be different. By the embodiment, the part of the spectrum multiplied by the mask is separated from the voice enhancement network, and the complexity of the voice enhancement network can be reduced.

According to an exemplary embodiment of the present disclosure, the output of the speech enhancement network is an enhanced spectrum of the noisy speech signal to be processed. Through the embodiment, the pre-estimated enhanced spectrum can be conveniently and quickly acquired.

In step S305, a time domain signal corresponding to the enhanced spectrum is obtained, and the time domain signal is used as an enhanced speech signal of the noisy speech signal to be processed.

To facilitate an understanding of the above embodiments, the following description of the system, fig. 4 is a schematic diagram of a speech enhancement system according to an exemplary embodiment. As shown in fig. 4, the Noisy speech signal (noise) passes through a plurality of STFT modules (STFT _1, STFT _2, …, STFT _ M) and corresponding feature extraction networks (FeaNet _1, FeaNet _2, …, FeaNet _ M) to obtain multi-scale features corresponding to different time-frequency conversion parameters. And then fusing the obtained multi-scale features through a feature Fusion Layer (Fusion Layer), and then obtaining a Mask (Mask) of the noisy voice signal by using a voice enhancement network (enhNet). In addition, STFT _0 and ISTFT _0 are time-frequency domain conversion corresponding to voice enhancement operation, after the noise-containing voice is subjected to STFT _0 to obtain a noise-containing frequency spectrum and multiplied by a mask estimated by a voice enhancement network to obtain a voice enhancement frequency spectrum, the final time domain enhanced voice, namely a final enhanced voice signal, is obtained through ISTFT _ 0. It should be noted that the time-frequency conversion parameter corresponding to STFT _0 may be preset.

In conclusion, the method and the device overcome the performance limitation caused by the adoption of fixed time-frequency conversion parameters in the voice noise reduction scheme in the related art, and provide a real-time voice noise reduction scheme based on multi-scale features. The method comprises the steps of obtaining frequency spectrums corresponding to various time-frequency conversion parameters by performing a time-frequency analysis method of different time-frequency conversion parameters on input noisy speech signals, then performing independent feature extraction on each frequency spectrum, fusing the extracted multiple features, and finally estimating clean speech signals by a neural network based on the fused features. The method and the device can effectively extract information of different types of voices and noises by extracting the characteristics of a plurality of frequency spectrums of the noisy voice signals, and improve the overall effect of the model.

FIG. 5 is a block diagram illustrating an apparatus for training speech enhancement models in accordance with an exemplary embodiment. Referring to fig. 5, the apparatus includes a training sample set obtaining unit 50, a feature extracting unit 52, a fusing unit 54, a pre-estimation enhancement spectrum obtaining unit 56, a target loss function determining unit 58, and a training unit 510.

A training sample set obtaining unit 50 configured to obtain a training sample set, where each training sample in the training sample set includes a noisy speech signal and a corresponding clean speech signal, and the noisy speech signal is a speech signal after noise and reverberation are added to the corresponding clean speech signal; a feature extraction unit 52, configured to input at least two frequency spectrums of the noisy speech signal into corresponding feature extraction networks of the at least two feature extraction networks, respectively, to obtain at least two features of the noisy speech signal, where the at least two frequency spectrums are obtained based on at least two preset groups of different time-frequency conversion parameters; a fusion unit 54 configured to perform fusion processing on at least two features to obtain fused features; an estimated enhanced spectrum obtaining unit 56 configured to input the fused features into a voice enhanced network to obtain an estimated enhanced spectrum of the noisy voice signal; a target loss function determination unit 58 configured to determine a target loss function of the speech enhancement model based on the predicted time domain signal corresponding to the predicted enhancement spectrum and the corresponding clean speech signal; a training unit 510 configured to adjust parameters of the at least two feature extraction networks and the speech enhancement network according to the objective loss function, and train the speech enhancement model.

According to the embodiment of the present disclosure, the output of the voice enhancement network is a mask of the noisy speech signal, where the mask represents a spectrum ratio of a clean speech signal in the noisy speech signal, and the estimated enhanced spectrum obtaining unit 56 is further configured to multiply the spectrum of the noisy speech signal with the mask of the noisy speech signal to obtain an estimated enhanced spectrum of the noisy speech signal, where the spectrum of the noisy speech signal is obtained based on a preset set of time-frequency conversion parameters.

According to an embodiment of the present disclosure, the output of the speech enhancement network is a predicted enhancement spectrum of the noisy speech signal.

According to an embodiment of the present disclosure, the feature extraction unit 52 is further configured to obtain at least two preset groups of different time-frequency conversion parameters before the at least two frequency spectrums of the noisy speech signal are respectively input to corresponding feature extraction networks of the at least two feature extraction networks to obtain at least two features of the noisy speech signal; and respectively carrying out short-time Fourier transform on the voice signal containing the noise based on at least two groups of different time-frequency conversion parameters to obtain at least two frequency spectrums of the voice signal containing the noise.

According to an embodiment of the present disclosure, the fusion unit 54 is further configured to perform weighted concatenation or weighted addition on the at least two features, wherein a weight corresponding to each of the at least two features is preset.

According to an embodiment of the disclosure, a set of time-frequency transformation parameters includes: at least one of window length, window shift, window function, and fast fourier transform length.

FIG. 6 is a block diagram illustrating a speech enhancement apparatus according to an example embodiment. Wherein the speech enhancement model comprises at least two feature extraction networks and a speech enhancement network, referring to fig. 6, the apparatus comprises a signal obtaining unit 60, a feature extraction unit 62, a fusion unit 64, an enhanced spectrum obtaining unit 66 and an enhanced speech signal obtaining unit 68.

A signal acquisition unit 60 configured to acquire a noisy speech signal to be processed; the feature extraction unit 62 is configured to input at least two frequency spectrums of the noisy speech signal to be processed into corresponding feature extraction networks in the speech enhancement model, respectively, so as to obtain at least two features of the noisy speech signal to be processed, where the at least two frequency spectrums are obtained based on at least two preset groups of different time-frequency conversion parameters; a fusion unit 64 configured to perform fusion processing on at least two features to obtain fused features; an enhanced spectrum obtaining unit 66 configured to input the fused features into a speech enhancement network in a speech enhancement model, so as to obtain an enhanced spectrum of the noisy speech signal to be processed; and an enhanced speech signal obtaining unit 68 configured to obtain a time domain signal corresponding to the enhanced spectrum, and use the time domain signal as an enhanced speech signal of the noisy speech signal to be processed.

According to the embodiment of the disclosure, the output of the voice enhancement network is a mask of the noisy voice signal to be processed, where the mask represents a spectrum proportion of a clean voice signal in the noisy voice signal to be processed, and the enhanced spectrum obtaining unit is further configured to multiply the spectrum of the noisy voice signal to be processed by the corresponding mask to obtain an enhanced spectrum of the noisy voice signal to be processed, where the spectrum of the noisy voice signal to be processed is obtained based on a preset set of time-frequency conversion parameters.

According to an embodiment of the present disclosure, the output of the speech enhancement network is an enhanced spectrum of the noisy speech signal to be processed.

According to the embodiment of the present disclosure, the feature extraction unit 62 is further configured to obtain at least two preset groups of different time-frequency conversion parameters before inputting at least two frequency spectrums of the noisy speech signal to be processed into the corresponding feature extraction networks in the speech enhancement model, respectively, to obtain at least two features of the noisy speech signal to be processed; and based on at least two groups of different time-frequency conversion parameters, carrying out short-time Fourier transform on the voice signal containing noise to be processed to obtain at least two frequency spectrums of the voice signal containing noise to be processed.

According to an embodiment of the present disclosure, the fusion unit 64 is further configured to perform weighted concatenation or weighted addition on the at least two features, wherein a weight corresponding to each of the at least two features is preset.

According to an embodiment of the present disclosure, the set of time-frequency transformation parameters includes at least one of a window length, a window shift, a window function, and a fast fourier transform length.

According to the embodiment of the present disclosure, the speech enhancement model is trained based on the above-mentioned training method of the speech enhancement model.

According to an embodiment of the present disclosure, an electronic device may be provided. FIG. 7 is a block diagram of an electronic device 700 including at least one memory 701 having a set of computer-executable instructions stored therein that, when executed by the at least one processor, perform a method of training a speech enhancement model and a method of speech enhancement according to embodiments of the present disclosure, and at least one processor 702 according to embodiments of the present disclosure.

By way of example, the electronic device 700 may be a PC computer, tablet device, personal digital assistant, smartphone, or other device capable of executing the set of instructions described above. The electronic device 1000 need not be a single electronic device, but can be any collection of devices or circuits that can execute the above instructions (or sets of instructions) individually or in combination. The electronic device 700 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).

In the electronic device 700, the processor 702 may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processor 702 may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, or the like.

The processor 702 may execute instructions or code stored in memory, where the memory 701 may also store data. The instructions and data may also be transmitted or received over a network via a network interface device, which may employ any known transmission protocol.

The memory 701 may be integrated with the processor 702, for example, by having RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, memory 702 may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The memory 701 and the processor 702 may be operatively coupled or may communicate with each other, e.g., through I/O ports, network connections, etc., such that the processor 702 can read files stored in the memory 701.

In addition, the electronic device 700 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device may be connected to each other via a bus and/or a network.

According to an embodiment of the present disclosure, there may also be provided a computer-readable storage medium, wherein when executed by at least one processor, instructions in the computer-readable storage medium cause the at least one processor to perform the training method of a speech enhancement model and the speech enhancement method of the embodiments of the present disclosure. Examples of the computer-readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD + RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD + RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or compact disc memory, Hard Disk Drive (HDD), solid-state drive (SSD), card-type memory (such as a multimedia card, a Secure Digital (SD) card or a extreme digital (XD) card), magnetic tape, a floppy disk, a magneto-optical data storage device, an optical data storage device, a hard disk, a magnetic tape, a magneto-optical data storage device, a hard disk, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, A solid state disk, and any other device configured to store and provide a computer program and any associated data, data files, and data structures to a processor or computer in a non-transitory manner such that the processor or computer can execute the computer program. The computer program in the computer-readable storage medium described above can be run in an environment deployed in a computer apparatus, such as a client, a host, a proxy device, a server, and the like, and further, in one example, the computer program and any associated data, data files, and data structures are distributed across a networked computer system such that the computer program and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.

According to an embodiment of the present disclosure, a computer program product is provided, which includes computer instructions, and the computer instructions, when executed by a processor, implement the training method of the speech enhancement model and the speech enhancement method of the embodiment of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for training a speech enhancement model, wherein the speech enhancement model comprises at least two feature extraction networks and a speech enhancement network, the method comprising:

obtaining a training sample set, wherein each training sample in the training sample set comprises a noise-containing voice signal and a corresponding clean voice signal, and the noise-containing voice signal is a voice signal of the corresponding clean voice signal after noise and reverberation are added;

respectively inputting at least two frequency spectrums of a noise-containing voice signal into corresponding feature extraction networks in the at least two feature extraction networks to obtain at least two features of the noise-containing voice signal, wherein the at least two frequency spectrums are obtained based on at least two groups of preset different time-frequency conversion parameters;

fusing the at least two characteristics to obtain fused characteristics;

inputting the fused features into the voice enhancement network to obtain a pre-estimated enhancement frequency spectrum of the noisy voice signal;

determining a target loss function of the speech enhancement model based on the pre-estimated time domain signal corresponding to the pre-estimated enhancement spectrum and the corresponding clean speech signal;

and adjusting parameters of the at least two feature extraction networks and the voice enhancement network according to the target loss function, and training the voice enhancement model.

2. The training method of claim 1, wherein the output of the speech enhancement network is a mask of the noisy speech signal, wherein the mask represents a spectral fraction of an clean speech signal in the noisy speech signal,

inputting the fused features into the voice enhancement network to obtain the pre-estimated enhancement spectrum of the noisy voice signal, wherein the pre-estimated enhancement spectrum comprises:

and multiplying the frequency spectrum of the noisy speech signal with the mask of the noisy speech signal to obtain an estimated enhanced frequency spectrum of the noisy speech signal, wherein the frequency spectrum of the noisy speech signal is obtained based on a preset group of time-frequency conversion parameters.

3. The training method of claim 1, wherein the output of the speech enhancement network is an estimated enhanced spectrum of the noisy speech signal.

4. The training method according to claim 1, wherein before inputting at least two spectra of the noisy speech signal into corresponding feature extraction networks of the at least two feature extraction networks, respectively, to obtain at least two features of the noisy speech signal, the method further comprises:

acquiring at least two groups of preset different time-frequency conversion parameters;

and respectively carrying out short-time Fourier transform on the noise-containing voice signal based on the at least two groups of different time-frequency conversion parameters to obtain at least two frequency spectrums of the noise-containing voice signal.

5. A method of speech enhancement, comprising:

acquiring a to-be-processed noisy voice signal;

inputting at least two frequency spectrums of the noise-containing voice signal to be processed into corresponding feature extraction networks in a voice enhancement model respectively to obtain at least two features of the noise-containing voice signal to be processed, wherein the at least two frequency spectrums are obtained based on at least two groups of preset different time-frequency conversion parameters;

fusing the at least two characteristics to obtain fused characteristics;

inputting the fused features into a voice enhancement network in the voice enhancement model to obtain an enhanced frequency spectrum of the to-be-processed noisy voice signal;

and acquiring a time domain signal corresponding to the enhanced frequency spectrum, and taking the time domain signal as an enhanced voice signal of the voice signal containing noise to be processed.

6. An apparatus for training a speech enhancement model, wherein the speech enhancement model comprises at least two feature extraction networks and a speech enhancement network, the apparatus comprising:

a training sample set obtaining unit configured to obtain a training sample set, wherein each training sample in the training sample set includes a noisy speech signal and a corresponding clean speech signal, and the noisy speech signal is a speech signal after noise and reverberation are added to the corresponding clean speech signal;

the device comprises a feature extraction unit, a feature extraction unit and a feature extraction unit, wherein the feature extraction unit is configured to input at least two frequency spectrums of a noisy speech signal into corresponding feature extraction networks in the at least two feature extraction networks respectively to obtain at least two features of the noisy speech signal, and the at least two frequency spectrums are obtained based on at least two groups of preset different time-frequency conversion parameters;

the fusion unit is configured to perform fusion processing on the at least two features to obtain fused features;

the estimated enhanced spectrum obtaining unit is configured to input the fused features into the voice enhanced network to obtain an estimated enhanced spectrum of the noisy voice signal;

a target loss function determination unit configured to determine a target loss function of the speech enhancement model based on the pre-estimated time domain signal corresponding to the pre-estimated enhancement spectrum and the corresponding clean speech signal;

and the training unit is configured to adjust parameters of the at least two feature extraction networks and the voice enhancement network according to the target loss function and train the voice enhancement model.

7. A speech enhancement apparatus, comprising:

a signal acquisition unit configured to acquire a noisy speech signal to be processed;

the characteristic extraction unit is configured to input at least two frequency spectrums of the to-be-processed noise-containing voice signal into corresponding characteristic extraction networks in a voice enhancement model respectively to obtain at least two characteristics of the to-be-processed noise-containing voice signal, wherein the at least two frequency spectrums are obtained based on at least two preset groups of different time-frequency conversion parameters;

the enhanced spectrum acquisition unit is configured to input the fused features into a voice enhanced network in the voice enhanced model to obtain an enhanced spectrum of the to-be-processed noisy voice signal;

and the enhanced voice signal acquisition unit is configured to acquire a time domain signal corresponding to the enhanced spectrum and use the time domain signal as an enhanced voice signal of the to-be-processed noisy voice signal.

8. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of training a speech enhancement model according to any one of claims 1 to 4 and/or the method of speech enhancement according to claim 5.

9. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by at least one processor, cause the at least one processor to perform the method of training a speech enhancement model according to any one of claims 1 to 4 and/or the method of speech enhancement according to claim 5.

10. A computer program product comprising computer instructions, characterized in that the computer instructions, when executed by a processor, implement the method of training a speech enhancement model according to any of claims 1 to 4 and/or the method of speech enhancement according to claim 5.