CN114758668A

CN114758668A - Training method of voice enhancement model and voice enhancement method

Info

Publication number: CN114758668A
Application number: CN202210435868.9A
Authority: CN
Inventors: 许成林; 郑羲光; 张旭; 陈联武; 任新蕾; 张晨
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2022-04-24
Filing date: 2022-04-24
Publication date: 2022-07-15

Abstract

The present disclosure relates to a training method and a speech enhancement method of a speech enhancement model, including: acquiring training samples of a plurality of speaking objects; inputting a first clean speech signal sample of each speaking object into a speech characterization extractor; inputting the voice representation of each speaking object into a classifier; inputting the voice representation of each speaking object and the amplitude spectrum of the overlapped voice noisy signal sample into a voice extractor, and determining a predicted enhanced voice signal of the speaking object according to a predicted amplitude spectrum mask of the enhanced voice signal of the speaking object; calculating loss according to the enhanced voice signal, the second pure voice signal sample, the identification prediction result and the identification label corresponding to each speaking object; parameters of the speech extractor, the speech characterization extractor, and the classifier are adjusted by the loss to train the speech enhancement model. Thus, the trained speech enhancement model can accurately extract the speech signal of the specified speaking object from the speech signals of a plurality of speaking objects.

Description

Training method of voice enhancement model and voice enhancement method

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method for training a speech enhancement model, a method for speech enhancement, an apparatus for speech enhancement, an electronic device, a sound box, and a storage medium.

Background

With the development of deep learning technology, the speech enhancement system in the related art can remove noise signals in noisy audio signals, thereby improving the listening quality and intelligibility of non-noise signals.

However, the speech enhancement system in the related art can only filter the non-interfering vocal signals such as the environmental noise in the noisy audio signal, and cannot filter the interfering vocal signals in the noisy audio signal. For example, if there are multiple speaking objects speaking simultaneously in the current environment, the speech enhancement system in the related art will retain the speech signals of all the speaking objects, and at this time, the voice signal of the interfering person that is not filtered out will interfere with the speech signal of the specified speaking object, and the quality of hearing and intelligibility of the speech signal of the specified speaking object will be reduced.

Disclosure of Invention

The present disclosure provides a training method, a speech enhancement method, a device, an electronic apparatus, a sound box and a storage medium for a speech enhancement model, so as to at least solve the problems that in the related art, a speech enhancement system can retain speech signals of all speaking objects, and at this time, an unfiltered interfering person sound signal can cause interference to a speech signal of a specified speaking object, and the hearing quality and intelligibility of the speech signal of the specified speaking object are reduced.

According to a first aspect of the embodiments of the present disclosure, there is provided a training method for a speech enhancement model, where the speech enhancement model includes a speech extractor, a speech characterization extractor, and a classifier, the training method includes: obtaining training samples of a plurality of speaking subjects, wherein the training samples of each speaking subject comprise: the method comprises the steps that a first pure voice signal sample, a second pure voice signal sample, an overlapped voice noisy signal sample and an identification tag of a speaking object are obtained, wherein the overlapped voice noisy signal sample of the speaking object is obtained on the basis that the second pure voice signal sample of the speaking object is overlapped with voice signals of at least one other speaking object; inputting the first pure voice signal sample of each speaking object into the voice characterization extractor to obtain the voice characterization of the speaking object; inputting the voice representation of each speaking object into the classifier to obtain the identification prediction result of the speaking object; inputting the voice representation of each speaking object and the amplitude spectrum of the overlapped voice noisy signal sample into the voice extractor, obtaining a predicted amplitude spectrum mask of the enhanced voice signal of the speaking object, and determining the predicted enhanced voice signal of the speaking object according to the predicted amplitude spectrum mask of the enhanced voice signal of the speaking object, wherein the enhanced voice signal is the voice signal of the speaking object extracted from the overlapped voice noisy signal sample of the speaking object; calculating loss according to the enhanced voice signal, the second pure voice signal sample, the identification prediction result and the identification label corresponding to each speaking object; adjusting parameters of the speech extractor, the speech characterization extractor, and the classifier by the loss to train the speech enhancement model.

Optionally, the calculating the loss according to the enhanced speech signal, the second pure speech signal sample, the identification prediction result, and the identification tag corresponding to each speaking object includes: calculating a first loss according to the identification prediction result and the identification label corresponding to each speaking object; calculating a second loss according to the enhanced voice signal and a second pure voice signal sample corresponding to each speaking object; and carrying out weighted summation on the first loss and the second loss to obtain the loss.

Optionally, the first loss is calculated by a cross entropy loss function.

Optionally, the second loss is calculated by a scale-invariant signal-to-distortion ratio loss function.

Optionally, the speech characterization extractor and the classifier are obtained through a first pre-training; the speech extractor is obtained via a second pre-training, wherein the first and second pre-training are independent of each other.

Optionally, the determining the predicted enhanced speech signal of the speaking object according to the predicted magnitude spectrum mask of the enhanced speech signal of the speaking object includes:

multiplying the amplitude spectrum of the overlapped voice noise signal sample of the speaking object with the predicted amplitude spectrum mask of the enhanced voice signal of the speaking object to obtain the predicted amplitude spectrum of the enhanced voice signal of the speaking object;

and combining the predicted amplitude spectrum of the enhanced voice signal of the speaking object with the phase spectrum of the overlapped voice noisy signal sample and executing inverse short-time Fourier transform to obtain the predicted enhanced voice signal of the speaking object.

Optionally, the inputting the first clean speech signal sample of each speaking object into the speech characterization extractor includes:

carrying out short-time Fourier transform on the first pure voice signal sample of each speaking object to obtain the amplitude spectrum of the first pure voice signal sample of the speaking object;

acquiring a Mel cepstrum characteristic corresponding to the amplitude spectrum of a first pure voice signal sample of the speaking object;

and inputting the Mel cepstrum characteristics corresponding to the magnitude spectrum of the first pure voice signal sample of the speaking object into the voice characteristic extractor.

Optionally, the first clean speech signal samples are different from the second clean speech signal samples.

According to a second aspect of the embodiments of the present disclosure, there is provided a speech enhancement method, including: acquiring a tested voice signal; inputting a pre-registered voice signal of a target speaking object into a voice characterization extractor trained according to the training method disclosed by the disclosure to obtain a voice characterization of the target speaking object; and inputting the voice representation of the target speaking object and the tested voice signal into a voice extractor trained according to the training method disclosed by the disclosure, and obtaining the voice signal of the target speaking object extracted from the tested voice signal.

According to a third aspect of the embodiments of the present disclosure, there is provided a training apparatus for a speech enhancement model, the speech enhancement model including a speech extractor, a speech characterization extractor, and a classifier, the training apparatus comprising: an acquisition module configured to acquire training samples of a plurality of speaking subjects, wherein the training samples of each speaking subject include: the method comprises the steps that a first pure voice signal sample, a second pure voice signal sample, an overlapped voice noisy signal sample and an identification tag of a speaking object are obtained, wherein the overlapped voice noisy signal sample of the speaking object is obtained on the basis that the second pure voice signal sample of the speaking object is overlapped with voice signals of at least one other speaking object; the first input module is configured to input a first pure voice signal sample of each speaking object into the voice characterization extractor to obtain a voice characterization of the speaking object; the second input module is configured to input the voice representation of each speaking object into the classifier, and obtain an identification prediction result of the speaking object; a third input module configured to input the voice representation of each speaking object and the magnitude spectrum of the overlapped voice noisy signal sample into the voice extractor, obtain a predicted magnitude spectrum mask of the enhanced voice signal of the speaking object, and determine a predicted enhanced voice signal of the speaking object according to the predicted magnitude spectrum mask of the enhanced voice signal of the speaking object, wherein the enhanced voice signal is the voice signal of the speaking object extracted from the overlapped voice noisy signal sample of the speaking object; adjusting parameters of the speech extractor, the speech characterization extractor, and the classifier by the loss to train the speech enhancement model.

Optionally, the computing module is configured to: calculating a first loss according to the identification prediction result and the identification label corresponding to each speaking object; calculating a second loss according to the enhanced voice signal and a second pure voice signal sample corresponding to each speaking object; and carrying out weighted summation on the first loss and the second loss to obtain the loss.

Optionally, the first loss is calculated by a cross entropy loss function.

Optionally, the third input module is configured to:

Optionally, the first input module is configured to:

and inputting the Mel cepstrum characteristic corresponding to the magnitude spectrum of the first pure voice signal sample of the speaking object into the voice characterization extractor.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a speech enhancement apparatus, comprising: the acquisition module is configured to acquire a tested voice signal; a first input module configured to input a pre-registered voice signal of a target speaking object into a voice representation extractor trained according to the training device of the present disclosure, so as to obtain a voice representation of the target speaking object; and a second input module configured to input the voice representation of the target speaking object and the tested voice signal into a voice extractor trained according to the training device of the present disclosure, and obtain the voice signal of the target speaking object extracted from the tested voice signal.

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement a method of training a speech enhancement model or a method of speech enhancement according to the present disclosure.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions, when executed by a processor of an electronic device, enable the electronic device to perform a method of training a speech enhancement model or a method of speech enhancement according to the present disclosure.

According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method of training a speech enhancement model or a method of speech enhancement according to the present disclosure.

According to an eighth aspect of embodiments of the present disclosure, there is provided an acoustic enclosure comprising a speech enhancement device according to the present disclosure.

According to a ninth aspect of the embodiments of the present disclosure, there is provided an acoustic enclosure, comprising: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform a speech enhancement method according to the present disclosure.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

therefore, the training method of the speech enhancement model adopts a large number of pure speech signal samples and overlapped speech noisy signal samples of the speaking object to train the speech enhancement model, so that the trained speech enhancement model can accurately extract the speech signal of the specified speaking object from the speech signals of a plurality of speaking objects, avoids the interference of the interfering person speech signal to the speech signal of the specified speaking object, and improves the hearing quality and intelligibility of the speech signal of the specified speaking object. Moreover, the speech enhancement method can extract the potential speech signal of the target speaking object from the detected speech signal by using the trained speech enhancement model with the help of the pre-registered speech signal of the target speaking object, so that the interference of the interfering person sound signal to the speech signal of the specified speaking object is avoided, the effect of speech enhancement aiming at the specified speaking object is achieved, and the hearing quality and the intelligibility of the speech signal of the specified speaking object are improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a flow chart illustrating a method of training a speech enhancement model according to an exemplary embodiment of the present disclosure;

FIG. 2 is a schematic diagram illustrating a training phase of a speech enhancement model according to an exemplary embodiment of the present disclosure;

FIG. 3 is a flow chart illustrating a method of speech enhancement according to an exemplary embodiment of the present disclosure;

FIG. 4 is a schematic diagram illustrating a testing phase of a speech enhancement model according to an exemplary embodiment of the present disclosure;

FIG. 5 is a block diagram illustrating a training apparatus for a speech enhancement model according to an exemplary embodiment of the present disclosure;

FIG. 6 is a block diagram illustrating a speech enhancement apparatus according to an exemplary embodiment of the present disclosure;

FIG. 7 is a block diagram illustrating an electronic device according to an example embodiment of the present disclosure;

fig. 8 is a block diagram showing a structure of an acoustic enclosure according to an exemplary embodiment of the present disclosure;

fig. 9 is a block diagram showing a structure of an acoustic enclosure according to another exemplary embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present disclosure better understood, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in other sequences than those illustrated or described herein. The embodiments described in the following examples do not represent all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In this case, the expression "at least one of the items" in the present disclosure means a case where three types of parallel expressions "any one of the items", "a combination of any plural ones of the items", and "the entirety of the items" are included. For example, "include at least one of a and B" includes the following three cases in parallel: (1) comprises A; (2) comprises B; (3) including a and B. For another example, "at least one of step one and step two is performed", which means the following three parallel cases: (1) executing the step one; (2) executing the step two; (3) and executing the step one and the step two.

FIG. 1 is a flow chart illustrating a method of training a speech enhancement model, which may include a speech extractor, a speech characterization extractor, and a classifier, according to an exemplary embodiment of the present disclosure.

Referring to fig. 1, in step 101, training samples of a plurality of speaking subjects may be obtained. Wherein the training sample of each speaking object may include: the method comprises a first pure speech signal sample x (t), a second pure speech signal sample s (t), an overlapped speech noise signal sample y (t) and an identification label (label), wherein the overlapped speech noise signal sample y (t) of the speaking object is obtained based on the second pure speech signal sample s (t) of the speaking object and the speech signal of at least one other speaking object is overlapped. Further, the overlapped speech noisy signal samples y (t) of the speaking subject may also contain different types of noise. For example, Noise data may be added To a speech Signal of a speaking subject at different Signal-To-Noise ratios (SNRs), Noise may be Room Impulse Responses (RIRs) of different rooms, and so on. The overlapping speech noisy signal samples y (t) of the speaking subject may also be speech noisy signal samples of a single speaking subject. It should be noted that the speaking object may be an object with speech output capability, for example, a real speaker, a virtual speaker, or the like.

According to an exemplary embodiment of the disclosure, the first clean speech signal sample x (t) is different from the second clean speech signal sample s (t), that is, the first clean speech signal sample x (t) and the second clean speech signal sample s (t) may be speech signal samples of the same speaker with different contents, and the first clean speech signal sample x (t) and the second clean speech signal sample s (t) are each a single speaker speech signal sample with a higher SNR.

Further, the reason why the first clean speech signal sample x (t) and the second clean speech signal sample s (t) are required to be speech signal samples with different contents of the same speaker is: the method is characterized in that potential speech signals of a specified speaking object are extracted from a tested speech signal by using a trained speech enhancement model by means of pre-registered speech signals of the specified speaking object. However, in an actual test scenario, the pre-registered speech signal of each speaker and the potential speech signal of the specified speaker in the tested speech signal are often different.

For example, all employees of a company may register their own voice signals in advance, and during an online meeting of the employees of the company, the speaking content of each participant in the meeting process must be different from the content of the voice signals registered in advance. Therefore, in the training stage, in order to simulate a real test scenario as much as possible, the first clean speech signal sample x (t) and the second clean speech signal sample s (t) for training may be set as speech signal samples with different contents, so that it may be ensured that the speech enhancement model can extract a potential speech signal of a specified speaker from a tested speech signal as accurately as possible in an actual test scenario.

It should be noted that in the training phase of the speech enhancement model, the speech data of thousands of speaking subjects are used to train a speech extractor, a speech characterization extractor, and a classifier. The speech extractor, the speech characterization extractor and the classifier can be constructed based on a neural network.

In step 102, the first clean speech signal sample x (t) of each speaking object may be input into a speech characterization extractor to obtain a speech characterization of the speaking object, where the speech characterization of the speaking object may be a voiceprint feature of the speaking object.

According to an exemplary embodiment of the present disclosure, the first pure speech signal sample X (t) of each speaking object may be subjected to Short Time Fourier Transform (STFT) to obtain a magnitude spectrum | X (n, k) | of the first pure speech signal sample X (t) of the speaking object.

It should be noted that if the original audio signal with length T is represented as x (T) in time domain, where T represents time, and 0< T ≦ T, after short-time fourier transform, x (T) can be represented as:

X(n,k)＝STFT(x(t))

wherein N is a frame sequence, N is more than 0 and less than or equal to N, and N is the total frame number; k is a central frequency sequence, K is more than 0 and less than or equal to K, and K is the number of total frequency points. X (n, k) of the complex field includes a magnitude spectrum | X (n, k) | and a phase value × (n, k) of the time-frequency domain.

Then, a mel cepstrum feature M (n, l) corresponding to the magnitude spectrum | X (n, k) | of the first clean speech signal sample X (t) of the speaking object can be obtained, wherein l is a feature dimension.

Next, the mel cepstrum feature M (n, l) corresponding to the magnitude spectrum | X (n, k) | of the first pure speech signal sample X (t) of the speaking object may be input into the speech representation extractor, so as to obtain the speech representation of the speaking object. For example, a multi-layer attention-based delay network f (-) can be used to process the frame-based mel-frequency cepstrum features M (n, l), resulting in frame-level features R (n, i), i being the number of nodes of the layer of neural network.

R(n,i)＝f(Mel(|STFT(x(t))|))

After obtaining the frame-level features R (n, i), statistical property pooling (statistical clustering) may be performed on the frame-level features R (n, i) to obtain the speech characterization e of the sentence-level speaking object, where the speech characterization e of the sentence-level speaking object may be a vector:

e＝statistic pooing(R(n,i))

in step 103, the speech characterization e of each speaking object can be input into a classifier, and the identification prediction result of the speaking object is obtained. For example, a two-layer forward network may be used to convert the speech token e corresponding to the speaking object and predict that the speech token e belongs to the identification prediction result of each of all the speaking objects in the training data. Wherein, the identification prediction result can be an identity prediction probability.

It should be noted that, a short-time fourier transform may also be performed on the overlapped speech noisy signal samples y (t):

Y(n,k)＝STFT(y(t))

wherein, Y (n, k) of the complex field comprises a magnitude spectrum | Y (n, k) | and a phase spectrum ^ Y (n, k).

In step 104, the speech representation e of each speaking object and the magnitude spectrum | Y (n, k) | of the overlapped speech noisy signal sample Y (t) may be input to the speech extractor, a predicted magnitude spectrum mask M of the enhanced speech signal of the speaking object is obtained, and a predicted enhanced speech signal s' (t) of the speaking object is determined according to the predicted magnitude spectrum mask M of the enhanced speech signal of the speaking object. Wherein, the enhanced speech signal s' (t) is the speech signal of the speaking object extracted from the overlapped speech noisy signal sample y (t) of the speaking object. It should be noted that the speech extractor of the present disclosure may include a forward neural network layer, a gated round robin (GRU) layer, and a hole convolutional network layer.

According to an exemplary embodiment of the present disclosure, the magnitude spectrum | Y (n, k) | of the overlapped speech noisy signal samples Y (t) of the speaking object may be multiplied by the predicted magnitude spectrum mask M of the enhanced speech signal of the speaking object to obtain the predicted magnitude spectrum | Y' (n, k) | of the enhanced speech signal of the speaking object, which is implemented by filtering the predicted magnitude spectrum mask M of the enhanced speech signal of the speaking object. Next, the predicted magnitude spectrum | Y '(n, k) | of the enhanced speech signal of the speaking object and the phase spectrum ═ Y (n, k) of the overlapped speech noisy signal sample Y (t) may be combined and Inverse Short Time Fourier Transform (iSTFT) may be performed to obtain the predicted enhanced speech signal s' (t) of the speaking object:

s′(t)＝iSTFT(g(STFT(x(t)),STFT(y(t))))

referring to FIG. 2, FIG. 2 is a schematic diagram illustrating a training phase of a speech enhancement model according to an exemplary embodiment of the present disclosure. In fig. 2, feature extraction may be performed on the first clean speech signal sample x (t) of each speaking object to obtain mel cepstral features M (n, l) corresponding to the first clean speech signal sample x (t) of the speaking object. Then, the mel cepstrum feature M (n, l) corresponding to the first pure speech signal sample x (t) of the speaking object can be used as the input of the speech feature extractor, and the speech feature e of the speaking object is obtained.

For the overlapped speech noisy signal sample Y (t) of the speaking object, short-time fourier transform can be performed on the overlapped speech noisy signal sample Y (t) to obtain Y (n, k) of a complex domain, wherein Y (n, k) of the complex domain comprises a magnitude spectrum | Y (n, k) | and a phase spectrum ═ Y (n, k). The speech representation e of the speaking object and the magnitude spectrum | Y (n, k) | of the overlapped speech noisy signal sample Y (t) of the speaking object can be used as the input of the speech extractor, and with the help of the speech representation e of the speaking object, the predicted magnitude spectrum mask M of the enhanced speech signal of the speaking object can be obtained.

Next, the predicted magnitude spectrum mask M of the enhanced speech signal of the speaking subject may be filtered. For example, the amplitude spectrum | Y (n, k) | of the overlapped speech noisy signal sample Y (t) of the speaking object and the predicted amplitude spectrum mask M of the enhanced speech signal of the speaking object may be point-to-point multiplied to obtain the predicted amplitude spectrum | Y' (n, k) | of the enhanced speech signal of the speaking object. Finally, the predicted magnitude spectrum | Y '(n, k) | of the enhanced speech signal of the speaking object and the phase spectrum ═ Y (n, k) of the overlapped speech noisy signal sample Y (t) can be combined and subjected to isttft to obtain the predicted enhanced speech signal s' (t) of the speaking object.

Moreover, the speech characterization e of the speaking object can be used as the input of the classifier to obtain the identification prediction result of the speaking object, that is, the identity prediction probability of each speaking object in the speaking objects, namely speaking object 1, speaking object 2, … … and speaking object m, can be obtained.

In step 105, the loss may be calculated according to the enhanced speech signal s' (t), the second clean speech signal sample s (t), the flag prediction result, and the flag (label) corresponding to each speaking object.

According to the exemplary embodiment of the present disclosure, the speech extractor needs to improve the SNR of the extracted speech signal of the potential speaker as much as possible, and the speech characterization extractor needs to improve the classification accuracy of the speaker as much as possible, so that a combined training mode can be adopted to perform weighted summation on the loss function corresponding to the speech extractor and the loss function corresponding to the speech characterization extractor as the loss function used in actual training.

For example, the first loss J may be calculated based on the tag prediction result and the tag label (label) corresponding to each speaking object₁(ii) a The second loss J can be calculated according to the enhanced speech signal s' (t) and the second pure speech signal sample s (t) corresponding to each speaking object₂(ii) a Next, the first loss J can be corrected₁And a second loss J₂And (3) carrying out weighted summation to obtain the loss J during actual training:

J＝α*J₁+J₂

where α is a weight.

According to an exemplary embodiment of the present disclosure, the first loss J₁Can be calculated by a cross entropy loss function to obtain:

wherein C is the total number of speaking objects in the training data, and P is the number of speaking objects when the speech representation e of the speech segment is output by the speaking object C_cEqual to 1, otherwise, P_cP (c | e) is the identification prediction result that the classifier predicts the speech characterization e of the speech segment as output by the speaking object c.

According to an exemplary embodiment of the present disclosure, the second loss J₂It can be calculated by a Scale-Invariant Signal To Distortion Ratio (SISDR) loss function:

in step 106, parameters of the speech extractor, the speech characterization extractor, and the classifier may be adjusted by the loss J, thereby training the speech enhancement model.

According to an exemplary embodiment of the present disclosure, the speech characterization extractor and the classifier may be obtained through a first pre-training; the speech extractor may be obtained through a second pre-training. Wherein the first pre-training and the second pre-training are independent of each other.

At this time, the process of training the speech enhancement model of the present disclosure may include three stages:

in the first training phase, the speech characterization extractor and classifier may be pre-trained using a large number of clean speech signal samples of a single speaking subject. The method comprises the steps of obtaining the prediction probability that a pure speech signal sample of a single speaking object belongs to each speaking object in a training data set by utilizing a classifier, then calculating a loss function of cross-entropy (cross-entropy) according to the real identity label of each speaking object, and adjusting parameters of a speech characterization extractor and the classifier through the loss function of the cross-entropy so as to achieve the purpose of pre-training the speech characterization extractor and the classifier.

In a second training phase, a pre-trained speech feature extractor can be used to extract a speech feature e corresponding to a first clean speech signal sample of a speaking object, and a potential speech signal of the speaking object can be extracted from an overlapped speech noisy signal sample of the speaking object by means of the speech feature e. The SISDR loss function can be calculated based on a second clean speech signal sample of a speaking subject contained in the overlapped speech noisy signal sample of the speaking subject and having a different content from the first clean speech signal sample of the speaking subject, and the extracted speech signal of the speaking subject. Next, parameters of the speech extractor can be adjusted by the SISDR loss function, thereby enabling pre-training of the speech extractor. It should be noted that in the second training phase, the parameters of the pre-trained speech feature extractor need to be fixed, i.e. the pre-trained speech feature extractor is not trained again in the second training phase.

In the third training phase, the pre-trained speech characterization extractor, the pre-trained classifier, and the pre-trained speech extractor may be jointly trained simultaneously using a smaller learning rate. The cross-entropy loss function corresponding to the speech characterization extractor and the classifier and the SISDR loss function corresponding to the speech extractor may be weighted and summed to serve as the loss function of the third training stage. Next, parameters of the pre-trained speech characterization extractor, the pre-trained classifier, and the pre-trained speech extractor may be adjusted using the loss function obtained by the weighted summation, thereby implementing joint training of the speech enhancement model.

It should be noted that, compared with the method of directly training the speech feature extractor, the classifier and the speech extractor, the method of pre-distributing and pre-training each module and then performing joint training on each module has two advantages:

1) the voice characterization extractor and the classifier can be trained by using more training data, so that the performance of the voice characterization extractor and the classifier is more complete;

2) by adopting a pre-training mode, the voice enhancement model can be trained more easily and can be converged more quickly.

Referring to fig. 3, fig. 3 is a flowchart illustrating a method of speech enhancement according to an exemplary embodiment of the present disclosure.

Referring to fig. 3, in step 301, a detected speech signal a (t) may be acquired. The detected speech signal a (t) may be a speech signal of one speaking object or speech signals of a plurality of speaking objects, and the detected speech signal a (t) may further include environmental noise, such as noise of air conditioner operation, noise of keyboard knocking by a user, and the like.

It should be noted that the speech enhancement model of the present disclosure may be deployed on a terminal of a user, and the user needs to register a speech signal in advance and store the speech signal on the terminal of the user for use.

In step 302, a pre-registered speech signal of the target speaker can be input into a speech feature extractor trained according to the training method of the present disclosure, so as to obtain a speech feature e of the target speaker.

In step 303, the speech representation e and the tested speech signal a (t) of the target speaker may be input into a speech extractor trained according to the training method of the present disclosure, and a speech signal s' (t) of the target speaker extracted from the tested speech signal a (t) is obtained.

Referring to FIG. 4, FIG. 4 is a schematic diagram illustrating a testing phase of a speech enhancement model according to an exemplary embodiment of the present disclosure. In fig. 4, the test procedure is shown on the terminal 1 of zhang san and the terminal 2 of lie ye, in a test phase both the classifier and the calculation module of the loss function in fig. 2 are discarded.

Zhang III can register a section of speech signal z in advance when using the speech enhancement model of the present disclosure₁(t) performing feature extraction to obtain a pre-registered speech signal z₁(t) corresponding Meier cepstrum features M (n, l)₁. Then, Zhang three pre-registered speech signal z₁(t) corresponding Meier cepstrum features M (n, l)₁Can be used as the input of a voice characterization extractor to obtain a voice characterization e of Zhang III₁. When the speech enhancement model receives the tested speech a₁(t) it can be subjected to a short-time Fourier transform to obtain a complex field A₁(n, k) A of the plurality of fields₁(n, k) includes the magnitude spectrum | A₁(n, k) | and phase spectrum-₁(n, k). Note that, the tested speech a₁At least one disturbing voice may be included in (t). The speech of Zhang III can be characterized by e₁And tested voice a₁(t) amplitude spectrum | A₁(n, k) | as the input of the speech extractor, and the speech representation e of Zhang III₁With the help of (2), a predicted slave tested speech a can be obtained₁(t) magnitude spectrum mask M1 of the enhanced speech signal of zhang extracted in (t).

Next, predicted slave tested speech a may be tested₁And (t) filtering by using a magnitude spectrum mask M1 of the enhanced voice signal of Zhang III extracted in the step (t). For example, the tested speech a may be₁(t) amplitude spectrum | A₁(n, k) | with predicted slave speech a under test₁(t) performing point-to-point multiplication on the amplitude spectrum mask M1 of the enhanced speech signal of Zhang III extracted in (t) to obtain the predicted amplitude spectrum | A of the enhanced speech signal of Zhang III₁' (n, k) |. Finally, the predicted Zhang three enhanced phonetic letter can be usedMagnitude spectrum of horn | A₁' (n, k) | and tested speech a₁(t) phase spectrum & lt A₁(n, k) combining and performing iSTFT to obtain a predicted Zhang three enhanced speech signal s'₁(t) of (d). Note that the predicted zhang san enhanced speech signal s'₁And (t) only including Zhang III sound, and interfering human voice, environmental noise and the like in the tested voice are filtered.

Lie four when using the speech enhancement model of the present disclosure, the process of extracting a potential lie four speech signal from a tested speech is similar to the process of extracting a potential zhang speech signal from a tested speech, and is not repeated herein. In fig. 4, the pre-registered segment of the speech signal of lie four is z₂(t), the Meier cepstrum characteristic corresponding to Liqu is M (n, l)₂The speech of lie IV is characterized as e₂The tested voice received by the voice enhancement model is a₂(t) the amplitude spectrum obtained by short-time Fourier transform is | A₂(n, k) |, phase spectrum is < A₂(n, k), predicted slave tested speech a₂(t) the magnitude spectrum mask of the enhanced speech signal of lee four extracted in (t) is M2, and the predicted magnitude spectrum of the enhanced speech signal of lee four is | a₂'(n, k) |, the predicted enhanced speech signal of prune four is s'₂(t)。

It should be noted that the speech enhancement system in the related art may retain the speech signals of all the speaking objects, and at this time, the voice signal of the specified speaking object may be interfered by the unfiltered interfering person sound signal, which may reduce the listening quality and intelligibility of the speech signal of the specified speaking object. For example, when wang di participates in a remote video conference and gives a speech at a workstation, the cowboy of the next coworker is discussing the problem with other coworkers, and the microphone of wang di simultaneously picks up the sound of wang di, the sound of the cowboy and the environmental noise, for example, the environmental noise can be the noise of the operation of an air conditioner or the noise of the keyboard knocking. The voice enhancement system in the related art can only filter out environmental noise, but cannot filter out interfering human voice. At this time, the listener at the other end of the teleconference can simultaneously hear the wang-two sound and the small week sound, however, the far-end listener really wants to hear only the wang-two sound. Since the small week's sound is also transmitted to the far end, the far end listener's hearing quality and intelligibility of the wang-two speech is degraded.

The voice enhancement model can filter the picked environmental noise and the interference voice in the video conference process, and then the terminals of the participants can distribute the pure voice with the environmental noise and the interference voice filtered out to the terminals of other participants through the server and play the pure voice. It should be noted that, during the video conference, when a plurality of participants all use the speech enhancement model on their respective terminals, the speech enhancement model on the terminal of each participant performs the same speech extraction process. Therefore, the speech enhancement model can extract potential speech signals of the speaking object from the detected speech signals with the help of the pre-registered speech signals of the speaking object, avoids the interference of interfering human voice signals to the speech signals of the speaking object, achieves the effect of speech enhancement aiming at the appointed speaking object, and improves the listening quality and the intelligibility of the speech signals of the appointed speaking object.

FIG. 5 is a block diagram illustrating a training apparatus of a speech enhancement model including a speech extractor, a speech characterization extractor, and a classifier according to an exemplary embodiment of the present disclosure.

Referring to fig. 5, the apparatus 500 for training a speech enhancement model may include an obtaining module 501, a first input module 502, a second input module 503, a third input module 504, a calculating module 505, and an adjusting module 506.

An obtaining module 501 configured to obtain training samples of a plurality of speaking subjects, where the training samples of each speaking subject include: the method comprises the steps that a first pure voice signal sample, a second pure voice signal sample, an overlapped voice noisy signal sample and an identification tag of a speaking object are obtained, wherein the overlapped voice noisy signal sample of the speaking object is obtained on the basis that the second pure voice signal sample of the speaking object is overlapped with voice signals of at least one other speaking object;

a first input module 502 configured to input a first clean speech signal sample of each speaking object into the speech characterization extractor, and obtain a speech characterization of the speaking object;

a second input module 503, configured to input the speech characterization of each speaking object into the classifier, and obtain an identification prediction result of the speaking object;

a third input module 504, configured to input the speech representation of each speaking object and the magnitude spectrum of the overlapped speech noisy signal sample into the speech extractor, obtain a predicted magnitude spectrum mask of the enhanced speech signal of the speaking object, and determine a predicted enhanced speech signal of the speaking object according to the predicted magnitude spectrum mask of the enhanced speech signal of the speaking object, where the enhanced speech signal is the speech signal of the speaking object extracted from the overlapped speech noisy signal sample of the speaking object;

a calculating module 505 configured to calculate a loss according to the enhanced speech signal, the second pure speech signal sample, the identification prediction result, and the identification tag corresponding to each speaking object;

an adjusting module 506 configured to adjust parameters of the speech extractor, the speech characterization extractor, and the classifier by the loss to train the speech enhancement model.

According to an exemplary embodiment of the present disclosure, the calculation module 505 is configured to:

calculating a first loss according to the identification prediction result and the identification label corresponding to each speaking object;

calculating a second loss according to the enhanced voice signal and a second pure voice signal sample corresponding to each speaking object;

and carrying out weighted summation on the first loss and the second loss to obtain the loss.

According to an exemplary embodiment of the present disclosure, the first loss is calculated by a cross entropy loss function.

According to an exemplary embodiment of the present disclosure, the second loss is calculated by a scale-invariant signal-to-distortion ratio loss function.

According to an exemplary embodiment of the present disclosure, the speech characterization extractor and the classifier are obtained through a first pre-training; the speech extractor is obtained via a second pre-training, wherein the first and second pre-training are independent of each other.

According to an exemplary embodiment of the present disclosure, the third input module 504 is configured to:

multiplying the amplitude spectrum of the overlapped voice noisy signal sample of the speaking object with the predicted amplitude spectrum mask of the enhanced voice signal of the speaking object to obtain the predicted amplitude spectrum of the enhanced voice signal of the speaking object;

According to an exemplary embodiment of the present disclosure, the first input module 502 is configured to:

carrying out Fourier transform on the first pure voice signal sample of each speaking object to obtain the magnitude spectrum of the first pure voice signal sample of the speaking object;

According to an exemplary embodiment of the disclosure, the first clean speech signal sample is different from the second clean speech signal sample.

Fig. 6 is a block diagram illustrating a speech enhancement apparatus according to an exemplary embodiment of the present disclosure.

Referring to fig. 6, the speech enhancement apparatus 600 may include an acquisition module 601, a first input module 602, and a second input module 603.

An obtaining module 601 configured to obtain a tested voice signal;

a first input module 602, configured to input a pre-registered voice signal of a target speaker into a voice characterization extractor trained according to the training apparatus of the present disclosure, so as to obtain a voice characterization of the target speaker;

a second input module 603 configured to input the speech representation of the target speaking object and the tested speech signal into a speech extractor trained according to the training apparatus of the present disclosure, and obtain a speech signal of the target speaking object extracted from the tested speech signal.

Fig. 7 is a block diagram illustrating an electronic device 700 according to an example embodiment of the present disclosure.

Referring to fig. 7, an electronic device 700 includes at least one memory 701 and at least one processor 702, the at least one memory 701 having instructions stored therein, which when executed by the at least one processor 702, perform a method of training a speech enhancement model or a method of speech enhancement according to an exemplary embodiment of the present disclosure.

By way of example, the electronic device 700 may be a PC computer, tablet device, personal digital assistant, smart phone, or other device capable of executing the instructions described above. Here, the electronic device 700 need not be a single electronic device, but can be any collection of devices or circuits that can execute the above instructions (or sets of instructions) either individually or in combination. The electronic device 700 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).

In the electronic device 700, the processor 702 may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

The processor 702 may execute instructions or code stored in the memory 701, wherein the memory 701 may also store data. The instructions and data may also be transmitted or received over a network via the network interface device, which may employ any known transmission protocol.

The memory 701 may be integrated with the processor 702, for example, with RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, memory 701 may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The memory 701 and the processor 702 may be operatively coupled, or may communicate with each other, such as through I/O ports, network connections, etc., to enable the processor 702 to read files stored in the memory.

In addition, the electronic device 700 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device 700 may be connected to each other via a bus and/or a network.

According to an exemplary embodiment of the present disclosure, there may also be provided a computer-readable storage medium, in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform a training method of a speech enhancement model or a speech enhancement method according to the present disclosure.

Examples of the computer-readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD + RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD + RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or optical disk memory, Hard Disk Drive (HDD), solid-state drive (SSD), card-type memory (such as a multimedia card, Secure Digital (SD) card or extreme digital (XD) card), tape, floppy disk, magneto-optical data storage, hard disk, magnetic tape, magneto-optical data storage, hard disk, magnetic tape, magneto-optical data storage, optical disk, and optical disk, A solid state disk, and any other device configured to store and provide a computer program and any associated data, data files, and data structures to a processor or computer in a non-transitory manner such that the processor or computer can execute the computer program. The computer program in the computer-readable storage medium described above can be run in an environment deployed in a computer apparatus, such as a client, a host, a proxy device, a server, and the like, and further, in one example, the computer program and any associated data, data files, and data structures are distributed across a networked computer system such that the computer program and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.

According to an exemplary embodiment of the present disclosure, a computer program product may also be provided, comprising a computer program which, when being executed by a processor, carries out a method of training a speech enhancement model or a method of speech enhancement according to the present disclosure.

Fig. 8 is a block diagram illustrating a structure of an acoustic enclosure according to an exemplary embodiment of the present disclosure. As shown in fig. 8, an acoustic enclosure 800 according to an exemplary embodiment of the present disclosure includes: a speech enhancement device 600.

Fig. 9 is a block diagram illustrating a structure of an acoustic enclosure according to another exemplary embodiment of the present disclosure. As shown in fig. 9, an acoustic enclosure 900 according to an exemplary embodiment of the present disclosure includes: at least one memory 901 and at least one processor 902, the at least one memory 901 having stored therein a set of computer-executable instructions that, when executed by the at least one processor 902, perform a method of speech enhancement as described in the above exemplary embodiments.

As an example, the sound box 800 and the sound box 900 in the above exemplary embodiments may be understood as a device integrating a speaker and/or a microphone, for example, a smart sound box, a home sound box, a video conference device, a telephone conference device, and may be integrated on other devices. That is, it should be clear that any sound box that performs speech enhancement by using the speech enhancement method shown in the present disclosure is within the scope of the present disclosure.

By way of example, loudspeaker 800 and loudspeaker 900 may also include other components that perform their own functions as loudspeakers. For example, but not limited to, at least one of the following may also be included: the sound processing device comprises a signal acquisition unit and a signal processing unit, wherein the signal acquisition unit can acquire sound in the environment to form an audio signal, and the signal processing unit can process (for example, amplify and process) the audio signal acquired by the signal acquisition unit.

By way of example, loudspeaker 800 and loudspeaker 900 may be applied to, but are not limited to, at least one of the following scenarios: video conference scenarios, home environment scenarios, online teaching scenarios, it should be understood that the present disclosure is not limited thereto, and may also be applied to other suitable scenarios. In different usage scenarios, the composition structures of the sound box 800 and the sound box 900 may be different, and it should be clear that any sound box that performs voice enhancement by using the voice enhancement method shown in the present disclosure is within the intended scope of the present disclosure.

According to the training method, the speech enhancement method, the device, the electronic equipment, the sound box and the storage medium of the speech enhancement model, because a large number of pure speech signal samples and overlapped speech noisy signal samples of the speaking object are adopted to train the speech enhancement model, the trained speech enhancement model can accurately extract the speech signal of the specified speaking object from the speech signals of a plurality of speaking objects, so that the interference of the sound signal of an interfering person on the speech signal of the specified speaking object is avoided, and the listening quality and the intelligibility of the speech signal of the specified speaking object are improved. Moreover, the speech enhancement method can extract the potential speech signal of the target speaker from the detected speech signal by using the trained speech enhancement model with the help of the pre-registered speech signal of the target speaker, so as to avoid the interference of the voice signal of the interfering person on the speech signal of the specified speaker, achieve the effect of speech enhancement aiming at the specified speaker, and improve the hearing quality and intelligibility of the speech signal of the specified speaker. Furthermore, the first pure speech signal sample and the second pure speech signal sample used for training can be set as speech signal samples with different contents, so that the speech enhancement model can be ensured to extract a potential speech signal of a specified speaking object from the tested speech signal as accurately as possible in an actual test scene. Furthermore, the training method of the speech enhancement model can distribute pre-training modules, and then perform combined training on the modules, so that the speech representation extractor and the classifier can be ensured to use more training data for training, and the performance of the speech enhancement model is more perfect; moreover, the pre-training mode is adopted, so that the voice enhancement model is easier to train and can be converged more quickly.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A training method of a speech enhancement model, wherein the speech enhancement model comprises a speech extractor, a speech characterization extractor and a classifier, the training method comprising:

obtaining training samples of a plurality of speaking subjects, wherein the training samples of each speaking subject comprise: the method comprises the steps of obtaining a first pure voice signal sample, a second pure voice signal sample, an overlapped voice signal sample with noise and an identification label of a speaking object, wherein the overlapped voice signal sample with noise of the speaking object is obtained by overlapping voice signals of at least one other speaking object based on the second pure voice signal sample of the speaking object;

inputting the first pure voice signal sample of each speaking object into the voice characterization extractor to obtain the voice characterization of the speaking object;

inputting the voice representation of each speaking object into the classifier to obtain the identification prediction result of the speaking object;

inputting the voice characterization of each speaking object and the amplitude spectrum of the overlapped voice noisy signal sample into the voice extractor, obtaining a predicted amplitude spectrum mask of the enhanced voice signal of the speaking object, and determining the predicted enhanced voice signal of the speaking object according to the predicted amplitude spectrum mask of the enhanced voice signal of the speaking object, wherein the enhanced voice signal is the voice signal of the speaking object extracted from the overlapped voice noisy signal sample of the speaking object;

calculating loss according to the enhanced voice signal, the second pure voice signal sample, the identification prediction result and the identification label corresponding to each speaking object;

adjusting parameters of the speech extractor, the speech characterization extractor, and the classifier by the loss to train the speech enhancement model.

2. The training method of claim 1, wherein calculating the loss based on the enhanced speech signal, the second clean speech signal samples, the identification prediction result, and the identification tag corresponding to each speaking object comprises:

calculating a second loss according to the enhanced voice signal and the second pure voice signal sample corresponding to each speaking object;

3. A method of speech enhancement, the method comprising:

acquiring a tested voice signal;

inputting a pre-registered voice signal of a target speaking object into a voice representation extractor trained by the training method according to any one of claims 1 to 2 to obtain a voice representation of the target speaking object;

inputting the voice representation of the target speaking object and the tested voice signal into a voice extractor trained by the training method according to any one of claims 1 to 2, and obtaining the voice signal of the target speaking object extracted from the tested voice signal.

4. An apparatus for training a speech enhancement model, wherein the speech enhancement model comprises a speech extractor, a speech characterization extractor, and a classifier, the apparatus comprising:

an acquisition module configured to acquire training samples of a plurality of speaking subjects, wherein the training samples of each speaking subject include: the method comprises the steps that a first pure voice signal sample, a second pure voice signal sample, an overlapped voice noisy signal sample and an identification tag of a speaking object are obtained, wherein the overlapped voice noisy signal sample of the speaking object is obtained on the basis that the second pure voice signal sample of the speaking object is overlapped with voice signals of at least one other speaking object;

the first input module is configured to input a first pure voice signal sample of each speaking object into the voice characterization extractor to obtain a voice characterization of the speaking object;

the second input module is configured to input the voice representation of each speaking object into the classifier, and obtain an identification prediction result of the speaking object;

a third input module configured to input the voice representation of each speaking object and the magnitude spectrum of the overlapped voice noisy signal sample into the voice extractor, obtain a predicted magnitude spectrum mask of the enhanced voice signal of the speaking object, and determine a predicted enhanced voice signal of the speaking object according to the predicted magnitude spectrum mask of the enhanced voice signal of the speaking object, wherein the enhanced voice signal is the voice signal of the speaking object extracted from the overlapped voice noisy signal sample of the speaking object;

the computing module is configured to compute loss according to the enhanced voice signal, the second pure voice signal sample, the identification prediction result and the identification label corresponding to each speaking object;

an adjustment module configured to adjust parameters of the speech extractor, the speech characterization extractor, and the classifier by the loss to train the speech enhancement model.

5. A speech enhancement apparatus, characterized in that the speech enhancement apparatus comprises:

the acquisition module is configured to acquire a tested voice signal;

a first input module, configured to input a speech signal of a pre-registered target speaker into the speech characterization extractor trained by the training apparatus of claim 4, to obtain a speech characterization of the target speaker;

a second input module configured to input the speech representation of the target speaking object and the tested speech signal into the speech extractor trained by the training apparatus according to claim 4, and obtain the speech signal of the target speaking object extracted from the tested speech signal.

6. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of training a speech enhancement model according to any one of claims 1 to 2 or the method of speech enhancement according to claim 3.

7. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method of training a speech enhancement model according to any one of claims 1 to 2 or the method of speech enhancement according to claim 3.

8. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements a method of training a speech enhancement model according to any one of claims 1 to 2 or a method of speech enhancement according to claim 3.

9. An acoustic enclosure, comprising:

the speech enhancement device of claim 5.

10. An acoustic enclosure, comprising:

at least one processor;

at least one memory storing computer-executable instructions;

wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the speech enhancement method of claim 3.