CN112133293A

CN112133293A - Phrase voice sample compensation method based on generation countermeasure network and storage medium

Info

Publication number: CN112133293A
Application number: CN201911067181.9A
Authority: CN
Inventors: 胡章芳; 付亚芹
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2019-11-04
Filing date: 2019-11-04
Publication date: 2020-12-25

Abstract

The invention requests to protect a phrase voice sample compensation method and a storage medium based on a generation countermeasure network. The method is used for solving the problem that the recognition rate is seriously reduced due to insufficient corpus data caused by the short voice condition in the speaker recognition system. It assumes that the long voice distribution contains sufficient characteristics for distinguishing the identity information of the speaker, and extracts the characteristics capable of distinguishing the identity of the speaker from the long voice as the condition input of the generator G and the discriminator D. The short voice is taken as input to the generator G, which tries to compensate the short voice to a sample close to the true long speech distribution with the aid of the condition information, while the discriminator D tries to determine whether a given speech is a true long speech sample or a pseudo speech compensated by the generator. The invention completes the mapping from the phrase voice sample to the compensation voice sample, and increases the universality and diversity of the training sample while ensuring that the compensated voice has sufficient acoustic characteristics, thereby improving the system robustness and reducing the error rate of speaker recognition and the like.

Description

Phrase voice sample compensation method based on generation countermeasure network and storage medium

Technical Field

The invention belongs to the field of speaker identification, and particularly relates to a short voice sample compensation method based on a generation countermeasure network.

Background

The Gaussian mixture-universal background model (GMM-UBM) is used as a key method, and can achieve a good recognition effect only when the voice of a speaker is long in a speaker recognition system. Whereas in a short speech environment the recognition rate performance drops drastically, in fact a brief utterance means that the utterance contains insufficient acoustic features. In this case, the speaker model based on statistical attributes does not describe the speaker well, and although the speaker model has significant feature specificity, the speaker model is still susceptible to noise interference due to too few features. In the past few years, deep learning has become very popular in the field of speaker recognition, and many methods use deep learning to solve the short phrase voice sample deficiency problem. Intuitively, the deep learning model has strong characteristic learning capability and is helpful for solving the problem. However, training deep neural networks requires a large amount of data, and short voices contain less speaker identification information, which is one of the biggest obstacles to building speaker recognition systems using deep learning. Therefore, the invention provides a short voice sample compensation method and a storage medium based on a generation countermeasure network, so that the compensated short voice-character speaker recognition system has higher recognition rate and better robustness.

Disclosure of Invention

The invention aims to solve the problems in the prior art, provides a phrase voice sample compensation method and a storage medium based on a generation countermeasure network, and can effectively solve the problem that in speaker recognition, the recognition rate is seriously reduced due to insufficient corpus data caused by a short voice condition. Meanwhile, the problems of model collapse, unstable gradient and the like in the model training process are solved. The technical scheme of the invention is as follows:

a short voice sample compensation method based on generation of confrontation network, comprising the steps of:

s1, acquiring a voice signal by a microphone;

s2, sequentially carrying out preprocessing including pre-emphasis, framing, windowing, fast Fourier transform, Mel filtering and discrete cosine transform on all the voice data acquired in the step S1, extracting the personal identity characteristic of the voice signal of the speaker, namely Mel frequency cepstrum coefficient MFCC, and dividing the voice signal to acquire short voice;

s3, constructing and generating a confrontation network model which is composed of a generator model G and a discriminator model D, wherein a random noise vector z is generated by the generator model G to obey the real data distribution P as much as possible_dataThe discrimination model D may determine whether the input sample is the real data x or the generated data g (z).

S4, constructing an optimization objective function V (D, G) for generating an antagonistic network model, and carrying out model training;

s5, constructing a model-oriented learning task, namely a generator compensation performance measurement training task and a discriminator characteristic label training task, wherein the generator compensation performance measurement training task is used for reducing the deviation between compensation voice distribution and real voice distribution, and the discriminator characteristic label training task is used for improving the discrimination capability of the compensation voice speaker;

further, the step S2 specifically includes:

s21: and performing pre-emphasis, framing, windowing and fast Fourier transform on all voice signals in sequence. And then calculating a power spectrum, passing the obtained power spectrum through a triangular band-pass filter, and converting the filtering output result into a logarithmic form by using a relationship between the Mel domain and the linear frequency:

finally obtaining the ith dimension characteristic component C of the MFCC characteristic parameter through discrete cosine transform_iThe expression of (a) is:

m represents the number of filters, and is typically 20 to 28. And taking the MFCC of the obtained speaker voice signal as the identity personality characteristic.

S22: the speech signal is segmented to obtain short speech, and long speech and short speech sound pairs are formed.

Further, the generating of the countermeasure network model constructed in the step S3 is specifically:

s31: a generator G for generating an anti-network model is a deep neural network, short voice z is used as the input of the generator G, a short voice sample is subjected to the generator G to obtain a compensated voice sample G (z), a discriminator D is a deep neural network serving as a binary classifier, the short voice sample G (z) and a real long voice sample x which are subjected to compensation by the generator G are alternately used as the input of a discriminator D under the same condition, and the discriminator D judges whether the given voice is a real long voice sample or is obtained by the compensation of the generator;

s32: the conditional version of the generated countermeasure network is used in the model, namely the conditional generation countermeasure network CGAN is a conditional model formed by adding condition extension on the basis of GAN, so that the hidden layers of the generator G and the discriminator D introduce the speaker identity personality characteristic condition c, namely the Mel frequency cepstrum coefficient MFCC, and the mapping process from short voice to compensation voice is guided better.

Further, the step S4 is to construct an objective optimization function V (D, G) for generating an antagonistic network model, and perform model training at the same time, specifically including:

s41: generating a version of the confrontation network condition, which is optimized for the objective function V (D, G) as follows:

wherein E is_x～Pdata(x)[logD(x|c)]Indicates the probability that the discriminator D judges whether the real long speech data x is real or not under the guidance of the condition c，E_z～Pdata(z)[log(1-D(G(z|c)|c))]Representing the probability that the short voice z is real data or not by judging the probability of the short voice z being the compensation sample generated by the generator under the condition of inputting the same condition information by a discriminator D;

s42: in the training process, the generator G aims to compensate the short voice to the voice meeting the distribution of the real long voice as much as possible under the guidance of the condition c, and the discriminator D distinguishes the compensation voice of the generator G from the real long voice as much as possible, so that the generator G and the discriminator D form a dynamic game process, and the discriminator D and the generator G are alternately optimized by using a gradient descent method.

Further, the detailed steps of alternately optimizing the discriminator D and the generator G by using the gradient descent method are as follows:

step 1: from the known phrase-sound distribution P_z(z)In which a number of samples z are selected⁽¹⁾,z⁽²⁾……,z^(m)}；

Step 2: selecting corresponding real long voice data { x from training data⁽¹⁾,x⁽²⁾……,x^(m)}；

And 3, step 3: extracting condition information { c) from real long voice⁽¹⁾,c⁽²⁾……,c^(m)}；

And 4, step 4: let the parameter of the discriminator D be theta_dThe gradient of the objective function of the following formula with respect to the parameter is obtained, for θ_dAdding the gradient during updating;

and 5, step 5: let the generator G have a parameter θ_gThe gradient of the objective function of the following formula with respect to the parameter is obtained, for θ_gSubtracting the gradient at update time;

the parameters of generator G are then updated each time the parameters of discriminator D are updated.

Further, in the step S5, a learning task is designed for the generator G and the discriminator D respectively to guide a compensation process of the data in the model training process, and the specific process is as follows:

s51: the generator compensates for the performance metric training tasks. The most direct method for measuring the compensation performance of the generator G is to calculate the numerical difference between the compensation voice and the real voice, assuming that N data are divided into i groups, and the difference degree between the compensation voice of the i group and the real long voice is measured by the mean square error:

wherein, observed_real,iPredicted data of the ith group representing real speech samples_gan,iRepresenting the ith set of data based on the voice samples that were compensated to generate the anti-network multitask framework with the goal of minimizing the MSE value, the generator G learns the objective function of compensating for the difference between speech and true long speech as follows:

e (-) is the calculation of the expected value and G (z | c) represents the compensated sample generated by the generator under the direction of condition c. Numerical difference function loss measuring compensation performance of generator_GThe goal is to minimize the numerical difference function during the training process, so that the compensation performance of the generator reaches the optimal state;

s52: training task of the feature label of the discriminator: the feature label training task of the discriminator is used for improving the distinguishing capability of the compensated voice speakers, MFCC features extracted from real long voice represent different speaker labels, after the compensated voice and the real long voice are input into the discriminator, whether the voice belongs to the feature label of the category is predicted through feature distance measurement, and the cross entropy between the result of the predicted feature label and the real feature label is minimized.

Further, the cross entropy objective function between the minimum discriminator predicted feature tag result and the true feature tag is:

wherein n is_iRepresents the number of the intercepted short voice in the ith speech signal,

in order for the discriminator to observe the empirical probability of the class k signature to which a true long speech belongs based on facts,

in the training process, the training of the discriminator is stabilized by continuously minimizing the cross entropy loss of the feature labels to which the real voice and the compensation voice belong, so that the compensation voice carries more speaker identity features.

A storage medium having stored therein a computer program which, when read by a processor, performs any of the methods described above.

The invention has the following advantages and beneficial effects:

the invention provides a short voice sample compensation method based on a generation countermeasure network, aiming at the problem that the short voice recognition rate is seriously reduced in a speaker recognition system. The conditional version of the countermeasure network is generated, and the feature capable of distinguishing the identity of the speaker is extracted from the long voice as the conditional input of the generator G and the discriminator D, provided that the long voice distribution contains sufficient features for distinguishing the identity information of the speaker. The short voice is taken as input to the generator G, which tries to compensate the short voice to a sample close to the true long speech distribution with the aid of the condition information, while the discriminator D tries to determine whether the given speech is a true long speech sample or a pseudo speech compensated by the generator. The method completes the mapping from the phrase voice sample to the compensation voice sample, and increases the universality and diversity of the training sample while ensuring that the compensated voice has sufficient acoustic characteristics, thereby improving the system robustness and reducing the error rate of speaker recognition and the like.

The method completes the model training process by constructing and generating the confrontation network model, successfully maps the short voice lacking the identity individual characteristics into the compensation voice with stronger speaker distinguishing capability, increases the universality and diversity of the training sample while the compensation voice contains sufficient acoustic characteristics, and can effectively solve the problem that the recognition rate is seriously reduced due to insufficient corpus data caused by the short voice condition in speaker recognition. In order to prevent the problems of model collapse, gradient instability and the like in the training process of generating the confrontation network model, the constructed model-oriented learning task, namely the generator compensation performance measurement training task and the feature label training task of the discriminator effectively stabilizes the training process, reduces the deviation between the compensation voice distribution and the real voice distribution, and further improves the distinguishing capability of the compensation voice speaker.

Drawings

FIG. 1 is a flow chart of speaker recognition based on generation of anti-network short speech compensation according to the preferred embodiment of the present invention;

FIG. 2 is a diagram of an improved generation countermeasure compensation model structure proposed by the present invention;

FIG. 3 is a flow chart of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.

The technical scheme for solving the technical problems is as follows:

as shown in fig. 1 to 3, the technical solution of the present invention for solving the above technical problems is:

s1, acquiring a voice signal by a microphone;

s2, preprocessing all voice data, extracting the personal identity characteristic of the speaker voice signal, namely Mel frequency cepstrum coefficient MFCC, and dividing the voice signal to obtain short voice;

the pre-emphasis part can be seen as a high-pass filter, corresponding to the following equation, where a is the pre-emphasis coefficient (typically in the interval [0.95,0.97 ]).

H(z)＝1-az^-1.

A hamming window ω shown below is used to smooth the edge signal, k 0,1.

In speech processing, the Mel-frequency cepstrum has some effect on the short-time power spectrum of speech, which is based on a cosine transform on a non-linear Mel-frequency scale. The Mel frequency vs. linear frequency is given by:

the Mel filter bank is a set of triangular band-pass filters, the transfer function of the band-pass filters is as follows, M represents the number of the filters, generally 20-28, M is more than or equal to 0 and less than or equal to M, and the f (·) function is the center frequency of the Mel band-pass filter bank.

Obtaining the ith dimension characteristic component C of the MFCC characteristic parameter through discrete cosine transform_iThe expression of (a) is:

s3, constructing and generating a confrontation network model, which comprises a generator network and a discriminator network;

s4, constructing an optimization objective function V (D, G) of the model, wherein the optimization process is as follows:

wherein E is_x～Pdata(x)[logD(x|c)]Representing the probability that the discriminator D judges whether the real long speech data x is real or not, E, guided by the condition c_z～Pdata(z)[log(1-D(G(z|c)|c))]Representing the probability that the short speech z is true data or not, and the decision D determines the compensated sample generated by the generator under the same condition information input.

And S5, training the model. In the actual training, the discriminator D and the generator G are alternately optimized by using a gradient descent method, and the detailed steps are as follows:

step 1: from the known phrase-sound distribution P_z(z)In which a number of samples z are selected⁽¹⁾,z⁽²⁾……,z^(m)}。

Step 2: selecting corresponding real long voice data { x from training data⁽¹⁾,x⁽²⁾……,x^(m)}。

And 3, step 3: extracting condition information { c) from real long voice⁽¹⁾,c⁽²⁾……,c^(m)}。

And 4, step 4: let the parameter of the discriminator D be theta_dThe gradient of the objective function of the following formula with respect to the parameter is obtained, for θ_dThe gradient is added at the time of update.

And 5, step 5: let the generator G have a parameter θ_gThe gradient of the objective function of the following formula with respect to the parameter is obtained, for θ_gThe gradient is subtracted at the update.

S6, the construction generator compensates the performance measurement training task. Aiming at the problem that gradient disappears sometimes in the training process, it is more appropriate to generate an antagonistic network to learn and compensate the difference between the voice and the real voice, and an objective function of a generator G for learning and compensating the difference between the voice and the real long voice is as follows:

numerical difference function loss measuring compensation performance of generator_GThe goal is to minimize the numerical difference function during training to optimize the compensation performance of the generator.

S7, constructing a discriminant feature tag training task, representing each MFCC feature extracted from the real long speech by different speaker tags, after compensating the speech and the real long speech, predicting whether the speech belongs to the class feature tag or not through feature distance measurement, and minimizing the cross entropy between the result of predicting the feature tag and the real feature tag. The cross entropy objective function between the minimum discriminator predicted feature tag result and the true feature tag is:

wherein

and calculating the prediction probability of the kth class feature label to which the compensation voice belongs according to the feature distance for the discriminator. In the training process, the training of the discriminator is stabilized by continuously minimizing the cross entropy loss of the feature labels to which the real voice and the compensation voice belong, so that the compensation voice carries more speaker identity features, and the equal error rate of the short voice in a speaker recognition system is reduced.

S8, a short voice sample compensation method based on the generation countermeasure network is evaluated on the speaker recognition system based on the Gaussian mixture model-general background model, and experimental results show that the method effectively reduces the equal error rate of the speaker recognition system in the short voice environment.

The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims

1. A short voice sample compensation method based on a generation countermeasure network, comprising the steps of:

s1, acquiring a voice signal by a microphone;

s2, sequentially carrying out pre-emphasis, framing, windowing, fast Fourier transform, Mel filtering and discrete cosine transform on all the voice data acquired in the step S1, extracting the personal identity characteristic of the speaker voice signal, namely Mel frequency cepstrum coefficient MFCC, and dividing the voice signal to acquire short voice;

s3, constructing and generating a confrontation network model which is composed of a generator model G and a discriminator model D, wherein a random noise vector z is generated by the generator model G to obey the real data distribution P as much as possible_dataThe discrimination model D may determine whether the input sample is real data x or generated data g (z);

s5, constructing a model-oriented learning task, namely a generator compensation performance measurement training task and a discriminator characteristic label training task, wherein the generator compensation performance measurement training task is used for reducing the deviation between compensation voice distribution and real voice distribution, and the discriminator characteristic label training task is used for improving the discrimination capability of the compensation voice speaker.

2. The method for generating phrase voice sample compensation for confrontation network as claimed in claim 1, wherein said step S2 includes the following steps:

m represents the number of filters, and is typically 20 to 28. Using the MFCC of the obtained speaker voice signal as an identity personality characteristic;

3. The method for compensating the phrase voice sample based on the generative confrontation network as claimed in claim 2, wherein the generative confrontation network model constructed in the step S3 is specifically:

4. The method for generating short voice sample compensation for confrontation network according to claim 3, wherein the step S4 is to construct an objective optimization function V (D, G) for generating confrontation network model, and to train the model, specifically comprising:

wherein E is_x～Pdata(x)[logD(x|c)]Representing the probability that the discriminator D judges whether the real long speech data x is real or not, E, guided by the condition c_z～Pdata(z)[log(1-D(G(z|c)|c))]Representing the probability that the short voice z is real data or not by judging the probability of the short voice z being the compensation sample generated by the generator under the condition of inputting the same condition information by a discriminator D;

5. The method for generating phrase voice sample compensation for countermeasure network as claimed in claim 4, wherein the detailed steps of optimizing the alternation of the discriminator D and the generator G by using gradient descent method are as follows:

step 1: from knownShort speech distribution P_z(z)In which a number of samples z are selected⁽¹⁾,z⁽²⁾……,z^(m)}；

m represents the number of samples.

6. The method for generating phrase sound sample compensation for confrontation network as claimed in claim 5, wherein said step S5 is to design learning task for generator G and discriminator D respectively to guide the compensation process of data during the model training process, and the specific process is as follows:

7. The method as claimed in claim 6, wherein the cross-entropy objective function between the minimization discriminator predicted signature result and the true signature is:

8. A storage medium having a computer program stored therein, wherein the computer program, when read by a processor, performs the method of any of claims 1 to 7.