CN112133293A - Phrase voice sample compensation method based on generation countermeasure network and storage medium - Google Patents

Phrase voice sample compensation method based on generation countermeasure network and storage medium Download PDF

Info

Publication number
CN112133293A
CN112133293A CN201911067181.9A CN201911067181A CN112133293A CN 112133293 A CN112133293 A CN 112133293A CN 201911067181 A CN201911067181 A CN 201911067181A CN 112133293 A CN112133293 A CN 112133293A
Authority
CN
China
Prior art keywords
voice
generator
discriminator
compensation
real
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911067181.9A
Other languages
Chinese (zh)
Inventor
胡章芳
付亚芹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN201911067181.9A priority Critical patent/CN112133293A/en
Publication of CN112133293A publication Critical patent/CN112133293A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The invention requests to protect a phrase voice sample compensation method and a storage medium based on a generation countermeasure network. The method is used for solving the problem that the recognition rate is seriously reduced due to insufficient corpus data caused by the short voice condition in the speaker recognition system. It assumes that the long voice distribution contains sufficient characteristics for distinguishing the identity information of the speaker, and extracts the characteristics capable of distinguishing the identity of the speaker from the long voice as the condition input of the generator G and the discriminator D. The short voice is taken as input to the generator G, which tries to compensate the short voice to a sample close to the true long speech distribution with the aid of the condition information, while the discriminator D tries to determine whether a given speech is a true long speech sample or a pseudo speech compensated by the generator. The invention completes the mapping from the phrase voice sample to the compensation voice sample, and increases the universality and diversity of the training sample while ensuring that the compensated voice has sufficient acoustic characteristics, thereby improving the system robustness and reducing the error rate of speaker recognition and the like.

Description

Phrase voice sample compensation method based on generation countermeasure network and storage medium
Technical Field
The invention belongs to the field of speaker identification, and particularly relates to a short voice sample compensation method based on a generation countermeasure network.
Background
The Gaussian mixture-universal background model (GMM-UBM) is used as a key method, and can achieve a good recognition effect only when the voice of a speaker is long in a speaker recognition system. Whereas in a short speech environment the recognition rate performance drops drastically, in fact a brief utterance means that the utterance contains insufficient acoustic features. In this case, the speaker model based on statistical attributes does not describe the speaker well, and although the speaker model has significant feature specificity, the speaker model is still susceptible to noise interference due to too few features. In the past few years, deep learning has become very popular in the field of speaker recognition, and many methods use deep learning to solve the short phrase voice sample deficiency problem. Intuitively, the deep learning model has strong characteristic learning capability and is helpful for solving the problem. However, training deep neural networks requires a large amount of data, and short voices contain less speaker identification information, which is one of the biggest obstacles to building speaker recognition systems using deep learning. Therefore, the invention provides a short voice sample compensation method and a storage medium based on a generation countermeasure network, so that the compensated short voice-character speaker recognition system has higher recognition rate and better robustness.
Disclosure of Invention
The invention aims to solve the problems in the prior art, provides a phrase voice sample compensation method and a storage medium based on a generation countermeasure network, and can effectively solve the problem that in speaker recognition, the recognition rate is seriously reduced due to insufficient corpus data caused by a short voice condition. Meanwhile, the problems of model collapse, unstable gradient and the like in the model training process are solved. The technical scheme of the invention is as follows:
a short voice sample compensation method based on generation of confrontation network, comprising the steps of:
s1, acquiring a voice signal by a microphone;
s2, sequentially carrying out preprocessing including pre-emphasis, framing, windowing, fast Fourier transform, Mel filtering and discrete cosine transform on all the voice data acquired in the step S1, extracting the personal identity characteristic of the voice signal of the speaker, namely Mel frequency cepstrum coefficient MFCC, and dividing the voice signal to acquire short voice;
s3, constructing and generating a confrontation network model which is composed of a generator model G and a discriminator model D, wherein a random noise vector z is generated by the generator model G to obey the real data distribution P as much as possibledataThe discrimination model D may determine whether the input sample is the real data x or the generated data g (z).
S4, constructing an optimization objective function V (D, G) for generating an antagonistic network model, and carrying out model training;
s5, constructing a model-oriented learning task, namely a generator compensation performance measurement training task and a discriminator characteristic label training task, wherein the generator compensation performance measurement training task is used for reducing the deviation between compensation voice distribution and real voice distribution, and the discriminator characteristic label training task is used for improving the discrimination capability of the compensation voice speaker;
further, the step S2 specifically includes:
s21: and performing pre-emphasis, framing, windowing and fast Fourier transform on all voice signals in sequence. And then calculating a power spectrum, passing the obtained power spectrum through a triangular band-pass filter, and converting the filtering output result into a logarithmic form by using a relationship between the Mel domain and the linear frequency:
Figure BDA0002259745930000021
finally obtaining the ith dimension characteristic component C of the MFCC characteristic parameter through discrete cosine transformiThe expression of (a) is:
Figure BDA0002259745930000022
m represents the number of filters, and is typically 20 to 28. And taking the MFCC of the obtained speaker voice signal as the identity personality characteristic.
S22: the speech signal is segmented to obtain short speech, and long speech and short speech sound pairs are formed.
Further, the generating of the countermeasure network model constructed in the step S3 is specifically:
s31: a generator G for generating an anti-network model is a deep neural network, short voice z is used as the input of the generator G, a short voice sample is subjected to the generator G to obtain a compensated voice sample G (z), a discriminator D is a deep neural network serving as a binary classifier, the short voice sample G (z) and a real long voice sample x which are subjected to compensation by the generator G are alternately used as the input of a discriminator D under the same condition, and the discriminator D judges whether the given voice is a real long voice sample or is obtained by the compensation of the generator;
s32: the conditional version of the generated countermeasure network is used in the model, namely the conditional generation countermeasure network CGAN is a conditional model formed by adding condition extension on the basis of GAN, so that the hidden layers of the generator G and the discriminator D introduce the speaker identity personality characteristic condition c, namely the Mel frequency cepstrum coefficient MFCC, and the mapping process from short voice to compensation voice is guided better.
Further, the step S4 is to construct an objective optimization function V (D, G) for generating an antagonistic network model, and perform model training at the same time, specifically including:
s41: generating a version of the confrontation network condition, which is optimized for the objective function V (D, G) as follows:
Figure BDA0002259745930000031
wherein E isx~Pdata(x)[logD(x|c)]Indicates the probability that the discriminator D judges whether the real long speech data x is real or not under the guidance of the condition c,Ez~Pdata(z)[log(1-D(G(z|c)|c))]Representing the probability that the short voice z is real data or not by judging the probability of the short voice z being the compensation sample generated by the generator under the condition of inputting the same condition information by a discriminator D;
s42: in the training process, the generator G aims to compensate the short voice to the voice meeting the distribution of the real long voice as much as possible under the guidance of the condition c, and the discriminator D distinguishes the compensation voice of the generator G from the real long voice as much as possible, so that the generator G and the discriminator D form a dynamic game process, and the discriminator D and the generator G are alternately optimized by using a gradient descent method.
Further, the detailed steps of alternately optimizing the discriminator D and the generator G by using the gradient descent method are as follows:
step 1: from the known phrase-sound distribution Pz(z)In which a number of samples z are selected(1),z(2)……,z(m)};
Step 2: selecting corresponding real long voice data { x from training data(1),x(2)……,x(m)};
And 3, step 3: extracting condition information { c) from real long voice(1),c(2)……,c(m)};
And 4, step 4: let the parameter of the discriminator D be thetadThe gradient of the objective function of the following formula with respect to the parameter is obtained, for θdAdding the gradient during updating;
Figure BDA0002259745930000032
and 5, step 5: let the generator G have a parameter θgThe gradient of the objective function of the following formula with respect to the parameter is obtained, for θgSubtracting the gradient at update time;
Figure BDA0002259745930000041
the parameters of generator G are then updated each time the parameters of discriminator D are updated.
Further, in the step S5, a learning task is designed for the generator G and the discriminator D respectively to guide a compensation process of the data in the model training process, and the specific process is as follows:
s51: the generator compensates for the performance metric training tasks. The most direct method for measuring the compensation performance of the generator G is to calculate the numerical difference between the compensation voice and the real voice, assuming that N data are divided into i groups, and the difference degree between the compensation voice of the i group and the real long voice is measured by the mean square error:
Figure BDA0002259745930000042
wherein, observedreal,iPredicted data of the ith group representing real speech samplesgan,iRepresenting the ith set of data based on the voice samples that were compensated to generate the anti-network multitask framework with the goal of minimizing the MSE value, the generator G learns the objective function of compensating for the difference between speech and true long speech as follows:
Figure BDA0002259745930000043
e (-) is the calculation of the expected value and G (z | c) represents the compensated sample generated by the generator under the direction of condition c. Numerical difference function loss measuring compensation performance of generatorGThe goal is to minimize the numerical difference function during the training process, so that the compensation performance of the generator reaches the optimal state;
s52: training task of the feature label of the discriminator: the feature label training task of the discriminator is used for improving the distinguishing capability of the compensated voice speakers, MFCC features extracted from real long voice represent different speaker labels, after the compensated voice and the real long voice are input into the discriminator, whether the voice belongs to the feature label of the category is predicted through feature distance measurement, and the cross entropy between the result of the predicted feature label and the real feature label is minimized.
Further, the cross entropy objective function between the minimum discriminator predicted feature tag result and the true feature tag is:
Figure BDA0002259745930000051
wherein n isiRepresents the number of the intercepted short voice in the ith speech signal,
Figure BDA0002259745930000052
in order for the discriminator to observe the empirical probability of the class k signature to which a true long speech belongs based on facts,
Figure BDA0002259745930000053
in the training process, the training of the discriminator is stabilized by continuously minimizing the cross entropy loss of the feature labels to which the real voice and the compensation voice belong, so that the compensation voice carries more speaker identity features.
A storage medium having stored therein a computer program which, when read by a processor, performs any of the methods described above.
The invention has the following advantages and beneficial effects:
the invention provides a short voice sample compensation method based on a generation countermeasure network, aiming at the problem that the short voice recognition rate is seriously reduced in a speaker recognition system. The conditional version of the countermeasure network is generated, and the feature capable of distinguishing the identity of the speaker is extracted from the long voice as the conditional input of the generator G and the discriminator D, provided that the long voice distribution contains sufficient features for distinguishing the identity information of the speaker. The short voice is taken as input to the generator G, which tries to compensate the short voice to a sample close to the true long speech distribution with the aid of the condition information, while the discriminator D tries to determine whether the given speech is a true long speech sample or a pseudo speech compensated by the generator. The method completes the mapping from the phrase voice sample to the compensation voice sample, and increases the universality and diversity of the training sample while ensuring that the compensated voice has sufficient acoustic characteristics, thereby improving the system robustness and reducing the error rate of speaker recognition and the like.
The method completes the model training process by constructing and generating the confrontation network model, successfully maps the short voice lacking the identity individual characteristics into the compensation voice with stronger speaker distinguishing capability, increases the universality and diversity of the training sample while the compensation voice contains sufficient acoustic characteristics, and can effectively solve the problem that the recognition rate is seriously reduced due to insufficient corpus data caused by the short voice condition in speaker recognition. In order to prevent the problems of model collapse, gradient instability and the like in the training process of generating the confrontation network model, the constructed model-oriented learning task, namely the generator compensation performance measurement training task and the feature label training task of the discriminator effectively stabilizes the training process, reduces the deviation between the compensation voice distribution and the real voice distribution, and further improves the distinguishing capability of the compensation voice speaker.
Drawings
FIG. 1 is a flow chart of speaker recognition based on generation of anti-network short speech compensation according to the preferred embodiment of the present invention;
FIG. 2 is a diagram of an improved generation countermeasure compensation model structure proposed by the present invention;
FIG. 3 is a flow chart of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.
The technical scheme for solving the technical problems is as follows:
as shown in fig. 1 to 3, the technical solution of the present invention for solving the above technical problems is:
s1, acquiring a voice signal by a microphone;
s2, preprocessing all voice data, extracting the personal identity characteristic of the speaker voice signal, namely Mel frequency cepstrum coefficient MFCC, and dividing the voice signal to obtain short voice;
the pre-emphasis part can be seen as a high-pass filter, corresponding to the following equation, where a is the pre-emphasis coefficient (typically in the interval [0.95,0.97 ]).
H(z)=1-az-1.
A hamming window ω shown below is used to smooth the edge signal, k 0,1.
Figure BDA0002259745930000061
In speech processing, the Mel-frequency cepstrum has some effect on the short-time power spectrum of speech, which is based on a cosine transform on a non-linear Mel-frequency scale. The Mel frequency vs. linear frequency is given by:
Figure BDA0002259745930000062
the Mel filter bank is a set of triangular band-pass filters, the transfer function of the band-pass filters is as follows, M represents the number of the filters, generally 20-28, M is more than or equal to 0 and less than or equal to M, and the f (·) function is the center frequency of the Mel band-pass filter bank.
Figure BDA0002259745930000071
Obtaining the ith dimension characteristic component C of the MFCC characteristic parameter through discrete cosine transformiThe expression of (a) is:
Figure BDA0002259745930000072
s3, constructing and generating a confrontation network model, which comprises a generator network and a discriminator network;
s4, constructing an optimization objective function V (D, G) of the model, wherein the optimization process is as follows:
Figure BDA0002259745930000073
wherein E isx~Pdata(x)[logD(x|c)]Representing the probability that the discriminator D judges whether the real long speech data x is real or not, E, guided by the condition cz~Pdata(z)[log(1-D(G(z|c)|c))]Representing the probability that the short speech z is true data or not, and the decision D determines the compensated sample generated by the generator under the same condition information input.
And S5, training the model. In the actual training, the discriminator D and the generator G are alternately optimized by using a gradient descent method, and the detailed steps are as follows:
step 1: from the known phrase-sound distribution Pz(z)In which a number of samples z are selected(1),z(2)……,z(m)}。
Step 2: selecting corresponding real long voice data { x from training data(1),x(2)……,x(m)}。
And 3, step 3: extracting condition information { c) from real long voice(1),c(2)……,c(m)}。
And 4, step 4: let the parameter of the discriminator D be thetadThe gradient of the objective function of the following formula with respect to the parameter is obtained, for θdThe gradient is added at the time of update.
Figure BDA0002259745930000074
And 5, step 5: let the generator G have a parameter θgThe gradient of the objective function of the following formula with respect to the parameter is obtained, for θgThe gradient is subtracted at the update.
Figure BDA0002259745930000081
The parameters of generator G are then updated each time the parameters of discriminator D are updated.
S6, the construction generator compensates the performance measurement training task. Aiming at the problem that gradient disappears sometimes in the training process, it is more appropriate to generate an antagonistic network to learn and compensate the difference between the voice and the real voice, and an objective function of a generator G for learning and compensating the difference between the voice and the real long voice is as follows:
Figure BDA0002259745930000082
numerical difference function loss measuring compensation performance of generatorGThe goal is to minimize the numerical difference function during training to optimize the compensation performance of the generator.
S7, constructing a discriminant feature tag training task, representing each MFCC feature extracted from the real long speech by different speaker tags, after compensating the speech and the real long speech, predicting whether the speech belongs to the class feature tag or not through feature distance measurement, and minimizing the cross entropy between the result of predicting the feature tag and the real feature tag. The cross entropy objective function between the minimum discriminator predicted feature tag result and the true feature tag is:
Figure BDA0002259745930000083
wherein
Figure BDA0002259745930000084
In order for the discriminator to observe the empirical probability of the class k signature to which a true long speech belongs based on facts,
Figure BDA0002259745930000085
and calculating the prediction probability of the kth class feature label to which the compensation voice belongs according to the feature distance for the discriminator. In the training process, the training of the discriminator is stabilized by continuously minimizing the cross entropy loss of the feature labels to which the real voice and the compensation voice belong, so that the compensation voice carries more speaker identity features, and the equal error rate of the short voice in a speaker recognition system is reduced.
S8, a short voice sample compensation method based on the generation countermeasure network is evaluated on the speaker recognition system based on the Gaussian mixture model-general background model, and experimental results show that the method effectively reduces the equal error rate of the speaker recognition system in the short voice environment.
The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims (8)

1. A short voice sample compensation method based on a generation countermeasure network, comprising the steps of:
s1, acquiring a voice signal by a microphone;
s2, sequentially carrying out pre-emphasis, framing, windowing, fast Fourier transform, Mel filtering and discrete cosine transform on all the voice data acquired in the step S1, extracting the personal identity characteristic of the speaker voice signal, namely Mel frequency cepstrum coefficient MFCC, and dividing the voice signal to acquire short voice;
s3, constructing and generating a confrontation network model which is composed of a generator model G and a discriminator model D, wherein a random noise vector z is generated by the generator model G to obey the real data distribution P as much as possibledataThe discrimination model D may determine whether the input sample is real data x or generated data g (z);
s4, constructing an optimization objective function V (D, G) for generating an antagonistic network model, and carrying out model training;
s5, constructing a model-oriented learning task, namely a generator compensation performance measurement training task and a discriminator characteristic label training task, wherein the generator compensation performance measurement training task is used for reducing the deviation between compensation voice distribution and real voice distribution, and the discriminator characteristic label training task is used for improving the discrimination capability of the compensation voice speaker.
2. The method for generating phrase voice sample compensation for confrontation network as claimed in claim 1, wherein said step S2 includes the following steps:
s21: and performing pre-emphasis, framing, windowing and fast Fourier transform on all voice signals in sequence. And then calculating a power spectrum, passing the obtained power spectrum through a triangular band-pass filter, and converting the filtering output result into a logarithmic form by using a relationship between the Mel domain and the linear frequency:
Figure FDA0002259745920000011
finally obtaining the ith dimension characteristic component C of the MFCC characteristic parameter through discrete cosine transformiThe expression of (a) is:
Figure FDA0002259745920000012
m represents the number of filters, and is typically 20 to 28. Using the MFCC of the obtained speaker voice signal as an identity personality characteristic;
s22: the speech signal is segmented to obtain short speech, and long speech and short speech sound pairs are formed.
3. The method for compensating the phrase voice sample based on the generative confrontation network as claimed in claim 2, wherein the generative confrontation network model constructed in the step S3 is specifically:
s31: a generator G for generating an anti-network model is a deep neural network, short voice z is used as the input of the generator G, a short voice sample is subjected to the generator G to obtain a compensated voice sample G (z), a discriminator D is a deep neural network serving as a binary classifier, the short voice sample G (z) and a real long voice sample x which are subjected to compensation by the generator G are alternately used as the input of a discriminator D under the same condition, and the discriminator D judges whether the given voice is a real long voice sample or is obtained by the compensation of the generator;
s32: the conditional version of the generated countermeasure network is used in the model, namely the conditional generation countermeasure network CGAN is a conditional model formed by adding condition extension on the basis of GAN, so that the hidden layers of the generator G and the discriminator D introduce the speaker identity personality characteristic condition c, namely the Mel frequency cepstrum coefficient MFCC, and the mapping process from short voice to compensation voice is guided better.
4. The method for generating short voice sample compensation for confrontation network according to claim 3, wherein the step S4 is to construct an objective optimization function V (D, G) for generating confrontation network model, and to train the model, specifically comprising:
s41: generating a version of the confrontation network condition, which is optimized for the objective function V (D, G) as follows:
Figure FDA0002259745920000021
wherein E isx~Pdata(x)[logD(x|c)]Representing the probability that the discriminator D judges whether the real long speech data x is real or not, E, guided by the condition cz~Pdata(z)[log(1-D(G(z|c)|c))]Representing the probability that the short voice z is real data or not by judging the probability of the short voice z being the compensation sample generated by the generator under the condition of inputting the same condition information by a discriminator D;
s42: in the training process, the generator G aims to compensate the short voice to the voice meeting the distribution of the real long voice as much as possible under the guidance of the condition c, and the discriminator D distinguishes the compensation voice of the generator G from the real long voice as much as possible, so that the generator G and the discriminator D form a dynamic game process, and the discriminator D and the generator G are alternately optimized by using a gradient descent method.
5. The method for generating phrase voice sample compensation for countermeasure network as claimed in claim 4, wherein the detailed steps of optimizing the alternation of the discriminator D and the generator G by using gradient descent method are as follows:
step 1: from knownShort speech distribution Pz(z)In which a number of samples z are selected(1),z(2)……,z(m)};
Step 2: selecting corresponding real long voice data { x from training data(1),x(2)……,x(m)};
And 3, step 3: extracting condition information { c) from real long voice(1),c(2)……,c(m)};
And 4, step 4: let the parameter of the discriminator D be thetadThe gradient of the objective function of the following formula with respect to the parameter is obtained, for θdAdding the gradient during updating;
Figure FDA0002259745920000031
m represents the number of samples.
And 5, step 5: let the generator G have a parameter θgThe gradient of the objective function of the following formula with respect to the parameter is obtained, for θgSubtracting the gradient at update time;
Figure FDA0002259745920000032
the parameters of generator G are then updated each time the parameters of discriminator D are updated.
6. The method for generating phrase sound sample compensation for confrontation network as claimed in claim 5, wherein said step S5 is to design learning task for generator G and discriminator D respectively to guide the compensation process of data during the model training process, and the specific process is as follows:
s51: the generator compensates for the performance metric training tasks. The most direct method for measuring the compensation performance of the generator G is to calculate the numerical difference between the compensation voice and the real voice, assuming that N data are divided into i groups, and the difference degree between the compensation voice of the i group and the real long voice is measured by the mean square error:
Figure FDA0002259745920000033
wherein, observedreal,iPredicted data of the ith group representing real speech samplesgan,iRepresenting the ith set of data based on the voice samples that were compensated to generate the anti-network multitask framework with the goal of minimizing the MSE value, the generator G learns the objective function of compensating for the difference between speech and true long speech as follows:
Figure FDA0002259745920000034
e (-) is the calculation of the expected value and G (z | c) represents the compensated sample generated by the generator under the direction of condition c. Numerical difference function loss measuring compensation performance of generatorGThe goal is to minimize the numerical difference function during the training process, so that the compensation performance of the generator reaches the optimal state;
s52: training task of the feature label of the discriminator: the feature label training task of the discriminator is used for improving the distinguishing capability of the compensated voice speakers, MFCC features extracted from real long voice represent different speaker labels, after the compensated voice and the real long voice are input into the discriminator, whether the voice belongs to the feature label of the category is predicted through feature distance measurement, and the cross entropy between the result of the predicted feature label and the real feature label is minimized.
7. The method as claimed in claim 6, wherein the cross-entropy objective function between the minimization discriminator predicted signature result and the true signature is:
Figure FDA0002259745920000041
wherein n isiRepresents the number of the intercepted short voice in the ith speech signal,
Figure FDA0002259745920000042
in order for the discriminator to observe the empirical probability of the class k signature to which a true long speech belongs based on facts,
Figure FDA0002259745920000043
in the training process, the training of the discriminator is stabilized by continuously minimizing the cross entropy loss of the feature labels to which the real voice and the compensation voice belong, so that the compensation voice carries more speaker identity features.
8. A storage medium having a computer program stored therein, wherein the computer program, when read by a processor, performs the method of any of claims 1 to 7.
CN201911067181.9A 2019-11-04 2019-11-04 Phrase voice sample compensation method based on generation countermeasure network and storage medium Pending CN112133293A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911067181.9A CN112133293A (en) 2019-11-04 2019-11-04 Phrase voice sample compensation method based on generation countermeasure network and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911067181.9A CN112133293A (en) 2019-11-04 2019-11-04 Phrase voice sample compensation method based on generation countermeasure network and storage medium

Publications (1)

Publication Number Publication Date
CN112133293A true CN112133293A (en) 2020-12-25

Family

ID=73849548

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911067181.9A Pending CN112133293A (en) 2019-11-04 2019-11-04 Phrase voice sample compensation method based on generation countermeasure network and storage medium

Country Status (1)

Country Link
CN (1) CN112133293A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113488069A (en) * 2021-07-06 2021-10-08 浙江工业大学 Method and device for quickly extracting high-dimensional voice features based on generative countermeasure network
CN113553972A (en) * 2021-07-29 2021-10-26 青岛农业大学 Apple disease diagnosis method based on deep learning

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108597496A (en) * 2018-05-07 2018-09-28 广州势必可赢网络科技有限公司 A kind of speech production method and device for fighting network based on production
CN108806708A (en) * 2018-06-13 2018-11-13 中国电子科技集团公司第三研究所 Voice de-noising method based on Computational auditory scene analysis and generation confrontation network model
CN108922518A (en) * 2018-07-18 2018-11-30 苏州思必驰信息科技有限公司 voice data amplification method and system
US10152970B1 (en) * 2018-02-08 2018-12-11 Capital One Services, Llc Adversarial learning and generation of dialogue responses
CN109147810A (en) * 2018-09-30 2019-01-04 百度在线网络技术(北京)有限公司 Establish the method, apparatus, equipment and computer storage medium of speech enhan-cement network
CN109346087A (en) * 2018-09-17 2019-02-15 平安科技(深圳)有限公司 Fight the method for identifying speaker and device of the noise robustness of the bottleneck characteristic of network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10152970B1 (en) * 2018-02-08 2018-12-11 Capital One Services, Llc Adversarial learning and generation of dialogue responses
CN108597496A (en) * 2018-05-07 2018-09-28 广州势必可赢网络科技有限公司 A kind of speech production method and device for fighting network based on production
CN108806708A (en) * 2018-06-13 2018-11-13 中国电子科技集团公司第三研究所 Voice de-noising method based on Computational auditory scene analysis and generation confrontation network model
CN108922518A (en) * 2018-07-18 2018-11-30 苏州思必驰信息科技有限公司 voice data amplification method and system
CN109346087A (en) * 2018-09-17 2019-02-15 平安科技(深圳)有限公司 Fight the method for identifying speaker and device of the noise robustness of the bottleneck characteristic of network
CN109147810A (en) * 2018-09-30 2019-01-04 百度在线网络技术(北京)有限公司 Establish the method, apparatus, equipment and computer storage medium of speech enhan-cement network

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
CONDITIONAL GENERATIVE ADVERSARIAL NETWORKS FOR SPEECH ENHANCEME: "Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification", 《ARXIV PREPRINT ARXIV:1709.01703》 *
ZHANG J: "I-vector transformation using conditional generative adversarial networks for short utterance speaker verification", 《ARXIV PREPRINT ARXIV:1804.00290》 *
付亚芹: "基于短语音的说话人识别方法研究与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
刘海东: "基于生成对抗网络的乳腺癌病理图像可疑区域标记", 《科研信息化技术与应用》 *
樊云云: "面向说话人识别的深度学习方法研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
赵力: "《语音信号处理》", 30 April 2003, 机械工业出版社 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113488069A (en) * 2021-07-06 2021-10-08 浙江工业大学 Method and device for quickly extracting high-dimensional voice features based on generative countermeasure network
CN113553972A (en) * 2021-07-29 2021-10-26 青岛农业大学 Apple disease diagnosis method based on deep learning

Similar Documents

Publication Publication Date Title
CN109817246B (en) Emotion recognition model training method, emotion recognition device, emotion recognition equipment and storage medium
US9536547B2 (en) Speaker change detection device and speaker change detection method
Yu et al. Active learning and semi-supervised learning for speech recognition: A unified framework using the global entropy reduction maximization criterion
US7684986B2 (en) Method, medium, and apparatus recognizing speech considering similarity between the lengths of phonemes
EP1515305B1 (en) Noise adaption for speech recognition
Song et al. Noise invariant frame selection: a simple method to address the background noise problem for text-independent speaker verification
Tong et al. A comparative study of robustness of deep learning approaches for VAD
Cui et al. Noise robust speech recognition using feature compensation based on polynomial regression of utterance SNR
CN108520752B (en) Voiceprint recognition method and device
EP3989217B1 (en) Method for detecting an audio adversarial attack with respect to a voice input processed by an automatic speech recognition system, corresponding device, computer program product and computer-readable carrier medium
TW201419270A (en) Method and apparatus for utterance verification
Renevey et al. Robust speech recognition using missing feature theory and vector quantization.
CN109192200A (en) A kind of audio recognition method
Li et al. SNR-invariant PLDA modeling in nonparametric subspace for robust speaker verification
CN110047504B (en) Speaker identification method under identity vector x-vector linear transformation
Perero-Codosero et al. X-vector anonymization using autoencoders and adversarial training for preserving speech privacy
CN112133293A (en) Phrase voice sample compensation method based on generation countermeasure network and storage medium
Li et al. Oriental language recognition (OLR) 2020: Summary and analysis
Venkateswarlu et al. Novel approach for speech recognition by using self—organized maps
Chaudhari et al. Multigrained modeling with pattern specific maximum likelihood transformations for text-independent speaker recognition
Saeidi et al. Particle swarm optimization for sorted adapted gaussian mixture models
Singh Support vector machine based approaches for real time automatic speaker recognition system
Ghaemmaghami et al. Speakers in the wild (SITW): The QUT speaker recognition system
Khetri et al. Automatic speech recognition for marathi isolated words
Dat et al. Robust speaker verification using low-rank recovery under total variability space

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20201225