CN112133293A - Phrase voice sample compensation method based on generation countermeasure network and storage medium - Google Patents
Phrase voice sample compensation method based on generation countermeasure network and storage medium Download PDFInfo
- Publication number
- CN112133293A CN112133293A CN201911067181.9A CN201911067181A CN112133293A CN 112133293 A CN112133293 A CN 112133293A CN 201911067181 A CN201911067181 A CN 201911067181A CN 112133293 A CN112133293 A CN 112133293A
- Authority
- CN
- China
- Prior art keywords
- voice
- generator
- discriminator
- compensation
- real
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Abstract
The invention requests to protect a phrase voice sample compensation method and a storage medium based on a generation countermeasure network. The method is used for solving the problem that the recognition rate is seriously reduced due to insufficient corpus data caused by the short voice condition in the speaker recognition system. It assumes that the long voice distribution contains sufficient characteristics for distinguishing the identity information of the speaker, and extracts the characteristics capable of distinguishing the identity of the speaker from the long voice as the condition input of the generator G and the discriminator D. The short voice is taken as input to the generator G, which tries to compensate the short voice to a sample close to the true long speech distribution with the aid of the condition information, while the discriminator D tries to determine whether a given speech is a true long speech sample or a pseudo speech compensated by the generator. The invention completes the mapping from the phrase voice sample to the compensation voice sample, and increases the universality and diversity of the training sample while ensuring that the compensated voice has sufficient acoustic characteristics, thereby improving the system robustness and reducing the error rate of speaker recognition and the like.
Description
Technical Field
The invention belongs to the field of speaker identification, and particularly relates to a short voice sample compensation method based on a generation countermeasure network.
Background
The Gaussian mixture-universal background model (GMM-UBM) is used as a key method, and can achieve a good recognition effect only when the voice of a speaker is long in a speaker recognition system. Whereas in a short speech environment the recognition rate performance drops drastically, in fact a brief utterance means that the utterance contains insufficient acoustic features. In this case, the speaker model based on statistical attributes does not describe the speaker well, and although the speaker model has significant feature specificity, the speaker model is still susceptible to noise interference due to too few features. In the past few years, deep learning has become very popular in the field of speaker recognition, and many methods use deep learning to solve the short phrase voice sample deficiency problem. Intuitively, the deep learning model has strong characteristic learning capability and is helpful for solving the problem. However, training deep neural networks requires a large amount of data, and short voices contain less speaker identification information, which is one of the biggest obstacles to building speaker recognition systems using deep learning. Therefore, the invention provides a short voice sample compensation method and a storage medium based on a generation countermeasure network, so that the compensated short voice-character speaker recognition system has higher recognition rate and better robustness.
Disclosure of Invention
The invention aims to solve the problems in the prior art, provides a phrase voice sample compensation method and a storage medium based on a generation countermeasure network, and can effectively solve the problem that in speaker recognition, the recognition rate is seriously reduced due to insufficient corpus data caused by a short voice condition. Meanwhile, the problems of model collapse, unstable gradient and the like in the model training process are solved. The technical scheme of the invention is as follows:
a short voice sample compensation method based on generation of confrontation network, comprising the steps of:
s1, acquiring a voice signal by a microphone;
s2, sequentially carrying out preprocessing including pre-emphasis, framing, windowing, fast Fourier transform, Mel filtering and discrete cosine transform on all the voice data acquired in the step S1, extracting the personal identity characteristic of the voice signal of the speaker, namely Mel frequency cepstrum coefficient MFCC, and dividing the voice signal to acquire short voice;
s3, constructing and generating a confrontation network model which is composed of a generator model G and a discriminator model D, wherein a random noise vector z is generated by the generator model G to obey the real data distribution P as much as possibledataThe discrimination model D may determine whether the input sample is the real data x or the generated data g (z).
S4, constructing an optimization objective function V (D, G) for generating an antagonistic network model, and carrying out model training;
s5, constructing a model-oriented learning task, namely a generator compensation performance measurement training task and a discriminator characteristic label training task, wherein the generator compensation performance measurement training task is used for reducing the deviation between compensation voice distribution and real voice distribution, and the discriminator characteristic label training task is used for improving the discrimination capability of the compensation voice speaker;
further, the step S2 specifically includes:
s21: and performing pre-emphasis, framing, windowing and fast Fourier transform on all voice signals in sequence. And then calculating a power spectrum, passing the obtained power spectrum through a triangular band-pass filter, and converting the filtering output result into a logarithmic form by using a relationship between the Mel domain and the linear frequency:
finally obtaining the ith dimension characteristic component C of the MFCC characteristic parameter through discrete cosine transformiThe expression of (a) is:
m represents the number of filters, and is typically 20 to 28. And taking the MFCC of the obtained speaker voice signal as the identity personality characteristic.
S22: the speech signal is segmented to obtain short speech, and long speech and short speech sound pairs are formed.
Further, the generating of the countermeasure network model constructed in the step S3 is specifically:
s31: a generator G for generating an anti-network model is a deep neural network, short voice z is used as the input of the generator G, a short voice sample is subjected to the generator G to obtain a compensated voice sample G (z), a discriminator D is a deep neural network serving as a binary classifier, the short voice sample G (z) and a real long voice sample x which are subjected to compensation by the generator G are alternately used as the input of a discriminator D under the same condition, and the discriminator D judges whether the given voice is a real long voice sample or is obtained by the compensation of the generator;
s32: the conditional version of the generated countermeasure network is used in the model, namely the conditional generation countermeasure network CGAN is a conditional model formed by adding condition extension on the basis of GAN, so that the hidden layers of the generator G and the discriminator D introduce the speaker identity personality characteristic condition c, namely the Mel frequency cepstrum coefficient MFCC, and the mapping process from short voice to compensation voice is guided better.
Further, the step S4 is to construct an objective optimization function V (D, G) for generating an antagonistic network model, and perform model training at the same time, specifically including:
s41: generating a version of the confrontation network condition, which is optimized for the objective function V (D, G) as follows:
wherein E isx~Pdata(x)[logD(x|c)]Indicates the probability that the discriminator D judges whether the real long speech data x is real or not under the guidance of the condition c,Ez~Pdata(z)[log(1-D(G(z|c)|c))]Representing the probability that the short voice z is real data or not by judging the probability of the short voice z being the compensation sample generated by the generator under the condition of inputting the same condition information by a discriminator D;
s42: in the training process, the generator G aims to compensate the short voice to the voice meeting the distribution of the real long voice as much as possible under the guidance of the condition c, and the discriminator D distinguishes the compensation voice of the generator G from the real long voice as much as possible, so that the generator G and the discriminator D form a dynamic game process, and the discriminator D and the generator G are alternately optimized by using a gradient descent method.
Further, the detailed steps of alternately optimizing the discriminator D and the generator G by using the gradient descent method are as follows:
step 1: from the known phrase-sound distribution Pz(z)In which a number of samples z are selected(1),z(2)……,z(m)};
Step 2: selecting corresponding real long voice data { x from training data(1),x(2)……,x(m)};
And 3, step 3: extracting condition information { c) from real long voice(1),c(2)……,c(m)};
And 4, step 4: let the parameter of the discriminator D be thetadThe gradient of the objective function of the following formula with respect to the parameter is obtained, for θdAdding the gradient during updating;
and 5, step 5: let the generator G have a parameter θgThe gradient of the objective function of the following formula with respect to the parameter is obtained, for θgSubtracting the gradient at update time;
the parameters of generator G are then updated each time the parameters of discriminator D are updated.
Further, in the step S5, a learning task is designed for the generator G and the discriminator D respectively to guide a compensation process of the data in the model training process, and the specific process is as follows:
s51: the generator compensates for the performance metric training tasks. The most direct method for measuring the compensation performance of the generator G is to calculate the numerical difference between the compensation voice and the real voice, assuming that N data are divided into i groups, and the difference degree between the compensation voice of the i group and the real long voice is measured by the mean square error:
wherein, observedreal,iPredicted data of the ith group representing real speech samplesgan,iRepresenting the ith set of data based on the voice samples that were compensated to generate the anti-network multitask framework with the goal of minimizing the MSE value, the generator G learns the objective function of compensating for the difference between speech and true long speech as follows:
e (-) is the calculation of the expected value and G (z | c) represents the compensated sample generated by the generator under the direction of condition c. Numerical difference function loss measuring compensation performance of generatorGThe goal is to minimize the numerical difference function during the training process, so that the compensation performance of the generator reaches the optimal state;
s52: training task of the feature label of the discriminator: the feature label training task of the discriminator is used for improving the distinguishing capability of the compensated voice speakers, MFCC features extracted from real long voice represent different speaker labels, after the compensated voice and the real long voice are input into the discriminator, whether the voice belongs to the feature label of the category is predicted through feature distance measurement, and the cross entropy between the result of the predicted feature label and the real feature label is minimized.
Further, the cross entropy objective function between the minimum discriminator predicted feature tag result and the true feature tag is:
wherein n isiRepresents the number of the intercepted short voice in the ith speech signal,in order for the discriminator to observe the empirical probability of the class k signature to which a true long speech belongs based on facts,in the training process, the training of the discriminator is stabilized by continuously minimizing the cross entropy loss of the feature labels to which the real voice and the compensation voice belong, so that the compensation voice carries more speaker identity features.
A storage medium having stored therein a computer program which, when read by a processor, performs any of the methods described above.
The invention has the following advantages and beneficial effects:
the invention provides a short voice sample compensation method based on a generation countermeasure network, aiming at the problem that the short voice recognition rate is seriously reduced in a speaker recognition system. The conditional version of the countermeasure network is generated, and the feature capable of distinguishing the identity of the speaker is extracted from the long voice as the conditional input of the generator G and the discriminator D, provided that the long voice distribution contains sufficient features for distinguishing the identity information of the speaker. The short voice is taken as input to the generator G, which tries to compensate the short voice to a sample close to the true long speech distribution with the aid of the condition information, while the discriminator D tries to determine whether the given speech is a true long speech sample or a pseudo speech compensated by the generator. The method completes the mapping from the phrase voice sample to the compensation voice sample, and increases the universality and diversity of the training sample while ensuring that the compensated voice has sufficient acoustic characteristics, thereby improving the system robustness and reducing the error rate of speaker recognition and the like.
The method completes the model training process by constructing and generating the confrontation network model, successfully maps the short voice lacking the identity individual characteristics into the compensation voice with stronger speaker distinguishing capability, increases the universality and diversity of the training sample while the compensation voice contains sufficient acoustic characteristics, and can effectively solve the problem that the recognition rate is seriously reduced due to insufficient corpus data caused by the short voice condition in speaker recognition. In order to prevent the problems of model collapse, gradient instability and the like in the training process of generating the confrontation network model, the constructed model-oriented learning task, namely the generator compensation performance measurement training task and the feature label training task of the discriminator effectively stabilizes the training process, reduces the deviation between the compensation voice distribution and the real voice distribution, and further improves the distinguishing capability of the compensation voice speaker.
Drawings
FIG. 1 is a flow chart of speaker recognition based on generation of anti-network short speech compensation according to the preferred embodiment of the present invention;
FIG. 2 is a diagram of an improved generation countermeasure compensation model structure proposed by the present invention;
FIG. 3 is a flow chart of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.
The technical scheme for solving the technical problems is as follows:
as shown in fig. 1 to 3, the technical solution of the present invention for solving the above technical problems is:
s1, acquiring a voice signal by a microphone;
s2, preprocessing all voice data, extracting the personal identity characteristic of the speaker voice signal, namely Mel frequency cepstrum coefficient MFCC, and dividing the voice signal to obtain short voice;
the pre-emphasis part can be seen as a high-pass filter, corresponding to the following equation, where a is the pre-emphasis coefficient (typically in the interval [0.95,0.97 ]).
H(z)=1-az-1.
A hamming window ω shown below is used to smooth the edge signal, k 0,1.
In speech processing, the Mel-frequency cepstrum has some effect on the short-time power spectrum of speech, which is based on a cosine transform on a non-linear Mel-frequency scale. The Mel frequency vs. linear frequency is given by:
the Mel filter bank is a set of triangular band-pass filters, the transfer function of the band-pass filters is as follows, M represents the number of the filters, generally 20-28, M is more than or equal to 0 and less than or equal to M, and the f (·) function is the center frequency of the Mel band-pass filter bank.
Obtaining the ith dimension characteristic component C of the MFCC characteristic parameter through discrete cosine transformiThe expression of (a) is:
s3, constructing and generating a confrontation network model, which comprises a generator network and a discriminator network;
s4, constructing an optimization objective function V (D, G) of the model, wherein the optimization process is as follows:
wherein E isx~Pdata(x)[logD(x|c)]Representing the probability that the discriminator D judges whether the real long speech data x is real or not, E, guided by the condition cz~Pdata(z)[log(1-D(G(z|c)|c))]Representing the probability that the short speech z is true data or not, and the decision D determines the compensated sample generated by the generator under the same condition information input.
And S5, training the model. In the actual training, the discriminator D and the generator G are alternately optimized by using a gradient descent method, and the detailed steps are as follows:
step 1: from the known phrase-sound distribution Pz(z)In which a number of samples z are selected(1),z(2)……,z(m)}。
Step 2: selecting corresponding real long voice data { x from training data(1),x(2)……,x(m)}。
And 3, step 3: extracting condition information { c) from real long voice(1),c(2)……,c(m)}。
And 4, step 4: let the parameter of the discriminator D be thetadThe gradient of the objective function of the following formula with respect to the parameter is obtained, for θdThe gradient is added at the time of update.
And 5, step 5: let the generator G have a parameter θgThe gradient of the objective function of the following formula with respect to the parameter is obtained, for θgThe gradient is subtracted at the update.
The parameters of generator G are then updated each time the parameters of discriminator D are updated.
S6, the construction generator compensates the performance measurement training task. Aiming at the problem that gradient disappears sometimes in the training process, it is more appropriate to generate an antagonistic network to learn and compensate the difference between the voice and the real voice, and an objective function of a generator G for learning and compensating the difference between the voice and the real long voice is as follows:
numerical difference function loss measuring compensation performance of generatorGThe goal is to minimize the numerical difference function during training to optimize the compensation performance of the generator.
S7, constructing a discriminant feature tag training task, representing each MFCC feature extracted from the real long speech by different speaker tags, after compensating the speech and the real long speech, predicting whether the speech belongs to the class feature tag or not through feature distance measurement, and minimizing the cross entropy between the result of predicting the feature tag and the real feature tag. The cross entropy objective function between the minimum discriminator predicted feature tag result and the true feature tag is:
whereinIn order for the discriminator to observe the empirical probability of the class k signature to which a true long speech belongs based on facts,and calculating the prediction probability of the kth class feature label to which the compensation voice belongs according to the feature distance for the discriminator. In the training process, the training of the discriminator is stabilized by continuously minimizing the cross entropy loss of the feature labels to which the real voice and the compensation voice belong, so that the compensation voice carries more speaker identity features, and the equal error rate of the short voice in a speaker recognition system is reduced.
S8, a short voice sample compensation method based on the generation countermeasure network is evaluated on the speaker recognition system based on the Gaussian mixture model-general background model, and experimental results show that the method effectively reduces the equal error rate of the speaker recognition system in the short voice environment.
The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.
Claims (8)
1. A short voice sample compensation method based on a generation countermeasure network, comprising the steps of:
s1, acquiring a voice signal by a microphone;
s2, sequentially carrying out pre-emphasis, framing, windowing, fast Fourier transform, Mel filtering and discrete cosine transform on all the voice data acquired in the step S1, extracting the personal identity characteristic of the speaker voice signal, namely Mel frequency cepstrum coefficient MFCC, and dividing the voice signal to acquire short voice;
s3, constructing and generating a confrontation network model which is composed of a generator model G and a discriminator model D, wherein a random noise vector z is generated by the generator model G to obey the real data distribution P as much as possibledataThe discrimination model D may determine whether the input sample is real data x or generated data g (z);
s4, constructing an optimization objective function V (D, G) for generating an antagonistic network model, and carrying out model training;
s5, constructing a model-oriented learning task, namely a generator compensation performance measurement training task and a discriminator characteristic label training task, wherein the generator compensation performance measurement training task is used for reducing the deviation between compensation voice distribution and real voice distribution, and the discriminator characteristic label training task is used for improving the discrimination capability of the compensation voice speaker.
2. The method for generating phrase voice sample compensation for confrontation network as claimed in claim 1, wherein said step S2 includes the following steps:
s21: and performing pre-emphasis, framing, windowing and fast Fourier transform on all voice signals in sequence. And then calculating a power spectrum, passing the obtained power spectrum through a triangular band-pass filter, and converting the filtering output result into a logarithmic form by using a relationship between the Mel domain and the linear frequency:
finally obtaining the ith dimension characteristic component C of the MFCC characteristic parameter through discrete cosine transformiThe expression of (a) is:
m represents the number of filters, and is typically 20 to 28. Using the MFCC of the obtained speaker voice signal as an identity personality characteristic;
s22: the speech signal is segmented to obtain short speech, and long speech and short speech sound pairs are formed.
3. The method for compensating the phrase voice sample based on the generative confrontation network as claimed in claim 2, wherein the generative confrontation network model constructed in the step S3 is specifically:
s31: a generator G for generating an anti-network model is a deep neural network, short voice z is used as the input of the generator G, a short voice sample is subjected to the generator G to obtain a compensated voice sample G (z), a discriminator D is a deep neural network serving as a binary classifier, the short voice sample G (z) and a real long voice sample x which are subjected to compensation by the generator G are alternately used as the input of a discriminator D under the same condition, and the discriminator D judges whether the given voice is a real long voice sample or is obtained by the compensation of the generator;
s32: the conditional version of the generated countermeasure network is used in the model, namely the conditional generation countermeasure network CGAN is a conditional model formed by adding condition extension on the basis of GAN, so that the hidden layers of the generator G and the discriminator D introduce the speaker identity personality characteristic condition c, namely the Mel frequency cepstrum coefficient MFCC, and the mapping process from short voice to compensation voice is guided better.
4. The method for generating short voice sample compensation for confrontation network according to claim 3, wherein the step S4 is to construct an objective optimization function V (D, G) for generating confrontation network model, and to train the model, specifically comprising:
s41: generating a version of the confrontation network condition, which is optimized for the objective function V (D, G) as follows:
wherein E isx~Pdata(x)[logD(x|c)]Representing the probability that the discriminator D judges whether the real long speech data x is real or not, E, guided by the condition cz~Pdata(z)[log(1-D(G(z|c)|c))]Representing the probability that the short voice z is real data or not by judging the probability of the short voice z being the compensation sample generated by the generator under the condition of inputting the same condition information by a discriminator D;
s42: in the training process, the generator G aims to compensate the short voice to the voice meeting the distribution of the real long voice as much as possible under the guidance of the condition c, and the discriminator D distinguishes the compensation voice of the generator G from the real long voice as much as possible, so that the generator G and the discriminator D form a dynamic game process, and the discriminator D and the generator G are alternately optimized by using a gradient descent method.
5. The method for generating phrase voice sample compensation for countermeasure network as claimed in claim 4, wherein the detailed steps of optimizing the alternation of the discriminator D and the generator G by using gradient descent method are as follows:
step 1: from knownShort speech distribution Pz(z)In which a number of samples z are selected(1),z(2)……,z(m)};
Step 2: selecting corresponding real long voice data { x from training data(1),x(2)……,x(m)};
And 3, step 3: extracting condition information { c) from real long voice(1),c(2)……,c(m)};
And 4, step 4: let the parameter of the discriminator D be thetadThe gradient of the objective function of the following formula with respect to the parameter is obtained, for θdAdding the gradient during updating;
And 5, step 5: let the generator G have a parameter θgThe gradient of the objective function of the following formula with respect to the parameter is obtained, for θgSubtracting the gradient at update time;
the parameters of generator G are then updated each time the parameters of discriminator D are updated.
6. The method for generating phrase sound sample compensation for confrontation network as claimed in claim 5, wherein said step S5 is to design learning task for generator G and discriminator D respectively to guide the compensation process of data during the model training process, and the specific process is as follows:
s51: the generator compensates for the performance metric training tasks. The most direct method for measuring the compensation performance of the generator G is to calculate the numerical difference between the compensation voice and the real voice, assuming that N data are divided into i groups, and the difference degree between the compensation voice of the i group and the real long voice is measured by the mean square error:
wherein, observedreal,iPredicted data of the ith group representing real speech samplesgan,iRepresenting the ith set of data based on the voice samples that were compensated to generate the anti-network multitask framework with the goal of minimizing the MSE value, the generator G learns the objective function of compensating for the difference between speech and true long speech as follows:
e (-) is the calculation of the expected value and G (z | c) represents the compensated sample generated by the generator under the direction of condition c. Numerical difference function loss measuring compensation performance of generatorGThe goal is to minimize the numerical difference function during the training process, so that the compensation performance of the generator reaches the optimal state;
s52: training task of the feature label of the discriminator: the feature label training task of the discriminator is used for improving the distinguishing capability of the compensated voice speakers, MFCC features extracted from real long voice represent different speaker labels, after the compensated voice and the real long voice are input into the discriminator, whether the voice belongs to the feature label of the category is predicted through feature distance measurement, and the cross entropy between the result of the predicted feature label and the real feature label is minimized.
7. The method as claimed in claim 6, wherein the cross-entropy objective function between the minimization discriminator predicted signature result and the true signature is:
wherein n isiRepresents the number of the intercepted short voice in the ith speech signal,in order for the discriminator to observe the empirical probability of the class k signature to which a true long speech belongs based on facts,in the training process, the training of the discriminator is stabilized by continuously minimizing the cross entropy loss of the feature labels to which the real voice and the compensation voice belong, so that the compensation voice carries more speaker identity features.
8. A storage medium having a computer program stored therein, wherein the computer program, when read by a processor, performs the method of any of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911067181.9A CN112133293A (en) | 2019-11-04 | 2019-11-04 | Phrase voice sample compensation method based on generation countermeasure network and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911067181.9A CN112133293A (en) | 2019-11-04 | 2019-11-04 | Phrase voice sample compensation method based on generation countermeasure network and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112133293A true CN112133293A (en) | 2020-12-25 |
Family
ID=73849548
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911067181.9A Pending CN112133293A (en) | 2019-11-04 | 2019-11-04 | Phrase voice sample compensation method based on generation countermeasure network and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112133293A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113488069A (en) * | 2021-07-06 | 2021-10-08 | 浙江工业大学 | Method and device for quickly extracting high-dimensional voice features based on generative countermeasure network |
CN113553972A (en) * | 2021-07-29 | 2021-10-26 | 青岛农业大学 | Apple disease diagnosis method based on deep learning |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108597496A (en) * | 2018-05-07 | 2018-09-28 | 广州势必可赢网络科技有限公司 | A kind of speech production method and device for fighting network based on production |
CN108806708A (en) * | 2018-06-13 | 2018-11-13 | 中国电子科技集团公司第三研究所 | Voice de-noising method based on Computational auditory scene analysis and generation confrontation network model |
CN108922518A (en) * | 2018-07-18 | 2018-11-30 | 苏州思必驰信息科技有限公司 | voice data amplification method and system |
US10152970B1 (en) * | 2018-02-08 | 2018-12-11 | Capital One Services, Llc | Adversarial learning and generation of dialogue responses |
CN109147810A (en) * | 2018-09-30 | 2019-01-04 | 百度在线网络技术(北京)有限公司 | Establish the method, apparatus, equipment and computer storage medium of speech enhan-cement network |
CN109346087A (en) * | 2018-09-17 | 2019-02-15 | 平安科技(深圳)有限公司 | Fight the method for identifying speaker and device of the noise robustness of the bottleneck characteristic of network |
-
2019
- 2019-11-04 CN CN201911067181.9A patent/CN112133293A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10152970B1 (en) * | 2018-02-08 | 2018-12-11 | Capital One Services, Llc | Adversarial learning and generation of dialogue responses |
CN108597496A (en) * | 2018-05-07 | 2018-09-28 | 广州势必可赢网络科技有限公司 | A kind of speech production method and device for fighting network based on production |
CN108806708A (en) * | 2018-06-13 | 2018-11-13 | 中国电子科技集团公司第三研究所 | Voice de-noising method based on Computational auditory scene analysis and generation confrontation network model |
CN108922518A (en) * | 2018-07-18 | 2018-11-30 | 苏州思必驰信息科技有限公司 | voice data amplification method and system |
CN109346087A (en) * | 2018-09-17 | 2019-02-15 | 平安科技(深圳)有限公司 | Fight the method for identifying speaker and device of the noise robustness of the bottleneck characteristic of network |
CN109147810A (en) * | 2018-09-30 | 2019-01-04 | 百度在线网络技术(北京)有限公司 | Establish the method, apparatus, equipment and computer storage medium of speech enhan-cement network |
Non-Patent Citations (6)
Title |
---|
CONDITIONAL GENERATIVE ADVERSARIAL NETWORKS FOR SPEECH ENHANCEME: "Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification", 《ARXIV PREPRINT ARXIV:1709.01703》 * |
ZHANG J: "I-vector transformation using conditional generative adversarial networks for short utterance speaker verification", 《ARXIV PREPRINT ARXIV:1804.00290》 * |
付亚芹: "基于短语音的说话人识别方法研究与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
刘海东: "基于生成对抗网络的乳腺癌病理图像可疑区域标记", 《科研信息化技术与应用》 * |
樊云云: "面向说话人识别的深度学习方法研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
赵力: "《语音信号处理》", 30 April 2003, 机械工业出版社 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113488069A (en) * | 2021-07-06 | 2021-10-08 | 浙江工业大学 | Method and device for quickly extracting high-dimensional voice features based on generative countermeasure network |
CN113553972A (en) * | 2021-07-29 | 2021-10-26 | 青岛农业大学 | Apple disease diagnosis method based on deep learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109817246B (en) | Emotion recognition model training method, emotion recognition device, emotion recognition equipment and storage medium | |
US9536547B2 (en) | Speaker change detection device and speaker change detection method | |
Yu et al. | Active learning and semi-supervised learning for speech recognition: A unified framework using the global entropy reduction maximization criterion | |
US7684986B2 (en) | Method, medium, and apparatus recognizing speech considering similarity between the lengths of phonemes | |
EP1515305B1 (en) | Noise adaption for speech recognition | |
Song et al. | Noise invariant frame selection: a simple method to address the background noise problem for text-independent speaker verification | |
Tong et al. | A comparative study of robustness of deep learning approaches for VAD | |
Cui et al. | Noise robust speech recognition using feature compensation based on polynomial regression of utterance SNR | |
CN108520752B (en) | Voiceprint recognition method and device | |
EP3989217B1 (en) | Method for detecting an audio adversarial attack with respect to a voice input processed by an automatic speech recognition system, corresponding device, computer program product and computer-readable carrier medium | |
TW201419270A (en) | Method and apparatus for utterance verification | |
Renevey et al. | Robust speech recognition using missing feature theory and vector quantization. | |
CN109192200A (en) | A kind of audio recognition method | |
Li et al. | SNR-invariant PLDA modeling in nonparametric subspace for robust speaker verification | |
CN110047504B (en) | Speaker identification method under identity vector x-vector linear transformation | |
Perero-Codosero et al. | X-vector anonymization using autoencoders and adversarial training for preserving speech privacy | |
CN112133293A (en) | Phrase voice sample compensation method based on generation countermeasure network and storage medium | |
Li et al. | Oriental language recognition (OLR) 2020: Summary and analysis | |
Venkateswarlu et al. | Novel approach for speech recognition by using self—organized maps | |
Chaudhari et al. | Multigrained modeling with pattern specific maximum likelihood transformations for text-independent speaker recognition | |
Saeidi et al. | Particle swarm optimization for sorted adapted gaussian mixture models | |
Singh | Support vector machine based approaches for real time automatic speaker recognition system | |
Ghaemmaghami et al. | Speakers in the wild (SITW): The QUT speaker recognition system | |
Khetri et al. | Automatic speech recognition for marathi isolated words | |
Dat et al. | Robust speaker verification using low-rank recovery under total variability space |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20201225 |