CN113299315B

CN113299315B - Method for generating voice features through continuous learning without original data storage

Info

Publication number: CN113299315B
Application number: CN202110852843.4A
Authority: CN
Inventors: 陶建华; 马浩鑫; 易江燕
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2021-07-27
Filing date: 2021-07-27
Publication date: 2021-10-15
Anticipated expiration: 2041-07-27
Also published as: CN113299315A

Abstract

The invention provides a method for generating voice features through continuous learning without storing original data, which comprises the following steps: collecting audio data, and extracting audio acoustic features to obtain linear cepstrum coefficient features; training a deep learning network model by applying the linear cepstrum coefficient characteristics to obtain a source domain model; regularization loss is added on the basis of a training loss function of the source domain model, the direction of model parameter optimization is constrained, and model parameters of the source domain model are updated by using newly acquired audio data to obtain a target domain model.

Description

Method for generating voice features through continuous learning without original data storage

Technical Field

The invention relates to the field of voice processing and image processing, in particular to a method for generating voice features through continuous learning without storing original data.

Background

The generated voice detection is to judge whether the audio is real human voice or generated voice generated by recording, voice synthesis and voice conversion technologies.

The existing generated speech discrimination model trained by a specific data set has greatly reduced capability of detecting unknown generated speech which is not matched with training data, and has lower generalization performance.

Meanwhile, with the continuous development of speech synthesis and speech conversion technologies, numerous speech generation means are developed, however, existing speech generation detection schemes all face the problem of insufficient generalization performance of models, and for unknown generation types in training data sets, no model with high robustness and good generalization can be detected, for example, a model trained on an ASVspoof2019 LA data set is not available, and because no generation type is available, the effect on the ASVspoof2019 PA data set is greatly reduced, and for example, a model trained on a data set generated by a limited speech synthesis technology is difficult to detect new synthesized speech. The existing generated voice data is trained once to generate a voice identification model, and when a new voice generation means appears, the new data and the original data can be mixed together to retrain the model again, but with the increase of the data quantity, the linear growth of calculation and storage resources can be brought, and the cost is overlarge; moreover, due to the privacy protection problem of special data, long-term storage of original data may not be realized; in addition, for the generated voice detection model which is updated on line continuously, the retraining of the old data can not be realized.

In view of the above problems, it is important that the model has the capability of continuously learning new generated speech.

In order to improve the discrimination performance of the model on the unknown generated speech, model fine tuning, joint training, extraction of more generalized acoustic features and the like can also be considered.

By adopting model fine adjustment, the phenomenon of 'catastrophic forgetting' can be generated when the original model is used for fine adjustment on new data, so that the performance on the original data set is greatly reduced; joint training may generate a large overhead of time and computational resources, and in some special situations, it is impossible to train all data together because the original data cannot be acquired due to privacy protection or other secret-related reasons.

For the problem that the detection performance of the model for the unknown data set is obviously reduced, some related technical researches are available:

1. the multi-model fusion method comprises the following steps: and training one generated voice identification model for each data set, then fusing the multiple models, and comprehensively scoring.

2. Bidirectional confrontation field self-adaptive method: the method is an extension of field confrontation training, two field discriminators which aim at reality and generated voice are added in the network, and the method adopts labeled data of a source field and unlabeled data of a target field for training so as to improve the performance of field unmatched data sets.

3. Extracting other generalization characteristics: the method is to design a front-end feature extractor from the perspective of traditional signal processing, and hopefully adopt more generalized features, such as: extended CQCC, CQSPIC coefficients, etc.

However, the above studies have drawbacks in that: the multi-model fusion method needs new and old data to be trained together, which increases the cost of training; the bidirectional countermeasure field self-adaptive method only focuses on the performance of the new data, and neglects the effect of the trained data on the old data set; extracting other features also does not guarantee the performance of the feature for all generated speech type detections.

The continuous learning problem is how to overcome the 'catastrophic forgetting' problem encountered by fine tuning, even if the model can be continuously updated only by new data while learning a new task and ensuring the memory capacity on the old task.

In addition to the above problems, in practical applications, people need to know not only the authenticity information of voice but also a specific generation type. At the moment, the simple two-classification is not enough to make the output judgment of the model persuasive, so that the original authenticity two-classification is changed into the generation type multi-classification, and the generation type multi-classification is more practical.

Publication number CN111564163A discloses a speech detection method for various counterfeit operations based on RNN, which comprises the following steps: 1) Obtaining an original voice sample, performing M kinds of counterfeiting processing on the original voice sample to obtain M voices after counterfeiting operation and 1 original voice without processing, performing feature extraction on the voices to obtain an LFCC matrix of a training voice sample, and sending the LFCC matrix into an RNN classifier network for training to obtain a multi-classification training model; 2) Obtaining a section of test voice, extracting the characteristics of the test voice to obtain an LFCC matrix of the test voice data, sending the LFCC matrix into the RNN classifier trained in the step 1) for classification, obtaining an output probability for each test voice, and combining all the output probabilities as a final prediction result: if the prediction result is the original voice, the test voice is recognized as the original voice; if the prediction result is a voice subjected to a certain falsification operation, the test voice is recognized as a falsified voice subjected to a corresponding falsification operation.

Publication No. CN112712809B discloses a voice detection method, apparatus, electronic device, and storage medium. Extracting a plurality of voice characteristic information from the voice to be detected; respectively inputting the voice characteristic information into a plurality of pre-trained voice source models, and determining a first matching degree between the voice to be detected and the source type of each voice source model; determining a second matching degree between the voice to be detected and the class type corresponding to the voice class model based on the determined first matching degree aiming at each voice class model; and determining the type and the source type of the voice to be detected based on the determined first matching degrees and the second matching degrees.

The prior art has the following defects:

when a new generation type appears and needs model updating, there are two common solutions: direct fine adjustment and mixed new and old data retraining, but the problems of high calculation cost and long heavy training time are solved.

For the unknown type of generated speech, there are methods of multi-model fusion and adaptive training, but there are also corresponding disadvantages.

1. Direct fine adjustment: training on the existing model by using new data can improve the effect of the model on the new data, but greatly reduce the recognition effect of the model on the former data.

2. Heavy head training: the new data and the old data are superposed and repeatedly trained, and when the data are continuously increased, the training time is longer and longer, and the time cost and the calculation expense are increased.

3. And (3) multi-model fusion: each time a model is added, a new model is added, which brings about overhead in storage.

4. Domain-impedance self-adaptation: new and old data are required to be trained together, and in some cases, the method cannot be used when the old data is unavailable due to factors such as privacy safety and the like.

In addition, the existing methods aim at the authenticity two classification, and in practical application, people not only need to know the authenticity information of voice, but also want to know a specific generation type, so that the generation type multi-classification has important significance.

Disclosure of Invention

In view of the above, the present invention provides a method for generating speech features without continuous learning of raw data storage, the method comprising:

s1: collecting audio data, and extracting audio acoustic features to obtain LFCC features;

s2: training a deep learning network model by applying the LFCC characteristics to obtain a source domain model;

s3: regularization loss is added on the basis of a training loss function of the source domain model, the direction of model parameter optimization is constrained, and model parameters of the source domain model are updated by applying newly generated audio data to obtain a target domain model.

Preferably, the specific method for extracting the audio acoustic features to obtain the LFCC features includes: sampling the collected audio data to obtain original waveform points, and then performing pre-emphasis, framing, windowing and fast Fourier transform to obtain a Fourier power spectrum;

performing DCT (discrete cosine transformation) on the Fourier power spectrum through a linear filter bank to obtain 60-dimensional LFCC (linear frequency feedback control) characteristics of the audio;

where the window length is 25 frames, a 512-dimensional FFT is performed.

Preferably, the deep learning network model is a lightweight convolutional neural network.

Preferably, the lightweight convolutional neural network is finally output as an N classification result through a full connection layer, including real speech and N-1 different types of generated speech.

Preferably, N is set to 50.

Preferably, the regularization loss includes a distillation regularization loss and a true speech feature distribution consistency constraint.

Preferably, the training loss function of the target domain modelL _{Total loss}Comprises the following steps:

wherein the content of the first and second substances,

L _original: a training loss function of the source domain model;

L _distillation: distillation regularization loss;

α: weight loss by distillation regularization of 0.5. ltoreqα≤1；

L _{Reality (reality)}: constraint of consistency of distribution of real voice features;

β: the weight of the constraint of the consistency of the distribution of the real voice features is more than or equal to 1β≤1.5。

Preferably, the specific formula of the distillation regularization loss is as follows:

wherein the content of the first and second substances,

: for newly collected audio data, the prediction probability of the sample of the ith category output by the source domain model is obtained;

: accumulation of the output of the source domain model for newly acquired audio data;

: the prediction probability of the sample of the ith category, which is output by the target domain model;

: accumulation of the outputs of the target domain model;

T: the temperature is over-parametric.

Preferably, the parameter of the temperature over-parameter is set to be 1 ≦T≤2。

Preferably, the specific formula of the constraint on the consistency of the distribution of the real voice features is as follows:

wherein the content of the first and second substances,

: the total number of real voices;

: embedding the characteristic vector of the kth real voice output by the source domain model;

: embedding the characteristic vector of the kth real voice output by the target domain model;

: the model length of the embedded feature vector of the kth real voice output by the source domain model;

: and the model length of the embedded feature vector of the k-th real voice output by the target domain model.

Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:

1. time, calculation, storage overhead:

each time the model is updated, only the model and new data which are trained last time are used;

2. continuous incremental learning:

the method conforms to the current situation that the generation means is continuously updated, and the generated voice identification model is continuously evolved along with the development of the generation technology;

3. the effect of the model on legacy data does not degrade unacceptably: only new data is used, the effect of the model on old data is reduced to a certain extent, but the model is not catastrophic, and the effect is far better than that of the model which is finely adjusted at any time;

4. the output is the classification generation category, and the specific generation type is detected.

Drawings

FIG. 1 is a flowchart of a method for generating speech features for continuous learning without raw data storage according to an embodiment of the present invention;

fig. 2 is a schematic diagram of advantages of continuous learning compared to direct fine tuning provided by the embodiment of the present invention.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

As shown in fig. 1, a method for generating speech features through continuous learning without raw data storage according to an embodiment of the present application includes:

s1: the method comprises the following steps of collecting audio data, extracting audio acoustic features and obtaining LFCC features, and specifically comprises the following steps: sampling the collected audio data to obtain original waveform points, and then performing pre-emphasis, framing, windowing and fast Fourier transform to obtain a Fourier power spectrum;

wherein the window length is 25 frames, and 512-dimensional FFT is carried out;

the deep learning network model is a lightweight convolutional neural network, comprises operations such as convolutional layers, maximum pooling layers, maximum feature mapping output and batch normalization operation, and finally obtains 80-dimensional embedded feature vectors, and finally outputs N classification results through a full connection layer, wherein the N classification results comprise real voice and N-1 generated voices (generated voices of different recording devices, different vocoders and different modes), a reserved classification output head is provided for the generated voices which may appear in the future, and N =50 is preset;

the specific method for training the deep learning network model comprises the following steps: model training is carried out for 150 rounds, an adaptive moment estimation optimizer is selected, the initial learning rate is set to be 0.001, and the size of each batch of data is 128;

the model updating is based on the source domain model trained in the previous step, and further training is carried out by adopting new data on the basis of the source domain model trained in the previous step, namely, the target domain model is initialized by utilizing the source domain model parameters trained in the previous step; the common model tuning operations are: using new data and cross entropy or other loss functions to directly optimize model parameters; the method is based on continuous learning, regularization constraint is added on the basis of the original loss function, and the direction of model parameter optimization is constrained;

as shown in fig. 2, the advantages of continuous learning compared to direct fine tuning are: model fine tuning is often accompanied by 'catastrophic forgetting' of the fine-tuned model to original data knowledge, namely, a fine tuning strategy is simply adopted, so that a final model optimization result is easy to fall into an area with obvious difference from an original task, and the performance of the model is greatly reduced on the original task; however, after the continuous learning method is adopted, the updated model parameters can still be close to the optimal parameter area of the original model, so that the memory capacity of the model on the old task is ensured while the model learns the new task;

s3: adding regularization loss on the basis of a training loss function of a source domain model, constraining the direction of model parameter optimization, wherein the regularization loss comprises distillation regularization loss and real voice feature distribution consistency constraint, applying newly-generated type audio data, extracting corresponding acoustic features (LFCC) of new data, then selecting an adaptive moment estimation optimizer in the same way, setting the initial learning rate to be 0.0001, setting the batch data size to be 64, training for 20 rounds, and updating model parameters of the source domain model to obtain a target domain model;

a training loss function of the target domain modelL _{Total loss}Comprises the following steps:

wherein the content of the first and second substances,

L _original: a training loss function of the source domain model;

L _distillation: distillation regularization loss;

α: the weight lost by the distillation regularization,α=0.7；

β: the weight of the constraint of the consistency of the distribution of the real speech features,β=1.2；

the specific formula of the distillation regularization loss is as follows:

wherein the content of the first and second substances,

: for new miningThe audio data of the set, the sample of the ith category, and the prediction probability output by the source domain model;

: accumulation of the outputs of the target domain model;

T: temperature over-parameter, T = 2;

the specific formula of the constraint of the consistency of the distribution of the real voice features is as follows:

wherein the content of the first and second substances,

: the total number of real voices;

Examples

In ASVspoof2019 LA data set, selecting four types of A13, A17, A10 and A19 to generate voice types and real voice, and adopting a lightweight convolutional neural network with the final output category number of 50;

firstly, training a network by using A13 and real voice together, wherein the label of the real voice is 0, and the label of A13 is 1, so as to obtain a model 1; then, on the basis, A17 and real voice are used, the label of the real voice is 0, the label of A17 is 2, model updating is further carried out, and the loss function of model training at the moment isL _{Total loss}=L _Original+αL _Distillation+βL _{Reality (reality)}，α=0.7，β=1.2, yielding model 2; compared with the model 1, the model 1 is subjected to fine tuning 2 obtained by direct fine tuning at A17 and the real voice;

then, on the basis, A10 and real voice are used, the label of the real voice is 0, the label of A10 is 3, model updating is further carried out, and the loss function of model training is carried out at the momentL _{Total loss}=L _Original+αL _Distillation+βL _{Reality (reality)}，α=0.7，β=1.2, yielding model 3; in contrast, fine tuning 2 is fine tuning 3 obtained at a10 and direct fine tuning of real speech;

then, on the basis, A19 and real voice are used, the label of the real voice is 0, the label of A10 is 3, model updating is further carried out, and the loss function of model training at the moment isL _{Total loss}=L _Original+αL _Distillation+βL _{Reality (reality)}，α=0.7，β=1.2, yielding model 4; in contrast, fine tuning 3 is fine tuning 4 obtained at a19 and direct fine tuning of real speech;

respectively testing eer detected by the model 1 for A13, A17, A10 and A19, and averaging to obtain avg _ eer 1;

respectively testing eer detected by model 2 for A13, A17, A10 and A19, and averaging to obtain avg _ eer 2;

respectively testing eer detected by the model 3 for A13, A17, A10 and A19, and averaging to obtain avg _ eer 3;

respectively testing eer detected by the model 4 for A13, A17, A10 and A19, and averaging to obtain avg _ eer 4;

respectively testing eer detected by the fine tuning 2 pairs of A13, A17, A10 and A19, and averaging to obtain avg _ eer _ 2;

respectively testing eer detected by the fine tuning 3 pairs of A13, A17, A10 and A19, and averaging to obtain avg _ eer _ 3;

respectively testing eer detected by trimming 4 pairs of A13, A17, A10 and A19, and averaging to obtain avg _ eer _ 4;

the evaluation index in the following table is average eer (average equal error rate), and the lower the average equal error rate, the better the effect, the results are as follows:

	addition of A13	Addition of A17	Addition of A10	Addition of A19
					Method for producing a composite material	0.811	3.834	4.312	8.032
Fine tuning	0.811	14.422	14.404	43.719

It can be seen that the present method is superior to direct fine tuning.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present invention. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. In other instances, features described in connection with one embodiment may be implemented as discrete components or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Further, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for generating speech features without continuous learning of raw data storage, the method comprising:

s1: collecting audio data, and extracting audio acoustic features to obtain LFCC features, namely linear cepstrum coefficients;

s3: adding regularization loss on the basis of a training loss function of a source domain model, constraining the direction of model parameter optimization, and updating model parameters of the source domain model by using newly generated audio data to obtain a target domain model;

the deep learning network model is a lightweight convolutional neural network;

the regularization loss includes a distillation regularization loss and a true speech feature distribution consistency constraint,

training loss function for target domain modelL _{Total loss}Comprises the following steps:

wherein the content of the first and second substances,

L _original: a training loss function of the source domain model;

L _distillation: distillation regularization loss;

α: weight loss by distillation regularization of 0.5. ltoreqα≤1；

β: the weight of the constraint of the consistency of the distribution of the real voice features is more than or equal to 1β≤1.5，

The specific formula of the distillation regularization loss is as follows:

wherein the content of the first and second substances,

: accumulation of the outputs of the target domain model;

T: the temperature of the liquid crystal is over-parameter,

the parameter of the temperature over-parameter is set to be more than or equal to 1T≤2。

2. The method for generating speech features for continuous learning without raw data storage according to claim 1, wherein the specific method for extracting audio acoustic features and obtaining LFCC features comprises:

sampling the collected audio data to obtain original waveform points, and then performing pre-emphasis, framing, windowing and fast Fourier transform to obtain a Fourier power spectrum;

performing DCT (discrete cosine transformation) on the Fourier power spectrum through a linear filter bank to obtain 60-dimensional LFCC (linear frequency cepstrum coefficient) characteristics of the audio;

where the window length is 25 frames, a 512-dimensional FFT is performed.

3. The method for continuous learning generated speech features without raw data storage of claim 1 wherein the lightweight convolutional neural network is finally output as N classification results through a full connectivity layer, comprising real speech and N-1 different types of generated speech.

4. The method for continuous learning to generate speech features without raw data storage according to claim 3, wherein N is set to 50.

5. The method for continuous learning to generate speech features without raw data storage according to claim 1, wherein the real speech feature distribution consistency constraint is defined by the following formula:

wherein the content of the first and second substances,

: the total number of real voices;