CN113299315B - Method for generating voice features through continuous learning without original data storage - Google Patents

Method for generating voice features through continuous learning without original data storage Download PDF

Info

Publication number
CN113299315B
CN113299315B CN202110852843.4A CN202110852843A CN113299315B CN 113299315 B CN113299315 B CN 113299315B CN 202110852843 A CN202110852843 A CN 202110852843A CN 113299315 B CN113299315 B CN 113299315B
Authority
CN
China
Prior art keywords
model
domain model
features
source domain
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110852843.4A
Other languages
Chinese (zh)
Other versions
CN113299315A (en
Inventor
陶建华
马浩鑫
易江燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN202110852843.4A priority Critical patent/CN113299315B/en
Publication of CN113299315A publication Critical patent/CN113299315A/en
Application granted granted Critical
Publication of CN113299315B publication Critical patent/CN113299315B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/12Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being prediction coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a method for generating voice features through continuous learning without storing original data, which comprises the following steps: collecting audio data, and extracting audio acoustic features to obtain linear cepstrum coefficient features; training a deep learning network model by applying the linear cepstrum coefficient characteristics to obtain a source domain model; regularization loss is added on the basis of a training loss function of the source domain model, the direction of model parameter optimization is constrained, and model parameters of the source domain model are updated by using newly acquired audio data to obtain a target domain model.

Description

Method for generating voice features through continuous learning without original data storage
Technical Field
The invention relates to the field of voice processing and image processing, in particular to a method for generating voice features through continuous learning without storing original data.
Background
The generated voice detection is to judge whether the audio is real human voice or generated voice generated by recording, voice synthesis and voice conversion technologies.
The existing generated speech discrimination model trained by a specific data set has greatly reduced capability of detecting unknown generated speech which is not matched with training data, and has lower generalization performance.
Meanwhile, with the continuous development of speech synthesis and speech conversion technologies, numerous speech generation means are developed, however, existing speech generation detection schemes all face the problem of insufficient generalization performance of models, and for unknown generation types in training data sets, no model with high robustness and good generalization can be detected, for example, a model trained on an ASVspoof2019 LA data set is not available, and because no generation type is available, the effect on the ASVspoof2019 PA data set is greatly reduced, and for example, a model trained on a data set generated by a limited speech synthesis technology is difficult to detect new synthesized speech. The existing generated voice data is trained once to generate a voice identification model, and when a new voice generation means appears, the new data and the original data can be mixed together to retrain the model again, but with the increase of the data quantity, the linear growth of calculation and storage resources can be brought, and the cost is overlarge; moreover, due to the privacy protection problem of special data, long-term storage of original data may not be realized; in addition, for the generated voice detection model which is updated on line continuously, the retraining of the old data can not be realized.
In view of the above problems, it is important that the model has the capability of continuously learning new generated speech.
In order to improve the discrimination performance of the model on the unknown generated speech, model fine tuning, joint training, extraction of more generalized acoustic features and the like can also be considered.
By adopting model fine adjustment, the phenomenon of 'catastrophic forgetting' can be generated when the original model is used for fine adjustment on new data, so that the performance on the original data set is greatly reduced; joint training may generate a large overhead of time and computational resources, and in some special situations, it is impossible to train all data together because the original data cannot be acquired due to privacy protection or other secret-related reasons.
For the problem that the detection performance of the model for the unknown data set is obviously reduced, some related technical researches are available:
1. the multi-model fusion method comprises the following steps: and training one generated voice identification model for each data set, then fusing the multiple models, and comprehensively scoring.
2. Bidirectional confrontation field self-adaptive method: the method is an extension of field confrontation training, two field discriminators which aim at reality and generated voice are added in the network, and the method adopts labeled data of a source field and unlabeled data of a target field for training so as to improve the performance of field unmatched data sets.
3. Extracting other generalization characteristics: the method is to design a front-end feature extractor from the perspective of traditional signal processing, and hopefully adopt more generalized features, such as: extended CQCC, CQSPIC coefficients, etc.
However, the above studies have drawbacks in that: the multi-model fusion method needs new and old data to be trained together, which increases the cost of training; the bidirectional countermeasure field self-adaptive method only focuses on the performance of the new data, and neglects the effect of the trained data on the old data set; extracting other features also does not guarantee the performance of the feature for all generated speech type detections.
The continuous learning problem is how to overcome the 'catastrophic forgetting' problem encountered by fine tuning, even if the model can be continuously updated only by new data while learning a new task and ensuring the memory capacity on the old task.
In addition to the above problems, in practical applications, people need to know not only the authenticity information of voice but also a specific generation type. At the moment, the simple two-classification is not enough to make the output judgment of the model persuasive, so that the original authenticity two-classification is changed into the generation type multi-classification, and the generation type multi-classification is more practical.
Publication number CN111564163A discloses a speech detection method for various counterfeit operations based on RNN, which comprises the following steps: 1) Obtaining an original voice sample, performing M kinds of counterfeiting processing on the original voice sample to obtain M voices after counterfeiting operation and 1 original voice without processing, performing feature extraction on the voices to obtain an LFCC matrix of a training voice sample, and sending the LFCC matrix into an RNN classifier network for training to obtain a multi-classification training model; 2) Obtaining a section of test voice, extracting the characteristics of the test voice to obtain an LFCC matrix of the test voice data, sending the LFCC matrix into the RNN classifier trained in the step 1) for classification, obtaining an output probability for each test voice, and combining all the output probabilities as a final prediction result: if the prediction result is the original voice, the test voice is recognized as the original voice; if the prediction result is a voice subjected to a certain falsification operation, the test voice is recognized as a falsified voice subjected to a corresponding falsification operation.
Publication No. CN112712809B discloses a voice detection method, apparatus, electronic device, and storage medium. Extracting a plurality of voice characteristic information from the voice to be detected; respectively inputting the voice characteristic information into a plurality of pre-trained voice source models, and determining a first matching degree between the voice to be detected and the source type of each voice source model; determining a second matching degree between the voice to be detected and the class type corresponding to the voice class model based on the determined first matching degree aiming at each voice class model; and determining the type and the source type of the voice to be detected based on the determined first matching degrees and the second matching degrees.
The prior art has the following defects:
when a new generation type appears and needs model updating, there are two common solutions: direct fine adjustment and mixed new and old data retraining, but the problems of high calculation cost and long heavy training time are solved.
For the unknown type of generated speech, there are methods of multi-model fusion and adaptive training, but there are also corresponding disadvantages.
1. Direct fine adjustment: training on the existing model by using new data can improve the effect of the model on the new data, but greatly reduce the recognition effect of the model on the former data.
2. Heavy head training: the new data and the old data are superposed and repeatedly trained, and when the data are continuously increased, the training time is longer and longer, and the time cost and the calculation expense are increased.
3. And (3) multi-model fusion: each time a model is added, a new model is added, which brings about overhead in storage.
4. Domain-impedance self-adaptation: new and old data are required to be trained together, and in some cases, the method cannot be used when the old data is unavailable due to factors such as privacy safety and the like.
In addition, the existing methods aim at the authenticity two classification, and in practical application, people not only need to know the authenticity information of voice, but also want to know a specific generation type, so that the generation type multi-classification has important significance.
Disclosure of Invention
In view of the above, the present invention provides a method for generating speech features without continuous learning of raw data storage, the method comprising:
s1: collecting audio data, and extracting audio acoustic features to obtain LFCC features;
s2: training a deep learning network model by applying the LFCC characteristics to obtain a source domain model;
s3: regularization loss is added on the basis of a training loss function of the source domain model, the direction of model parameter optimization is constrained, and model parameters of the source domain model are updated by applying newly generated audio data to obtain a target domain model.
Preferably, the specific method for extracting the audio acoustic features to obtain the LFCC features includes: sampling the collected audio data to obtain original waveform points, and then performing pre-emphasis, framing, windowing and fast Fourier transform to obtain a Fourier power spectrum;
performing DCT (discrete cosine transformation) on the Fourier power spectrum through a linear filter bank to obtain 60-dimensional LFCC (linear frequency feedback control) characteristics of the audio;
where the window length is 25 frames, a 512-dimensional FFT is performed.
Preferably, the deep learning network model is a lightweight convolutional neural network.
Preferably, the lightweight convolutional neural network is finally output as an N classification result through a full connection layer, including real speech and N-1 different types of generated speech.
Preferably, N is set to 50.
Preferably, the regularization loss includes a distillation regularization loss and a true speech feature distribution consistency constraint.
Preferably, the training loss function of the target domain modelL Total lossComprises the following steps:
Figure 925933DEST_PATH_IMAGE001
wherein the content of the first and second substances,
L original: a training loss function of the source domain model;
L distillation: distillation regularization loss;
α: weight loss by distillation regularization of 0.5. ltoreqα≤1;
L Reality (reality): constraint of consistency of distribution of real voice features;
β: the weight of the constraint of the consistency of the distribution of the real voice features is more than or equal to 1β≤1.5。
Preferably, the specific formula of the distillation regularization loss is as follows:
Figure 485090DEST_PATH_IMAGE002
Figure DEST_PATH_IMAGE003
Figure 197832DEST_PATH_IMAGE004
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE005
: for newly collected audio data, the prediction probability of the sample of the ith category output by the source domain model is obtained;
Figure 685707DEST_PATH_IMAGE006
: accumulation of the output of the source domain model for newly acquired audio data;
Figure DEST_PATH_IMAGE007
: the prediction probability of the sample of the ith category, which is output by the target domain model;
Figure 382268DEST_PATH_IMAGE008
: accumulation of the outputs of the target domain model;
T: the temperature is over-parametric.
Preferably, the parameter of the temperature over-parameter is set to be 1 ≦T≤2。
Preferably, the specific formula of the constraint on the consistency of the distribution of the real voice features is as follows:
Figure DEST_PATH_IMAGE009
wherein the content of the first and second substances,
Figure 479537DEST_PATH_IMAGE010
: the total number of real voices;
Figure DEST_PATH_IMAGE011
: embedding the characteristic vector of the kth real voice output by the source domain model;
Figure 577943DEST_PATH_IMAGE012
: embedding the characteristic vector of the kth real voice output by the target domain model;
Figure DEST_PATH_IMAGE013
: the model length of the embedded feature vector of the kth real voice output by the source domain model;
Figure 469675DEST_PATH_IMAGE014
: and the model length of the embedded feature vector of the k-th real voice output by the target domain model.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:
1. time, calculation, storage overhead:
each time the model is updated, only the model and new data which are trained last time are used;
2. continuous incremental learning:
the method conforms to the current situation that the generation means is continuously updated, and the generated voice identification model is continuously evolved along with the development of the generation technology;
3. the effect of the model on legacy data does not degrade unacceptably: only new data is used, the effect of the model on old data is reduced to a certain extent, but the model is not catastrophic, and the effect is far better than that of the model which is finely adjusted at any time;
4. the output is the classification generation category, and the specific generation type is detected.
Drawings
FIG. 1 is a flowchart of a method for generating speech features for continuous learning without raw data storage according to an embodiment of the present invention;
fig. 2 is a schematic diagram of advantages of continuous learning compared to direct fine tuning provided by the embodiment of the present invention.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.
As shown in fig. 1, a method for generating speech features through continuous learning without raw data storage according to an embodiment of the present application includes:
s1: the method comprises the following steps of collecting audio data, extracting audio acoustic features and obtaining LFCC features, and specifically comprises the following steps: sampling the collected audio data to obtain original waveform points, and then performing pre-emphasis, framing, windowing and fast Fourier transform to obtain a Fourier power spectrum;
performing DCT (discrete cosine transformation) on the Fourier power spectrum through a linear filter bank to obtain 60-dimensional LFCC (linear frequency feedback control) characteristics of the audio;
wherein the window length is 25 frames, and 512-dimensional FFT is carried out;
s2: training a deep learning network model by applying the LFCC characteristics to obtain a source domain model;
the deep learning network model is a lightweight convolutional neural network, comprises operations such as convolutional layers, maximum pooling layers, maximum feature mapping output and batch normalization operation, and finally obtains 80-dimensional embedded feature vectors, and finally outputs N classification results through a full connection layer, wherein the N classification results comprise real voice and N-1 generated voices (generated voices of different recording devices, different vocoders and different modes), a reserved classification output head is provided for the generated voices which may appear in the future, and N =50 is preset;
the specific method for training the deep learning network model comprises the following steps: model training is carried out for 150 rounds, an adaptive moment estimation optimizer is selected, the initial learning rate is set to be 0.001, and the size of each batch of data is 128;
the model updating is based on the source domain model trained in the previous step, and further training is carried out by adopting new data on the basis of the source domain model trained in the previous step, namely, the target domain model is initialized by utilizing the source domain model parameters trained in the previous step; the common model tuning operations are: using new data and cross entropy or other loss functions to directly optimize model parameters; the method is based on continuous learning, regularization constraint is added on the basis of the original loss function, and the direction of model parameter optimization is constrained;
as shown in fig. 2, the advantages of continuous learning compared to direct fine tuning are: model fine tuning is often accompanied by 'catastrophic forgetting' of the fine-tuned model to original data knowledge, namely, a fine tuning strategy is simply adopted, so that a final model optimization result is easy to fall into an area with obvious difference from an original task, and the performance of the model is greatly reduced on the original task; however, after the continuous learning method is adopted, the updated model parameters can still be close to the optimal parameter area of the original model, so that the memory capacity of the model on the old task is ensured while the model learns the new task;
s3: adding regularization loss on the basis of a training loss function of a source domain model, constraining the direction of model parameter optimization, wherein the regularization loss comprises distillation regularization loss and real voice feature distribution consistency constraint, applying newly-generated type audio data, extracting corresponding acoustic features (LFCC) of new data, then selecting an adaptive moment estimation optimizer in the same way, setting the initial learning rate to be 0.0001, setting the batch data size to be 64, training for 20 rounds, and updating model parameters of the source domain model to obtain a target domain model;
a training loss function of the target domain modelL Total lossComprises the following steps:
Figure 387953DEST_PATH_IMAGE001
wherein the content of the first and second substances,
L original: a training loss function of the source domain model;
L distillation: distillation regularization loss;
α: the weight lost by the distillation regularization,α=0.7;
L reality (reality): constraint of consistency of distribution of real voice features;
β: the weight of the constraint of the consistency of the distribution of the real speech features,β=1.2;
the specific formula of the distillation regularization loss is as follows:
Figure 23333DEST_PATH_IMAGE002
Figure 179508DEST_PATH_IMAGE003
Figure 242142DEST_PATH_IMAGE004
wherein the content of the first and second substances,
Figure 883601DEST_PATH_IMAGE005
: for new miningThe audio data of the set, the sample of the ith category, and the prediction probability output by the source domain model;
Figure 322673DEST_PATH_IMAGE006
: accumulation of the output of the source domain model for newly acquired audio data;
Figure 598933DEST_PATH_IMAGE007
: the prediction probability of the sample of the ith category, which is output by the target domain model;
Figure 98048DEST_PATH_IMAGE008
: accumulation of the outputs of the target domain model;
T: temperature over-parameter, T = 2;
the specific formula of the constraint of the consistency of the distribution of the real voice features is as follows:
Figure 459759DEST_PATH_IMAGE009
wherein the content of the first and second substances,
Figure 436942DEST_PATH_IMAGE010
: the total number of real voices;
Figure 833289DEST_PATH_IMAGE011
: embedding the characteristic vector of the kth real voice output by the source domain model;
Figure 237725DEST_PATH_IMAGE012
: embedding the characteristic vector of the kth real voice output by the target domain model;
Figure 86732DEST_PATH_IMAGE013
: the model length of the embedded feature vector of the kth real voice output by the source domain model;
Figure 602027DEST_PATH_IMAGE014
: and the model length of the embedded feature vector of the k-th real voice output by the target domain model.
Examples
In ASVspoof2019 LA data set, selecting four types of A13, A17, A10 and A19 to generate voice types and real voice, and adopting a lightweight convolutional neural network with the final output category number of 50;
firstly, training a network by using A13 and real voice together, wherein the label of the real voice is 0, and the label of A13 is 1, so as to obtain a model 1; then, on the basis, A17 and real voice are used, the label of the real voice is 0, the label of A17 is 2, model updating is further carried out, and the loss function of model training at the moment isL Total loss=L Original+αL Distillation+βL Reality (reality)α=0.7,β=1.2, yielding model 2; compared with the model 1, the model 1 is subjected to fine tuning 2 obtained by direct fine tuning at A17 and the real voice;
then, on the basis, A10 and real voice are used, the label of the real voice is 0, the label of A10 is 3, model updating is further carried out, and the loss function of model training is carried out at the momentL Total loss=L Original+αL Distillation+βL Reality (reality)α=0.7,β=1.2, yielding model 3; in contrast, fine tuning 2 is fine tuning 3 obtained at a10 and direct fine tuning of real speech;
then, on the basis, A19 and real voice are used, the label of the real voice is 0, the label of A10 is 3, model updating is further carried out, and the loss function of model training at the moment isL Total loss=L Original+αL Distillation+βL Reality (reality)α=0.7,β=1.2, yielding model 4; in contrast, fine tuning 3 is fine tuning 4 obtained at a19 and direct fine tuning of real speech;
respectively testing eer detected by the model 1 for A13, A17, A10 and A19, and averaging to obtain avg _ eer 1;
respectively testing eer detected by model 2 for A13, A17, A10 and A19, and averaging to obtain avg _ eer 2;
respectively testing eer detected by the model 3 for A13, A17, A10 and A19, and averaging to obtain avg _ eer 3;
respectively testing eer detected by the model 4 for A13, A17, A10 and A19, and averaging to obtain avg _ eer 4;
respectively testing eer detected by the fine tuning 2 pairs of A13, A17, A10 and A19, and averaging to obtain avg _ eer _ 2;
respectively testing eer detected by the fine tuning 3 pairs of A13, A17, A10 and A19, and averaging to obtain avg _ eer _ 3;
respectively testing eer detected by trimming 4 pairs of A13, A17, A10 and A19, and averaging to obtain avg _ eer _ 4;
the evaluation index in the following table is average eer (average equal error rate), and the lower the average equal error rate, the better the effect, the results are as follows:
addition of A13 Addition of A17 Addition of A10 Addition of A19
Method for producing a composite material 0.811 3.834 4.312 8.032
Fine tuning 0.811 14.422 14.404 43.719
It can be seen that the present method is superior to direct fine tuning.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present invention. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. In other instances, features described in connection with one embodiment may be implemented as discrete components or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Further, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (5)

1. A method for generating speech features without continuous learning of raw data storage, the method comprising:
s1: collecting audio data, and extracting audio acoustic features to obtain LFCC features, namely linear cepstrum coefficients;
s2: training a deep learning network model by applying the LFCC characteristics to obtain a source domain model;
s3: adding regularization loss on the basis of a training loss function of a source domain model, constraining the direction of model parameter optimization, and updating model parameters of the source domain model by using newly generated audio data to obtain a target domain model;
the deep learning network model is a lightweight convolutional neural network;
the regularization loss includes a distillation regularization loss and a true speech feature distribution consistency constraint,
training loss function for target domain modelL Total lossComprises the following steps:
Figure 342570DEST_PATH_IMAGE001
wherein the content of the first and second substances,
L original: a training loss function of the source domain model;
L distillation: distillation regularization loss;
α: weight loss by distillation regularization of 0.5. ltoreqα≤1;
L Reality (reality): constraint of consistency of distribution of real voice features;
β: the weight of the constraint of the consistency of the distribution of the real voice features is more than or equal to 1β≤1.5,
The specific formula of the distillation regularization loss is as follows:
Figure 541470DEST_PATH_IMAGE002
Figure 210349DEST_PATH_IMAGE003
Figure 390663DEST_PATH_IMAGE004
wherein the content of the first and second substances,
Figure 144993DEST_PATH_IMAGE005
: for newly collected audio data, the prediction probability of the sample of the ith category output by the source domain model is obtained;
Figure 882004DEST_PATH_IMAGE006
: accumulation of the output of the source domain model for newly acquired audio data;
Figure 343073DEST_PATH_IMAGE007
: the prediction probability of the sample of the ith category, which is output by the target domain model;
Figure 772917DEST_PATH_IMAGE008
: accumulation of the outputs of the target domain model;
T: the temperature of the liquid crystal is over-parameter,
the parameter of the temperature over-parameter is set to be more than or equal to 1T≤2。
2. The method for generating speech features for continuous learning without raw data storage according to claim 1, wherein the specific method for extracting audio acoustic features and obtaining LFCC features comprises:
sampling the collected audio data to obtain original waveform points, and then performing pre-emphasis, framing, windowing and fast Fourier transform to obtain a Fourier power spectrum;
performing DCT (discrete cosine transformation) on the Fourier power spectrum through a linear filter bank to obtain 60-dimensional LFCC (linear frequency cepstrum coefficient) characteristics of the audio;
where the window length is 25 frames, a 512-dimensional FFT is performed.
3. The method for continuous learning generated speech features without raw data storage of claim 1 wherein the lightweight convolutional neural network is finally output as N classification results through a full connectivity layer, comprising real speech and N-1 different types of generated speech.
4. The method for continuous learning to generate speech features without raw data storage according to claim 3, wherein N is set to 50.
5. The method for continuous learning to generate speech features without raw data storage according to claim 1, wherein the real speech feature distribution consistency constraint is defined by the following formula:
Figure 201493DEST_PATH_IMAGE009
wherein the content of the first and second substances,
Figure 476617DEST_PATH_IMAGE010
: the total number of real voices;
Figure 120088DEST_PATH_IMAGE011
: embedding the characteristic vector of the kth real voice output by the source domain model;
Figure 455254DEST_PATH_IMAGE012
: embedding the characteristic vector of the kth real voice output by the target domain model;
Figure 121859DEST_PATH_IMAGE013
: the model length of the embedded feature vector of the kth real voice output by the source domain model;
Figure 200673DEST_PATH_IMAGE014
: and the model length of the embedded feature vector of the k-th real voice output by the target domain model.
CN202110852843.4A 2021-07-27 2021-07-27 Method for generating voice features through continuous learning without original data storage Active CN113299315B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110852843.4A CN113299315B (en) 2021-07-27 2021-07-27 Method for generating voice features through continuous learning without original data storage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110852843.4A CN113299315B (en) 2021-07-27 2021-07-27 Method for generating voice features through continuous learning without original data storage

Publications (2)

Publication Number Publication Date
CN113299315A CN113299315A (en) 2021-08-24
CN113299315B true CN113299315B (en) 2021-10-15

Family

ID=77331197

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110852843.4A Active CN113299315B (en) 2021-07-27 2021-07-27 Method for generating voice features through continuous learning without original data storage

Country Status (1)

Country Link
CN (1) CN113299315B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113488027A (en) * 2021-09-08 2021-10-08 中国科学院自动化研究所 Hierarchical classification generated audio tracing method, storage medium and computer equipment
CN115938390B (en) * 2023-01-06 2023-06-30 中国科学院自动化研究所 Continuous learning method and device for generating voice identification model and electronic equipment

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018160943A1 (en) * 2017-03-03 2018-09-07 Pindrop Security, Inc. Method and apparatus for detecting spoofing conditions
CN108022589A (en) * 2017-10-31 2018-05-11 努比亚技术有限公司 Aiming field classifier training method, specimen discerning method, terminal and storage medium
CN110287374B (en) * 2019-06-14 2023-01-03 天津大学 Self-attention video abstraction method based on distribution consistency
CN110428845A (en) * 2019-07-24 2019-11-08 厦门快商通科技股份有限公司 Composite tone detection method, system, mobile terminal and storage medium
CN111564163B (en) * 2020-05-08 2023-12-15 宁波大学 RNN-based multiple fake operation voice detection method
CN111667016B (en) * 2020-06-12 2023-01-06 中国电子科技集团公司第三十六研究所 Incremental information classification method based on prototype
CN111723203A (en) * 2020-06-15 2020-09-29 苏州意能通信息技术有限公司 Text classification method based on lifetime learning
CN111797844A (en) * 2020-07-20 2020-10-20 苏州思必驰信息科技有限公司 Adaptive model training method for antagonistic domain and adaptive model for antagonistic domain
CN112712809B (en) * 2021-03-29 2021-06-18 北京远鉴信息技术有限公司 Voice detection method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN113299315A (en) 2021-08-24

Similar Documents

Publication Publication Date Title
CN113299315B (en) Method for generating voice features through continuous learning without original data storage
US7724961B2 (en) Method for classifying data using an analytic manifold
CN111582320B (en) Dynamic individual identification method based on semi-supervised learning
CN111564163B (en) RNN-based multiple fake operation voice detection method
US20030236661A1 (en) System and method for noise-robust feature extraction
JP6904483B2 (en) Pattern recognition device, pattern recognition method, and pattern recognition program
CN110942091B (en) Semi-supervised few-sample image classification method for searching reliable abnormal data center
CN110940523B (en) Unsupervised domain adaptive fault diagnosis method
Bahari et al. Speaker age estimation and gender detection based on supervised non-negative matrix factorization
CN114503131A (en) Search device, search method, search program, and learning model search system
CN111310719B (en) Unknown radiation source individual identification and detection method
Prasad et al. Improving the performance of speech clustering method
CN116935892A (en) Industrial valve anomaly detection method based on audio key feature dynamic aggregation
Kamaruddin et al. Features extraction for speech emotion
CN108009434A (en) Rich model Stego-detection Feature Selection Algorithms based on rough set α-positive domain reduction
CN117423344A (en) Voiceprint recognition method and device based on neural network
CN112735442B (en) Wetland ecology monitoring system with audio separation voiceprint recognition function and audio separation method thereof
Sunu et al. Dimensionality reduction for acoustic vehicle classification with spectral embedding
CN113488027A (en) Hierarchical classification generated audio tracing method, storage medium and computer equipment
CN111667000A (en) Earthquake early warning method of adaptive field deep neural network
CN118098288B (en) Weak supervision voice depression detection method based on self-learning label correction
CN112700792B (en) Audio scene identification and classification method
CN116230012B (en) Two-stage abnormal sound detection method based on metadata comparison learning pre-training
CN116522240A (en) Open-set radiation source individual identification method based on self-adaptive threshold
CN117668645A (en) Water surface underwater target discrimination method based on multi-model fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant