CN113299315B - Method for generating voice features through continuous learning without original data storage - Google Patents
Method for generating voice features through continuous learning without original data storage Download PDFInfo
- Publication number
- CN113299315B CN113299315B CN202110852843.4A CN202110852843A CN113299315B CN 113299315 B CN113299315 B CN 113299315B CN 202110852843 A CN202110852843 A CN 202110852843A CN 113299315 B CN113299315 B CN 113299315B
- Authority
- CN
- China
- Prior art keywords
- model
- domain model
- features
- source domain
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 32
- 238000013500 data storage Methods 0.000 title claims description 10
- 238000012549 training Methods 0.000 claims abstract description 38
- 230000006870 function Effects 0.000 claims abstract description 15
- 238000013135 deep learning Methods 0.000 claims abstract description 8
- 238000005457 optimization Methods 0.000 claims abstract description 6
- 238000004821 distillation Methods 0.000 claims description 18
- 239000013598 vector Substances 0.000 claims description 13
- 238000009826 distribution Methods 0.000 claims description 12
- 239000000126 substance Substances 0.000 claims description 9
- 238000009825 accumulation Methods 0.000 claims description 6
- 238000013527 convolutional neural network Methods 0.000 claims description 6
- 238000001228 spectrum Methods 0.000 claims description 6
- 238000009432 framing Methods 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 230000004580 weight loss Effects 0.000 claims description 2
- 239000004973 liquid crystal related substance Substances 0.000 claims 1
- 238000012360 testing method Methods 0.000 description 13
- 230000000694 effects Effects 0.000 description 8
- 238000012935 Averaging Methods 0.000 description 7
- 238000001514 detection method Methods 0.000 description 7
- 238000012545 processing Methods 0.000 description 7
- 238000003860 storage Methods 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 239000011159 matrix material Substances 0.000 description 4
- 230000003044 adaptive effect Effects 0.000 description 3
- 230000015572 biosynthetic process Effects 0.000 description 3
- 238000003786 synthesis reaction Methods 0.000 description 3
- 230000002457 bidirectional effect Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000004927 fusion Effects 0.000 description 2
- 238000007500 overflow downdraw method Methods 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 239000002131 composite material Substances 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000009966 trimming Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/12—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being prediction coefficients
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/21—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a method for generating voice features through continuous learning without storing original data, which comprises the following steps: collecting audio data, and extracting audio acoustic features to obtain linear cepstrum coefficient features; training a deep learning network model by applying the linear cepstrum coefficient characteristics to obtain a source domain model; regularization loss is added on the basis of a training loss function of the source domain model, the direction of model parameter optimization is constrained, and model parameters of the source domain model are updated by using newly acquired audio data to obtain a target domain model.
Description
Technical Field
The invention relates to the field of voice processing and image processing, in particular to a method for generating voice features through continuous learning without storing original data.
Background
The generated voice detection is to judge whether the audio is real human voice or generated voice generated by recording, voice synthesis and voice conversion technologies.
The existing generated speech discrimination model trained by a specific data set has greatly reduced capability of detecting unknown generated speech which is not matched with training data, and has lower generalization performance.
Meanwhile, with the continuous development of speech synthesis and speech conversion technologies, numerous speech generation means are developed, however, existing speech generation detection schemes all face the problem of insufficient generalization performance of models, and for unknown generation types in training data sets, no model with high robustness and good generalization can be detected, for example, a model trained on an ASVspoof2019 LA data set is not available, and because no generation type is available, the effect on the ASVspoof2019 PA data set is greatly reduced, and for example, a model trained on a data set generated by a limited speech synthesis technology is difficult to detect new synthesized speech. The existing generated voice data is trained once to generate a voice identification model, and when a new voice generation means appears, the new data and the original data can be mixed together to retrain the model again, but with the increase of the data quantity, the linear growth of calculation and storage resources can be brought, and the cost is overlarge; moreover, due to the privacy protection problem of special data, long-term storage of original data may not be realized; in addition, for the generated voice detection model which is updated on line continuously, the retraining of the old data can not be realized.
In view of the above problems, it is important that the model has the capability of continuously learning new generated speech.
In order to improve the discrimination performance of the model on the unknown generated speech, model fine tuning, joint training, extraction of more generalized acoustic features and the like can also be considered.
By adopting model fine adjustment, the phenomenon of 'catastrophic forgetting' can be generated when the original model is used for fine adjustment on new data, so that the performance on the original data set is greatly reduced; joint training may generate a large overhead of time and computational resources, and in some special situations, it is impossible to train all data together because the original data cannot be acquired due to privacy protection or other secret-related reasons.
For the problem that the detection performance of the model for the unknown data set is obviously reduced, some related technical researches are available:
1. the multi-model fusion method comprises the following steps: and training one generated voice identification model for each data set, then fusing the multiple models, and comprehensively scoring.
2. Bidirectional confrontation field self-adaptive method: the method is an extension of field confrontation training, two field discriminators which aim at reality and generated voice are added in the network, and the method adopts labeled data of a source field and unlabeled data of a target field for training so as to improve the performance of field unmatched data sets.
3. Extracting other generalization characteristics: the method is to design a front-end feature extractor from the perspective of traditional signal processing, and hopefully adopt more generalized features, such as: extended CQCC, CQSPIC coefficients, etc.
However, the above studies have drawbacks in that: the multi-model fusion method needs new and old data to be trained together, which increases the cost of training; the bidirectional countermeasure field self-adaptive method only focuses on the performance of the new data, and neglects the effect of the trained data on the old data set; extracting other features also does not guarantee the performance of the feature for all generated speech type detections.
The continuous learning problem is how to overcome the 'catastrophic forgetting' problem encountered by fine tuning, even if the model can be continuously updated only by new data while learning a new task and ensuring the memory capacity on the old task.
In addition to the above problems, in practical applications, people need to know not only the authenticity information of voice but also a specific generation type. At the moment, the simple two-classification is not enough to make the output judgment of the model persuasive, so that the original authenticity two-classification is changed into the generation type multi-classification, and the generation type multi-classification is more practical.
Publication number CN111564163A discloses a speech detection method for various counterfeit operations based on RNN, which comprises the following steps: 1) Obtaining an original voice sample, performing M kinds of counterfeiting processing on the original voice sample to obtain M voices after counterfeiting operation and 1 original voice without processing, performing feature extraction on the voices to obtain an LFCC matrix of a training voice sample, and sending the LFCC matrix into an RNN classifier network for training to obtain a multi-classification training model; 2) Obtaining a section of test voice, extracting the characteristics of the test voice to obtain an LFCC matrix of the test voice data, sending the LFCC matrix into the RNN classifier trained in the step 1) for classification, obtaining an output probability for each test voice, and combining all the output probabilities as a final prediction result: if the prediction result is the original voice, the test voice is recognized as the original voice; if the prediction result is a voice subjected to a certain falsification operation, the test voice is recognized as a falsified voice subjected to a corresponding falsification operation.
Publication No. CN112712809B discloses a voice detection method, apparatus, electronic device, and storage medium. Extracting a plurality of voice characteristic information from the voice to be detected; respectively inputting the voice characteristic information into a plurality of pre-trained voice source models, and determining a first matching degree between the voice to be detected and the source type of each voice source model; determining a second matching degree between the voice to be detected and the class type corresponding to the voice class model based on the determined first matching degree aiming at each voice class model; and determining the type and the source type of the voice to be detected based on the determined first matching degrees and the second matching degrees.
The prior art has the following defects:
when a new generation type appears and needs model updating, there are two common solutions: direct fine adjustment and mixed new and old data retraining, but the problems of high calculation cost and long heavy training time are solved.
For the unknown type of generated speech, there are methods of multi-model fusion and adaptive training, but there are also corresponding disadvantages.
1. Direct fine adjustment: training on the existing model by using new data can improve the effect of the model on the new data, but greatly reduce the recognition effect of the model on the former data.
2. Heavy head training: the new data and the old data are superposed and repeatedly trained, and when the data are continuously increased, the training time is longer and longer, and the time cost and the calculation expense are increased.
3. And (3) multi-model fusion: each time a model is added, a new model is added, which brings about overhead in storage.
4. Domain-impedance self-adaptation: new and old data are required to be trained together, and in some cases, the method cannot be used when the old data is unavailable due to factors such as privacy safety and the like.
In addition, the existing methods aim at the authenticity two classification, and in practical application, people not only need to know the authenticity information of voice, but also want to know a specific generation type, so that the generation type multi-classification has important significance.
Disclosure of Invention
In view of the above, the present invention provides a method for generating speech features without continuous learning of raw data storage, the method comprising:
s1: collecting audio data, and extracting audio acoustic features to obtain LFCC features;
s2: training a deep learning network model by applying the LFCC characteristics to obtain a source domain model;
s3: regularization loss is added on the basis of a training loss function of the source domain model, the direction of model parameter optimization is constrained, and model parameters of the source domain model are updated by applying newly generated audio data to obtain a target domain model.
Preferably, the specific method for extracting the audio acoustic features to obtain the LFCC features includes: sampling the collected audio data to obtain original waveform points, and then performing pre-emphasis, framing, windowing and fast Fourier transform to obtain a Fourier power spectrum;
performing DCT (discrete cosine transformation) on the Fourier power spectrum through a linear filter bank to obtain 60-dimensional LFCC (linear frequency feedback control) characteristics of the audio;
where the window length is 25 frames, a 512-dimensional FFT is performed.
Preferably, the deep learning network model is a lightweight convolutional neural network.
Preferably, the lightweight convolutional neural network is finally output as an N classification result through a full connection layer, including real speech and N-1 different types of generated speech.
Preferably, N is set to 50.
Preferably, the regularization loss includes a distillation regularization loss and a true speech feature distribution consistency constraint.
Preferably, the training loss function of the target domain modelL Total lossComprises the following steps:
wherein the content of the first and second substances,
L original: a training loss function of the source domain model;
L distillation: distillation regularization loss;
α: weight loss by distillation regularization of 0.5. ltoreqα≤1;
L Reality (reality): constraint of consistency of distribution of real voice features;
β: the weight of the constraint of the consistency of the distribution of the real voice features is more than or equal to 1β≤1.5。
Preferably, the specific formula of the distillation regularization loss is as follows:
wherein the content of the first and second substances,
: for newly collected audio data, the prediction probability of the sample of the ith category output by the source domain model is obtained;
: the prediction probability of the sample of the ith category, which is output by the target domain model;
T: the temperature is over-parametric.
Preferably, the parameter of the temperature over-parameter is set to be 1 ≦T≤2。
Preferably, the specific formula of the constraint on the consistency of the distribution of the real voice features is as follows:
wherein the content of the first and second substances,
: the model length of the embedded feature vector of the kth real voice output by the source domain model;
: and the model length of the embedded feature vector of the k-th real voice output by the target domain model.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:
1. time, calculation, storage overhead:
each time the model is updated, only the model and new data which are trained last time are used;
2. continuous incremental learning:
the method conforms to the current situation that the generation means is continuously updated, and the generated voice identification model is continuously evolved along with the development of the generation technology;
3. the effect of the model on legacy data does not degrade unacceptably: only new data is used, the effect of the model on old data is reduced to a certain extent, but the model is not catastrophic, and the effect is far better than that of the model which is finely adjusted at any time;
4. the output is the classification generation category, and the specific generation type is detected.
Drawings
FIG. 1 is a flowchart of a method for generating speech features for continuous learning without raw data storage according to an embodiment of the present invention;
fig. 2 is a schematic diagram of advantages of continuous learning compared to direct fine tuning provided by the embodiment of the present invention.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.
As shown in fig. 1, a method for generating speech features through continuous learning without raw data storage according to an embodiment of the present application includes:
s1: the method comprises the following steps of collecting audio data, extracting audio acoustic features and obtaining LFCC features, and specifically comprises the following steps: sampling the collected audio data to obtain original waveform points, and then performing pre-emphasis, framing, windowing and fast Fourier transform to obtain a Fourier power spectrum;
performing DCT (discrete cosine transformation) on the Fourier power spectrum through a linear filter bank to obtain 60-dimensional LFCC (linear frequency feedback control) characteristics of the audio;
wherein the window length is 25 frames, and 512-dimensional FFT is carried out;
s2: training a deep learning network model by applying the LFCC characteristics to obtain a source domain model;
the deep learning network model is a lightweight convolutional neural network, comprises operations such as convolutional layers, maximum pooling layers, maximum feature mapping output and batch normalization operation, and finally obtains 80-dimensional embedded feature vectors, and finally outputs N classification results through a full connection layer, wherein the N classification results comprise real voice and N-1 generated voices (generated voices of different recording devices, different vocoders and different modes), a reserved classification output head is provided for the generated voices which may appear in the future, and N =50 is preset;
the specific method for training the deep learning network model comprises the following steps: model training is carried out for 150 rounds, an adaptive moment estimation optimizer is selected, the initial learning rate is set to be 0.001, and the size of each batch of data is 128;
the model updating is based on the source domain model trained in the previous step, and further training is carried out by adopting new data on the basis of the source domain model trained in the previous step, namely, the target domain model is initialized by utilizing the source domain model parameters trained in the previous step; the common model tuning operations are: using new data and cross entropy or other loss functions to directly optimize model parameters; the method is based on continuous learning, regularization constraint is added on the basis of the original loss function, and the direction of model parameter optimization is constrained;
as shown in fig. 2, the advantages of continuous learning compared to direct fine tuning are: model fine tuning is often accompanied by 'catastrophic forgetting' of the fine-tuned model to original data knowledge, namely, a fine tuning strategy is simply adopted, so that a final model optimization result is easy to fall into an area with obvious difference from an original task, and the performance of the model is greatly reduced on the original task; however, after the continuous learning method is adopted, the updated model parameters can still be close to the optimal parameter area of the original model, so that the memory capacity of the model on the old task is ensured while the model learns the new task;
s3: adding regularization loss on the basis of a training loss function of a source domain model, constraining the direction of model parameter optimization, wherein the regularization loss comprises distillation regularization loss and real voice feature distribution consistency constraint, applying newly-generated type audio data, extracting corresponding acoustic features (LFCC) of new data, then selecting an adaptive moment estimation optimizer in the same way, setting the initial learning rate to be 0.0001, setting the batch data size to be 64, training for 20 rounds, and updating model parameters of the source domain model to obtain a target domain model;
a training loss function of the target domain modelL Total lossComprises the following steps:
wherein the content of the first and second substances,
L original: a training loss function of the source domain model;
L distillation: distillation regularization loss;
α: the weight lost by the distillation regularization,α=0.7;
L reality (reality): constraint of consistency of distribution of real voice features;
β: the weight of the constraint of the consistency of the distribution of the real speech features,β=1.2;
the specific formula of the distillation regularization loss is as follows:
wherein the content of the first and second substances,
: for new miningThe audio data of the set, the sample of the ith category, and the prediction probability output by the source domain model;
: the prediction probability of the sample of the ith category, which is output by the target domain model;
T: temperature over-parameter, T = 2;
the specific formula of the constraint of the consistency of the distribution of the real voice features is as follows:
wherein the content of the first and second substances,
: the model length of the embedded feature vector of the kth real voice output by the source domain model;
: and the model length of the embedded feature vector of the k-th real voice output by the target domain model.
Examples
In ASVspoof2019 LA data set, selecting four types of A13, A17, A10 and A19 to generate voice types and real voice, and adopting a lightweight convolutional neural network with the final output category number of 50;
firstly, training a network by using A13 and real voice together, wherein the label of the real voice is 0, and the label of A13 is 1, so as to obtain a model 1; then, on the basis, A17 and real voice are used, the label of the real voice is 0, the label of A17 is 2, model updating is further carried out, and the loss function of model training at the moment isL Total loss=L Original+αL Distillation+βL Reality (reality),α=0.7,β=1.2, yielding model 2; compared with the model 1, the model 1 is subjected to fine tuning 2 obtained by direct fine tuning at A17 and the real voice;
then, on the basis, A10 and real voice are used, the label of the real voice is 0, the label of A10 is 3, model updating is further carried out, and the loss function of model training is carried out at the momentL Total loss=L Original+αL Distillation+βL Reality (reality),α=0.7,β=1.2, yielding model 3; in contrast, fine tuning 2 is fine tuning 3 obtained at a10 and direct fine tuning of real speech;
then, on the basis, A19 and real voice are used, the label of the real voice is 0, the label of A10 is 3, model updating is further carried out, and the loss function of model training at the moment isL Total loss=L Original+αL Distillation+βL Reality (reality),α=0.7,β=1.2, yielding model 4; in contrast, fine tuning 3 is fine tuning 4 obtained at a19 and direct fine tuning of real speech;
respectively testing eer detected by the model 1 for A13, A17, A10 and A19, and averaging to obtain avg _ eer 1;
respectively testing eer detected by model 2 for A13, A17, A10 and A19, and averaging to obtain avg _ eer 2;
respectively testing eer detected by the model 3 for A13, A17, A10 and A19, and averaging to obtain avg _ eer 3;
respectively testing eer detected by the model 4 for A13, A17, A10 and A19, and averaging to obtain avg _ eer 4;
respectively testing eer detected by the fine tuning 2 pairs of A13, A17, A10 and A19, and averaging to obtain avg _ eer _ 2;
respectively testing eer detected by the fine tuning 3 pairs of A13, A17, A10 and A19, and averaging to obtain avg _ eer _ 3;
respectively testing eer detected by trimming 4 pairs of A13, A17, A10 and A19, and averaging to obtain avg _ eer _ 4;
the evaluation index in the following table is average eer (average equal error rate), and the lower the average equal error rate, the better the effect, the results are as follows:
addition of A13 | Addition of A17 | Addition of A10 | Addition of A19 | |
Method for producing a composite material | 0.811 | 3.834 | 4.312 | 8.032 |
Fine tuning | 0.811 | 14.422 | 14.404 | 43.719 |
It can be seen that the present method is superior to direct fine tuning.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present invention. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. In other instances, features described in connection with one embodiment may be implemented as discrete components or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Further, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (5)
1. A method for generating speech features without continuous learning of raw data storage, the method comprising:
s1: collecting audio data, and extracting audio acoustic features to obtain LFCC features, namely linear cepstrum coefficients;
s2: training a deep learning network model by applying the LFCC characteristics to obtain a source domain model;
s3: adding regularization loss on the basis of a training loss function of a source domain model, constraining the direction of model parameter optimization, and updating model parameters of the source domain model by using newly generated audio data to obtain a target domain model;
the deep learning network model is a lightweight convolutional neural network;
the regularization loss includes a distillation regularization loss and a true speech feature distribution consistency constraint,
training loss function for target domain modelL Total lossComprises the following steps:
wherein the content of the first and second substances,
L original: a training loss function of the source domain model;
L distillation: distillation regularization loss;
α: weight loss by distillation regularization of 0.5. ltoreqα≤1;
L Reality (reality): constraint of consistency of distribution of real voice features;
β: the weight of the constraint of the consistency of the distribution of the real voice features is more than or equal to 1β≤1.5,
The specific formula of the distillation regularization loss is as follows:
wherein the content of the first and second substances,
: for newly collected audio data, the prediction probability of the sample of the ith category output by the source domain model is obtained;
: the prediction probability of the sample of the ith category, which is output by the target domain model;
T: the temperature of the liquid crystal is over-parameter,
the parameter of the temperature over-parameter is set to be more than or equal to 1T≤2。
2. The method for generating speech features for continuous learning without raw data storage according to claim 1, wherein the specific method for extracting audio acoustic features and obtaining LFCC features comprises:
sampling the collected audio data to obtain original waveform points, and then performing pre-emphasis, framing, windowing and fast Fourier transform to obtain a Fourier power spectrum;
performing DCT (discrete cosine transformation) on the Fourier power spectrum through a linear filter bank to obtain 60-dimensional LFCC (linear frequency cepstrum coefficient) characteristics of the audio;
where the window length is 25 frames, a 512-dimensional FFT is performed.
3. The method for continuous learning generated speech features without raw data storage of claim 1 wherein the lightweight convolutional neural network is finally output as N classification results through a full connectivity layer, comprising real speech and N-1 different types of generated speech.
4. The method for continuous learning to generate speech features without raw data storage according to claim 3, wherein N is set to 50.
5. The method for continuous learning to generate speech features without raw data storage according to claim 1, wherein the real speech feature distribution consistency constraint is defined by the following formula:
wherein the content of the first and second substances,
: the model length of the embedded feature vector of the kth real voice output by the source domain model;
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110852843.4A CN113299315B (en) | 2021-07-27 | 2021-07-27 | Method for generating voice features through continuous learning without original data storage |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110852843.4A CN113299315B (en) | 2021-07-27 | 2021-07-27 | Method for generating voice features through continuous learning without original data storage |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113299315A CN113299315A (en) | 2021-08-24 |
CN113299315B true CN113299315B (en) | 2021-10-15 |
Family
ID=77331197
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110852843.4A Active CN113299315B (en) | 2021-07-27 | 2021-07-27 | Method for generating voice features through continuous learning without original data storage |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113299315B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113488027A (en) * | 2021-09-08 | 2021-10-08 | 中国科学院自动化研究所 | Hierarchical classification generated audio tracing method, storage medium and computer equipment |
CN115938390B (en) * | 2023-01-06 | 2023-06-30 | 中国科学院自动化研究所 | Continuous learning method and device for generating voice identification model and electronic equipment |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018160943A1 (en) * | 2017-03-03 | 2018-09-07 | Pindrop Security, Inc. | Method and apparatus for detecting spoofing conditions |
CN108022589A (en) * | 2017-10-31 | 2018-05-11 | 努比亚技术有限公司 | Aiming field classifier training method, specimen discerning method, terminal and storage medium |
CN110287374B (en) * | 2019-06-14 | 2023-01-03 | 天津大学 | Self-attention video abstraction method based on distribution consistency |
CN110428845A (en) * | 2019-07-24 | 2019-11-08 | 厦门快商通科技股份有限公司 | Composite tone detection method, system, mobile terminal and storage medium |
CN111564163B (en) * | 2020-05-08 | 2023-12-15 | 宁波大学 | RNN-based multiple fake operation voice detection method |
CN111667016B (en) * | 2020-06-12 | 2023-01-06 | 中国电子科技集团公司第三十六研究所 | Incremental information classification method based on prototype |
CN111723203A (en) * | 2020-06-15 | 2020-09-29 | 苏州意能通信息技术有限公司 | Text classification method based on lifetime learning |
CN111797844A (en) * | 2020-07-20 | 2020-10-20 | 苏州思必驰信息科技有限公司 | Adaptive model training method for antagonistic domain and adaptive model for antagonistic domain |
CN112712809B (en) * | 2021-03-29 | 2021-06-18 | 北京远鉴信息技术有限公司 | Voice detection method and device, electronic equipment and storage medium |
-
2021
- 2021-07-27 CN CN202110852843.4A patent/CN113299315B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN113299315A (en) | 2021-08-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113299315B (en) | Method for generating voice features through continuous learning without original data storage | |
US7724961B2 (en) | Method for classifying data using an analytic manifold | |
CN111582320B (en) | Dynamic individual identification method based on semi-supervised learning | |
CN111564163B (en) | RNN-based multiple fake operation voice detection method | |
US20030236661A1 (en) | System and method for noise-robust feature extraction | |
JP6904483B2 (en) | Pattern recognition device, pattern recognition method, and pattern recognition program | |
CN110942091B (en) | Semi-supervised few-sample image classification method for searching reliable abnormal data center | |
CN110940523B (en) | Unsupervised domain adaptive fault diagnosis method | |
Bahari et al. | Speaker age estimation and gender detection based on supervised non-negative matrix factorization | |
CN114503131A (en) | Search device, search method, search program, and learning model search system | |
CN111310719B (en) | Unknown radiation source individual identification and detection method | |
Prasad et al. | Improving the performance of speech clustering method | |
CN116935892A (en) | Industrial valve anomaly detection method based on audio key feature dynamic aggregation | |
Kamaruddin et al. | Features extraction for speech emotion | |
CN108009434A (en) | Rich model Stego-detection Feature Selection Algorithms based on rough set α-positive domain reduction | |
CN117423344A (en) | Voiceprint recognition method and device based on neural network | |
CN112735442B (en) | Wetland ecology monitoring system with audio separation voiceprint recognition function and audio separation method thereof | |
Sunu et al. | Dimensionality reduction for acoustic vehicle classification with spectral embedding | |
CN113488027A (en) | Hierarchical classification generated audio tracing method, storage medium and computer equipment | |
CN111667000A (en) | Earthquake early warning method of adaptive field deep neural network | |
CN118098288B (en) | Weak supervision voice depression detection method based on self-learning label correction | |
CN112700792B (en) | Audio scene identification and classification method | |
CN116230012B (en) | Two-stage abnormal sound detection method based on metadata comparison learning pre-training | |
CN116522240A (en) | Open-set radiation source individual identification method based on self-adaptive threshold | |
CN117668645A (en) | Water surface underwater target discrimination method based on multi-model fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |