CN115249487B - Incremental generated voice detection method and system for playback boundary load sample - Google Patents

Incremental generated voice detection method and system for playback boundary load sample Download PDF

Info

Publication number
CN115249487B
CN115249487B CN202210863709.9A CN202210863709A CN115249487B CN 115249487 B CN115249487 B CN 115249487B CN 202210863709 A CN202210863709 A CN 202210863709A CN 115249487 B CN115249487 B CN 115249487B
Authority
CN
China
Prior art keywords
data
generated
batch
deep learning
learning network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210863709.9A
Other languages
Chinese (zh)
Other versions
CN115249487A (en
Inventor
陶建华
马浩鑫
易江燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN202210863709.9A priority Critical patent/CN115249487B/en
Publication of CN115249487A publication Critical patent/CN115249487A/en
Application granted granted Critical
Publication of CN115249487B publication Critical patent/CN115249487B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention provides an incremental generated voice detection method and system for a playback boundary load sample. Belongs to the field of generated voice detection, wherein the method comprises the following steps: under the condition that can save a small amount of old samples, keep through selecting representative production speech data, add it in new training data in subsequent model update to reduce the forgetting of model to old knowledge, make the model possess the detectability to original forged pronunciation and novel forged pronunciation simultaneously, can be to the quick timely response of novel unknown production pronunciation, the continuation is updated.

Description

Incremental generated voice detection method and system for playback boundary load sample
Technical Field
The invention belongs to the field of generated voice detection, and particularly relates to an incremental generated voice detection method and system for playback boundary negative examples.
Background
The generated voice detection is to judge whether the audio is real human voice or generated voice generated by recording, voice synthesis and voice conversion technologies.
At present, the research in the field of generating voice detection is not mature, the biggest problem is insufficient generalization of models, and particularly, the detection performance of trained models on unmatched data sets and unknown types of generated voice is greatly reduced.
Meanwhile, with the continuous development of voice synthesis and voice conversion technologies, numerous voice forgery means are developed, however, existing voice forgery detection schemes all face the problem of insufficient generalization performance of models, and for unknown types of forgery in training data sets, no model with high robustness and good generalization can be detected, for example, the effect of a model trained on a data set 1 on a data set 2 is greatly reduced.
At present, most researches improve from two aspects of feature extraction and classifier to improve the generalization of a detection model, in the aspect of feature extraction, a high-frequency region feature is found by a Chinese academy of sciences acoustic institute team to be a main reason for overfitting the model to current data, a low-frequency region feature is more robust, and the overfitting of the model to the current data is aggravated by removing a silence section in voice, so that a detection method using the low-frequency region feature as the input of the model is adopted. The American college team of Calycor-Meilong provides an improved characteristic, namely 2-dimensional discrete cosine transform is carried out on a logarithmic Mel spectrum, and spectrum broadening and characteristic normalization strategies are added to improve the generalization of the model; the Singapore national university team provides eQCC and CQSPIC characteristics which are further improved on the basis of CQCC characteristics; the Zhejiang university team employs a large-scale pre-training network Wav2vec for audio feature extraction. In the classifier design perspective, attention networks, multi-scale residual networks incorporating channel gating mechanisms, and attention-based convolutional neural networks were all attempted to be applied to generate speech detection.
In addition to the conventional ideas, a Dual-adaptive domain adaptation (Dual-adaptive domain adaptation) method is also researched and proposed, the method is an extension of domain adaptive training (domain adaptive training), two domain discriminators respectively aiming at real and fake voices are added to the network, and the training is carried out by adopting tagged data of a source domain and untagged data of a target domain, so that the performance of a domain unmatched data set is improved.
The above solutions, while improving the generalization to some extent, still have limitations. Specifically, the method comprises the following steps:
(1) Improvements at the feature and model level have difficulty demonstrating their detection performance on unknown types of generated audio data, and assessing optimal feature and model structure.
(2) The cross-domain confrontation training needs all data to participate in the training process, and the problems of overlarge data volume and long time consumption for model updating can be faced.
(3) The generation technology is continuously updated, generation and detection are a set of coupling technology of attack and prevention, and it is not practical to train a detection model once and for all.
When a new counterfeiting means appears, model fine adjustment is adopted, and the fine-tune (fine-tune) of new data by using the original model can generate a 'catastrophic forgetting' phenomenon, so that the performance on the original data set is greatly reduced; training the new and old data in a mixed manner generates a large overhead of time and computing resources.
Disadvantages of the prior art
When a new generation type appears and needs model updating, there are two common solutions: direct fine adjustment and mixed new and old data retraining are carried out, but the problems of high calculation cost and long retraining time exist.
Aiming at improving the generalization of a model to unknown types of generated voice detection, the method of multi-model fusion, adaptive training and new characteristics and network structures exists at present, but the method also has corresponding defects.
1. Direct fine adjustment: training on the existing model by using new data can improve the effect of the model on the new data, but greatly reduce the recognition effect of the model on the former data.
2. Heavy head training: the new data and the old data are superposed and repeatedly trained, and when the data are continuously increased, the training time is longer and longer, and the time cost and the calculation expense are increased.
3. And (3) multi-model fusion: each time a model is added, a new model is added, which brings about overhead in storage.
4. Domain-impedance self-adaptation: all data is required to participate, and the problems of overlarge data volume and long time consumption for model updating can be faced.
5. New features and models: it is difficult to demonstrate its detection performance on unknown types of generated audio data and to assess optimal features and model structure.
6. The generation technology is continuously updated, generation and detection are a set of coupling technology of attack and prevention, and it is not practical to train a detection model once and for all.
Disclosure of Invention
In order to solve the above technical problems, the present invention provides a technical solution of an incremental speech detection method and system for playing back a boundary negative sample, so as to solve the above technical problems.
The invention discloses a method for detecting an incremental generated voice of a playback boundary load sample in a first aspect, which comprises the following steps:
s1, audio acoustic feature extraction: extracting audio acoustic features of input audio;
s2, model training: inputting the audio acoustic features into a deep learning network, and training the deep learning network;
s3, selecting old data: when a kth batch of data is trained, selecting M/k generated voices from the generated voices which can be correctly detected by the deep learning network after the training through the average value of the feature vectors of the real voices in the kth batch of data and the feature vectors of the generated voices which can be correctly detected by the deep learning network after the training, deleting M/k generated voices from the generated voices stored in a kth batch of previous data, and mixing the selected M/k generated voices with the generated voices stored in the kth batch of previous data after the M/k generated voices are deleted to be used as selected old data; m is the total amount of old data;
s4, mixed playback of new and old data: when the (k + 1) th batch of data is trained, the selected old data and the (k + 1) th batch of data are mixed and played back to obtain (k + 1) th batch of training data;
step S5, incremental updating: applying the (k + 1) th batch of training data to carry out incremental updating on the trained deep learning network;
step S6: and applying the incrementally updated deep learning network to finish the voice detection.
According to the method of the first aspect of the present invention, in step S3, the method for calculating the average value of the feature vectors of the real voice in the kth batch of data includes:
calculating the average value of the feature vectors of the real voice in the kth batch of data by applying the deep learning network trained on the kth batch of data; the feature vector of the real voice is the output of the structure of the trained deep learning network except the last full connection layer after the real voice is input into the trained deep learning network;
the specific formula for calculating the average value of the feature vectors of the real voice in the kth batch of data is as follows:
Figure BDA0003757686120000041
wherein, the first and the second end of the pipe are connected with each other,
g k () is the deep learning network after training;
x true i Audio acoustic features of the ith real voice;
g k (x true i ) The feature vector of the ith real voice is obtained;
N true k The number of the real voice;
μ true k Is the average of the feature vectors of the real speech in the kth batch of data.
According to the method of the first aspect of the present invention, in step S3, the method for calculating the feature vector of the generated speech, which can be correctly detected by the trained deep learning network, includes:
applying the trained deep learning network on the current kth batch of data, and selecting a generated voice set which can be correctly detected by the trained deep learning network in the current kth batch of data;
then calculating the feature vectors of all the generated voices in the generated voice set;
and the feature vector of the generated voice is the output of the structure of the trained deep learning network except the last full connection layer after the generated voice in the generated voice set is input into the trained deep learning network.
According to the method of the first aspect of the present invention, in the step S3, the criteria that can be correctly detected by the trained deep learning network are: the generation probability of the deep learning network prediction after the current training is larger than 0.5.
In the method according to the first aspect of the present invention, in step S3, the method for selecting M/k generated speeches from the generated speeches correctly detected by the trained deep learning network includes:
calculating the distance between the feature vector of the generated voice and the feature vector of the real voice;
and selecting M/k generated voices with the shortest distance, specifically the first M/k voices from near to far, and taking down an integer if the M/k voices are non-integers.
According to the method of the first aspect of the present invention, in the step S3, before the step of mixing the sorted M/k generated speeches with the kth generated speeches stored in the previous data, the method further includes:
deleting each batch of data in the generated voice stored by the kth previous batch of data according to the sequence from far to near, so that the data volume of each batch of data is M/k.
According to the method of the first aspect of the present invention, in the step S4, the method for playing back the sorted old data and the (k + 1) th batch of data in a mixed manner includes:
and copying and expanding the selected M old data to ensure that the total amount of the old data is the same as that of the (k + 1) th batch of data, and the total amount ratio of the old data to the (k + 1) th batch of data is 1.
The second aspect of the present invention discloses an incremental speech detection system for playback of boundary negative examples, which comprises:
a first processing module configured to, audio acoustic feature extraction: extracting audio acoustic features of input audio;
a second processing module configured to model train: inputting the audio acoustic features into a deep learning network, and training the deep learning network;
a third processing module configured to cull old data by: when a kth batch of data is trained, selecting M/k generated voices from the generated voices which can be correctly detected by the trained deep learning network through an average value of feature vectors of real voices in the kth batch of data and feature vectors of the generated voices which can be correctly detected by the trained deep learning network, deleting M/k generated voices from the generated voices stored in a kth batch of previous data, and mixing the selected M/k generated voices with the generated voices stored in the kth batch of previous data from which the M/k generated voices are deleted to serve as selected old data; m is the total amount of old data;
a fourth processing module configured to play back the new and old data in a mixed manner: when training the (k + 1) th batch of data, mixing and playing back the selected old data and the (k + 1) th batch of data to obtain a (k + 1) th batch of training data;
a fifth processing module configured to incrementally update: applying the (k + 1) th batch of training data to perform incremental updating on the trained deep learning network;
and the sixth processing module is configured to apply the incrementally updated deep learning network to complete the voice detection.
According to the system of the second aspect of the present invention, the third processing module is configured to calculate the average value of the feature vectors of the real voice in the kth batch of data, and includes:
calculating the average value of the feature vectors of the real voice in the kth batch of data by applying the deep learning network trained on the current kth batch of data; the feature vector of the real voice is the output of the structure of the trained deep learning network except the last full connection layer after the real voice is input into the trained deep learning network;
the specific formula for calculating the average value of the feature vectors of the real voice in the kth batch of data is as follows:
Figure BDA0003757686120000071
wherein the content of the first and second substances,
g k () is the trained deep learning network;
x true i Audio acoustic features of the ith real voice;
g k (x true i ) The feature vector of the ith real voice is obtained;
N true k The number of the real voice;
μ true k Is the average of the feature vectors of the real speech in the kth batch of data.
According to the system of the second aspect of the present invention, the third processing module is configured to calculate the feature vector of the generated speech, which can be correctly detected by the trained deep learning network, including:
applying the trained deep learning network on the current kth batch of data, and selecting a generated voice set which can be correctly detected by the trained deep learning network in the current kth batch of data;
then calculating the feature vectors of all the generated voices in the generated voice set;
the feature vector of the generated voice is output of a structure of the deep learning network after the generated voice in the generated voice set is input into the deep learning network after training except for the last full connection layer.
According to the system of the second aspect of the present invention, the third processing module is configured to, the criteria that can be correctly detected by the trained deep learning network are: the generation probability of the deep learning network prediction after current training is larger than 0.5.
According to the system of the second aspect of the present invention, the third processing module is configured to select M/k generated speeches from the generated speeches correctly detected by the trained deep learning network, and the method includes:
calculating the distance between the feature vector of the generated voice and the feature vector of the real voice;
and selecting M/k generated voices with the shortest distance, specifically the first M/k voices from near to far, and taking down an integer if the M/k voices are non-integers.
According to the system of the second aspect of the present invention, before the mixing the sorted M/k generated speeches with the kth generated speeches stored in the previous data, the third processing module is further configured to:
deleting each batch of data in the generated voice stored by the kth previous batch of data according to the sequence from far to near, so that the data volume of each batch of data is M/k.
According to the system of the second aspect of the present invention, the fourth processing module is configured to mix and play back the sorted old data and the (k + 1) th batch of data, and includes:
and copying and expanding the selected M old data to ensure that the total amount of the old data is the same as that of the (k + 1) th batch of data, and the total amount ratio of the old data to the (k + 1) th batch of data is 1.
A third aspect of the invention discloses an electronic device. The electronic device comprises a memory and a processor, the memory stores a computer program, and the processor executes the computer program to implement the steps of the incremental speech detection method for playback boundary negative examples in any one of the first aspect of the present disclosure.
A fourth aspect of the invention discloses a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps in a method for incrementally generating speech detection for playback boundary negative examples according to any one of the first aspect of the present disclosure.
The scheme provided by the invention has the following beneficial effects:
1. time, calculation, storage overhead:
and each time the model is updated, only the model which is trained last time, and a small amount of old data and new data which are stored in last training time are used.
2. Continuous incremental learning:
in consideration of the current situation that the means for generating speech is continuously updated, the model for generating speech should be continuously updated with the progress of speech synthesis, speech conversion technology, and recording equipment.
3. The effect of the model on old data is not reduced to be unacceptable and is not forgotten catastrophically, and the effect of the incremental updating model is far better than that of the fine-tuning model.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flow chart of a method for incrementally generating speech detection for playback boundary negative examples in accordance with an embodiment of the present invention;
FIG. 2 is a schematic diagram of a sample-based playback method according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating a method for incrementally generating speech detection in accordance with a playback boundary negative example in accordance with an embodiment of the present invention;
FIG. 4 is a diagram of a detailed training process of a fake voice detection deep learning network according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a SE module according to an embodiment of the present invention;
FIG. 6 is a block diagram of an incrementally-generated speech detection system in accordance with an embodiment of the present invention;
fig. 7 is a block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
As shown in fig. 2, after the training of the current task is finished, a small number of training samples are stored to participate in subsequent training, the color of neurons in the network corresponds to the color of data to represent the learned task, the long-bar square box represents the stored data, a small number of generated voice samples at the boundary are stored after the training of each batch of data is finished, and the color of the samples corresponds to different generation types.
The invention discloses a method for incrementally generating voice detection by replaying boundary negative examples. Fig. 1 is a flowchart of an incremental speech detection method for playback boundary negative samples according to an embodiment of the present invention, as shown in fig. 1 and 3, the method includes:
s1, audio acoustic feature extraction: extracting audio acoustic features of input audio;
s2, model training: inputting the audio acoustic features into a deep learning network, and training the deep learning network;
s3, selecting old data: when a kth batch of data is trained, selecting M/k generated voices from the generated voices which can be correctly detected by the trained deep learning network through an average value of feature vectors of real voices in the kth batch of data and feature vectors of the generated voices which can be correctly detected by the trained deep learning network, deleting M/k generated voices from the generated voices stored in a kth batch of previous data, and mixing the selected M/k generated voices with the generated voices stored in the kth batch of previous data from which the M/k generated voices are deleted to serve as selected old data; the M is the total amount of the old data;
s4, mixed playback of new and old data: when training the (k + 1) th batch of data, mixing and playing back the selected old data and the (k + 1) th batch of data to obtain a (k + 1) th batch of training data;
step S5, incremental updating: applying the (k + 1) th batch of training data to carry out incremental updating on the trained deep learning network;
and S6, applying the incrementally updated deep learning network to complete voice detection.
In step S1, audio acoustic feature extraction: audio acoustic features of the input audio are extracted.
Specifically, the input audio is sampled to an original waveform point, then pre-emphasis, framing, windowing, fast fourier transform, linear filter bank, logarithm taking and DCT transform are performed to obtain 60-dimensional LFCC features of the audio, namely audio acoustic features of the input audio, wherein the window length is 25 frames, and 512-dimensional FFT is performed. Besides, the input acoustic features may adopt Inverse Mel Frequency Cepstral Coefficient (IMFCC), constant Q Cepstral Coefficient (CQCC), and Fast Fourier Transform Frequency (Fast Fourier Transform) as the audio acoustic features of the input audio.
In step S2, model training: and inputting the audio acoustic features into a deep learning network, and training the deep learning network.
Specifically, as shown in fig. 4, the model adopts a LightCNN network, and the network structure has 32 layers, as shown in table 1:
TABLE 1
Figure BDA0003757686120000111
Conv in the table is a convolutional layer, max _ Pool is a maximum pooling layer, MFM is Max-Feature-Map, the maximum value in the sub-features is output, dimension reduction is carried out, filter is the size of a convolution kernel, stride is the step length between each step, batchNorm is batch normalization operation, and the final output of the network is real or binary classification results of generated voice.
The model was trained for 100 rounds, the adam optimizer was selected, the initial learning rate was set to 0.001, and the batch size (batch size) was 128.
Instead of LightCNN, SENet, resNet may be chosen, as shown in figure 5,
given an input x, the number of characteristic channels is c _1, and a characteristic with the number of characteristic channels c _2 is obtained through a series of convolution and other general transformations. Unlike conventional CNNs, the previously derived features are then recalibrated by three operations.
The three operations are respectively: squeeze operation, which performs feature compression along spatial dimensions; an Excitation operation that generates a weight for each feature channel by a parameter w; and in the operation of Reweight, the weight of the output of the Excitation is regarded as the importance of each feature channel after feature selection, and then the feature channels are weighted to the previous feature channel by channel through multiplication, so that the original feature is recalibrated in the channel dimension.
The model structures of ResNet-50, SE-ResNet50, SE-ResNeXt5 are shown in Table 2:
a convolution kernel of 1 x 1,3 x 3 is used in the network.
TABLE 2
Figure BDA0003757686120000131
In step S3, old data culling: when a kth batch of data is trained, selecting M/k generated voices from the generated voices which can be correctly detected by the trained deep learning network according to the average value of the feature vectors of the real voices in the kth batch of data and the feature vectors of the generated voices which can be correctly detected by the trained deep learning network, and mixing the selected M/k generated voices with the generated voices stored in the kth batch of previous data to be used as selected old data; the M is the total amount of old data.
In some embodiments, in the step S3, the method for calculating the average value of the feature vectors of the real voice in the kth batch of data includes:
calculating the average value of the feature vectors of the real voice in the kth batch of data by applying the deep learning network trained on the current kth batch of data; the feature vector of the real voice is the output of the structure of the trained deep learning network except the last full connection layer after the real voice is input into the trained deep learning network;
the specific formula for calculating the average value of the feature vectors of the real voice in the kth batch of data is as follows:
Figure BDA0003757686120000132
wherein, the first and the second end of the pipe are connected with each other,
g k () is the trained deep learning network;
x true i Audio acoustic features of the ith real voice;
g k (x true i ) The feature vector of the ith real voice is obtained;
N true k The number of the real voice;
μ true k Is the average value of the feature vectors of the real speech in the kth batch of data.
The method for calculating the feature vector of the generated voice, which can be correctly detected by the trained deep learning network, comprises the following steps:
applying the trained deep learning network on the current kth batch of data, and selecting a generated voice set which can be correctly detected by the trained deep learning network in the current kth batch of data;
then calculating the feature vectors of all the generated voices in the generated voice set;
and the feature vector of the generated voice is the output of the structure of the trained deep learning network except the last full connection layer after the generated voice in the generated voice set is input into the trained deep learning network.
The standard for the accurate detection of the deep learning network after being trained is as follows: the generation probability of the deep learning network prediction after the current training is larger than 0.5.
The method for selecting M/k generated voices from the generated voices correctly detected by the trained deep learning network comprises the following steps:
calculating the distance between the feature vector of the generated voice and the feature vector of the real voice;
and (5) selecting M/k generated voices with the shortest distance.
Before mixing the selected M/k generated voices with the k-th batch of generated voices stored in the previous data, the method further comprises the following steps:
randomly deleting each batch of data in the generated voice stored in the kth batch of previous data to make the data volume of each batch of data be M/k.
Specifically, if the total amount of the stored old data is M, when the kth batch of data is trained, M/k generated voices are stored and selected for each batch of data. The selection process for these M/k generated voices is:
step S31, calculating an average value of feature vectors of real voice in the kth batch of data by applying the deep learning network trained on the kth batch of data; the feature vector of the real voice is the output of the structure of the trained deep learning network except the last full connection layer after the real voice is input into the trained deep learning network, and the output of the feature vector of the real voice is the 80-dimensional vector output after the BatchNorm31 layer in the deep learning network;
the specific formula for calculating the average value of the feature vectors of the real voice in the kth batch of data is as follows:
Figure BDA0003757686120000151
wherein the content of the first and second substances,
g k () is the trained deep learning network;
x real i Audio acoustic features of the ith real voice;
g k (x real i ) The feature vector of the ith real voice;
N true k The number of the real voice;
μ true k The average value of the feature vectors of the real voice in the kth batch of data is obtained;
step S32, applying the trained deep learning network on the kth batch of data, and selecting a generated speech set that can be correctly detected by the trained deep learning network in the kth batch of data: s. the Generating ={x Generation 1 ,x Generation 2 ,x Generation 3 ,..,x Generating N };
Then calculating the feature vectors g of all the generated voices in the generated voice set k (x Generating );
The feature vector of the generated voice is the output of the structure of the trained deep learning network except the last full connection layer after the generated voice in the generated voice set is input into the trained deep learning network; outputting 80-dimensional vectors behind a BatchNorm31 layer in the deep learning network;
the standard for the accurate detection of the deep learning network after training is as follows: the generation probability of the deep learning network prediction after current training is larger than 0.5;
step S33, calculating the distance between the feature vector of the generated voice and the feature vector of the real voice; selecting M/k generated voices with shortest distances; the M/k generated speeches are the generated speeches which are closest to real speeches and are positioned at the boundary of the network boundary, and the data are most helpful for helping the model to keep the characteristic boundary distribution learned before in the later training;
and S34, deleting each batch of data in the generated voice stored by the kth batch of previous data according to the sequence from far to near, so that the data volume of each batch of data is M/k.
In step S4, new and old data are mixed and played back: and when the (k + 1) th batch of data is trained, the selected old data and the (k + 1) th batch of data are mixed and played back to obtain the (k + 1) th batch of training data.
In some embodiments, in the step S4, the method for playing back the sorted old data mixed with the (k + 1) th batch of data includes:
and copying and expanding the selected M old data to enable the total amount of the old data to be the same as that of the (k + 1) th batch of data, and enabling the total amount ratio of the old data to the (k + 1) th batch of data to be 1.
At step S5, the incremental update: and applying the (k + 1) th batch of training data to carry out incremental updating on the trained deep learning network.
Specifically, the saved old sample, new data and trained model are prepared, the corresponding acoustic features (LFCC) are lifted, an adam optimizer is selected, the initial learning rate is set to 0.0001, the batch size (batch size) is 128, 25 rounds of training are performed, and the loss function in the training process is cross entropy loss. And finally obtaining an updated model.
In summary, the beneficial effects of the scheme provided by the invention are as follows:
1. time, calculation, storage overhead:
each time the model is updated, only the model which is trained last time, and a small amount of old data and new data which are stored in the last training time are used.
2. Continuous incremental learning:
in consideration of the current situation that the means for generating speech is continuously updated, the model for generating speech detection should be continuously updated with the progress of speech synthesis, speech conversion technology, and recording equipment.
3. The effect of the model on the old data cannot be reduced to be unacceptable and is not forgotten catastrophically, and the effect of updating the model in an incremental mode is far better than that of a fine-tuning model.
The invention discloses an incremental generated voice detection system for replaying boundary negative examples in a second aspect. FIG. 6 is a block diagram of an incrementally-generated speech detection system in accordance with an embodiment of the present invention; as shown in fig. 6, the system 100 includes:
a first processing module 101 configured to, for audio acoustic feature extraction: extracting audio acoustic features of input audio;
a second processing module 102 configured to model training: inputting the audio acoustic features into a deep learning network, and training the deep learning network;
a third processing module 103 configured to choose old data: when a kth batch of data is trained, selecting M/k generated voices from the generated voices which can be correctly detected by the trained deep learning network through an average value of feature vectors of real voices in the kth batch of data and feature vectors of the generated voices which can be correctly detected by the trained deep learning network, deleting M/k generated voices from the generated voices stored in a kth batch of previous data, and mixing the selected M/k generated voices with the generated voices stored in the kth batch of previous data from which the M/k generated voices are deleted to serve as selected old data; the M is the total amount of the old data;
a fourth processing module 104 configured to play back the new and old data in a mixed manner: when the (k + 1) th batch of data is trained, the selected old data and the (k + 1) th batch of data are mixed and played back to obtain (k + 1) th batch of training data;
a fifth processing module 105 configured to update incrementally: applying the (k + 1) th batch of training data to carry out incremental updating on the trained deep learning network;
the sixth processing module 106 is configured to apply the incrementally updated deep learning network to complete the voice detection.
According to the system of the second aspect of the present invention, the third processing module 103 is configured to calculate the average value of the feature vectors of the real speech in the kth batch of data, including:
calculating the average value of the feature vectors of the real voice in the kth batch of data by applying the deep learning network trained on the kth batch of data; the feature vector of the real voice is output of a structure except for the last full connection layer of the trained deep learning network after the real voice is input into the trained deep learning network;
the specific formula for calculating the average value of the feature vectors of the real voice in the kth batch of data is as follows:
Figure BDA0003757686120000181
wherein the content of the first and second substances,
g k () is the deep learning network after training;
x real i Audio acoustic features of the ith real voice;
g k (x true i ) The feature vector of the ith real voice is obtained;
N true k The number of the real voice;
μ true k Is the average of the feature vectors of the real speech in the kth batch of data.
According to the system of the second aspect of the present invention, the third processing module 103 is configured to calculate the feature vector of the generated speech, which can be correctly detected by the trained deep learning network, including:
applying the trained deep learning network on the current kth batch of data, and selecting a generated voice set which can be correctly detected by the trained deep learning network in the current kth batch of data;
then calculating the feature vectors of all the generated voices in the generated voice set;
the feature vector of the generated voice is output of a structure of the deep learning network after the generated voice in the generated voice set is input into the deep learning network after training except for the last full connection layer.
According to the system of the second aspect of the present invention, the third processing module 103 is configured to be able to be correctly detected by the trained deep learning network according to the following criteria: the generation probability of the deep learning network prediction after the current training is larger than 0.5.
According to the system of the second aspect of the present invention, the third processing module 103 is configured to select M/k generated speeches from the generated speeches correctly detected by the trained deep learning network, including:
calculating the distance between the feature vector of the generated voice and the feature vector of the real voice;
and selecting M/k generated voices with the shortest distance, specifically the first M/k voices from near to far, and taking down an integer if the M/k voices are non-integers.
According to the system of the second aspect of the present invention, the third processing module 103 is configured to, before mixing the sorted M/k generated voices with the k batch of generated voices saved by previous data, further include:
deleting each batch of data in the generated voice stored by the kth previous batch of data according to the sequence from far to near, so that the data volume of each batch of data is M/k.
According to the system of the second aspect of the present invention, the fourth processing module 104 is configured to mix and play back the sorted old data and the (k + 1) th batch of data, and includes:
and copying and expanding the selected M old data to enable the total amount of the old data to be the same as that of the (k + 1) th batch of data, and enabling the total amount ratio of the old data to the (k + 1) th batch of data to be 1.
A third aspect of the invention discloses an electronic device. The electronic device comprises a memory and a processor, the memory stores a computer program, and the processor implements the steps of the incremental generated speech detection method for the playback boundary negative sample in any one of the first aspect of the disclosure when executing the computer program.
Fig. 7 is a block diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 7, the electronic device includes a processor, a memory, a communication interface, a display screen, and an input device, which are connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic equipment comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the electronic device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, near Field Communication (NFC) or other technologies. The display screen of the electronic equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the electronic equipment, an external keyboard, a touch pad or a mouse and the like.
It will be understood by those skilled in the art that the structure shown in fig. 7 is only a partial block diagram related to the technical solution of the present disclosure, and does not constitute a limitation of the electronic device to which the solution of the present application is applied, and a specific electronic device may include more or less components than those shown in the drawings, or combine some components, or have a different arrangement of components.
A fourth aspect of the invention discloses a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps in the method for incrementally generating speech detection for playback boundary negative examples according to any one of the first aspect of the present disclosure.
Note that, the technical features of the above embodiments may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description in the present specification. The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, and these are all within the scope of protection of the present application. Therefore, the protection scope of the present patent application shall be subject to the appended claims.

Claims (8)

1. A method for incrementally generating speech detection for playback boundary negative examples, the method comprising:
s1, audio acoustic feature extraction: extracting audio acoustic features of input audio;
s2, model training: inputting the audio acoustic features into a deep learning network, and training the deep learning network;
s3, selecting old data: when a kth batch of data is trained, selecting M/k generated voices from the generated voices which can be correctly detected by the trained deep learning network through an average value of feature vectors of real voices in the kth batch of data and feature vectors of the generated voices which can be correctly detected by the trained deep learning network, deleting M/k generated voices from the generated voices stored in a kth batch of previous data, and mixing the selected M/k generated voices with the generated voices stored in the kth batch of previous data from which the M/k generated voices are deleted to serve as selected old data; m is the total amount of old data;
in step S3, the method for selecting M/k generated speeches from the generated speeches correctly detected by the trained deep learning network includes:
calculating the distance between the feature vector of the generated voice and the feature vector of the real voice;
selecting M/k generated voices with the shortest distance, specifically the first M/k generated voices from near to far, and taking down an integer if the M/k is a non-integer;
in the step S3, before mixing the sorted M/k generated voices with the kth generated voice stored in the previous data, the method further includes:
deleting each batch of data in the generated voice stored in the kth batch of previous data according to the sequence from far to near, so that the data volume of each batch of data is M/k;
s4, mixed playback of new and old data: when the (k + 1) th batch of data is trained, the selected old data and the (k + 1) th batch of data are mixed and played back to obtain (k + 1) th batch of training data;
step S5, incremental updating: applying the (k + 1) th batch of training data to carry out incremental updating on the trained deep learning network;
step S6: and applying the incrementally updated deep learning network to complete the voice detection.
2. The method for incrementally generating speech detection with playback boundary negative examples as claimed in claim 1, wherein in step S3, the method for calculating the average value of the feature vectors of the real speech in the kth data includes:
calculating the average value of the feature vectors of the real voice in the kth batch of data by applying the deep learning network trained on the current kth batch of data; the feature vector of the real voice is the output of the structure of the trained deep learning network except the last full connection layer after the real voice is input into the trained deep learning network;
the specific formula for calculating the average value of the feature vectors of the real voice in the kth batch of data is as follows:
Figure FDA0004105611540000021
wherein the content of the first and second substances,
g k () is the deep learning network after training;
x true i Audio acoustic features of the ith real voice;
g k (x real i ) The feature vector of the ith real voice is obtained;
N true k The number of the real voice;
μ true k Is the average value of the feature vectors of the real speech in the kth batch of data.
3. The method according to claim 1, wherein in step S3, the method for calculating the feature vector of the generated speech correctly detected by the trained deep learning network includes:
applying the trained deep learning network on the current kth data, and selecting a generated voice set which can be correctly detected by the trained deep learning network in the current kth data;
then calculating the feature vectors of all the generated voices in the generated voice set;
and the feature vector of the generated voice is the output of the structure of the trained deep learning network except the last full connection layer after the generated voice in the generated voice set is input into the trained deep learning network.
4. The method for incrementally detecting a playback boundary negative example as claimed in claim 3, wherein in the step S3, the criteria that can be correctly detected by the trained deep learning network are: the generation probability of the deep learning network prediction after current training is larger than 0.5.
5. The method for incrementally generating speech detection with playback boundary negative examples as claimed in claim 1, wherein in step S4, the method for playing back the sorted old data and the (k + 1) th batch of data in a mixed manner includes:
and copying and expanding the selected M old data to enable the total amount of the old data to be the same as that of the (k + 1) th batch of data, and enabling the total amount ratio of the old data to the (k + 1) th batch of data to be 1.
6. An incrementally-generated speech detection system for playing back boundary negative examples, said system comprising:
a first processing module configured to, audio acoustic feature extraction: extracting audio acoustic features of input audio;
a second processing module configured to model train: inputting the audio acoustic features into a deep learning network, and training the deep learning network;
a third processing module configured to cull old data by: when a kth batch of data is trained, selecting M/k generated voices from the generated voices which can be correctly detected by the deep learning network after the training through the average value of the feature vectors of the real voices in the kth batch of data and the feature vectors of the generated voices which can be correctly detected by the deep learning network after the training, deleting M/k generated voices from the generated voices stored in a kth batch of previous data, and mixing the selected M/k generated voices with the generated voices stored in the kth batch of previous data after the M/k generated voices are deleted to be used as selected old data; m is the total amount of old data;
the selecting M/k generated voices from the generated voices correctly detected by the trained deep learning network comprises:
calculating the distance between the feature vector of the generated voice and the feature vector of the real voice;
selecting M/k generated voices with the shortest distance, specifically the first M/k generated voices from near to far, and taking down an integer if the M/k is a non-integer;
before mixing the selected M/k generated voices with the k batch of generated voices stored in the previous data, the method further comprises the following steps:
deleting each batch of data in the generated voice stored in the kth batch of previous data according to the sequence from far to near, so that the data volume of each batch of data is M/k;
a fourth processing module configured to play back the new and old data in a mixed manner: when the (k + 1) th batch of data is trained, the selected old data and the (k + 1) th batch of data are mixed and played back to obtain (k + 1) th batch of training data;
a fifth processing module configured to incrementally update: applying the (k + 1) th batch of training data to perform incremental updating on the trained deep learning network;
and the sixth processing module is configured to apply the incrementally updated deep learning network to complete the voice detection.
7. An electronic device, comprising a memory storing a computer program and a processor, wherein the processor, when executing the computer program, implements the steps of a method for incrementally generating speech detection for playback boundary negative examples as claimed in any one of claims 1 to 5.
8. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, implements the steps of a method for incrementally generating speech detection samples for playback boundary negative examples as claimed in any one of claims 1 to 5.
CN202210863709.9A 2022-07-21 2022-07-21 Incremental generated voice detection method and system for playback boundary load sample Active CN115249487B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210863709.9A CN115249487B (en) 2022-07-21 2022-07-21 Incremental generated voice detection method and system for playback boundary load sample

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210863709.9A CN115249487B (en) 2022-07-21 2022-07-21 Incremental generated voice detection method and system for playback boundary load sample

Publications (2)

Publication Number Publication Date
CN115249487A CN115249487A (en) 2022-10-28
CN115249487B true CN115249487B (en) 2023-04-14

Family

ID=83700101

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210863709.9A Active CN115249487B (en) 2022-07-21 2022-07-21 Incremental generated voice detection method and system for playback boundary load sample

Country Status (1)

Country Link
CN (1) CN115249487B (en)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7693713B2 (en) * 2005-06-17 2010-04-06 Microsoft Corporation Speech models generated using competitive training, asymmetric training, and data boosting
CN110265065B (en) * 2019-05-13 2021-08-03 厦门亿联网络技术股份有限公司 Method for constructing voice endpoint detection model and voice endpoint detection system
CN111564163B (en) * 2020-05-08 2023-12-15 宁波大学 RNN-based multiple fake operation voice detection method
CN114155875B (en) * 2022-02-09 2022-05-03 中国科学院自动化研究所 Method and device for identifying voice scene tampering, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN115249487A (en) 2022-10-28

Similar Documents

Publication Publication Date Title
CN110491416B (en) Telephone voice emotion analysis and identification method based on LSTM and SAE
WO2021208287A1 (en) Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium
CN108922518B (en) Voice data amplification method and system
Wang et al. Specaugment++: A hidden space data augmentation method for acoustic scene classification
CN109635883A (en) The Chinese word library generation method of the structural information guidance of network is stacked based on depth
CN110459225B (en) Speaker recognition system based on CNN fusion characteristics
CN107331384A (en) Audio recognition method, device, computer equipment and storage medium
CN105976809A (en) Voice-and-facial-expression-based identification method and system for dual-modal emotion fusion
CN108922515A (en) Speech model training method, audio recognition method, device, equipment and medium
CN107680077A (en) A kind of non-reference picture quality appraisement method based on multistage Gradient Features
CN110111797A (en) Method for distinguishing speek person based on Gauss super vector and deep neural network
CN110189766B (en) Voice style transfer method based on neural network
CN112053694A (en) Voiceprint recognition method based on CNN and GRU network fusion
CN109378014A (en) A kind of mobile device source discrimination and system based on convolutional neural networks
CN114998602B (en) Domain adaptive learning method and system based on low confidence sample contrast loss
CN108922543A (en) Model library method for building up, audio recognition method, device, equipment and medium
Hasannezhad et al. PACDNN: A phase-aware composite deep neural network for speech enhancement
CN110853656A (en) Audio tampering identification algorithm based on improved neural network
CN114203184A (en) Multi-state voiceprint feature identification method and device
CN114492521A (en) Intelligent lithology while drilling identification method and system based on acoustic vibration signals
CN107274892A (en) Method for distinguishing speek person and device
CN112241741A (en) Self-adaptive image attribute editing model and method based on classified countermeasure network
Kim et al. WaveNODE: A continuous normalizing flow for speech synthesis
CN114678030A (en) Voiceprint identification method and device based on depth residual error network and attention mechanism
CN112289338B (en) Signal processing method and device, computer equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant