CN117612562A - Self-supervision voice fake identification training method and system based on multi-center single classification - Google Patents

Self-supervision voice fake identification training method and system based on multi-center single classification Download PDF

Info

Publication number
CN117612562A
CN117612562A CN202311682362.9A CN202311682362A CN117612562A CN 117612562 A CN117612562 A CN 117612562A CN 202311682362 A CN202311682362 A CN 202311682362A CN 117612562 A CN117612562 A CN 117612562A
Authority
CN
China
Prior art keywords
voice
training
speaker
speech
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311682362.9A
Other languages
Chinese (zh)
Inventor
曹睿
沈宜
郭先会
马军
周伟中
邹严
郭兴文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Wanglian Anrui Network Technology Co ltd
CETC 30 Research Institute
Original Assignee
Shenzhen Wanglian Anrui Network Technology Co ltd
CETC 30 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Wanglian Anrui Network Technology Co ltd, CETC 30 Research Institute filed Critical Shenzhen Wanglian Anrui Network Technology Co ltd
Priority to CN202311682362.9A priority Critical patent/CN117612562A/en
Publication of CN117612562A publication Critical patent/CN117612562A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention belongs to the technical field of voice detection, and discloses a self-supervision voice fake identification training method and system based on multi-center single classification. The method comprises the following steps: inputting the processed voice data into a feature extraction module, and extracting voice features by using a pre-training self-supervision front-end network; fusing the voice features extracted from the pre-trained self-supervision front end; inputting the fused voice characteristics into a pseudo-identification network, adding a multi-center single-classification loss model, and training and optimizing the multi-center single-classification loss model and the pseudo-identification network; and judging the authenticity of the voice by using a threshold value through an reasoning process. According to the invention, different positions of speakers with different characteristics on the embedded space are combined, the multi-center embedded space is introduced for training, the problem of high error rate caused by false identification of a single-center embedded space method is solved, meanwhile, a plurality of kinds of noise and reverberation are added for simulating a real environment, and the problems of low accuracy and poor generalization of voice false identification under the real environment are effectively solved.

Description

Self-supervision voice fake identification training method and system based on multi-center single classification
Technical Field
The invention belongs to the technical field of voice detection, and particularly relates to a self-supervision voice fake identification training method and system based on multi-center single classification.
Background
The prior art provides a voice fake identifying method and a voice fake identifying system based on a single-classification multi-scale residual error network and a voice fake identifying method based on self-supervision learning. The prior art provides a multi-center single classification method which fuses SAMO, SPEAKERATTRACTORMULTI-center-CLASSLEARNING FORVOICEANTI-SPOOFING; furthermore, the prior art provides a self-supervision based front-end model, namely Automaticspeechrecognition (ASR) method whisper. The large and diverse speech data used by Whisper training helps to promote model generalization.
But has the following problems:
(1) The existing self-supervision learning front end feature extraction technology superimposes the voice multi-level features output by the model, and the characteristics of the features of different levels are not considered, so that the voice feature representation robustness is poor.
(2) The single-classification learning compresses real voice into one cluster in an embedding space, meanwhile, the synthesized voice is far away from the cluster, and the outside of the cluster is classified into the synthesized voice, but because of the voice characteristics of different speakers, the voice characteristics are different, the real voice forms a plurality of clusters in the embedding space, and the simple division of the real voice into one cluster can lead to the misclassification of some synthesized voices.
(3) The addition of various reverberations or noise to simulate background sounds in a real scene is not considered.
Disclosure of Invention
In order to overcome the problems in the related art, the disclosed embodiments of the invention provide a self-supervision voice fake identification training method and system based on multi-center single classification. In particular to a voice authentication method based on a self-supervision network and a residual attention network. The prior art either uses a single class learning or self-supervised learning front end, which is not effectively combined. In addition, the prior art does not process the multi-level characteristics output by the front end, but directly inputs the multi-level characteristics into the fake authentication network, so that generalization of the model is reduced. The prior art single-class learning uses a single center, and does not consider the real scene of speaker diversity. The invention aims to combine a single classification method with a self-supervision front end trained by using a large amount of data in consideration of the characteristics of the voice identification direction, so that the accuracy is effectively improved. And both methods are improved. The multi-level features output by the front end are processed and learned by combining with the attention mechanism, important features are obtained and then input into the pseudo-recognition network, the voice is identified by using a multi-center single-classification method, the generalization of the model is greatly improved, and the representation under a real scene is realized.
The technical scheme is as follows: the self-supervision voice fake identification training method based on multi-center single classification is used for training and reasoning of speaker voice fake identification in a real scene, and specifically comprises the following steps:
s1, processing voice data;
s2, inputting the processed voice data into a feature extraction module, and extracting voice features by using a pre-training self-supervision front-end network;
s3, fusing the voice features extracted from the pre-trained self-supervision front end;
s4, inputting the fused voice characteristics into a fake identification network, adding a multi-center single-classification loss model, and training and optimizing the multi-center single-classification loss model and the fake identification network;
s5, judging the authenticity of the voice by using a threshold value in an reasoning process.
In step S1, processing the voice data includes: the reverberation, convolution noise and background sound are added to the input voice, the voice generalization is added, and the voice is subjected to fixed-length random framing cutting.
In step S2, speech features are extracted using a pre-trained self-supervising front end network, comprising:
loading a pre-training self-supervision front-end network model whisper and a pre-training whisper weight, and inputting processed voice data to obtain multi-level voice characteristics; screening the voice features, reserving the voice features of the second half layer, and then entering a feature fusion module for fusion;
in step S3, fusing the speech features extracted from the pre-trained self-monitoring front end, including: carrying out pooling operation and reducing characteristic dimension; the method comprises the steps of entering a self-attention mechanism based on convolution, passing through a 2-D convolution layer, entering an activation layer and a batch normalization layer, using convolution operation again to obtain final attention weight, normalizing through a softmax function, and multiplying the final attention-based voice feature by an original feature.
In step S4, training and optimizing the loss model of the multi-center single classification and the pseudo-authentication network, including: inputting the fused voice features into a pseudo-discrimination network, training the features by combining a multi-scale residual error network, modeling the correlation among the features by using SENet, strengthening important features, obtaining global feature vectors, inputting the global feature vectors into a multi-center single-classification loss model, performing training iterative optimization, feeding back and optimizing the multi-center single-classification loss model weight, and embedding space and speaker center.
Further, the loss network method of the loss model of the multi-center single classification comprises the following steps: the real voice is compressed into a plurality of clusters in the embedding space to form a plurality of speaker centers, the clusters are formed based on the speaker identities during training, and false voice is far away from the real clusters in the embedding space, so that the authenticity of the voice is obtained.
Further, the loss function of the loss network method is calculated as follows:
wherein L is loss, N is the number of speakers in the batch, e is a constant, and a is a proportionality coefficient; y is i Y is the corresponding real label i Take the value 1 or 0; m is m 0 ,m 1 Boundaries of real and false voices, m yi Represents m 0 ,m 1 Is a value of (2); d, d i To measure similarity between speech and clustered speech embedding using cosine similarity;
wherein d i The calculation method of (2) is as follows:
in the method, in the process of the invention,for normalized speech embedding, ++>Speech embedding for multiple speaker clusters s i For different speakers in the training set, w is the average embedded representation of the speaker's real speech.
Further, for the input real audio, similarity calculation is to calculate the comparison of the speech embedding with the corresponding speaker center speech embedding;
inputting synthesized voice, wherein similarity calculation is the closest comparison between voice embedding and the centers of all speakers; by continuously reducing the loss, the loss model of multi-center single classification learns to compress the voices belonging to the same speaker and enables fraud attacks to be away from the embedding space of the speaker center.
Further, the similarity calculation includes: firstly, initializing a loss model of multi-center single classification by using random weights, initializing a plurality of speaker centers as one-hot vector representations, inputting global semantic features to calculate similarity with the speaker centers, updating the plurality of speaker centers on an embedding space by using average characterization of the speakers along with training iteration, feeding back the loss model weights of optimizing the multi-center single classification, and embedding the space and the speaker centers.
In the reasoning process of step S4, inputting the voice to be detected into a pseudo-identification network, inputting the voice to be detected into a multi-center single-classification loss model, calculating the similarity between the representation of the voice in an embedded space and the representation of the speaker center to obtain a score, and judging the authenticity of the voice through a threshold value; the threshold value comprises 0, 1,0 represents false, 1 represents true, the score is close to 1, the score is close to 0, and the specific value is determined according to the scene and the training effect;
if the speaker appears in the training set, the speaker calculates with the corresponding speaker center, if not, the speaker calculates with all speaker centers through inner products to obtain the maximum value, and the speaker is judged to be true or false through the threshold value; the inner product is calculated as:
in the method, in the process of the invention,normalized speech embedding for the ith test, s i Is corresponding to speaker, is->Is the average embedding of the corresponding registered speaker, +.>Is the speech embedding of speaker s in the training set.
Another object of the present invention is to provide a self-supervision speech authentication training system based on multi-center single classification, the system comprising:
the voice data processing module is used for processing voice data;
the voice feature extraction module is used for inputting the processed voice data into the feature extraction module and extracting voice features by using the pre-training self-supervision front-end network;
the feature fusion module is used for fusing the voice features extracted from the pre-trained self-supervision front end;
the fake identifying training module is used for inputting the fused voice characteristics into a fake identifying network, adding a multi-center single-classification loss model, and training and optimizing the multi-center single-classification loss model and the fake identifying network;
and the reasoning module is used for judging the authenticity of the voice by using the threshold value in the reasoning process.
By combining all the technical schemes, the invention has the advantages and positive effects that: based on the existing self-supervision voice fake-identifying method, the invention provides a multi-center single-classification voice fake-identifying method combined with self-supervision front-end hierarchical feature fusion, and the problem that the effect of manual features in fake-identifying field is poor is solved by using the front-end hierarchical feature with more robustness; by combining different positions of different characteristic speakers on the embedded space, multi-center embedded space is introduced for training, the problem of high error rate caused by false identification of a single-center embedded space method is solved, meanwhile, multiple kinds of noise and reverberation are added for simulating a real environment, and the problems of low accuracy and poor generalization of voice false identification under the real environment are effectively solved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure;
FIG. 1 is a flowchart of a self-supervision speech authentication training method based on multi-center single classification provided by an embodiment of the invention;
FIG. 2 is a flow chart of fusion of extracted speech features and feature fusion modules provided by an embodiment of the present invention;
FIG. 3 is a flowchart for training and optimizing a loss model and a pseudo-network of multi-center single classification according to an embodiment of the present invention;
FIG. 4 is a flow chart of a loss network method of a loss model for multi-center single classification provided by an embodiment of the invention;
FIG. 5 is a schematic diagram of a self-supervision speech authentication training system based on multi-center single classification provided by an embodiment of the invention;
in the figure: 1. a voice data processing module; 2. a voice feature extraction module; 3. a feature fusion module; 4. the fake identification training module; 5. and an inference module.
Detailed Description
In order that the above objects, features and advantages of the invention will be readily understood, a more particular description of the invention will be rendered by reference to the appended drawings. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The invention may be embodied in many other forms than described herein and similarly modified by those skilled in the art without departing from the spirit or scope of the invention, which is therefore not limited to the specific embodiments disclosed below.
The invention innovatively introduces a multi-center single-classification loss model for representing the voice in an embedded space, simulates the diversity of real voice, constructs a plurality of compact specific speaker clusters with different characteristics in the embedded space through training optimization, and judges the authenticity of the voice by comparing the input voice with the clustering of the real voice. Based on the meaning represented by different levels of the front end, the representation method of weighting fusion training by using the hierarchical level features of the self-supervision front end is selected to replace simple feature addition, and the generalization reliability of voice representation is improved by adopting the self-supervision training front end of pre-training to perform finetune.
In addition, the scheme also adds various reverberations and background sounds of different noise simulation real environments to increase model generalization.
In a word, the invention combines the multi-center single-classification voice authentication training method of the self-supervision front end to improve the accuracy and generalization of voice authentication in the real environment.
Embodiment 1 as shown in fig. 1, the self-supervision speech false identification training method based on multi-center single classification provided by the embodiment of the invention is used for training and reasoning of speaker speech false identification in a real scene.
The method specifically comprises the following steps:
s1, processing voice data;
s2, inputting the processed voice data into a feature extraction module, and extracting voice features by using a pre-training self-supervision front-end network;
s3, fusing the voice features extracted from the pre-trained self-supervision front end;
s4, inputting the fused voice characteristics into a fake identification network, adding a multi-center single-classification loss model, and training and optimizing the multi-center single-classification loss model and the fake identification network;
it will be appreciated that the pseudo-authentication network is a multi-scale residual network incorporating the squeze-and-ExcitationNetworks (SENet). And inputting the obtained result into a multi-center single-classification loss model to obtain true and false, and optimizing the false identification network and the loss model.
S5, judging the authenticity of the voice by using a threshold value in an reasoning process.
In step S1 of the embodiment of the present invention, the voice data processing includes: the input voice is processed, reverberation, convolution noise, background sound and the like are added, voice generalization is added, and fixed-length random framing cutting is carried out on the voice.
As shown in fig. 2, in step S2 of the embodiment of the present invention, the pre-training self-supervising front end network is used to extract speech features, including:
loading a pre-training self-supervision front-end network model whisper and a pre-training whisper weight, and inputting processed voice data to obtain multi-level voice characteristics; screening the voice features, reserving the voice features of the second half layer, and then entering a feature fusion module for fusion in the step S3;
the step S3 of fusion comprises the following steps: firstly, carrying out pooling operation, reducing feature dimension, then entering a self-attention mechanism based on convolution, firstly passing through a 2-D convolution layer, then entering an activation layer and a batch normalization layer, finally, again using convolution operation to obtain final attention weight, then carrying out normalization through a softmax function, and multiplying the final attention-based voice feature by the original feature. And obtaining the voice representation with high quality and low dimensionality.
By weighting and summing the important features, higher attention is allocated and robustness is increased. The pre-training self-supervision front-end network and the fusion network in the process are all continuously and iteratively optimized in training.
In step S4 of the embodiment of the present invention, a training optimization flow chart of the loss model and the pseudo-recognition network of the multi-center single classification is shown in fig. 3, the fused speech features are input into the pseudo-recognition network, the features are trained by combining with the multi-scale residual error network of the squeze-and-ExcitationNetworks (SENet), and the correlation between the features is modeled by using SENet, so as to strengthen the important features. The obtained global feature vector is input into a loss model of multi-center single classification, a loss network method of the loss model of multi-center single classification is shown in fig. 4, real voice is compressed into a plurality of clusters in an embedding space to form a plurality of speaker centers, the clusters are formed based on speaker identities during training, false voice is far away from the real clusters in the embedding space, and therefore true voice is obtained.
The loss function (loss) of the loss network method is calculated as follows:
wherein L is loss, N is the number of speakers in the batch, e is a constant, and a is a proportionality coefficient; y is i Is the corresponding real label, 1 or 0; m is m 0 ,m 1 Boundaries of real voice and false voice respectively; m is m yi Represents m 0 ,m 1 Is a value of (2);
wherein d i Calculation ofThe method comprises the following steps:
in the method, in the process of the invention,for normalized speech embedding, ++>Speech embedding, s, representing multiple speaker clusters i For different speakers in the training set, w is the average embedded representation of the real speech of the speaker; d, d i To measure similarity between speech and clustered speech embedding using cosine similarity.
For the true audio input, the similarity is a comparison of the calculated speech embedding with the corresponding speaker-centered speech embedding. The synthesized speech is input and the similarity calculation is the closest comparison of the speech embedding to the center of all speakers. By continuously reducing the loss, the loss model of the multi-center single classification can learn to compress the voices belonging to the same speaker and enable fraud attacks to be far away from the embedding space of the speaker center.
The implementation process comprises the steps of initializing a multi-center single-classification loss model by using random weights, initializing a plurality of speaker centers to be represented by one-hot vectors, inputting global semantic features to calculate similarity with the speaker centers, updating the plurality of speaker centers on an embedding space by using average characterization of the speakers along with training iteration, feeding back and optimizing the multi-center single-classification loss model weights, and embedding the space and the speaker centers.
In step S5 of the embodiment of the invention, in the reasoning process, the voice to be detected is input into a pseudo-identification network, then is input into a multi-center single-classification loss model, the similarity between the representation of the voice in the embedded space and the representation of the speaker center is calculated to obtain a score, the authenticity of the voice is judged through a threshold value,
the threshold value comprises 0, 1,0 represents false, 1 represents true, the score is close to 1, the score is close to 0, and the specific value is determined according to the scene and the training effect;
if the speaker appears in the training set, the speaker calculates with the corresponding speaker center, if not, the speaker calculates with all speaker centers through inner products to obtain the maximum value, and the speaker is judged to be true or false through the threshold value. The inner product is calculated as:
in the method, in the process of the invention,normalized speech embedding for the ith test, s i Is corresponding to speaker, is->Is the average embedding of the corresponding registered speaker, +.>Is the speech embedding of speaker s in the training set.
Through the embodiment, the voice fake identifying method and device can effectively improve the accuracy of voice fake identifying under a real scene.
In an exemplary embodiment, training data of different characters are labeled differently in a training scene of a real environment, and during training, characteristics of voices are extracted, processed, fused, input into a pseudo-recognition network, and form a plurality of speaker embedding centers in space due to natural clustering caused by acoustic diversity of speakers. The embedding center and the overall embedding space of the original initialized speaker are continuously updated. In the reasoning process, if the input voice is the voice of the existing speaker during training, the embedding center corresponding to the threshold value judges, and if the input voice does not exist, the input voice is calculated with all speaker embedding spaces.
In the invention, in step S2, a pre-training self-supervision front-end network is used for extracting voice features, such as whisper, then the features are screened, partial layer features are reserved, then a step S3 feature fusion module is entered, pooling operation is firstly carried out, feature dimension is reduced, then a self-attention mechanism based on convolution is used, specifically, a 2-D convolution layer is firstly passed, then an activation layer and a batch normalization layer are entered, finally, convolution operation is used again, final attention weight is obtained, normalization is carried out through a softmax function, and final attention-based voice features are obtained through multiplication with original features.
As can be seen from the above embodiments, in the security field: the speech recognition accuracy can be effectively improved in the aspect of fraud prevention, the fraud risk is reduced, and the speech recognition method is used for protecting the fund and personal information security; the method has the advantages that the credibility of the voice content is enhanced, and the method is used for the accuracy of the voice content in the fields of news broadcast television interview and the like, so that the credibility and the reliability of information are ensured; the reliability of audio evidence obtaining can be effectively improved, and judicial fairness is ensured. With the wide range of use of speech synthesis technology, there is an increasing market demand for protecting privacy, preventing fraud, and confirming the authenticity of speech.
The prior art is temporarily free from combining a self-supervision learning front end such as whisper and the like with a multi-center single-classification training method, and a front-end voice feature fusion processing module is added for voice fake identification.
The invention effectively improves the accuracy of the voice fake identifying model, effectively reduces the EqualErrorRate index, reduces the t-DCF index and effectively improves the model performance.
Embodiment 2 as shown in fig. 5, the self-supervision speech false identification training system based on multi-center single classification provided by the embodiment of the invention is used for training and reasoning of speaker speech false identification in a real scene.
The method specifically comprises the following steps:
the voice data processing module 1 is used for processing voice data;
the voice feature extraction module 2 is used for inputting the processed voice data into the feature extraction module and extracting voice features by using a pre-training self-supervision front-end network;
a feature fusion module 3 for fusing the pre-trained voice features extracted from the self-monitoring front end
The fake identifying training module 4 is used for inputting the fused voice characteristics into a fake identifying network, adding a multi-center single-classification loss model, and training and optimizing the multi-center single-classification loss model and the fake identifying network;
and the reasoning module 5 is used for judging the authenticity of the voice by using the threshold value in the reasoning process.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.
The content of the information interaction and the execution process between the devices/units and the like is based on the same conception as the method embodiment of the present invention, and specific functions and technical effects brought by the content can be referred to in the method embodiment section, and will not be described herein.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present invention. For specific working processes of the units and modules in the system, reference may be made to corresponding processes in the foregoing method embodiments.
Based on the technical solutions described in the embodiments of the present invention, the following application examples may be further proposed.
According to an embodiment of the present application, the present invention also provides a computer apparatus, including: at least one processor, a memory, and a computer program stored in the memory and executable on the at least one processor, which when executed by the processor performs the steps of any of the various method embodiments described above.
Embodiments of the present invention also provide a computer readable storage medium storing a computer program which, when executed by a processor, performs the steps of the respective method embodiments described above.
The embodiment of the invention also provides an information data processing terminal, which is used for providing a user input interface to implement the steps in the method embodiments when being implemented on an electronic device, and the information data processing terminal is not limited to a mobile phone, a computer and a switch.
The embodiment of the invention also provides a server, which is used for realizing the steps in the method embodiments when being executed on the electronic device and providing a user input interface.
Embodiments of the present invention also provide a computer program product which, when run on an electronic device, causes the electronic device to perform the steps of the method embodiments described above.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application implements all or part of the flow of the method of the above embodiments, and may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a photographing apparatus/terminal device, recording medium, computer memory, read-only memory (ROM), random access memory (RandomAccessMemory, RAM), electrical carrier signal, telecommunication signal, and software distribution medium. Such as a U-disk, removable hard disk, magnetic or optical disk, etc.
To further demonstrate the positive effects of the above embodiments, the present invention was based on the above technical solutions to perform the following experiments.
Because the tone characteristics of different speakers in the data are different, the true voices of the different speakers are found to naturally form a plurality of centers in the embedded space, and the learning method based on single classification simply compresses the different speakers to one center, so that the error classification of some attacks can be caused. A similar experiment proves that the model performance can be effectively improved and EER is reduced by about 30% by using the multi-center training method on ASVSpoof2019LA compared with a single-center single-classification training method. Using a self-supervising front end and incorporating feature fusion, e.g., using whisper, wav2vec2, a 30% reduction In EER In The data set In-The-Wild data set can be achieved compared to conventional front ends such as LFCC, MFCC, etc. Its combination both theoretically and experimentally shows an increase in generalization.
While the invention has been described with respect to what is presently considered to be the most practical and preferred embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but on the contrary, is intended to cover various modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

Claims (10)

1. The self-supervision voice fake identification training method based on multi-center single classification is characterized by being used for training and reasoning of speaker voice fake identification in a real scene and specifically comprising the following steps of:
s1, processing voice data;
s2, inputting the processed voice data into a feature extraction module, and extracting voice features by using a pre-training self-supervision front-end network;
s3, fusing the voice features extracted from the pre-trained self-supervision front end;
s4, inputting the fused voice characteristics into a fake identification network, adding a multi-center single-classification loss model, and training and optimizing the multi-center single-classification loss model and the fake identification network;
s5, judging the authenticity of the voice by using a threshold value in an reasoning process.
2. The multi-center single-classification based self-supervised speech authentication training method according to claim 1, wherein in step S1, processing the speech data comprises: the reverberation, convolution noise and background sound are added to the input voice, the voice generalization is added, and the voice is subjected to fixed-length random framing cutting.
3. The multi-center single-classification based self-supervised speech authentication training method of claim 1, wherein in step S2, the pre-trained self-supervised front end network is used to extract speech features, comprising:
loading a pre-training self-supervision front-end network model whisper and a pre-training whisper weight, and inputting processed voice data to obtain multi-level voice characteristics; screening the voice features, reserving the voice features of the second half layer, and then entering a feature fusion module for fusion;
in step S3, fusing the speech features extracted from the pre-trained self-monitoring front end, including: carrying out pooling operation and reducing characteristic dimension; the method comprises the steps of entering a self-attention mechanism based on convolution, passing through a 2-D convolution layer, entering an activation layer and a batch normalization layer, using convolution operation again to obtain final attention weight, normalizing through a softmax function, and multiplying the final attention-based voice feature by an original feature.
4. The multi-center single-classification based self-supervision speech false authentication training method according to claim 1, wherein in step S4, training optimization is performed on a loss model of the multi-center single classification and a false authentication network, including: inputting the fused voice features into a pseudo-discrimination network, training the features by combining a multi-scale residual error network, modeling the correlation among the features by using SENet, strengthening important features, obtaining global feature vectors, inputting the global feature vectors into a multi-center single-classification loss model, performing training iterative optimization, feeding back and optimizing the multi-center single-classification loss model weight, and embedding space and speaker center.
5. The multi-center single-classification based self-supervision speech false authentication training method as claimed in claim 4, wherein the loss network method of the loss model of the multi-center single classification comprises: the real voice is compressed into a plurality of clusters in the embedding space to form a plurality of speaker centers, the clusters are formed based on the speaker identities during training, and false voice is far away from the real clusters in the embedding space, so that the authenticity of the voice is obtained.
6. The multi-center single-classification based self-supervision speech false authentication training method according to claim 5, wherein a loss function of the loss network method is calculated as follows:
wherein L is loss, N is the number of speakers in the batch, e is a constant, and a is a proportionality coefficient; y is i Y is the corresponding real label i Take the value 1 or 0; m is m 0 ,m 1 Boundaries of real and false voices, m yi Represents m 0 ,m 1 Is a value of (2); d, d i To measure similarity between speech and clustered speech embedding using cosine similarity;
wherein d i The calculation method of (2) is as follows:
in the method, in the process of the invention,for normalized speech embedding, ++>Speech embedding for multiple speaker clusters s i For different speakers in the training set, w is the average embedded representation of the speaker's real speech.
7. The multi-center single-classification based self-supervision speech authentication training method as recited in claim 6, wherein for the input real audio, the similarity calculation is a comparison of the calculated speech embedding with the corresponding speaker-center speech embedding;
inputting synthesized voice, wherein similarity calculation is the closest comparison between voice embedding and the centers of all speakers; by continuously reducing the loss, the loss model of multi-center single classification learns to compress the voices belonging to the same speaker and enables fraud attacks to be away from the embedding space of the speaker center.
8. The multi-center single-classification based self-supervised speech false authentication training method of claim 7, wherein the similarity calculation comprises: firstly, initializing a loss model of multi-center single classification by using random weights, initializing a plurality of speaker centers as one-hot vector representations, inputting global semantic features to calculate similarity with the speaker centers, updating the plurality of speaker centers on an embedding space by using average characterization of the speakers along with training iteration, feeding back the loss model weights of optimizing the multi-center single classification, and embedding the space and the speaker centers.
9. The self-supervision speech false authentication training method based on multi-center single classification according to claim 1, wherein in the step S4 reasoning process, the speech to be detected is input into a false authentication network, then is input into a loss model of multi-center single classification, the similarity between the representation of the speech and the representation of the speaker center in the embedded space is calculated to obtain a score, and the true or false of the speech is judged through a threshold value; the threshold value comprises 0, 1,0 represents false, 1 represents true, the score is close to 1, the score is close to 0, and the specific value is determined according to the scene and the training effect;
if the speaker appears in the training set, the speaker calculates with the corresponding speaker center, if not, the speaker calculates with all speaker centers through inner products to obtain the maximum value, and the speaker is judged to be true or false through the threshold value; the inner product is calculated as:
in the method, in the process of the invention,normalized speech embedding for the ith test, s i Is corresponding to speaker, is->Is the average embedding of the corresponding registered speaker, +.>Is the speech embedding of speaker s in the training set.
10. A self-supervising voice authentication training system based on multi-center single classification, the system comprising:
the voice data processing module (1) is used for processing voice data;
the voice feature extraction module (2) is used for inputting the processed voice data into the feature extraction module and extracting voice features by using the pre-training self-supervision front-end network;
the feature fusion module (3) is used for fusing the voice features extracted from the pre-trained self-supervision front end;
the fake identification training module (4) is used for inputting the fused voice characteristics into a fake identification network, adding a multi-center single-classification loss model, and training and optimizing the multi-center single-classification loss model and the fake identification network;
and the reasoning module (5) is used for judging the authenticity of the voice by using the threshold value in the reasoning process.
CN202311682362.9A 2023-12-08 2023-12-08 Self-supervision voice fake identification training method and system based on multi-center single classification Pending CN117612562A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311682362.9A CN117612562A (en) 2023-12-08 2023-12-08 Self-supervision voice fake identification training method and system based on multi-center single classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311682362.9A CN117612562A (en) 2023-12-08 2023-12-08 Self-supervision voice fake identification training method and system based on multi-center single classification

Publications (1)

Publication Number Publication Date
CN117612562A true CN117612562A (en) 2024-02-27

Family

ID=89949765

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311682362.9A Pending CN117612562A (en) 2023-12-08 2023-12-08 Self-supervision voice fake identification training method and system based on multi-center single classification

Country Status (1)

Country Link
CN (1) CN117612562A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118016051A (en) * 2024-04-07 2024-05-10 中国科学院自动化研究所 Model fingerprint clustering-based generated voice tracing method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118016051A (en) * 2024-04-07 2024-05-10 中国科学院自动化研究所 Model fingerprint clustering-based generated voice tracing method and device

Similar Documents

Publication Publication Date Title
CN110310647B (en) Voice identity feature extractor, classifier training method and related equipment
CN112992126B (en) Voice authenticity verification method and device, electronic equipment and readable storage medium
CN106991312B (en) Internet anti-fraud authentication method based on voiceprint recognition
CN110188829B (en) Neural network training method, target recognition method and related products
CN117612562A (en) Self-supervision voice fake identification training method and system based on multi-center single classification
Ohi et al. Deep speaker recognition: Process, progress, and challenges
CN113011889B (en) Account anomaly identification method, system, device, equipment and medium
CN111798828B (en) Synthetic audio detection method, system, mobile terminal and storage medium
CN114678030A (en) Voiceprint identification method and device based on depth residual error network and attention mechanism
CN113628612A (en) Voice recognition method and device, electronic equipment and computer readable storage medium
Esmaeilpour et al. Multidiscriminator sobolev defense-GAN against adversarial attacks for end-to-end speech systems
Yu et al. Cam: Context-aware masking for robust speaker verification
CN113807940B (en) Information processing and fraud recognition method, device, equipment and storage medium
CN113241079A (en) Voice spoofing detection method based on residual error neural network
CN116595486A (en) Risk identification method, risk identification model training method and corresponding device
CN113593579B (en) Voiceprint recognition method and device and electronic equipment
CN113113048B (en) Speech emotion recognition method and device, computer equipment and medium
CN113111855B (en) Multi-mode emotion recognition method and device, electronic equipment and storage medium
CN112489678B (en) Scene recognition method and device based on channel characteristics
CN114822558A (en) Voiceprint recognition method and device, electronic equipment and storage medium
CN106373576A (en) Speaker confirmation method based on VQ and SVM algorithms, and system thereof
CN114049900B (en) Model training method, identity recognition device and electronic equipment
CN117313723B (en) Semantic analysis method, system and storage medium based on big data
Evsyukov et al. Antispoofing Countermeasures in Modern Voice Authentication Systems
CN113782033B (en) Voiceprint recognition method, voiceprint recognition device, voiceprint recognition equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination