CN114566170A - Lightweight voice spoofing detection algorithm based on class-one classification - Google Patents

Lightweight voice spoofing detection algorithm based on class-one classification Download PDF

Info

Publication number
CN114566170A
CN114566170A CN202210193172.XA CN202210193172A CN114566170A CN 114566170 A CN114566170 A CN 114566170A CN 202210193172 A CN202210193172 A CN 202210193172A CN 114566170 A CN114566170 A CN 114566170A
Authority
CN
China
Prior art keywords
voice
class
speech
model
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210193172.XA
Other languages
Chinese (zh)
Inventor
彭海朋
任叶青
李丽香
赵洁
薛晓鹏
赵猛猛
孟寅
暴爽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202210193172.XA priority Critical patent/CN114566170A/en
Publication of CN114566170A publication Critical patent/CN114566170A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a lightweight voice deception detection algorithm based on class-one classification, which designs a new loss function DOC-Softmax aiming at characteristics of real voice and deceptive voice, namely, a dispersion loss function is introduced into a deceptive voice space of the class-one classification loss function OC-Softmax to relieve the problem of unmatched feature distribution between training data and test data, thereby improving the accuracy and generalization capability of a voice deception detection model. Meanwhile, the voice deception detection algorithm is designed into a lightweight voice deception detection algorithm by utilizing a knowledge distillation framework, so that the parameter quantity of the model is reduced, and the model is convenient to deploy to a mobile terminal or embedded equipment. The model has better generalization capability than a model obtained by using the same model structure and training data and only using a hard label training method.

Description

Lightweight voice spoofing detection algorithm based on class-one classification
Technical Field
The invention relates to the technical field of voiceprint recognition, in particular to a lightweight voice spoofing detection algorithm based on class-one classification.
Background
Voiceprint recognition refers to a biological feature recognition technology for recognizing the identity of a speaker according to speaker information in a voice signal, has the characteristics of non-concentration, non-contact, no fear of shielding, requirement of subjective consciousness matching and the like, and is widely applied to the scenes of finance, social security, government and enterprise, Internet of things and the like.
However, in an actual application environment, there are many uncertainties, especially man-made malicious spoofing attacks, so that the performance of the existing voiceprint recognition system is sharply reduced. Voice spoofing refers to the detection of an illegal piece of sound that has not been authenticated by the automatic speaker verification system by "phishing" through means such as recording, voice synthesis, and voice conversion. Currently, there are three main types of spoofing attacks: (1) deliberate impersonation by other speakers; (2) a realistic speech obtained by speech synthesis or speech conversion techniques; (3) and (5) recording playback or recording splicing of the high-fidelity recording equipment. In the three types of spoofing attacks, the authenticity of the spoofing attack can be identified by a mainstream voiceprint recognition system by deliberately imitating the spoofing attack mode.
However, the improvement of the quality of the recording device and the rapid development of speech processing techniques such as speech synthesis and speech conversion bring more and more serious challenges to the security of the spoof detection and voice print recognition system. The voice deception detection means that manual features or original voice is input into a deep learning or machine learning model for learning by utilizing a deep learning or machine learning method, and finally the purpose of voice discrimination is achieved. Next, the residual error network, knowledge distillation, Softmax loss function, and AM-Softmax loss function are presented.
(1) Residual error network
In 2015, he camemin et al proposed a Residual network (ResNet) to alleviate the problem of gradient disappearance caused by increasing the network depth in a deep neural network, and widely applied to the fields of image classification, target detection, voice recognition, and the like. Residual networks solve the degradation problem by introducing a deep residual learning framework, whose main idea is to remove the same main part, highlight slight variations, fit a residual mapping f (x): h (x) -x instead of fitting a bottom mapping h (x) directly with some stacked non-linear layers, so that the original mapping becomes f (x) + x, and optimizing the residual mapping is easier than optimizing the original mapping. The residual learning structure can be realized by a forward neural network + short connection, as shown in fig. 1, the short connection is equivalent to performing equivalent mapping, no additional parameter is generated, no computational complexity is increased, and the whole network can still be trained by end-to-end back propagation.
(2) Knowledge distillation
Deep learning has achieved incredible performance in many areas of computer vision, speech recognition, natural language processing, and the like. However, most deep learning models are too computationally expensive to run on the mobile side or embedded devices. Therefore, the model needs to be compressed, and knowledge distillation is one of important technologies in model compression. Knowledge distillation was first proposed by Hinton et al and mainly involves three distillation settings, one is model compression, distilling the knowledge of complex models onto small models; secondly, transferring knowledge across modalities, namely transferring the knowledge from one modality to another modality; and thirdly, integrating distillation, distilling knowledge of multiple models onto a single model. The invention utilizes a knowledge distillation framework to carry out model compression on a class of classified voice deception detection models, and the core idea is to train a complex network model first and then train a smaller network by using the output of the complex network and the real label of data, so that the knowledge distillation framework usually comprises a complex model (called a Teacher model) and a small model (called a Student model), the complex model is generally a single complex network or a set of a plurality of networks and has good performance and generalization capability, and the small model has limited expression capability due to small network scale. The knowledge distillation frame is used for guiding the training of the small model by using the knowledge learned by the large model, namely the Soft-target predicted by the Teacher model is used for assisting the Hard-target training, so that the Student model has the performance equivalent to that of the Teacher model, the parameter quantity of the model is greatly reduced, the model compression is realized, and the model reasoning time delay is reduced.
(3) Softmax loss function and AM-Softmax loss function
The formula of the original Softmax loss function for binary classification is as follows:
Figure BDA0003525687300000031
wherein,
Figure BDA0003525687300000032
to embed the vector, yiE {0,1} is the label of the ith sample, y i0 indicates that the sample is the target class, y i1 indicates that the sample is a non-target class,
Figure BDA0003525687300000033
is a weight vector, N is the number of samples in a batch.
AM-Softmax improves the Softmax function by introducing an angular interval, so that the embedding distribution of two classes is more compact, and the specific formula is as follows:
Figure BDA0003525687300000034
where α is a scaling factor, m is a cosine additive interval,
Figure BDA0003525687300000035
and
Figure BDA0003525687300000036
are respectively w0,w1And a normalized representation of x.
For these two loss functions, the embedding vectors for the target class and the non-target class tend to converge into two opposite directions, w0-w1And w1-w0. For the AM-softmax function, the target class and the non-target class have one same compact edge, and the larger the value of m, the more compact the embedding.
In voice spoof detection, it is reasonable to train a compact embedding space for real voice, but it is possible that a compact space might be overfit to a known attack if it is also trained for spoof attacks. In conclusion, the spoof detecting algorithms have achieved a lot of results, but these algorithms have a problem of insufficient generalization to unknown spoof attacks, because most methods represent the spoof problem as a binary classification problem of real voice and spoof voice, essentially assuming that the feature distribution between training data and test data is the same or similar. While this assumption is reasonable for true speech, it is not as correct for spoofed speech. Due to the development of voice spoofing techniques such as voice conversion and voice synthesis, spoofing attacks in training sets may never catch up with the extension of the actual spoofing attack distribution. In addition, most of the existing voice spoofing detection algorithms have the problems of complex network structure, large computation amount and low speed, and are difficult to transplant to a mobile terminal or embedded equipment.
Disclosure of Invention
The invention provides a lightweight voice deception detection algorithm based on class classification aiming at the problem of unmatched feature distribution between training data and test data, aiming at improving the accuracy and generalization capability of the voice deception detection algorithm and reducing model reasoning delay.
In order to achieve the above purpose, the invention provides the following technical scheme:
a lightweight voice deception detection algorithm based on class classification utilizes a knowledge distillation framework to learn a feature space through a class classification loss function DOC-Softmax based on dispersion loss, real voice is embedded with a compact boundary in the feature space, certain distance is reserved between deception voice and the real voice, and dispersion loss is introduced into the deception voice feature space to maximize the distance from each deception voice sample to the center of the deception voice sample, so that the deception voice covers the whole deception voice space.
Further, a class classification loss function DOC-Softmax total loss function L based on dispersion lossDOCSIs a class classification loss LOCSAnd dispersion loss LDThe weight is λ, and the specific formula is as follows:
Figure BDA0003525687300000041
wherein,
Figure BDA0003525687300000042
is a vector of the weights that is,
Figure BDA0003525687300000043
is w0The normalized representation of (a) is the optimal direction of the real speech, a is a scaling factor, and the two intervals m are0And m1Respectively introduced to define the angle between the true speech, the spoofed speech and the true speech weight vector, i.e.
Figure BDA0003525687300000044
And
Figure BDA0003525687300000045
angle theta betweeni,m0,m1∈[-1,1],m0>m1
The formula for a class of classification loss functions OC-Softmax is as follows:
Figure BDA0003525687300000046
two spacing m0And m1,m0,m1∈[-1,1],m0>m1Are introduced to define true and fraud class samples, respectively
Figure BDA0003525687300000047
And
Figure BDA0003525687300000048
angle theta betweeniWhen y isi=0,m0For making thetaiLess than arccosm0(ii) a When y isi=1,m1For making thetaiGreater than arccosm1A small arccosm0Aggregating target classes in a weight vector w0A relatively large arccos m1Keeping non-target classes away from weight vector w0
The formula for introducing dispersion loss is as follows:
Figure BDA0003525687300000051
Figure BDA0003525687300000052
wherein,
Figure BDA0003525687300000053
in order to embed the vector, the vector is embedded,
Figure BDA0003525687300000054
is a normalized representation of x, yiE {0,1} is the label of the ith sample, y i0 means that the sample is true speech, y i1 indicates that the sample is spoofed speech, N is the number of samples in a batch, M is the number of spoofed samples in a batch, and epsilon is a small constant to avoid the case where the denominator is 0,
Figure BDA0003525687300000055
for the center of the spoofed sample in each batch, the loss L is distributedDIn order to maximize spoofed voice samples
Figure BDA0003525687300000056
The distance from their center mu is such that the spoofed speech covers as much of the spoofed area as possible.
Further, α ═ 20, m0=0.9,m1=0.2。
Further, when the speech sample is real speech, i.e. yiWhen equal to 0, m0For making thetaiLess than arccosm0A small arccosm0Clustering of real speech in a weight vector w0Nearby; when the speech sample is spoofed speech, i.e. yiWhen 1, m1For making thetaiGreater than arccosm1One relatively large arccos m1Keeping spoofed speech away from weight vector w0
Further, the teacher model employs a network structure based on the depth residual network ResNet-18 and uses attention pooling instead of global average pooling.
Further, the teacher model takes the extracted LFCC features as input and the results of the full connectivity layer output as the embedding of the input speech.
Further, calculating the embedded and sent DOC-Softmax loss function to obtain a confidence score
Figure BDA0003525687300000057
The confidence score represents the probability of whether the input speech belongs to real speech or deceptive speech, and then the classification result of the input speech is obtained.
Furthermore, the model architecture of the student model is basically consistent with that of the teacher model, only 3 residual modules are removed, the soft label predicted by the teacher model is used for assisting the hard label to train the student model, the loss function of the student model comprises two parts, one is a class of classification loss L based on the dispersion lossDOCSOne is the confidence score of the student model output
Figure BDA0003525687300000058
Confidence score with teacher model output
Figure BDA0003525687300000059
Mean square error loss L ofMSEThe concrete formula is as follows:
Figure BDA00035256873000000510
the total loss function of the student model is the minimum LDOCSAnd LMSEThe weight is beta, and the specific formula is as follows:
Figure BDA0003525687300000061
compared with the prior art, the invention has the beneficial effects that:
the lightweight voice deception detection algorithm based on one class of classification designs a new loss function DOC-Softmax aiming at characteristics of real voice and deceptive voice, namely, a dispersion loss function is introduced into a deceptive voice space of one class of classification loss function OC-Softmax to relieve the problem of unmatched feature distribution between training data and test data, so that the accuracy and the generalization capability of a voice deception detection model are improved. Meanwhile, the voice deception detection algorithm is designed into a lightweight voice deception detection algorithm by utilizing a knowledge distillation framework, so that the parameter quantity of the model is reduced, and the model is convenient to deploy to a mobile terminal or embedded equipment. The model has better generalization capability than a model obtained by using the same model structure and training data and only using a hard label training method.
Drawings
In order to more clearly illustrate the embodiments of the present application or technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings can be obtained by those skilled in the art according to the drawings.
Fig. 1 is a schematic diagram of a residual structure.
Fig. 2 is a diagram of a model architecture according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of a residual error module according to an embodiment of the present invention.
FIG. 4 is a comparison of four loss functions of Softmax, AM-Softmax, OC-Softmax, and DOC-Softmax, wherein FIG. 4(a) is Softmax, FIG. 4(b) is AM-Softmax, FIG. 4(c) is OC-Softmax, and FIG. 4(d) is DOC-Softmax.
Detailed Description
The invention designs a lightweight voice deception detection algorithm based on class classification, which can well relieve the problem of unmatched feature distribution between training data and test data. In one class of classification methods, the target class does not have this problem of distribution mismatch, while for non-target classes, the samples in the training set are either not present or statistically not representative. The key idea of one class of classification methods is to capture the target class distribution and place a strict classification boundary around it in order to place all non-target data outside the boundary.
A class of classification loss functions OC-Softmax is defined as follows:
Figure BDA0003525687300000071
the loss function has only one weight vector
Figure BDA0003525687300000072
It is the optimal direction for target class embedding. In this formula
Figure BDA0003525687300000073
And
Figure BDA0003525687300000074
the standardization was performed. Two spaces m0And m1(m0,m1∈[-1,1],m0>m1) Are introduced to define true class and fraud class samples, respectively
Figure BDA0003525687300000075
And
Figure BDA0003525687300000076
angle theta betweeni. When y isi=0,m0For making thetaiLess than arccosm0(ii) a When y isi=1,m1For making thetaiGreater than arccosm1A small arccosm0The target classes can be clustered in a weight vector w0A relatively large arccos m1Non-target classes can be kept away from the weight vector w0
However, as can be seen from equation (3), when the target class and the non-target class respectively tightly surround the weight vector w0W inverse to the weight vector0When near, class-one classification loss function LOCSThe minimum is reached, so the final direction of the non-target class optimization is still the opposite direction w that tightly surrounds the weight vector0At' point. When the OC-Softmax function is optimized well enough, this loss function is essentially no different from the AM-Softmax function.
The invention aims to improve the accuracy and generalization capability of a voice deception detection algorithm and reduce model reasoning delay, provides a lightweight voice deception detection algorithm based on one class of classification, and particularly designs a new loss function DOC-Softmax aiming at the characteristics of real voice and deceptive voice, namely, introduces a dispersion loss function in a deceptive voice space of the class of classification loss function OC-Softmax to relieve the problem of unmatched feature distribution between training data and test data, thereby improving the accuracy and generalization capability of the voice deception detection model. Meanwhile, the voice deception detection algorithm is designed into a lightweight voice deception detection algorithm by utilizing a knowledge distillation framework, so that the parameter quantity of the model is reduced, and the model is convenient to deploy to a mobile terminal or embedded equipment.
The invention provides a lightweight voice deception detection algorithm based on class classification, which is a single-feature and single-system voice deception detection algorithm.
The method mainly comprises two aspects: the method comprises the steps of designing a classification loss function based on dispersion loss and designing a lightweight voice deception detection model based on knowledge distillation. A feature space in which real speech is embedded with a tight boundary, a distance between spoofed speech and real speech is learned by using a class of classification loss functions based on dispersion loss. And simultaneously, the distance from each deception voice sample to the center of the deception voice sample is maximized by using the dispersion loss, so that the deception voice covers the whole deception voice space.
In an embodiment of the invention, 60-dimensional linear frequency cepstrum coefficient LFCC features are extracted from utterances, a network structure based on a depth residual network ResNet-18 is adopted, and attention pooling is used instead of global average pooling. The network structure takes the extracted LFCC characteristics as input, the output confidence score represents a classification result, and the performance of a voice deception detection algorithm is improved under the condition of not using any data enhancement, characteristic fusion and model integration methods. In addition, a lightweight speech deception detection model StudentNet is designed for the TeacherNet based on a knowledge distillation framework by using the network, so that model parameters are greatly reduced (the parameter quantity is reduced by about 30 times), and the model is conveniently deployed to a mobile terminal or embedded equipment.
1. Design of class-I classification loss function based on dispersion loss
The comparison of four loss functions of Softmax, AM-Softmax, OC-Softmax and DOC-Softmax is shown in FIG. 4, wherein FIG. 4(a) is Softmax, FIG. 4(b) is AM-Softmax, FIG. 4(c) is OC-Softmax and FIG. 4(d) is DOC-Softmax.
For the two loss functions Softmax and AM-Softmax, the embedded vectors of the target class and the non-target class tend to converge to two opposite directionsTo each direction is w0-w1And w1-w0As shown in fig. 4 (a-b). The target class and the non-target class in the AM-softmax loss function have the same compact edge, and the larger the value of m, the more compact the embedding. In spoof detection, it is reasonable to train a compact embedding space for real speech, but it may be possible to over-fit a known attack if a compact space is also trained for spoof attacks. The class classification loss function OC-Softmax can solve the problem well by introducing two different intervals to better compact real voice and isolate deceptive voice, but as can be seen from the formula (3), when a target class and a non-target class respectively tightly surround a weight vector w0W inverse to the weight vector0' near, the class one classification loss function OC-Softmax is minimized. Thus, the final direction of the non-target class optimization is still the opposite direction w tightly surrounding the weight vector0' at, as shown in FIG. 4(c), when the OC-Softmax function is optimized well enough, this loss function is essentially no different from the AM-Softmax function. Therefore, we introduce a dispersion penalty for the non-target class samples, as follows:
Figure BDA0003525687300000091
Figure BDA0003525687300000092
wherein,
Figure BDA0003525687300000093
in order to embed the vector, the vector is embedded,
Figure BDA0003525687300000094
is a normalized representation of x, yiE {0,1} is the label of the ith sample, y i0 means that the sample is true speech, y i1 indicates that the sample is spoofed speech, N is the number of samples in a batch, and M is the number of spoofed samples in a batch. EpsilonIs a very small constant, and is used to avoid the situation that the denominator is 0,
Figure BDA0003525687300000095
the center of the spoofed sample in each batch. Loss of dispersion LDIn order to maximize spoofed voice samples
Figure BDA0003525687300000096
The distance from their center mu is such that the spoof voice covers as much as possible the entire spoof area, i.e. the spoof voice covers as much as possible thetaiGreater than arccosm1As shown in fig. 4(d), thereby increasing the probability of the spoofed speech entering the spoofed area and improving the generalization performance of the model. Thus, the total loss function LDOCSIs a class classification loss LOCSAnd dispersion loss LDThe weight is λ, and the specific formula is as follows:
Figure BDA0003525687300000098
wherein,
Figure BDA0003525687300000099
is a vector of the weights that is,
Figure BDA00035256873000000910
is w0Is a normalized representation of the optimal direction of real speech. α is a scaling factor, and α ═ 20 in the present invention. Two spacing m0And m1(m0,m1∈[-1,1],m0>m1) Respectively introduced to define the angle between the true speech, the spoofed speech and the true speech weight vector, i.e.
Figure BDA00035256873000000911
And
Figure BDA00035256873000000912
angle theta betweeniIn the present invention, m is taken0=0.9,m10.2. When the speech sample is real speech, i.e. yiWhen equal to 0, m0For making thetaiLess than arccosm0A small arccosm0The real voice can be gathered in the weight vector w0Nearby; when the speech sample is spoofed speech, i.e. yiWhen 1, m1For making thetaiGreater than arccosm1A relatively large arccos m1Spoofed speech can be kept away from the weight vector w0
2. Design of lightweight speech spoofing detection model based on knowledge distillation
The overall architecture of the present invention is shown in fig. 2, with the residual block shown in fig. 3. The teacher model, TeacherNet, is designed based on a deep residual network ResNet-18, where an attention time pooling layer is used instead of a global average pooling layer. The input of the teacher model is the extracted 60-dimensional linear frequency cepstrum coefficient LFCC characteristic, the result output by the full-connection layer is used as the embedded embedding of the input voice, and the dimensionality is 256. The embedded DOC-Softmax loss function is sent to calculate a confidence score
Figure BDA0003525687300000101
The confidence score represents the probability of the input voice belonging to the real voice or the deceptive voice, and then the classification result of the input voice is obtained.
The model architecture of student model StudentNet is basically consistent with that of teacher model, except that 3 residual modules are removed. Soft label predicted by teacher model TeacherNet
Figure BDA0003525687300000102
To assist the hard tag in training student model StudentNet, the soft tag carries a lot of information for TeacherNet inductive reasoning. The loss function of the student model comprises two parts, one is a class of classification loss L based on dispersion lossDOCSOne is the confidence score of the student model output
Figure BDA0003525687300000103
Output with teacher modelConfidence score of (2)
Figure BDA0003525687300000104
Mean square error loss L ofMSEThe concrete formula is as follows:
Figure BDA0003525687300000105
the total loss function of the student model is the minimum LDOCSAnd LMSEThe weight is beta, and the specific formula is as follows:
Figure BDA0003525687300000106
the student Net trained based on the knowledge distillation framework has better generalization capability than the model obtained by using the identical model structure, training data and only using the hard label training method. The model parameter of the StudentNet is about 402K, which is reduced by about 30 times compared with the parameter quantity of the TeacherNet, and the model size of the StudentNet is 1590KB, which is convenient for the deployment to a mobile terminal or an embedded device. The voice deception detection algorithm provided by the invention is a single-system and single-feature algorithm, and under the condition of not using any data enhancement, feature fusion and model integration methods, the accuracy of the algorithm is improved, and the reasoning time delay of the model is greatly reduced.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: it is to be understood that modifications may be made to the technical solutions described in the foregoing embodiments, or equivalents may be substituted for some of the technical features thereof, but such modifications or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (8)

1. A lightweight voice deception detection algorithm based on class classification is characterized in that a knowledge distillation framework is utilized, a feature space is learned through a class classification loss function DOC-Softmax based on dispersion loss, real voice is embedded in the feature space, a compact boundary is arranged between deceptive voice and real voice, and dispersion loss is introduced into the deceptive voice feature space to maximize the distance from each deceptive voice sample to the center of the deceptive voice sample, so that the deceptive voice covers the whole deceptive voice space.
2. The class-classification-based lightweight spoofing detection algorithm of claim 1, wherein the dispersion-loss-based class-classification loss function DOC-Softmax total loss function LDOCSIs a class classification loss LOCSAnd dispersion loss LDThe weight is λ, and the specific formula is as follows:
Figure FDA0003525687290000011
wherein,
Figure FDA00035256872900000111
is a vector of the weights that is,
Figure FDA0003525687290000012
is w0Is a scaling factor, two intervals m0And m1Respectively introduced to define the angle between the true speech, the spoofed speech and the true speech weight vector, i.e.
Figure FDA0003525687290000013
And with
Figure FDA0003525687290000014
Angle theta betweeni,m0,m1∈[-1,1],m0>m1
The formula for a class of classification loss functions OC-Softmax is as follows:
Figure FDA0003525687290000015
two spacing m0And m1,m0,m1∈[-1,1],m0>m1Are introduced to define true and fraud class samples, respectively
Figure FDA0003525687290000016
And
Figure FDA0003525687290000017
angle theta betweeniWhen y isi=0,m0For making thetaiLess than arccosm0(ii) a When y isi=1,m1For making thetaiGreater than arccosm1A small arccosm0Aggregating target classes in a weight vector w0A relatively large arccos m1Keeping non-target classes away from weight vector w0
The formula for introducing dispersion loss is as follows:
Figure FDA0003525687290000018
Figure FDA0003525687290000019
wherein,
Figure FDA00035256872900000112
in order to embed the vector, the vector is embedded,
Figure FDA00035256872900000110
is a normalized representation of x, yiE {0,1} is the label of the ith sample, yi0 means that the sample is true speech, yi1 indicates that the sample is spoofed speech, N is the number of samples in a batch, M is the number of spoofed samples in a batch, and epsilon is a small constant to avoid the case where the denominator is 0,
Figure FDA0003525687290000024
for the center of the spoofed sample in each batch, the loss L is distributedDIn order to maximize spoofed voice samples
Figure FDA0003525687290000021
The distance from their center mu is such that the spoofed speech covers as much of the spoofed area as possible.
3. The class-based lightweight spoof detection algorithm of claim 2 wherein α -20, m0=0.9,m1=0.2。
4. The lightweight spoof detection algorithm of claim 2 wherein y is the time when the speech samples are true speechiWhen equal to 0, m0For making thetaiLess than arccosm0A small arccosm0Clustering real speech in weight vector w0Nearby; when the speech sample is spoofed speech, i.e. yiWhen 1, m1For making thetaiGreater than arccosm1A relatively large arccos m1Keeping spoofed speech away from weight vector w0
5. The class-based lightweight spoofing detection algorithm of claim 1, wherein the teacher model employs a network structure based on a deep residual network ResNet-18 and uses attention pooling instead of global average pooling.
6. The class classification-based lightweight spoofing detection algorithm of claim 1 wherein the teacher model takes the extracted LFCC features as input and the results of the full connectivity layer output as the embedding of the input speech.
7. The class classification-based lightweight spoofing detection algorithm of claim 1, wherein the confidence score is calculated by applying the embedded DOC-Softmax loss function
Figure FDA0003525687290000025
The confidence score represents the probability of the input voice belonging to the real voice or the deceptive voice, and then the classification result of the input voice is obtained.
8. The class-based lightweight speech spoofing detection algorithm of claim 1 wherein the model structure of the student model is substantially the same as the teacher model except that 3 residual modules are removed, and the student model is trained using the soft labels predicted by the teacher model to assist the hard labels, wherein the loss function of the student model comprises two parts, one is a class-based classification loss L based on dispersion lossDOCSOne is the confidence score of the student model output
Figure FDA0003525687290000022
Confidence score with teacher model output
Figure FDA0003525687290000023
Mean square error loss L ofMSEThe concrete formula is as follows:
Figure FDA0003525687290000031
the total loss function of the student model is the minimum LDOCSAnd LMSEThe weight is beta, and the specific formula is as follows:
Figure FDA0003525687290000032
CN202210193172.XA 2022-03-01 2022-03-01 Lightweight voice spoofing detection algorithm based on class-one classification Pending CN114566170A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210193172.XA CN114566170A (en) 2022-03-01 2022-03-01 Lightweight voice spoofing detection algorithm based on class-one classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210193172.XA CN114566170A (en) 2022-03-01 2022-03-01 Lightweight voice spoofing detection algorithm based on class-one classification

Publications (1)

Publication Number Publication Date
CN114566170A true CN114566170A (en) 2022-05-31

Family

ID=81716280

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210193172.XA Pending CN114566170A (en) 2022-03-01 2022-03-01 Lightweight voice spoofing detection algorithm based on class-one classification

Country Status (1)

Country Link
CN (1) CN114566170A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115831127A (en) * 2023-01-09 2023-03-21 浙江大学 Voiceprint reconstruction model construction method and device based on voice conversion and storage medium
CN116153336A (en) * 2023-04-19 2023-05-23 北京中电慧声科技有限公司 Synthetic voice detection method based on multi-domain information fusion

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115831127A (en) * 2023-01-09 2023-03-21 浙江大学 Voiceprint reconstruction model construction method and device based on voice conversion and storage medium
CN115831127B (en) * 2023-01-09 2023-05-05 浙江大学 Voiceprint reconstruction model construction method and device based on voice conversion and storage medium
CN116153336A (en) * 2023-04-19 2023-05-23 北京中电慧声科技有限公司 Synthetic voice detection method based on multi-domain information fusion

Similar Documents

Publication Publication Date Title
CN113076994B (en) Open-set domain self-adaptive image classification method and system
CN110349136A (en) A kind of tampered image detection method based on deep learning
CN113554089A (en) Image classification countermeasure sample defense method and system and data processing terminal
CN114566170A (en) Lightweight voice spoofing detection algorithm based on class-one classification
Yin et al. Deep learning-aided OCR techniques for Chinese uppercase characters in the application of Internet of Things
CN112820301B (en) Unsupervised cross-domain voiceprint recognition method fusing distribution alignment and counterstudy
CN113763417B (en) Target tracking method based on twin network and residual error structure
CN114913379B (en) Remote sensing image small sample scene classification method based on multitasking dynamic contrast learning
CN115841683B (en) Lightweight pedestrian re-identification method combining multi-level features
CN111126155B (en) Pedestrian re-identification method for generating countermeasure network based on semantic constraint
CN109886251A (en) A kind of recognition methods again of pedestrian end to end guiding confrontation study based on posture
CN110427804A (en) A kind of iris auth method based on secondary migration study
CN110163163B (en) Defense method and defense device for single face query frequency limited attack
CN116311026A (en) Classroom scene identity recognition method based on multi-level information fusion Transformer
CN116246305A (en) Pedestrian retrieval method based on hybrid component transformation network
CN114863485A (en) Cross-domain pedestrian re-identification method and system based on deep mutual learning
CN113989898A (en) Face confrontation sample detection method based on spatial sensitivity
CN117292466B (en) Multi-mode computer vision and biological recognition based Internet of things unlocking method
CN113657448B (en) Countermeasure sample defense method based on generation of countermeasure network and gradient interpretation
CN113837048B (en) Vehicle re-identification method based on less sample attention
CN118200060B (en) Lightweight vehicle-mounted network intrusion detection method based on edge vision transformer model
CN117809694B (en) Fake voice detection method and system based on time sequence multi-scale feature representation learning
CN118429389B (en) Target tracking method and system based on multiscale aggregation attention feature extraction network
CN117173477B (en) Domain generalization pedestrian re-identification method based on multi-layer data disturbance strategy
Srivastava Vulnerability of Neural Network based Speaker Recognition systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination