CN114566170A

CN114566170A - Lightweight voice spoofing detection algorithm based on class-one classification

Info

Publication number: CN114566170A
Application number: CN202210193172.XA
Authority: CN
Inventors: 彭海朋; 任叶青; 李丽香; 赵洁; 薛晓鹏; 赵猛猛; 孟寅; 暴爽
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2022-03-01
Filing date: 2022-03-01
Publication date: 2022-05-31
Anticipated expiration: 2042-03-01
Also published as: CN114566170B

Abstract

The invention discloses a light-weight speech deception detection algorithm based on one-class classification. A new loss function DOC-Softmax is designed according to the characteristics of real speech and deceitful speech, that is, the deception speech space of the class-classification loss function OC-Softmax is designed. A decentralized loss function is introduced to alleviate the problem of mismatch of feature distribution between training data and test data, thereby improving the accuracy and generalization ability of the speech deception detection model. At the same time, the speech deception detection algorithm is designed as a lightweight speech deception detection algorithm by using the knowledge distillation framework, which reduces the number of parameters of the model and makes it easy to deploy into mobile terminals or embedded devices. This model generalizes better than the model obtained using the exact same model structure, training data, and using only the hard label training method.

Description

A Lightweight Speech Spoofing Detection Algorithm Based on One-Class Classification

技术领域technical field

本发明涉及声纹识别技术领域，尤其涉及一种基于一类分类的轻量级语音欺骗检测算法。The invention relates to the technical field of voiceprint recognition, in particular to a lightweight voice deception detection algorithm based on one-class classification.

背景技术Background technique

声纹识别是指根据语音信号中的说话人信息来识别说话人身份的一项生物特征识别技术，具有非集中、非接触、不惧遮挡、需主观意识配合等特点，已广泛应用于金融、社保、政企、物联网等场景中。Voiceprint recognition refers to a biometric identification technology that recognizes the identity of the speaker based on the speaker information in the voice signal. In social security, government and enterprise, Internet of Things and other scenarios.

但实际应用环境中存在很多不确定性，尤其是人为的恶意欺骗攻击，使得现有声纹识别系统性能急剧下降。语音欺骗是指通过录音、语音合成、语音转换等手段，将一段非法的、未经过自动说话人验证系统认证的声音进行“修改仿冒”以通过自动说话人验证系统的检测。当前，主要有三种欺骗攻击：(1)其他说话人的刻意模仿；(2)通过语音合成或语音转换技术得到的逼真语音；(3)高保真录音设备的录音回放或录音拼接。在上述三种欺骗攻击中，刻意模仿这种欺骗攻击方式可被主流声纹识别系统辨识出真伪。However, there are many uncertainties in the practical application environment, especially artificial malicious spoofing attacks, which make the performance of the existing voiceprint recognition system drop sharply. Voice spoofing refers to "modifying and counterfeiting" an illegal voice that has not been authenticated by the automatic speaker verification system by means of recording, speech synthesis, voice conversion, etc. to pass the detection of the automatic speaker verification system. Currently, there are mainly three kinds of spoofing attacks: (1) deliberate imitation of other speakers; (2) lifelike speech obtained by speech synthesis or speech conversion technology; (3) recording playback or recording splicing of high-fidelity recording equipment. Among the above three spoofing attacks, deliberately imitating this spoofing attack method can be identified by mainstream voiceprint recognition systems.

然而，录音设备质量的提高以及语音合成、语音转换等语音处理技术的快速发展给语音欺骗检测和声纹识别系统的安全性带来越来越严峻的挑战。语音欺骗检测是指利用深度学习或机器学习方法，通过将手工特征或者原始语音输入到深度学习或机器学习模型中进行学习，最终达到语音鉴伪的目的。接下来介绍残差网络、知识蒸馏、Softmax损失函数和AM-Softmax损失函数。However, the improvement of the quality of recording equipment and the rapid development of speech processing technologies such as speech synthesis and speech conversion have brought more and more severe challenges to the security of speech deception detection and voiceprint recognition systems. Speech deception detection refers to the use of deep learning or machine learning methods to learn by inputting handcrafted features or raw speech into deep learning or machine learning models, and ultimately achieve the purpose of speech forgery. Next, we introduce residual network, knowledge distillation, Softmax loss function and AM-Softmax loss function.

(1)残差网络(1) Residual network

2015年何恺明等人提出残差网络(Residual network,ResNet)来缓解深度神经网络中增加网络深度带来的梯度消失问题，广泛应用于图像分类、目标检测、语音识别等领域中。残差网络通过引入深度残差学习框架来解决退化问题，其主要思想是去掉相同的主体部分，突出微小的变化，利用一些堆叠的非线性层去拟合一个残差映射F(x):＝H(x)-x而不是直接拟合一个底层映射H(x)，这样原始的映射就变成了F(x)+x，并且优化残差映射比优化原始的映射更容易。这种残差学习结构可以通过前向神经网络+shortcut连接实现，如图1所示，shortcut连接相当于执行了同等映射，不会产生额外的参数，也不会增加计算复杂度，并且整个网络仍可以通过端到端的反向传播训练。In 2015, He Yuming et al. proposed Residual Network (ResNet) to alleviate the gradient disappearance problem caused by increasing network depth in deep neural network, and it is widely used in image classification, target detection, speech recognition and other fields. The residual network solves the degradation problem by introducing a deep residual learning framework. The main idea is to remove the same main part, highlight the small changes, and use some stacked nonlinear layers to fit a residual map F(x):= H(x)-x instead of fitting an underlying map H(x) directly, so the original map becomes F(x)+x, and optimizing the residual map is easier than optimizing the original map. This residual learning structure can be implemented by a forward neural network + shortcut connection. As shown in Figure 1, the shortcut connection is equivalent to performing the same mapping, without generating additional parameters and increasing computational complexity, and the entire network It can still be trained with end-to-end backpropagation.

(2)知识蒸馏(2) Knowledge distillation

深度学习在计算机视觉、语音识别、自然语言处理等众多领域中均取得了令人难以置信的性能。然而，大多数的深度学习模型在计算上过于昂贵，无法在移动端或嵌入式设备上运行。因此需要对模型进行压缩，知识蒸馏是模型压缩中重要的技术之一。知识蒸馏最早由Hinton等人提出，主要包括三种蒸馏设置，一是模型压缩，将复杂模型的知识蒸馏到小模型上；二是跨模态的迁移知识，将知识从一个模态迁移到另一个模态；三是集成蒸馏，将多个模型的知识蒸馏到单个模型上。本发明利用知识蒸馏框架对一类分类语音欺骗检测模型进行模型压缩，其核心思想是先训练一个复杂网络模型，然后使用这个复杂网络的输出和数据的真实标签去训练一个更小的网络，因此知识蒸馏框架通常包含一个复杂模型(称为Teacher模型)和一个小模型(称为Student模型)，复杂模型一般是单个复杂网络或者是若干网络的集合，拥有良好的性能和泛化能力，而小模型由于网络规模较小，表达能力有限。知识蒸馏框架就是利用大模型学习到的知识去指导小模型训练，即利用Teacher模型预测的Soft-target来辅助Hard-target训练，使得Student模型具有与Teacher模型相当的性能，大幅度减少了模型的参数量，从而实现模型压缩、降低模型推理时延。Deep learning has achieved incredible performance in a wide range of fields such as computer vision, speech recognition, natural language processing, and more. However, most deep learning models are too computationally expensive to run on mobile or embedded devices. Therefore, it is necessary to compress the model, and knowledge distillation is one of the important techniques in model compression. Knowledge distillation was first proposed by Hinton et al. It mainly includes three distillation settings. One is model compression, which distills knowledge from complex models to small models; the other is cross-modal transfer knowledge, which transfers knowledge from one modality to another. One modality; the third is ensemble distillation, which distills knowledge from multiple models onto a single model. The invention utilizes the knowledge distillation framework to compress the model of a class of classification speech deception detection models. The knowledge distillation framework usually includes a complex model (called the Teacher model) and a small model (called the Student model). The complex model is generally a single complex network or a collection of several networks, which has good performance and generalization ability. Due to the small size of the network, the model has limited expressive ability. The knowledge distillation framework is to use the knowledge learned by the large model to guide the training of the small model, that is, the Soft-target predicted by the Teacher model is used to assist the Hard-target training, so that the Student model has the same performance as the Teacher model, which greatly reduces the model. parameters, so as to achieve model compression and reduce model inference delay.

(3)Softmax损失函数和AM-Softmax损失函数(3) Softmax loss function and AM-Softmax loss function

用于二元分类的原始Softmax损失函数的公式如下：The formula of the original Softmax loss function for binary classification is as follows:

其中，

为嵌入向量，y_i∈{0,1}为第i个样本的标签，y_i＝0表示该样本为目标类，y_i＝1表示该样本为非目标类，

是权重向量，N为一个批次中的样本数量。in,

is the embedding vector, y _i ∈{0,1} is the label of the ith sample, y _i =0 indicates that the sample is the target class, y _i =1 indicates that the sample is the non-target class,

is the weight vector, and N is the number of samples in a batch.

AM-Softmax通过引入一个角间隔改进了Softmax函数，使两个类的嵌入分布更加紧凑，具体公式如下：AM-Softmax improves the Softmax function by introducing an angular interval to make the embedding distribution of the two classes more compact. The specific formula is as follows:

其中，α是一个缩放因子，m是余弦加性间隔，

和

分别为w₀,w₁和x的标准化表示。where α is a scaling factor, m is the cosine additive interval,

and

are the normalized representations of w ₀ , w ₁ and x, respectively.

对于这两个损失函数，目标类和非目标类的嵌入向量趋于收敛到两个相反的方向，这两个方向分别为w₀-w₁和w₁-w₀。对AM-softmax函数，目标类和非目标类具有一个相同的紧凑边缘，m的值越大，嵌入越紧凑。For these two loss functions, the embedding vectors of the target and non-target classes tend to converge in two opposite directions, w ₀ -w ₁ and w ₁ -w ₀ , respectively. For the AM-softmax function, the target and non-target classes have the same compact edge, and the larger the value of m, the more compact the embedding.

在语音欺骗检测中，对真实语音来说，训练一个紧凑的嵌入空间是合理的，但是如果对于欺骗攻击也训练一个紧凑的空间可能会过度拟合已知攻击。综上，语音欺骗检测算法虽然已经取得非常多的成果，但这些算法普遍存在对未知欺骗攻击泛化性不足的问题，这是因为大多数方法将语音欺骗问题表述为真实语音和欺骗语音的二元分类问题，从本质上假设训练数据和测试数据之间的特征分布相同或相似。虽然这个假设对真实类语音来说是合理的，但是对于欺骗语音来说就不那么正确了。由于语音转换、语音合成等语音欺骗技术的发展，训练集中的欺骗攻击可能永远无法赶上实际欺骗攻击分布的扩展。此外，现有语音欺骗检测算法大都存在网络结构复杂，运算量大，速度慢的问题，并且很难移植到移动端或嵌入式设备中。In speech spoofing detection, it is reasonable to train a tight embedding space for real speech, but training a tight space for spoofing attacks may overfit known attacks. To sum up, although voice deception detection algorithms have achieved a lot of results, these algorithms generally have the problem of insufficient generalization of unknown deception attacks, because most methods express the problem of speech deception as two types of real speech and deception speech. Meta-classification problem, which essentially assumes the same or similar distribution of features between training and test data. While this assumption is reasonable for real-like speech, it is not so true for spoofed speech. Due to the development of speech deception techniques such as speech conversion, speech synthesis, etc., deception attacks in the training set may never catch up with the expansion of the distribution of actual deception attacks. In addition, most of the existing voice spoofing detection algorithms have the problems of complex network structure, large computational load and slow speed, and it is difficult to transplant into mobile terminals or embedded devices.

发明内容SUMMARY OF THE INVENTION

本发明针对训练数据和测试数据之间特征分布不匹配的问题，以提高语音欺骗检测算法的准确率和泛化能力、降低模型推理时延为目的，提出一种基于一类分类的轻量级语音欺骗检测算法。Aiming at the problem of mismatch of feature distribution between training data and test data, the invention proposes a class-based classification-based lightweight class for the purpose of improving the accuracy and generalization ability of the speech deception detection algorithm and reducing the model inference delay. Voice Spoofing Detection Algorithm.

为了实现上述目的，本发明提供如下技术方案：In order to achieve the above object, the present invention provides the following technical solutions:

一种基于一类分类的轻量级语音欺骗检测算法，利用知识蒸馏框架，通过基于分散损失的一类分类损失函数DOC-Softmax来学习一个特征空间，在这个特征空间中，真实语音嵌入有一个紧实的边界，欺骗语音与真实语音之间有一定的距离，同时在欺骗语音特征空间中引入分散损失来最大化每个欺骗语音样本到其中心的距离，从而使欺骗语音覆盖整个欺骗语音的空间。A lightweight speech deception detection algorithm based on one-class classification, using the knowledge distillation framework to learn a feature space through a one-class classification loss function DOC-Softmax based on dispersion loss, in this feature space, the real speech embedding has a Tight boundaries, there is a certain distance between the deceitful speech and the real speech, and at the same time, a dispersion loss is introduced in the deceitful speech feature space to maximize the distance of each deceitful speech sample to its center, so that the deceitful speech covers the entire deception speech. space.

进一步地，基于分散损失的一类分类损失函数DOC-Softmax总的损失函数L_DOCS为一类分类损失L_OCS和分散损失L_D的加权和，权重为λ，具体公式如下：Further, the total loss function L _DOCS of the one-class classification loss function DOC-Softmax based on dispersion loss is the weighted sum of the one-class classification loss L _OCS and the dispersion loss L _D , and the weight is λ. The specific formula is as follows:

其中，

是权重向量，

为w₀的标准化表示，是指真实语音的优化方向，α是一个缩放因子，两个间隔m₀和m₁分别被引入来限定真实语音、欺骗语音与真实语音权重向量之间的角度，即

与

间的角度θ_i，m₀,m₁∈[-1,1],m₀＞m₁；in,

is the weight vector,

is the normalized representation of w ₀ , which refers to the optimization direction of the real speech, α is a scaling factor, and two intervals m ₀ and m ₁ are respectively introduced to define the angle between the real speech, the spoofed speech and the real speech weight vector, namely

and

The angle between θ _i , m ₀ , m ₁ ∈[-1,1], m ₀ >m ₁ ;

一类分类损失函数OC-Softmax的公式如下：The formula of the one-class classification loss function OC-Softmax is as follows:

两个间隔m₀和m₁，m₀,m₁∈[-1,1],m₀＞m₁，分别被引入来限定真实类和欺骗类样本

与

间的角度θ_i，当y_i＝0，m₀用于使θ_i小于arccosm₀；当y_i＝1，m₁用于使θ_i大于arccosm₁，一个小的arccosm₀使目标类聚集在权重向量w₀，一个相对大的arccosm₁使非目标类远离权重向量w₀；Two intervals m ₀ and m ₁ , where m ₀ , m ₁ ∈ [-1,1], m ₀ >m ₁ , are introduced to define the real and spoofed samples, respectively

and

When y _i = 0, m ₀ is used to make θ _i smaller than arccosm ₀ ; when y _i ₌ 1, m ₁ is used to make θ _i greater than arccosm ₁ , a small arccosm ₀ makes the target class cluster in weight vector w ₀ , a relatively large arccosm ₁ keeps non-target classes away from the weight vector w ₀ ;

引入分散损失的公式如下：The formula for introducing dispersion loss is as follows:

其中，

为嵌入向量，

为x的标准化表示，y_i∈{0,1}为第i个样本的标签，y_i＝0表示该样本为真实语音，y_i＝1表示该样本为欺骗语音，N为一个批次中的样本数量，M为一个批次中欺骗样本的数量，ε为一个很小的常数，用来避免出现分母为0的情况，

为每个批次中欺骗样本的中心，分散损失L_D是为了最大化欺骗语音样本

与他们中心μ的距离，使欺骗语音尽可能的覆盖整个欺骗区域。in,

is the embedding vector,

is the standardized representation of x, y _i ∈{0,1} is the label of the ith sample, y _i =0 indicates that the sample is real speech, y _i =1 indicates that the sample is deceptive speech, and N is a batch of , M is the number of cheat samples in a batch, ε is a small constant to avoid the case where the denominator is 0,

For the center of the spoofed samples in each batch, the dispersion loss _LD is to maximize the spoofed speech samples

The distance from their center μ makes the deception speech cover the entire deception area as much as possible.

进一步地，α＝20，m₀＝0.9，m₁＝0.2。Further, α=20, m ₀ =0.9, m ₁ =0.2.

进一步地，当语音样本为真实语音时，即y_i＝0时，m₀用于使θ_i小于arccosm₀，一个小的arccosm₀使真实语音聚集在权重向量w₀附近；当语音样本为欺骗语音时，即y_i＝1时，m₁用于使θ_i大于arccosm₁，一个相对大的arccosm₁使欺骗语音远离权重向量w₀。Further, when the speech sample is real speech, that is, when y _i =0, m ₀ is used to make θ _i smaller than arccosm ₀ , and a small arccosm ₀ makes the real speech gather near the weight vector w ₀ ; when the speech sample is deception When speech, ie when y _i = 1, m ₁ is used to make θ _i larger than arccosm ₁ , a relatively large arccosm ₁ keeps the spoofing speech away from the weight vector w ₀ .

进一步地，教师模型采用基于深度残差网络ResNet-18的网络结构，并且使用注意力池化替代全局平均池化。Further, the teacher model adopts the network structure based on the deep residual network ResNet-18, and uses attention pooling instead of global average pooling.

进一步地，教师模型以提取的LFCC特征作为输入，以全连接层输出的结果作为输入语音的嵌入。Further, the teacher model takes the extracted LFCC features as input, and takes the output of the fully connected layer as the embedding of the input speech.

进一步地，将所述嵌入送入DOC-Softmax损失函数计算得到置信度分数

此置信度分数代表输入语音属于真实语音还是欺骗语音的概率，，进而得到输入语音的分类结果。Further, the embedding is sent to the DOC-Softmax loss function to calculate the confidence score

This confidence score represents the probability that the input speech belongs to real speech or deceptive speech, and then the classification result of the input speech is obtained.

进一步地，学生模型的模型架构与教师模型的基本一致，只是去掉了3个残差模块，用教师模型预测的软标签来辅助硬标签训练学生模型，学生模型的损失函数包括两部分，一个是基于分散损失的一类分类损失L_DOCS，一个是学生模型输出的置信度分数

与教师模型输出的置信度分数

的均方误差损失L_MSE，具体公式如下：Further, the model architecture of the student model is basically the same as that of the teacher model, except that three residual modules are removed, and the soft label predicted by the teacher model is used to assist the hard label training of the student model. The loss function of the student model consists of two parts, one is One-class classification loss _LDOCS based on dispersion loss, one is the confidence score output by the student model

Confidence score with teacher model output

The mean square error loss L _MSE of , the specific formula is as follows:

学生模型总的损失函数为最小化L_DOCS和L_MSE的加权和，权重为β，具体公式如下：The total loss function of the student model is to minimize the weighted sum of L _DOCS and L _MSE , and the weight is β. The specific formula is as follows:

与现有技术相比，本发明的有益效果为：Compared with the prior art, the beneficial effects of the present invention are:

本发明的基于一类分类的轻量级语音欺骗检测算法，针对真实语音和欺骗语音特性设计了新的损失函数DOC-Softmax，即在一类分类损失函数OC-Softmax的欺骗语音空间中引入分散损失函数来缓解训练数据和测试数据之间特征分布不匹配的问题，从而提高语音欺骗检测模型的准确率和泛化能力。同时，利用知识蒸馏框架将语音欺骗检测算法设计为轻量级的语音欺骗检测算法，减少了模型的参数量，使其便于部署到移动端或嵌入式设备中。此模型比使用完全相同的模型结构、训练数据和只使用硬标签训练方法得到的模型拥有更好的泛化能力。The light-weight speech deception detection algorithm based on one-class classification of the present invention designs a new loss function DOC-Softmax according to the characteristics of real speech and deceitful speech, that is, introduces dispersion into the deception speech space of a class of classification loss function OC-Softmax The loss function is used to alleviate the problem of mismatch of feature distribution between training data and test data, thereby improving the accuracy and generalization ability of the speech deception detection model. At the same time, the speech deception detection algorithm is designed as a lightweight speech deception detection algorithm by using the knowledge distillation framework, which reduces the number of parameters of the model and makes it easy to deploy into mobile terminals or embedded devices. This model generalizes better than the model obtained using the exact same model structure, training data, and using only the hard label training method.

附图说明Description of drawings

为了更清楚地说明本申请实施例或现有技术中的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明中记载的一些实施例，对于本领域普通技术人员来讲，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the accompanying drawings required in the embodiments will be briefly introduced below. Obviously, the accompanying drawings in the following description are only described in the present invention. For some embodiments of the present invention, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings.

图1为残差结构示意图。Figure 1 is a schematic diagram of the residual structure.

图2为本发明实施例提供的模型架构图。FIG. 2 is a model architecture diagram provided by an embodiment of the present invention.

图3为本发明实施例提供的残差模块示意图。FIG. 3 is a schematic diagram of a residual module provided by an embodiment of the present invention.

图4为Softmax、AM-Softmax、OC-Softmax、DOC-Softmax四种损失函数的比较，其中，图4(a)为Softmax，图4(b)为AM-Softmax，图4(c)为OC-Softmax，图4(d)为DOC-Softmax。Figure 4 is a comparison of the four loss functions of Softmax, AM-Softmax, OC-Softmax, and DOC-Softmax. Figure 4(a) is Softmax, Figure 4(b) is AM-Softmax, and Figure 4(c) is OC -Softmax, Figure 4(d) is DOC-Softmax.

具体实施方式Detailed ways

本发明设计了一种基于一类分类的轻量级语音欺骗检测算法，可以很好地缓解训练数据和测试数据之间特征分布不匹配的问题。在一类分类方法中，目标类不存在这种分布不匹配的问题，而对于非目标类，训练集中的样本要么不存在，要么在统计上不具有代表性。一类分类方法的关键思想是捕获目标类分布，并围绕其设置一个严格的分类边界，以便将所有非目标数据置于边界之外。The invention designs a light-weight speech deception detection algorithm based on one-class classification, which can well alleviate the problem of mismatch of feature distribution between training data and test data. In one-class classification methods, this distribution mismatch problem does not exist for target classes, while for non-target classes, the samples in the training set are either non-existent or statistically unrepresentative. The key idea of one-class classification methods is to capture the target class distribution and set a strict classification boundary around it so that all non-target data is kept outside the boundary.

一类分类损失函数OC-Softmax的定义如下：The one-class classification loss function OC-Softmax is defined as follows:

此损失函数仅有一个权重向量

它为目标类嵌入的优化方向。此公式中的

和

进行了标准化。两个间隔m₀和m₁(m₀,m₁∈[-1,1],m₀＞m₁)分别被引入来限定真实类和欺骗类样本

与

间的角度θ_i。当y_i＝0，m₀用于使θ_i小于arccosm₀；当y_i＝1，m₁用于使θ_i大于arccosm₁，一个小的arccosm₀可以使目标类聚集在权重向量w₀，一个相对大的arccosm₁可以使非目标类远离权重向量w₀。This loss function has only one weight vector

It is the optimization direction of the target class embedding. in this formula

and

standardized. Two intervals m ₀ and m ₁ (m ₀ , m ₁ ∈ [-1,1], m ₀ >m ₁ ) are introduced to define the real and spoofed samples, respectively

and

The angle θ _i between them. When _yi = 0, m ₀ is used to make θ _i smaller than arccosm ₀ ; when y _i =1, m ₁ is used to make θ _i greater than arccosm ₁ , a small arccosm ₀ can make the target class cluster in the weight vector w ₀ , A relatively large arccosm ₁ can keep non-target classes away from the weight vector w ₀ .

然而，由公式(3)可以看出，当目标类、非目标类分别紧紧围绕在权重向量w₀、权重向量的反方向w₀′附近时，一类分类损失函数L_OCS达到最小，因此，非目标类优化的最终方向仍是紧紧围绕在权重向量的反方向w₀′处。当OC-Softmax函数优化的足够好时，此损失函数本质上与AM-Softmax函数并没有什么差异。However, it can be seen from formula (3) that when the target class and the non-target class are closely around the weight vector w ₀ and the opposite direction w ₀ ′ of the weight vector respectively, the one-class classification loss function L _OCS reaches the minimum, so , the final direction of non-target class optimization is still closely around the opposite direction w ₀ ′ of the weight vector. When the OC-Softmax function is optimized well enough, this loss function is essentially no different from the AM-Softmax function.

本发明以提高语音欺骗检测算法的准确率和泛化能力、降低模型推理时延为目的，提出基于一类分类的轻量级语音欺骗检测算法，具体来说，针对真实语音和欺骗语音特性设计了新的损失函数DOC-Softmax，即在上述一类分类损失函数OC-Softmax的欺骗语音空间中引入分散损失函数来缓解训练数据和测试数据之间特征分布不匹配的问题，从而提高语音欺骗检测模型的准确率和泛化能力。同时，利用知识蒸馏框架将上述语音欺骗检测算法设计为轻量级的语音欺骗检测算法，减少了模型的参数量，使其便于部署到移动端或嵌入式设备中。In order to improve the accuracy and generalization ability of the voice deception detection algorithm, and reduce the model inference time delay, the invention proposes a lightweight voice deception detection algorithm based on one-class classification. A new loss function DOC-Softmax is introduced, that is, a decentralized loss function is introduced into the deception speech space of the above class of classification loss function OC-Softmax to alleviate the problem of mismatching feature distributions between training data and test data, thereby improving speech deception detection. Model accuracy and generalization ability. At the same time, the above-mentioned speech deception detection algorithm is designed as a lightweight speech deception detection algorithm by using the knowledge distillation framework, which reduces the amount of parameters of the model and makes it easy to deploy in mobile terminals or embedded devices.

本发明提出一种基于一类分类的轻量级语音欺骗检测算法，是一种单特征、单系统的语音欺骗检测算法，此算法将语音欺骗问题定义为一类特征学习问题，以提高模型的泛化能力。The invention proposes a light-weight speech deception detection algorithm based on one-class classification, which is a single-feature, single-system speech deception detection algorithm. The algorithm defines the speech deception problem as a class of feature learning problems to improve the model's performance Generalization.

本发明的方法主要包括两个方面：一是基于分散损失的一类分类损失函数的设计，二是基于知识蒸馏的轻量级语音欺骗检测模型的设计。通过使用基于分散损失的一类分类损失函数来学习一个特征空间，在这个特征空间中，真实语音嵌入有一个紧实的边界，欺骗语音与真实语音之间有一定的距离。同时使用分散损失来最大化每个欺骗语音样本到其中心的距离，从而使欺骗语音覆盖整个欺骗语音的空间，这一方法通过提高未知攻击进入欺骗空间的概率提高了模型的泛化能力。The method of the present invention mainly includes two aspects: one is the design of a class of classification loss functions based on dispersion loss, and the other is the design of a lightweight speech deception detection model based on knowledge distillation. By using a one-class classification loss function based on dispersion loss, a feature space is learned in which the real speech embeddings have a tight boundary and the spoofed speech is some distance from the real speech. At the same time, the dispersion loss is used to maximize the distance of each spoofing speech sample to its center, so that the spoofing speech covers the entire spoofing speech space. This method improves the generalization ability of the model by increasing the probability of unknown attacks entering the spoofing space.

本发明的实施例中，从话语中提取60维的线性频率倒谱系数LFCC特征，采用基于深度残差网络ResNet-18的网络结构，并且使用注意力池化替代了全局平均池化。该网络结构以提取的LFCC特征作为输入，输出的置信度分数表示分类结果，在不使用任何数据增强、特征融合和模型集成方法的情况下，提高语音欺骗检测算法的性能。另外，我们基于知识蒸馏框架，以上述网络为TeacherNet设计了一个轻量级语音欺骗检测模型StudentNet，大幅度减少了模型参数(减少了约30倍的参数量)，便于模型部署到移动端或嵌入式设备中。In the embodiment of the present invention, the 60-dimensional linear frequency cepstral coefficient LFCC feature is extracted from the utterance, the network structure based on the deep residual network ResNet-18 is adopted, and the global average pooling is replaced by the attention pooling. The network structure takes the extracted LFCC features as input, and the output confidence score represents the classification result, which improves the performance of the speech deception detection algorithm without using any data augmentation, feature fusion and model ensemble methods. In addition, based on the knowledge distillation framework, we designed a lightweight speech deception detection model StudentNet for TeacherNet with the above network, which greatly reduces the model parameters (about 30 times the amount of parameters), which is convenient for the model to be deployed on mobile terminals or embedded. in the device.

1、基于分散损失的一类分类损失函数的设计1. Design of a class of classification loss function based on dispersion loss

Softmax、AM-Softmax、OC-Softmax、DOC-Softmax四种损失函数的比较如图4所示，其中，图4(a)为Softmax，图4(b)为AM-Softmax，图4(c)为OC-Softmax，图4(d)为DOC-Softmax。The comparison of the four loss functions of Softmax, AM-Softmax, OC-Softmax, and DOC-Softmax is shown in Figure 4. Figure 4(a) is Softmax, Figure 4(b) is AM-Softmax, and Figure 4(c) is OC-Softmax, and Figure 4(d) is DOC-Softmax.

对于Softmax和AM-Softmax这两个损失函数，目标类和非目标类的嵌入向量趋于收敛到两个相反的方向，这两个方向分别为w₀-w₁和w₁-w₀，如图4(a-b)所示。AM-softmax损失函数中目标类和非目标类具有一个相同的紧凑边缘，m的值越大，嵌入越紧凑。在语音欺骗检测中，对真实语音来说，训练一个紧凑的嵌入空间是合理的，但是如果对于欺骗攻击也训练一个紧凑的空间可能会过度拟合已知攻击。一类分类损失函数OC-Softmax通过引入两个不同的间隔来更好地紧凑真实语音、隔离欺骗语音可以很好地解决这个问题，但由公式(3)可以看出，当目标类、非目标类分别紧紧围绕在权重向量w₀、权重向量的反方向w₀′附近时，一类分类损失函数OC-Softmax达到最小。因此，非目标类优化的最终方向仍是紧紧围绕在权重向量的反方向w₀′处，如图4(c)所示，当OC-Softmax函数优化的足够好时，此损失函数本质上与AM-Softmax函数并没有什么差异。因此，我们为非目标类样本引入一个分散损失，公式如下：For the two loss functions, Softmax and AM-Softmax, the embedding vectors of the target and non-target classes tend to converge to two opposite directions, which are w ₀ -w ₁ and w ₁ -w ₀ , as Figure 4(ab) shows. The target and non-target classes in the AM-softmax loss function have the same compact edge, and the larger the value of m, the more compact the embedding. In speech spoofing detection, it is reasonable to train a tight embedding space for real speech, but training a tight space for spoofing attacks may overfit known attacks. One-class classification loss function OC-Softmax can solve this problem well by introducing two different intervals to better compact real speech and isolate deceptive speech, but it can be seen from formula (3) that when the target class, non-target class When the class is closely around the weight vector w ₀ and the opposite direction w ₀ ′ of the weight vector, the one-class classification loss function OC-Softmax reaches the minimum. Therefore, the final direction of non-target class optimization is still closely around the opposite direction w ₀ ' of the weight vector, as shown in Figure 4(c), when the OC-Softmax function is optimized well enough, this loss function essentially There is no difference with the AM-Softmax function. Therefore, we introduce a dispersion loss for non-target class samples with the following formula:

其中，

为嵌入向量，

为x的标准化表示，y_i∈{0,1}为第i个样本的标签，y_i＝0表示该样本为真实语音，y_i＝1表示该样本为欺骗语音，N为一个批次中的样本数量，M为一个批次中欺骗样本的数量。ε为一个很小的常数，用来避免出现分母为0的情况，

为每个批次中欺骗样本的中心。分散损失L_D是为了最大化欺骗语音样本

与他们中心μ的距离，使欺骗语音尽可能的覆盖整个欺骗区域，也就是使欺骗语音尽可能的覆盖θ_i大于arccosm₁的整个区域，如图4(d)所示，从而提高欺骗语音进入欺骗区域的概率，提高模型的泛化性能。因此，总的损失函数L_DOCS为一类分类损失L_OCS和分散损失L_D的加权和，权重为λ，具体公式如下：in,

is the embedding vector,

is the standardized representation of x, y _i ∈{0,1} is the label of the ith sample, y _i =0 indicates that the sample is real speech, y _i =1 indicates that the sample is deceptive speech, and N is a batch of The number of samples, M is the number of cheat samples in a batch. ε is a small constant to avoid the situation where the denominator is 0,

is the center of the spoofed samples in each batch. The dispersion loss _LD is to maximize the deception speech samples

The distance from their center μ makes the cheating voice cover the entire cheating area as much as possible, that is, the cheating voice covers the entire area where θ _i is greater than arccosm ₁ as much as possible, as shown in Fig. 4(d), thereby improving the entry of the cheating voice The probability of deceiving regions improves the generalization performance of the model. Therefore, the total loss function L _DOCS is the weighted sum of the one-class classification loss L _OCS and the dispersion loss L _D , and the weight is λ. The specific formula is as follows:

其中，

是权重向量，

为w₀的标准化表示，它为真实语音的优化方向。α是一个缩放因子，本发明中α＝20。两个间隔m₀和m₁(m₀,m₁∈[-1,1],m₀＞m₁)分别被引入来限定真实语音、欺骗语音与真实语音权重向量之间的角度，即

与

间的角度θ_i，本发明中取m₀＝0.9，m₁＝0.2。当语音样本为真实语音时，即y_i＝0时，m₀用于使θ_i小于arccosm₀，一个小的arccosm₀可以使真实语音聚集在权重向量w₀附近；当语音样本为欺骗语音时，即y_i＝1时，m₁用于使θ_i大于arccosm₁，一个相对大的arccosm₁可以使欺骗语音远离权重向量w₀。in,

is the weight vector,

is the normalized representation of w ₀ , which is the optimized direction for real speech. α is a scaling factor, α=20 in the present invention. Two intervals m ₀ and m ₁ (m ₀ , m ₁ ∈ [-1,1], m ₀ >m ₁ ) are respectively introduced to define the angle between the real speech, the spoofed speech and the real speech weight vector, i.e.

and

The angle θ _i between them is taken as m ₀ =0.9 and m ₁ =0.2 in the present invention. When the speech sample is real speech, that is, y _i =0, m ₀ is used to make θ _i smaller than arccosm ₀ , and a small arccosm ₀ can make the real speech gather near the weight vector w ₀ ; when the speech sample is deceitful speech , that is, when y _i =1, m ₁ is used to make θ _i larger than arccosm ₁ , and a relatively large arccosm ₁ can make the deceitful speech far away from the weight vector w ₀ .

2、基于知识蒸馏的轻量级语音欺骗检测模型的设计2. Design of a lightweight speech deception detection model based on knowledge distillation

本发明的整体架构图如图2所示，其中的残差模块如图3所示。教师模型TeacherNet是基于深度残差网络ResNet-18设计的，其中使用注意力时间池化层取代了全局平均池化层。教师模型的输入为提取好的60维线性频率倒谱系数LFCC特征，以全连接层输出的结果作为输入语音的嵌入embedding，维度为256。将这个嵌入送入DOC-Softmax损失函数计算得到置信度分数

此置信度分数代表输入语音属于真实语音还是欺骗语音的概率，进而得到输入语音的分类结果。The overall architecture diagram of the present invention is shown in FIG. 2 , and the residual module therein is shown in FIG. 3 . The teacher model TeacherNet is designed based on the deep residual network ResNet-18, in which the global average pooling layer is replaced with an attention temporal pooling layer. The input of the teacher model is the extracted 60-dimensional linear frequency cepstral coefficient LFCC feature, and the output of the fully connected layer is used as the embedding of the input speech, with a dimension of 256. This embedding is fed into the DOC-Softmax loss function to calculate the confidence score

This confidence score represents the probability that the input speech belongs to the real speech or the deceitful speech, and then the classification result of the input speech is obtained.

学生模型StudentNet的模型架构与教师模型的基本一致，只是去掉了3个残差模块。用教师模型TeacherNet预测的软标签

来辅助硬标签训练学生模型StudentNet，此软标签带有TeacherNet归纳推理的大量信息。学生模型的损失函数包括两部分，一个是基于分散损失的一类分类损失L_DOCS，一个是学生模型输出的置信度分数

与教师模型输出的置信度分数

的均方误差损失L_MSE，具体公式如下：The model architecture of the student model StudentNet is basically the same as that of the teacher model, except that three residual modules are removed. Soft labels predicted with the teacher model TeacherNet

To assist in training the student model StudentNet with hard labels, this soft label carries a wealth of information for inductive reasoning from TeacherNet. The loss function of the student model consists of two parts, one is a one-class classification loss L _DOCS based on the dispersion loss, and the other is the confidence score output by the student model

Confidence score with teacher model output

The mean square error loss L _MSE of , the specific formula is as follows:

基于知识蒸馏框架训练出的StudentNet比使用完全相同的模型结构、训练数据和只使用硬标签训练方法得到的模型拥有更好的泛化能力。此StudentNet的模型参数约402K，相比于TeacherNet减少了约30倍的参数量，StudentNet的模型大小为1590KB，便于其部署到移动端或嵌入式设备中。本发明提出的语音欺骗检测算法是一种单系统、单特征的算法，在不使用任何数据增强、特征融合和模型集成方法的情况下，提高了算法的准确率、大大降低了模型的推理时延。The StudentNet trained based on the knowledge distillation framework has better generalization ability than the model obtained by using the exact same model structure, training data and only using the hard label training method. The model parameters of this StudentNet are about 402K, which is about 30 times less than that of TeacherNet. The model size of StudentNet is 1590KB, which is convenient for deployment to mobile or embedded devices. The voice deception detection algorithm proposed by the present invention is a single-system, single-feature algorithm, which improves the accuracy of the algorithm and greatly reduces the inference time of the model without using any data enhancement, feature fusion and model integration methods. extension.

以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换，但这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。The above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The recorded technical solutions are modified, or some technical features thereof are equivalently replaced, but these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A lightweight speech deception detection algorithm based on one-class classification, characterized in that, using a knowledge distillation framework, a feature space is learned through a class of classification loss function DOC-Softmax based on dispersion loss, in this feature space , the real speech embedding has a tight boundary, there is a certain distance between the deceitful speech and the real speech, and at the same time, a dispersion loss is introduced in the deception speech feature space to maximize the distance of each deception speech sample to its center, so that the deception speech The speech covers the entire space of the cheat speech.

2. the light-weight speech deception detection algorithm based on one-class classification according to claim 1, is characterized in that, the total loss function L _DOCS of one-class classification loss function DOC-Softmax based on dispersion loss is one-class classification loss L The weighted sum of _OCS and dispersion loss _LD , the weight is λ, and the specific formula is as follows:

in,

is the weight vector,

is the normalized representation of w ₀ , the optimization direction of real speech, α is a scaling factor, and two intervals m ₀ and m ₁ are respectively introduced to define the angle between real speech, spoofed speech and real speech weight vector, i.e.

and

The angle between θ _i , m ₀ , m ₁ ∈[-1,1], m ₀ >m ₁ ;

The formula of the one-class classification loss function OC-Softmax is as follows:

Two intervals m ₀ and m ₁ , where m ₀ , m ₁ ∈ [-1,1], m ₀ >m ₁ , are introduced to define the real and spoofed samples, respectively

and

The formula for introducing dispersion loss is as follows:

in,

is the embedding vector,

3 . The lightweight voice spoofing detection algorithm based on one-class classification according to claim 2 , wherein α=20, m ₀ =0.9, m ₁ =0.2. 4 .

4. The lightweight voice deception detection algorithm based on one-class classification according to claim 2, wherein when the voice sample is real voice, that is, when y _i =0, m ₀ is used to make θ _i less than arccosm ₀ , a small arccosm ₀ makes the real speech cluster around the weight vector w ₀ ; when the speech sample is deceitful speech, that is, when y _i =1, m ₁ is used to make θ _i larger than arccosm ₁ , a relatively large arccosm ₁ Move the deceptive speech away from the weight vector w ₀ .

5. The lightweight voice deception detection algorithm based on one-class classification according to claim 1, wherein the teacher model adopts the network structure based on the deep residual network ResNet-18, and uses attention pooling to replace the global average pooling.

6 . The light-weight speech deception detection algorithm based on one-class classification according to claim 1 , wherein the teacher model takes the extracted LFCC feature as an input, and takes the result of the fully connected layer output as the embedding of the input speech. 7 .

7. The lightweight voice deception detection algorithm based on one-class classification according to claim 1, wherein the embedding is sent into the DOC-Softmax loss function to calculate the confidence score

8. the light-weight speech deception detection algorithm based on one-class classification according to claim 1, is characterized in that, the model architecture of student model is basically consistent with teacher model, just removed 3 residual error modules, with teacher model The predicted soft labels are used to assist the hard labels to train the student model. The loss function of the student model consists of two parts, one is the one-class classification loss L _DOCS based on the dispersion loss, and the other is the confidence score output by the student model.

Confidence score with teacher model output

The mean square error loss L _MSE of , the specific formula is as follows:

The total loss function of the student model is to minimize the weighted sum of L _DOCS and L _MSE , and the weight is β. The specific formula is as follows: