CN114495950A

CN114495950A - Voice deception detection method based on deep residual shrinkage network

Info

Publication number: CN114495950A
Application number: CN202210347480.3A
Authority: CN
Inventors: 章坚武; 周晔
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2022-04-01
Filing date: 2022-04-01
Publication date: 2022-05-13
Anticipated expiration: 2042-04-01
Also published as: CN114495950B

Abstract

The invention discloses a voice deception detection method based on a deep residual shrinkage network, which comprises the steps of preprocessing voice to be detected, and transforming preprocessed voice feature data to obtain corresponding constant Q cepstrum coefficient features, Mel frequency cepstrum coefficient features and spectrogram features; then, a depth residual shrinkage network is adopted to respectively process the constant Q cepstrum coefficient characteristic, the Mel frequency cepstrum coefficient characteristic and the spectrogram characteristic to obtain three corresponding depth characteristics; inputting the three depth features into a deep neural network classifier respectively, and calculating to obtain detection scores corresponding to the three depth features; and finally, fusing the detection scores corresponding to the three depth features, and judging whether the voice to be detected is real voice. The method improves the distinguishing characteristic learning ability in the complex acoustic environment, improves the system generalization and has wider application scene.

Description

A voice deception detection method based on deep residual shrinkage network

技术领域technical field

本申请属于语音检测和深度学习技术领域，尤其涉及一种基于深度残差收缩网络的语音欺骗检测方法。The present application belongs to the technical field of speech detection and deep learning, and in particular, relates to a speech deception detection method based on a deep residual shrinkage network.

背景技术Background technique

近年来，基于生物识别的身份认证技术在数据安全和通过性认证中的作用越来越重要。由于采集传感设备的发展，自动说话人验证技术受到了广泛的关注，并应用于智能设备登录、门禁控制、网上银行等方面。但是各类语音伪造技术威胁着自动说话人验证系统的安全性能，目前确定了四种类型的伪造语音欺骗攻击：语音合成、语音转换、语音模仿、重播，它们可生成类似于合法用户语音的伪造语音。以语音合成和语音转换为主的逻辑访问攻击，在感知上与真正语音无法区分，因此区分伪造语音与真实用户语音变得更具挑战性。越来越多的研究证实,自动说话人验证系统在面对数据库的各种恶意欺骗攻击时存在严重脆弱性。In recent years, biometric-based identity authentication technology plays an increasingly important role in data security and pass-through authentication. Due to the development of acquisition and sensing devices, automatic speaker verification technology has received extensive attention and has been applied to smart device login, access control, online banking, etc. However, various speech forgery techniques threaten the security performance of automatic speaker verification systems. Currently, four types of forged speech deception attacks have been identified: speech synthesis, speech conversion, speech imitation, and replay, which can generate fake speech similar to legitimate user speech. voice. Logical access attacks based on speech synthesis and speech conversion are perceptually indistinguishable from real speech, making it more challenging to distinguish fake speech from real user speech. More and more studies have confirmed that the automatic speaker verification system has serious vulnerabilities in the face of various malicious spoofing attacks on the database.

为了应对欺骗攻击威胁，研究人员一直致力于寻求有效的反欺骗方法，目前语音欺骗检测系统主要由前端特征提取和后端分类器两部分组成。与一般的说话人验证和语音处理所使用的声学特征不同，语音欺骗检测需要开发更适用于语音欺骗检测的声学特征。经声学特征提取后，使用性能出色的分类器以完成真伪语音区分。在传统的机器学习方法中，高斯混合模型(GMM)是最经典的分类模型，其优势在于训练时间短，但检测准确性有限；随着深度学习的兴起，各类能够学习复杂非线性特征的深度神经网络也被应用于语音欺骗检测。卷积神经网络(CNN)具有良好的表征学习能力在提取音频特征中广泛使用。循环神经网络(RNN)因循环单元和门限结构而具有记忆性，所以在对时间序列问题的处理中具有一定的优势。In order to deal with the threat of spoofing attacks, researchers have been working on finding effective anti-spoofing methods. At present, the speech spoofing detection system mainly consists of two parts: front-end feature extraction and back-end classifier. Different from the acoustic features used in general speaker verification and speech processing, speech deception detection requires the development of acoustic features that are more suitable for speech deception detection. After the acoustic features are extracted, a classifier with excellent performance is used to complete the distinction between true and false speech. Among the traditional machine learning methods, Gaussian Mixture Model (GMM) is the most classic classification model, which has the advantage of short training time, but limited detection accuracy; with the rise of deep learning, various types of Deep neural networks have also been applied to speech deception detection. Convolutional Neural Networks (CNN) are widely used in extracting audio features with good representation learning ability. Recurrent Neural Network (RNN) has memory due to recurrent unit and threshold structure, so it has certain advantages in the processing of time series problems.

虽然现有方法的训练性能有所提升，但在实际应用中会遭遇未知类型攻击，而这些攻击通常与已知攻击具有不同的统计分布，从而造成训练和应用之间巨大的性能差距，这表明欺骗检测系统对未知攻击的泛化能力仍待改进。此外，由于真实环境中往往存在噪声、混响和信道干扰，各类欺骗检测系统面对复杂的声学环境时，存在性能大幅倒退情况。Although the training performance of existing methods has improved, in practical applications they encounter unknown types of attacks, which often have different statistical distributions from known attacks, resulting in a large performance gap between training and application, indicating that The generalization ability of deception detection systems to unknown attacks still needs to be improved. In addition, due to noise, reverberation and channel interference in the real environment, the performance of various spoofing detection systems in the face of complex acoustic environments is greatly reduced.

发明内容SUMMARY OF THE INVENTION

本申请的目的是提供一种基于深度残差收缩网络的语音欺骗检测方法，以针对复杂声学环境下的语音欺骗检测。The purpose of this application is to provide a speech deception detection method based on a deep residual shrinkage network, so as to detect speech deception in a complex acoustic environment.

为了实现上述目的，本申请技术方案如下：In order to achieve the above purpose, the technical solutions of the present application are as follows:

一种基于深度残差收缩网络的语音欺骗检测方法，包括：A voice deception detection method based on a deep residual shrinkage network, comprising:

对待检测语音进行预处理，对预处理后的语音特征数据进行变换获得对应的常数Q倒谱系数特征、梅尔频率倒谱系数特征和声谱图特征；The speech to be detected is preprocessed, and the preprocessed speech feature data is transformed to obtain the corresponding constant Q cepstral coefficient feature, Mel frequency cepstral coefficient feature and spectrogram feature;

采用深度残差收缩网络，分别对常数Q倒谱系数特征、梅尔频率倒谱系数特征和声谱图特征进行处理，获得对应的三种深度特征；Using the deep residual shrinkage network, the constant Q cepstral coefficient feature, the Mel frequency cepstral coefficient feature and the spectrogram feature are processed respectively, and the corresponding three depth features are obtained;

将所述三种深度特征分别输入到深度神经网络分类器，计算得到所述三种深度特征对应的检测分数；Inputting the three kinds of depth features into the deep neural network classifier respectively, and calculating the detection scores corresponding to the three kinds of depth features;

将所述三种深度特征对应的检测分数进行融合，判断待检测语音是否为真实语音。The detection scores corresponding to the three depth features are fused to determine whether the speech to be detected is real speech.

进一步的，所述深度残差收缩网络包括残差收缩构建单元，所述残差收缩构建单元包括卷积模块、自适应阈值学习模块和软阈值模块，所述卷积模块的输出经过自适应阈值学习模块学习得到阈值，所述软阈值模块对卷积模块和自适应阈值学习模块的输出进行处理，突出高判别性的声音信息。Further, the deep residual shrinkage network includes a residual shrinkage building unit, and the residual shrinkage building unit includes a convolution module, an adaptive threshold learning module and a soft threshold module, and the output of the convolution module is subjected to an adaptive threshold. The learning module learns to obtain a threshold, and the soft threshold module processes the outputs of the convolution module and the adaptive threshold learning module to highlight highly discriminative sound information.

进一步的，用于处理常数Q倒谱系数特征的深度残差收缩网络堆叠了6个残差收缩构建单元，用于处理梅尔频率倒谱系数特征的深度残差收缩网络堆叠了9个残差收缩构建单元，用于处理声谱图特征的深度残差收缩网络堆叠了6个残差收缩构建单元。Further, the deep residual shrinkage network for processing constant-Q cepstral coefficient features stacks 6 residual shrinkage building units, and the deep residual shrinkage network for processing Mel-frequency cepstral coefficient features stacks 9 residuals Shrinkage building units, a deep residual shrinkage network for processing spectrogram features stacks 6 residual shrinkage building units.

进一步的，所述深度神经网络分类器包括Dropout层、第一全隐藏连接层、Leak-Relu激活函数层、第二隐藏全连接层和LogSoftmax层。Further, the deep neural network classifier includes a Dropout layer, a first fully hidden connection layer, a Leak-Relu activation function layer, a second hidden fully connected layer and a LogSoftmax layer.

进一步的，所述Dropout层的随机丢弃权值概率为50％。Further, the probability of randomly dropping weights of the Dropout layer is 50%.

进一步的，所述将所述三种深度特征对应的检测分数进行融合，公式如下：Further, the detection scores corresponding to the three depth features are fused, and the formula is as follows:

其中Score_fuse为融合后的联合检测分数，w_i为融合权重，s_i为第i种深度特征对应的检测分数。Among them, Score _fuse is the joint detection score after fusion, w _i is the fusion weight, and si is the detection score corresponding to the _i -th depth feature.

进一步的，所述对预处理后的语音特征数据进行变换获得对应的常数Q倒谱系数特征、梅尔频率倒谱系数特征和声谱图特征，包括：Further, converting the preprocessed speech feature data to obtain corresponding constant Q cepstral coefficient features, Mel frequency cepstral coefficient features and spectrogram features, including:

对经预处理的语音特征数据进行常数Q变换，再计算功率谱并取对数，接着进行均匀重采样，最后通过离散余弦变换以获得常数Q倒谱系特征；Perform constant Q transformation on the preprocessed speech feature data, then calculate the power spectrum and take the logarithm, then perform uniform resampling, and finally obtain constant Q cepstral features through discrete cosine transformation;

对经预处理的语音特征数据进行短时傅立叶变换STFT，再通过滤波将频谱映射到梅尔频谱，最后经过离散余弦变换而得到梅尔频率倒谱系数特征；Perform short-time Fourier transform (STFT) on the preprocessed speech feature data, then map the spectrum to the Mel spectrum through filtering, and finally obtain the Mel frequency cepstral coefficient feature through discrete cosine transform;

对经预处理的语音特征数据进行短时傅里叶变换，并计算每个分量的大小最后将其转换为对数刻度，得到声谱图特征。Short-time Fourier transform is performed on the preprocessed speech feature data, and the size of each component is calculated and finally converted to logarithmic scale to obtain spectrogram features.

本申请提出的一种基于深度残差收缩网络的语音欺骗检测方法，构建了深度残差收缩网络，采用基于深度注意力机制的自适应阈值学习模块和软阈值模块的残差收缩构建单元，使每个语音信号依据各自声学环境确定独立阈值，将不重要的特征强制置零，以消除与噪声相关的信息，学习更具辨别性的高级特征，进而提高在复杂声学环境下的判别特征学习能力。针对检测方法泛化性能差问题，使用CQCC、MFCC和Spectrogram三种不同声学特征提取算法以更全面地表示语音特性，并将特征分别作为网络输入，依据其输出性能为各模型生成权重并执行多特征联合检测，以提升系统泛化性，应用场景更广。A speech deception detection method based on a deep residual shrinkage network proposed in this application, a deep residual shrinkage network is constructed, and the adaptive threshold learning module based on the deep attention mechanism and the residual shrinkage construction unit of the soft threshold module are used to make Each speech signal determines an independent threshold according to its own acoustic environment, and forces unimportant features to zero to eliminate noise-related information, learn more discriminative advanced features, and improve the ability to learn discriminative features in complex acoustic environments. . Aiming at the problem of poor generalization performance of detection methods, three different acoustic feature extraction algorithms, CQCC, MFCC and Spectrogram, are used to represent speech features more comprehensively. Feature joint detection to improve system generalization and wider application scenarios.

附图说明Description of drawings

图1为本申请基于深度残差收缩网络的语音欺骗检测方法流程图；Fig. 1 is the flow chart of the voice deception detection method based on the deep residual shrinkage network of the present application;

图2为本申请网络总体框架结构图；Fig. 2 is the overall framework structure diagram of the application network;

图3为本申请特征提取示意图；3 is a schematic diagram of feature extraction of the application;

图4为本申请残差收缩构建单元结构示意图；4 is a schematic structural diagram of the residual shrinkage construction unit of the present application;

图5为本申请深度残差收缩网络结构图；FIG. 5 is a structural diagram of a deep residual shrinkage network of the present application;

图6为本申请DNN分类器结构示意图。FIG. 6 is a schematic structural diagram of the DNN classifier of the present application.

具体实施方式Detailed ways

为了使本申请的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本申请进行进一步详细说明。应当理解，此处描述的具体实施例仅用以解释本申请，并不用于限定本申请。In order to make the purpose, technical solutions and advantages of the present application more clearly understood, the present application will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.

在一个实施例中，如图1所示，提供了一种基于深度残差收缩网络的语音欺骗检测方法，包括：In one embodiment, as shown in FIG. 1, a method for detecting speech deception based on a deep residual shrinkage network is provided, including:

步骤S1、对待检测语音进行预处理，对预处理后的语音特征数据进行变换获得对应的常数Q倒谱系数特征、梅尔频率倒谱系数特征和声谱图特征。Step S1: Preprocess the speech to be detected, and transform the preprocessed speech feature data to obtain the corresponding constant Q cepstral coefficient feature, Mel frequency cepstral coefficient feature and spectrogram feature.

如图2所示，本步骤实现特征提取。首先对待检测语音进行分帧处理，并对样本点数不足64000的语音数据进行pad填充操作，最后执行数据归一化完成数据预处理。语音数据和视频数据不同，本没有帧的概念，但是为了传输与存储，本申请采集的音频数据都是一段一段的。为了程序能够进行批量处理，会根据指定的长度(时间段或者采样数)进行分段，结构化为编程的数据结构，这就是分帧。语音信号在宏观上是不平稳的，在微观上是平稳的，具有短时平稳性(10---30ms内可以认为语音信号近似不变)，这个就可以把语音信号分为一些短段来进行处理，每一个短段称为一帧(CHUNK)。As shown in Figure 2, this step realizes feature extraction. First, the speech to be detected is divided into frames, and the pad filling operation is performed on the speech data with less than 64,000 sample points. Finally, data normalization is performed to complete the data preprocessing. Voice data is different from video data, and there is no concept of frame, but for transmission and storage, the audio data collected by this application is segment by segment. In order for the program to perform batch processing, it will be segmented according to the specified length (time period or number of samples) and structured into a programmed data structure, which is framing. The speech signal is not stable on the macro level, but stable on the micro level, and has short-term stability (the speech signal can be considered to be approximately unchanged within 10-30ms). This can divide the speech signal into some short segments. For processing, each short segment is called a frame (CHUNK).

然后参照图3，对预处理后的语音特征数据进行变换，其中：Then referring to Fig. 3, transform the preprocessed speech feature data, wherein:

常数Q倒谱系数特征(CQCC)：对经预处理的语音特征数据进行常数Q变换，再计算功率谱并取对数，接着进行均匀重采样，最后通过离散余弦变换以获得常数Q倒谱系特征。Constant Q cepstral coefficient feature (CQCC): perform constant Q transformation on the preprocessed speech feature data, then calculate the power spectrum and take the logarithm, then perform uniform resampling, and finally obtain the constant Q cepstral feature through discrete cosine transformation .

例如，对经预处理的语音特征数据以16kHz的采样频率进行常数Q变换，再计算功率谱并取对数，接着以16为均匀重采样周期进行均匀重采样，最后通过离散余弦变换以获得常数Q倒谱系数特征。For example, a constant Q transform is performed on the preprocessed speech feature data with a sampling frequency of 16 kHz, the power spectrum is calculated and the logarithm is taken, and then the uniform resampling is performed with a uniform resampling period of 16, and finally the constant is obtained by discrete cosine transform. Q cepstral coefficient feature.

梅尔频率倒谱系数特征(MFCC)：对经预处理的语音特征数据进行短时傅立叶变换STFT，再通过滤波将频谱映射到梅尔频谱，最后经过离散余弦变换而得到梅尔频率倒谱系数特征。Mel frequency cepstral coefficient feature (MFCC): short-time Fourier transform STFT is performed on the preprocessed speech feature data, and then the spectrum is mapped to the Mel spectrum by filtering, and finally the Mel frequency cepstral coefficient is obtained by discrete cosine transform. feature.

例如，对经预处理的语音特征数据以16kHz的采样频率，进行短时傅立叶变换STFT，再通过滤波器组将频谱映射到梅尔频谱，最后经过离散余弦变换而得到梅尔频率倒谱系数MFCC特征。本申请选择前24个系数，并将MFCC与其一阶MFCC和二阶导数MFCC连接以产生MFCC特征表示，它是一个二维矩阵。For example, a short-time Fourier transform (STFT) is performed on the preprocessed speech feature data at a sampling frequency of 16 kHz, and then the spectrum is mapped to the Mel spectrum through a filter bank, and finally the Mel frequency cepstral coefficient MFCC is obtained through discrete cosine transform. feature. This application selects the first 24 coefficients and concatenates the MFCC with its first-order MFCC and second-order derivative MFCC to produce the MFCC feature representation, which is a two-dimensional matrix.

声谱图特征(Spectrogram)：对经预处理的语音特征数据进行短时傅里叶变换，并计算每个分量的大小最后将其转换为对数刻度，得到声谱图特征。Spectrogram feature: perform short-time Fourier transform on the preprocessed speech feature data, calculate the size of each component, and finally convert it to logarithmic scale to obtain spectrogram features.

例如，对经预处理的语音特征数据在汉明窗口(窗口大小＝2048，重叠25％)上计算短时傅里叶变换。然后计算每个分量的大小并将其转换为对数刻度，输出矩阵捕获输入音频波形的时频特性，得到声谱图特征。For example, the short-time Fourier transform is computed over a Hamming window (window size=2048, overlap 25%) on the preprocessed speech feature data. The magnitude of each component is then calculated and converted to a logarithmic scale, and the output matrix captures the time-frequency characteristics of the input audio waveform, resulting in the spectrogram features.

步骤S2、采用训练好的深度残差收缩网络，分别对常数Q倒谱系数特征、梅尔频率倒谱系数特征和声谱图特征进行处理，获得对应的三种深度特征。Step S2: Using the trained deep residual shrinkage network, the constant Q cepstral coefficient feature, the Mel-frequency cepstral coefficient feature and the spectrogram feature are processed respectively to obtain three corresponding depth features.

本实施例设计搭建深度残差收缩网络，对步骤S1变换获得的常数Q倒谱系数特征、梅尔频率倒谱系数特征和声谱图特征进行处理，得到关于检测语音的深度特征。In this embodiment, a deep residual shrinkage network is designed and constructed, and the constant Q cepstral coefficient feature, the Mel frequency cepstral coefficient feature and the spectrogram feature obtained by the transformation in step S1 are processed to obtain the depth feature of the detected speech.

在一个具体的实施例中，本申请深度残差收缩网络所采用的残差收缩构建单元，包括卷积模块、自适应阈值学习模块和软阈值模块。In a specific embodiment, the residual shrinkage construction unit adopted by the deep residual shrinkage network of the present application includes a convolution module, an adaptive threshold learning module and a soft threshold module.

图4示出了改进后的深度残差收缩网络(Deep Residual Shrinkage Networks，DRSN)中的残差收缩构建单元(RSBU)。本实施例残差收缩构建单元包括卷积模块、自适应阈值学习模块和软阈值模块，所述卷积模块的输出经过自适应阈值学习模块学习得到阈值，所述软阈值模块对卷积模块和自适应阈值学习模块的输出进行处理，突出高判别性的声音信息。Figure 4 shows the Residual Shrinkage Building Unit (RSBU) in the improved Deep Residual Shrinkage Networks (DRSN). The residual shrinkage construction unit in this embodiment includes a convolution module, an adaptive threshold learning module, and a soft threshold module. The output of the convolution module is learned by the adaptive threshold learning module to obtain a threshold. The output of the adaptive threshold learning module is processed to highlight highly discriminative sound information.

自适应阈值学习模块包括取绝对值函数(Absolute)、全局平均池化(GAP)、两个全连接层(FC)、BN层、ReLu函数和一个激活函数。全局平均池化的输出大小的1×1，之后的两个全连接层都拥有32个神经单元数量，两者之间连接着的BN使用BatchNorm1d，参数设置为32。经过全连接层后获得缩放参数，最后利用sigmoid函数使缩放参数至(0,1)的范围内，可表示为：The adaptive threshold learning module includes absolute value function (Absolute), global average pooling (GAP), two fully connected layers (FC), BN layer, ReLu function and an activation function. The output size of the global average pooling is 1×1, and the next two fully connected layers have 32 neural units. The BN connected between them uses BatchNorm1d, and the parameter is set to 32. After the fully connected layer, the scaling parameters are obtained, and finally the sigmoid function is used to make the scaling parameters within the range of (0,1), which can be expressed as:

其中z_c是第c个通道中神经元的特征，α_c是与之对应的缩放系数。where _zc is the feature of the neuron in the cth channel, and _αc is the corresponding scaling factor.

将缩放系数α_c与全局平均池化的结果相乘，最后得到一个为正的阈值τ_c。The scaling factor α _c is multiplied by the result of global average pooling, and finally a positive threshold τ _c is obtained.

本实施例软阈值模块作为非线性变换层，插入到残差收缩构建单元(RSBU)中，将语音特征数据与阈值进行比较，进而对特征数据执行处理，将接近零的特征置零，保留有用的正负特征，可以根据当前音频状况灵活地实现噪声减弱，突出高判别性的声音信息。The soft threshold module in this embodiment is inserted into the Residual Shrinkage Building Unit (RSBU) as a nonlinear transformation layer, compares the speech feature data with the threshold, and then performs processing on the feature data, zeroing the features close to zero, and retaining usefulness The positive and negative features of , can flexibly reduce noise according to the current audio conditions and highlight highly discriminative sound information.

软阈值函数可以表示为：The soft threshold function can be expressed as:

其中x为输入特征，y为输出特征，τ为阈值，对应不同通道有不同的阈值τ_c。Where x is the input feature, y is the output feature, τ is the threshold, and there are different thresholds τ _c corresponding to different channels.

本实施例深度残差收缩网络如图5所示，分别处理常数Q倒谱系数CQCC特征、梅尔频率倒谱系数MFCC特征、声谱图Spectrogram特征。其中用于处理常数Q倒谱系数特征的深度残差收缩网络堆叠了6个残差收缩构建单元，用于处理梅尔频率倒谱系数特征的深度残差收缩网络堆叠了9个残差收缩构建单元，用于处理声谱图特征的深度残差收缩网络堆叠了6个残差收缩构建单元。The deep residual shrinking network in this embodiment is shown in FIG. 5 , which respectively processes the constant Q cepstral coefficient CQCC feature, the Mel-frequency cepstral coefficient MFCC feature, and the spectrogram feature. Among them, the deep residual shrinkage network for processing constant Q cepstral coefficient features stacks 6 residual shrinkage building units, and the deep residual shrinkage network for processing Mel-frequency cepstral coefficient features stacks 9 residual shrinkage builds unit, a deep residual shrinkage network for processing spectrogram features stacks 6 residual shrinkage building units.

本实施例多个RSBU堆叠使用可加强各种非线性变换学习判别特征的能力，软阈值作为收缩函数，更有利于消除与噪声相关的信息。将所述的常数Q倒谱系数CQCC特征、梅尔频率倒谱系数MFCC特征、声谱图Spectrogram特征分别作为输入，获得三种深度特征。The stacking of multiple RSBUs in this embodiment can enhance the ability of various nonlinear transformations to learn and discriminate features, and the soft threshold is used as a shrinking function, which is more conducive to eliminating information related to noise. Taking the constant Q cepstral coefficient CQCC feature, the Mel-frequency cepstral coefficient MFCC feature, and the spectrogram feature as inputs, respectively, three kinds of depth features are obtained.

步骤S3、将所述三种深度特征分别输入到深度神经网络分类器，计算得到所述三种深度特征对应的检测分数。Step S3: Input the three kinds of depth features into the deep neural network classifier respectively, and calculate the detection scores corresponding to the three kinds of depth features.

本步骤深度神经网络分类器参照图6，用于产生待检测语音的检测分数。In this step, the deep neural network classifier is used to generate the detection score of the speech to be detected, referring to FIG. 6 .

具体的，将三个深度特征分别传入深度神经网络分类器(DNN分类器)，包括：Dropout层、第一全隐藏连接层、Leak-Relu激活函数层、第二隐藏全连接层和LogSoftmax层。Specifically, the three deep features are respectively passed to the deep neural network classifier (DNN classifier), including: Dropout layer, first fully hidden connection layer, Leak-Relu activation function layer, second hidden fully connected layer and LogSoftmax layer .

深度残差收缩网络调整最后一个维度上的元素个数，对深度特征形状进行重塑输入至DNN分类器隐藏层为随机丢弃权值概率50％的Dropout层。针对不同特征的DNN分类器的第一全连接层设置均不同，CQCC-DSRN模型、MFCC-DRSN模型、Spectrogram-模型的第一隐藏全连接层的参数分别为(32,128)、(480,128)、(160,128)。第二隐藏全连接层带有α为0.01的leaky-Relu激活函数，该隐藏全连接层的参数设置为(128,2)，在该层有产生分类的logits单元。随后使用LogSoftmax层，对每一行的所有元素进行softmax运算并取log值，将logits转换为检测分数。The deep residual shrinkage network adjusts the number of elements in the last dimension, and reshapes the shape of the deep feature. The settings of the first fully connected layer of the DNN classifier for different features are different. The parameters of the first hidden fully connected layer of the CQCC-DSRN model, MFCC-DRSN model and Spectrogram-model are (32,128), (480,128), ( 160, 128). The second hidden fully-connected layer has a leaky-Relu activation function with α of 0.01. The parameters of this hidden fully-connected layer are set to (128, 2), and there are logits units that generate classification in this layer. The LogSoftmax layer is then used to perform a softmax operation on all elements of each row and take the log value to convert the logits into detection scores.

为了表述方便，本申请将深度残差收缩网络DRSN、DDN分类器联合组成的网络模型，针对三种不同的特征分别命名为MFCC-DRSN模型、CQCC-DRSN模型和Spectrogram-DRSN模型。For the convenience of expression, this application names the network model jointly composed of the deep residual shrinkage network DRSN and the DDN classifier as the MFCC-DRSN model, the CQCC-DRSN model and the Spectrogram-DRSN model for three different features.

步骤S4、将所述三种深度特征对应的检测分数进行融合，判断待检测语音是否为真实语音。Step S4, fuse the detection scores corresponding to the three depth features to determine whether the speech to be detected is real speech.

本步骤联合检测单元，将MFCC-DRSN模型、CQCC-DRSN模型和Spectrogram-DRSN模型生成的检测分数执行加权融合，得到最后的语音欺骗检测结果。In this step, the detection unit is combined to perform weighted fusion of the detection scores generated by the MFCC-DRSN model, the CQCC-DRSN model and the Spectrogram-DRSN model to obtain the final speech deception detection result.

考虑到在一般欺骗检测任务中，真实语音数量远少于伪造语音数量，所有模型都使用最小化加权交叉熵损失函数进行训练，其中分配给真实语音和伪造语音的权重之比为9:1，以减轻训练数据分布的不平衡。将使用性能最好的训练模型参数应用于评估数据集，经过模型检测后得到单类特征-DRSN模型检测分数。Considering that in general deception detection tasks, the number of real voices is much less than the number of fake voices, all models are trained using a minimized weighted cross-entropy loss function, where the ratio of weights assigned to real and fake voices is 9:1, In order to alleviate the imbalance of training data distribution. The parameters of the training model with the best performance are applied to the evaluation data set, and the single-class feature-DRSN model detection score is obtained after model detection.

以三个单类特征-DRSN模型检测分数为自变量、以检测结果为因变量建立逻辑回归模型，表示为：A logistic regression model is established with three single-class features-DRSN model detection scores as independent variables and detection results as dependent variables, expressed as:

其中，w₁、w₂、w₃表示权重参数，s₁、s₂、s₃表示检测分数。Wherein, w ₁ , w ₂ , and w ₃ represent weight parameters, and s ₁ , s ₂ , and s ₃ represent detection scores.

经模型处理得出回归常数并对其归一化处理，最终获得模型的权重。通过对分数文件的加权融合以实现联合检测，检测分数在多项式回归的逻辑函数中融合，表示为：After model processing, the regression constant is obtained and normalized, and finally the weight of the model is obtained. Through weighted fusion of score files to achieve joint detection, the detection scores are fused in the logistic function of polynomial regression, which is expressed as:

分析所述的联合检测分数得到判决阈值，若联合检测分数小于该阈值则判为真实语音，否则判断为欺骗语音。The joint detection score is analyzed to obtain a decision threshold. If the joint detection score is less than the threshold, it is judged as real speech, otherwise it is judged as deceitful speech.

本申请基于深度残差收缩网络的语音欺骗检测方法具有如下优点：The speech deception detection method based on the deep residual shrinkage network of the present application has the following advantages:

(1)构建深度残差收缩网络DRSN，设计了包含基于深度注意力机制的自适应阈值学习模块和软阈值模块的残差收缩构建单元RSBU，使每个语音信号依据各自声学环境确定独立阈值，将不重要的特征强制置零，以消除与噪声相关的信息，学习更具辨别性的高级特征，进而提高在复杂声学环境下的判别特征学习能力。(1) Construct a deep residual shrinkage network DRSN, and design a residual shrinkage construction unit RSBU including an adaptive threshold learning module based on a deep attention mechanism and a soft threshold module, so that each speech signal can determine an independent threshold according to its own acoustic environment, Unimportant features are forced to zero to eliminate noise-related information and learn more discriminative high-level features, thereby improving the ability of discriminative feature learning in complex acoustic environments.

(2)使用CQCC、MFCC和Spectrogram三种不同声学特征提取算法以更全面地表示语音特性，并将特征分别作为网络输入，依据其输出性能为各模型生成权重并执行多特征联合检测，以提升系统泛化性。(2) Three different acoustic feature extraction algorithms, CQCC, MFCC, and Spectrogram, are used to represent speech features more comprehensively, and the features are used as network inputs respectively, and weights are generated for each model according to its output performance and multi-feature joint detection is performed to improve the System generalizability.

以上所述实施例仅表达了本申请的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对发明专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本申请构思的前提下，还可以做出若干变形和改进，这些都属于本申请的保护范围。因此，本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only represent several embodiments of the present application, and the descriptions thereof are specific and detailed, but should not be construed as a limitation on the scope of the invention patent. It should be pointed out that for those skilled in the art, without departing from the concept of the present application, several modifications and improvements can be made, which all belong to the protection scope of the present application. Therefore, the scope of protection of the patent of the present application shall be subject to the appended claims.

Claims

1. A voice spoofing detection method based on a deep residual shrinkage network is characterized in that the voice spoofing detection method based on the deep residual shrinkage network comprises the following steps:

preprocessing the voice to be detected, and transforming the voice characteristic data after preprocessing to obtain corresponding constant Q cepstrum coefficient characteristics, Mel frequency cepstrum coefficient characteristics and spectrogram characteristics;

respectively processing the constant Q cepstrum coefficient characteristic, the Mel frequency cepstrum coefficient characteristic and the spectrogram characteristic by adopting a depth residual shrinkage network to obtain three corresponding depth characteristics;

inputting the three depth features into a deep neural network classifier respectively, and calculating to obtain detection scores corresponding to the three depth features;

and fusing the detection scores corresponding to the three depth features, and judging whether the voice to be detected is real voice.

2. The method according to claim 1, wherein the deep residual shrinking network comprises a residual shrinking construction unit, the residual shrinking construction unit comprises a convolution module, an adaptive threshold learning module and a soft threshold module, the output of the convolution module is learned by the adaptive threshold learning module to obtain a threshold, and the soft threshold module processes the outputs of the convolution module and the adaptive threshold learning module to highlight high-discriminant sound information.

3. The method of claim 2, wherein the depth residual shrinkage network for processing constant Q cepstral coefficient features is stacked with 6 residual shrinkage building units, the depth residual shrinkage network for processing mel-frequency cepstral coefficient features is stacked with 9 residual shrinkage building units, and the depth residual shrinkage network for processing spectrogram feature is stacked with 6 residual shrinkage building units.

4. The deep residual shrinkage network-based spoofing detection method of claim 1, wherein the deep neural network classifier comprises a Dropout layer, a first fully hidden connected layer, a Leak-Relu activation function layer, a second hidden fully connected layer, and a LogSoftmax layer.

5. The method of claim 4, wherein the random discard weight probability of the Dropout layer is 50%.

6. The method according to claim 1, wherein the detection scores corresponding to the three depth features are fused, and the formula is as follows:

wherein Score is_fuseFor fused joint detection score, w_iTo fuse the weights, s_iAnd the detection score corresponding to the ith depth feature.

7. The method according to claim 1, wherein the transforming the preprocessed voice feature data to obtain corresponding constant Q cepstrum coefficient features, mel-frequency cepstrum coefficient features, and spectrogram features comprises:

constant Q transformation is carried out on the preprocessed voice characteristic data, then a power spectrum is calculated and logarithm is taken, then uniform resampling is carried out, and finally constant Q cepstrum system characteristics are obtained through discrete cosine transformation;

performing short-time Fourier transform (STFT) on the preprocessed voice feature data, mapping the frequency spectrum to a Mel frequency spectrum through filtering, and finally obtaining Mel frequency cepstrum coefficient features through discrete cosine transform;

and performing short-time Fourier transform on the preprocessed voice feature data, calculating the size of each component, and finally converting the component into logarithmic scales to obtain the spectrogram features.