CN111653289A

CN111653289A - A kind of playback voice detection method

Info

Publication number: CN111653289A
Application number: CN202010479392.XA
Authority: CN
Inventors: 王让定; 胡君; 严迪群
Original assignee: Ningbo University
Current assignee: Foshan Zhenglang Technology Co.,Ltd.
Priority date: 2020-05-29
Filing date: 2020-05-29
Publication date: 2020-09-11
Anticipated expiration: 2040-05-29
Also published as: CN111653289B

Abstract

The invention discloses a playback voice detection method, which is characterized by comprising the following steps: 1) training phase: 1.1) inputting training voice samples, the training voice samples including original voice and playback voice; 1.2) extracting training voice samples Cepstral features; 1.3) Train the residual network model according to the extracted features to obtain network model parameters; 2) Test stage: 2.1) Input test voice samples; 2.2) Extract cepstrum features of test voice samples; 2.3) Utilize step 1) The trained residual network identifies and scores the features of the extracted test speech samples; 2.4) judges whether the test speech samples are playback speeches. Compared with the prior art, the advantages of the present invention are: based on the deep learning method, the cepstral feature of the speech signal is combined with the deep residual network, which effectively improves the detection performance of the system and makes the algorithm have better robustness. Awesomeness.

Description

A kind of playback voice detection method

技术领域technical field

本发明涉及语音检测技术，尤其是一种回放语音检测方法。The invention relates to voice detection technology, in particular to a playback voice detection method.

背景技术Background technique

随着声纹认证技术的不断发展，其所受到的各种欺骗性语音攻击也日益严峻。其中回放语音攻击主要依赖于使用录制设备偷录合法用户在进入系统时的语音并用高保真的回放设备进行回放从而达到对声纹认证系统的攻击。回放语音攻击因其语音样本来源于真实语音，且操作简单、获取方便、不需要攻击者拥有相关的专业知识，因此更容易实施。With the continuous development of voiceprint authentication technology, the various deceptive voice attacks it is subjected to are becoming more and more severe. Among them, the playback voice attack mainly relies on the use of recording equipment to secretly record the voice of a legitimate user when entering the system and playback with a high-fidelity playback device to attack the voiceprint authentication system. The playback voice attack is easier to implement because the voice samples are derived from real voice, and the operation is simple, easy to obtain, and does not require the attacker to have relevant professional knowledge.

早期的回放语音检测技术都是基于传统的手工特征，然而传统的手工提取特征并不能很好地表达语音的高层语义，因此目前更多的回放语音检测技术都是基于深度学习的方式，通过神经网络模型对浅层特征进行更深层的提取，以此来提升回放语音检测算法的性能。如现有技术1：《Lavrentyeva G,Novoselov S,Malykh E,et al.Audio ReplayAttack Detection with Deep Learning Frameworks[C]//Interspeech.2017:82-86.》:，作者提出了一种轻量级卷积神经网络LCNN，通过在LCNN中的每一个卷积层都运用最大输出的方法，构成Max-Feature-Map(MFM)层，以竞争学习的方式淘汰掉输出较小的部分。又如现有技术2：《Jung J,Shim H,Heo H S,et al.Replay attack detection withcomplementary high-resolution information using end-to-end DNN for the ASVspoof 2019 Challenge[J]. arXiv preprint arXiv:1904.10134,2019.》:，作者提出了数据集分区的方法，通过丢弃数据点来建立他们的模型，以确保训练阶段和测试阶段的欺骗攻击不存在不重叠，同时他们将深层传统机器学习的方法与他们的数据集分区相结合，以改进模型的泛化能力。又如现有技术3：《Chettri B,Stoller D,Morfi V,et al.Ensemblemodels for spoofing detection in automatic speaker verification[J].arXivpreprint arXiv:1904.04589,2019.》，作者提出了一种基于多特征集成和多任务学习的框架，其中包含了网络内的多种频谱特征的互补信息，但由于单个的传统特征不足以概括语音信号在全频带上的信息，所以通过多特征集成的方式更好的表达语音信息，此外，作者还提出一种用于多任务学习的蝴蝶单元来促进二分类任务和其他辅助类任务在传播过程中的参数共享。The early playback speech detection technologies were all based on traditional handcrafted features. However, the traditional handcrafted features could not express the high-level semantics of speech well. Therefore, more current playback speech detection technologies are based on deep learning. The network model extracts the shallow features in a deeper level to improve the performance of the playback speech detection algorithm. For example, prior art 1: "Lavrentyeva G, Novoselov S, Malykh E, et al.Audio ReplayAttack Detection with Deep Learning Frameworks[C]//Interspeech.2017:82-86.":, the author proposes a lightweight The convolutional neural network LCNN uses the maximum output method in each convolutional layer in the LCNN to form a Max-Feature-Map (MFM) layer, and eliminates the smaller output part in a competitive learning manner. Another example is prior art 2: "Jung J, Shim H, Heo H S, et al. Replay attack detection with complementary high-resolution information using end-to-end DNN for the ASVspoof 2019 Challenge [J]. arXiv preprint arXiv:1904.10134, 2019.":, the authors propose a method of dataset partitioning by dropping data points to build their models to ensure that there is no overlap between spoofing attacks in the training and testing phases, while they combine the methods of deep traditional machine learning with theirs. The dataset partitions are combined to improve the generalization ability of the model. Another example is prior art 3: "Chettri B, Stoller D, Morfi V, et al.Ensemblemodels for spoofing detection in automatic speaker verification[J].arXivpreprint arXiv:1904.04589,2019.", the author proposes a method based on multi-feature integration And the framework of multi-task learning, which contains the complementary information of multiple spectral features in the network, but since a single traditional feature is not enough to generalize the information of the speech signal in the full frequency band, it is better expressed through multi-feature integration. In addition, the authors propose a butterfly unit for multi-task learning to facilitate parameter sharing in the propagation process of binary classification tasks and other auxiliary class tasks.

现有技术1存在鲁棒性不强的问题，在评价集中其检测算法性能不高，主要是由于其模型的泛化能力不强，在面对评价集中未知的攻击时，其检测性能明显降低；现有技术2和3存在检测算法较复杂的问题，现有技术2中作者进行数据集分区使得测试集与训练集中的回放语音攻击不存在重叠现象，但其数据集分区过程较为复杂，因此需要耗费更多时间；现有技术3中作者通过提出多特征、多任务学习的方式，通过在网络训练时也进行其他辅助类任务来帮助二分类的识别判断，但因此其检测算法也相对较为复杂。Existing technology 1 has the problem of low robustness, and its detection algorithm performance is not high in the evaluation set, mainly due to the weak generalization ability of its model, when faced with unknown attacks in the evaluation set, its detection performance is significantly reduced ; Prior art 2 and 3 have the problem that detection algorithm is more complicated, in prior art 2, the author performs data set partitioning so that the playback voice attack in the test set and training set does not overlap, but its data set partitioning process is more complicated, so It takes more time; in the prior art 3, the author proposes a multi-feature and multi-task learning method, and also performs other auxiliary tasks during network training to help the recognition and judgment of the two-class classification, but the detection algorithm is relatively relatively complex.

发明内容SUMMARY OF THE INVENTION

本发明所要解决的技术问题是针对上述现有技术存在的不足，提供一种回放语音检测方法，能提高回放语音算法的检测性能及鲁棒性。The technical problem to be solved by the present invention is to provide a playback voice detection method for the shortcomings of the above-mentioned prior art, which can improve the detection performance and robustness of the playback voice algorithm.

本发明解决上述技术问题所采用的技术方案为：一种回放语音检测方法，其特征在于：包括如下步骤：The technical solution adopted by the present invention to solve the above-mentioned technical problems is: a playback voice detection method, which is characterized in that it comprises the following steps:

1)训练阶段：1) Training phase:

1.1)输入训练语音样本，所述训练语音样本包括原始语音和回放语音；1.1) input training voice samples, the training voice samples include original voice and playback voice;

1.2)提取训练语音样本的倒谱特征；1.2) Extract the cepstral features of the training speech samples;

1.3)根据提取的特征训练残差网络模型，得到网络模型参数；1.3) Train the residual network model according to the extracted features to obtain network model parameters;

2)测试阶段：2) Test phase:

2.1)输入测试语音样本；2.1) Input test voice samples;

2.2)提取测试语音样本的倒谱特征；2.2) Extract the cepstrum feature of the test speech sample;

2.3)利用步骤1)训练得到的残差网络对提取的测试语音样本的特征进行识别打分；2.3) use the residual network obtained in step 1) to identify and score the features of the extracted test speech samples;

2.4)判断测试语音样本是否为回放语音。2.4) Determine whether the test voice sample is a playback voice.

优选的，为便于保留更多频谱上的细节信息，在步骤1.2)和步骤2.2)中，提取的为全频率倒谱系数特征。Preferably, in order to retain more detailed information on the spectrum, in step 1.2) and step 2.2), the extracted features are full-frequency cepstral coefficients.

进一步地，全频率倒谱系数特征的提取方法为，1)通过将训练语音样本或测试语音样本的语音信号进行分帧加窗处理，然后对分帧后的语音信号进行傅里叶变换求取其频谱系数X_i(k)：Further, the extraction method of full-frequency cepstral coefficient feature is, 1) by the speech signal of training speech sample or test speech sample is divided into frame and windowed, then the speech signal after framed is carried out Fourier transform to obtain. Its spectral coefficient X _i (k):

其中，i表示分帧后的第i帧，k表示第i帧内的频率点，k＝0,1,2,...,N-1，j表示复数， m表示语音信号分帧后的帧数，N表示傅里叶变换点数；Among them, i represents the ith frame after framing, k represents the frequency point in the ith frame, k=0, 1, 2,..., N-1, j represents a complex number, m represents the speech signal after framing The number of frames, N represents the number of Fourier transform points;

2)然后求绝对值，得到相应的幅度谱系数E_i(k)：2) Then find the absolute value to get the corresponding amplitude spectral coefficient E _i (k):

3)然后进行对数运算以及DCT变换，得到第i帧的频率倒谱系数BFCC(i)：3) Then perform logarithmic operation and DCT transformation to obtain the frequency cepstral coefficient BFCC(i) of the i-th frame:

为使得不同特征的训练模型相互协作以获得更好的融合结果，在步骤1.2)和步骤2.2)中，还分别提取梅尔频率倒谱系数特征和常Q倒谱系数特征，在训练阶段分别根据全频率倒谱系数特征、梅尔频率倒谱系数特征和常Q倒谱系数特征得到对应的三个残差网络，在测试阶段相应得到根据全频率倒谱系数特征、梅尔频率倒谱系数特征和常Q倒谱系数特征得到的残差网络识别结果，对三个识别结果融合进行综合判断。In order to make the training models of different features cooperate with each other to obtain better fusion results, in step 1.2) and step 2.2), the Mel-frequency cepstral coefficient feature and the constant-Q cepstral coefficient feature are also extracted respectively. The full-frequency cepstral coefficient feature, the Mel-frequency cepstral coefficient feature, and the constant-Q cepstral coefficient feature obtain the corresponding three residual networks. Combined with the residual network identification results obtained by the constant Q cepstral coefficient feature, the three identification results are combined to make a comprehensive judgment.

根据本发明的一个方面，在步骤1.2)和步骤2.2)中，提取的为梅尔频率倒谱系数特征。According to an aspect of the present invention, in step 1.2) and step 2.2), the extracted features are Mel-frequency cepstral coefficients.

根据本发明的另一个方面，在步骤1.2)和步骤2.2)中，提取的为常Q倒谱系数特征。According to another aspect of the present invention, in step 1.2) and step 2.2), the feature of constant-Q cepstral coefficients is extracted.

优选的，残差网络包括依次连接的二维卷积层、残差块序列、Dropout层、第一全连接层、激活函数层、GRU层、第二全连接层和网络输出层。Preferably, the residual network includes a two-dimensional convolutional layer, a residual block sequence, a dropout layer, a first fully connected layer, an activation function layer, a GRU layer, a second fully connected layer and a network output layer that are connected in sequence.

优选的，所述激活函数层采用泄露修正线性单元。Preferably, the activation function layer adopts a leakage correction linear unit.

优选的，为加快学习的收敛速度，在二维卷积层和激活函数层之间还具有批标准化处理层。Preferably, in order to speed up the convergence speed of learning, a batch normalization processing layer is also provided between the two-dimensional convolution layer and the activation function layer.

为提高检测准确性，在步骤2.4)中，将残差网络输出的得分与ASV系统得分相结合来判断测试语音样本是原始语音还是回放语音。In order to improve the detection accuracy, in step 2.4), the score output by the residual network is combined with the ASV system score to determine whether the test speech sample is the original speech or the playback speech.

与现有技术相比，本发明的优点在于：利用神经网络能提取更深层次的特征，能更完备地表征语音信号中的细节信息，基于深度学习的方式，将语音信号的倒谱特征与深度残差网络相结合，有效地提升系统的检测性能，并使得算法有较好的鲁棒性；残差网络能对时域和频域上的失真进行很好地建模，从而提高神经网络分类的准确率；通过提取全频率倒谱系数特征，无需使用滤波器，能够更多的保留频谱上的细节信息。Compared with the prior art, the present invention has the advantages that: the neural network can be used to extract deeper features, and the detailed information in the speech signal can be more completely represented; The residual network can effectively improve the detection performance of the system and make the algorithm more robust; the residual network can model the distortion in the time domain and frequency domain well, thereby improving the neural network classification. By extracting the features of full-frequency cepstral coefficients, no filter is needed, and more detailed information on the spectrum can be preserved.

附图说明Description of drawings

图1为本发明实施例的回放语音检测方法流程图；1 is a flowchart of a playback voice detection method according to an embodiment of the present invention;

图2为本发明实施例的回放语音检测方法的MFCC提取流程图；Fig. 2 is the MFCC extraction flow chart of the playback speech detection method of the embodiment of the present invention;

图3为本发明实施例的回放语音检测方法的CQCC提取流程图；Fig. 3 is the CQCC extraction flow chart of the playback speech detection method of the embodiment of the present invention;

图4为本发明实施例的回放语音检测方法的BFCC提取流程图；Fig. 4 is the BFCC extraction flow chart of the playback speech detection method of the embodiment of the present invention;

图5为本发明实施例的回放语音检测方法的残差网络的示意图。FIG. 5 is a schematic diagram of a residual network of a playback speech detection method according to an embodiment of the present invention.

具体实施方式Detailed ways

下面详细描述本发明的实施例，所述实施例的示例在附图中示出，其中相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。The following describes in detail the embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions.

在本发明的描述中，需要理解的是，术语“中心”、“纵向”、“横向”、“长度”、“宽度”、“厚度”、“上”、“下”、“前”、“后”、“左”、“右”、“竖直”、“水平”、“顶”、“底”“内”、“外”、“顺时针”、“逆时针”、“轴向”、“径向”、“周向”等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了便于描述本发明和简化描述，而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，由于本发明所公开的实施例可以按照不同的方向设置，所以这些表示方向的术语只是作为说明而不应视作为限制，比如“上”、“下”并不一定被限定为与重力方向相反或一致的方向。此外，限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。In the description of the present invention, it should be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", " Rear, Left, Right, Vertical, Horizontal, Top, Bottom, Inner, Outer, Clockwise, Counterclockwise, Axial, The orientations or positional relationships indicated by "radial direction", "circumferential direction", etc. are based on the orientations or positional relationships shown in the accompanying drawings, which are only for the convenience of describing the present invention and simplifying the description, rather than indicating or implying the indicated devices or elements. must have a specific orientation, be constructed and operated in a specific orientation, and since the disclosed embodiments of the present invention may be arranged in different orientations, these directional terms are for illustration only and should not be regarded as limiting, such as "up", "Down" is not necessarily defined as a direction opposite or in line with the direction of gravity. Furthermore, features delimited with "first", "second" may expressly or implicitly include one or more of that feature.

参见图1，一种回放语音检测方法，包括如下步骤：Referring to Fig. 1, a playback voice detection method includes the following steps:

1)训练阶段：1) Training phase:

1.1)输入训练语音样本；1.1) Input training speech samples;

1.2)提取倒谱特征；1.2) Extract cepstral features;

1.3)训练残差网络模型，得到网络模型参数；1.3) Train the residual network model to obtain the network model parameters;

2)测试阶段：2) Test phase:

2.1)输入测试语音样本；2.1) Input test voice samples;

2.2)提取倒谱特征；2.2) Extract cepstral features;

2.3)利用步骤1)训练得到的残差网络进行识别打分；2.3) Use the residual network trained in step 1) to identify and score;

2.4)结合ASV系统得分和步骤2.3)得到决策。2.4) Combining the ASV system score and step 2.3) to get a decision.

在上述步骤中，训练语音样本和测试语音样本的提取特征方式，提取器可以为滤波器，则可以根据滤波器确定语音样本中的特征，其中，滤波器可以为根据实际需求预配置的滤波器，用于提取语音样本中的特征，如传统的梅尔频率倒谱系数特征和常Q倒谱系数特征。In the above steps, for the feature extraction method of the training speech samples and the test speech samples, the extractor may be a filter, and then the features in the speech samples may be determined according to the filter, wherein the filter may be a preconfigured filter according to actual requirements , which is used to extract features in speech samples, such as traditional Mel-frequency cepstral coefficient features and constant-Q cepstral coefficient features.

1、梅尔频率倒谱系数特征：梅尔频率倒谱系数(Mel-Frequency CepstralCoefficients， MFCC)是一种常用于说话人识别领域的语音特征参数，它符合人耳的听觉特性，对于不同频率的声波有不同的听觉敏感度。梅尔频率倒谱系数是在Mel标度的频率域提取出来的倒谱系数，Mel标度能反映出人耳频率的非线性特性，它和频率的关系表达式如下式所示：1. Features of Mel-frequency cepstral coefficients: Mel-Frequency Cepstral Coefficients (MFCC) is a speech feature parameter commonly used in the field of speaker recognition. It conforms to the auditory characteristics of the human ear. Sound waves have different hearing sensitivities. The Mel frequency cepstral coefficient is the cepstral coefficient extracted in the frequency domain of the Mel scale. The Mel scale can reflect the nonlinear characteristics of the human ear frequency. The relationship between it and the frequency is expressed as follows:

其中，F_mel是以梅尔(Mel)为单位的感知频率，f是以赫兹(Hz)为单位的实际频率。将语音信号转换至感知频域，而不是单纯的以傅里叶变换方式来表达，通常能更好地模拟人耳听觉过程的处理。where _Fmel is the perceived frequency in Mel (Mel) and f is the actual frequency in Hertz (Hz). Converting the speech signal to the perceptual frequency domain, rather than simply expressing it in a Fourier transform, usually better simulates the processing of the human auditory process.

参见图2，为MFCC的具体提取过程流程图，包括如下步骤：Referring to Fig. 2, it is the specific extraction process flow chart of MFCC, including the following steps:

1)语音样本(训练语音样本或测试语音样本)的信号x(n)进行预处理后为x_i(m)，然后将分帧后的得到每一帧信号进行傅里叶时频变换(STFT)，得到该帧信号的频谱系数；1) The signal x (n) of the speech sample (training speech sample or test speech sample) is preprocessed to be x _i (m), and then the Fourier time-frequency transform (STFT) of each frame of the framed signal is performed. ) to obtain the spectral coefficient of the frame signal;

其中，i表示分帧后的第i帧，k表示第i帧内的频率点，k＝1,2,...,N-1；Among them, i represents the ith frame after frame division, k represents the frequency point in the ith frame, k=1,2,...,N-1;

2)对得到的频谱系数求其每帧的幅度谱系数：2) Calculate the amplitude spectral coefficient of each frame of the obtained spectral coefficient:

3)将得到的能量谱系数送到一组梅尔滤波器中，计算在梅尔滤波器中的能量。得到的梅尔频谱系数是由能量谱系数与梅尔滤波器中的频率响应H_m(k)相乘并求和，即：3) The obtained energy spectral coefficients are sent to a set of mel filters, and the energy in the mel filters is calculated. The resulting mel spectral coefficients are multiplied and summed by the energy spectral coefficients and the frequency response H _m (k) in the mel filter, namely:

其中，0≤m≤M指的是第m个Mel滤波器，共有M个滤波器；Among them, 0≤m≤M refers to the mth Mel filter, and there are M filters in total;

4)接着对梅尔频谱系数进行对数运算和DCT变换，得到其梅尔频率倒谱系数:4) then carry out logarithmic operation and DCT transformation to Mel spectral coefficients to obtain its Mel frequency cepstral coefficients:

标准的MFCC只反映了语音信号中参数的静态特性，一般通过这些静态特征的差分频谱系数可以得到语音信号的动态特性，优选的可以将MFCC特征的静态系数与一阶系数、二阶系数的动态特性结合，以提高系统的识别性能。The standard MFCC only reflects the static characteristics of the parameters in the speech signal. Generally, the dynamic characteristics of the speech signal can be obtained through the differential spectral coefficients of these static characteristics. Features are combined to improve the recognition performance of the system.

2、常Q倒谱系数特征：常Q倒谱系数(Constant Q Cepstral Coefficients,CQCC)特征是将语音信号通过CQT时频变换得到。CQT变换中所使用的是一组中心频率与带宽比恒为常量Q的滤波器。通过CQT变换得到的频谱的横轴频率是非线性的，因为它的中心频率服从指数分布，且滤波器组所用到的窗长是根据频率变化的。它将时域语音信号转换为频域语音信号时，能在低频区提供较高的频率分辨率，在高频区提供较高的时间分辨率。2. Features of constant Q cepstral coefficients: Constant Q Cepstral Coefficients (CQCC) features are obtained by converting speech signals through CQT time-frequency transformation. What is used in the CQT transform is a set of filters whose center frequency to bandwidth ratio is constant Q. The frequency of the horizontal axis of the spectrum obtained by CQT transformation is nonlinear, because its center frequency obeys an exponential distribution, and the window length used by the filter bank changes according to the frequency. When it converts a time domain speech signal into a frequency domain speech signal, it can provide higher frequency resolution in the low frequency region and higher time resolution in the high frequency region.

参见图3，为常Q倒谱系数的具体提取过程流程图，包括如下步骤：Referring to Figure 3, it is a flowchart of the specific extraction process of the constant-Q cepstral coefficient, including the following steps:

1)语音样本(训练语音样本或测试语音样本)的信号x(n)经过CQT转换后的感知频域信号为X^CQ(k)，X^CQ(k)的计算公式如下：1) The signal x(n) of the speech sample (training speech sample or test speech sample) is converted by CQT to the perceptual frequency domain signal X ^CQ (k), and the calculation formula of X ^CQ (k) is as follows:

其中k表示频率点，k＝1,2,...,K，f_s为语音样本的采样率，f_k为滤波器的中心频率，它服从指数分布，定义如下：Where k represents the frequency point, _k =1,2,..., _K , fs is the sampling rate of the speech sample, fk is the center frequency of the filter, which obeys the exponential distribution and is defined as follows:

B为每倍乘内的频率点数，f1是最低频率点的中心频率，由下式计算得到：B is the number of frequency points in each multiplication, and f1 is the center frequency of the lowest frequency point, which is calculated by the following formula:

Q因子为中心频率f_k与带宽B_k的比值，是一个与k无关的常数，其基本定义公式如下：The Q factor is the ratio of the center frequency f _k to the bandwidth B _k , which is a constant independent of k. Its basic definition formula is as follows:

窗函数采用汉宁窗，随着频率分辨率的提升，时间分辨率逐渐降低，因此窗口长度N_k是随着k变化的，并且与k成反比，窗函数的定义如下：The window function adopts the Hanning window. As the frequency resolution increases, the time resolution gradually decreases. Therefore, the window length N _k changes with k and is inversely proportional to k. The definition of the window function is as follows:

2)求第i帧第k个频率点的频率值X_i(k)的幅值：2) Find the amplitude of the frequency value X _i (k) of the k-th frequency point of the i-th frame:

3)对上述幅值求对数频谱系数：3) Calculate the logarithmic spectral coefficient for the above amplitude:

其中T_k表示第k个频带上语音信号的总帧数，k＝1,2,...，K，在本文中K取420。Wherein T _k represents the total number of frames of speech signals on the kth frequency band, k=1, 2, . . . , K, and K is 420 in this paper.

4)对求得的对数频谱系数进行均匀重采样，其中新的频率表示与原始的频率表示关系如下：4) Uniform resampling of the obtained logarithmic spectral coefficients, where the relationship between the new frequency representation and the original frequency representation is as follows:

5)对重采样后的频谱系数进行DCT变换，即：5) DCT transform is performed on the resampled spectral coefficients, namely:

其中，p＝0,1,...,L-1，l表示的是重采样后的频率点。Among them, p=0,1,...,L-1, l represents the frequency point after resampling.

上述两种为现有的特征提取，在本发明中，还可以采用以下特征提取方法：The above two are existing feature extraction, and in the present invention, the following feature extraction methods can also be used:

此外，也可以不使用滤波器来提取语音样本的特征，具体如下：In addition, it is also possible to extract features of speech samples without using filters, as follows:

3、全频率倒谱系数：全频率倒谱系数(Full-Band Frequency CepstralCoefficients，BFCC)，相较于其他一些传统特征，如CQCC、MFCC，这种特征摒弃了传统特征中滤波器的使用，即直接将经过傅里叶变换得到的频谱系数进行对数运算和DCT变换，这样做的好处在于能够更多的保留频谱上的细节信息。3. Full-frequency cepstral coefficients: Full-Band Frequency Cepstral Coefficients (BFCC), compared with some other traditional features, such as CQCC and MFCC, this feature abandons the use of filters in traditional features, namely The advantage of directly performing logarithmic operation and DCT transformation on the spectral coefficients obtained by the Fourier transform is that more detailed information on the frequency spectrum can be preserved.

参见图4，为BFCC的具体提取过程流程图，包括如下步骤：Referring to Fig. 4, it is the specific extraction process flow chart of BFCC, including the following steps:

1)通过将一段语音进行分帧加窗处理，然后对分帧后的语音信号进行傅里叶变换求取其频谱系数：1) By performing a frame-by-frame windowing process on a piece of speech, and then performing Fourier transform on the frame-by-frame speech signal to obtain its spectral coefficient:

其中，i表示分帧后的第i帧，k表示第i帧内的频率点，k＝0,1,2,...,N-1，j表示复数，m表示语音信号分帧后的帧数，N表示傅里叶变换点数，本实施例中，N＝512；Among them, i represents the ith frame after framing, k represents the frequency point in the ith frame, k=0, 1, 2,..., N-1, j represents a complex number, m represents the speech signal after framing The number of frames, N represents the number of Fourier transform points, in this embodiment, N=512;

2)接着对其求绝对值，得到相应的幅度谱系数：2) Then find the absolute value of it to get the corresponding amplitude spectral coefficient:

3)进而进行对数运算以及DCT变换，得到它的倒谱系数：3) Then perform logarithmic operation and DCT transformation to obtain its cepstral coefficient:

在BFCC特征中，优选的也可以加入该特征的对数能量系数以及一阶、二阶差分系数，作为最后的特征矢量。In the BFCC feature, it is preferable to add the logarithmic energy coefficient and the first-order and second-order difference coefficients of the feature as the final feature vector.

在本发明中，网络模型采用残差网络，其总体框架如图5所示，当将不同的特征作为输入时，网络模型的总体架构不变，只在输入端的输入特征维度上有所变化。In the present invention, the network model adopts a residual network, and its overall framework is shown in Figure 5. When different features are used as inputs, the overall structure of the network model remains unchanged, and only changes in the input feature dimension of the input end.

残差网络，包括依次连接的二维卷积层、四个相同的残差块序列、Dropout层、第一全连接层、激活函数层、GRU层、第二全连接层和网络输出层(分类器)，设置Dropout 层函数的值为0.5，网络输出层优选的采用Softmax层。Residual network, including sequentially connected two-dimensional convolutional layer, four identical residual block sequence, Dropout layer, first fully connected layer, activation function layer, GRU layer, second fully connected layer and network output layer (classification) device), set the value of the Dropout layer function to 0.5, and the network output layer preferably adopts the Softmax layer.

激活函数层采用泄露修正线性单元(LeakyRelu)，它是由修正线性单元(Rectified linear unit，Relu)激活函数演变而来。Relu激活函数只有在输入超出阈值时神经元才被激活，将所有的负值都变为0，而正值不变。但它的补点是在训练时很脆弱，尤其当输入是负值的时候，Relu的学习速度会非常慢甚至使神经元直接不起作用，导致其权重无法更新，使得这个神经元再也不会对任何数据有激活现象了，梯度永远为0。因此，为了解决Relu函数这个补点，本发明在网络中采用了LeakyRelu激活函数。与Relu函数将所有负值都设为0不同的是，LeakyRelu函数是给所有负值加一个非常小的非零斜率，这样就解决了Relu函数在输入为负值时神经元不学习的问题，其数学表达式如下：The activation function layer adopts the leaky rectified linear unit (LeakyRelu), which is evolved from the Rectified linear unit (Relu) activation function. The Relu activation function only activates the neuron when the input exceeds the threshold, turning all negative values to 0 and leaving positive values unchanged. But its complement is that it is very fragile during training, especially when the input is negative, the learning speed of Relu will be very slow or even make the neuron directly useless, resulting in its weights cannot be updated, making this neuron no longer There will be activation on any data, and the gradient will always be 0. Therefore, in order to solve the complementary point of the Relu function, the present invention adopts the LeakyRelu activation function in the network. Unlike the Relu function, which sets all negative values to 0, the LeakyRelu function adds a very small non-zero slope to all negative values, which solves the problem that the neurons of the Relu function do not learn when the input is negative. Its mathematical expression is as follows:

此外，GRU层不仅能将上层网络中提取出的帧级的特征重新聚合成单个话语特征，而且它的模型简单，很适合构建较深的网络。在残差网络中，它的两个门控也能使得整个网络的效率更高，计算更省时，使得模型在训练时的收敛速度明显加快。In addition, the GRU layer can not only re-aggregate the frame-level features extracted from the upper network into a single utterance feature, but also its simple model is suitable for building deeper networks. In the residual network, its two gates can also make the entire network more efficient, save time in computation, and make the model converge significantly faster during training.

将提取的三种倒谱特征分别送入残差网络的输入端，首先通过一个二维卷积层来提取其基于帧级的时域-频域特征，然后将该层的输出送入4个相同的残差块序列中，以促进网络更深层的训练，并将最后一个残差块的输出依次送入Dropout层、第一全连接层、激活函数层和GRU层。在GRU层后，通过第二全连接层将其基于话语的特征映射到新的空间上，转换为新的特征，并送入一个只有两个节点单元的输出层，在此产生分类对数，最后将第二全连接层的输出送入Softmax层将对数转化为得分概率分布。The extracted three cepstral features are respectively sent to the input of the residual network. First, a two-dimensional convolutional layer is used to extract its frame-level time-frequency domain features, and then the output of this layer is sent to four In the same sequence of residual blocks, to promote deeper training of the network, the output of the last residual block is sequentially sent to the Dropout layer, the first fully connected layer, the activation function layer and the GRU layer. After the GRU layer, its utterance-based features are mapped to the new space through the second fully connected layer, converted into new features, and sent to an output layer with only two node units, where the classification logarithm is generated, Finally, the output of the second fully connected layer is sent to the Softmax layer to logarithmically transform it into a score probability distribution.

由于神经网络在训练时网络的层数过深可能会出现梯度消失的现象，因此在残差网络中加入批标准化处理(Batch Normalization，BN)层，BN是根据规范化的手段，将偏离正常范围的分布拉到一个标准化的分布范围内，该层位于二维卷积层和激活函数层之间，这样使得数据能分布在激活函数比较敏感的区域内，从而使梯度变大，加快学习的收敛速度。Since the number of layers of the neural network is too deep during training, the gradient may disappear. Therefore, a batch normalization (BN) layer is added to the residual network. BN is a method of normalization, which will deviate from the normal range. The distribution is pulled into a standardized distribution range, which is located between the two-dimensional convolution layer and the activation function layer, so that the data can be distributed in the area where the activation function is more sensitive, so that the gradient becomes larger and the convergence speed of learning is accelerated. .

在训练阶段，还可以使用Adam算法对残差网络进行优化。同时，以10e^-4作为网络的学习速率，以32作为批处理的值，训练过程在50圈后停止。损失函数采用预测值与目标值之间的二元交叉熵函数，并且将最后一个全连接层的节点输出作为预测分数。During the training phase, the residual network can also be optimized using the Adam algorithm. Meanwhile, with 10e ^-4 as the learning rate of the network and 32 as the batch value, the training process stops after 50 epochs. The loss function takes the binary cross-entropy function between the predicted value and the target value, and takes the node output of the last fully connected layer as the predicted score.

在步骤1)训练阶段，使用训练语音样本(包括原始语音和回放语音的训练集)提取出传统特征并将它们送入神经网络，分别训练出原始语音和回放语音的残差网络模型；在测试阶段，提取出测试语音样本中的特征，并送入训练阶段训练出来的残差网络模型中，根据网络输出层的得分结果对测试语音进行分类判断，并将该结果与ASV(AutomaticSpeaker Recognition)系统得分相结合，作为最终的判决测试语音样本是否为回放语音的结果。In step 1) training phase, use the training speech samples (including the training set of the original speech and the replayed speech) to extract the traditional features and send them to the neural network, and train the residual network models of the original speech and the replayed speech respectively; In the first stage, the features in the test voice samples are extracted and sent to the residual network model trained in the training stage. The test voice is classified and judged according to the score results of the network output layer, and the results are compared with the ASV (Automatic Speaker Recognition) system. The scores are combined as the final decision to test whether the speech sample is the result of playback speech.

通过上述训练和测试过程，可分别得到关于三种倒谱特征的子系统，将三个子系统的得分进行融合，融合的方式为将三个子系统的得分进行融合，其公式为：Through the above training and testing process, the subsystems about the three cepstral features can be obtained respectively, and the scores of the three subsystems can be fused. The fusion method is to fuse the scores of the three subsystems. The formula is:

S＝i·S_BFCC+j·S_MFCC+k·S_CQCC S=i·S _BFCC +j·S _MFCC +k·S _CQCC

其中，i、j、k分别为三个子系统得分的权重系数，其约束条件为i+j+k＝1，S_BFCC、S_MFCC、S_CQCC分别为归一化后的子系统的得分。通过让不同特征的训练模型相互协作以获得更好的融合结果。Among them, i, j, and k are the weight coefficients of the scores of the three subsystems respectively, and the constraint condition is i+j+k=1, and S _BFCC , S _MFCC , and S _CQCC are the scores of the normalized subsystems, respectively. By letting the trained models of different features cooperate with each other to obtain better fusion results.

Claims

1. a playback voice detection method, is characterized in that: comprise the steps:

1) Training phase:

1.1) input training voice samples, the training voice samples include original voice and playback voice;

1.2) Extract the cepstral features of the training speech samples;

1.3) Train the residual network model according to the extracted features to obtain network model parameters;

2) Test phase:

2.1) Input test voice samples;

2.2) Extract the cepstrum feature of the test speech sample;

2.3) use the residual network obtained in step 1) to identify and score the features of the extracted test speech samples;

2.4) Determine whether the test voice sample is a playback voice.

2. The playback speech detection method according to claim 1, characterized in that: in step 1.2) and step 2.2), the extracted features are full-frequency cepstral coefficients.

3. playback voice detection method according to claim 2 is characterized in that: full frequency 1) by the voice signal of training voice sample or test voice sample is carried out framing and window processing, then the voice signal after framing is carried out. Fourier transform to find its spectral coefficient X _i (k):

Among them, i represents the ith frame after framing, k represents the frequency point in the ith frame, k=0, 1, 2,..., N-1, j represents a complex number, m represents the speech signal after framing The number of frames, N represents the number of Fourier transform points;

2) Then find the absolute value to get the corresponding amplitude spectral coefficient E _i (k):

3) Then perform logarithmic operation and DCT transformation to obtain the frequency cepstral coefficient BFCC(i) of the i-th frame:

4. playback speech detection method according to claim 2 is characterized in that: in step 1.2) and step 2.2), also extract Mel frequency cepstral coefficient feature and constant Q cepstral coefficient feature respectively, in training stage respectively According to the full-frequency cepstral coefficient feature, the Mel-frequency cepstral coefficient feature and the constant-Q cepstral coefficient feature, the corresponding three residual networks are obtained. The residual network identification results obtained from the feature and the constant Q cepstral coefficient feature are used to comprehensively judge the fusion of the three identification results.

5. The playback speech detection method according to claim 1, characterized in that: in step 1.2) and step 2.2), the extracted features are Mel frequency cepstral coefficients.

6. The playback speech detection method according to claim 1, characterized in that: in step 1.2) and step 2.2), the extracted features are constant-Q cepstral coefficients.

7. The playback speech detection method according to any one of claims 1 to 6, wherein the residual network comprises a two-dimensional convolutional layer, a residual block sequence, a Dropout layer, and a first fully connected layer that are sequentially connected , activation function layer, GRU layer, second fully connected layer and network output layer.

8 . The playback speech detection method according to claim 7 , wherein the activation function layer adopts a leakage correction linear unit. 9 .

9 . The playback speech detection method according to claim 7 , wherein a batch normalization processing layer is further provided between the two-dimensional convolution layer and the activation function layer. 10 .

10. The playback voice detection method according to any one of claims 1 to 6, characterized in that: in step 2.4), the score of the residual network output is combined with the ASV system score to judge that the test voice sample is the original Voice or playback voice.