CN105869630B

CN105869630B - Speaker's voice spoofing attack detection method and system based on deep learning

Info

Publication number: CN105869630B
Application number: CN201610478041.0A
Authority: CN
Inventors: 钱彦旻; 陈楠昕; 俞凯
Original assignee: Shanghai Jiaotong University
Current assignee: Sipic Technology Co Ltd
Priority date: 2016-06-27
Filing date: 2016-06-27
Publication date: 2019-08-02
Anticipated expiration: 2036-06-27
Also published as: CN105869630A

Abstract

A kind of speaker's voice spoofing attack detection method and system based on deep learning, by constructing audio training set, initializing and depth feedforward neural network and depth recurrent neural network being respectively trained using the multiframe feature vector and single frames sequence vector of training set；It, not and sequence level feature vector is directed respectively into housebroken two linear differential analysis models by the frame level of audio to be measured, will be after the weighting of obtained two result scores as scoring, through realizing that voice cheats discrimination compared with predefined thresholds in test phase.The present invention can either capture local feature, and can hold global information.And classifier is used as using linear differential analysis in identification Qualify Phase, is judged by score fusion, the accuracy of voice fraud detection can be greatlyd improve.

Description

Speaker's voice spoofing attack detection method and system based on deep learning

Technical field

The present invention relates to a kind of technology in intelligent sound field, specifically a kind of human speech of speaking based on deep learning Sound spoofing attack detection method and system.

Background technique

Voice spoofing attack refers to and is forged for specific objective sound, hence for automatic Speaker Recognition System The technology attacked.Speaker Recognition Technology has been widely used in numerous areas at present, such as: authentication, Internet security, human-computer interaction, bank securities system, military criminal investigation etc..It is directed to the attack master of Speaker Recognition System in recent years It is divided into four classes, i.e. impersonation attack, playback, speech synthesis, voice conversion.Studies have shown that traditional voice spoofing attack The main problem of detection is present in feature extraction, existing feature extracting method in the expressive force of human voice characteristics and There are many deficiencies in terms of robustness.

In recent years in existing technology, for the detection and identification of voice spoofing attack, characteristic extraction part pass through frequently with Characteristic parameter mainly have spectrum signature parameter, phase property parameter, class cochlea aural signature (cochlea based Features), the method for Perception feature etc., these feature extractions still has some deficits in the characteristic aspect for characterizing true and false voice, To influence detection accuracy.In addition, the aural signature of voice signal is all utilized in these methods, it is lost the dynamic of voice signal Feature, robustness is poor, and recognition effect is undesirable.

In identification model part, the method for mainstream is mainly gauss hybrid models (GMM) and supporting vector machine model (SVM).Both methods is suitble to handle continuous signal, is limited by training criterion, weaker in ability to express, processing result The difference between inhomogeneity sample can only be easily distinguished, therefore, recognition effect is poor.

Summary of the invention

The present invention is unable to accurate characterization with feature extraction for the method for existing traditional voice spoofing attack detection and takes advantage of It deceives distinctive feature between voice and real speech, and loses the limitations such as the behavioral characteristics of voice signal, robustness be poor And the disadvantage that recognition effect is bad, it proposes a kind of speaker's voice spoofing attack detection method based on deep learning and is System, using deep learning model extraction feature vector, two kinds of different frames: is based on depth feed forward neural in feature extraction phases The other character representation of the frame level of network and the sequence level character representation based on depth recurrent neural network, can either capture part Feature, and global information can be held.And classifier is used as using linear differential analysis in identification Qualify Phase, is melted by score Conjunction judges.The present invention can greatly improve the accuracy of voice fraud detection.

The present invention is achieved by the following technical solutions:

Speaker's voice spoofing attack detection method based on deep learning that the present invention relates to a kind of, building audio training Collection initializes and depth feedforward neural network and depth is respectively trained using the multiframe feature vector and single frames sequence vector of training set Spend recurrent neural network；In test phase, by the frame level of audio to be measured not and sequence level feature vector be directed respectively into it is trained Two linear differential analysis models, will obtained two result scores weighting after as scoring, through with predefined thresholds ratio It is distinguished compared with realizing that voice is cheated.

The training depth feedforward neural network and depth recurrent neural network, specifically: it is filtered using the Mel of multiframe The acoustic feature for the registration audio that device group is extracted, i.e. Filter-bank feature train depth feedforward neural network, then sound Frequency training set by depth feedforward neural network, obtained in the last one hidden layer of network the frame level characteristics of the audio to Amount；Using the acoustic feature training depth recurrent neural network for the registration audio that the Mel filter group of multiframe is extracted, then By feature normalization, obtained in the last one hidden layer of depth recurrent neural network the sequence level feature of the audio to Amount.

The training depth feedforward neural network and depth recurrent neural network, in backward communication process, learning rate By simulated annealing and stop strategy as early as possible to determine.

The multiframe refers to: 31 frame windows and 15 frame of every side.

The acoustic feature, the i.e. acoustic feature of Mel filter group, by passing through one group of Mel filter on frequency domain Audio signal to be detected be filtered, obtain one group of filtered array, i.e. Mel frequency spectrum, wherein each bandpass filter A Filter-bank coefficient is exported, the length of array is equal to the number of filter in Mel filter group.

The Mel filter, using but be not limited to triangle window filter.

The depth feedforward neural network includes several hidden layers, is to connect entirely between hidden layer, and parameter value randomization is just Begin, is propagated by Back Propagation Algorithm；

The depth recurrent neural network includes several hidden layers, wherein also arriving comprising hidden layer itself in addition to connecting entirely The connection of itself protects stored purpose for propagating the information of last moment to reach.

The network output layer difference node on behalf different attack pattern or real human's voice, entire nerve net Network classifies for input voice, using cross entropy as objective function.

The frame level is not and sequence level feature vector is respectively via depth feedforward neural network and depth recurrent neural Network output, preferably through regular processing to have identical two norm length of vector.

Housebroken two linear differentials analyze (LDA) model, refer to: using depth feedforward neural network and depth The last one hidden layer of degree recurrent neural network obtains frame level not and two linear differentials are respectively trained in sequence level feature vector Model, the density of each classification is modeled by Multi-dimensional Gaussian distribution in the LDA model:Wherein: ∑ k and μ_kIt is the covariance, Mean Matrix of k-th of class respectively, is somebody's turn to do LDA model assumption:And posterior probability is provided by Bayesian formula:Its In: π_kIt is the prior probability of k-th of class.

Two LDA models, preferably according to the score weight of performance adjustment the two in development set.

The quantity of the classification is consistent with the output layer number of nodes of the neural network, i.e. attack type+1.

Speaker's voice spoofing attack detection system based on deep learning that the present invention relates to a kind of, comprising: logarithmic spectrum is special Levy extraction module, deep neural network module and linear differential module, in which: logarithmic spectrum characteristic extracting module and depth nerve net Network module is connected and transmits the acoustic feature information of audio to be measured, and deep neural network module exports special according to acoustic feature information Vector information is levied to linear differential module to be trained, linear differential module can treat the feature of acoustic frequency after training Vector information judges and scores, to realize the detection of voice deception.

Technical effect

Compared with prior art, the feature vector extracted using deep learning proposed in the present invention being capable of more acurrate earth's surface The phonetic feature of traveller on a long journey；And in Classification and Identification part using linear differential analysis (LDA) model as classifier, it can reduce same Difference between class expands the gap between inhomogeneity, and recognition effect is good, strong robustness, and the more existing method of precision has very Big promotion, the technology of the present invention effect include:

1) the more existing method of accuracy of identification greatly improves；

2) feature extracted can more accurately characterize the personal characteristics of speaker；

3) deep learning strategy avoids the over-fitting of network；

4) deep learning becomes feature more added with distinction；

5) robustness is stronger under different channels and environment；

In addition, the present invention is more robust in unknown complex condition effect.

Detailed description of the invention

Fig. 1 is flow diagram of the present invention.

Specific embodiment

Embodiment 1

The present embodiment is tested using the ASVSpoof2015 data set newly issued, and with existing Baseline Methods into Comparison is gone, the results are shown in Table 1.It can be seen that method proposed by the invention, can reach result best at present.

Speaker's voice spoofing attack detection system that the present embodiment is related to, comprising: logarithmic spectrum characteristic extracting module, depth Neural network module and linear differential module, in which: logarithmic spectrum characteristic extracting module is connected and passes with deep neural network module The acoustic feature information of defeated audio to be measured, deep neural network module export eigenvector information to line according to acoustic feature information Property difference block to be trained, linear differential module can be treated after training acoustic frequency eigenvector information judgement simultaneously Scoring, to realize the detection of voice deception.

The detection process that the present embodiment is related to above system is as follows:

Step 1) constructs audio training set (training set of ASVSpoof2015) and random initializtion by depth feed forward neural The deep neural network that network and depth recurrent neural network are constituted；

The loss function of the deep neural network is cross entropy, and having a coefficient is 10^‐6Euclidean distance (L2- Norm) weight attenuation term.

The random initializtion refers to: network parameter initial value is randomly derived, based on after stochastic gradient descent (SGD) Parameter to propagation algorithm for depth feedforward neural network adjusts, and time evolution anti-pass (BPTT) algorithm based on SGD is used for The parameter of depth recurrent neural network adjusts.

Step 2) the training stage, with the multiframe feature vector training depth feedforward neural network of training audio, window size is 31 frames, 15 frames of each extension in left and right；Depth recurrent neural network is trained with the single frames sequence vector of training audio, using based on SGD BPTT algorithm.Learning rate stops strategy by simulated annealing and as early as possible and determines, using cross entropy training, introducing value is 10^‐6Power Weight attenuation term.After the completion of network training, training audio is passing through depth feedforward neural network and depth recurrent neural net respectively Obtained after the last one hidden layer of network frame level not and sequence level feature vector, for training two linear differential models.Finally According to the score weight of performance adjustment the two in development set.

Step 3) test phase, calculate audio to be measured frame level not and sequence level feature vector, be directed respectively into and train Linear differential analysis model, will obtained two results weighting after as scoring, through with training threshold value comparison realize language Sound deception distinguishes.

More specific between the present invention and existing algorithm is as follows:

Above-mentioned specific implementation can by those skilled in the art under the premise of without departing substantially from the principle of the invention and objective with difference Mode carry out local directed complete set to it, protection scope of the present invention is subject to claims and not by above-mentioned specific implementation institute Limit, each implementation within its scope is by the constraint of the present invention.

Claims

1. a kind of speaker's voice spoofing attack detection method based on deep learning, which is characterized in that pass through building audio instruction Practice collection, initialize and using training set multiframe feature vector and single frames sequence vector be respectively trained depth feedforward neural network and Depth recurrent neural network；In test phase, the frame level of audio to be measured is not directed respectively into sequence level feature vector through instructing Two experienced linear differential analysis models will be used as scoring, warp and predefined thresholds after the weighting of obtained two result scores Compare and realizes that voice deception distinguishes；

Housebroken two linear differential analysis models refer to: using depth feedforward neural network and depth recurrent neural The last one hidden layer of network obtains frame level not and two linear differential models are respectively trained in sequence level feature vector, this is linear The density of each classification is modeled by Multi-dimensional Gaussian distribution in difference analysis model: Wherein: x indicates that each frame phonetic feature, p indicate the dimension of characteristic variable, ∑ k and μ_kBe respectively k-th of class covariance, Value matrix, the linear differential analysis model assume:Σ indicates covariance matrix, indicates between each dimension variable The degree of correlation, K indicates total Gauss quantity and posterior probability is provided by Bayesian formula:Its In: π_kIt is the prior probability of k-th of class, G indicates that Gauss index, X indicate observational characteristic vector, π_tIndicate t-th of Gaussian component Weight.

2. speaker's voice spoofing attack detection method according to claim 1, characterized in that before the training depth Neural network and depth recurrent neural network are presented, specifically: the registration audio extracted using the Mel filter group of multiframe Acoustic feature, i.e. Filter-bank feature train depth feedforward neural network, and then audio training set passes through depth feed forward neural Network obtains the other feature vector of frame level of the audio in the last one hidden layer of network；It is mentioned using the Mel filter group of multiframe The acoustic feature training depth recurrent neural network of the registration audio obtained, then by feature normalization, in depth recurrence The sequence level feature vector of the audio is obtained in the last one hidden layer of neural network.

3. speaker's voice spoofing attack detection method according to claim 1 or 2, characterized in that the training is deep Spend feedforward neural network and depth recurrent neural network, in backward communication process, learning rate stops by simulated annealing and as early as possible Strategy determines.

4. speaker's voice spoofing attack detection method according to claim 2, characterized in that the acoustic feature, That is the acoustic feature of Mel filter group, by being filtered by one group of Mel filter to the audio signal to be detected on frequency domain Wave obtains one group of filtered array, i.e. Mel frequency spectrum, and wherein each bandpass filter exports a Filter-bank system Number, the length of array are equal to the number of filter in Mel filter group.

5. speaker's voice spoofing attack detection method according to claim 1 or 2, characterized in that before the depth Neural network is presented, includes several hidden layers, is to connect entirely between hidden layer, parameter value randomization is initial, passes through Back Propagation Algorithm It propagates；The depth recurrent neural network includes several hidden layers, wherein also arriving itself comprising hidden layer itself in addition to connecting entirely Connection protect stored purpose for propagating the information of last moment to reach.

6. speaker's voice spoofing attack detection method according to claim 1, characterized in that the frame level is not and sequence Column level characteristics vector is exported via depth feedforward neural network and depth recurrent neural network respectively, by regular processing to have Standby identical two norm length of vector.

7. speaker's voice spoofing attack detection method according to claim 1, characterized in that described two are linear poor Divide analysis model, according to the score weight of performance adjustment the two in development set.

8. speaker's voice spoofing attack detection method according to claim 1, characterized in that the quantity of the classification It is consistent with the output layer number of nodes of the neural network, that is, attack type+1.

9. a kind of speaker's voice spoofing attack inspection based on deep learning for realizing any the method in claim 1~8 Examining system characterized by comprising logarithmic spectrum characteristic extracting module, deep neural network module and linear differential module, In: logarithmic spectrum characteristic extracting module is connected with deep neural network module and transmits the acoustic feature information of audio to be measured, depth Neural network module according to acoustic feature information export eigenvector information to linear differential module to be trained, linear differential The eigenvector information that module can treat acoustic frequency after training judges and scores, to realize the detection of voice deception.