CN105869630A

CN105869630A - Method and system for detecting voice spoofing attack of speakers on basis of deep learning

Info

Publication number: CN105869630A
Application number: CN201610478041.0A
Authority: CN
Inventors: 钱彦旻; 陈楠昕; 俞凯
Original assignee: Shanghai Jiaotong University
Current assignee: Sipic Technology Co Ltd
Priority date: 2016-06-27
Filing date: 2016-06-27
Publication date: 2016-08-17
Anticipated expiration: 2036-06-27
Also published as: CN105869630B

Abstract

The invention discloses a method and a system for detecting voice spoofing attack of speakers on the basis of deep learning. The method includes constructing audio-frequency training sets, initializing deep feed-forward neural networks and deep recurrent neural networks and respectively training the deep feed-forward neural networks and the deep recurrent neural networks by the aid of multi-frame feature vectors and single-frame vector sequences of the training sets; respectively leading frame level and sequence level feature vectors of to-be-tested audio frequencies into two trained linear differential analysis models in test phases, weighting two obtained result grades to obtain scores and comparing the scores to predefined threshold values so as to discriminate voice spoofing. The method and the system have the advantages that local features can be captured, and global information can be grasped; the linear differential analysis models are used as classifiers in identification and verification phases, the voice spoofing attack can be judged by means of grade fusion, and accordingly the voice spoofing detection accuracy can be greatly improved.

Description

Speaker's voice spoofing attack detection method based on degree of depth study and system

Technical field

The present invention relates to the technology in a kind of intelligent sound field, a kind of speaker's language based on degree of depth study Sound spoofing attack detection method and system.

Background technology

Voice spoofing attack, refers to forge for specific objective sound, hence for automatic Speaker Recognition System Carry out the technology attacked.Speaker Recognition Technology is widely used at numerous areas, such as at present: authentication, Internet security, man-machine interaction, bank securities system, military criminal investigation etc..In recent years for the attack master of Speaker Recognition System Being divided into four classes, i.e. impersonation attack, playback, phonetic synthesis, voice is changed.Research shows, traditional voice spoofing attack Detection subject matter be present in feature extraction, existing feature extracting method in the expressive force of human voice characteristics and Robustness aspect has many deficiencies.

In the most existing technology, for detection and the identification of voice spoofing attack, characteristic extraction part through frequently with Characteristic parameter mainly have spectrum signature parameter, phase property parameter, class cochlea aural signature (cochlea based Features), Perception feature etc., the method for these feature extractions still has some deficits in the characteristic aspect characterizing true and false voice, Thus affect accuracy of detection.Additionally, these methods all make use of the aural signature of voice signal, lost the dynamic of voice signal Feature, robustness is poor, and recognition effect is undesirable.

Identifying model part, the method for main flow is mainly gauss hybrid models (GMM) and supporting vector machine model (SVM).Both approaches is suitable for processing continuous signal, is limited by training criterion, more weak in ability to express, its result Can only easily distinguish the difference between inhomogeneity sample, therefore, its recognition effect is poor.

Summary of the invention

The present invention is directed to the method for existing traditional voice spoofing attack detection there is feature extraction can not take advantage of by accurate characterization Deceive distinctive feature between voice and real speech, and lose the limitation such as the behavioral characteristics of voice signal, robustness be poor And the shortcoming that recognition effect is the best, propose a kind of speaker's voice spoofing attack detection method based on degree of depth study and be System, in feature extraction phases, utilizes degree of depth learning model to extract characteristic vector, two kinds of different frames: based on degree of depth feed forward neural The other character representation of the frame level of network and sequence level character representation based on degree of depth recurrent neural network, can either catch local Feature, can hold again global information.And identifying that Qualify Phase uses linear differential analysis as grader, melted by mark Conjunction judges.The present invention can be greatly enhanced the accuracy of voice fraud detection.

The present invention is achieved by the following technical solutions:

The present invention relates to a kind of speaker's voice spoofing attack detection method based on degree of depth study, build audio frequency training Collection, initializes and uses the multiframe characteristic vector of training set and single frames sequence vector to be respectively trained degree of depth feedforward neural network with deep Degree recurrent neural network；At test phase, by the frame level of audio frequency to be measured, not and sequence level characteristic vector is directed respectively into trained Two linear differential analyze models, using after two obtained result marks weightings as scoring, warp and predefined threshold ratio Relatively realize voice deception to distinguish.

Described training degree of depth feedforward neural network and degree of depth recurrent neural network, particularly as follows: use the Mel filtering of multiframe Device group extracts the acoustic feature of the registration audio frequency obtained, i.e. Filter bank features training degree of depth feedforward neural network, then sound Frequently training set passes through degree of depth feedforward neural network, last hidden layer of network obtains the frame level characteristics of this audio frequency to Amount；The Mel bank of filters using multiframe extracts the acoustic feature training degree of depth recurrent neural network of the registration audio frequency obtained, then By feature normalization, last hidden layer of degree of depth recurrent neural network obtains the sequence level feature of this audio frequency to Amount.

Described training degree of depth feedforward neural network and degree of depth recurrent neural network, in its backward communication process, learning rate By simulated annealing and as early as possible stop strategy determining.

Described multiframe refers to: 31 frame windows and every limit 15 frame.

Described acoustic feature, i.e. the acoustic feature of Mel bank of filters, by by one group of Mel wave filter on frequency domain Audio signal to be detected be filtered, obtain one group filter after array, i.e. Mel frequency spectrum, each of which bandpass filter Exporting a Filter bank coefficient, the length of array is equal to the number of filter in Mel bank of filters.

Described Mel wave filter, uses but is not limited to quarter window wave filter.

Described degree of depth feedforward neural network, comprises several hidden layers, is full connection, at the beginning of parameter value randomization between hidden layer Begin, propagated by Back Propagation Algorithm；

Described degree of depth recurrent neural network, comprises several hidden layers, wherein arrives except full connection also comprises hidden layer self The connection of self, is used for propagating the information in a moment, to reach to protect stored purpose.

Attack pattern that described network output layer difference node on behalf is different or real human's voice, whole nerve net Network is classified, using cross entropy as object function for input voice.

Described frame level is not and sequence level characteristic vector is respectively via degree of depth feedforward neural network and degree of depth recurrent neural Network exports, preferably through regular process to possess identical vector two norm length.

Described housebroken two linear differential analysis (LDA) models, refer to: use degree of depth feedforward neural network and the degree of depth Last hidden layer of recurrent neural network obtains frame level, and not and sequence level characteristic vector is respectively trained two linear differential models, In this LDA model, the density of each classification is modeled by Multi-dimensional Gaussian distribution:Its In: ∑ k and μ_kIt is the covariance of kth class, Mean Matrix respectively, this LDA model assumption:And posterior probability by Bayesian formula is given:Wherein: π_kIt it is the prior probability of kth class.

Two described LDA models, preferably adjust both score weights according to the performance in development set.

The quantity of described classification is consistent with the output layer nodes of described neutral net, i.e. attacks kind+1.

The present invention relates to a kind of speaker's voice spoofing attack detecting system based on degree of depth study, including: logarithmic spectrum is special Levy extraction module, deep neural network module and linear differential module, wherein: logarithmic spectrum characteristic extracting module and degree of depth nerve net Network module is connected and transmits the acoustic feature information of audio frequency to be measured, and deep neural network module is according to acoustic feature information output spy Levying vector information to linear differential module to be trained, linear differential module can treat the feature of acoustic frequency after training Vector information judges and marks, thus realizes the detection of voice deception.

Technique effect

Compared with prior art, the characteristic vector utilizing degree of depth study to extract proposed in the present invention can table more accurately The phonetic feature of traveller on a long journey；And use linear differential analysis (LDA) model as grader in Classification and Identification part, it is possible to reduce same Difference between class, expands the gap between inhomogeneity, and recognition effect is good, strong robustness, and the more existing method of precision has had very Big lifting, the technology of the present invention effect includes:

1) the more existing method of accuracy of identification is greatly improved；

2) feature extracted can characterize the personal characteristics of speaker more accurately；

3) degree of depth learning strategy avoids the over-fitting of network；

4) degree of depth study makes feature become more added with distinction；

5) under different channels and environment, robustness is higher；

Additionally, the present invention is at unknown complex condition effect more robust.

Accompanying drawing explanation

Fig. 1 is schematic flow sheet of the present invention.

Detailed description of the invention

Embodiment 1

The present embodiment uses the new ASVSpoof2015 data set issued to be tested, and enters with existing Baseline Methods Having gone contrast, result is as shown in table 1.It will be seen that method proposed by the invention, it is possible to reach the most best result.

Speaker's voice spoofing attack detecting system that the present embodiment relates to, including: logarithmic spectrum characteristic extracting module, the degree of depth Neural network module and linear differential module, wherein: logarithmic spectrum characteristic extracting module is connected with deep neural network module and passes The acoustic feature information of defeated audio frequency to be measured, deep neural network module according to acoustic feature information output characteristic vector information to line Property difference block to be trained, linear differential module through training after can treat acoustic frequency eigenvector information judge also Scoring, thus realize the detection of voice deception.

The detection process that the present embodiment relates to said system is as follows:

Step 1) build audio frequency training set (training set of ASVSpoof2015) random initializtion by degree of depth feed forward neural The deep neural network that network and degree of depth recurrent neural network are constituted；

The loss function of described deep neural network is cross entropy, and to have a coefficient be 10^‐6Euclidean distance (L2 Norm) weight attenuation term.

Described random initializtion refers to: be randomly derived network parameter initial value, after stochastic gradient descent (SGD) Be used for the parameter adjustment of degree of depth feedforward neural network to propagation algorithm, time evolution anti-pass (BPTT) algorithm based on SGD is used for The parameter adjustment of degree of depth recurrent neural network.

Step 2) training stage, with the multiframe characteristic vector training degree of depth feedforward neural network of training audio frequency, window size is 31 frames, left and right respectively extends 15 frames；With the single frames sequence vector training degree of depth recurrent neural network of training audio frequency, use based on SGD BPTT algorithm.Learning rate by simulated annealing and as early as possible stop strategy determining, use cross entropy training, introducing value is 10^‐6Power Weight attenuation term.After network training completes, training audio frequency is respectively by degree of depth feedforward neural network and degree of depth recurrent neural net Frame level is not obtained not and sequence level characteristic vector, for training two linear differential models after last hidden layer of network.Finally Both score weights are adjusted according to the performance in development set.

Step 3) test phase, do not calculate the frame level of audio frequency to be measured not and sequence level characteristic vector, be directed respectively into and train Linear differential analyze model, using after obtained two results weighting as scoring, through relatively realizing language with training threshold ratio Sound deception distinguishes.

More specific such as following table between the present invention and existing algorithm:

Above-mentioned be embodied as can by those skilled in the art on the premise of without departing substantially from the principle of the invention and objective with difference Mode it is carried out local directed complete set, protection scope of the present invention is as the criterion with claims and is not embodied as institute by above-mentioned Limit, each implementation in the range of it is all by the constraint of the present invention.

Claims

1. speaker's voice spoofing attack detection method based on degree of depth study, it is characterised in that instruct by building audio frequency Practice collection, initialize and use the multiframe characteristic vector of training set and single frames sequence vector be respectively trained degree of depth feedforward neural network and Degree of depth recurrent neural network；At test phase, the frame level of audio frequency to be measured is not directed respectively into through instruction with sequence level characteristic vector Two linear differential practiced analyze model, as scoring after two obtained result marks are weighted, and warp and predefined threshold value Relatively realize voice deception to distinguish.

Speaker's voice spoofing attack detection method the most according to claim 1, is characterized in that, before the described training degree of depth Feedback neutral net and degree of depth recurrent neural network, particularly as follows: use the Mel bank of filters of multiframe to extract the registration audio frequency that obtains Acoustic feature, i.e. Filter bank features training degree of depth feedforward neural network, then audio frequency training set passes through degree of depth feed forward neural Network, obtains the other characteristic vector of frame level of this audio frequency in last hidden layer of network；The Mel bank of filters using multiframe carries Obtain the acoustic feature training degree of depth recurrent neural network of the registration audio frequency arrived, then by feature normalization, in degree of depth recurrence The sequence level characteristic vector of this audio frequency is obtained in last hidden layer of neutral net.

Speaker's voice spoofing attack detection method the most according to claim 1 and 2, is characterized in that, described training is deep Degree feedforward neural network and degree of depth recurrent neural network, in its backward communication process, learning rate by simulated annealing and stops as early as possible Strategy determines.

Speaker's voice spoofing attack detection method the most according to claim 1, is characterized in that, described acoustic feature, The acoustic feature of i.e. Mel bank of filters, by filtering the audio signal to be detected on frequency domain by one group of Mel wave filter Ripple, obtain one group filter after array, i.e. Mel frequency spectrum, each of which bandpass filter exports a Filter bank system Number, the length of array is equal to the number of filter in Mel bank of filters.

Speaker's voice spoofing attack detection method the most according to claim 1 and 2, is characterized in that, before the described degree of depth Feedback neutral net, comprises several hidden layers, is full connection between hidden layer, and parameter value randomization is initial, passes through Back Propagation Algorithm Propagate；Described degree of depth recurrent neural network, comprises several hidden layers, wherein also comprises hidden layer self to self except full connection Connection, be used for propagating the information in a moment, to reach to protect stored purpose.

Speaker's voice spoofing attack detection method the most according to claim 1, is characterized in that, described frame level is not and sequence Row level characteristics vector exports via degree of depth feedforward neural network and degree of depth recurrent neural network respectively, through regular process with tool Standby identical vector two norm length.

Speaker's voice spoofing attack detection method the most according to claim 1, is characterized in that, described housebroken two Individual linear differential analyzes model, refers to: use degree of depth feedforward neural network and last hidden layer of degree of depth recurrent neural network Not and sequence level characteristic vector is respectively trained two linear differential models to obtain frame level, and in this LDA model, each classification is close Degree is modeled by Multi-dimensional Gaussian distribution:Wherein: ∑ k and μ_kIt is kth respectively The covariance of individual class, Mean Matrix, this LDA model assumption:And posterior probability is given by Bayesian formula: Wherein: π_kIt it is the prior probability of kth class.

8. according to the speaker's voice spoofing attack detection method described in claim 1 or 7, it is characterized in that, two described LDA Model, adjusts both score weights according to the performance in development set.

Speaker's voice spoofing attack detection method the most according to claim 7, is characterized in that, the quantity of described classification Consistent with the output layer nodes of described neutral net, i.e. attack kind+1.

10. speaker's voice spoofing attack detecting system based on degree of depth study, it is characterised in that including: logarithmic spectrum is special Levy extraction module, deep neural network module and linear differential module, wherein: logarithmic spectrum characteristic extracting module and degree of depth nerve net Network module is connected and transmits the acoustic feature information of audio frequency to be measured, and deep neural network module is according to acoustic feature information output spy Levying vector information to linear differential module to be trained, linear differential module can treat the feature of acoustic frequency after training Vector information judges and marks, thus realizes the detection of voice deception.