Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention
In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is
A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art
Every other embodiment obtained without creative efforts, shall fall within the protection scope of the present invention.
A kind of optimization method of deception recording detection neural network model provided as shown in Figure 1 for one embodiment of the invention
Flow chart, include the following steps:
S11: based on feature extractor, fraud detection device and domain prediction device building deception recording detection neural network mould
Type, wherein the feature extractor and the fraud detection device constitute the first branch, the feature extractor and the field
Fallout predictor constitutes second branch;
S12: the feature extractor is input to using source domain data and target numeric field data as input sample, wherein source
Numeric field data has deception label and field label, and target numeric field data has field label;
S13: being separately input into the fraud detection device and the domain prediction device for the output of the feature extractor, leads to
It crosses and trains the deception recording detection neural network model, loss function value and the reduction field for reducing fraud detection device are pre-
Survey the loss function value of device;
S14: the loss function value based on the domain prediction device after the reduction carries out confrontation instruction to the feature extractor
Practice, so that the depth characteristic that the feature extractor is output to the fraud detection device is the constant spy with fraud detection difference in domain
Sign.
In the present embodiment, for " field " described in the text, it is possible to understand that are as follows: the configuration of a recorded audio includes
" playback equipment " (i.e. with what device plays original audio), " sound pick-up outfit " (what equipment recording former audio with) and " recording
Environment " (audio play where ambient enviroment, such as office, dining room etc.), these configurations between the same data set
Similitude can be relatively high, and these being similarly configured property of different data set can be relatively low, and the same data set of intuitivism apprehension is not
Possible each audio all uses different playback environ-ment and equipment (often multiplex environments and equipment), and between two datasets
Coincidence factor then may be almost 0, such as playback environ-ment and equipment used are absolutely not overlapped, that is to say, that cross datasets
Field difference is huge, than otherness is much bigger the case where test in same data set.
For step S11, deception recording detection mind is constructed based on feature extractor, fraud detection device and domain prediction device
Through network model, the conventional depth neural network for detecting replay attacks attack generally comprises two components: one is intended to
It was found that the feature extractor of distinguishing feature, the other is Feature Mapping is implied them to the fraud detection device of deception label
It is spoofing attack or true voice.
In order to mitigate the unmatched influence in domain, a kind of architecture is proposed, which can learn depth characteristic,
It can solve detection replay attacks but the problem that cannot be distinguished between not same area, it is different from traditional neural network, it establishes new
Branch's connection serves as domain classifier by gradient inversion layer after feature extractor.First branch includes feature extraction
Device and fraud detection device, constitute the feed forward architecture of a standard.Article 2 branch has shared the feature extraction of first branch
Device accesses a domain classification device (domain by a reversed layer of gradient (gradient reversal layer)
classifier)。
For step S12, the spy is input to using ready source domain data and target numeric field data as input sample
Levy extractor, wherein the deception label and field label of the source domain data and target numeric field data be when collecting data just
It is known that.
For step S13, for " source domain data ", should go to calculate deception prediction loss by fraud detection device, again
The domain prediction loss of calculating field detector over there simultaneously;And for " target numeric field data ", it is only necessary to calculating field detector
Domain prediction loss, because of the label that these data are not cheated.And then pass through the training deception recording detection nerve net
Network model reduces the loss function value of fraud detection device and reduces the loss function value of the domain prediction device.
For step S14, the loss function value based on the domain prediction device after the reduction to the feature extractor into
Row dual training, by dual training, so that being the separating capacity of not domain features after feature extractor training.
It can be seen that by the embodiment in order to reduce the reduction amplitude of cross-cutting test performance, after proposing optimization
Deception recording detection neural network model frame, another domain prediction is added on the basis of traditional neural network model
Output, by the dual training of feature extractor and domain prediction device, finally make model learning to recording attack detecting
There is no the depth characteristic of separating capacity with distinguishing ability and on domain prediction, to improve the generalization of cross-cutting test
Can, when solving the test of cross datasets field, the bad problem of identification effect.
As an implementation, the loss function value based on the domain prediction device after the reduction is to the feature
Extractor carries out dual training
By the loss function value of the domain prediction device after the reduction by the reversed layer of gradient to the feature extractor into
Row dual training.
Further, after the reversed layer reversion by gradient, the loss function of the fraud detection device minimized is determined
The loss function value of value and maximized domain prediction device.
In the present embodiment, (the gradient reversal of the GRL between feature extractor and domain prediction device
Layer, gradient inversion layer) gradient is inverted during backpropagation.In turn, the loss function value of field fallout predictor is inverted maximum
Change.
Can be seen that by the embodiment realizes domain prediction device by the reversed layer of gradient during backpropagation
Loss function value maximizes, and helps the identification of the deception recording detection neural network model after optimizing more accurate.
As an implementation, when the data volume imbalance of the source domain data and the target numeric field data, logarithm
Over-sampling is carried out according to few data field is measured, so that the data of the source domain data and the target numeric field data are flux matched.
, may be inadequate due to data in the acquisition of source domain data and target numeric field data, cause training when data volume not
Balance, influences final effect of optimization.In order to avoid this case, the data field few to data volume carries out over-sampling.
It can be seen that the data volume by matching source domain data and the target numeric field data by the embodiment, guarantee
When optimizing training, there are sufficient data to be trained optimization, the deception recording detection neural network model after improving optimization
Recognition effect.
Above-mentioned steps are specifically implemented, the conventional depth neural network for replay attacks attack detecting is usually wrapped
Containing two components: a feature extractor for being intended to find distinguishing feature, the other is by Feature Mapping to deception label
Fraud detection device implies that they are spoofing attack or true voice.Assuming that input sample is x ∈ X and output label
It is { [0,1], [1,0] } y ∈ Y=, wherein X and Y is input feature vector space and output label space respectively.Scene is mismatched in domain
In, the source domain data data distribution similar but different with aiming field data sharing is expressed as S (x, y) and T (x, y).
In order to mitigate the unmatched influence in domain, a kind of architecture is proposed, which can learn depth characteristic,
Cross-cutting recording spoofing attack detection framework schematic diagram based on field dual training as shown in Figure 2.With traditional neural network
Difference, new branch pass through gradient inversion layer after being connected to feature extractor, serve as domain prediction device.Therefore, the framework is by two
A output layer composition: one is deception label y ∈ Y, the other is domain label d ∈ D.Here { [0,1], [1,0] } Y=D=, because
Binary classification task is usually modeled as to cheat.
Specifically, feature extractor Gf(·;Θf), fraud detection device Gy(·;Θy) and domain classifier Gd(·;Θd)
Correspondence mappings function formula is as follows:
F=Gf(x;Θf)
Y=Gy(f;Θy)
D=Gd(f;Θd)
By xiIt is expressed as that there is label yiAnd diI-th of input sample, indicate xiFrom source domain ((xi, yi)~S (x,
Y), if di=[0,1]) or aiming field ((xi, yi)~T (x, y) is if di=[1,0]).The deception inspection of i-th of input sample
It surveys to lose to lose with domain prediction and is expressed as (domain refers to the field in text):
In order to find deception-differentiation and domain invariant features, target is to find optimal parameter Θf, ΘyAnd Θd, to minimize
Fraud detection loss, while maximizing domain prediction loss.Therefore, the total losses of the whole network of N number of input sample can state
It is as follows:
Wherein λ is two positive coefficients lost that fracture in back-propagation process.By finding saddle point With
It can theoretically optimize.
Stochastic gradient descent (SGD) is used with the help of gradient inversion layer, the gradient of source domain sample
Wherein, α is learning rate.For aiming field sample, parameter ΘyIt does not update, parameter ΘdStill it updates, as parameter Θf
When changing its update rule:
In order to verify the effect of this method, tested,
Test in 2016 data set of ASVspoof 2017V.2 data set and BTAS] PA part (only real audio
And Replay Attack, be expressed as 2016 data set of BTAS-PA) on carry out.ASVspoof 2017V.2 data set as shown in Figure 3
The detail statistics of two datasets sentence quantity are listed with the sentence quantity list datagram in 2016 data set of BTAS-PA
Data.
For ASVspoof 2017V.2 data set, all real audios both are from the son of original RedDots corpus
Collection, and audio playback is then configured with various playbacks and is recorded, including acoustic enviroment, the various combination of playback apparatus and sound pick-up outfit.
2016 data set of BTAS is based on public AVspoof database, in the database, under different settings and environmental condition
It will do it secret record, in addition the Replay Attack of two kinds of " unknown " types is further added to assessment and concentrates, and has more competition
Challenge.In addition, the development set of 2016 data set of ASVspoof 2017V.2 data set and BTAS-PA and assessment collection are only in institute
Have in experiment and is left test set.Model is selected, the 10% of training set is removed as verifying collection.
Front end features are 257 dimension spectrograms, are obtained by every 10 milliseconds 512 point fast Fouriers of calculating transformation, window is big
Small is 25 milliseconds.The library Librosa uses Kaldi kit sliding by 300 frames for extracting front end features from initial data
Dynamic window applies the cepstrum mean value and normalized square mean of each sentence.In addition, the mean value and standard deviation that calculate training data are simultaneously
For global criteria.
Training is carried out in a manner of sentence, it means that application filling is needed, because sentence length is different.In order to
All sentences of parallel processing in batch fill longest topic by repeating its feature in each batch.In all experiments
Middle batch size is set as 8.
All neural networks all realize that Xavier initialization is used for all parameter layer in PyTorch.It is damaged using cross entropy
The SGD optimization that losing is 0.9 using momentum in the training process of all models as loss criterion and learning rate is 0.0001
Device.In addition, directly carrying out calculated performance measurement EER using the score in predicting from neural network using end-to-end scoring method
(Equal Error Rate, etc. error rates).EER is calculated using the kit provided in the challenge of ASV spoof 2019.
LCNN (Lingt Convolutional Neural Networks, lightweight convolutional neural networks) is 2017
The optimizer system of ASV spoof challenge, wherein maximum Feature Mapping (MFM) active module uses after CNN module.Due to making
With batch filling rather than all sentence overall situations are filled into maximum length, therefore frame number (being expressed as T) is different because of batch.It will be real
Existing LCNN is adjusted to the new version suitable for variable-length input feature vector.
The details of LCNN framework is described in the topological structure Parameter Map of LCNN model shown in Fig. 4.MaxPool mode is used
In all maximum pond layers, make sub it is suitable for being less than the short sentence of 32 frames.In addition, being answered in the time dimension after MaxPool5 layers
With average pond, so that significant reduction is fully connected the number of parameters in FC6 layers of (FC).0.5 ratio is used in FC7 and FC8
Dropout layers (random drop layer).
DAT (the domain adversarial based on LCNN can be readily available from baseline LCNN model
Training, field dual training) (LCNN-DAT) frame.Specifically, the layer from Conv1 to MFM6 is considered as feature extraction
Device, and FC7 and FC8 layers of composition fraud detection device.The copy of fraud detection device is used as anti-by gradient after feature extractor
Turn the domain classifier of layer connection.But dropout is used not in the classifier of domain.
In order to make up the imbalance between source domain amount of training data and aiming field amount of training data, to a small number of domain training datas
Over-sampling is carried out, to match most domains training data.Then, successively using the batch of all source domain data and all aiming fields
The batch of data carrys out training pattern.In addition, in order to the early exercise stage inhibit the noise signal from domain classifier, use with
Adaptation factor λ is gradually changed into 1 from 0 rather than initially fixes it by lower strategy:
Wherein r, which is set as 0.1, e, indicates housebroken the number of iterations.
Here, respectively by the training set of 2016 data set of ASVSpoof 2017v.2 data set and BTAS-PA, development set
A-train, A-dev, A-eval, B-train, B-dev and B-eval are expressed as with test set.Table data figure shown in fig. 5
Compare EER (%) performance of baseline LCNN model and LCNN-DAT model on A-dev, A-eval, B-dev and B-eval.
(wherein, use A-train+B-train to mean that A-train is source domain data as training data, and B-train is target
Numeric field data, for B-train+A-train, vice versa.)
To realize 9.06EER on A-dev, 12.39EER is realized on A-eval, this shows the LCNN realized
Slightly preferably promote.In addition, all performance is good on B-dev and B-eval for LCNN model, but the result is that on B-train mistake
Degree fitting, this explains significant performance differences.Although LCNN model shows well in the same domain, they are at the two
Generalization ability in data set is very poor.But by introducing domain antagonistic training framework, cross-domain test can be effectively reduced
Performance slippage, without weakening its overall performance in original source domain.Specifically, if using in A-train+B-
The LCNN-DAT model of training on train, opposite 38%, the B-eval that reduces of the performance degradation of B-dev then accordingly reduce by 57%, A-
Dev is that 33%, A- is assessed if using the LCNN-DAT model trained on B-train+A-train for 30%.As a result table
Bright, by the way that domain dual training is introduced LCNN frame, LCNN-DAT model is general for cross datasets replay attacks attack detecting
Change ability is more preferable than not DAT.
Entire aiming field training set domain dual training.Herein, five foldings are randomly divided into, then use preceding 1 respectively,
2,3,4 and 5 foldings ensure that lesser training set is the subset of larger training set as unlabelled aiming field training data.
The EER schematic diagram of the LCNN or LCNN-DAT model of the training shown in fig. 6 on different training datas shows institute
Systematic result.Regardless of the aiming field data volume used in all cases, significant cross-domain performance can be obtained
It improves.However, it can be seen that LCNN-DAT model has better cross-cutting generalization ability using more aiming field training datas, and
It will not influence the overall performance that they are concentrated in original source numeric field data.In addition, when 2016 data set of BTAS-PA is used as aiming field
Rather than when ASVspoof 2017V.2 data set, it is relatively improved more significant.Reason may be the data set size of B-train
It is A-train more than twice, so that it is more preferable effectively to help LCNN-DAT model to practise from more target numeric field datas middle school, and realizes
Better cross-domain performance.
A kind of optimization system of deception recording detection neural network model of one embodiment of the invention offer is provided
Structural schematic diagram, which can be performed described in above-mentioned any embodiment the optimization side of deception recording detection neural network model
Method, and configure in the terminal.
A kind of optimization system of deception recording detection neural network model provided in this embodiment includes: network model building
Program module 11, feature extraction program module 12, loss function optimize program module 13 and model optimization program module 14.
Wherein, network model construction procedures module 11 is used to be based on feature extractor, fraud detection device and domain prediction device
Building deception recording detection neural network model, wherein the feature extractor and the fraud detection device constitute first
Road, the feature extractor and the domain prediction device constitute second branch;Feature extraction program module 12 is used for source domain
Data and target numeric field data are input to the feature extractor as input sample, wherein source domain data have deception label
With field label, target numeric field data has field label;Loss function optimizes program module 13 and is used for the feature extractor
Output be separately input into the fraud detection device and the domain prediction device, pass through training deception recording detection nerve net
Network model reduces the loss function value of fraud detection device and reduces the loss function value of the domain prediction device;Model optimization journey
Sequence module 14 carries out confrontation instruction to the feature extractor for the loss function value based on the domain prediction device after the reduction
Practice, so that the depth characteristic that the feature extractor is output to the fraud detection device is the constant spy with fraud detection difference in domain
Sign.
Further, the model optimization program module is used for:
By the loss function value of the domain prediction device after the reduction by the reversed layer of gradient to the feature extractor into
Row dual training.
Further, after the reversed layer reversion by gradient, the loss function of the fraud detection device minimized is determined
The loss function value of value and maximized domain prediction device.
Further, few to data volume when the data volume imbalance of the source domain data and the target numeric field data
Data field carries out over-sampling, so that the data of the source domain data and the target numeric field data are flux matched.
The embodiment of the invention also provides a kind of nonvolatile computer storage media, computer storage medium is stored with meter
The recording detection nerve of the deception in above-mentioned any means embodiment can be performed in calculation machine executable instruction, the computer executable instructions
The optimization method of network model;
As an implementation, nonvolatile computer storage media of the invention is stored with the executable finger of computer
It enables, computer executable instructions setting are as follows:
Deception recording detection neural network model is constructed based on feature extractor, fraud detection device and domain prediction device,
In, the feature extractor and the fraud detection device constitute the first branch, the feature extractor and the domain prediction
Device constitutes second branch;
The feature extractor is input to using source domain data and target numeric field data as input sample, wherein source domain number
According to having deception label and field label, target numeric field data has field label;
The output of the feature extractor is separately input into the fraud detection device and the domain prediction device, passes through instruction
Practice the deception recording detection neural network model, reduces the loss function value of fraud detection device and reduce the domain prediction device
Loss function value;
Loss function value based on the domain prediction device after the reduction carries out dual training to the feature extractor, with
Make the feature extractor be output to the fraud detection device depth characteristic domain is constant and fraud detection area another characteristic.
As a kind of non-volatile computer readable storage medium storing program for executing, it can be used for storing non-volatile software program, non-volatile
Property computer executable program and module, such as the corresponding program instruction/mould of the method for the test software in the embodiment of the present invention
Block.One or more program instruction is stored in non-volatile computer readable storage medium storing program for executing, when being executed by a processor, is held
The optimization method of deception recording detection neural network model in the above-mentioned any means embodiment of row.
Non-volatile computer readable storage medium storing program for executing may include storing program area and storage data area, wherein storage journey
It sequence area can application program required for storage program area, at least one function;Storage data area can be stored according to test software
Device use created data etc..In addition, non-volatile computer readable storage medium storing program for executing may include that high speed is deposited at random
Access to memory, can also include nonvolatile memory, a for example, at least disk memory, flush memory device or other are non-
Volatile solid-state part.In some embodiments, it includes relative to place that non-volatile computer readable storage medium storing program for executing is optional
The remotely located memory of device is managed, these remote memories can be by being connected to the network to the device of test software.Above-mentioned network
Example include but is not limited to internet, intranet, local area network, mobile radio communication and combinations thereof.
The embodiment of the present invention also provides a kind of electronic equipment comprising: at least one processor, and with described at least one
The memory of a processor communication connection, wherein the memory is stored with the finger that can be executed by least one described processor
Enable, described instruction executed by least one described processor so that at least one described processor be able to carry out it is of the invention any
The step of optimization method of the deception recording detection neural network model of embodiment.
The client of the embodiment of the present application exists in a variety of forms, including but not limited to:
(1) mobile communication equipment: the characteristics of this kind of equipment is that have mobile communication function, and to provide speech, data
Communication is main target.This Terminal Type includes: smart phone, multimedia handset, functional mobile phone and low-end mobile phone etc..
(2) super mobile personal computer equipment: this kind of equipment belongs to the scope of personal computer, there is calculating and processing function
Can, generally also have mobile Internet access characteristic.This Terminal Type includes: PDA, MID and UMPC equipment etc., such as tablet computer.
(3) portable entertainment device: this kind of equipment can show and play multimedia content.Such equipment include: audio,
Video player, handheld device, e-book and intelligent toy and portable car-mounted navigation equipment.
(4) other electronic devices with audio detection function.
Herein, relational terms such as first and second and the like be used merely to by an entity or operation with it is another
One entity or operation distinguish, and without necessarily requiring or implying between these entities or operation, there are any this reality
Relationship or sequence.Moreover, the terms "include", "comprise", include not only those elements, but also including being not explicitly listed
Other element, or further include for elements inherent to such a process, method, article, or device.Do not limiting more
In the case where system, the element that is limited by sentence " including ... ", it is not excluded that including process, method, the article of the element
Or there is also other identical elements in equipment.
The apparatus embodiments described above are merely exemplary, wherein described, unit can as illustrated by the separation member
It is physically separated with being or may not be, component shown as a unit may or may not be physics list
Member, it can it is in one place, or may be distributed over multiple network units.It can be selected according to the actual needs
In some or all of the modules achieve the purpose of the solution of this embodiment.Those of ordinary skill in the art are not paying creativeness
Labour in the case where, it can understand and implement.
Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can
It realizes by means of software and necessary general hardware platform, naturally it is also possible to pass through hardware.Based on this understanding, on
Stating technical solution, substantially the part that contributes to existing technology can be embodied in the form of software products in other words, should
Computer software product may be stored in a computer readable storage medium, such as ROM/RAM, magnetic disk, CD, including several fingers
It enables and using so that a computer equipment (can be personal computer, server or the network equipment etc.) executes each implementation
Method described in certain parts of example or embodiment.
Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although
Present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: it still may be used
To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features;
And these are modified or replaceed, technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution spirit and
Range.