CN109767776A

CN109767776A - A kind of deception speech detection method based on intensive neural network

Info

Publication number: CN109767776A
Application number: CN201910033384.XA
Authority: CN
Inventors: 王泳; 苏卓艺
Original assignee: Guangdong Polytechnic Normal University
Current assignee: Guangdong Polytechnic Normal University
Priority date: 2019-01-14
Filing date: 2019-01-14
Publication date: 2019-05-17
Anticipated expiration: 2039-01-14
Also published as: CN109767776B

Abstract

The invention discloses a kind of deception speech detection methods based on intensive neural network, more particularly to field of information security technology, it specifically includes following detecting step: step 1: the building of VT deception voice transformation model: by breaking the connection between traditional time and frequency characteristic using STFT, and keeping rhythm constant；Convolutional neural networks are constructed, so that the output of previous layer network is sent to next layer as input, is exported by nonlinear operation.The present invention is by establishing intensive convolutional network, it ensure that the maximum information flow of interlayer, feature propagation is enhanced, and intensively connects the over-fitting reduced with regularization effect to the lesser task of training set, and intensive convolutional network can make network layer narrow, number of parameters is substantially reduced, mitigates degenerate problem, supports the reuse of limited neuron, it does not need to relearn the characteristic pattern of redundancy simultaneously, convenient for training.

Description

A kind of deception speech detection method based on intensive neural network

Technical field

The present invention relates to field of information security technology, it is more particularly related to which a kind of be based on intensive neural network Deception speech detection method.

Background technique

Today's society, speech deception phenomenon is very universal, proposes great challenge to social security.It is true from one Identify that the language pretended is very important in voice.Most of research all concentrates on voice conversion at present (VC), in speech synthesis and replay attack, however, being speaker A there is also another deception mode in voice deception Sound becomes a certain different sound (not needed target speaker), enable identifying system can not judge the voice for described in A, this Kind transformation is known as VT (Voice Transformation, voice deformation).People are to its attention but much less.

The patent of invention of 106875007 A of patent application publication CN discloses a kind of base for language fraud detection End-to-end deep neural network is remembered in convolution shot and long term, and deep neural network can directly optimize feature when the convolution of use is long It extracts and classifies according to current task, therefore given input can indicate more to have robustness and effectively, to make testing result It is improved comprehensively；Suitable feature is directly assessed by the training of combining classification device so that model can adapt to any correlation Task；Due to eliminating Front End so that model of the present invention enormously simplifies assembly line, especially API Calls: by list Joint classification and optimization in a model, so that the present invention is not necessarily to call join for individual classifier and feature extracting method more Number.

But it is in actual use, still there is more disadvantage, such as with the increase of the number of plies, it may occur that it degenerates, and Such connection type leads to many network layer contribution very littles, but occupies a large amount of calculating.

Summary of the invention

In order to overcome the drawbacks described above of the prior art, the embodiment of the present invention provides a kind of taking advantage of based on intensive neural network Speech detection method is deceived, by establishing intensive convolutional network, the maximum information flow of interlayer is ensure that, enhances feature propagation, and Intensive connection has regularization effect, reduces the over-fitting to the lesser task of training set, and intensive convolutional network can make Network layer narrows, and substantially reduces number of parameters, mitigates degenerate problem, supports the reuse of limited neuron, while not needing again Learn the characteristic pattern of redundancy, convenient for training, to solve the problems mentioned in the above background technology.

To achieve the above object, the invention provides the following technical scheme: a kind of deception voice based on intensive neural network Detection method specifically includes following detecting step:

Step 1: the building of VT deception voice transformation model: special by the time and frequency of breaking traditional using STFT Property between connection, and keep rhythm constant, wherein VT deception can be described as follows:

Assuming that x_t(n)) be input speech signal t moment length be n frame, firstly, x_t(n) FFT coefficient is by formula (1) it provides:

Wherein w (n) indicates that Hamming or Hanning window mouth, k indicate this index of frequency,

Then, instantaneous flow | F (k) | and the calculating of instantaneous frequency ω (k) is respectively in formula (2) and formula (3):

Δ indicates the deviation of this frequency of kth, and Fs indicates sample frequency,

VT is cheated, instantaneous frequency ω (k) is modified by formula (4), and α indicates scale factor, i.e. the deception factor,

ω ' (k* α)=ω (k) * α 0≤k < N/2 0≤k* α < N/2 (4)

Linear interpolation is shown in formula (5) commonly used in the instantaneous grade of modification, wherein 0≤k, k'< N/2, k=k'/α and μ= K'/α-k,

| F (k ') |=μ | F (k)+(1- μ) | F (k+1) | (5)

Another method for changing instantaneous flow modulus value is protecting energy amendment, as shown in formula (6),

Use the modified instantaneous frequency ω ' and instantaneous grade F' of k index

Then instantaneous phase φ ' (k) is calculated by instantaneous frequency ω ' (k), and then after being converted by formula (7) FFT coefficient,

F (k)=| F (k) | e^jφ(k) (7)

Finally, to F'(k) inverse FFT is carried out, VT curve has been obtained,

From formula 4 and formula 5 as can be seen that VT cheating interference changes spectral magnitude, so that implicit features may be drawn Enter into deception voice signal, can input by using the spectrogram of voice as deep neural network, extraction depth characteristic into Row classification, and the spectrogram of an input speech signal has been obtained by short time discrete Fourier transform (STFT), formula (8) provides,

Wherein window size is 175, and lap 50%, in phonetics, VT cheating interference is led by 12 semitones The deception factor-alpha of cause measures, as shown in formula (9),

α (s)=2^s/12 (9)

S can take any integer value in [- 12 ,+12] range, modified it is weak or it is too strong can all cause deception failure or It sounds unnatural, therefore, in an experiment, we have selected to have between [- 8, -4] and [+4 ,+8] and most cheat in ability by force Between section tested；

Step 2: building convolutional neural networks make the output of previous layer networkIt is sent to next layer and is used as input, By nonlinear operationOutputWherein,It can be expressed as follows:

With the increase of the number of plies, it may occur that degenerate, and residual error network, highway network and fractal net work all create from Short path X of the earlier network to rear layer_l-n, have good inhibiting effect to degradation phenomena, as shown in formula (11)

Step 3: the detection accuracy of VT deception performance measurement: is tested by experiment corpus, wherein detection can describe It is as follows:

D=(G_d+S_d)/(G+S)

Wherein G and S is respectively true in test set and deception segment quantity, and Gd and Sd are respectively correctly to detect from G To genuine segments and the deception segment being correctly detecting from S quantity.

It in a preferred embodiment, further include a kind of intensive convolutional network for improving structure in the step 2, In intensive convolutional network, any layer is connected directly to all succeeding layers, is specifically expressed as follows,

Wherein X₀,X_1, Indicate the output of l layers of all layers of front, [...] indicates continuous operation, in addition, each layer Output dimension has k Feature Mapping, and wherein k is usually arranged as a lesser value.

In a preferred embodiment, the intensive convolutional network input is some single pass obtained by STFT Spectrogram, size is both configured to 90 × 88, and network, by an initialization layer, three intensive modules, two conversion layers, one is complete Office's pond layer and a linear layer composition, intensively touch block for three and are made of respectively 6 layers, 12 layers and 48 layers bottleneck layer, linear layer is One complete articulamentum is followed by a softmax, and there are two outputs, respectively indicates the probability of " true " and " deception ", Each convolution bottleneck layer includes 2 layers, and intensive convolutional network entire in this way includes 2 × (6+12+48)+1+1+1=135 convolutional layers.

In a preferred embodiment, the bottleneck layer includes one 1 × 1 layer of convolution, followed by one 3 × 3 two 3 × 3 layers replace convolution convolutional layer, and transition zone connects two adjacent denseblocks to be further reduced functionally The size of figure.

In a preferred embodiment, the experiment corpus in the step 3 includes Timit, NIST and UME, It is WAV format, 8 kilo hertzs of sample rates, 16 quantizations and monophonic.

In a preferred embodiment, described Timit, NIST and UME include training set and test set, wherein Training set is respectively Timit-1, NIST-1, UME-1, and test set is respectively Timit-2, NIST-2, UME-2.

Technical effect and advantage of the invention:

The present invention ensure that the maximum information flow of interlayer, enhance feature propagation by establishing intensive convolutional network, and close Collection connection has regularization effect, reduces the over-fitting to the lesser task of training set, and intensive convolutional network can make net Network layers narrow, and substantially reduce number of parameters, mitigate degenerate problem, support the reuse of limited neuron, while not needing to learn again The characteristic pattern of redundancy is practised, convenient for training, so that the present invention does not need to need manually to choose as traditional machine learning method Specific one or multiple features, are then classified with classifier again, but utilize the intensive neural network proposed, can from Hair ground extracts the feature that relevant feature includes some shallow-layer edges and then the feature of deep layer is classified in turn, simplifies entire stream Journey has simultaneously reached better effect.

Detailed description of the invention

Fig. 1 is speech detection flow chart of the invention；

Fig. 2 is intensive neural network structure figure of the invention；

Fig. 3 is intensive neural network internal structure chart of the invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

Embodiment 1

The present invention provides a kind of deception speech detection methods based on intensive neural network as shown in Figs. 1-3, specifically Including following detecting step:

ω ' (k* α)=ω (k) * α 0≤k < N/2 0≤k* α < N/2 (4)

| F (k ') |=μ F (k) |+(1- μ) | F (k+1) | (5)

Another method for changing instantaneous flow modulus value is protecting energy amendment

(Energy-preserving modification), as shown in formula (6),

F (k)=| F (k) | e^jφ(k) (7)

Finally, to F'(k) inverse FFT is carried out, VT curve has been obtained,

α (s)=2^s/12 (9)

Step 2: building convolutional neural networks (CNN) makes the output of previous layer networkIt is sent to next layer of conduct Input, by nonlinear operationOutputWherein,It can be expressed as follows:

With the increase of the number of plies, it may occur that it degenerates, and residual error network (ResNets), highway network (Highway Networks) and fractal net work (FractalNets) all creates the short path X from earlier network to rear layer_l-n, existing to degenerating As there is good inhibiting effect, as shown in formula (11)

D=(G_d+S_d)/(G+S)

Further, the experiment corpus in the step 3 include Timit (6300 segments, 630 spokesmans), NIST (3560 segments, 356 spokesmans) and UME (4040 segments, 202 spokesmans), is WAV format, and 8 kilo hertzs Sample rate, 16 quantizations and monophonic.

Further, the Timit (6300 segments, 630 spokesmans), NIST (3560 segments, 356 speeches Person) and UME (4040 segments, 202 spokesmans) include training set and test set, wherein training set is respectively Timit-1 (3000 segments), NIST-1 (2000 segments), UME-1 (2040 segments), and test set is respectively Timit-2 (3300 A segment), NIST-2 (1560 segments), UME-2 (2000 segments).

Embodiment 2

It unlike the first embodiment, further include a kind of intensive convolutional network for improving structure in the step 2 (DenseNet), in intensive convolutional network (DenseNet), any layer is connected directly to all succeeding layers, specific to indicate such as Under,

Wherein X₀,X₁,Indicate the output of l layers of all layers of front, [...] indicates continuous operation, in addition, each layer Output dimension has k Feature Mapping, and wherein k is usually arranged as a lesser value.

Further, intensive convolutional network (DenseNet) input is some single pass spectrums obtained by STFT Figure, size is both configured to 90 × 88, and network is by an initialization layer, three intensive modules, two conversion layers, a global pool Change layer and a linear layer composition, intensively touches block for three and be made of respectively 6 layers, 12 layers and 48 layers bottleneck layer, linear layer is one Complete articulamentum is followed by a softmax, and there are two outputs, respectively indicates the probability of " true " and " deception ", each Convolution bottleneck layer includes 2 layers, and intensive convolutional network (DenseNet) entire in this way is rolled up comprising 2 × (6+12+48)+1+1+1=135 Lamination is conducive to automatically extract depth characteristic by 135 layers of intensive convolutional network, to improve computational efficiency.

Further, the bottleneck layer includes one 1 × 1 layer of convolution, followed by one 3 × 3 two 3 × 3 layers It is calculated instead of convolution convolutional layer with reducing, transition zone connects two adjacent denseblocks to be further reduced function map Size.

Based on embodiment 2, homologous corpus assessment and across corpus assessment are carried out to test set and training set respectively:

(1) homologous corpus assessment

In the case where internal database, test set and training set come from the same corpus, the testing result of this method It is as shown in the table with the result of other methods,

From the data in the table, the method for the average detected ratio of precision tradition CNN model of method proposed by the invention is high 2.58%, it is higher than the method for SVM model by 3.66%, so that decision had not only had depth characteristic but also had referred in intensive convolutional network The edge feature of early stage, so as to further increasing precision.

(2) overstate that corpus is assessed

In reality scene, tested speech and training voice may be from different sources, by choosing three corpus In one be used as test data set, other two is as training set, and the experimental results are shown inthe following table,

From the data in the table, the result of first two situation is all fine, but scheme 3 is unsatisfactory, a possible reason Be NIST data volume be greater than table 1 shown in other two groups, illustrate NIST training model have better generalization ability, and And in the method for GNN, the accuracy rate of scheme 1 is 94.37%, and our accuracy rate is 96.45%, illustrate it is proposed that Method be better than GNN method.

Finally, it should be noted that the foregoing is only a preferred embodiment of the present invention, it is not intended to restrict the invention, All within the spirits and principles of the present invention, any modification, equivalent replacement, improvement and so on should be included in of the invention Within protection scope.

Claims

1. a kind of deception speech detection method based on intensive neural network, it is characterised in that: specifically include following detecting step:

Step 1: VT deception voice transformation model building: by broken using STFT traditional time and frequency characteristic it Between connection, and keep rhythm constant, wherein VT deception can be described as follows:

ω ' (k* α)=ω (k) * α 0≤k < N/2 0≤k* α < N/2 (4)

Linear interpolation is shown in formula (5), wherein 0≤k, k'< N/2, k=k'/α and μ=k'/α-commonly used in modifying instantaneous grade K,

| F (k ') |=μ | F (k) |+(1- μ) | F (k+1) | (5)

Then instantaneous phase φ ' (k) is calculated by instantaneous frequency ω ' (k), and then passes through the FFT system after formula (7) are converted Number,

F (k)=| F (k) | e^jφ(k) (7)

Finally, to F'(k) inverse FFT is carried out, VT curve has been obtained,

From formula 4 and formula 5 as can be seen that VT cheating interference changes spectral magnitude, so that implicit features may be introduced in Cheat in voice signal, can input by using the spectrogram of voice as deep neural network, extraction depth characteristic divided Class, and the spectrogram of an input speech signal has been obtained by short time discrete Fourier transform (STFT), formula (8) provides,

Wherein window size is 175, and lap 50%, in phonetics, VT cheating interference is as caused by 12 semitones Factor-alpha is cheated to measure, as shown in formula (9),

α (s)=2^s/12 (9)

S can take any integer value in [- 12 ,+12] range, modified it is weak or it is too strong can all cause deception failure or listened Next unnatural, therefore, in an experiment, we have selected have the middle area for most cheating ability by force between [- 8, -4] and [+4 ,+8] Between tested；

Step 2: building convolutional neural networks make the output X of previous layer network_l-1Next layer is sent to as input, is passed through Nonlinear operation H_lExport X_l, wherein X_lIt can be expressed as follows:

X_l=H_l(X_l-1) (10)

With the increase of the number of plies, it may occur that it degenerates, and residual error network, highway network and fractal net work were all created from early stage Short path X of the network to rear layer_l-n, have good inhibiting effect to degradation phenomena, as shown in formula (11)

X_l=H_l(X_l-1)+X_l-n(11)；

Step 3: the detection accuracy of VT deception performance measurement: is tested by experiment corpus, wherein detection can be described as follows:

D=(G_d+S_d)/(G+S)

Wherein G and S is respectively true in test set and deception segment quantity, and Gd and Sd are respectively to be correctly detecting from G The quantity of genuine segments and the deception segment being correctly detecting from S.

2. a kind of deception speech detection method based on intensive neural network according to claim 1, it is characterised in that: institute Stating further includes a kind of intensive convolutional network for improving structure in step 2, and in intensive convolutional network, any layer is all directly connected to To all succeeding layers, specifically it is expressed as follows,

X_l=H_l([X₀,X₁,...,X_l-1])

Wherein X_0,X_1,X_l-1Indicate the output of l layers of all layers of front, [...] indicates continuous operation, in addition, each layer of output is tieed up Degree has k Feature Mapping, and wherein k is usually arranged as a lesser value.

3. a kind of deception speech detection method based on intensive neural network according to claim 2, it is characterised in that: institute Stating intensive convolutional network input is some single pass spectrograms obtained by STFT, and size is both configured to 90 × 88, and network By an initialization layer, three intensive modules, two conversion layers, a global pool layer and a linear layer composition, three close Collection is touched block and is made of respectively 6 layers, 12 layers and 48 layers bottleneck layer, and linear layer is a complete articulamentum, is followed by one Softmax, there are two outputs, respectively indicate the probability of " true " and " deception ", and each convolution bottleneck layer includes 2 layers, in this way Entire intensive convolutional network includes 2 × (6+12+48)+1+1+1=135 convolutional layers.

4. a kind of deception speech detection method based on intensive neural network according to claim 3, it is characterised in that: institute Stating bottleneck layer includes one 1 × 1 layer of convolution, and followed by one 3 × 3 two 3 × 3 layers replace convolution convolutional layer, transition Layer connects two adjacent denseblocks to be further reduced the size of function map.

5. a kind of deception speech detection method based on intensive neural network according to claim 1, it is characterised in that: institute Stating the experiment corpus in step 3 includes Timit, NIST and UME, is WAV format, 8 kilo hertzs of sample rates, 16 amounts Change and monophonic.

6. a kind of deception speech detection method based on intensive neural network according to claim 5, it is characterised in that: institute Stating Timit, NIST and UME includes training set and test set, wherein and training set is respectively Timit-1, NIST-1, UME-1, And test set is respectively Timit-2, NIST-2, UME-2.