CN110211604A

CN110211604A - A kind of depth residual error network structure for voice deformation detection

Info

Publication number: CN110211604A
Application number: CN201910521871.0A
Authority: CN
Inventors: 王泳; 张梦鸽; 赵雅珺
Original assignee: Guangdong Polytechnic Normal University
Current assignee: Guangdong Polytechnic Normal University
Priority date: 2019-06-17
Filing date: 2019-06-17
Publication date: 2019-09-06

Abstract

The present invention relates to a kind of depth residual error network structures for voice deformation detection, it is characterised in that and totally 50 layers, and short link is added in the convolutional neural networks of network structure；The short link is upper one layer of the Feature Mapping of addition outside next layer of the amount of imports, and the Feature Mapping of short link does not increase additional parameter；The network structure is using four maximum ponds, and there are four types of the block structure for being directed to different size characteristic figures after first layer convolutional layer, in network structure convolution process, primary down-sampled, the down-sampled time dimension progress in time-frequency figure is carried out after each block structure；And convolutional neural networks carry out voice deformation spectrum signature extract when, convolution kernel only spectrum signature dimension carry out convolution；The network structure finally added with after global mean value pond layer and full articulamentum again through sigmoid nonlinear function carry out evaluation of result.The detection that the present invention establishes deforms speech model, and can preferably sort out voice is raw tone or camouflage voice.

Description

A kind of depth residual error network structure for voice deformation detection

Technical field

The invention belongs to the deformation field of speech recognition in speech recognition, are specifically related to a kind of for voice deformation detection Depth residual error network structure.

Background technique

With the development of computer and other information technologies.Many audio processing software programs, such as Audacity, Cool Edit, PRAAT, the real-time iterative spectrogram (Real-Time Iterative Spectrogram, RTIS) based on matlab tool Algorithm is deformed by voice, is widely used in the fields such as audio forensics, amusement, secret protection, the abuse of these software products Increase voice-based illegal act, such as online fraud, voice payment, phone chat.People are easy for can be a The recording of oneself the other sound that disguises oneself as concurrently is run out on people's computer or smart phone, so that the sound of speaker is real-time It is transferred out after variation, the side of answering can not just recognize the identity of speaker in this way.Offender can easily utilize this A little tools pretend the sound of oneself or other people, to cheat automatic speaker verification (automatic speaker Verification, ASV) system or hiding speaker true identity, bring serious and far-reaching safety problem for society.Cause This, it is vital that whether detection voice, which passes through deformation,.

Currently, the Voice Camouflage research for ASV system emerges one after another.Duplicity voice includes voice conversion (Voice Conversion), speech synthesis (Speech Synthesis) and rerecord voice etc. duplicity voice is easy to deception ASV system System, is confirmed by many researchers.Although some automatic speaker verification research institutions are proposing to prevent accordingly Countermeasure, but the loophole for being easy to obscure true and false voice is largely still unknown.Speaker verification's system is to different deceptions There is also some loopholes, the anti-attack ability of ASV system need to improve into one for attack.

And currently, carrying out fraud detection generally by following several representative modes:

1) method that phase property carries out fraud detection is extracted using the linear predictive residual of voice signal.(Hanilci C.Speaker verification anti-spoofing using linear prediction residual phase features[C],2017EUSIPCO)

2) Kamble et al. proposes the instantaneous frequency cosine coefficient based on energy separate algorithm, for detecting true and false voice (Kamble M.R,Patil H.A.Novel energy separation based instantaneous frequency features for spoof speech detection[C],2017EUSIPCO)

3) Janicki, which is proposed, is extracted using linear prediction (linear prediction) residual signals based on audio matter The algorithm of measure feature.(Janicki A.Spoofing countermeasure based on analysis of linear prediction error[C],in Proc.INTERSPEECH,2015,pp.2077–2081)

4) Alam et al. proposes a kind of fraud detection algorithm indicated based on infinite impulse response constant q transform characteristics. (Alam J,Kenny P.Spoofing detection employing infinite impulse response— constant Q transform-based feature representations[C],2017EUSIPCO)

The algorithm of the above 1-4 is all with traditional artificial extraction feature, and the feature applied to has phase property, MFCC etc., Again with model trainings networks such as GMM, detected.It is more complicated that traditional algorithm extracts characteristic procedure, and lacks versatility.

5) Huixin Liang et al. is proposed with convolutional neural networks and is automatically extracted feature, with deep learning Method constructs network, and the data tested are that pseudo- loading amount is smaller, sound like the voice data collection of natural person.Its crossing number Average detected accuracy according to library is 94.37%.(Huixin Liang,Xiaodan Lin,Qiong Zhang.recognition of spoofed voice using convolutional neural networks[C] .IEEE Global Conference on Signal&Information Processing.2017) but in this experiment, make It is less with the structure level number of the network system, it is not enough to extract effective feature by testing.

Summary of the invention

The present invention provides a kind of depth residual error network structure for voice deformation detection by the deficiencies in the prior art, To establish the model for detecting deformation voice, and then sorting out voice is raw tone or the voice by deformation camouflage.

In order to achieve the above object, a kind of depth residual error network structure setting for voice deformation detection of the present invention totally 50 Layer, and short link is added in the convolutional neural networks of the network structure；The short link is outside next layer of the amount of imports Upper one layer of the Feature Mapping of addition, the Feature Mapping of the short link do not increase additional parameter, each layer indicated again The feature that repetitive learning is not trained to is not had to for one layer of residual error function in study；

The time-frequency figure input of the network structure is having a size of (128*127), using four maximum ponds, filter size For (1*2), step-length 2；After the first layer convolutional layer of network structure, there are four types of the structures for being directed to different size characteristic figures Block, it is 3 convolutional layers of 1x1,3x1 and 1x1 respectively that each block structure is shared, and the convolutional layer of two of them 1x1 is first to reduce number According to being further added by data dimension after dimension；

In the network structure convolution process, progress is primary down-sampled after each block structure, this down-sampled in time-frequency The time dimension of figure carries out；And convolutional neural networks carry out voice deformation spectrum signature extract when, convolution kernel only frequency Spectrum signature dimension carries out convolution；

The last full articulamentum added with global a mean value pond floor and 1000 tunnels of the network structure, later using Sigmoid nonlinear function carries out evaluation of result.

The principle of voice deformation is the tone that sound is increased or decreased by stretching or compressing frequency spectrum.In music and voice In, the smallest unit gap is chromatic scale (semitone) between two single-tones, and an octave is by 12 chromatic scales Composition, it means that pitch, which is increased or decreased a semitone, will lead to the variation of frequency, and its ratio be 2^1/12。

Preferably, the network structure can pre-process the voice data of deformation, it is assumed that x before inputting time-frequency figure₀ It is the pitch of raw tone, a is the camouflage factor, it can be deduced that deforming voice x is

X=2^α/12·x₀；

Pretend the value of the factor as the arbitrary integer in [- 11,11].When value is that [1,11] is interior, voice spectrum can expand Exhibition, and then tone increases；Conversely, frequency spectrum is shunk, tone is reduced.When camouflage the factor absolute value it is smaller, constitute deception language Sound is closer to raw tone.

Data are standardized again later, standardization formula is

Wherein E [x] is characteristic mean, and Var [x] is feature variance.

It is equalization processing first, i.e., the mean value of this feature is subtracted for each feature of data-oriented, by data set To 0, the purpose done so is to reduce the calculation amount of entire algorithm for data center, data from the vector under original coordinate system The matrix of composition, becomes the coordinate system established with 0 for origin, and cardinal principle is that default time-frequency figure is a kind of smoothly data point Cloth highlights individual difference at this point, common part can be removed by subtracting the assembly average of data on each sample.Then The amplitude of each dimension of data set is returned divided by the variance of the data characteristics, i.e. normalized again on the basis of 0 equalization One changes into same range.In this way, in network training process, so that it may accelerate training speed, accelerates weight convergence, Stablize loss function, prevents occurring gradient disappearance or gradient explosion issues in network training process, improve algorithm performance.

Preferably, the time-frequency figure of the input network structure is generated by Short Time Fourier Transform, deforms language to allow The characteristic information of sound has relatively intensive distribution, to be more advantageous to neural network characteristics extraction.

The network structure can be extracted automatically relative to the better characteristic information of traditional convolutional neural networks, therein short Connection and the setting of convolution kernel provide better help.One layer of spy is additionally added that is, in next layer of input in short connection Sign mapping；The Feature Mapping of short link neither increases additional parameter, does not also increase the complexity of calculating.Here do not increase additional Parameter network can be made more effectively to train.Each layer is expressed as one layer in study of residual error function by we again in this way, And the feature that repetitive learning is not trained to is not had to.So as to form residual error network be easier to be trained to and optimize, thus deeper Preferably convergence effect is obtained in network.In addition, the problem of network structure can disappear or explode to avoid gradient, it will not be with The intensification of network and generate degenerate problem.Wherein the setting of convolution kernel is also the analysis being more practically applicable to voice.And traditional mind The problem of gradient disappears or explodes is not can avoid through network structure, acquired results may generate degenerate problem.

Specifically, the present invention and following advantages:

(1) depth residual error network class model is constructed with deep learning method, which is more easily optimized, can be to avoid Bring accuracy declines problem due to the network number of plies increases；

(2) this algorithm has versatility, has preferable testing result to different deformation methods；

(3) our depth residual error network can easily obtain the raising of precision from the increase of the network number of plies, than Pervious network generates better effect；

(4) signature analysis for voice is more suitable for the setting of convolution kernel in network structure, it is better is conducive to network Training.

Detailed description of the invention

Fig. 1 is schematic network structure of the invention；

Fig. 2 is block structure schematic diagram；

Fig. 3 is the testing result schematic diagram of right ± 4 camouflage factor under different distortion method.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

Referring to Figure of description 1-2, a kind of depth residual error network structure for voice deformation detection of the embodiment of the present invention Totally 50 layers of setting, and short link A is added in the convolutional neural networks of the network structure；The short link A is in next layer The amount of imports outside upper one layer of the Feature Mapping of addition, the Feature Mapping of the short link do not increase additional parameter, will be each Layer is expressed as one layer of residual error function in study again and does not have to the feature that repetitive learning is not trained to；

Referring to Figures 1 and 2, the time-frequency figure input of the network structure is having a size of (128*127), using four maximum ponds Change, filter size is (1*2), step-length 2；After the first layer convolutional layer of network structure, there are four types of be directed to different rulers The block structure of very little characteristic pattern, it is 3 convolutional layers of 1x1,3x1 and 1x1, the volume of two of them 1x1 respectively that each block structure is shared Lamination is to be further added by data dimension after first reducing data dimension；

The network structure can pre-process the voice data of deformation, it is assumed that x before inputting time-frequency figure₀It is original The pitch of voice, a are the camouflage factors, it can be deduced that deforming voice x is

X=2^α/12·x₀；

Pretend the value of the factor as the arbitrary integer in [- 11,11].When value is that [1,11] is interior, voice spectrum can expand Exhibition, and then tone increases；Conversely, frequency spectrum is shunk, tone is reduced.When camouflage the factor absolute value it is smaller, constitute deception language Sound is closer to raw tone.In the present embodiment, used camouflage factor range is [- 8, -4] and [4,8], altogether 10 kinds of camouflages The factor.

Data are standardized again later, standardization formula is

Wherein E [x] is characteristic mean, and Var [x] is feature variance.

It is described input network structure time-frequency figure be is generated by Short Time Fourier Transform, with allow deform voice feature Information has relatively intensive distribution, to be more advantageous to neural network characteristics extraction.In the present embodiment, data used, be by Original voice segments are cut into the sound bite of 1s, then carry out STFT variation to the voice segments of 1s, generate corresponding time-frequency figure, take out Sample frequency is 16KHZ.

Each block structure haves three layers altogether, is 1x1,3x1 and 1x1 convolutional layer respectively.Two of them 1x1 convolutional layer is first to reduce Then data dimension is further added by dimension.3x1 convolutional layer is the lesser layer of dimension.Number after every layer of convolutional layer represents this layer The quantity of used convolution kernel.It is carried out after every kind of block structure primary down-sampled.A last plus global mean value pond for network The full articulamentum of floor and 1000 tunnels carries out evaluation of result using sigmoid nonlinear function.Specific network parameter such as following table Shown in 1.

Table 1

The transverse direction of time-frequency figure represents the time, longitudinally represents frequecy characteristic.In convolutional layer, our filter does time-frequency figure Longitudinal convolution, the filter size in present network architecture have 8 × 1 and 3 × 1.It is known that the convolution kernel of neural network is generally all 3 × 3 this ranks are analogous to the matrix of dimension, the reason is that being usually between a value locally organized in image recognition Highly relevant, the unique local feature for being easy to detect is formd, therefore ranks are generallyd use in image procossing with dimension Convolution kernel.And it is directed to the time-frequency characteristic of voice, the deformation of voice will cause the extension or contraction of frequency spectrum, and introduce on a timeline Voice variation it is considered that be consistent.The frequency spectrum that neural network need to be deformed to do is to extract voice from longitudinal frequency range Feature, therefore present networks use such convolution kernel.In addition, relative to ranks with the convolution kernel of dimension, such convolution kernel reduces Certain parameter can be very good the over-fitting for avoiding network.

Down-sampled of embodiment carries out in the time dimension of time-frequency figure, and whole process frequency dimension is all adopted without drop Sample is only finally carrying out mean value pond.Not only reduce characteristic dimension in this way, but also not will lead to losing for frequency dimension feature It loses, is conducive to network and obtains preferable classification results.

In training network, the present embodiment is trained cross entropy error using small quantities of iteration stochastic gradient descent.It adopts The hyper parameter on verifying collection is optimized with supervised learning method.Table 2 is listed for training some important of network to surpass Parameter.Under this configuration, the depth residual error learning network model of proposition provides fairly good accuracy of identification.Wherein, β₁、β₂ Respectively ADAM optimizer parameter.

Learning rate	10^-4	Minimum batch size	32
				β₁	0.9	Frequency of training	50000
β₂	0.999	Regularization coefficient	10^-4

Table 2

The experimental data that the embodiment of the present invention is used derives from NIST SRE 2003 (NIST), TIMIT and UME-ERJ (UME).TIMIT database is made of the voice segments that 6300 sections of average durations are 3 seconds, and spokesman has 630 people；NIST data Library is made of the voice segments that 3560 sections of average durations are 5 seconds, and spokesman has 356 people；UME database has 4040 sections averagely to hold The voice segments that the continuous time is 5 seconds form, and spokesman has 202 people.During the experiment, this 3 databases are respectively divided into two Data set, for trained and testing characteristics of network: TIMIT-1 has 3300 sections of voice segments, and TIMIT-2 has 3000 sections of voice segments； NIST-1 has 2000 sections of voice segments, and NIST-2 has 1560 sections of voice segments；UME-1 has 2040 sections of voice segments, and UME-2 has 2000 sections of languages Segment.Wherein, TIMIT-1, NIST-1, UME-1 are divided into training set；TIMIT-2, NIST-2, UME-2 are divided into test set.

In an experiment, the camouflage method of four kinds of voices: Audacity, Cool Edit, PRAAT, and RTISI has been used, The corresponding voice that every kind of camouflage method generates can all be added to us and test in three databases used, verify the calculation with this The versatility of method.Every kind of method can obtain the camouflage voice under 10 kinds of camouflage factor of alpha.When the camouflage factor is too small, voice The effect of camouflage is unobvious, therefore source speaker can be identified by the sense of hearing of people.When influence factor is excessive, sound is deformed In noise be easy to wake suspicion.Therefore, it is not suitable for by the sound that too small or too big factor is pretended to present networks phase Close the training and test of algorithm.Therefore, in our work, the range of the similar factors of consideration is [4,8] and [- 8, -4].

During the experiment, all voice segments are cut into the duration all as the voice segments of 1s, and then choose again appropriate Training data and test data.In order to evaluate the speech recognition network proposed, it is real that We conducted 3 kinds of classification It tests, and is compared with the result of the LiangHuixin network structure proposed:

Experiment one: the detection of genbank database；

Experiment two: intersect the detection of database；

Experiment three: the detection of ± 4 camouflage factor right under different distortion method.

Experiment one: the detection of genbank database.This part experiment separates experiment using the data in the same sound bank Required training set and test set.It is that the voice of 4 kinds of camouflage methods is trained and is tested respectively first, every kind of camouflage method The voice under 10 kinds of camouflage factor transformations is all contained, training set and test set both are from same Voice Camouflage method.Then 4 kinds of camouflage methods for separately including 10 kinds of camouflage factors are divided into a training set and a test set jointly again, then obtain one A result and the method for Liang compare.For NIST sound bank, in an experiment, training set data is collected in NIST-1, former Beginning voice and deformation voice it is each it is random use 17000, then training data shares 34000.Test set is collected in NIST-2, for The first test, original and deformation voice respectively extract 2000 at random, altogether 4000 test data；Second of integrated testability is in total 10000 test data is original and deformation voice is each random uses 5000.Its experimental result is as shown in table 3 below:

Table 3

Different camouflage methods can reach 96.88% or more when testing respectively it can be seen from experimental result As a result.It can be seen that the network configuration, which has different deformation methods, preferably connects survey result.And four kinds of methods are mixed When testing together, as a result 96.4%, the method than Liang is higher by 0.47%.

Experiment two: intersect the detection of database.Since in real life, the method for recording between different phonetic library may It is different, so our network should can have preferable testing result to multiple voice library, present networks is showed with this Versatility.Therefore, the detection between integration across database is also very important.In our experiment, first respectively to TIMIT, UME is individually trained, and original and deformation voice each 6500 is collected in TIMIT-1, totally 13000 voice segments, wherein deforming language Sound includes 4 kinds of camouflage methods and its 10 kinds of camouflage factors applied respectively；It is each that original and deformation voice is collected in UME-1 6000, totally 12000 voice segments, wherein equally including 4 kinds of camouflage methods and its 10 kinds of camouflage factors applied respectively.Then exist Raw tone is collected in NIST-2 and deform voice totally 4000 test data, and contain 4 kinds of camouflage methods and its answer respectively 10 kinds of camouflage factors.The TIMIT-1 and UME-1 model generated is tested respectively.In addition, herein by TIMIT_1 and UME_1 totally 24800 training data be combined together, wherein TIMIT_1 has 12800 data, and UME_1 has 12000 data. It is tested again with NIST_2.Experimental result is as shown in table 4 below:

Table 4

It can be seen that, the experimental result of the method for Liang is 94.37%, and the result of our methods is than it from table It is higher by 2.06%, therefore our algorithm model has certain superiority.

Experiment three: as shown in Figure of description 3, the detection of ± 4 camouflage factor right under different distortion method.Last part Experiment, present network architecture detect camouflage lesser ± 4 factor of the factor.Because ± 4 factors belong in our experiment The lesser camouflage factor, and it is closest to the camouflage factor of natural person's voice, the detection to it is relatively very difficult, so its Detection accuracy is representative.In this section experiment, we have carried out ± 4 camouflages to different voice deformation camouflage methods respectively The experiment of the factor.In 34000 data of training set, 4000 data of test set, raw tone and deformation voice Respectively account for half.Test result and the method for Liang compare.As seen from Figure 3, our result all 96.1% with On, and the highest result of Liang is 95.85%.Therefore, no matter use which kind of camouflage method to the lesser camouflage factor, have Preferable testing result.

Specifically, the present invention and following advantages:

The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, all in essence of the invention Within mind and principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims

1. a kind of depth residual error network structure for voice deformation detection, it is characterised in that: totally 50 layers of the network structure, and Short link is added in the convolutional neural networks of the network structure；The short link is the addition outside next layer of the amount of imports Upper one layer of Feature Mapping, the Feature Mapping of the short link do not increase additional parameter, each layer are expressed as learning again Upper one layer of residual error function and do not have to the feature that is not trained to of repetitive learning；

The time-frequency figure input of the network structure is having a size of (128*127), and using four maximum ponds, filter size is (1*2), step-length 2；After the first layer convolutional layer of network structure, there are four types of be directed to different size characteristic figures block structure, It is 3 convolutional layers of 1x1,3x1 and 1x1 respectively that each block structure is shared, and the convolutional layer of two of them 1x1 is first to reduce data dimension Data dimension is further added by after degree；

In the network structure convolution process, progress is primary down-sampled after each block structure, this down-sampled in time-frequency figure Time dimension carries out；And convolutional neural networks carry out voice deformation spectrum signature extract when, convolution kernel only frequency spectrum spy It levies dimension and carries out convolution；

2. a kind of depth residual error network structure for voice deformation detection according to claim 1, it is characterised in that: institute Network structure is stated before inputting time-frequency figure, the voice data of deformation can be pre-processed, it is assumed that x₀It is the pitch of raw tone, A is the camouflage factor, it can be deduced that deforming voice x is

X=2^α/12·x₀；

Pretend the value of the factor as the arbitrary integer in [- 11,11], data is standardized again later, standardization is public Formula is

Wherein E [x] is characteristic mean, and Var [x] is feature variance.

3. a kind of depth residual error network structure for voice deformation detection according to claim 1, it is characterised in that: institute The time-frequency figure for stating input network structure is generated by Short Time Fourier Transform, to allow the characteristic information of deformation voice to have relatively Intensive distribution, to be more advantageous to neural network characteristics extraction.