CN110211604A - A kind of depth residual error network structure for voice deformation detection - Google Patents
A kind of depth residual error network structure for voice deformation detection Download PDFInfo
- Publication number
- CN110211604A CN110211604A CN201910521871.0A CN201910521871A CN110211604A CN 110211604 A CN110211604 A CN 110211604A CN 201910521871 A CN201910521871 A CN 201910521871A CN 110211604 A CN110211604 A CN 110211604A
- Authority
- CN
- China
- Prior art keywords
- voice
- network structure
- layer
- network
- frequency
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 26
- 238000000034 method Methods 0.000 claims abstract description 45
- 238000001228 spectrum Methods 0.000 claims abstract description 16
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 12
- 238000013507 mapping Methods 0.000 claims abstract description 12
- 230000008569 process Effects 0.000 claims abstract description 11
- 239000000284 extract Substances 0.000 claims abstract description 6
- 238000011156 evaluation Methods 0.000 claims abstract description 5
- 230000008676 import Effects 0.000 claims abstract description 4
- 230000006870 function Effects 0.000 claims description 11
- 238000013528 artificial neural network Methods 0.000 claims description 5
- 230000003252 repetitive effect Effects 0.000 claims description 5
- 238000000605 extraction Methods 0.000 claims description 4
- 238000012360 testing method Methods 0.000 description 24
- 238000012549 training Methods 0.000 description 22
- 238000002474 experimental method Methods 0.000 description 18
- 238000004458 analytical method Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 5
- 238000012795 verification Methods 0.000 description 5
- 230000003247 decreasing effect Effects 0.000 description 4
- 239000000203 mixture Substances 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 230000015572 biosynthetic process Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000008034 disappearance Effects 0.000 description 2
- 238000004880 explosion Methods 0.000 description 2
- 239000004744 fabric Substances 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000008602 contraction Effects 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000003475 lamination Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Complex Calculations (AREA)
Abstract
The present invention relates to a kind of depth residual error network structures for voice deformation detection, it is characterised in that and totally 50 layers, and short link is added in the convolutional neural networks of network structure;The short link is upper one layer of the Feature Mapping of addition outside next layer of the amount of imports, and the Feature Mapping of short link does not increase additional parameter;The network structure is using four maximum ponds, and there are four types of the block structure for being directed to different size characteristic figures after first layer convolutional layer, in network structure convolution process, primary down-sampled, the down-sampled time dimension progress in time-frequency figure is carried out after each block structure;And convolutional neural networks carry out voice deformation spectrum signature extract when, convolution kernel only spectrum signature dimension carry out convolution;The network structure finally added with after global mean value pond layer and full articulamentum again through sigmoid nonlinear function carry out evaluation of result.The detection that the present invention establishes deforms speech model, and can preferably sort out voice is raw tone or camouflage voice.
Description
Technical field
The invention belongs to the deformation field of speech recognition in speech recognition, are specifically related to a kind of for voice deformation detection
Depth residual error network structure.
Background technique
With the development of computer and other information technologies.Many audio processing software programs, such as Audacity, Cool
Edit, PRAAT, the real-time iterative spectrogram (Real-Time Iterative Spectrogram, RTIS) based on matlab tool
Algorithm is deformed by voice, is widely used in the fields such as audio forensics, amusement, secret protection, the abuse of these software products
Increase voice-based illegal act, such as online fraud, voice payment, phone chat.People are easy for can be a
The recording of oneself the other sound that disguises oneself as concurrently is run out on people's computer or smart phone, so that the sound of speaker is real-time
It is transferred out after variation, the side of answering can not just recognize the identity of speaker in this way.Offender can easily utilize this
A little tools pretend the sound of oneself or other people, to cheat automatic speaker verification (automatic speaker
Verification, ASV) system or hiding speaker true identity, bring serious and far-reaching safety problem for society.Cause
This, it is vital that whether detection voice, which passes through deformation,.
Currently, the Voice Camouflage research for ASV system emerges one after another.Duplicity voice includes voice conversion (Voice
Conversion), speech synthesis (Speech Synthesis) and rerecord voice etc. duplicity voice is easy to deception ASV system
System, is confirmed by many researchers.Although some automatic speaker verification research institutions are proposing to prevent accordingly
Countermeasure, but the loophole for being easy to obscure true and false voice is largely still unknown.Speaker verification's system is to different deceptions
There is also some loopholes, the anti-attack ability of ASV system need to improve into one for attack.
And currently, carrying out fraud detection generally by following several representative modes:
1) method that phase property carries out fraud detection is extracted using the linear predictive residual of voice signal.(Hanilci
C.Speaker verification anti-spoofing using linear prediction residual phase
features[C],2017EUSIPCO)
2) Kamble et al. proposes the instantaneous frequency cosine coefficient based on energy separate algorithm, for detecting true and false voice
(Kamble M.R,Patil H.A.Novel energy separation based instantaneous frequency
features for spoof speech detection[C],2017EUSIPCO)
3) Janicki, which is proposed, is extracted using linear prediction (linear prediction) residual signals based on audio matter
The algorithm of measure feature.(Janicki A.Spoofing countermeasure based on analysis of linear
prediction error[C],in Proc.INTERSPEECH,2015,pp.2077–2081)
4) Alam et al. proposes a kind of fraud detection algorithm indicated based on infinite impulse response constant q transform characteristics.
(Alam J,Kenny P.Spoofing detection employing infinite impulse response—
constant Q transform-based feature representations[C],2017EUSIPCO)
The algorithm of the above 1-4 is all with traditional artificial extraction feature, and the feature applied to has phase property, MFCC etc.,
Again with model trainings networks such as GMM, detected.It is more complicated that traditional algorithm extracts characteristic procedure, and lacks versatility.
5) Huixin Liang et al. is proposed with convolutional neural networks and is automatically extracted feature, with deep learning
Method constructs network, and the data tested are that pseudo- loading amount is smaller, sound like the voice data collection of natural person.Its crossing number
Average detected accuracy according to library is 94.37%.(Huixin Liang,Xiaodan Lin,Qiong
Zhang.recognition of spoofed voice using convolutional neural networks[C]
.IEEE Global Conference on Signal&Information Processing.2017) but in this experiment, make
It is less with the structure level number of the network system, it is not enough to extract effective feature by testing.
Summary of the invention
The present invention provides a kind of depth residual error network structure for voice deformation detection by the deficiencies in the prior art,
To establish the model for detecting deformation voice, and then sorting out voice is raw tone or the voice by deformation camouflage.
In order to achieve the above object, a kind of depth residual error network structure setting for voice deformation detection of the present invention totally 50
Layer, and short link is added in the convolutional neural networks of the network structure;The short link is outside next layer of the amount of imports
Upper one layer of the Feature Mapping of addition, the Feature Mapping of the short link do not increase additional parameter, each layer indicated again
The feature that repetitive learning is not trained to is not had to for one layer of residual error function in study;
The time-frequency figure input of the network structure is having a size of (128*127), using four maximum ponds, filter size
For (1*2), step-length 2;After the first layer convolutional layer of network structure, there are four types of the structures for being directed to different size characteristic figures
Block, it is 3 convolutional layers of 1x1,3x1 and 1x1 respectively that each block structure is shared, and the convolutional layer of two of them 1x1 is first to reduce number
According to being further added by data dimension after dimension;
In the network structure convolution process, progress is primary down-sampled after each block structure, this down-sampled in time-frequency
The time dimension of figure carries out;And convolutional neural networks carry out voice deformation spectrum signature extract when, convolution kernel only frequency
Spectrum signature dimension carries out convolution;
The last full articulamentum added with global a mean value pond floor and 1000 tunnels of the network structure, later using
Sigmoid nonlinear function carries out evaluation of result.
The principle of voice deformation is the tone that sound is increased or decreased by stretching or compressing frequency spectrum.In music and voice
In, the smallest unit gap is chromatic scale (semitone) between two single-tones, and an octave is by 12 chromatic scales
Composition, it means that pitch, which is increased or decreased a semitone, will lead to the variation of frequency, and its ratio be 21/12。
Preferably, the network structure can pre-process the voice data of deformation, it is assumed that x before inputting time-frequency figure0
It is the pitch of raw tone, a is the camouflage factor, it can be deduced that deforming voice x is
X=2α/12·x0;
Pretend the value of the factor as the arbitrary integer in [- 11,11].When value is that [1,11] is interior, voice spectrum can expand
Exhibition, and then tone increases;Conversely, frequency spectrum is shunk, tone is reduced.When camouflage the factor absolute value it is smaller, constitute deception language
Sound is closer to raw tone.
Data are standardized again later, standardization formula is
Wherein E [x] is characteristic mean, and Var [x] is feature variance.
It is equalization processing first, i.e., the mean value of this feature is subtracted for each feature of data-oriented, by data set
To 0, the purpose done so is to reduce the calculation amount of entire algorithm for data center, data from the vector under original coordinate system
The matrix of composition, becomes the coordinate system established with 0 for origin, and cardinal principle is that default time-frequency figure is a kind of smoothly data point
Cloth highlights individual difference at this point, common part can be removed by subtracting the assembly average of data on each sample.Then
The amplitude of each dimension of data set is returned divided by the variance of the data characteristics, i.e. normalized again on the basis of 0 equalization
One changes into same range.In this way, in network training process, so that it may accelerate training speed, accelerates weight convergence,
Stablize loss function, prevents occurring gradient disappearance or gradient explosion issues in network training process, improve algorithm performance.
Preferably, the time-frequency figure of the input network structure is generated by Short Time Fourier Transform, deforms language to allow
The characteristic information of sound has relatively intensive distribution, to be more advantageous to neural network characteristics extraction.
The network structure can be extracted automatically relative to the better characteristic information of traditional convolutional neural networks, therein short
Connection and the setting of convolution kernel provide better help.One layer of spy is additionally added that is, in next layer of input in short connection
Sign mapping;The Feature Mapping of short link neither increases additional parameter, does not also increase the complexity of calculating.Here do not increase additional
Parameter network can be made more effectively to train.Each layer is expressed as one layer in study of residual error function by we again in this way,
And the feature that repetitive learning is not trained to is not had to.So as to form residual error network be easier to be trained to and optimize, thus deeper
Preferably convergence effect is obtained in network.In addition, the problem of network structure can disappear or explode to avoid gradient, it will not be with
The intensification of network and generate degenerate problem.Wherein the setting of convolution kernel is also the analysis being more practically applicable to voice.And traditional mind
The problem of gradient disappears or explodes is not can avoid through network structure, acquired results may generate degenerate problem.
Specifically, the present invention and following advantages:
(1) depth residual error network class model is constructed with deep learning method, which is more easily optimized, can be to avoid
Bring accuracy declines problem due to the network number of plies increases;
(2) this algorithm has versatility, has preferable testing result to different deformation methods;
(3) our depth residual error network can easily obtain the raising of precision from the increase of the network number of plies, than
Pervious network generates better effect;
(4) signature analysis for voice is more suitable for the setting of convolution kernel in network structure, it is better is conducive to network
Training.
Detailed description of the invention
Fig. 1 is schematic network structure of the invention;
Fig. 2 is block structure schematic diagram;
Fig. 3 is the testing result schematic diagram of right ± 4 camouflage factor under different distortion method.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
Referring to Figure of description 1-2, a kind of depth residual error network structure for voice deformation detection of the embodiment of the present invention
Totally 50 layers of setting, and short link A is added in the convolutional neural networks of the network structure;The short link A is in next layer
The amount of imports outside upper one layer of the Feature Mapping of addition, the Feature Mapping of the short link do not increase additional parameter, will be each
Layer is expressed as one layer of residual error function in study again and does not have to the feature that repetitive learning is not trained to;
Referring to Figures 1 and 2, the time-frequency figure input of the network structure is having a size of (128*127), using four maximum ponds
Change, filter size is (1*2), step-length 2;After the first layer convolutional layer of network structure, there are four types of be directed to different rulers
The block structure of very little characteristic pattern, it is 3 convolutional layers of 1x1,3x1 and 1x1, the volume of two of them 1x1 respectively that each block structure is shared
Lamination is to be further added by data dimension after first reducing data dimension;
In the network structure convolution process, progress is primary down-sampled after each block structure, this down-sampled in time-frequency
The time dimension of figure carries out;And convolutional neural networks carry out voice deformation spectrum signature extract when, convolution kernel only frequency
Spectrum signature dimension carries out convolution;
The last full articulamentum added with global a mean value pond floor and 1000 tunnels of the network structure, later using
Sigmoid nonlinear function carries out evaluation of result.
The principle of voice deformation is the tone that sound is increased or decreased by stretching or compressing frequency spectrum.In music and voice
In, the smallest unit gap is chromatic scale (semitone) between two single-tones, and an octave is by 12 chromatic scales
Composition, it means that pitch, which is increased or decreased a semitone, will lead to the variation of frequency, and its ratio be 21/12。
The network structure can pre-process the voice data of deformation, it is assumed that x before inputting time-frequency figure0It is original
The pitch of voice, a are the camouflage factors, it can be deduced that deforming voice x is
X=2α/12·x0;
Pretend the value of the factor as the arbitrary integer in [- 11,11].When value is that [1,11] is interior, voice spectrum can expand
Exhibition, and then tone increases;Conversely, frequency spectrum is shunk, tone is reduced.When camouflage the factor absolute value it is smaller, constitute deception language
Sound is closer to raw tone.In the present embodiment, used camouflage factor range is [- 8, -4] and [4,8], altogether 10 kinds of camouflages
The factor.
Data are standardized again later, standardization formula is
Wherein E [x] is characteristic mean, and Var [x] is feature variance.
It is equalization processing first, i.e., the mean value of this feature is subtracted for each feature of data-oriented, by data set
To 0, the purpose done so is to reduce the calculation amount of entire algorithm for data center, data from the vector under original coordinate system
The matrix of composition, becomes the coordinate system established with 0 for origin, and cardinal principle is that default time-frequency figure is a kind of smoothly data point
Cloth highlights individual difference at this point, common part can be removed by subtracting the assembly average of data on each sample.Then
The amplitude of each dimension of data set is returned divided by the variance of the data characteristics, i.e. normalized again on the basis of 0 equalization
One changes into same range.In this way, in network training process, so that it may accelerate training speed, accelerates weight convergence,
Stablize loss function, prevents occurring gradient disappearance or gradient explosion issues in network training process, improve algorithm performance.
It is described input network structure time-frequency figure be is generated by Short Time Fourier Transform, with allow deform voice feature
Information has relatively intensive distribution, to be more advantageous to neural network characteristics extraction.In the present embodiment, data used, be by
Original voice segments are cut into the sound bite of 1s, then carry out STFT variation to the voice segments of 1s, generate corresponding time-frequency figure, take out
Sample frequency is 16KHZ.
Each block structure haves three layers altogether, is 1x1,3x1 and 1x1 convolutional layer respectively.Two of them 1x1 convolutional layer is first to reduce
Then data dimension is further added by dimension.3x1 convolutional layer is the lesser layer of dimension.Number after every layer of convolutional layer represents this layer
The quantity of used convolution kernel.It is carried out after every kind of block structure primary down-sampled.A last plus global mean value pond for network
The full articulamentum of floor and 1000 tunnels carries out evaluation of result using sigmoid nonlinear function.Specific network parameter such as following table
Shown in 1.
Table 1
The transverse direction of time-frequency figure represents the time, longitudinally represents frequecy characteristic.In convolutional layer, our filter does time-frequency figure
Longitudinal convolution, the filter size in present network architecture have 8 × 1 and 3 × 1.It is known that the convolution kernel of neural network is generally all
3 × 3 this ranks are analogous to the matrix of dimension, the reason is that being usually between a value locally organized in image recognition
Highly relevant, the unique local feature for being easy to detect is formd, therefore ranks are generallyd use in image procossing with dimension
Convolution kernel.And it is directed to the time-frequency characteristic of voice, the deformation of voice will cause the extension or contraction of frequency spectrum, and introduce on a timeline
Voice variation it is considered that be consistent.The frequency spectrum that neural network need to be deformed to do is to extract voice from longitudinal frequency range
Feature, therefore present networks use such convolution kernel.In addition, relative to ranks with the convolution kernel of dimension, such convolution kernel reduces
Certain parameter can be very good the over-fitting for avoiding network.
Down-sampled of embodiment carries out in the time dimension of time-frequency figure, and whole process frequency dimension is all adopted without drop
Sample is only finally carrying out mean value pond.Not only reduce characteristic dimension in this way, but also not will lead to losing for frequency dimension feature
It loses, is conducive to network and obtains preferable classification results.
In training network, the present embodiment is trained cross entropy error using small quantities of iteration stochastic gradient descent.It adopts
The hyper parameter on verifying collection is optimized with supervised learning method.Table 2 is listed for training some important of network to surpass
Parameter.Under this configuration, the depth residual error learning network model of proposition provides fairly good accuracy of identification.Wherein, β1、β2
Respectively ADAM optimizer parameter.
Learning rate | 10-4 | Minimum batch size | 32 |
β1 | 0.9 | Frequency of training | 50000 |
β2 | 0.999 | Regularization coefficient | 10-4 |
Table 2
The experimental data that the embodiment of the present invention is used derives from NIST SRE 2003 (NIST), TIMIT and UME-ERJ
(UME).TIMIT database is made of the voice segments that 6300 sections of average durations are 3 seconds, and spokesman has 630 people;NIST data
Library is made of the voice segments that 3560 sections of average durations are 5 seconds, and spokesman has 356 people;UME database has 4040 sections averagely to hold
The voice segments that the continuous time is 5 seconds form, and spokesman has 202 people.During the experiment, this 3 databases are respectively divided into two
Data set, for trained and testing characteristics of network: TIMIT-1 has 3300 sections of voice segments, and TIMIT-2 has 3000 sections of voice segments;
NIST-1 has 2000 sections of voice segments, and NIST-2 has 1560 sections of voice segments;UME-1 has 2040 sections of voice segments, and UME-2 has 2000 sections of languages
Segment.Wherein, TIMIT-1, NIST-1, UME-1 are divided into training set;TIMIT-2, NIST-2, UME-2 are divided into test set.
In an experiment, the camouflage method of four kinds of voices: Audacity, Cool Edit, PRAAT, and RTISI has been used,
The corresponding voice that every kind of camouflage method generates can all be added to us and test in three databases used, verify the calculation with this
The versatility of method.Every kind of method can obtain the camouflage voice under 10 kinds of camouflage factor of alpha.When the camouflage factor is too small, voice
The effect of camouflage is unobvious, therefore source speaker can be identified by the sense of hearing of people.When influence factor is excessive, sound is deformed
In noise be easy to wake suspicion.Therefore, it is not suitable for by the sound that too small or too big factor is pretended to present networks phase
Close the training and test of algorithm.Therefore, in our work, the range of the similar factors of consideration is [4,8] and [- 8, -4].
During the experiment, all voice segments are cut into the duration all as the voice segments of 1s, and then choose again appropriate
Training data and test data.In order to evaluate the speech recognition network proposed, it is real that We conducted 3 kinds of classification
It tests, and is compared with the result of the LiangHuixin network structure proposed:
Experiment one: the detection of genbank database;
Experiment two: intersect the detection of database;
Experiment three: the detection of ± 4 camouflage factor right under different distortion method.
Experiment one: the detection of genbank database.This part experiment separates experiment using the data in the same sound bank
Required training set and test set.It is that the voice of 4 kinds of camouflage methods is trained and is tested respectively first, every kind of camouflage method
The voice under 10 kinds of camouflage factor transformations is all contained, training set and test set both are from same Voice Camouflage method.Then
4 kinds of camouflage methods for separately including 10 kinds of camouflage factors are divided into a training set and a test set jointly again, then obtain one
A result and the method for Liang compare.For NIST sound bank, in an experiment, training set data is collected in NIST-1, former
Beginning voice and deformation voice it is each it is random use 17000, then training data shares 34000.Test set is collected in NIST-2, for
The first test, original and deformation voice respectively extract 2000 at random, altogether 4000 test data;Second of integrated testability is in total
10000 test data is original and deformation voice is each random uses 5000.Its experimental result is as shown in table 3 below:
Table 3
Different camouflage methods can reach 96.88% or more when testing respectively it can be seen from experimental result
As a result.It can be seen that the network configuration, which has different deformation methods, preferably connects survey result.And four kinds of methods are mixed
When testing together, as a result 96.4%, the method than Liang is higher by 0.47%.
Experiment two: intersect the detection of database.Since in real life, the method for recording between different phonetic library may
It is different, so our network should can have preferable testing result to multiple voice library, present networks is showed with this
Versatility.Therefore, the detection between integration across database is also very important.In our experiment, first respectively to TIMIT,
UME is individually trained, and original and deformation voice each 6500 is collected in TIMIT-1, totally 13000 voice segments, wherein deforming language
Sound includes 4 kinds of camouflage methods and its 10 kinds of camouflage factors applied respectively;It is each that original and deformation voice is collected in UME-1
6000, totally 12000 voice segments, wherein equally including 4 kinds of camouflage methods and its 10 kinds of camouflage factors applied respectively.Then exist
Raw tone is collected in NIST-2 and deform voice totally 4000 test data, and contain 4 kinds of camouflage methods and its answer respectively
10 kinds of camouflage factors.The TIMIT-1 and UME-1 model generated is tested respectively.In addition, herein by TIMIT_1 and
UME_1 totally 24800 training data be combined together, wherein TIMIT_1 has 12800 data, and UME_1 has 12000 data.
It is tested again with NIST_2.Experimental result is as shown in table 4 below:
Table 4
It can be seen that, the experimental result of the method for Liang is 94.37%, and the result of our methods is than it from table
It is higher by 2.06%, therefore our algorithm model has certain superiority.
Experiment three: as shown in Figure of description 3, the detection of ± 4 camouflage factor right under different distortion method.Last part
Experiment, present network architecture detect camouflage lesser ± 4 factor of the factor.Because ± 4 factors belong in our experiment
The lesser camouflage factor, and it is closest to the camouflage factor of natural person's voice, the detection to it is relatively very difficult, so its
Detection accuracy is representative.In this section experiment, we have carried out ± 4 camouflages to different voice deformation camouflage methods respectively
The experiment of the factor.In 34000 data of training set, 4000 data of test set, raw tone and deformation voice
Respectively account for half.Test result and the method for Liang compare.As seen from Figure 3, our result all 96.1% with
On, and the highest result of Liang is 95.85%.Therefore, no matter use which kind of camouflage method to the lesser camouflage factor, have
Preferable testing result.
The network structure can be extracted automatically relative to the better characteristic information of traditional convolutional neural networks, therein short
Connection and the setting of convolution kernel provide better help.One layer of spy is additionally added that is, in next layer of input in short connection
Sign mapping;The Feature Mapping of short link neither increases additional parameter, does not also increase the complexity of calculating.Here do not increase additional
Parameter network can be made more effectively to train.Each layer is expressed as one layer in study of residual error function by we again in this way,
And the feature that repetitive learning is not trained to is not had to.So as to form residual error network be easier to be trained to and optimize, thus deeper
Preferably convergence effect is obtained in network.In addition, the problem of network structure can disappear or explode to avoid gradient, it will not be with
The intensification of network and generate degenerate problem.Wherein the setting of convolution kernel is also the analysis being more practically applicable to voice.And traditional mind
The problem of gradient disappears or explodes is not can avoid through network structure, acquired results may generate degenerate problem.
Specifically, the present invention and following advantages:
(1) depth residual error network class model is constructed with deep learning method, which is more easily optimized, can be to avoid
Bring accuracy declines problem due to the network number of plies increases;
(2) this algorithm has versatility, has preferable testing result to different deformation methods;
(3) our depth residual error network can easily obtain the raising of precision from the increase of the network number of plies, than
Pervious network generates better effect;
(4) signature analysis for voice is more suitable for the setting of convolution kernel in network structure, it is better is conducive to network
Training.
The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, all in essence of the invention
Within mind and principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.
Claims (3)
1. a kind of depth residual error network structure for voice deformation detection, it is characterised in that: totally 50 layers of the network structure, and
Short link is added in the convolutional neural networks of the network structure;The short link is the addition outside next layer of the amount of imports
Upper one layer of Feature Mapping, the Feature Mapping of the short link do not increase additional parameter, each layer are expressed as learning again
Upper one layer of residual error function and do not have to the feature that is not trained to of repetitive learning;
The time-frequency figure input of the network structure is having a size of (128*127), and using four maximum ponds, filter size is
(1*2), step-length 2;After the first layer convolutional layer of network structure, there are four types of be directed to different size characteristic figures block structure,
It is 3 convolutional layers of 1x1,3x1 and 1x1 respectively that each block structure is shared, and the convolutional layer of two of them 1x1 is first to reduce data dimension
Data dimension is further added by after degree;
In the network structure convolution process, progress is primary down-sampled after each block structure, this down-sampled in time-frequency figure
Time dimension carries out;And convolutional neural networks carry out voice deformation spectrum signature extract when, convolution kernel only frequency spectrum spy
It levies dimension and carries out convolution;
The last full articulamentum added with global a mean value pond floor and 1000 tunnels of the network structure, later using
Sigmoid nonlinear function carries out evaluation of result.
2. a kind of depth residual error network structure for voice deformation detection according to claim 1, it is characterised in that: institute
Network structure is stated before inputting time-frequency figure, the voice data of deformation can be pre-processed, it is assumed that x0It is the pitch of raw tone,
A is the camouflage factor, it can be deduced that deforming voice x is
X=2α/12·x0;
Pretend the value of the factor as the arbitrary integer in [- 11,11], data is standardized again later, standardization is public
Formula is
Wherein E [x] is characteristic mean, and Var [x] is feature variance.
3. a kind of depth residual error network structure for voice deformation detection according to claim 1, it is characterised in that: institute
The time-frequency figure for stating input network structure is generated by Short Time Fourier Transform, to allow the characteristic information of deformation voice to have relatively
Intensive distribution, to be more advantageous to neural network characteristics extraction.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910521871.0A CN110211604A (en) | 2019-06-17 | 2019-06-17 | A kind of depth residual error network structure for voice deformation detection |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910521871.0A CN110211604A (en) | 2019-06-17 | 2019-06-17 | A kind of depth residual error network structure for voice deformation detection |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110211604A true CN110211604A (en) | 2019-09-06 |
Family
ID=67792995
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910521871.0A Pending CN110211604A (en) | 2019-06-17 | 2019-06-17 | A kind of depth residual error network structure for voice deformation detection |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110211604A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111091047A (en) * | 2019-10-28 | 2020-05-01 | 支付宝(杭州)信息技术有限公司 | Living body detection method and device, server and face recognition equipment |
WO2021127978A1 (en) * | 2019-12-24 | 2021-07-01 | 深圳市优必选科技股份有限公司 | Speech synthesis method and apparatus, computer device and storage medium |
WO2021127811A1 (en) * | 2019-12-23 | 2021-07-01 | 深圳市优必选科技股份有限公司 | Speech synthesis method and apparatus, intelligent terminal, and readable medium |
CN113506583A (en) * | 2021-06-28 | 2021-10-15 | 杭州电子科技大学 | Disguised voice detection method using residual error network |
CN114822587A (en) * | 2021-01-19 | 2022-07-29 | 四川大学 | Audio feature compression method based on constant Q transformation |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106991999A (en) * | 2017-03-29 | 2017-07-28 | 北京小米移动软件有限公司 | Audio recognition method and device |
CN108345911A (en) * | 2018-04-16 | 2018-07-31 | 东北大学 | Surface Defects in Steel Plate detection method based on convolutional neural networks multi-stage characteristics |
US20180254046A1 (en) * | 2017-03-03 | 2018-09-06 | Pindrop Security, Inc. | Method and apparatus for detecting spoofing conditions |
CN108960053A (en) * | 2018-05-28 | 2018-12-07 | 北京陌上花科技有限公司 | Normalization processing method and device, client |
CN109767776A (en) * | 2019-01-14 | 2019-05-17 | 广东技术师范学院 | A kind of deception speech detection method based on intensive neural network |
CN109872720A (en) * | 2019-01-29 | 2019-06-11 | 广东技术师范学院 | It is a kind of that speech detection algorithms being rerecorded to different scenes robust based on convolutional neural networks |
-
2019
- 2019-06-17 CN CN201910521871.0A patent/CN110211604A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180254046A1 (en) * | 2017-03-03 | 2018-09-06 | Pindrop Security, Inc. | Method and apparatus for detecting spoofing conditions |
CN106991999A (en) * | 2017-03-29 | 2017-07-28 | 北京小米移动软件有限公司 | Audio recognition method and device |
CN108345911A (en) * | 2018-04-16 | 2018-07-31 | 东北大学 | Surface Defects in Steel Plate detection method based on convolutional neural networks multi-stage characteristics |
CN108960053A (en) * | 2018-05-28 | 2018-12-07 | 北京陌上花科技有限公司 | Normalization processing method and device, client |
CN109767776A (en) * | 2019-01-14 | 2019-05-17 | 广东技术师范学院 | A kind of deception speech detection method based on intensive neural network |
CN109872720A (en) * | 2019-01-29 | 2019-06-11 | 广东技术师范学院 | It is a kind of that speech detection algorithms being rerecorded to different scenes robust based on convolutional neural networks |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111091047A (en) * | 2019-10-28 | 2020-05-01 | 支付宝(杭州)信息技术有限公司 | Living body detection method and device, server and face recognition equipment |
WO2021127811A1 (en) * | 2019-12-23 | 2021-07-01 | 深圳市优必选科技股份有限公司 | Speech synthesis method and apparatus, intelligent terminal, and readable medium |
WO2021127978A1 (en) * | 2019-12-24 | 2021-07-01 | 深圳市优必选科技股份有限公司 | Speech synthesis method and apparatus, computer device and storage medium |
CN114822587A (en) * | 2021-01-19 | 2022-07-29 | 四川大学 | Audio feature compression method based on constant Q transformation |
CN113506583A (en) * | 2021-06-28 | 2021-10-15 | 杭州电子科技大学 | Disguised voice detection method using residual error network |
CN113506583B (en) * | 2021-06-28 | 2024-01-05 | 杭州电子科技大学 | Camouflage voice detection method using residual error network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110211604A (en) | A kind of depth residual error network structure for voice deformation detection | |
CN105869630B (en) | Speaker's voice spoofing attack detection method and system based on deep learning | |
CN104732978B (en) | The relevant method for distinguishing speek person of text based on combined depth study | |
Yegnanarayana et al. | Combining evidence from source, suprasegmental and spectral features for a fixed-text speaker verification system | |
CN108231067A (en) | Sound scenery recognition methods based on convolutional neural networks and random forest classification | |
CN109524014A (en) | A kind of Application on Voiceprint Recognition analysis method based on depth convolutional neural networks | |
CN106448684A (en) | Deep-belief-network-characteristic-vector-based channel-robust voiceprint recognition system | |
CN108711436A (en) | Speaker verification's system Replay Attack detection method based on high frequency and bottleneck characteristic | |
CN105488466B (en) | A kind of deep-neural-network and Acoustic Object vocal print feature extracting method | |
CN105513598B (en) | A kind of voice playback detection method based on the distribution of frequency domain information amount | |
CN109065072A (en) | A kind of speech quality objective assessment method based on deep neural network | |
CN101923855A (en) | Test-irrelevant voice print identifying system | |
CN106531174A (en) | Animal sound recognition method based on wavelet packet decomposition and spectrogram features | |
CN109545228A (en) | A kind of end-to-end speaker's dividing method and system | |
CN108520753A (en) | Voice lie detection method based on the two-way length of convolution memory network in short-term | |
CN109633289A (en) | A kind of red information detecting method of electromagnetism based on cepstrum and convolutional neural networks | |
CN111048097B (en) | Twin network voiceprint recognition method based on 3D convolution | |
CN109767776A (en) | A kind of deception speech detection method based on intensive neural network | |
CN102496366B (en) | Speaker identification method irrelevant with text | |
CN110364168A (en) | A kind of method for recognizing sound-groove and system based on environment sensing | |
CN100570712C (en) | Based on anchor model space projection ordinal number quick method for identifying speaker relatively | |
Kamruzzaman et al. | Speaker identification using mfcc-domain support vector machine | |
Zheng et al. | MSRANet: Learning discriminative embeddings for speaker verification via channel and spatial attention mechanism in alterable scenarios | |
Zhang et al. | Waveform level adversarial example generation for joint attacks against both automatic speaker verification and spoofing countermeasures | |
Mishra et al. | Speaker identification, differentiation and verification using deep learning for human machine interface |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190906 |