CN108198561A - A kind of pirate recordings speech detection method based on convolutional neural networks - Google Patents
A kind of pirate recordings speech detection method based on convolutional neural networks Download PDFInfo
- Publication number
- CN108198561A CN108198561A CN201711323563.4A CN201711323563A CN108198561A CN 108198561 A CN108198561 A CN 108198561A CN 201711323563 A CN201711323563 A CN 201711323563A CN 108198561 A CN108198561 A CN 108198561A
- Authority
- CN
- China
- Prior art keywords
- layer
- voice
- pirate recordings
- convolutional layer
- raw tone
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 37
- 238000013527 convolutional neural network Methods 0.000 title claims abstract description 25
- 238000012549 training Methods 0.000 claims abstract description 27
- 238000012360 testing method Methods 0.000 claims abstract description 11
- 238000000034 method Methods 0.000 claims description 24
- 230000004913 activation Effects 0.000 claims description 9
- 230000006870 function Effects 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 8
- 239000000284 extract Substances 0.000 claims description 7
- 238000004088 simulation Methods 0.000 claims description 3
- 230000008901 benefit Effects 0.000 abstract description 3
- 238000000605 extraction Methods 0.000 abstract description 3
- 238000002474 experimental method Methods 0.000 description 11
- 238000005070 sampling Methods 0.000 description 4
- 238000012795 verification Methods 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 238000005259 measurement Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 241001672694 Citrus reticulata Species 0.000 description 2
- PEDCQBHIVMGVHV-UHFFFAOYSA-N Glycerine Chemical compound OCC(O)CO PEDCQBHIVMGVHV-UHFFFAOYSA-N 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000007689 inspection Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 238000003475 lamination Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 230000002787 reinforcement Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Acoustics & Sound (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
Abstract
The invention discloses a kind of pirate recordings speech detection methods based on convolutional neural networks, first build raw tone library and pirate recordings sound bank;Then the sound spectrograph of each raw tone in extraction raw tone library is as positive sample, the sound spectrograph of each pirate recordings voice in pirate recordings sound bank is extracted as negative sample, a part of positive sample and negative sample composing training collection are selected, remainder positive sample and negative sample form test set;Then according to training set and convolutional neural networks frame, convolutional neural networks frame training pattern is built;Again using each sample in test set as input, it is input in convolutional neural networks frame training pattern, obtains classification results;Advantage is their ability in the case where not limited by text, and higher Detection accuracy is respectively provided with for a variety of pirate recordings voices for using a hidden recorder equipment.
Description
Technical field
The present invention relates to a kind of speech detection technologies, are examined more particularly, to a kind of pirate recordings voice based on convolutional neural networks
Survey method.
Background technology
As the continuous development of internet and the quick of Portable intelligent terminal are popularized, people can more conveniently
Information is transmitted by various Digital Medias (such as image, audio, video).At the same time, as playback apparatus, high-fidelity are recorded
Equipment is popularized, and the cipher of validated user easily uses a hidden recorder success when request enters identifying system by attacker.Pirate recordings voice passes through
High-fidelity sound pick-up outfit is used a hidden recorder, playback apparatus playback, has higher similarity, some speaker authentication systems with raw tone
Also it can not distinguish, compromise the equity of validated user;And equipment volume is small, easily uses a hidden recorder, success rate height because using a hidden recorder for pirate recordings voice
Etc. advantages, it has also become the method most easily implemented in attack voice authentication system.Therefore, to pirate recordings speech detection by the industry
Pay attention to extensively.
In recent years, certain achievement is achieved to the research of pirate recordings speech detection.
The first kind, the randomness that researcher generates according to voice compared the Peak map of raw tone and pirate recordings voice
(Shang W,Stevenson M.A playback attack detector for speaker verification
systems[C]//In ternational Symposium on Communications,Control and Signal
Processing.IEEE,2008:A kind of replay attacks for speaker authentication system of 1144-1149. Shang Wei, Shi Difensen
Detection algorithm [C] // communication, control and signal processing international conference .IEEE, 2008:1144-1149.) difference, it is proposed that one
Recording playback detection algorithm of the kind based on Peak map similarities, if Peak map similarities are more than the threshold value of setting, judges
For pirate recordings voice;Conversely, it is determined as raw tone.On this basis, someone improves the algorithm, in Peak map spies
Property in add the position relationship of each speech frequency point, come according to voice to be certified and similarity of the raw tone in this feature
Judge whether voice to be certified is legitimate voice.Above method can only be directed to the relevant identifying system of text, can not be suitable for text
This unrelated pirate recordings speech detection has larger limitation.
Second class according to channelling mode feature, using the difference between pirate recordings voice channel and raw tone channel, proposes
It is a kind of based on mute section of MFCC (Mel-frequency cepstral coefficients, mel-frequency cepstrum coefficient)
Pirate recordings speech detection algorithms, mute section of the algorithm primary voice data detect language to be measured to raw tone Channel Modeling
Whether sound is identical with the channel of training voice, so as to determine whether that pirate recordings is attacked.Another algorithm is according to raw tone with turning over
It is different to record the channel that voice generates, extracts channelling mode noise, and using SVM (Support Vector Machine, support to
Amount machine) obtain good classification results.The third algorithm is according to high-fidelity sound pick-up outfit channel to the shadow of speech
It rings, it is proposed that a kind of pirate recordings speech detection algorithms based on long window scale factor.Above method can only detect single equipment recording
Voice, not to it is a variety of it is different use a hidden recorder equipment and playback apparatus is analyzed and studied, the wherein letter of second algorithm extraction
Road modal noise is also inaccurate.
At present, for work most of in terms of pirate recordings speech detection turning over for equipment and playback apparatus is used a hidden recorder both for a kind of
Voice is recorded, it is less to the concern of the pirate recordings speech detection research of a variety of sound pick-up outfits.And in actual life, various high-fidelity records
Sound equipment is seen everywhere, such as recording pen and various smart mobile phones, this kind of sound pick-up outfit carrying convenience and not noticeable, and obtain
Pirate recordings voice and raw tone similitude be higher, and this kind of sound pick-up outfit is that more mainstream uses a hidden recorder equipment at present.Therefore, needle is studied
It is necessary to the pirate recordings speech detection of a variety of sound pick-up outfits.
Invention content
The technical problems to be solved by the invention are to provide a kind of pirate recordings speech detection method based on convolutional neural networks,
It is respectively provided with higher Detection accuracy in the case where not limited by text, for a variety of pirate recordings voices for using a hidden recorder equipment.
Technical solution is used by the present invention solves above-mentioned technical problem:A kind of pirate recordings language based on convolutional neural networks
Sound detection method, it is characterised in that include the following steps:
1. build raw tone library and pirate recordings sound bank:Under quiet environment, recording personnel are carried out using collecting device
Raw tone acquires, and collects N altogether1The raw tone of a different content, by N1The raw tone of a different content forms original
Sound bank;Process is used a hidden recorder according to real process simulation, the same of raw tone acquisition is carried out to recording personnel using collecting device
When, it uses a hidden recorder equipment using at least two and recording personnel progress voice is used a hidden recorder, then using at least one playback apparatus to using a hidden recorder
Voice carry out audio playback, reuse same collecting device and voice collecting carried out to the voice of playback, collect N altogether2
A pirate recordings voice, by N2A pirate recordings voice forms pirate recordings sound bank;Wherein, N1>=1000, N2≥2N1;
2. extracting the sound spectrograph of each raw tone in raw tone library, and extract each pirate recordings in pirate recordings sound bank
The sound spectrograph of voice;Then using the sound spectrograph of each raw tone as a positive sample, by the sound spectrograph of each pirate recordings voice
As a negative sample;It again will be from N150~70% positive sample is randomly selected in a positive sample and from N2It is random in a negative sample
50~70% negative sample composing training collection is chosen, remaining positive sample and remaining negative sample are formed into test set;
3. build convolutional neural networks frame training pattern:
The first step builds first layer convolutional layer:First, the total number of the wave filter in first layer convolutional layer is set;Secondly,
The size of convolution kernel in first layer convolutional layer is set;Again, the output of first layer convolutional layer and Relu activation primitives are determined
Relationship is described as:Wherein, 1≤p≤P, P represent the total number of sample included in training set,
1≤j≤M1, M1Represent the total number of the wave filter in first layer convolutional layer, representations of the f () for Relu activation primitives, xp
Represent training set in p-th of sample, symbol " * " be convolution algorithm symbol, k(1)Represent the convolution kernel in first layer convolutional layer
Size,It representsBiasing,Represent xpThe jth width characteristic pattern that first layer convolutional layer exports after first layer convolutional layer,
xpCorrespondence obtains M after first layer convolutional layer1Width characteristic pattern;
Second step builds second layer convolutional layer:First, the total number of the wave filter in second layer convolutional layer is set;Secondly,
The size of convolution kernel in second layer convolutional layer is set;Again, the output of second layer convolutional layer and Relu activation primitives are determined
Relationship is described as:Wherein, 1≤i≤M2, M2Represent the wave filter in second layer convolutional layer
Total number, k(2)Represent the size of the convolution kernel in second layer convolutional layer,It representsBiasing,It representsThrough
The i-th width characteristic pattern that second layer convolutional layer exports after second layer convolutional layer,Correspondence obtains M after second layer convolutional layer2Width is special
Sign figure;
Third walks, and builds pond layer:First, the size of the convolution kernel in the layer of pond is set;Secondly, pond used by determining
Change algorithm;Again, using the output of second layer convolutional layer as the input of pond layer, the output of pond layer is obtained;
4th step builds full articulamentum:First, the number of hidden nodes in full articulamentum is set;Secondly, used by determining
Loss function;Again, using the output of pond layer as the input of full articulamentum, the output of full articulamentum is obtained, is so far rolled up
Product neural network framework training pattern;
4. using each sample in test set as input, it is input in convolutional neural networks frame training pattern, obtains
Raw tone and the classification results of pirate recordings voice.
The step 3. in, the total number of the wave filter in first layer convolutional layer is 32, the volume in first layer convolutional layer
The size of product core is 1 × 11;The total number of wave filter in second layer convolutional layer is 64, the convolution kernel in second layer convolutional layer
Size is 2 × 6;The size of convolution kernel in the layer of pond is 1 × 4, and used pond algorithm is maximum pond algorithm;Full connection
The number of hidden nodes in layer is 256, and used loss function is SoftMax regression functions.
Compared with prior art, the advantage of the invention is that:
1) the method for the present invention is by obtaining the sound spectrograph of raw tone and pirate recordings voice, and using a part of raw tone with
The sound spectrograph of pirate recordings voice has built the convolutional neural networks frame training pattern suitable for detection pirate recordings voice so that the present invention
Method is respectively provided with higher Detection accuracy in the case where not limited by text, for a variety of pirate recordings voices for using a hidden recorder equipment.
2) the method for the present invention is during convolutional neural networks frame training pattern is built, it is contemplated that the network number of plies, filter
The influence of the number of wave device and the size of convolution kernel to recognition effect, has weighed processing time and space complexity, establishes inspection
Survey the network number of plies of best results and network parameter setting.
3) the method for the present invention is verified through cross-over experiment, a kind of is used a hidden recorder and the situation of the pirate recordings voice of playback apparatus known
Under, have preferable discrimination for the pirate recordings voice in other sources, can effectively identify raw tone and it is a variety of use a hidden recorder and
The pirate recordings voice of playback apparatus, and experiment show that Detection accuracy has reached 99.26%.
4) the method for the present invention can detect it is a variety of use a hidden recorder and the pirate recordings voice of playback apparatus, more tally with the actual situation, have
Higher realistic meaning.
Description of the drawings
Fig. 1 is that the overall of the method for the present invention realizes block diagram;
Fig. 2 a are the sound spectrograph of one section of raw tone through Aigo R6620 recording pen original recordeds;
Fig. 2 b are that equipment is Aigo R6620, playback apparatus is pirate recordings voice that Huawei AM08 are obtained using using a hidden recorder
Sound spectrograph;
Fig. 2 c are to be composed using using a hidden recorder equipment is iPhone6, playback apparatus is the pirate recordings voice that Huawei AM08 are obtained language
Figure;
Fig. 2 d are that equipment is SONY PX440, playback apparatus is pirate recordings voice that Huawei AM08 are obtained using using a hidden recorder
Sound spectrograph;
Fig. 3 a are the sound spectrograph of another section of raw tone through Aigo R6620 recording pen original recordeds;
Fig. 3 b are that equipment is Aigo R6620, playback apparatus is pirate recordings language that Philips DTM3115 are obtained using using a hidden recorder
The sound spectrograph of sound;
Fig. 3 c are that equipment is iPhone6, playback apparatus is pirate recordings voice that Philips DTM3115 are obtained using using a hidden recorder
Sound spectrograph;
Fig. 3 d are that equipment is SONY PX440, playback apparatus is pirate recordings language that Philips DTM3115 are obtained using using a hidden recorder
The sound spectrograph of sound;
Fig. 4 a be window length be set as 512 points, Fourier's sampling number be 1024 points, window shifting be turning under 128 points and 256 points
Record the detection discrimination curve graph of voice;
Fig. 4 b be window length be set as 512 points, Fourier's sampling number be 1024 points, window shifting be turning under 128 points and 256 points
Record the Detectability loss rate of voice.
Specific embodiment
The present invention is described in further detail below in conjunction with attached drawing embodiment.
Deep learning is substantially machine learning framework model of the structure containing more hidden layers, is instructed by large-scale data
Practice, obtain a large amount of more representational characteristic informations, so as to which sample is classified and be predicted, improve the essence of classification and prediction
Degree.Compared with the feature extracting method of engineer, the data characteristics that is obtained using deep learning model discloses big data
Abundant internal information more has representative.Convolutional neural networks can extract the characteristic information that mass data sample hides, this causes
Convolutional neural networks are widely used in the every field of pattern-recognition.Therefore, the present invention utilizes convolutional neural networks
Realize pirate recordings speech detection.
A kind of pirate recordings speech detection method based on convolutional neural networks proposed by the present invention, it is overall to realize block diagram as schemed
Shown in 1, include the following steps:
1. build raw tone library and pirate recordings sound bank:Under quiet environment, recording personnel are carried out using collecting device
Raw tone acquires, and collects N altogether1The raw tone of a different content, by N1The raw tone of a different content forms original
Sound bank;Process is used a hidden recorder according to real process simulation, the same of raw tone acquisition is carried out to recording personnel using collecting device
When, it uses a hidden recorder equipment using at least two and recording personnel progress voice is used a hidden recorder, then using at least one playback apparatus to using a hidden recorder
Voice carry out audio playback, reuse same collecting device and voice collecting carried out to the voice of playback, collect N altogether2
A pirate recordings voice, by N2A pirate recordings voice forms pirate recordings sound bank;Wherein, N1>=1000, N2≥2N1。
During collecting device is used to carry out raw tone acquisition to recording personnel, recording personnel speak according to itself
Custom reads corpus content, and collecting device recording distance personnel are about 20cm;It is about 70cm to use a hidden recorder equipment recording distance personnel;
Playback apparatus distance is about 20cm for acquiring the collecting device of the voice of playback;Collecting device, to use a hidden recorder equipment, playback apparatus equal
Common sound equipment can be used using existing high-fidelity sound pick-up outfit, such as playback apparatus.
Table 1 give collecting device used by the present embodiment, use a hidden recorder equipment, playback apparatus facility information, table 2 provides
The raw tone that the present embodiment obtains and the details of pirate recordings voice.
Collecting device used by 1 the present embodiment of table, use a hidden recorder equipment, playback apparatus facility information
The raw tone that 2 the present embodiment of table obtains and the details of pirate recordings voice
2. using the sound spectrograph of each raw tone in prior art extraction raw tone library, and extract pirate recordings sound bank
In each pirate recordings voice sound spectrograph;Then using the sound spectrograph of each raw tone as a positive sample, by each pirate recordings
The sound spectrograph of voice is as a negative sample;It again will be from N150~70% positive sample is randomly selected in a positive sample and from N2It is a
50~70% negative sample composing training collection is randomly selected in negative sample, remaining positive sample and remaining negative sample are formed and surveyed
Examination collection.
Information largely related with the sentence characteristic of voice is contained in sound spectrograph, it combines spectrogram and time domain wave
The characteristics of shape, it will be apparent that show voice spectrum and change with time situation.Compared to raw tone, pirate recordings voice is undergone mostly
Recording and playback process, and use a hidden recorder equipment and playback apparatus and inevitably voice signal is adopted again
Collection and encoding and decoding, this, which results in pirate recordings voice, to carry intrinsic attribute, this attribute will differ from raw tone.
Fig. 2 a give the sound spectrograph of one section of raw tone through Aigo R6620 recording pen original recordeds, the raw tone
Particular content " open sesame-I be that local tyrant-a thousand li is total to the moon " read aloud for mandarin;Fig. 2 b give use and use a hidden recorder equipment
It is the sound spectrograph of pirate recordings voice that Huawei AM08 are obtained for Aigo R6620, playback apparatus;Fig. 2 c, which give use and use a hidden recorder, to be set
Standby is iPhone6, playback apparatus is the sound spectrograph of pirate recordings voice that Huawei AM08 are obtained;Fig. 2 d, which give use and use a hidden recorder, to be set
Standby is SONY PX440, playback apparatus is the sound spectrograph of pirate recordings voice that Huawei AM08 are obtained.Fig. 3 a give another section of warp
The sound spectrograph of the raw tone of Aigo R6620 recording pen original recordeds, the particular content of the raw tone is what mandarin was read aloud
" open sesame-I be that local tyrant-a thousand li is total to the moon ";Fig. 3 b give use and use a hidden recorder that equipment is Aigo R6620, playback apparatus is
The sound spectrograph of pirate recordings voice that Philips DTM3115 are obtained;Fig. 3 c give use use a hidden recorder equipment for iPhone6, playback set
The standby sound spectrograph of pirate recordings voice obtained for Philips DTM3115;Fig. 3 d give use use a hidden recorder equipment for SONY PX440,
Playback apparatus is the sound spectrograph of pirate recordings voice that Philips DTM3115 are obtained.
3. build convolutional neural networks frame training pattern:
The first step builds first layer convolutional layer:First, the total number of the wave filter in first layer convolutional layer is set;Secondly,
The size of convolution kernel in first layer convolutional layer is set;Again, the output of first layer convolutional layer and Relu activation primitives are determined
Relationship is described as:Wherein, 1≤p≤P, P represent the total number of sample included in training set,
1≤j≤M1, M1Represent the total number of the wave filter in first layer convolutional layer, representations of the f () for Relu activation primitives, xp
Represent training set in p-th of sample, symbol " * " be convolution algorithm symbol, k(1)Represent the convolution kernel in first layer convolutional layer
Size,It representsBiasing,Represent xpThe jth width characteristic pattern that first layer convolutional layer exports after first layer convolutional layer,
xpCorrespondence obtains M after first layer convolutional layer1Width characteristic pattern.
Second step builds second layer convolutional layer:First, the total number of the wave filter in second layer convolutional layer is set;Secondly,
The size of convolution kernel in second layer convolutional layer is set;Again, the output of second layer convolutional layer and Relu activation primitives are determined
Relationship is described as:Wherein, 1≤i≤M2, M2Represent the wave filter in second layer convolutional layer
Total number, k(2)Represent the size of the convolution kernel in second layer convolutional layer,It representsBiasing,It representsThrough
The i-th width characteristic pattern that second layer convolutional layer exports after two layers of convolutional layer,Correspondence obtains M after second layer convolutional layer2Width feature
Figure.
Third walks, and builds pond layer:First, the size of the convolution kernel in the layer of pond is set;Secondly, pond used by determining
Change algorithm;Again, using the output of second layer convolutional layer as the input of pond layer, the output of pond layer is obtained.
4th step builds full articulamentum:First, the number of hidden nodes in full articulamentum is set;Secondly, used by determining
Loss function;Again, using the output of pond layer as the input of full articulamentum, the output of full articulamentum is obtained, is so far rolled up
Product neural network framework training pattern.
In this particular embodiment, step 3. in, the total number of the wave filter in first layer convolutional layer is 32, first layer volume
The size of convolution kernel in lamination is 1 × 11;The total number of wave filter in second layer convolutional layer is 64, in second layer convolutional layer
Convolution kernel size be 2 × 6;The size of convolution kernel in the layer of pond is 1 × 4, and used pond algorithm is maximum pond
Algorithm;The number of hidden nodes in full articulamentum is 256, and used loss function is SoftMax regression functions.
4. using each sample in test set as input, it is input in convolutional neural networks frame training pattern, obtains
Raw tone and the classification results of pirate recordings voice.
To further illustrate the present invention the feasibility and validity of method, the method for the present invention is tested.
The selection of the number and size of convolution kernel:
Convolutional neural networks analyze local feature by convolution kernel, the feature extracted by the reinforcement of pond layer
Robustness establishes model finally by full articulamentum and obtains final classification results.In this process, acoustic convolver is special to input
Sign is analyzed and is extracted, and large effect is played to classification results.There are two the parameter setting of acoustic convolver is common:Convolution kernel it is big
Small and convolution kernel number.
In principle, the number of convolution kernel (wave filter) is the width number of the characteristic pattern of output, and even the number of acoustic convolver is N,
Then output is N width characteristic patterns.With the increase of the number of wave filter, the characteristic pattern of output is also more, convolutional neural networks table
The feature space shown is bigger, and learning ability is also stronger, and discrimination is also higher.Table 3 gives the number of wave filter to inspection
The influence of performance is surveyed, table 4 gives influence of the size to detection performance of convolution kernel.In table 3 and table 4, ACC is identified for detection
Rate, Loss are loss late, and the time is whenabouts caused by iteration each time.The experiment constraints of table 3 is to ensure network
In the case of number of plies structure and ceteris paribus, the number of its two layers of wave filter is adjusted;The experiment constraints of table 4 is to filter
The number of wave device is 32~64, the size of convolution kernel in the layer of pond is 1 × 4, the number of hidden nodes in full articulamentum is 256
In the case of, adjustment changes the size of its two layers of convolution kernel.Experiment sample be raw tone 6300, pirate recordings voice 6300.
Influence of the number of 3 wave filter of table to detection performance
The number of wave filter | ACC (%) | Loss | Time/iteration |
16~32 | 98.39 | 0.048 | 238s |
32~32 | 98.57 | 0.043 | 321s |
32~64 | 98.97 | 0.034 | 360s |
64~64 | 99.04 | 0.031 | 420s |
Influence of the size of 4 convolution kernel of table to detection performance
The size of convolution kernel | ACC (%) | Loss | Time/iteration |
1 × 7~2 × 6 | 98.97 | 0.033 | 400s |
1 × 11~2 × 6 | 98.97 | 0.034 | 360s |
1 × 14~2 × 6 | 98.54 | 0.047 | 318s |
From the data listed by table 3 and table 4 it is found that the increase of the number with wave filter, detection performance are better;Different filters
Wave device extracts different features from different angles, if the number of wave filter is less, cannot fully extract useful information;
If the number of wave filter is more, operation time can increase, but the raising of its discrimination is not obvious;In addition, with convolution kernel
The gradual refinement of size, discrimination increases, but ascensional range is weaker, and this also illustrates the sizes of convolution kernel to detection performance
Influence it is weaker.Consider, in the specific implementation can final choice wave filter number be 32~64, the size of convolution kernel is
1 × 11~2 × 6.
Input the influence of the sound spectrograph under different window is moved:
Voice signal calculates its energy spectral density and obtains sound spectrograph by framing, adding window, Fourier transformation.Different windows
Shifting will generate different sound spectrographs, comprising voice messaging it is also just different.Fig. 4 a give window length and are set as, Fu Li at 512 points
The detection discrimination curve graph that leaf sampling number is 1024 points, window shifting is the pirate recordings voice under 128 points and 256 points, Fig. 4 b are provided
Window length is set as, the detection that Fourier's sampling number is, window shifting is the pirate recordings voice under 128 points and 256 points at 512 points at 1024 points
Loss late.Abscissa Epoch represents iterations in Fig. 4 a, and ordinate Accurary represents detection discrimination;Horizontal seat in Fig. 4 b
It marks Epoch and represents iterations, ordinate Loss represents Detectability loss rate.Experiment sample be raw tone 6300, pirate recordings language
Sound 6300,70% for training, remaining is used to test.
Cross-over experiment:
It during pirate recordings, uses a hidden recorder equipment and playback apparatus type is various, different use a hidden recorder equipment and playback apparatus will be right
Testing result generates different influences.The purpose of cross-over experiment is exactly the applicability in order to preferably examine the method for the present invention.
In experiment, equipment is used a hidden recorder and a kind of pirate recordings voice that playback apparatus obtains is as training voice using a kind of, remaining any one steal
The pirate recordings voice that recording apparatus and a kind of playback apparatus obtain is as tested speech.Raw tone 6300, pirate recordings voice 37800
It is a.Wherein, testing result is represented with ACC (%).Experimental result is as listed in table 5.
It can be seen that from the data listed by table 5 when using identical playback apparatus, different intersection when using a hidden recorder equipment real
Preferable verification and measurement ratio can be obtained by testing, and verification and measurement ratio can reach more than 93%, wherein, playback apparatus is Huawei AM08, steathily
The verification and measurement ratio of pirate recordings voice has reached 99.28% when recording apparatus is Aigo R6620.When using different playback apparatus, different
Cross-over experiment during equipment is used a hidden recorder, the method for the present invention has certain detection result, but result is set not as good as using identical playback
Standby, different pirate recordings speech detection when using a hidden recorder equipment.It follows that compared to equipment is used a hidden recorder, playback apparatus is to pirate recordings voice
Have an impact it is larger.
5 cross-over experiment result of table
Claims (2)
1. a kind of pirate recordings speech detection method based on convolutional neural networks, it is characterised in that include the following steps:
1. build raw tone library and pirate recordings sound bank:Under quiet environment, recording personnel are carried out using collecting device original
Voice collecting collects N altogether1The raw tone of a different content, by N1The raw tone of a different content forms raw tone
Library;Process is used a hidden recorder according to real process simulation, while collecting device is used to carry out raw tone acquisition to recording personnel, is made
Equipment is used a hidden recorder at least two voice is carried out to recording personnel and used a hidden recorder, then using at least one playback apparatus to the voice used a hidden recorder
Audio playback is carried out, same collecting device is reused and voice collecting is carried out to the voice of playback, collect N altogether2A pirate recordings
Voice, by N2A pirate recordings voice forms pirate recordings sound bank;Wherein, N1>=1000, N2≥2N1;
2. extracting the sound spectrograph of each raw tone in raw tone library, and extract each pirate recordings voice in pirate recordings sound bank
Sound spectrograph;Then using the sound spectrograph of each raw tone as a positive sample, using the sound spectrograph of each pirate recordings voice as
One negative sample;It again will be from N150~70% positive sample is randomly selected in a positive sample and from N2It is randomly selected in a negative sample
Remaining positive sample and remaining negative sample are formed test set by 50~70% negative sample composing training collection;
3. build convolutional neural networks frame training pattern:
The first step builds first layer convolutional layer:First, the total number of the wave filter in first layer convolutional layer is set;Secondly, setting
The size of convolution kernel in first layer convolutional layer;Again, the output of first layer convolutional layer and the relationship of Relu activation primitives are determined,
It is described as:Wherein, 1≤p≤P, P represent the total number of sample included in training set, 1≤j
≤M1, M1Represent the total number of the wave filter in first layer convolutional layer, representations of the f () for Relu activation primitives, xpIt represents
P-th of sample in training set, symbol " * " be convolution algorithm symbol, k(1)Represent the big of the convolution kernel in first layer convolutional layer
It is small,It representsBiasing,Represent xpThe jth width characteristic pattern that first layer convolutional layer exports after first layer convolutional layer, xp
Correspondence obtains M after first layer convolutional layer1Width characteristic pattern;
Second step builds second layer convolutional layer:First, the total number of the wave filter in second layer convolutional layer is set;Secondly, setting
The size of convolution kernel in second layer convolutional layer;Again, the output of second layer convolutional layer and the relationship of Relu activation primitives are determined,
It is described as:Wherein, 1≤i≤M2, M2Represent total of the wave filter in second layer convolutional layer
Number, k(2)Represent the size of the convolution kernel in second layer convolutional layer,It representsBiasing,It representsThrough the second layer
The i-th width characteristic pattern that second layer convolutional layer exports after convolutional layer,Correspondence obtains M after second layer convolutional layer2Width characteristic pattern;
Third walks, and builds pond layer:First, the size of the convolution kernel in the layer of pond is set;Secondly, pondization is calculated used by determining
Method;Again, using the output of second layer convolutional layer as the input of pond layer, the output of pond layer is obtained;
4th step builds full articulamentum:First, the number of hidden nodes in full articulamentum is set;Secondly, it is lost used by determining
Function;Again, using the output of pond layer as the input of full articulamentum, the output of full articulamentum is obtained, so far obtains convolution god
Through network frame training pattern;
4. using each sample in test set as input, it is input in convolutional neural networks frame training pattern, obtains original
The classification results of voice and pirate recordings voice.
A kind of 2. pirate recordings speech detection method based on convolutional neural networks according to claim 1, it is characterised in that institute
The step of stating 3. in, the total number of the wave filter in first layer convolutional layer is 32, the size of the convolution kernel in first layer convolutional layer
It is 1 × 11;The total number of wave filter in second layer convolutional layer is 64, the size of the convolution kernel in second layer convolutional layer for 2 ×
6;The size of convolution kernel in the layer of pond is 1 × 4, and used pond algorithm is maximum pond algorithm;It is hidden in full articulamentum
Node layer number is 256, and used loss function is SoftMax regression functions.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711323563.4A CN108198561A (en) | 2017-12-13 | 2017-12-13 | A kind of pirate recordings speech detection method based on convolutional neural networks |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711323563.4A CN108198561A (en) | 2017-12-13 | 2017-12-13 | A kind of pirate recordings speech detection method based on convolutional neural networks |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108198561A true CN108198561A (en) | 2018-06-22 |
Family
ID=62574282
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711323563.4A Pending CN108198561A (en) | 2017-12-13 | 2017-12-13 | A kind of pirate recordings speech detection method based on convolutional neural networks |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108198561A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109326294A (en) * | 2018-09-28 | 2019-02-12 | 杭州电子科技大学 | A kind of relevant vocal print key generation method of text |
CN109599117A (en) * | 2018-11-14 | 2019-04-09 | 厦门快商通信息技术有限公司 | A kind of audio data recognition methods and human voice anti-replay identifying system |
CN109801638A (en) * | 2019-01-24 | 2019-05-24 | 平安科技(深圳)有限公司 | Speech verification method, apparatus, computer equipment and storage medium |
CN109872720A (en) * | 2019-01-29 | 2019-06-11 | 广东技术师范学院 | It is a kind of that speech detection algorithms being rerecorded to different scenes robust based on convolutional neural networks |
CN110223676A (en) * | 2019-06-14 | 2019-09-10 | 苏州思必驰信息科技有限公司 | The optimization method and system of deception recording detection neural network model |
CN110459225A (en) * | 2019-08-14 | 2019-11-15 | 南京邮电大学 | A kind of speaker identification system based on CNN fusion feature |
CN110491391A (en) * | 2019-07-02 | 2019-11-22 | 厦门大学 | A kind of deception speech detection method based on deep neural network |
CN112270931A (en) * | 2020-10-22 | 2021-01-26 | 江西师范大学 | Method for carrying out deceptive voice detection based on twin convolutional neural network |
CN113646833A (en) * | 2021-07-14 | 2021-11-12 | 东莞理工学院 | Voice confrontation sample detection method, device, equipment and computer readable storage medium |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105118503A (en) * | 2015-07-13 | 2015-12-02 | 中山大学 | Ripped audio detection method |
-
2017
- 2017-12-13 CN CN201711323563.4A patent/CN108198561A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105118503A (en) * | 2015-07-13 | 2015-12-02 | 中山大学 | Ripped audio detection method |
Non-Patent Citations (1)
Title |
---|
XIAODAN LIN等: ""Audio Recapture Detection With Convolutional Neural Networks"", 《IEEE TRANSACTIONS ON MULTIMEDIA》 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109326294A (en) * | 2018-09-28 | 2019-02-12 | 杭州电子科技大学 | A kind of relevant vocal print key generation method of text |
CN109326294B (en) * | 2018-09-28 | 2022-09-20 | 杭州电子科技大学 | Text-related voiceprint key generation method |
CN109599117A (en) * | 2018-11-14 | 2019-04-09 | 厦门快商通信息技术有限公司 | A kind of audio data recognition methods and human voice anti-replay identifying system |
CN109801638A (en) * | 2019-01-24 | 2019-05-24 | 平安科技(深圳)有限公司 | Speech verification method, apparatus, computer equipment and storage medium |
CN109801638B (en) * | 2019-01-24 | 2023-10-13 | 平安科技(深圳)有限公司 | Voice verification method, device, computer equipment and storage medium |
CN109872720A (en) * | 2019-01-29 | 2019-06-11 | 广东技术师范学院 | It is a kind of that speech detection algorithms being rerecorded to different scenes robust based on convolutional neural networks |
CN110223676A (en) * | 2019-06-14 | 2019-09-10 | 苏州思必驰信息科技有限公司 | The optimization method and system of deception recording detection neural network model |
CN110491391A (en) * | 2019-07-02 | 2019-11-22 | 厦门大学 | A kind of deception speech detection method based on deep neural network |
CN110459225A (en) * | 2019-08-14 | 2019-11-15 | 南京邮电大学 | A kind of speaker identification system based on CNN fusion feature |
CN112270931A (en) * | 2020-10-22 | 2021-01-26 | 江西师范大学 | Method for carrying out deceptive voice detection based on twin convolutional neural network |
CN113646833A (en) * | 2021-07-14 | 2021-11-12 | 东莞理工学院 | Voice confrontation sample detection method, device, equipment and computer readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108198561A (en) | A kind of pirate recordings speech detection method based on convolutional neural networks | |
CN109065030B (en) | Convolutional neural network-based environmental sound identification method and system | |
CN108231067A (en) | Sound scenery recognition methods based on convolutional neural networks and random forest classification | |
CN108711436B (en) | Speaker verification system replay attack detection method based on high frequency and bottleneck characteristics | |
CN104732978B (en) | The relevant method for distinguishing speek person of text based on combined depth study | |
CN104900235B (en) | Method for recognizing sound-groove based on pitch period composite character parameter | |
CN105788592A (en) | Audio classification method and apparatus thereof | |
CN107507625B (en) | Sound source distance determining method and device | |
CN101923855A (en) | Test-irrelevant voice print identifying system | |
CN102723079B (en) | Music and chord automatic identification method based on sparse representation | |
CN113823293B (en) | Speaker recognition method and system based on voice enhancement | |
CN109378014A (en) | A kind of mobile device source discrimination and system based on convolutional neural networks | |
CN102982351A (en) | Porcelain insulator vibrational acoustics test data sorting technique based on back propagation (BP) neural network | |
CN109872720A (en) | It is a kind of that speech detection algorithms being rerecorded to different scenes robust based on convolutional neural networks | |
CN104221079A (en) | Modified Mel filter bank structure using spectral characteristics for sound analysis | |
CN105513598A (en) | Playback voice detection method based on distribution of information quantity in frequency domain | |
CN108766464A (en) | Digital audio based on mains frequency fluctuation super vector distorts automatic testing method | |
CN111081223A (en) | Voice recognition method, device, equipment and storage medium | |
CN111508524A (en) | Method and system for identifying voice source equipment | |
CN111402922B (en) | Audio signal classification method, device, equipment and storage medium based on small samples | |
CN117419915A (en) | Motor fault diagnosis method for multi-source information fusion | |
CN110136746B (en) | Method for identifying mobile phone source in additive noise environment based on fusion features | |
CN112786057B (en) | Voiceprint recognition method and device, electronic equipment and storage medium | |
CN116705063B (en) | Manifold measurement-based multi-model fusion voice fake identification method | |
CN113936667A (en) | Bird song recognition model training method, recognition method and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180622 |
|
RJ01 | Rejection of invention patent application after publication |