CN108198561A - A kind of pirate recordings speech detection method based on convolutional neural networks - Google Patents

A kind of pirate recordings speech detection method based on convolutional neural networks Download PDF

Info

Publication number
CN108198561A
CN108198561A CN201711323563.4A CN201711323563A CN108198561A CN 108198561 A CN108198561 A CN 108198561A CN 201711323563 A CN201711323563 A CN 201711323563A CN 108198561 A CN108198561 A CN 108198561A
Authority
CN
China
Prior art keywords
layer
voice
pirate recordings
convolutional layer
raw tone
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711323563.4A
Other languages
Chinese (zh)
Inventor
王让定
李璨
严迪群
林朗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ningbo University
Original Assignee
Ningbo University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ningbo University filed Critical Ningbo University
Priority to CN201711323563.4A priority Critical patent/CN108198561A/en
Publication of CN108198561A publication Critical patent/CN108198561A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)

Abstract

The invention discloses a kind of pirate recordings speech detection methods based on convolutional neural networks, first build raw tone library and pirate recordings sound bank;Then the sound spectrograph of each raw tone in extraction raw tone library is as positive sample, the sound spectrograph of each pirate recordings voice in pirate recordings sound bank is extracted as negative sample, a part of positive sample and negative sample composing training collection are selected, remainder positive sample and negative sample form test set;Then according to training set and convolutional neural networks frame, convolutional neural networks frame training pattern is built;Again using each sample in test set as input, it is input in convolutional neural networks frame training pattern, obtains classification results;Advantage is their ability in the case where not limited by text, and higher Detection accuracy is respectively provided with for a variety of pirate recordings voices for using a hidden recorder equipment.

Description

A kind of pirate recordings speech detection method based on convolutional neural networks
Technical field
The present invention relates to a kind of speech detection technologies, are examined more particularly, to a kind of pirate recordings voice based on convolutional neural networks Survey method.
Background technology
As the continuous development of internet and the quick of Portable intelligent terminal are popularized, people can more conveniently Information is transmitted by various Digital Medias (such as image, audio, video).At the same time, as playback apparatus, high-fidelity are recorded Equipment is popularized, and the cipher of validated user easily uses a hidden recorder success when request enters identifying system by attacker.Pirate recordings voice passes through High-fidelity sound pick-up outfit is used a hidden recorder, playback apparatus playback, has higher similarity, some speaker authentication systems with raw tone Also it can not distinguish, compromise the equity of validated user;And equipment volume is small, easily uses a hidden recorder, success rate height because using a hidden recorder for pirate recordings voice Etc. advantages, it has also become the method most easily implemented in attack voice authentication system.Therefore, to pirate recordings speech detection by the industry Pay attention to extensively.
In recent years, certain achievement is achieved to the research of pirate recordings speech detection.
The first kind, the randomness that researcher generates according to voice compared the Peak map of raw tone and pirate recordings voice (Shang W,Stevenson M.A playback attack detector for speaker verification systems[C]//In ternational Symposium on Communications,Control and Signal Processing.IEEE,2008:A kind of replay attacks for speaker authentication system of 1144-1149. Shang Wei, Shi Difensen Detection algorithm [C] // communication, control and signal processing international conference .IEEE, 2008:1144-1149.) difference, it is proposed that one Recording playback detection algorithm of the kind based on Peak map similarities, if Peak map similarities are more than the threshold value of setting, judges For pirate recordings voice;Conversely, it is determined as raw tone.On this basis, someone improves the algorithm, in Peak map spies Property in add the position relationship of each speech frequency point, come according to voice to be certified and similarity of the raw tone in this feature Judge whether voice to be certified is legitimate voice.Above method can only be directed to the relevant identifying system of text, can not be suitable for text This unrelated pirate recordings speech detection has larger limitation.
Second class according to channelling mode feature, using the difference between pirate recordings voice channel and raw tone channel, proposes It is a kind of based on mute section of MFCC (Mel-frequency cepstral coefficients, mel-frequency cepstrum coefficient) Pirate recordings speech detection algorithms, mute section of the algorithm primary voice data detect language to be measured to raw tone Channel Modeling Whether sound is identical with the channel of training voice, so as to determine whether that pirate recordings is attacked.Another algorithm is according to raw tone with turning over It is different to record the channel that voice generates, extracts channelling mode noise, and using SVM (Support Vector Machine, support to Amount machine) obtain good classification results.The third algorithm is according to high-fidelity sound pick-up outfit channel to the shadow of speech It rings, it is proposed that a kind of pirate recordings speech detection algorithms based on long window scale factor.Above method can only detect single equipment recording Voice, not to it is a variety of it is different use a hidden recorder equipment and playback apparatus is analyzed and studied, the wherein letter of second algorithm extraction Road modal noise is also inaccurate.
At present, for work most of in terms of pirate recordings speech detection turning over for equipment and playback apparatus is used a hidden recorder both for a kind of Voice is recorded, it is less to the concern of the pirate recordings speech detection research of a variety of sound pick-up outfits.And in actual life, various high-fidelity records Sound equipment is seen everywhere, such as recording pen and various smart mobile phones, this kind of sound pick-up outfit carrying convenience and not noticeable, and obtain Pirate recordings voice and raw tone similitude be higher, and this kind of sound pick-up outfit is that more mainstream uses a hidden recorder equipment at present.Therefore, needle is studied It is necessary to the pirate recordings speech detection of a variety of sound pick-up outfits.
Invention content
The technical problems to be solved by the invention are to provide a kind of pirate recordings speech detection method based on convolutional neural networks, It is respectively provided with higher Detection accuracy in the case where not limited by text, for a variety of pirate recordings voices for using a hidden recorder equipment.
Technical solution is used by the present invention solves above-mentioned technical problem:A kind of pirate recordings language based on convolutional neural networks Sound detection method, it is characterised in that include the following steps:
1. build raw tone library and pirate recordings sound bank:Under quiet environment, recording personnel are carried out using collecting device Raw tone acquires, and collects N altogether1The raw tone of a different content, by N1The raw tone of a different content forms original Sound bank;Process is used a hidden recorder according to real process simulation, the same of raw tone acquisition is carried out to recording personnel using collecting device When, it uses a hidden recorder equipment using at least two and recording personnel progress voice is used a hidden recorder, then using at least one playback apparatus to using a hidden recorder Voice carry out audio playback, reuse same collecting device and voice collecting carried out to the voice of playback, collect N altogether2 A pirate recordings voice, by N2A pirate recordings voice forms pirate recordings sound bank;Wherein, N1>=1000, N2≥2N1
2. extracting the sound spectrograph of each raw tone in raw tone library, and extract each pirate recordings in pirate recordings sound bank The sound spectrograph of voice;Then using the sound spectrograph of each raw tone as a positive sample, by the sound spectrograph of each pirate recordings voice As a negative sample;It again will be from N150~70% positive sample is randomly selected in a positive sample and from N2It is random in a negative sample 50~70% negative sample composing training collection is chosen, remaining positive sample and remaining negative sample are formed into test set;
3. build convolutional neural networks frame training pattern:
The first step builds first layer convolutional layer:First, the total number of the wave filter in first layer convolutional layer is set;Secondly, The size of convolution kernel in first layer convolutional layer is set;Again, the output of first layer convolutional layer and Relu activation primitives are determined Relationship is described as:Wherein, 1≤p≤P, P represent the total number of sample included in training set, 1≤j≤M1, M1Represent the total number of the wave filter in first layer convolutional layer, representations of the f () for Relu activation primitives, xp Represent training set in p-th of sample, symbol " * " be convolution algorithm symbol, k(1)Represent the convolution kernel in first layer convolutional layer Size,It representsBiasing,Represent xpThe jth width characteristic pattern that first layer convolutional layer exports after first layer convolutional layer, xpCorrespondence obtains M after first layer convolutional layer1Width characteristic pattern;
Second step builds second layer convolutional layer:First, the total number of the wave filter in second layer convolutional layer is set;Secondly, The size of convolution kernel in second layer convolutional layer is set;Again, the output of second layer convolutional layer and Relu activation primitives are determined Relationship is described as:Wherein, 1≤i≤M2, M2Represent the wave filter in second layer convolutional layer Total number, k(2)Represent the size of the convolution kernel in second layer convolutional layer,It representsBiasing,It representsThrough The i-th width characteristic pattern that second layer convolutional layer exports after second layer convolutional layer,Correspondence obtains M after second layer convolutional layer2Width is special Sign figure;
Third walks, and builds pond layer:First, the size of the convolution kernel in the layer of pond is set;Secondly, pond used by determining Change algorithm;Again, using the output of second layer convolutional layer as the input of pond layer, the output of pond layer is obtained;
4th step builds full articulamentum:First, the number of hidden nodes in full articulamentum is set;Secondly, used by determining Loss function;Again, using the output of pond layer as the input of full articulamentum, the output of full articulamentum is obtained, is so far rolled up Product neural network framework training pattern;
4. using each sample in test set as input, it is input in convolutional neural networks frame training pattern, obtains Raw tone and the classification results of pirate recordings voice.
The step 3. in, the total number of the wave filter in first layer convolutional layer is 32, the volume in first layer convolutional layer The size of product core is 1 × 11;The total number of wave filter in second layer convolutional layer is 64, the convolution kernel in second layer convolutional layer Size is 2 × 6;The size of convolution kernel in the layer of pond is 1 × 4, and used pond algorithm is maximum pond algorithm;Full connection The number of hidden nodes in layer is 256, and used loss function is SoftMax regression functions.
Compared with prior art, the advantage of the invention is that:
1) the method for the present invention is by obtaining the sound spectrograph of raw tone and pirate recordings voice, and using a part of raw tone with The sound spectrograph of pirate recordings voice has built the convolutional neural networks frame training pattern suitable for detection pirate recordings voice so that the present invention Method is respectively provided with higher Detection accuracy in the case where not limited by text, for a variety of pirate recordings voices for using a hidden recorder equipment.
2) the method for the present invention is during convolutional neural networks frame training pattern is built, it is contemplated that the network number of plies, filter The influence of the number of wave device and the size of convolution kernel to recognition effect, has weighed processing time and space complexity, establishes inspection Survey the network number of plies of best results and network parameter setting.
3) the method for the present invention is verified through cross-over experiment, a kind of is used a hidden recorder and the situation of the pirate recordings voice of playback apparatus known Under, have preferable discrimination for the pirate recordings voice in other sources, can effectively identify raw tone and it is a variety of use a hidden recorder and The pirate recordings voice of playback apparatus, and experiment show that Detection accuracy has reached 99.26%.
4) the method for the present invention can detect it is a variety of use a hidden recorder and the pirate recordings voice of playback apparatus, more tally with the actual situation, have Higher realistic meaning.
Description of the drawings
Fig. 1 is that the overall of the method for the present invention realizes block diagram;
Fig. 2 a are the sound spectrograph of one section of raw tone through Aigo R6620 recording pen original recordeds;
Fig. 2 b are that equipment is Aigo R6620, playback apparatus is pirate recordings voice that Huawei AM08 are obtained using using a hidden recorder Sound spectrograph;
Fig. 2 c are to be composed using using a hidden recorder equipment is iPhone6, playback apparatus is the pirate recordings voice that Huawei AM08 are obtained language Figure;
Fig. 2 d are that equipment is SONY PX440, playback apparatus is pirate recordings voice that Huawei AM08 are obtained using using a hidden recorder Sound spectrograph;
Fig. 3 a are the sound spectrograph of another section of raw tone through Aigo R6620 recording pen original recordeds;
Fig. 3 b are that equipment is Aigo R6620, playback apparatus is pirate recordings language that Philips DTM3115 are obtained using using a hidden recorder The sound spectrograph of sound;
Fig. 3 c are that equipment is iPhone6, playback apparatus is pirate recordings voice that Philips DTM3115 are obtained using using a hidden recorder Sound spectrograph;
Fig. 3 d are that equipment is SONY PX440, playback apparatus is pirate recordings language that Philips DTM3115 are obtained using using a hidden recorder The sound spectrograph of sound;
Fig. 4 a be window length be set as 512 points, Fourier's sampling number be 1024 points, window shifting be turning under 128 points and 256 points Record the detection discrimination curve graph of voice;
Fig. 4 b be window length be set as 512 points, Fourier's sampling number be 1024 points, window shifting be turning under 128 points and 256 points Record the Detectability loss rate of voice.
Specific embodiment
The present invention is described in further detail below in conjunction with attached drawing embodiment.
Deep learning is substantially machine learning framework model of the structure containing more hidden layers, is instructed by large-scale data Practice, obtain a large amount of more representational characteristic informations, so as to which sample is classified and be predicted, improve the essence of classification and prediction Degree.Compared with the feature extracting method of engineer, the data characteristics that is obtained using deep learning model discloses big data Abundant internal information more has representative.Convolutional neural networks can extract the characteristic information that mass data sample hides, this causes Convolutional neural networks are widely used in the every field of pattern-recognition.Therefore, the present invention utilizes convolutional neural networks Realize pirate recordings speech detection.
A kind of pirate recordings speech detection method based on convolutional neural networks proposed by the present invention, it is overall to realize block diagram as schemed Shown in 1, include the following steps:
1. build raw tone library and pirate recordings sound bank:Under quiet environment, recording personnel are carried out using collecting device Raw tone acquires, and collects N altogether1The raw tone of a different content, by N1The raw tone of a different content forms original Sound bank;Process is used a hidden recorder according to real process simulation, the same of raw tone acquisition is carried out to recording personnel using collecting device When, it uses a hidden recorder equipment using at least two and recording personnel progress voice is used a hidden recorder, then using at least one playback apparatus to using a hidden recorder Voice carry out audio playback, reuse same collecting device and voice collecting carried out to the voice of playback, collect N altogether2 A pirate recordings voice, by N2A pirate recordings voice forms pirate recordings sound bank;Wherein, N1>=1000, N2≥2N1
During collecting device is used to carry out raw tone acquisition to recording personnel, recording personnel speak according to itself Custom reads corpus content, and collecting device recording distance personnel are about 20cm;It is about 70cm to use a hidden recorder equipment recording distance personnel; Playback apparatus distance is about 20cm for acquiring the collecting device of the voice of playback;Collecting device, to use a hidden recorder equipment, playback apparatus equal Common sound equipment can be used using existing high-fidelity sound pick-up outfit, such as playback apparatus.
Table 1 give collecting device used by the present embodiment, use a hidden recorder equipment, playback apparatus facility information, table 2 provides The raw tone that the present embodiment obtains and the details of pirate recordings voice.
Collecting device used by 1 the present embodiment of table, use a hidden recorder equipment, playback apparatus facility information
The raw tone that 2 the present embodiment of table obtains and the details of pirate recordings voice
2. using the sound spectrograph of each raw tone in prior art extraction raw tone library, and extract pirate recordings sound bank In each pirate recordings voice sound spectrograph;Then using the sound spectrograph of each raw tone as a positive sample, by each pirate recordings The sound spectrograph of voice is as a negative sample;It again will be from N150~70% positive sample is randomly selected in a positive sample and from N2It is a 50~70% negative sample composing training collection is randomly selected in negative sample, remaining positive sample and remaining negative sample are formed and surveyed Examination collection.
Information largely related with the sentence characteristic of voice is contained in sound spectrograph, it combines spectrogram and time domain wave The characteristics of shape, it will be apparent that show voice spectrum and change with time situation.Compared to raw tone, pirate recordings voice is undergone mostly Recording and playback process, and use a hidden recorder equipment and playback apparatus and inevitably voice signal is adopted again Collection and encoding and decoding, this, which results in pirate recordings voice, to carry intrinsic attribute, this attribute will differ from raw tone.
Fig. 2 a give the sound spectrograph of one section of raw tone through Aigo R6620 recording pen original recordeds, the raw tone Particular content " open sesame-I be that local tyrant-a thousand li is total to the moon " read aloud for mandarin;Fig. 2 b give use and use a hidden recorder equipment It is the sound spectrograph of pirate recordings voice that Huawei AM08 are obtained for Aigo R6620, playback apparatus;Fig. 2 c, which give use and use a hidden recorder, to be set Standby is iPhone6, playback apparatus is the sound spectrograph of pirate recordings voice that Huawei AM08 are obtained;Fig. 2 d, which give use and use a hidden recorder, to be set Standby is SONY PX440, playback apparatus is the sound spectrograph of pirate recordings voice that Huawei AM08 are obtained.Fig. 3 a give another section of warp The sound spectrograph of the raw tone of Aigo R6620 recording pen original recordeds, the particular content of the raw tone is what mandarin was read aloud " open sesame-I be that local tyrant-a thousand li is total to the moon ";Fig. 3 b give use and use a hidden recorder that equipment is Aigo R6620, playback apparatus is The sound spectrograph of pirate recordings voice that Philips DTM3115 are obtained;Fig. 3 c give use use a hidden recorder equipment for iPhone6, playback set The standby sound spectrograph of pirate recordings voice obtained for Philips DTM3115;Fig. 3 d give use use a hidden recorder equipment for SONY PX440, Playback apparatus is the sound spectrograph of pirate recordings voice that Philips DTM3115 are obtained.
3. build convolutional neural networks frame training pattern:
The first step builds first layer convolutional layer:First, the total number of the wave filter in first layer convolutional layer is set;Secondly, The size of convolution kernel in first layer convolutional layer is set;Again, the output of first layer convolutional layer and Relu activation primitives are determined Relationship is described as:Wherein, 1≤p≤P, P represent the total number of sample included in training set, 1≤j≤M1, M1Represent the total number of the wave filter in first layer convolutional layer, representations of the f () for Relu activation primitives, xp Represent training set in p-th of sample, symbol " * " be convolution algorithm symbol, k(1)Represent the convolution kernel in first layer convolutional layer Size,It representsBiasing,Represent xpThe jth width characteristic pattern that first layer convolutional layer exports after first layer convolutional layer, xpCorrespondence obtains M after first layer convolutional layer1Width characteristic pattern.
Second step builds second layer convolutional layer:First, the total number of the wave filter in second layer convolutional layer is set;Secondly, The size of convolution kernel in second layer convolutional layer is set;Again, the output of second layer convolutional layer and Relu activation primitives are determined Relationship is described as:Wherein, 1≤i≤M2, M2Represent the wave filter in second layer convolutional layer Total number, k(2)Represent the size of the convolution kernel in second layer convolutional layer,It representsBiasing,It representsThrough The i-th width characteristic pattern that second layer convolutional layer exports after two layers of convolutional layer,Correspondence obtains M after second layer convolutional layer2Width feature Figure.
Third walks, and builds pond layer:First, the size of the convolution kernel in the layer of pond is set;Secondly, pond used by determining Change algorithm;Again, using the output of second layer convolutional layer as the input of pond layer, the output of pond layer is obtained.
4th step builds full articulamentum:First, the number of hidden nodes in full articulamentum is set;Secondly, used by determining Loss function;Again, using the output of pond layer as the input of full articulamentum, the output of full articulamentum is obtained, is so far rolled up Product neural network framework training pattern.
In this particular embodiment, step 3. in, the total number of the wave filter in first layer convolutional layer is 32, first layer volume The size of convolution kernel in lamination is 1 × 11;The total number of wave filter in second layer convolutional layer is 64, in second layer convolutional layer Convolution kernel size be 2 × 6;The size of convolution kernel in the layer of pond is 1 × 4, and used pond algorithm is maximum pond Algorithm;The number of hidden nodes in full articulamentum is 256, and used loss function is SoftMax regression functions.
4. using each sample in test set as input, it is input in convolutional neural networks frame training pattern, obtains Raw tone and the classification results of pirate recordings voice.
To further illustrate the present invention the feasibility and validity of method, the method for the present invention is tested.
The selection of the number and size of convolution kernel:
Convolutional neural networks analyze local feature by convolution kernel, the feature extracted by the reinforcement of pond layer Robustness establishes model finally by full articulamentum and obtains final classification results.In this process, acoustic convolver is special to input Sign is analyzed and is extracted, and large effect is played to classification results.There are two the parameter setting of acoustic convolver is common:Convolution kernel it is big Small and convolution kernel number.
In principle, the number of convolution kernel (wave filter) is the width number of the characteristic pattern of output, and even the number of acoustic convolver is N, Then output is N width characteristic patterns.With the increase of the number of wave filter, the characteristic pattern of output is also more, convolutional neural networks table The feature space shown is bigger, and learning ability is also stronger, and discrimination is also higher.Table 3 gives the number of wave filter to inspection The influence of performance is surveyed, table 4 gives influence of the size to detection performance of convolution kernel.In table 3 and table 4, ACC is identified for detection Rate, Loss are loss late, and the time is whenabouts caused by iteration each time.The experiment constraints of table 3 is to ensure network In the case of number of plies structure and ceteris paribus, the number of its two layers of wave filter is adjusted;The experiment constraints of table 4 is to filter The number of wave device is 32~64, the size of convolution kernel in the layer of pond is 1 × 4, the number of hidden nodes in full articulamentum is 256 In the case of, adjustment changes the size of its two layers of convolution kernel.Experiment sample be raw tone 6300, pirate recordings voice 6300.
Influence of the number of 3 wave filter of table to detection performance
The number of wave filter ACC (%) Loss Time/iteration
16~32 98.39 0.048 238s
32~32 98.57 0.043 321s
32~64 98.97 0.034 360s
64~64 99.04 0.031 420s
Influence of the size of 4 convolution kernel of table to detection performance
The size of convolution kernel ACC (%) Loss Time/iteration
1 × 7~2 × 6 98.97 0.033 400s
1 × 11~2 × 6 98.97 0.034 360s
1 × 14~2 × 6 98.54 0.047 318s
From the data listed by table 3 and table 4 it is found that the increase of the number with wave filter, detection performance are better;Different filters Wave device extracts different features from different angles, if the number of wave filter is less, cannot fully extract useful information; If the number of wave filter is more, operation time can increase, but the raising of its discrimination is not obvious;In addition, with convolution kernel The gradual refinement of size, discrimination increases, but ascensional range is weaker, and this also illustrates the sizes of convolution kernel to detection performance Influence it is weaker.Consider, in the specific implementation can final choice wave filter number be 32~64, the size of convolution kernel is 1 × 11~2 × 6.
Input the influence of the sound spectrograph under different window is moved:
Voice signal calculates its energy spectral density and obtains sound spectrograph by framing, adding window, Fourier transformation.Different windows Shifting will generate different sound spectrographs, comprising voice messaging it is also just different.Fig. 4 a give window length and are set as, Fu Li at 512 points The detection discrimination curve graph that leaf sampling number is 1024 points, window shifting is the pirate recordings voice under 128 points and 256 points, Fig. 4 b are provided Window length is set as, the detection that Fourier's sampling number is, window shifting is the pirate recordings voice under 128 points and 256 points at 512 points at 1024 points Loss late.Abscissa Epoch represents iterations in Fig. 4 a, and ordinate Accurary represents detection discrimination;Horizontal seat in Fig. 4 b It marks Epoch and represents iterations, ordinate Loss represents Detectability loss rate.Experiment sample be raw tone 6300, pirate recordings language Sound 6300,70% for training, remaining is used to test.
Cross-over experiment:
It during pirate recordings, uses a hidden recorder equipment and playback apparatus type is various, different use a hidden recorder equipment and playback apparatus will be right Testing result generates different influences.The purpose of cross-over experiment is exactly the applicability in order to preferably examine the method for the present invention. In experiment, equipment is used a hidden recorder and a kind of pirate recordings voice that playback apparatus obtains is as training voice using a kind of, remaining any one steal The pirate recordings voice that recording apparatus and a kind of playback apparatus obtain is as tested speech.Raw tone 6300, pirate recordings voice 37800 It is a.Wherein, testing result is represented with ACC (%).Experimental result is as listed in table 5.
It can be seen that from the data listed by table 5 when using identical playback apparatus, different intersection when using a hidden recorder equipment real Preferable verification and measurement ratio can be obtained by testing, and verification and measurement ratio can reach more than 93%, wherein, playback apparatus is Huawei AM08, steathily The verification and measurement ratio of pirate recordings voice has reached 99.28% when recording apparatus is Aigo R6620.When using different playback apparatus, different Cross-over experiment during equipment is used a hidden recorder, the method for the present invention has certain detection result, but result is set not as good as using identical playback Standby, different pirate recordings speech detection when using a hidden recorder equipment.It follows that compared to equipment is used a hidden recorder, playback apparatus is to pirate recordings voice Have an impact it is larger.
5 cross-over experiment result of table

Claims (2)

1. a kind of pirate recordings speech detection method based on convolutional neural networks, it is characterised in that include the following steps:
1. build raw tone library and pirate recordings sound bank:Under quiet environment, recording personnel are carried out using collecting device original Voice collecting collects N altogether1The raw tone of a different content, by N1The raw tone of a different content forms raw tone Library;Process is used a hidden recorder according to real process simulation, while collecting device is used to carry out raw tone acquisition to recording personnel, is made Equipment is used a hidden recorder at least two voice is carried out to recording personnel and used a hidden recorder, then using at least one playback apparatus to the voice used a hidden recorder Audio playback is carried out, same collecting device is reused and voice collecting is carried out to the voice of playback, collect N altogether2A pirate recordings Voice, by N2A pirate recordings voice forms pirate recordings sound bank;Wherein, N1>=1000, N2≥2N1
2. extracting the sound spectrograph of each raw tone in raw tone library, and extract each pirate recordings voice in pirate recordings sound bank Sound spectrograph;Then using the sound spectrograph of each raw tone as a positive sample, using the sound spectrograph of each pirate recordings voice as One negative sample;It again will be from N150~70% positive sample is randomly selected in a positive sample and from N2It is randomly selected in a negative sample Remaining positive sample and remaining negative sample are formed test set by 50~70% negative sample composing training collection;
3. build convolutional neural networks frame training pattern:
The first step builds first layer convolutional layer:First, the total number of the wave filter in first layer convolutional layer is set;Secondly, setting The size of convolution kernel in first layer convolutional layer;Again, the output of first layer convolutional layer and the relationship of Relu activation primitives are determined, It is described as:Wherein, 1≤p≤P, P represent the total number of sample included in training set, 1≤j ≤M1, M1Represent the total number of the wave filter in first layer convolutional layer, representations of the f () for Relu activation primitives, xpIt represents P-th of sample in training set, symbol " * " be convolution algorithm symbol, k(1)Represent the big of the convolution kernel in first layer convolutional layer It is small,It representsBiasing,Represent xpThe jth width characteristic pattern that first layer convolutional layer exports after first layer convolutional layer, xp Correspondence obtains M after first layer convolutional layer1Width characteristic pattern;
Second step builds second layer convolutional layer:First, the total number of the wave filter in second layer convolutional layer is set;Secondly, setting The size of convolution kernel in second layer convolutional layer;Again, the output of second layer convolutional layer and the relationship of Relu activation primitives are determined, It is described as:Wherein, 1≤i≤M2, M2Represent total of the wave filter in second layer convolutional layer Number, k(2)Represent the size of the convolution kernel in second layer convolutional layer,It representsBiasing,It representsThrough the second layer The i-th width characteristic pattern that second layer convolutional layer exports after convolutional layer,Correspondence obtains M after second layer convolutional layer2Width characteristic pattern;
Third walks, and builds pond layer:First, the size of the convolution kernel in the layer of pond is set;Secondly, pondization is calculated used by determining Method;Again, using the output of second layer convolutional layer as the input of pond layer, the output of pond layer is obtained;
4th step builds full articulamentum:First, the number of hidden nodes in full articulamentum is set;Secondly, it is lost used by determining Function;Again, using the output of pond layer as the input of full articulamentum, the output of full articulamentum is obtained, so far obtains convolution god Through network frame training pattern;
4. using each sample in test set as input, it is input in convolutional neural networks frame training pattern, obtains original The classification results of voice and pirate recordings voice.
A kind of 2. pirate recordings speech detection method based on convolutional neural networks according to claim 1, it is characterised in that institute The step of stating 3. in, the total number of the wave filter in first layer convolutional layer is 32, the size of the convolution kernel in first layer convolutional layer It is 1 × 11;The total number of wave filter in second layer convolutional layer is 64, the size of the convolution kernel in second layer convolutional layer for 2 × 6;The size of convolution kernel in the layer of pond is 1 × 4, and used pond algorithm is maximum pond algorithm;It is hidden in full articulamentum Node layer number is 256, and used loss function is SoftMax regression functions.
CN201711323563.4A 2017-12-13 2017-12-13 A kind of pirate recordings speech detection method based on convolutional neural networks Pending CN108198561A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711323563.4A CN108198561A (en) 2017-12-13 2017-12-13 A kind of pirate recordings speech detection method based on convolutional neural networks

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711323563.4A CN108198561A (en) 2017-12-13 2017-12-13 A kind of pirate recordings speech detection method based on convolutional neural networks

Publications (1)

Publication Number Publication Date
CN108198561A true CN108198561A (en) 2018-06-22

Family

ID=62574282

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711323563.4A Pending CN108198561A (en) 2017-12-13 2017-12-13 A kind of pirate recordings speech detection method based on convolutional neural networks

Country Status (1)

Country Link
CN (1) CN108198561A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109326294A (en) * 2018-09-28 2019-02-12 杭州电子科技大学 A kind of relevant vocal print key generation method of text
CN109599117A (en) * 2018-11-14 2019-04-09 厦门快商通信息技术有限公司 A kind of audio data recognition methods and human voice anti-replay identifying system
CN109801638A (en) * 2019-01-24 2019-05-24 平安科技(深圳)有限公司 Speech verification method, apparatus, computer equipment and storage medium
CN109872720A (en) * 2019-01-29 2019-06-11 广东技术师范学院 It is a kind of that speech detection algorithms being rerecorded to different scenes robust based on convolutional neural networks
CN110223676A (en) * 2019-06-14 2019-09-10 苏州思必驰信息科技有限公司 The optimization method and system of deception recording detection neural network model
CN110459225A (en) * 2019-08-14 2019-11-15 南京邮电大学 A kind of speaker identification system based on CNN fusion feature
CN110491391A (en) * 2019-07-02 2019-11-22 厦门大学 A kind of deception speech detection method based on deep neural network
CN112270931A (en) * 2020-10-22 2021-01-26 江西师范大学 Method for carrying out deceptive voice detection based on twin convolutional neural network
CN113646833A (en) * 2021-07-14 2021-11-12 东莞理工学院 Voice confrontation sample detection method, device, equipment and computer readable storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105118503A (en) * 2015-07-13 2015-12-02 中山大学 Ripped audio detection method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105118503A (en) * 2015-07-13 2015-12-02 中山大学 Ripped audio detection method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XIAODAN LIN等: ""Audio Recapture Detection With Convolutional Neural Networks"", 《IEEE TRANSACTIONS ON MULTIMEDIA》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109326294A (en) * 2018-09-28 2019-02-12 杭州电子科技大学 A kind of relevant vocal print key generation method of text
CN109326294B (en) * 2018-09-28 2022-09-20 杭州电子科技大学 Text-related voiceprint key generation method
CN109599117A (en) * 2018-11-14 2019-04-09 厦门快商通信息技术有限公司 A kind of audio data recognition methods and human voice anti-replay identifying system
CN109801638A (en) * 2019-01-24 2019-05-24 平安科技(深圳)有限公司 Speech verification method, apparatus, computer equipment and storage medium
CN109801638B (en) * 2019-01-24 2023-10-13 平安科技(深圳)有限公司 Voice verification method, device, computer equipment and storage medium
CN109872720A (en) * 2019-01-29 2019-06-11 广东技术师范学院 It is a kind of that speech detection algorithms being rerecorded to different scenes robust based on convolutional neural networks
CN110223676A (en) * 2019-06-14 2019-09-10 苏州思必驰信息科技有限公司 The optimization method and system of deception recording detection neural network model
CN110491391A (en) * 2019-07-02 2019-11-22 厦门大学 A kind of deception speech detection method based on deep neural network
CN110459225A (en) * 2019-08-14 2019-11-15 南京邮电大学 A kind of speaker identification system based on CNN fusion feature
CN112270931A (en) * 2020-10-22 2021-01-26 江西师范大学 Method for carrying out deceptive voice detection based on twin convolutional neural network
CN113646833A (en) * 2021-07-14 2021-11-12 东莞理工学院 Voice confrontation sample detection method, device, equipment and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN108198561A (en) A kind of pirate recordings speech detection method based on convolutional neural networks
CN109065030B (en) Convolutional neural network-based environmental sound identification method and system
CN108231067A (en) Sound scenery recognition methods based on convolutional neural networks and random forest classification
CN108711436B (en) Speaker verification system replay attack detection method based on high frequency and bottleneck characteristics
CN104732978B (en) The relevant method for distinguishing speek person of text based on combined depth study
CN104900235B (en) Method for recognizing sound-groove based on pitch period composite character parameter
CN105788592A (en) Audio classification method and apparatus thereof
CN107507625B (en) Sound source distance determining method and device
CN101923855A (en) Test-irrelevant voice print identifying system
CN102723079B (en) Music and chord automatic identification method based on sparse representation
CN113823293B (en) Speaker recognition method and system based on voice enhancement
CN109378014A (en) A kind of mobile device source discrimination and system based on convolutional neural networks
CN102982351A (en) Porcelain insulator vibrational acoustics test data sorting technique based on back propagation (BP) neural network
CN109872720A (en) It is a kind of that speech detection algorithms being rerecorded to different scenes robust based on convolutional neural networks
CN104221079A (en) Modified Mel filter bank structure using spectral characteristics for sound analysis
CN105513598A (en) Playback voice detection method based on distribution of information quantity in frequency domain
CN108766464A (en) Digital audio based on mains frequency fluctuation super vector distorts automatic testing method
CN111081223A (en) Voice recognition method, device, equipment and storage medium
CN111508524A (en) Method and system for identifying voice source equipment
CN111402922B (en) Audio signal classification method, device, equipment and storage medium based on small samples
CN117419915A (en) Motor fault diagnosis method for multi-source information fusion
CN110136746B (en) Method for identifying mobile phone source in additive noise environment based on fusion features
CN112786057B (en) Voiceprint recognition method and device, electronic equipment and storage medium
CN116705063B (en) Manifold measurement-based multi-model fusion voice fake identification method
CN113936667A (en) Bird song recognition model training method, recognition method and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180622

RJ01 Rejection of invention patent application after publication