CN108198561A

CN108198561A - A kind of pirate recordings speech detection method based on convolutional neural networks

Info

Publication number: CN108198561A
Application number: CN201711323563.4A
Authority: CN
Inventors: 王让定; 李璨; 严迪群; 林朗
Original assignee: Ningbo University
Current assignee: Ningbo University
Priority date: 2017-12-13
Filing date: 2017-12-13
Publication date: 2018-06-22

Abstract

The invention discloses a kind of pirate recordings speech detection methods based on convolutional neural networks, first build raw tone library and pirate recordings sound bank；Then the sound spectrograph of each raw tone in extraction raw tone library is as positive sample, the sound spectrograph of each pirate recordings voice in pirate recordings sound bank is extracted as negative sample, a part of positive sample and negative sample composing training collection are selected, remainder positive sample and negative sample form test set；Then according to training set and convolutional neural networks frame, convolutional neural networks frame training pattern is built；Again using each sample in test set as input, it is input in convolutional neural networks frame training pattern, obtains classification results；Advantage is their ability in the case where not limited by text, and higher Detection accuracy is respectively provided with for a variety of pirate recordings voices for using a hidden recorder equipment.

Description

A kind of pirate recordings speech detection method based on convolutional neural networks

Technical field

The present invention relates to a kind of speech detection technologies, are examined more particularly, to a kind of pirate recordings voice based on convolutional neural networks Survey method.

Background technology

As the continuous development of internet and the quick of Portable intelligent terminal are popularized, people can more conveniently Information is transmitted by various Digital Medias (such as image, audio, video).At the same time, as playback apparatus, high-fidelity are recorded Equipment is popularized, and the cipher of validated user easily uses a hidden recorder success when request enters identifying system by attacker.Pirate recordings voice passes through High-fidelity sound pick-up outfit is used a hidden recorder, playback apparatus playback, has higher similarity, some speaker authentication systems with raw tone Also it can not distinguish, compromise the equity of validated user；And equipment volume is small, easily uses a hidden recorder, success rate height because using a hidden recorder for pirate recordings voice Etc. advantages, it has also become the method most easily implemented in attack voice authentication system.Therefore, to pirate recordings speech detection by the industry Pay attention to extensively.

In recent years, certain achievement is achieved to the research of pirate recordings speech detection.

The first kind, the randomness that researcher generates according to voice compared the Peak map of raw tone and pirate recordings voice (Shang W,Stevenson M.A playback attack detector for speaker verification systems[C]//In ternational Symposium on Communications,Control and Signal Processing.IEEE,2008:A kind of replay attacks for speaker authentication system of 1144-1149. Shang Wei, Shi Difensen Detection algorithm [C] // communication, control and signal processing international conference .IEEE, 2008:1144-1149.) difference, it is proposed that one Recording playback detection algorithm of the kind based on Peak map similarities, if Peak map similarities are more than the threshold value of setting, judges For pirate recordings voice；Conversely, it is determined as raw tone.On this basis, someone improves the algorithm, in Peak map spies Property in add the position relationship of each speech frequency point, come according to voice to be certified and similarity of the raw tone in this feature Judge whether voice to be certified is legitimate voice.Above method can only be directed to the relevant identifying system of text, can not be suitable for text This unrelated pirate recordings speech detection has larger limitation.

Second class according to channelling mode feature, using the difference between pirate recordings voice channel and raw tone channel, proposes It is a kind of based on mute section of MFCC (Mel-frequency cepstral coefficients, mel-frequency cepstrum coefficient) Pirate recordings speech detection algorithms, mute section of the algorithm primary voice data detect language to be measured to raw tone Channel Modeling Whether sound is identical with the channel of training voice, so as to determine whether that pirate recordings is attacked.Another algorithm is according to raw tone with turning over It is different to record the channel that voice generates, extracts channelling mode noise, and using SVM (Support Vector Machine, support to Amount machine) obtain good classification results.The third algorithm is according to high-fidelity sound pick-up outfit channel to the shadow of speech It rings, it is proposed that a kind of pirate recordings speech detection algorithms based on long window scale factor.Above method can only detect single equipment recording Voice, not to it is a variety of it is different use a hidden recorder equipment and playback apparatus is analyzed and studied, the wherein letter of second algorithm extraction Road modal noise is also inaccurate.

At present, for work most of in terms of pirate recordings speech detection turning over for equipment and playback apparatus is used a hidden recorder both for a kind of Voice is recorded, it is less to the concern of the pirate recordings speech detection research of a variety of sound pick-up outfits.And in actual life, various high-fidelity records Sound equipment is seen everywhere, such as recording pen and various smart mobile phones, this kind of sound pick-up outfit carrying convenience and not noticeable, and obtain Pirate recordings voice and raw tone similitude be higher, and this kind of sound pick-up outfit is that more mainstream uses a hidden recorder equipment at present.Therefore, needle is studied It is necessary to the pirate recordings speech detection of a variety of sound pick-up outfits.

Invention content

The technical problems to be solved by the invention are to provide a kind of pirate recordings speech detection method based on convolutional neural networks, It is respectively provided with higher Detection accuracy in the case where not limited by text, for a variety of pirate recordings voices for using a hidden recorder equipment.

Technical solution is used by the present invention solves above-mentioned technical problem：A kind of pirate recordings language based on convolutional neural networks Sound detection method, it is characterised in that include the following steps：

1. build raw tone library and pirate recordings sound bank：Under quiet environment, recording personnel are carried out using collecting device Raw tone acquires, and collects N altogether₁The raw tone of a different content, by N₁The raw tone of a different content forms original Sound bank；Process is used a hidden recorder according to real process simulation, the same of raw tone acquisition is carried out to recording personnel using collecting device When, it uses a hidden recorder equipment using at least two and recording personnel progress voice is used a hidden recorder, then using at least one playback apparatus to using a hidden recorder Voice carry out audio playback, reuse same collecting device and voice collecting carried out to the voice of playback, collect N altogether₂ A pirate recordings voice, by N₂A pirate recordings voice forms pirate recordings sound bank；Wherein, N₁>=1000, N₂≥2N₁；

2. extracting the sound spectrograph of each raw tone in raw tone library, and extract each pirate recordings in pirate recordings sound bank The sound spectrograph of voice；Then using the sound spectrograph of each raw tone as a positive sample, by the sound spectrograph of each pirate recordings voice As a negative sample；It again will be from N₁50~70% positive sample is randomly selected in a positive sample and from N₂It is random in a negative sample 50~70% negative sample composing training collection is chosen, remaining positive sample and remaining negative sample are formed into test set；

3. build convolutional neural networks frame training pattern：

The first step builds first layer convolutional layer：First, the total number of the wave filter in first layer convolutional layer is set；Secondly, The size of convolution kernel in first layer convolutional layer is set；Again, the output of first layer convolutional layer and Relu activation primitives are determined Relationship is described as：Wherein, 1≤p≤P, P represent the total number of sample included in training set, 1≤j≤M₁, M₁Represent the total number of the wave filter in first layer convolutional layer, representations of the f () for Relu activation primitives, x_p Represent training set in p-th of sample, symbol " * " be convolution algorithm symbol, k⁽¹⁾Represent the convolution kernel in first layer convolutional layer Size,It representsBiasing,Represent x_pThe jth width characteristic pattern that first layer convolutional layer exports after first layer convolutional layer, x_pCorrespondence obtains M after first layer convolutional layer₁Width characteristic pattern；

Second step builds second layer convolutional layer：First, the total number of the wave filter in second layer convolutional layer is set；Secondly, The size of convolution kernel in second layer convolutional layer is set；Again, the output of second layer convolutional layer and Relu activation primitives are determined Relationship is described as：Wherein, 1≤i≤M₂, M₂Represent the wave filter in second layer convolutional layer Total number, k⁽²⁾Represent the size of the convolution kernel in second layer convolutional layer,It representsBiasing,It representsThrough The i-th width characteristic pattern that second layer convolutional layer exports after second layer convolutional layer,Correspondence obtains M after second layer convolutional layer₂Width is special Sign figure；

Third walks, and builds pond layer：First, the size of the convolution kernel in the layer of pond is set；Secondly, pond used by determining Change algorithm；Again, using the output of second layer convolutional layer as the input of pond layer, the output of pond layer is obtained；

4th step builds full articulamentum：First, the number of hidden nodes in full articulamentum is set；Secondly, used by determining Loss function；Again, using the output of pond layer as the input of full articulamentum, the output of full articulamentum is obtained, is so far rolled up Product neural network framework training pattern；

4. using each sample in test set as input, it is input in convolutional neural networks frame training pattern, obtains Raw tone and the classification results of pirate recordings voice.

The step 3. in, the total number of the wave filter in first layer convolutional layer is 32, the volume in first layer convolutional layer The size of product core is 1 × 11；The total number of wave filter in second layer convolutional layer is 64, the convolution kernel in second layer convolutional layer Size is 2 × 6；The size of convolution kernel in the layer of pond is 1 × 4, and used pond algorithm is maximum pond algorithm；Full connection The number of hidden nodes in layer is 256, and used loss function is SoftMax regression functions.

Compared with prior art, the advantage of the invention is that：

1) the method for the present invention is by obtaining the sound spectrograph of raw tone and pirate recordings voice, and using a part of raw tone with The sound spectrograph of pirate recordings voice has built the convolutional neural networks frame training pattern suitable for detection pirate recordings voice so that the present invention Method is respectively provided with higher Detection accuracy in the case where not limited by text, for a variety of pirate recordings voices for using a hidden recorder equipment.

2) the method for the present invention is during convolutional neural networks frame training pattern is built, it is contemplated that the network number of plies, filter The influence of the number of wave device and the size of convolution kernel to recognition effect, has weighed processing time and space complexity, establishes inspection Survey the network number of plies of best results and network parameter setting.

3) the method for the present invention is verified through cross-over experiment, a kind of is used a hidden recorder and the situation of the pirate recordings voice of playback apparatus known Under, have preferable discrimination for the pirate recordings voice in other sources, can effectively identify raw tone and it is a variety of use a hidden recorder and The pirate recordings voice of playback apparatus, and experiment show that Detection accuracy has reached 99.26%.

4) the method for the present invention can detect it is a variety of use a hidden recorder and the pirate recordings voice of playback apparatus, more tally with the actual situation, have Higher realistic meaning.

Description of the drawings

Fig. 1 is that the overall of the method for the present invention realizes block diagram；

Fig. 2 a are the sound spectrograph of one section of raw tone through Aigo R6620 recording pen original recordeds；

Fig. 2 b are that equipment is Aigo R6620, playback apparatus is pirate recordings voice that Huawei AM08 are obtained using using a hidden recorder Sound spectrograph；

Fig. 2 c are to be composed using using a hidden recorder equipment is iPhone6, playback apparatus is the pirate recordings voice that Huawei AM08 are obtained language Figure；

Fig. 2 d are that equipment is SONY PX440, playback apparatus is pirate recordings voice that Huawei AM08 are obtained using using a hidden recorder Sound spectrograph；

Fig. 3 a are the sound spectrograph of another section of raw tone through Aigo R6620 recording pen original recordeds；

Fig. 3 b are that equipment is Aigo R6620, playback apparatus is pirate recordings language that Philips DTM3115 are obtained using using a hidden recorder The sound spectrograph of sound；

Fig. 3 c are that equipment is iPhone6, playback apparatus is pirate recordings voice that Philips DTM3115 are obtained using using a hidden recorder Sound spectrograph；

Fig. 3 d are that equipment is SONY PX440, playback apparatus is pirate recordings language that Philips DTM3115 are obtained using using a hidden recorder The sound spectrograph of sound；

Fig. 4 a be window length be set as 512 points, Fourier's sampling number be 1024 points, window shifting be turning under 128 points and 256 points Record the detection discrimination curve graph of voice；

Fig. 4 b be window length be set as 512 points, Fourier's sampling number be 1024 points, window shifting be turning under 128 points and 256 points Record the Detectability loss rate of voice.

Specific embodiment

The present invention is described in further detail below in conjunction with attached drawing embodiment.

Deep learning is substantially machine learning framework model of the structure containing more hidden layers, is instructed by large-scale data Practice, obtain a large amount of more representational characteristic informations, so as to which sample is classified and be predicted, improve the essence of classification and prediction Degree.Compared with the feature extracting method of engineer, the data characteristics that is obtained using deep learning model discloses big data Abundant internal information more has representative.Convolutional neural networks can extract the characteristic information that mass data sample hides, this causes Convolutional neural networks are widely used in the every field of pattern-recognition.Therefore, the present invention utilizes convolutional neural networks Realize pirate recordings speech detection.

A kind of pirate recordings speech detection method based on convolutional neural networks proposed by the present invention, it is overall to realize block diagram as schemed Shown in 1, include the following steps：

1. build raw tone library and pirate recordings sound bank：Under quiet environment, recording personnel are carried out using collecting device Raw tone acquires, and collects N altogether₁The raw tone of a different content, by N₁The raw tone of a different content forms original Sound bank；Process is used a hidden recorder according to real process simulation, the same of raw tone acquisition is carried out to recording personnel using collecting device When, it uses a hidden recorder equipment using at least two and recording personnel progress voice is used a hidden recorder, then using at least one playback apparatus to using a hidden recorder Voice carry out audio playback, reuse same collecting device and voice collecting carried out to the voice of playback, collect N altogether₂ A pirate recordings voice, by N₂A pirate recordings voice forms pirate recordings sound bank；Wherein, N₁>=1000, N₂≥2N₁。

During collecting device is used to carry out raw tone acquisition to recording personnel, recording personnel speak according to itself Custom reads corpus content, and collecting device recording distance personnel are about 20cm；It is about 70cm to use a hidden recorder equipment recording distance personnel； Playback apparatus distance is about 20cm for acquiring the collecting device of the voice of playback；Collecting device, to use a hidden recorder equipment, playback apparatus equal Common sound equipment can be used using existing high-fidelity sound pick-up outfit, such as playback apparatus.

Table 1 give collecting device used by the present embodiment, use a hidden recorder equipment, playback apparatus facility information, table 2 provides The raw tone that the present embodiment obtains and the details of pirate recordings voice.

Collecting device used by 1 the present embodiment of table, use a hidden recorder equipment, playback apparatus facility information

The raw tone that 2 the present embodiment of table obtains and the details of pirate recordings voice

2. using the sound spectrograph of each raw tone in prior art extraction raw tone library, and extract pirate recordings sound bank In each pirate recordings voice sound spectrograph；Then using the sound spectrograph of each raw tone as a positive sample, by each pirate recordings The sound spectrograph of voice is as a negative sample；It again will be from N₁50~70% positive sample is randomly selected in a positive sample and from N₂It is a 50~70% negative sample composing training collection is randomly selected in negative sample, remaining positive sample and remaining negative sample are formed and surveyed Examination collection.

Information largely related with the sentence characteristic of voice is contained in sound spectrograph, it combines spectrogram and time domain wave The characteristics of shape, it will be apparent that show voice spectrum and change with time situation.Compared to raw tone, pirate recordings voice is undergone mostly Recording and playback process, and use a hidden recorder equipment and playback apparatus and inevitably voice signal is adopted again Collection and encoding and decoding, this, which results in pirate recordings voice, to carry intrinsic attribute, this attribute will differ from raw tone.

Fig. 2 a give the sound spectrograph of one section of raw tone through Aigo R6620 recording pen original recordeds, the raw tone Particular content " open sesame-I be that local tyrant-a thousand li is total to the moon " read aloud for mandarin；Fig. 2 b give use and use a hidden recorder equipment It is the sound spectrograph of pirate recordings voice that Huawei AM08 are obtained for Aigo R6620, playback apparatus；Fig. 2 c, which give use and use a hidden recorder, to be set Standby is iPhone6, playback apparatus is the sound spectrograph of pirate recordings voice that Huawei AM08 are obtained；Fig. 2 d, which give use and use a hidden recorder, to be set Standby is SONY PX440, playback apparatus is the sound spectrograph of pirate recordings voice that Huawei AM08 are obtained.Fig. 3 a give another section of warp The sound spectrograph of the raw tone of Aigo R6620 recording pen original recordeds, the particular content of the raw tone is what mandarin was read aloud " open sesame-I be that local tyrant-a thousand li is total to the moon "；Fig. 3 b give use and use a hidden recorder that equipment is Aigo R6620, playback apparatus is The sound spectrograph of pirate recordings voice that Philips DTM3115 are obtained；Fig. 3 c give use use a hidden recorder equipment for iPhone6, playback set The standby sound spectrograph of pirate recordings voice obtained for Philips DTM3115；Fig. 3 d give use use a hidden recorder equipment for SONY PX440, Playback apparatus is the sound spectrograph of pirate recordings voice that Philips DTM3115 are obtained.

3. build convolutional neural networks frame training pattern：

The first step builds first layer convolutional layer：First, the total number of the wave filter in first layer convolutional layer is set；Secondly, The size of convolution kernel in first layer convolutional layer is set；Again, the output of first layer convolutional layer and Relu activation primitives are determined Relationship is described as：Wherein, 1≤p≤P, P represent the total number of sample included in training set, 1≤j≤M₁, M₁Represent the total number of the wave filter in first layer convolutional layer, representations of the f () for Relu activation primitives, x_p Represent training set in p-th of sample, symbol " * " be convolution algorithm symbol, k⁽¹⁾Represent the convolution kernel in first layer convolutional layer Size,It representsBiasing,Represent x_pThe jth width characteristic pattern that first layer convolutional layer exports after first layer convolutional layer, x_pCorrespondence obtains M after first layer convolutional layer₁Width characteristic pattern.

Second step builds second layer convolutional layer：First, the total number of the wave filter in second layer convolutional layer is set；Secondly, The size of convolution kernel in second layer convolutional layer is set；Again, the output of second layer convolutional layer and Relu activation primitives are determined Relationship is described as：Wherein, 1≤i≤M₂, M₂Represent the wave filter in second layer convolutional layer Total number, k⁽²⁾Represent the size of the convolution kernel in second layer convolutional layer,It representsBiasing,It representsThrough The i-th width characteristic pattern that second layer convolutional layer exports after two layers of convolutional layer,Correspondence obtains M after second layer convolutional layer₂Width feature Figure.

Third walks, and builds pond layer：First, the size of the convolution kernel in the layer of pond is set；Secondly, pond used by determining Change algorithm；Again, using the output of second layer convolutional layer as the input of pond layer, the output of pond layer is obtained.

4th step builds full articulamentum：First, the number of hidden nodes in full articulamentum is set；Secondly, used by determining Loss function；Again, using the output of pond layer as the input of full articulamentum, the output of full articulamentum is obtained, is so far rolled up Product neural network framework training pattern.

In this particular embodiment, step 3. in, the total number of the wave filter in first layer convolutional layer is 32, first layer volume The size of convolution kernel in lamination is 1 × 11；The total number of wave filter in second layer convolutional layer is 64, in second layer convolutional layer Convolution kernel size be 2 × 6；The size of convolution kernel in the layer of pond is 1 × 4, and used pond algorithm is maximum pond Algorithm；The number of hidden nodes in full articulamentum is 256, and used loss function is SoftMax regression functions.

To further illustrate the present invention the feasibility and validity of method, the method for the present invention is tested.

The selection of the number and size of convolution kernel：

Convolutional neural networks analyze local feature by convolution kernel, the feature extracted by the reinforcement of pond layer Robustness establishes model finally by full articulamentum and obtains final classification results.In this process, acoustic convolver is special to input Sign is analyzed and is extracted, and large effect is played to classification results.There are two the parameter setting of acoustic convolver is common：Convolution kernel it is big Small and convolution kernel number.

In principle, the number of convolution kernel (wave filter) is the width number of the characteristic pattern of output, and even the number of acoustic convolver is N, Then output is N width characteristic patterns.With the increase of the number of wave filter, the characteristic pattern of output is also more, convolutional neural networks table The feature space shown is bigger, and learning ability is also stronger, and discrimination is also higher.Table 3 gives the number of wave filter to inspection The influence of performance is surveyed, table 4 gives influence of the size to detection performance of convolution kernel.In table 3 and table 4, ACC is identified for detection Rate, Loss are loss late, and the time is whenabouts caused by iteration each time.The experiment constraints of table 3 is to ensure network In the case of number of plies structure and ceteris paribus, the number of its two layers of wave filter is adjusted；The experiment constraints of table 4 is to filter The number of wave device is 32~64, the size of convolution kernel in the layer of pond is 1 × 4, the number of hidden nodes in full articulamentum is 256 In the case of, adjustment changes the size of its two layers of convolution kernel.Experiment sample be raw tone 6300, pirate recordings voice 6300.

Influence of the number of 3 wave filter of table to detection performance

The number of wave filter	ACC (%)	Loss	Time/iteration
				16~32	98.39	0.048	238s
32~32	98.57	0.043	321s
				32~64	98.97	0.034	360s
64~64	99.04	0.031	420s

Influence of the size of 4 convolution kernel of table to detection performance

The size of convolution kernel	ACC (%)	Loss	Time/iteration
				1 × 7~2 × 6	98.97	0.033	400s
1 × 11~2 × 6	98.97	0.034	360s
				1 × 14~2 × 6	98.54	0.047	318s

From the data listed by table 3 and table 4 it is found that the increase of the number with wave filter, detection performance are better；Different filters Wave device extracts different features from different angles, if the number of wave filter is less, cannot fully extract useful information； If the number of wave filter is more, operation time can increase, but the raising of its discrimination is not obvious；In addition, with convolution kernel The gradual refinement of size, discrimination increases, but ascensional range is weaker, and this also illustrates the sizes of convolution kernel to detection performance Influence it is weaker.Consider, in the specific implementation can final choice wave filter number be 32~64, the size of convolution kernel is 1 × 11~2 × 6.

Input the influence of the sound spectrograph under different window is moved：

Voice signal calculates its energy spectral density and obtains sound spectrograph by framing, adding window, Fourier transformation.Different windows Shifting will generate different sound spectrographs, comprising voice messaging it is also just different.Fig. 4 a give window length and are set as, Fu Li at 512 points The detection discrimination curve graph that leaf sampling number is 1024 points, window shifting is the pirate recordings voice under 128 points and 256 points, Fig. 4 b are provided Window length is set as, the detection that Fourier's sampling number is, window shifting is the pirate recordings voice under 128 points and 256 points at 512 points at 1024 points Loss late.Abscissa Epoch represents iterations in Fig. 4 a, and ordinate Accurary represents detection discrimination；Horizontal seat in Fig. 4 b It marks Epoch and represents iterations, ordinate Loss represents Detectability loss rate.Experiment sample be raw tone 6300, pirate recordings language Sound 6300,70% for training, remaining is used to test.

Cross-over experiment：

It during pirate recordings, uses a hidden recorder equipment and playback apparatus type is various, different use a hidden recorder equipment and playback apparatus will be right Testing result generates different influences.The purpose of cross-over experiment is exactly the applicability in order to preferably examine the method for the present invention. In experiment, equipment is used a hidden recorder and a kind of pirate recordings voice that playback apparatus obtains is as training voice using a kind of, remaining any one steal The pirate recordings voice that recording apparatus and a kind of playback apparatus obtain is as tested speech.Raw tone 6300, pirate recordings voice 37800 It is a.Wherein, testing result is represented with ACC (%).Experimental result is as listed in table 5.

It can be seen that from the data listed by table 5 when using identical playback apparatus, different intersection when using a hidden recorder equipment real Preferable verification and measurement ratio can be obtained by testing, and verification and measurement ratio can reach more than 93%, wherein, playback apparatus is Huawei AM08, steathily The verification and measurement ratio of pirate recordings voice has reached 99.28% when recording apparatus is Aigo R6620.When using different playback apparatus, different Cross-over experiment during equipment is used a hidden recorder, the method for the present invention has certain detection result, but result is set not as good as using identical playback Standby, different pirate recordings speech detection when using a hidden recorder equipment.It follows that compared to equipment is used a hidden recorder, playback apparatus is to pirate recordings voice Have an impact it is larger.

5 cross-over experiment result of table

Claims

1. a kind of pirate recordings speech detection method based on convolutional neural networks, it is characterised in that include the following steps：

1. build raw tone library and pirate recordings sound bank：Under quiet environment, recording personnel are carried out using collecting device original Voice collecting collects N altogether₁The raw tone of a different content, by N₁The raw tone of a different content forms raw tone Library；Process is used a hidden recorder according to real process simulation, while collecting device is used to carry out raw tone acquisition to recording personnel, is made Equipment is used a hidden recorder at least two voice is carried out to recording personnel and used a hidden recorder, then using at least one playback apparatus to the voice used a hidden recorder Audio playback is carried out, same collecting device is reused and voice collecting is carried out to the voice of playback, collect N altogether₂A pirate recordings Voice, by N₂A pirate recordings voice forms pirate recordings sound bank；Wherein, N₁>=1000, N₂≥2N₁；

2. extracting the sound spectrograph of each raw tone in raw tone library, and extract each pirate recordings voice in pirate recordings sound bank Sound spectrograph；Then using the sound spectrograph of each raw tone as a positive sample, using the sound spectrograph of each pirate recordings voice as One negative sample；It again will be from N₁50~70% positive sample is randomly selected in a positive sample and from N₂It is randomly selected in a negative sample Remaining positive sample and remaining negative sample are formed test set by 50~70% negative sample composing training collection；

3. build convolutional neural networks frame training pattern：

The first step builds first layer convolutional layer：First, the total number of the wave filter in first layer convolutional layer is set；Secondly, setting The size of convolution kernel in first layer convolutional layer；Again, the output of first layer convolutional layer and the relationship of Relu activation primitives are determined, It is described as：Wherein, 1≤p≤P, P represent the total number of sample included in training set, 1≤j ≤M₁, M₁Represent the total number of the wave filter in first layer convolutional layer, representations of the f () for Relu activation primitives, x_pIt represents P-th of sample in training set, symbol " * " be convolution algorithm symbol, k⁽¹⁾Represent the big of the convolution kernel in first layer convolutional layer It is small,It representsBiasing,Represent x_pThe jth width characteristic pattern that first layer convolutional layer exports after first layer convolutional layer, x_p Correspondence obtains M after first layer convolutional layer₁Width characteristic pattern；

Second step builds second layer convolutional layer：First, the total number of the wave filter in second layer convolutional layer is set；Secondly, setting The size of convolution kernel in second layer convolutional layer；Again, the output of second layer convolutional layer and the relationship of Relu activation primitives are determined, It is described as：Wherein, 1≤i≤M₂, M₂Represent total of the wave filter in second layer convolutional layer Number, k⁽²⁾Represent the size of the convolution kernel in second layer convolutional layer,It representsBiasing,It representsThrough the second layer The i-th width characteristic pattern that second layer convolutional layer exports after convolutional layer,Correspondence obtains M after second layer convolutional layer₂Width characteristic pattern；

Third walks, and builds pond layer：First, the size of the convolution kernel in the layer of pond is set；Secondly, pondization is calculated used by determining Method；Again, using the output of second layer convolutional layer as the input of pond layer, the output of pond layer is obtained；

4th step builds full articulamentum：First, the number of hidden nodes in full articulamentum is set；Secondly, it is lost used by determining Function；Again, using the output of pond layer as the input of full articulamentum, the output of full articulamentum is obtained, so far obtains convolution god Through network frame training pattern；

4. using each sample in test set as input, it is input in convolutional neural networks frame training pattern, obtains original The classification results of voice and pirate recordings voice.

A kind of 2. pirate recordings speech detection method based on convolutional neural networks according to claim 1, it is characterised in that institute The step of stating 3. in, the total number of the wave filter in first layer convolutional layer is 32, the size of the convolution kernel in first layer convolutional layer It is 1 × 11；The total number of wave filter in second layer convolutional layer is 64, the size of the convolution kernel in second layer convolutional layer for 2 × 6；The size of convolution kernel in the layer of pond is 1 × 4, and used pond algorithm is maximum pond algorithm；It is hidden in full articulamentum Node layer number is 256, and used loss function is SoftMax regression functions.