CN108281158A

CN108281158A - Voice biopsy method, server and storage medium based on deep learning

Info

Publication number: CN108281158A
Application number: CN201810029892.6A
Authority: CN
Inventors: 王健宗; 郑斯奇; 于夕畔; 肖京
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2018-01-12
Filing date: 2018-01-12
Publication date: 2018-07-13
Also published as: WO2019136909A1

Abstract

The invention discloses a kind of voice biopsy method based on deep learning is applied to server, including：Deep neural network model is trained to obtain optimal depth neural network model；It obtains voice to be detected and framing is carried out to the voice to be detected, obtain 1000*20 dimension matrixes；The 1000*20 is tieed up into Input matrix to the optimal depth neural network model；1000*20 dimension matrixes are calculated using the optimal depth neural network model and tie up output vector to obtain 1*4, the 1*4 dimensions output vector represents 4 kinds of voice class；Select the maximum a kind of classification as the voice to be detected of numerical value in the 1*4 dimensions output vector.Invention additionally discloses a kind of server and storage mediums.By implementing said program, higher level safety can be provided for the safety of voice control, promotes the development of speech recognition technology.

Description

Voice biopsy method, server and storage medium based on deep learning

Technical field

The present invention relates to field of computer technology more particularly to a kind of voice biopsy method based on deep learning, Server and storage medium.

Background technology

With the continuous development of speech recognition technology, speech recognition application is also more and more, including voice control, voice branch Pay etc..But at present during speech recognition, typically just identification is semantic, and it is artificial that can not distinguish voice well It sends out or other recording inputs, such as apple siri, during being waken up with siri to apple terminal device, no matter It is I or recording, once " hi, siri " can wake up terminal device, can not distinguish the source for sending out voice for input. Therefore it is particularly important for the In vivo detection of voice.Voice In vivo detection refers to whether identification input information is that true man speak, non-real The voice that people speaks commonly referred to as forgery recording, including music inputs, recording replay, passes through the generation of the technological means such as phonetic synthesis Voice etc..It forges recording and is frequently utilized for finance, security fields, carrying out Application on Voiceprint Recognition by forgery recording swarms into, to log in To injured party's account, steals wealth to reach or damage the targets such as other people fames and property.

Invention content

In view of this, the present invention proposes that a kind of voice biopsy method, server and storage based on deep learning are situated between Matter so that before carrying out respective application using voice, can quickly detect whether the voice is that user directly exports Voice or other people malice forge voice, in this way, higher level safety can be provided for the safety of voice control Property ensure, promote the development of speech recognition technology.

First, to achieve the above object, the present invention proposes a kind of server, and the server includes memory, processor, The voice In vivo detection program based on deep learning that can be run on the processor, the base are stored on the memory Following steps are realized when the voice In vivo detection program of deep learning is executed by the processor：To deep neural network model It is trained to obtain optimal depth neural network model；It obtains voice to be detected and framing is carried out to the voice to be detected, Obtain 1000*20 dimension matrixes；The 1000*20 is tieed up into Input matrix to the optimal depth neural network model；Using described Optimal depth neural network model calculates 1000*20 dimension matrixes ties up output vector, the 1*4 dimensions to obtain 1*4 Output vector represents 4 kinds of voice class；Select numerical value in the 1*4 dimensions output vector maximum a kind of as described to be detected The classification of voice.

Optionally, it is described to depth when the voice In vivo detection program based on deep learning is executed by the processor Degree neural network model is trained includes the step of optimal depth neural network model to obtain：Training voice is divided Frame, using every 1000 frame as a sample；Classification logotype is carried out to each sample；Using the sample after mark as described in The training sample of deep neural network model.

Optionally, when the voice In vivo detection program based on deep learning is executed by the processor, the utilization The optimal depth neural network model calculates to obtain the step of 1*4 ties up output vector 1000*20 dimension matrixes It specifically includes：Convolution is carried out to input feature vector with 1000*20 convolution kernels in first layer；At the second to four layers, using the convolution of 1*1 Core carries out convolution, and uses LeakyReLU activation primitives；Pond is carried out in layer 5, maximum is extracted to 2*2 core ranges Value；It is flattened in layer 6；Dimensionality reduction is carried out to the layer 6 in layer 7, obtains output Out7；In layer 7 with described Out7 is inputted, and with softmax activation primitives, is exported as 1*4 vectors, as testing result.

In addition, to achieve the above object, the present invention also provides a kind of voice biopsy method based on deep learning is answered For server, the described method comprises the following steps：Deep neural network model is trained to obtain optimal depth nerve Network model；It obtains voice to be detected and framing is carried out to the voice to be detected, obtain 1000*20 dimension matrixes；It will be described 1000*20 ties up Input matrix to the optimal depth neural network model；Using the optimal depth neural network model to institute It states 1000*20 dimension matrixes to be calculated to obtain 1*4 dimension output vectors, the 1*4 dimensions output vector represents 4 kinds of voice class Not；Select the maximum a kind of classification as the voice to be detected of numerical value in the 1*4 dimensions output vector.

Optionally, described the step of deep neural network model being trained to obtain optimal depth neural network model Including：Framing is carried out to training voice, using every 1000 frame as a sample；Classification logotype is carried out to each sample；It will mark Training sample of the sample as the deep neural network model after knowledge.

Optionally, described to obtain voice to be detected and framing is carried out to the voice to be detected, obtain 1000*20 dimension matrixes The step of specifically include：After carrying out framing to the voice to be detected, extracts 1000 frames and calculate separately 20 dimension MFCC features；According to The 1000*20, which is generated, according to 20 dimension MFCC of 1000 frame ties up matrix.

Optionally, it is described using the optimal depth neural network model to the 1000*20 dimension matrix calculated with The step of obtaining 1*4 dimension output vectors specifically includes：Convolution is carried out to input feature vector with 1000*20 convolution kernels in first layer； The second to four layers, convolution is carried out using the convolution kernel of 1*1, and use LeakyReLU activation primitives；Pond is carried out in layer 5, Maximum value is extracted to 2*2 core ranges；It is flattened in layer 6；Dimensionality reduction is carried out to the layer 6 in layer 7, is obtained Export Out7；It is inputted with the Out7 in layer 7, with softmax activation primitives, is exported as 1*4 vectors, as testing result.

Optionally, the 1*4 dimensions output vector is the numerical value in 0~1 range.

Further, to achieve the above object, the present invention also provides a kind of storage medium, the storage medium is stored with base In the voice In vivo detection program of deep learning, the voice In vivo detection program based on deep learning can be by least one place It manages device to execute, so that at least one processor executes the voice biopsy method based on deep learning as described above Step.

Compared to the prior art, voice biopsy method proposed by the invention based on deep learning, server and Storage medium is trained deep neural network model to obtain optimal depth neural network model first；Secondly, it obtains Voice to be detected simultaneously carries out framing to the voice to be detected, obtains 1000*20 dimension matrixes；Again, the 1000*20 is tieed up into square Battle array is input to the optimal depth neural network model；Then, using the optimal depth neural network model to the 1000* 20 dimension matrixes are calculated ties up output vector, voice class during the 1*4 dimensions output vector represents 4 to obtain 1*4；Finally, Select the maximum a kind of classification as the voice to be detected of numerical value in the 1*4 dimensions output vector.In this way so that utilizing Before voice carries out respective application, can quickly detect the voice whether be voice that user directly exports or he The malice of people is forged voice and is promoted in this way, higher level safety can be provided for the safety of voice control The development of speech recognition technology.

Description of the drawings

Fig. 1 is the schematic diagram of one optional hardware structure of server of the present invention；

Fig. 2 is the Program modual graph of the voice In vivo detection program first embodiment the present invention is based on deep learning；

Fig. 3 is that the present invention is based on the flow charts of the voice biopsy method first embodiment of deep learning；

Reference numeral：

The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.

Specific implementation mode

In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that described herein, specific examples are only used to explain the present invention, not For limiting the present invention.Based on the embodiments of the present invention, those of ordinary skill in the art are not before making creative work The every other embodiment obtained is put, shall fall within the protection scope of the present invention.

It should be noted that the description for being related to " first ", " second " etc. in the present invention is used for description purposes only, and cannot It is interpreted as indicating or implying its relative importance or implicitly indicates the quantity of indicated technical characteristic.Define as a result, " the One ", the feature of " second " can explicitly or implicitly include at least one of the features.In addition, the skill between each embodiment Art scheme can be combined with each other, but must can be implemented as basis with those of ordinary skill in the art, when technical solution Will be understood that the combination of this technical solution is not present in conjunction with there is conflicting or cannot achieve when, also not the present invention claims Protection domain within.

As shown in fig.1, being the schematic diagram of 1 one optional hardware structure of server.

The server 1 can be rack-mount server, blade server, tower server or Cabinet-type server etc. Computing device, which can be independent server, can also be the server cluster that multiple servers are formed.

In the present embodiment, the server 1 may include, but be not limited only to, and can be in communication with each other connection by system bus and deposit Reservoir 11, processor 12, network interface 13.

The server 1 connects network by network interface 13, obtains information.The network can be intranet (Intranet), internet (Internet), global system for mobile communications (Global System of Mobile Communication, GSM), wideband code division multiple access (Wideband Code Division Multiple Access, WCDMA), the wirelessly or non-wirelessly network such as 4G networks, 5G networks, bluetooth (Bluetooth), Wi-Fi, speech path network.

It should be pointed out that Fig. 1 illustrates only the server 1 with component 11-13, it should be understood that simultaneously should not Realistic to apply all components shown, the implementation that can be substituted is more or less component.

Wherein, the memory 11 includes at least a type of storage medium, and the storage medium includes flash memory, hard Disk, multimedia card, card-type memory (for example, SD or DX memories etc.), random access storage device (RAM), static random-access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), may be programmed read-only storage at read-only memory (ROM) Device (PROM), magnetic storage, disk, CD etc..In some embodiments, the memory 11 can be the server 1 Internal storage unit, such as the server 1 hard disk or memory.In further embodiments, the memory 11 can also It is the External memory equipment of the server 1, such as the plug-in type hard disk that the server 1 is equipped with, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..Certainly, described Memory 11 can also both include the server 1 internal storage unit and also including its External memory equipment.In the present embodiment, The memory 11 is installed on the operating system and types of applications software of the server 1 commonly used in storage, such as based on depth Spend the program code etc. of the voice In vivo detection program 200 of study.In addition, the memory 11 can be also used for temporarily storing The Various types of data that has exported or will export.

The processor 12 can be in some embodiments central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor or other data processing chips.The processor 12 is commonly used in the control clothes The overall operation of business device 1, such as execute data interaction or the relevant control of communication and processing etc..In the present embodiment, the place Reason device 12 for running the program code stored in the memory 11 or processing data, such as operation it is described based on depth Voice In vivo detection program 200 of study etc..

The network interface 13 may include radio network interface or wired network interface, which is commonly used in Communication connection is established between the server 1 and other electronic equipments.

In the present embodiment, is installed in the server 1 and run the voice In vivo detection program based on deep learning 200, when the voice In vivo detection program 200 based on deep learning is run, the server 1 is to deep neural network mould Type is trained to obtain optimal depth neural network model；It obtains voice to be detected and the voice to be detected is divided Frame obtains 1000*20 dimension matrixes；The 1000*20 is tieed up into Input matrix to the optimal depth neural network model；Using institute Optimal depth neural network model is stated to calculate 1000*20 dimension matrixes to obtain 1*4 dimension output vectors, the 1*4 Voice class during dimension output vector represents 4；Select numerical value in the 1*4 dimensions output vector maximum a kind of as described to be checked Survey the classification of voice.In this way so that before carrying out respective application using voice, whether can quickly detect the voice The voice or other people malice directly exported for user forges voice, in this way, can be provided for the safety of voice control Higher level safety promotes the development of speech recognition technology.

So far, oneself is through describing the hardware configuration and work(of the application environment and relevant device of each embodiment of the present invention in detail Energy.In the following, above application environment and relevant device will be based on, each embodiment of the present invention is proposed.

First, the present invention proposes a kind of voice In vivo detection program 200 based on deep learning.

As shown in fig.2, being the program of 200 first embodiment of voice In vivo detection program the present invention is based on deep learning Module map.

In the present embodiment, server 1, including a series of computer program instructions being stored on memory 11, i.e. institute The voice In vivo detection program 200 based on deep learning is stated, when the computer program instructions are executed by processor 12, Ke Yishi The voice In vivo detection operation based on deep learning of existing various embodiments of the present invention.In some embodiments, it is based on the computer The specific operation that program instruction each section is realized, the voice In vivo detection program 200 based on deep learning can be by It is divided into one or more modules.For example, in fig. 2, the voice In vivo detection program 200 based on deep learning can be with It is divided into training module 201, speech processing module 202, matrix input module 203, matrix computing module 204 and judgment module 205.Wherein：

The training module 201, for being trained to deep neural network model to obtain optimal depth neural network Model.

Specifically, the training module 201 is specifically used for carrying out framing to training voice, using every 1000 frame as a sample This；Classification logotype is carried out to each sample；Using the sample after mark as the training of the deep neural network model Sample.

In the present embodiment, the purpose that every 1000 frame blocks is the input for making model have regular length, considers different length The recording of degree will produce different MFCC (Mel-Frequency Cepstral Coefficients, cepstral coefficients) feature Distribution Effect be easy to cause Model Identification inaccuracy if input feature vector on-fixed.For being shorter than 1000 frames and being longer than 100 frames Recording, we splice full 0 frame behind；For being shorter than the recording of 100 frames, we directly cast out, it is believed that are said without people Words.Every 1000 frames of all recording be the benefits of a training sample be can allow model learning to such voice each period sound Sound feature, than using some 1000 frame training effect of certain section of recording to have more robustness merely.

Training stage is tagged to each input recording to be identified, and if proper class is [0000], one kind is forged [0100], two classes forge [0010], and three classes forge [0001].Specifically, proper class is as its name suggests, and really it is as voice, and it is right In forging voice, it is divided into three kinds of forgeries, it is music that the first kind, which is forged, and the second class is forged forges for recording replay, and third class is forged It is forged for technical voice.It is music that first kind forgery, which refers to Application on Voiceprint Recognition input, and music is due to containing abundant acoustic constituents, energy Be normally carried out the registration and verification of voice, but and the information not comprising speaker's sound, therefore the not target recording of Application on Voiceprint Recognition； Second class forges the simple replay of predominantly recording, as spoken with target person under recording pen, the record of mobile phone equipment or music etc. Then voice directly replays the input terminal to Application on Voiceprint Recognition；Third class, which is forged, to be referred mainly to convert using phonetic synthesis or voice Technology carries out target person and speaks forgerys, and phonetic synthesis recording, which generally acquires a certain amount of voice data of target person, just can use synthesizing mean The voice that target person specifies content of text is generated, voice conversion recording is that the change of frequency spectrum is directly carried out to original recording, such It forges due to containing a large amount of voice process technologies, therefore referred to as poly-talented forgery.

And for how to be trained to DNN (deep neural network) using training sample, it is summarized as follows：Using what is increased income Keras frames carry out model training；In view of hardware limitation, setting DNN uses minibatch technologies, setting each in training Batch sizes are 128,1000 batch of each repetitive exercise, total training n times iteration.Each batch is random from total data 128 voice MFCC feature samples are selected, model output is generated, to feedback more new model ginseng after then being passed through according to loss function Number calculates to complete 1 batch, generates 1000 batch data with this and completes 1000 training, obtains an iteration Model output.Under normal circumstances, the optimal model of loss function is selected in 50 iteration：The convolution kernel of first layer is 9* 20, Nfilters are set as 512, and loss function is set as the maximum entropy categorical_crossentropy of all discriminant classifications, Optimizer is adagrad.

The speech processing module 202 is obtained for obtaining voice to be detected and carrying out framing to the voice to be detected 1000*20 ties up matrix.

In the present embodiment, the speech processing module 202 is specifically used for：Framing is carried out to the voice to be detected Afterwards, 1000 frames are extracted and calculate separately 20 dimension MFCC features；The 20 dimension MFCC according to 1000 frame generate the 1000*20 dimensions Matrix.

In the present embodiment, framing operation and the above-mentioned processing mode phase to training voice are carried out to voice to be detected Together, for being shorter than 1000 frames and being longer than the recording of 100 frames, we splice full 0 frame behind；For being shorter than the recording of 100 frames, We directly cast out, it is believed that speak without people.And a kind of conventional algorithm is then belonged to for the calculating of MFCC features, the present invention It does not just repeat again.

The matrix input module 203, for the 1000*20 to be tieed up Input matrix to the optimal depth neural network Model.

In the present embodiment, the input layer of optimal depth neural network model DNN obtained above is Input matrix, can The 1000*20 that above-mentioned speech processing module 202 obtains directly is tieed up Input matrix to obtained optimal depth neural network mould Type.

The matrix computing module 204, for tieing up square to the 1000*20 using the optimal depth neural network model Battle array is calculated ties up output vector, voice class during the 1*4 dimensions output vector represents 4 to obtain 1*4.

Specifically, matrix computing module 204 rolls up input feature vector with 1000*20 convolution kernels in DNN models first layer Product, the purpose of this layer are to carry out consecutive frame Projection Character to control each convolution kernel through pulleying by Nfilters (n times filtering) N number of channel characteristics are obtained after product；At the second to four layers, convolution is carried out using the convolution kernel of 1*1, and activate using LeakyReLU Function, the effect of wherein these 1*1 convolution kernels is that interchannel is allowed to connect interaction, to model learning to more multiframe With interframe feature；Pond is carried out in layer 5, maximum value is extracted to 2*2 core ranges, wherein 2*2MaxPooling is (quasi- Close), step-length selects default value 1*1, which can select certain upper layer nodes, and model parameter is made to reduce, it is not easy to over-fitting； Six layers are flattened, you can to obtain 1*P dimensional features by carrying out flattening to last layer output node；In layer 7 to described Six layers of progress dimensionality reduction, obtain output Out7, and wherein layer 7 is linear layer, while being inputted with the Out7 in layer 7, is used Softmax activation primitives, export for 1*4 vector, that is, export 4 vectors, as testing result.

The judgment module 205, for selecting numerical value in the 1*4 dimensions output vector maximum a kind of as described to be checked Survey the classification of voice.The wherein described 1*4 dimensions output vector is the numerical value in 0~1 range.

In the present embodiment, the 1*4 dimensions output vector belongs to respective class by the fractional representation of 4 0-1 ranges Probability, i.e., the probability of above-mentioned real speech, a kind of forgery, two classes are forged, the probability that three classes are forged, and numerical value in this four probability Maximum one kind then represents the classification of input voice, i.e., intuitively can effectively detect language to be detected by the numerical value of output It is really voice that whether sound, which is live body voice,.

By above procedure module 201-205, server proposed by the invention instructs deep neural network model Practice to obtain optimal depth neural network model；It obtains voice to be detected and framing is carried out to the voice to be detected, obtain 1000*20 ties up matrix；The 1000*20 is tieed up into Input matrix to the optimal depth neural network model；Using described optimal Deep neural network model calculates 1000*20 dimension matrixes ties up output vector, the 1*4 dimensions output to obtain 1*4 Voice class during vector represents 4；Select numerical value in the 1*4 dimensions output vector maximum a kind of as the voice to be detected Classification.In this way so that before carrying out respective application using voice, can quickly detect whether the voice is user The voice or other people malice directly exported forges voice, in this way, higher can be provided for the safety of voice control Secondary safety promotes the development of speech recognition technology.

In addition, the present invention also proposes a kind of voice biopsy method based on deep learning.

As shown in fig.3, being the implementation stream of the voice biopsy method first embodiment the present invention is based on deep learning Journey schematic diagram.In the present embodiment, the execution sequence of the step in flow chart shown in Fig. 3 can change according to different requirements, Become, certain steps can be omitted.

Step S301 is trained deep neural network model to obtain optimal depth neural network model.

Specifically, above-mentioned steps are specifically included carries out framing to training voice, using every 1000 frame as a sample；To every One sample carries out classification logotype；Using the sample after mark as the training sample of the deep neural network model.

Step S302 obtains voice to be detected and carries out framing to the voice to be detected, obtains 1000*20 dimension matrixes.

The 1000*20 is tieed up Input matrix to the optimal depth neural network model by step S303.

Step S304 calculates to obtain 1000*20 dimension matrixes using the optimal depth neural network model Output vector, voice class during the 1*4 dimensions output vector represents 4 are tieed up to 1*4.

Step S305 selects the maximum a kind of class as the voice to be detected of numerical value in the 1*4 dimensions output vector Not.The wherein described 1*4 dimensions output vector is the numerical value in 0~1 range.

S301-305 through the above steps, the voice biopsy method proposed by the invention based on deep learning are first First, deep neural network model is trained to obtain optimal depth neural network model；Secondly, voice to be detected is obtained simultaneously Framing is carried out to the voice to be detected, obtains 1000*20 dimension matrixes；Again, the 1000*20 is tieed up into Input matrix described in Optimal depth neural network model；Then, 1000*20 dimension matrixes are carried out using the optimal depth neural network model It calculates and ties up output vector, voice class during the 1*4 dimensions output vector represents 4 to obtain 1*4；Finally, the 1*4 is selected to tie up The maximum a kind of classification as the voice to be detected of numerical value in output vector.In this way so that carried out accordingly using voice Using before, can quickly detecting whether the voice is that voice that user directly exports or other people malice are forged Voice promotes speech recognition technology in this way, higher level safety can be provided for the safety of voice control Development.

The present invention also provides another embodiments, that is, provide a kind of storage medium, and the storage medium is stored with base In the voice In vivo detection program of deep learning, the voice In vivo detection program based on deep learning can be by least one place It manages device to execute, so that at least one processor executes the step such as the above-mentioned voice biopsy method based on deep learning Suddenly.

The embodiments of the present invention are for illustration only, can not represent the quality of embodiment.

Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can add the mode of required general hardware platform to realize by software, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on this understanding, technical scheme of the present invention substantially in other words does the prior art Going out the part of contribution can be expressed in the form of software products, which is stored in a storage medium In (such as ROM/RAM, magnetic disc, CD), including some instructions are used so that a station terminal equipment (can be mobile phone, computer, clothes Be engaged in device, air conditioner or the network equipment etc.) execute method described in each embodiment of the present invention.

It these are only the preferred embodiment of the present invention, be not intended to limit the scope of the invention, it is every to utilize this hair Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills Art field, is included within the scope of the present invention.

Claims

1. a kind of voice biopsy method based on deep learning is applied to server, which is characterized in that the method includes Following steps：

Deep neural network model is trained to obtain optimal depth neural network model；

It obtains voice to be detected and framing is carried out to the voice to be detected, obtain 1000*20 dimension matrixes；

The 1000*20 is tieed up into Input matrix to the optimal depth neural network model；

Using the optimal depth neural network model to the 1000*20 dimension matrix calculated with obtain 1*4 dimension output to Amount, the 1*4 dimensions output vector represent 4 kinds of voice class；And

Select the maximum a kind of classification as the voice to be detected of numerical value in the 1*4 dimensions output vector.

2. the voice biopsy method based on deep learning as described in claim 1, which is characterized in that described to depth god It is trained through network model and includes the step of optimal depth neural network model to obtain：

Framing is carried out to training voice, using every 1000 frame as a sample；

Classification logotype is carried out to each sample；And

Using the sample after mark as the training sample of the deep neural network model.

3. the voice biopsy method based on deep learning as described in claim 1, which is characterized in that the acquisition is to be checked It surveys voice and is specifically included the step of carrying out framing to the voice to be detected, obtain 1000*20 dimension matrixes：

After carrying out framing to the voice to be detected, extracts 1000 frames and calculate separately 20 dimension MFCC features；And

The 20 dimension MFCC according to 1000 frame generate the 1000*20 and tie up matrix.

4. the voice biopsy method based on deep learning as described in claim 1, which is characterized in that described in the utilization Optimal depth neural network model calculates 1000*20 dimension matrixes：

Convolution is carried out to input feature vector with 1000*20 convolution kernels in first layer；

At the second to four layers, convolution is carried out using the convolution kernel of 1*1, and use LeakyReLU activation primitives；

Pond is carried out in layer 5, maximum value is extracted to 2*2 core ranges；

It is flattened in layer 6；

Dimensionality reduction is carried out to the layer 6 in layer 7, obtains output Out7；And

It is inputted with the Out7 in layer 7, with softmax activation primitives, is exported as 1*4 vectors, as testing result.

5. the voice biopsy method according to any one of claims 1-4 based on deep learning, which is characterized in that described It is the numerical value in 0~1 range that 1*4, which ties up output vector,.

6. a kind of server, which is characterized in that the server includes memory, processor, and being stored on the memory can The voice In vivo detection program based on deep learning run on the processor, the voice live body based on deep learning Detection program realizes following steps when being executed by the processor：

7. server as claimed in claim 6, which is characterized in that the voice In vivo detection program quilt based on deep learning It is described deep neural network model to be trained to obtain the step of optimal depth neural network model when the processor executes Suddenly include：

Framing is carried out to training voice, using every 1000 frame as a sample；

Classification logotype is carried out to each sample；And

8. server as claimed in claim 6, which is characterized in that the voice In vivo detection program quilt based on deep learning It is described to obtain voice to be detected and framing is carried out to the voice to be detected when the processor executes, obtain 1000*20 dimension squares The step of battle array, specifically includes：

9. such as claim 6-8 any one of them servers, which is characterized in that the voice live body inspection based on deep learning It is described that matrix is tieed up to the 1000*20 using the optimal depth neural network model when ranging sequence is executed by the processor It is calculated and includes to obtain the step of 1*4 ties up output vector：

It is flattened in layer 6；

10. a kind of storage medium, the storage medium is stored with the voice In vivo detection program based on deep learning, described to be based on The voice In vivo detection program of deep learning can be executed by least one processor, so that at least one processor executes such as The step of voice biopsy method based on deep learning described in any one of claim 1-5.