CN108281158A - Voice biopsy method, server and storage medium based on deep learning - Google Patents

Voice biopsy method, server and storage medium based on deep learning Download PDF

Info

Publication number
CN108281158A
CN108281158A CN201810029892.6A CN201810029892A CN108281158A CN 108281158 A CN108281158 A CN 108281158A CN 201810029892 A CN201810029892 A CN 201810029892A CN 108281158 A CN108281158 A CN 108281158A
Authority
CN
China
Prior art keywords
voice
neural network
network model
carried out
deep learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810029892.6A
Other languages
Chinese (zh)
Inventor
王健宗
郑斯奇
于夕畔
肖京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201810029892.6A priority Critical patent/CN108281158A/en
Priority to PCT/CN2018/089203 priority patent/WO2019136909A1/en
Publication of CN108281158A publication Critical patent/CN108281158A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a kind of voice biopsy method based on deep learning is applied to server, including:Deep neural network model is trained to obtain optimal depth neural network model;It obtains voice to be detected and framing is carried out to the voice to be detected, obtain 1000*20 dimension matrixes;The 1000*20 is tieed up into Input matrix to the optimal depth neural network model;1000*20 dimension matrixes are calculated using the optimal depth neural network model and tie up output vector to obtain 1*4, the 1*4 dimensions output vector represents 4 kinds of voice class;Select the maximum a kind of classification as the voice to be detected of numerical value in the 1*4 dimensions output vector.Invention additionally discloses a kind of server and storage mediums.By implementing said program, higher level safety can be provided for the safety of voice control, promotes the development of speech recognition technology.

Description

Voice biopsy method, server and storage medium based on deep learning
Technical field
The present invention relates to field of computer technology more particularly to a kind of voice biopsy method based on deep learning, Server and storage medium.
Background technology
With the continuous development of speech recognition technology, speech recognition application is also more and more, including voice control, voice branch Pay etc..But at present during speech recognition, typically just identification is semantic, and it is artificial that can not distinguish voice well It sends out or other recording inputs, such as apple siri, during being waken up with siri to apple terminal device, no matter It is I or recording, once " hi, siri " can wake up terminal device, can not distinguish the source for sending out voice for input. Therefore it is particularly important for the In vivo detection of voice.Voice In vivo detection refers to whether identification input information is that true man speak, non-real The voice that people speaks commonly referred to as forgery recording, including music inputs, recording replay, passes through the generation of the technological means such as phonetic synthesis Voice etc..It forges recording and is frequently utilized for finance, security fields, carrying out Application on Voiceprint Recognition by forgery recording swarms into, to log in To injured party's account, steals wealth to reach or damage the targets such as other people fames and property.
Invention content
In view of this, the present invention proposes that a kind of voice biopsy method, server and storage based on deep learning are situated between Matter so that before carrying out respective application using voice, can quickly detect whether the voice is that user directly exports Voice or other people malice forge voice, in this way, higher level safety can be provided for the safety of voice control Property ensure, promote the development of speech recognition technology.
First, to achieve the above object, the present invention proposes a kind of server, and the server includes memory, processor, The voice In vivo detection program based on deep learning that can be run on the processor, the base are stored on the memory Following steps are realized when the voice In vivo detection program of deep learning is executed by the processor:To deep neural network model It is trained to obtain optimal depth neural network model;It obtains voice to be detected and framing is carried out to the voice to be detected, Obtain 1000*20 dimension matrixes;The 1000*20 is tieed up into Input matrix to the optimal depth neural network model;Using described Optimal depth neural network model calculates 1000*20 dimension matrixes ties up output vector, the 1*4 dimensions to obtain 1*4 Output vector represents 4 kinds of voice class;Select numerical value in the 1*4 dimensions output vector maximum a kind of as described to be detected The classification of voice.
Optionally, it is described to depth when the voice In vivo detection program based on deep learning is executed by the processor Degree neural network model is trained includes the step of optimal depth neural network model to obtain:Training voice is divided Frame, using every 1000 frame as a sample;Classification logotype is carried out to each sample;Using the sample after mark as described in The training sample of deep neural network model.
Optionally, it is described to depth when the voice In vivo detection program based on deep learning is executed by the processor Degree neural network model is trained includes the step of optimal depth neural network model to obtain:Training voice is divided Frame, using every 1000 frame as a sample;Classification logotype is carried out to each sample;Using the sample after mark as described in The training sample of deep neural network model.
Optionally, when the voice In vivo detection program based on deep learning is executed by the processor, the utilization The optimal depth neural network model calculates to obtain the step of 1*4 ties up output vector 1000*20 dimension matrixes It specifically includes:Convolution is carried out to input feature vector with 1000*20 convolution kernels in first layer;At the second to four layers, using the convolution of 1*1 Core carries out convolution, and uses LeakyReLU activation primitives;Pond is carried out in layer 5, maximum is extracted to 2*2 core ranges Value;It is flattened in layer 6;Dimensionality reduction is carried out to the layer 6 in layer 7, obtains output Out7;In layer 7 with described Out7 is inputted, and with softmax activation primitives, is exported as 1*4 vectors, as testing result.
In addition, to achieve the above object, the present invention also provides a kind of voice biopsy method based on deep learning is answered For server, the described method comprises the following steps:Deep neural network model is trained to obtain optimal depth nerve Network model;It obtains voice to be detected and framing is carried out to the voice to be detected, obtain 1000*20 dimension matrixes;It will be described 1000*20 ties up Input matrix to the optimal depth neural network model;Using the optimal depth neural network model to institute It states 1000*20 dimension matrixes to be calculated to obtain 1*4 dimension output vectors, the 1*4 dimensions output vector represents 4 kinds of voice class Not;Select the maximum a kind of classification as the voice to be detected of numerical value in the 1*4 dimensions output vector.
Optionally, described the step of deep neural network model being trained to obtain optimal depth neural network model Including:Framing is carried out to training voice, using every 1000 frame as a sample;Classification logotype is carried out to each sample;It will mark Training sample of the sample as the deep neural network model after knowledge.
Optionally, described to obtain voice to be detected and framing is carried out to the voice to be detected, obtain 1000*20 dimension matrixes The step of specifically include:After carrying out framing to the voice to be detected, extracts 1000 frames and calculate separately 20 dimension MFCC features;According to The 1000*20, which is generated, according to 20 dimension MFCC of 1000 frame ties up matrix.
Optionally, it is described using the optimal depth neural network model to the 1000*20 dimension matrix calculated with The step of obtaining 1*4 dimension output vectors specifically includes:Convolution is carried out to input feature vector with 1000*20 convolution kernels in first layer; The second to four layers, convolution is carried out using the convolution kernel of 1*1, and use LeakyReLU activation primitives;Pond is carried out in layer 5, Maximum value is extracted to 2*2 core ranges;It is flattened in layer 6;Dimensionality reduction is carried out to the layer 6 in layer 7, is obtained Export Out7;It is inputted with the Out7 in layer 7, with softmax activation primitives, is exported as 1*4 vectors, as testing result.
Optionally, the 1*4 dimensions output vector is the numerical value in 0~1 range.
Further, to achieve the above object, the present invention also provides a kind of storage medium, the storage medium is stored with base In the voice In vivo detection program of deep learning, the voice In vivo detection program based on deep learning can be by least one place It manages device to execute, so that at least one processor executes the voice biopsy method based on deep learning as described above Step.
Compared to the prior art, voice biopsy method proposed by the invention based on deep learning, server and Storage medium is trained deep neural network model to obtain optimal depth neural network model first;Secondly, it obtains Voice to be detected simultaneously carries out framing to the voice to be detected, obtains 1000*20 dimension matrixes;Again, the 1000*20 is tieed up into square Battle array is input to the optimal depth neural network model;Then, using the optimal depth neural network model to the 1000* 20 dimension matrixes are calculated ties up output vector, voice class during the 1*4 dimensions output vector represents 4 to obtain 1*4;Finally, Select the maximum a kind of classification as the voice to be detected of numerical value in the 1*4 dimensions output vector.In this way so that utilizing Before voice carries out respective application, can quickly detect the voice whether be voice that user directly exports or he The malice of people is forged voice and is promoted in this way, higher level safety can be provided for the safety of voice control The development of speech recognition technology.
Description of the drawings
Fig. 1 is the schematic diagram of one optional hardware structure of server of the present invention;
Fig. 2 is the Program modual graph of the voice In vivo detection program first embodiment the present invention is based on deep learning;
Fig. 3 is that the present invention is based on the flow charts of the voice biopsy method first embodiment of deep learning;
Reference numeral:
The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.
Specific implementation mode
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that described herein, specific examples are only used to explain the present invention, not For limiting the present invention.Based on the embodiments of the present invention, those of ordinary skill in the art are not before making creative work The every other embodiment obtained is put, shall fall within the protection scope of the present invention.
It should be noted that the description for being related to " first ", " second " etc. in the present invention is used for description purposes only, and cannot It is interpreted as indicating or implying its relative importance or implicitly indicates the quantity of indicated technical characteristic.Define as a result, " the One ", the feature of " second " can explicitly or implicitly include at least one of the features.In addition, the skill between each embodiment Art scheme can be combined with each other, but must can be implemented as basis with those of ordinary skill in the art, when technical solution Will be understood that the combination of this technical solution is not present in conjunction with there is conflicting or cannot achieve when, also not the present invention claims Protection domain within.
As shown in fig.1, being the schematic diagram of 1 one optional hardware structure of server.
The server 1 can be rack-mount server, blade server, tower server or Cabinet-type server etc. Computing device, which can be independent server, can also be the server cluster that multiple servers are formed.
In the present embodiment, the server 1 may include, but be not limited only to, and can be in communication with each other connection by system bus and deposit Reservoir 11, processor 12, network interface 13.
The server 1 connects network by network interface 13, obtains information.The network can be intranet (Intranet), internet (Internet), global system for mobile communications (Global System of Mobile Communication, GSM), wideband code division multiple access (Wideband Code Division Multiple Access, WCDMA), the wirelessly or non-wirelessly network such as 4G networks, 5G networks, bluetooth (Bluetooth), Wi-Fi, speech path network.
It should be pointed out that Fig. 1 illustrates only the server 1 with component 11-13, it should be understood that simultaneously should not Realistic to apply all components shown, the implementation that can be substituted is more or less component.
Wherein, the memory 11 includes at least a type of storage medium, and the storage medium includes flash memory, hard Disk, multimedia card, card-type memory (for example, SD or DX memories etc.), random access storage device (RAM), static random-access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), may be programmed read-only storage at read-only memory (ROM) Device (PROM), magnetic storage, disk, CD etc..In some embodiments, the memory 11 can be the server 1 Internal storage unit, such as the server 1 hard disk or memory.In further embodiments, the memory 11 can also It is the External memory equipment of the server 1, such as the plug-in type hard disk that the server 1 is equipped with, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..Certainly, described Memory 11 can also both include the server 1 internal storage unit and also including its External memory equipment.In the present embodiment, The memory 11 is installed on the operating system and types of applications software of the server 1 commonly used in storage, such as based on depth Spend the program code etc. of the voice In vivo detection program 200 of study.In addition, the memory 11 can be also used for temporarily storing The Various types of data that has exported or will export.
The processor 12 can be in some embodiments central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor or other data processing chips.The processor 12 is commonly used in the control clothes The overall operation of business device 1, such as execute data interaction or the relevant control of communication and processing etc..In the present embodiment, the place Reason device 12 for running the program code stored in the memory 11 or processing data, such as operation it is described based on depth Voice In vivo detection program 200 of study etc..
The network interface 13 may include radio network interface or wired network interface, which is commonly used in Communication connection is established between the server 1 and other electronic equipments.
In the present embodiment, is installed in the server 1 and run the voice In vivo detection program based on deep learning 200, when the voice In vivo detection program 200 based on deep learning is run, the server 1 is to deep neural network mould Type is trained to obtain optimal depth neural network model;It obtains voice to be detected and the voice to be detected is divided Frame obtains 1000*20 dimension matrixes;The 1000*20 is tieed up into Input matrix to the optimal depth neural network model;Using institute Optimal depth neural network model is stated to calculate 1000*20 dimension matrixes to obtain 1*4 dimension output vectors, the 1*4 Voice class during dimension output vector represents 4;Select numerical value in the 1*4 dimensions output vector maximum a kind of as described to be checked Survey the classification of voice.In this way so that before carrying out respective application using voice, whether can quickly detect the voice The voice or other people malice directly exported for user forges voice, in this way, can be provided for the safety of voice control Higher level safety promotes the development of speech recognition technology.
So far, oneself is through describing the hardware configuration and work(of the application environment and relevant device of each embodiment of the present invention in detail Energy.In the following, above application environment and relevant device will be based on, each embodiment of the present invention is proposed.
First, the present invention proposes a kind of voice In vivo detection program 200 based on deep learning.
As shown in fig.2, being the program of 200 first embodiment of voice In vivo detection program the present invention is based on deep learning Module map.
In the present embodiment, server 1, including a series of computer program instructions being stored on memory 11, i.e. institute The voice In vivo detection program 200 based on deep learning is stated, when the computer program instructions are executed by processor 12, Ke Yishi The voice In vivo detection operation based on deep learning of existing various embodiments of the present invention.In some embodiments, it is based on the computer The specific operation that program instruction each section is realized, the voice In vivo detection program 200 based on deep learning can be by It is divided into one or more modules.For example, in fig. 2, the voice In vivo detection program 200 based on deep learning can be with It is divided into training module 201, speech processing module 202, matrix input module 203, matrix computing module 204 and judgment module 205.Wherein:
The training module 201, for being trained to deep neural network model to obtain optimal depth neural network Model.
Specifically, the training module 201 is specifically used for carrying out framing to training voice, using every 1000 frame as a sample This;Classification logotype is carried out to each sample;Using the sample after mark as the training of the deep neural network model Sample.
In the present embodiment, the purpose that every 1000 frame blocks is the input for making model have regular length, considers different length The recording of degree will produce different MFCC (Mel-Frequency Cepstral Coefficients, cepstral coefficients) feature Distribution Effect be easy to cause Model Identification inaccuracy if input feature vector on-fixed.For being shorter than 1000 frames and being longer than 100 frames Recording, we splice full 0 frame behind;For being shorter than the recording of 100 frames, we directly cast out, it is believed that are said without people Words.Every 1000 frames of all recording be the benefits of a training sample be can allow model learning to such voice each period sound Sound feature, than using some 1000 frame training effect of certain section of recording to have more robustness merely.
Training stage is tagged to each input recording to be identified, and if proper class is [0000], one kind is forged [0100], two classes forge [0010], and three classes forge [0001].Specifically, proper class is as its name suggests, and really it is as voice, and it is right In forging voice, it is divided into three kinds of forgeries, it is music that the first kind, which is forged, and the second class is forged forges for recording replay, and third class is forged It is forged for technical voice.It is music that first kind forgery, which refers to Application on Voiceprint Recognition input, and music is due to containing abundant acoustic constituents, energy Be normally carried out the registration and verification of voice, but and the information not comprising speaker's sound, therefore the not target recording of Application on Voiceprint Recognition; Second class forges the simple replay of predominantly recording, as spoken with target person under recording pen, the record of mobile phone equipment or music etc. Then voice directly replays the input terminal to Application on Voiceprint Recognition;Third class, which is forged, to be referred mainly to convert using phonetic synthesis or voice Technology carries out target person and speaks forgerys, and phonetic synthesis recording, which generally acquires a certain amount of voice data of target person, just can use synthesizing mean The voice that target person specifies content of text is generated, voice conversion recording is that the change of frequency spectrum is directly carried out to original recording, such It forges due to containing a large amount of voice process technologies, therefore referred to as poly-talented forgery.
And for how to be trained to DNN (deep neural network) using training sample, it is summarized as follows:Using what is increased income Keras frames carry out model training;In view of hardware limitation, setting DNN uses minibatch technologies, setting each in training Batch sizes are 128,1000 batch of each repetitive exercise, total training n times iteration.Each batch is random from total data 128 voice MFCC feature samples are selected, model output is generated, to feedback more new model ginseng after then being passed through according to loss function Number calculates to complete 1 batch, generates 1000 batch data with this and completes 1000 training, obtains an iteration Model output.Under normal circumstances, the optimal model of loss function is selected in 50 iteration:The convolution kernel of first layer is 9* 20, Nfilters are set as 512, and loss function is set as the maximum entropy categorical_crossentropy of all discriminant classifications, Optimizer is adagrad.
The speech processing module 202 is obtained for obtaining voice to be detected and carrying out framing to the voice to be detected 1000*20 ties up matrix.
In the present embodiment, the speech processing module 202 is specifically used for:Framing is carried out to the voice to be detected Afterwards, 1000 frames are extracted and calculate separately 20 dimension MFCC features;The 20 dimension MFCC according to 1000 frame generate the 1000*20 dimensions Matrix.
In the present embodiment, framing operation and the above-mentioned processing mode phase to training voice are carried out to voice to be detected Together, for being shorter than 1000 frames and being longer than the recording of 100 frames, we splice full 0 frame behind;For being shorter than the recording of 100 frames, We directly cast out, it is believed that speak without people.And a kind of conventional algorithm is then belonged to for the calculating of MFCC features, the present invention It does not just repeat again.
The matrix input module 203, for the 1000*20 to be tieed up Input matrix to the optimal depth neural network Model.
In the present embodiment, the input layer of optimal depth neural network model DNN obtained above is Input matrix, can The 1000*20 that above-mentioned speech processing module 202 obtains directly is tieed up Input matrix to obtained optimal depth neural network mould Type.
The matrix computing module 204, for tieing up square to the 1000*20 using the optimal depth neural network model Battle array is calculated ties up output vector, voice class during the 1*4 dimensions output vector represents 4 to obtain 1*4.
Specifically, matrix computing module 204 rolls up input feature vector with 1000*20 convolution kernels in DNN models first layer Product, the purpose of this layer are to carry out consecutive frame Projection Character to control each convolution kernel through pulleying by Nfilters (n times filtering) N number of channel characteristics are obtained after product;At the second to four layers, convolution is carried out using the convolution kernel of 1*1, and activate using LeakyReLU Function, the effect of wherein these 1*1 convolution kernels is that interchannel is allowed to connect interaction, to model learning to more multiframe With interframe feature;Pond is carried out in layer 5, maximum value is extracted to 2*2 core ranges, wherein 2*2MaxPooling is (quasi- Close), step-length selects default value 1*1, which can select certain upper layer nodes, and model parameter is made to reduce, it is not easy to over-fitting; Six layers are flattened, you can to obtain 1*P dimensional features by carrying out flattening to last layer output node;In layer 7 to described Six layers of progress dimensionality reduction, obtain output Out7, and wherein layer 7 is linear layer, while being inputted with the Out7 in layer 7, is used Softmax activation primitives, export for 1*4 vector, that is, export 4 vectors, as testing result.
The judgment module 205, for selecting numerical value in the 1*4 dimensions output vector maximum a kind of as described to be checked Survey the classification of voice.The wherein described 1*4 dimensions output vector is the numerical value in 0~1 range.
In the present embodiment, the 1*4 dimensions output vector belongs to respective class by the fractional representation of 4 0-1 ranges Probability, i.e., the probability of above-mentioned real speech, a kind of forgery, two classes are forged, the probability that three classes are forged, and numerical value in this four probability Maximum one kind then represents the classification of input voice, i.e., intuitively can effectively detect language to be detected by the numerical value of output It is really voice that whether sound, which is live body voice,.
By above procedure module 201-205, server proposed by the invention instructs deep neural network model Practice to obtain optimal depth neural network model;It obtains voice to be detected and framing is carried out to the voice to be detected, obtain 1000*20 ties up matrix;The 1000*20 is tieed up into Input matrix to the optimal depth neural network model;Using described optimal Deep neural network model calculates 1000*20 dimension matrixes ties up output vector, the 1*4 dimensions output to obtain 1*4 Voice class during vector represents 4;Select numerical value in the 1*4 dimensions output vector maximum a kind of as the voice to be detected Classification.In this way so that before carrying out respective application using voice, can quickly detect whether the voice is user The voice or other people malice directly exported forges voice, in this way, higher can be provided for the safety of voice control Secondary safety promotes the development of speech recognition technology.
In addition, the present invention also proposes a kind of voice biopsy method based on deep learning.
As shown in fig.3, being the implementation stream of the voice biopsy method first embodiment the present invention is based on deep learning Journey schematic diagram.In the present embodiment, the execution sequence of the step in flow chart shown in Fig. 3 can change according to different requirements, Become, certain steps can be omitted.
Step S301 is trained deep neural network model to obtain optimal depth neural network model.
Specifically, above-mentioned steps are specifically included carries out framing to training voice, using every 1000 frame as a sample;To every One sample carries out classification logotype;Using the sample after mark as the training sample of the deep neural network model.
In the present embodiment, the purpose that every 1000 frame blocks is the input for making model have regular length, considers different length The recording of degree will produce different MFCC (Mel-Frequency Cepstral Coefficients, cepstral coefficients) feature Distribution Effect be easy to cause Model Identification inaccuracy if input feature vector on-fixed.For being shorter than 1000 frames and being longer than 100 frames Recording, we splice full 0 frame behind;For being shorter than the recording of 100 frames, we directly cast out, it is believed that are said without people Words.Every 1000 frames of all recording be the benefits of a training sample be can allow model learning to such voice each period sound Sound feature, than using some 1000 frame training effect of certain section of recording to have more robustness merely.
Training stage is tagged to each input recording to be identified, and if proper class is [0000], one kind is forged [0100], two classes forge [0010], and three classes forge [0001].Specifically, proper class is as its name suggests, and really it is as voice, and it is right In forging voice, it is divided into three kinds of forgeries, it is music that the first kind, which is forged, and the second class is forged forges for recording replay, and third class is forged It is forged for technical voice.It is music that first kind forgery, which refers to Application on Voiceprint Recognition input, and music is due to containing abundant acoustic constituents, energy Be normally carried out the registration and verification of voice, but and the information not comprising speaker's sound, therefore the not target recording of Application on Voiceprint Recognition; Second class forges the simple replay of predominantly recording, as spoken with target person under recording pen, the record of mobile phone equipment or music etc. Then voice directly replays the input terminal to Application on Voiceprint Recognition;Third class, which is forged, to be referred mainly to convert using phonetic synthesis or voice Technology carries out target person and speaks forgerys, and phonetic synthesis recording, which generally acquires a certain amount of voice data of target person, just can use synthesizing mean The voice that target person specifies content of text is generated, voice conversion recording is that the change of frequency spectrum is directly carried out to original recording, such It forges due to containing a large amount of voice process technologies, therefore referred to as poly-talented forgery.
And for how to be trained to DNN (deep neural network) using training sample, it is summarized as follows:Using what is increased income Keras frames carry out model training;In view of hardware limitation, setting DNN uses minibatch technologies, setting each in training Batch sizes are 128,1000 batch of each repetitive exercise, total training n times iteration.Each batch is random from total data 128 voice MFCC feature samples are selected, model output is generated, to feedback more new model ginseng after then being passed through according to loss function Number calculates to complete 1 batch, generates 1000 batch data with this and completes 1000 training, obtains an iteration Model output.Under normal circumstances, the optimal model of loss function is selected in 50 iteration:The convolution kernel of first layer is 9* 20, Nfilters are set as 512, and loss function is set as the maximum entropy categorical_crossentropy of all discriminant classifications, Optimizer is adagrad.
Step S302 obtains voice to be detected and carries out framing to the voice to be detected, obtains 1000*20 dimension matrixes.
In the present embodiment, the speech processing module 202 is specifically used for:Framing is carried out to the voice to be detected Afterwards, 1000 frames are extracted and calculate separately 20 dimension MFCC features;The 20 dimension MFCC according to 1000 frame generate the 1000*20 dimensions Matrix.
In the present embodiment, framing operation and the above-mentioned processing mode phase to training voice are carried out to voice to be detected Together, for being shorter than 1000 frames and being longer than the recording of 100 frames, we splice full 0 frame behind;For being shorter than the recording of 100 frames, We directly cast out, it is believed that speak without people.And a kind of conventional algorithm is then belonged to for the calculating of MFCC features, the present invention It does not just repeat again.
The 1000*20 is tieed up Input matrix to the optimal depth neural network model by step S303.
In the present embodiment, the input layer of optimal depth neural network model DNN obtained above is Input matrix, can The 1000*20 that above-mentioned speech processing module 202 obtains directly is tieed up Input matrix to obtained optimal depth neural network mould Type.
Step S304 calculates to obtain 1000*20 dimension matrixes using the optimal depth neural network model Output vector, voice class during the 1*4 dimensions output vector represents 4 are tieed up to 1*4.
Specifically, matrix computing module 204 rolls up input feature vector with 1000*20 convolution kernels in DNN models first layer Product, the purpose of this layer are to carry out consecutive frame Projection Character to control each convolution kernel through pulleying by Nfilters (n times filtering) N number of channel characteristics are obtained after product;At the second to four layers, convolution is carried out using the convolution kernel of 1*1, and activate using LeakyReLU Function, the effect of wherein these 1*1 convolution kernels is that interchannel is allowed to connect interaction, to model learning to more multiframe With interframe feature;Pond is carried out in layer 5, maximum value is extracted to 2*2 core ranges, wherein 2*2MaxPooling is (quasi- Close), step-length selects default value 1*1, which can select certain upper layer nodes, and model parameter is made to reduce, it is not easy to over-fitting; Six layers are flattened, you can to obtain 1*P dimensional features by carrying out flattening to last layer output node;In layer 7 to described Six layers of progress dimensionality reduction, obtain output Out7, and wherein layer 7 is linear layer, while being inputted with the Out7 in layer 7, is used Softmax activation primitives, export for 1*4 vector, that is, export 4 vectors, as testing result.
Step S305 selects the maximum a kind of class as the voice to be detected of numerical value in the 1*4 dimensions output vector Not.The wherein described 1*4 dimensions output vector is the numerical value in 0~1 range.
In the present embodiment, the 1*4 dimensions output vector belongs to respective class by the fractional representation of 4 0-1 ranges Probability, i.e., the probability of above-mentioned real speech, a kind of forgery, two classes are forged, the probability that three classes are forged, and numerical value in this four probability Maximum one kind then represents the classification of input voice, i.e., intuitively can effectively detect language to be detected by the numerical value of output It is really voice that whether sound, which is live body voice,.
S301-305 through the above steps, the voice biopsy method proposed by the invention based on deep learning are first First, deep neural network model is trained to obtain optimal depth neural network model;Secondly, voice to be detected is obtained simultaneously Framing is carried out to the voice to be detected, obtains 1000*20 dimension matrixes;Again, the 1000*20 is tieed up into Input matrix described in Optimal depth neural network model;Then, 1000*20 dimension matrixes are carried out using the optimal depth neural network model It calculates and ties up output vector, voice class during the 1*4 dimensions output vector represents 4 to obtain 1*4;Finally, the 1*4 is selected to tie up The maximum a kind of classification as the voice to be detected of numerical value in output vector.In this way so that carried out accordingly using voice Using before, can quickly detecting whether the voice is that voice that user directly exports or other people malice are forged Voice promotes speech recognition technology in this way, higher level safety can be provided for the safety of voice control Development.
The present invention also provides another embodiments, that is, provide a kind of storage medium, and the storage medium is stored with base In the voice In vivo detection program of deep learning, the voice In vivo detection program based on deep learning can be by least one place It manages device to execute, so that at least one processor executes the step such as the above-mentioned voice biopsy method based on deep learning Suddenly.
The embodiments of the present invention are for illustration only, can not represent the quality of embodiment.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can add the mode of required general hardware platform to realize by software, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on this understanding, technical scheme of the present invention substantially in other words does the prior art Going out the part of contribution can be expressed in the form of software products, which is stored in a storage medium In (such as ROM/RAM, magnetic disc, CD), including some instructions are used so that a station terminal equipment (can be mobile phone, computer, clothes Be engaged in device, air conditioner or the network equipment etc.) execute method described in each embodiment of the present invention.
It these are only the preferred embodiment of the present invention, be not intended to limit the scope of the invention, it is every to utilize this hair Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills Art field, is included within the scope of the present invention.

Claims (10)

1. a kind of voice biopsy method based on deep learning is applied to server, which is characterized in that the method includes Following steps:
Deep neural network model is trained to obtain optimal depth neural network model;
It obtains voice to be detected and framing is carried out to the voice to be detected, obtain 1000*20 dimension matrixes;
The 1000*20 is tieed up into Input matrix to the optimal depth neural network model;
Using the optimal depth neural network model to the 1000*20 dimension matrix calculated with obtain 1*4 dimension output to Amount, the 1*4 dimensions output vector represent 4 kinds of voice class;And
Select the maximum a kind of classification as the voice to be detected of numerical value in the 1*4 dimensions output vector.
2. the voice biopsy method based on deep learning as described in claim 1, which is characterized in that described to depth god It is trained through network model and includes the step of optimal depth neural network model to obtain:
Framing is carried out to training voice, using every 1000 frame as a sample;
Classification logotype is carried out to each sample;And
Using the sample after mark as the training sample of the deep neural network model.
3. the voice biopsy method based on deep learning as described in claim 1, which is characterized in that the acquisition is to be checked It surveys voice and is specifically included the step of carrying out framing to the voice to be detected, obtain 1000*20 dimension matrixes:
After carrying out framing to the voice to be detected, extracts 1000 frames and calculate separately 20 dimension MFCC features;And
The 20 dimension MFCC according to 1000 frame generate the 1000*20 and tie up matrix.
4. the voice biopsy method based on deep learning as described in claim 1, which is characterized in that described in the utilization Optimal depth neural network model calculates 1000*20 dimension matrixes:
Convolution is carried out to input feature vector with 1000*20 convolution kernels in first layer;
At the second to four layers, convolution is carried out using the convolution kernel of 1*1, and use LeakyReLU activation primitives;
Pond is carried out in layer 5, maximum value is extracted to 2*2 core ranges;
It is flattened in layer 6;
Dimensionality reduction is carried out to the layer 6 in layer 7, obtains output Out7;And
It is inputted with the Out7 in layer 7, with softmax activation primitives, is exported as 1*4 vectors, as testing result.
5. the voice biopsy method according to any one of claims 1-4 based on deep learning, which is characterized in that described It is the numerical value in 0~1 range that 1*4, which ties up output vector,.
6. a kind of server, which is characterized in that the server includes memory, processor, and being stored on the memory can The voice In vivo detection program based on deep learning run on the processor, the voice live body based on deep learning Detection program realizes following steps when being executed by the processor:
Deep neural network model is trained to obtain optimal depth neural network model;
It obtains voice to be detected and framing is carried out to the voice to be detected, obtain 1000*20 dimension matrixes;
The 1000*20 is tieed up into Input matrix to the optimal depth neural network model;
Using the optimal depth neural network model to the 1000*20 dimension matrix calculated with obtain 1*4 dimension output to Amount, the 1*4 dimensions output vector represent 4 kinds of voice class;And
Select the maximum a kind of classification as the voice to be detected of numerical value in the 1*4 dimensions output vector.
7. server as claimed in claim 6, which is characterized in that the voice In vivo detection program quilt based on deep learning It is described deep neural network model to be trained to obtain the step of optimal depth neural network model when the processor executes Suddenly include:
Framing is carried out to training voice, using every 1000 frame as a sample;
Classification logotype is carried out to each sample;And
Using the sample after mark as the training sample of the deep neural network model.
8. server as claimed in claim 6, which is characterized in that the voice In vivo detection program quilt based on deep learning It is described to obtain voice to be detected and framing is carried out to the voice to be detected when the processor executes, obtain 1000*20 dimension squares The step of battle array, specifically includes:
After carrying out framing to the voice to be detected, extracts 1000 frames and calculate separately 20 dimension MFCC features;And
The 20 dimension MFCC according to 1000 frame generate the 1000*20 and tie up matrix.
9. such as claim 6-8 any one of them servers, which is characterized in that the voice live body inspection based on deep learning It is described that matrix is tieed up to the 1000*20 using the optimal depth neural network model when ranging sequence is executed by the processor It is calculated and includes to obtain the step of 1*4 ties up output vector:
Convolution is carried out to input feature vector with 1000*20 convolution kernels in first layer;
At the second to four layers, convolution is carried out using the convolution kernel of 1*1, and use LeakyReLU activation primitives;
Pond is carried out in layer 5, maximum value is extracted to 2*2 core ranges;
It is flattened in layer 6;
Dimensionality reduction is carried out to the layer 6 in layer 7, obtains output Out7;And
It is inputted with the Out7 in layer 7, with softmax activation primitives, is exported as 1*4 vectors, as testing result.
10. a kind of storage medium, the storage medium is stored with the voice In vivo detection program based on deep learning, described to be based on The voice In vivo detection program of deep learning can be executed by least one processor, so that at least one processor executes such as The step of voice biopsy method based on deep learning described in any one of claim 1-5.
CN201810029892.6A 2018-01-12 2018-01-12 Voice biopsy method, server and storage medium based on deep learning Pending CN108281158A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201810029892.6A CN108281158A (en) 2018-01-12 2018-01-12 Voice biopsy method, server and storage medium based on deep learning
PCT/CN2018/089203 WO2019136909A1 (en) 2018-01-12 2018-05-31 Voice living-body detection method based on deep learning, server and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810029892.6A CN108281158A (en) 2018-01-12 2018-01-12 Voice biopsy method, server and storage medium based on deep learning

Publications (1)

Publication Number Publication Date
CN108281158A true CN108281158A (en) 2018-07-13

Family

ID=62803422

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810029892.6A Pending CN108281158A (en) 2018-01-12 2018-01-12 Voice biopsy method, server and storage medium based on deep learning

Country Status (2)

Country Link
CN (1) CN108281158A (en)
WO (1) WO2019136909A1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109036459A (en) * 2018-08-22 2018-12-18 百度在线网络技术(北京)有限公司 Sound end detecting method, device, computer equipment, computer storage medium
CN109346089A (en) * 2018-09-27 2019-02-15 深圳市声扬科技有限公司 Living body identity identifying method, device, computer equipment and readable storage medium storing program for executing
CN109801638A (en) * 2019-01-24 2019-05-24 平安科技(深圳)有限公司 Speech verification method, apparatus, computer equipment and storage medium
CN111933154A (en) * 2020-07-16 2020-11-13 平安科技(深圳)有限公司 Method and device for identifying counterfeit voice and computer readable storage medium
CN112489677A (en) * 2020-11-20 2021-03-12 平安科技(深圳)有限公司 Voice endpoint detection method, device, equipment and medium based on neural network
CN112735381A (en) * 2020-12-29 2021-04-30 四川虹微技术有限公司 Model updating method and device
CN112735431A (en) * 2020-12-29 2021-04-30 三星电子(中国)研发中心 Model training method and device and artificial intelligence dialogue recognition method and device
CN115280410A (en) * 2020-01-13 2022-11-01 密歇根大学董事会 Safe automatic speaker verification system

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110504027A (en) * 2019-08-20 2019-11-26 东北大学 A kind of X-Ray rabat pneumonia intelligent diagnosis system and method based on deep learning

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102436810A (en) * 2011-10-26 2012-05-02 华南理工大学 Record replay attack detection method and system based on channel mode noise
CN105869630A (en) * 2016-06-27 2016-08-17 上海交通大学 Method and system for detecting voice spoofing attack of speakers on basis of deep learning
CN106409298A (en) * 2016-09-30 2017-02-15 广东技术师范学院 Identification method of sound rerecording attack
CN106531172A (en) * 2016-11-23 2017-03-22 湖北大学 Speaker voice playback identification method and system based on environmental noise change detection

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9697833B2 (en) * 2015-08-25 2017-07-04 Nuance Communications, Inc. Audio-visual speech recognition with scattering operators
CN107545248B (en) * 2017-08-24 2021-04-02 北京小米移动软件有限公司 Biological characteristic living body detection method, device, equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102436810A (en) * 2011-10-26 2012-05-02 华南理工大学 Record replay attack detection method and system based on channel mode noise
CN105869630A (en) * 2016-06-27 2016-08-17 上海交通大学 Method and system for detecting voice spoofing attack of speakers on basis of deep learning
CN106409298A (en) * 2016-09-30 2017-02-15 广东技术师范学院 Identification method of sound rerecording attack
CN106531172A (en) * 2016-11-23 2017-03-22 湖北大学 Speaker voice playback identification method and system based on environmental noise change detection

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
CHUNLEI ZHANG 等: "An Investigation of Deep-Learning Frameworks for Speaker Verification Antispoofing", 《IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING》 *
HUIXIN LIANG 等: "RECOGNITION OF SPOOFED VOICE USING CONVOLUTIONAL NEURAL NETWORKS", 《2017 IEEE GLOBALSIP》 *
XIAOHAI TIAN 等: "Spoofing Speech Detection using Temporal Convolutional Neural Network", 《2016 APSIPA》 *
陈敏: "《认知计算导论》", 30 April 2017, 华中科技大学出版社 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109036459A (en) * 2018-08-22 2018-12-18 百度在线网络技术(北京)有限公司 Sound end detecting method, device, computer equipment, computer storage medium
CN109036459B (en) * 2018-08-22 2019-12-27 百度在线网络技术(北京)有限公司 Voice endpoint detection method and device, computer equipment and computer storage medium
CN109346089A (en) * 2018-09-27 2019-02-15 深圳市声扬科技有限公司 Living body identity identifying method, device, computer equipment and readable storage medium storing program for executing
CN109801638A (en) * 2019-01-24 2019-05-24 平安科技(深圳)有限公司 Speech verification method, apparatus, computer equipment and storage medium
CN109801638B (en) * 2019-01-24 2023-10-13 平安科技(深圳)有限公司 Voice verification method, device, computer equipment and storage medium
CN115280410A (en) * 2020-01-13 2022-11-01 密歇根大学董事会 Safe automatic speaker verification system
EP4091164A4 (en) * 2020-01-13 2024-01-24 Univ Michigan Regents Secure automatic speaker verification system
CN111933154A (en) * 2020-07-16 2020-11-13 平安科技(深圳)有限公司 Method and device for identifying counterfeit voice and computer readable storage medium
CN111933154B (en) * 2020-07-16 2024-02-13 平安科技(深圳)有限公司 Method, equipment and computer readable storage medium for recognizing fake voice
CN112489677A (en) * 2020-11-20 2021-03-12 平安科技(深圳)有限公司 Voice endpoint detection method, device, equipment and medium based on neural network
CN112489677B (en) * 2020-11-20 2023-09-22 平安科技(深圳)有限公司 Voice endpoint detection method, device, equipment and medium based on neural network
CN112735381A (en) * 2020-12-29 2021-04-30 四川虹微技术有限公司 Model updating method and device
CN112735431A (en) * 2020-12-29 2021-04-30 三星电子(中国)研发中心 Model training method and device and artificial intelligence dialogue recognition method and device
CN112735381B (en) * 2020-12-29 2022-09-27 四川虹微技术有限公司 Model updating method and device
CN112735431B (en) * 2020-12-29 2023-12-22 三星电子(中国)研发中心 Model training method and device and artificial intelligent dialogue recognition method and device

Also Published As

Publication number Publication date
WO2019136909A1 (en) 2019-07-18

Similar Documents

Publication Publication Date Title
CN108281158A (en) Voice biopsy method, server and storage medium based on deep learning
CN107527620A (en) Electronic installation, the method for authentication and computer-readable recording medium
CN107993071A (en) Electronic device, auth method and storage medium based on vocal print
CN107564513A (en) Audio recognition method and device
CN107785015A (en) A kind of audio recognition method and device
CN109409971A (en) Abnormal order processing method and device
CN106104674A (en) Mixing voice identification
CN108154371A (en) Electronic device, the method for authentication and storage medium
CN110675862A (en) Corpus acquisition method, electronic device and storage medium
CN109308912A (en) Music style recognition methods, device, computer equipment and storage medium
CN108986798A (en) Processing method, device and the equipment of voice data
CN108694952A (en) Electronic device, the method for authentication and storage medium
CN111508524A (en) Method and system for identifying voice source equipment
CN111161713A (en) Voice gender identification method and device and computing equipment
CN109658943A (en) A kind of detection method of audio-frequency noise, device, storage medium and mobile terminal
CN107229691A (en) A kind of method and apparatus for being used to provide social object
Li et al. Anti-forensics of audio source identification using generative adversarial network
CN108650266A (en) Server, the method for voice print verification and storage medium
Wang et al. Robust speaker identification of iot based on stacked sparse denoising auto-encoders
Dixit et al. Review of audio deepfake detection techniques: Issues and prospects
US11630950B2 (en) Prediction of media success from plot summaries using machine learning model
CN116152938A (en) Method, device and equipment for training identity recognition model and transferring electronic resources
CN111933154B (en) Method, equipment and computer readable storage medium for recognizing fake voice
CN110415708A (en) Method for identifying speaker, device, equipment and storage medium neural network based
CN116564269A (en) Voice data processing method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180713