CN108281158A - Voice biopsy method, server and storage medium based on deep learning - Google Patents
Voice biopsy method, server and storage medium based on deep learning Download PDFInfo
- Publication number
- CN108281158A CN108281158A CN201810029892.6A CN201810029892A CN108281158A CN 108281158 A CN108281158 A CN 108281158A CN 201810029892 A CN201810029892 A CN 201810029892A CN 108281158 A CN108281158 A CN 108281158A
- Authority
- CN
- China
- Prior art keywords
- voice
- neural network
- network model
- carried out
- deep learning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000013135 deep learning Methods 0.000 title claims abstract description 44
- 238000000034 method Methods 0.000 title claims abstract description 26
- 238000001574 biopsy Methods 0.000 title claims abstract description 19
- 238000003062 neural network model Methods 0.000 claims abstract description 64
- 239000013598 vector Substances 0.000 claims abstract description 56
- 239000011159 matrix material Substances 0.000 claims abstract description 31
- 238000009432 framing Methods 0.000 claims abstract description 25
- 238000012549 training Methods 0.000 claims description 35
- 238000001727 in vivo Methods 0.000 claims description 27
- 230000015654 memory Effects 0.000 claims description 21
- 230000004913 activation Effects 0.000 claims description 10
- 230000009467 reduction Effects 0.000 claims description 6
- 238000012360 testing method Methods 0.000 claims description 6
- 239000000284 extract Substances 0.000 claims description 3
- 238000007689 inspection Methods 0.000 claims 1
- 238000005516 engineering process Methods 0.000 abstract description 15
- 238000011161 development Methods 0.000 abstract description 7
- 238000012545 processing Methods 0.000 description 13
- 230000006870 function Effects 0.000 description 9
- 230000000694 effects Effects 0.000 description 6
- 238000013528 artificial neural network Methods 0.000 description 5
- 230000015572 biosynthetic process Effects 0.000 description 5
- 238000003786 synthesis reaction Methods 0.000 description 5
- 230000008901 benefit Effects 0.000 description 4
- 241000238558 Eucarida Species 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 238000004891 communication Methods 0.000 description 3
- 238000001514 detection method Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 241000208340 Araliaceae Species 0.000 description 2
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 2
- 235000003140 Panax quinquefolius Nutrition 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 239000000470 constituent Substances 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 238000005242 forging Methods 0.000 description 2
- 235000008434 ginseng Nutrition 0.000 description 2
- 238000010295 mobile communication Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000003252 repetitive effect Effects 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 230000002194 synthesizing effect Effects 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 210000005036 nerve Anatomy 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a kind of voice biopsy method based on deep learning is applied to server, including:Deep neural network model is trained to obtain optimal depth neural network model;It obtains voice to be detected and framing is carried out to the voice to be detected, obtain 1000*20 dimension matrixes;The 1000*20 is tieed up into Input matrix to the optimal depth neural network model;1000*20 dimension matrixes are calculated using the optimal depth neural network model and tie up output vector to obtain 1*4, the 1*4 dimensions output vector represents 4 kinds of voice class;Select the maximum a kind of classification as the voice to be detected of numerical value in the 1*4 dimensions output vector.Invention additionally discloses a kind of server and storage mediums.By implementing said program, higher level safety can be provided for the safety of voice control, promotes the development of speech recognition technology.
Description
Technical field
The present invention relates to field of computer technology more particularly to a kind of voice biopsy method based on deep learning,
Server and storage medium.
Background technology
With the continuous development of speech recognition technology, speech recognition application is also more and more, including voice control, voice branch
Pay etc..But at present during speech recognition, typically just identification is semantic, and it is artificial that can not distinguish voice well
It sends out or other recording inputs, such as apple siri, during being waken up with siri to apple terminal device, no matter
It is I or recording, once " hi, siri " can wake up terminal device, can not distinguish the source for sending out voice for input.
Therefore it is particularly important for the In vivo detection of voice.Voice In vivo detection refers to whether identification input information is that true man speak, non-real
The voice that people speaks commonly referred to as forgery recording, including music inputs, recording replay, passes through the generation of the technological means such as phonetic synthesis
Voice etc..It forges recording and is frequently utilized for finance, security fields, carrying out Application on Voiceprint Recognition by forgery recording swarms into, to log in
To injured party's account, steals wealth to reach or damage the targets such as other people fames and property.
Invention content
In view of this, the present invention proposes that a kind of voice biopsy method, server and storage based on deep learning are situated between
Matter so that before carrying out respective application using voice, can quickly detect whether the voice is that user directly exports
Voice or other people malice forge voice, in this way, higher level safety can be provided for the safety of voice control
Property ensure, promote the development of speech recognition technology.
First, to achieve the above object, the present invention proposes a kind of server, and the server includes memory, processor,
The voice In vivo detection program based on deep learning that can be run on the processor, the base are stored on the memory
Following steps are realized when the voice In vivo detection program of deep learning is executed by the processor:To deep neural network model
It is trained to obtain optimal depth neural network model;It obtains voice to be detected and framing is carried out to the voice to be detected,
Obtain 1000*20 dimension matrixes;The 1000*20 is tieed up into Input matrix to the optimal depth neural network model;Using described
Optimal depth neural network model calculates 1000*20 dimension matrixes ties up output vector, the 1*4 dimensions to obtain 1*4
Output vector represents 4 kinds of voice class;Select numerical value in the 1*4 dimensions output vector maximum a kind of as described to be detected
The classification of voice.
Optionally, it is described to depth when the voice In vivo detection program based on deep learning is executed by the processor
Degree neural network model is trained includes the step of optimal depth neural network model to obtain:Training voice is divided
Frame, using every 1000 frame as a sample;Classification logotype is carried out to each sample;Using the sample after mark as described in
The training sample of deep neural network model.
Optionally, it is described to depth when the voice In vivo detection program based on deep learning is executed by the processor
Degree neural network model is trained includes the step of optimal depth neural network model to obtain:Training voice is divided
Frame, using every 1000 frame as a sample;Classification logotype is carried out to each sample;Using the sample after mark as described in
The training sample of deep neural network model.
Optionally, when the voice In vivo detection program based on deep learning is executed by the processor, the utilization
The optimal depth neural network model calculates to obtain the step of 1*4 ties up output vector 1000*20 dimension matrixes
It specifically includes:Convolution is carried out to input feature vector with 1000*20 convolution kernels in first layer;At the second to four layers, using the convolution of 1*1
Core carries out convolution, and uses LeakyReLU activation primitives;Pond is carried out in layer 5, maximum is extracted to 2*2 core ranges
Value;It is flattened in layer 6;Dimensionality reduction is carried out to the layer 6 in layer 7, obtains output Out7;In layer 7 with described
Out7 is inputted, and with softmax activation primitives, is exported as 1*4 vectors, as testing result.
In addition, to achieve the above object, the present invention also provides a kind of voice biopsy method based on deep learning is answered
For server, the described method comprises the following steps:Deep neural network model is trained to obtain optimal depth nerve
Network model;It obtains voice to be detected and framing is carried out to the voice to be detected, obtain 1000*20 dimension matrixes;It will be described
1000*20 ties up Input matrix to the optimal depth neural network model;Using the optimal depth neural network model to institute
It states 1000*20 dimension matrixes to be calculated to obtain 1*4 dimension output vectors, the 1*4 dimensions output vector represents 4 kinds of voice class
Not;Select the maximum a kind of classification as the voice to be detected of numerical value in the 1*4 dimensions output vector.
Optionally, described the step of deep neural network model being trained to obtain optimal depth neural network model
Including:Framing is carried out to training voice, using every 1000 frame as a sample;Classification logotype is carried out to each sample;It will mark
Training sample of the sample as the deep neural network model after knowledge.
Optionally, described to obtain voice to be detected and framing is carried out to the voice to be detected, obtain 1000*20 dimension matrixes
The step of specifically include:After carrying out framing to the voice to be detected, extracts 1000 frames and calculate separately 20 dimension MFCC features;According to
The 1000*20, which is generated, according to 20 dimension MFCC of 1000 frame ties up matrix.
Optionally, it is described using the optimal depth neural network model to the 1000*20 dimension matrix calculated with
The step of obtaining 1*4 dimension output vectors specifically includes:Convolution is carried out to input feature vector with 1000*20 convolution kernels in first layer;
The second to four layers, convolution is carried out using the convolution kernel of 1*1, and use LeakyReLU activation primitives;Pond is carried out in layer 5,
Maximum value is extracted to 2*2 core ranges;It is flattened in layer 6;Dimensionality reduction is carried out to the layer 6 in layer 7, is obtained
Export Out7;It is inputted with the Out7 in layer 7, with softmax activation primitives, is exported as 1*4 vectors, as testing result.
Optionally, the 1*4 dimensions output vector is the numerical value in 0~1 range.
Further, to achieve the above object, the present invention also provides a kind of storage medium, the storage medium is stored with base
In the voice In vivo detection program of deep learning, the voice In vivo detection program based on deep learning can be by least one place
It manages device to execute, so that at least one processor executes the voice biopsy method based on deep learning as described above
Step.
Compared to the prior art, voice biopsy method proposed by the invention based on deep learning, server and
Storage medium is trained deep neural network model to obtain optimal depth neural network model first;Secondly, it obtains
Voice to be detected simultaneously carries out framing to the voice to be detected, obtains 1000*20 dimension matrixes;Again, the 1000*20 is tieed up into square
Battle array is input to the optimal depth neural network model;Then, using the optimal depth neural network model to the 1000*
20 dimension matrixes are calculated ties up output vector, voice class during the 1*4 dimensions output vector represents 4 to obtain 1*4;Finally,
Select the maximum a kind of classification as the voice to be detected of numerical value in the 1*4 dimensions output vector.In this way so that utilizing
Before voice carries out respective application, can quickly detect the voice whether be voice that user directly exports or he
The malice of people is forged voice and is promoted in this way, higher level safety can be provided for the safety of voice control
The development of speech recognition technology.
Description of the drawings
Fig. 1 is the schematic diagram of one optional hardware structure of server of the present invention;
Fig. 2 is the Program modual graph of the voice In vivo detection program first embodiment the present invention is based on deep learning;
Fig. 3 is that the present invention is based on the flow charts of the voice biopsy method first embodiment of deep learning;
Reference numeral:
The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.
Specific implementation mode
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, right
The present invention is further elaborated.It should be appreciated that described herein, specific examples are only used to explain the present invention, not
For limiting the present invention.Based on the embodiments of the present invention, those of ordinary skill in the art are not before making creative work
The every other embodiment obtained is put, shall fall within the protection scope of the present invention.
It should be noted that the description for being related to " first ", " second " etc. in the present invention is used for description purposes only, and cannot
It is interpreted as indicating or implying its relative importance or implicitly indicates the quantity of indicated technical characteristic.Define as a result, " the
One ", the feature of " second " can explicitly or implicitly include at least one of the features.In addition, the skill between each embodiment
Art scheme can be combined with each other, but must can be implemented as basis with those of ordinary skill in the art, when technical solution
Will be understood that the combination of this technical solution is not present in conjunction with there is conflicting or cannot achieve when, also not the present invention claims
Protection domain within.
As shown in fig.1, being the schematic diagram of 1 one optional hardware structure of server.
The server 1 can be rack-mount server, blade server, tower server or Cabinet-type server etc.
Computing device, which can be independent server, can also be the server cluster that multiple servers are formed.
In the present embodiment, the server 1 may include, but be not limited only to, and can be in communication with each other connection by system bus and deposit
Reservoir 11, processor 12, network interface 13.
The server 1 connects network by network interface 13, obtains information.The network can be intranet
(Intranet), internet (Internet), global system for mobile communications (Global System of Mobile
Communication, GSM), wideband code division multiple access (Wideband Code Division Multiple Access,
WCDMA), the wirelessly or non-wirelessly network such as 4G networks, 5G networks, bluetooth (Bluetooth), Wi-Fi, speech path network.
It should be pointed out that Fig. 1 illustrates only the server 1 with component 11-13, it should be understood that simultaneously should not
Realistic to apply all components shown, the implementation that can be substituted is more or less component.
Wherein, the memory 11 includes at least a type of storage medium, and the storage medium includes flash memory, hard
Disk, multimedia card, card-type memory (for example, SD or DX memories etc.), random access storage device (RAM), static random-access
Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), may be programmed read-only storage at read-only memory (ROM)
Device (PROM), magnetic storage, disk, CD etc..In some embodiments, the memory 11 can be the server 1
Internal storage unit, such as the server 1 hard disk or memory.In further embodiments, the memory 11 can also
It is the External memory equipment of the server 1, such as the plug-in type hard disk that the server 1 is equipped with, intelligent memory card (Smart
Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..Certainly, described
Memory 11 can also both include the server 1 internal storage unit and also including its External memory equipment.In the present embodiment,
The memory 11 is installed on the operating system and types of applications software of the server 1 commonly used in storage, such as based on depth
Spend the program code etc. of the voice In vivo detection program 200 of study.In addition, the memory 11 can be also used for temporarily storing
The Various types of data that has exported or will export.
The processor 12 can be in some embodiments central processing unit (Central Processing Unit,
CPU), controller, microcontroller, microprocessor or other data processing chips.The processor 12 is commonly used in the control clothes
The overall operation of business device 1, such as execute data interaction or the relevant control of communication and processing etc..In the present embodiment, the place
Reason device 12 for running the program code stored in the memory 11 or processing data, such as operation it is described based on depth
Voice In vivo detection program 200 of study etc..
The network interface 13 may include radio network interface or wired network interface, which is commonly used in
Communication connection is established between the server 1 and other electronic equipments.
In the present embodiment, is installed in the server 1 and run the voice In vivo detection program based on deep learning
200, when the voice In vivo detection program 200 based on deep learning is run, the server 1 is to deep neural network mould
Type is trained to obtain optimal depth neural network model;It obtains voice to be detected and the voice to be detected is divided
Frame obtains 1000*20 dimension matrixes;The 1000*20 is tieed up into Input matrix to the optimal depth neural network model;Using institute
Optimal depth neural network model is stated to calculate 1000*20 dimension matrixes to obtain 1*4 dimension output vectors, the 1*4
Voice class during dimension output vector represents 4;Select numerical value in the 1*4 dimensions output vector maximum a kind of as described to be checked
Survey the classification of voice.In this way so that before carrying out respective application using voice, whether can quickly detect the voice
The voice or other people malice directly exported for user forges voice, in this way, can be provided for the safety of voice control
Higher level safety promotes the development of speech recognition technology.
So far, oneself is through describing the hardware configuration and work(of the application environment and relevant device of each embodiment of the present invention in detail
Energy.In the following, above application environment and relevant device will be based on, each embodiment of the present invention is proposed.
First, the present invention proposes a kind of voice In vivo detection program 200 based on deep learning.
As shown in fig.2, being the program of 200 first embodiment of voice In vivo detection program the present invention is based on deep learning
Module map.
In the present embodiment, server 1, including a series of computer program instructions being stored on memory 11, i.e. institute
The voice In vivo detection program 200 based on deep learning is stated, when the computer program instructions are executed by processor 12, Ke Yishi
The voice In vivo detection operation based on deep learning of existing various embodiments of the present invention.In some embodiments, it is based on the computer
The specific operation that program instruction each section is realized, the voice In vivo detection program 200 based on deep learning can be by
It is divided into one or more modules.For example, in fig. 2, the voice In vivo detection program 200 based on deep learning can be with
It is divided into training module 201, speech processing module 202, matrix input module 203, matrix computing module 204 and judgment module
205.Wherein:
The training module 201, for being trained to deep neural network model to obtain optimal depth neural network
Model.
Specifically, the training module 201 is specifically used for carrying out framing to training voice, using every 1000 frame as a sample
This;Classification logotype is carried out to each sample;Using the sample after mark as the training of the deep neural network model
Sample.
In the present embodiment, the purpose that every 1000 frame blocks is the input for making model have regular length, considers different length
The recording of degree will produce different MFCC (Mel-Frequency Cepstral Coefficients, cepstral coefficients) feature
Distribution Effect be easy to cause Model Identification inaccuracy if input feature vector on-fixed.For being shorter than 1000 frames and being longer than 100 frames
Recording, we splice full 0 frame behind;For being shorter than the recording of 100 frames, we directly cast out, it is believed that are said without people
Words.Every 1000 frames of all recording be the benefits of a training sample be can allow model learning to such voice each period sound
Sound feature, than using some 1000 frame training effect of certain section of recording to have more robustness merely.
Training stage is tagged to each input recording to be identified, and if proper class is [0000], one kind is forged
[0100], two classes forge [0010], and three classes forge [0001].Specifically, proper class is as its name suggests, and really it is as voice, and it is right
In forging voice, it is divided into three kinds of forgeries, it is music that the first kind, which is forged, and the second class is forged forges for recording replay, and third class is forged
It is forged for technical voice.It is music that first kind forgery, which refers to Application on Voiceprint Recognition input, and music is due to containing abundant acoustic constituents, energy
Be normally carried out the registration and verification of voice, but and the information not comprising speaker's sound, therefore the not target recording of Application on Voiceprint Recognition;
Second class forges the simple replay of predominantly recording, as spoken with target person under recording pen, the record of mobile phone equipment or music etc.
Then voice directly replays the input terminal to Application on Voiceprint Recognition;Third class, which is forged, to be referred mainly to convert using phonetic synthesis or voice
Technology carries out target person and speaks forgerys, and phonetic synthesis recording, which generally acquires a certain amount of voice data of target person, just can use synthesizing mean
The voice that target person specifies content of text is generated, voice conversion recording is that the change of frequency spectrum is directly carried out to original recording, such
It forges due to containing a large amount of voice process technologies, therefore referred to as poly-talented forgery.
And for how to be trained to DNN (deep neural network) using training sample, it is summarized as follows:Using what is increased income
Keras frames carry out model training;In view of hardware limitation, setting DNN uses minibatch technologies, setting each in training
Batch sizes are 128,1000 batch of each repetitive exercise, total training n times iteration.Each batch is random from total data
128 voice MFCC feature samples are selected, model output is generated, to feedback more new model ginseng after then being passed through according to loss function
Number calculates to complete 1 batch, generates 1000 batch data with this and completes 1000 training, obtains an iteration
Model output.Under normal circumstances, the optimal model of loss function is selected in 50 iteration:The convolution kernel of first layer is 9*
20, Nfilters are set as 512, and loss function is set as the maximum entropy categorical_crossentropy of all discriminant classifications,
Optimizer is adagrad.
The speech processing module 202 is obtained for obtaining voice to be detected and carrying out framing to the voice to be detected
1000*20 ties up matrix.
In the present embodiment, the speech processing module 202 is specifically used for:Framing is carried out to the voice to be detected
Afterwards, 1000 frames are extracted and calculate separately 20 dimension MFCC features;The 20 dimension MFCC according to 1000 frame generate the 1000*20 dimensions
Matrix.
In the present embodiment, framing operation and the above-mentioned processing mode phase to training voice are carried out to voice to be detected
Together, for being shorter than 1000 frames and being longer than the recording of 100 frames, we splice full 0 frame behind;For being shorter than the recording of 100 frames,
We directly cast out, it is believed that speak without people.And a kind of conventional algorithm is then belonged to for the calculating of MFCC features, the present invention
It does not just repeat again.
The matrix input module 203, for the 1000*20 to be tieed up Input matrix to the optimal depth neural network
Model.
In the present embodiment, the input layer of optimal depth neural network model DNN obtained above is Input matrix, can
The 1000*20 that above-mentioned speech processing module 202 obtains directly is tieed up Input matrix to obtained optimal depth neural network mould
Type.
The matrix computing module 204, for tieing up square to the 1000*20 using the optimal depth neural network model
Battle array is calculated ties up output vector, voice class during the 1*4 dimensions output vector represents 4 to obtain 1*4.
Specifically, matrix computing module 204 rolls up input feature vector with 1000*20 convolution kernels in DNN models first layer
Product, the purpose of this layer are to carry out consecutive frame Projection Character to control each convolution kernel through pulleying by Nfilters (n times filtering)
N number of channel characteristics are obtained after product;At the second to four layers, convolution is carried out using the convolution kernel of 1*1, and activate using LeakyReLU
Function, the effect of wherein these 1*1 convolution kernels is that interchannel is allowed to connect interaction, to model learning to more multiframe
With interframe feature;Pond is carried out in layer 5, maximum value is extracted to 2*2 core ranges, wherein 2*2MaxPooling is (quasi-
Close), step-length selects default value 1*1, which can select certain upper layer nodes, and model parameter is made to reduce, it is not easy to over-fitting;
Six layers are flattened, you can to obtain 1*P dimensional features by carrying out flattening to last layer output node;In layer 7 to described
Six layers of progress dimensionality reduction, obtain output Out7, and wherein layer 7 is linear layer, while being inputted with the Out7 in layer 7, is used
Softmax activation primitives, export for 1*4 vector, that is, export 4 vectors, as testing result.
The judgment module 205, for selecting numerical value in the 1*4 dimensions output vector maximum a kind of as described to be checked
Survey the classification of voice.The wherein described 1*4 dimensions output vector is the numerical value in 0~1 range.
In the present embodiment, the 1*4 dimensions output vector belongs to respective class by the fractional representation of 4 0-1 ranges
Probability, i.e., the probability of above-mentioned real speech, a kind of forgery, two classes are forged, the probability that three classes are forged, and numerical value in this four probability
Maximum one kind then represents the classification of input voice, i.e., intuitively can effectively detect language to be detected by the numerical value of output
It is really voice that whether sound, which is live body voice,.
By above procedure module 201-205, server proposed by the invention instructs deep neural network model
Practice to obtain optimal depth neural network model;It obtains voice to be detected and framing is carried out to the voice to be detected, obtain
1000*20 ties up matrix;The 1000*20 is tieed up into Input matrix to the optimal depth neural network model;Using described optimal
Deep neural network model calculates 1000*20 dimension matrixes ties up output vector, the 1*4 dimensions output to obtain 1*4
Voice class during vector represents 4;Select numerical value in the 1*4 dimensions output vector maximum a kind of as the voice to be detected
Classification.In this way so that before carrying out respective application using voice, can quickly detect whether the voice is user
The voice or other people malice directly exported forges voice, in this way, higher can be provided for the safety of voice control
Secondary safety promotes the development of speech recognition technology.
In addition, the present invention also proposes a kind of voice biopsy method based on deep learning.
As shown in fig.3, being the implementation stream of the voice biopsy method first embodiment the present invention is based on deep learning
Journey schematic diagram.In the present embodiment, the execution sequence of the step in flow chart shown in Fig. 3 can change according to different requirements,
Become, certain steps can be omitted.
Step S301 is trained deep neural network model to obtain optimal depth neural network model.
Specifically, above-mentioned steps are specifically included carries out framing to training voice, using every 1000 frame as a sample;To every
One sample carries out classification logotype;Using the sample after mark as the training sample of the deep neural network model.
In the present embodiment, the purpose that every 1000 frame blocks is the input for making model have regular length, considers different length
The recording of degree will produce different MFCC (Mel-Frequency Cepstral Coefficients, cepstral coefficients) feature
Distribution Effect be easy to cause Model Identification inaccuracy if input feature vector on-fixed.For being shorter than 1000 frames and being longer than 100 frames
Recording, we splice full 0 frame behind;For being shorter than the recording of 100 frames, we directly cast out, it is believed that are said without people
Words.Every 1000 frames of all recording be the benefits of a training sample be can allow model learning to such voice each period sound
Sound feature, than using some 1000 frame training effect of certain section of recording to have more robustness merely.
Training stage is tagged to each input recording to be identified, and if proper class is [0000], one kind is forged
[0100], two classes forge [0010], and three classes forge [0001].Specifically, proper class is as its name suggests, and really it is as voice, and it is right
In forging voice, it is divided into three kinds of forgeries, it is music that the first kind, which is forged, and the second class is forged forges for recording replay, and third class is forged
It is forged for technical voice.It is music that first kind forgery, which refers to Application on Voiceprint Recognition input, and music is due to containing abundant acoustic constituents, energy
Be normally carried out the registration and verification of voice, but and the information not comprising speaker's sound, therefore the not target recording of Application on Voiceprint Recognition;
Second class forges the simple replay of predominantly recording, as spoken with target person under recording pen, the record of mobile phone equipment or music etc.
Then voice directly replays the input terminal to Application on Voiceprint Recognition;Third class, which is forged, to be referred mainly to convert using phonetic synthesis or voice
Technology carries out target person and speaks forgerys, and phonetic synthesis recording, which generally acquires a certain amount of voice data of target person, just can use synthesizing mean
The voice that target person specifies content of text is generated, voice conversion recording is that the change of frequency spectrum is directly carried out to original recording, such
It forges due to containing a large amount of voice process technologies, therefore referred to as poly-talented forgery.
And for how to be trained to DNN (deep neural network) using training sample, it is summarized as follows:Using what is increased income
Keras frames carry out model training;In view of hardware limitation, setting DNN uses minibatch technologies, setting each in training
Batch sizes are 128,1000 batch of each repetitive exercise, total training n times iteration.Each batch is random from total data
128 voice MFCC feature samples are selected, model output is generated, to feedback more new model ginseng after then being passed through according to loss function
Number calculates to complete 1 batch, generates 1000 batch data with this and completes 1000 training, obtains an iteration
Model output.Under normal circumstances, the optimal model of loss function is selected in 50 iteration:The convolution kernel of first layer is 9*
20, Nfilters are set as 512, and loss function is set as the maximum entropy categorical_crossentropy of all discriminant classifications,
Optimizer is adagrad.
Step S302 obtains voice to be detected and carries out framing to the voice to be detected, obtains 1000*20 dimension matrixes.
In the present embodiment, the speech processing module 202 is specifically used for:Framing is carried out to the voice to be detected
Afterwards, 1000 frames are extracted and calculate separately 20 dimension MFCC features;The 20 dimension MFCC according to 1000 frame generate the 1000*20 dimensions
Matrix.
In the present embodiment, framing operation and the above-mentioned processing mode phase to training voice are carried out to voice to be detected
Together, for being shorter than 1000 frames and being longer than the recording of 100 frames, we splice full 0 frame behind;For being shorter than the recording of 100 frames,
We directly cast out, it is believed that speak without people.And a kind of conventional algorithm is then belonged to for the calculating of MFCC features, the present invention
It does not just repeat again.
The 1000*20 is tieed up Input matrix to the optimal depth neural network model by step S303.
In the present embodiment, the input layer of optimal depth neural network model DNN obtained above is Input matrix, can
The 1000*20 that above-mentioned speech processing module 202 obtains directly is tieed up Input matrix to obtained optimal depth neural network mould
Type.
Step S304 calculates to obtain 1000*20 dimension matrixes using the optimal depth neural network model
Output vector, voice class during the 1*4 dimensions output vector represents 4 are tieed up to 1*4.
Specifically, matrix computing module 204 rolls up input feature vector with 1000*20 convolution kernels in DNN models first layer
Product, the purpose of this layer are to carry out consecutive frame Projection Character to control each convolution kernel through pulleying by Nfilters (n times filtering)
N number of channel characteristics are obtained after product;At the second to four layers, convolution is carried out using the convolution kernel of 1*1, and activate using LeakyReLU
Function, the effect of wherein these 1*1 convolution kernels is that interchannel is allowed to connect interaction, to model learning to more multiframe
With interframe feature;Pond is carried out in layer 5, maximum value is extracted to 2*2 core ranges, wherein 2*2MaxPooling is (quasi-
Close), step-length selects default value 1*1, which can select certain upper layer nodes, and model parameter is made to reduce, it is not easy to over-fitting;
Six layers are flattened, you can to obtain 1*P dimensional features by carrying out flattening to last layer output node;In layer 7 to described
Six layers of progress dimensionality reduction, obtain output Out7, and wherein layer 7 is linear layer, while being inputted with the Out7 in layer 7, is used
Softmax activation primitives, export for 1*4 vector, that is, export 4 vectors, as testing result.
Step S305 selects the maximum a kind of class as the voice to be detected of numerical value in the 1*4 dimensions output vector
Not.The wherein described 1*4 dimensions output vector is the numerical value in 0~1 range.
In the present embodiment, the 1*4 dimensions output vector belongs to respective class by the fractional representation of 4 0-1 ranges
Probability, i.e., the probability of above-mentioned real speech, a kind of forgery, two classes are forged, the probability that three classes are forged, and numerical value in this four probability
Maximum one kind then represents the classification of input voice, i.e., intuitively can effectively detect language to be detected by the numerical value of output
It is really voice that whether sound, which is live body voice,.
S301-305 through the above steps, the voice biopsy method proposed by the invention based on deep learning are first
First, deep neural network model is trained to obtain optimal depth neural network model;Secondly, voice to be detected is obtained simultaneously
Framing is carried out to the voice to be detected, obtains 1000*20 dimension matrixes;Again, the 1000*20 is tieed up into Input matrix described in
Optimal depth neural network model;Then, 1000*20 dimension matrixes are carried out using the optimal depth neural network model
It calculates and ties up output vector, voice class during the 1*4 dimensions output vector represents 4 to obtain 1*4;Finally, the 1*4 is selected to tie up
The maximum a kind of classification as the voice to be detected of numerical value in output vector.In this way so that carried out accordingly using voice
Using before, can quickly detecting whether the voice is that voice that user directly exports or other people malice are forged
Voice promotes speech recognition technology in this way, higher level safety can be provided for the safety of voice control
Development.
The present invention also provides another embodiments, that is, provide a kind of storage medium, and the storage medium is stored with base
In the voice In vivo detection program of deep learning, the voice In vivo detection program based on deep learning can be by least one place
It manages device to execute, so that at least one processor executes the step such as the above-mentioned voice biopsy method based on deep learning
Suddenly.
The embodiments of the present invention are for illustration only, can not represent the quality of embodiment.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side
Method can add the mode of required general hardware platform to realize by software, naturally it is also possible to by hardware, but in many cases
The former is more preferably embodiment.Based on this understanding, technical scheme of the present invention substantially in other words does the prior art
Going out the part of contribution can be expressed in the form of software products, which is stored in a storage medium
In (such as ROM/RAM, magnetic disc, CD), including some instructions are used so that a station terminal equipment (can be mobile phone, computer, clothes
Be engaged in device, air conditioner or the network equipment etc.) execute method described in each embodiment of the present invention.
It these are only the preferred embodiment of the present invention, be not intended to limit the scope of the invention, it is every to utilize this hair
Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills
Art field, is included within the scope of the present invention.
Claims (10)
1. a kind of voice biopsy method based on deep learning is applied to server, which is characterized in that the method includes
Following steps:
Deep neural network model is trained to obtain optimal depth neural network model;
It obtains voice to be detected and framing is carried out to the voice to be detected, obtain 1000*20 dimension matrixes;
The 1000*20 is tieed up into Input matrix to the optimal depth neural network model;
Using the optimal depth neural network model to the 1000*20 dimension matrix calculated with obtain 1*4 dimension output to
Amount, the 1*4 dimensions output vector represent 4 kinds of voice class;And
Select the maximum a kind of classification as the voice to be detected of numerical value in the 1*4 dimensions output vector.
2. the voice biopsy method based on deep learning as described in claim 1, which is characterized in that described to depth god
It is trained through network model and includes the step of optimal depth neural network model to obtain:
Framing is carried out to training voice, using every 1000 frame as a sample;
Classification logotype is carried out to each sample;And
Using the sample after mark as the training sample of the deep neural network model.
3. the voice biopsy method based on deep learning as described in claim 1, which is characterized in that the acquisition is to be checked
It surveys voice and is specifically included the step of carrying out framing to the voice to be detected, obtain 1000*20 dimension matrixes:
After carrying out framing to the voice to be detected, extracts 1000 frames and calculate separately 20 dimension MFCC features;And
The 20 dimension MFCC according to 1000 frame generate the 1000*20 and tie up matrix.
4. the voice biopsy method based on deep learning as described in claim 1, which is characterized in that described in the utilization
Optimal depth neural network model calculates 1000*20 dimension matrixes:
Convolution is carried out to input feature vector with 1000*20 convolution kernels in first layer;
At the second to four layers, convolution is carried out using the convolution kernel of 1*1, and use LeakyReLU activation primitives;
Pond is carried out in layer 5, maximum value is extracted to 2*2 core ranges;
It is flattened in layer 6;
Dimensionality reduction is carried out to the layer 6 in layer 7, obtains output Out7;And
It is inputted with the Out7 in layer 7, with softmax activation primitives, is exported as 1*4 vectors, as testing result.
5. the voice biopsy method according to any one of claims 1-4 based on deep learning, which is characterized in that described
It is the numerical value in 0~1 range that 1*4, which ties up output vector,.
6. a kind of server, which is characterized in that the server includes memory, processor, and being stored on the memory can
The voice In vivo detection program based on deep learning run on the processor, the voice live body based on deep learning
Detection program realizes following steps when being executed by the processor:
Deep neural network model is trained to obtain optimal depth neural network model;
It obtains voice to be detected and framing is carried out to the voice to be detected, obtain 1000*20 dimension matrixes;
The 1000*20 is tieed up into Input matrix to the optimal depth neural network model;
Using the optimal depth neural network model to the 1000*20 dimension matrix calculated with obtain 1*4 dimension output to
Amount, the 1*4 dimensions output vector represent 4 kinds of voice class;And
Select the maximum a kind of classification as the voice to be detected of numerical value in the 1*4 dimensions output vector.
7. server as claimed in claim 6, which is characterized in that the voice In vivo detection program quilt based on deep learning
It is described deep neural network model to be trained to obtain the step of optimal depth neural network model when the processor executes
Suddenly include:
Framing is carried out to training voice, using every 1000 frame as a sample;
Classification logotype is carried out to each sample;And
Using the sample after mark as the training sample of the deep neural network model.
8. server as claimed in claim 6, which is characterized in that the voice In vivo detection program quilt based on deep learning
It is described to obtain voice to be detected and framing is carried out to the voice to be detected when the processor executes, obtain 1000*20 dimension squares
The step of battle array, specifically includes:
After carrying out framing to the voice to be detected, extracts 1000 frames and calculate separately 20 dimension MFCC features;And
The 20 dimension MFCC according to 1000 frame generate the 1000*20 and tie up matrix.
9. such as claim 6-8 any one of them servers, which is characterized in that the voice live body inspection based on deep learning
It is described that matrix is tieed up to the 1000*20 using the optimal depth neural network model when ranging sequence is executed by the processor
It is calculated and includes to obtain the step of 1*4 ties up output vector:
Convolution is carried out to input feature vector with 1000*20 convolution kernels in first layer;
At the second to four layers, convolution is carried out using the convolution kernel of 1*1, and use LeakyReLU activation primitives;
Pond is carried out in layer 5, maximum value is extracted to 2*2 core ranges;
It is flattened in layer 6;
Dimensionality reduction is carried out to the layer 6 in layer 7, obtains output Out7;And
It is inputted with the Out7 in layer 7, with softmax activation primitives, is exported as 1*4 vectors, as testing result.
10. a kind of storage medium, the storage medium is stored with the voice In vivo detection program based on deep learning, described to be based on
The voice In vivo detection program of deep learning can be executed by least one processor, so that at least one processor executes such as
The step of voice biopsy method based on deep learning described in any one of claim 1-5.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810029892.6A CN108281158A (en) | 2018-01-12 | 2018-01-12 | Voice biopsy method, server and storage medium based on deep learning |
PCT/CN2018/089203 WO2019136909A1 (en) | 2018-01-12 | 2018-05-31 | Voice living-body detection method based on deep learning, server and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810029892.6A CN108281158A (en) | 2018-01-12 | 2018-01-12 | Voice biopsy method, server and storage medium based on deep learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108281158A true CN108281158A (en) | 2018-07-13 |
Family
ID=62803422
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810029892.6A Pending CN108281158A (en) | 2018-01-12 | 2018-01-12 | Voice biopsy method, server and storage medium based on deep learning |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN108281158A (en) |
WO (1) | WO2019136909A1 (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109036459A (en) * | 2018-08-22 | 2018-12-18 | 百度在线网络技术(北京)有限公司 | Sound end detecting method, device, computer equipment, computer storage medium |
CN109346089A (en) * | 2018-09-27 | 2019-02-15 | 深圳市声扬科技有限公司 | Living body identity identifying method, device, computer equipment and readable storage medium storing program for executing |
CN109801638A (en) * | 2019-01-24 | 2019-05-24 | 平安科技(深圳)有限公司 | Speech verification method, apparatus, computer equipment and storage medium |
CN111933154A (en) * | 2020-07-16 | 2020-11-13 | 平安科技(深圳)有限公司 | Method and device for identifying counterfeit voice and computer readable storage medium |
CN112489677A (en) * | 2020-11-20 | 2021-03-12 | 平安科技(深圳)有限公司 | Voice endpoint detection method, device, equipment and medium based on neural network |
CN112735381A (en) * | 2020-12-29 | 2021-04-30 | 四川虹微技术有限公司 | Model updating method and device |
CN112735431A (en) * | 2020-12-29 | 2021-04-30 | 三星电子(中国)研发中心 | Model training method and device and artificial intelligence dialogue recognition method and device |
CN115280410A (en) * | 2020-01-13 | 2022-11-01 | 密歇根大学董事会 | Safe automatic speaker verification system |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110504027A (en) * | 2019-08-20 | 2019-11-26 | 东北大学 | A kind of X-Ray rabat pneumonia intelligent diagnosis system and method based on deep learning |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102436810A (en) * | 2011-10-26 | 2012-05-02 | 华南理工大学 | Record replay attack detection method and system based on channel mode noise |
CN105869630A (en) * | 2016-06-27 | 2016-08-17 | 上海交通大学 | Method and system for detecting voice spoofing attack of speakers on basis of deep learning |
CN106409298A (en) * | 2016-09-30 | 2017-02-15 | 广东技术师范学院 | Identification method of sound rerecording attack |
CN106531172A (en) * | 2016-11-23 | 2017-03-22 | 湖北大学 | Speaker voice playback identification method and system based on environmental noise change detection |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9697833B2 (en) * | 2015-08-25 | 2017-07-04 | Nuance Communications, Inc. | Audio-visual speech recognition with scattering operators |
CN107545248B (en) * | 2017-08-24 | 2021-04-02 | 北京小米移动软件有限公司 | Biological characteristic living body detection method, device, equipment and storage medium |
-
2018
- 2018-01-12 CN CN201810029892.6A patent/CN108281158A/en active Pending
- 2018-05-31 WO PCT/CN2018/089203 patent/WO2019136909A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102436810A (en) * | 2011-10-26 | 2012-05-02 | 华南理工大学 | Record replay attack detection method and system based on channel mode noise |
CN105869630A (en) * | 2016-06-27 | 2016-08-17 | 上海交通大学 | Method and system for detecting voice spoofing attack of speakers on basis of deep learning |
CN106409298A (en) * | 2016-09-30 | 2017-02-15 | 广东技术师范学院 | Identification method of sound rerecording attack |
CN106531172A (en) * | 2016-11-23 | 2017-03-22 | 湖北大学 | Speaker voice playback identification method and system based on environmental noise change detection |
Non-Patent Citations (4)
Title |
---|
CHUNLEI ZHANG 等: "An Investigation of Deep-Learning Frameworks for Speaker Verification Antispoofing", 《IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING》 * |
HUIXIN LIANG 等: "RECOGNITION OF SPOOFED VOICE USING CONVOLUTIONAL NEURAL NETWORKS", 《2017 IEEE GLOBALSIP》 * |
XIAOHAI TIAN 等: "Spoofing Speech Detection using Temporal Convolutional Neural Network", 《2016 APSIPA》 * |
陈敏: "《认知计算导论》", 30 April 2017, 华中科技大学出版社 * |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109036459A (en) * | 2018-08-22 | 2018-12-18 | 百度在线网络技术(北京)有限公司 | Sound end detecting method, device, computer equipment, computer storage medium |
CN109036459B (en) * | 2018-08-22 | 2019-12-27 | 百度在线网络技术(北京)有限公司 | Voice endpoint detection method and device, computer equipment and computer storage medium |
CN109346089A (en) * | 2018-09-27 | 2019-02-15 | 深圳市声扬科技有限公司 | Living body identity identifying method, device, computer equipment and readable storage medium storing program for executing |
CN109801638A (en) * | 2019-01-24 | 2019-05-24 | 平安科技(深圳)有限公司 | Speech verification method, apparatus, computer equipment and storage medium |
CN109801638B (en) * | 2019-01-24 | 2023-10-13 | 平安科技(深圳)有限公司 | Voice verification method, device, computer equipment and storage medium |
CN115280410A (en) * | 2020-01-13 | 2022-11-01 | 密歇根大学董事会 | Safe automatic speaker verification system |
EP4091164A4 (en) * | 2020-01-13 | 2024-01-24 | Univ Michigan Regents | Secure automatic speaker verification system |
CN111933154A (en) * | 2020-07-16 | 2020-11-13 | 平安科技(深圳)有限公司 | Method and device for identifying counterfeit voice and computer readable storage medium |
CN111933154B (en) * | 2020-07-16 | 2024-02-13 | 平安科技(深圳)有限公司 | Method, equipment and computer readable storage medium for recognizing fake voice |
CN112489677A (en) * | 2020-11-20 | 2021-03-12 | 平安科技(深圳)有限公司 | Voice endpoint detection method, device, equipment and medium based on neural network |
CN112489677B (en) * | 2020-11-20 | 2023-09-22 | 平安科技(深圳)有限公司 | Voice endpoint detection method, device, equipment and medium based on neural network |
CN112735381A (en) * | 2020-12-29 | 2021-04-30 | 四川虹微技术有限公司 | Model updating method and device |
CN112735431A (en) * | 2020-12-29 | 2021-04-30 | 三星电子(中国)研发中心 | Model training method and device and artificial intelligence dialogue recognition method and device |
CN112735381B (en) * | 2020-12-29 | 2022-09-27 | 四川虹微技术有限公司 | Model updating method and device |
CN112735431B (en) * | 2020-12-29 | 2023-12-22 | 三星电子(中国)研发中心 | Model training method and device and artificial intelligent dialogue recognition method and device |
Also Published As
Publication number | Publication date |
---|---|
WO2019136909A1 (en) | 2019-07-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108281158A (en) | Voice biopsy method, server and storage medium based on deep learning | |
CN107527620A (en) | Electronic installation, the method for authentication and computer-readable recording medium | |
CN107993071A (en) | Electronic device, auth method and storage medium based on vocal print | |
CN107564513A (en) | Audio recognition method and device | |
CN107785015A (en) | A kind of audio recognition method and device | |
CN109409971A (en) | Abnormal order processing method and device | |
CN106104674A (en) | Mixing voice identification | |
CN108154371A (en) | Electronic device, the method for authentication and storage medium | |
CN110675862A (en) | Corpus acquisition method, electronic device and storage medium | |
CN109308912A (en) | Music style recognition methods, device, computer equipment and storage medium | |
CN108986798A (en) | Processing method, device and the equipment of voice data | |
CN108694952A (en) | Electronic device, the method for authentication and storage medium | |
CN111508524A (en) | Method and system for identifying voice source equipment | |
CN111161713A (en) | Voice gender identification method and device and computing equipment | |
CN109658943A (en) | A kind of detection method of audio-frequency noise, device, storage medium and mobile terminal | |
CN107229691A (en) | A kind of method and apparatus for being used to provide social object | |
Li et al. | Anti-forensics of audio source identification using generative adversarial network | |
CN108650266A (en) | Server, the method for voice print verification and storage medium | |
Wang et al. | Robust speaker identification of iot based on stacked sparse denoising auto-encoders | |
Dixit et al. | Review of audio deepfake detection techniques: Issues and prospects | |
US11630950B2 (en) | Prediction of media success from plot summaries using machine learning model | |
CN116152938A (en) | Method, device and equipment for training identity recognition model and transferring electronic resources | |
CN111933154B (en) | Method, equipment and computer readable storage medium for recognizing fake voice | |
CN110415708A (en) | Method for identifying speaker, device, equipment and storage medium neural network based | |
CN116564269A (en) | Voice data processing method and device, electronic equipment and readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180713 |