CN108922513A

CN108922513A - Speech differentiation method, apparatus, computer equipment and storage medium

Info

Publication number: CN108922513A
Application number: CN201810561788.1A
Authority: CN
Inventors: 涂宏
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2018-06-04
Filing date: 2018-06-04
Publication date: 2018-11-30
Anticipated expiration: 2038-06-04
Also published as: WO2019232846A1; CN108922513B

Abstract

The invention discloses a kind of speech differentiation method, apparatus, computer equipment and storage mediums.The speech differentiation method includes：Based on the original voice data to be distinguished of voice activity detection algorithms processing, target voice data to be distinguished is obtained；Based on target voice data to be distinguished, corresponding ASR phonetic feature is obtained；The ASR phonetic feature is input in preparatory trained ASR-RNN model and is distinguished, target is obtained and distinguishes result.Target voice and interference voice can be distinguished well using the speech differentiation method still can carry out accurate speech differentiation in the very big situation of voice data noise jamming.

Description

Speech differentiation method, apparatus, computer equipment and storage medium

Technical field

The present invention relates to speech processes field more particularly to a kind of speech differentiation method, apparatus, computer equipment and storage Medium.

Background technique

Speech differentiation, which refers to, carries out mute screening to the voice of input, only retains the voice segments more meaningful to identification (i.e. Target voice).There is very big deficiency in current speech differentiation method, especially in the presence of noise, with noise Become larger, the difficulty for carrying out speech differentiation is bigger, can not accurately distinguish out target voice and interference voice, lead to speech differentiation Effect is undesirable.

Summary of the invention

The embodiment of the present invention provides a kind of speech differentiation method, apparatus, computer equipment and storage medium, with solve into The undesirable problem of row speech differentiation effect.

The embodiment of the present invention provides a kind of speech differentiation method, including：

Based on the original voice data to be distinguished of voice activity detection algorithms processing, target voice data to be distinguished is obtained；

Based on target voice data to be distinguished, corresponding ASR phonetic feature is obtained；

The ASR phonetic feature is input in preparatory trained ASR-RNN model and is distinguished, target is obtained and distinguishes As a result.

The embodiment of the present invention provides a kind of speech differentiation device, including：

Target voice data to be distinguished obtains module, for handling original voice to be distinguished based on voice activity detection algorithms Data obtain target voice data to be distinguished；

Phonetic feature obtains module, and for being based on target voice data to be distinguished, it is special to obtain corresponding ASR voice Sign；

Target distinguishes result and obtains module, for the ASR phonetic feature to be input to preparatory trained ASR-RNN mould It is distinguished in type, obtains target and distinguish result.

The embodiment of the present invention provides a kind of computer equipment, including memory, processor and is stored in the memory In and the computer program that can run on the processor, the processor realize institute's predicate when executing the computer program The step of sound differentiating method.

The embodiment of the present invention provides a kind of computer readable storage medium, and the computer-readable recording medium storage has meter The step of calculation machine program, the computer program realizes speech differentiation method when being executed by processor.

In speech differentiation method, apparatus, computer equipment and storage medium provided by the embodiment of the present invention, it is primarily based on The original voice data to be distinguished of voice activity detection algorithms processing, obtains target voice data to be distinguished, original language to be distinguished Sound data are first distinguished once by voice activity detection algorithms, obtain the smaller target of range voice data to be distinguished, Neng Gouchu Step effectively removes non-voice.It is then based on target voice data to be distinguished, obtains corresponding ASR phonetic feature, is subsequent basis The ASR phonetic feature carries out corresponding ASR-RNN model identification and provides technical foundation.Finally ASR phonetic feature is input to pre- It is first distinguished in trained ASR-RNN model, obtains target and distinguish as a result, the ASR-RNN model is special according to ASR voice The identification model for being used to accurately distinguish voice of the feature specialized training of sign and voice in timing, can be from target language to be distinguished Target voice and interference voice are correctly distinguished in sound data, improve the accuracy of speech differentiation.

Detailed description of the invention

In order to illustrate the technical solution of the embodiments of the present invention more clearly, below by institute in the description to the embodiment of the present invention Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings Obtain other attached drawings.

Fig. 1 is an applied environment figure of speech differentiation method in one embodiment of the invention；

Fig. 2 is a flow chart of speech differentiation method in one embodiment of the invention；

Fig. 3 is a specific flow chart of step S10 in Fig. 2；

Fig. 4 is a specific flow chart of step S20 in Fig. 2；

Fig. 5 is a specific flow chart of step S21 in Fig. 4；

Fig. 6 is a specific flow chart of step S24 in Fig. 4；

Fig. 7 is the specific flow chart in Fig. 2 before step S30；

Fig. 8 is a schematic diagram of speech differentiation device in one embodiment of the invention；

Fig. 9 is a schematic diagram of computer equipment in one embodiment of the invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall within the protection scope of the present invention.

Fig. 1 shows the application environment of speech differentiation method provided in an embodiment of the present invention.The application of the audio recognition method Environment includes server-side and client, wherein is attached between server-side and client by network.Client be can with The equipment that family carries out human-computer interaction, the including but not limited to equipment such as computer, smart phone and plate.Server-side specifically can be with solely The server cluster of vertical server or multiple servers composition is realized.Speech differentiation method provided in an embodiment of the present invention is answered For server-side.

As shown in Fig. 2, Fig. 2 shows a flow chart of speech differentiation method in the present embodiment, which includes Following steps：

S10：Based on the original voice data to be distinguished of voice activity detection algorithms processing, target voice number to be distinguished is obtained According to.

Wherein, voice activity detection (Voice Activity Detection, hereinafter referred to as VAD), it is therefore an objective to from sound The prolonged mute phase is identified and eliminated in signal stream, and traffic resource is saved in the case where not reducing quality of service to reach Effect can save valuable bandwidth resources, reduce time delay end to end, promote user experience.Voice activity detection algorithms (vad algorithm) i.e. voice activity detection when the algorithm that specifically uses, the algorithm can there are many.It is to be appreciated that VAD can be answered Used in speech differentiation, target voice and interference voice can be distinguished.Target voice refers to that vocal print consecutive variations are bright in voice data Aobvious phonological component, interference voice can be in voice data since silence is without the phonological component of pronunciation, be also possible to ring Border noise.Original voice data to be distinguished is the voice data to be distinguished that most original is got, the original voice data to be distinguished Refer to that vad algorithm to be employed carries out the preliminary voice data for distinguishing processing.Target voice data to be distinguished refers to living by voice Detection algorithm is moved to original after distinguishing voice data and handling, the voice data for being used to carry out speech differentiation of acquisition.

In the present embodiment, original voice data to be distinguished is handled using vad algorithm, from original voice number to be distinguished Go out target voice and interference voice according to middle preliminary screening, and the target voice part that preliminary screening is gone out is as target language to be distinguished Sound data.It is to be appreciated that the interference voice gone out for preliminary screening need not distinguish again, to improve the effect of speech differentiation Rate.And from the original content for still having interference voice wait distinguish the target voice that preliminary screening in voice data goes out, especially when When the original noise wait distinguish voice data is bigger, if the interference voice (such as noise) that the target voice that preliminary screening goes out mixes It is more, it is clear that voice can not effectively be distinguished using vad algorithm at this time, therefore interference language should be mixed by what preliminary screening went out The target voice of sound carries out more accurate distinguish as target voice data to be distinguished, with the target voice gone out to preliminary screening. Preliminary speech differentiation is carried out to original voice data to be distinguished by using vad algorithm, can according to preliminary screening it is original to It distinguishes voice data to be repartitioned, while removing a large amount of interference voice, be conducive to subsequent further speech differentiation.

In a specific embodiment, as shown in figure 3, in step S10, based on voice activity detection algorithms processing it is original to Voice data is distinguished, target voice data to be distinguished is obtained, includes the following steps：

S11：Original voice data to be distinguished is handled according to short-time energy characteristic value calculation formula, is obtained corresponding The original data to be distinguished that short-time energy characteristic value is greater than first threshold are retained, are determined as the first original by short-time energy characteristic value Begin to distinguish voice data, wherein short-time energy characteristic value calculation formula isN is voice frame length, when s (n) is Signal amplitude on domain, n are the time.

Wherein, it is corresponding in its time domain to describe a frame voice (frame generally takes 10-30ms) for short-time energy characteristic value Energy, the time (i.e. voice frame length) for being interpreted as a frame " in short-term " of the short-time energy.Due in short-term capable of for target voice Measure feature value, the short-time energy characteristic value compared to interference voice (mute) can be higher by very much, therefore in short-term can according to this Measure feature value come distinguish target voice and interference voice.

In the present embodiment, original voice data to be distinguished is handled according to short-time energy characteristic value calculation formula and (is needed in advance Make the processing of framing to original voice data to be distinguished), calculate and obtain the short-time energy of original each frame of voice data to be distinguished The short-time energy characteristic value of each frame is compared with pre-set first threshold, will be greater than the original of first threshold by characteristic value Begin voice data to be distinguished retains, and is determined as the first original differentiation voice data.The first threshold in short-term can for measuring Measure feature value belongs to target voice or interferes the cut off value of voice.In the present embodiment, according to short-time energy characteristic value and The comparison result of one threshold value, can tentatively be distinguished from the angle of short-time energy characteristic value obtain it is original wait distinguish in voice data Target voice, and effectively remove and original largely interfere voice in voice data wait distinguish.

S12：Original voice data to be distinguished is handled according to zero-crossing rate characteristic value calculation formula, obtains corresponding mistake The original voice data to be distinguished that zero-crossing rate characteristic value is less than second threshold is retained, it is original to be determined as second by zero rate characteristic value Distinguish voice data, wherein zero-crossing rate characteristic value calculation formula isN is voice frame length, s (n) It is the time for the signal amplitude n in time domain.

Wherein, zero-crossing rate characteristic value is the number for describing voice signal waveform in a frame voice and passing through horizontal axis (zero level). Due to the zero-crossing rate characteristic value of target voice, the zero-crossing rate characteristic value compared to interference voice can be much lower, therefore can basis The short-time energy characteristic value come distinguish target voice and interference voice.

In the present embodiment, original voice data to be distinguished is handled according to zero-crossing rate characteristic value calculation formula, calculates and obtains The zero-crossing rate characteristic value of original each frame of voice data to be distinguished, by the zero-crossing rate characteristic value of each frame and pre-set second threshold It is compared, the original voice data to be distinguished for being less than second threshold is retained, and be determined as the second original differentiation voice data. The second threshold is the cut off value for belonging to target voice or interference voice for measuring short-time energy characteristic value.The present embodiment In, according to the comparison result of zero-crossing rate characteristic value and second threshold, it can tentatively distinguish and obtain from the angle of zero-crossing rate characteristic value It is original wait distinguish the target voice in voice data, and effectively remove and original largely interfere voice in voice data wait distinguish.

S13：Using the first original differentiation voice data and the second original differentiation voice data as target voice number to be distinguished According to.

In the present embodiment, the first original differentiation voice data is from original according to the angle of short-time energy characteristic value wait distinguish It distinguishes and obtains in voice data, the second original differentiation voice data is from original according to the angle of zero-crossing rate characteristic value to area It distinguishes and obtains in point voice data.First original differentiation voice data and the second original differentiation voice data are respectively from differentiation The different angle of voice is set out, the two angles can distinguish voice well, therefore by the first original differentiation voice data Together with merging (merging in a manner of taking intersection) with the second original differentiation voice data, as target voice data to be distinguished.

Step S11-S13 can tentatively be effectively removed original wait distinguish most interference voice number in voice data According to reservation mixes the original voice data to be distinguished of target voice and small part interference voice (such as noise), and this is original Voice data to be distinguished can make effective preliminary voice to original voice data to be distinguished as target voice data to be distinguished It distinguishes.

S20：Based on target voice data to be distinguished, corresponding ASR phonetic feature is obtained.

Wherein, ASR (Automatic Speech Recognition, automatic speech recognition technology) is to turn voice data It is changed to the technology of computer-readable input, such as converts the shapes such as key, binary coding or character string for voice data Formula.The phonetic feature in target voice data to be distinguished can be extracted by ASR, the voice extracted is corresponding thereto ASR phonetic feature.It is to be appreciated that the voice data that script computer can not be directly read can be converted to computer by ASR The ASR phonetic feature that can be read, the ASR phonetic feature can be indicated by the way of vector.

It in the present embodiment, is handled using ASR voice data to be distinguished to target, it is special to obtain corresponding ASR voice Sign, the ASR phonetic feature can be well reflected the potential feature of target voice data to be distinguished, can be special according to ASR voice Voice data to be distinguished to target is levied to distinguish, for it is subsequent according to the corresponding ASR-RNN of ASR phonetic feature progress (RNN, Recurrent neural networks, Recognition with Recurrent Neural Network) the important technology premise of model identification offer.

In a specific embodiment, as shown in figure 4, in step S20, based on target voice data to be distinguished, phase is obtained Corresponding ASR phonetic feature, includes the following steps：

S21：Voice data to be distinguished to target pre-processes, and obtains pretreatment voice data.

In the present embodiment, voice data to be distinguished to target is pre-processed, and obtains corresponding pretreatment voice number According to.Voice data to be distinguished to target pre-processes the ASR voice spy that can preferably extract target voice data to be distinguished Sign so that the ASR phonetic feature extracted is more representative of target voice data to be distinguished, with the use ASR phonetic feature into Row speech differentiation.

In a specific embodiment, as shown in figure 5, in step S21, voice data to be distinguished to target is located in advance Reason obtains pretreatment voice data, includes the following steps：

S211：Voice data to be distinguished to target makees preemphasis processing, and the calculation formula of preemphasis processing is s'_n=s_n-a* s_n-1, wherein s_nFor the signal amplitude in time domain, s_n-1For with s_nThe signal amplitude of corresponding last moment, s'_nFor preemphasis Signal amplitude in time domain afterwards, a are pre emphasis factor, and the value range of a is 0.9<a<1.0.

Wherein, preemphasis is a kind of signal processing mode compensated in transmitting terminal to input signal high fdrequency component.With The increase of signal rate, signal be damaged in transmission process it is very big, in order to enable receiving end to obtain relatively good signal waveform, With regard to needing to compensate impaired signal.The thought of pre-emphasis technique is exactly the high frequency in the transmitting terminal enhancing signal of transmission line Ingredient enables receiving end to obtain preferable signal waveform to compensate excessive decaying of the high fdrequency component in transmission process.In advance Exacerbation does not have an impact to noise, therefore can effectively improve output signal-to-noise ratio.

In the present embodiment, voice data to be distinguished to target makees preemphasis processing, and the formula of preemphasis processing is s'_n= s_n-a*s_n-1, wherein s_nFor the signal amplitude in time domain, i.e. the amplitude (amplitude) of voice expressed in the time domain of voice data, s_n-1For with s_nThe signal amplitude of opposite last moment, s'_nFor the signal amplitude in time domain after preemphasis, a is pre emphasis factor, The value range of a is 0.9<a<1.0, take the effect of 0.97 preemphasis relatively good here.Being handled using the preemphasis can eliminate It is interfered caused by vocal cords and lip etc. in voiced process, it can be with the pent-up radio-frequency head of effective compensation target voice data to be distinguished Point, and the formant of target voice data high frequency to be distinguished can be highlighted, reinforce the signal width of target voice data to be distinguished Degree helps to extract ASR phonetic feature.

S212：Target voice data to be distinguished after preemphasis is subjected to sub-frame processing.

In the present embodiment, in preemphasis target after distinguishing voice data, sub-frame processing should be also carried out.Framing refers to will be whole The voice signal of section is cut into the voice processing technology of several segments, and the size of every frame is in the range of 10-30ms, with general 1/2 Frame length is moved as frame.Frame moves the overlapping region for referring to adjacent two interframe, can be avoided adjacent two frame and changes excessive problem.To mesh It marks voice data to be distinguished and carries out sub-frame processing, target voice data to be distinguished can be divided into the voice data of several segments, it can To segment target voice data to be distinguished, convenient for the extraction of ASR phonetic feature.

S213：Target voice data to be distinguished after framing is subjected to windowing process, obtains pretreatment voice data, adding window Calculation formula beWherein, N is that window is long, and n is time, s_nFor the letter in time domain Number amplitude, s'_nFor the signal amplitude in time domain after adding window.

In the present embodiment, to target wait distinguish voice data carry out sub-frame processing after, the initial segment of each frame and end Discontinuous place can all occur in end, so framing is mostly also bigger with the error of target voice data to be distinguished.Using adding Window is able to solve this problem, and the voice data to be distinguished of the target after can making framing becomes continuously, and enables each frame Enough show the feature of periodic function.Windowing process is specifically referred to using at window function voice data to be distinguished to target Reason, window function can choose Hamming window, then the formula of the adding window isN is Hamming Window window is long, and n is time, s_nFor the signal amplitude in time domain, s'_nFor the signal amplitude in time domain after adding window.Language to be distinguished to target Sound data carry out windowing process, obtain pretreatment voice data, target after enabling to framing wait distinguish voice data when Signal on domain becomes the ASR phonetic feature for continuously facilitating to extract target voice data to be distinguished.

The pretreatment operation of above-mentioned steps S211-S213 voice data to be distinguished to target, to extract target language to be distinguished The ASR phonetic feature of sound data provides the foundation, and enables to the ASR phonetic feature extracted more representative of target language to be distinguished Sound data, and speech differentiation is carried out according to the ASR phonetic feature.

S22：Fast Fourier Transform (FFT) is made to pretreatment voice data, obtains the frequency spectrum of target voice data to be distinguished, and The power spectrum of target voice data to be distinguished is obtained according to frequency spectrum.

Wherein, Fast Fourier Transform (FFT) (Fast Fourier Transformation, abbreviation FFT), refers to and utilizes computer Calculate efficient, quick calculation method the general designation of discrete Fourier transform, abbreviation FFT.Computer meter can be made using this algorithm It calculates multiplication number required for discrete Fourier transform to be greatly reduced, the number of sampling points being especially transformed is more, fft algorithm meter The saving of calculation amount is more significant.

In the present embodiment, Fast Fourier Transform (FFT) is carried out to pretreatment voice data, voice data will be pre-processed from time domain Signal amplitude be converted to the signal amplitude (frequency spectrum) on frequency domain.The formula of the calculating frequency spectrum is1 ≤ k≤N, N are the size of frame, and s (k) is the signal amplitude on frequency domain, and s (n) is the signal amplitude in time domain, and n is the time, and i is Complex unit.After the frequency spectrum for obtaining pretreatment voice data, pretreatment voice data can be directly acquired according to the frequency spectrum The power spectrum for pre-processing voice data is known as the power spectrum of target voice data to be distinguished by power spectrum below.The calculating target The formula of the power spectrum of voice data to be distinguished is1≤k≤N, N are the size of frame, and s (k) is on frequency domain Signal amplitude.By the way that pretreatment voice data is converted to the signal amplitude on frequency domain, then root from the signal amplitude in time domain The power spectrum of target voice data to be distinguished is obtained according to the signal amplitude on the frequency domain, is the function from target voice data to be distinguished ASR phonetic feature is extracted in rate spectrum, and important technical foundation is provided.

S23：Using the power spectrum of melscale filter group processing target voice data to be distinguished, obtains target and wait distinguishing The Meier power spectrum of voice data.

It wherein, using the power spectrum of melscale filter group processing target voice data to be distinguished is carried out to power spectrum Mel-frequency analysis, mel-frequency analysis be based on human auditory perception analysis.Observation discovery, human ear are filtered just as one Device group is the same, only focuses on certain specific frequency components (sense of hearing of people is selective frequency), that is to say, that human ear is only It allows the signal of certain frequencies to pass through, and directly ignores the certain frequency signals for being not desired to perception.However these filters are sat in frequency But it is not univesral distribution on parameter, there are many filters in low frequency region, they is distributed than comparatively dense, but in high frequency region Domain, the number of filter just become fewer, are distributed very sparse.It is to be appreciated that melscale filter group is in low frequency part High resolution, the auditory properties with human ear are consistent, where this is also the physical significance of melscale.

In the present embodiment, using the power spectrum of melscale filter group processing target voice data to be distinguished, mesh is obtained The Meier power spectrum for marking voice data to be distinguished carries out cutting to frequency-region signal by using melscale filter group, so that The last corresponding numerical value of each frequency band, if the number of filter is 22, available target voice data to be distinguished Corresponding 22 energy values of Meier power spectrum.Mel-frequency analysis is carried out by the power spectrum to target voice data to be distinguished, So that the Meier power spectrum obtained after its analysis maintains the frequency-portions closely related with human ear characteristic, which can It is well reflected out the feature of target voice data to be distinguished.

S24：Cepstral analysis is carried out on Meier power spectrum, obtains the mel-frequency cepstrum system of target voice data to be distinguished Number.

Wherein, cepstrum (cepstrum) refers in Fu that a kind of Fourier transform spectrum of signal carries out again after logarithm operation Leaf inverse transformation, since general Fourier spectrum is complex number spectrum, thus cepstrum is also known as cepstrum.

In the present embodiment, cepstral analysis is carried out to Meier power spectrum, according to cepstrum as a result, analyzing and obtaining target and wait for area Divide the mel-frequency cepstrum coefficient of voice data.It, can be excessively high by script intrinsic dimensionality, it is difficult to directly make by the cepstral analysis The feature for including in the Meier power spectrum of target voice data to be distinguished, by carrying out cepstrum point on Meier power spectrum Analysis, is converted into wieldy feature (for the mel-frequency cepstrum coefficient feature vector for being trained or identifying).The Meier The coefficient that frequency cepstral coefficient can distinguish different phonetic as ASR phonetic feature, the ASR phonetic feature can reflect Difference between voice can be used to identify and distinguish between target voice data to be distinguished.

In a specific embodiment, as shown in fig. 6, in step S24, cepstral analysis is carried out on Meier power spectrum, is obtained The mel-frequency cepstrum coefficient for taking target voice data to be distinguished, includes the following steps：

S241：The logarithm for taking Meier power spectrum obtains Meier power spectrum to be transformed.

In the present embodiment, according to the definition of cepstrum, logarithm log is taken to Meier power spectrum, obtains Meier power to be transformed Compose m.

S242：Discrete cosine transform is made to Meier power spectrum to be transformed, obtains the Meier frequency of target voice data to be distinguished Rate cepstrum coefficient.

In the present embodiment, discrete cosine transform (Discrete Cosine is made to Meier power spectrum m to be transformed Transform, DCT), the mel-frequency cepstrum coefficient of corresponding target voice data to be distinguished is obtained, generally takes the 2nd to arrive 13rd coefficient is able to reflect the difference between voice data as ASR phonetic feature, the ASR phonetic feature.To Meier to be transformed The formula that power spectrum m makees discrete cosine transform is N is frame length, and m is Meier power spectrum to be transformed, and j is the independent variable of Meier power spectrum to be transformed.Due to being between Meier filter There is overlapping, so having correlation between the energy value obtained using melscale filter, discrete cosine transform can To carry out dimensionality reduction compression to Meier power spectrum m to be transformed and be abstracted, and corresponding ASR phonetic feature is obtained, compared to Fourier Transformation, the result of discrete cosine transform do not have imaginary part, there is apparent advantage in terms of calculating.

Step S21-S24 carries out the processing of feature extraction based on ASR technology voice data to be distinguished to target, final to obtain ASR phonetic feature can embody target voice data to be distinguished well, which can be in depth network model Training acquires ASR-RNN model, and result of the ASR-RNN model for obtaining training when carrying out speech differentiation is more smart Really, even if accurately noise and speech differentiation can also come under conditions of noise is very big.

It should be noted that the feature extracted above is mel-frequency cepstrum coefficient, it herein should not be by ASR phonetic feature It is a kind of to be limited to only mel-frequency cepstrum coefficient, and will be understood that the phonetic feature obtained using ASR technology, as long as can have Effect reflection voice data feature, all can be used as ASR phonetic feature and carries out identification and model training.

S30：ASR phonetic feature is input in preparatory trained ASR-RNN model and is distinguished, target is obtained and distinguishes As a result.

Wherein, ASR-RNN model refers to that the Recognition with Recurrent Neural Network model obtained using the training of ASR phonetic feature, RNN are referred to Recognition with Recurrent Neural Network (Recurrent neural networks).The ASR-RNN model is extracted using voice data to be trained ASR phonetic feature be trained, therefore the model can identify ASR phonetic feature, thus according to ASR phonetic feature Distinguish voice.Specifically, voice data to be trained includes target voice and noise, is extracted when carrying out ASR-RNN model training The ASR phonetic feature of target voice and the ASR phonetic feature of noise, the ASR-RNN model that training is obtained is according to ASR Phonetic feature identify target voice and interfere voice in noise (using VAD distinguish it is original when distinguishing voice data Eliminate most interference voice, as in voice data since silence is without the phonological component and a part of noise of pronunciation, So here ASR-DBN model distinguish interference voice specifically refer to noise components), realize to target voice and interference voice into The purpose that row is effectively distinguished.

In the present embodiment, ASR phonetic feature is input in preparatory trained ASR-RNN model and is distinguished, due to ASR phonetic feature is able to reflect the feature of voice data, therefore can be according to ASR-RNN model voice data to be distinguished to target The ASR phonetic feature of extraction is identified, to make accurate language according to ASR phonetic feature voice data to be distinguished to target Sound is distinguished.Trained ASR-RNN models coupling ASR phonetic feature and Recognition with Recurrent Neural Network carry out deep layer to feature in advance for this The characteristics of extraction, distinguishes voice from the ASR phonetic feature of voice data, the noise conditions very severe the case where Under still have very high accurate rate.Specifically, since the ASR feature extracted also contains the ASR phonetic feature of noise, In the ASR-RNN model, noise is also accurately to distinguish, solution current speech differentiating method (including but it is unlimited In VAD) the problem of speech differentiation can not be effectively carried out under the conditions of noise effect is biggish.

In a specific embodiment, ASR phonetic feature is being input to preparatory trained ASR-RNN mould by step S30 It is distinguished in type, before obtaining the step of target distinguishes result, speech differentiation method further includes following steps：Obtain ASR- RNN model.

As shown in fig. 7, the step of obtaining ASR-RNN model specifically includes：

S31：Voice data to be trained is obtained, and extracts the ASR phonetic feature to be trained of voice data to be trained.

Wherein, voice data training sample set needed for voice data to be trained refers to trained ASR-RNN model, should be wait instruct Practicing voice data can be the voice training collection for directlying adopt open source, or the voice by collecting great amount of samples voice data Training set.Being somebody's turn to do voice data to be trained is to distinguish target voice and interference voice (being herein specially noise) well in advance, The concrete mode that differentiation is taken, which can be, target voice and noise is respectively set different label values.For example, will language be trained Target voice part in sound data is collectively labeled as 1 (representing "true"), and noise components are collectively labeled as 0 (representing "false"), are passed through The accuracy that the label value being arranged in advance can examine ASR-RNN model to identify updates ASR- in order to provide improved reference Network parameter in RNN model continues to optimize ASR-RNN model.In the present embodiment, the ratio of target voice and noise specifically may be used To take 1:1, it can be avoided using the ratio and occurred due to target voice in voice data to be trained and not identical noise quantity Fitting phenomenon.Wherein, over-fitting refers in order to obtain unanimously hypothesis and makes to assume to become over stringent phenomenon, avoids over-fitting It is a core missions in classifier design.

In the present embodiment, obtain voice data to be trained, and extract the feature of the voice data to be trained, this feature i.e. to The step of training ASR phonetic feature, extracting ASR phonetic feature to be trained is identical as step S21-S24, and details are not described herein.To Training voice data includes the training sample of target voice and the training sample of noise, this two parts voice data has respective Therefore ASR phonetic feature can extract and train corresponding ASR-RNN model using ASR phonetic feature to be trained, so that Target voice and noise (noise can be accurately distinguished according to the ASR-RNN model that the ASR phonetic feature training to be trained obtains Belong to interference voice).

S32：Initialize RNN model.

Wherein, RNN model, that is, Recognition with Recurrent Neural Network model.RNN model includes the input layer being made of neuron, hidden layer And output layer.RNN model includes that the weight that each neuron connects between each layer and biasing, these weights and biasing determine The property and recognition effect of RNN model.With traditional neural network such as DNN (Deep Neural Network, depth nerve net Network) compare, RNN be a kind of pair of sequence data (such as time series) modeling neural network, i.e. the current output of a sequence with The output of front is related.The specific form of expression is that network can carry out the hiding layer state of front to remember and applied to current defeated In calculating out, i.e., the node between hidden layer is no longer connectionless but has connection, and the input of hidden layer not only includes defeated The output for entering layer further includes the output of last moment hidden layer.Since voice data has the characteristics that in timing, can adopt It is accurate to extract target voice and interference voice respective further feature in timing with voice data to be trained training RNN model, Realize accurately distinguishing for voice.

In the present embodiment, RNN model is initialized, which is weight and biasing in RNN model is arranged initial Value, when initial value initial setting up, can be set to lesser value, such as be arranged between section [- 0.3-0.3].It is reasonable initial Change RNN model, which can make model in the early stage, to be had compared with flexibly adjustment ability, can be had during model training to model The adjustment of effect causes the model trained to distinguish effect bad without keeping model just very poor in the adjustment capability of initial stage.

S33：ASR phonetic feature to be trained is input in RNN model, RNN model is obtained according to propagated forward algorithm Output valve, output value table are shown as：σ indicates activation primitive, connects between V expression hidden layer and output layer Weight, h_tIndicate that the hidden state of t moment, c indicate the biasing between hidden layer and output layer.

In the present embodiment, the process of RNN propagated forward is according to weight, the biasing for connecting each neuron in RNN model A series of linear operations and activation carried out in RNN model with the ASR phonetic feature to be trained of input according to time series are transported It calculates, each layer of network of output valve in obtained RNN model.Particularly, since RNN is to sequence (when specifically can be here Between sequence) neural network that is modeled of data, in the hidden state h for calculating t moment hidden layer_tWhen, it needs according to the t-1 moment Hidden layer state h_t-1It is acquired jointly with the ASR phonetic feature to be trained of t moment input.By the process of RNN model propagated forward, The propagated forward algorithm of available RNN model：For any time t, according to the ASR phonetic feature to be trained of input from RNN The input layer of model calculates the output to hidden layer, output (the i.e. hidden state h of the hidden layer_t) be expressed as：h_t=σ (Ux_t+ Wh_t-1+ b), wherein σ indicates that activation primitive (can specifically use tanh activation primitive, tanh can not in cyclic process here Difference between the disconnected feature for expanding ASR phonetic feature to be trained is conducive to distinguish target voice and noise), U indicates input layer To the weight connected between hidden layer, W indicates the weight connected between hidden layer (between the hidden layer realized by time series Connection), h_t-1Indicate that the hidden state at t-1 moment, b indicate the biasing between input layer and hidden layer.From hiding for RNN model Layer calculates the output for arriving output layer, and the output (i.e. the output valve of RNN model) of the output layer is expressed asIts In, (the softmax function is used for classification problem effect ratio for what activation primitive here specifically used can be softmax function Preferably), V indicates the weight connected between hidden layer and output layer, h_tIndicate that the hidden state of t moment, c indicate hidden layer and defeated Biasing between layer out.The output valve (output of output layer) of the RNN modelSequence one is as pressed by propagated forward algorithm The output valve being calculated layer by layer is properly termed as prediction output valve.It, can basis after server obtains the output valve of RNN model Network parameter (weight and biasing) in output valve update, adjustment RNN model, so that the RNN model obtained can be according to language The timing feature that sound has distinguishes, by the ASR phonetic feature of target voice and interfere voice ASR phonetic feature and The difference showed in timing obtains accurate recognition result.

S34：Error-duration model is carried out based on output valve, weight and the biasing of each layer of RNN model is updated, obtains ASR-RNN model, Wherein, the formula of update weight V is：V indicates to connect between hidden layer and output layer before updating Weight, V' indicates that the weight that connects between hidden layer and output layer after updating, α indicate that learning rate, t indicate t moment, τ table Show total duration,Indicate prediction output valve, y_tIndicate true output, h_tIndicate the hidden state of t moment, T representing matrix transposition Operation；Update biasing c formula be：C indicates the biasing before updated between hidden layer and output layer, c' table Show the biasing after updating between hidden layer and output layer；Update weight U formula be： U indicates that input layer is to the weight connected between hidden layer before updating, and U' indicates input layer after updating to connecting between hidden layer Weight, diag () indicate one diagonal matrix of construction or return to the square of diagonal entry on a matrix in vector form Battle array operation, δ_tIndicate the gradient of hiding layer state, x_tIndicate the ASR phonetic feature to be trained of t moment input；Update the public affairs of weight W Formula is：W indicates that the weight connected between hidden layer before updating, W' indicate more The weight connected between hidden layer after new；Update biasing b formula be：B indicates to update Biasing between preceding input layer and hidden layer, b' indicate the biasing after updated between input layer and hidden layer.

In the present embodiment, server-side is obtaining the output valve of RNN model (prediction output valve) according to propagated forward algorithm It afterwards, can basisWith the ASR phonetic feature to be trained for pre-setting label value, ASR phonetic feature to be trained is calculated at this The error generated when RNN model training, and suitable error function is constructed (such as using log error function come table according to the error Show the error of generation).Server-side uses the error function to carry out error-duration model, adjustment, the weight for updating each layer of RNN model again (U, W and V) and weight (b and c).Specifically, the label value pre-set is properly termed as true output and (represents objective thing Real, label value 1 represents target voice, and label value is that voice is interfered in 0 representative), use y_tIt indicates.During training RNN model, To having error when output before RNN model is calculated at each layer in time series, error function L can be used by measuring the error, It is expressed as：Wherein, t refers to t moment, and τ indicates total duration, L_tIndicate to be indicated by error function generates in t moment Error.It, can be according to BPTT (Back Propagation Trough Time, when being based on after server-side obtains error function Between back-propagation algorithm) update RNN model weight and biasing, obtain the ASR-RNN mould based on ASR phonetic feature to be trained Type.Specifically, the formula of update weight V is：Wherein, V indicates hidden layer and defeated before updating The weight connected between layer out, V' indicate that the weight connected between hidden layer and output layer after updating, α indicate learning rate,Indicate prediction output valve, y_tIndicate true output, h_tIndicate the hidden state of t moment, T representing matrix transposition operation.It updates Biasing c formula be：C indicates the biasing before updated between hidden layer and output layer, and c' is indicated after updating Biasing between hidden layer and output layer.It is compared to weight V and biasing c, weight U, weight W and biasing b, in backpropagation When, the gradient loss of a certain moment t loses two parts by the corresponding gradient loss of output of current location and the gradient at t+1 moment It codetermines.Therefore the update of weight U, weight W and biasing b need the gradient δ by hiding layer state_tIt obtains.The t sequence moment Hide the gradient δ of layer state_tIt is expressed as：δ_t+1With δ_tBetween exist connection, according to δ_t+1It can be in the hope of δ_t, connection Expression formula is：Wherein, δ_t+1Indicate that the t+1 sequence moment hides the ladder of layer state Degree, diag () indicate a kind of calculating function of matrix operation, and the calculating function is for constructing a diagonal matrix or with vector Form return a matrix on diagonal entry, h_t+1Indicate the hiding layer state at t+1 sequence moment.When then can be by obtaining τ Carve the gradient δ for hiding layer state_τ, utilize δ_t+1With δ_tBetween the expression formula that contacts By δ_τBackpropagation recursion obtains δ from level to level_t.Due to δ_τAt the time of below without other, therefore can be direct according to gradient calculating It obtains：It then can be according to δ_τRecursion acquires δ_t.Obtain δ_tAfterwards, it can calculate weight U, weight W and biasing b.Update weight U formula be：U indicates that input layer connects between hidden layer before updating The weight connect, U' indicate that input layer indicates to construct a diagonal matrix to the weight connected between hidden layer, diag () after updating Or the matrix operation of diagonal entry on a matrix, δ are returned in vector form_tIndicate the gradient of hiding layer state, x_tTable Show the ASR phonetic feature to be trained of t moment input；Update weight W formula be： W indicates that the weight connected between hidden layer before updating, W' indicate the weight connected between hidden layer after updating；Update biasing b's Formula is：B indicates the biasing before updated between input layer and hidden layer, and b' indicates to update Biasing between input layer and hidden layer afterwards.When all weights and the changing value of biasing both less than stop iteration threshold ∈, i.e., It can deconditioning；Alternatively, when training reaches maximum number of iterations MAX, deconditioning.By ASR phonetic feature to be trained in RNN The error generated between prediction output valve in model and the label value (true output) pre-set, it is real based on the error The update of existing RNN model each layer weight and biasing enables the ASR-RNN model finally obtained according to ASR phonetic feature, instruction Practice and learn the further feature about time series, realizes the purpose for accurately distinguishing voice.

Step S31-S34 is trained RNN model using ASR phonetic feature to be trained, so that the ASR- that training obtains RNN model can be according to the further feature that ASR phonetic feature is trained and study is about sequence (timing), can be according to target voice Voice is effectively distinguished with the ASR phonetic feature of interference voice and in conjunction with temporal factors.Under noise jamming serious situation, still So target voice and noise can accurately be distinguished.

In speech differentiation method provided by the present embodiment, it is original to be primarily based on voice activity detection algorithms (VAD) processing Voice data to be distinguished obtains target voice data to be distinguished, and original voice data to be distinguished is calculated by voice activity detection Method is first distinguished once, is obtained the smaller target of range voice data to be distinguished, can be tentatively effectively removed original language to be distinguished Interference voice data in sound data retains the original voice data to be distinguished for mixing target voice and interfering voice, and will The original voice data to be distinguished can make original voice data to be distinguished effective first as target voice data to be distinguished Speech differentiation is walked, a large amount of interference voice is removed.It is then based on target voice data to be distinguished, obtains corresponding ASR voice Feature, the ASR phonetic feature can make the result of speech differentiation more accurate, even if under conditions of noise is very big, it can also be with To accurately voice (such as noise) and target voice be interfered to distinguish, be carried out accordingly to be subsequent according to the ASR phonetic feature The identification of ASR-RNN model provides important technology premise.ASR phonetic feature is finally input to preparatory trained ASR-RNN It is distinguished in model, obtains target and distinguish as a result, the ASR-RNN model is the ASR language extracted according to voice data to be trained The feature specialized training of sound feature and voice in timing for effectively distinguishing the identification model of voice, can be from mixing mesh Poster sound and interference voice (due to having used VAD to distinguish once, so most of interference voice here refers to noise) Target voice data to be distinguished in correctly distinguish target voice and interference voice, improve the accuracy of speech differentiation.

It should be understood that the size of the serial number of each step is not meant that the order of the execution order in above-described embodiment, each process Execution sequence should be determined by its function and internal logic, the implementation process without coping with the embodiment of the present invention constitutes any limit It is fixed.

Fig. 8 shows the functional block diagram with the one-to-one speech differentiation device of speech differentiation method in embodiment.Such as Fig. 8 institute Show, which includes that target voice data to be distinguished obtains module 10, phonetic feature obtains module 20 and target area Result is divided to obtain module 30.Wherein, target voice data to be distinguished obtains module 10, phonetic feature obtains module 20 and target area The realization function step corresponding with speech differentiation method in embodiment for dividing result to obtain module 30 corresponds, to avoid going to live in the household of one's in-laws on getting married It states, the present embodiment is not described in detail one by one.

Target voice data to be distinguished obtains module 10, for handling original language to be distinguished based on voice activity detection algorithms Sound data obtain target voice data to be distinguished.

Phonetic feature obtains module 20, and for being based on target voice data to be distinguished, it is special to obtain corresponding ASR voice Sign.

Target distinguishes result and obtains module 30, for ASR phonetic feature to be input to preparatory trained ASR-RNN model In distinguish, obtain target distinguish result.

Preferably, it includes the first original differentiation voice data acquiring unit that target voice data to be distinguished, which obtains module 10, 11, the second original differentiation voice data acquiring unit 12 and target voice data acquiring unit 13 to be distinguished.

First original differentiation voice data acquiring unit 11, for according to short-time energy characteristic value calculation formula to it is original to It distinguishes voice data to be handled, obtains corresponding short-time energy characteristic value, short-time energy characteristic value is greater than first threshold Original data to be distinguished retain, and are determined as the first original differentiation voice data, wherein short-time energy characteristic value calculation formula isN is voice frame length, and s (n) is the signal amplitude in time domain, and n is the time.

Second original differentiation voice data acquiring unit 12 is used for according to zero-crossing rate characteristic value calculation formula to original to area Point voice data is handled, and corresponding zero-crossing rate characteristic value is obtained, by zero-crossing rate characteristic value be less than second threshold it is original to It distinguishes voice data to retain, is determined as the second original differentiation voice data, wherein zero-crossing rate characteristic value calculation formula isN is voice frame length, and s (n) is the signal amplitude in time domain, and n is the time.

Target voice data acquiring unit 13 to be distinguished is used for the first original differentiation voice data and the second original differentiation Voice data is as target voice data to be distinguished.

Preferably, it includes pretreatment voice data acquiring unit 21, power spectrum acquiring unit that phonetic feature, which obtains module 20, 22, Meier power spectrum acquiring unit 23 and mel-frequency cepstrum coefficient unit 24.

Pretreatment unit 21 obtains pretreatment voice data for pre-processing to target voice data to be distinguished.

Power spectrum acquiring unit 22 obtains target and waits distinguishing for making Fast Fourier Transform (FFT) to pretreatment voice data The frequency spectrum of voice data, and according to the power spectrum of frequency spectrum acquisition target voice data to be distinguished.

Meier power spectrum acquiring unit 23, for using melscale filter group processing target voice data to be distinguished Power spectrum obtains the Meier power spectrum of target voice data to be distinguished.

Mel-frequency cepstrum coefficient unit 24 obtains target and waits distinguishing for carrying out cepstral analysis on Meier power spectrum The mel-frequency cepstrum coefficient of voice data.

Preferably, pretreatment unit 21 includes preemphasis subelement 211, framing subelement 212 and adding window subelement 213.

Preemphasis subelement 211, for making preemphasis processing, the calculating of preemphasis processing to target voice data to be distinguished Formula is s'_n=s_n-a*s_n-1, wherein s_nFor the signal amplitude in time domain, s_n-1For with s_nThe signal of corresponding last moment Amplitude, s'_nFor the signal amplitude in time domain after preemphasis, a is pre emphasis factor, and the value range of a is 0.9<a<1.0.

Framing subelement 212, for the target voice data to be distinguished after preemphasis to be carried out sub-frame processing.

Adding window subelement 213 obtains pretreatment for the target voice data to be distinguished after framing to be carried out windowing process The calculation formula of voice data, adding window isWherein, N is that window is long, and n is the time, s_nFor the signal amplitude in time domain, s'_nFor the signal amplitude in time domain after adding window.

Preferably, mel-frequency cepstrum coefficient unit 24 includes that Meier power spectrum to be transformed obtains subelement 241 and Meier Frequency cepstral coefficient subelement 242.

Meier power spectrum to be transformed obtains subelement 241 and obtains Meier to be transformed for taking the logarithm of Meier power spectrum Power spectrum.

Mel-frequency cepstrum coefficient subelement 242 obtains mesh for making discrete cosine transform to Meier power spectrum to be transformed Mark the mel-frequency cepstrum coefficient of voice data to be distinguished.

Preferably, which further includes that ASR-RNN model obtains module 40, and ASR-RNN model obtains module 40 include ASR phonetic feature acquiring unit 41, initialization unit 42, output valve acquiring unit 43 and updating unit 44 to be trained.

ASR phonetic feature acquiring unit 41 to be trained for obtaining voice data to be trained, and extracts voice number to be trained According to ASR phonetic feature to be trained.

Initialization unit 42, for initializing RNN model.

Output valve acquiring unit 43, for ASR phonetic feature to be trained to be input in RNN model, according to propagated forward Algorithm obtains the output valve of RNN model, and output value table is shown as：σ indicate activation primitive, V indicate hidden layer and The weight connected between output layer, h_tIndicate that the hidden state of t moment, c indicate the biasing between hidden layer and output layer.

Updating unit 44 updates weight and the biasing of each layer of RNN model, obtains for carrying out error-duration model based on output valve Take ASR-RNN model, wherein update weight V formula be：V indicates hidden layer before updating The weight connected between output layer, V' indicate that the weight connected between hidden layer and output layer after updating, α indicate study Rate, t indicate t moment, and τ indicates total duration,Indicate prediction output valve, y_tIndicate true output, h_tIndicate hiding for t moment State, T representing matrix transposition operation；Update biasing c formula be：C indicate update before hidden layer and Biasing between output layer, c' indicate the biasing after updated between hidden layer and output layer；Update weight U formula be：U indicates that input layer indicates more to the weight connected between hidden layer, U' before updating Input layer is to the weight connected between hidden layer after new, and diag () indicates one diagonal matrix of construction or in vector form Return to the matrix operation of diagonal entry on a matrix, δ_tIndicate the gradient of hiding layer state, x_tIndicate t moment input to Training ASR phonetic feature；Update weight W formula be：W is indicated before updating The weight connected between hidden layer, W' indicate the weight connected between hidden layer after updating；Update biasing b formula be：B indicates the biasing before updated between input layer and hidden layer, and b' indicates to input after updating Biasing between layer and hidden layer.

The present embodiment provides a computer readable storage medium, computer journey is stored on the computer readable storage medium Sequence realizes speech differentiation method in embodiment when the computer program is executed by processor, no longer superfluous here to avoid repeating It states.Alternatively, realizing the function of each module/unit in speech differentiation device in embodiment when the computer program is executed by processor Can, to avoid repeating, which is not described herein again.

It is to be appreciated that the computer readable storage medium may include：The computer program code can be carried Any entity or device, recording medium, USB flash disk, mobile hard disk, magnetic disk, CD, computer storage, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electric carrier signal and telecommunications letter Number etc..

Fig. 9 is the schematic diagram of computer equipment in the present embodiment.As shown in figure 9, computer equipment 50 include processor 51, Memory 52 and it is stored in the computer program 53 that can be run in memory 52 and on processor 51.Processor 51 executes meter Each step of speech differentiation method in embodiment, such as step S10, S20 and S30 shown in Fig. 2 are realized when calculation machine program 53. Alternatively, processor 51 realizes the function of each module/unit of speech differentiation device in embodiment when executing computer program 53, such as scheme The voice data to be distinguished of target shown in 8 obtains module 10, phonetic feature obtains module 20, target distinguishes result and obtains 30 and of module The function of ASR-RNN model acquisition module 40.

It is apparent to those skilled in the art that for convenience of description and succinctly, only with above-mentioned each function Can unit, module division progress for example, in practical application, can according to need and by above-mentioned function distribution by different Functional unit, module are completed, i.e., the internal structure of described device is divided into different functional unit or module, more than completing The all or part of function of description.

It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.

Embodiment described above is merely illustrative of the technical solution of the present invention, rather than its limitations；Although referring to aforementioned reality Applying example, invention is explained in detail, those skilled in the art should understand that：It still can be to aforementioned each Technical solution documented by embodiment is modified or equivalent replacement of some of the technical features；These modification or Person's replacement, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution should all wrap Containing within protection scope of the present invention.

Claims

1. a kind of speech differentiation method, which is characterized in that including：

The ASR phonetic feature is input in preparatory trained ASR-RNN model and is distinguished, target is obtained and distinguishes knot Fruit.

2. speech differentiation method according to claim 1, which is characterized in that input the ASR phonetic feature described It is distinguished into preparatory trained ASR-RNN model, before obtaining the step of distinguishing result, the speech differentiation method is also Including：Obtain ASR-RNN model；

The step of acquisition ASR-RNN model includes：

Voice data to be trained is obtained, and extracts the ASR phonetic feature to be trained of the voice data to be trained；

Initialize RNN model；

ASR phonetic feature to be trained is input in RNN model, the output valve of RNN model, institute are obtained according to propagated forward algorithm Output value table is stated to be shown as：σ indicates that activation primitive, V indicate the weight connected between hidden layer and output layer, h_t Indicate that the hidden state of t moment, c indicate the biasing between hidden layer and output layer；

Error-duration model is carried out based on the output valve, weight and the biasing of each layer of RNN model is updated, obtains ASR-RNN model, In, the formula for updating weight V is：What V expression connected between hidden layer and output layer before updating Weight, V' indicates that the weight that connects between hidden layer and output layer after updating, α indicate that learning rate, t indicate t moment, τ table Show total duration,Indicate prediction output valve, y_tIndicate true output, h_tIndicate the hidden state of t moment, T representing matrix transposition Operation；Update biasing c formula be：C indicates the biasing before updated between hidden layer and output layer, c' table Show the biasing after updating between hidden layer and output layer；Update weight U formula be： U indicates that input layer is to the weight connected between hidden layer before updating, and U' indicates input layer after updating to connecting between hidden layer Weight, diag () indicate one diagonal matrix of construction or return to the square of diagonal entry on a matrix in vector form Battle array operation, δ_tIndicate the gradient of hiding layer state, x_tIndicate the ASR phonetic feature to be trained of t moment input；Update the public affairs of weight W Formula is：W indicates that the weight connected between hidden layer before updating, W' indicate more The weight connected between hidden layer after new；Update biasing b formula be：B indicates to update Biasing between preceding input layer and hidden layer, b' indicate the biasing after updated between input layer and hidden layer.

3. speech differentiation method according to claim 1, which is characterized in that described to be handled based on voice activity detection algorithms Original voice data to be distinguished obtains target voice data to be distinguished, including：

The original voice data to be distinguished is handled according to short-time energy characteristic value calculation formula, acquisition is corresponding in short-term Energy eigenvalue retains the original data to be distinguished that the short-time energy characteristic value is greater than first threshold, is determined as the One original differentiation voice data, short-time energy characteristic value calculation formula areWherein, N is voice frame length, and s (n) is Signal amplitude in time domain, n are the time；

The original voice data to be distinguished is handled according to zero-crossing rate characteristic value calculation formula, obtains corresponding zero-crossing rate The original voice data to be distinguished that the zero-crossing rate characteristic value is less than second threshold is retained, is determined as second by characteristic value Original differentiation voice data, zero-crossing rate characteristic value calculation formula areWherein, N is voice frame length, S (n) is the signal amplitude in time domain, and n is the time；

Using the described first original differentiation voice data and the second original differentiation voice data as target language to be distinguished Sound data.

4. speech differentiation method according to claim 1, which is characterized in that described to be based on target voice number to be distinguished According to, corresponding ASR phonetic feature is obtained, including：

Target voice data to be distinguished is pre-processed, pretreatment voice data is obtained；

Fast Fourier Transform (FFT) made to the pretreatment voice data, obtains the frequency spectrum of target voice data to be distinguished, and according to The frequency spectrum obtains the power spectrum of target voice data to be distinguished；

The power spectrum of target voice data to be distinguished is handled using melscale filter group, obtains target voice to be distinguished The Meier power spectrum of data；

Cepstral analysis is carried out on the Meier power spectrum, obtains the mel-frequency cepstrum coefficient of target voice data to be distinguished.

5. speech differentiation method according to claim 4, which is characterized in that described to target voice data to be distinguished It is pre-processed, obtains pretreatment voice data, including：

Preemphasis processing is made to target voice data to be distinguished, the calculation formula of preemphasis processing is s'_n=s_n-a*s_n-1, Wherein, s_nFor the signal amplitude in time domain, s_n-1For with s_nThe signal amplitude of corresponding last moment, s'_nWhen for after preemphasis Signal amplitude on domain, a are pre emphasis factor, and the value range of a is 0.9<a<1.0；

Target voice data to be distinguished after preemphasis is subjected to sub-frame processing；

Target voice data to be distinguished after framing is subjected to windowing process, obtains pretreatment voice data, the meter of adding window Calculating formula isWherein, N is that window is long, and n is time, s_nFor the signal width in time domain Degree, s'_nFor the signal amplitude in time domain after adding window.

6. speech differentiation method according to claim 4, which is characterized in that described to be fallen on the Meier power spectrum Spectrum analysis obtains the mel-frequency cepstrum coefficient of target voice data to be distinguished, including：

The logarithm for taking the Meier power spectrum obtains Meier power spectrum to be transformed；

Discrete cosine transform is made to the Meier power spectrum to be transformed, obtains the mel-frequency cepstrum of target voice data to be distinguished Coefficient.

7. a kind of speech differentiation device, which is characterized in that including：

Target voice data to be distinguished obtains module, for handling original voice number to be distinguished based on voice activity detection algorithms According to acquisition target voice data to be distinguished；

Phonetic feature obtains module, for being based on target voice data to be distinguished, obtains corresponding ASR phonetic feature；

Target distinguishes result and obtains module, for the ASR phonetic feature to be input in preparatory trained ASR-RNN model It distinguishes, obtains target and distinguish result.

8. speech differentiation device according to claim 7, which is characterized in that the speech differentiation device further includes ASR- RNN model obtains module, and the ASR-RNN model obtains module and includes：

ASR phonetic feature acquiring unit to be trained for obtaining voice data to be trained, and extracts the voice data to be trained ASR phonetic feature to be trained；

Initialization unit, for initializing RNN model；

Output valve acquiring unit is obtained for ASR phonetic feature to be trained to be input in RNN model according to propagated forward algorithm The output valve of RNN model is taken, the output value table is shown as：σ indicates that activation primitive, V indicate hidden layer and defeated The weight connected between layer out, h_tIndicate that the hidden state of t moment, c indicate the biasing between hidden layer and output layer；

Updating unit updates weight and the biasing of each layer of RNN model, obtains for carrying out error-duration model based on the output valve ASR-RNN model, wherein update weight V formula be：V indicate update before hidden layer and The weight connected between output layer, V' indicate that the weight connected between hidden layer and output layer after updating, α indicate study Rate, t indicate t moment, and τ indicates total duration,Indicate prediction output valve, y_tIndicate true output, h_tIndicate hiding for t moment State, T representing matrix transposition operation；Update biasing c formula be：C indicate update before hidden layer and Biasing between output layer, c' indicate the biasing after updated between hidden layer and output layer；Update weight U formula be：U indicates that input layer indicates more to the weight connected between hidden layer, U' before updating Input layer is to the weight connected between hidden layer after new, and diag () indicates one diagonal matrix of construction or in vector form Return to the matrix operation of diagonal entry on a matrix, δ_tIndicate the gradient of hiding layer state, x_tIndicate t moment input to Training ASR phonetic feature；Update weight W formula be：W is indicated before updating The weight connected between hidden layer, W' indicate the weight connected between hidden layer after updating；Update biasing b formula be：B indicates the biasing before updated between input layer and hidden layer, and b' indicates to input after updating Biasing between layer and hidden layer.

9. a kind of computer equipment, including memory, processor and storage are in the memory and can be in the processor The computer program of upper operation, which is characterized in that the processor realized when executing the computer program as claim 1 to The step of any one of 6 speech differentiation method.

10. a kind of computer readable storage medium, the computer-readable recording medium storage has computer program, and feature exists In the step of realization speech differentiation method as described in any one of claim 1 to 6 when the computer program is executed by processor Suddenly.