CN108766419A

CN108766419A - A kind of abnormal speech detection method based on deep learning

Info

Publication number: CN108766419A
Application number: CN201810417478.2A
Authority: CN
Inventors: 奉小慧; 陈光科; 贺前华; 巫小兰; 李艳雄
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2018-05-04
Filing date: 2018-05-04
Publication date: 2018-11-06
Anticipated expiration: 2038-05-04
Also published as: CN108766419B

Abstract

The abnormal speech differentiation method based on deep learning that the invention discloses a kind of, includes the following steps, obtains input voice, and carry out resampling, preemphasis, the pretreatment of framing adding window to input voice, obtains pretreatment voice；Mel-frequency cepstrum coefficient characteristic vector is extracted to pretreatment voice；The voice segments of different frame numbers are regular to frame number is fixed, and each voice segments obtain a corresponding mel-frequency cepstrum coefficient eigenmatrix；Establish convolution depth confidence network；Mel-frequency cepstrum coefficient eigenmatrix is inputted into convolution depth confidence network, is trained, and is classified to the state for inputting voice；According to classification results, calls hidden Markov model to carry out template matches, obtain voice recognition result；The present invention utilizes multiple nonlinear transformation layers of convolution depth confidence network, by the MFCC Feature Mappings of input to more higher dimensional space, is used in combination Hidden Markov Model to model the voice of different conditions respectively, improves the identification accuracy of voice.

Description

A kind of abnormal speech detection method based on deep learning

Technical field

The present invention relates to intelligent sound treatment research field, more particularly to a kind of abnormal speech region based on deep learning Other method.

Background technology

Voice is one of the important way of the mankind and machine interaction, and by the research of decades, speech recognition technology obtains Prodigious development has been deep into our daily life, however, the research of existing voice identification has problems with：

In actual life, speaker's health status exception or other reasons can cause its input voice from normality voice to Abnormal voice transfer, and more noise jamming can be brought.Abnormal voice refer generally to complex background noise voice, deliberately Change the voice of tongue or custom, the voice etc. of development organ lesion.

Another problem is that traditional speech recognition system often uses linear prediction residue error and mel-frequency cepstrum Coefficient, main information is pronunciation this paper features in these bottom acoustic features, and speaker information is highly susceptible to this information, letter The interference in road and noise information, to make the recognition performance of system decline.

Invention content

The shortcomings that it is a primary object of the present invention to overcome the prior art and deficiency, provide a kind of based on the non-of deep learning Normality speech differentiation method is mapped MFCC, LPCC parameter compared with low-dimensional using the nonlinear transformation ability of deep neural network To higher dimensional space, the high-level abstracted information of voice signal is preferably characterized, and model respectively to normality voice and abnormal voice, Effectively distinguish normality voice and abnormal voice.

The purpose of the present invention is realized by the following technical solution：

A kind of abnormal speech differentiation method based on deep learning, includes the following steps：

S1, input voice is obtained, and resampling, preemphasis, the pretreatment of framing adding window is carried out to input voice, obtain pre- place Manage voice；

S2, each frame voice extraction Meier frequency that voice is pre-processed using mel-frequency filter group and Fourier transform pairs Rate cepstrum coefficient characteristic vector；

S3, regular arrive of the voice segments with different frame numbers is fixed into frame number, each voice segments obtain a corresponding Meier Frequency cepstral coefficient eigenmatrix；

S4, convolution depth confidence network is established；

S5, mel-frequency cepstrum coefficient eigenmatrix is inputted into convolution depth confidence network, be trained, and to inputting language The state of sound is classified；

S6, according to classification results, call hidden Markov model to carry out template matches, obtain voice recognition result.

In step sl, its sample frequency of resampling is 22.05kHz, and coding mode is wav formats；

Preemphasis uses single order FIR high-pass filters, and transfer function is：

H (z)=1-az^-1,

Wherein, a is high-pass filter coefficient, value 0.93；Voice signal after preemphasis is：

Y (n)=sp (n)-sp (n-1), n=0,1 ..., Length-1

Wherein, y (n) is the voice signal after preemphasis, and sp (n) is the voice signal before preemphasis, and sp (n-1) is voice The time shift of signal, Length are voice signal length；

Framing adding window is specially：Voice is subjected to slicing treatment, at every fixed time fixed length in interception input voice The audio signal of degree is a frame, uses frame length for 25ms, and the Hamming window that it is 10ms that frame, which moves, carries out framing windowing process.

In step s 2, detailed process is as follows：

L V1, design triangular shaped mel-frequency filter groups, if W_lFor the frequency of first of mel-frequency filter Rate responds, l=1,2 .., L, and Fs is the resampling frequency of voice signal, and Q is the frame length of certain frame voice signal, and Q is also Fourier The points of transformation, f_l,f_hThe respectively lower and upper limit cutoff frequency of voice signal, certain frame voice signal that frame length is Q carry out Q Q frequency component, o (l), c (l) are obtained after point quick Fourier transformation, h (l) is respectively under first of mel-frequency filter Limit, the subscript value of center and upper limiting frequency in Q frequency component；There are following relationships by o (l), c (l), h (l)：

C (l-1)=o (l),

O (l+1)=c (l),

H (l)=c (l+1),

That is, the position of present filter centre frequency is in the side lobe attenuation position of next filter；

Have simultaneously, o (l)_{| l=1}=f_l,h(l)_{| l=L}=f_h, therefore, the centre frequency of first of mel-frequency filter is at Q Subscript value in frequency component is expressed as:

Wherein, Mel (f₁) it is that actual frequency is mapped to the function of mel-frequency, and Mel^-1(f₂) it is Mel (f₁) anti-letter Number, f₁For actual frequency, f₂For mel-frequency：

The frequency response of first of mel-frequency filter is：

V2, to certain frame voice signal x (n), n=0,1 after resampling, preemphasis, framing adding window ..., Q-1, Q < Length carry out Q point quick Fourier transformation, acquire its frequency spectrum X (k) and amplitude spectrum | X (k) |:

V3, by certain frame voice signal by mel-frequency filter group, obtain the output amplitude spectrum of each filter:

V4, logarithm operation is done to the output amplitude spectrum of all filters, plum can be obtained by further doing discrete cosine transform That frequency cepstral coefficient：

The in L coefficient the 2nd to M+1 total M coefficient is taken to constitute the M Jan Vermeer frequency cepstral coefficient characteristic vectors per frame C={ c_mfcc(2),c_mfcc(3),...,c_mfcc(M+1)}。

In step S2, the mel-frequency cepstrum coefficient feature matrix, dimension is N × M, and N is this section of voice rule Fixation frame number after whole, i.e. the target frame number of Time alignment, M are the M rank mel-frequency cepstrum coefficients of every frame voice, the Meier Frequency cepstral coefficient is 12 dimensions, i.e. M=12.

In step s3, due to the time span of different phonetic section difference, the frame number that different phonetic section may contain It is inconsistent, and since the input of neural network must be fixed dimension, it is therefore desirable to Time alignment is carried out to voice signal, So that the input of neural network is fixed size；Time alignment is carried out using characterization point sequence method, it is regular to be specially：

Y1, the Time alignment network for building one n-N+1 layers, wherein N are the target frame number after Time alignment, first layer Frame number is n, and the frame number of last layer is N；

I-th layer of Time alignment network has n- (i-1) frame, and a characteristic vector is corresponded to per frame, constitutes n- (i-1) a feature Set of vectors：

Wherein, i=0,1 ... n-N+1, k=1,2 ... n- (i-1),Indicate the Meier frequency of i-th layer of kth frame voice of network Rate cepstrum coefficient characteristic vector；

Particularly, the characteristic vector group of network first tier inputs the characteristic vector group of network：

WithIndicate vectorThe weight of the speech frame of representative has as i=1：

Y2, it setsIt indicatesWithDistance, i.e.,To i-th layer, find j so as to arbitrary k ≠ J has

Y3, two nearest frames of distance are merged, frame number subtracts one, i.e.,：

Wherein,Indicate the mel-frequency cepstrum coefficient characteristic vector of network i+1 layer jth frame voice；Indicate net The mel-frequency cepstrum coefficient characteristic vector of+1 frame voice of i-th layer of jth of network；Indicate the plum of network i+1 layer kth frame voice That frequency cepstral coefficient characteristic vector；It indicatesThe mel-frequency cepstrum coefficient Characteristic Vectors of+1 frame voice of i-th layer of kth of network Amount；Indicate the weight of i-th layer of jth frame voice of network；Indicate the weight of+1 frame voice of i-th layer of jth of network；It indicates The weight of network i+1 layer jth frame voice；Indicate that the mel-frequency of i-th layer of jth frame voice of network and+1 frame voice of jth falls The distance of spectral coefficient characteristic vector；

Y4, solutionI.e.：

I=1 to i=n-N+1 is repeated, has often been executed once, frame number subtracts one, finally that the voice signal of n frames is regular to solid Fixed N frames.

In step S4, the convolution depth confidence network by multiple convolution be limited Boltzmann machine accumulate from top to bottom and At output layer uses Softmax graders；The convolution is limited Boltzmann machine by one layer of input layer V and one layer of convolutional layer H structure At.Convolution is limited Boltzmann machine, and each convolution is added pond layer after being limited the convolutional layer H of Boltzmann machine, carries out Chi Huacao Make, pond size is E ' × F ', pond step-length s3 × s4, s3=s4=2, and therefore, next convolution is limited the pond of Boltzmann machine It is the input layer that a upper convolution is limited Boltzmann machine to change layer.

Step S5 is first limited Boltzmann machine to the convolution of bottom and is trained, then is limited Bohr hereby to the convolution of top layer Graceful machine is trained, specially：

Z1, set convolution be limited Boltzmann machine input channel number as I, each channel correspond to size for y × s two dimension Matrix, i.e.,：

V={ v¹,v²,...,v^I},vⁱ∈R^y×s, i=1,2 ..., I,

Wherein, V is the input layer that convolution is limited Boltzmann machine, vⁱ∈R^y×sFor i-th of channel of input layer；

Particularly, the convolution of bottom is limited in Boltzmann machine, and I=1, y=N, s=M, the i.e. convolution of bottom are limited Bohr Hereby the input channel number of graceful machine is 1, the two-dimentional mel-frequency cepstrum that the corresponding size for inputting convolution depth confidence network is N × M Coefficient characteristics matrix, N are the target frame number of Time alignment, and M is the M rank mel-frequency cepstrum coefficients of every frame voice；

Z2, convolution process use O convolution kernel, and each convolution kernel is three-dimensional weight matrix I × E × F, i.e. convolution kernel W= {w¹,w²,...,w^O, wherein w^j∈R^I×E×F, j=1,2 ..., O, w^jFor the convolution kernel；Therefore convolution is limited Boltzmann machine Output channel number be O；Each output channel corresponds to some local feature of input, i.e.,：

Convolutional layer H={ h¹,h²,...,h^O, wherein h^j∈R^N′×M′, j=1,2 ..., O；Refer to j-th of channel of convolutional layer, A corresponding size after Feature Mapping is the two dimensional character matrix of N ' × M '

Z3, convolution step-length is set as s1 × s2, all neurons in the same channel of input layer share biasing a_i, i=1, 2 ..., I, the shared biasing b of same group of neuron in convolutional layer_j, j=1,2 ..., O；The parameter to be trained：θ=W, a, b}；

The limited Boltzmann machine of Z4, convolution is the model based on energy, and energy function is defined as：

The joint probability distribution of input layer V and convolutional layer H are obtained by energy function, i.e. convolution is limited Boltzmann machine input layer All neuron values of V：With all neuron values of convolutional layer H：Joint probability distribution：

Wherein,For partition function；I=1 ..., I, p=1 ..., y, q=1 ..., s, j= 1 ..., O, m=1 ..., N ', k=1 ..., M ',

The marginal probability distribution of input layer V, i.e. all neuron values of input layer VProbability distribution be：

If training sample set TS has T sample, then the log-likelihood function on input layer V is：

Wherein, L (TS, θ) is the likelihood function on input layer V：

Z5, log-likelihood function is maximized, and then acquire parameter in conjunction with to sdpecific dispersion algorithm using gradient ascent algorithm θ；I.e.

Z6, output layer use Softmax output layers, neuron number 2, and it is normality voice that output, which respectively represents sample, With the probability of abnormal voice, practical identification is to take the corresponding classification of maximum probability as final classification result；Output layer and The pond layer that top convolution is limited Boltzmann machine connects entirely, and it is the biasing of two neurons of output layer to need trained parameter c₁,c₂And the connection weight of output layer and top pond layer：

Wherein, as i=1, r is the neuron number of top hidden layer；WE¹For the 1st god of Softmax output layers The connection weight vector of neuron through member and top pond layer,For the 1st neuron and most of Softmax output layers The connection weight of k-th of neuron of top layer pond layer, k≤r；As i=2, WE²For the 2nd nerve of Softmax output layers The connection weight vector of the neuron of member and top pond layer,For Softmax output layers the 2nd neuron and most push up The connection weight of k-th of neuron of layer pond layer；

If the output of top pond layer is G={ g₁,g₂,...,g_r, g_iFor i-th of neuron of top pond layer Output, the then input of output layer are：f_i=WEⁱG^T+c_i, wherein as i=1, f₁Indicate the defeated of first neuron of output layer Enter, as i=2, f₂Indicate the input of second neuron of output layer；

The output valve of output layer is：

Wherein, as i=1, y₁It is exported for the 1st neuron of Softmax output layers, indicates that input voice belongs to normality The probability of voice；As i=2, y₂It is exported for the 2nd neuron of Softmax output layers, indicates that input voice belongs to abnormal The probability of voice；

When Z7, training Softmax graders, it is trained by the way of Mini-batch, once captures T trained sample Originally be trained, undated parameter is primary, use loss function for：

Wherein, error sum of squares when J is training, For j-th of sample of crawl The idea output of i-th of neuron of corresponding grader has if j-th of sample belongs to normality voiceIt is on the contrary For i-th of neuron of the corresponding grader of j-th of sample of crawl Real output value；

F1 is the harmonic-mean of accurate rate P and recall rate R, i.e.,：

Wherein, accurate rate P is correctly to be identified as the sample number of abnormal voice to account for and all be identified as abnormal voice Sample number ratio, recall rate is the ratio for all abnormal speech samples numbers of sample number for being correctly identified as abnormal voice Value, TP are the sample number that abnormal voice is correctly identified as in T sample, and FP is to be erroneously identified as very in T sample The sample number of state voice, FN are the sample number that normality voice is erroneously identified as in T sample；

When training, using gradient descent algorithm undated parameter, until loss function converges on the value or iteration time of setting Number reaches maximum iteration, then training terminates.

Step S6, detailed process are：Online recognition part, the power that deep neural network completes network structure and training obtains The load of value coefficient, Hidden Markov Model have loaded the advance model succeeded in school, are pre-processed to the voice inputted in real time And feature extraction；The HMM template of normality and abnormal voice is established respectively, that is, uses normality voice training HMM template, is established normal The HMM template of state voice；Using abnormal voice training HMM template, the HMM template of abnormal voice is established；When online recognition, The judging result exported according to convolution depth confidence network calls different HMM templates, the mel-frequency of input speech signal to fall Spectral coefficient carries out template matches, obtains final voice recognition result.

Compared with prior art, the present invention having the following advantages that and advantageous effect：

Present invention combination convolution deep learning confidence network and Hidden Markov Model have using convolution depth confidence network The MFCC characteristics of input are mapped to more higher dimensional space, more fully characterize voice messaging, have by some multilayered nonlinear transform layers Effect abnormal voice and normality voice are distinguished；The time series modeling ability for recycling hidden Markov model to have, and The voice of different conditions is modeled respectively, substantially increases the identification accuracy of voice.

Description of the drawings

Fig. 1 is a kind of off-line training flow chart of abnormal speech differentiation method based on deep learning of the present invention；

Fig. 2 is a kind of online recognition flow chart of abnormal speech differentiation method based on deep learning of the present invention；

Fig. 3 is a kind of convolution depth confidence network of abnormal speech differentiation method based on deep learning of the present invention Identify voice status schematic diagram.

Specific implementation mode

Present invention will now be described in further detail with reference to the embodiments and the accompanying drawings, but embodiments of the present invention are unlimited In this.

Embodiment 1

The first step：Input voice is obtained, and the pretreatments such as resampling, preemphasis, framing adding window are carried out to input voice, is obtained To pretreatment voice；

Resampling is specially：The voice of input has different sample frequency and coding mode, in order to facilitate the processing of data Analysis carries out resampling, uniform sampling frequency and coding mode to former input speech signal；Sample frequency is 22.05kHz, is compiled Code mode is wav formats.

Preemphasis is specially：The power spectrum of sound signal reduces with the increase of frequency, and most of energy concentrates on low frequency In range, in order to improve the high frequency section of primitive sound signal, preemphasis processing is carried out to former input speech signal, using single order FIR high-pass filters, transfer function are：

H (z)=1-az^-1,

Y (n)=sp (n)-sp (n-1), n=0,1 ..., Length-1

Second step：Each frame voice to pre-processing voice extracts mel-frequency cepstrum coefficient feature；

Detailed process is as follows：

C (l-1)=o (l),

O (l+1)=c (l),

H (l)=c (l+1),

The frequency response of first of mel-frequency filter is：

V2, to certain frame voice signal x (n), n=0,1 ..., Q-1 after resampling, preemphasis, framing adding window, into Row Q point quick Fouriers convert, and acquire its frequency spectrum X (k) and amplitude spectrum | X (k) |:

Third walks：The voice segments of different frame numbers are regular to frame number is fixed, and each voice segments obtain a corresponding Meier Frequency cepstral coefficient eigenmatrix；

Since the time span of different phonetic section is different, the frame number that different phonetic section may contain is inconsistent, and Since the input of neural network must be fixed dimension, it is therefore desirable to carry out Time alignment to voice signal so that nerve net The input of network is fixed size；Time alignment is carried out using characterization point sequence method, regular process is specially：

Wherein, i=0,1 ... n-N+1, k=1,2 ... n- (i-1), C are the mel-frequency cepstrum coefficient feature per frame voice Vector；

Y4, solutionI.e.：

I=1 to i=n-N+1 is repeated, has often been executed once, frame number subtracts one, finally that the voice signal of n frames is regular to solid Fixed N frames；

After Time alignment, every section of voice corresponds to a mel-frequency cepstrum coefficient eigenmatrix, and dimension is N × M, N is the regular target frame number of this section of Speech time, and M is the M rank mel-frequency cepstrum coefficients of every frame voice, and the mel-frequency falls Spectral coefficient is 12 dimensions, i.e. M=12.

4th step：Establish convolution depth confidence network；Depth confidence network limits Boltzmann machine by two convolution and stacks It forms, existing level 2 volume lamination, 2 pond layers, 1 visual layers and 1 output layer；Boltzmann machine is limited to first convolution, Visual layers neural unit number is N × M=200 × 12, and the number of convolution kernel is 10, and the size of convolution kernel is 2 × 2, convolution The initial value of step-length 2 × 2, convolution kernel is：With mean value be 0, variance be 0.01 Gaussian Profile random value.Visual layers it is initial partially It sets and takes 0, the initial bias of convolutional layer takes -0.1.An iteration captures 100 samples, and iteration rounds are 100.First pond layer Pond size take 2 × 2, pond step-length is 2 × 2.Boltzmann machine is limited to second convolution, the number of convolution kernel is 10, The size of convolution kernel is 10 × 2 × 2, and the initial value of convolution step-length 2 × 2, convolution kernel is：With mean value be 0, variance be 0.01 height This distribution random value, the initial bias of the second convolutional layer take -0.1.An iteration captures 100 samples, and iteration rounds are 100.The pond size of second pond layer takes 2 × 2, and pond step-length is 2 × 2.All convolutional layers use Sigma to activate letter Number.Output layer neuron number is 2, is exported as the posterior probability of normality voice and abnormal voice.Output layer setting loss letter Number convergency value is 0.004, maximum iteration 1000.

5th step：Mel-frequency cepstrum coefficient eigenmatrix is inputted into convolution depth confidence network, is trained, and to defeated The state for entering voice is classified, and Fig. 1 is the training flow chart of off-line state,；

First Boltzmann machine is limited to the convolution of bottom to be trained, then Boltzmann machine is limited to the convolution of top layer and is carried out Training, specially：

V={ v¹,v²,...,v^I},vⁱ∈R^y×s, i=1,2 ..., I,

Convolutional layer H={ h¹,h²,...,h^O, wherein h^j∈R^N′×M′, j=1,2 ..., O；Refer to j-th of channel of convolutional layer, A corresponding size after Feature Mapping is the two dimensional character matrix of N ' × M '；

Wherein, L (TS, θ) is the likelihood function on input layer V：

The output valve of output layer is：

Wherein, error sum of squares when J is training, For j-th of sample of crawl The idea output of i-th of neuron of corresponding grader has if j-th of sample belongs to normality voiceIt is on the contrary For the reality of i-th of neuron of the corresponding grader of j-th of sample of crawl Border output valve；

F1 is the harmonic-mean of accurate rate P and recall rate R, i.e.,：

6th step：According to classification results, calls hidden Markov model to carry out template matches, obtain voice recognition result：

Online recognition part, as shown in Fig. 2, deep neural network completes network structure and trains obtained weight coefficient Load, Hidden Markov Model have loaded the advance model succeeded in school, carry out pretreatment to the voice inputted in real time and feature carries It takes；The HMM template of normality and abnormal voice is established respectively, that is, is used normality voice training HMM template, established normality voice HMM template；Using abnormal voice training HMM template, the HMM template of abnormal voice is established；When online recognition, according to convolution The judging result of depth confidence network output calls different HMM templates, the mel-frequency cepstrum coefficient of input speech signal into Row template matches, obtain final voice recognition result, and principle flow chart is as shown in Figure 3.

The above embodiment is a preferred embodiment of the present invention, but embodiments of the present invention are not by above-described embodiment Limitation, it is other it is any without departing from the spirit and principles of the present invention made by changes, modifications, substitutions, combinations, simplifications, Equivalent substitute mode is should be, is included within the scope of the present invention.

Claims

1. a kind of abnormal speech differentiation method based on deep learning, which is characterized in that include the following steps：

S1, acquisition input voice, and input voice is pre-processed, pretreatment voice is obtained, the pretreatment includes to adopt again Sample, preemphasis, framing adding window；

S2, each frame voice extraction mel-frequency that voice is pre-processed using mel-frequency filter group and Fourier transform pairs are fallen Spectral coefficient characteristic vector；

S3, regular arrive of the voice segments with different frame numbers is fixed into frame number, each voice segments obtain a corresponding mel-frequency Cepstrum coefficient eigenmatrix；

S4, convolution depth confidence network is established；

S5, mel-frequency cepstrum coefficient eigenmatrix is inputted into convolution depth confidence network, be trained, and to input voice State is classified；

2. a kind of abnormal speech differentiation method based on deep learning according to claim 1, which is characterized in that step In S1,

Its sample frequency of the resampling is 22.05kHz, and coding mode is wav formats；

The preemphasis uses single order FIR high-pass filters, and transfer function is：

H (z)=1-az^-1,

Y (n)=sp (n)-sp (n-1), n=0,1 ..., Length-1；

Wherein, y (n) is the voice signal after preemphasis, and sp (n) is the voice signal before preemphasis, and sp (n-1) is voice signal Time shift, Length be voice signal length；

The framing adding window is specially：Voice is subjected to slicing treatment, at every fixed time fixed length in interception input voice The audio signal of degree is a frame, uses frame length for 25ms, and the Hamming window that it is 10ms that frame, which moves, carries out framing windowing process.

3. a kind of abnormal speech differentiation method based on deep learning according to claim 1, which is characterized in that described Step S2, specially：

L V1, design triangular shaped mel-frequency filter groups, if W_lIt is rung for the frequency of first of mel-frequency filter It answers, l=1,2 .., L, Fs is the resampling frequency of voice signal, and Q is the frame length of certain frame voice signal, and Q is also Fourier transformation Points, f_l,f_hThe respectively lower and upper limit cutoff frequency of voice signal, it is fast that certain frame voice signal that frame length is Q carries out Q points Obtain Q frequency component, o (l), c (l) after fast Fourier transformation, h (l) be respectively first of mel-frequency filter lower limit, The subscript value of center and upper limiting frequency in Q frequency component；There are following relationships by o (l), c (l), h (l)：

C (l-1)=o (l),

O (l+1)=c (l),

H (l)=c (l+1),

Have simultaneously, o (l)_{| l=1}=f_l,h(l)_{| l=L}=f_h, therefore, the centre frequency of first of mel-frequency filter is in Q frequency Subscript value in component is expressed as:

Wherein, Mel (f₁) it is that actual frequency is mapped to the function of mel-frequency, and Mel^-1(f₂) it is Mel (f₁) inverse function, f₁ For actual frequency, f₂For mel-frequency：

The frequency response of first of mel-frequency filter is：

Wherein, k is subscript value of the frequency component in Q frequency component；

V2, to certain frame voice signal x (n), n=0,1 after resampling, preemphasis, framing adding window ..., Q-1, Q < Length carries out Q point quick Fourier transformation, acquires its frequency spectrum X (k) and amplitude spectrum | X (k) |:

V4, logarithm operation is done to the output amplitude spectrum of all filters, Meier frequency can be obtained by further doing discrete cosine transform Rate cepstrum coefficient：

The in L coefficient the 2nd to M+1 total M coefficient is taken to constitute the M Jan Vermeer frequency cepstral coefficient characteristic vectors C=per frame {c_mfcc(2),c_mfcc(3),...,c_mfcc(M+1)}。

4. a kind of abnormal speech differentiation method based on deep learning according to claim 1, which is characterized in that step In S2, the mel-frequency cepstrum coefficient feature matrix, dimension is N × M, and N is the regular target frame of this section of Speech time Number, M are the M rank mel-frequency cepstrum coefficients of every frame voice, and the mel-frequency cepstrum coefficient is 12 dimensions, i.e. M=12.

5. a kind of abnormal speech differentiation method based on deep learning according to claim 1, which is characterized in that step In S3, it is described it is regular be specially：

Y1, the Time alignment network for building one n-N+1 layers, wherein N are the target frame number of Time alignment, and the frame number of first layer is The frame number of n, last layer are N；

I-th layer of Time alignment network has n- (i-1) frame, and a characteristic vector is corresponded to per frame, constitutes n- (i-1) a characteristic vector Group：

Wherein,For the mel-frequency cepstrum coefficient characteristic vector of i-th layer of kth frame voice of network；

Y2, it setsIt indicatesWithDistance, i.e.,To i-th layer, j is found so as to arbitrary k ≠ j, have

Wherein,Indicate the mel-frequency cepstrum coefficient characteristic vector of network i+1 layer jth frame voice；Indicate network i-th The mel-frequency cepstrum coefficient characteristic vector of layer+1 frame voice of jth；Indicate the mel-frequency of network i+1 layer kth frame voice Cepstrum coefficient characteristic vector；It indicatesThe mel-frequency cepstrum coefficient characteristic vector of+1 frame voice of i-th layer of kth of network；It indicates The weight of i-th layer of jth frame voice of network；Indicate the weight of+1 frame voice of i-th layer of jth of network；Indicate network i+1 The weight of layer jth frame voice；Indicate the mel-frequency cepstrum coefficient feature of i-th layer of jth frame voice of network and+1 frame voice of jth The distance of vector；

Y4, solutionI.e.：

I=1 to i=n-N+1 is repeated, has often been executed once, frame number subtracts one, finally that the voice signal of n frames is regular to fixed N Frame.

6. a kind of abnormal speech differentiation method based on deep learning according to claim 1, which is characterized in that step In S4, the convolution depth confidence network is limited Boltzmann machine by more than one convolution and accumulates from top to bottom, output layer Using Softmax graders；The convolution is limited Boltzmann machine and is made of one layer of input layer V and one layer of convolutional layer H.

7. a kind of abnormal speech differentiation method based on deep learning according to claim 6, which is characterized in that described Convolution is limited Boltzmann machine, and pond layer is added after being limited the convolutional layer H of Boltzmann machine in each convolution, carries out pondization operation, Pond size is E ' × F ', pond step-length s3 × s4, s3=s4=2, and therefore, next convolution is limited the pond layer of Boltzmann machine It is the input layer that a upper convolution is limited Boltzmann machine.

8. a kind of abnormal speech differentiation method based on deep learning according to claim 6, which is characterized in that described Step S5 is first limited Boltzmann machine to the convolution of bottom and is trained, then is limited Boltzmann machine to the convolution of top layer and carries out Training, specially：

Z1, set convolution be limited Boltzmann machine input channel number as I, each channel correspond to size for y × s Two-Dimensional Moment Battle array, i.e.,：

V={ v¹,v²,...,v^I},vⁱ∈R^y×s, i=1,2 ..., I,

The convolution of bottom is limited in Boltzmann machine, and I=1, y=N, s=M, the i.e. convolution of bottom are limited the defeated of Boltzmann machine It is 1 to enter port number, the two-dimentional mel-frequency cepstrum coefficient feature square that the corresponding size for inputting convolution depth confidence network is N × M Battle array, N are the target frame number of Time alignment, and M is the M rank mel-frequency cepstrum coefficients of every frame voice；

Z2, convolution process use O convolution kernel, and each convolution kernel is three-dimensional weight matrix I × E × F, i.e. convolution kernel W={ w¹, w²,...,w^O, wherein w^j∈R^I×E×F, j=1,2 ..., O, w^jFor the convolution kernel；Therefore convolution is limited Boltzmann machine Output channel number is O；Each output channel corresponds to some local feature of input, i.e.,：

Convolutional layer H={ h¹,h²,...,h^O, wherein h^j∈R^N′×M′, j=1,2 ..., O refer to j-th of channel of convolutional layer, corresponding One size after Feature Mapping is the two dimensional character matrix of N ' × M '；

Z3, convolution step-length is set as s1 × s2, all neurons in the same channel of input layer share biasing a_i, i=1,2 ..., I, The shared biasing b of same group of neuron in convolutional layer_j, j=1,2 ..., O；The parameter to be trained：θ={ W, a, b }；

The joint probability distribution of input layer V and convolutional layer H are obtained by energy function, i.e. convolution is limited Boltzmann machine input layer V institutes There is neuron value：With all neuron values of convolutional layer H：Joint probability distribution：

Wherein,For partition function；I=1 ..., I, p=1 ..., y, q=1 ..., s, j=1 ..., O, M=1 ..., N ', k=1 ..., M ',

Wherein, L (TS, θ) is the likelihood function on input layer V：

Z5, log-likelihood function is maximized, and then acquire parameter θ in conjunction with to sdpecific dispersion algorithm using gradient ascent algorithm；I.e.

Z6, output layer use Softmax output layers, neuron number 2, and it is normality voice and non-that output, which respectively represents sample, The probability of normality voice, practical identification is to take the corresponding classification of maximum probability as final classification result；It output layer and most pushes up The pond layer that layer convolution is limited Boltzmann machine connects entirely, and it is the biasing c of two neurons of output layer to need trained parameter₁,c₂ And the connection weight of output layer and top pond layer：

Wherein, r is the neuron number of top hidden layer；As i=1, WE¹For the 1st neuron of Softmax output layers It is vectorial with the connection weight of the neuron of top pond layer,For the 1st neuron and top of Softmax output layers The connection weight of k-th of neuron of pond layer, k≤r；As i=2, WE²For Softmax output layers the 2nd neuron and The connection weight vector of the neuron of top pond layer,For the 2nd neuron of Softmax output layers and top pond Change the connection weight of k-th of neuron of layer；

If the output of top pond layer is G={ g₁,g₂,...,g_r, g_iIt is exported for i-th of neuron of top pond layer, The then input of output layer is：f_i=WEⁱG^T+c_i, wherein as i=1, f₁The input for indicating first neuron of output layer, works as i When=2, f₂Indicate the input of second neuron of output layer；

The output valve of output layer isWherein, as i=1, y₁For the 1st neuron of Softmax output layers Output indicates that input voice belongs to the probability of normality voice；As i=2, y₂It is defeated for the 2nd neuron of Softmax output layers Go out, indicates that input voice belongs to the probability of abnormal voice；

Z7, training Softmax graders when, be trained by the way of Mini-batch, once capture T training sample into Row training, undated parameter is primary, use loss function for：

Wherein, error sum of squares when J is training, It is corresponded to for j-th of sample of crawl The idea output of i-th of neuron of grader have if j-th of sample belongs to normality voiceInstead It For the real output value of i-th of neuron of the corresponding grader of j-th of sample of crawl；

F1 is the harmonic-mean of accurate rate P and recall rate R, i.e.,：

Wherein, accurate rate P is correctly to be identified as the sample number of abnormal voice to account for all samples for being identified as abnormal voice This number ratio, recall rate are the ratio for all abnormal speech samples numbers of sample number for being correctly identified as abnormal voice, TP To be correctly identified as the sample number of abnormal voice in T sample, FP is to be erroneously identified as abnormal voice in T sample Sample number, FN is the sample number that normality voice is erroneously identified as in T sample；

When training, using gradient descent algorithm undated parameter, until loss function converges on the value of setting or iterations reach Maximum iteration, then training terminate.

9. a kind of abnormal speech differentiation method based on deep learning according to claim 1, which is characterized in that described Step S6, detailed process are：Online recognition part, deep neural network complete network structure and train obtained weight coefficient Load, Hidden Markov Model have loaded the advance model succeeded in school, carry out pretreatment to the voice inputted in real time and feature carries It takes；The HMM template of normality and abnormal voice is established respectively, when online recognition, according to sentencing for convolution depth confidence network output The different HMM template of disconnected call by result, the mel-frequency cepstrum coefficient of input speech signal carry out template matches, obtain final Voice recognition result.

10. a kind of abnormal speech differentiation method based on deep learning according to claim 9, which is characterized in that institute The HMM template for stating normality voice is obtained using normality voice training HMM template；The HMM template of the abnormal voice uses Abnormal voice training HMM template obtains.