CN108766419B

CN108766419B - Abnormal voice distinguishing method based on deep learning

Info

Publication number: CN108766419B
Application number: CN201810417478.2A
Authority: CN
Inventors: 奉小慧; 陈光科; 贺前华; 巫小兰; 李艳雄
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2018-05-04
Filing date: 2018-05-04
Publication date: 2020-10-27
Anticipated expiration: 2038-05-04
Also published as: CN108766419A

Abstract

The invention discloses a method for distinguishing abnormal voices based on deep learning, which comprises the following steps of obtaining input voices, and performing resampling, pre-emphasis and frame-dividing and window-adding preprocessing on the input voices to obtain preprocessed voices; extracting a Mel frequency cepstrum coefficient feature vector from the preprocessed voice; regulating the voice sections with different frame numbers to a fixed frame number, and obtaining a corresponding Mel frequency cepstrum coefficient characteristic matrix for each voice section; establishing a convolution depth confidence network; inputting the Mel frequency cepstrum coefficient feature matrix into a convolution depth confidence network, training, and classifying the state of input voice; calling a hidden Markov model to perform template matching according to the classification result to obtain a voice recognition result; the invention utilizes a plurality of nonlinear transformation layers of the convolution depth confidence network to map the input MFCC characteristics to a higher dimensional space, and respectively models the voices in different states by using the hidden Markov model, thereby improving the accuracy of voice recognition.

Description

Abnormal voice distinguishing method based on deep learning

Technical Field

The invention relates to the field of intelligent voice processing research, in particular to a non-stationary voice distinguishing method based on deep learning.

Background

Speech is one of the important ways of human and machine interaction, and after decades of research, speech recognition technology has been greatly developed and has penetrated into our daily lives, however, the existing speech recognition research has the following problems:

in real life, abnormal health conditions of speakers or other reasons can cause input voices to be transferred from normal voices to abnormal voices, and more noise interference can be caused. Abnormal speech generally refers to speech of complex background noise, speech of intentionally changing speaking style or habit, speech of developing organ lesion, and the like.

Another problem is that conventional speech recognition systems often use linear predictive cepstral coefficients and mel-frequency cepstral coefficients, the main information in these underlying acoustic features is the pronunciation text feature, and speaker information is easily interfered by this information, channel and noise information, so that the recognition performance of the system is degraded.

Disclosure of Invention

The invention mainly aims to overcome the defects of the prior art and provide an abnormal speech distinguishing method based on deep learning, which utilizes the nonlinear transformation capability of a deep neural network to map MFCC and LPCC parameters with lower dimensions to a high-dimensional space, better represents high-level abstract information of a speech signal, models normal speech and abnormal speech respectively and effectively distinguishes the normal speech and the abnormal speech.

The purpose of the invention is realized by the following technical scheme:

a abnormal speech distinguishing method based on deep learning comprises the following steps:

s1, acquiring input voice, and performing resampling, pre-emphasis and frame-dividing and windowing pre-processing on the input voice to obtain pre-processed voice;

s2, extracting a Mel frequency cepstrum coefficient feature vector for each frame of voice of the preprocessed voice by utilizing a Mel frequency filter bank and Fourier transform;

s3, regulating the voice sections with different frame numbers to a fixed frame number, and obtaining a corresponding Mel frequency cepstrum coefficient characteristic matrix for each voice section;

s4, establishing a convolution depth confidence network;

s5, inputting the Mel frequency cepstrum coefficient feature matrix into a convolution depth confidence network, training, and classifying the state of input voice;

and S6, calling the hidden Markov model to carry out template matching according to the classification result to obtain a voice recognition result.

In step S1, the sampling frequency of the resampling is 22.05kHz, and the encoding mode is wav format;

the pre-emphasis uses a first order FIR high pass filter with a transfer function of:

H(z)＝1-az^-1，

wherein a is a high-pass filter coefficient and takes a value of 0.93; the pre-emphasized speech signal is:

y(n)＝sp(n)-sp(n-1)，n＝0,1,…,Length-1

wherein, y (n) is the pre-emphasis voice signal, sp (n) is the voice signal before pre-emphasis, sp (n-1) is the time shift of the voice signal, and Length is the voice signal Length;

the frame windowing specifically comprises: the method comprises the steps of slicing the voice, intercepting an audio signal with a fixed length in the input voice at fixed time intervals into a frame, and performing frame division windowing by adopting a Hamming window with the frame length of 25ms and the frame shift of 10 ms.

In step S2, the specific procedure is as follows:

v1, designing L Mel frequency filter banks with triangular shape, and setting W_lL, Fs is the resampling frequency of the speech signal, Q is the frame length of a frame of speech signal, Q is also the number of points of the fourier transform, f is the frequency response of the L-th mel-frequency filter_l,f_hRespectively the lower limit and the upper limit cut-off frequency of the voice signal, and Q-point fast Fourier transform is carried out on a certain frame of voice signal with the frame length of Q to obtain Q frequency components, wherein o (l), c (l), h (l) are respectively the subscript values of the lower limit, the center and the upper limit frequency of the l-th Mel frequency filter in the Q frequency components; o (l), c (l), h (l) have the following relationships:

c(l-1)＝o(l)，

o(l+1)＝c(l)，

h(l)＝c(l+1)，

that is, the position of the center frequency of the current filter is at the side lobe attenuation position of the next filter;

at the same time, o (l)_|l＝1＝f_l,h(l)_|l＝L＝f_hThus, the subscript value of the center frequency of the l-th mel-frequency filter among the Q frequency components is expressed as:

wherein, Mel (f)₁) To map the actual frequency to a function of the Mel frequency, Mel^-1(f₂) Is Mel (f)₁) Inverse function, f₁Is the actual frequency, f₂As the mel frequency:

the frequency response of the ith mel-frequency filter is:

v2, performing Q-point fast fourier transform on a frame of speech signal x (n) after resampling, pre-emphasis, and frame windowing, where n is 0,1, Q-1, and Q is less than Length, and obtaining a frequency spectrum x (k) and an amplitude spectrum | x (k) |:

v3, passing a frame of speech signal through a mel-frequency filter bank to obtain an output amplitude spectrum of each filter:

v4, carrying out logarithm operation on the output amplitude spectrums of all the filters, and further carrying out discrete cosine transform to obtain a Mel frequency cepstrum coefficient:

taking the 2 nd to M +1 th coefficients of the L coefficients to form an M-dimensional Mel frequency cepstrum coefficient feature vector C ═ C of each frame_mfcc(2),c_mfcc(3),...,c_mfcc(M+1)}。

In step S2, the dimension of the mel-frequency cepstrum coefficient feature vector matrix is nxm, N is the fixed frame number after the speech is normalized, i.e., the target frame number of the time normalization, M is the M-order mel-frequency cepstrum coefficient of each frame of speech, and the mel-frequency cepstrum coefficient is 12-dimensional, i.e., M is 12.

In step S3, since the time lengths of different speech segments are different, the frames contained in different speech segments may not be consistent, and since the input of the neural network must be fixed-dimension, the speech signal needs to be time-warped so that the input of the neural network is fixed-size; performing time warping by adopting a characteristic point sequence method, wherein the warping specifically comprises the following steps:

y1, building an N-N + 1-layer time warping network, wherein N is the target frame number after time warping, the frame number of the first layer is N, and the frame number of the last layer is N;

the ith layer of the time warping network is provided with n- (i-1) frames, each frame corresponds to a feature vector, and n- (i-1) feature vector groups are formed:

wherein i is 0,1, … N-N +1, k is 1,2 … N- (i-1),

a Mel frequency cepstrum coefficient feature vector representing the ith frame voice of the network;

in particular, the set of feature vectors of the first layer of the network, i.e. the set of feature vectors of the input network:

to be provided with

Representing vectors

The weight of the represented speech frame, when i equals 1, has:

y2, is

To represent

And

a distance of (i) that

For the ith layer, find j so that for any k ≠ j, have

Y3, merging two frames with the nearest distance, and subtracting the frame number by one, namely:

wherein the content of the first and second substances,

representing the Mel frequency cepstrum coefficient characteristic vector of the j frame voice of the i +1 th layer of the network;

representing the Mel frequency cepstrum coefficient characteristic vector of the j +1 th frame voice of the ith layer of the network;

a Mel frequency cepstrum coefficient feature vector representing the kth frame voice of the i +1 layer of the network; to represent

The Mel frequency cepstrum coefficient feature vector of the (k + 1) th frame of voice of the ith layer of the network;

representing the weight of the jth frame voice of the ith layer of the network;

representing the weight of j +1 th frame voice of the ith layer of the network;

representing the weight of the j frame voice of the (i + 1) th layer of the network;

representing the distance between the Mel frequency cepstrum coefficient feature vectors of the jth frame voice and the jth +1 th frame voice of the ith layer of the network;

y4, solving

Namely:

repeating i-1 to i-N +1, and reducing the frame number by one every time the execution is finished, and finally regulating the voice signal of N frames to fixed N frames.

In step S4, the convolution depth confidence network is formed by stacking a plurality of convolution limited boltzmann machines from top to bottom, and the output layer adopts a Softmax classifier; the convolution limited boltzmann machine is composed of a layer of input layer V and a layer of convolution layer H. And a pooling layer is added after the convolution layer H of each convolution limited Boltzmann machine, and pooling operation is carried out, wherein the pooling size is E 'multiplied by F', the pooling step length is 3 multiplied by s4, and s3 is s4 which is 2, so that the pooling layer of the next convolution limited Boltzmann machine is the input layer of the last convolution limited Boltzmann machine.

Step S5, training the convolution limited boltzmann machine at the bottom, and then training the convolution limited boltzmann machine at the top, specifically:

z1, the number of input channels of the convolution limited Boltzmann machine is set as I, and each channel corresponds to a two-dimensional matrix with the size of y multiplied by s, namely:

V＝{v¹,v²,...,v^I},vⁱ∈R^y×s,i＝1,2,...,I，

wherein V is the input layer of the convolution limited Boltzmann machine, Vⁱ∈R^y×sThe ith channel of the input layer;

in particular, in the convolution limited boltzmann machine at the bottom layer, I is 1, y is N, s is M, that is, the number of input channels of the convolution limited boltzmann machine at the bottom layer is 1, a two-dimensional mel-frequency cepstrum coefficient feature matrix with the size of N × M corresponding to the input convolution depth confidence network is provided, N is the number of time-normalized target frames, and M is the M-order mel-frequency cepstrum coefficient of each frame of voice;

z2, the convolution process uses O convolution kernels, each convolution kernel is a three-dimensional weight matrix I multiplied by E multiplied by F, namely, a convolution kernel W ═ { W ═¹,w²,...,w^OIn which w^j∈R^I×E×F,j＝1,2,...,O，w^jA first convolution kernel; therefore, the number of output channels of the convolution limited Boltzmann machine is O; each output channel corresponds to a certain local characteristic of the input, namely:

convolution layer H ═ H¹,h²,...,h^OIn which h is^j∈R^N′×M′J ═ 1,2,. O; the jth channel of the convolutional layer corresponds to a two-dimensional feature matrix with the size of N 'multiplied by M' after feature mapping

Z3, setting convolution step length as s1 x s2, all neurons in the same channel of input layer share bias a_iI1, 2.. I, neurons of the same group in the convolutional layer share an offset b_jJ ═ 1,2,. O; parameters to be trained: θ ═ W, a, b };

the Z4, convolution-limited Boltzmann machine is an energy-based model whose energy function is defined as:

obtaining the joint probability distribution of the input layer V and the convolutional layer H by an energy function, namely obtaining all neuron values of the input layer V of the convolutional limited Boltzmann machine:

and all neuron values of convolutional layer H:

joint probability distribution of (a):

wherein the content of the first and second substances,

is a distribution function; i1, I, p 1, y, q 1, s, j 1, O, M1, N ', k 1, M',

marginal probability distribution of input layer V, i.e. all neuron values of input layer V

The probability distribution of (c) is:

if the training sample set TS has T samples, the log-likelihood function on the input layer V is:

where L (TS, θ) is the likelihood function on the input layer V:

z5, adopting a gradient rising algorithm and combining a contrast divergence algorithm to maximize a log-likelihood function so as to obtain a parameter theta; namely, it is

Z6, adopting a Softmax output layer as an output layer, wherein the number of the neurons is 2, the output of the neurons respectively represents the probability that the sample is normal speech and abnormal speech, and the actual identification is to take the category corresponding to the maximum probability as a final classification result; the output layer is fully connected with the pooling layer of the topmost convolution limited Boltzmann machine, and the parameter to be trained is the bias c of two neurons of the output layer₁,c₂And the connection weight of the output layer and the topmost pooling layer:

when i is 1, r is the number of neurons of the topmost hidden layer; WE¹The connection weight vectors for the 1 st neuron of the Softmax output layer and the neuron of the topmost pooling layer,

connecting weight values of the 1 st neuron of the Softmax output layer and the kth neuron of the topmost pooling layer, wherein k is less than or equal to r; when i is 2, WE²The connection weight vectors for the 2 nd neuron of the Softmax output layer and the neuron of the topmost pooling layer,

the connection weight value of the 2 nd neuron of the Softmax output layer and the kth neuron of the topmost pooling layer is obtained;

let the output of the topmost pooling layer be G ═ G₁,g₂,...,g_r}，g_iThe ith neuron output of the topmost pooling layer, then the input of the output layer is: f. of_i＝WEⁱG^T+c_iWherein, when i is 1, f₁Representing the input of the first neuron in the output layer, when i is 2, f₂An input representing a second neuron of the output layer;

the output values of the output layer are:

wherein, when i is 1, y₁The 1 st neuron output of the Softmax output layer represents the probability that the input speech belongs to normal speech; when i is 2, y₂The 2 nd neuron output of the Softmax output layer represents the probability that the input speech belongs to abnormal speech;

when Z7 and the Softmax classifier are trained, training is carried out in a Mini-batch mode, T training samples are grabbed once for training, parameters are updated once, and the loss function is adopted:

wherein J is the sum of squares of errors during training,

for the ideal output value of the ith neuron of the classifier corresponding to the captured jth sample, if the jth sample belongs to normal voice, the ideal output value is obtained

Otherwise, the reverse is carried out

The actual output value of the ith neuron of the classifier corresponding to the grabbed jth sample;

f1 is the harmonic mean of precision P and recall R, i.e.:

the accuracy rate P is the ratio of the number of samples which are correctly identified as abnormal speech to the number of samples which are identified as abnormal speech, the recall rate is the ratio of the number of samples which are correctly identified as abnormal speech to the number of samples which are all abnormal speech, TP is the number of samples which are correctly identified as abnormal speech in T samples, FP is the number of samples which are incorrectly identified as abnormal speech in T samples, and FN is the number of samples which are incorrectly identified as normal speech in T samples;

during training, updating the parameters by adopting a gradient descent algorithm until the loss function converges to a set value or the iteration times reach the maximum iteration times, and ending the training.

Step S6, the specific process is: in the online identification part, a deep neural network finishes the loading of a network structure and weight coefficients obtained by training, a hidden Markov model finishes the loading of a pre-learned model, and the preprocessing and the feature extraction are carried out on the real-time input voice; respectively establishing HMM templates of normal and abnormal voices, namely training the HMM template by using the normal voices and establishing the HMM template of the normal voices; training an HMM template by using the abnormal speech to establish an HMM template of the abnormal speech; and during online recognition, calling different HMM templates according to a judgment result output by the convolution depth confidence network, and performing template matching on a Mel frequency cepstrum coefficient of an input voice signal to obtain a final voice recognition result.

Compared with the prior art, the invention has the following advantages and beneficial effects:

the method combines the convolution deep learning confidence network and the hidden Markov model, utilizes the multilayer nonlinear transformation layers of the convolution deep confidence network to map the input MFCC characteristics to a higher dimensional space, more comprehensively represents voice information, and effectively distinguishes abnormal voice and normal voice; and then, the time series modeling capability of the hidden Markov model is utilized to respectively model the voices in different states, so that the recognition accuracy of the voices is greatly improved.

Drawings

FIG. 1 is a flowchart of an off-line training method for abnormal speech discrimination based on deep learning according to the present invention;

FIG. 2 is a flow chart of the online recognition of the abnormal speech distinguishing method based on deep learning according to the present invention;

FIG. 3 is a schematic diagram of a convolution deep belief network recognition speech state of an abnormal speech discrimination method based on deep learning according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

Example 1

the first step is as follows: acquiring input voice, and carrying out preprocessing such as resampling, pre-emphasis, framing and windowing on the input voice to obtain preprocessed voice;

the resampling specifically comprises: the input voice has different sampling frequencies and coding modes, so that the original input voice signal is resampled to facilitate the processing and analysis of data, and the sampling frequencies and the coding modes are unified; the sampling frequency is 22.05kHz, and the coding mode is wav format.

The pre-emphasis is specifically as follows: the power spectrum of the sound signal is reduced along with the increase of the frequency, most energy is concentrated in a low-frequency range, in order to improve the high-frequency part of the original sound signal, the original input sound signal is subjected to pre-emphasis processing, a first-order FIR high-pass filter is adopted, and the transmission function of the FIR high-pass filter is as follows:

H(z)＝1-az^-1，

y(n)＝sp(n)-sp(n-1)，n＝0,1,...,Length-1

The second step is that: extracting Mel frequency cepstrum coefficient characteristics of each frame of voice of the preprocessed voice;

the specific process is as follows:

c(l-1)＝o(l)，

o(l+1)＝c(l)，

h(l)＝c(l+1)，

the frequency response of the ith mel-frequency filter is:

v2, performing Q-point fast fourier transform on a frame of speech signal x (n), where n is 0,1, and Q-1 after resampling, pre-emphasis, and frame windowing, to obtain a frequency spectrum x (k) and an amplitude spectrum | x (k) |:

The third step: regulating the voice sections with different frame numbers to a fixed frame number, and obtaining a corresponding Mel frequency cepstrum coefficient characteristic matrix for each voice section;

because the time lengths of different speech segments are different, the number of frames possibly contained in different speech segments is not consistent, and because the input of the neural network must be of a fixed dimension, the time of the speech signal needs to be regulated, so that the input of the neural network is of a fixed size; performing time warping by adopting a characteristic point sequence method, wherein the warping process specifically comprises the following steps:

wherein i is 0,1, … N-N +1, k is 1,2 … N- (i-1), and C is a mel-frequency cepstrum coefficient feature vector of each frame of speech;

to be provided with

Representing vectors

The weight of the represented speech frame, when i equals 1, has:

y2, is

To represent

And

a distance of (i) that

For the ith layer, find j so that for any k ≠ j, have

wherein the content of the first and second substances,

representing the weight of the jth frame voice of the ith layer of the network;

representing the weight of j +1 th frame voice of the ith layer of the network;

y4, solving

Namely:

repeating i-1 to i-N +1, reducing the frame number by one every time the execution is finished, and finally regulating the voice signals of the N frames to fixed N frames;

after time warping, each section of voice corresponds to a mel frequency cepstrum coefficient feature matrix, the dimensionality of the mel frequency cepstrum coefficient feature matrix is NxM, N is the target frame number of the section of voice time warping, M is an M-order mel frequency cepstrum coefficient of each frame of voice, the mel frequency cepstrum coefficient is 12 dimensionalities, and namely M is 12 dimensionalities.

The fourth step: establishing a convolution depth confidence network; the deep confidence network is formed by stacking two convolution limiting Boltzmann machines, and comprises 2 convolution layers, 2 pooling layers, 1 visual layer and 1 output layer; for the first convolution limited boltzmann machine, the number of visual layer nerve units is N × M ═ 200 × 12, the number of convolution kernels is 10, the size of the convolution kernels is 2 × 2, the convolution step size is 2 × 2, and the initial values of the convolution kernels are: the mean value is 0, and the variance is 0.01. The initial offset of the visible layer is taken to be 0 and the initial offset of the convolutional layer is taken to be-0.1. One iteration takes 100 samples, the number of iterations is 100. The pooling size of the first pooling layer was taken to be 2 × 2, and the pooling step size was 2 × 2. For the second convolution limited boltzmann machine, the number of convolution kernels is 10, the size of the convolution kernels is 10 × 2 × 2, the convolution step size is 2 × 2, and the initial value of the convolution kernels is: the initial offset of the second convolutional layer was taken to be-0.1 with a random value of gaussian distribution with mean 0 and variance 0.01. One iteration takes 100 samples, the number of iterations is 100. The pooling size of the second pooling layer was taken to be 2 × 2, and the pooling step size was 2 × 2. All convolutional layers used the Sigma activation function. The number of neurons in the output layer is 2, and the posterior probability of normal speech and abnormal speech is output. The output layer sets the convergence value of the loss function to be 0.004, and the maximum iteration number to be 1000.

The fifth step: inputting the mel frequency cepstrum coefficient feature matrix into a convolution depth confidence network, training, and classifying states of input voice, wherein fig. 1 is a training flow chart of an off-line state;

firstly, training the convolution limited Boltzmann machine at the bottom layer, and then training the convolution limited Boltzmann machine at the top layer, specifically:

V＝{v¹,v²,...,v^I},vⁱ∈R^y×s,i＝1,2,...,I，

convolution layer H ═ H¹,h²,...,h^OIn which h is^j∈R^N′×M′J ═ 1,2,. O; the jth channel of the convolutional layer corresponds to a two-dimensional characteristic matrix which is subjected to characteristic mapping and has the size of N '× M';

and all neuron values of convolutional layer H:

joint probability distribution of (a):

wherein the content of the first and second substances,

is a distribution function; i1, I, p 1, y, q 1, s, j 1, O, M1, N ', k 1, M',

The probability distribution of (c) is:

where L (TS, θ) is the likelihood function on the input layer V:

z5, adopting gradient rising algorithm and combining contrast divergence algorithm to maximize pairsA number likelihood function is carried out, and then a parameter theta is obtained; namely, it is

the output values of the output layer are:

wherein J is the sum of squares of errors during training,

Otherwise, the reverse is carried out

f1 is the harmonic mean of precision P and recall R, i.e.:

And a sixth step: and calling a hidden Markov model to perform template matching according to the classification result to obtain a voice recognition result:

an online identification part, as shown in fig. 2, a deep neural network finishes the loading of a network structure and weight coefficients obtained by training, a hidden markov model finishes the loading of a pre-learned model, and the preprocessing and feature extraction are carried out on the real-time input voice; respectively establishing HMM templates of normal and abnormal voices, namely training the HMM template by using the normal voices and establishing the HMM template of the normal voices; training an HMM template by using the abnormal speech to establish an HMM template of the abnormal speech; during online recognition, different HMM templates are called according to the judgment result output by the convolution depth confidence network, and the mel-frequency cepstrum coefficient of the input speech signal is subjected to template matching to obtain a final speech recognition result, wherein the schematic flow chart is shown in fig. 3.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. An abnormal speech distinguishing method based on deep learning is characterized by comprising the following steps:

s1, acquiring input voice, and preprocessing the input voice to obtain preprocessed voice, wherein the preprocessing comprises resampling, pre-emphasis, framing and windowing;

s4, establishing a convolution depth confidence network; the convolution depth confidence network is formed by stacking more than one convolution limited Boltzmann machine from top to bottom, and the output layer adopts a Softmax classifier; the convolution limited Boltzmann machine is composed of a layer of input layer V and a layer of convolution layer H;

s5, inputting the Mel frequency cepstrum coefficient feature matrix into a convolution depth confidence network, training, and classifying the state of input voice; the process is as follows: firstly, training the convolution limited Boltzmann machine at the bottom layer, and then training the convolution limited Boltzmann machine at the top layer, specifically:

V＝{v¹,v²,...,v^I},vⁱ∈R^y×s,i＝1,2,...,I，

in the convolution limited Boltzmann machine at the bottom layer, I is 1, y is N, s is M, namely the number of input channels of the convolution limited Boltzmann machine at the bottom layer is 1, a two-dimensional Mel frequency cepstrum coefficient feature matrix with the size of N multiplied by M is correspondingly input into a convolution depth confidence network, N is the time-regular target frame number, and M is the M-order Mel frequency cepstrum coefficient of each frame of voice;

convolution layer H ═ H¹,h²,...,h^OIn which h is^j∈R^N′×M′J is 1, 2.. gto, which means the jth channel of the convolutional layer and corresponds to a two-dimensional feature matrix with the size of N '× M' after feature mapping;

and all neuron values of convolutional layer H:

joint probability distribution of (a):

wherein the content of the first and second substances,

is a distribution function; i1, I, p 1, y, q 1, s, j 1, O, M1, N ', k 1, M',

The probability distribution of (c) is:

where L (TS, θ) is the likelihood function on the input layer V:

wherein r is the number of neurons of the topmost hidden layer; when i is 1, WE¹The connection weight vectors for the 1 st neuron of the Softmax output layer and the neuron of the topmost pooling layer,

the output value of the output layer is

wherein J is the sum of squares of errors during training,

Otherwise, the reverse is carried out

f1 is the harmonic mean of precision P and recall R, i.e.:

during training, updating parameters by adopting a gradient descent algorithm until the loss function converges to a set value or the iteration times reach the maximum iteration times, and ending the training;

2. The abnormal speech distinguishing method based on deep learning of claim 1, wherein in step S1,

the sampling frequency of the resampling is 22.05kHz, and the encoding mode is wav format;

the pre-emphasis adopts a first-order FIR high-pass filter, and the transmission function of the pre-emphasis is as follows:

H(z)＝1-az^-1，

y(n)＝sp(n)-sp(n-1)，n＝0,1,…,Length-1；

3. The abnormal speech distinguishing method based on deep learning of claim 1, wherein the step S2 specifically includes:

v1, designing L Mel frequency filter banks with triangular shape, and setting W_lL, Fs is the resampling frequency of the speech signal, Q is the frame length of a frame of speech signal, Q is also the number of points of the fourier transform, f is the frequency response of the L-th mel-frequency filter_l,f_hQ-point fast Fourier transform is carried out on a certain frame of voice signals with the frame length of Q to obtain Q frequency components, wherein the voice signals are respectively the lower limit cutoff frequency and the upper limit cutoff frequency of the voice signals, and o (l), c (l), h (l) are respectively the first Mel frequency filterThe lower limit, the center and the lower limit frequency are lower scale values in Q frequency components; o (l), c (l), h (l) have the following relationships:

c(l-1)＝o(l)，

o(l+1)＝c(l)，

h(l)＝c(l+1)，

the frequency response of the ith mel-frequency filter is:

wherein k is a subscript value of the first frequency component in the Q frequency components;

4. The method for distinguishing abnormal speech according to claim 1, wherein in step S2, the dimension of the mel-frequency cepstrum coefficient feature vector matrix is nxm, N is the number of target frames for time warping of the speech, M is the M-th-order mel-frequency cepstrum coefficient of each frame of speech, and the mel-frequency cepstrum coefficient is 12-dimensional, that is, M is 12.

5. The method for distinguishing abnormal speech according to claim 1, wherein in step S3, the normalization specifically comprises:

y1, building an N-N + 1-layer time warping network, wherein N is the target frame number of time warping, the frame number of the first layer is N, and the frame number of the last layer is N;

wherein the content of the first and second substances,

a Mel frequency cepstrum coefficient feature vector of the ith frame of voice of the ith layer of the network;

to be provided with

Representing vectors

The weight of the represented speech frame, when i equals 1, has:

y2, is

To represent

And

a distance of (i) that

For the ith layer, find j so that for any k ≠ j, have

wherein the content of the first and second substances,

representing the weight of the jth frame voice of the ith layer of the network;

representing the weight of j +1 th frame voice of the ith layer of the network;

y4, solving

Namely:

6. The method for distinguishing abnormal speech based on deep learning of claim 1, wherein the convolution limited boltzmann machine adds a pooling layer after convolution layer H of each convolution limited boltzmann machine, performs pooling operation, and the pooling size is E '× F', the pooling step length is s3 × s4, s3 ═ s4 ═ 2, so that the pooling layer of the next convolution limited boltzmann machine is the input layer of the previous convolution limited boltzmann machine.

7. The method for distinguishing abnormal speech according to claim 1, wherein the step S6 includes the following steps: in the online identification part, a deep neural network finishes the loading of a network structure and weight coefficients obtained by training, a hidden Markov model finishes the loading of a pre-learned model, and the preprocessing and the feature extraction are carried out on the real-time input voice; and respectively establishing HMM templates of normal and abnormal voices, calling different HMM templates according to a judgment result output by the convolution depth confidence network during online recognition, and inputting a Mel frequency cepstrum coefficient of a voice signal to perform template matching to obtain a final voice recognition result.

8. The method for distinguishing abnormal speech based on deep learning of claim 7, wherein the HMM template of the normal speech is obtained by training an HMM template with the normal speech; the HMM template of the abnormal speech is obtained by training the HMM template by using abnormal speech.