CN108766419B - Abnormal voice distinguishing method based on deep learning - Google Patents

Abnormal voice distinguishing method based on deep learning Download PDF

Info

Publication number
CN108766419B
CN108766419B CN201810417478.2A CN201810417478A CN108766419B CN 108766419 B CN108766419 B CN 108766419B CN 201810417478 A CN201810417478 A CN 201810417478A CN 108766419 B CN108766419 B CN 108766419B
Authority
CN
China
Prior art keywords
layer
voice
frame
convolution
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810417478.2A
Other languages
Chinese (zh)
Other versions
CN108766419A (en
Inventor
奉小慧
陈光科
贺前华
巫小兰
李艳雄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN201810417478.2A priority Critical patent/CN108766419B/en
Publication of CN108766419A publication Critical patent/CN108766419A/en
Application granted granted Critical
Publication of CN108766419B publication Critical patent/CN108766419B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The invention discloses a method for distinguishing abnormal voices based on deep learning, which comprises the following steps of obtaining input voices, and performing resampling, pre-emphasis and frame-dividing and window-adding preprocessing on the input voices to obtain preprocessed voices; extracting a Mel frequency cepstrum coefficient feature vector from the preprocessed voice; regulating the voice sections with different frame numbers to a fixed frame number, and obtaining a corresponding Mel frequency cepstrum coefficient characteristic matrix for each voice section; establishing a convolution depth confidence network; inputting the Mel frequency cepstrum coefficient feature matrix into a convolution depth confidence network, training, and classifying the state of input voice; calling a hidden Markov model to perform template matching according to the classification result to obtain a voice recognition result; the invention utilizes a plurality of nonlinear transformation layers of the convolution depth confidence network to map the input MFCC characteristics to a higher dimensional space, and respectively models the voices in different states by using the hidden Markov model, thereby improving the accuracy of voice recognition.

Description

Abnormal voice distinguishing method based on deep learning
Technical Field
The invention relates to the field of intelligent voice processing research, in particular to a non-stationary voice distinguishing method based on deep learning.
Background
Speech is one of the important ways of human and machine interaction, and after decades of research, speech recognition technology has been greatly developed and has penetrated into our daily lives, however, the existing speech recognition research has the following problems:
in real life, abnormal health conditions of speakers or other reasons can cause input voices to be transferred from normal voices to abnormal voices, and more noise interference can be caused. Abnormal speech generally refers to speech of complex background noise, speech of intentionally changing speaking style or habit, speech of developing organ lesion, and the like.
Another problem is that conventional speech recognition systems often use linear predictive cepstral coefficients and mel-frequency cepstral coefficients, the main information in these underlying acoustic features is the pronunciation text feature, and speaker information is easily interfered by this information, channel and noise information, so that the recognition performance of the system is degraded.
Disclosure of Invention
The invention mainly aims to overcome the defects of the prior art and provide an abnormal speech distinguishing method based on deep learning, which utilizes the nonlinear transformation capability of a deep neural network to map MFCC and LPCC parameters with lower dimensions to a high-dimensional space, better represents high-level abstract information of a speech signal, models normal speech and abnormal speech respectively and effectively distinguishes the normal speech and the abnormal speech.
The purpose of the invention is realized by the following technical scheme:
a abnormal speech distinguishing method based on deep learning comprises the following steps:
s1, acquiring input voice, and performing resampling, pre-emphasis and frame-dividing and windowing pre-processing on the input voice to obtain pre-processed voice;
s2, extracting a Mel frequency cepstrum coefficient feature vector for each frame of voice of the preprocessed voice by utilizing a Mel frequency filter bank and Fourier transform;
s3, regulating the voice sections with different frame numbers to a fixed frame number, and obtaining a corresponding Mel frequency cepstrum coefficient characteristic matrix for each voice section;
s4, establishing a convolution depth confidence network;
s5, inputting the Mel frequency cepstrum coefficient feature matrix into a convolution depth confidence network, training, and classifying the state of input voice;
and S6, calling the hidden Markov model to carry out template matching according to the classification result to obtain a voice recognition result.
In step S1, the sampling frequency of the resampling is 22.05kHz, and the encoding mode is wav format;
the pre-emphasis uses a first order FIR high pass filter with a transfer function of:
H(z)=1-az-1
wherein a is a high-pass filter coefficient and takes a value of 0.93; the pre-emphasized speech signal is:
y(n)=sp(n)-sp(n-1),n=0,1,…,Length-1
wherein, y (n) is the pre-emphasis voice signal, sp (n) is the voice signal before pre-emphasis, sp (n-1) is the time shift of the voice signal, and Length is the voice signal Length;
the frame windowing specifically comprises: the method comprises the steps of slicing the voice, intercepting an audio signal with a fixed length in the input voice at fixed time intervals into a frame, and performing frame division windowing by adopting a Hamming window with the frame length of 25ms and the frame shift of 10 ms.
In step S2, the specific procedure is as follows:
v1, designing L Mel frequency filter banks with triangular shape, and setting WlL, Fs is the resampling frequency of the speech signal, Q is the frame length of a frame of speech signal, Q is also the number of points of the fourier transform, f is the frequency response of the L-th mel-frequency filterl,fhRespectively the lower limit and the upper limit cut-off frequency of the voice signal, and Q-point fast Fourier transform is carried out on a certain frame of voice signal with the frame length of Q to obtain Q frequency components, wherein o (l), c (l), h (l) are respectively the subscript values of the lower limit, the center and the upper limit frequency of the l-th Mel frequency filter in the Q frequency components; o (l), c (l), h (l) have the following relationships:
c(l-1)=o(l),
o(l+1)=c(l),
h(l)=c(l+1),
that is, the position of the center frequency of the current filter is at the side lobe attenuation position of the next filter;
at the same time, o (l)|l=1=fl,h(l)|l=L=fhThus, the subscript value of the center frequency of the l-th mel-frequency filter among the Q frequency components is expressed as:
Figure BDA0001649749860000021
wherein, Mel (f)1) To map the actual frequency to a function of the Mel frequency, Mel-1(f2) Is Mel (f)1) Inverse function, f1Is the actual frequency, f2As the mel frequency:
Figure BDA0001649749860000031
Figure BDA0001649749860000032
the frequency response of the ith mel-frequency filter is:
Figure BDA0001649749860000033
v2, performing Q-point fast fourier transform on a frame of speech signal x (n) after resampling, pre-emphasis, and frame windowing, where n is 0,1, Q-1, and Q is less than Length, and obtaining a frequency spectrum x (k) and an amplitude spectrum | x (k) |:
Figure BDA0001649749860000034
v3, passing a frame of speech signal through a mel-frequency filter bank to obtain an output amplitude spectrum of each filter:
Figure BDA0001649749860000035
v4, carrying out logarithm operation on the output amplitude spectrums of all the filters, and further carrying out discrete cosine transform to obtain a Mel frequency cepstrum coefficient:
Figure BDA0001649749860000036
taking the 2 nd to M +1 th coefficients of the L coefficients to form an M-dimensional Mel frequency cepstrum coefficient feature vector C ═ C of each framemfcc(2),cmfcc(3),...,cmfcc(M+1)}。
In step S2, the dimension of the mel-frequency cepstrum coefficient feature vector matrix is nxm, N is the fixed frame number after the speech is normalized, i.e., the target frame number of the time normalization, M is the M-order mel-frequency cepstrum coefficient of each frame of speech, and the mel-frequency cepstrum coefficient is 12-dimensional, i.e., M is 12.
In step S3, since the time lengths of different speech segments are different, the frames contained in different speech segments may not be consistent, and since the input of the neural network must be fixed-dimension, the speech signal needs to be time-warped so that the input of the neural network is fixed-size; performing time warping by adopting a characteristic point sequence method, wherein the warping specifically comprises the following steps:
y1, building an N-N + 1-layer time warping network, wherein N is the target frame number after time warping, the frame number of the first layer is N, and the frame number of the last layer is N;
the ith layer of the time warping network is provided with n- (i-1) frames, each frame corresponds to a feature vector, and n- (i-1) feature vector groups are formed:
Figure BDA0001649749860000041
wherein i is 0,1, … N-N +1, k is 1,2 … N- (i-1),
Figure BDA0001649749860000042
a Mel frequency cepstrum coefficient feature vector representing the ith frame voice of the network;
in particular, the set of feature vectors of the first layer of the network, i.e. the set of feature vectors of the input network:
Figure BDA0001649749860000043
to be provided with
Figure BDA0001649749860000044
Representing vectors
Figure BDA0001649749860000045
The weight of the represented speech frame, when i equals 1, has:
Figure BDA0001649749860000046
y2, is
Figure BDA0001649749860000047
To represent
Figure BDA0001649749860000048
And
Figure BDA0001649749860000049
a distance of (i) that
Figure BDA00016497498600000410
For the ith layer, find j so that for any k ≠ j, have
Figure BDA00016497498600000411
Y3, merging two frames with the nearest distance, and subtracting the frame number by one, namely:
Figure BDA00016497498600000412
Figure BDA00016497498600000413
Figure BDA00016497498600000414
wherein the content of the first and second substances,
Figure BDA00016497498600000415
representing the Mel frequency cepstrum coefficient characteristic vector of the j frame voice of the i +1 th layer of the network;
Figure BDA00016497498600000416
representing the Mel frequency cepstrum coefficient characteristic vector of the j +1 th frame voice of the ith layer of the network;
Figure BDA00016497498600000417
a Mel frequency cepstrum coefficient feature vector representing the kth frame voice of the i +1 layer of the network; to represent
Figure BDA00016497498600000418
The Mel frequency cepstrum coefficient feature vector of the (k + 1) th frame of voice of the ith layer of the network;
Figure BDA00016497498600000419
representing the weight of the jth frame voice of the ith layer of the network;
Figure BDA00016497498600000420
representing the weight of j +1 th frame voice of the ith layer of the network;
Figure BDA00016497498600000421
representing the weight of the j frame voice of the (i + 1) th layer of the network;
Figure BDA00016497498600000422
representing the distance between the Mel frequency cepstrum coefficient feature vectors of the jth frame voice and the jth +1 th frame voice of the ith layer of the network;
y4, solving
Figure BDA0001649749860000051
Namely:
Figure BDA0001649749860000052
Figure BDA0001649749860000053
Figure BDA0001649749860000054
repeating i-1 to i-N +1, and reducing the frame number by one every time the execution is finished, and finally regulating the voice signal of N frames to fixed N frames.
In step S4, the convolution depth confidence network is formed by stacking a plurality of convolution limited boltzmann machines from top to bottom, and the output layer adopts a Softmax classifier; the convolution limited boltzmann machine is composed of a layer of input layer V and a layer of convolution layer H. And a pooling layer is added after the convolution layer H of each convolution limited Boltzmann machine, and pooling operation is carried out, wherein the pooling size is E 'multiplied by F', the pooling step length is 3 multiplied by s4, and s3 is s4 which is 2, so that the pooling layer of the next convolution limited Boltzmann machine is the input layer of the last convolution limited Boltzmann machine.
Step S5, training the convolution limited boltzmann machine at the bottom, and then training the convolution limited boltzmann machine at the top, specifically:
z1, the number of input channels of the convolution limited Boltzmann machine is set as I, and each channel corresponds to a two-dimensional matrix with the size of y multiplied by s, namely:
V={v1,v2,...,vI},vi∈Ry×s,i=1,2,...,I,
wherein V is the input layer of the convolution limited Boltzmann machine, Vi∈Ry×sThe ith channel of the input layer;
in particular, in the convolution limited boltzmann machine at the bottom layer, I is 1, y is N, s is M, that is, the number of input channels of the convolution limited boltzmann machine at the bottom layer is 1, a two-dimensional mel-frequency cepstrum coefficient feature matrix with the size of N × M corresponding to the input convolution depth confidence network is provided, N is the number of time-normalized target frames, and M is the M-order mel-frequency cepstrum coefficient of each frame of voice;
z2, the convolution process uses O convolution kernels, each convolution kernel is a three-dimensional weight matrix I multiplied by E multiplied by F, namely, a convolution kernel W ═ { W ═1,w2,...,wOIn which wj∈RI×E×F,j=1,2,...,O,wjA first convolution kernel; therefore, the number of output channels of the convolution limited Boltzmann machine is O; each output channel corresponds to a certain local characteristic of the input, namely:
convolution layer H ═ H1,h2,...,hOIn which h isj∈RN′×M′J ═ 1,2,. O; the jth channel of the convolutional layer corresponds to a two-dimensional feature matrix with the size of N 'multiplied by M' after feature mapping
Z3, setting convolution step length as s1 x s2, all neurons in the same channel of input layer share bias aiI1, 2.. I, neurons of the same group in the convolutional layer share an offset bjJ ═ 1,2,. O; parameters to be trained: θ ═ W, a, b };
the Z4, convolution-limited Boltzmann machine is an energy-based model whose energy function is defined as:
Figure BDA0001649749860000061
obtaining the joint probability distribution of the input layer V and the convolutional layer H by an energy function, namely obtaining all neuron values of the input layer V of the convolutional limited Boltzmann machine:
Figure BDA0001649749860000062
and all neuron values of convolutional layer H:
Figure BDA0001649749860000063
joint probability distribution of (a):
Figure BDA0001649749860000064
wherein the content of the first and second substances,
Figure BDA0001649749860000065
is a distribution function; i1, I, p 1, y, q 1, s, j 1, O, M1, N ', k 1, M',
marginal probability distribution of input layer V, i.e. all neuron values of input layer V
Figure BDA0001649749860000066
The probability distribution of (c) is:
Figure BDA0001649749860000067
if the training sample set TS has T samples, the log-likelihood function on the input layer V is:
Figure BDA0001649749860000068
where L (TS, θ) is the likelihood function on the input layer V:
Figure BDA0001649749860000069
z5, adopting a gradient rising algorithm and combining a contrast divergence algorithm to maximize a log-likelihood function so as to obtain a parameter theta; namely, it is
Figure BDA00016497498600000610
Z6, adopting a Softmax output layer as an output layer, wherein the number of the neurons is 2, the output of the neurons respectively represents the probability that the sample is normal speech and abnormal speech, and the actual identification is to take the category corresponding to the maximum probability as a final classification result; the output layer is fully connected with the pooling layer of the topmost convolution limited Boltzmann machine, and the parameter to be trained is the bias c of two neurons of the output layer1,c2And the connection weight of the output layer and the topmost pooling layer:
Figure BDA0001649749860000071
when i is 1, r is the number of neurons of the topmost hidden layer; WE1The connection weight vectors for the 1 st neuron of the Softmax output layer and the neuron of the topmost pooling layer,
Figure BDA0001649749860000072
connecting weight values of the 1 st neuron of the Softmax output layer and the kth neuron of the topmost pooling layer, wherein k is less than or equal to r; when i is 2, WE2The connection weight vectors for the 2 nd neuron of the Softmax output layer and the neuron of the topmost pooling layer,
Figure BDA0001649749860000073
the connection weight value of the 2 nd neuron of the Softmax output layer and the kth neuron of the topmost pooling layer is obtained;
let the output of the topmost pooling layer be G ═ G1,g2,...,gr},giThe ith neuron output of the topmost pooling layer, then the input of the output layer is: f. ofi=WEiGT+ciWherein, when i is 1, f1Representing the input of the first neuron in the output layer, when i is 2, f2An input representing a second neuron of the output layer;
the output values of the output layer are:
Figure BDA0001649749860000074
wherein, when i is 1, y1The 1 st neuron output of the Softmax output layer represents the probability that the input speech belongs to normal speech; when i is 2, y2The 2 nd neuron output of the Softmax output layer represents the probability that the input speech belongs to abnormal speech;
when Z7 and the Softmax classifier are trained, training is carried out in a Mini-batch mode, T training samples are grabbed once for training, parameters are updated once, and the loss function is adopted:
Figure BDA0001649749860000075
wherein J is the sum of squares of errors during training,
Figure BDA0001649749860000081
Figure BDA0001649749860000082
for the ideal output value of the ith neuron of the classifier corresponding to the captured jth sample, if the jth sample belongs to normal voice, the ideal output value is obtained
Figure BDA0001649749860000083
Otherwise, the reverse is carried out
Figure BDA0001649749860000084
Figure BDA0001649749860000085
The actual output value of the ith neuron of the classifier corresponding to the grabbed jth sample;
f1 is the harmonic mean of precision P and recall R, i.e.:
Figure BDA0001649749860000086
Figure BDA0001649749860000087
the accuracy rate P is the ratio of the number of samples which are correctly identified as abnormal speech to the number of samples which are identified as abnormal speech, the recall rate is the ratio of the number of samples which are correctly identified as abnormal speech to the number of samples which are all abnormal speech, TP is the number of samples which are correctly identified as abnormal speech in T samples, FP is the number of samples which are incorrectly identified as abnormal speech in T samples, and FN is the number of samples which are incorrectly identified as normal speech in T samples;
during training, updating the parameters by adopting a gradient descent algorithm until the loss function converges to a set value or the iteration times reach the maximum iteration times, and ending the training.
Step S6, the specific process is: in the online identification part, a deep neural network finishes the loading of a network structure and weight coefficients obtained by training, a hidden Markov model finishes the loading of a pre-learned model, and the preprocessing and the feature extraction are carried out on the real-time input voice; respectively establishing HMM templates of normal and abnormal voices, namely training the HMM template by using the normal voices and establishing the HMM template of the normal voices; training an HMM template by using the abnormal speech to establish an HMM template of the abnormal speech; and during online recognition, calling different HMM templates according to a judgment result output by the convolution depth confidence network, and performing template matching on a Mel frequency cepstrum coefficient of an input voice signal to obtain a final voice recognition result.
Compared with the prior art, the invention has the following advantages and beneficial effects:
the method combines the convolution deep learning confidence network and the hidden Markov model, utilizes the multilayer nonlinear transformation layers of the convolution deep confidence network to map the input MFCC characteristics to a higher dimensional space, more comprehensively represents voice information, and effectively distinguishes abnormal voice and normal voice; and then, the time series modeling capability of the hidden Markov model is utilized to respectively model the voices in different states, so that the recognition accuracy of the voices is greatly improved.
Drawings
FIG. 1 is a flowchart of an off-line training method for abnormal speech discrimination based on deep learning according to the present invention;
FIG. 2 is a flow chart of the online recognition of the abnormal speech distinguishing method based on deep learning according to the present invention;
FIG. 3 is a schematic diagram of a convolution deep belief network recognition speech state of an abnormal speech discrimination method based on deep learning according to the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.
Example 1
A abnormal speech distinguishing method based on deep learning comprises the following steps:
the first step is as follows: acquiring input voice, and carrying out preprocessing such as resampling, pre-emphasis, framing and windowing on the input voice to obtain preprocessed voice;
the resampling specifically comprises: the input voice has different sampling frequencies and coding modes, so that the original input voice signal is resampled to facilitate the processing and analysis of data, and the sampling frequencies and the coding modes are unified; the sampling frequency is 22.05kHz, and the coding mode is wav format.
The pre-emphasis is specifically as follows: the power spectrum of the sound signal is reduced along with the increase of the frequency, most energy is concentrated in a low-frequency range, in order to improve the high-frequency part of the original sound signal, the original input sound signal is subjected to pre-emphasis processing, a first-order FIR high-pass filter is adopted, and the transmission function of the FIR high-pass filter is as follows:
H(z)=1-az-1
wherein a is a high-pass filter coefficient and takes a value of 0.93; the pre-emphasized speech signal is:
y(n)=sp(n)-sp(n-1),n=0,1,...,Length-1
wherein, y (n) is the pre-emphasis voice signal, sp (n) is the voice signal before pre-emphasis, sp (n-1) is the time shift of the voice signal, and Length is the voice signal Length;
the frame windowing specifically comprises: the method comprises the steps of slicing the voice, intercepting an audio signal with a fixed length in the input voice at fixed time intervals into a frame, and performing frame division windowing by adopting a Hamming window with the frame length of 25ms and the frame shift of 10 ms.
The second step is that: extracting Mel frequency cepstrum coefficient characteristics of each frame of voice of the preprocessed voice;
the specific process is as follows:
v1, designing L Mel frequency filter banks with triangular shape, and setting WlL, Fs is the resampling frequency of the speech signal, Q is the frame length of a frame of speech signal, Q is also the number of points of the fourier transform, f is the frequency response of the L-th mel-frequency filterl,fhRespectively the lower limit and the upper limit cut-off frequency of the voice signal, and Q-point fast Fourier transform is carried out on a certain frame of voice signal with the frame length of Q to obtain Q frequency components, wherein o (l), c (l), h (l) are respectively the subscript values of the lower limit, the center and the upper limit frequency of the l-th Mel frequency filter in the Q frequency components; o (l), c (l), h (l) have the following relationships:
c(l-1)=o(l),
o(l+1)=c(l),
h(l)=c(l+1),
that is, the position of the center frequency of the current filter is at the side lobe attenuation position of the next filter;
at the same time, o (l)|l=1=fl,h(l)|l=L=fhThus, the subscript value of the center frequency of the l-th mel-frequency filter among the Q frequency components is expressed as:
Figure BDA0001649749860000101
wherein, Mel (f)1) To map the actual frequency to a function of the Mel frequency, Mel-1(f2) Is Mel (f)1) Inverse function, f1Is the actual frequency, f2As the mel frequency:
Figure BDA0001649749860000102
Figure BDA0001649749860000103
the frequency response of the ith mel-frequency filter is:
Figure BDA0001649749860000111
v2, performing Q-point fast fourier transform on a frame of speech signal x (n), where n is 0,1, and Q-1 after resampling, pre-emphasis, and frame windowing, to obtain a frequency spectrum x (k) and an amplitude spectrum | x (k) |:
Figure BDA0001649749860000112
v3, passing a frame of speech signal through a mel-frequency filter bank to obtain an output amplitude spectrum of each filter:
Figure BDA0001649749860000113
v4, carrying out logarithm operation on the output amplitude spectrums of all the filters, and further carrying out discrete cosine transform to obtain a Mel frequency cepstrum coefficient:
Figure BDA0001649749860000114
taking the 2 nd to M +1 th coefficients of the L coefficients to form an M-dimensional Mel frequency cepstrum coefficient feature vector C ═ C of each framemfcc(2),cmfcc(3),...,cmfcc(M+1)}。
The third step: regulating the voice sections with different frame numbers to a fixed frame number, and obtaining a corresponding Mel frequency cepstrum coefficient characteristic matrix for each voice section;
because the time lengths of different speech segments are different, the number of frames possibly contained in different speech segments is not consistent, and because the input of the neural network must be of a fixed dimension, the time of the speech signal needs to be regulated, so that the input of the neural network is of a fixed size; performing time warping by adopting a characteristic point sequence method, wherein the warping process specifically comprises the following steps:
y1, building an N-N + 1-layer time warping network, wherein N is the target frame number after time warping, the frame number of the first layer is N, and the frame number of the last layer is N;
the ith layer of the time warping network is provided with n- (i-1) frames, each frame corresponds to a feature vector, and n- (i-1) feature vector groups are formed:
Figure BDA0001649749860000121
wherein i is 0,1, … N-N +1, k is 1,2 … N- (i-1), and C is a mel-frequency cepstrum coefficient feature vector of each frame of speech;
in particular, the set of feature vectors of the first layer of the network, i.e. the set of feature vectors of the input network:
Figure BDA0001649749860000122
to be provided with
Figure BDA0001649749860000123
Representing vectors
Figure BDA0001649749860000124
The weight of the represented speech frame, when i equals 1, has:
Figure BDA0001649749860000125
y2, is
Figure BDA0001649749860000126
To represent
Figure BDA0001649749860000127
And
Figure BDA0001649749860000128
a distance of (i) that
Figure BDA0001649749860000129
For the ith layer, find j so that for any k ≠ j, have
Figure BDA00016497498600001210
Y3, merging two frames with the nearest distance, and subtracting the frame number by one, namely:
Figure BDA00016497498600001211
Figure BDA00016497498600001212
Figure BDA00016497498600001213
wherein the content of the first and second substances,
Figure BDA00016497498600001214
representing the Mel frequency cepstrum coefficient characteristic vector of the j frame voice of the i +1 th layer of the network;
Figure BDA00016497498600001215
representing the Mel frequency cepstrum coefficient characteristic vector of the j +1 th frame voice of the ith layer of the network;
Figure BDA00016497498600001216
a Mel frequency cepstrum coefficient feature vector representing the kth frame voice of the i +1 layer of the network; to represent
Figure BDA00016497498600001217
The Mel frequency cepstrum coefficient feature vector of the (k + 1) th frame of voice of the ith layer of the network;
Figure BDA00016497498600001218
representing the weight of the jth frame voice of the ith layer of the network;
Figure BDA00016497498600001219
representing the weight of j +1 th frame voice of the ith layer of the network;
Figure BDA00016497498600001220
representing the weight of the j frame voice of the (i + 1) th layer of the network;
Figure BDA00016497498600001221
representing the distance between the Mel frequency cepstrum coefficient feature vectors of the jth frame voice and the jth +1 th frame voice of the ith layer of the network;
y4, solving
Figure BDA00016497498600001222
Namely:
Figure BDA00016497498600001223
Figure BDA00016497498600001224
Figure BDA00016497498600001225
repeating i-1 to i-N +1, reducing the frame number by one every time the execution is finished, and finally regulating the voice signals of the N frames to fixed N frames;
after time warping, each section of voice corresponds to a mel frequency cepstrum coefficient feature matrix, the dimensionality of the mel frequency cepstrum coefficient feature matrix is NxM, N is the target frame number of the section of voice time warping, M is an M-order mel frequency cepstrum coefficient of each frame of voice, the mel frequency cepstrum coefficient is 12 dimensionalities, and namely M is 12 dimensionalities.
The fourth step: establishing a convolution depth confidence network; the deep confidence network is formed by stacking two convolution limiting Boltzmann machines, and comprises 2 convolution layers, 2 pooling layers, 1 visual layer and 1 output layer; for the first convolution limited boltzmann machine, the number of visual layer nerve units is N × M ═ 200 × 12, the number of convolution kernels is 10, the size of the convolution kernels is 2 × 2, the convolution step size is 2 × 2, and the initial values of the convolution kernels are: the mean value is 0, and the variance is 0.01. The initial offset of the visible layer is taken to be 0 and the initial offset of the convolutional layer is taken to be-0.1. One iteration takes 100 samples, the number of iterations is 100. The pooling size of the first pooling layer was taken to be 2 × 2, and the pooling step size was 2 × 2. For the second convolution limited boltzmann machine, the number of convolution kernels is 10, the size of the convolution kernels is 10 × 2 × 2, the convolution step size is 2 × 2, and the initial value of the convolution kernels is: the initial offset of the second convolutional layer was taken to be-0.1 with a random value of gaussian distribution with mean 0 and variance 0.01. One iteration takes 100 samples, the number of iterations is 100. The pooling size of the second pooling layer was taken to be 2 × 2, and the pooling step size was 2 × 2. All convolutional layers used the Sigma activation function. The number of neurons in the output layer is 2, and the posterior probability of normal speech and abnormal speech is output. The output layer sets the convergence value of the loss function to be 0.004, and the maximum iteration number to be 1000.
The fifth step: inputting the mel frequency cepstrum coefficient feature matrix into a convolution depth confidence network, training, and classifying states of input voice, wherein fig. 1 is a training flow chart of an off-line state;
firstly, training the convolution limited Boltzmann machine at the bottom layer, and then training the convolution limited Boltzmann machine at the top layer, specifically:
z1, the number of input channels of the convolution limited Boltzmann machine is set as I, and each channel corresponds to a two-dimensional matrix with the size of y multiplied by s, namely:
V={v1,v2,...,vI},vi∈Ry×s,i=1,2,...,I,
wherein V is the input layer of the convolution limited Boltzmann machine, Vi∈Ry×sThe ith channel of the input layer;
in particular, in the convolution limited boltzmann machine at the bottom layer, I is 1, y is N, s is M, that is, the number of input channels of the convolution limited boltzmann machine at the bottom layer is 1, a two-dimensional mel-frequency cepstrum coefficient feature matrix with the size of N × M corresponding to the input convolution depth confidence network is provided, N is the number of time-normalized target frames, and M is the M-order mel-frequency cepstrum coefficient of each frame of voice;
z2, the convolution process uses O convolution kernels, each convolution kernel is a three-dimensional weight matrix I multiplied by E multiplied by F, namely, a convolution kernel W ═ { W ═1,w2,...,wOIn which wj∈RI×E×F,j=1,2,...,O,wjA first convolution kernel; therefore, the number of output channels of the convolution limited Boltzmann machine is O; each output channel corresponds to a certain local characteristic of the input, namely:
convolution layer H ═ H1,h2,...,hOIn which h isj∈RN′×M′J ═ 1,2,. O; the jth channel of the convolutional layer corresponds to a two-dimensional characteristic matrix which is subjected to characteristic mapping and has the size of N '× M';
z3, setting convolution step length as s1 x s2, all neurons in the same channel of input layer share bias aiI1, 2.. I, neurons of the same group in the convolutional layer share an offset bjJ ═ 1,2,. O; parameters to be trained: θ ═ W, a, b };
the Z4, convolution-limited Boltzmann machine is an energy-based model whose energy function is defined as:
Figure BDA0001649749860000141
obtaining the joint probability distribution of the input layer V and the convolutional layer H by an energy function, namely obtaining all neuron values of the input layer V of the convolutional limited Boltzmann machine:
Figure BDA0001649749860000142
and all neuron values of convolutional layer H:
Figure BDA0001649749860000143
joint probability distribution of (a):
Figure BDA0001649749860000144
wherein the content of the first and second substances,
Figure BDA0001649749860000145
is a distribution function; i1, I, p 1, y, q 1, s, j 1, O, M1, N ', k 1, M',
marginal probability distribution of input layer V, i.e. all neuron values of input layer V
Figure BDA0001649749860000146
The probability distribution of (c) is:
Figure BDA0001649749860000147
if the training sample set TS has T samples, the log-likelihood function on the input layer V is:
Figure BDA0001649749860000151
where L (TS, θ) is the likelihood function on the input layer V:
Figure BDA0001649749860000152
z5, adopting gradient rising algorithm and combining contrast divergence algorithm to maximize pairsA number likelihood function is carried out, and then a parameter theta is obtained; namely, it is
Figure BDA0001649749860000153
Z6, adopting a Softmax output layer as an output layer, wherein the number of the neurons is 2, the output of the neurons respectively represents the probability that the sample is normal speech and abnormal speech, and the actual identification is to take the category corresponding to the maximum probability as a final classification result; the output layer is fully connected with the pooling layer of the topmost convolution limited Boltzmann machine, and the parameter to be trained is the bias c of two neurons of the output layer1,c2And the connection weight of the output layer and the topmost pooling layer:
Figure BDA0001649749860000154
when i is 1, r is the number of neurons of the topmost hidden layer; WE1The connection weight vectors for the 1 st neuron of the Softmax output layer and the neuron of the topmost pooling layer,
Figure BDA0001649749860000155
connecting weight values of the 1 st neuron of the Softmax output layer and the kth neuron of the topmost pooling layer, wherein k is less than or equal to r; when i is 2, WE2The connection weight vectors for the 2 nd neuron of the Softmax output layer and the neuron of the topmost pooling layer,
Figure BDA0001649749860000156
the connection weight value of the 2 nd neuron of the Softmax output layer and the kth neuron of the topmost pooling layer is obtained;
let the output of the topmost pooling layer be G ═ G1,g2,...,gr},giThe ith neuron output of the topmost pooling layer, then the input of the output layer is: f. ofi=WEiGT+ciWherein, when i is 1, f1Representing the input of the first neuron in the output layer, when i is 2, f2An input representing a second neuron of the output layer;
the output values of the output layer are:
Figure BDA0001649749860000157
wherein, when i is 1, y1The 1 st neuron output of the Softmax output layer represents the probability that the input speech belongs to normal speech; when i is 2, y2The 2 nd neuron output of the Softmax output layer represents the probability that the input speech belongs to abnormal speech;
when Z7 and the Softmax classifier are trained, training is carried out in a Mini-batch mode, T training samples are grabbed once for training, parameters are updated once, and the loss function is adopted:
Figure BDA0001649749860000161
wherein J is the sum of squares of errors during training,
Figure BDA0001649749860000162
Figure BDA0001649749860000163
for the ideal output value of the ith neuron of the classifier corresponding to the captured jth sample, if the jth sample belongs to normal voice, the ideal output value is obtained
Figure BDA0001649749860000164
Otherwise, the reverse is carried out
Figure BDA0001649749860000165
Figure BDA0001649749860000166
The actual output value of the ith neuron of the classifier corresponding to the grabbed jth sample;
f1 is the harmonic mean of precision P and recall R, i.e.:
Figure BDA0001649749860000167
Figure BDA0001649749860000168
the accuracy rate P is the ratio of the number of samples which are correctly identified as abnormal speech to the number of samples which are identified as abnormal speech, the recall rate is the ratio of the number of samples which are correctly identified as abnormal speech to the number of samples which are all abnormal speech, TP is the number of samples which are correctly identified as abnormal speech in T samples, FP is the number of samples which are incorrectly identified as abnormal speech in T samples, and FN is the number of samples which are incorrectly identified as normal speech in T samples;
during training, updating the parameters by adopting a gradient descent algorithm until the loss function converges to a set value or the iteration times reach the maximum iteration times, and ending the training.
And a sixth step: and calling a hidden Markov model to perform template matching according to the classification result to obtain a voice recognition result:
an online identification part, as shown in fig. 2, a deep neural network finishes the loading of a network structure and weight coefficients obtained by training, a hidden markov model finishes the loading of a pre-learned model, and the preprocessing and feature extraction are carried out on the real-time input voice; respectively establishing HMM templates of normal and abnormal voices, namely training the HMM template by using the normal voices and establishing the HMM template of the normal voices; training an HMM template by using the abnormal speech to establish an HMM template of the abnormal speech; during online recognition, different HMM templates are called according to the judgment result output by the convolution depth confidence network, and the mel-frequency cepstrum coefficient of the input speech signal is subjected to template matching to obtain a final speech recognition result, wherein the schematic flow chart is shown in fig. 3.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (8)

1. An abnormal speech distinguishing method based on deep learning is characterized by comprising the following steps:
s1, acquiring input voice, and preprocessing the input voice to obtain preprocessed voice, wherein the preprocessing comprises resampling, pre-emphasis, framing and windowing;
s2, extracting a Mel frequency cepstrum coefficient feature vector for each frame of voice of the preprocessed voice by utilizing a Mel frequency filter bank and Fourier transform;
s3, regulating the voice sections with different frame numbers to a fixed frame number, and obtaining a corresponding Mel frequency cepstrum coefficient characteristic matrix for each voice section;
s4, establishing a convolution depth confidence network; the convolution depth confidence network is formed by stacking more than one convolution limited Boltzmann machine from top to bottom, and the output layer adopts a Softmax classifier; the convolution limited Boltzmann machine is composed of a layer of input layer V and a layer of convolution layer H;
s5, inputting the Mel frequency cepstrum coefficient feature matrix into a convolution depth confidence network, training, and classifying the state of input voice; the process is as follows: firstly, training the convolution limited Boltzmann machine at the bottom layer, and then training the convolution limited Boltzmann machine at the top layer, specifically:
z1, the number of input channels of the convolution limited Boltzmann machine is set as I, and each channel corresponds to a two-dimensional matrix with the size of y multiplied by s, namely:
V={v1,v2,...,vI},vi∈Ry×s,i=1,2,...,I,
wherein V is the input layer of the convolution limited Boltzmann machine, Vi∈Ry×sThe ith channel of the input layer;
in the convolution limited Boltzmann machine at the bottom layer, I is 1, y is N, s is M, namely the number of input channels of the convolution limited Boltzmann machine at the bottom layer is 1, a two-dimensional Mel frequency cepstrum coefficient feature matrix with the size of N multiplied by M is correspondingly input into a convolution depth confidence network, N is the time-regular target frame number, and M is the M-order Mel frequency cepstrum coefficient of each frame of voice;
z2, the convolution process uses O convolution kernels, each convolution kernel is a three-dimensional weight matrix I multiplied by E multiplied by F, namely, a convolution kernel W ═ { W ═1,w2,...,wOIn which wj∈RI×E×F,j=1,2,...,O,wjA first convolution kernel; therefore, the number of output channels of the convolution limited Boltzmann machine is O; each output channel corresponds to a certain local characteristic of the input, namely:
convolution layer H ═ H1,h2,...,hOIn which h isj∈RN′×M′J is 1, 2.. gto, which means the jth channel of the convolutional layer and corresponds to a two-dimensional feature matrix with the size of N '× M' after feature mapping;
z3, setting convolution step length as s1 x s2, all neurons in the same channel of input layer share bias aiI1, 2.. I, neurons of the same group in the convolutional layer share an offset bjJ ═ 1,2,. O; parameters to be trained: θ ═ W, a, b };
the Z4, convolution-limited Boltzmann machine is an energy-based model whose energy function is defined as:
Figure FDA0002585679790000021
obtaining the joint probability distribution of the input layer V and the convolutional layer H by an energy function, namely obtaining all neuron values of the input layer V of the convolutional limited Boltzmann machine:
Figure FDA0002585679790000022
and all neuron values of convolutional layer H:
Figure FDA0002585679790000023
joint probability distribution of (a):
Figure FDA0002585679790000024
wherein the content of the first and second substances,
Figure FDA0002585679790000025
is a distribution function; i1, I, p 1, y, q 1, s, j 1, O, M1, N ', k 1, M',
marginal probability distribution of input layer V, i.e. all neuron values of input layer V
Figure FDA0002585679790000026
The probability distribution of (c) is:
Figure FDA0002585679790000027
if the training sample set TS has T samples, the log-likelihood function on the input layer V is:
Figure FDA0002585679790000028
where L (TS, θ) is the likelihood function on the input layer V:
Figure FDA0002585679790000029
z5, adopting a gradient rising algorithm and combining a contrast divergence algorithm to maximize a log-likelihood function so as to obtain a parameter theta; namely, it is
Figure FDA00025856797900000210
Z6, adopting a Softmax output layer as an output layer, wherein the number of the neurons is 2, the output of the neurons respectively represents the probability that the sample is normal speech and abnormal speech, and the actual identification is to take the category corresponding to the maximum probability as a final classification result; the output layer is fully connected with the pooling layer of the topmost convolution limited Boltzmann machine, and the parameter to be trained is the bias c of two neurons of the output layer1,c2And the connection weight of the output layer and the topmost pooling layer:
Figure FDA00025856797900000211
wherein r is the number of neurons of the topmost hidden layer; when i is 1, WE1The connection weight vectors for the 1 st neuron of the Softmax output layer and the neuron of the topmost pooling layer,
Figure FDA00025856797900000212
connecting weight values of the 1 st neuron of the Softmax output layer and the kth neuron of the topmost pooling layer, wherein k is less than or equal to r; when i is 2, WE2The connection weight vectors for the 2 nd neuron of the Softmax output layer and the neuron of the topmost pooling layer,
Figure FDA0002585679790000031
the connection weight value of the 2 nd neuron of the Softmax output layer and the kth neuron of the topmost pooling layer is obtained;
let the output of the topmost pooling layer be G ═ G1,g2,...,gr},giThe ith neuron output of the topmost pooling layer, then the input of the output layer is: f. ofi=WEiGT+ciWherein, when i is 1, f1Representing the input of the first neuron in the output layer, when i is 2, f2An input representing a second neuron of the output layer;
the output value of the output layer is
Figure FDA0002585679790000032
Wherein, when i is 1, y1The 1 st neuron output of the Softmax output layer represents the probability that the input speech belongs to normal speech; when i is 2, y2The 2 nd neuron output of the Softmax output layer represents the probability that the input speech belongs to abnormal speech;
when Z7 and the Softmax classifier are trained, training is carried out in a Mini-batch mode, T training samples are grabbed once for training, parameters are updated once, and the loss function is adopted:
Figure FDA0002585679790000033
wherein J is the sum of squares of errors during training,
Figure FDA0002585679790000034
Figure FDA0002585679790000035
for the ideal output value of the ith neuron of the classifier corresponding to the captured jth sample, if the jth sample belongs to normal voice, the ideal output value is obtained
Figure FDA0002585679790000036
Otherwise, the reverse is carried out
Figure FDA0002585679790000037
Figure FDA0002585679790000038
The actual output value of the ith neuron of the classifier corresponding to the grabbed jth sample;
f1 is the harmonic mean of precision P and recall R, i.e.:
Figure FDA0002585679790000039
Figure FDA00025856797900000310
the accuracy rate P is the ratio of the number of samples which are correctly identified as abnormal speech to the number of samples which are identified as abnormal speech, the recall rate is the ratio of the number of samples which are correctly identified as abnormal speech to the number of samples which are all abnormal speech, TP is the number of samples which are correctly identified as abnormal speech in T samples, FP is the number of samples which are incorrectly identified as abnormal speech in T samples, and FN is the number of samples which are incorrectly identified as normal speech in T samples;
during training, updating parameters by adopting a gradient descent algorithm until the loss function converges to a set value or the iteration times reach the maximum iteration times, and ending the training;
and S6, calling the hidden Markov model to carry out template matching according to the classification result to obtain a voice recognition result.
2. The abnormal speech distinguishing method based on deep learning of claim 1, wherein in step S1,
the sampling frequency of the resampling is 22.05kHz, and the encoding mode is wav format;
the pre-emphasis adopts a first-order FIR high-pass filter, and the transmission function of the pre-emphasis is as follows:
H(z)=1-az-1
wherein a is a high-pass filter coefficient and takes a value of 0.93; the pre-emphasized speech signal is:
y(n)=sp(n)-sp(n-1),n=0,1,…,Length-1;
wherein, y (n) is the pre-emphasis voice signal, sp (n) is the voice signal before pre-emphasis, sp (n-1) is the time shift of the voice signal, and Length is the voice signal Length;
the frame windowing specifically comprises: the method comprises the steps of slicing the voice, intercepting an audio signal with a fixed length in the input voice at fixed time intervals into a frame, and performing frame division windowing by adopting a Hamming window with the frame length of 25ms and the frame shift of 10 ms.
3. The abnormal speech distinguishing method based on deep learning of claim 1, wherein the step S2 specifically includes:
v1, designing L Mel frequency filter banks with triangular shape, and setting WlL, Fs is the resampling frequency of the speech signal, Q is the frame length of a frame of speech signal, Q is also the number of points of the fourier transform, f is the frequency response of the L-th mel-frequency filterl,fhQ-point fast Fourier transform is carried out on a certain frame of voice signals with the frame length of Q to obtain Q frequency components, wherein the voice signals are respectively the lower limit cutoff frequency and the upper limit cutoff frequency of the voice signals, and o (l), c (l), h (l) are respectively the first Mel frequency filterThe lower limit, the center and the lower limit frequency are lower scale values in Q frequency components; o (l), c (l), h (l) have the following relationships:
c(l-1)=o(l),
o(l+1)=c(l),
h(l)=c(l+1),
that is, the position of the center frequency of the current filter is at the side lobe attenuation position of the next filter;
at the same time, o (l)|l=1=fl,h(l)|l=L=fhThus, the subscript value of the center frequency of the l-th mel-frequency filter among the Q frequency components is expressed as:
Figure FDA0002585679790000051
wherein, Mel (f)1) To map the actual frequency to a function of the Mel frequency, Mel-1(f2) Is Mel (f)1) Inverse function, f1Is the actual frequency, f2As the mel frequency:
Figure FDA0002585679790000052
Figure FDA0002585679790000053
the frequency response of the ith mel-frequency filter is:
Figure FDA0002585679790000054
wherein k is a subscript value of the first frequency component in the Q frequency components;
v2, performing Q-point fast fourier transform on a frame of speech signal x (n) after resampling, pre-emphasis, and frame windowing, where n is 0,1, Q-1, and Q is less than Length, and obtaining a frequency spectrum x (k) and an amplitude spectrum | x (k) |:
Figure FDA0002585679790000055
v3, passing a frame of speech signal through a mel-frequency filter bank to obtain an output amplitude spectrum of each filter:
Figure FDA0002585679790000056
v4, carrying out logarithm operation on the output amplitude spectrums of all the filters, and further carrying out discrete cosine transform to obtain a Mel frequency cepstrum coefficient:
Figure FDA0002585679790000057
taking the 2 nd to M +1 th coefficients of the L coefficients to form an M-dimensional Mel frequency cepstrum coefficient feature vector C ═ C of each framemfcc(2),cmfcc(3),...,cmfcc(M+1)}。
4. The method for distinguishing abnormal speech according to claim 1, wherein in step S2, the dimension of the mel-frequency cepstrum coefficient feature vector matrix is nxm, N is the number of target frames for time warping of the speech, M is the M-th-order mel-frequency cepstrum coefficient of each frame of speech, and the mel-frequency cepstrum coefficient is 12-dimensional, that is, M is 12.
5. The method for distinguishing abnormal speech according to claim 1, wherein in step S3, the normalization specifically comprises:
y1, building an N-N + 1-layer time warping network, wherein N is the target frame number of time warping, the frame number of the first layer is N, and the frame number of the last layer is N;
the ith layer of the time warping network is provided with n- (i-1) frames, each frame corresponds to a feature vector, and n- (i-1) feature vector groups are formed:
Figure FDA0002585679790000061
wherein the content of the first and second substances,
Figure FDA0002585679790000062
a Mel frequency cepstrum coefficient feature vector of the ith frame of voice of the ith layer of the network;
in particular, the set of feature vectors of the first layer of the network, i.e. the set of feature vectors of the input network:
Figure FDA0002585679790000063
to be provided with
Figure FDA0002585679790000064
Representing vectors
Figure FDA0002585679790000065
The weight of the represented speech frame, when i equals 1, has:
Figure FDA0002585679790000066
y2, is
Figure FDA0002585679790000067
To represent
Figure FDA0002585679790000068
And
Figure FDA0002585679790000069
a distance of (i) that
Figure FDA00025856797900000610
For the ith layer, find j so that for any k ≠ j, have
Figure FDA00025856797900000611
Y3, merging two frames with the nearest distance, and subtracting the frame number by one, namely:
Figure FDA00025856797900000612
Figure FDA00025856797900000613
Figure FDA00025856797900000614
wherein the content of the first and second substances,
Figure FDA00025856797900000615
representing the Mel frequency cepstrum coefficient characteristic vector of the j frame voice of the i +1 th layer of the network;
Figure FDA00025856797900000616
representing the Mel frequency cepstrum coefficient characteristic vector of the j +1 th frame voice of the ith layer of the network;
Figure FDA00025856797900000617
a Mel frequency cepstrum coefficient feature vector representing the kth frame voice of the i +1 layer of the network; to represent
Figure FDA00025856797900000618
The Mel frequency cepstrum coefficient feature vector of the (k + 1) th frame of voice of the ith layer of the network;
Figure FDA00025856797900000619
representing the weight of the jth frame voice of the ith layer of the network;
Figure FDA00025856797900000620
representing the weight of j +1 th frame voice of the ith layer of the network;
Figure FDA00025856797900000621
representing the weight of the j frame voice of the (i + 1) th layer of the network;
Figure FDA00025856797900000622
representing the distance between the Mel frequency cepstrum coefficient feature vectors of the jth frame voice and the jth +1 th frame voice of the ith layer of the network;
y4, solving
Figure FDA0002585679790000071
Namely:
Figure FDA0002585679790000072
Figure FDA0002585679790000073
Figure FDA0002585679790000074
repeating i-1 to i-N +1, and reducing the frame number by one every time the execution is finished, and finally regulating the voice signal of N frames to fixed N frames.
6. The method for distinguishing abnormal speech based on deep learning of claim 1, wherein the convolution limited boltzmann machine adds a pooling layer after convolution layer H of each convolution limited boltzmann machine, performs pooling operation, and the pooling size is E '× F', the pooling step length is s3 × s4, s3 ═ s4 ═ 2, so that the pooling layer of the next convolution limited boltzmann machine is the input layer of the previous convolution limited boltzmann machine.
7. The method for distinguishing abnormal speech according to claim 1, wherein the step S6 includes the following steps: in the online identification part, a deep neural network finishes the loading of a network structure and weight coefficients obtained by training, a hidden Markov model finishes the loading of a pre-learned model, and the preprocessing and the feature extraction are carried out on the real-time input voice; and respectively establishing HMM templates of normal and abnormal voices, calling different HMM templates according to a judgment result output by the convolution depth confidence network during online recognition, and inputting a Mel frequency cepstrum coefficient of a voice signal to perform template matching to obtain a final voice recognition result.
8. The method for distinguishing abnormal speech based on deep learning of claim 7, wherein the HMM template of the normal speech is obtained by training an HMM template with the normal speech; the HMM template of the abnormal speech is obtained by training the HMM template by using abnormal speech.
CN201810417478.2A 2018-05-04 2018-05-04 Abnormal voice distinguishing method based on deep learning Active CN108766419B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810417478.2A CN108766419B (en) 2018-05-04 2018-05-04 Abnormal voice distinguishing method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810417478.2A CN108766419B (en) 2018-05-04 2018-05-04 Abnormal voice distinguishing method based on deep learning

Publications (2)

Publication Number Publication Date
CN108766419A CN108766419A (en) 2018-11-06
CN108766419B true CN108766419B (en) 2020-10-27

Family

ID=64009048

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810417478.2A Active CN108766419B (en) 2018-05-04 2018-05-04 Abnormal voice distinguishing method based on deep learning

Country Status (1)

Country Link
CN (1) CN108766419B (en)

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112955954B (en) * 2018-12-21 2024-04-12 华为技术有限公司 Audio processing device and method for audio scene classification
CN111354373B (en) * 2018-12-21 2023-05-12 中国科学院声学研究所 Audio signal classification method based on neural network intermediate layer characteristic filtering
CN110058689A (en) * 2019-04-08 2019-07-26 深圳大学 A kind of smart machine input method based on face's vibration
CN110322887B (en) * 2019-04-28 2021-10-15 武汉大晟极科技有限公司 Multi-type audio signal energy feature extraction method
EP3745412A1 (en) * 2019-05-28 2020-12-02 Corti ApS An intelligent computer aided decision support system
CN110444202B (en) * 2019-07-04 2023-05-26 平安科技(深圳)有限公司 Composite voice recognition method, device, equipment and computer readable storage medium
CN110390929A (en) * 2019-08-05 2019-10-29 中国民航大学 Chinese and English civil aviaton land sky call acoustic model construction method based on CDNN-HMM
CN110706720B (en) * 2019-08-16 2022-04-22 广东省智能制造研究所 Acoustic anomaly detection method for end-to-end unsupervised deep support network
CN110600015B (en) * 2019-09-18 2020-12-15 北京声智科技有限公司 Voice dense classification method and related device
CN110782901B (en) * 2019-11-05 2021-12-24 深圳大学 Method, storage medium and device for identifying voice of network telephone
CN111044285A (en) * 2019-11-22 2020-04-21 军事科学院系统工程研究院军用标准研究中心 Method for diagnosing faults of mechanical equipment under complex conditions
CN111027675B (en) * 2019-11-22 2023-03-07 南京大学 Automatic adjusting method and system for multimedia playing setting
CN110931046A (en) * 2019-11-29 2020-03-27 福州大学 Audio high-level semantic feature extraction method and system for overlapped sound event detection
CN111128227B (en) * 2019-12-30 2022-06-17 云知声智能科技股份有限公司 Sound detection method and device
CN111724770B (en) * 2020-05-19 2022-04-01 中国电子科技网络信息安全有限公司 Audio keyword identification method for generating confrontation network based on deep convolution
CN111508501B (en) * 2020-07-02 2020-09-29 成都晓多科技有限公司 Voice recognition method and system with accent for telephone robot
CN112750428A (en) * 2020-12-29 2021-05-04 平安普惠企业管理有限公司 Voice interaction method and device and computer equipment
CN113361647A (en) * 2021-07-06 2021-09-07 青岛洞听智能科技有限公司 Method for identifying type of missed call
CN113959071B (en) * 2021-07-21 2023-05-26 北京金茂绿建科技有限公司 Centralized water chilling unit air conditioning system operation control optimization method based on machine learning assistance
CN113689633B (en) * 2021-08-26 2023-03-17 浙江力石科技股份有限公司 Scenic spot human-computer interaction method, device and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106782602A (en) * 2016-12-01 2017-05-31 南京邮电大学 Speech-emotion recognition method based on length time memory network and convolutional neural networks
CN107464568A (en) * 2017-09-25 2017-12-12 四川长虹电器股份有限公司 Based on the unrelated method for distinguishing speek person of Three dimensional convolution neutral net text and system

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102129860B (en) * 2011-04-07 2012-07-04 南京邮电大学 Text-related speaker recognition method based on infinite-state hidden Markov model
CN104157290B (en) * 2014-08-19 2017-10-24 大连理工大学 A kind of method for distinguishing speek person based on deep learning
CN105206270B (en) * 2015-08-20 2019-04-02 长安大学 A kind of isolated digit speech recognition categorizing system and method combining PCA and RBM
US10373073B2 (en) * 2016-01-11 2019-08-06 International Business Machines Corporation Creating deep learning models using feature augmentation
CN106941005A (en) * 2017-02-24 2017-07-11 华南理工大学 A kind of vocal cords method for detecting abnormality based on speech acoustics feature

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106782602A (en) * 2016-12-01 2017-05-31 南京邮电大学 Speech-emotion recognition method based on length time memory network and convolutional neural networks
CN107464568A (en) * 2017-09-25 2017-12-12 四川长虹电器股份有限公司 Based on the unrelated method for distinguishing speek person of Three dimensional convolution neutral net text and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
A deep architecture for audio-visual voice activity detection in the presence of transients;Ido Ariav等;《Signal Processing》;20170712;第64-67页 *

Also Published As

Publication number Publication date
CN108766419A (en) 2018-11-06

Similar Documents

Publication Publication Date Title
CN108766419B (en) Abnormal voice distinguishing method based on deep learning
KR100908121B1 (en) Speech feature vector conversion method and apparatus
CN108447495B (en) Deep learning voice enhancement method based on comprehensive feature set
Gulzar et al. Comparative analysis of LPCC, MFCC and BFCC for the recognition of Hindi words using artificial neural networks
Bhattacharjee A comparative study of LPCC and MFCC features for the recognition of Assamese phonemes
KR20080078466A (en) Multi-stage speech recognition apparatus and method
CN111899757B (en) Single-channel voice separation method and system for target speaker extraction
Moritz et al. Noise robust distant automatic speech recognition utilizing NMF based source separation and auditory feature extraction
Pardede et al. Convolutional neural network and feature transformation for distant speech recognition
CN113763965A (en) Speaker identification method with multiple attention characteristics fused
Cai et al. The DKU system for the speaker recognition task of the 2019 VOiCES from a distance challenge
CN113571095A (en) Speech emotion recognition method and system based on nested deep neural network
Palo et al. Comparison of neural network models for speech emotion recognition
Li et al. A Convolutional Neural Network with Non-Local Module for Speech Enhancement.
Aggarwal et al. Performance evaluation of artificial neural networks for isolated Hindi digit recognition with LPC and MFCC
WO2023070874A1 (en) Voiceprint recognition method
Gu et al. Restoring high frequency spectral envelopes using neural networks for speech bandwidth extension
Sunny et al. Feature extraction methods based on linear predictive coding and wavelet packet decomposition for recognizing spoken words in malayalam
Raju et al. AUTOMATIC SPEECH RECOGNITION SYSTEM USING MFCC-BASED LPC APPROACH WITH BACK PROPAGATED ARTIFICIAL NEURAL NETWORKS.
Zheng et al. Throat microphone speech enhancement via progressive learning of spectral mapping based on lstm-rnn
Sunil Kumar et al. Phoneme recognition using zerocrossing interval distribution of speech patterns and ANN
Alex et al. Performance analysis of SOFM based reduced complexity feature extraction methods with back propagation neural network for multilingual digit recognition
CN111785262A (en) Speaker age and gender classification method based on residual error network and fusion characteristics
Nijhawan et al. A comparative study of two different neural models for speaker recognition systems
Singh et al. A novel algorithm using MFCC and ERB gammatone filters in speech recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant