CN108766419A - A kind of abnormal speech detection method based on deep learning - Google Patents
A kind of abnormal speech detection method based on deep learning Download PDFInfo
- Publication number
- CN108766419A CN108766419A CN201810417478.2A CN201810417478A CN108766419A CN 108766419 A CN108766419 A CN 108766419A CN 201810417478 A CN201810417478 A CN 201810417478A CN 108766419 A CN108766419 A CN 108766419A
- Authority
- CN
- China
- Prior art keywords
- voice
- layer
- frequency
- mel
- frame
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
- G10L15/142—Hidden Markov Models [HMMs]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Abstract
The abnormal speech differentiation method based on deep learning that the invention discloses a kind of, includes the following steps, obtains input voice, and carry out resampling, preemphasis, the pretreatment of framing adding window to input voice, obtains pretreatment voice;Mel-frequency cepstrum coefficient characteristic vector is extracted to pretreatment voice;The voice segments of different frame numbers are regular to frame number is fixed, and each voice segments obtain a corresponding mel-frequency cepstrum coefficient eigenmatrix;Establish convolution depth confidence network;Mel-frequency cepstrum coefficient eigenmatrix is inputted into convolution depth confidence network, is trained, and is classified to the state for inputting voice;According to classification results, calls hidden Markov model to carry out template matches, obtain voice recognition result;The present invention utilizes multiple nonlinear transformation layers of convolution depth confidence network, by the MFCC Feature Mappings of input to more higher dimensional space, is used in combination Hidden Markov Model to model the voice of different conditions respectively, improves the identification accuracy of voice.
Description
Technical field
The present invention relates to intelligent sound treatment research field, more particularly to a kind of abnormal speech region based on deep learning
Other method.
Background technology
Voice is one of the important way of the mankind and machine interaction, and by the research of decades, speech recognition technology obtains
Prodigious development has been deep into our daily life, however, the research of existing voice identification has problems with:
In actual life, speaker's health status exception or other reasons can cause its input voice from normality voice to
Abnormal voice transfer, and more noise jamming can be brought.Abnormal voice refer generally to complex background noise voice, deliberately
Change the voice of tongue or custom, the voice etc. of development organ lesion.
Another problem is that traditional speech recognition system often uses linear prediction residue error and mel-frequency cepstrum
Coefficient, main information is pronunciation this paper features in these bottom acoustic features, and speaker information is highly susceptible to this information, letter
The interference in road and noise information, to make the recognition performance of system decline.
Invention content
The shortcomings that it is a primary object of the present invention to overcome the prior art and deficiency, provide a kind of based on the non-of deep learning
Normality speech differentiation method is mapped MFCC, LPCC parameter compared with low-dimensional using the nonlinear transformation ability of deep neural network
To higher dimensional space, the high-level abstracted information of voice signal is preferably characterized, and model respectively to normality voice and abnormal voice,
Effectively distinguish normality voice and abnormal voice.
The purpose of the present invention is realized by the following technical solution:
A kind of abnormal speech differentiation method based on deep learning, includes the following steps:
S1, input voice is obtained, and resampling, preemphasis, the pretreatment of framing adding window is carried out to input voice, obtain pre- place
Manage voice;
S2, each frame voice extraction Meier frequency that voice is pre-processed using mel-frequency filter group and Fourier transform pairs
Rate cepstrum coefficient characteristic vector;
S3, regular arrive of the voice segments with different frame numbers is fixed into frame number, each voice segments obtain a corresponding Meier
Frequency cepstral coefficient eigenmatrix;
S4, convolution depth confidence network is established;
S5, mel-frequency cepstrum coefficient eigenmatrix is inputted into convolution depth confidence network, be trained, and to inputting language
The state of sound is classified;
S6, according to classification results, call hidden Markov model to carry out template matches, obtain voice recognition result.
In step sl, its sample frequency of resampling is 22.05kHz, and coding mode is wav formats;
Preemphasis uses single order FIR high-pass filters, and transfer function is:
H (z)=1-az-1,
Wherein, a is high-pass filter coefficient, value 0.93;Voice signal after preemphasis is:
Y (n)=sp (n)-sp (n-1), n=0,1 ..., Length-1
Wherein, y (n) is the voice signal after preemphasis, and sp (n) is the voice signal before preemphasis, and sp (n-1) is voice
The time shift of signal, Length are voice signal length;
Framing adding window is specially:Voice is subjected to slicing treatment, at every fixed time fixed length in interception input voice
The audio signal of degree is a frame, uses frame length for 25ms, and the Hamming window that it is 10ms that frame, which moves, carries out framing windowing process.
In step s 2, detailed process is as follows:
L V1, design triangular shaped mel-frequency filter groups, if WlFor the frequency of first of mel-frequency filter
Rate responds, l=1,2 .., L, and Fs is the resampling frequency of voice signal, and Q is the frame length of certain frame voice signal, and Q is also Fourier
The points of transformation, fl,fhThe respectively lower and upper limit cutoff frequency of voice signal, certain frame voice signal that frame length is Q carry out Q
Q frequency component, o (l), c (l) are obtained after point quick Fourier transformation, h (l) is respectively under first of mel-frequency filter
Limit, the subscript value of center and upper limiting frequency in Q frequency component;There are following relationships by o (l), c (l), h (l):
C (l-1)=o (l),
O (l+1)=c (l),
H (l)=c (l+1),
That is, the position of present filter centre frequency is in the side lobe attenuation position of next filter;
Have simultaneously, o (l)| l=1=fl,h(l)| l=L=fh, therefore, the centre frequency of first of mel-frequency filter is at Q
Subscript value in frequency component is expressed as:
Wherein, Mel (f1) it is that actual frequency is mapped to the function of mel-frequency, and Mel-1(f2) it is Mel (f1) anti-letter
Number, f1For actual frequency, f2For mel-frequency:
The frequency response of first of mel-frequency filter is:
V2, to certain frame voice signal x (n), n=0,1 after resampling, preemphasis, framing adding window ..., Q-1, Q
< Length carry out Q point quick Fourier transformation, acquire its frequency spectrum X (k) and amplitude spectrum | X (k) |:
V3, by certain frame voice signal by mel-frequency filter group, obtain the output amplitude spectrum of each filter:
V4, logarithm operation is done to the output amplitude spectrum of all filters, plum can be obtained by further doing discrete cosine transform
That frequency cepstral coefficient:
The in L coefficient the 2nd to M+1 total M coefficient is taken to constitute the M Jan Vermeer frequency cepstral coefficient characteristic vectors per frame
C={ cmfcc(2),cmfcc(3),...,cmfcc(M+1)}。
In step S2, the mel-frequency cepstrum coefficient feature matrix, dimension is N × M, and N is this section of voice rule
Fixation frame number after whole, i.e. the target frame number of Time alignment, M are the M rank mel-frequency cepstrum coefficients of every frame voice, the Meier
Frequency cepstral coefficient is 12 dimensions, i.e. M=12.
In step s3, due to the time span of different phonetic section difference, the frame number that different phonetic section may contain
It is inconsistent, and since the input of neural network must be fixed dimension, it is therefore desirable to Time alignment is carried out to voice signal,
So that the input of neural network is fixed size;Time alignment is carried out using characterization point sequence method, it is regular to be specially:
Y1, the Time alignment network for building one n-N+1 layers, wherein N are the target frame number after Time alignment, first layer
Frame number is n, and the frame number of last layer is N;
I-th layer of Time alignment network has n- (i-1) frame, and a characteristic vector is corresponded to per frame, constitutes n- (i-1) a feature
Set of vectors:
Wherein, i=0,1 ... n-N+1, k=1,2 ... n- (i-1),Indicate the Meier frequency of i-th layer of kth frame voice of network
Rate cepstrum coefficient characteristic vector;
Particularly, the characteristic vector group of network first tier inputs the characteristic vector group of network:
WithIndicate vectorThe weight of the speech frame of representative has as i=1:
Y2, it setsIt indicatesWithDistance, i.e.,To i-th layer, find j so as to arbitrary k ≠
J has
Y3, two nearest frames of distance are merged, frame number subtracts one, i.e.,:
Wherein,Indicate the mel-frequency cepstrum coefficient characteristic vector of network i+1 layer jth frame voice;Indicate net
The mel-frequency cepstrum coefficient characteristic vector of+1 frame voice of i-th layer of jth of network;Indicate the plum of network i+1 layer kth frame voice
That frequency cepstral coefficient characteristic vector;It indicatesThe mel-frequency cepstrum coefficient Characteristic Vectors of+1 frame voice of i-th layer of kth of network
Amount;Indicate the weight of i-th layer of jth frame voice of network;Indicate the weight of+1 frame voice of i-th layer of jth of network;It indicates
The weight of network i+1 layer jth frame voice;Indicate that the mel-frequency of i-th layer of jth frame voice of network and+1 frame voice of jth falls
The distance of spectral coefficient characteristic vector;
Y4, solutionI.e.:
I=1 to i=n-N+1 is repeated, has often been executed once, frame number subtracts one, finally that the voice signal of n frames is regular to solid
Fixed N frames.
In step S4, the convolution depth confidence network by multiple convolution be limited Boltzmann machine accumulate from top to bottom and
At output layer uses Softmax graders;The convolution is limited Boltzmann machine by one layer of input layer V and one layer of convolutional layer H structure
At.Convolution is limited Boltzmann machine, and each convolution is added pond layer after being limited the convolutional layer H of Boltzmann machine, carries out Chi Huacao
Make, pond size is E ' × F ', pond step-length s3 × s4, s3=s4=2, and therefore, next convolution is limited the pond of Boltzmann machine
It is the input layer that a upper convolution is limited Boltzmann machine to change layer.
Step S5 is first limited Boltzmann machine to the convolution of bottom and is trained, then is limited Bohr hereby to the convolution of top layer
Graceful machine is trained, specially:
Z1, set convolution be limited Boltzmann machine input channel number as I, each channel correspond to size for y × s two dimension
Matrix, i.e.,:
V={ v1,v2,...,vI},vi∈Ry×s, i=1,2 ..., I,
Wherein, V is the input layer that convolution is limited Boltzmann machine, vi∈Ry×sFor i-th of channel of input layer;
Particularly, the convolution of bottom is limited in Boltzmann machine, and I=1, y=N, s=M, the i.e. convolution of bottom are limited Bohr
Hereby the input channel number of graceful machine is 1, the two-dimentional mel-frequency cepstrum that the corresponding size for inputting convolution depth confidence network is N × M
Coefficient characteristics matrix, N are the target frame number of Time alignment, and M is the M rank mel-frequency cepstrum coefficients of every frame voice;
Z2, convolution process use O convolution kernel, and each convolution kernel is three-dimensional weight matrix I × E × F, i.e. convolution kernel W=
{w1,w2,...,wO, wherein wj∈RI×E×F, j=1,2 ..., O, wjFor the convolution kernel;Therefore convolution is limited Boltzmann machine
Output channel number be O;Each output channel corresponds to some local feature of input, i.e.,:
Convolutional layer H={ h1,h2,...,hO, wherein hj∈RN′×M′, j=1,2 ..., O;Refer to j-th of channel of convolutional layer,
A corresponding size after Feature Mapping is the two dimensional character matrix of N ' × M '
Z3, convolution step-length is set as s1 × s2, all neurons in the same channel of input layer share biasing ai, i=1,
2 ..., I, the shared biasing b of same group of neuron in convolutional layerj, j=1,2 ..., O;The parameter to be trained:θ=W, a,
b};
The limited Boltzmann machine of Z4, convolution is the model based on energy, and energy function is defined as:
The joint probability distribution of input layer V and convolutional layer H are obtained by energy function, i.e. convolution is limited Boltzmann machine input layer
All neuron values of V:With all neuron values of convolutional layer H:Joint probability distribution:
Wherein,For partition function;I=1 ..., I, p=1 ..., y, q=1 ..., s, j=
1 ..., O, m=1 ..., N ', k=1 ..., M ',
The marginal probability distribution of input layer V, i.e. all neuron values of input layer VProbability distribution be:
If training sample set TS has T sample, then the log-likelihood function on input layer V is:
Wherein, L (TS, θ) is the likelihood function on input layer V:
Z5, log-likelihood function is maximized, and then acquire parameter in conjunction with to sdpecific dispersion algorithm using gradient ascent algorithm
θ;I.e.
Z6, output layer use Softmax output layers, neuron number 2, and it is normality voice that output, which respectively represents sample,
With the probability of abnormal voice, practical identification is to take the corresponding classification of maximum probability as final classification result;Output layer and
The pond layer that top convolution is limited Boltzmann machine connects entirely, and it is the biasing of two neurons of output layer to need trained parameter
c1,c2And the connection weight of output layer and top pond layer:
Wherein, as i=1, r is the neuron number of top hidden layer;WE1For the 1st god of Softmax output layers
The connection weight vector of neuron through member and top pond layer,For the 1st neuron and most of Softmax output layers
The connection weight of k-th of neuron of top layer pond layer, k≤r;As i=2, WE2For the 2nd nerve of Softmax output layers
The connection weight vector of the neuron of member and top pond layer,For Softmax output layers the 2nd neuron and most push up
The connection weight of k-th of neuron of layer pond layer;
If the output of top pond layer is G={ g1,g2,...,gr, giFor i-th of neuron of top pond layer
Output, the then input of output layer are:fi=WEiGT+ci, wherein as i=1, f1Indicate the defeated of first neuron of output layer
Enter, as i=2, f2Indicate the input of second neuron of output layer;
The output valve of output layer is:
Wherein, as i=1, y1It is exported for the 1st neuron of Softmax output layers, indicates that input voice belongs to normality
The probability of voice;As i=2, y2It is exported for the 2nd neuron of Softmax output layers, indicates that input voice belongs to abnormal
The probability of voice;
When Z7, training Softmax graders, it is trained by the way of Mini-batch, once captures T trained sample
Originally be trained, undated parameter is primary, use loss function for:
Wherein, error sum of squares when J is training, For j-th of sample of crawl
The idea output of i-th of neuron of corresponding grader has if j-th of sample belongs to normality voiceIt is on the contrary For i-th of neuron of the corresponding grader of j-th of sample of crawl
Real output value;
F1 is the harmonic-mean of accurate rate P and recall rate R, i.e.,:
Wherein, accurate rate P is correctly to be identified as the sample number of abnormal voice to account for and all be identified as abnormal voice
Sample number ratio, recall rate is the ratio for all abnormal speech samples numbers of sample number for being correctly identified as abnormal voice
Value, TP are the sample number that abnormal voice is correctly identified as in T sample, and FP is to be erroneously identified as very in T sample
The sample number of state voice, FN are the sample number that normality voice is erroneously identified as in T sample;
When training, using gradient descent algorithm undated parameter, until loss function converges on the value or iteration time of setting
Number reaches maximum iteration, then training terminates.
Step S6, detailed process are:Online recognition part, the power that deep neural network completes network structure and training obtains
The load of value coefficient, Hidden Markov Model have loaded the advance model succeeded in school, are pre-processed to the voice inputted in real time
And feature extraction;The HMM template of normality and abnormal voice is established respectively, that is, uses normality voice training HMM template, is established normal
The HMM template of state voice;Using abnormal voice training HMM template, the HMM template of abnormal voice is established;When online recognition,
The judging result exported according to convolution depth confidence network calls different HMM templates, the mel-frequency of input speech signal to fall
Spectral coefficient carries out template matches, obtains final voice recognition result.
Compared with prior art, the present invention having the following advantages that and advantageous effect:
Present invention combination convolution deep learning confidence network and Hidden Markov Model have using convolution depth confidence network
The MFCC characteristics of input are mapped to more higher dimensional space, more fully characterize voice messaging, have by some multilayered nonlinear transform layers
Effect abnormal voice and normality voice are distinguished;The time series modeling ability for recycling hidden Markov model to have, and
The voice of different conditions is modeled respectively, substantially increases the identification accuracy of voice.
Description of the drawings
Fig. 1 is a kind of off-line training flow chart of abnormal speech differentiation method based on deep learning of the present invention;
Fig. 2 is a kind of online recognition flow chart of abnormal speech differentiation method based on deep learning of the present invention;
Fig. 3 is a kind of convolution depth confidence network of abnormal speech differentiation method based on deep learning of the present invention
Identify voice status schematic diagram.
Specific implementation mode
Present invention will now be described in further detail with reference to the embodiments and the accompanying drawings, but embodiments of the present invention are unlimited
In this.
Embodiment 1
A kind of abnormal speech differentiation method based on deep learning, includes the following steps:
The first step:Input voice is obtained, and the pretreatments such as resampling, preemphasis, framing adding window are carried out to input voice, is obtained
To pretreatment voice;
Resampling is specially:The voice of input has different sample frequency and coding mode, in order to facilitate the processing of data
Analysis carries out resampling, uniform sampling frequency and coding mode to former input speech signal;Sample frequency is 22.05kHz, is compiled
Code mode is wav formats.
Preemphasis is specially:The power spectrum of sound signal reduces with the increase of frequency, and most of energy concentrates on low frequency
In range, in order to improve the high frequency section of primitive sound signal, preemphasis processing is carried out to former input speech signal, using single order
FIR high-pass filters, transfer function are:
H (z)=1-az-1,
Wherein, a is high-pass filter coefficient, value 0.93;Voice signal after preemphasis is:
Y (n)=sp (n)-sp (n-1), n=0,1 ..., Length-1
Wherein, y (n) is the voice signal after preemphasis, and sp (n) is the voice signal before preemphasis, and sp (n-1) is voice
The time shift of signal, Length are voice signal length;
Framing adding window is specially:Voice is subjected to slicing treatment, at every fixed time fixed length in interception input voice
The audio signal of degree is a frame, uses frame length for 25ms, and the Hamming window that it is 10ms that frame, which moves, carries out framing windowing process.
Second step:Each frame voice to pre-processing voice extracts mel-frequency cepstrum coefficient feature;
Detailed process is as follows:
L V1, design triangular shaped mel-frequency filter groups, if WlFor the frequency of first of mel-frequency filter
Rate responds, l=1,2 .., L, and Fs is the resampling frequency of voice signal, and Q is the frame length of certain frame voice signal, and Q is also Fourier
The points of transformation, fl,fhThe respectively lower and upper limit cutoff frequency of voice signal, certain frame voice signal that frame length is Q carry out Q
Q frequency component, o (l), c (l) are obtained after point quick Fourier transformation, h (l) is respectively under first of mel-frequency filter
Limit, the subscript value of center and upper limiting frequency in Q frequency component;There are following relationships by o (l), c (l), h (l):
C (l-1)=o (l),
O (l+1)=c (l),
H (l)=c (l+1),
That is, the position of present filter centre frequency is in the side lobe attenuation position of next filter;
Have simultaneously, o (l)| l=1=fl,h(l)| l=L=fh, therefore, the centre frequency of first of mel-frequency filter is at Q
Subscript value in frequency component is expressed as:
Wherein, Mel (f1) it is that actual frequency is mapped to the function of mel-frequency, and Mel-1(f2) it is Mel (f1) anti-letter
Number, f1For actual frequency, f2For mel-frequency:
The frequency response of first of mel-frequency filter is:
V2, to certain frame voice signal x (n), n=0,1 ..., Q-1 after resampling, preemphasis, framing adding window, into
Row Q point quick Fouriers convert, and acquire its frequency spectrum X (k) and amplitude spectrum | X (k) |:
V3, by certain frame voice signal by mel-frequency filter group, obtain the output amplitude spectrum of each filter:
V4, logarithm operation is done to the output amplitude spectrum of all filters, plum can be obtained by further doing discrete cosine transform
That frequency cepstral coefficient:
The in L coefficient the 2nd to M+1 total M coefficient is taken to constitute the M Jan Vermeer frequency cepstral coefficient characteristic vectors per frame
C={ cmfcc(2),cmfcc(3),...,cmfcc(M+1)}。
Third walks:The voice segments of different frame numbers are regular to frame number is fixed, and each voice segments obtain a corresponding Meier
Frequency cepstral coefficient eigenmatrix;
Since the time span of different phonetic section is different, the frame number that different phonetic section may contain is inconsistent, and
Since the input of neural network must be fixed dimension, it is therefore desirable to carry out Time alignment to voice signal so that nerve net
The input of network is fixed size;Time alignment is carried out using characterization point sequence method, regular process is specially:
Y1, the Time alignment network for building one n-N+1 layers, wherein N are the target frame number after Time alignment, first layer
Frame number is n, and the frame number of last layer is N;
I-th layer of Time alignment network has n- (i-1) frame, and a characteristic vector is corresponded to per frame, constitutes n- (i-1) a feature
Set of vectors:
Wherein, i=0,1 ... n-N+1, k=1,2 ... n- (i-1), C are the mel-frequency cepstrum coefficient feature per frame voice
Vector;
Particularly, the characteristic vector group of network first tier inputs the characteristic vector group of network:
WithIndicate vectorThe weight of the speech frame of representative has as i=1:
Y2, it setsIt indicatesWithDistance, i.e.,To i-th layer, find j so as to arbitrary k ≠
J has
Y3, two nearest frames of distance are merged, frame number subtracts one, i.e.,:
Wherein,Indicate the mel-frequency cepstrum coefficient characteristic vector of network i+1 layer jth frame voice;Indicate net
The mel-frequency cepstrum coefficient characteristic vector of+1 frame voice of i-th layer of jth of network;Indicate the plum of network i+1 layer kth frame voice
That frequency cepstral coefficient characteristic vector;It indicatesThe mel-frequency cepstrum coefficient Characteristic Vectors of+1 frame voice of i-th layer of kth of network
Amount;Indicate the weight of i-th layer of jth frame voice of network;Indicate the weight of+1 frame voice of i-th layer of jth of network;It indicates
The weight of network i+1 layer jth frame voice;Indicate that the mel-frequency of i-th layer of jth frame voice of network and+1 frame voice of jth falls
The distance of spectral coefficient characteristic vector;
Y4, solutionI.e.:
I=1 to i=n-N+1 is repeated, has often been executed once, frame number subtracts one, finally that the voice signal of n frames is regular to solid
Fixed N frames;
After Time alignment, every section of voice corresponds to a mel-frequency cepstrum coefficient eigenmatrix, and dimension is N × M,
N is the regular target frame number of this section of Speech time, and M is the M rank mel-frequency cepstrum coefficients of every frame voice, and the mel-frequency falls
Spectral coefficient is 12 dimensions, i.e. M=12.
4th step:Establish convolution depth confidence network;Depth confidence network limits Boltzmann machine by two convolution and stacks
It forms, existing level 2 volume lamination, 2 pond layers, 1 visual layers and 1 output layer;Boltzmann machine is limited to first convolution,
Visual layers neural unit number is N × M=200 × 12, and the number of convolution kernel is 10, and the size of convolution kernel is 2 × 2, convolution
The initial value of step-length 2 × 2, convolution kernel is:With mean value be 0, variance be 0.01 Gaussian Profile random value.Visual layers it is initial partially
It sets and takes 0, the initial bias of convolutional layer takes -0.1.An iteration captures 100 samples, and iteration rounds are 100.First pond layer
Pond size take 2 × 2, pond step-length is 2 × 2.Boltzmann machine is limited to second convolution, the number of convolution kernel is 10,
The size of convolution kernel is 10 × 2 × 2, and the initial value of convolution step-length 2 × 2, convolution kernel is:With mean value be 0, variance be 0.01 height
This distribution random value, the initial bias of the second convolutional layer take -0.1.An iteration captures 100 samples, and iteration rounds are
100.The pond size of second pond layer takes 2 × 2, and pond step-length is 2 × 2.All convolutional layers use Sigma to activate letter
Number.Output layer neuron number is 2, is exported as the posterior probability of normality voice and abnormal voice.Output layer setting loss letter
Number convergency value is 0.004, maximum iteration 1000.
5th step:Mel-frequency cepstrum coefficient eigenmatrix is inputted into convolution depth confidence network, is trained, and to defeated
The state for entering voice is classified, and Fig. 1 is the training flow chart of off-line state,;
First Boltzmann machine is limited to the convolution of bottom to be trained, then Boltzmann machine is limited to the convolution of top layer and is carried out
Training, specially:
Z1, set convolution be limited Boltzmann machine input channel number as I, each channel correspond to size for y × s two dimension
Matrix, i.e.,:
V={ v1,v2,...,vI},vi∈Ry×s, i=1,2 ..., I,
Wherein, V is the input layer that convolution is limited Boltzmann machine, vi∈Ry×sFor i-th of channel of input layer;
Particularly, the convolution of bottom is limited in Boltzmann machine, and I=1, y=N, s=M, the i.e. convolution of bottom are limited Bohr
Hereby the input channel number of graceful machine is 1, the two-dimentional mel-frequency cepstrum that the corresponding size for inputting convolution depth confidence network is N × M
Coefficient characteristics matrix, N are the target frame number of Time alignment, and M is the M rank mel-frequency cepstrum coefficients of every frame voice;
Z2, convolution process use O convolution kernel, and each convolution kernel is three-dimensional weight matrix I × E × F, i.e. convolution kernel W=
{w1,w2,...,wO, wherein wj∈RI×E×F, j=1,2 ..., O, wjFor the convolution kernel;Therefore convolution is limited Boltzmann machine
Output channel number be O;Each output channel corresponds to some local feature of input, i.e.,:
Convolutional layer H={ h1,h2,...,hO, wherein hj∈RN′×M′, j=1,2 ..., O;Refer to j-th of channel of convolutional layer,
A corresponding size after Feature Mapping is the two dimensional character matrix of N ' × M ';
Z3, convolution step-length is set as s1 × s2, all neurons in the same channel of input layer share biasing ai, i=1,
2 ..., I, the shared biasing b of same group of neuron in convolutional layerj, j=1,2 ..., O;The parameter to be trained:θ=W, a,
b};
The limited Boltzmann machine of Z4, convolution is the model based on energy, and energy function is defined as:
The joint probability distribution of input layer V and convolutional layer H are obtained by energy function, i.e. convolution is limited Boltzmann machine input layer
All neuron values of V:With all neuron values of convolutional layer H:Joint probability distribution:
Wherein,For partition function;I=1 ..., I, p=1 ..., y, q=1 ..., s, j=
1 ..., O, m=1 ..., N ', k=1 ..., M ',
The marginal probability distribution of input layer V, i.e. all neuron values of input layer VProbability distribution be:
If training sample set TS has T sample, then the log-likelihood function on input layer V is:
Wherein, L (TS, θ) is the likelihood function on input layer V:
Z5, log-likelihood function is maximized, and then acquire parameter in conjunction with to sdpecific dispersion algorithm using gradient ascent algorithm
θ;I.e.
Z6, output layer use Softmax output layers, neuron number 2, and it is normality voice that output, which respectively represents sample,
With the probability of abnormal voice, practical identification is to take the corresponding classification of maximum probability as final classification result;Output layer and
The pond layer that top convolution is limited Boltzmann machine connects entirely, and it is the biasing of two neurons of output layer to need trained parameter
c1,c2And the connection weight of output layer and top pond layer:
Wherein, as i=1, r is the neuron number of top hidden layer;WE1For the 1st god of Softmax output layers
The connection weight vector of neuron through member and top pond layer,For the 1st neuron and most of Softmax output layers
The connection weight of k-th of neuron of top layer pond layer, k≤r;As i=2, WE2For the 2nd nerve of Softmax output layers
The connection weight vector of the neuron of member and top pond layer,For Softmax output layers the 2nd neuron and most push up
The connection weight of k-th of neuron of layer pond layer;
If the output of top pond layer is G={ g1,g2,...,gr, giFor i-th of neuron of top pond layer
Output, the then input of output layer are:fi=WEiGT+ci, wherein as i=1, f1Indicate the defeated of first neuron of output layer
Enter, as i=2, f2Indicate the input of second neuron of output layer;
The output valve of output layer is:
Wherein, as i=1, y1It is exported for the 1st neuron of Softmax output layers, indicates that input voice belongs to normality
The probability of voice;As i=2, y2It is exported for the 2nd neuron of Softmax output layers, indicates that input voice belongs to abnormal
The probability of voice;
When Z7, training Softmax graders, it is trained by the way of Mini-batch, once captures T trained sample
Originally be trained, undated parameter is primary, use loss function for:
Wherein, error sum of squares when J is training, For j-th of sample of crawl
The idea output of i-th of neuron of corresponding grader has if j-th of sample belongs to normality voiceIt is on the contrary For the reality of i-th of neuron of the corresponding grader of j-th of sample of crawl
Border output valve;
F1 is the harmonic-mean of accurate rate P and recall rate R, i.e.,:
Wherein, accurate rate P is correctly to be identified as the sample number of abnormal voice to account for and all be identified as abnormal voice
Sample number ratio, recall rate is the ratio for all abnormal speech samples numbers of sample number for being correctly identified as abnormal voice
Value, TP are the sample number that abnormal voice is correctly identified as in T sample, and FP is to be erroneously identified as very in T sample
The sample number of state voice, FN are the sample number that normality voice is erroneously identified as in T sample;
When training, using gradient descent algorithm undated parameter, until loss function converges on the value or iteration time of setting
Number reaches maximum iteration, then training terminates.
6th step:According to classification results, calls hidden Markov model to carry out template matches, obtain voice recognition result:
Online recognition part, as shown in Fig. 2, deep neural network completes network structure and trains obtained weight coefficient
Load, Hidden Markov Model have loaded the advance model succeeded in school, carry out pretreatment to the voice inputted in real time and feature carries
It takes;The HMM template of normality and abnormal voice is established respectively, that is, is used normality voice training HMM template, established normality voice
HMM template;Using abnormal voice training HMM template, the HMM template of abnormal voice is established;When online recognition, according to convolution
The judging result of depth confidence network output calls different HMM templates, the mel-frequency cepstrum coefficient of input speech signal into
Row template matches, obtain final voice recognition result, and principle flow chart is as shown in Figure 3.
The above embodiment is a preferred embodiment of the present invention, but embodiments of the present invention are not by above-described embodiment
Limitation, it is other it is any without departing from the spirit and principles of the present invention made by changes, modifications, substitutions, combinations, simplifications,
Equivalent substitute mode is should be, is included within the scope of the present invention.
Claims (10)
1. a kind of abnormal speech differentiation method based on deep learning, which is characterized in that include the following steps:
S1, acquisition input voice, and input voice is pre-processed, pretreatment voice is obtained, the pretreatment includes to adopt again
Sample, preemphasis, framing adding window;
S2, each frame voice extraction mel-frequency that voice is pre-processed using mel-frequency filter group and Fourier transform pairs are fallen
Spectral coefficient characteristic vector;
S3, regular arrive of the voice segments with different frame numbers is fixed into frame number, each voice segments obtain a corresponding mel-frequency
Cepstrum coefficient eigenmatrix;
S4, convolution depth confidence network is established;
S5, mel-frequency cepstrum coefficient eigenmatrix is inputted into convolution depth confidence network, be trained, and to input voice
State is classified;
S6, according to classification results, call hidden Markov model to carry out template matches, obtain voice recognition result.
2. a kind of abnormal speech differentiation method based on deep learning according to claim 1, which is characterized in that step
In S1,
Its sample frequency of the resampling is 22.05kHz, and coding mode is wav formats;
The preemphasis uses single order FIR high-pass filters, and transfer function is:
H (z)=1-az-1,
Wherein, a is high-pass filter coefficient, value 0.93;Voice signal after preemphasis is:
Y (n)=sp (n)-sp (n-1), n=0,1 ..., Length-1;
Wherein, y (n) is the voice signal after preemphasis, and sp (n) is the voice signal before preemphasis, and sp (n-1) is voice signal
Time shift, Length be voice signal length;
The framing adding window is specially:Voice is subjected to slicing treatment, at every fixed time fixed length in interception input voice
The audio signal of degree is a frame, uses frame length for 25ms, and the Hamming window that it is 10ms that frame, which moves, carries out framing windowing process.
3. a kind of abnormal speech differentiation method based on deep learning according to claim 1, which is characterized in that described
Step S2, specially:
L V1, design triangular shaped mel-frequency filter groups, if WlIt is rung for the frequency of first of mel-frequency filter
It answers, l=1,2 .., L, Fs is the resampling frequency of voice signal, and Q is the frame length of certain frame voice signal, and Q is also Fourier transformation
Points, fl,fhThe respectively lower and upper limit cutoff frequency of voice signal, it is fast that certain frame voice signal that frame length is Q carries out Q points
Obtain Q frequency component, o (l), c (l) after fast Fourier transformation, h (l) be respectively first of mel-frequency filter lower limit,
The subscript value of center and upper limiting frequency in Q frequency component;There are following relationships by o (l), c (l), h (l):
C (l-1)=o (l),
O (l+1)=c (l),
H (l)=c (l+1),
That is, the position of present filter centre frequency is in the side lobe attenuation position of next filter;
Have simultaneously, o (l)| l=1=fl,h(l)| l=L=fh, therefore, the centre frequency of first of mel-frequency filter is in Q frequency
Subscript value in component is expressed as:
Wherein, Mel (f1) it is that actual frequency is mapped to the function of mel-frequency, and Mel-1(f2) it is Mel (f1) inverse function, f1
For actual frequency, f2For mel-frequency:
The frequency response of first of mel-frequency filter is:
Wherein, k is subscript value of the frequency component in Q frequency component;
V2, to certain frame voice signal x (n), n=0,1 after resampling, preemphasis, framing adding window ..., Q-1, Q <
Length carries out Q point quick Fourier transformation, acquires its frequency spectrum X (k) and amplitude spectrum | X (k) |:
V3, by certain frame voice signal by mel-frequency filter group, obtain the output amplitude spectrum of each filter:
V4, logarithm operation is done to the output amplitude spectrum of all filters, Meier frequency can be obtained by further doing discrete cosine transform
Rate cepstrum coefficient:
The in L coefficient the 2nd to M+1 total M coefficient is taken to constitute the M Jan Vermeer frequency cepstral coefficient characteristic vectors C=per frame
{cmfcc(2),cmfcc(3),...,cmfcc(M+1)}。
4. a kind of abnormal speech differentiation method based on deep learning according to claim 1, which is characterized in that step
In S2, the mel-frequency cepstrum coefficient feature matrix, dimension is N × M, and N is the regular target frame of this section of Speech time
Number, M are the M rank mel-frequency cepstrum coefficients of every frame voice, and the mel-frequency cepstrum coefficient is 12 dimensions, i.e. M=12.
5. a kind of abnormal speech differentiation method based on deep learning according to claim 1, which is characterized in that step
In S3, it is described it is regular be specially:
Y1, the Time alignment network for building one n-N+1 layers, wherein N are the target frame number of Time alignment, and the frame number of first layer is
The frame number of n, last layer are N;
I-th layer of Time alignment network has n- (i-1) frame, and a characteristic vector is corresponded to per frame, constitutes n- (i-1) a characteristic vector
Group:
Wherein,For the mel-frequency cepstrum coefficient characteristic vector of i-th layer of kth frame voice of network;
Particularly, the characteristic vector group of network first tier inputs the characteristic vector group of network:
WithIndicate vectorThe weight of the speech frame of representative has as i=1:
Y2, it setsIt indicatesWithDistance, i.e.,To i-th layer, j is found so as to arbitrary k ≠ j, have
Y3, two nearest frames of distance are merged, frame number subtracts one, i.e.,:
Wherein,Indicate the mel-frequency cepstrum coefficient characteristic vector of network i+1 layer jth frame voice;Indicate network i-th
The mel-frequency cepstrum coefficient characteristic vector of layer+1 frame voice of jth;Indicate the mel-frequency of network i+1 layer kth frame voice
Cepstrum coefficient characteristic vector;It indicatesThe mel-frequency cepstrum coefficient characteristic vector of+1 frame voice of i-th layer of kth of network;It indicates
The weight of i-th layer of jth frame voice of network;Indicate the weight of+1 frame voice of i-th layer of jth of network;Indicate network i+1
The weight of layer jth frame voice;Indicate the mel-frequency cepstrum coefficient feature of i-th layer of jth frame voice of network and+1 frame voice of jth
The distance of vector;
Y4, solutionI.e.:
I=1 to i=n-N+1 is repeated, has often been executed once, frame number subtracts one, finally that the voice signal of n frames is regular to fixed N
Frame.
6. a kind of abnormal speech differentiation method based on deep learning according to claim 1, which is characterized in that step
In S4, the convolution depth confidence network is limited Boltzmann machine by more than one convolution and accumulates from top to bottom, output layer
Using Softmax graders;The convolution is limited Boltzmann machine and is made of one layer of input layer V and one layer of convolutional layer H.
7. a kind of abnormal speech differentiation method based on deep learning according to claim 6, which is characterized in that described
Convolution is limited Boltzmann machine, and pond layer is added after being limited the convolutional layer H of Boltzmann machine in each convolution, carries out pondization operation,
Pond size is E ' × F ', pond step-length s3 × s4, s3=s4=2, and therefore, next convolution is limited the pond layer of Boltzmann machine
It is the input layer that a upper convolution is limited Boltzmann machine.
8. a kind of abnormal speech differentiation method based on deep learning according to claim 6, which is characterized in that described
Step S5 is first limited Boltzmann machine to the convolution of bottom and is trained, then is limited Boltzmann machine to the convolution of top layer and carries out
Training, specially:
Z1, set convolution be limited Boltzmann machine input channel number as I, each channel correspond to size for y × s Two-Dimensional Moment
Battle array, i.e.,:
V={ v1,v2,...,vI},vi∈Ry×s, i=1,2 ..., I,
Wherein, V is the input layer that convolution is limited Boltzmann machine, vi∈Ry×sFor i-th of channel of input layer;
The convolution of bottom is limited in Boltzmann machine, and I=1, y=N, s=M, the i.e. convolution of bottom are limited the defeated of Boltzmann machine
It is 1 to enter port number, the two-dimentional mel-frequency cepstrum coefficient feature square that the corresponding size for inputting convolution depth confidence network is N × M
Battle array, N are the target frame number of Time alignment, and M is the M rank mel-frequency cepstrum coefficients of every frame voice;
Z2, convolution process use O convolution kernel, and each convolution kernel is three-dimensional weight matrix I × E × F, i.e. convolution kernel W={ w1,
w2,...,wO, wherein wj∈RI×E×F, j=1,2 ..., O, wjFor the convolution kernel;Therefore convolution is limited Boltzmann machine
Output channel number is O;Each output channel corresponds to some local feature of input, i.e.,:
Convolutional layer H={ h1,h2,...,hO, wherein hj∈RN′×M′, j=1,2 ..., O refer to j-th of channel of convolutional layer, corresponding
One size after Feature Mapping is the two dimensional character matrix of N ' × M ';
Z3, convolution step-length is set as s1 × s2, all neurons in the same channel of input layer share biasing ai, i=1,2 ..., I,
The shared biasing b of same group of neuron in convolutional layerj, j=1,2 ..., O;The parameter to be trained:θ={ W, a, b };
The limited Boltzmann machine of Z4, convolution is the model based on energy, and energy function is defined as:
The joint probability distribution of input layer V and convolutional layer H are obtained by energy function, i.e. convolution is limited Boltzmann machine input layer V institutes
There is neuron value:With all neuron values of convolutional layer H:Joint probability distribution:
Wherein,For partition function;I=1 ..., I, p=1 ..., y, q=1 ..., s, j=1 ..., O,
M=1 ..., N ', k=1 ..., M ',
The marginal probability distribution of input layer V, i.e. all neuron values of input layer VProbability distribution be:
If training sample set TS has T sample, then the log-likelihood function on input layer V is:
Wherein, L (TS, θ) is the likelihood function on input layer V:
Z5, log-likelihood function is maximized, and then acquire parameter θ in conjunction with to sdpecific dispersion algorithm using gradient ascent algorithm;I.e.
Z6, output layer use Softmax output layers, neuron number 2, and it is normality voice and non-that output, which respectively represents sample,
The probability of normality voice, practical identification is to take the corresponding classification of maximum probability as final classification result;It output layer and most pushes up
The pond layer that layer convolution is limited Boltzmann machine connects entirely, and it is the biasing c of two neurons of output layer to need trained parameter1,c2
And the connection weight of output layer and top pond layer:
Wherein, r is the neuron number of top hidden layer;As i=1, WE1For the 1st neuron of Softmax output layers
It is vectorial with the connection weight of the neuron of top pond layer,For the 1st neuron and top of Softmax output layers
The connection weight of k-th of neuron of pond layer, k≤r;As i=2, WE2For Softmax output layers the 2nd neuron and
The connection weight vector of the neuron of top pond layer,For the 2nd neuron of Softmax output layers and top pond
Change the connection weight of k-th of neuron of layer;
If the output of top pond layer is G={ g1,g2,...,gr, giIt is exported for i-th of neuron of top pond layer,
The then input of output layer is:fi=WEiGT+ci, wherein as i=1, f1The input for indicating first neuron of output layer, works as i
When=2, f2Indicate the input of second neuron of output layer;
The output valve of output layer isWherein, as i=1, y1For the 1st neuron of Softmax output layers
Output indicates that input voice belongs to the probability of normality voice;As i=2, y2It is defeated for the 2nd neuron of Softmax output layers
Go out, indicates that input voice belongs to the probability of abnormal voice;
Z7, training Softmax graders when, be trained by the way of Mini-batch, once capture T training sample into
Row training, undated parameter is primary, use loss function for:
Wherein, error sum of squares when J is training, It is corresponded to for j-th of sample of crawl
The idea output of i-th of neuron of grader have if j-th of sample belongs to normality voiceInstead
It For the real output value of i-th of neuron of the corresponding grader of j-th of sample of crawl;
F1 is the harmonic-mean of accurate rate P and recall rate R, i.e.,:
Wherein, accurate rate P is correctly to be identified as the sample number of abnormal voice to account for all samples for being identified as abnormal voice
This number ratio, recall rate are the ratio for all abnormal speech samples numbers of sample number for being correctly identified as abnormal voice, TP
To be correctly identified as the sample number of abnormal voice in T sample, FP is to be erroneously identified as abnormal voice in T sample
Sample number, FN is the sample number that normality voice is erroneously identified as in T sample;
When training, using gradient descent algorithm undated parameter, until loss function converges on the value of setting or iterations reach
Maximum iteration, then training terminate.
9. a kind of abnormal speech differentiation method based on deep learning according to claim 1, which is characterized in that described
Step S6, detailed process are:Online recognition part, deep neural network complete network structure and train obtained weight coefficient
Load, Hidden Markov Model have loaded the advance model succeeded in school, carry out pretreatment to the voice inputted in real time and feature carries
It takes;The HMM template of normality and abnormal voice is established respectively, when online recognition, according to sentencing for convolution depth confidence network output
The different HMM template of disconnected call by result, the mel-frequency cepstrum coefficient of input speech signal carry out template matches, obtain final
Voice recognition result.
10. a kind of abnormal speech differentiation method based on deep learning according to claim 9, which is characterized in that institute
The HMM template for stating normality voice is obtained using normality voice training HMM template;The HMM template of the abnormal voice uses
Abnormal voice training HMM template obtains.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810417478.2A CN108766419B (en) | 2018-05-04 | 2018-05-04 | Abnormal voice distinguishing method based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810417478.2A CN108766419B (en) | 2018-05-04 | 2018-05-04 | Abnormal voice distinguishing method based on deep learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108766419A true CN108766419A (en) | 2018-11-06 |
CN108766419B CN108766419B (en) | 2020-10-27 |
Family
ID=64009048
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810417478.2A Active CN108766419B (en) | 2018-05-04 | 2018-05-04 | Abnormal voice distinguishing method based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108766419B (en) |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110058689A (en) * | 2019-04-08 | 2019-07-26 | 深圳大学 | A kind of smart machine input method based on face's vibration |
CN110322887A (en) * | 2019-04-28 | 2019-10-11 | 武汉大晟极科技有限公司 | A kind of polymorphic type audio signal energies feature extracting method |
CN110390929A (en) * | 2019-08-05 | 2019-10-29 | 中国民航大学 | Chinese and English civil aviaton land sky call acoustic model construction method based on CDNN-HMM |
CN110600015A (en) * | 2019-09-18 | 2019-12-20 | 北京声智科技有限公司 | Voice dense classification method and related device |
CN110706720A (en) * | 2019-08-16 | 2020-01-17 | 广东省智能制造研究所 | Acoustic anomaly detection method for end-to-end unsupervised deep support network |
CN110782901A (en) * | 2019-11-05 | 2020-02-11 | 深圳大学 | Method, storage medium and device for identifying voice of network telephone |
CN110931046A (en) * | 2019-11-29 | 2020-03-27 | 福州大学 | Audio high-level semantic feature extraction method and system for overlapped sound event detection |
CN111027675A (en) * | 2019-11-22 | 2020-04-17 | 南京大学 | Automatic adjusting method and system for multimedia playing setting |
CN111044285A (en) * | 2019-11-22 | 2020-04-21 | 军事科学院系统工程研究院军用标准研究中心 | Method for diagnosing faults of mechanical equipment under complex conditions |
CN111128227A (en) * | 2019-12-30 | 2020-05-08 | 云知声智能科技股份有限公司 | Sound detection method and device |
CN111354373A (en) * | 2018-12-21 | 2020-06-30 | 中国科学院声学研究所 | Audio signal classification method based on neural network intermediate layer characteristic filtering |
CN111508501A (en) * | 2020-07-02 | 2020-08-07 | 成都晓多科技有限公司 | Voice recognition method and system with accent for telephone robot |
CN111724770A (en) * | 2020-05-19 | 2020-09-29 | 中国电子科技网络信息安全有限公司 | Audio keyword identification method for generating confrontation network based on deep convolution |
EP3745412A1 (en) * | 2019-05-28 | 2020-12-02 | Corti ApS | An intelligent computer aided decision support system |
WO2021000498A1 (en) * | 2019-07-04 | 2021-01-07 | 平安科技(深圳)有限公司 | Composite speech recognition method, device, equipment, and computer-readable storage medium |
CN112750428A (en) * | 2020-12-29 | 2021-05-04 | 平安普惠企业管理有限公司 | Voice interaction method and device and computer equipment |
CN112955954A (en) * | 2018-12-21 | 2021-06-11 | 华为技术有限公司 | Audio processing device and method for audio scene classification |
CN113361647A (en) * | 2021-07-06 | 2021-09-07 | 青岛洞听智能科技有限公司 | Method for identifying type of missed call |
CN113689633A (en) * | 2021-08-26 | 2021-11-23 | 浙江力石科技股份有限公司 | Scenic spot human-computer interaction method, device and system |
CN113959071A (en) * | 2021-07-21 | 2022-01-21 | 北京金茂绿建科技有限公司 | Machine learning assistance-based operation control optimization method for centralized water chilling unit air conditioning system |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102129860A (en) * | 2011-04-07 | 2011-07-20 | 魏昕 | Text-related speaker recognition method based on infinite-state hidden Markov model |
CN104157290A (en) * | 2014-08-19 | 2014-11-19 | 大连理工大学 | Speaker recognition method based on depth learning |
CN105206270A (en) * | 2015-08-20 | 2015-12-30 | 长安大学 | Isolated digit speech recognition classification system and method combining principal component analysis (PCA) with restricted Boltzmann machine (RBM) |
CN106782602A (en) * | 2016-12-01 | 2017-05-31 | 南京邮电大学 | Speech-emotion recognition method based on length time memory network and convolutional neural networks |
CN106941005A (en) * | 2017-02-24 | 2017-07-11 | 华南理工大学 | A kind of vocal cords method for detecting abnormality based on speech acoustics feature |
US20170200092A1 (en) * | 2016-01-11 | 2017-07-13 | International Business Machines Corporation | Creating deep learning models using feature augmentation |
CN107464568A (en) * | 2017-09-25 | 2017-12-12 | 四川长虹电器股份有限公司 | Based on the unrelated method for distinguishing speek person of Three dimensional convolution neutral net text and system |
-
2018
- 2018-05-04 CN CN201810417478.2A patent/CN108766419B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102129860A (en) * | 2011-04-07 | 2011-07-20 | 魏昕 | Text-related speaker recognition method based on infinite-state hidden Markov model |
CN104157290A (en) * | 2014-08-19 | 2014-11-19 | 大连理工大学 | Speaker recognition method based on depth learning |
CN105206270A (en) * | 2015-08-20 | 2015-12-30 | 长安大学 | Isolated digit speech recognition classification system and method combining principal component analysis (PCA) with restricted Boltzmann machine (RBM) |
US20170200092A1 (en) * | 2016-01-11 | 2017-07-13 | International Business Machines Corporation | Creating deep learning models using feature augmentation |
CN106782602A (en) * | 2016-12-01 | 2017-05-31 | 南京邮电大学 | Speech-emotion recognition method based on length time memory network and convolutional neural networks |
CN106941005A (en) * | 2017-02-24 | 2017-07-11 | 华南理工大学 | A kind of vocal cords method for detecting abnormality based on speech acoustics feature |
CN107464568A (en) * | 2017-09-25 | 2017-12-12 | 四川长虹电器股份有限公司 | Based on the unrelated method for distinguishing speek person of Three dimensional convolution neutral net text and system |
Non-Patent Citations (2)
Title |
---|
IDO ARIAV等: "A deep architecture for audio-visual voice activity detection in the presence of transients", 《SIGNAL PROCESSING》 * |
李威等: "基于深层说话人矢量的说话人检索", 《华中科技大学学报(自然科学版)》 * |
Cited By (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112955954B (en) * | 2018-12-21 | 2024-04-12 | 华为技术有限公司 | Audio processing device and method for audio scene classification |
CN112955954A (en) * | 2018-12-21 | 2021-06-11 | 华为技术有限公司 | Audio processing device and method for audio scene classification |
CN111354373A (en) * | 2018-12-21 | 2020-06-30 | 中国科学院声学研究所 | Audio signal classification method based on neural network intermediate layer characteristic filtering |
US11776532B2 (en) | 2018-12-21 | 2023-10-03 | Huawei Technologies Co., Ltd. | Audio processing apparatus and method for audio scene classification |
CN110058689A (en) * | 2019-04-08 | 2019-07-26 | 深圳大学 | A kind of smart machine input method based on face's vibration |
CN110322887B (en) * | 2019-04-28 | 2021-10-15 | 武汉大晟极科技有限公司 | Multi-type audio signal energy feature extraction method |
CN110322887A (en) * | 2019-04-28 | 2019-10-11 | 武汉大晟极科技有限公司 | A kind of polymorphic type audio signal energies feature extracting method |
WO2020239910A3 (en) * | 2019-05-28 | 2021-02-18 | Corti Aps | An intelligent computer aided decision support system |
EP3745412A1 (en) * | 2019-05-28 | 2020-12-02 | Corti ApS | An intelligent computer aided decision support system |
WO2021000498A1 (en) * | 2019-07-04 | 2021-01-07 | 平安科技(深圳)有限公司 | Composite speech recognition method, device, equipment, and computer-readable storage medium |
CN110390929A (en) * | 2019-08-05 | 2019-10-29 | 中国民航大学 | Chinese and English civil aviaton land sky call acoustic model construction method based on CDNN-HMM |
CN110706720A (en) * | 2019-08-16 | 2020-01-17 | 广东省智能制造研究所 | Acoustic anomaly detection method for end-to-end unsupervised deep support network |
CN110706720B (en) * | 2019-08-16 | 2022-04-22 | 广东省智能制造研究所 | Acoustic anomaly detection method for end-to-end unsupervised deep support network |
CN110600015A (en) * | 2019-09-18 | 2019-12-20 | 北京声智科技有限公司 | Voice dense classification method and related device |
CN110782901A (en) * | 2019-11-05 | 2020-02-11 | 深圳大学 | Method, storage medium and device for identifying voice of network telephone |
CN110782901B (en) * | 2019-11-05 | 2021-12-24 | 深圳大学 | Method, storage medium and device for identifying voice of network telephone |
CN111027675A (en) * | 2019-11-22 | 2020-04-17 | 南京大学 | Automatic adjusting method and system for multimedia playing setting |
CN111044285A (en) * | 2019-11-22 | 2020-04-21 | 军事科学院系统工程研究院军用标准研究中心 | Method for diagnosing faults of mechanical equipment under complex conditions |
CN111027675B (en) * | 2019-11-22 | 2023-03-07 | 南京大学 | Automatic adjusting method and system for multimedia playing setting |
CN110931046A (en) * | 2019-11-29 | 2020-03-27 | 福州大学 | Audio high-level semantic feature extraction method and system for overlapped sound event detection |
CN111128227A (en) * | 2019-12-30 | 2020-05-08 | 云知声智能科技股份有限公司 | Sound detection method and device |
CN111724770B (en) * | 2020-05-19 | 2022-04-01 | 中国电子科技网络信息安全有限公司 | Audio keyword identification method for generating confrontation network based on deep convolution |
CN111724770A (en) * | 2020-05-19 | 2020-09-29 | 中国电子科技网络信息安全有限公司 | Audio keyword identification method for generating confrontation network based on deep convolution |
CN111508501B (en) * | 2020-07-02 | 2020-09-29 | 成都晓多科技有限公司 | Voice recognition method and system with accent for telephone robot |
CN111508501A (en) * | 2020-07-02 | 2020-08-07 | 成都晓多科技有限公司 | Voice recognition method and system with accent for telephone robot |
CN112750428A (en) * | 2020-12-29 | 2021-05-04 | 平安普惠企业管理有限公司 | Voice interaction method and device and computer equipment |
CN113361647A (en) * | 2021-07-06 | 2021-09-07 | 青岛洞听智能科技有限公司 | Method for identifying type of missed call |
CN113959071A (en) * | 2021-07-21 | 2022-01-21 | 北京金茂绿建科技有限公司 | Machine learning assistance-based operation control optimization method for centralized water chilling unit air conditioning system |
CN113959071B (en) * | 2021-07-21 | 2023-05-26 | 北京金茂绿建科技有限公司 | Centralized water chilling unit air conditioning system operation control optimization method based on machine learning assistance |
CN113689633A (en) * | 2021-08-26 | 2021-11-23 | 浙江力石科技股份有限公司 | Scenic spot human-computer interaction method, device and system |
Also Published As
Publication number | Publication date |
---|---|
CN108766419B (en) | 2020-10-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108766419A (en) | A kind of abnormal speech detection method based on deep learning | |
CN112364779B (en) | Underwater sound target identification method based on signal processing and deep-shallow network multi-model fusion | |
CN106782511A (en) | Amendment linear depth autoencoder network audio recognition method | |
CN109559736B (en) | Automatic dubbing method for movie actors based on confrontation network | |
CN107221320A (en) | Train method, device, equipment and the computer-readable storage medium of acoustic feature extraction model | |
CN113488058B (en) | Voiceprint recognition method based on short voice | |
CN111048097B (en) | Twin network voiceprint recognition method based on 3D convolution | |
CN108520753A (en) | Voice lie detection method based on the two-way length of convolution memory network in short-term | |
CN110111797A (en) | Method for distinguishing speek person based on Gauss super vector and deep neural network | |
CN103456302B (en) | A kind of emotional speaker recognition method based on the synthesis of emotion GMM Model Weight | |
CN111986679A (en) | Speaker confirmation method, system and storage medium for responding to complex acoustic environment | |
Sukhwal et al. | Comparative study of different classifiers based speaker recognition system using modified MFCC for noisy environment | |
CN102496366B (en) | Speaker identification method irrelevant with text | |
Zheng et al. | MSRANet: Learning discriminative embeddings for speaker verification via channel and spatial attention mechanism in alterable scenarios | |
Sun et al. | A novel convolutional neural network voiceprint recognition method based on improved pooling method and dropout idea | |
CN113571095B (en) | Speech emotion recognition method and system based on nested deep neural network | |
CN111243621A (en) | Construction method of GRU-SVM deep learning model for synthetic speech detection | |
Kamble et al. | Emotion recognition for instantaneous Marathi spoken words | |
Aggarwal et al. | Performance evaluation of artificial neural networks for isolated Hindi digit recognition with LPC and MFCC | |
CN114299995A (en) | Language emotion recognition method for emotion assessment | |
CN111785262B (en) | Speaker age and gender classification method based on residual error network and fusion characteristics | |
CN114882888A (en) | Voiceprint recognition method and system based on variational self-coding and countermeasure generation network | |
CN113963718A (en) | Voice session segmentation method based on deep learning | |
Gade et al. | Hybrid Deep Convolutional Neural Network based Speaker Recognition for Noisy Speech Environments | |
CN113450830A (en) | Voice emotion recognition method of convolution cyclic neural network with multiple attention mechanisms |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |