CN108766419A - A kind of abnormal speech detection method based on deep learning - Google Patents

A kind of abnormal speech detection method based on deep learning Download PDF

Info

Publication number
CN108766419A
CN108766419A CN201810417478.2A CN201810417478A CN108766419A CN 108766419 A CN108766419 A CN 108766419A CN 201810417478 A CN201810417478 A CN 201810417478A CN 108766419 A CN108766419 A CN 108766419A
Authority
CN
China
Prior art keywords
voice
layer
frequency
mel
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810417478.2A
Other languages
Chinese (zh)
Other versions
CN108766419B (en
Inventor
奉小慧
陈光科
贺前华
巫小兰
李艳雄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN201810417478.2A priority Critical patent/CN108766419B/en
Publication of CN108766419A publication Critical patent/CN108766419A/en
Application granted granted Critical
Publication of CN108766419B publication Critical patent/CN108766419B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The abnormal speech differentiation method based on deep learning that the invention discloses a kind of, includes the following steps, obtains input voice, and carry out resampling, preemphasis, the pretreatment of framing adding window to input voice, obtains pretreatment voice;Mel-frequency cepstrum coefficient characteristic vector is extracted to pretreatment voice;The voice segments of different frame numbers are regular to frame number is fixed, and each voice segments obtain a corresponding mel-frequency cepstrum coefficient eigenmatrix;Establish convolution depth confidence network;Mel-frequency cepstrum coefficient eigenmatrix is inputted into convolution depth confidence network, is trained, and is classified to the state for inputting voice;According to classification results, calls hidden Markov model to carry out template matches, obtain voice recognition result;The present invention utilizes multiple nonlinear transformation layers of convolution depth confidence network, by the MFCC Feature Mappings of input to more higher dimensional space, is used in combination Hidden Markov Model to model the voice of different conditions respectively, improves the identification accuracy of voice.

Description

A kind of abnormal speech detection method based on deep learning
Technical field
The present invention relates to intelligent sound treatment research field, more particularly to a kind of abnormal speech region based on deep learning Other method.
Background technology
Voice is one of the important way of the mankind and machine interaction, and by the research of decades, speech recognition technology obtains Prodigious development has been deep into our daily life, however, the research of existing voice identification has problems with:
In actual life, speaker's health status exception or other reasons can cause its input voice from normality voice to Abnormal voice transfer, and more noise jamming can be brought.Abnormal voice refer generally to complex background noise voice, deliberately Change the voice of tongue or custom, the voice etc. of development organ lesion.
Another problem is that traditional speech recognition system often uses linear prediction residue error and mel-frequency cepstrum Coefficient, main information is pronunciation this paper features in these bottom acoustic features, and speaker information is highly susceptible to this information, letter The interference in road and noise information, to make the recognition performance of system decline.
Invention content
The shortcomings that it is a primary object of the present invention to overcome the prior art and deficiency, provide a kind of based on the non-of deep learning Normality speech differentiation method is mapped MFCC, LPCC parameter compared with low-dimensional using the nonlinear transformation ability of deep neural network To higher dimensional space, the high-level abstracted information of voice signal is preferably characterized, and model respectively to normality voice and abnormal voice, Effectively distinguish normality voice and abnormal voice.
The purpose of the present invention is realized by the following technical solution:
A kind of abnormal speech differentiation method based on deep learning, includes the following steps:
S1, input voice is obtained, and resampling, preemphasis, the pretreatment of framing adding window is carried out to input voice, obtain pre- place Manage voice;
S2, each frame voice extraction Meier frequency that voice is pre-processed using mel-frequency filter group and Fourier transform pairs Rate cepstrum coefficient characteristic vector;
S3, regular arrive of the voice segments with different frame numbers is fixed into frame number, each voice segments obtain a corresponding Meier Frequency cepstral coefficient eigenmatrix;
S4, convolution depth confidence network is established;
S5, mel-frequency cepstrum coefficient eigenmatrix is inputted into convolution depth confidence network, be trained, and to inputting language The state of sound is classified;
S6, according to classification results, call hidden Markov model to carry out template matches, obtain voice recognition result.
In step sl, its sample frequency of resampling is 22.05kHz, and coding mode is wav formats;
Preemphasis uses single order FIR high-pass filters, and transfer function is:
H (z)=1-az-1,
Wherein, a is high-pass filter coefficient, value 0.93;Voice signal after preemphasis is:
Y (n)=sp (n)-sp (n-1), n=0,1 ..., Length-1
Wherein, y (n) is the voice signal after preemphasis, and sp (n) is the voice signal before preemphasis, and sp (n-1) is voice The time shift of signal, Length are voice signal length;
Framing adding window is specially:Voice is subjected to slicing treatment, at every fixed time fixed length in interception input voice The audio signal of degree is a frame, uses frame length for 25ms, and the Hamming window that it is 10ms that frame, which moves, carries out framing windowing process.
In step s 2, detailed process is as follows:
L V1, design triangular shaped mel-frequency filter groups, if WlFor the frequency of first of mel-frequency filter Rate responds, l=1,2 .., L, and Fs is the resampling frequency of voice signal, and Q is the frame length of certain frame voice signal, and Q is also Fourier The points of transformation, fl,fhThe respectively lower and upper limit cutoff frequency of voice signal, certain frame voice signal that frame length is Q carry out Q Q frequency component, o (l), c (l) are obtained after point quick Fourier transformation, h (l) is respectively under first of mel-frequency filter Limit, the subscript value of center and upper limiting frequency in Q frequency component;There are following relationships by o (l), c (l), h (l):
C (l-1)=o (l),
O (l+1)=c (l),
H (l)=c (l+1),
That is, the position of present filter centre frequency is in the side lobe attenuation position of next filter;
Have simultaneously, o (l)| l=1=fl,h(l)| l=L=fh, therefore, the centre frequency of first of mel-frequency filter is at Q Subscript value in frequency component is expressed as:
Wherein, Mel (f1) it is that actual frequency is mapped to the function of mel-frequency, and Mel-1(f2) it is Mel (f1) anti-letter Number, f1For actual frequency, f2For mel-frequency:
The frequency response of first of mel-frequency filter is:
V2, to certain frame voice signal x (n), n=0,1 after resampling, preemphasis, framing adding window ..., Q-1, Q < Length carry out Q point quick Fourier transformation, acquire its frequency spectrum X (k) and amplitude spectrum | X (k) |:
V3, by certain frame voice signal by mel-frequency filter group, obtain the output amplitude spectrum of each filter:
V4, logarithm operation is done to the output amplitude spectrum of all filters, plum can be obtained by further doing discrete cosine transform That frequency cepstral coefficient:
The in L coefficient the 2nd to M+1 total M coefficient is taken to constitute the M Jan Vermeer frequency cepstral coefficient characteristic vectors per frame C={ cmfcc(2),cmfcc(3),...,cmfcc(M+1)}。
In step S2, the mel-frequency cepstrum coefficient feature matrix, dimension is N × M, and N is this section of voice rule Fixation frame number after whole, i.e. the target frame number of Time alignment, M are the M rank mel-frequency cepstrum coefficients of every frame voice, the Meier Frequency cepstral coefficient is 12 dimensions, i.e. M=12.
In step s3, due to the time span of different phonetic section difference, the frame number that different phonetic section may contain It is inconsistent, and since the input of neural network must be fixed dimension, it is therefore desirable to Time alignment is carried out to voice signal, So that the input of neural network is fixed size;Time alignment is carried out using characterization point sequence method, it is regular to be specially:
Y1, the Time alignment network for building one n-N+1 layers, wherein N are the target frame number after Time alignment, first layer Frame number is n, and the frame number of last layer is N;
I-th layer of Time alignment network has n- (i-1) frame, and a characteristic vector is corresponded to per frame, constitutes n- (i-1) a feature Set of vectors:
Wherein, i=0,1 ... n-N+1, k=1,2 ... n- (i-1),Indicate the Meier frequency of i-th layer of kth frame voice of network Rate cepstrum coefficient characteristic vector;
Particularly, the characteristic vector group of network first tier inputs the characteristic vector group of network:
WithIndicate vectorThe weight of the speech frame of representative has as i=1:
Y2, it setsIt indicatesWithDistance, i.e.,To i-th layer, find j so as to arbitrary k ≠ J has
Y3, two nearest frames of distance are merged, frame number subtracts one, i.e.,:
Wherein,Indicate the mel-frequency cepstrum coefficient characteristic vector of network i+1 layer jth frame voice;Indicate net The mel-frequency cepstrum coefficient characteristic vector of+1 frame voice of i-th layer of jth of network;Indicate the plum of network i+1 layer kth frame voice That frequency cepstral coefficient characteristic vector;It indicatesThe mel-frequency cepstrum coefficient Characteristic Vectors of+1 frame voice of i-th layer of kth of network Amount;Indicate the weight of i-th layer of jth frame voice of network;Indicate the weight of+1 frame voice of i-th layer of jth of network;It indicates The weight of network i+1 layer jth frame voice;Indicate that the mel-frequency of i-th layer of jth frame voice of network and+1 frame voice of jth falls The distance of spectral coefficient characteristic vector;
Y4, solutionI.e.:
I=1 to i=n-N+1 is repeated, has often been executed once, frame number subtracts one, finally that the voice signal of n frames is regular to solid Fixed N frames.
In step S4, the convolution depth confidence network by multiple convolution be limited Boltzmann machine accumulate from top to bottom and At output layer uses Softmax graders;The convolution is limited Boltzmann machine by one layer of input layer V and one layer of convolutional layer H structure At.Convolution is limited Boltzmann machine, and each convolution is added pond layer after being limited the convolutional layer H of Boltzmann machine, carries out Chi Huacao Make, pond size is E ' × F ', pond step-length s3 × s4, s3=s4=2, and therefore, next convolution is limited the pond of Boltzmann machine It is the input layer that a upper convolution is limited Boltzmann machine to change layer.
Step S5 is first limited Boltzmann machine to the convolution of bottom and is trained, then is limited Bohr hereby to the convolution of top layer Graceful machine is trained, specially:
Z1, set convolution be limited Boltzmann machine input channel number as I, each channel correspond to size for y × s two dimension Matrix, i.e.,:
V={ v1,v2,...,vI},vi∈Ry×s, i=1,2 ..., I,
Wherein, V is the input layer that convolution is limited Boltzmann machine, vi∈Ry×sFor i-th of channel of input layer;
Particularly, the convolution of bottom is limited in Boltzmann machine, and I=1, y=N, s=M, the i.e. convolution of bottom are limited Bohr Hereby the input channel number of graceful machine is 1, the two-dimentional mel-frequency cepstrum that the corresponding size for inputting convolution depth confidence network is N × M Coefficient characteristics matrix, N are the target frame number of Time alignment, and M is the M rank mel-frequency cepstrum coefficients of every frame voice;
Z2, convolution process use O convolution kernel, and each convolution kernel is three-dimensional weight matrix I × E × F, i.e. convolution kernel W= {w1,w2,...,wO, wherein wj∈RI×E×F, j=1,2 ..., O, wjFor the convolution kernel;Therefore convolution is limited Boltzmann machine Output channel number be O;Each output channel corresponds to some local feature of input, i.e.,:
Convolutional layer H={ h1,h2,...,hO, wherein hj∈RN′×M′, j=1,2 ..., O;Refer to j-th of channel of convolutional layer, A corresponding size after Feature Mapping is the two dimensional character matrix of N ' × M '
Z3, convolution step-length is set as s1 × s2, all neurons in the same channel of input layer share biasing ai, i=1, 2 ..., I, the shared biasing b of same group of neuron in convolutional layerj, j=1,2 ..., O;The parameter to be trained:θ=W, a, b};
The limited Boltzmann machine of Z4, convolution is the model based on energy, and energy function is defined as:
The joint probability distribution of input layer V and convolutional layer H are obtained by energy function, i.e. convolution is limited Boltzmann machine input layer All neuron values of V:With all neuron values of convolutional layer H:Joint probability distribution:
Wherein,For partition function;I=1 ..., I, p=1 ..., y, q=1 ..., s, j= 1 ..., O, m=1 ..., N ', k=1 ..., M ',
The marginal probability distribution of input layer V, i.e. all neuron values of input layer VProbability distribution be:
If training sample set TS has T sample, then the log-likelihood function on input layer V is:
Wherein, L (TS, θ) is the likelihood function on input layer V:
Z5, log-likelihood function is maximized, and then acquire parameter in conjunction with to sdpecific dispersion algorithm using gradient ascent algorithm θ;I.e.
Z6, output layer use Softmax output layers, neuron number 2, and it is normality voice that output, which respectively represents sample, With the probability of abnormal voice, practical identification is to take the corresponding classification of maximum probability as final classification result;Output layer and The pond layer that top convolution is limited Boltzmann machine connects entirely, and it is the biasing of two neurons of output layer to need trained parameter c1,c2And the connection weight of output layer and top pond layer:
Wherein, as i=1, r is the neuron number of top hidden layer;WE1For the 1st god of Softmax output layers The connection weight vector of neuron through member and top pond layer,For the 1st neuron and most of Softmax output layers The connection weight of k-th of neuron of top layer pond layer, k≤r;As i=2, WE2For the 2nd nerve of Softmax output layers The connection weight vector of the neuron of member and top pond layer,For Softmax output layers the 2nd neuron and most push up The connection weight of k-th of neuron of layer pond layer;
If the output of top pond layer is G={ g1,g2,...,gr, giFor i-th of neuron of top pond layer Output, the then input of output layer are:fi=WEiGT+ci, wherein as i=1, f1Indicate the defeated of first neuron of output layer Enter, as i=2, f2Indicate the input of second neuron of output layer;
The output valve of output layer is:
Wherein, as i=1, y1It is exported for the 1st neuron of Softmax output layers, indicates that input voice belongs to normality The probability of voice;As i=2, y2It is exported for the 2nd neuron of Softmax output layers, indicates that input voice belongs to abnormal The probability of voice;
When Z7, training Softmax graders, it is trained by the way of Mini-batch, once captures T trained sample Originally be trained, undated parameter is primary, use loss function for:
Wherein, error sum of squares when J is training, For j-th of sample of crawl The idea output of i-th of neuron of corresponding grader has if j-th of sample belongs to normality voiceIt is on the contrary For i-th of neuron of the corresponding grader of j-th of sample of crawl Real output value;
F1 is the harmonic-mean of accurate rate P and recall rate R, i.e.,:
Wherein, accurate rate P is correctly to be identified as the sample number of abnormal voice to account for and all be identified as abnormal voice Sample number ratio, recall rate is the ratio for all abnormal speech samples numbers of sample number for being correctly identified as abnormal voice Value, TP are the sample number that abnormal voice is correctly identified as in T sample, and FP is to be erroneously identified as very in T sample The sample number of state voice, FN are the sample number that normality voice is erroneously identified as in T sample;
When training, using gradient descent algorithm undated parameter, until loss function converges on the value or iteration time of setting Number reaches maximum iteration, then training terminates.
Step S6, detailed process are:Online recognition part, the power that deep neural network completes network structure and training obtains The load of value coefficient, Hidden Markov Model have loaded the advance model succeeded in school, are pre-processed to the voice inputted in real time And feature extraction;The HMM template of normality and abnormal voice is established respectively, that is, uses normality voice training HMM template, is established normal The HMM template of state voice;Using abnormal voice training HMM template, the HMM template of abnormal voice is established;When online recognition, The judging result exported according to convolution depth confidence network calls different HMM templates, the mel-frequency of input speech signal to fall Spectral coefficient carries out template matches, obtains final voice recognition result.
Compared with prior art, the present invention having the following advantages that and advantageous effect:
Present invention combination convolution deep learning confidence network and Hidden Markov Model have using convolution depth confidence network The MFCC characteristics of input are mapped to more higher dimensional space, more fully characterize voice messaging, have by some multilayered nonlinear transform layers Effect abnormal voice and normality voice are distinguished;The time series modeling ability for recycling hidden Markov model to have, and The voice of different conditions is modeled respectively, substantially increases the identification accuracy of voice.
Description of the drawings
Fig. 1 is a kind of off-line training flow chart of abnormal speech differentiation method based on deep learning of the present invention;
Fig. 2 is a kind of online recognition flow chart of abnormal speech differentiation method based on deep learning of the present invention;
Fig. 3 is a kind of convolution depth confidence network of abnormal speech differentiation method based on deep learning of the present invention Identify voice status schematic diagram.
Specific implementation mode
Present invention will now be described in further detail with reference to the embodiments and the accompanying drawings, but embodiments of the present invention are unlimited In this.
Embodiment 1
A kind of abnormal speech differentiation method based on deep learning, includes the following steps:
The first step:Input voice is obtained, and the pretreatments such as resampling, preemphasis, framing adding window are carried out to input voice, is obtained To pretreatment voice;
Resampling is specially:The voice of input has different sample frequency and coding mode, in order to facilitate the processing of data Analysis carries out resampling, uniform sampling frequency and coding mode to former input speech signal;Sample frequency is 22.05kHz, is compiled Code mode is wav formats.
Preemphasis is specially:The power spectrum of sound signal reduces with the increase of frequency, and most of energy concentrates on low frequency In range, in order to improve the high frequency section of primitive sound signal, preemphasis processing is carried out to former input speech signal, using single order FIR high-pass filters, transfer function are:
H (z)=1-az-1,
Wherein, a is high-pass filter coefficient, value 0.93;Voice signal after preemphasis is:
Y (n)=sp (n)-sp (n-1), n=0,1 ..., Length-1
Wherein, y (n) is the voice signal after preemphasis, and sp (n) is the voice signal before preemphasis, and sp (n-1) is voice The time shift of signal, Length are voice signal length;
Framing adding window is specially:Voice is subjected to slicing treatment, at every fixed time fixed length in interception input voice The audio signal of degree is a frame, uses frame length for 25ms, and the Hamming window that it is 10ms that frame, which moves, carries out framing windowing process.
Second step:Each frame voice to pre-processing voice extracts mel-frequency cepstrum coefficient feature;
Detailed process is as follows:
L V1, design triangular shaped mel-frequency filter groups, if WlFor the frequency of first of mel-frequency filter Rate responds, l=1,2 .., L, and Fs is the resampling frequency of voice signal, and Q is the frame length of certain frame voice signal, and Q is also Fourier The points of transformation, fl,fhThe respectively lower and upper limit cutoff frequency of voice signal, certain frame voice signal that frame length is Q carry out Q Q frequency component, o (l), c (l) are obtained after point quick Fourier transformation, h (l) is respectively under first of mel-frequency filter Limit, the subscript value of center and upper limiting frequency in Q frequency component;There are following relationships by o (l), c (l), h (l):
C (l-1)=o (l),
O (l+1)=c (l),
H (l)=c (l+1),
That is, the position of present filter centre frequency is in the side lobe attenuation position of next filter;
Have simultaneously, o (l)| l=1=fl,h(l)| l=L=fh, therefore, the centre frequency of first of mel-frequency filter is at Q Subscript value in frequency component is expressed as:
Wherein, Mel (f1) it is that actual frequency is mapped to the function of mel-frequency, and Mel-1(f2) it is Mel (f1) anti-letter Number, f1For actual frequency, f2For mel-frequency:
The frequency response of first of mel-frequency filter is:
V2, to certain frame voice signal x (n), n=0,1 ..., Q-1 after resampling, preemphasis, framing adding window, into Row Q point quick Fouriers convert, and acquire its frequency spectrum X (k) and amplitude spectrum | X (k) |:
V3, by certain frame voice signal by mel-frequency filter group, obtain the output amplitude spectrum of each filter:
V4, logarithm operation is done to the output amplitude spectrum of all filters, plum can be obtained by further doing discrete cosine transform That frequency cepstral coefficient:
The in L coefficient the 2nd to M+1 total M coefficient is taken to constitute the M Jan Vermeer frequency cepstral coefficient characteristic vectors per frame C={ cmfcc(2),cmfcc(3),...,cmfcc(M+1)}。
Third walks:The voice segments of different frame numbers are regular to frame number is fixed, and each voice segments obtain a corresponding Meier Frequency cepstral coefficient eigenmatrix;
Since the time span of different phonetic section is different, the frame number that different phonetic section may contain is inconsistent, and Since the input of neural network must be fixed dimension, it is therefore desirable to carry out Time alignment to voice signal so that nerve net The input of network is fixed size;Time alignment is carried out using characterization point sequence method, regular process is specially:
Y1, the Time alignment network for building one n-N+1 layers, wherein N are the target frame number after Time alignment, first layer Frame number is n, and the frame number of last layer is N;
I-th layer of Time alignment network has n- (i-1) frame, and a characteristic vector is corresponded to per frame, constitutes n- (i-1) a feature Set of vectors:
Wherein, i=0,1 ... n-N+1, k=1,2 ... n- (i-1), C are the mel-frequency cepstrum coefficient feature per frame voice Vector;
Particularly, the characteristic vector group of network first tier inputs the characteristic vector group of network:
WithIndicate vectorThe weight of the speech frame of representative has as i=1:
Y2, it setsIt indicatesWithDistance, i.e.,To i-th layer, find j so as to arbitrary k ≠ J has
Y3, two nearest frames of distance are merged, frame number subtracts one, i.e.,:
Wherein,Indicate the mel-frequency cepstrum coefficient characteristic vector of network i+1 layer jth frame voice;Indicate net The mel-frequency cepstrum coefficient characteristic vector of+1 frame voice of i-th layer of jth of network;Indicate the plum of network i+1 layer kth frame voice That frequency cepstral coefficient characteristic vector;It indicatesThe mel-frequency cepstrum coefficient Characteristic Vectors of+1 frame voice of i-th layer of kth of network Amount;Indicate the weight of i-th layer of jth frame voice of network;Indicate the weight of+1 frame voice of i-th layer of jth of network;It indicates The weight of network i+1 layer jth frame voice;Indicate that the mel-frequency of i-th layer of jth frame voice of network and+1 frame voice of jth falls The distance of spectral coefficient characteristic vector;
Y4, solutionI.e.:
I=1 to i=n-N+1 is repeated, has often been executed once, frame number subtracts one, finally that the voice signal of n frames is regular to solid Fixed N frames;
After Time alignment, every section of voice corresponds to a mel-frequency cepstrum coefficient eigenmatrix, and dimension is N × M, N is the regular target frame number of this section of Speech time, and M is the M rank mel-frequency cepstrum coefficients of every frame voice, and the mel-frequency falls Spectral coefficient is 12 dimensions, i.e. M=12.
4th step:Establish convolution depth confidence network;Depth confidence network limits Boltzmann machine by two convolution and stacks It forms, existing level 2 volume lamination, 2 pond layers, 1 visual layers and 1 output layer;Boltzmann machine is limited to first convolution, Visual layers neural unit number is N × M=200 × 12, and the number of convolution kernel is 10, and the size of convolution kernel is 2 × 2, convolution The initial value of step-length 2 × 2, convolution kernel is:With mean value be 0, variance be 0.01 Gaussian Profile random value.Visual layers it is initial partially It sets and takes 0, the initial bias of convolutional layer takes -0.1.An iteration captures 100 samples, and iteration rounds are 100.First pond layer Pond size take 2 × 2, pond step-length is 2 × 2.Boltzmann machine is limited to second convolution, the number of convolution kernel is 10, The size of convolution kernel is 10 × 2 × 2, and the initial value of convolution step-length 2 × 2, convolution kernel is:With mean value be 0, variance be 0.01 height This distribution random value, the initial bias of the second convolutional layer take -0.1.An iteration captures 100 samples, and iteration rounds are 100.The pond size of second pond layer takes 2 × 2, and pond step-length is 2 × 2.All convolutional layers use Sigma to activate letter Number.Output layer neuron number is 2, is exported as the posterior probability of normality voice and abnormal voice.Output layer setting loss letter Number convergency value is 0.004, maximum iteration 1000.
5th step:Mel-frequency cepstrum coefficient eigenmatrix is inputted into convolution depth confidence network, is trained, and to defeated The state for entering voice is classified, and Fig. 1 is the training flow chart of off-line state,;
First Boltzmann machine is limited to the convolution of bottom to be trained, then Boltzmann machine is limited to the convolution of top layer and is carried out Training, specially:
Z1, set convolution be limited Boltzmann machine input channel number as I, each channel correspond to size for y × s two dimension Matrix, i.e.,:
V={ v1,v2,...,vI},vi∈Ry×s, i=1,2 ..., I,
Wherein, V is the input layer that convolution is limited Boltzmann machine, vi∈Ry×sFor i-th of channel of input layer;
Particularly, the convolution of bottom is limited in Boltzmann machine, and I=1, y=N, s=M, the i.e. convolution of bottom are limited Bohr Hereby the input channel number of graceful machine is 1, the two-dimentional mel-frequency cepstrum that the corresponding size for inputting convolution depth confidence network is N × M Coefficient characteristics matrix, N are the target frame number of Time alignment, and M is the M rank mel-frequency cepstrum coefficients of every frame voice;
Z2, convolution process use O convolution kernel, and each convolution kernel is three-dimensional weight matrix I × E × F, i.e. convolution kernel W= {w1,w2,...,wO, wherein wj∈RI×E×F, j=1,2 ..., O, wjFor the convolution kernel;Therefore convolution is limited Boltzmann machine Output channel number be O;Each output channel corresponds to some local feature of input, i.e.,:
Convolutional layer H={ h1,h2,...,hO, wherein hj∈RN′×M′, j=1,2 ..., O;Refer to j-th of channel of convolutional layer, A corresponding size after Feature Mapping is the two dimensional character matrix of N ' × M ';
Z3, convolution step-length is set as s1 × s2, all neurons in the same channel of input layer share biasing ai, i=1, 2 ..., I, the shared biasing b of same group of neuron in convolutional layerj, j=1,2 ..., O;The parameter to be trained:θ=W, a, b};
The limited Boltzmann machine of Z4, convolution is the model based on energy, and energy function is defined as:
The joint probability distribution of input layer V and convolutional layer H are obtained by energy function, i.e. convolution is limited Boltzmann machine input layer All neuron values of V:With all neuron values of convolutional layer H:Joint probability distribution:
Wherein,For partition function;I=1 ..., I, p=1 ..., y, q=1 ..., s, j= 1 ..., O, m=1 ..., N ', k=1 ..., M ',
The marginal probability distribution of input layer V, i.e. all neuron values of input layer VProbability distribution be:
If training sample set TS has T sample, then the log-likelihood function on input layer V is:
Wherein, L (TS, θ) is the likelihood function on input layer V:
Z5, log-likelihood function is maximized, and then acquire parameter in conjunction with to sdpecific dispersion algorithm using gradient ascent algorithm θ;I.e.
Z6, output layer use Softmax output layers, neuron number 2, and it is normality voice that output, which respectively represents sample, With the probability of abnormal voice, practical identification is to take the corresponding classification of maximum probability as final classification result;Output layer and The pond layer that top convolution is limited Boltzmann machine connects entirely, and it is the biasing of two neurons of output layer to need trained parameter c1,c2And the connection weight of output layer and top pond layer:
Wherein, as i=1, r is the neuron number of top hidden layer;WE1For the 1st god of Softmax output layers The connection weight vector of neuron through member and top pond layer,For the 1st neuron and most of Softmax output layers The connection weight of k-th of neuron of top layer pond layer, k≤r;As i=2, WE2For the 2nd nerve of Softmax output layers The connection weight vector of the neuron of member and top pond layer,For Softmax output layers the 2nd neuron and most push up The connection weight of k-th of neuron of layer pond layer;
If the output of top pond layer is G={ g1,g2,...,gr, giFor i-th of neuron of top pond layer Output, the then input of output layer are:fi=WEiGT+ci, wherein as i=1, f1Indicate the defeated of first neuron of output layer Enter, as i=2, f2Indicate the input of second neuron of output layer;
The output valve of output layer is:
Wherein, as i=1, y1It is exported for the 1st neuron of Softmax output layers, indicates that input voice belongs to normality The probability of voice;As i=2, y2It is exported for the 2nd neuron of Softmax output layers, indicates that input voice belongs to abnormal The probability of voice;
When Z7, training Softmax graders, it is trained by the way of Mini-batch, once captures T trained sample Originally be trained, undated parameter is primary, use loss function for:
Wherein, error sum of squares when J is training, For j-th of sample of crawl The idea output of i-th of neuron of corresponding grader has if j-th of sample belongs to normality voiceIt is on the contrary For the reality of i-th of neuron of the corresponding grader of j-th of sample of crawl Border output valve;
F1 is the harmonic-mean of accurate rate P and recall rate R, i.e.,:
Wherein, accurate rate P is correctly to be identified as the sample number of abnormal voice to account for and all be identified as abnormal voice Sample number ratio, recall rate is the ratio for all abnormal speech samples numbers of sample number for being correctly identified as abnormal voice Value, TP are the sample number that abnormal voice is correctly identified as in T sample, and FP is to be erroneously identified as very in T sample The sample number of state voice, FN are the sample number that normality voice is erroneously identified as in T sample;
When training, using gradient descent algorithm undated parameter, until loss function converges on the value or iteration time of setting Number reaches maximum iteration, then training terminates.
6th step:According to classification results, calls hidden Markov model to carry out template matches, obtain voice recognition result:
Online recognition part, as shown in Fig. 2, deep neural network completes network structure and trains obtained weight coefficient Load, Hidden Markov Model have loaded the advance model succeeded in school, carry out pretreatment to the voice inputted in real time and feature carries It takes;The HMM template of normality and abnormal voice is established respectively, that is, is used normality voice training HMM template, established normality voice HMM template;Using abnormal voice training HMM template, the HMM template of abnormal voice is established;When online recognition, according to convolution The judging result of depth confidence network output calls different HMM templates, the mel-frequency cepstrum coefficient of input speech signal into Row template matches, obtain final voice recognition result, and principle flow chart is as shown in Figure 3.
The above embodiment is a preferred embodiment of the present invention, but embodiments of the present invention are not by above-described embodiment Limitation, it is other it is any without departing from the spirit and principles of the present invention made by changes, modifications, substitutions, combinations, simplifications, Equivalent substitute mode is should be, is included within the scope of the present invention.

Claims (10)

1. a kind of abnormal speech differentiation method based on deep learning, which is characterized in that include the following steps:
S1, acquisition input voice, and input voice is pre-processed, pretreatment voice is obtained, the pretreatment includes to adopt again Sample, preemphasis, framing adding window;
S2, each frame voice extraction mel-frequency that voice is pre-processed using mel-frequency filter group and Fourier transform pairs are fallen Spectral coefficient characteristic vector;
S3, regular arrive of the voice segments with different frame numbers is fixed into frame number, each voice segments obtain a corresponding mel-frequency Cepstrum coefficient eigenmatrix;
S4, convolution depth confidence network is established;
S5, mel-frequency cepstrum coefficient eigenmatrix is inputted into convolution depth confidence network, be trained, and to input voice State is classified;
S6, according to classification results, call hidden Markov model to carry out template matches, obtain voice recognition result.
2. a kind of abnormal speech differentiation method based on deep learning according to claim 1, which is characterized in that step In S1,
Its sample frequency of the resampling is 22.05kHz, and coding mode is wav formats;
The preemphasis uses single order FIR high-pass filters, and transfer function is:
H (z)=1-az-1,
Wherein, a is high-pass filter coefficient, value 0.93;Voice signal after preemphasis is:
Y (n)=sp (n)-sp (n-1), n=0,1 ..., Length-1;
Wherein, y (n) is the voice signal after preemphasis, and sp (n) is the voice signal before preemphasis, and sp (n-1) is voice signal Time shift, Length be voice signal length;
The framing adding window is specially:Voice is subjected to slicing treatment, at every fixed time fixed length in interception input voice The audio signal of degree is a frame, uses frame length for 25ms, and the Hamming window that it is 10ms that frame, which moves, carries out framing windowing process.
3. a kind of abnormal speech differentiation method based on deep learning according to claim 1, which is characterized in that described Step S2, specially:
L V1, design triangular shaped mel-frequency filter groups, if WlIt is rung for the frequency of first of mel-frequency filter It answers, l=1,2 .., L, Fs is the resampling frequency of voice signal, and Q is the frame length of certain frame voice signal, and Q is also Fourier transformation Points, fl,fhThe respectively lower and upper limit cutoff frequency of voice signal, it is fast that certain frame voice signal that frame length is Q carries out Q points Obtain Q frequency component, o (l), c (l) after fast Fourier transformation, h (l) be respectively first of mel-frequency filter lower limit, The subscript value of center and upper limiting frequency in Q frequency component;There are following relationships by o (l), c (l), h (l):
C (l-1)=o (l),
O (l+1)=c (l),
H (l)=c (l+1),
That is, the position of present filter centre frequency is in the side lobe attenuation position of next filter;
Have simultaneously, o (l)| l=1=fl,h(l)| l=L=fh, therefore, the centre frequency of first of mel-frequency filter is in Q frequency Subscript value in component is expressed as:
Wherein, Mel (f1) it is that actual frequency is mapped to the function of mel-frequency, and Mel-1(f2) it is Mel (f1) inverse function, f1 For actual frequency, f2For mel-frequency:
The frequency response of first of mel-frequency filter is:
Wherein, k is subscript value of the frequency component in Q frequency component;
V2, to certain frame voice signal x (n), n=0,1 after resampling, preemphasis, framing adding window ..., Q-1, Q < Length carries out Q point quick Fourier transformation, acquires its frequency spectrum X (k) and amplitude spectrum | X (k) |:
V3, by certain frame voice signal by mel-frequency filter group, obtain the output amplitude spectrum of each filter:
V4, logarithm operation is done to the output amplitude spectrum of all filters, Meier frequency can be obtained by further doing discrete cosine transform Rate cepstrum coefficient:
The in L coefficient the 2nd to M+1 total M coefficient is taken to constitute the M Jan Vermeer frequency cepstral coefficient characteristic vectors C=per frame {cmfcc(2),cmfcc(3),...,cmfcc(M+1)}。
4. a kind of abnormal speech differentiation method based on deep learning according to claim 1, which is characterized in that step In S2, the mel-frequency cepstrum coefficient feature matrix, dimension is N × M, and N is the regular target frame of this section of Speech time Number, M are the M rank mel-frequency cepstrum coefficients of every frame voice, and the mel-frequency cepstrum coefficient is 12 dimensions, i.e. M=12.
5. a kind of abnormal speech differentiation method based on deep learning according to claim 1, which is characterized in that step In S3, it is described it is regular be specially:
Y1, the Time alignment network for building one n-N+1 layers, wherein N are the target frame number of Time alignment, and the frame number of first layer is The frame number of n, last layer are N;
I-th layer of Time alignment network has n- (i-1) frame, and a characteristic vector is corresponded to per frame, constitutes n- (i-1) a characteristic vector Group:
Wherein,For the mel-frequency cepstrum coefficient characteristic vector of i-th layer of kth frame voice of network;
Particularly, the characteristic vector group of network first tier inputs the characteristic vector group of network:
WithIndicate vectorThe weight of the speech frame of representative has as i=1:
Y2, it setsIt indicatesWithDistance, i.e.,To i-th layer, j is found so as to arbitrary k ≠ j, have
Y3, two nearest frames of distance are merged, frame number subtracts one, i.e.,:
Wherein,Indicate the mel-frequency cepstrum coefficient characteristic vector of network i+1 layer jth frame voice;Indicate network i-th The mel-frequency cepstrum coefficient characteristic vector of layer+1 frame voice of jth;Indicate the mel-frequency of network i+1 layer kth frame voice Cepstrum coefficient characteristic vector;It indicatesThe mel-frequency cepstrum coefficient characteristic vector of+1 frame voice of i-th layer of kth of network;It indicates The weight of i-th layer of jth frame voice of network;Indicate the weight of+1 frame voice of i-th layer of jth of network;Indicate network i+1 The weight of layer jth frame voice;Indicate the mel-frequency cepstrum coefficient feature of i-th layer of jth frame voice of network and+1 frame voice of jth The distance of vector;
Y4, solutionI.e.:
I=1 to i=n-N+1 is repeated, has often been executed once, frame number subtracts one, finally that the voice signal of n frames is regular to fixed N Frame.
6. a kind of abnormal speech differentiation method based on deep learning according to claim 1, which is characterized in that step In S4, the convolution depth confidence network is limited Boltzmann machine by more than one convolution and accumulates from top to bottom, output layer Using Softmax graders;The convolution is limited Boltzmann machine and is made of one layer of input layer V and one layer of convolutional layer H.
7. a kind of abnormal speech differentiation method based on deep learning according to claim 6, which is characterized in that described Convolution is limited Boltzmann machine, and pond layer is added after being limited the convolutional layer H of Boltzmann machine in each convolution, carries out pondization operation, Pond size is E ' × F ', pond step-length s3 × s4, s3=s4=2, and therefore, next convolution is limited the pond layer of Boltzmann machine It is the input layer that a upper convolution is limited Boltzmann machine.
8. a kind of abnormal speech differentiation method based on deep learning according to claim 6, which is characterized in that described Step S5 is first limited Boltzmann machine to the convolution of bottom and is trained, then is limited Boltzmann machine to the convolution of top layer and carries out Training, specially:
Z1, set convolution be limited Boltzmann machine input channel number as I, each channel correspond to size for y × s Two-Dimensional Moment Battle array, i.e.,:
V={ v1,v2,...,vI},vi∈Ry×s, i=1,2 ..., I,
Wherein, V is the input layer that convolution is limited Boltzmann machine, vi∈Ry×sFor i-th of channel of input layer;
The convolution of bottom is limited in Boltzmann machine, and I=1, y=N, s=M, the i.e. convolution of bottom are limited the defeated of Boltzmann machine It is 1 to enter port number, the two-dimentional mel-frequency cepstrum coefficient feature square that the corresponding size for inputting convolution depth confidence network is N × M Battle array, N are the target frame number of Time alignment, and M is the M rank mel-frequency cepstrum coefficients of every frame voice;
Z2, convolution process use O convolution kernel, and each convolution kernel is three-dimensional weight matrix I × E × F, i.e. convolution kernel W={ w1, w2,...,wO, wherein wj∈RI×E×F, j=1,2 ..., O, wjFor the convolution kernel;Therefore convolution is limited Boltzmann machine Output channel number is O;Each output channel corresponds to some local feature of input, i.e.,:
Convolutional layer H={ h1,h2,...,hO, wherein hj∈RN′×M′, j=1,2 ..., O refer to j-th of channel of convolutional layer, corresponding One size after Feature Mapping is the two dimensional character matrix of N ' × M ';
Z3, convolution step-length is set as s1 × s2, all neurons in the same channel of input layer share biasing ai, i=1,2 ..., I, The shared biasing b of same group of neuron in convolutional layerj, j=1,2 ..., O;The parameter to be trained:θ={ W, a, b };
The limited Boltzmann machine of Z4, convolution is the model based on energy, and energy function is defined as:
The joint probability distribution of input layer V and convolutional layer H are obtained by energy function, i.e. convolution is limited Boltzmann machine input layer V institutes There is neuron value:With all neuron values of convolutional layer H:Joint probability distribution:
Wherein,For partition function;I=1 ..., I, p=1 ..., y, q=1 ..., s, j=1 ..., O, M=1 ..., N ', k=1 ..., M ',
The marginal probability distribution of input layer V, i.e. all neuron values of input layer VProbability distribution be:
If training sample set TS has T sample, then the log-likelihood function on input layer V is:
Wherein, L (TS, θ) is the likelihood function on input layer V:
Z5, log-likelihood function is maximized, and then acquire parameter θ in conjunction with to sdpecific dispersion algorithm using gradient ascent algorithm;I.e.
Z6, output layer use Softmax output layers, neuron number 2, and it is normality voice and non-that output, which respectively represents sample, The probability of normality voice, practical identification is to take the corresponding classification of maximum probability as final classification result;It output layer and most pushes up The pond layer that layer convolution is limited Boltzmann machine connects entirely, and it is the biasing c of two neurons of output layer to need trained parameter1,c2 And the connection weight of output layer and top pond layer:
Wherein, r is the neuron number of top hidden layer;As i=1, WE1For the 1st neuron of Softmax output layers It is vectorial with the connection weight of the neuron of top pond layer,For the 1st neuron and top of Softmax output layers The connection weight of k-th of neuron of pond layer, k≤r;As i=2, WE2For Softmax output layers the 2nd neuron and The connection weight vector of the neuron of top pond layer,For the 2nd neuron of Softmax output layers and top pond Change the connection weight of k-th of neuron of layer;
If the output of top pond layer is G={ g1,g2,...,gr, giIt is exported for i-th of neuron of top pond layer, The then input of output layer is:fi=WEiGT+ci, wherein as i=1, f1The input for indicating first neuron of output layer, works as i When=2, f2Indicate the input of second neuron of output layer;
The output valve of output layer isWherein, as i=1, y1For the 1st neuron of Softmax output layers Output indicates that input voice belongs to the probability of normality voice;As i=2, y2It is defeated for the 2nd neuron of Softmax output layers Go out, indicates that input voice belongs to the probability of abnormal voice;
Z7, training Softmax graders when, be trained by the way of Mini-batch, once capture T training sample into Row training, undated parameter is primary, use loss function for:
Wherein, error sum of squares when J is training, It is corresponded to for j-th of sample of crawl The idea output of i-th of neuron of grader have if j-th of sample belongs to normality voiceInstead It For the real output value of i-th of neuron of the corresponding grader of j-th of sample of crawl;
F1 is the harmonic-mean of accurate rate P and recall rate R, i.e.,:
Wherein, accurate rate P is correctly to be identified as the sample number of abnormal voice to account for all samples for being identified as abnormal voice This number ratio, recall rate are the ratio for all abnormal speech samples numbers of sample number for being correctly identified as abnormal voice, TP To be correctly identified as the sample number of abnormal voice in T sample, FP is to be erroneously identified as abnormal voice in T sample Sample number, FN is the sample number that normality voice is erroneously identified as in T sample;
When training, using gradient descent algorithm undated parameter, until loss function converges on the value of setting or iterations reach Maximum iteration, then training terminate.
9. a kind of abnormal speech differentiation method based on deep learning according to claim 1, which is characterized in that described Step S6, detailed process are:Online recognition part, deep neural network complete network structure and train obtained weight coefficient Load, Hidden Markov Model have loaded the advance model succeeded in school, carry out pretreatment to the voice inputted in real time and feature carries It takes;The HMM template of normality and abnormal voice is established respectively, when online recognition, according to sentencing for convolution depth confidence network output The different HMM template of disconnected call by result, the mel-frequency cepstrum coefficient of input speech signal carry out template matches, obtain final Voice recognition result.
10. a kind of abnormal speech differentiation method based on deep learning according to claim 9, which is characterized in that institute The HMM template for stating normality voice is obtained using normality voice training HMM template;The HMM template of the abnormal voice uses Abnormal voice training HMM template obtains.
CN201810417478.2A 2018-05-04 2018-05-04 Abnormal voice distinguishing method based on deep learning Active CN108766419B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810417478.2A CN108766419B (en) 2018-05-04 2018-05-04 Abnormal voice distinguishing method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810417478.2A CN108766419B (en) 2018-05-04 2018-05-04 Abnormal voice distinguishing method based on deep learning

Publications (2)

Publication Number Publication Date
CN108766419A true CN108766419A (en) 2018-11-06
CN108766419B CN108766419B (en) 2020-10-27

Family

ID=64009048

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810417478.2A Active CN108766419B (en) 2018-05-04 2018-05-04 Abnormal voice distinguishing method based on deep learning

Country Status (1)

Country Link
CN (1) CN108766419B (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110058689A (en) * 2019-04-08 2019-07-26 深圳大学 A kind of smart machine input method based on face's vibration
CN110322887A (en) * 2019-04-28 2019-10-11 武汉大晟极科技有限公司 A kind of polymorphic type audio signal energies feature extracting method
CN110390929A (en) * 2019-08-05 2019-10-29 中国民航大学 Chinese and English civil aviaton land sky call acoustic model construction method based on CDNN-HMM
CN110600015A (en) * 2019-09-18 2019-12-20 北京声智科技有限公司 Voice dense classification method and related device
CN110706720A (en) * 2019-08-16 2020-01-17 广东省智能制造研究所 Acoustic anomaly detection method for end-to-end unsupervised deep support network
CN110782901A (en) * 2019-11-05 2020-02-11 深圳大学 Method, storage medium and device for identifying voice of network telephone
CN110931046A (en) * 2019-11-29 2020-03-27 福州大学 Audio high-level semantic feature extraction method and system for overlapped sound event detection
CN111027675A (en) * 2019-11-22 2020-04-17 南京大学 Automatic adjusting method and system for multimedia playing setting
CN111044285A (en) * 2019-11-22 2020-04-21 军事科学院系统工程研究院军用标准研究中心 Method for diagnosing faults of mechanical equipment under complex conditions
CN111128227A (en) * 2019-12-30 2020-05-08 云知声智能科技股份有限公司 Sound detection method and device
CN111354373A (en) * 2018-12-21 2020-06-30 中国科学院声学研究所 Audio signal classification method based on neural network intermediate layer characteristic filtering
CN111508501A (en) * 2020-07-02 2020-08-07 成都晓多科技有限公司 Voice recognition method and system with accent for telephone robot
CN111724770A (en) * 2020-05-19 2020-09-29 中国电子科技网络信息安全有限公司 Audio keyword identification method for generating confrontation network based on deep convolution
EP3745412A1 (en) * 2019-05-28 2020-12-02 Corti ApS An intelligent computer aided decision support system
WO2021000498A1 (en) * 2019-07-04 2021-01-07 平安科技(深圳)有限公司 Composite speech recognition method, device, equipment, and computer-readable storage medium
CN112750428A (en) * 2020-12-29 2021-05-04 平安普惠企业管理有限公司 Voice interaction method and device and computer equipment
CN112955954A (en) * 2018-12-21 2021-06-11 华为技术有限公司 Audio processing device and method for audio scene classification
CN113361647A (en) * 2021-07-06 2021-09-07 青岛洞听智能科技有限公司 Method for identifying type of missed call
CN113689633A (en) * 2021-08-26 2021-11-23 浙江力石科技股份有限公司 Scenic spot human-computer interaction method, device and system
CN113959071A (en) * 2021-07-21 2022-01-21 北京金茂绿建科技有限公司 Machine learning assistance-based operation control optimization method for centralized water chilling unit air conditioning system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102129860A (en) * 2011-04-07 2011-07-20 魏昕 Text-related speaker recognition method based on infinite-state hidden Markov model
CN104157290A (en) * 2014-08-19 2014-11-19 大连理工大学 Speaker recognition method based on depth learning
CN105206270A (en) * 2015-08-20 2015-12-30 长安大学 Isolated digit speech recognition classification system and method combining principal component analysis (PCA) with restricted Boltzmann machine (RBM)
CN106782602A (en) * 2016-12-01 2017-05-31 南京邮电大学 Speech-emotion recognition method based on length time memory network and convolutional neural networks
CN106941005A (en) * 2017-02-24 2017-07-11 华南理工大学 A kind of vocal cords method for detecting abnormality based on speech acoustics feature
US20170200092A1 (en) * 2016-01-11 2017-07-13 International Business Machines Corporation Creating deep learning models using feature augmentation
CN107464568A (en) * 2017-09-25 2017-12-12 四川长虹电器股份有限公司 Based on the unrelated method for distinguishing speek person of Three dimensional convolution neutral net text and system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102129860A (en) * 2011-04-07 2011-07-20 魏昕 Text-related speaker recognition method based on infinite-state hidden Markov model
CN104157290A (en) * 2014-08-19 2014-11-19 大连理工大学 Speaker recognition method based on depth learning
CN105206270A (en) * 2015-08-20 2015-12-30 长安大学 Isolated digit speech recognition classification system and method combining principal component analysis (PCA) with restricted Boltzmann machine (RBM)
US20170200092A1 (en) * 2016-01-11 2017-07-13 International Business Machines Corporation Creating deep learning models using feature augmentation
CN106782602A (en) * 2016-12-01 2017-05-31 南京邮电大学 Speech-emotion recognition method based on length time memory network and convolutional neural networks
CN106941005A (en) * 2017-02-24 2017-07-11 华南理工大学 A kind of vocal cords method for detecting abnormality based on speech acoustics feature
CN107464568A (en) * 2017-09-25 2017-12-12 四川长虹电器股份有限公司 Based on the unrelated method for distinguishing speek person of Three dimensional convolution neutral net text and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
IDO ARIAV等: "A deep architecture for audio-visual voice activity detection in the presence of transients", 《SIGNAL PROCESSING》 *
李威等: "基于深层说话人矢量的说话人检索", 《华中科技大学学报(自然科学版)》 *

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112955954B (en) * 2018-12-21 2024-04-12 华为技术有限公司 Audio processing device and method for audio scene classification
CN112955954A (en) * 2018-12-21 2021-06-11 华为技术有限公司 Audio processing device and method for audio scene classification
CN111354373A (en) * 2018-12-21 2020-06-30 中国科学院声学研究所 Audio signal classification method based on neural network intermediate layer characteristic filtering
US11776532B2 (en) 2018-12-21 2023-10-03 Huawei Technologies Co., Ltd. Audio processing apparatus and method for audio scene classification
CN110058689A (en) * 2019-04-08 2019-07-26 深圳大学 A kind of smart machine input method based on face's vibration
CN110322887B (en) * 2019-04-28 2021-10-15 武汉大晟极科技有限公司 Multi-type audio signal energy feature extraction method
CN110322887A (en) * 2019-04-28 2019-10-11 武汉大晟极科技有限公司 A kind of polymorphic type audio signal energies feature extracting method
WO2020239910A3 (en) * 2019-05-28 2021-02-18 Corti Aps An intelligent computer aided decision support system
EP3745412A1 (en) * 2019-05-28 2020-12-02 Corti ApS An intelligent computer aided decision support system
WO2021000498A1 (en) * 2019-07-04 2021-01-07 平安科技(深圳)有限公司 Composite speech recognition method, device, equipment, and computer-readable storage medium
CN110390929A (en) * 2019-08-05 2019-10-29 中国民航大学 Chinese and English civil aviaton land sky call acoustic model construction method based on CDNN-HMM
CN110706720A (en) * 2019-08-16 2020-01-17 广东省智能制造研究所 Acoustic anomaly detection method for end-to-end unsupervised deep support network
CN110706720B (en) * 2019-08-16 2022-04-22 广东省智能制造研究所 Acoustic anomaly detection method for end-to-end unsupervised deep support network
CN110600015A (en) * 2019-09-18 2019-12-20 北京声智科技有限公司 Voice dense classification method and related device
CN110782901A (en) * 2019-11-05 2020-02-11 深圳大学 Method, storage medium and device for identifying voice of network telephone
CN110782901B (en) * 2019-11-05 2021-12-24 深圳大学 Method, storage medium and device for identifying voice of network telephone
CN111027675A (en) * 2019-11-22 2020-04-17 南京大学 Automatic adjusting method and system for multimedia playing setting
CN111044285A (en) * 2019-11-22 2020-04-21 军事科学院系统工程研究院军用标准研究中心 Method for diagnosing faults of mechanical equipment under complex conditions
CN111027675B (en) * 2019-11-22 2023-03-07 南京大学 Automatic adjusting method and system for multimedia playing setting
CN110931046A (en) * 2019-11-29 2020-03-27 福州大学 Audio high-level semantic feature extraction method and system for overlapped sound event detection
CN111128227A (en) * 2019-12-30 2020-05-08 云知声智能科技股份有限公司 Sound detection method and device
CN111724770B (en) * 2020-05-19 2022-04-01 中国电子科技网络信息安全有限公司 Audio keyword identification method for generating confrontation network based on deep convolution
CN111724770A (en) * 2020-05-19 2020-09-29 中国电子科技网络信息安全有限公司 Audio keyword identification method for generating confrontation network based on deep convolution
CN111508501B (en) * 2020-07-02 2020-09-29 成都晓多科技有限公司 Voice recognition method and system with accent for telephone robot
CN111508501A (en) * 2020-07-02 2020-08-07 成都晓多科技有限公司 Voice recognition method and system with accent for telephone robot
CN112750428A (en) * 2020-12-29 2021-05-04 平安普惠企业管理有限公司 Voice interaction method and device and computer equipment
CN113361647A (en) * 2021-07-06 2021-09-07 青岛洞听智能科技有限公司 Method for identifying type of missed call
CN113959071A (en) * 2021-07-21 2022-01-21 北京金茂绿建科技有限公司 Machine learning assistance-based operation control optimization method for centralized water chilling unit air conditioning system
CN113959071B (en) * 2021-07-21 2023-05-26 北京金茂绿建科技有限公司 Centralized water chilling unit air conditioning system operation control optimization method based on machine learning assistance
CN113689633A (en) * 2021-08-26 2021-11-23 浙江力石科技股份有限公司 Scenic spot human-computer interaction method, device and system

Also Published As

Publication number Publication date
CN108766419B (en) 2020-10-27

Similar Documents

Publication Publication Date Title
CN108766419A (en) A kind of abnormal speech detection method based on deep learning
CN112364779B (en) Underwater sound target identification method based on signal processing and deep-shallow network multi-model fusion
CN106782511A (en) Amendment linear depth autoencoder network audio recognition method
CN109559736B (en) Automatic dubbing method for movie actors based on confrontation network
CN107221320A (en) Train method, device, equipment and the computer-readable storage medium of acoustic feature extraction model
CN113488058B (en) Voiceprint recognition method based on short voice
CN111048097B (en) Twin network voiceprint recognition method based on 3D convolution
CN108520753A (en) Voice lie detection method based on the two-way length of convolution memory network in short-term
CN110111797A (en) Method for distinguishing speek person based on Gauss super vector and deep neural network
CN103456302B (en) A kind of emotional speaker recognition method based on the synthesis of emotion GMM Model Weight
CN111986679A (en) Speaker confirmation method, system and storage medium for responding to complex acoustic environment
Sukhwal et al. Comparative study of different classifiers based speaker recognition system using modified MFCC for noisy environment
CN102496366B (en) Speaker identification method irrelevant with text
Zheng et al. MSRANet: Learning discriminative embeddings for speaker verification via channel and spatial attention mechanism in alterable scenarios
Sun et al. A novel convolutional neural network voiceprint recognition method based on improved pooling method and dropout idea
CN113571095B (en) Speech emotion recognition method and system based on nested deep neural network
CN111243621A (en) Construction method of GRU-SVM deep learning model for synthetic speech detection
Kamble et al. Emotion recognition for instantaneous Marathi spoken words
Aggarwal et al. Performance evaluation of artificial neural networks for isolated Hindi digit recognition with LPC and MFCC
CN114299995A (en) Language emotion recognition method for emotion assessment
CN111785262B (en) Speaker age and gender classification method based on residual error network and fusion characteristics
CN114882888A (en) Voiceprint recognition method and system based on variational self-coding and countermeasure generation network
CN113963718A (en) Voice session segmentation method based on deep learning
Gade et al. Hybrid Deep Convolutional Neural Network based Speaker Recognition for Noisy Speech Environments
CN113450830A (en) Voice emotion recognition method of convolution cyclic neural network with multiple attention mechanisms

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant